Slurm DSE: squeue causes RuntimeException
Apparently squeue
fails quite frequently with error messages:
slurm_load_jobs error: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation
...
Which leads to further warnings:
[17:30:12 <scala-execution-context-global-202: Slurm$> WARN] Slurm `squeue` failed: java.lang.RuntimeException: Nonzero exit value: 1
[17:30:12 <scala-execution-context-global-237: Slurm$> WARN] Slurm `squeue` failed: java.lang.RuntimeException: Nonzero exit value: 1
[17:30:12 <scala-execution-context-global-230: Slurm$> WARN] Slurm `squeue` failed: java.lang.RuntimeException: Nonzero exit value: 1
...
And apparently this crashes the DSE:
[17:30:12 <scala-execution-context-global-44: DesignSpaceExplorationTask> ERROR] exception: java.lang.AssertionError: assertion failed: elements length (25) does not match results length (0), stacktrace: de.tu_darmstadt.cs.esa.tapasco.dse.Exploration$Events$BatchFinished.<init>(Exploration.scala:249)
de.tu_darmstadt.cs.esa.tapasco.dse.ConcreteBatch.start(Batch.scala:33)
de.tu_darmstadt.cs.esa.tapasco.dse.ConcreteExploration.apply(Exploration.scala:144)
de.tu_darmstadt.cs.esa.tapasco.dse.ConcreteExploration.start(Exploration.scala:200)
de.tu_darmstadt.cs.esa.tapasco.task.DesignSpaceExplorationTask.job(DesignSpaceExplorationTask.scala:75)
de.tu_darmstadt.cs.esa.tapasco.task.Tasks$ProcessingRunnable.$anonfun$run$1(Tasks.scala:156)
scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:12)
scala.concurrent.Future$.$anonfun$apply$1(Future.scala:653)
scala.util.Success.$anonfun$map$1(Try.scala:251)
scala.util.Success.map(Try.scala:209)
scala.concurrent.Future.$anonfun$map$1(Future.scala:287)
scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:140)
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
[17:30:13 <main: DesignSpaceExploration$> INFO] all DSE tasks have finished
Example run:
tapasco --slurm dse --composition [arrayupdate x 1] --dimensions freq,area --batchSize 25 --frequency 50 --platforms pynq