Madam, Sir,
I have a major problem on the joblines.
I launched on a supercomputer with slurm a virtual screening on a part of Real database (around 400 millions ligands).
We defined 100 joblines for the job. The main modified parameters of the all.ctrl file were : cpus_per_step=28 / central_todo_list_splitting_size=2000 / ligands_todo_per_queue=100000 / ligands_per_refilling_step=1000.
Main modification for slurm : #SBATCH --mem=120000mb
On the 100 joblines, we got rapidly problems on 31 (killed and this is not finished) with the following information (for instance with jobline 95) :
Before (re)filling the todolists the queue 95-1-1 had 0 ligands todo distributed in 0 collections.
The ligand-collections/todo/todo.all (if existent) did not meet the requirements for continuation (trial 1).
The ligand-collections/todo/todo.all (if existent) did not meet the requirements for continuation (trial 2).
The ligand-collections/todo/todo.all (if existent) did not meet the requirements for continuation (trial 3).
The ligand-collections/todo/todo.all (if existent) did not meet the requirements for continuation (trial 4).
After (re)filling the todolists the queue 95-1-1 has 100099 ligands todo distributed in 95 collections.
The todo lists for the queues were (re)filled in 1 second(s) (waiting time not included).
The waiting time was 103 second(s).
Starting job step 1 on host my080.
Job step 1 is starting queue 95-1-1 on host my080.
-
Trying to stop this queue and causing the jobline to fail…
-
Trying to stop this queue and causing the jobline to fail…
*** Final Job Information ***
======================================================================
I need some information on this problem. Could you help me ?
Thank you,
Ronan,