Problems with joblines

bureau · May 7, 2020, 3:24pm

Madam, Sir,
I have a major problem on the joblines.

I launched on a supercomputer with slurm a virtual screening on a part of Real database (around 400 millions ligands).
We defined 100 joblines for the job. The main modified parameters of the all.ctrl file were : cpus_per_step=28 / central_todo_list_splitting_size=2000 / ligands_todo_per_queue=100000 / ligands_per_refilling_step=1000.
Main modification for slurm : #SBATCH --mem=120000mb

On the 100 joblines, we got rapidly problems on 31 (killed and this is not finished) with the following information (for instance with jobline 95) :
Before (re)filling the todolists the queue 95-1-1 had 0 ligands todo distributed in 0 collections.

The ligand-collections/todo/todo.all (if existent) did not meet the requirements for continuation (trial 1).
The ligand-collections/todo/todo.all (if existent) did not meet the requirements for continuation (trial 2).
The ligand-collections/todo/todo.all (if existent) did not meet the requirements for continuation (trial 3).
The ligand-collections/todo/todo.all (if existent) did not meet the requirements for continuation (trial 4).
After (re)filling the todolists the queue 95-1-1 has 100099 ligands todo distributed in 95 collections.

The todo lists for the queues were (re)filled in 1 second(s) (waiting time not included).
The waiting time was 103 second(s).

Starting job step 1 on host my080.
Job step 1 is starting queue 95-1-1 on host my080.

Trying to stop this queue and causing the jobline to fail…

Trying to stop this queue and causing the jobline to fail…

              *** Final Job Information ***

======================================================================

I need some information on this problem. Could you help me ?

Thank you,

Ronan,

Christoph · June 22, 2020, 6:15pm

Hi Ronan,

Were you able to solve the problem in the meantime? If not, I would look into the logfiles in the folder workflow/output-files/queues/…

If there is nothing, I would start an interactive job and try to get the logfiles from the temporary directories.

Best,
Christoph

bureau · June 30, 2020, 2:06pm

Hi Christoph,

The fact to fix “error_response=ignore” allows to resolve a lot of problem.
Thank you again for the quality of your software and your reply.

Ronan,