Batch job submitted but immediately failed

kaneki · March 26, 2020, 6:28pm

Dear VirtualFlow community,

Before screening my own protein I wanted to make sure VirtualFlow works properly, hence I followed the tutorial with the provided docking scenarios. Everything goes smoothly and upon running the command ./vf_start_jobline.sh 1 12 templates/template1.slurm.sh submit 1 I can see the following: Submitted batch job #number The job for jobline X has been submitted.

However,its followed by an immediate fail:
Slurm Job_id=XXX Name=t-4.1 Failed, Run time 00:00:29, FAILED, ExitCode 1

What could be the problem?
Is there a way to see whats going wrong?

Christoph · March 26, 2020, 9:29pm

Dear Kaneki,

Welcome to the forum

I would recommend to look into the logfiles in the workflow/output-files/jobs folder.

More details can be found here in the documentation in the Troubleshooting section:

https://docs.virtual-flow.org/documentation/-LdE8RH9UN4HKpckqkX3/troubleshooting

If you don’t see logfiles, then make sure that the parameter in the logfile store_queue_log_files is set to all_uncompressed.

Good luck,
Christoph

kaneki · March 26, 2020, 10:01pm

Thanks for your reply,

The store_queue_log_files is set to all_uncompressed in the all.ctrl file
but still the workflow/output-files/jobs folder only contains job-1.1_135.out files

So its weird that it doesnt produce any logfile despite setting the parameter.

inside the job-1.1_135.out I find the following errors:

Error was trapped
srun: error: wen01: task 0: Exited with exit code 1
Error in bash script one-step.sh
Error on line 55
Environment variables (wen01 is the node)
Error was trapped
Error in bash script /var/spool/slurm/job00135/slurm_script
Error on line 298
Environment variables
BASH_FUNC_module()=() { eval /usr/bin/modulecmd bash $*
}
_=/usr/bin/env

Trying to stop this queue and causing the jobline to fail…

In addition both the folders are empty after running:
workflow/ligand-collections/ligand-lists
and
workflow/output-files/queues

Christoph · March 26, 2020, 11:25pm

@kaneki Can you set (in addition to store_queue_log_files=all_uncompressed ) also the parameter verbosity_logfiles=debug in the control file, and run the workflow again with this setting and attach the log file? The folder workflow/output-files/jobs should contain one logfile for each job which is run by the batch system, which are there in your folder, so all seems to be fine regarding these log files, and they should contain the cause of the error.

Since the workflow/output-files/queues folder is still empty as you said, the workflow seems to fail before starting the actual queues (one for each CPU) which process the ligands. Therefore also the workflow/ligand-collections/ligand-lists folder is still empty, since no ligands were yet processed.

kaneki · March 26, 2020, 11:37pm

I would like to attach a logfile but in the control file (all.ctrl) both verbosity_logfiles=debug and store_queue_log_files=all_uncompressed are set as you mentioned, and yet no logfiles are produced.

Indeed it seems to not even reach a queue so Im really confused in what’s going on.
Im just following the tutorial with the provided files so I did not change anything, other than what was just specified.

Maybe important to note is that I used the values given from the tutorial in:
steps_per_job=1
cpus_per_step=5
queues_per_step=5
cpus_per_queue=1

Could it be that i have to change this configuration, since VirtualFlow is not able to even start a queue for my cluster, hence failing the jobline?

Christoph · March 26, 2020, 11:40pm

@kaneki You mentioned earlier that for example the file job-1.1_135.out was there, which is a logfile of the job. Could you share this (or a similar) file? (This is also the log file I meant in my previous post, i.e. the log file on the job level)
Do all the files in the folder workflow/output-files/jobs have the ending “out”, or also some the ending “err”?

kaneki · March 27, 2020, 11:07am

Okay my bad, I assumed logfiles would be labbeled separatley as logfile.
There are no “err” file, only “out” files.

I sent you a pm containing the logfile

Christoph · March 27, 2020, 12:28pm

That’s a good point, renaming the workflow/output-files folder “logfiles” might be a good idea. I’ll put that on my todo list

It seems that you are running your workflow in the home folder, is that right?. On most clusters, this should not be done, unless your admins recommend that. Usually a fast shared cluster filesystem should be used. Is the home folder available to the compute nodes? If not, this might cause the immediate failure of the jobs.

If this does not help, I would recommend running an interactive job: start an interactive session with srun --nodes=1 --ntasks-per-node=2 --pty bash (additional parameters might be need on your cluster), then go the tools folder and run

bash../workflow/job-files/main/1.job

You then see all the live output on your screen.

kaneki · March 27, 2020, 1:46pm

im indeed running it on my home folder. When running an interactive job followed by bash../workflow/job-files/main/1.job then final job information shows starting time and ending time with a difference of 1 second. I will try to copy eevrything to the local scratch folder of a node and try to run from there. Indeed, the problem might lie in the home folder. I will let you know

Christoph · March 27, 2020, 2:15pm

The local scratch of a compute node is again another type of storage, which is only accessible by that node. Usually, a cluster has a shared cluster/scratch filesystem which is available to all compute nodes (including the login node). This is the one you want to use

kaneki · March 27, 2020, 3:30pm

we only have local scratch folders, not a shared one. Would it still be possible to run virtualflow on 1 node? Just testing how to use it before sending it to HPC facility!

edit:

could it be problem with additional programs such as openbabel or autodock etc. I did not install autodock on the cluster because the doucmentration mentioned that it is already in the virtualflow folder.

Christoph · March 27, 2020, 4:43pm

That’s a good question. Theoretically, yet, but it depends on how exactly your cluster is set up.

Regarding the settings you mentioned earlier, like steps_per_job=1, cpus_per_step=5, queues_per_step=5 or cpus_per_queue=1, this could also be (but I don’t expect it) a cause of the problem. Normally if you have an unsupported setting for example for cpus_per_step, then the batchsystem wouldn’t even allow you to submit the job in the first place.

The external docking programs should not be the problem, since they are shipped with the pckagage. And you would see specific error messages in the log files about this when the workflow tries to use these programs.

The error message in your job output files indicates that as soon as “srun” is run (which usually starts the script one-step.sh on one of the compute node), your workflow crashes immediately without any output from the one-step.sh script. In my experience this happens only if there is a fundamental problem, such as the filesystem suddenly no longer available to the workflow/compute node. Your cluster admin might be able to help.

kaneki · March 27, 2020, 10:59pm

I will forward this to the admin:

Regarding the settings you mentioned earlier, like steps_per_job=1 , cpus_per_step=5 , queues_per_step=5 or cpus_per_queue=1 , this could also be (but I don’t expect it) a cause of the problem.

In addition the admin tried running the job and he faces the same problem and gets the following error which he asked me to forward to you:

echo ‘Starting job step 1 on host ana01.’

Starting job step 1 on host ana01.

pids[$(( VF_STEP_NO - 0 ))]=191312

sleep 1

srun --relative=0 -n 1 -N 1 …/workflow/job-files/sub/one-step.sh

cpu-bind=MASK - ana01, task 0 0 [191326]: mask 0x1f set

Error was trapped

srun: error: ana01: task 0: Exited with exit code 1

Error in bash script one-step.sh

Error on line 55

Christoph · March 28, 2020, 10:51am

You can try the following two things:

Set steps_per_job=1 , cpus_per_step=1 , queues_per_step=1 or cpus_per_queue=1. Make sure you set these parameters either in the workflow/control/all.ctrl file in case you don’t prepare the folders again with ./vf_prepare_folders.sh. If you change the settings in the template control file tools/templates/all.ctrl, you need at first prepare the workflow folders again with ./vf_prepare_folders.sh. Then try running the workflow again with ./vf_start_jobline.sh
If that does not work, in addition to the settings above (without re-preparing the workflow files), edit the file ../workflow/job-files/main/1.job and change the line 294 containing srun to
bash ../workflow/job-files/sub/one-step.sh
Then run an interactive job, e.g. with srun --pty --mem 1G -n 1 -N 1 --cpus-per-task=1 -t 0-1:00:00 /usr/bin/env bash (How exactly you need to start an interactive session will depend on your cluster.)
Then while you are still in the tools folder, run bash ../workflow/job-files/main/1.job 2>&1 | tee log.txt
Then you can send me the file log.txt which should be in your current directory (i.e. the tools folder).

Did I understand correctly that the cluster on which you are trying to get it run is different from the larger cluster where you will run the real virtual screenings later?

kaneki · March 28, 2020, 10:58am

I will try the two mentioned steps today and return to you with results later this day.

Yes thats true, first I wanted to understand how virtualflow works, since I got errors with my own files, I tried running the tutorial with the given files but the same errors (we talked about them) arised. I want to run it on our own cluster first to check if the ligands are docked at the position I want them before wasting time in a big HPC facility!

EDIT:
point 1 didnt work either, I will send you the logfile when i finished point 2
point2 1.job line294 contains for pid in ${pids[@]}; do
wait $pid || let “exit_code=1”

Christoph · March 28, 2020, 11:36am

Testing that the ligands are docked in the right docking site when you setup your own target is definitely a good idea. But this you can in principle also do without VirtualFlow, you can just run Autodock Vina manually for that one time to test, and look at the results (that’s how I usually do it). Then you can use the same vina config file which you used for VirtualFlow. Of course, checking in addition also the docking results of VirtualFlow in small tests runs is a good idea as well to be sure the dockings are working as you want them to.

If you want to to use a big HPC facility, you will also need to run some smaller tests there anyways, because each HPC system is different. Thus if for some reason you don’t get it working on your own HPC system within a reasonable amount of time, it might be more efficient to test on the large HPC facility directly (on a small test scale).

Regarding point 2, maybe you have a slightly different version than the one I looked at, the line containing the srun command is “around” line 294.

kaneki · March 28, 2020, 12:00pm

Those are valid points.

In addition, i sent you a PM with the logfiles form the 2 steps you asked me to perform:
for point two:
since ran the command with 1 job there was only 1 jobfile in which i chagned line 285:
srun --relative=$((VF_STEP_NO - 1)) -n 1 -N 1 ../workflow/job-files/sub/one-step.sh &

into

bash ../workflow/job-files/sub/one-step.sh

so i removed the whole line and copied the above line on that line and it resulted in an error at that line. For point two I get two errors, and again the error which keeps coming back:

Starting job step 1 on host ana01.
+ bash ../workflow/job-files/sub/one-step.sh
Error was trapped
Error in bash script one-step.sh
Error on line 55

Christoph · March 28, 2020, 12:17pm

Thank you for the log files. That is really strange, never seen something like this before.

As soon as the script one-step.sh is started, an error appears from line 55 without any output from the earlier lines of that file. And line 55 of that file doesn’t seem to make sense either.

Without changing or resetting anything else, can you start again your interactive job, and only change the line
bash ../workflow/job-files/sub/one-step.sh
in the file ../workflow/job-files/main/1.job to
source ../workflow/job-files/sub/one-step.sh
and run again
bash ../workflow/job-files/main/1.job 2>&1 | tee log.txt

Also, can you send me the file ../workflow/job-files/sub/one-step.sh along as well next time?

Christoph · March 28, 2020, 6:05pm

I have looked at your latest files which you sent me. This time we got the bash-debug output from the one-step.sh script as you might have noticed. The one-step.sh shell script sources the ~/.bashrc file, which seems to source /etc/bashrc:

  + source ../workflow/job-files/sub/one-step.sh
++ trap 'error_response_std $LINENO' ERR
++ trap time_near_limit 10
++ trap time_near_limit 1 2 3 9 15
++ trap clean_up EXIT
++ source /home/samur001/.bashrc
+++ '[' -f /etc/bashrc ']'
+++ . /etc/bashrc
++++ '[' '' ']'
++++ shopt -q login_shell

and it seems an error is occurring while this script is sourced. The error occurs shortly after the command gau-machine (related to Gaussian?) is run. Thus you could for example try to fix your .bashrc (or the /etc/bashrc) file, or you can edit the file one-step.sh and comment out/remove the line which sources your .bashrc. VFVS should still be able to run.

kaneki · March 28, 2020, 6:48pm

It seems to be running after commenting out this sourcing line!
What is the command for stopping a job?