VFVS Tutorial-1 failing after 1 min

ErwinPannecoucke · October 28, 2020, 4:26pm

Dear all,

I’m trying out the preconfigured VFVS tutorial, and all jobs fail a few seconds after being submitted. And unfortunately, I can’t seem to find the error, or even a reference to an error, so this is why I’m reaching out. My appologies if it is an issue that has been adressed previously: I did try to find a solution in this forum first, with no avail.

What I did:

In ./templates/all.ctrl, I’ve enabled verbosity_commands=debug and verbosity_logfiles=debug
Next, I adapted ./templates/template1.slurm.sh to our system:

template1.slurm.sh

``# Slurm Settings
###############################################################################

#SBATCH --job-name=h-1.1
##SBATCH --mail-user=To be completed if uncommented
##SBATCH --mail-type=fail
#SBATCH --time=00-12:00:00
#SBATCH --mem-per-cpu=500M
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=cn1522
#SBATCH --output=../workflow/output-files/jobs/job-1.1_%j.out           # File to which standard out will be written
#SBATCH --error=../workflow/output-files/jobs/job-1.1_%j.error            # File to which standard err will be written
#SBATCH --signal=10@300``

Finally, I ran ./vf_prepare_folders.sh,
followed by
./t./vf_start_jobline.sh 1 12 templates/template1.slurm.sh submit 1

Most jobs finish quite fast after having appeared in the squeue, the longest job stays in there for approximately 1-1.5 minutes.

This is the relevant sbatch output of the first job:

../workflow/output-files/jobs/job-1.1_172.out

===========================================================


 * Preparing the to-do lists for jobline 1

Wed Oct 28 16:50:21 CET 2020

Starting the (re)filling of the todolists of the queues.

Before (re)filling the todolists the queue 1-1-1 had 0 ligands todo distributed in 0 collections.

After (re)filling the todolists the queue 1-1-1 has 106 ligands todo distributed in 4 collections.

The todo lists for the queues were (re)filled in 0 second(s) (waiting time not included).
The waiting time was 0 second(s).

Starting job step 1 on host cn1522-irc-ugent-be.

 * Trying to stop this queue and causing the jobline to fail...

 * Trying to stop this queue and causing the jobline to fail...

                 *** Final Job Information ***
======================================================================

Starting time: Wed Oct 28 16:50:20 CET 2020
Ending time:   Wed Oct 28 16:50:22 CET 2020``

This is the output of the corresponding .error:

../workflow/output-files/jobs/job-1.1_172.error

Error was trapped srun: error: cn1522-irc-ugent-be: task 0: Exited with exit code 1 Error in bash script one-step.sh Error on line 3 ... Error was trapped Error in bash script /mnt/DATA1/programs/slurm/spool/slurmd/job00172/slurm_script Error on line 298 ...

I have no idea where to start troubleshooting, and have spent already quite some time on this. Can somebody give me a hint towards the actual root of this error?

Thank you kindly,
Erwin

Christoph · November 14, 2020, 9:07pm

Dear Erwin,

I’ve never seen an error like this.

Have you installed VirtualFlow on a shared cluster filesytem?

Line 3 is a comment, so this is indeed strange.

Which version of Bash is installed on your cluster?

Best,
Christoph

ErwinPannecoucke · November 16, 2020, 9:02am

Dear Christoph,

Thank you for confirming that the problem indeed doesn’t occur often, it was driving me crazy

The system is an in-house GPU workstation running CentOS 7 (featuring 2x 16 Intel Xeon Gold 5218 CPU’s @ 2.30GHz (64 cores multithreaded); 252Gb RAM) in which we installed SLURM ourselves. We were able to confirm that our SLURM setup works, as we are using several scripts, developed for a HPC at Grenoble, to process EM data using this interface.

The tutorial is run on a 1TB (scratch) SSD, of which my user has full read/write access. I also tried running the script as sudo, but that does not give a difference.

I realize that I’m not giving a lot of information to troubleshoot this. But honestly, I have no idea where to start looking for problems. Don’t hesistate to ask for more information!

Thank you in advance,
Erwin