VFVS Tutorial-1 failing after 1 min

Dear all,

I’m trying out the preconfigured VFVS tutorial, and all jobs fail a few seconds after being submitted. And unfortunately, I can’t seem to find the error, or even a reference to an error, so this is why I’m reaching out. My appologies if it is an issue that has been adressed previously: I did try to find a solution in this forum first, with no avail.

What I did:

  1. In ./templates/all.ctrl, I’ve enabled verbosity_commands=debug and verbosity_logfiles=debug

  2. Next, I adapted ./templates/template1.slurm.sh to our system:

template1.slurm.sh
``# Slurm Settings
###############################################################################

#SBATCH --job-name=h-1.1
##SBATCH --mail-user=To be completed if uncommented
##SBATCH --mail-type=fail
#SBATCH --time=00-12:00:00
#SBATCH --mem-per-cpu=500M
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=cn1522
#SBATCH --output=../workflow/output-files/jobs/job-1.1_%j.out           # File to which standard out will be written
#SBATCH --error=../workflow/output-files/jobs/job-1.1_%j.error            # File to which standard err will be written
#SBATCH --signal=10@300``
  1. Finally, I ran ./vf_prepare_folders.sh,
    followed by
    ./t./vf_start_jobline.sh 1 12 templates/template1.slurm.sh submit 1

Most jobs finish quite fast after having appeared in the squeue, the longest job stays in there for approximately 1-1.5 minutes.

This is the relevant sbatch output of the first job:

../workflow/output-files/jobs/job-1.1_172.out
===========================================================


 * Preparing the to-do lists for jobline 1

Wed Oct 28 16:50:21 CET 2020

Starting the (re)filling of the todolists of the queues.

Before (re)filling the todolists the queue 1-1-1 had 0 ligands todo distributed in 0 collections.

After (re)filling the todolists the queue 1-1-1 has 106 ligands todo distributed in 4 collections.

The todo lists for the queues were (re)filled in 0 second(s) (waiting time not included).
The waiting time was 0 second(s).

Starting job step 1 on host cn1522-irc-ugent-be.

 * Trying to stop this queue and causing the jobline to fail...

 * Trying to stop this queue and causing the jobline to fail...

                 *** Final Job Information ***
======================================================================

Starting time: Wed Oct 28 16:50:20 CET 2020
Ending time:   Wed Oct 28 16:50:22 CET 2020``

This is the output of the corresponding .error:

../workflow/output-files/jobs/job-1.1_172.error

Error was trapped srun: error: cn1522-irc-ugent-be: task 0: Exited with exit code 1 Error in bash script one-step.sh Error on line 3 ... Error was trapped Error in bash script /mnt/DATA1/programs/slurm/spool/slurmd/job00172/slurm_script Error on line 298 ...

I have no idea where to start troubleshooting, and have spent already quite some time on this. Can somebody give me a hint towards the actual root of this error?

Thank you kindly,
Erwin

Dear Erwin,

I’ve never seen an error like this.

Have you installed VirtualFlow on a shared cluster filesytem?

Line 3 is a comment, so this is indeed strange.

Which version of Bash is installed on your cluster?

Best,
Christoph

Dear Christoph,

Thank you for confirming that the problem indeed doesn’t occur often, it was driving me crazy :slight_smile:

The system is an in-house GPU workstation running CentOS 7 (featuring 2x 16 Intel Xeon Gold 5218 CPU’s @ 2.30GHz (64 cores multithreaded); 252Gb RAM) in which we installed SLURM ourselves. We were able to confirm that our SLURM setup works, as we are using several scripts, developed for a HPC at Grenoble, to process EM data using this interface.

The tutorial is run on a 1TB (scratch) SSD, of which my user has full read/write access. I also tried running the script as sudo, but that does not give a difference.

I realize that I’m not giving a lot of information to troubleshoot this. But honestly, I have no idea where to start looking for problems. Don’t hesistate to ask for more information!

Thank you in advance,
Erwin