Dear all,
I’m trying out the preconfigured VFVS tutorial, and all jobs fail a few seconds after being submitted. And unfortunately, I can’t seem to find the error, or even a reference to an error, so this is why I’m reaching out. My appologies if it is an issue that has been adressed previously: I did try to find a solution in this forum first, with no avail.
What I did:
-
In
./templates/all.ctrl
, I’ve enabledverbosity_commands=debug
andverbosity_logfiles=debug
-
Next, I adapted
./templates/template1.slurm.sh
to our system:
template1.slurm.sh
``# Slurm Settings
###############################################################################
#SBATCH --job-name=h-1.1
##SBATCH --mail-user=To be completed if uncommented
##SBATCH --mail-type=fail
#SBATCH --time=00-12:00:00
#SBATCH --mem-per-cpu=500M
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=cn1522
#SBATCH --output=../workflow/output-files/jobs/job-1.1_%j.out # File to which standard out will be written
#SBATCH --error=../workflow/output-files/jobs/job-1.1_%j.error # File to which standard err will be written
#SBATCH --signal=10@300``
- Finally, I ran
./vf_prepare_folders.sh
,
followed by
./t./vf_start_jobline.sh 1 12 templates/template1.slurm.sh submit 1
Most jobs finish quite fast after having appeared in the squeue, the longest job stays in there for approximately 1-1.5 minutes.
This is the relevant sbatch output of the first job:
../workflow/output-files/jobs/job-1.1_172.out
===========================================================
* Preparing the to-do lists for jobline 1
Wed Oct 28 16:50:21 CET 2020
Starting the (re)filling of the todolists of the queues.
Before (re)filling the todolists the queue 1-1-1 had 0 ligands todo distributed in 0 collections.
After (re)filling the todolists the queue 1-1-1 has 106 ligands todo distributed in 4 collections.
The todo lists for the queues were (re)filled in 0 second(s) (waiting time not included).
The waiting time was 0 second(s).
Starting job step 1 on host cn1522-irc-ugent-be.
* Trying to stop this queue and causing the jobline to fail...
* Trying to stop this queue and causing the jobline to fail...
*** Final Job Information ***
======================================================================
Starting time: Wed Oct 28 16:50:20 CET 2020
Ending time: Wed Oct 28 16:50:22 CET 2020``
This is the output of the corresponding .error:
../workflow/output-files/jobs/job-1.1_172.error
Error was trapped srun: error: cn1522-irc-ugent-be: task 0: Exited with exit code 1 Error in bash script one-step.sh Error on line 3 ... Error was trapped Error in bash script /mnt/DATA1/programs/slurm/spool/slurmd/job00172/slurm_script Error on line 298 ...
I have no idea where to start troubleshooting, and have spent already quite some time on this. Can somebody give me a hint towards the actual root of this error?
Thank you kindly,
Erwin