Recently I was following the tutorials on Virtual Flow website to install the software on our UHN slurm cluster. However, when I was using the template example, I encountered an error in the system which is
"srun: error: Unable to create step for job 874794: Memory required by task is not available
Error was trapped
Error in bash script /var/spool/slurmd/job874794/slurm_script
Error on line 298
Environment variables
"
I changed the partition to “all” according to our slurm partition list. Here is its basic info:
all
Default=YES
MaxNodes=1
MaxTime=5-00:00:00
DefMemPerCPU=256M
DefaultTime=1-00:00:00
MaxMemPerNode=30720M
I am wondering if anyone here has encountered such issues before. Look forward to hearing from you. Thank you!
Thank you for your reply. Below is the configuration of my all.ctrl file.
job_letter=t
# One alphabetic character (i.e. a letter from a-z or A-Z)
# Should not be changed during runtime, and be the same for all joblines
# Required when running VF several times on the same cluster to distinguish the jobs in the batchsystem
# Settable via range control files: No
batchsystem=SLURM
# Possible values: SLURM, TOQRUE, PBS, LSF, SGE
# Settable via range control files: No
partition=all
# Partitions are also called queues in some batchsystems
# Settable via range control files: Yes
timelimit=0-07:00:00
# Format for slurm: dd-hh:mm:ss
# Format for TORQUE and PBS: hh:mm:ss
# Format for SGE: hh:mm:ss
# Format for LSF: hh:mm
# For all batchsystems: always fill up with two digits per field (used be the job scripts)
# Settable via range control files: Yes
steps_per_job=1
# Not (yet) available for LSF and SGE (is always set to 1)
# Should not be changed during runtime, and be the same for all joblines
# Settable via range control files: Yes
cpus_per_step=1
# Sets the slurm cpus-per-task variable (task = step) in SLURM
# In LSF this corresponds to the number of slots per node
# Should not be changed during runtime, and be the same for all joblines
# Not yet available for SGE (always set to 1)
# Settable via range control files: Yes
queues_per_step=1
# Sets the number of queues/processes per step
# Should not be changed during runtime, and be the same for all joblines
# Not yet available for SGE (always set to 1)
# Settable via range control files: Yes
cpus_per_queue=1
# Should be equal or higher than <cpus-per-step/queues-per-step>
# Should not be changed during runtime, and be the same for all joblines
# Not yet available for SGE (always set to 1)
# Settable via range control files: Yes
Here is my slurm settings in the template1.slurm.sh:
# Slurm Settings
###############################################################################
#SBATCH --job-name=h-1.1
##SBATCH --mail-user=To be completed if uncommented
#SBATCH --mail-type=fail
#SBATCH --time=00-12:00:00
#SBATCH --mem-per-cpu=256M
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=main
#SBATCH --output=../workflow/output-files/jobs/job-1.1_%j.out # File to which standard out will be written
#SBATCH --error=../workflow/output-files/jobs/job-1.1_%j.out # File to which standard err will be written
#SBATCH --signal=10@300
In our slurm system, the partition names include: all and himem
all
· Default=YES
· MaxNodes=1
· MaxTime=5-00:00:00
· DefMemPerCPU=256M
· DefaultTime=1-00:00:00
· MaxMemPerNode=30720M
himem
· Default=NO
· MaxNodes=1
· MaxTime=7-00:00:00
· DefMemPerCPU=256M
· DefaultTime=1-00:00:00
· MaxMemPerNode=61440M
· Limit: max 60 running jobs
These are the only part that I modified in the files. I did not change anything else in the rest of the files.
That is strange. It seems to be a SLURM related problem specific to your cluster.
Your config and job template seem to be fine. I see that you even reduced the mem-per-cpu setting in your job file. The default value I think was higher. With 256 MB only you might get other problems during the runtime, since the docking program and VirtualFlow often require around 500 MB, sometimes even more per CPU.
Thank you for your prompt reply and suggestions. Yes I reduced the mem to 256 MB and the request is still rejected.
I will contact our cluster manager for this issue. I also think it should be a cluster specific issue as I was using the preconfigured example in Virtual Flow.
Were you able to get this issue fixed? If so, could you please share with us how you did it? This will be useful to other members of the VirtualFlow community in case they encounter the same error
Try once increasing the Memory per CPU, Start from 1GB or 2GB. edit your template.slurm ( I am not sure in your case which one your using) and in all.ctrl.
The amount of RAM required by VFVS depends on the precise docking program, docking settings, and the types of ligands which you are screening. In my experience at least 500 MB are required per queue/core in many cases. But I had a few cases where I needed to go to 600 or 700 MB. This assumes that the setting tempdir_default in the control file does not point to a ram-drive (such as /dev/shm), but hard drive such as /tmp.