"Memory required by task is not available" issue encountered while using the virtual flow on Slurm system

fangzhe3 · April 22, 2020, 8:35pm

Recently I was following the tutorials on Virtual Flow website to install the software on our UHN slurm cluster. However, when I was using the template example, I encountered an error in the system which is

"srun: error: Unable to create step for job 874794: Memory required by task is not available

Error was trapped

Error in bash script /var/spool/slurmd/job874794/slurm_script

Error on line 298

Environment variables

"

I changed the partition to “all” according to our slurm partition list. Here is its basic info:

all

Default=YES
MaxNodes=1
MaxTime=5-00:00:00
DefMemPerCPU=256M
DefaultTime=1-00:00:00
MaxMemPerNode=30720M

I am wondering if anyone here has encountered such issues before. Look forward to hearing from you. Thank you!

Best,

Zhenhao

Christoph · April 23, 2020, 3:13am

Dear Zhenhao,

Welcome to the VirtualFlow Community

It seems Slurm complains that you request more memory than is available on your cluster. Could you post your entire all.ctrl file?

Best,
Christoph

fangzhe3 · April 23, 2020, 1:50pm

Hi Chris,

Thank you for your reply. Below is the configuration of my all.ctrl file.

job_letter=t
# One alphabetic character (i.e. a letter from a-z or A-Z)
# Should not be changed during runtime, and be the same for all joblines
# Required when running VF several times on the same cluster to distinguish the jobs in the batchsystem
# Settable via range control files: No

batchsystem=SLURM
# Possible values: SLURM, TOQRUE, PBS, LSF, SGE
# Settable via range control files: No

partition=all
# Partitions are also called queues in some batchsystems
# Settable via range control files: Yes

timelimit=0-07:00:00
# Format for slurm: dd-hh:mm:ss
# Format for TORQUE and PBS: hh:mm:ss
# Format for SGE: hh:mm:ss
# Format for LSF: hh:mm
# For all batchsystems: always fill up with two digits per field (used be the job scripts)
# Settable via range control files: Yes

steps_per_job=1
# Not (yet) available for LSF and SGE (is always set to 1)
# Should not be changed during runtime, and be the same for all joblines
# Settable via range control files: Yes

cpus_per_step=1
# Sets the slurm cpus-per-task variable (task = step) in SLURM
# In LSF this corresponds to the number of slots per node
# Should not be changed during runtime, and be the same for all joblines
# Not yet available for SGE (always set to 1)
# Settable via range control files: Yes

queues_per_step=1
# Sets the number of queues/processes per step
# Should not be changed during runtime, and be the same for all joblines
# Not yet available for SGE (always set to 1)
# Settable via range control files: Yes

cpus_per_queue=1
# Should be equal or higher than <cpus-per-step/queues-per-step>
# Should not be changed during runtime, and be the same for all joblines
# Not yet available for SGE (always set to 1)
# Settable via range control files: Yes


Here is my slurm settings in the template1.slurm.sh:

# Slurm Settings
###############################################################################

#SBATCH --job-name=h-1.1
##SBATCH --mail-user=To be completed if uncommented
#SBATCH --mail-type=fail
#SBATCH --time=00-12:00:00
#SBATCH --mem-per-cpu=256M
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=main
#SBATCH --output=../workflow/output-files/jobs/job-1.1_%j.out           # File to which standard out will be written
#SBATCH --error=../workflow/output-files/jobs/job-1.1_%j.out            # File to which standard err will be written
#SBATCH --signal=10@300

In our slurm system, the partition names include: all and himem

all

· Default=YES

· MaxNodes=1

· MaxTime=5-00:00:00

· DefMemPerCPU=256M

· DefaultTime=1-00:00:00

· MaxMemPerNode=30720M

himem

· Default=NO

· MaxNodes=1

· MaxTime=7-00:00:00

· DefMemPerCPU=256M

· DefaultTime=1-00:00:00

· MaxMemPerNode=61440M

· Limit: max 60 running jobs

These are the only part that I modified in the files. I did not change anything else in the rest of the files.

Look forward to hearing from you.

Zhenhao

Christoph · April 23, 2020, 4:35pm

Hi Zhenhao,

That is strange. It seems to be a SLURM related problem specific to your cluster.

Your config and job template seem to be fine. I see that you even reduced the mem-per-cpu setting in your job file. The default value I think was higher. With 256 MB only you might get other problems during the runtime, since the docking program and VirtualFlow often require around 500 MB, sometimes even more per CPU.

The right people to contact might be the admins. Maybe you need to add the --mem option on your cluster?
https://slurm.schedmd.com/sbatch.html

Best,
Christoph

fangzhe3 · April 23, 2020, 9:17pm

Hi Christoph,

Thank you for your prompt reply and suggestions. Yes I reduced the mem to 256 MB and the request is still rejected.

I will contact our cluster manager for this issue. I also think it should be a cluster specific issue as I was using the preconfigured example in Virtual Flow.

best,

Zhenhao

anitaknivedha · May 3, 2020, 11:20pm

Hi Zhenhao,

Were you able to get this issue fixed? If so, could you please share with us how you did it? This will be useful to other members of the VirtualFlow community in case they encounter the same error

Thanks,
Anita

vas2201 · May 7, 2020, 4:19am

Hi Fangzhe3,

Try once increasing the Memory per CPU, Start from 1GB or 2GB. edit your template.slurm ( I am not sure in your case which one your using) and in all.ctrl.

Regards
Vas

kaneki · May 10, 2020, 10:01am

So to be clear, virtualflow requires around, or atleast, 500 RAM per core?

Christoph · May 10, 2020, 11:13am

The amount of RAM required by VFVS depends on the precise docking program, docking settings, and the types of ligands which you are screening. In my experience at least 500 MB are required per queue/core in many cases. But I had a few cases where I needed to go to 600 or 700 MB. This assumes that the setting tempdir_default in the control file does not point to a ram-drive (such as /dev/shm), but hard drive such as /tmp.