Jobs not running in parallel?

I get this output from my ./vf_report.sh -c workflow command:

                                         Joblines

Number of jobfiles in the workflow/jobfiles/main folder: 90
Number of joblines in the batch system: 80
Number of joblines in the batch system currently running: 1

  • Number of joblines in queue “project” currently running: 1
    Number of joblines in the batch system currently not running: 79
  • Number of joblines in queue “project” currently not running: 79
    Number of cores/slots currently used by the workflow: 90

Does this mean I am only running 1 job at a time? I thought this should be going in parallel - I have a 96 cpu system setup on GCP. Where would I look to change input to ensure that things are being run in parallel?

Thanks in advance -

Hey there,

I’m new to computing but it could be that you’re only allowed to run 1 job at a time within your cluster, might be set by admin? I have that issue on certain accounts. However, it does run in parallel across all cpu’s on a node, at least for for my use : ), but I have been using a SLURM manager.

If you want to ensure that you’re taking advantage of a whole node, make sure that the all.ctrl file in the tools/template directory is set to take advantage of the whole node. Explanations of how to do so can be found in the tutorials and there is some information within the supplementary information of the nature paper about VirtualFlow.

Hope this is of some help as you troubleshoot.

Cheerfully

Thanks BK for the suggestions.

The cluster in question is under my control - I am running a virtual instance on the google compute platform (GCP) with a 96cpu cluster, with SLURM as the job manager. I have root access so anything that needs to be changed here is accessible to me I have read (and reread and reread) the virtual-flow docs - I can’t find where my setup may be wrong that I can only run 1 job at time here. It is not being deployed across all cpus - cpu load rarely goes about 25%. I have queued anywhere from 5 to 90 jobs with the start and still get the same one-at-a-time processing.

So - still hoping that someone out there with experience in virtual-flow, slurm and the GCP will help me find the setting that I need to adjust.

Hi,

It looks from your first output like your one job is using pretty much using all cores from that machines (90).

Could you share your VirtualFlow all.ctrl file to see the settings there ? Share something like

cat templates/all.ctrl | egrep -v ‘#’

Also, how is your Slurm cluster set up, only that one compute node with 96 cpus, or do you have configured it to scale the number of compute nodes based on load ? That info should be in

/apps/slurm/slurm-19.05.6/etc/slurm.conf

Hi Guilhem -
Thanks for the response.

To answer your first request - here is the output. I do have my cpu count set to take all of the cpus. I am running this as a single 96 cpu instance - SLURM can’t launch additional instances for me as I have it set up.

Here is the all.ctrl output:


*************************************************************** Job Resource Configuration ****************************************************************


job_letter=a

batchsystem=SLURM

partition=project

timelimit=0-05:00:00

steps_per_job=1

cpus_per_step=90

queues_per_step=90

cpus_per_queue=1


********************************************************************* Workflow Options ********************************************************************


central_todo_list_splitting_size=10000

ligands_todo_per_queue=1000

ligands_per_refilling_step=1000

collection_folder=…/…/ligand_library/

minimum_time_remaining=10

dispersion_time_min=3
dispersion_time_max=10

verbosity_commands=standard

verbosity_logfiles=debug

store_queue_log_files=all_uncompressed

keep_ligand_summary_logs=true

error_sensitivity=high

error_response=ignore

tempdir=/dev/shm


***************************************************************** Virtual Screenign Options ***************************************************************


docking_scenario_names=qvina02_rigid_receptor1:smina_rigid_receptor1

docking_scenario_programs=qvina02:smina_rigid

docking_scenario_replicas=1:1

docking_scenario_inputfolders=…/input-files/qvina02_rigid_receptor1:…/input-files/smina_rigid_receptor1


******************************************************************* Terminating Variables *****************************************************************


stop_after_next_check_interval=false

ligand_check_interval=100

stop_after_collection=false

stop_after_job=false


And here is the SLURM.conf output (with more sensitive info xxxxxx out)

ClusterName=virtualflow
SlurmctldHost=xxxxxxxxxxxxxxxxx
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
EnforcePartLimits=ALL
JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SrunPortRange=60001-63000
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/home/slurm
SwitchType=switch/none
TaskPlugin=task/cgroup
TmpFS=/scratch
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityCalcPeriod=60
PriorityFavorSmall=YES
PriorityMaxAge=7-0
PriorityUsageResetPeriod=WEEKLY
PriorityWeightAge=1000
PriorityWeightFairshare=2000
PriorityWeightJobSize=3000
PriorityWeightPartition=5000
PriorityWeightQOS=0
MaxArraySize=10000
MaxJobCount=100000
AccountingStoreJobComment=YES
AccountingStorageEnforce=limits
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm-llnl/job_completions
JobCompHost=xx.xx.xx.xx (removed this)
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=7
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=7
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
NodeName=xxxxxxxxxxxxxxxxxxxxx NodeAddr=xxxxxxxx RealMemory=84000 Sockets=2 CoresPerSocket=24 ThreadsPercore=2 Procs=96
PartitionName=debug Priority=8000 Nodes=xxxxxxxxxxxxxxx AllowQOS=restrained Default=NO MaxTime=INFINITE State=UP Shared=NO
PartitionName=project Priority=9000 Nodes=xxxxxxxxxxxxx MaxTime=INFINITE AllowQOS=maxjobs Default=NO State=UP Shared=NO
PartitionName=cpu Priority=5000 Nodes=xxxxxxxxxxxxxxx MaxTime=INFINITE AllowQOS=cpuonly Default=YES State=UP Shared=NO
PartitionName=gpu Priority=4000 Nodes=xxxxxxxxxxxxxx MaxTime=INFINITE MaxCPUsPerNode=8 Default=NO State=UP Shared=NO

Many thanks for any suggestions you can offer.

-Byron

Hey Byron,

Thank you for sharing the outputs. I think your issue comes from the cpus_per_step and queues_per_step configuration. Please check out another example here.

I think what’s happening in your scenario is that with cpus_per_step = 90, you basically pass that requirement to Slurm to find a compute node with at least 90 cores to execute one job, therefore putting on hold all other consecutive jobs until that one job completes, since you only have 1 compute node available with 96 cores. Even if the given job doesn’t need 90 cores, from a scheduling standpoint Slurm considers those 90 cores to be used. I’d recommend lowering those cpus/queues_per_step counts and also probably allowing Slurm to schedule new compute nodes if needed, using a smaller instance type (8 or 16 or 32 or 64 CPUs, depending your job requirements), that would allow you to scale-out the processing as well across multiple nodes in a cluster.

Hopefully this helps and you can distribute more job processing in your setup.

Regards,
Guilhem

1 Like

Thanks Guilhem! Makes sense - I will give it a try.

Hi Guilhem -

Thanks so much for your help with this -

I followed the example you referenced for all.ctrl.

Slight change - rather than ‘8’ as a cpu value I chose 5:

ie/

steps_per_job=1
cpus_per_step=5
queues_per_step=5
cpus_per_queue=1

Since I have a 96 cpu setup - I thought launching with 18 jobs would be suitable.

I used
./vf_start_jobline.sh 1 18 templates/template1.slurm.sh submit 1

to start (after running the ./vf_prepare_folders)

It still appears that the jobs are running in serial:

                                     Workflow Status

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

                                         Joblines

Number of jobfiles in the workflow/jobfiles/main folder: 18
Number of joblines in the batch system: 1
Number of joblines in the batch system currently running: 1

  • Number of joblines in queue “project” currently running: 1
    Number of joblines in the batch system currently not running: 0

  • Number of joblines in queue “project” currently not running: 0
    Number of cores/slots currently used by the workflow: 6

                                        Collections
    

Total number of ligand collections: 52
Number of ligand collections completed: 0
Number of ligand collections in state “processing”: 5
Number of ligand collections not yet started: 47

                             Ligands (in completed collections)

Total number of ligands: 47638
Number of ligands started: 0
Number of ligands successfully completed: 0
Number of ligands failed: 0

                            Dockings (in completed collections)

Docking runs per ligand: 2
Number of dockings started: 0
Number of dockings successfully completed: 0
Number of dockings failed: 0

A quick view of cpu load with ‘top’ shows that only 5 qvina/smina jobs running at any one time with only 5% of total cpu use on the system.

Could it be something wrong with my slurm setup that is restricting the cpu use?

Thanks -

Byron

Im not completely sure if its correct but it seems that virtualflow is told to use 1 node per job (steps_per_queue) and this node should use 5 CPU. But you only have 1 node, which has 95 cpu, but youre not using all of them since youve told the ctrl file to use just 1 node and that node should use 5 cpu.

If you have 1 node then I dont think you can use a parralel run. But my knowledge is limited so I hope somebody could verify what I say or at least correct me