Failed immediately after submitting the Job

dagarshali · March 31, 2020, 6:54pm

I just followed the instructions provided in the tutorial from scratch.

submit.sh: line 68: sbatch: command not found
Error was trapped which is a nonstandard error.
Error in bash script submit.sh
Error on line 68

I followed the instructions given in tutorial (from scratch).
I used it on AWS with Slurm.

Can anyone point me as where to look for to fix this issue

Best regards,
Vishwa

kaneki · March 31, 2020, 7:41pm

In my experience if the command is not found then youre not calling it from the right folder. So even if the tutorial says for example submit.sh XXX YYY, sometimes in your own cluster you need to say where this submit is located e.g.: home/dagarshali/submit.sh XXX YYY

might want to try this

dagarshali · March 31, 2020, 9:01pm

Hi
This is my first time using HPC. I had set up AWS cluster with SGE and was using Slurm in my commands. I have since fixed that and I now get a new error as below

sbatch: error: invalid partition specified: test

sbatch: error: Batch job submission failed: Invalid partition name specified

Error was trapped which is a nonstandard error.

Error in bash script submit.sh

Error on line 68

Invalid partition test. What am I suppose to give for that?

Best regards,
Vishwa

kaneki · April 1, 2020, 12:16am

you have to replace test with the name of the partition you want to use.
you can see partition name with sinfo

dagarshali · April 1, 2020, 12:25am

Thank you for the response… i did that… Now i get another error

** sbatch: error: Memory specification can not be satisfied**
sbatch: error: Batch job submission failed: Requested node configuration is not available

best regards,
Vishwa

Christoph · April 1, 2020, 12:06pm

Dear Vishwa,

Welcome to the VirtualFlow community

Regarding the memory problem, each compute node has a maximum allowed memory which can be used by the SLURM jobs. If more memory is requested by the batchsystem jobs than is available on the compute nodes, then Slurm will show this error message and refuse to accept the job.

The total memory which you are requesting per node is the (memory per cpu) x (cpus per node).

The memory which you request per cpu (core) is specified in the file tools/templates/template1.slurm.sh by the setting --mem-per-cpu=....

The number of cpus per node you specify/find in the file all.ctrl by the setting cpus_per_step.

Thus either you need to increase the memory of the virtual machines/the nodes which Slurm uses, or you need to decrease the memory used per node (I would recommend at least 500 MB per core).

@kaneki: Thanks for having helped Vishwa as well

dagarshali · April 1, 2020, 6:37pm

Hi Christoph,
Thank you very much for responding.

Regarding the memory problem, each compute node has a maximum allowed memory which can be >used by the SLURM jobs.

How do you we find our the maximum allowed memory used by SLURM jobs?

Here are the slurm settings in the template/tempalte1.slurm.sh

Slurm Settings

###############################################################################

#SBATCH --job-name=h-1.1
##SBATCH --mail-user=To be completed if uncommented
#SBATCH --mail-type=fail
#SBATCH --time=00-12:00:00
#SBATCH --mem-per-cpu=500M
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=compute
#SBATCH --output=…/workflow/output-files/jobs/job-1.1_%j.out # File to which standard out will be written
#SBATCH --error=…/workflow/output-files/jobs/job-1.1_%j.out # File to which standard err will be written
#SBATCH --signal=10@300

In the templates/all.ctrl file, we have
steps_per_job=1
cpus_per_step=96
queues_per_step=96
cpus_per_queue=1

The compute node that we have in the cluster is c5.24xlarge (96 vCPUS 200GB Memory) on AWS.

I issued the command ./vf_startjobline.sh 1 5 templates/template1.slurm.sh submit 1

Here are a couple of trials I attempted.

Changed the cpus_per_step=96 and queues_per_step=96 based on your suggestion, and ran the command ./vf_startjobline.sh 1 5 templates/template1.slurm.sh submit 1
– Result: sbatch: error: Batch job submission failed: More processors requested than permitted
Changed the cpus_per_step=1 and queues_per_step=1 and the default template1.slurm.sh with #SBATCH --mem-per-cpu=500M
–Result: sbatch: error: Memory specification can not be satisfied
Changed the cpus_per_step=1 and queues_per_step=1 and the default template1.slurm.sh with #SBATCH --mem-per-cpu=1M
–Result: No errors and starts to run

I have never had to set up the cluster and scheduler myself before. So, I am not sure what is happening and how to fix these errors. Any help you can provide to fix or troubleshoot this issue would be invaluable.

Best regards,
Vishwa

Christoph · April 2, 2020, 10:25am

Hi Vishwa,

Thank you for the additional details.

Once the Slurm cluster is running, you can find out more about the available resources (partition, nodes) by using the scontrol command for instance. The settings which are used by Slurm to define the available resources are usually in the slurm.conf file in the Slurm installation folder.

When we tested VirtualFlow on AWS last year, we didn’t have these errors, the SLURM cluster was setup automatically in a way which could be used by VirtualFlow. We used for example m5.24xlarge compute nodes for testing.

–mem-per-cpu=1M will certainly be too less for VirtualFlow, but it could be that Slurm (if set up in this way) doesn’t mind if the workflow actually uses more memory than it requested when submitting the job.

For help on how to setup SLURM with suitable settings on AWS, it might best best to get in touch with the people from these projects directly (AWS ParallelCluster, SchedMD SLURM), and use the available resources from them (as VirtualFlow runs only on top, and we are not specialized on configuring these resources either as in most cases they are provided in a way which work automatically). Here are some resources from ParallelCluster and SchedMD:

On the GitHub repo of ParallelCluster you can open issues for example as well.

Best wishes,
Christoph

geyang2007 · June 8, 2020, 7:02pm

Hello,

I have a similar problem when following tutorial 2:

sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Error was trapped which is a nonstandard error.
Error in bash script submit.sh
Error on line 68

I am using a single CPU machine from google cloud console just for testing. Therefore i changed the settings in templates/all.ctrl file to be:

**steps_per_job=1**
**cpus_per_step=1**
**queues_per_step=1**
**cpus_per_queue=1**

Is there any other setting in the job files that i fail to configure correctly?

Best,
Yang Ge