In my experience if the command is not found then youre not calling it from the right folder. So even if the tutorial says for example submit.sh XXX YYY, sometimes in your own cluster you need to say where this submit is located e.g.: home/dagarshali/submit.sh XXX YYY
Hi
This is my first time using HPC. I had set up AWS cluster with SGE and was using Slurm in my commands. I have since fixed that and I now get a new error as below
sbatch: error: invalid partition specified: test
sbatch: error: Batch job submission failed: Invalid partition name specified
Error was trapped which is a nonstandard error.
Error in bash script submit.sh
Error on line 68
Invalid partition test. What am I suppose to give for that?
Thank you for the response… i did that… Now i get another error
** sbatch: error: Memory specification can not be satisfied** sbatch: error: Batch job submission failed: Requested node configuration is not available
Regarding the memory problem, each compute node has a maximum allowed memory which can be used by the SLURM jobs. If more memory is requested by the batchsystem jobs than is available on the compute nodes, then Slurm will show this error message and refuse to accept the job.
The total memory which you are requesting per node is the (memory per cpu) x (cpus per node).
The memory which you request per cpu (core) is specified in the file tools/templates/template1.slurm.sh by the setting --mem-per-cpu=....
The number of cpus per node you specify/find in the file all.ctrl by the setting cpus_per_step.
Thus either you need to increase the memory of the virtual machines/the nodes which Slurm uses, or you need to decrease the memory used per node (I would recommend at least 500 MB per core).
#SBATCH --job-name=h-1.1
##SBATCH --mail-user=To be completed if uncommented #SBATCH --mail-type=fail #SBATCH --time=00-12:00:00 #SBATCH --mem-per-cpu=500M #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --partition=compute #SBATCH --output=…/workflow/output-files/jobs/job-1.1_%j.out # File to which standard out will be written #SBATCH --error=…/workflow/output-files/jobs/job-1.1_%j.out # File to which standard err will be written #SBATCH --signal=10@300
In the templates/all.ctrl file, we have steps_per_job=1 cpus_per_step=96 queues_per_step=96 cpus_per_queue=1
The compute node that we have in the cluster is c5.24xlarge (96 vCPUS 200GB Memory) on AWS.
I issued the command ./vf_startjobline.sh 1 5 templates/template1.slurm.sh submit 1
Here are a couple of trials I attempted.
Changed the cpus_per_step=96 and queues_per_step=96 based on your suggestion, and ran the command ./vf_startjobline.sh 1 5 templates/template1.slurm.sh submit 1
– Result: sbatch: error: Batch job submission failed: More processors requested than permitted
Changed the cpus_per_step=1 and queues_per_step=1 and the default template1.slurm.sh with #SBATCH --mem-per-cpu=500M
–Result: sbatch: error: Memory specification can not be satisfied
Changed the cpus_per_step=1 and queues_per_step=1 and the default template1.slurm.sh with #SBATCH --mem-per-cpu=1M
–Result: No errors and starts to run
I have never had to set up the cluster and scheduler myself before. So, I am not sure what is happening and how to fix these errors. Any help you can provide to fix or troubleshoot this issue would be invaluable.
Once the Slurm cluster is running, you can find out more about the available resources (partition, nodes) by using the scontrol command for instance. The settings which are used by Slurm to define the available resources are usually in the slurm.conf file in the Slurm installation folder.
When we tested VirtualFlow on AWS last year, we didn’t have these errors, the SLURM cluster was setup automatically in a way which could be used by VirtualFlow. We used for example m5.24xlarge compute nodes for testing.
–mem-per-cpu=1M will certainly be too less for VirtualFlow, but it could be that Slurm (if set up in this way) doesn’t mind if the workflow actually uses more memory than it requested when submitting the job.
For help on how to setup SLURM with suitable settings on AWS, it might best best to get in touch with the people from these projects directly (AWS ParallelCluster, SchedMD SLURM), and use the available resources from them (as VirtualFlow runs only on top, and we are not specialized on configuring these resources either as in most cases they are provided in a way which work automatically). Here are some resources from ParallelCluster and SchedMD:
I have a similar problem when following tutorial 2:
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
Error was trapped which is a nonstandard error.
Error in bash script submit.sh
Error on line 68
I am using a single CPU machine from google cloud console just for testing. Therefore i changed the settings in templates/all.ctrl file to be: