Optimal high performance computing facility setup

kaneki · March 31, 2020, 2:15pm

Hello community

I have been granted 10.000 cores for a total of 500.000CBU time. If im correct this means that I am able to use virtualflow for 50 hours long using 10.000 cores (500000/10000). This means I’m not able to screen 1 billion compounds but less. In the publication Virtualflow author’s mentioned that leveraging 10.000 cores would be able to screen 1 billion compounds in 336 hours, for my 50 hours this means I can roughly screen 140 million compounds.

Now my question is, if I screen 140million compounds with 10000 core what would be the optimal paramaters in the ctrl file for such setup e.g. steps_per_job=1 , cpus_per_step=1 , queues_per_step=1 or cpus_per_queue=1

In additon, how many jobs would be proper to use in ./vf_start_jobline.sh 1 10 templates/template1.slurm.sh submit 1

Let me know what you think!

Christoph · March 31, 2020, 2:43pm

Hi Kaneki,

Welcome back, and congratulations on your obtained computation time

VirtualFlow is quite flexible regarding these settings (to be able to run on any HPC system). The optimal settings will depend on the precise HPC which you are using.

If for example your HPC system always allocates full compute nodes to users/jobs, then I would set cpus_per_step and queues_per_step to the number of cores per compute node. steps_per_job I would for such an HPC system set to something like 10, meaning 10 nodes per job, and cpus_per_queue=1 is always recommended. So if there are 32 cores for instance per node, then one job (with 10 nodes per job) would use a total of 320 cores. Thus if you want to use 10000 in parallel in this case, you would need to submit around 31 jobs.

The number of compounds you can screen with your computation time will depend also on the processor speed. Maybe your CPUs are faster then the ones which were used for the publication

kaneki · March 31, 2020, 2:45pm

Thanks for your ellaborate answer. That makes sense. However, where do you get the 31 jobs from?

Christoph · March 31, 2020, 2:49pm

Here is my calculation: (10000 cores in total)/((32 cores per node)x(10 nodes per job))=31.25

kaneki · July 6, 2020, 10:32am

Number of nodes Cores

177 32

1080 24

540 24

32 32

64 16

18 64

steps_per_job: 10
cpus_per_step: 24
queues_per_step: 24
cpus_per_queue: 1

(10000 cores in total)/((24 cores per node)x(10 nodes per job)) = 41.66 jobs to be submitted

./vf_start_jobline.sh 1 41 templates/template1.slurm.sh submit 1

I assume this would be good then?
I want to screen 150 million compounds, do I have to change the ctrl file in terms of:

central_todo_list_splitting_size=10000
ligands_todo_per_queue=1000
ligands_per_refilling_step=100

gal · January 4, 2021, 8:56pm

Hi Kaneki,

I have similar problem over GCP cluster. Can you tell us how it went in general?

Also, do you have answer for your last question?
" do I have to change the ctrl file in terms of:

central_todo_list_splitting_size=10000
ligands_todo_per_queue=1000
ligands_per_refilling_step=100"

gal · January 12, 2021, 11:46pm

Hi,
I have confusion on these CPU settings after trying 2 different settings regarding this thread. I will share my screenshots while running the, along with my questions which is not clear to me and would be glad if I can get help to clarify them?

I use GCP+ Slurm with maximum 400 CPUs quota and each compute node is set to 8 CPUs per node as this:
Screen Shot 2021-01-12 at 2.56.21 PM

Then I set all.ctrl file as suggested above as 1,8,8,1 as below to match number of CPUs per node.
Screen Shot 2021-01-12 at 1.53.06 PM

Question 1: The running CPU numbers is not clear between the two case:

In the current 1,8,8,1 setting run I seem to have normal CPU numbering as this:
Screen Shot 2021-01-12 at 1.53.21 PM

But, if I set CPUs settings as 1,1,1,1 setting run then I see the same CPU numbering 8 times as Running. This made me confused as to whether something is wrong with this setting as it gives the imression to be running the same job 8 times as seen below:
Screen Shot 2021-01-12 at 2.18.15 PM

To make sure the setting 1,1,1,1 is clear I paste it below too:
Screen Shot 2021-01-12 at 2.17.30 PM

To my experience in the short trials, it feels like 1,1,1,1 setting runs faster but because of this confusion I wanted to clarify it before using it.
Thanks