Scaling question

bdelabarre · May 6, 2020, 10:46pm

I managed to get set up on GCP with SLURM as a job manager. I am trying to see how well the scaling works.
I have been using your tutorial 1 example as a test case. 1200 or so ligands there against GK (I think).
I ran a 15 cpu job and finished it in 3 hours 45 min.
I tried a 95 cpu job and finished in ~1 hour.
So 6x cpu only brought me <4x increase.

all.ctrl has following settings:
steps_per_job=1
cpus_per_step=95
queues_per_step=95
cpus_per_queue=1

I changed some of the ligand filling params in the all.ctrl to account for what looked like would be a depletion of ligands for parallel jobs:

central_todo_list_splitting_size=90 (was 10000 for 15 cpu job)
ligands_todo_per_queue=15 (was 1000 for 15 cpu job)
ligands_per_refilling_step=5 (was 10 for 15 cpu job)

So question - are these settings correct? Should I have left the second set of values at the higher numbers?

Things seem sub-optimal here. Another clue: %CPU only ran about 30% during while the job was running.

Thanks in advance for comments/help.

Christoph · May 7, 2020, 2:57pm

Hi Byron,

To test the scaling behavior, one needs to run larger scale workflows. On small scales this does not work for the REAL library which we provide. The reason is that ligands are grouped in collections, and one collection is processed by one CPU/queue, and the job will run until all CPU/queues are completed. Thus if one queue takes a certain time, no matter how many CPUs you add, it will not reduce the total runtime because one ligand collection (containing a 1000 ligands in average) will still need the same time.

This is different with larger workflows. The more CPUs you have, the more collections can be processed in parallel. Thus the rate of ligands processed increases linearly. But because at the end of the workflow, not every CPU/queue finishes at the same time, the total time required might not be half the time when you double the CPUs.

bdelabarre · May 7, 2020, 3:33pm

Thanks Christoph. This is what I suspected was happening here.

So altering the central_todo_list_splitting and related parameters as I described cannot change this behavior? I was hoping I could break the collection into smaller pieces this way and feed it to the overall process to make more efficient use of all of the cpus. I can’t find full documentation on these parameters - so it was a bit of a guess based on the descriptions I could find.

I was hoping I could do shorter runs while working out ideal (and affordable!) parameters/setup to apply to the larger run. Sounds like it would be better to generate a larger library and do partial runs (eg/ time to complete 10% of library) to evaluate? So if I had a 200K library - this would be about 200 collections and so I should see a linear improvement until I started exceeding 200 cpus, correct?

Christoph · May 19, 2020, 11:35am

Hi Byron,

Yes, changing the setting central_todo_list_splitting size and related parameters will not help. This is about an internal mechanism, which splits the global todo list into smaller pieces, so that they each piece can be faster searched/processsed. The ligand collections of the REAL library which we provide on the homepage are prepackaged into collections, and each one has a certain size. This cannot be changed on the fly with some setting at the moment.

Yes, if you have around 200K library with 200 collections, you should see roughly a linear “completion time” behavior only if all collections have the same number of ligands. But with the REAL library we provide, some collections have 800, some have 1200 ligands, and so on. Thus the CPU (or queue) with 800 ligands would need around 50% longer than the other with the 1200 ligands. But this is just an estimate. Thus on such small scales you don’t see a linear completion time behavior. If you run large-scale computations, this effect will be minimized. For example, if one queue (or CPU) has 20000 ligands to process, and another queue 20500, then there is a 2.5 % difference in number of ligands.

If you run jobs where each job uses a single CPU, then you would not loose any allocated/obtained computing time, since each job will end automatically once all ligands of that processing (queue) are processed.

What scales linearly is the “throughput”. If you use 1000 CPUs with VirtualFlow and all are processing ligands, than in average a certain number of CPUs will be processed per minute (or hour). If 2000 processors are used, then the throughput will be double in average. Thus if give VFVS let’s say around 200 million molecules, then the time until you complete 100 million compounds when using 1000 CPUs will be double of the time than when you use 2000 CPUs. It’s just at the very of the workflow that each queue/CPU does not finish exactly at the same time. To minimize this further, you can also do the following: Let’s say you want to screen 100 million compounds. You can give VFVS 120 million instead, and stop the workflow after you reached 100 million. That’s usually what I do when I want to reach a “specific number” of ligands screened.

I hope this clarifies the issue.

bdelabarre · May 20, 2020, 4:28pm

Many thanks for the answer - things are getting clearer here. Sounds like ensuring that all sets contain roughly equivalent numbers is important.

Could you confirm:

From the way you have set up your library - collections broken into sets of 1K compounds is ideal?

Thanks -

Byron

Christoph · May 20, 2020, 10:33pm

Hi Byron,

I’m glad things are getting clearer

Yes, around 1000 compounds per collection work quite well in practice for many cases, for example when you do a fast primary virtual screening (with a rigid receptor) in my experience. The REAL library which we had prepared contains in average around 1000 compounds per collection. If you set the parameter ligands_todo_per_queue to lets say 10000, then VirtualFlow tries to assign that queue (i.e. one worker thread running on one CPU) so many collections that it’s roughly 10000 ligands. Since the collection size is in average 1000, a queue might end up with 10500 ligands in the end, and another queue with 10100 (since the collection sizes vary with each collections).

The reason why the collection sizes vary is that before we prepared the REAL library with VFLP, we split the library in pieces of exactly 1000 ligands per collections (in SMILES format). During the preparation, tautomers and protonation states are enumerated, and thus from one molecule can arise multiple different molecules/protomers. Also, collections belong to tranches, and some tranches might have only less than 1000 ligands, meaning that it in such a case there is only one collection with less than 1000 compounds.

In pratice, this is not a problem. Not every queue might get exactly the same number of ligands, but the “throughput” (ligands processed per minute) still scales linearly until almost the end of the screening in large-scale screenings (until some queues start to finish earlier than others and no ligand collections are left in the central todo list
). And at the end of the screenings, some queues finish earlier than others, which is not a problem at all usually.

If you do screenings where the receptor is flexible (e.g. a lot of flexible residues), which one might do tin a second stage screening (rescoring of the best 1% of hits for example), then the processing time per ligand is much longer. So in this case, when one prepares the input-library for the second stage, one might want to use a much smaller collection size (e.g. 50).

In my experience, a collection size is ideal when the processing one one collection is takes a few hours. If you one queue has has let’s say more ligands to do than another queue, it should at most be the size of one collection which it has more to do. So if one collection takes a few hours, the difference of completion time among different queues should only be a few hours.

Hope this helps,
Christoph