Hi Byron,
I’m glad things are getting clearer
Yes, around 1000 compounds per collection work quite well in practice for many cases, for example when you do a fast primary virtual screening (with a rigid receptor) in my experience. The REAL library which we had prepared contains in average around 1000 compounds per collection. If you set the parameter ligands_todo_per_queue to lets say 10000, then VirtualFlow tries to assign that queue (i.e. one worker thread running on one CPU) so many collections that it’s roughly 10000 ligands. Since the collection size is in average 1000, a queue might end up with 10500 ligands in the end, and another queue with 10100 (since the collection sizes vary with each collections).
The reason why the collection sizes vary is that before we prepared the REAL library with VFLP, we split the library in pieces of exactly 1000 ligands per collections (in SMILES format). During the preparation, tautomers and protonation states are enumerated, and thus from one molecule can arise multiple different molecules/protomers. Also, collections belong to tranches, and some tranches might have only less than 1000 ligands, meaning that it in such a case there is only one collection with less than 1000 compounds.
In pratice, this is not a problem. Not every queue might get exactly the same number of ligands, but the “throughput” (ligands processed per minute) still scales linearly until almost the end of the screening in large-scale screenings (until some queues start to finish earlier than others and no ligand collections are left in the central todo list
). And at the end of the screenings, some queues finish earlier than others, which is not a problem at all usually.
If you do screenings where the receptor is flexible (e.g. a lot of flexible residues), which one might do tin a second stage screening (rescoring of the best 1% of hits for example), then the processing time per ligand is much longer. So in this case, when one prepares the input-library for the second stage, one might want to use a much smaller collection size (e.g. 50).
In my experience, a collection size is ideal when the processing one one collection is takes a few hours. If you one queue has has let’s say more ligands to do than another queue, it should at most be the size of one collection which it has more to do. So if one collection takes a few hours, the difference of completion time among different queues should only be a few hours.
Hope this helps,
Christoph