Reduce I/O to further speed up the calculation?

x-ray_pku · February 6, 2021, 3:48am

Hi Christoph,

I am very interested in VFVS and am planning to run it on a 10,000 core HPC system. I did a lot of testing before setting up the final large scale screening and got concerns about the following issues.

I found that after submitting the vf_start_jobline.sh command and before the real calculating, the whole input-files directory is coppied to /tmp/ of every nodes. I think if one screens a huge library (say the 1.4 billion REAL) on a large number of nodes (e.g. 32 core/node, 312 nodes in total), this would be a pretty high load for the intranet. It also requires a large storage space on /tmp. Since we hope /tmp (or any other VF_TMPDIR) should be on a high-speed storage such as an SSD, this may further increase the expense.

Would it be pretty easy to modify the VirtualFlow scripts so that only a subset of the library is copied to every nodes? Ideally only those compound collections that will be processed on a certain node are copied to that node. Is this prettry easy?

It seems that /dev/shm/ is created but not really used. All subdirectorie are empty. Since the memory disk is super fast, and nowadays most HPC systems are equiped with large memory, I wonder if we could make better use of /dev/shm. It seems that the AutoDock family softwares themselves do not consume a lot of memory. So to better make use of the large memory disk may be reasonable.

The following questions are not about development of VF, but about the use of VF. If you feel nessesary I may re-post it in another branch of this forum –

As a new user, I have not become familiar with different AutoDock family softwares such as vina, qvina02, smina… and there may be even more. Unfortunately, I did a quick search but did not find a website or a paper to carefully introduce every and all software – it’s likely if I do a lot more search I may find something useful, but you might please simply recommend a literature to me? thanks! As you may understand, before begin a hugh screening, I have to consider the whole strategy. One of the most important question is which software to use. I need to consider the accuracy of the docking and the computing speed, etc. Therefore suggestion from a highly experienced expert like you would be invaluable.
Perhaps a reasonable strategy is to run VFVS for at least two runs? The first run I should do qvina02 to take advantage of its high speed (not relatively low accuracy) and the pick up the top 10 million hits to run VFVS for the second time using something like smina? What’s your suggestion? Thanks!
and how about flexible docking? – similar concern is related to the “exhaustiveness” number in AutoDock configuration file. These may further slow down the calculation and dramatically increase the expense. Do you have any suggestions?

Thank you very much! and finally I would like to say that VirtualFlow is really a very good idea and a very good pakage for people in the drug discovery field! Thank you very much for the wonderful contribution to this field!

Best,
Daniel

Christoph · February 11, 2021, 12:18am

Dear Daniel,

Thanks for your post, and welcome to the forum

Regarding question 1, you are absolutely right that the entire library should not be copied over to each local node. For this, the library needs to be in a folder other than input-files folder. You can store the library anywhere you want on the shared cluster file system (which should be fast/high I/O), and you can specify the path to the folder via the variable collection_folder in the all.ctrl file. Thanks for pointing this out, we should point this out more explicitly in the documentation, and also modify the tutorials to reflect this.

Regarding question 2, yes /dev/shm is superfast and has the highest performance, and ideally you do as must as possible on /dev/shm with VirtualFlow. If tempdir_fast=/dev/shm, then /dev/shm will be used during the runtime. The contents will be deleted again after the job is over, maybe that’s why you are seeing an empty folder. If you have enough memory, you can also set tempdir_default= to /dev/shm, in which case VFVS will use /dev/shm also for the local I/O which does not need a very fast I/O system. However, because it is not essential to use /dev/shm for all local I/O of VFVS, it is fine to use a local disc (usually /tmp), so the user has a choice.

Regarding your other questions, yes it would be great if you could move them to a separate thread in the proper section of the forum.

Best,
Christoph