VFVS for randomly selected ligands?

gal · December 3, 2020, 7:37pm

Hi. It looks running the full set of REAL librarry is a bit expensive.
Is there a way to submit VFVS for a “randomly” selected 100 million ligand of the ready REAL library?
Especially, is there an option to set before submitting the current bash scripts in the tutorials or by simply changing the config files?
thanks

Christoph · February 11, 2021, 12:24am

Hi Gal,

a similar question was asked here, and there are some answers:

Also, in the documentation is described on how you can select subsets of the entire library. Basically the collections of the library which are screened are defined in the todo.all file.

Best,
Christoph

gal · February 11, 2021, 12:52am

Hi Christoph,
thanks.
I saw that thread but what I mean in particular is as the clustered version based on similarity.
For example two stage screening would be cheapar if we had a smaller size (10 Million) library that represents the whole REAL library,
we might screen it first at the first stage and then
at the second stage we can subset larger set based on the top 1K hits of the first screening.
Such a strategy, if possible, might reduce cost and increase the speed if possible.

Christoph · February 11, 2021, 1:10am

Hi Altay,

Yes, multi-stage screenings with less compounds in the later stages can substantially decrease the computational costs. Basically this is also what we did in our Nature paper with KEAP1.

To prepare the input ligand libraries for the second stage screenings, which contain the top X compounds from the first stage, these need to be prepared manually at this point, because a random selection of the ligands to be screened is not possible at the moment. At the moment, only entire collections of ligands can be screened. We have this on our todo-list for future versions and provide scripts which automate this (if you want to work on this feature, please let us know).

If you prepare the collections for your stage 2 screening, you might want to make smaller collection sizes (e.g. 50 compounds per stage) in case that you plan to use higher accuracy in the dockings, since then the dockings take a longer time than in the stage 1 screening.

Best,
Christoph

gal · February 11, 2021, 1:24am

I actually meant a bit different than this. Let me clarify more.
I mean clustering 1.2 billion ligands into lets say, 100 different clusters based on their similarity to each other.
We can then sample around 100K from each sample proportionally as representations of the full library and obtain the 10 million representation library for the first stage.

In the second stage, based on the proportion of the of the 100 clusters in the top 1K hits, we can sample from the clusters that provide more hits and we do not sample from the clusters that does not provide hits in the top 1K hits of the first stage.

So, in the second stage we can screen only 100 million ligands but since they are from the relevant clusters, we should expect similar performance to the full 1.2 billion screening.
But I am not sure if the REAL library can be clustered as such as I have not done similar study before. Thus, I am just brainstroming about the problem. I can code with Python or R (not shell) and if someone who has experience on clustering ligands and can supervise me, I might contribute for such utiliy with such guidance.

Christoph · February 11, 2021, 10:58pm

Thanks for the additional details and clarification.

This is a potentially interesting feature. Currently this is not yet possible in an automated way, if I understand you correctly. But you could do it manually as follows:

You could prepare the Diversity Set of the REAL library which is provided by Enamine with VFLP:
https://enamine.net/hit-finding/diversity-libraries/dds-50240

And then screen it with VFVS. Then you take the top hits (as many as you want), and search for analogs of these compounds (to create the clusters), and create a analog library which contains all your clusters, which you could then screen with VFVS.

We will put this on our wish-list, and will think about how to best do this. We might come back to your offer to help, we really appreciate it

gal · February 11, 2021, 11:04pm

Thanks. How do you currently generate the analogs of the hits? I could not find specific information about it in your case study in the Nature paper.

Christoph · February 13, 2021, 12:38am

Yes, you will not find it in the Nature paper, because there we did not search for analogs for the best stage 1 hits. We rescreened the top ~3M compounds of the first stage.

Regarding the creation of analog clusters of hits, you have all the freedom. You can use any external tool of your preference to create the clusters of analogs. The only requirement is that the new ligand library which contain your analog clusters is in the typical VFVS input ligand database format.