Multi-stage screen: Ligand selection & preparation

Morten · December 8, 2022, 6:27pm

Hi everyone,

I’m almost done with a 62 million ligands screen and I want to do a second round for my top hits using flexible residues for my receptor.
How do I prepare a ligand library for multi-stage screening? You state that you did this in the Nature paper first describing VirtualFlow.

However, I find very little information about this.

Christoph stated the following in the forum:

To prepare the input ligand libraries for the second stage screenings, which contain the top X compounds from the first stage, these need to be prepared manually at this point, because a random selection of the ligands to be screened is not possible at the moment. At the moment, only entire collections of ligands can be screened. We have this on our todo-list for future versions and provide scripts which automate this (if you want to work on this feature, please let us know).

In the Nature paper you did a second screen of 3 million ligands. Clearly you didn’t extract these one by one.

If this needs to be done purely manually I can extract maybe a few thousands, but it would not be great fun. Do you have a recommended protocol?

Christoph · December 10, 2022, 9:29pm

Hi @Morten ,

I’m glad to hear you were able to almost complete the 62 million screen.

Regarding the preparation of a library for the second stage, what I meant by “manually” is doing it with some custom bash scripts (or similar).

We also provide a basic script in the VF Tools package: vfvs_prepare_newcollections.sh
that you can find here: VFTools/bin at master · VirtualFlow/VFTools · GitHub

You can find some instructions in the file itself, and you can also look at the source code if needed.

I hope this helps,
Christoph

Morten · December 14, 2022, 9:12pm

Hi @Christoph

That’s helpful.
I expect to be ready for the second stage in a matter of days, and I’m now sorting the protocol:

How to run the vfvs_prepare_newcollections.sh:

vfvs_prepare_newcollections.sh <ligand file> <pdbqt_input_folder> <pdbqt_folder_format> <ligands_per_collection> <output folder>

Do you have any general guidelines for <ligands_per_collection>?

What about pdbqt_folder_format?
The possible values are tar_tar, meta, sub_tar, and hash_metatranche.
I currently have the library in the following file structure:
.../ligand-library/XX/XXXXXX.tar
Is that tar_tar or sub_tar?

How to make the selection for second stage screening:
I have looked at the tutorial for how to Complete Ligand Ranking.

I have done two docking scenarios, and I have thus two firstposes.all.minindex.sorted.clean files. I now want to make the selection, merge the ligand files, and delete duplicates. The resulting ligand file can then be used as input for vfvs_prepare_newcollections.sh.
I think I have a good way of doing this:

Lets say that I want to extract ligands with estimated affinities greater than -9 kcal / moles, and then write an output file containing the collection name, ligand name, and estimated affinity.

awk -F" " '$4 <= -9 {print $1, $2, $4}' firstposes.all.minindex.sorted.clean > ranked_ligands_1

I then simply use cat to create ligand_merge including both docking scenarios:

cat ranked_ligands_1 ranked_ligands_2 > ligand_merge

Lastly I remove duplicated ligands with:
awk '!a[$2]++' ligand_merge > ligand_merge_rd

Does this seem sensible?

Morten · January 2, 2023, 6:33am

I have been trying:

vfvs_prepare_newcollections.sh /home/rekggla/Scratch/VF_upload/merge_n9_dr /home/rekggla/Scratch/tmp/ligand-library/ tar_tar 1000 /home/rekggla/Scratch/VF_upload/lib_merge_n9_5_dr/

and

vfvs_prepare_newcollections.sh ../../../Scratch/VF_upload/merge_n9_5_dr ../../../Scratch/tmp/ligand-library/ tar_tar 1000 ../../../Scratch/VF_upload/lig_merge

It results in:

              Extracting the winning structrures                 
/home/rekggla/programs/VFTools/bin/vfvs_prepare_newcollections.sh: line 133: …//home/rekggla/Scratch/VF_upload/merge_n9_dr: No such file or directory

*** The preparation of the intermediate folders has been completed ***

*** Starting the preparation of the length.all file ***

If the file /home/rekggla/Scratch/VF_upload/lib_merge_n9_5_dr/.length.all exists already it will be cleared.
ls: cannot access /home/rekggla/Scratch/VF_upload/lib_merge_n9_5_dr/.tmp2: No such file or directory

*** The preparation of the length.all file has been completed ***

*** Starting the preparation of the tar archives ***
/home/rekggla/programs/VFTools/bin/vfvs_prepare_newcollections.sh: line 150: cd: /home/rekggla/Scratch/VF_upload/lib_merge_n9_5_dr/.tmp2: No such file or directory
Error was trapped
Error in bash script vfvs_prepare_newcollections.sh
Error on line 150
Exiting.

I can’t really make sense of this. I see how it reference to line 133 and 150 which would be related to my input library (which is the library I used for my first screen). It also make reference to a .tmp2 that it presumably should create but which fail.
What do I do wrong here?

Morten · January 5, 2023, 5:47pm

I have kept on trying, and I have been testing on a computer running Linux.
It is hard to understand all this from what is available as documentation.

I have tried all the different pdbqt_folder_formats because I don’t know which one to use. I used the REAL library, and I prepared it from the VF tutorial. This library is then also used as input for vfvs_prepare_newcollections.sh.

It seems that the script have difficulties finding the paths. The script seems to look here:
ligand-library/ABCDEF.tar

While the actual library has sub-folders:
ligand-library/AB/ABCDEF.tar

Any ideas on how to do this?


./vfvs_prepare_newcollections.sh ../test_firstposes.all.minindex.sorted.clean ligand-library tar_tar 100 lib_merge/


*********************************************************************
                  Extracting the winning structrures                 
*********************************************************************


 *** Adding the ligand XX-XXXXXXXXXXXX_X_XX to the collection XXXXXX_XXXXX-0001 ***
tar: ../../ligand-library/ABCDEF.tar: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
tar: ABCDEF/00000.pdbqt.gz.tar: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now

...

 *** The preparation of the intermediate folders has been completed ***

 *** Starting the preparation of the length.all file ***
 * If the file lib_merge/.length.all exists already it will be cleared.

 *** Adding the collection XXXXXX_00000-0001 to the length.all file ***
./vfvs_prepare_newcollections.sh: line 168: ../../lib_merge/.length.all: No such file or directory
Error was trapped
Error in bash script vfvs_prepare_newcollections.sh
Error on line 168
Exiting.

xingbb · January 7, 2023, 7:06pm

Hello, have you solved this problem? I am also facing the same confusion, thank you very much.

_Chris_Secker · January 8, 2023, 9:54pm

Hi @Morten,

congrats on your 62 mio. screen using VirtualFlow

I can try to help you with this issue. First of all, the pdbqt_folder_format you have is called “meta” (e.g. AB/ABCDEF.tar). Please also make sure that you use the latest version of the VFTools script from github (VFTools/vfvs_prepare_newcollections.sh at master · VirtualFlow/VFTools · GitHub).

Please try running it this way and let me know if that works. If not, it would be great if you could give more info on your “ligand file” (however, the firstposes.all.minindex.sorted.clean file should be fine). Additionally, more info on your “pdbqt input folder” could be helpful.

All the best
Chris

Morten · January 9, 2023, 11:41am

Hi @_Chris_Secker,

Thanks for getting back to me.

I am also use the latest script, and I have now tested with meta. Still no luck.

Here’s an example:

./vfvs_prepare_newcollections.sh ../test_firstposes.all.minindex.sorted.clean ../ligand-library meta 100 lib_merge/

*********************************************************************
                  Extracting the winning structrures                 
*********************************************************************


 *** Adding the ligand PV-002015206817_3_T1 to the collection JBFCEG_00001-0001 ***


 * Extracting collection JBFCEG_00001
tar: ../../ligand-library//JBFCEG.tar: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
./vfvs_prepare_newcollections.sh: line 141: cd: JBFCEG: No such file or directory
Error was trapped
Error in bash script vfvs_prepare_newcollections.sh
Error on line 141
Exiting.

As you can see I’m running the script locally and I have it in the same folder as my input files.
I have tried multiple permutations. I have tried the full paths of input files and folders, or just their folder names. I have to use ../ before my ligand input file for it be read.

I think there’s a confusion with the folder hierarchy, and I can’t see how to fix that.
In the example above it try to access the ligand in /ligand-library//JBFCEG.tar. Is // wrong, should it not be /JB/, like this; /ligand-library/JB/JBFCEG.tar?

PS:
Do you have any tips regarding <ligands_per_collection>? What are sensible guidelines for this number?

_Chris_Secker · January 9, 2023, 12:43pm

Hi @Morten,

thanks for the info. I wonder why there is no metatranche info for the tar command. Can you give me an example line of your ligand-file? Yes, exactly -it should be ligand-library/JB/JBFCEG.tar

Regarding the ligands_per_collection a general recommendation can be 1000 to 10000 for I’d say an average cluster. But it largely depends on the nodes and the config of your slurm cluster and the docking programs/specifications you are using. E.g. if you want one job to work on ~10 collections, you should make sure that the job does not exceed the timelimit on the slurm partition it is running on. How much time the node needs for one ligand to process then also depends on how many cpus the job will use on the node, which docking program you use, what settings of the program you use etc.

Best
Chris

Morten · January 9, 2023, 1:20pm

Hi @_Chris_Secker,

./vfvs_prepare_newcollections.sh ../test_firstposes.all.minindex.sorted.clean ligand-library meta 100 lib_merge/


*********************************************************************
                  Extracting the winning structrures                 
*********************************************************************


 *** Adding the ligand PV-002015206817_3_T1 to the collection JBFCEG_00001-0001 ***


 * Extracting collection JBFCEG_00001
tar: ../ligand-library//JBFCEG.tar: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
./vfvs_prepare_newcollections.sh: line 141: cd: JBFCEG: No such file or directory
Error was trapped
Error in bash script vfvs_prepare_newcollections.sh
Error on line 141
Exiting.

I tried changing the ligand-library input to absolute path, to …/ligand-library, ligand-library/ etc. In all cases the script fails to execute the intended path.

The tar in the example above is indeed in the library:
ligand-library/JB/JBFCEG.tar

I don’t see how I can change the input library path in such a way that the script reads correctly. Is that possible or is this a bug in the script?

This is the relevant code. I kinda get what is happening, but I’m above all an wet lab scientist and my batch scripting skills are rudimentary.

elif [ "${pdbqt_folder_format}" == "meta" ]; then
    if [ "${new_collection}" == "true" ]; then
        echo
        echo
        echo " * Extracting collection ${collection}"
        rm -r ${old_tranche} &>/dev/null || true
        tar -xf ../${pdbqt_input_folder}/${metatranche}/${tranche}.tar ${tranche}/${collection_no}.tar.gz || true
        cd ${tranche}
        tar -xzf ${collection_no}.tar.gz || true
        cd ..
    fi
    cp ${tranche}/${collection_no}/${ligand}.pdbqt ../${output_folder}.tmp2/${collection_new}/${ligand}.pdbqt || true

Morten · January 24, 2023, 11:16am

This is fixed now.

If you’re interested. I swapped line 86 & 87.

Going from:

metatranche="${tranche:0:2}"
tranche="${collection/_*}"

To:


tranche="${collection/_*}"
metatranche="${tranche:0:2}"

Morten · March 4, 2023, 1:01pm

Follow up question:

The script vfvs_prepare_newcollections.sh generates the new library and writes a .length.all which I then use as the new todo.all.

However. The jobs submitted tends to terminate. And it seems like this is related to (re)filling and the todo.all.

From the logfile of a job that terminated:

...

* Preparing the to-do lists for jobline 1


Starting the (re)filling of the todolists of the queues.

Before (re)filling the todolists the queue 1-1-1 had 1001 ligands todo distributed in 446 collections.

 * Info: No more todo lists.
The next todo list will be used (todo.all.0000)
There is no more ligand collection in the todo.all file. Stopping the refilling procedure.

...

In all.ctrl I’ve tried changing the values of central_todo_list_splitting_size, ligands_todo_per_queue, ligands_per_refilling_step, and prepare_queue_todolists.
As far as I can see those are the only parameters that should be different compared to my first screen. Non of those seems to fix the job submission though.

Am I missing something? How can I fix this so I can properly submit jobs?

Sorin · March 29, 2023, 1:59pm

Hi Morten,

First, allow me to apologize for the tardy reply to your question. So by looking at what you posted, my first assumption would be that there might be something wrong with the file itself or that some of the parameters mentioned have improper values assigned to them.

Just a quick question: how many ligands are you docking in your second stage screen?

Also what is the value of prepare_queue_todolists? Perhaps this is set to true? This should be set to false in the production run.

Also, what are the values you assigned to the rest of the parameters you mentioned?

The second thing that might be problematic would be that vfvs_prepare_new_collections.sh requires a ligand file with the first column the collection name and the second column the ligand name.

I’m not sure how you generated this file, but the approach I use is to simply run vfvs_pp_all.sh (part of VFTools) with both numerical arguments at 0.

This in turn will generate a first_poses file that would contain all docked ligands, but you can keep the any number of these by text manipulation.

A simple approach that I use (eg for the first 100 molecules/lines) is to run head -100 input.file > output.file

On that file you can then run vfvs_prepare_newcollections.sh.

Hope this helps!

Kind regards,

Sorin

Morten · April 11, 2023, 5:31am

Hi Sorin,

I’m currently on holidays and don’t have access to the cluster. I’ll answer as best as I can, and if there are more questions I can address them at a later point.

My second screen is about 50k ligands. I prepared the library as stated in detail above. First I used vfvs_pp_prepare_dockingposes.sh as demonstrated in the tutorial, and then usingvfvs_prepare_newcollections.sh.. The resulting todo.all has a long list of entries, but each only have 1 or a small number of ligands. Is that still okay, or could that be problematic?

…
JBECEF_00000-0001 1
JBECEF_00001-0001 1
JBECEG_00000-0001 3
JBEDDF_00000-0001 1
…

As far as I can see there isn’t a problem with making the ligand list, generating the library, defining the todo.all. It fails when I’m submitting jobs. Previously a small number of jobs started, but most jobs tend to terminate quickly. I’ve tried varying the variables stated above wildly, non of that have helped, but I would have to get back to you on exact values.
Were you unable to extract anything from the logfile in my previous post? If not, let me know if there’s anything in particular that you want to see.

I’m not sure how you generated this file, but the approach I use is to simply run vfvs_pp_all.sh (part of VFTools) with both numerical arguments at 0.

This in turn will generate a first_poses file that would contain all docked ligands, but you can keep the any number of these by text manipulation.

I did not use vfvs_pp_all.sh. It should still contain the required ligand information though. I explained all my steps in detail in previous posts in this thread.

Can you see any issues in my setup?

Would be great to get this going again! =)

Sorin · April 14, 2023, 7:11am

Hi Morten,

Please use the steps outlined above. That way we can at least have a coherent view of what is working (or not working) properly. Let me know how it went and we will take it from there.

Kind regards,

Sorin

Morten · April 17, 2023, 4:05pm

Hi Sorin
I have no checked the all.ctrl and you can find the values below:

Currently all.ctrl has the following values:
prepare_queue_todolists=false
central_todo_list_splitting_size=10000
ligands_todo_per_queue=1000
ligands_per_refilling_step=1000

This screen has a total of 293755 ligands.

I would appreciate if you could help select sensible values.

I also attach my all.ctrl here in case there are other options I should change.
all.ctrl (16.6 KB)

[Sorin]
The second thing that might be problematic would be that vfvs_prepare_new_collections.sh requires a ligand file with the first column the collection name and the second column the ligand name.

This is fulfilled. By using vfvs_pp_prepare_dockingposes.sh I was left with a file with the collection names and ligand names in the first and second column, respectively. The generated library from vfvs_prepare_newcollections.sh also seem perfectly fine and it contains what we intend.
If you see any reason to still try your protocol I would be happy to do so.

Sorin · May 5, 2023, 1:22pm

Hi Morten,

These values seem fine to me. Do you see any errors in you logs? (for example, the ones that are set by the option below):

store_queue_log_files=all_compressed_error_uncompressed

If so, can you please send me the logs?