VFLP fails when attempting to fill queue

Julian · April 9, 2020, 10:12am

Hi Christoph,

thanks for making VirtualFlow available! I used VFVS and it worked great.

I am now trying to develop my custom libraries from SMILES using VFLP and encounter some problems that I believe to be related to the structure of my input.
When I start the job (t-1.1), it runs for only a minute before handing over to a successive job (t-1.2). It does not produce any ligand output.
The job output (workflow/output-files/jobs/job-1.1…) is:

Preparing the to-do lists for jobline 1

Starting the (re)filling of the todolists of the queues.

Before (re)filling the todolists the queue 1-1-1 had 0 ligands todo distributed in 0 collections.

Info: No more todo lists.

Warning: There exists an old (locked) todo file. Trying to take care of it…

Warning: The old todo file is a symlink or an empty file. Removing it…
The next todo list will be used (todo.all.0000)

Info: No more todo lists.

Warning: There exists an old (locked) todo file. Trying to take care of it…

Warning: The old todo file is a symlink or an empty file. Removing it…
The next todo list will be used (todo.all.0000)
There is no more ligand collection in the todo.all file. Stopping the refilling procedure.
After (re)filling the todolists the queue 1-1-1 has 6 ligands todo distributed in 6 collections.

The todo lists for the queues were (re)filled in 0 second(s) (waiting time not included).
The waiting time was 1 second(s).

Starting job step VF_STEP_NO on host eu-c7-052-10.
Job step 1 is starting queue 1-1-1 on host eu-c7-052-10.
Error was trapped
Error in bash script one-step.sh
Error on line 307

To me it seems like the tool fails to load the ligand SMILES. I tried with a number of different inputs, including:

*.smi files with one SMILES per collection
*.smi files with multiple SMILES per collection
*.txt files with one SMILES per collection
the structure that is output from VFTools/bin/vflp_prepare_inputdb_smi2indtar.sh when applied to a file holding multiple SMILES and names
but all failed in the same way

Can you help me use the correct input for VFLP? Or is my mistake somewhere else?

Below are sime minor issues I noticed with VFLP which might or might not be related to the main problem:

Minor issues:

The job tries to terminate the queue but fails (i.e. next job starts). Related output:

Trying to stop this queue and causing the jobline to fail…
cp: cannot stat ‘/dev/shm/“username”/VFLP/t/1-1/1-1-1/workflow/output-files/queues/1/1/queue-1-1-1.*’: No such file or directory

In the file workflow/ligand-collections/todo/1/1/1-1-1 every entry has a duplicate (i.e. there are twice as many ligands as in templates/todo.all)
For templates/todo.all, the default github file holds only “tranch”_“collection” and not “number of ligands in collection” as specified in the documentation

thomc · April 14, 2020, 10:44am

I had similar problems that were ultimately due to my LSF environment. I could never make sense of why specific errors were output when I tried to trace them.

I’ll copy/paste what worked for me below, in case they are helpful to you (disclaimer: not entirely sure they will solve your issue though).

Thanks,
Chris

By removing and adjusting the following lines, tutorials ran fine:

deleted within templates/template1.lsf.sh *
#BSUB -R “rusage[mem=400]”
#BSUB -R “select[scratch]”
#BSUB -R “span[ptile=4]”
changed within templates/template1.lsf.sh *
#BSUB -q medium --> #BSUB -q /actualQueue/

Julian · April 14, 2020, 11:02am

Thanks Chris!

Actually I solved these issues already to get VFVS to run.
To improve your control over the resource allocation, I suggest you use change
#BSUB -R “select[scratch]” to #BSUB -R “rusage[scratch=2000]”
so that you still have control over the scratch space you request. I also had no problems with the other two lines (you just have to adjust the values). I deleted the #BSUB -q medium line entirely since our LSF cluster forbids selection of queues by the user.

At least this is what works for me in VFVS. VFLP still does not work at all.

Best wishes,
Julian

Julian · April 14, 2020, 12:49pm

Hi everyone,

I made some progress in turning VFLP into something that actually works, but I am not quite there yet.

CRUCIALLY, there is a bug in /tools/templates/one-step.sh. In line 105 “/$/{VF_QUEUE_NO_2}/” obviously needs to be"/${VF_QUEUE_NO_2}/" without the slash after $

However this still does not solve the problem since there are more bugs.

If JChemSuite is used (and thus nailgun is needed), the next problem is that in one-step.sh on line 258, chmod is attempted on …/nailgun/nailgun-client/target/ng. The folder target and the file ng do not exist. EDIT: It is necessary to call “$ make” in the nailgun directory. This will give the required files. Still the VFLP fails during runtime with the following error:

Error: Could not find or load main class com.martiansoftware.nailgun.NGServer
Caused by: java.lang.ClassNotFoundException: com.martiansoftware.nailgun.NGServer

If JChemSuite is not used (and thus nailgun is not needed) VFLP jobs still saying there is an error in line 307 which is unfortunately not informative as it only means that not all queues exited without error.

Any help and further ideas in getting VFLP to run are highly welcome

Best,
Julian

vas2201 · April 16, 2020, 7:32pm

Hi Julian,

Apparently, From the below links, we suppose to have the following code in one-step.sh to Lunch the server. I am not at very good at shell scripting I am trying to figure out still.

java -cp ~/nailgun/tools/nailgun-server-0.9.1.jar -server com.martiansoftware.nailgun.NGServer 127.0.0.1

http://www.martiansoftware.com/nailgun/quickstart.html

Regards
Vas

vas2201 · April 17, 2020, 4:09am

Hi Julian,

I fixed that above error as follows. (Error: Could not find or load main class com.martiansoftware.nailgun.NGServer)

but end-up with other bugs, that related java version, installed in root folder.

Download the nailgun-server-0.91.jar file from web and keep in any of the folder and add the ~/path/to/ that jar file.

Edit the line 171 as follows in one-queue.sh script.

java -Xmx${java_max_heap_size}G -cp ~/path/to/nailgun-server-0.91. jar -server com.martiansoftware.nailgun.NGServer localhost:${NG_PORT} &

-Cheers
Vas

nailgun/tools/google-java-format/README.txt
nailgun/tools/google-java-format/google-java-format-1.6-all-deps.jar
NGServer 0.9.1 started on 127.0.0.1, port 53005.
Job step 1 is starting queue 1-1-1 on host compt118.
Job step 1 is starting queue 1-1-2 on host compt118.
Job step 1 is starting queue 1-1-3 on host compt118.
Job step 1 is starting queue 1-1-4 on host compt118.
Job step 1 is starting queue 1-1-5 on host compt118.
Error was trapped
Error in bash script one-step.sh
Error on line 308
Environment variables

truexc · February 24, 2022, 3:16pm

hey guys!

I’ve been trying to get VFLP to work for a little while and I have been getting this exact error on all my job outputs. I have …

the ligand library in the exact directory structure as VFVS but with .smi files
AA → AAAA.tar.gz → 00000.tar.gz —> 4232 individual smi files
i’ve created a module for openbabel that is loaded in template1.slurm.sh
i don’t use JChem and have either disabled those features or directed it to openbabel
I even tried “none” in the secondary option and “obabel” two times.
I only have one item in the todo list “AAAA_00000 4232”
all the splitting and job handling stuff was set to 4232
I’ve set number of nodes, cpus per job, cpus per queue to one
I’ve made sure the partition is correctly defined in template1.slurm.sh

I am running out of things to check and it all results in the same error message.
“”"
Error was trapped
Error in bash script one-step.sh
Error on line 307
“”"
This error message is extremely unhelpful and I have yet to find a solution. Has any progress been made here? Any help would be greatly appreciated.

Jason_wu · April 6, 2022, 11:25pm

I also encountered the same issue, and have no idea to deal with it. Do you have any updates?

truexc · April 7, 2022, 3:58pm

@Jason_wu,

in one-step.sh at line 155, you will see the script call a variable before its is defined on line 161.

in line 1051 of one-queue.sh, I also found that the logic to confirm ‘max_obabel_energy’ is an integer was incorrect.

it has been a while since I last looked at it, but remember fixing these two things. It ended up running an endless loop that ultimately did nothing.

I think VFLP just needs more debugging, in my opinion.

Jason_wu · April 22, 2022, 10:45am

Thanks for the response…
Yes, it requires more debugging, or have more detailed documents