Issue when trying to run the VFVS_GK tutorial (tutorial 1)

BKAmos · June 1, 2020, 3:45pm

Hello there VirtualFlow community!

I’ve been trying to run the above tutorial. I’m using a SLRUM manager, I’ve done the proper, or what I think proper : ), modifications to the all.ctrl file and the template1.slurm.sh file, and in this example the todo.all file is already set up properly. I then execute the prepare folders command ./vf_prepare_folder.sh.

When I try to run with ./vf_start_jobline.sh I get the following error:

./vf_start_jobline.sh 1 12 templates/template1.slurm.sh submit 1


        ::  ::  ::  ::::. :::::: ::  ::  .::::.  ::      :::::  ::    .::::. ::      ::
        ::  ::  ::  :: ::   ::   ::  ::  ::  ::  ::      ::     ::    ::  :: ::  ::  ::
         ::::   ::  :::.    ::   ::  ::  ::::::  ::      :::::  ::    ::  ::  ::::::::
          ::    ::  :: ::   ::    ::::   ::  ::  ::::    ::     ::::: '::::'   ::  ::


++ grep -m 1 '^batchsystem=' ../workflow/control/all.ctrl
++ tr -d '[[:space:]]'
++ awk -F '[=#]' '{print $2}'
+ batchsystem=SLURM
+ echo ''

++ seq 1 12
+ for i in '$(seq ${start_jobline_no} ${end_jobline_no})'
+ cp templates/template1.slurm.sh ../workflow/job-files/main/1.job
+ sed -i 's/-1\.1/-1\.1/g' ../workflow/job-files/main/1.job
+ cd helpers
+ . sync-jobfile.sh 1
++ '[' 1 = -h ']'
++ [[ 1 -ne 1 ]]
++ trap 'error_response_nonstd $LINENO' ERR
++ jobline_no=1
++ controlfile=
+++ ls '../../workflow/control/*-*'
+++ true
++ '[' -z '' ']'
++ export controlfile=../../workflow/control/all.ctrl
++ controlfile=../../workflow/control/all.ctrl
+++ grep -m 1 '^batchsystem=' ../../workflow/control/all.ctrl
+++ tr -d '[[:space:]]'
+++ awk -F '[=#]' '{print $2}'
++ batchsystem=SLURM
++ echo -e 'Syncing the jobfile of jobline 1 with the controlfile file ../../workflow/control/all.ctrl.'
Syncing the jobfile of jobline 1 with the controlfile file ../../workflow/control/all.ctrl.
+++ grep -m 1 '^steps_per_job=' ../../workflow/control/all.ctrl
+++ tr -d '[[:space:]]'
+++ awk -F '[=#]' '{print $2}'
++ steps_per_job_new=1
++ '[' SLURM = SLURM ']'
+++ grep -m 1 nodes= ../../workflow/job-files/main/1.job
++ job_line='job_line=$(grep -m 1 "nodes=" ../workflow/job-files/main/${VF_JOBLINE_NO}.job)'
++ steps_per_job_old='job_line=$(grep -m 1 "nodes=" ../workflow/job-files/main/${VF_JOBLINE_NO}.job)'
++ steps_per_job_old='job_line=$(grep -m 1 "nodes=" ../workflow/job-files/main/${VF_JOBLINE_NO}.job)'
++ sed -i 's/nodes=job_line=$(grep -m 1 "nodes=" ../workflow/job-files/main/${VF_JOBLINE_NO}.job)/nodes=1/g' ../../workflow/job-files/main/1.job
sed: -e expression #1, char 51: unknown option to `s'
+++ error_response_nonstd 88
+++ echo 'Error was trapped which is a nonstandard error.'
Error was trapped which is a nonstandard error.
++++ basename sync-jobfile.sh
+++ echo 'Error in bash script sync-jobfile.sh'
Error in bash script sync-jobfile.sh
+++ echo 'Error on line 88'
Error on line 88
+++ exit 1

Any thoughts?

I’m rather new to programming so any information would be great! Happy to provide any additional information that might make understanding this easier.

Looking forward to hearing back.

Cheerfully,
Kirtley

Guilhem · June 1, 2020, 5:29pm

Hi Kirtley,

Errors seems to be coming from here : sed: -e expression #1, char 51: unknown option to `s’

Do you mind sharing your all.ctrl file ? Also, what changes did you make in the template1.slurm.sh file ? Do you mind sharing that one as well ?

Regards,
Guilhem

BKAmos · June 1, 2020, 6:56pm

Hello Guilhem,

Thanks for getting back so quickly. Sure, I can share.

Here were my changes within the all.ctrl file:

job_letter=t
batchsystem=SLURM
partition=batch
timelimit=02-00:00:00
steps, cpus, queues, and cpus/queue = 1 (originally this was set maximize the node, but in troubleshooting was changed to 1)
verbosity_commands=debug
verbosity_logifles=debug

This is what I changed in the slurm settings of the template1.slurm.sh file:
# Slurm Settings
###############################################################################

#SBATCH -A syb105
#SBATCH --mail-user=amosbk@ornl.gov
#SBATCH --mail-type=fail
#SBATCH -t 48:00:00
#SBATCH -N 1
#SBATCH -J VFTEST
#SBATCH --mem=0
#SBATCH -o ../workflow/output-files/jobs/job-1.1_%j.out           # File to which standard out will be written
#SBATCH -e ../workflow/output-files/jobs/job-1.1_%j.out            # File to which standard err will be written

What are your thoughts?

I’m having some issues trying to share the files in any readable format. I did some searching of the sync-jobfile.sh and the 1.job file trying to understand this. Not sure what it means.

Cheerfully,
Kirtley

Guilhem · June 3, 2020, 8:13pm

Hi Kirtley,

I don’t see anything specific, especially your changes in the template1.slurm.sh file seem to be commented so it won’t be impacting anything there.
Validate the name of your Slurm partition (I’m assuming it’s batch in your case, just check by running sinfo on the login/controller node)

My setup is a couple months older than yours so I’ll try to download the latest and check if I run into any issue.

BKAmos · June 4, 2020, 11:41pm

Hello Guilheim and other,

We got it to work over here. We changed the slurm file, drastic change from my original one : ), and got it to run but then had to do some additional changes with the all.ctrl file. As we were debugging we found that the tempdir needed to be changed to tempdir_default. We also ran into some strange scratch file issues. But, it’s working.

Another question. Is it common to have failures at the end of the of the run? Our run yielded 26 ligands failed out of 1123; 26 docking failed out of 2220.

Also, is Virtual flow set up for hyperthreading : ) Or is that a naive question?

Cheerfully,
Kirtley

Guilhem · June 6, 2020, 5:02am

Hi Kirtley,

That’s correct the example in tutorial 1 results in 26 docking failed. You probably have the same workflow report as below :

$ ./vf_report.sh -c workflow

    ::  ::  ::  ::::. :::::: ::  ::  .::::.  ::      :::::  ::    .::::. ::      ::
    ::  ::  ::  :: ::   ::   ::  ::  ::  ::  ::      ::     ::    ::  :: ::  ::  ::
     ::::   ::  :::.    ::   ::  ::  ::::::  ::      :::::  ::    ::  ::  ::::::::
      ::    ::  :: ::   ::    ::::   ::  ::  ::::    ::     ::::: '::::'   ::  ::



                              Sat Jun  6 04:48:54 UTC 2020


                                     Workflow Status

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

                                         Joblines

…

Number of jobfiles in the workflow/jobfiles/main folder: 12
Number of joblines in the batch system: 0
Number of joblines in the batch system currently running: 0
Number of joblines in the batch system currently not running: 0
Number of cores/slots currently used by the workflow: 0

                                        Collections

…

Total number of ligand collections: 68
Number of ligand collections completed: 68
Number of ligand collections in state “processing”: 0
Number of ligand collections not yet started: 0

                             Ligands (in completed collections)

…

Total number of ligands: 1123
Number of ligands started: 1123
Number of ligands successfully completed: 1097
Number of ligands failed: 26

                            Dockings (in completed collections)

…

Docking runs per ligand: 2
Number of dockings started: 2220
Number of dockings successfully completed: 2194
Number of dockings failed: 26

Regarding your second question, it’s not so much VirtualFlow itself that’s configured for hyper-threading but the nodes on which you run the workflow. More details here in the Slurm documentation. Note that for example on Google Cloud, by default each vCPU is implemented as a single hardware hyper thread of the available CPU platform (details here) but then the mapping executed in the Slurm cluster will be of a job per thread and a thread per vCPU. You can check your Slurm nodes configuration with something like :
$ scontrol show node | grep Thread
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
State=IDLE+CLOUD+POWER ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

BKAmos · June 8, 2020, 5:37pm

Hello Guilhem,

Sure enough, that’s exactly the output that I got from the ./vf_report.sh… but with a bit more verbosity. Do you have any idea why they fail? Is it that the ligand isn’t able to dock with the protein provided in the parameters? Is there any way to find why they failed?

Thanks for sharing the information. Exactly answers the question. I knew it would be system depended but I wasn’t sure if code needed to be written in a particular way to be hyperthreaded across a CPU vs being run on a non-hyperthreaded CPU.

Always more to learn.

Cheerfully,
Kirtley

Guilhem · June 8, 2020, 11:28pm

Hi Kirtley,

I’m not a ligand expert myself, I’m more on the compute/cloud side , but looking at the workflow output files it seems to be a ligand coordinates error :

$ pwd
/mnt/virtualflow/VFVS_GK/workflow/output-files

]$ grep -r failed
jobs/job-9.1_10.out:ln: failed to create symbolic link ‘./todo.all.locked’: File exists
jobs/job-12.1_13.out:ln: failed to create symbolic link ‘./todo.all.locked’: File exists
jobs/job-2.1_3.out:ln: failed to create symbolic link ‘./todo.all.locked’: File exists
queues/queue-4-1-1.out.all:# If the conversion failed, a reason is stated.
queues/queue-3-1-1.out.all:# If the conversion failed, a reason is stated.
queues/queue-8-1-1.out.all:# If the conversion failed, a reason is stated.
queues/queue-8-1-1.out.all:Ligand PV-001825361157_1 failed(ligand_elements:B) on Tue Apr 14 03:59:41 UTC 2020.
queues/queue-8-1-1.out.all:Ligand PV-001825361333_1 failed(ligand_elements:B) on Tue Apr 14 03:59:42 UTC 2020.
queues/queue-11-1-1.out.all:# If the conversion failed, a reason is stated.
queues/queue-7-1-1.out.all:# If the conversion failed, a reason is stated.
queues/queue-6-1-1.out.all:# If the conversion failed, a reason is stated.
queues/queue-1-1-1.out.all:# If the conversion failed, a reason is stated.
queues/queue-2-1-1.out.all:# If the conversion failed, a reason is stated.
queues/queue-12-1-1.out.all:# If the conversion failed, a reason is stated.
queues/queue-12-1-1.out.all:Ligand Z2002569892_8 failed(ligand_coordinates) on Tue Apr 14 03:37:17 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_31 failed(ligand_coordinates) on Tue Apr 14 03:37:17 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_45 failed(ligand_coordinates) on Tue Apr 14 03:37:17 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_66 failed(ligand_coordinates) on Tue Apr 14 03:37:18 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_78 failed(ligand_coordinates) on Tue Apr 14 03:38:36 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_86 failed(ligand_coordinates) on Tue Apr 14 03:38:36 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_88 failed(ligand_coordinates) on Tue Apr 14 03:38:37 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_102 failed(ligand_coordinates) on Tue Apr 14 03:41:03 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_103 failed(ligand_coordinates) on Tue Apr 14 03:41:03 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_107 failed(ligand_coordinates) on Tue Apr 14 03:41:04 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_117 failed(ligand_coordinates) on Tue Apr 14 03:42:16 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_119 failed(ligand_coordinates) on Tue Apr 14 03:42:16 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_171 failed(ligand_coordinates) on Tue Apr 14 03:45:50 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_186 failed(ligand_coordinates) on Tue Apr 14 03:45:50 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_188 failed(ligand_coordinates) on Tue Apr 14 03:45:51 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_192 failed(ligand_coordinates) on Tue Apr 14 03:47:06 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_200 failed(ligand_coordinates) on Tue Apr 14 03:47:06 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_203 failed(ligand_coordinates) on Tue Apr 14 03:47:06 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_230 failed(ligand_coordinates) on Tue Apr 14 03:47:06 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_232 failed(ligand_coordinates) on Tue Apr 14 03:47:07 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_235 failed(ligand_coordinates) on Tue Apr 14 03:47:07 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_238 failed(ligand_coordinates) on Tue Apr 14 03:47:07 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z2002569892_249 failed(ligand_coordinates) on Tue Apr 14 03:47:07 UTC 2020.
queues/queue-12-1-1.out.all:Ligand Z1270872534_1 failed(ligand_coordinates) on Tue Apr 14 04:13:01 UTC 2020.
queues/queue-5-1-1.out.all:# If the conversion failed, a reason is stated.
queues/queue-10-1-1.out.all:# If the conversion failed, a reason is stated.
queues/queue-9-1-1.out.all:# If the conversion failed, a reason is stated.

Then looking at specifically this file : queues/queue-12-1-1.out.all

Ligand 42 of job 12.1 belonging to collection HACABE_00000: Z2002569892_192

The ligand contains elements with the same coordinates.
Skipping this ligand and continuing with next one.
Ligand Z2002569892_192 failed(ligand_coordinates) on Tue Apr 14 03:47:06 UTC 2020.
  Ligand 43 of job 12.1 belonging to collection HACABE_00000: Z2002569892_200
The ligand contains elements with the same coordinates.
Skipping this ligand and continuing with next one.
Ligand Z2002569892_200 failed(ligand_coordinates) on Tue Apr 14 03:47:06 UTC 2020.

[…]

BKAmos · June 10, 2020, 2:30pm

Hey Guilhem,

Thanks for getting back to me. Maybe it’s because it’s saying that there are coordinates for two atoms in the same position, which isn’t reasonable? I was seeing the ligand element and ligand coordinates are reasons for failure.

Another thing. I wasn’t able to get the VFTools to work for the post-processing within this tutorial. It produced files but they were all empty. I blame this on the lack of outputfiles_level within the all.ctrl file. I was able to get them to work within the second tutorial, well, aside from the vfvs_pp_prepare_dockingposes.sh. Still working on that.

Were you able to get the post processing scripts to work?

Cheerfully,
Kirtley

Guilhem · June 12, 2020, 9:35pm

Yeah, the post processing scripts worked for me, there was a small hiccup regarding the naming of an argument (tranche vs. tranch if I remember correctly…) but this has been fixed in the code and/or documentation by Christoph already.