Before screening my own protein I wanted to make sure VirtualFlow works properly, hence I followed the tutorial with the provided docking scenarios. Everything goes smoothly and upon running the command ./vf_start_jobline.sh 1 12 templates/template1.slurm.sh submit 1 I can see the following: Submitted batch job #number The job for jobline X has been submitted.
However,its followed by an immediate fail:
Slurm Job_id=XXX Name=t-4.1 Failed, Run time 00:00:29, FAILED, ExitCode 1
What could be the problem?
Is there a way to see whats going wrong?
The store_queue_log_files is set to all_uncompressed in the all.ctrl file
but still the workflow/output-files/jobs folder only contains job-1.1_135.out files
So its weird that it doesnt produce any logfile despite setting the parameter.
inside the job-1.1_135.out I find the following errors:
Error was trapped
srun: error: wen01: task 0: Exited with exit code 1
Error in bash script one-step.sh
Error on line 55
Environment variables (wen01 is the node)
Error was trapped
Error in bash script /var/spool/slurm/job00135/slurm_script
Error on line 298
Environment variables
@kaneki Can you set (in addition to store_queue_log_files=all_uncompressed ) also the parameter verbosity_logfiles=debug in the control file, and run the workflow again with this setting and attach the log file? The folder workflow/output-files/jobs should contain one logfile for each job which is run by the batch system, which are there in your folder, so all seems to be fine regarding these log files, and they should contain the cause of the error.
Since the workflow/output-files/queues folder is still empty as you said, the workflow seems to fail before starting the actual queues (one for each CPU) which process the ligands. Therefore also the workflow/ligand-collections/ligand-lists folder is still empty, since no ligands were yet processed.
I would like to attach a logfile but in the control file (all.ctrl) both verbosity_logfiles=debug and store_queue_log_files=all_uncompressed are set as you mentioned, and yet no logfiles are produced.
Indeed it seems to not even reach a queue so Im really confused in whatâs going on.
Im just following the tutorial with the provided files so I did not change anything, other than what was just specified.
Maybe important to note is that I used the values given from the tutorial in: steps_per_job=1 cpus_per_step=5 queues_per_step=5 cpus_per_queue=1
Could it be that i have to change this configuration, since VirtualFlow is not able to even start a queue for my cluster, hence failing the jobline?
@kaneki You mentioned earlier that for example the file job-1.1_135.out was there, which is a logfile of the job. Could you share this (or a similar) file? (This is also the log file I meant in my previous post, i.e. the log file on the job level)
Do all the files in the folder workflow/output-files/jobs have the ending âoutâ, or also some the ending âerrâ?
Thatâs a good point, renaming the workflow/output-files folder âlogfilesâ might be a good idea. Iâll put that on my todo list
It seems that you are running your workflow in the home folder, is that right?. On most clusters, this should not be done, unless your admins recommend that. Usually a fast shared cluster filesystem should be used. Is the home folder available to the compute nodes? If not, this might cause the immediate failure of the jobs.
If this does not help, I would recommend running an interactive job: start an interactive session with srun --nodes=1 --ntasks-per-node=2 --pty bash (additional parameters might be need on your cluster), then go the tools folder and run
im indeed running it on my home folder. When running an interactive job followed by bash../workflow/job-files/main/1.job then final job information shows starting time and ending time with a difference of 1 second. I will try to copy eevrything to the local scratch folder of a node and try to run from there. Indeed, the problem might lie in the home folder. I will let you know
The local scratch of a compute node is again another type of storage, which is only accessible by that node. Usually, a cluster has a shared cluster/scratch filesystem which is available to all compute nodes (including the login node). This is the one you want to use
we only have local scratch folders, not a shared one. Would it still be possible to run virtualflow on 1 node? Just testing how to use it before sending it to HPC facility!
edit:
could it be problem with additional programs such as openbabel or autodock etc. I did not install autodock on the cluster because the doucmentration mentioned that it is already in the virtualflow folder.
Thatâs a good question. Theoretically, yet, but it depends on how exactly your cluster is set up.
Regarding the settings you mentioned earlier, like steps_per_job=1, cpus_per_step=5, queues_per_step=5 or cpus_per_queue=1, this could also be (but I donât expect it) a cause of the problem. Normally if you have an unsupported setting for example for cpus_per_step, then the batchsystem wouldnât even allow you to submit the job in the first place.
The external docking programs should not be the problem, since they are shipped with the pckagage. And you would see specific error messages in the log files about this when the workflow tries to use these programs.
The error message in your job output files indicates that as soon as âsrunâ is run (which usually starts the script one-step.sh on one of the compute node), your workflow crashes immediately without any output from the one-step.sh script. In my experience this happens only if there is a fundamental problem, such as the filesystem suddenly no longer available to the workflow/compute node. Your cluster admin might be able to help.
Regarding the settings you mentioned earlier, like steps_per_job=1 , cpus_per_step=5 , queues_per_step=5 or cpus_per_queue=1 , this could also be (but I donât expect it) a cause of the problem.
In addition the admin tried running the job and he faces the same problem and gets the following error which he asked me to forward to you:
Set steps_per_job=1 , cpus_per_step=1 , queues_per_step=1 or cpus_per_queue=1. Make sure you set these parameters either in the workflow/control/all.ctrl file in case you donât prepare the folders again with ./vf_prepare_folders.sh. If you change the settings in the template control file tools/templates/all.ctrl, you need at first prepare the workflow folders again with ./vf_prepare_folders.sh. Then try running the workflow again with ./vf_start_jobline.sh
If that does not work, in addition to the settings above (without re-preparing the workflow files), edit the file ../workflow/job-files/main/1.job and change the line 294 containing srun to bash ../workflow/job-files/sub/one-step.sh
Then run an interactive job, e.g. with srun --pty --mem 1G -n 1 -N 1 --cpus-per-task=1 -t 0-1:00:00 /usr/bin/env bash (How exactly you need to start an interactive session will depend on your cluster.)
Then while you are still in the tools folder, run bash ../workflow/job-files/main/1.job 2>&1 | tee log.txt
Then you can send me the file log.txt which should be in your current directory (i.e. the tools folder).
Did I understand correctly that the cluster on which you are trying to get it run is different from the larger cluster where you will run the real virtual screenings later?
I will try the two mentioned steps today and return to you with results later this day.
Yes thats true, first I wanted to understand how virtualflow works, since I got errors with my own files, I tried running the tutorial with the given files but the same errors (we talked about them) arised. I want to run it on our own cluster first to check if the ligands are docked at the position I want them before wasting time in a big HPC facility!
EDIT:
point 1 didnt work either, I will send you the logfile when i finished point 2
point2 1.job line294 contains for pid in ${pids[@]}; do
wait $pid || let âexit_code=1â
Testing that the ligands are docked in the right docking site when you setup your own target is definitely a good idea. But this you can in principle also do without VirtualFlow, you can just run Autodock Vina manually for that one time to test, and look at the results (thatâs how I usually do it). Then you can use the same vina config file which you used for VirtualFlow. Of course, checking in addition also the docking results of VirtualFlow in small tests runs is a good idea as well to be sure the dockings are working as you want them to.
If you want to to use a big HPC facility, you will also need to run some smaller tests there anyways, because each HPC system is different. Thus if for some reason you donât get it working on your own HPC system within a reasonable amount of time, it might be more efficient to test on the large HPC facility directly (on a small test scale).
Regarding point 2, maybe you have a slightly different version than the one I looked at, the line containing the srun command is âaroundâ line 294.
In addition, i sent you a PM with the logfiles form the 2 steps you asked me to perform:
for point two:
since ran the command with 1 job there was only 1 jobfile in which i chagned line 285: srun --relative=$((VF_STEP_NO - 1)) -n 1 -N 1 ../workflow/job-files/sub/one-step.sh &
into
bash ../workflow/job-files/sub/one-step.sh
so i removed the whole line and copied the above line on that line and it resulted in an error at that line. For point two I get two errors, and again the error which keeps coming back:
Starting job step 1 on host ana01.
+ bash ../workflow/job-files/sub/one-step.sh
Error was trapped
Error in bash script one-step.sh
Error on line 55
Thank you for the log files. That is really strange, never seen something like this before.
As soon as the script one-step.sh is started, an error appears from line 55 without any output from the earlier lines of that file. And line 55 of that file doesnât seem to make sense either.
Without changing or resetting anything else, can you start again your interactive job, and only change the line bash ../workflow/job-files/sub/one-step.sh
in the file ../workflow/job-files/main/1.job to source ../workflow/job-files/sub/one-step.sh
and run again bash ../workflow/job-files/main/1.job 2>&1 | tee log.txt
Also, can you send me the file ../workflow/job-files/sub/one-step.sh along as well next time?
I have looked at your latest files which you sent me. This time we got the bash-debug output from the one-step.sh script as you might have noticed. The one-step.sh shell script sources the ~/.bashrc file, which seems to source /etc/bashrc:
and it seems an error is occurring while this script is sourced. The error occurs shortly after the command gau-machine (related to Gaussian?) is run. Thus you could for example try to fix your .bashrc (or the /etc/bashrc) file, or you can edit the file one-step.sh and comment out/remove the line which sources your .bashrc. VFVS should still be able to run.