Getting started
The CCPEM-Pipeliner provides easy access to a variety of software tools for all steps of processing cryoEM data from data preprocessing through model building and validation. The workflow is tracked and tools for visualising the full project and analysing the results are provided.
The ccpem-pipeliner serves the back end for its companion software Doppio which provides a full graphical user interface.
Start with a Project
The Project is contained in a single project directory.
Note
Paths used in the pipeliner, such as paths to files in job parameters are generally relative to this project directory.
To create a project or access an existing project using the API:
from pipeliner.api.manage_project import PipelinerProject
my_project = PipelinerProject()
To start a new project from the command line
$ pipeliner --start_new_project
A Project is made up of Jobs
The project is made of up Jobs. Each job is one type of operation on the data, although jobs can have several steps. Jobs are defined by their jobtype. The format of the jobtype is:
<program>.<function>.<keywords>
- with:
- <program>: the main piece of external software used by the job<function>: the task being performed<keywords>: serve to further differentiate the jobtype; an unlimited number are allowed
To get information about a specific job type from the command line:
$ pipeliner --job_info <job type>
Jobs are written in their own job directories with the format:
<function>/job<nnn>/
with the job number automatically incremented as the project progresses.
Note
A job’s directory is also its name, which is used to identify it. The job’s name EX: AutoPick/job004/ requires the trailing slash at the end.
Jobs are created from parameter files
Jobs can be created by reading from either or two types of Parameter Files: run.job or job.star
Both files define the jobtype, if the job is new or a continuation of an old job, and the parameters or JobOptions.
run.job files are more verbose and easier to manually edit:
job_type == relion.autopick.log
is_continue == false
Pixel size in micrographs (A) == 1.02
Pixel size in references (A) == 3.54
...
job.star files have a more complicated format but have the advantage that the Pipeliner has functions to dynamically edit them:
data_job
_rlnJobTypeLabel relion.autopick.log
_rlnJobIsContinue 0
data_joboptions_values
loop_
_rlnJobOptionVariable #1
_rlnJobOptionValue #2
angpix 1.02
angpix_ref 3.54
...
Note
job.star and run.job files can be used interchangeably for almost all applications in a project.
Getting a run.job or job.star file
A parameter file with the default values for any job can be generated
with write_default_runjob()
or
write_default_jobstar()
API:
from pipeliner.api.api_utils import default_runjob, default_jobstar
default_runjob("relion.autopick.log")
default_jobstar("relion.autopick.log")
Command line:
$ pipeliner --default_runjob relion.autopick.log
$ pipeliner --default_jobstar relion.autopick.log
This will create the files relion_autopick_log_job.star and relion_autopick_log_run.job
Running a job
With the paramter file created the job can now be run with
run_job()
:
API:
my_project.run_job("relion_autopick_log_job.star")
Command line:
$ pipeliner --run_job relion_autopick_log_job.star
This will create and run the job AutoPick/job001/
Alternatively a job can be run from a dict
containing its parameters
which can be generated with the function pipeliner.api.api_utils.job_default_parameters_dict()
This dict can be edited in place before using it to run a job.
from pipeliner.api.api_utils import job_default_parameters_dict
from pipeliner.api.manage_project import run_job
params = job_default_parameters_dict("relion.autopick.log")
params["fn_input"] = "Path/to/new/input_file.mrc"
run_job(params)
Continuing a job
Some jobs can be continued from where they finished. When a job is run a file continue_job.star is written in its job directory. This file contains only the parameters that are allowed to be modified when the job is continued. Edit this file if any parameters need to be changed and then continue the job with:
API:
my_project.continue_job("AutoPick/job001/")
Command line:
$ pipeliner --continue_job AutoPick/job001/
Note
The job’s full name was used to continue it, not the name of the continue_job.star file
Submitting jobs to a queue
Pipeliner jobs can be submitted to a queuing system using a submission script template that incorporates values from the job’s JobOptions.
The pipeliner will update variables bracketed by XXX in the submission template from the job’s JobOptions and then run the submission script using the command specified in the job’s qsub JobOption.
Script Variable |
JobOption |
GUI Field |
---|---|---|
XXXmpinodesXXX see note |
nr_mpi |
Number of MPI procs: |
XXXthreadsXXX see note |
nr_threads |
Number of threads: |
XXXdedicatedXXX |
min_dedicated |
Minimum dedicated cores per node: |
XXXqueueXXX |
queuename |
Queue name: |
XXXextra1XXX |
qsub_extra_1 |
Set from environment variable PIPELINER_QSUB_EXTRA1 |
XXXextra2XXX |
qsub_extra_2 |
Set from environment variable PIPELINER_QSUB_EXTRA2 |
XXXextra3XXX |
qsub_extra_3 |
Set from environment variable PIPELINER_QSUB_EXTRA3 |
XXXextra4XXX |
qsub_extra_4 |
Set from environment variable PIPELINER_QSUB_EXTRA4 |
There are some additional variables available for submission scripts that are not drawn from the JobOptions:
Script Variable |
Substitution |
---|---|
XXXnameXXX |
The job’s name; the same as its output directory |
XXXcoresXXX |
The number of mpi processes multiplied by the number of threads |
XXXerrfileXXX |
Path to the job’s run.err file |
XXXoutfileXXX |
Path to the job’s run.out file |
XXXcommandXXX see note |
The full commands list for the job. |
Note
The variable XXXcommandXXX will already have the mpirun command specified by the mpirun_com JobOption prepended to commands where necessary. It generally does NOT need to be included in the submission script. The default for mpirun_com is mpi_run -n XXXmpinodesXXX meaning the XXXmpinodesXXX variable generally also does not need to be included in the commands section of the submission script template. Similarly, the number of threads used by a job is usually set in the commands, so the XXXthreadsXXX variable rarely needs to be used.
The submission script template must be written for your specific system. Here is an example submission script template for a cluster running SLURM:
#!/bin/bash
#SBATCH --ntasks=XXXmpinodesXXX
#SBATCH --partition=XXXqueueXXX
#SBATCH --cpus-per-task=XXXthreadsXXX
#SBATCH --error=XXXerrfileXXX
#SBATCH --output=XXXoutfileXXX
#SBATCH --gres=gpu:2
XXXcommandXXX
Modifying parameters
The python API can modify job.star parameter files on-the-fly using
edit_jobstar()
. This avoids manual editing
of the parameter files when stringing together multiple jobs:
from pipeliner.api.api_utils import edit_jobstar
movie_jobstar = my_project.write_default_jobstar("relion.import.movies")
edit_jobstar(movie_jobstar, {"fn_in_raw": "Movies/*.mrcs"})
movie_job = my_project.run_job(movie_jobstar).output_name
mocorr_jobstar = my_project.write_default_jobstar("relion.motioncorr.own")
edit_jobstar(mocorr_jobstar, {"fn_in": movie_job.output_name + "movies.star"})
mocorr_job = my_project.run_job(mocorr_jobstar).output_name
alternatively this can be done solely with dicts:
from pipeliner.api.api_utils import job_default_parameters_dict
import_params = job_default_parameters_dict("relion.import.movies")
import_params["fn_in_raw"] = "Movies/*.mrcs"
movie_job = my_project.run_job(import_params).output_name
mocorr_params = job_default_parameters_dict("relion.motioncorr.own")
mocorr_params["fn_in"] = movie_job.output_name + "movies.star"
mocorr_job = my_project.run_job(mocorr_params).output_name
Running schedules
Scheduling allows for sets of jobs to be run multiple times via
schedule_job()
and
run_schedule()
Note
When a job is scheduled placeholder files are created for all of its outputs so these files can be used as if they already exist.
Here is running the same jobs as above, except using the scheduling functions to run the set of import and motion correction jobs 10 times:
API:
movie_jobstar = my_project.write_default_jobstar("relion.import.movies")
edit_jobstar(movie_jobstar, {"fn_in_raw": "Movies/*.mrcs"})
movie_job = my_project.schedule_job(movie_jobstar)
mocorr_jobstar = my_project.write_default_jobstar("relion.motioncorr.own")
edit_jobstar(mocorr_jobstar, {"fn_in": movie_job.output_name + "movies.star"})
mocorr_job = my_project.schedule_job(mocorr_jobstar)
my_project.run_schedule(
fn_sched="my_schedule",
job_ids=[movie_job.putput_name, mocorr_job.output_name],
nr_repeat=10,
)
To accomplish this from the command line the parameter files for the Import and MotionCorr jobs must already have been created with the correct file names as inputs
$ pipeliner --schedule_job <import job param file>
$ pipeliner --schedule_job <motion corr job param file>
$ pipeliner --run_schedule --name my_schedule --jobs job001 job002 --nr_repeat 10
Note
The command line tool intelligently parses job names, so for the job named Import/job001/ it would accept job001 or 1 as well as the full job name
Other job tools
A variety of other tool exist for modifying jobs in the project. See the api documentation for how to use these functions:
set_alias
- Give a job an more descriptive name
run_cleanup
- Move intermediate files from jobs into the trash to save disk space
delete_job
- Move a job to the trash
undelete_job
- Remove a job from the trash and restore it to the project
empty_trash
- Permanently delete files in the trash
prepare_metadata_report
- Get metadata about an entire project