Getting started

The CCPEM-Pipeliner provides easy access to a variety of software tools for all steps of processing cryoEM data from data preprocessing through model building and validation. The workflow is tracked and tools for visualising the full project and analysing the results are provided.

The ccpem-pipeliner serves the back end for its companion software Doppio which provides a full graphical user interface.

Start with a Project

The Project is contained in a single project directory.

Note

Paths used in the pipeliner, such as paths to files in job parameters are generally relative to this project directory.

To create a project or access an existing project using the API:

from pipeliner.api.manage_project import PipelinerProject

my_project = PipelinerProject()

To start a new project from the command line

$ pipeliner --start_new_project

A Project is made up of Jobs

The project is made of up Jobs. Each job is one type of operation on the data, although jobs can have several steps. Jobs are defined by their jobtype. The format of the jobtype is:

<program>.<function>.<keywords>

with:: <program>: the main piece of external software used by the job

<function>: the task being performed

<keywords>: serve to further differentiate the jobtype; an unlimited number are allowed

To get information about a specific job type from the command line:

$ pipeliner --job_info <job type>

Jobs are written in their own job directories with the format:

<function>/job<nnn>/

with the job number automatically incremented as the project progresses.

Note

A job’s directory is also its name, which is used to identify it. The job’s name EX: AutoPick/job004/ requires the trailing slash at the end.

Jobs are created from parameter files

Jobs can be created by reading from either or two types of Parameter Files: run.job or job.star

Both files define the jobtype, if the job is new or a continuation of an old job, and the parameters or JobOptions.

run.job files are more verbose and easier to manually edit:

job_type == relion.autopick.log
is_continue == false
Pixel size in micrographs (A) == 1.02
Pixel size in references (A) == 3.54
...

job.star files have a more complicated format but have the advantage that the Pipeliner has functions to dynamically edit them:

data_job

_rlnJobTypeLabel          relion.autopick.log
_rlnJobIsContinue    0

data_joboptions_values
loop_
_rlnJobOptionVariable #1
_rlnJobOptionValue #2
angpix         1.02
angpix_ref     3.54
...

Note

job.star and run.job files can be used interchangeably for almost all applications in a project.

Getting a run.job or job.star file

A parameter file with the default values for any job can be generated with write_default_runjob() or write_default_jobstar()

API:

from pipeliner.api.api_utils import default_runjob, default_jobstar

default_runjob("relion.autopick.log")
default_jobstar("relion.autopick.log")

Command line:

$ pipeliner --default_runjob relion.autopick.log
$ pipeliner --default_jobstar relion.autopick.log

This will create the files relion_autopick_log_job.star and relion_autopick_log_run.job

Running a job

With the paramter file created the job can now be run with run_job():

API:

my_project.run_job("relion_autopick_log_job.star")

Command line:

$ pipeliner --run_job relion_autopick_log_job.star

This will create and run the job AutoPick/job001/

Alternatively a job can be run from a dict containing its parameters which can be generated with the function pipeliner.api.api_utils.job_default_parameters_dict() This dict can be edited in place before using it to run a job.

from pipeliner.api.api_utils import job_default_parameters_dict
from pipeliner.api.manage_project import run_job

params = job_default_parameters_dict("relion.autopick.log")
params["fn_input"] = "Path/to/new/input_file.mrc"
run_job(params)

Continuing a job

Some jobs can be continued from where they finished. When a job is run a file continue_job.star is written in its job directory. This file contains only the parameters that are allowed to be modified when the job is continued. Edit this file if any parameters need to be changed and then continue the job with:

API:

my_project.continue_job("AutoPick/job001/")

Command line:

$ pipeliner --continue_job AutoPick/job001/

Note

The job’s full name was used to continue it, not the name of the continue_job.star file

Submitting jobs to a queue

Pipeliner jobs can be submitted to a queuing system using a submission script template that incorporates values from the job’s JobOptions.

The pipeliner will update variables bracketed by XXX in the submission template from the job’s JobOptions and then run the submission script using the command specified in the job’s qsub JobOption.

Template variables updated from JobOptions
Script Variable	JobOption	GUI Field
XXXmpinodesXXX ^{see note}	nr_mpi	Number of MPI procs:
XXXthreadsXXX ^{see note}	nr_threads	Number of threads:
XXXdedicatedXXX	min_dedicated	Minimum dedicated cores per node:
XXXqueueXXX	queuename	Queue name:
XXXextra1XXX	qsub_extra_1	Set from environment variable PIPELINER_QSUB_EXTRA1
XXXextra2XXX	qsub_extra_2	Set from environment variable PIPELINER_QSUB_EXTRA2
XXXextra3XXX	qsub_extra_3	Set from environment variable PIPELINER_QSUB_EXTRA3
XXXextra4XXX	qsub_extra_4	Set from environment variable PIPELINER_QSUB_EXTRA4

There are some additional variables available for submission scripts that are not drawn from the JobOptions:

Additional template variables
Script Variable	Substitution
XXXnameXXX	The job’s name; the same as its output directory
XXXcoresXXX	The number of mpi processes multiplied by the number of threads
XXXerrfileXXX	Path to the job’s run.err file
XXXoutfileXXX	Path to the job’s run.out file
XXXcommandXXX ^{see note}	The full commands list for the job.

Note

The variable XXXcommandXXX will already have the mpirun command specified by the mpirun_com JobOption prepended to commands where necessary. It generally does NOT need to be included in the submission script. The default for mpirun_com is mpi_run -n XXXmpinodesXXX meaning the XXXmpinodesXXX variable generally also does not need to be included in the commands section of the submission script template. Similarly, the number of threads used by a job is usually set in the commands, so the XXXthreadsXXX variable rarely needs to be used.

The submission script template must be written for your specific system. Here is an example submission script template for a cluster running SLURM:

#!/bin/bash
#SBATCH --ntasks=XXXmpinodesXXX
#SBATCH --partition=XXXqueueXXX
#SBATCH --cpus-per-task=XXXthreadsXXX
#SBATCH --error=XXXerrfileXXX
#SBATCH --output=XXXoutfileXXX
#SBATCH --gres=gpu:2

XXXcommandXXX

Modifying parameters

The python API can modify job.star parameter files on-the-fly using edit_jobstar(). This avoids manual editing of the parameter files when stringing together multiple jobs:

from pipeliner.api.api_utils import edit_jobstar

movie_jobstar = my_project.write_default_jobstar("relion.import.movies")
edit_jobstar(movie_jobstar, {"fn_in_raw": "Movies/*.mrcs"})
movie_job = my_project.run_job(movie_jobstar).output_name

mocorr_jobstar = my_project.write_default_jobstar("relion.motioncorr.own")
edit_jobstar(mocorr_jobstar, {"fn_in": movie_job.output_name + "movies.star"})
mocorr_job = my_project.run_job(mocorr_jobstar).output_name

alternatively this can be done solely with dicts:

from pipeliner.api.api_utils import job_default_parameters_dict

import_params = job_default_parameters_dict("relion.import.movies")
import_params["fn_in_raw"] = "Movies/*.mrcs"
movie_job = my_project.run_job(import_params).output_name

mocorr_params = job_default_parameters_dict("relion.motioncorr.own")
mocorr_params["fn_in"] = movie_job.output_name + "movies.star"
mocorr_job = my_project.run_job(mocorr_params).output_name

Running schedules

Scheduling allows for sets of jobs to be run multiple times via schedule_job() and run_schedule()

Note

When a job is scheduled placeholder files are created for all of its outputs so these files can be used as if they already exist.

Here is running the same jobs as above, except using the scheduling functions to run the set of import and motion correction jobs 10 times:

API:

movie_jobstar = my_project.write_default_jobstar("relion.import.movies")
edit_jobstar(movie_jobstar, {"fn_in_raw": "Movies/*.mrcs"})
movie_job = my_project.schedule_job(movie_jobstar)

mocorr_jobstar = my_project.write_default_jobstar("relion.motioncorr.own")
edit_jobstar(mocorr_jobstar, {"fn_in": movie_job.output_name + "movies.star"})
mocorr_job = my_project.schedule_job(mocorr_jobstar)

my_project.run_schedule(
    fn_sched="my_schedule",
    job_ids=[movie_job.putput_name, mocorr_job.output_name],
    nr_repeat=10,
    )

To accomplish this from the command line the parameter files for the Import and MotionCorr jobs must already have been created with the correct file names as inputs

$ pipeliner --schedule_job <import job param file>
$ pipeliner --schedule_job <motion corr job param file>
$ pipeliner --run_schedule --name my_schedule --jobs job001 job002 --nr_repeat 10

Note

The command line tool intelligently parses job names, so for the job named Import/job001/ it would accept job001 or 1 as well as the full job name

Other job tools

A variety of other tool exist for modifying jobs in the project. See the api documentation for how to use these functions:

set_alias - Give a job an more descriptive name

run_cleanup - Move intermediate files from jobs into the trash to save disk space

delete_job - Move a job to the trash

undelete_job - Remove a job from the trash and restore it to the project

empty_trash - Permanently delete files in the trash

prepare_metadata_report - Get metadata about an entire project