===============
Getting started
===============

The CCPEM-Pipeliner provides easy access to a variety of software tools for all steps
of processing cryoEM data from data preprocessing through model building and 
validation.  The workflow is tracked and tools for visualising the full project and
analysing the results are provided.

The ccpem-pipeliner serves the back end for its companion software
`Doppio <https://gitlab.com/ccpem/doppio>`_ which provides a full graphical user interface.

Start with a *Project*
----------------------

The **Project** is contained in a single project directory.

.. note::
    Paths used in the pipeliner, such as paths to files in job parameters are
    generally relative to this project directory.

To create a project or access an existing project using the API:

.. code-block:: python

    from pipeliner.api.manage_project import PipelinerProject

    my_project = PipelinerProject()

To start a new project from the command line

.. code-block:: console

    $ pipeliner --new_project


A *Project* is made up of *Jobs*
--------------------------------

The project is made of up **Jobs**.  Each job is one type of operation on the data, 
although jobs can have several steps.  Jobs are defined by their **jobtype**.  The 
format of the jobtype is:

::

    <program>.<function>.<keywords>

with:
 | `<program>`: the main piece of external software used by the job
 | `<function>`: the task being performed
 | <`keywords`>: serve to further differentiate the jobtype; an unlimited number are allowed

To get information about a specific job type from the command line:

.. code-block:: console

    $ pipeliner --job_info <job type>

Jobs are written in their own job directories with the format:

::

    <function>/job<nnn>/

with the job number automatically incremented as the project progresses.

.. note::
    A job's directory is also its name, which is used to identify it.
    The job's name EX: `AutoPick/job004/` **requires** the trailing slash at the end.


Jobs are created from parameter files
-------------------------------------

Jobs can be created by reading from either or two types of **Parameter Files**:
`run.job` or `job.star`

Both files define the jobtype, if the job is new or a continuation of an old job, and 
the parameters or **JobOptions**.

`run.job` files are more verbose and easier to manually edit:

.. code-block:: text

 job_type == relion.autopick.log
 is_continue == false
 Pixel size in micrographs (A) == 1.02
 Pixel size in references (A) == 3.54
 ...


`job.star` files have a more complicated format but have the advantage that the 
Pipeliner has functions to dynamically edit them:

.. code-block:: text

 data_job

 _rlnJobTypeLabel          relion.autopick.log
 _rlnJobIsContinue    0
 
 data_joboptions_values
 loop_
 _rlnJobOptionVariable #1 
 _rlnJobOptionValue #2 
 angpix         1.02 
 angpix_ref     3.54 
 ...

.. note::
    `job.star` and `run.job` files can be used interchangeably for almost all
    applications in a project.


Getting a `run.job` or `job.star` file
--------------------------------------

A parameter file with the default values for any job can be generated
with :meth:`~pipeliner.api.api_utils.write_default_runjob` or
:meth:`~pipeliner.api.api_utils.write_default_jobstar`

API:

.. code-block:: python

    from pipeliner.api.api_utils import default_runjob, default_jobstar

    default_runjob("relion.autopick.log")
    default_jobstar("relion.autopick.log")

Command line:

.. code-block:: console

    $ pipeliner --default_runjob relion.autopick.log
    $ pipeliner --default_jobstar relion.autopick.log

This will create the files `relion_autopick_log_job.star` and 
`relion_autopick_log_run.job`


Running a job
-------------

With the paramter file created the job can now be run with 
:meth:`~pipeliner.api.manage_project.PipelinerProject.run_job`:

API:

.. code-block:: python

    my_project.run_job("relion_autopick_log_job.star")

Command line:

.. code-block:: console 

    $ pipeliner --run_job relion_autopick_log_job.star

This will create and run the job `AutoPick/job001/`

Alternatively a job can be run from a :class:`dict` containing its parameters
which can be generated with the function :meth:`pipeliner.api.api_utils.job_default_parameters_dict`
This dict can be edited in place before using it to run a job.

.. code-block:: python

 from pipeliner.api.manage_project import PipelinerProject
 from pipeliner.api.api_utils import job_default_parameters_dict

 proj = PipelinerProject(make_new_project=True)
 params = job_default_parameters_dict("relion.autopick.log")
 params["fn_input_autopick"] = "Path/to/new/input_file.mrc"
 proj.run_job(params)


Continuing a job
----------------

Some jobs can be continued from where they finished.  When a job is run a file 
`continue_job.star` is written in its job directory.  This file contains only the
parameters that are allowed to be modified when the job is continued.  Edit this
file if any parameters need to be changed and then continue the job with:

API:

.. code-block:: python

    my_project.continue_job("AutoPick/job001/")

Command line:

.. code-block:: console 

    $ pipeliner --continue_job AutoPick/job001/

.. note::
    The job's full name was used to continue it, *not* the name of the
    `continue_job.star` file

Submitting jobs to a queue
--------------------------

Pipeliner jobs can be submitted to a queuing system using a submission script template
that incorporates values from the job's JobOptions.

The pipeliner will update variables bracketed by `XXX` in the submission template
from the job's JobOptions and then run the submission script using the command
specified in the job's `qsub` JobOption.

.. list-table:: Template variables updated from JobOptions
    :header-rows: 1

    * - Script Variable
      - JobOption
      - GUI Field
    * - XXXmpinodesXXX :sup:`see note`
      - nr_mpi
      - Number of MPI procs:
    * - XXXthreadsXXX :sup:`see note`
      - nr_threads
      - Number of threads:
    * - XXXdedicatedXXX
      - min_dedicated
      - Minimum dedicated cores per node:
    * - XXXqueueXXX
      - queuename
      - Queue name:
    * - XXXextra1XXX
      - qsub_extra_1
      - Set from environment variable PIPELINER_QSUB_EXTRA1
    * - XXXextra2XXX
      - qsub_extra_2
      - Set from environment variable PIPELINER_QSUB_EXTRA2
    * - XXXextra3XXX
      - qsub_extra_3
      - Set from environment variable PIPELINER_QSUB_EXTRA3
    * - XXXextra4XXX
      - qsub_extra_4
      - Set from environment variable PIPELINER_QSUB_EXTRA4

There are some additional variables available for submission scripts that are not
drawn from the JobOptions:

.. list-table:: Additional template variables
    :header-rows: 1

    * - Script Variable
      - Substitution
    * - XXXnameXXX
      - The job's name; the same as its output directory
    * - XXXcoresXXX
      - The number of mpi processes multiplied by the number of threads
    * - XXXerrfileXXX
      - Path to the job's run.err file
    * - XXXoutfileXXX
      - Path to the job's run.out file
    * - XXXcommandXXX :sup:`see note`
      - The full commands list for the job.

.. note::
    The variable `XXXcommandXXX` will already have the mpirun command specified by the
    `mpirun_com` JobOption prepended to commands where necessary.  It generally does NOT
    need to be included in the submission script.  The default for `mpirun_com` is
    `mpi_run -n XXXmpinodesXXX` meaning the `XXXmpinodesXXX` variable generally
    also does not need to be included in the commands section of the submission script
    template.  Similarly, the number of threads used by a job is usually set in
    the commands, so the `XXXthreadsXXX` variable rarely needs to be used.

The submission script template must be written for your specific system. Here is an
example submission script template for a cluster running SLURM:

.. code-block:: console

    #!/bin/bash
    #SBATCH --ntasks=XXXmpinodesXXX
    #SBATCH --partition=XXXqueueXXX
    #SBATCH --cpus-per-task=XXXthreadsXXX
    #SBATCH --error=XXXerrfileXXX
    #SBATCH --output=XXXoutfileXXX
    #SBATCH --gres=gpu:2

    XXXcommandXXX


Modifying parameters
--------------------

The python API can modify `job.star` parameter files on-the-fly using 
:meth:`~pipeliner.api.api_utils.edit_jobstar`. This avoids manual editing
of the parameter files when  stringing together multiple jobs:

.. code-block:: python

    from pipeliner.api.api_utils import edit_jobstar

    movie_jobstar = my_project.write_default_jobstar("relion.import.movies")
    edit_jobstar(movie_jobstar, {"fn_in_raw": "Movies/*.mrcs"})
    movie_job = my_project.run_job(movie_jobstar).output_name

    mocorr_jobstar = my_project.write_default_jobstar("relion.motioncorr.own")
    edit_jobstar(mocorr_jobstar, {"fn_in": movie_job.output_name + "movies.star"})
    mocorr_job = my_project.run_job(mocorr_jobstar).output_name

alternatively this can be done solely with dicts:

.. code-block:: python

    from pipeliner.api.api_utils import job_default_parameters_dict

    import_params = job_default_parameters_dict("relion.import.movies")
    import_params["fn_in_raw"] = "Movies/*.mrcs"
    movie_job = my_project.run_job(import_params).output_name

    mocorr_params = job_default_parameters_dict("relion.motioncorr.own")
    mocorr_params["fn_in"] = movie_job.output_name + "movies.star"
    mocorr_job = my_project.run_job(mocorr_params).output_name

Running schedules
-----------------

Scheduling allows for sets of jobs to be run multiple times via 
:meth:`~pipeliner.api.manage_project.PipelinerProject.schedule_job` and
:meth:`~pipeliner.api.manage_project.PipelinerProject.run_schedule`

.. note::
    When a job is scheduled placeholder files are created for all of its outputs
    so these files can be used as if they already exist.

Here is running the same jobs as above, except using the scheduling functions
to run the set of import and motion correction jobs 10 times:

API:

.. code-block:: python

    from pipeliner.api.manage_project import PipelinerProject
    from pipeliner.api.api_utils import write_default_jobstar, edit_jobstar

    my_project = PipelinerProject(make_new_project=True)

    movie_jobstar = write_default_jobstar("relion.import.movies")
    edit_jobstar(movie_jobstar, {"fn_in_raw": "Movies/*.mrcs"}, movie_jobstar)
    movie_job = my_project.schedule_job(movie_jobstar)

    mocorr_jobstar = write_default_jobstar("relion.motioncorr.own")
    edit_jobstar(mocorr_jobstar, {"input_star_mics": movie_job.output_dir + "movies.star"}, mocorr_jobstar)
    mocorr_job = my_project.schedule_job(mocorr_jobstar)

    my_project.run_schedule(
        fn_sched="my_schedule",
        job_ids=[movie_job.output_dir, mocorr_job.output_dir],
        nr_repeat=10,
    )


To accomplish this from the command line the parameter files for the Import and 
MotionCorr jobs must already have been created with the correct file names as inputs

.. code-block:: console 

    $ pipeliner --schedule_job <import job param file>
    $ pipeliner --schedule_job <motion corr job param file>
    $ pipeliner --run_schedule --name my_schedule --jobs job001 job002 --nr_repeat 10

.. note::
    The command line tool intelligently parses job names, so for the job named `Import/job001/`
    it would accept `job001` or `1` as well as the full job name


Other job tools
---------------

A variety of other tool exist for modifying jobs in the project.  See the api
documentation for how to use these functions:

 - :class:`~pipeliner.api.manage_project.PipelinerProject.set_alias` - Give a job an 
   more descriptive name
 - :class:`~pipeliner.api.manage_project.PipelinerProject.run_cleanup` - Move intermediate
   files from jobs into the trash to save disk space
 - :class:`~pipeliner.api.manage_project.PipelinerProject.delete_job` - Move a job to the
   trash
 - :class:`~pipeliner.api.manage_project.PipelinerProject.undelete_job` - Remove a job from 
   the trash and restore it to the project
 - :class:`~pipeliner.api.manage_project.PipelinerProject.empty_trash` - Permanently delete
   files in the trash
 - :class:`~pipeliner.api.manage_project.PipelinerProject.prepare_metadata_report` - Get metadata
   about an entire project


Logging
-------

Logging in the pipeliner is performed using Python's standard :mod:`logging` module.
If you are using the pipeliner as a library, propagation of log messages can be
disabled like this:

.. code-block:: python

 pipeliner_logger = logging.getLogger("pipeliner")
 pipeliner_logger.propagate = False