General Utilities

These utilities are used by the pipeliner for basic tasks such as nice looking on-screen display, checking file names, and getting directory and file names

class pipeliner.utils.DirectoryBasedLock(dirname: str | PathLike[str] = '.relion_lock', timeout=60.0)

Bases: object

A lock based on the creation and existence of a directory on the file system.

The interface is almost the same as Python’s standard multiprocessing.Lock, except for some changes related to timeout behaviour:

There is a default timeout of 60 seconds when acquiring the lock (rather than the default None value, with corresponding infinite timeout, that is used by multiprocessing.Lock). This is for compatibility with previous RELION locking timeout behaviour.
A timeout for use when entering a context manager can be set when the lock object is created. Note that this value is ignored if the acquire() method is called directly. If there is a timeout waiting to acquire the lock when entering a context manager, a TimeoutError is raised.

The principle of this lock is that directory creation is an atomic operation provided by the file system, even in (most, modern) networked file systems. If several processes try to create the same directory at the same time, only one will succeed and the rest will get an error. Therefore, we can use this as a locking primitive, acquiring the lock if we successfully create the directory and releasing it by deleting the directory afterwards.

The lock directory name can be set if required. For compatibility with RELION, the default directory name is “.relion_lock”.

acquire(block=True, timeout=60.0)

Acquire a lock, blocking or non-blocking.

With the block argument set to True (the default), the method call will block until the lock is in an unlocked state, then set it to locked and return True.

With the block argument set to False, the method call does not block. If the lock is currently in a locked state, return False; otherwise set the lock to a locked state and return True.

When invoked with a positive, floating-point value for timeout, block for at most the number of seconds specified by timeout as long as the lock can not be acquired. The default is 60.0 seconds; note that this is different from the default timeout in multiprocessing.Lock.acquire().

Invocations with a negative value for timeout are equivalent to a timeout of zero. Invocations with a timeout value of None set the timeout period to infinite. The timeout argument has no practical implications if the block argument is set to False and is thus ignored.

Returns:: True if the lock has been acquired or False if the timeout period has elapsed.

:raises Various possible errors from os.mkdir() including: FileNotFoundError or PermissionError.

release()

Release the lock.

This can be called from any thread, not only the thread which has acquired the lock.

When the lock is locked, reset it to unlocked, and return. If any other threads are blocked waiting for the lock to become unlocked, allow exactly one of them to proceed.

When invoked on an unlocked lock, a RuntimeError is raised.

There is no return value.

pipeliner.utils.atomic_write_json(obj: object, filename: str, **json_dump_kwargs) → None

Save an object to JSON as an atomic operation.

This ensures any processes reading the JSON file will always see a valid version (old or new) and not a half-written new file.

This function writes the JSON to a temporary file first, then renames it after writing is complete and flushed to disk. The temporary file will always be removed even if the write-and-rename is unsuccessful.

Pass additional keyword arguments for the json.dump() command as keyword arguments to this function.

Note: using this function avoids race conditions caused by processes reading the file when it is partly-written by another process. It does not avoid possible problems caused by two processes writing to the same file at the same time. In that case, one version will be kept and the other will not, but there is no guarantee about which one will be kept. Avoiding that problem would require some kind of file locking, versioning or time stamping to identify conflicting or simultaneous edits.

Parameters:

obj – The object to save to JSON
filename – The file name or path to save to
json_dump_kwargs – Keyword arguments to be passed through to json.dump()

pipeliner.utils.check_for_illegal_symbols(check_string: str, string_name: str = 'input', exclude: str = '') → str | None

Check a text string doesn’t have any of the disallowed symbols.

Illegal symbols are !*?()^/#<>&%{}$.”’ and @.

Parameters:

check_string (str) – The string to be checked
string_name (str) – The name of the string being checked; for more informative error messages
exclude (str) – Any symbols that are normally in the illegal symbols list but should be allowed.

Returns:

An error message if any illegal symbols are present

Return type:

str

pipeliner.utils.clean_job_dirname(dirname: str) → str

Makes sure a pipeline job_dir name is valid and in the right format

Parameters:: dirname (str) – The dirname to check
Returns:: The correctly formatted dirname
Return type:: str
Raises:: ValueError – If the dir name connot be formatted correctly

pipeliner.utils.clean_jobname(jobname: str) → str

Makes sure job names are in the correct format

Job names must have a trailing slash, cannot begin with a slash, and have no illegal characters

Parameters:: jobname (str) – The job name to be checked
Returns:: The job name, with corrections in necessary
Return type:: str

pipeliner.utils.compare_dicts(a_dict: dict, e_dict: dict, tolerance: float | None = None) → bool

Compare two dictionaries, with an optional tolerance for float values.

If the items in the dictionaries are all dicts, compare their contents with tolerance applied to float values.

Parameters:

a_dict (dict) – the actual dictionary
e_dict (dict) – the expected dictionary
tolerance (float) – The relative tolerance for float values, i.e. to compare as equivalent, the values must differ by less than this value multipled by the value from a_dict. Use None or 0 if the values must match exactly.

Returns:

do they match (within tolerance if specified)

Return type:

bool

pipeliner.utils.compare_nested_lists(a_list: list, e_list: list, tolerance: float | None = None)

Compare two nested lists, with an optional tolerance for float values.

If the items in the lists are all dicts, recursively compare their contents with tolerance applied to float values.

If a tolerance is given, list items will be converted to floats and then compared to check if their absolute difference is less than tolerance multiplied by the value from a_list.

Otherwise the individual items in each list will be compared with ==.

Parameters:

a_list (list) – the actual list
e_list (list) – the expected list
tolerance (float) – The relative tolerance for float values, i.e. to compare as equivalent, the values must differ by less than this value multipled by the value from a_list. Use None or 0 if the values must match exactly.

Returns:

do they match (within tolerance if specified)

Return type:

bool

pipeliner.utils.convert_relative_filename(filename: str) → str

Convert a filename that is relative to the project to just its name

IE: ../../my_dir/my_file.txt -> my_dir/my_file /my_dir/my_file.txt -> my_dir/my_file ~/my_dir/my_file.txt -> my_dir/my_file

Parameters:: filename
Returns:: The part of the file path that is not relative to the project
Return type:: str

pipeliner.utils.count_file_lines(filename: str) → int

Fast and efficient count of number of lines in a file

Parameters:: filename (str) – Name of the file to count the lines in
Returns:: Number of lines
Return type:: int

pipeliner.utils.date_time_tag(compact: bool = False) → str

Get a current date and time tag

It can return a compact version or one that is easier to read

Parameters:

compact (bool) – Should the returned tag be in the compact form

Returns:

The datetime tag

compact format is: YYYYMMDDHHMMSS

verbose form is: YYYY-MM-DD HH:MM:SS.MS

Return type:

str

pipeliner.utils.decompose_pipeline_filename(fn_in: str) → Tuple[str, int, str]

Breaks a job name into usable pieces

Returns everything before the job number, the job number as an int and everything after the job number setup for up to 20 dirs deep. The 20 directory limit is from the relion code but no really necessary anymore

Parameters:

fn_in (str) – The job or file name to be broken down in the format: <jobtype>/jobxxx/<filename>

Returns:

The decomposed file name: (str, int, str)

[0] Everything before ‘job’ in the file name

[1] The job number

[2] Everything after the job number

Return type:

tuple

Raises:

ValueError – If the input file name is more than 20 directories deep

pipeliner.utils.file_in_project(filename: str) → bool

Check that a file is part of the project

Not done with os.path.abspath(file).startswith(project_dir) because this causes errors during testing

pipeliner.utils.find_common_string(input_strings: List[str]) → str

Find the common part of a list of strings starting from the beginning

Parameters:: input_strings (list) – List of strings to compare
Returns:: The common portion of the strings
Return type:: str
Raises:: ValueError – If input_list is shorter than 2

pipeliner.utils.format_string_to_type_objs(in_str: str) → str | int | float | bool | None

Returns Int, Float, Bool, and None Objects from strings

Any number with a decimal point, in scientific notation, or ‘NaN’ will return a float Any other number will retun an int ‘False’ or ‘false’ returns False ‘True’ or ‘true’ returns True ‘None’ returns a NoneType object

Parameters:: in_str (str) – The input string
Returns:: The appropriate object
Return type:: Optional[Union[str, int, float, bool]]

pipeliner.utils.get_directory_info(dir_path: str | Path) → List[Tuple[str, str]]

pipeliner.utils.get_directory_size(directory_path: str | Path) → int

Get the size of a directory

Parameters:: directory_path (str) – The dir to check
Returns:: The size in bytes
Return type:: str

pipeliner.utils.get_file_size_mb(file: str | Path) → float

Get the size of a file in MB, rounded to 2 decimal places

Parameters:: file (str) – The file to check

pipeliner.utils.get_job_number(job_name)

Get the job number from a pipeliner job as an int

Parameters:: job_name (str) – The job name in the pipeliner format
Returns:: The job number
Return type:: int

pipeliner.utils.get_job_runner_command() → List[str]: Get the full command to run the job_runner.py script.

pipeliner.utils.get_job_script(name: str) → str

Get the full path to a job script file.

Returns:: The job script file, if it exists.
Raises:: FileNotFoundError – if the named job script cannot be found.

pipeliner.utils.get_mrc_map_header_string(mapfile: str) → str

pipeliner.utils.get_mrc_map_summary(mapfile: str, use_data: bool = False) → Dict[str, Tuple | int | float]

pipeliner.utils.get_percentile_from_emdb_json(metric: str, score: float | int, resolution: float | None = None) → str

Calculate the percentile rank of a score against structures from the EMDB.

If a resolution is provided, the percentile is relative to structures in a range of similar resolutions, otherwise it is relative to all structures.

Parameters:

metric – The name of the metric, i.e. one of the keys in validation_distribution_histograms.json.
score – The score to calculate a percentile for.
resolution – The resolution of this structure (optional)

Returns:

A string with the percentile (0.0 = worst, 100.0 = best) and the relevant resolution range that was used (e.g. “73.4 (1.0-2.0 Å)”, or an empty string if the percentile could not be determined.

pipeliner.utils.get_pipeliner_root() → Path

Get the directory of the main pipeliner module

Returns:: The path of the pipeliner
Return type:: Path

pipeliner.utils.get_python_command() → List[str]

Get the command to launch the current Python interpreter.

Note that the command is returned as a list and might include some arguments as well as the command itself.

pipeliner.utils.get_regenerate_results_command() → List[str]: Get the full command to run the regenerate_results.py script.

pipeliner.utils.get_unique_basenames(files: List[str], include_ext=True) → Dict[str, str]

Increment the base names of files if there are duplicates

e.g. for Import/job001/myfile, Import/job002/myfile

job001_myfile and job002_myfile are set as unique names in the output dict

If the files have different extensions:

e.g. Import/job001/myfile.ext1, Import/job002/myfile.ext2: myfile.ext1 and myfile.ext2 are returned in the output dict If include_ext is False, job001_myfile and job002_myfile are set.

Parameters:

files (List[str]) – The files to operate on
include_ext (bool) – Include file extension in unique basename check? True by default

Returns:

The file name and its incremented basename

Return type:

Dict[str, str]

pipeliner.utils.is_uuid4(in_str: str) → bool

Check that a string is a UID4

Parameters:: in_str (str) – The string to test
Returns:: Is the string a valid uid4
Return type:: bool

pipeliner.utils.launch_detached_process(command: List[str], timeout: float, **popen_kwargs) → int | None

Run the given command as a detached process.

The process is started in a new session and with all file handles set to null, to ensure it keeps running in the background after the parent Python process exits.

Parameters:

command (List[str]) – The commands to execute
timeout (int) – How many seconds to wait to check if the command failed early
popen_kwargs – Additional keyword arguments to be passed to subprocess.Popen

Returns:

The process’s return code, if available, or None if it is still: running after timeout seconds

Return type:

int | None

pipeliner.utils.make_atomic_model_summary(modelfile: str, summary_file: str)

pipeliner.utils.make_pretty_header(text: str, char: str = '-=', top: bool = True, bottom: bool = True)

Make nice looking headers for on-screen display

Parameters:

text (str) – The text to put in the header
char (str) – What characters to use for the header
top (bool) – Put a border on the top?
bottom (bool) – Put a border on the bottom

Returns:

A nice looking header

Return type:

str

pipeliner.utils.make_pretty_size(size_b: int) → str

Get a file/dir size in human-readable form

Parameters:: size_b (int) – The size in bytes
Returns:: The size in B, KB, MB, GB, or TB as appropriate
Return type:: str

pipeliner.utils.print_nice_columns(list_in: List[str], err_msg: str = 'ERROR: No items in input list')

Takes a list of items and makes three columns for nicer on-screen display

Parameters:

list_in (str) – The list to display in columns
err_msg (str) – The message to display if the list is empty

pipeliner.utils.run_subprocess(*args, **kwargs) → CompletedProcess

pipeliner.utils.str_is_hex_colour(in_string, allow_0x: bool = False) → bool

Test that a string is a hexadecimal colour code

Valid codes consist of a # symbol or ‘0x’ followed by exactly six hexadecimal digits (0-9 or a-f, lower or upper case).

Parameters:

in_string (str) – The string to test
allow_0x (bool) – Also allow ‘0x’ style codes

Returns:

is it a valid colour code?

Return type:

bool

pipeliner.utils.subprocess_popen(*args, **kwargs) → Popen

pipeliner.utils.touch(filename: str)

Create an empty file

Parameters:: filename (str) – The name for the file to create

pipeliner.utils.truncate_number(number: float, maxlength: int) → str

Return a number with no more than x decimal places but no trailing 0s

This is used to format numbers in the exact same way that RELION does it. IE: with maxlength 3; 1.2000 = 1.2, 1.0 = 1, 1.23 = 1.23. RELION commands are happy to accept numbers with any number of decimal places or trailing 0s. This function is just to maintain continuity between RELION and pipeliner commands

Parameters:

number (float) – The number to be truncated
maxlength (int) – The maximum number of decimal places

pipeliner.utils.update_jobinfo_file(jobdir: str, action: str | None = None, comment: str | None = None, command_list: List[str] | None = None) → None

Update the file in the jobdir that stores info about the job

Parameters:

jobdir – The job that contains the file to be updated
action (str) – what action was performed on the job, e.g. Run, Scheduled, Cleaned up
comment (str) – Comment to append to the job’s comments list
command_list (list) – Commands that were run. Generally None if action was any other than Run or Scheduled

pipeliner.utils.wrap_text(text_string: str)

Produces <= 55 character wide wrapped text for on-screen display

Parameters:: text_string (str) – The text to be displayed

pipeliner.utils.write_script_json(script_path: str, data: List[str], json_path: str) → None

This is a utility for creating Visualisation script node (NODE_VISUALISATIONSCRIPT) as a json file with path to the script and the list of associated data files to be opened with a molecular graphics (MG) tool.

Parameters:

script_path – script file to be run with the MG tool
data – list of data files (e.g. map/model) associated with the script
json_path – output json path