Chapter 6: Running Ruffus from the command line with ruffus.cmdline

We find that much of our Ruffus pipeline code is built on the same template and this is generally a good place to start developing a new pipeline.

From version 2.4, Ruffus includes an optional Ruffus.cmdline module that provides support for a set of common command line arguments. This makes writing Ruffus pipelines much more pleasant.

Template for argparse

All you need to do is copy these 6 lines

import ruffus.cmdline as cmdline

parser = cmdline.get_argparse(description='WHAT DOES THIS PIPELINE DO?')

#   <<<---- add your own command line options like --input_file here
# parser.add_argument("--input_file")

options = parser.parse_args()

#  standard python logger which can be synchronised across concurrent Ruffus tasks
logger, logger_mutex = cmdline.setup_logging (__name__, options.log_file, options.verbose)

#   <<<----  pipelined functions go here

cmdline.run (options)

You are recommended to use the standard argparse module but the deprecated optparse module works as well. (See below for the template)

Command Line Arguments

Ruffus.cmdline by default provides these predefined options:

-v, --verbose
    --version
-L, --log_file

    # tasks
-T, --target_tasks
    --forced_tasks
-j, --jobs
    --use_threads


    # printout
-n, --just_print

    # flow chart
    --flowchart
    --key_legend_in_graph
    --draw_graph_horizontally
    --flowchart_format


    # check sum
    --touch_files_only
    --checksum_file_name
    --recreate_database

1) Logging

The script provides for logging both to the command line:

myscript -v
myscript --verbose

and an optional log file:

# keep tabs on yourself
myscript --log_file /var/log/secret.logbook

Logging is ignored if neither --verbose or --log_file are specified on the command line

Ruffus.cmdline automatically allows you to write to a shared log file via a proxy from multiple processes. However, you do need to use logging_mutex for the log files to be synchronised properly across different jobs:

with logging_mutex:

    logger_proxy.info("Look Ma. No hands")

Logging is set up so that you can write

A) Only to the log file:

logger.info("A message")

B) Only to the display:

logger.debug("A message")

C) To both simultaneously:

from ruffus.cmdline import MESSAGE

logger.log(MESSAGE, "A message")

2) Tracing pipeline progress

This is extremely useful for understanding what is happening with your pipeline, what tasks and which jobs are up-to-date etc.

See Chapter 5: Understanding how your pipeline works with pipeline_printout(...)

To trace the pipeline, call script with the following options

# well-mannered, reserved
myscript --just_print
myscript -n

or

# extremely loquacious
myscript --just_print --verbose 5
myscript -n -v5

Increasing levels of verbosity (--verbose to --verbose 5) provide more detailed output

3) Printing a flowchart

This is the subject of Chapter 7: Displaying the pipeline visually with pipeline_printout_graph(...).

Flowcharts can be specified using the following option:

myscript --flowchart xxxchart.svg

The extension of the flowchart file indicates what format the flowchart should take, for example, svg, jpg etc.

Override with --flowchart_format

4) Running in parallel on multiple processors

Optionally specify the number of parallel strands of execution and which is the last target task to run. The pipeline will run starting from any out-of-date tasks which precede the target and proceed no further beyond the target.

myscript --jobs 15 --target_tasks "final_task"
myscript -j 15

5) Setup checkpointing so that Ruffus knows which files are out of date

The checkpoint file uses to the value set in the environment (DEFAULT_RUFFUS_HISTORY_FILE).

If this is not set, it will default to .ruffus_history.sqlite in the current working directory.

Either can be changed on the command line:

myscript --checksum_file_name mychecksum.sqlite

Recreating checkpoints

Create or update the checkpoint file so that all existing files in completed jobs appear up to date

Will stop sensibly if current state is incomplete or inconsistent

myscript --recreate_database

Touch files

As far as possible, create empty files with the correct timestamp to make the pipeline appear up to date.

myscript --touch_files_only

6) Skipping specified options

Note that particular options can be skipped (not added to the command line), if they conflict with your own options, for example:

# see below for how to use get_argparse
parser = cmdline.get_argparse(  description='WHAT DOES THIS PIPELINE DO?',
                                # Exclude the following options: --log_file --key_legend_in_graph
                                ignored_args = ["log_file", "key_legend_in_graph"])

7) Specifying verbosity and abbreviating long paths

The verbosity can be specified on the command line

myscript --verbose 5

# verbosity of 5 + 1 = 6
myscript --verbose 5 --verbose

# verbosity reset to 2
myscript --verbose 5 --verbose --verbose 2

If the printed paths are too long, and need to be abbreviated, or alternatively, if you want see the full absolute paths of your input and output parameters, you can specify an extension to the verbosity. See the manual discussion of verbose_abbreviated_path for more details. This is specified as --verbose VERBOSITY:VERBOSE_ABBREVIATED_PATH. (No spaces!)

For example:

 # verbosity of 4
 myscript.py --verbose 4

 # display three levels of nested directories
 myscript.py --verbose 4:3

 # restrict input and output parameters to 60 letters
 myscript.py --verbose 4:-60

8) Displaying the version

Note that the version for your script will default to "%(prog)s 1.0" unless specified:

parser = cmdline.get_argparse(  description='WHAT DOES THIS PIPELINE DO?',
                                version = "my_programme.py v. 2.23")

Template for optparse

deprecated since python 2.7

#
#   Using optparse (new in python v 2.6)
#
from ruffus import *

parser = cmdline.get_optgparse(version="%prog 1.0", usage = "\n\n    %prog [options]")

#   <<<---- add your own command line options like --input_file here
# parser.add_option("-i", "--input_file", dest="input_file", help="Input file")

(options, remaining_args) = parser.parse_args()

#  logger which can be passed to ruffus tasks
logger, logger_mutex = cmdline.setup_logging ("this_program", options.log_file, options.verbose)

#   <<<----  pipelined functions go here

cmdline.run (options)