Where I see Ruffus going

These are the future enhancements I would like to see in Ruffus:

  • Simpler syntax
    • Extremely pared down syntax where strings are interpreted as commands (like gmake) but with full Ruffus support / string interpolation etc.
    • More powerful non-decorator OOP syntax
    • More customisation points for your own syntax / database use
  • Better support for Computational clusters / larger scale pipelines
    • Running jobs out of sequence
    • Long running pipeline where input can be added later
    • Restarting failed jobs robustly
    • Finding out why jobs fail
    • Does Ruffus scale to thousands of parallel jobs. What are the bottlenecks?
  • Better displays of progress
    • Query which tasks / jobs are being run
    • GUI displays
  • Dynamic control during pipeline progress
    • Turn tasks on and off
    • Pause pipelines
    • Pause jobs
    • Change priorities
  • Better handling of data
    • Can we read and write from databases instead of files?
    • Can we cleanup files but preserve history?

In up coming release:

Todo: document output_from()

Todo: document new syntax

Todo: Log the progress through the pipeline in a machine parsable format

Standard parsable format for reporting the state of the pipeline enhancement

  • Timestamped text file
  • Timestamped Database

Todo: either_or: Prevent failed jobs from propagating further

Motivating example:

@transform(prevtask, suffix(".txt"),  either_or(".failed", ".succeed"))
def task(input_file, output_files):
    succeed_file_name, failed_file_name = output_files
    if not run_operation(input_file, succeed_file_name):
        # touch failed file
        with open(failed_file_name, "w") as faile_file:
            pass

Todo: (bug fix) pipeline_printout_graph should print inactive tasks

Todo: Mark input strings as non-file names, and add support for dynamically returned parameters

  1. Use indicator object.
  2. What is a good name? "output_from()", "NOT_FILE_NAME" :-)
  3. They will still participate in suffix, formatter and regex replacement

Bernie Pope suggests that we should generalise this:

If any object in the input parameters is a (non-list/tuple) class instance, check (getattr) whether it has a ruffus_params() function. If it does, call it to obtain a list which is substituted in place. If there are string nested within, these will also take part in Ruffus string substitution. Objects with ruffus_params() always “decay” to the results of the function call

output_from would be a simple wrapper which returns the internal string via ruffus_params()

class output_from (object):
    def __init__(self, str):
        self.str = str
    def ruffus_params(self):
        return [self.str]

Returning a list should be like wildcards and should not introduce an unnecessary level of indirection for output parameters, i.e. suffix(”.txt”) or formatter() / “{basename[0]}” should work.

Check!

Future Changes to Ruffus

I would appreciated feedback and help on all these issues and where next to take ruffus.

Future Changes are features where we more or less know where we are going and how to get there.

Planned Improvements describes features we would like in Ruffus but where the implementation or syntax has not yet been (fully) worked out.

If you have suggestions or contributions, please either write to me ( ruffus_lib at llew.org.uk) or send a pull request via the git site.

Todo: Replacements for formatter(), suffix(), regex()

formatter etc. should be self contained objects derived from a single base class with behaviour rather than empty tags used for dispatching to functions

The design is better fit by and should be switched over to an inheritance scheme

Todo: Allow “extra” parameters to be used in output substitution

Formatter substitution can refer to the original elements in the input and extra parameters (without converting them to strings either). This refers to the original (nested) data structure.

This will allow normal python datatypes to be handed down and slipstreamed into a pipeline more easily.

The syntax would use Ruffus (> version 2.4) formatter:

@transform( ..., formatter(), [
                                "{EXTRAS[0][1][3]}",    # EXTRAS
                                "[INPUTS[1][2]]"],...)  # INPUTS
def taskfunc():
    pass

EXTRA and INPUTS indicate that we are referring to the input and extra parameters.

These are the full (nested) parameters in all their original form. In the case of the input parameters, this obvious depends on the decorator, so

@transform(["a.text", [1, "b.text"]], formatter(), "{INPUTS[0][0]}")
def taskfunc():
    pass

would give

job #1
   input  == "a.text"
   output == "a"

job #2
   input  == [1, "b.text"]
   output == 1

The entire string must consist of INPUTS or EXTRAS followed by optionally N levels of square brackets. i.e. They must match "(INPUTS|EXTRAS)(\[\d+\])+"

No string conversion takes place.

For INPUTS or EXTRAS which have objects with a ruffus_params() function (see Todo item above), the original object rather than the result of ruffus_params() is forwarded.

Todo: Extra signalling before and after each task and job

@prejob(custom_func)
@postjob(custom_func)
def task():
    pass

@prejob / @postjob would be run in the child processes.

Todo: @split / @subdivide returns the actual output created

  • overrides (not replaces) wild cards.

  • Returns a list, each with output and extra paramters.

  • Won’t include extraneous files which were not created in the pipeline but which just happened to match the wild card

  • We should have ruffus_output_params, ruffus_extra_params wrappers for clarity:

    @split("a.file", "*.txt")
    def split_into_txt_files(input_file, output_files):
        output_files = ["a.txt", "b.txt", "c.txt"]
        for output_file_name in output_files:
            with open(output_file_name, "w") as oo:
                pass
        return [
                ruffus_output("a.file"),
                [ruffus_output(["b.file", "c.file"]), ruffus_extras(13, 14)],
                ]
    
  • Consider yielding?

Checkpointing

  • If checkpoint file is used, the actual files are saved and checked the next time
  • If no files are generated, no files are checked the next time...
  • The output files do not have to match the wildcard though we can output a warning message if that happens... This is obviously dangerous because the behavior will change if the pipeline is rerun without using the checkpoint file
  • What happens if the task function changes?

Todo: New decorators

Todo: @originate

Each (serial) invocation returns lists of output parameters until returns None. (Empty list = continue, None = break).

Todo: @recombine

Like @collate but automatically regroups jobs which were a result of a previous @subdivide / @split (even after intervening @transform )

This is the only way job trickling can work without stalling the pipeline: We would know how many jobs were pending for each @recombine job and which jobs go together.

Todo: Bioinformatics example to end all examples

Uses
  • @product
  • @subdivide
  • @transform
  • @collate
  • @merge

Todo: Allow the next task to start before all jobs in the previous task have finished

Jake (Biesinger) calls this Job Trickling!

  • A single long running job no longer will hold up the entire pipeline
  • Calculates dependencies dynamically at the job level.
  • Goal is to have a long running (months) pipeline to which we can keep adding input...
  • We can choose between prioritising completion of the entire pipeline for some jobs (depth first) or trying to complete as many tasks as possible (breadth first)

Converting to per-job rather than per task dependencies

Some decorators prevent per job (rather than per task) dependency calculations, and will call a pipeline stall until the dependent tasks are completed (the current situation):

  • Some types of jobs unavoidably depend on an entire previous task completing:
    • add_inputs(), inputs()
    • @merge
    • @split (implicit @merge)
  • @split, @originate produce variable amount of output at runtime and must be completed before the next task can be run.
    • Should yield instead of return?
  • @collate needs to pattern match all the inputs of a previous task
    • Replace @collate with @recombine which “remembers” and reverses the results of a previous @subdivide or @split
    • Jobs need unique job_id tag
    • Jobs are assigned (nested) grouping id which accompany them down the pipeline after @subdivide / @split and are removed after @recombine
    • Should have a count of jobs so we always know when an “input slot” is full
  • Funny “single file” mode for @transform, @files needs to be regularised so it is a syntactic (front end) convenience (oddity!) and not plague the inards of ruffus

Breaking change: to force the entirety of the previous task to complete before the next one, use @follows

Implementation

  • “Push” model. Completing jobs “check in” their outputs to “input slots” for all the sucessor jobs.
  • When “input slots” are full for any job, it is put on the dispatch queue to be run.
  • The priority (depth first or breadth first) can be set here.
  • pipeline_run / Pipeline_printout create a task dependency tree structure (from decorator dependencies) (a runtime pipeline object)
  • Each task in the pipeline object knows which other tasks wait on it.
  • When output is created by a job, it sends messages to (i.e. function calls) all dependent tasks in the pipeline object with the new output
  • Sets of output such as from @split and @subdivide and @originate have a terminating condition and/or a associated count (# of output)
  • Tasks in the pipeline object forward incoming inputs to task input slots (for slots common to all jobs in a task: @inputs, @add_inputs) or to slots in new jobs in the pipeline object
  • When all slots are full in each job, this triggers putting the job parameters onto the job submission queue
  • The pipeline object should allow Ruffus to be reentrant?

Todo: Allow checkpoint files to be moved

Allow checkpoint files to be “rebased” so that moving the working directory of the pipeline does not invalidate all the files.

We need some sort of path search and replace mechanism which handles conflicts, and probably versioning?

Todo: Remove intermediate files

Often large intermediate files are produced in the middle of a pipeline which could be removed. However, their absence would cause the pipeline to appear out of date. What is the best way to solve this?

In gmake, all intermediate files which are not marked .PRECIOUS are deleted.

We can similar mark out all tasks producing intermediate files so that all their output file can be deleted using an @intermediate/provisional/transient/temporary/interim/ephemeral decorator.

The tricky part of the design is how to delete files without disrupting our ability to build the original file dependency DAG, and hence check which tasks have up-to-date output when the pipeline is run again.

  1. We can just work back from upstream/downstream files and ignore the intermediate files as gmake does. However, the increased power of Ruffus makes this very fragile: In gmake, the DAG is entirely specified by the specified destination files. In Ruffus, the number of task files is indeterminate, and can be changed at run time (see @split and @subdivide)
  2. We can save the filenames into the checksum file before deleting them
  3. We can leave the files in place files but zero out their contents. It is probably best to write a small magic text value to the file, e.g. “RUFFUS_ZEROED_FILE”, so that we are not confused by real files of zero size.

In practice (2) and (3) should be combined for safety.

  1. pipeline_cleaunup() will print out a list of files to be zeroed, or a list of commands to zero files or just do it for you
  2. When rerunning, we can force files to be recreated using pipeline_run(..., forcedtorun_tasks,...), and Ruffus will track back through lists of dependencies and recreate all “zeroed” files.

Planned Improvements to Ruffus

  • @split needs to be able to specify at run time the number of resulting jobs without using wild cards
  • legacy support for wild cards and file names.

Planned: Running python code (task functions) transparently on remote cluster nodes

Wait until next release.

Will bump Ruffus to v.3.0 if can run python jobs transparently on a cluster!

abstract out task.run_pooled_job_without_exceptions() as a function which can be supplied to pipeline_run

Common “job” interface:

  • marshalled arguments
  • marshalled function
  • submission timestamp
Returns
  • completion timestamp
  • returned values
  • exception
  1. Full version use libpythongrid? * http://zguide.zeromq.org/page:all * Christian Widmer <ckwidmer@gmail.com> * Cheng Soon Ong <chengsoon.ong@unimelb.edu.au> * https://code.google.com/p/pythongrid/source/browse/#git%2Fpythongrid * Probably not good to base Ruffus entirely on libpythongrid to minimise dependencies, the use of sophisticated configuration policies etc.
  2. Start with light-weight file-based protocol * specify where the scripts should live * use drmaa to start jobs * have executable ruffus module which knows how to load deserialise (unmarshall) function / parameters from disk. This would be what drmaa starts up, given the marshalled data as an argument * time stamp * “heart beat” to check that the job is still running
  3. Next step: socket-based protocol * use specified master port in ruffus script * start remote processes using drmaa * child receives marshalled data and the address::port in the ruffus script (head node) to initiate hand shake or die * process recycling: run successive jobs on the same remote process for reduced overhead, until exceeds max number of jobs on the same process, min/max time on the same process * resubmit if die (Don’t do sophisticated stuff like libpythongrid).

Planned: Custom parameter generator

Request on mailing list

I’ve often wished that I could use an arbitrary function to process the input filepath instead of just a regex.
def f(inputs, outputs, extra_param1, extra_param2):
     # do something to generate parameters
     return new_output_param, new_extra_param1, new_extra_param2

now f() can be used inside a Ruffus decorator to generate the outputs from inputs, instead of being forced to use a regex for the job.

Cheers, Bernie.

Leverages built-in Ruffus functionality. Don’t have to write entire parameter generation from scratch.

  • Gets passed an iterator where you can do a for loop to get input parameters / a flattened list of files
  • Other parameters are forwarded as is
  • The duty of the function is to yield input, output, extra parameters

Simple to do but how do we prevent this from being a job-trickling barrier?

Postpone until we have an initial design for job-trickling: Ruffus v.4 ;-(

Planned: Ruffus GUI interface.

Desktop (PyQT or web-based solution?) I’d love to see an svg pipeline picture that I could actually interact with

Planned: Non-decorator / Function interface to Ruffus

Planned: @retry_on_error(NUM_OF_RETRIES)

Planned: Clean up

The plan is to store the files and directories created via a standard interface.

The placeholders for this are a function call register_cleanup.

Jobs can specify the files they created and which need to be deleted by returning a list of file names from the job function.

So:

raise Exception = Error

return False = halt pipeline now

return string / list of strings = cleanup files/directories later

return anything else = ignored

The cleanup file/directory store interface can be connected to a text file or a database.

The cleanup function would look like this:

pipeline_cleanup(cleanup_log("../cleanup.log"), [instance ="october19th" ])
pipeline_cleanup(cleanup_msql_db("user", "password", "hash_record_table"))

The parameters for where and how to store the list of created files could be similarly passed to pipeline_run as an extra parameter:

pipeline_run(cleanup_log("../cleanup.log"), [instance ="october19th" ])
pipeline_run(cleanup_msql_db("user", "password", "hash_record_table"))

where cleanup_log and cleanup_msql_db are classes which have functions for

  1. storing file
  2. retrieving file
  3. clearing entries
  • Files would be deleted in reverse order, and directories after files.

  • By default, only empty directories would be removed.

    But this could be changed with a --forced_remove_dir option

  • An --remove_empty_parent_directories option would be supported by os.removedirs(path).