See also

@collate( input, filter, output, [extras,...] )ΒΆ

Purpose:

Use filter to identify common sets of inputs which are to be grouped or collated together:

Each set of inputs which generate identical output and extras using the formatter or regex (regular expression) filters are collated into one job.

This is a many to fewer operation.

Only out of date jobs (comparing input and output files) will be re-run.

Example:

regex(r".+\.(.+)$"), "\1.summary" creates a separate summary file for each suffix:

animal_files = "a.fish", "b.fish", "c.mammals", "d.mammals"
# summarise by file suffix:
@collate(animal_files, regex(r".+\.(.+)$"),  r'\1.summary')
def summarize(infiles, summary_file):
    pass
  1. output and optional extras parameters are passed to the functions after string substitution. Non-string values are passed through unchanged.

  2. Each collate job consists of input files which are aggregated by string substitution to identical output and extras

  3. The above example results in two jobs:
    ["a.fish", "b.fish" -> "fish.summary"]
    ["c.mammals", "d.mammals" -> "mammals.summary"]

Parameters:

  • input = tasks_or_file_names

    can be a:

    1. Task / list of tasks.

      File names are taken from the output of the specified task(s)

    2. (Nested) list of file name strings (as in the example above).
      File names containing *[]? will be expanded as a glob.

      E.g.:"a.*" => "a.1", "a.2"

  • filter = matching_regex

    is a python regular expression string, which must be wrapped in a regex indicator object See python regular expression (re) documentation for details of regular expression syntax

  • output = output

    Specifies the resulting output file name(s) after string substitution

  • extras = extras

    Any extra parameters are passed verbatim to the task function

    If you are using named parameters, these can be passed as a list, i.e. extras= [...]

    Any extra parameters are consumed by the task function and not forwarded further down the pipeline.

Example2:

Suppose we had the following files:

cows.mammals.animal
horses.mammals.animal
sheep.mammals.animal

snake.reptile.animal
lizard.reptile.animal
crocodile.reptile.animal

pufferfish.fish.animal

and we wanted to end up with three different resulting output:

cow.mammals.animal
horse.mammals.animal
sheep.mammals.animal
    -> mammals.results

snake.reptile.animal
lizard.reptile.animal
crocodile.reptile.animal
    -> reptile.results

pufferfish.fish.animal
    -> fish.results

This is the @collate code required:

animals = [     "cows.mammals.animal",
                "horses.mammals.animal",
                "sheep.mammals.animal",
                "snake.reptile.animal",
                "lizard.reptile.animal",
                "crocodile.reptile.animal",
                "pufferfish.fish.animal"]

@collate(animals, regex(r"(.+)\.(.+)\.animal"),  r"\2.results")
# \1 = species [cow, horse]
# \2 = phylogenetics group [mammals, reptile, fish]
def summarize_animals_into_groups(species_file, result_file):
    " ... more code here"
    pass

See @merge for an alternative way to summarise files.