Chapter 17: @combinations, @permutations and all versus all @product

Overview

A surprising number of computational problems involve some sort of all versus all calculations. Previously, this would have required all the parameters to be supplied using a custom function on the fly with @files.

From version 2.4, Ruffus supports @combinations_with_replacement, @combinations, @permutations, @product.

These provide as far as possible all the functionality of the four combinatorics iterators from the standard python itertools functions of the same name.

Generating output with formatter()

String replacement always takes place via formatter(). Unfortunately, the other Ruffus workhorses of regex() and suffix() do not have sufficient syntactic flexibility.

Each combinatorics decorator deals with multiple sets of inputs whether this might be:

The replacement strings thus require an extra level of indirection to refer to parsed components.

  1. The first level refers to which set of inputs.
  2. The second level refers to which input file in any particular set of inputs.

For example, if the inputs are [A1,A2],[B1,B2],[C1,C2] vs [P1,P2],[Q1,Q2],[R1,R2] vs [X1,X2],[Y1,Y2],[Z1,Z2], then '{basename[2][0]}' is the basename for

  • the third set of inputs (X,Y,Z) and
  • the first file name string in each Input of that set (X1, Y1, Z1)

All vs all comparisons with @product

@product generates the Cartesian product between sets of input files, i.e. all vs all comparisons.

The effect is analogous to a nested for loop.

@product can be useful, for example, in bioinformatics for finding the corresponding genes (orthologues) for a set of proteins in multiple species.

>>> from itertools import product
>>> # product('ABC', 'XYZ') --> AX AY AZ BX BY BZ CX CY CZ
>>> [ "".join(a) for a in product('ABC', 'XYZ')]
['AX', 'AY', 'AZ', 'BX', 'BY', 'BZ', 'CX', 'CY', 'CZ']

This example Calculates the @product of A,B and P,Q and X,Y files

from ruffus import *
from ruffus.combinatorics import *

#   Three sets of initial files
@originate([ 'a.start', 'b.start'])
def create_initial_files_ab(output_file):
    with open(output_file, "w") as oo: pass

@originate([ 'p.start', 'q.start'])
def create_initial_files_pq(output_file):
    with open(output_file, "w") as oo: pass

@originate([ ['x.1_start', 'x.2_start'],
             ['y.1_start', 'y.2_start'] ])
def create_initial_files_xy(output_file):
    with open(output_file, "w") as oo: pass

#   @product
@product(   create_initial_files_ab,        # Input
            formatter("(.start)$"),         # match input file set # 1

            create_initial_files_pq,        # Input
            formatter("(.start)$"),         # match input file set # 2

            create_initial_files_xy,        # Input
            formatter("(.start)$"),         # match input file set # 3

            "{path[0][0]}/"                 # Output Replacement string
            "{basename[0][0]}_vs_"          #
            "{basename[1][0]}_vs_"          #
            "{basename[2][0]}.product",     #

            "{path[0][0]}",                 # Extra parameter: path for 1st set of files, 1st file name

            ["{basename[0][0]}",            # Extra parameter: basename for 1st set of files, 1st file name
             "{basename[1][0]}",            #                               2nd
             "{basename[2][0]}",            #                               3rd
             ])
def product_task(input_file, output_parameter, shared_path, basenames):
    print "# basenames      = ", " ".join(basenames)
    print "input_parameter  = ", input_file
    print "output_parameter = ", output_parameter, "\n"


#
#       Run
#
pipeline_run(verbose=0)

This results in:

>>> pipeline_run(verbose=0)

# basenames      =  a p x
input_parameter  =  ('a.start', 'p.start', 'x.start')
output_parameter =  /home/lg/temp/a_vs_p_vs_x.product

# basenames      =  a p y
input_parameter  =  ('a.start', 'p.start', 'y.start')
output_parameter =  /home/lg/temp/a_vs_p_vs_y.product

# basenames      =  a q x
input_parameter  =  ('a.start', 'q.start', 'x.start')
output_parameter =  /home/lg/temp/a_vs_q_vs_x.product

# basenames      =  a q y
input_parameter  =  ('a.start', 'q.start', 'y.start')
output_parameter =  /home/lg/temp/a_vs_q_vs_y.product

# basenames      =  b p x
input_parameter  =  ('b.start', 'p.start', 'x.start')
output_parameter =  /home/lg/temp/b_vs_p_vs_x.product

# basenames      =  b p y
input_parameter  =  ('b.start', 'p.start', 'y.start')
output_parameter =  /home/lg/temp/b_vs_p_vs_y.product

# basenames      =  b q x
input_parameter  =  ('b.start', 'q.start', 'x.start')
output_parameter =  /home/lg/temp/b_vs_q_vs_x.product

# basenames      =  b q y
input_parameter  =  ('b.start', 'q.start', 'y.start')
output_parameter =  /home/lg/temp/b_vs_q_vs_y.product

Permute all k-tuple orderings of inputs without repeats using @permutations

Generates the permutations for all the elements of a set of Input (e.g. A B C D),
  • r-length tuples of input elements
  • excluding repeated elements (A A)
  • and order of the tuples is significant (both A B and B A).
>>> from itertools import permutations
>>> # permutations('ABCD', 2) --> AB AC AD BA BC BD CA CB CD DA DB DC
>>> [ "".join(a) for a in permutations("ABCD", 2)]
['AB', 'AC', 'AD', 'BA', 'BC', 'BD', 'CA', 'CB', 'CD', 'DA', 'DB', 'DC']

This following example calculates the @permutations of A,B,C,D files

from ruffus import *
from ruffus.combinatorics import *

#   initial file pairs
@originate([ ['A.1_start', 'A.2_start'],
             ['B.1_start', 'B.2_start'],
             ['C.1_start', 'C.2_start'],
             ['D.1_start', 'D.2_start']])
def create_initial_files_ABCD(output_files):
    for output_file in output_files:
        with open(output_file, "w") as oo: pass

#   @permutations
@permutations(create_initial_files_ABCD,      # Input
              formatter(),                    # match input files

              # tuple of 2 at a time
              2,

              # Output Replacement string
              "{path[0][0]}/"
              "{basename[0][1]}_vs_"
              "{basename[1][1]}.permutations",

              # Extra parameter: path for 1st set of files, 1st file name
              "{path[0][0]}",

              # Extra parameter
              ["{basename[0][0]}",  # basename for 1st set of files, 1st file name
               "{basename[1][0]}",  #                                2nd
               ])
def permutations_task(input_file, output_parameter, shared_path, basenames):
    print " - ".join(basenames)


#
#       Run
#
pipeline_run(verbose=0)

This results in:

>>> pipeline_run(verbose=0)

A - B
A - C
A - D
B - A
B - C
B - D
C - A
C - B
C - D
D - A
D - B
D - C

Select unordered k-tuples within inputs excluding repeated elements using @combinations

Generates the combinations for all the elements of a set of Input (e.g. A B C D),
  • r-length tuples of input elements
  • without repeated elements (A A)
  • where order of the tuples is irrelevant (either A B or B A, not both).

@combinations can be useful, for example, in calculating a transition probability matrix for a set of states. The diagonals are meaningless “self-self” transitions which are excluded.

>>> from itertools import combinations
>>> # combinations('ABCD', 3) --> ABC ABD ACD BCD
>>> [ "".join(a) for a in combinations("ABCD", 3)]
['ABC', 'ABD', 'ACD', 'BCD']

This example calculates the @combinations of A,B,C,D files

from ruffus import *
from ruffus.combinatorics import *

#   initial file pairs
@originate([ ['A.1_start', 'A.2_start'],
             ['B.1_start', 'B.2_start'],
             ['C.1_start', 'C.2_start'],
             ['D.1_start', 'D.2_start']])
def create_initial_files_ABCD(output_files):
    for output_file in output_files:
        with open(output_file, "w") as oo: pass

#   @combinations
@combinations(create_initial_files_ABCD,      # Input
              formatter(),                    # match input files

              # tuple of 3 at a time
              3,

              # Output Replacement string
              "{path[0][0]}/"
              "{basename[0][1]}_vs_"
              "{basename[1][1]}_vs_"
              "{basename[2][1]}.combinations",

              # Extra parameter: path for 1st set of files, 1st file name
              "{path[0][0]}",

              # Extra parameter
              ["{basename[0][0]}",  # basename for 1st set of files, 1st file name
               "{basename[1][0]}",  #              2nd
               "{basename[2][0]}",  #              3rd
               ])
def combinations_task(input_file, output_parameter, shared_path, basenames):
    print " - ".join(basenames)


#
#       Run
#
pipeline_run(verbose=0)

This results in:

>>> pipeline_run(verbose=0)
A - B - C
A - B - D
A - C - D
B - C - D

Select unordered k-tuples within inputs including repeated elements with @combinations_with_replacement

Generates the combinations_with_replacement for all the elements of a set of Input (e.g. A B C D),
  • r-length tuples of input elements
  • including repeated elements (A A)
  • where order of the tuples is irrelevant (either A B or B A, not both).

@combinations_with_replacement can be useful, for example, in bioinformatics for finding evolutionary relationships between genetic elements such as proteins and genes. Self-self comparisons can be used a baseline for scaling similarity scores.

>>> from itertools import combinations_with_replacement
>>> # combinations_with_replacement('ABCD', 2) --> AA AB AC AD BB BC BD CC CD DD
>>> [ "".join(a) for a in combinations_with_replacement('ABCD', 2)]
['AA', 'AB', 'AC', 'AD', 'BB', 'BC', 'BD', 'CC', 'CD', 'DD']

This example calculates the @combinations_with_replacement of A,B,C,D files

from ruffus import *
from ruffus.combinatorics import *

#   initial file pairs
@originate([ ['A.1_start', 'A.2_start'],
             ['B.1_start', 'B.2_start'],
             ['C.1_start', 'C.2_start'],
             ['D.1_start', 'D.2_start']])
def create_initial_files_ABCD(output_files):
    for output_file in output_files:
        with open(output_file, "w") as oo: pass

#   @combinations_with_replacement
@combinations_with_replacement(create_initial_files_ABCD,   # Input
              formatter(),                                  # match input files

              # tuple of 2 at a time
              2,

              # Output Replacement string
              "{path[0][0]}/"
              "{basename[0][1]}_vs_"
              "{basename[1][1]}.combinations_with_replacement",

              # Extra parameter: path for 1st set of files, 1st file name
              "{path[0][0]}",

              # Extra parameter
              ["{basename[0][0]}",  # basename for 1st set of files, 1st file name
               "{basename[1][0]}",  #              2rd
               ])
def combinations_with_replacement_task(input_file, output_parameter, shared_path, basenames):
    print " - ".join(basenames)


#
#       Run
#
pipeline_run(verbose=0)

This results in:

>>> pipeline_run(verbose=0)
A - A
A - B
A - C
A - D
B - B
B - C
B - D
C - C
C - D
D - D