Chapter 20: Manipulating task inputs via string substitution using inputs() and add_inputs()¶
See also
- Manual Table of Contents
- inputs() syntax
- add_inputs() syntax
Note
Remember to look at the example code:
Overview¶
The previous chapters have been described how Ruffus allows the Output names for each job to be generated from the Input names via string substitution. This is how Ruffus can automatically chain multiple tasks in a pipeline together seamlessly.
Sometimes it is useful to be able to modify the Input by string substitution as well. There are two situations where this additional flexibility is needed:
- You need to add additional prequisites or filenames to the Input of every single job
- You need to add additional Input file names which are some variant of the existing ones.
Both will be much more obvious with some examples
Adding additional input prerequisites per job with add_inputs()¶
1. Example: compiling c++ code¶
Let us first compile some c++ ("*.cpp") files using plain @transform syntax:
# source files exist before our pipeline source_files = ["hasty.cpp", "tasty.cpp", "messy.cpp"] for source_file in source_files: open(source_file, "w") from ruffus import * @transform(source_files, suffix(".cpp"), ".o") def compile(input_filename, output_file): open(output_file, "w") pipeline_run()
2. Example: Adding a common header file with add_inputs()¶
# source files exist before our pipeline source_files = ["hasty.cpp", "tasty.cpp", "messy.cpp"] for source_file in source_files: open(source_file, "w") # common (universal) header exists before our pipeline open("universal.h", "w") from ruffus import * # make header files @transform(source_files, suffix(".cpp"), ".h") def create_matching_headers(input_file, output_file): open(output_file, "w") @transform(source_files, suffix(".cpp"), # add header to the input of every job add_inputs("universal.h", # add result of task create_matching_headers to the input of every job create_matching_headers), ".o") def compile(input_filename, output_file): open(output_file, "w") pipeline_run() >>> pipeline_run() Job = [hasty.cpp -> hasty.h] completed Job = [messy.cpp -> messy.h] completed Job = [tasty.cpp -> tasty.h] completed Completed Task = create_matching_headers Job = [[hasty.cpp, universal.h, hasty.h, messy.h, tasty.h] -> hasty.o] completed Job = [[messy.cpp, universal.h, hasty.h, messy.h, tasty.h] -> messy.o] completed Job = [[tasty.cpp, universal.h, hasty.h, messy.h, tasty.h] -> tasty.o] completed Completed Task = compile
3. Example: Additional Input can be tasks¶
We can also add a task name to add_inputs(). This chains the Output, i.e. run time results, of any previous task as an additional Input to every single job in the task.
# make header files @transform(source_files, suffix(".cpp"), ".h") def create_matching_headers(input_file, output_file): open(output_file, "w") @transform(source_files, suffix(".cpp"), # add header to the input of every job add_inputs("universal.h", # add result of task create_matching_headers to the input of every job create_matching_headers), ".o") def compile(input_filenames, output_file): open(output_file, "w") pipeline_run()>>> pipeline_run() Job = [[hasty.cpp, universal.h, hasty.h, messy.h, tasty.h] -> hasty.o] completed Job = [[messy.cpp, universal.h, hasty.h, messy.h, tasty.h] -> messy.o] completed Job = [[tasty.cpp, universal.h, hasty.h, messy.h, tasty.h] -> tasty.o] completed Completed Task = compile
4. Example: Add corresponding files using add_inputs() with formatter or regex¶
The previous example created headers corresponding to our source files and added them as the Input to the compilation. That is generally not what you want. Instead, what is generally need is a way to
- Look up the exact corresponding header for the specific job, and not add all possible files to all jobs in a task. When compiling hasty.cpp, we just need to add hasty.h (and universal.h).
- Add a pre-existing file name (hasty.h already exists. Don’t create it via another task.)
This is a surprisingly common requirement: In bioinformatics sometimes DNA or RNA sequence files come singly in *.fastq and sometimes in matching pairs: *1.fastq, *2.fastq etc. In the latter case, we often need to make sure that both sequence files are being processed in tandem. One way is to take one file name (*1.fastq) and look up the other.
add_inputs() uses standard Ruffus string substitution via formatter and regex to lookup (generate) Input file names. (As a rule suffix only substitutes Output file names.)@transform( source_files, formatter(".cpp$"), # corresponding header for each source file add_inputs("{basename[0]}.h", # add header to the input of every job "universal.h"), "{basename[0]}.o") def compile(input_filenames, output_file): open(output_file, "w")This script gives the following output
>>> pipeline_run() Job = [[hasty.cpp, hasty.h, universal.h] -> hasty.o] completed Job = [[messy.cpp, messy.h, universal.h] -> messy.o] completed Job = [[tasty.cpp, tasty.h, universal.h] -> tasty.o] completed Completed Task = compile
Replacing all input parameters with inputs()¶
The previous examples all added to the set of Input file names. Sometimes it is necessary to replace all the Input parameters altogether.
5. Example: Running matching python scripts using inputs()¶
Here is a contrived example: we wish to find all cython/python files which have been compiled into corresponding c++ source files. Instead of compiling the c++, we shall invoke the corresponding python scripts.
Given three c++ files and their corresponding python scripts:
@transform( source_files, formatter(".cpp$"), # corresponding python file for each source file inputs("{basename[0]}.py"), "{basename[0]}.results") def run_corresponding_python(input_filenames, output_file): open(output_file, "w")The Ruffus code will call each python script corresponding to their c++ counterpart:
>>> pipeline_run() Job = [hasty.py -> hasty.results] completed Job = [messy.py -> messy.results] completed Job = [tasty.py -> tasty.results] completed Completed Task = run_corresponding_python