Learning nextflow: blasting multiple sequences

Nextflow is both a reactive workflow framework and a domain-specific language (DSL). It is gaining lots of tracking in bioinformatics thanks in large part to the nf-core open source community that develops and publishes reusable workflows for many use cases.

To start learning nextflow, I worked through Andrew Severin’s excellent Creating a NextFlow workflow tutorial. (The tutorial follows the older DSL1 specification of nextflow, but only a few small modifications were needed to run it under DSL2.)

The DSL2 code I wrote is here and these are notes I took while working through the tutorial:

To make a variable a pipeline parameter prepend it with params., then specify them in the command line:

main.nf:

#! /usr/bin/env nextflow
params.query="file.fasta"
println "Querying file $params.query"

shell command:

nextflow run main.nf --query other_file.fasta

The -log argument directs logging to the specified file.
```
nextflow -log nextflo.log run main.nf 
```
To clean up intermediate files automatically upon workflow completion, use the cleanup parameter within a profile.
```
profiles {
  standard {
      cleanup = true
  }
  debug {
      cleanup = false
  }
}
```
- By convention the standard profile is implicitly used when no other
  profile is specified by the user.
- Cleaning up intermediate files precludes the use of -resume.
The nextflow.config file sets the global parameters, e.g.
- process
- manifest
- executor
- profiles
- docker
- singularity
- timeline
- report
- etc
Contents of the work folder for a nextflow task:
- .command.begin is the begin script if you have one
- .command.err is useful when it crashes.
- .command.run is the full nextflow pipeline that was run, this is helpful when trouble shooting a nextflow error rather than the script error.
- .command.sh shows what was run.
- .exitcode will have the exit code in it.

Displaying help messages

main.nf

def helpMessage() {
log.info """
      Usage:
      The typical command for running the pipeline is as follows:
      nextflow run main.nf --query QUERY.fasta --dbDir "blastDatabaseDirectory" --dbName "blastPrefixName"

      Mandatory arguments:
       --query                        Query fasta file of sequences you wish to BLAST
       --dbDir                        BLAST database directory (full path required)
       [...]
"""
}

// Show help message
if (params.help) {
    helpMessage()
    exit 0
}

shell command:

nextflow run main.nf --help

The publishDir directive accepts arguments like mode and pattern to fine tune its behavior, e.g.

output:
file("${label}/short_summary.specific.*.txt")
publishDir "${params.outdir}/BUSCOResults/${label}/", mode: 'copy', pattern: "${label}/short_summary.specific.*.txt"

DSL2 allows piping, e.g.

workflow {
  res = Channel
      .fromPath(params.query)
      .splitFasta(by: 1, file:true) |
      runBlast
  res.collectFile(name: 'blast_output_combined.txt', storeDir: params.outdir)
}

Add a timeline report to the output with

timeline {
    enabled = true
    file = "$params.outdir/timeline.html"
}

(in nextflow.config).

Add a detailed execution report with

report {
enabled = true
file = "$params.outdir/report.html"
}

(in nextflow.config).

Include a profile-specific configuration file

nextflow.config

profiles {
    slurm { includeConfig './configs/slurm.config' }
}

configs/slurm.config

process {
    executor = 'slurm'
    clusterOptions =  '-N 1 -n 16 -t 24:00:00'
}

and use it via nextflow run main.nf -profile slurm

Similarly, refer to a test profile, specified in a separate file:

nextflow.config
```
test { includeConfig './configs/test.config' }
```

Adding a manifest to nextflow.config

manifest {
    name = 'isugifNF/tutorial'
    author = 'Andrew Severin'
    homePage = 'www.bioinformaticsworkbook.org'
    description = 'nextflow bash'
    mainScript = 'main.nf'
    version = '1.0.0'
}

Using a label for a process allows granular control of a process’ configuration

main.nf

process runBlast { 
    label 'blast'
}

nextflow.config

process {
    executor = 'slurm'
    clusterOptions =  '-N 1 -n 16 -t 02:00:00'
    withLabel: blast { module = 'blast-plus' }
}

The label has to be placed before the input section.

Loading a module specifically for a process

process runBlast {

    module = 'blast-plus'
    publishDir "${params.outdir}/blastout"

    input:
    path queryFile from queryFile_ch
    .
    .
    . // these three dots mean I didn't paste the whole process.
}

Enabling docker in the nextflow.config

docker { docker.enabled = true }

The docker container can be specified in the process, e.g.

container = 'ncbi/blast'

container = `quay.io/biocontainers/blast/2.2.31--pl526he19e7b1_5`

We can include additional options to pass to the container as well:

containerOptions = "--bind $launchDir/$params.outdir/config:/augustus/config"

projectDir refers to the directory where the main workflow script is located. (It used to be called baseDir.)
Refering to local directories from within a docker container: create a channel
- Working in containers, we need a way to pass the database file location directly into the runBlast process without the need of the local path.
Repeating a process over each element of a channel with each: input repeaters
Turning a queue channel into a value channel, which can be used multiple times.
- A value channel is implicitly created by a process when it is invoked with a simple value.
- A value channel is also implicitly created as output for a process whose inputs are all value channels.
- A queue channel can be converted into a value channel by returning a single value, using e.g. first, last, collect, count, min, max, reduce, sum, etc. For example: the runBlast process receives three inputs in the following example:
  - the queryFile_ch queue channel, with multiple sequences.
  - the dbDir_ch value channel, created by calling .first(), which is reused for all elements of queryFile_ch
  - the dbName_ch value channel, which is also reused for all elements of queryFile_ch
```
workflow {
  channel.fromPath(params.dbDir).first()
  .set { dbDir_ch }

  channel.from(params.dbName).first()
  .set { dbName_ch }

  queryFile_ch = channel
      .fromPath(params.query)
      .splitFasta(by: 1, file:true)
     res = runBlast(queryFile_ch, dbDir_ch, dbName_ch)
  res.collectFile(name: 'blast_output_combined.txt', storeDir: params.outdir)
}
```

Additional resources

This work is licensed under a Creative Commons Attribution 4.0 International License.