QuantSeq RNAseq analysis (1): configuring the nf-core/rnaseq workflow

nextflow
NGS
Author

Thomas Sandmann

Published

January 16, 2023

Note

This is the first of four posts documenting my progress toward processing and analyzing QuantSeq FWD 3’ tag RNAseq data with the nf-core/rnaseq workflow.

  1. Configuring the nf-core/rnaseq workflow
  2. Exploring the workflow outputs
  3. Validating the workflow by reproducing results published by Xia et al (no UMIs)
  4. Validating the workflow by reproducing results published by Nugent et al (including UMIs)

Many thanks to Harshil Patel, António Miguel de Jesus Domingues and Matthias Zepper for their generous guidance & input via nf-core slack. (Any mistakes are mine.)

tl;dr

Installing Nextflow

To run nextflow, we create a new conda environment. (See conda installation instructions if you are not yet familiar with the conda package manager.) Then we install nextflow and its dependencies from the bioconda repository.

> conda create -n nf
> conda install -c bioconda nextflow

Once the installation is complete, we activate our new conda environment and verify that Nextflow is installed:

> conda activate nf
> nextflow -v
nextflow version 22.10.4.5836

Nextflow configs

Nextflow separates the definition of a workflow from the platform it is executed on. For example, you can run the same workflow on your local computer, a high-performance cluster or using a cloud provider like Amazon Web Services (AWS) or Google Cloud Services (GCS).

Each nf-core workflow defines the required software sources, which can either be installed (e.g. via conda) or be made available through a container engine such as docker, singularity, etc. For details, please consult the Nextflow documentation.

The desired configuration is specified in one or more configuration file(s), and passed to the nextflow command via its -profile argument. Here, we execute the workflows on the local system, using docker containers by specifying the built-in -profile docker configuration.

Analysis of published studies

To validate the nf-core/rnaseq workflow for QuantSeq 3’ tag RNA-seq data, we reanalyze data from two published studies, one with and one without the use of unique molecular identifiers (UMIs).

  1. “Fibrillar Aβ causes profound microglial metabolic perturbations in a novel APP knock-in mouse model” by Xia et al, 2021
  1. “TREM2 regulates microglial lipid metabolism during aging in mice” by Nugent et al, 2020.

Quantseq FWD without UMIs: Xia et al dataset

Here, we will reanalyze RNA-seq data published in the preprint “Fibrillar Aβ causes profound microglial metabolic perturbations in a novel APP knock-in mouse model” by Xia et al, 2021.

Experimental design

This study examines FACS-isolated microglia from 18 mice from three genetic backgrounds (N = 6 WT, N =6 heterozygous App-SAA and N = 6 homozygous App-SAA animals). Microglia were isolated from the brain (cortex & hippocampus) of each animal using FACS and gene expression changes were analyzed using 3-tag RNA-seq.

From each mouse, two (technical) replicate pools of microglia were collected and processed into separate libraries with Lexogen’s QuantSeq FWD. This dataset does not include unique molecular identifiers (UMIs).

Raw data retrieval with nf-core/fetchngs

We start the analysis by downloading the raw data (36 FASTQ files) and the sample metadata with the nf-core/fetchngs workflow. For datasets stored in SRA, we only need to paste the SRA Study identifier (SRP282921) into a text file for us as the input for the fetchngs workflow. The following command will download all of the FASTQ files associated with this study into the raw_data directory:

> mkdir SRP282921
> cd SRP282921
> echo SRP282921 > ids.txt
> nextflow run nf-core/fetchngs \
      -revision 1.9 \
      -profile docker \
      --outdir raw_data \
      --input ids.txt \
      --nf_core_pipeline rnaseq \
      --nf_core_rnaseq_strandedness forward

By specifying the --nf_core_pipeline argument, the workflow creates a sample sheet that can be used as input for the nf-core/rnaseq pipeline. Because the data was generated with the QuantSeq FWD kit, we also know that the reads are from the forward strand and add this information to the sample sheet via the --nf_core_rnaseq_strandedness argument. 1

Once the workflow has completed, the raw_data output directory contains the following structure:

> tree raw_data
raw_data/
├── custom
│   └── user-settings.mkfg
├── fastq
│   ├── md5
│   │   ├── SRX9142647_SRR12661938.fastq.gz.md5
│   │   ├── SRX9142648_SRR12661924.fastq.gz.md5
│   │   ├── 34 additional .fastq.gz.md5 files (not shown)
│   ├── SRX9142647_SRR12661938.fastq.gz
│   ├── SRX9142648_SRR12661924.fastq.gz
│   ├── 34 additional .fastq.gz files (not shown)
├── metadata
│   └── SRP282921.runinfo_ftp.tsv
├── pipeline_info
│   ├── execution_report_2023-01-12_01-49-02.html
│   ├── execution_timeline_2023-01-12_01-49-02.html
│   ├── execution_trace_2023-01-12_01-49-02.txt
│   ├── pipeline_dag_2023-01-12_01-49-02.html
│   └── software_versions.yml
└── samplesheet
    ├── id_mappings.csv
    ├── multiqc_config.yml
    ├── nf_params.json
    ├── run.sh
    └── samplesheet.csv
6 directories, 90 files
  • The 36 FASTQ files are in the fastq subfolder, each accompanied by a MD5 checksum file in the md5 directory.
  • In the samplesheet subdirectory, we find the samplesheet.csv file that we can pass to the nf-core/rnaseq workflow.

For more details about the output oft he workflow, please check the next post in this series] and the nf-core/rnaseq output docs.

Configuring the nf-core/rnaseq workflow

Reference data

According to the sample metadata in NCBI GEO the authors mapped the data to the mouse GRCm38 genome version, using Gencode annotations from release M17. To later compare our results with those obtained in the original publication, we use the same Gencode version (even though it is not the most recent).

The nf-core/rnaseq workflow can automatically generate all necessary indices when provided with

  1. A FASTA file with the genomic sequences, e.g. Gencode’s Primary genome assembly and
  2. A GFP file with matching gene annotations, e.g.  Gencode Comprehensive gene annotations

Because downloading and indexing the genome is time consuming, we will save them for later use.

Custom arguments

The nf-core/rnaseq pipeline has robust default parameters that are suitable for whole transcriptome RNA-seq data. But this dataset was generated with Lexogen’s QuantSeq FWD 3’ mRNA library preparation kit, which specificially targets the 3’ end of polyadenylated transcripts. As a consequence, reads align primarily to the 3’ UTR and the final exon and coverage of the remainder of the gene body is minimal.

To account for these properties, we supply a number of custom arguments to the nextflow run command, which augment the workflow for this specific data type.

Extra STAR arguments

Lexogen provides an example analysis workflow on their website, which uses the STAR aligner with several non-default parameters. Most of them parameters correspond to the ENCODE standard options listed in the STAR manual:

  • --alignIntronMax 1000000
  • --alignIntronMin 20
  • --alignMatesGapMax 1000000
  • --alignSJoverhangMin 8
  • --alignSJDBoverhangMin 1
  • --outFilterMismatchNmax 999
  • --outFilterMultimapNmax 20
  • --outFilterType BySJout

The ENCODE project originally focussed on data from human and mouse. The parameters shown above are suitable for these genomes or those with comparable characteristics. For analysis of data from other organisms, e.g. those with shorter introns, you might need to modify them.

In addition, Lexogen also specified the

  • --outFilterMismatchNoverLmax 0.1, decreasing the tolerance for mismatches.

Lexogen’s example workflow uses bbduk to trim adapters, poly(A) tails and low quality bases from the 3’ end of reads. The nf-core/rnaseq workflow uses Trim Galore to trim adapters, but does not remove poly(A) tails. Instead we use the STAR ligner to clip poly(A) tails during the alignment stage:

  • --clip3pAdapterSeq AAAAAAAA

We will pass all of the arguments listed above to STAR via nf-core/rnaseq’s --extra_star_align_args argument (as a string, see example below).

Extra Salmon arguments

The salmon algorithm takes into account transcript length when quantifying gene expression. Because QuantSeq (and other 3’ tag sequencing methods) only capture the 3’ most part of each transcript, not its full length, this feature needs to be deactivated:

  • --noLengthCorrection

The nf-core/rnaseq workflow features a special argument, extra_salmon_quant_args to pass additional arguments to the salmon tool (see syntax below).

Starting the workflow

We can pass the parameters we discussed above as individual arguments to the nextflow run command (prefixed with --). Alternatively, we can collect them in a JSON file and specify its name via the -params-file argument.

For example, the following JSON string specifies the parameters for analyzing sQuantSeq FWD data without UMIs:

{
    "save_reference": true,
    "fasta": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M17/GRCm38.primary_assembly.genome.fa.gz",
    "gtf": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M17/gencode.vM17.primary_assembly.annotation.gtf.gz",
    "extra_star_align_args": "--alignIntronMax 1000000 --alignIntronMin 20 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --outFilterMismatchNmax 999 --outFilterMultimapNmax 20 --outFilterType BySJout --outFilterMismatchNoverLmax 0.1 --clip3pAdapterSeq AAAAAAAA",
    "with_umi": false,
    "extra_salmon_quant_args": "--noLengthCorrection",
    "skip_stringtie": true,
    "gencode": true
}

We store the JSON string in a file called nf_params.json, and then pass it to the nextflow run command 2:

> nextflow run \
      nf-core/rnaseq \
      -r 3.10.1 \
      -profile docker \
      -params-file nf_params.json \
      -resume \
      --input raw_data/samplesheet/samplesheet.csv \
      --outdir SRP282921

Examining the output files

Reusing reference files and indices

Retrieving the reference files and indexing them is time consuming. If you are planning to map additional sample against the same genome / transcriptome in the future, you might want to reuse them. Because we set the save_reference argument to true in our workflow, the output directory contains a genome folder, featuring reference files (e.g. genome and transcriptome sequences in FASTQ formats) and indices (e.g. for STAR, salmon and rsem).

Let’s move the genome folder to our current working directory (or any other suitable location). 3 For future runs, we can include paths to the files and indices in our nf_params.json file, e.g.

{
    "save_reference": false,
    "fasta": "genome/GRCm38.primary_assembly.genome.fa",
    "gtf": "genome/gencode.vM17.primary_assembly.annotation.gtf",
    "gene_bed": "genome/gencode.vM17.primary_assembly.annotation.bed",
    "transcript_fasta": "genome/genome.transcripts.fa",
    "star_index": "genome/index/star",
    "salmon_index": "genome/index/salmon",
    "rsem_index": "genome/rsem",
    "extra_star_align_args": "--alignIntronMax 1000000 --alignIntronMin 20 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --outFilterMismatchNmax 999 --outFilterMultimapNmax 20 --outFilterType BySJout --outFilterMismatchNoverLmax 0.1 --clip3pAdapterSeq AAAAAAAA",
    "with_umi": false,
    "extra_salmon_quant_args": "--noLengthCorrection",
    "skip_stringtie": true,
    "gencode": true
}

Our next run will execute much faster, because these reference files and indices are already available and don’t need to be created from scratch.

Note

If you chose a location other than genome to store your reference data, please specify the (absolute) path in the JSON file instead.

Quantseq FWD with UMIs: Nugent et al dataset

Especially when only low amounts of input material (total RNA) are available, the QuantSeq FWD procotol requires multiple PCR amplification steps. Each step generates clones of the originally captured tags, but does not provide new information about transcript abundance. Inclusion of unique molecular identifiers (UMIs) in the second strand synthesis guarantees that PCR duplicates can be identified and removed in the analysis (based on both the coordinates of their alignment and the UMI sequence).

Nugent et al used the QuantSeq FWD protocol with UMIs, and we will reanalyze this dataset to illustrate how to instruct UMI-tools to extract the UMIs from the raw reads and deduplicated the STAR alignments.

Experimental design

This study examines FACS-isolated astroctyes and microglia from female mice that were either 2- or 16 months of age. The animals are either wildtype (WT) or
homozygous knockouts (KO) for the Trem2 gene.

From each mouse, separate microglia and astrocyte samples were collected and processed into separate libraries with Lexogen’s QuantSeq FWD kit and the QuantSeq UMI add-on module.

Raw data retrieval with nf-core/fetchngs

As in the first example, we use the nf-core/fetchngs workflow to retrieve the FASTQ files from SRA.

> mkdir SRP213880
> cd SRP213880
> echo SRP213880 > ids.txt
> nextflow run nf-core/fetchngs \
      -revision 1.9 \
      -profile docker \
      --outdir raw_data \
      --input ids.txt \
      --nf_core_pipeline rnaseq \
      --nf_core_rnaseq_strandedness forward

Configuring the nf-core/rnaseq workflow

Reference data

According to the sample metadata in NCBI GEO the authors used the same reference data as Xia et al, Gencode release M17. We can therefore reuse the reference files and indices we generated in the first analysis (see above).

Unique molecular identifiers

To process QuantSeq data with UMIs, the following parameters need to be provided.

  • --with_umi
  • --umitools_extract_method "regex": The type of pattern to detect, either string (default) or regex
  • --umitools_bc_pattern "^(?P<umi_1>.{6})(?P<discard_1>.{4}).*": A regular expression that instructs UMI-tools to extract the first 6 bases (the UMI) from the 5’ end of the read 4 and to discard the following 4 bases. (Lexogen adds an invariant TATA sequence motif to each UMI.)
  • --umitools_grouping_method "unique": The method used to group similar UMIs, e.g. to correct sequencing errors. The default method is directional, but it is computationally expensive. Here, we use the simpler unique method, suppressing grouping / error correction entirely.
Note

These arguments shown above are suitable for UMIs offered by Lexogen for the QuantSeq protocol. If you use custom UMIs, please consult the nf-core/rnaseq documentation for full details and additional options for details on additional parameters you can set.

Starting the workflow

Next, we add the UMI-related parameters to the JSON file we used above, so we can pass them to nextflow run via the -params-file argument. Because Nugent et al used the same reference (Gencode M17), we can reuse the same reference we generated for Xia et al’s dataset. (Here, we assume the precomputed references are in the genome folder to our current working directory. If you chose to store them elsewhere, please provide absolute paths in the JSON file instead.)

{
    "save_reference": false,
    "fasta": "genome/GRCm38.primary_assembly.genome.fa",
    "gtf": "genome/gencode.vM17.primary_assembly.annotation.gtf",
    "gene_bed": "genome/gencode.vM17.primary_assembly.annotation.bed",
    "transcript_fasta": "genome/genome.transcripts.fa",
    "star_index": "genome/index/star",
    "salmon_index": "genome/index/salmon",
    "rsem_index": "genome/rsem",
    "extra_star_align_args": "--alignIntronMax 1000000 --alignIntronMin 20 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --outFilterMismatchNmax 999 --outFilterMultimapNmax 20 --outFilterType BySJout --outFilterMismatchNoverLmax 0.1 --clip3pAdapterSeq AAAAAAAA",
    "with_umi": true,
    "umitools_extract_method": "regex",
    "umitools_grouping_method": "unique",
    "umitools_bc_pattern": "^(?P<umi_1>.{6})(?P<discard_1>.{4}).*",
    "extra_salmon_quant_args": "--noLengthCorrection",
    "skip_stringtie": true,
    "gencode": true
}

We store the JSON string in a file called nf_params.json, and then pass it to the nextflow run command:

> nextflow run \
      nf-core/rnaseq \
      -r 3.10.1 \
      -profile docker \
      -params-file nf_params.json \
      -resume \
      --input raw_data/samplesheet/samplesheet.csv \
      --outdir SRP213880

Upon successful completion, the workflow’s output is available in the SRP213880 directory.

Next, we will take a closer look at the available results in the second post in this series].

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Footnotes

  1. The strandedness column in the sample sheet CSV file must contain one of forward, reverse, unstranded or auto. The latter will prompt the nf-core/rnaseq workflow to subsample reads and map them to the transcriptome with salmon to determine the likely orientation of the reads. Because this adds additional steps to the workflow execution, it is recommended that you specify the strandedness of your library explicitly whenever possible.↩︎

  2. The -resume argument allows you to resume previous executions of the same workflow. Nextflow will reuse any existing outputs and generate only those that are missing.↩︎

  3. Nextflow can interact with different storage backends. For example, you could provide URLs of remote files (e.g. with the ftp:// prefix) or in cloud storage (e.g. with the s3:// or gs:// prefix) and they will automatically be downloaded and staged when you start the run.↩︎

  4. Typically, QuantSeq libraries are single-end (not paired-end) and the UMI is therefore found in the R1 read.↩︎