This is the first of four posts documenting my progress toward processing and analyzing QuantSeq FWD 3’ tag RNAseq data with the nf-core/rnaseq workflow.
- Configuring the nf-core/rnaseq workflow
- Exploring the workflow outputs
- Validating the workflow by reproducing results published by Xia et al (no UMIs)
- Validating the workflow by reproducing results published by Nugent et al (including UMIs)
Many thanks to Harshil Patel, António Miguel de Jesus Domingues and Matthias Zepper for their generous guidance & input via nf-core slack. (Any mistakes are mine.)
tl;dr
- This tutorial documents how to configure & execute the nf-core/rnaseq workflow for the processing of raw QuantSeq 3’ FWD RNA-seq data.
- It highlights custom nf-core/rnaseq parameters settings and applies them to two published datasets, one with and one without UMIs.
Installing Nextflow
To run nextflow, we create a new conda environment. (See conda installation instructions if you are not yet familiar with the conda package manager.) Then we install nextflow and its dependencies from the bioconda repository.
> conda create -n nf
> conda install -c bioconda nextflow
Once the installation is complete, we activate our new conda environment and verify that Nextflow is installed:
> conda activate nf
> nextflow -v
nextflow version 22.10.4.5836
Nextflow configs
Nextflow separates the definition of a workflow from the platform it is executed on. For example, you can run the same workflow on your local computer, a high-performance cluster or using a cloud provider like Amazon Web Services (AWS) or Google Cloud Services (GCS).
Each nf-core workflow defines the required software sources, which can either be installed (e.g. via conda) or be made available through a container engine such as docker, singularity, etc. For details, please consult the Nextflow documentation.
The desired configuration is specified in one or more configuration file(s), and passed to the nextflow
command via its -profile
argument. Here, we execute the workflows on the local system, using docker containers by specifying the built-in -profile docker
configuration.
Analysis of published studies
To validate the nf-core/rnaseq workflow for QuantSeq 3’ tag RNA-seq data, we reanalyze data from two published studies, one with and one without the use of unique molecular identifiers (UMIs).
- “Fibrillar Aβ causes profound microglial metabolic perturbations in a novel APP knock-in mouse model” by Xia et al, 2021
- QuantSeq 3’ mRNA-Seq Library Prep Kit FWD FWD
- This data does not include UMIs.
- Raw data are available via
- “TREM2 regulates microglial lipid metabolism during aging in mice” by Nugent et al, 2020.
- QuantSeq 3’ mRNA-Seq Library Prep Kit FWD FWD
- This data includes UMIs, which and can be used to filter alignments originating from PCR duplicates.
- Raw data are available via
Quantseq FWD without UMIs: Xia et al dataset
Here, we will reanalyze RNA-seq data published in the preprint “Fibrillar Aβ causes profound microglial metabolic perturbations in a novel APP knock-in mouse model” by Xia et al, 2021.
Experimental design
This study examines FACS-isolated microglia from 18 mice from three genetic backgrounds (N = 6 WT, N =6 heterozygous App-SAA and N = 6 homozygous App-SAA animals). Microglia were isolated from the brain (cortex & hippocampus) of each animal using FACS and gene expression changes were analyzed using 3-tag RNA-seq.
From each mouse, two (technical) replicate pools of microglia were collected and processed into separate libraries with Lexogen’s QuantSeq FWD. This dataset does not include unique molecular identifiers (UMIs).
Raw data retrieval with nf-core/fetchngs
We start the analysis by downloading the raw data (36 FASTQ files) and the sample metadata with the nf-core/fetchngs workflow. For datasets stored in SRA, we only need to paste the SRA Study identifier
(SRP282921) into a text file for us as the input
for the fetchngs workflow. The following command will download all of the FASTQ files associated with this study into the raw_data
directory:
> mkdir SRP282921
> cd SRP282921
> echo SRP282921 > ids.txt
> nextflow run nf-core/fetchngs \
-revision 1.9 \
-profile docker \
--outdir raw_data \
--input ids.txt \
--nf_core_pipeline rnaseq \
--nf_core_rnaseq_strandedness forward
By specifying the --nf_core_pipeline
argument, the workflow creates a sample sheet that can be used as input for the nf-core/rnaseq
pipeline. Because the data was generated with the QuantSeq FWD kit, we also know that the reads are from the forward
strand and add this information to the sample sheet via the --nf_core_rnaseq_strandedness
argument. 1
Once the workflow has completed, the raw_data
output directory contains the following structure:
> tree raw_data
raw_data/
├── custom
│ └── user-settings.mkfg
├── fastq
│ ├── md5
│ │ ├── SRX9142647_SRR12661938.fastq.gz.md5
│ │ ├── SRX9142648_SRR12661924.fastq.gz.md5
│ │ ├── 34 additional .fastq.gz.md5 files (not shown)
│ ├── SRX9142647_SRR12661938.fastq.gz
│ ├── SRX9142648_SRR12661924.fastq.gz
│ ├── 34 additional .fastq.gz files (not shown)
├── metadata
│ └── SRP282921.runinfo_ftp.tsv
├── pipeline_info
│ ├── execution_report_2023-01-12_01-49-02.html
│ ├── execution_timeline_2023-01-12_01-49-02.html
│ ├── execution_trace_2023-01-12_01-49-02.txt
│ ├── pipeline_dag_2023-01-12_01-49-02.html
│ └── software_versions.yml
└── samplesheet
├── id_mappings.csv
├── multiqc_config.yml
├── nf_params.json
├── run.sh
└── samplesheet.csv
6 directories, 90 files
- The 36 FASTQ files are in the
fastq
subfolder, each accompanied by a MD5 checksum file in themd5
directory. - In the
samplesheet
subdirectory, we find thesamplesheet.csv
file that we can pass to the nf-core/rnaseq workflow.
For more details about the output oft he workflow, please check the next post in this series] and the nf-core/rnaseq output docs.
Configuring the nf-core/rnaseq workflow
Reference data
According to the sample metadata in NCBI GEO the authors mapped the data to the mouse GRCm38 genome version, using Gencode annotations from release M17. To later compare our results with those obtained in the original publication, we use the same Gencode version (even though it is not the most recent).
The nf-core/rnaseq
workflow can automatically generate all necessary indices when provided with
- A FASTA file with the genomic sequences, e.g. Gencode’s Primary genome assembly and
- A GFP file with matching gene annotations, e.g. Gencode Comprehensive gene annotations
Because downloading and indexing the genome is time consuming, we will save them for later use.
Custom arguments
The nf-core/rnaseq
pipeline has robust default parameters that are suitable for whole transcriptome RNA-seq data. But this dataset was generated with Lexogen’s QuantSeq FWD 3’ mRNA library preparation kit, which specificially targets the 3’ end of polyadenylated transcripts. As a consequence, reads align primarily to the 3’ UTR and the final exon and coverage of the remainder of the gene body is minimal.
To account for these properties, we supply a number of custom arguments to the nextflow run
command, which augment the workflow for this specific data type.
Extra STAR arguments
Lexogen provides an example analysis workflow on their website, which uses the STAR aligner with several non-default parameters. Most of them parameters correspond to the ENCODE standard options listed in the STAR manual:
--alignIntronMax 1000000
--alignIntronMin 20
--alignMatesGapMax 1000000
--alignSJoverhangMin 8
--alignSJDBoverhangMin 1
--outFilterMismatchNmax 999
--outFilterMultimapNmax 20
--outFilterType BySJout
The ENCODE project originally focussed on data from human and mouse. The parameters shown above are suitable for these genomes or those with comparable characteristics. For analysis of data from other organisms, e.g. those with shorter introns, you might need to modify them.
In addition, Lexogen also specified the
--outFilterMismatchNoverLmax 0.1
, decreasing the tolerance for mismatches.
Lexogen’s example workflow uses bbduk to trim adapters, poly(A) tails and low quality bases from the 3’ end of reads. The nf-core/rnaseq workflow uses Trim Galore to trim adapters, but does not remove poly(A) tails. Instead we use the STAR ligner to clip poly(A) tails during the alignment stage:
--clip3pAdapterSeq AAAAAAAA
We will pass all of the arguments listed above to STAR via nf-core/rnaseq’s --extra_star_align_args
argument (as a string, see example below).
Extra Salmon arguments
The salmon algorithm takes into account transcript length when quantifying gene expression. Because QuantSeq (and other 3’ tag sequencing methods) only capture the 3’ most part of each transcript, not its full length, this feature needs to be deactivated:
--noLengthCorrection
The nf-core/rnaseq workflow features a special argument, extra_salmon_quant_args
to pass additional arguments to the salmon
tool (see syntax below).
Starting the workflow
We can pass the parameters we discussed above as individual arguments to the nextflow run
command (prefixed with --
). Alternatively, we can collect them in a JSON file and specify its name via the -params-file
argument.
For example, the following JSON string specifies the parameters for analyzing sQuantSeq FWD data without UMIs:
{
"save_reference": true,
"fasta": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M17/GRCm38.primary_assembly.genome.fa.gz",
"gtf": "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M17/gencode.vM17.primary_assembly.annotation.gtf.gz",
"extra_star_align_args": "--alignIntronMax 1000000 --alignIntronMin 20 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --outFilterMismatchNmax 999 --outFilterMultimapNmax 20 --outFilterType BySJout --outFilterMismatchNoverLmax 0.1 --clip3pAdapterSeq AAAAAAAA",
"with_umi": false,
"extra_salmon_quant_args": "--noLengthCorrection",
"skip_stringtie": true,
"gencode": true
}
We store the JSON string in a file called nf_params.json
, and then pass it to the nextflow run
command 2:
> nextflow run \
\
nf-core/rnaseq -r 3.10.1 \
-profile docker \
-params-file nf_params.json \
-resume \
--input raw_data/samplesheet/samplesheet.csv \
--outdir SRP282921
Examining the output files
Reusing reference files and indices
Retrieving the reference files and indexing them is time consuming. If you are planning to map additional sample against the same genome / transcriptome in the future, you might want to reuse them. Because we set the save_reference
argument to true
in our workflow, the output directory contains a genome
folder, featuring reference files (e.g. genome and transcriptome sequences in FASTQ formats) and indices (e.g. for STAR, salmon and rsem).
Let’s move the genome folder to our current working directory (or any other suitable location). 3 For future runs, we can include paths to the files and indices in our nf_params.json
file, e.g.
{
"save_reference": false,
"fasta": "genome/GRCm38.primary_assembly.genome.fa",
"gtf": "genome/gencode.vM17.primary_assembly.annotation.gtf",
"gene_bed": "genome/gencode.vM17.primary_assembly.annotation.bed",
"transcript_fasta": "genome/genome.transcripts.fa",
"star_index": "genome/index/star",
"salmon_index": "genome/index/salmon",
"rsem_index": "genome/rsem",
"extra_star_align_args": "--alignIntronMax 1000000 --alignIntronMin 20 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --outFilterMismatchNmax 999 --outFilterMultimapNmax 20 --outFilterType BySJout --outFilterMismatchNoverLmax 0.1 --clip3pAdapterSeq AAAAAAAA",
"with_umi": false,
"extra_salmon_quant_args": "--noLengthCorrection",
"skip_stringtie": true,
"gencode": true
}
Our next run will execute much faster, because these reference files and indices are already available and don’t need to be created from scratch.
If you chose a location other than genome
to store your reference data, please specify the (absolute) path in the JSON file instead.
Quantseq FWD with UMIs: Nugent et al dataset
Especially when only low amounts of input material (total RNA) are available, the QuantSeq FWD procotol requires multiple PCR amplification steps. Each step generates clones of the originally captured tags, but does not provide new information about transcript abundance. Inclusion of unique molecular identifiers (UMIs) in the second strand synthesis guarantees that PCR duplicates can be identified and removed in the analysis (based on both the coordinates of their alignment and the UMI sequence).
Nugent et al used the QuantSeq FWD protocol with UMIs, and we will reanalyze this dataset to illustrate how to instruct UMI-tools to extract the UMIs from the raw reads and deduplicated the STAR alignments.
Experimental design
This study examines FACS-isolated astroctyes and microglia from female mice that were either 2- or 16 months of age. The animals are either wildtype (WT
) or
homozygous knockouts (KO
) for the Trem2 gene.
From each mouse, separate microglia and astrocyte samples were collected and processed into separate libraries with Lexogen’s QuantSeq FWD kit and the QuantSeq UMI add-on module.
Raw data retrieval with nf-core/fetchngs
As in the first example, we use the nf-core/fetchngs
workflow to retrieve the FASTQ files from SRA.
> mkdir SRP213880
> cd SRP213880
> echo SRP213880 > ids.txt
> nextflow run nf-core/fetchngs \
-revision 1.9 \
-profile docker \
--outdir raw_data \
--input ids.txt \
--nf_core_pipeline rnaseq \
--nf_core_rnaseq_strandedness forward
Configuring the nf-core/rnaseq workflow
Reference data
According to the sample metadata in NCBI GEO the authors used the same reference data as Xia et al, Gencode release M17. We can therefore reuse the reference files and indices we generated in the first analysis (see above).
Unique molecular identifiers
To process QuantSeq data with UMIs, the following parameters need to be provided.
--with_umi
--umitools_extract_method "regex"
: The type of pattern to detect, eitherstring
(default) orregex
--umitools_bc_pattern "^(?P<umi_1>.{6})(?P<discard_1>.{4}).*"
: A regular expression that instructsUMI-tools
to extract the first 6 bases (the UMI) from the 5’ end of the read 4 and to discard the following 4 bases. (Lexogen adds an invariantTATA
sequence motif to each UMI.)--umitools_grouping_method "unique"
: The method used to group similar UMIs, e.g. to correct sequencing errors. The default method isdirectional
, but it is computationally expensive. Here, we use the simplerunique
method, suppressing grouping / error correction entirely.
These arguments shown above are suitable for UMIs offered by Lexogen for the QuantSeq protocol. If you use custom UMIs, please consult the nf-core/rnaseq documentation for full details and additional options for details on additional parameters you can set.
Starting the workflow
Next, we add the UMI-related parameters to the JSON file we used above, so we can pass them to nextflow run
via the -params-file
argument. Because Nugent et al used the same reference (Gencode M17), we can reuse the same reference we generated for Xia et al’s dataset. (Here, we assume the precomputed references are in the genome
folder to our current working directory. If you chose to store them elsewhere, please provide absolute paths in the JSON file instead.)
{
"save_reference": false,
"fasta": "genome/GRCm38.primary_assembly.genome.fa",
"gtf": "genome/gencode.vM17.primary_assembly.annotation.gtf",
"gene_bed": "genome/gencode.vM17.primary_assembly.annotation.bed",
"transcript_fasta": "genome/genome.transcripts.fa",
"star_index": "genome/index/star",
"salmon_index": "genome/index/salmon",
"rsem_index": "genome/rsem",
"extra_star_align_args": "--alignIntronMax 1000000 --alignIntronMin 20 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --outFilterMismatchNmax 999 --outFilterMultimapNmax 20 --outFilterType BySJout --outFilterMismatchNoverLmax 0.1 --clip3pAdapterSeq AAAAAAAA",
"with_umi": true,
"umitools_extract_method": "regex",
"umitools_grouping_method": "unique",
"umitools_bc_pattern": "^(?P<umi_1>.{6})(?P<discard_1>.{4}).*",
"extra_salmon_quant_args": "--noLengthCorrection",
"skip_stringtie": true,
"gencode": true
}
We store the JSON string in a file called nf_params.json
, and then pass it to the nextflow run
command:
> nextflow run \
\
nf-core/rnaseq -r 3.10.1 \
-profile docker \
-params-file nf_params.json \
-resume \
--input raw_data/samplesheet/samplesheet.csv \
--outdir SRP213880
Upon successful completion, the workflow’s output is available in the SRP213880
directory.
Next, we will take a closer look at the available results in the second post in this series].
This work is licensed under a Creative Commons Attribution 4.0 International License.
Footnotes
The
strandedness
column in the sample sheet CSV file must contain one offorward
,reverse
,unstranded
orauto
. The latter will prompt thenf-core/rnaseq
workflow to subsample reads and map them to the transcriptome with salmon to determine the likely orientation of the reads. Because this adds additional steps to the workflow execution, it is recommended that you specify thestrandedness
of your library explicitly whenever possible.↩︎The
-resume
argument allows you to resume previous executions of the same workflow. Nextflow will reuse any existing outputs and generate only those that are missing.↩︎Nextflow can interact with different storage backends. For example, you could provide URLs of remote files (e.g. with the
ftp://
prefix) or in cloud storage (e.g. with thes3://
orgs://
prefix) and they will automatically be downloaded and staged when you start the run.↩︎Typically, QuantSeq libraries are single-end (not paired-end) and the UMI is therefore found in the R1 read.↩︎