QuantSeq RNAseq analysis (2): Exploring nf-core/rnaseq output

nextflow
NGS
Author

Thomas Sandmann

Published

January 16, 2023

Note

This is the second of four posts documenting my progress toward processing and analyzing QuantSeq FWD 3’ tag RNAseq data with the nf-core/rnaseq workflow.

  1. Configuring & executing the nf-core/rnaseq workflow
  2. Exploring the workflow outputs
  3. Validating the workflow by reproducing results published by Xia et al (no UMIs)
  4. Validating the workflow by reproducing results published by Nugent et al (including UMIs)

Many thanks to Harshil Patel, António Miguel de Jesus Domingues and Matthias Zepper for their generous guidance & input via nf-core slack. (Any mistakes are mine.)

tl;dr

Reports

MultiQC report

The MultiQC HTML report is a one-stop-shop that summarises QC metrics across the workflow. It can be found int he multiqc folder, in a subdirectory named according to the aligner & quantifier combination used (default: star_salmon).

multiqc report

Pipeline info

The pipeline_info folder contains html reports and text (CSV, TXT, YML) files with information about the run, including the versions of the software tools used.

pipeline info

FastQC reports

The fastqc folder contains the output of the fastqc tool. Most of the reported metrics are included in the MultiQC report as well, but the HTML reports for individual samples are available here if needed.

FastQC report

Trim Galore reports

The trimgalore folder contains

  • trimming reports for each sample
  • the fastqc sub-folder with quality metrics for the trimmed FASTQ files

Trim Galore

umitools

This folder contains the log files returned by UMI-tools

UMI-tools

Workflow results

The main output of the workflow is available in the star_salmon subdirectory. (This folder is named after the selected alignment & quantification strategy, e.g. star_salmon is present only if this tool combination was used.)

It contains multiple folders, as well as the (deduplicated) sorted BAM files for each sample.

bigwig

Genome coverage in bigWig format for each sample.

BigWig files

DESeq2 object & QC metrics

The workflow aggregates all counts into a DESeq2 R objects, performs QC and exploratory analyses and serializes the object as deseq2.dds.RData.

DESeq2 object & QC metrics

Dupradar

The dupRadar Bioconductor package performs duplication rate quality control.

dupRadar

Featurecounts

Output from featurecounts tool is only used to generate QC metrics. For actual quantitation of the gene-level results, the output of salmon (default) or rsem are used.

The metrics reported in the featurecounts folder are included in the MultiQC report.

featureCounts

STAR log files

This folder contains log files output by the STAR aligner.

STAR logs

Qualimap

The qualimap package generates QC metrics from BAM files.

Qualimap

RSEQC

The output of the rseqc QC control package are in this directory.

RSEQC

Alignments & gene-level counts

The star_salmon folder also contains the main results of the workflow: gene-level counts and alignments (BAM files).

Salmon quantitation

Aggregated

The workflow outputs salmon quantitation results aggregated across all samples. Different types of counts (e.g. raw, length-scaled, TPMs) are available - the choice for downstream analyses depends on the chosen approach. Please see tximport for details.

Salmon aggrated output

By sample

In addition, the salmon outputs for individual samples are avialable in sub-folders, one for each sample.

Salmon output for each sample

STAR alignments

The sorted (and, in the case of datasets that include UMIs, deduplicated) BAM files and their indices are available:

STAR alignments

Genome, gene annotations & indices

If the workflow was exectuted with the "save_reference": true parameter, then all reference files (FASTA, GTF, BED, etc) and the indices generated for STAR, salmon and rsem are returned in the genome folder within the output directory:

genome folder

These files can be reused for future runs, shortening the execuction time of the workflow.

Next, we will compare how the gene-counts returned of the nf-core/rnaseq workflow compare to those posted on GEO by the authors of the two datasets we processed in the first post in this series by performing a differential expression analysis in the third and fourth posts in this series.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.