tl;dr
This week I learned how to (1) Perform bulk RNA-seq analyses in python, using the excellent pydeseq2 module, (2) Explore the recount3 collection of thousands of consistently preprocessed public RNA-seq datasets, and (3)Create a web application using marimo to perform (simple) analyses interactively.
Overview
When it comes to the analysis of bulk (or pseudo-bulk) RNA-seq data, my go-to R packages are either limma or DESeq21. This week, I wanted to learn how to implement a similar analysis using python - and do so in the marimo
notebook environment, a project I have been following for some time2.
To learn how to use marimo, both as a notebook and in the form of an app, I chose to create a workflow that performs differential gene expression analysis of bulk RNA-seq data using the excellent pydeseq2 module3. The result is my first notebook / app, called mariner.
Data source
The recount3 project has processed and shared thousands of publicly available human and mouse RNA-seq datasets. The recount3 team has provided the study explorer (a web application based on the shiny R package) that allows users to search and filter the collection, and to get links to various processed files stored as compressed CSV files, e.g. gene- and exon-level counts, sample and project metadata, as well as gene annotations (based on Gencode).
A full list of all available projects, samples and files is also available in the the study explorer github repository.
Getting started with a template
To get started, I took advantage of the marimo uv starter template, which creates a project directory including src
and test
directories, as well as various configuration files to leverage the uv package manager.
The final project has the following structure:
├── .github/ # GitHub Actions workflows
├── data # Datasets
│ └── recount3.db # duckdb metadata database
├── docs # Documentation
│ └── images # Screenshots
├── scripts/ # Scripts
│ └── recount3.db # R script used to create the duckdb database
├── src/ # Source code
│ └── app.py # Sample marimo notebook
│ └── utils.py # Sample marimo notebook
├── pyproject.toml # Project configuration
├── tests/ # Test files
│ └── test_utils.py # Unit test for the utils submodule
├── pyproject.toml # Project configuration
├── uv.lock # Dependency lock file
└── Dockerfile # Instructions to build a docker image
Analysis steps
The mariner notebook allows users to:
- Browse RNA-sequencing projects curated by the recount3 team & select a study of interest
I wrote a simple R script to collate this metadata into a duckdb database, which is accessed via http from the notebook. (I chose
duckdb
over sqlite
because I can store the database file in my github repository and access it via http from anywhere.) This allows me to display an interactive table in my marimo notebook for users to search, and to select a dataset of interest from. Because marimo supports SQL code cells and connects to duckdb
out of the box, it was straightforward to display additional details about the selected dataset.
- Retrieve the raw counts, sample- and gene-annotations from recount3
recount3 provides sample metadata (e.g. annotations provided by the original submitters of the data) and raw counts in the form of CSV files. I wrote simple functions to retrieve, parse & filter these files, to provide users with a quick overview of the study design. (I also implemented a simple caching functionality using the
@lru_cache
decorator from the functtools
module to avoid downloading the same files in the same session twice.)
To keep my notebook short and potentially reuse these functions in other projects, I kept these and other helper functions in a separate utils module.
- Define up to two covariates of interest for differential expression analysis
Some of the studies in recount3 are either very large or very complex, e.g. with many different treatments, conditions or sample sources. For now, I chose to enable users to define the linear model to fit based on existing sample annotation columns (e.g. agent
or cell_line
in the example shown) above. Users choose the main effect of interest, and can optionally also include one additional covariate to adjust for (which is included as an additive predictor in the model)4.
Focussing only on categorical covariates, I enabled users to choose the numerator and the denominator of the comparison via a drop-down menu.
- Fit a generalized linear model using pydeseq2 and extract p-values for a comparison of interest
Once the design of the experiment is defined, it’s time to fit the negative binomial model, to extract the contrast of interest and provide p-values using Wald’s test.
At this point, I struggled a bit because the pydeseq2 tutorial illustrates how to first create a DeseqDataSet object, and then apply several methods that modify its state. First, I tried to perform each step in a different code cell, but that conflicted with marimo’s requirement to keep track of the reactive relationships between them.
Calmcode’s video on “Dealing with state in marimo” helped me a lot to get a better grip on this difference between python
and marimo
concepts.
The results are presented in an interactive table, with one row for each tested gene, rendered via marimo.ui.table, providing helpful overview charts, the ability to filter, sort or download - and to select one or more genes for visualization.
- Plot the raw or normalized gene expression counts for one or more genes of interest
There are many different ways to explore the results of a differential expression analyis. Here, I opted for simple scatter plots. Users can specify the categorical x-axis (based on the available sample annotations) and display either raw or normalized counts for each sample.
Deployment via Docker
Because not all of the python modules that are imported by my notebook are available as pure python wheels today, the notebook cannot be deployed via webassembly. For now, I can package and deploy the app on a server using docker instead.
A pre-built docker image for the arm64
platform is available from dockerhub:
docker pull tomsing1/mariner
docker run -p 8080:8080 --rm -it tomsing1/mariner
You can reach the application at localhost:8080.
Alternatively, you can build the docker image yourself, using the Dockerfile included in the repository:
docker build -t tomsing1/mariner .
docker run -p 8080:8080 --rm -it tomsing1/mariner
Lessons learned
marimo
provides great interactive elements out of the box, making it easy to first develop a workflow based on hard-coded variables, and to then add interactivity (e.g. via pull-down menus) later.- The interaction between marimo and my
duckdb
database was seamless, thanks to marimo’s built-in SQL cells. - A big difference between writing regular python scripts or Jupyter notebooks and
marimo
is the latter’s requirement to have a single cell that defines a variable (to avoid ambiguous reactive graphs). That’s still something I need to wrap my head around. - Because marimo notebooks / apps are just python, I was able to add the same
pytest
unit tests I would use for other modules, and to check / format my code with ruff 5. - I am only at the beginning of my python journey, but I am sure
marimo
will remain a part of it!
Known limitations
The main goal of this project was to learn, not to ship a fully functional application. With that in mind, here are a few known limitations:
- Some of the RNA-seq projects curated by recount3 are either very large (and take a long time to model) or are not associated with obvious covariates for modeling. For these studies, users likely want to download the data & annotations via the recount3 study explorer and perform a custom analysis.
- Currently, only categorical covariates are supported by pydeseq2.
- Because not all of the python modules that are imported by my notebook are available as pure python wheels today, the notebook cannot be deployed via webassembly. For now, I can package and deploy the app on a server using docker instead.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Footnotes
DESeq2 estimates the variance-mean dependence in count data and tests for differential expression usinga model based on the negative binomial distribution. limma’s
voom
method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline.↩︎marimo’s documentation is excellent. Recently, the marimo team has also started to post short videos on their youtube channel. And I found Calm code’s marimo track to be an excellent place to start learning about it! (Thank you Vincent!)↩︎
pydeseq2
is a reimplementation of the pioneering DESeq2 R package and has been given a stamp of approval by Michael Love, one of DESeq2’s original authors.↩︎pydeseq2
uses the formulaic module to define linear models, which felt quite familiar to me coming from R.↩︎I disabled linting for lines that merely called an object of interest, triggering it to be displayed in the notebook, with the
# noqa B018
comment, e.g. here↩︎