SQL and noSQL approaches to creating & querying databases (using R)

Creating, polulating and querying SQL and noSQL databases with R

The first step of any data analysis is to obtain and explore the available data, often by accessing and querying a database. There many great introductions on how to read data into an R session. But I found it harder to find tutorials on how to create and populate a new database from scratch. Here, I explore both noSQL and SQL approaches to data management.

As an example use case, we store a collection of gene sets, specifically the mouse MSigDb hallmark gene sets (MH), either as unstructured documents or in relational tables.

Motivation

Bioconductor offers well designed S4 Classes to store gene set collections, including e.g. in a list-like GSEABase::GeneSetCollection or a set of three tibbles within a BiocSet::BiocSet object. So why could we be interested in storing this information in a database?

A database (e.g. SQLite, Postgres, etc) offers a standardized way to store, manage and access information in a language-agnostic way. E.g. some of my colleagues use python for their analyses and are comfortable retrieving gene set information from a database, but not necessarily from an R S4 object.
Gene sets capture knowledge from multiple experiments, studies and sources. If you are part of a larger organization a single source of truth, available in a central location, is very useful.
Collaborators might not be interested / able to access information programmatically, e.g. they may prefer a web application to search, share and edit gene sets. Many tools to build web applications have built-in capabilities to interact with a database.
As the number of gene sets grows, sharing them in the form of one or more files might become cumbersome. A hosted database (e.g. Postgres or MariaDB ) allows users to retrieve only the information they need.

In this tutorial, I am using the SQLite engine to explore both relational and non-relational ways to manage gene sets. SQLite can be embedded into applications, and does not require a central server, making it ideal for experimentation. (But as you move into a scenario where multiple users need to access a central, it is time to switch to a hosted database instead; my favorite is Postgres.)

library(BiocSet)
library(dm)
library(dbplyr)
library(dplyr)
library(jsonlite)
library(nodbi)
library(org.Mm.eg.db)
library(purrr)
library(DiagrammeR)
library(RSQLite)
library(tibble)
library(tidyr)

The Mouse Hallmarks MSigDB collection

At the time of writing, Mouse Molecular Signatures Database (MSigDB) contains 15918 gene sets, organized into numerous different collections. For example, the 50 hallmark gene sets (MH) summarize and represent specific well-defined biological states or processes ( Liberzon et al, Cell Systems, 2015 ).

Gene symbols

Each of the 50 sets in the collection contains between 32 and 200 official gene symbols, specifying the members of the gene set.

Here, I will use the hallmarks collection as an example but the overall approach can be applied to other gene set collection in a similar way. (You might need additional / different annotation fields, though.)

The mouse hallmarks collection is available in different formats, including as a JSON file. Let’s start by reading it into an R session as nested list mh.

json_file <- paste0(
  "https://raw.githubusercontent.com/tomsing1/blog/main/posts/",
  "geneset-sqlite-db/mh.all.v2022.1.Mm.json")
mh <- jsonlite::read_json(json_file, simplifyVector = TRUE)

Each of the 50 elements in the JSON file corresponds to a different gene set,

head(names(mh))

[1] "HALLMARK_ADIPOGENESIS"        "HALLMARK_ALLOGRAFT_REJECTION"
[3] "HALLMARK_ANDROGEN_RESPONSE"   "HALLMARK_ANGIOGENESIS"       
[5] "HALLMARK_APICAL_JUNCTION"     "HALLMARK_APICAL_SURFACE"

each gene set is a nested list with the following elements,

lengths(mh[[1]])

              systematicName                         pmid 
                           1                            1 
                 exactSource                  geneSymbols 
                           1                          200 
                   msigdbURL           externalDetailsURL 
                           1                            1 
        filteredBySimilarity externalNamesForSimilarTerms 
                           0                            0 
                  collection 
                           1

and the gene symbols that make up the set are listed in the geneSymbols vector:

head(mh[[1]]$geneSymbols)

[1] "Abca1" "Abcb8" "Acaa2" "Acadl" "Acadm" "Acads"

noSQL: storing gene sets as unstructured documents

Each gene set is represented as a list - so why not store it in the same way? A noSQL database is designed to store unstructed information, e.g. data models that are not organized in tables, making them flexible and scalable. Examples of noSQL databases include e.g. Mongodb, CouchDB or AWS dynamodb.

In addition, traditional relational database engines - including SQLite and Postgres - can also store unstructured data in dedicated JSON fields.

The nodbi R package provides a unified interface to multiple noSQL implementations, including SQLite. (If you are interested in a deeper look at how to create & query a JSON field in SQLite with raw SQL, check out this gist ).

Creating & populating a noSQL database with the `nodbi` R package

To experiment with its noSQL mode, we create a temporary SQLite database in memory. (For real data, you definitely want to provide a file path as the dbname instead!)

src <- nodbi::src_sqlite(dbname = ":memory:")

Right now, the names of the gene sets are only stored as the names() of the list elements, e.g. not in a field within each sub-list itself. To make sure they are included in each database record, we add them to each sub-list in a new name field.

mh2 <- lapply(names(mh), \(gs) c("name" = gs, mh[[gs]]))

Unique identifiers

The docdb_create() function accepts either a data.frame, a JSON string or a list as its value argument.

If you include a field _id in your list, it will be used as the primary key for each element. If no _id field is found, then the _id field is created automatically with a call to the uuid::UUIDgenerate() function.

If you provide a data.frames() with row.names, they will be used to populate the _id field.

Now we are ready to create a new SQLite table hallmarks and populate it with the 50 gene sets.

docdb_create(src, key = "hallmarks", value = mh2)

[1] 50

We can retrieve the full set of records as a data.frame with the docdb_get() function. (Here we select a subset of the returned columns due to space constraints.) Because each gene set contains multiple geneSymbols, this field is a list-column.

docdb_get(src, "hallmarks")[1:4, c("name", "geneSymbols", "pmid")]

                          name  geneSymbols     pmid
1        HALLMARK_ADIPOGENESIS Abca1, A.... 30224793
2 HALLMARK_ALLOGRAFT_REJECTION Aars, Ab.... 30224793
3   HALLMARK_ANDROGEN_RESPONSE Abcc4, A.... 30224793
4        HALLMARK_ANGIOGENESIS Apoh, Ap.... 30224793

Querying with JSON filters

More commonly, users might want to retrieve one or more gene sets by name. The docdb_query() function accepts a query argument specifying the desired filter criteria (as MongoDB JSON ).

results <- nodbi::docdb_query(
  src = src, key = "hallmarks",
  query = '{"name": "HALLMARK_ADIPOGENESIS"}')
results[, c("name", "geneSymbols", "pmid")]

                   name  geneSymbols     pmid
1 HALLMARK_ADIPOGENESIS Abca1, A.... 30224793

The fields argument allows us to return only specific columns. (Specifying a field as 1 or 0 will include or exclude it, respectively.)

nodbi::docdb_query(
  src = src, key = "hallmarks",
  query = '{"name": "HALLMARK_ADIPOGENESIS"}',
  fields = '{"name": 1, "geneSymbols": 1}'
)

                                   _id                  name  geneSymbols
1 0498deea-1353-11f0-9761-a4f9335c1df4 HALLMARK_ADIPOGENESIS Abca1, A....

We can also identify gene sets containing at least one of the given gene symbols:

results <- nodbi::docdb_query(
  src = src, key = "hallmarks",
query =
'{"geneSymbols": {"$in": ["Abca1",
 "Gapdh"]}}',
  fields = '{"name": 1, "geneSymbols": 1}'
)
results

                                   _id                             name
1 0498deea-1353-11f0-9761-a4f9335c1df4            HALLMARK_ADIPOGENESIS
2 0498df1d-1353-11f0-9761-a4f9335c1df4    HALLMARK_BILE_ACID_METABOLISM
3 0498df94-1353-11f0-9761-a4f9335c1df4   HALLMARK_INFLAMMATORY_RESPONSE
4 0498dfee-1353-11f0-9761-a4f9335c1df4       HALLMARK_PROTEIN_SECRETION
5 0498e00c-1353-11f0-9761-a4f9335c1df4 HALLMARK_TNFA_SIGNALING_VIA_NFKB
   geneSymbols
1 Abca1, A....
2 Abca1, A....
3 Abca1, A....
4 Abca1, A....
5 Abca1, A....

Unnesting columns

Because the set contains more than one geneSymbol, we obtain a nested data.frame. We can unnest it e.g. with the tidyr R package

tidyr::unnest(results, cols = c(geneSymbols))

# A tibble: 798 × 3
   `_id`                                name                  geneSymbols
   <chr>                                <chr>                 <chr>      
 1 0498deea-1353-11f0-9761-a4f9335c1df4 HALLMARK_ADIPOGENESIS Abca1      
 2 0498deea-1353-11f0-9761-a4f9335c1df4 HALLMARK_ADIPOGENESIS Abcb8      
 3 0498deea-1353-11f0-9761-a4f9335c1df4 HALLMARK_ADIPOGENESIS Acaa2      
 4 0498deea-1353-11f0-9761-a4f9335c1df4 HALLMARK_ADIPOGENESIS Acadl      
 5 0498deea-1353-11f0-9761-a4f9335c1df4 HALLMARK_ADIPOGENESIS Acadm      
 6 0498deea-1353-11f0-9761-a4f9335c1df4 HALLMARK_ADIPOGENESIS Acads      
 7 0498deea-1353-11f0-9761-a4f9335c1df4 HALLMARK_ADIPOGENESIS Acly       
 8 0498deea-1353-11f0-9761-a4f9335c1df4 HALLMARK_ADIPOGENESIS Aco2       
 9 0498deea-1353-11f0-9761-a4f9335c1df4 HALLMARK_ADIPOGENESIS Acox1      
10 0498deea-1353-11f0-9761-a4f9335c1df4 HALLMARK_ADIPOGENESIS Adcy6      
# ℹ 788 more rows

Querying using SQL

Formulating the queries as JSON strings is tedious, though. Alternatively, SQLite also supports querying JSON columns using SQL (muddying the border between noSQL and SQL). For example, we can use SQLite’s -> and ->> operators and the json_each() SQL function to create a query that returns the names of all gene sets that include e.g. the Abca1 gene:

SELECT hallmarks.json->>'name' as name
FROM hallmarks, json_each(hallmarks.json, '$.geneSymbols')
WHERE json_each.value LIKE '%Abca1%'

5 records
name
HALLMARK_ADIPOGENESIS
HALLMARK_BILE_ACID_METABOLISM
HALLMARK_INFLAMMATORY_RESPONSE
HALLMARK_PROTEIN_SECRETION
HALLMARK_TNFA_SIGNALING_VIA_NFKB

Depending on comfortable you are reading / writing SQL, this might be a nicer approach.

Limitations

SQLite’s JSON operators are somewhat limited, e.g. there is no straightforward way to ask whether a column contains one or more gene identifiers (e.g. the query we performed above using a query JSON string). Indexing a SQLite JSON column also comes with limitations.

The Postgres database engine supports JSON and binary JSONB fields) with indexing & additional operators like the @> contains operator.

noSQL summary

This example highlights some of the advantages of a noSQL solution:

Rapid ingestion of data without the need for a rigid schema.
Simple retrieval of individual object identified by their primary key.

But also some of the disadvantages:

Queries that descend into the (potentially nested) objects must be carefully constructed.
Increasing database performance with indices is more complicated than for relational databases (see below.)

Next, we will try another approach: reshaping the gene set collection into a set of tables and modeling the relationship between them.s

SQL: storing gene sets in a relational database

R has excellent support for interacting with relational database, e.g. via the foundational ‘Common Database Interface’ (DBI) package and the numerous database-specific packages built on top of it, including the RSQLite, RPostgres and many others.

To take advantage of a relational database we have perform a little more work up-front. But this effort is amply repaid by simplifying subsequent queries.

Learning from Bioconductor: BiocSet’s three tables

The BiocSet Class from the eponymous Bioconductor package represents a collection of gene sets in three tibbles. Let’s create a simple BiocSet with two gene sets for illustration:

set_names <- purrr::map_chr(mh2[1:2], "name")
gene_ids <- purrr::map(mh2[1:2], "geneSymbols")
es <- BiocSet(setNames(gene_ids, set_names))

The first two tibbles represent genes (called elements) and sets, respectively:

es_element: one row per gene

head(es_element(es))

# A tibble: 6 × 1
  element
  <chr>  
1 Abca1  
2 Abcb8  
3 Acaa2  
4 Acadl  
5 Acadm  
6 Acads

es_set: one row per gene set

es_set(es)

# A tibble: 2 × 1
  set                         
  <chr>                       
1 HALLMARK_ADIPOGENESIS       
2 HALLMARK_ALLOGRAFT_REJECTION

The third table establishes the many-to-many relationship between genes and sets, e.g. it tracks which gene is a member of each set.

es_elementset: gene x set combination

# we are showing 10 random rows
set.seed(42)
es_elementset(es)[sample(nrow(es_elementset(es)), size = 10), ]

# A tibble: 10 × 2
   element set                         
   <chr>   <chr>                       
 1 Coq5    HALLMARK_ADIPOGENESIS       
 2 Irf7    HALLMARK_ALLOGRAFT_REJECTION
 3 Qdpr    HALLMARK_ADIPOGENESIS       
 4 Elovl6  HALLMARK_ADIPOGENESIS       
 5 Cd28    HALLMARK_ALLOGRAFT_REJECTION
 6 Ppm1b   HALLMARK_ADIPOGENESIS       
 7 Mtarc2  HALLMARK_ADIPOGENESIS       
 8 Zap70   HALLMARK_ALLOGRAFT_REJECTION
 9 Ndufb7  HALLMARK_ADIPOGENESIS       
10 Il15    HALLMARK_ALLOGRAFT_REJECTION

Each of these tables can be augmented with additional metadata, e.g. we could add Entrez gene identifiers to the es_element (see below), or long-form descriptions for each set to the es_set tibble.

These three tables can easily be represented in a relational database, using the element and set columns as primary keys.

Creating and populating a relational database

Let’s start with a fresh SQLite database.

con <- dbConnect(RSQLite::SQLite(), ":memory:")

First, we create the geneset data.frame that lists all gene sets, and we also include their MSigDb URLs as metadata:

geneset <- data.frame(
  geneset = purrr::map_chr(mh2, "name"),
  url = purrr::map_chr(mh2, "msigdbURL"))
head(geneset)

                       geneset
1        HALLMARK_ADIPOGENESIS
2 HALLMARK_ALLOGRAFT_REJECTION
3   HALLMARK_ANDROGEN_RESPONSE
4        HALLMARK_ANGIOGENESIS
5     HALLMARK_APICAL_JUNCTION
6      HALLMARK_APICAL_SURFACE
                                                                                 url
1        https://www.gsea-msigdb.org/gsea/msigdb/mouse/geneset/HALLMARK_ADIPOGENESIS
2 https://www.gsea-msigdb.org/gsea/msigdb/mouse/geneset/HALLMARK_ALLOGRAFT_REJECTION
3   https://www.gsea-msigdb.org/gsea/msigdb/mouse/geneset/HALLMARK_ANDROGEN_RESPONSE
4        https://www.gsea-msigdb.org/gsea/msigdb/mouse/geneset/HALLMARK_ANGIOGENESIS
5     https://www.gsea-msigdb.org/gsea/msigdb/mouse/geneset/HALLMARK_APICAL_JUNCTION
6      https://www.gsea-msigdb.org/gsea/msigdb/mouse/geneset/HALLMARK_APICAL_SURFACE

Next, we identify all unique gene symbols, annotate them with their Entrez ids (using the org.Mm.eg.db Bioconductor annotation package), and store both identifier types in the element data.frame.

gene_symbols <- unique(unlist(purrr::map(mh2, "geneSymbols")))
element <- data.frame(
  element = gene_symbols,
  entrezid = mapIds(org.Mm.eg.db, keys = gene_symbols, keytype = "SYMBOL", 
                    column = "ENTREZID")
  )
head(element)

      element entrezid
Abca1   Abca1    11303
Abcb8   Abcb8    74610
Acaa2   Acaa2    52538
Acadl   Acadl    11363
Acadm   Acadm    11364
Acads   Acads    11409

Finally, we create the element_set join table, connecting gene sets to their constituent genes:

elementset <- purrr::map_df(mh2, \(gs) {
  with(gs, 
       data.frame(
         element = geneSymbols,
         geneset = name
       )
  )
})
head(elementset)

  element               geneset
1   Abca1 HALLMARK_ADIPOGENESIS
2   Abcb8 HALLMARK_ADIPOGENESIS
3   Acaa2 HALLMARK_ADIPOGENESIS
4   Acadl HALLMARK_ADIPOGENESIS
5   Acadm HALLMARK_ADIPOGENESIS
6   Acads HALLMARK_ADIPOGENESIS

Next, we write each data.frame into a separate table in our SQLite database.

Verifying foreign keys

By default, SQLite does not verify that foreign keys actually exist in the referenced table, and the following SQL expression yields 0:

dbGetQuery(con, "PRAGMA foreign_keys")[["foreign_keys"]]

[1] 0

To enable foreign key checking, we use the following command:

dbExecute(con, 'PRAGMA foreign_keys=ON;')

[1] 0

Now the same query we issued above confirms that the foreign key check is active.

dbGetQuery(con, "PRAGMA foreign_keys")[["foreign_keys"]]

[1] 1

For each of the tables we want to create, we first check whether a table with the same name already exists, and delete it if necessary.

The we use the dbAppendTable function, and not the dbWriteTable command, because dbWriteTable will overwrite an existing table and all of the constraints we have enforced in the CREATE TABLE statment (e.g. primary key, foreign key relationships, field types, etc) will be lost.

# delete existing tables, if any
for (table_name in c("tbl_geneset", "tbl_element", "tbl_elementset")) {
  if (dbExistsTable(con, table_name)) {
    dbExecute(con, paste0("DROP TABLE ", table_name))
  }
}
# create and populate the three tables
dbExecute(con, 
          "CREATE TABLE tbl_geneset (geneset TEXT PRIMARY KEY, url TEXT)")
dbAppendTable(con, name = "tbl_geneset", value = geneset)

dbExecute(con, 
          "CREATE TABLE tbl_element (element TEXT PRIMARY KEY, entrezid TEXT)")
dbAppendTable(con, name = "tbl_element", value = element)

dbExecute(con, paste(
  "CREATE TABLE tbl_elementset (",
  "element TEXT,", 
  "geneset TEXT,",
  "FOREIGN KEY(geneset) REFERENCES tbl_geneset(geneset),",
  "FOREIGN KEY(element) REFERENCES tbl_element(element)",
  ")")
  )
dbAppendTable(con, name = "tbl_elementset", value = elementset)

We now have three tables in the database:

dbListTables(con)

[1] "tbl_element"    "tbl_elementset" "tbl_geneset"

Checking the foreign keys

Let’s take a peek at the foreign keys we defined in the tbl_elementset table, to make sure they are correct.

dbGetQuery(con, "PRAGMA foreign_key_list('tbl_elementset')")

  id seq       table    from      to on_update on_delete match
1  0   0 tbl_element element element NO ACTION NO ACTION  NONE
2  1   0 tbl_geneset geneset geneset NO ACTION NO ACTION  NONE

There are two foreign keys, one connecting to the element field in the tbl_element table, and the other connecting to the geneset field in the tbl_geneset table. Looks good!

Plotting relationships

As we create and need to keep track of multiple tables, it is useful to visualize their contents (fields, columns) and relationships in a model diagram. The awesome dm R package, designed to bring an existing relational data model into your R session, can be used to generate diagrams like the one shown below. (dm can identify the keys in postgres and SQL server database engines automatically, but for SQLite we need to specify them ourselves with the dm_add_pk() and dm_add_fk() functions.)

dm_from_con(con, learn_keys = FALSE) %>%
  dm_add_pk(tbl_element, element) %>%
  dm_add_pk(tbl_geneset, geneset) %>%
  dm_add_fk(tbl_elementset, element, tbl_element) %>%
  dm_add_fk(tbl_elementset, geneset, tbl_geneset) %>%
  dm_draw(view_type = "all")

Model diagram

Querying the database

Great! Now we are ready to query our database. To make our lives easier, we will use the dplyr package to translate our R syntax into SQL. (But we could just as well use plain SQL instead.)

First we define the remote tables by connecting to our brand new database:

tbl_geneset <- tbl(con, "tbl_geneset")
tbl_element <- tbl(con, "tbl_element")
tbl_elementset <- tbl(con, "tbl_elementset")

Let’s return the gene symbols and entrez identifiers that make up the HALLMARK_APOPTOSIS gene set and display the first 5 (in alphabetical order of the gene symbols).

result <- tbl_elementset %>% 
  dplyr::filter(geneset == "HALLMARK_ADIPOGENESIS") %>%
  dplyr::inner_join(tbl_element, by = "element") %>%
  dplyr::slice_min(n = 5, order_by = element)
result

# Source:   SQL [?? x 3]
# Database: sqlite 3.47.1 [:memory:]
  element geneset               entrezid
  <chr>   <chr>                 <chr>   
1 Abca1   HALLMARK_ADIPOGENESIS 11303   
2 Abcb8   HALLMARK_ADIPOGENESIS 74610   
3 Acaa2   HALLMARK_ADIPOGENESIS 52538   
4 Acadl   HALLMARK_ADIPOGENESIS 11363   
5 Acadm   HALLMARK_ADIPOGENESIS 11364

And now let’s add the gene set’s URL as well:

result %>%
  dplyr::left_join(tbl_geneset, by = "geneset")

# Source:   SQL [?? x 4]
# Database: sqlite 3.47.1 [:memory:]
  element geneset               entrezid url                                    
  <chr>   <chr>                 <chr>    <chr>                                  
1 Abca1   HALLMARK_ADIPOGENESIS 11303    https://www.gsea-msigdb.org/gsea/msigd…
2 Abcb8   HALLMARK_ADIPOGENESIS 74610    https://www.gsea-msigdb.org/gsea/msigd…
3 Acaa2   HALLMARK_ADIPOGENESIS 52538    https://www.gsea-msigdb.org/gsea/msigd…
4 Acadl   HALLMARK_ADIPOGENESIS 11363    https://www.gsea-msigdb.org/gsea/msigd…
5 Acadm   HALLMARK_ADIPOGENESIS 11364    https://www.gsea-msigdb.org/gsea/msigd…

Pulling data into a BiocSet

Finally, we can easily pull selected (or even all) gene sets into a Bioconductor BiocSet object for analysis in R. Importantly, the database does not require us to use R: e.g. python users can connect to the same SQLite database (e.g. using sqlalchemy ) and retrieve the information in whatever form is most useful to them.

For example, let’s retrieve all gene sets whose name ends in the letter N, store them in a list and create a BiocSet object.

gene_set_list <- with(
  tbl_elementset %>% 
    dplyr::filter(geneset %like% '%N') %>%
    collect(), 
  split(element, geneset)
)
es <- BiocSet(gene_set_list)

Next, we add gene set metadata to the es_set tibble, by joining it with the (richer) information in the database. This will add the url column.

es <- left_join_set(es, 
  tbl_geneset, by = c(set = "geneset"), 
  copy = TRUE
)
es_set(es)

# A tibble: 8 × 2
  set                                        url                                
  <chr>                                      <chr>                              
1 HALLMARK_ALLOGRAFT_REJECTION               https://www.gsea-msigdb.org/gsea/m…
2 HALLMARK_APICAL_JUNCTION                   https://www.gsea-msigdb.org/gsea/m…
3 HALLMARK_COAGULATION                       https://www.gsea-msigdb.org/gsea/m…
4 HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION https://www.gsea-msigdb.org/gsea/m…
5 HALLMARK_KRAS_SIGNALING_DN                 https://www.gsea-msigdb.org/gsea/m…
6 HALLMARK_OXIDATIVE_PHOSPHORYLATION         https://www.gsea-msigdb.org/gsea/m…
7 HALLMARK_PROTEIN_SECRETION                 https://www.gsea-msigdb.org/gsea/m…
8 HALLMARK_UV_RESPONSE_DN                    https://www.gsea-msigdb.org/gsea/m…

And finally, let’s also add the entrezid column from out database to the es_element table:

es <- left_join_element(es, 
  tbl_element, by = "element", 
  copy = TRUE
)
es_element(es)

# A tibble: 1,233 × 2
   element entrezid
   <chr>   <chr>   
 1 Aars    <NA>    
 2 Abce1   24015   
 3 Abi1    11308   
 4 Ache    11423   
 5 Acvr2a  11480   
 6 Akt1    11651   
 7 Apbb1   11785   
 8 B2m     12010   
 9 Bcat1   12035   
10 Bcl10   12042   
# ℹ 1,223 more rows

SQL summary

For this example the effort required to transform the dataset into a set of three tables - the starting point for import into a relational database - was minimal.
Given the use case, e.g. management of a gene set collections, the number of times that data is added to the database is likely much smaller than the number of times it is queried. That makes it worth the effort to transform it once - and benefit from this upfront cost ever after.
Because we knew exactly which properties / annotations we wanted to capture in the database, defining the database tables and their relationships (e.g. the schema ) was not an obstacle, either.
Enabling users to query the data using simple SQL or via a higher level abstraction like dplyr makes it accessible to a broader audience.

Defining a schema is much harder when we deal with datasets that are less standardized, deeply nested, changing over time, etc.

References

If you are new to working with databases, then you might find these two great books useful:

SQL for Data Scientists: A Beginner’s Guide for Building Datasets for Analysis by Renee M. P. Teate is a great starting place to learn SQL. It mainly focusses on accessing existing databases.
Practical SQL: A Beginner’s Guide to Storytelling with Data by Anthony DeBarros teaches readers how to create & populate a Postgres database, and how to index and search it effectively.

SessionInfo

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       Debian GNU/Linux 12 (bookworm)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Los_Angeles
 date     2025-04-06
 pandoc   3.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
 quarto   1.6.36 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 ! package          * version date (UTC) lib source
 P AnnotationDbi    * 1.68.0  2024-10-29 [?] Bioconduc~
 P backports          1.5.0   2024-05-23 [?] RSPM
 P Biobase          * 2.66.0  2024-10-29 [?] Bioconduc~
 P BiocGenerics     * 0.52.0  2024-10-29 [?] Bioconduc~
 P BiocIO             1.16.0  2024-10-29 [?] Bioconduc~
 P BiocManager        1.30.25 2024-08-28 [?] RSPM
 P BiocSet          * 1.20.0  2024-10-29 [?] Bioconduc~
 P Biostrings         2.74.1  2024-12-16 [?] Bioconduc~
 P bit                4.6.0   2025-03-06 [?] RSPM
 P bit64              4.6.0-1 2025-01-16 [?] RSPM
 P blob               1.2.4   2023-03-17 [?] RSPM
 P cachem             1.1.0   2024-05-16 [?] RSPM
 P cli                3.6.4   2025-02-13 [?] RSPM
 P crayon             1.5.3   2024-06-20 [?] RSPM
 P curl               6.2.2   2025-03-24 [?] RSPM
 P DBI                1.2.3   2024-06-02 [?] RSPM
   dbplyr           * 2.5.0   2024-03-19 [1] RSPM (R 4.4.0)
 P DiagrammeR       * 1.0.11  2024-02-02 [?] RSPM
 P digest             0.6.37  2024-08-19 [?] RSPM
 P dm               * 1.0.11  2024-11-25 [?] RSPM
 P dplyr            * 1.1.4   2023-11-17 [?] RSPM
 P evaluate           1.0.3   2025-01-10 [?] RSPM
 P fastmap            1.2.0   2024-05-15 [?] RSPM
 P generics           0.1.3   2022-07-05 [?] RSPM
 P GenomeInfoDb       1.42.3  2025-01-27 [?] Bioconduc~
 P GenomeInfoDbData   1.2.13  2024-12-24 [?] Bioconductor
 P glue               1.8.0   2024-09-30 [?] RSPM
 P htmltools          0.5.8.1 2024-04-04 [?] RSPM
 P htmlwidgets        1.6.4   2023-12-06 [?] RSPM
 P httr               1.4.7   2023-08-15 [?] RSPM
 P igraph             2.1.4   2025-01-23 [?] RSPM
 P IRanges          * 2.40.1  2024-12-05 [?] Bioconduc~
 P jqr                1.4.0   2024-12-16 [?] RSPM
 P jsonlite         * 1.9.0   2025-02-19 [?] RSPM
 P KEGGREST           1.46.0  2024-10-29 [?] Bioconduc~
 P knitr              1.49    2024-11-08 [?] RSPM
 P lazyeval           0.2.2   2019-03-15 [?] RSPM
 P lifecycle          1.0.4   2023-11-07 [?] RSPM
 P magrittr           2.0.3   2022-03-30 [?] CRAN (R 4.4.1)
 P memoise            2.0.1   2021-11-26 [?] CRAN (R 4.4.1)
 P nodbi            * 0.12.0  2025-03-30 [?] RSPM
 P ontologyIndex      2.12    2024-02-27 [?] RSPM
 P org.Mm.eg.db     * 3.20.0  2025-04-07 [?] Bioconductor
 P pillar             1.10.1  2025-01-07 [?] RSPM
 P pkgconfig          2.0.3   2019-09-22 [?] RSPM
 P plyr               1.8.9   2023-10-02 [?] RSPM
 P png                0.1-8   2022-11-29 [?] RSPM
 P purrr            * 1.0.4   2025-02-05 [?] RSPM
 P R.methodsS3        1.8.2   2022-06-13 [?] RSPM
 P R.oo               1.27.0  2024-11-01 [?] RSPM
 P R.utils            2.13.0  2025-02-24 [?] RSPM
 P R6                 2.6.1   2025-02-15 [?] RSPM
 P RColorBrewer       1.1-3   2022-04-03 [?] CRAN (R 4.4.1)
 P Rcpp               1.0.14  2025-01-12 [?] RSPM
   renv               1.1.4   2025-03-20 [1] RSPM (R 4.4.0)
 P rlang              1.1.5   2025-01-17 [?] RSPM
 P rmarkdown          2.29    2024-11-04 [?] RSPM
 P RSQLite          * 2.3.9   2024-12-03 [?] RSPM
 P rstudioapi         0.17.1  2024-10-22 [?] RSPM
 P S4Vectors        * 0.44.0  2024-10-29 [?] Bioconduc~
 P sessioninfo        1.2.3   2025-02-05 [?] RSPM
 P stringi            1.8.7   2025-03-27 [?] RSPM
 P tibble           * 3.2.1   2023-03-20 [?] RSPM
 P tidyr            * 1.3.1   2024-01-24 [?] RSPM
 P tidyselect         1.2.1   2024-03-11 [?] RSPM
 P UCSC.utils         1.2.0   2024-10-29 [?] Bioconduc~
 P utf8               1.2.4   2023-10-22 [?] RSPM
 P uuid               1.2-1   2024-07-29 [?] RSPM
 P V8                 6.0.3   2025-03-26 [?] RSPM
 P vctrs              0.6.5   2023-12-01 [?] RSPM
 P visNetwork         2.1.2   2022-09-29 [?] RSPM
 P withr              3.0.2   2024-10-28 [?] RSPM
 P xfun               0.51    2025-02-19 [?] RSPM
 P XVector            0.46.0  2024-10-29 [?] Bioconduc~
 P yaml               2.3.10  2024-07-26 [?] RSPM
 P zlibbioc           1.52.0  2024-10-29 [?] Bioconduc~

 [1] /home/sandmann/repositories/blog/posts/geneset-sqlite-db/renv/library/linux-debian-bookworm/R-4.4/x86_64-pc-linux-gnu
 [2] /home/sandmann/.cache/R/renv/sandbox/linux-debian-bookworm/R-4.4/x86_64-pc-linux-gnu/db5e602d

 * ── Packages attached to the search path.
 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

This work is licensed under a Creative Commons Attribution 4.0 International License.

Creating, polulating and querying SQL and noSQL databases with R

Motivation

The Mouse Hallmarks MSigDB collection

noSQL: storing gene sets as unstructured documents

Creating & populating a noSQL database with the nodbi R package

Querying with JSON filters

Querying using SQL

noSQL summary

SQL: storing gene sets in a relational database

Learning from Bioconductor: BiocSet’s three tables

Creating and populating a relational database

Plotting relationships

Querying the database

Pulling data into a BiocSet

SQL summary

References

Creating & populating a noSQL database with the `nodbi` R package