library(DBI)
library(duckdb)
library(glue)
ClinVar is a public archive of reports of human variations associated with diseases and drug responses, with supporting evidence, hosted at the NCBI.
Retrieving ClinVar variants from the NCBI’s FTP server
First, we read information about disease-associated variants, available in a gzip-compressed tab-delimited text file from the NCBI, and store information about single-nucleotide variants (SNVs) in a local parquet file.
Frequently, information about genetic variation is shared using files in Variant Call Format (VCF) or variations of it, e.g. genomic VCF (gVCF).
While widely used, VCF files for large collections of variants can become unwieldly. In 2021, researcher from 23&me explored other file formats and reported that storing variant data in parquet files might be adventageous.
The NCBI makes ClinVar’s information available in multiple formats, including VCF and tab-delimited text files. Here, I am retrieving the latter. Multiple solutions to convert VCF to character-delimited or parquet files exist, including methods in the GLOW Spark framework, or the VariantsToTable
tool included in the GATK suite. The vcf-to-duckdb python module goes one step further and converts a VCF file to a DuckDB database, encompassing both as Parquet files and accompanying SQL schema.
Let’s start by creating an in-memory duckdb
instance and install the httpfs
extension to retrieve remote files.
<- dbConnect(duckdb())
con ::dbExecute(con, "INSTALL httpfs") DBI
[1] 0
The latest release of ClinVar is available from its website. For this tutorial, I will convert it into a local parquet file stored in a temporary directory1.
<- paste0(
kTab "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/",
"variant_summary.txt.gz"
)<- tempfile(fileext = ".parquet") kParquet
ClinVar accepts reports about any type of genetic variant, including single nucleotide variants (SNVs), small insertions/deletions or large copy number variations for human genome assemblies GRCh37 and GRCh38. Here, I will only retain rows describing single nucleotide variants and their location in the GRCh38 assembly.
A single SQL query retrieves the ~ 380 MB tab-delimited file, filters for single-nucleotide variants on the GRCh38 assembly, and writes them into a ~ 325 MB parquet file in a single step 2
<- glue::glue_sql("
sql COPY (
SELECT *,
FROM read_csv({kTab}, AUTO_DETECT = TRUE, NORMALIZE_NAMES = TRUE)
WHERE Assembly = 'GRCh38' AND _type = 'single nucleotide variant'
)
TO {kParquet} (FORMAT 'PARQUET');", .con = con)
::dbExecute(con, sql) DBI
[1] 3405475
Exploring genetic variants stored in parquet format
I can access the parquet file in different ways from R (or python or other environments). For example, I could read its entire content into my R session using the - {nanoparquet} R package, or - arrow::read_parquet from the arrow
package, which supports dplyr
verbs for downstream processing.
Alternatively, I can use duckdb
’s internal functionality to connect to external files, and then pass the database connection to dbplyr
, which will translate common dplyr
verbs into SQL for me. I will demonstrate this approach in the remainder of this post.
Recently, the duckplyr R package was released, offering yet another way of using duckdb
for data analysis. duckplyr
was built around an embedded version of duckdb
to provide a drop-in replacement for dplyr's
functionality 3, including automatic fall-back to dplyr
functionality for any functionality that is not supported within the duckdb
database instance.
A main distinction between duckplyr
and combining duckdb
with dbplyr
is that the former aims to support all R operators, functions, and data types 4. It automatically decides when to pull data from the database instance into the R session (e.g. materialize it in memory). While that is often convenient, it might require some extract attention, especially when working with very large datasets.
suppressPackageStartupMessages(
library(duckplyr)
)
The duckplyr::read_parquet_duckdb
function connects to a parquet file and returns a duckplyr_df
object.
<- read_parquet_duckdb(kParquet, prudence = "thrifty")
df class(df)
[1] "prudent_duckplyr_df" "duckplyr_df" "tbl_df" "tbl"
[5] "data.frame"
The function read_parquet_duckdb()
function accepts a prudence
argument, which governs whether / when data is actually materialized in memory.
- By default,
duckplyr
isthrifty
and decides at each step of a workflow whether the dataset is small enough (< 1 million cells = rows multiplied by columns) to materialize in memory right away 5. - Alternatively, we could choose a
lavish
approach, suitable for small datasets, that will trigger the computation be triggered immediately. - Finally, setting prudence to
stingy
will avoid automatic materialization and requires the use ofcollect()
oras_tibble()
to materialize the data in a conventional data.frame.
Registering ClinVar as a VIEW
Currently, our in-memory duckdb
database does not contain any tables.
dbListTables(con)
character(0)
Let’s select a subset of field from the parquet file and also add a new width
column. Because we will use the output of this query multiple times, it is useful to add it to the database as a VIEW
, e.g. a canned query that is accessible under the clinvar
name. (Creating a TABLE
instead of a VIEW
would result importing an actual copy the data into the duckdb
database.)
Whenever we access the clinvar
view, the SQL query that defines it is re-executed, e.g. data is retrieved from the parquet file.
CREATE OR REPLACE VIEW clinvar as
SELECT cv.*, (cv.end - cv.start) + 1.0 AS width
FROM (
SELECT
rcvaccession,as name,
_name AS seqnames,
chromosomeaccession AS start,
_start stop AS end,
AS type,
_type
clinicalsignificance,
clinsigsimple,
phenotypeids,
phenotypelist,AS dbsnp,
rs_dbsnp AS dbvar,
nsvesv_dbvar
otheridsFROM (FROM read_parquet(?kParquet)) AS cv
AS cv )
::dbListTables(con) duckdb
[1] "clinvar"
Next, let’s create a toy example of a query, e.g. a genomic region of interest defined by its genomic coordinates. We could add this query as a (potentially temporary) table to our database. Even more conveniently, the duckdb::duckdb_register
function allows us to make the database aware of this data.frame without having to transfer any data; it is registered as a “virtual table” instead.
<- data.frame(
query seqnames = "NC_000016.10",
start = 2765864,
end = 2765864
) query
seqnames start end
1 NC_000016.10 2765864 2765864
Querying ClinVar with dplyr
suppressPackageStartupMessages(
library(dbplyr)
)
By passing the duckdb
connection to the tbl
function, I can use the database tables in dplyr workflows, e.g. to retrieve any ClinVar variants with the same coordinates as my query.
::inner_join(query, tbl(con, "clinvar"),
dplyrby = join_by(seqnames, start, end))
Ouch, that didn’t work - it triggered the following error:
Error in `auto_copy()`:
! `x` and `y` must share the same src.
ℹ `x` is a <tbl_duckdb_connection/tbl_dbi/tbl_sql/tbl_lazy/tbl> object.
ℹ `y` is a <data.frame> object.
ℹ Set `copy = TRUE` if `y` can be copied to the same source as `x` (may be slow).
The query
data.frame exists in my R session, but the clinvar
table is in the database. To join these two sources of information, I need to add the query to the database first.
I could use the dbWriteTable
function to create a new (potentially temporary) table to the database. But there is a more convenient option: the duckdb::duckdb_register
function makes the database aware of a data.frame without having to transfer any data; it is registered as a “virtual table” instead.
::duckdb_register(con, "query", query, overwrite = TRUE)
duckdb::dbListTables(con) duckdb
[1] "clinvar" "query"
Now the join can proceed, and we observe that this SNV corresponds to ClinVar report RCV002776799
, e.g. by using dplyr
to write a JOIN SQL statement for us.
::inner_join(tbl(con, "query"), tbl(con, "clinvar"),
dplyrby = join_by(seqnames, start, end))
# Source: SQL [?? x 14]
# Database: DuckDB v1.3.2 [root@Darwin 24.6.0:R 4.5.1/:memory:]
seqnames start end rcvaccession name type clinicalsignificance
<chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 NC_000016.10 2765864 2765864 RCV002776799 NM_01633… sing… Uncertain significa…
# ℹ 7 more variables: clinsigsimple <dbl>, phenotypeids <chr>,
# phenotypelist <chr>, dbsnp <dbl>, dbvar <chr>, otherids <chr>, width <dbl>
Querying ClinVar using SQL
We can also write and submit SQL statements directly, and duckdb
will search the parquet file (represented by the clinvar
view) for us.
For example, we can request three rows:
SELECT * FROM clinvar LIMIT 3;
rcvaccession | name | seqnames | start | end | type | clinicalsignificance | clinsigsimple | phenotypeids | phenotypelist | dbsnp | dbvar | otherids | width |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RCV000000014 | NM_014630.3(ZNF592):c.3136G>A (p.Gly1046Arg) | NC_000015.10 | 84799209 | 84799209 | single nucleotide variant | Uncertain significance | 0 | MONDO:MONDO:0033005,MedGen:C4551772,OMIM:251300,Orphanet:2065,Orphanet:83472 | Galloway-Mowat syndrome 1 | 150829393 | - | ClinGen:CA210674,UniProtKB:Q92610#VAR_064583,OMIM:613624.0001 | 1 |
RCV000000015|RCV000578659|RCV001194045|RCV003390625 | NM_017547.4(FOXRED1):c.694C>T (p.Gln232Ter) | NC_000011.10 | 126275389 | 126275389 | single nucleotide variant | Pathogenic | 1 | MONDO:MONDO:0032624,MedGen:C4748791,OMIM:618241|MedGen:C3661900|MONDO:MONDO:0009723,MedGen:C2931891,OMIM:256000,Orphanet:506| | Mitochondrial complex I deficiency, nuclear type 19|not provided|Leigh syndrome|FOXRED1-related disorder | 267606829 | - | ClinGen:CA113792,OMIM:613622.0001 | 1 |
RCV000000016 | NM_017547.4(FOXRED1):c.1289A>G (p.Asn430Ser) | NC_000011.10 | 126277517 | 126277517 | single nucleotide variant | Likely pathogenic | 1 | MONDO:MONDO:0032624,MedGen:C4748791,OMIM:618241 | Mitochondrial complex I deficiency, nuclear type 19 | 267606830 | - | ClinGen:CA113794,UniProtKB:Q96CU9#VAR_064571,OMIM:613622.0002 | 1 |
information about a specific dbSNP identifier (rsid),
SELECT * FROM clinvar WHERE dbsnp == '1294019636';
rcvaccession | name | seqnames | start | end | type | clinicalsignificance | clinsigsimple | phenotypeids | phenotypelist | dbsnp | dbvar | otherids | width |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RCV002776799 | NM_016333.4(SRRM2):c.5336G>A (p.Arg1779Lys) | NC_000016.10 | 2765864 | 2765864 | single nucleotide variant | Uncertain significance | 0 | MeSH:D030342,MedGen:C0950123 | Inborn genetic diseases | 1294019636 | - | ClinGen:CA394422276 | 1 |
look up a variant by its ClinVar identifier,
SELECT * FROM clinvar WHERE rcvaccession == 'RCV000000014';
rcvaccession | name | seqnames | start | end | type | clinicalsignificance | clinsigsimple | phenotypeids | phenotypelist | dbsnp | dbvar | otherids | width |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RCV000000014 | NM_014630.3(ZNF592):c.3136G>A (p.Gly1046Arg) | NC_000015.10 | 84799209 | 84799209 | single nucleotide variant | Uncertain significance | 0 | MONDO:MONDO:0033005,MedGen:C4551772,OMIM:251300,Orphanet:2065,Orphanet:83472 | Galloway-Mowat syndrome 1 | 150829393 | - | ClinGen:CA210674,UniProtKB:Q92610#VAR_064583,OMIM:613624.0001 | 1 |
list all chromosome names that are included in the ClinVar file,
SELECT DISTINCT seqnames AS chr FROM clinvar ORDER BY chr;
chr |
---|
NC_000001.11 |
NC_000002.12 |
NC_000003.12 |
NC_000004.12 |
NC_000005.10 |
NC_000006.12 |
NC_000007.14 |
NC_000008.11 |
NC_000009.12 |
NC_000010.11 |
or retrieve variants include in a specified range (in this case the same query we made with dplyr
above).
SELECT * FROM clinvar
INNER JOIN query
ON (
= query.seqnames AND
clinvar.seqnames start <= query.end AND
clinvar.end >= query.start
clinvar. )
rcvaccession | name | seqnames | start | end | type | clinicalsignificance | clinsigsimple | phenotypeids | phenotypelist | dbsnp | dbvar | otherids | width | seqnames | start | end |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RCV002776799 | NM_016333.4(SRRM2):c.5336G>A (p.Arg1779Lys) | NC_000016.10 | 2765864 | 2765864 | single nucleotide variant | Uncertain significance | 0 | MeSH:D030342,MedGen:C0950123 | Inborn genetic diseases | 1294019636 | - | ClinGen:CA394422276 | 1 | NC_000016.10 | 2765864 | 2765864 |
Finally, we close the connectsion to the in-memory database.
::dbDisconnect(con) duckdb
Conclusions
Storing large datasets in parquet files and querying them using duckdb
is one of my favorite things I learned this year. In this example, I am using a local in-memory duckdb instance to search a local file, but I could do the same with files stored remotely e.g. on AWS S3.
Eventually, as file sizes increase, it will become too slow to query files over the network, and I will have to consider moving my query engine closer to the data, e.g. by running duckdb
on a cloud instance, or e.g. by querying the parquet files through a cloud database such as motherduck or AWS Athena. But or prototyping and medium sized datasets, duckdb
is a powerful addition to my genetics toolkit.
Reproducibility
Session Information
sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] dbplyr_2.5.0 duckplyr_1.1.1 dplyr_1.1.4 glue_1.8.0 duckdb_1.3.2
[6] DBI_1.2.3
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 cli_3.6.5 knitr_1.50 rlang_1.1.6
[5] xfun_0.52 purrr_1.1.0 renv_1.1.5 generics_0.1.4
[9] jsonlite_2.0.0 htmltools_0.5.8.1 rmarkdown_2.29 evaluate_1.0.4
[13] tibble_3.3.0 fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4
[17] memoise_2.0.1 compiler_4.5.1 pkgconfig_2.0.3 collections_0.3.8
[21] digest_0.6.37 R6_2.6.1 utf8_1.2.6 tidyselect_1.2.1
[25] pillar_1.11.0 magrittr_2.0.3 tools_4.5.1 cachem_1.1.0
This work is licensed under a Creative Commons Attribution 4.0 International License.
Footnotes
For real use cases, you probably want to store the parquet file in a permanent location, either locally or in a cloud bucket.↩︎
The same command could be used to convert very large files, as
duckdb
does not need to retain all of the variants in memory at the same time. If not all of the columns included in the source file are needed, selecting only a subset of fields can also dramatically decrease the size of the final parquet file.↩︎With
duckplyr
there is no need to create and manage an in-memoryduckdb
instance - it’s all handled internally.↩︎See this github issue for details.↩︎
Fore more details, please consult
duckplyr's
Memory protection: controlling automatic materialization vignette.↩︎