Matching genetic variants using parquet files and duckdb

ClinVar is a public archive of reports of human variations associated with diseases and drug responses, with supporting evidence, hosted at the NCBI.

Retrieving ClinVar variants from the NCBI’s FTP server

First, we read information about disease-associated variants, available in a gzip-compressed tab-delimited text file from the NCBI, and store information about single-nucleotide variants (SNVs) in a local parquet file.

VCF format

Frequently, information about genetic variation is shared using files in Variant Call Format (VCF) or variations of it, e.g. genomic VCF (gVCF).

While widely used, VCF files for large collections of variants can become unwieldly. In 2021, researcher from 23&me explored other file formats and reported that storing variant data in parquet files might be adventageous.

The NCBI makes ClinVar’s information available in multiple formats, including VCF and tab-delimited text files. Here, I am retrieving the latter. Multiple solutions to convert VCF to character-delimited or parquet files exist, including methods in the GLOW Spark framework, or the VariantsToTable tool included in the GATK suite. The vcf-to-duckdb python module goes one step further and converts a VCF file to a DuckDB database, encompassing both as Parquet files and accompanying SQL schema.

library(DBI)
library(duckdb)
library(glue)

Let’s start by creating an in-memory duckdb instance and install the httpfs extension to retrieve remote files.

con <- dbConnect(duckdb())
DBI::dbExecute(con, "INSTALL httpfs")

[1] 0

The latest release of ClinVar is available from its website. For this tutorial, I will convert it into a local parquet file stored in a temporary directory¹.

kTab <- paste0(
    "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/",
    "variant_summary.txt.gz"
)
kParquet <- tempfile(fileext = ".parquet")

ClinVar accepts reports about any type of genetic variant, including single nucleotide variants (SNVs), small insertions/deletions or large copy number variations for human genome assemblies GRCh37 and GRCh38. Here, I will only retain rows describing single nucleotide variants and their location in the GRCh38 assembly.

A single SQL query retrieves the ~ 380 MB tab-delimited file, filters for single-nucleotide variants on the GRCh38 assembly, and writes them into a ~ 325 MB parquet file in a single step ²

sql <- glue::glue_sql("
    COPY (
        SELECT *, 
        FROM read_csv({kTab}, AUTO_DETECT = TRUE, NORMALIZE_NAMES = TRUE)
        WHERE Assembly = 'GRCh38' AND _type = 'single nucleotide variant'
    )
    TO {kParquet} (FORMAT 'PARQUET');", .con = con)
DBI::dbExecute(con, sql)

[1] 3405475

Exploring genetic variants stored in parquet format

I can access the parquet file in different ways from R (or python or other environments). For example, I could read its entire content into my R session using the - {nanoparquet} R package, or - arrow::read_parquet from the arrow package, which supports dplyr verbs for downstream processing.

Alternatively, I can use duckdb’s internal functionality to connect to external files, and then pass the database connection to dbplyr, which will translate common dplyr verbs into SQL for me. I will demonstrate this approach in the remainder of this post.

duckplyr

Recently, the duckplyr R package was released, offering yet another way of using duckdb for data analysis. duckplyr was built around an embedded version of duckdb to provide a drop-in replacement for dplyr's functionality ³, including automatic fall-back to dplyr functionality for any functionality that is not supported within the duckdb database instance.

A main distinction between duckplyr and combining duckdb with dbplyr is that the former aims to support all R operators, functions, and data types ⁴. It automatically decides when to pull data from the database instance into the R session (e.g. materialize it in memory). While that is often convenient, it might require some extract attention, especially when working with very large datasets.

suppressPackageStartupMessages(
  library(duckplyr)
)

The duckplyr::read_parquet_duckdb function connects to a parquet file and returns a duckplyr_df object.

df <- read_parquet_duckdb(kParquet, prudence = "thrifty")
class(df)

[1] "prudent_duckplyr_df" "duckplyr_df"         "tbl_df"              "tbl"                
[5] "data.frame"

The function read_parquet_duckdb() function accepts a prudence argument, which governs whether / when data is actually materialized in memory.

By default, duckplyr is thrifty and decides at each step of a workflow whether the dataset is small enough (< 1 million cells = rows multiplied by columns) to materialize in memory right away ⁵.
Alternatively, we could choose a lavish approach, suitable for small datasets, that will trigger the computation be triggered immediately.
Finally, setting prudence to stingy will avoid automatic materialization and requires the use of collect() or as_tibble() to materialize the data in a conventional data.frame.

Registering ClinVar as a VIEW

Currently, our in-memory duckdb database does not contain any tables.

dbListTables(con)

character(0)

Let’s select a subset of field from the parquet file and also add a new width column. Because we will use the output of this query multiple times, it is useful to add it to the database as a VIEW, e.g. a canned query that is accessible under the clinvar name. (Creating a TABLE instead of a VIEW would result importing an actual copy the data into the duckdb database.)

Whenever we access the clinvar view, the SQL query that defines it is re-executed, e.g. data is retrieved from the parquet file.

CREATE OR REPLACE VIEW clinvar as
    SELECT cv.*, (cv.end - cv.start) + 1.0 AS width 
    FROM (
        SELECT
            rcvaccession,
            _name as name,
            chromosomeaccession AS seqnames,
            _start AS start,
            stop AS end,
            _type AS type,
            clinicalsignificance,
            clinsigsimple,
            phenotypeids,
            phenotypelist,
            rs_dbsnp AS dbsnp,
            nsvesv_dbvar AS dbvar,
            otherids
        FROM (FROM read_parquet(?kParquet)) AS cv
    ) AS cv

duckdb::dbListTables(con)

[1] "clinvar"

Next, let’s create a toy example of a query, e.g. a genomic region of interest defined by its genomic coordinates. We could add this query as a (potentially temporary) table to our database. Even more conveniently, the duckdb::duckdb_register function allows us to make the database aware of this data.frame without having to transfer any data; it is registered as a “virtual table” instead.

query <- data.frame(
  seqnames = "NC_000016.10",
  start = 2765864,
  end = 2765864
)
query

      seqnames   start     end
1 NC_000016.10 2765864 2765864

Querying ClinVar with dplyr

suppressPackageStartupMessages(
  library(dbplyr)
)

By passing the duckdb connection to the tbl function, I can use the database tables in dplyr workflows, e.g. to retrieve any ClinVar variants with the same coordinates as my query.

dplyr::inner_join(query, tbl(con, "clinvar"),
                  by = join_by(seqnames, start, end))

Ouch, that didn’t work - it triggered the following error:

Error in `auto_copy()`:
! `x` and `y` must share the same src.
ℹ `x` is a <tbl_duckdb_connection/tbl_dbi/tbl_sql/tbl_lazy/tbl> object.
ℹ `y` is a <data.frame> object.
ℹ Set `copy = TRUE` if `y` can be copied to the same source as `x` (may be slow).

The query data.frame exists in my R session, but the clinvar table is in the database. To join these two sources of information, I need to add the query to the database first.

I could use the dbWriteTable function to create a new (potentially temporary) table to the database. But there is a more convenient option: the duckdb::duckdb_register function makes the database aware of a data.frame without having to transfer any data; it is registered as a “virtual table” instead.

duckdb::duckdb_register(con, "query", query, overwrite = TRUE)
duckdb::dbListTables(con)

[1] "clinvar" "query"

Now the join can proceed, and we observe that this SNV corresponds to ClinVar report RCV002776799, e.g. by using dplyr to write a JOIN SQL statement for us.

dplyr::inner_join(tbl(con, "query"), tbl(con, "clinvar"),
                  by = join_by(seqnames, start, end))

# Source:   SQL [?? x 14]
# Database: DuckDB v1.3.2 [root@Darwin 24.6.0:R 4.5.1/:memory:]
  seqnames       start     end rcvaccession name      type  clinicalsignificance
  <chr>          <dbl>   <dbl> <chr>        <chr>     <chr> <chr>               
1 NC_000016.10 2765864 2765864 RCV002776799 NM_01633… sing… Uncertain significa…
# ℹ 7 more variables: clinsigsimple <dbl>, phenotypeids <chr>,
#   phenotypelist <chr>, dbsnp <dbl>, dbvar <chr>, otherids <chr>, width <dbl>

Querying ClinVar using SQL

We can also write and submit SQL statements directly, and duckdb will search the parquet file (represented by the clinvar view) for us.

For example, we can request three rows:

SELECT * FROM clinvar LIMIT 3;

3 records
rcvaccession	name	seqnames	start	end	type	clinicalsignificance	clinsigsimple	phenotypeids	phenotypelist	dbsnp	dbvar	otherids	width
RCV000000014	NM_014630.3(ZNF592):c.3136G>A (p.Gly1046Arg)	NC_000015.10	84799209	84799209	single nucleotide variant	Uncertain significance	0	MONDO:MONDO:0033005,MedGen:C4551772,OMIM:251300,Orphanet:2065,Orphanet:83472	Galloway-Mowat syndrome 1	150829393	-	ClinGen:CA210674,UniProtKB:Q92610#VAR_064583,OMIM:613624.0001	1
RCV000000015\|RCV000578659\|RCV001194045\|RCV003390625	NM_017547.4(FOXRED1):c.694C>T (p.Gln232Ter)	NC_000011.10	126275389	126275389	single nucleotide variant	Pathogenic	1	MONDO:MONDO:0032624,MedGen:C4748791,OMIM:618241\|MedGen:C3661900\|MONDO:MONDO:0009723,MedGen:C2931891,OMIM:256000,Orphanet:506\|	Mitochondrial complex I deficiency, nuclear type 19\|not provided\|Leigh syndrome\|FOXRED1-related disorder	267606829	-	ClinGen:CA113792,OMIM:613622.0001	1
RCV000000016	NM_017547.4(FOXRED1):c.1289A>G (p.Asn430Ser)	NC_000011.10	126277517	126277517	single nucleotide variant	Likely pathogenic	1	MONDO:MONDO:0032624,MedGen:C4748791,OMIM:618241	Mitochondrial complex I deficiency, nuclear type 19	267606830	-	ClinGen:CA113794,UniProtKB:Q96CU9#VAR_064571,OMIM:613622.0002	1

information about a specific dbSNP identifier (rsid),

SELECT * FROM clinvar WHERE dbsnp == '1294019636';

1 records
rcvaccession	name	seqnames	start	end	type	clinicalsignificance	clinsigsimple	phenotypeids	phenotypelist	dbsnp	dbvar	otherids	width
RCV002776799	NM_016333.4(SRRM2):c.5336G>A (p.Arg1779Lys)	NC_000016.10	2765864	2765864	single nucleotide variant	Uncertain significance	0	MeSH:D030342,MedGen:C0950123	Inborn genetic diseases	1294019636	-	ClinGen:CA394422276	1

look up a variant by its ClinVar identifier,

SELECT * FROM clinvar WHERE rcvaccession == 'RCV000000014';

1 records
rcvaccession	name	seqnames	start	end	type	clinicalsignificance	clinsigsimple	phenotypeids	phenotypelist	dbsnp	dbvar	otherids	width
RCV000000014	NM_014630.3(ZNF592):c.3136G>A (p.Gly1046Arg)	NC_000015.10	84799209	84799209	single nucleotide variant	Uncertain significance	0	MONDO:MONDO:0033005,MedGen:C4551772,OMIM:251300,Orphanet:2065,Orphanet:83472	Galloway-Mowat syndrome 1	150829393	-	ClinGen:CA210674,UniProtKB:Q92610#VAR_064583,OMIM:613624.0001	1

list all chromosome names that are included in the ClinVar file,

SELECT DISTINCT seqnames AS chr FROM clinvar ORDER BY chr;

Displaying records 1 - 10
chr
NC_000001.11
NC_000002.12
NC_000003.12
NC_000004.12
NC_000005.10
NC_000006.12
NC_000007.14
NC_000008.11
NC_000009.12
NC_000010.11

or retrieve variants include in a specified range (in this case the same query we made with dplyr above).

SELECT * FROM clinvar
INNER JOIN query
ON (
    clinvar.seqnames = query.seqnames AND
    clinvar.start <= query.end AND
    clinvar.end >= query.start
    )

1 records
rcvaccession	name	seqnames	start	end	type	clinicalsignificance	clinsigsimple	phenotypeids	phenotypelist	dbsnp	dbvar	otherids	width	seqnames	start	end
RCV002776799	NM_016333.4(SRRM2):c.5336G>A (p.Arg1779Lys)	NC_000016.10	2765864	2765864	single nucleotide variant	Uncertain significance	0	MeSH:D030342,MedGen:C0950123	Inborn genetic diseases	1294019636	-	ClinGen:CA394422276	1	NC_000016.10	2765864	2765864

Finally, we close the connectsion to the in-memory database.

duckdb::dbDisconnect(con)

Conclusions

Storing large datasets in parquet files and querying them using duckdb is one of my favorite things I learned this year. In this example, I am using a local in-memory duckdb instance to search a local file, but I could do the same with files stored remotely e.g. on AWS S3.

Eventually, as file sizes increase, it will become too slow to query files over the network, and I will have to consider moving my query engine closer to the data, e.g. by running duckdb on a cloud instance, or e.g. by querying the parquet files through a cloud database such as motherduck or AWS Athena. But or prototyping and medium sized datasets, duckdb is a powerful addition to my genetics toolkit.

Reproducibility

Session Information

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] dbplyr_2.5.0   duckplyr_1.1.1 dplyr_1.1.4    glue_1.8.0     duckdb_1.3.2  
[6] DBI_1.2.3     

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.5         knitr_1.50        rlang_1.1.6      
 [5] xfun_0.52         purrr_1.1.0       renv_1.1.5        generics_0.1.4   
 [9] jsonlite_2.0.0    htmltools_0.5.8.1 rmarkdown_2.29    evaluate_1.0.4   
[13] tibble_3.3.0      fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4  
[17] memoise_2.0.1     compiler_4.5.1    pkgconfig_2.0.3   collections_0.3.8
[21] digest_0.6.37     R6_2.6.1          utf8_1.2.6        tidyselect_1.2.1 
[25] pillar_1.11.0     magrittr_2.0.3    tools_4.5.1       cachem_1.1.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

Footnotes

For real use cases, you probably want to store the parquet file in a permanent location, either locally or in a cloud bucket.↩︎
The same command could be used to convert very large files, as duckdb does not need to retain all of the variants in memory at the same time. If not all of the columns included in the source file are needed, selecting only a subset of fields can also dramatically decrease the size of the final parquet file.↩︎
With duckplyr there is no need to create and manage an in-memory duckdb instance - it’s all handled internally.↩︎
See this github issue for details.↩︎
Fore more details, please consult duckplyr's Memory protection: controlling automatic materialization vignette.↩︎