Documenting data wrangling with the dtrackr R package

R
tidyverse
TIL
Author

Thomas Sandmann

Published

June 26, 2023

tl;dr

Today I learned about Robert Challen’s dtrackr R package. It extends functionality from the tidyverse to track and visualize the data wrangling operations that have been applied to a dataset.

Motivation

Publications involving cohorts of human subjects often include flow charts describing which participants were screened, included in a specific study arm or excluded from analysis. In fact, reporting guidelines such as CONSORT, STROBE or STARD include visualizations that communcate how the participants flowed through the study.

If a dataset is processed with tidyverse functions, e.g. from the dplyr or tidyr R packages, methods from the dtrackr package add metadata to each step - and automatically generate a flow chart.

Installation

The dtrackr package is available from CRAN

if (!requireNamespace("dtrackr", quietly = TRUE)) {
  install.packages("dtrackr")
}
suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("dtrackr"))
library("glue")
library("GenomicDataCommons", 
        include.only = c("cases", "results", "ids", "gdc_clinical"))

It contains several very useful vignettes, including an example of processing clinical trial according to CONSORT guidelines.

Retrieving metadata from The Cancer Genome Atlas

The GenomicDataCommons Bioconductor Package provides an interface to search and retrieve data and metadata from the NIH Genomic Data Commons (GDC), including information from The Cancer Genome Atlas (TCGA) an international collaboration that collected molecular and clinical data on tens of thousands of human tumor samples.

Here, we retrieve metadata on the subjects and samples available as part of TCGA, and then use dtrackr to select a (hypothetical) subset of samples for analysis.

We start by retrieving data on 500 cases

case_ids = cases() %>% 
  results(size=500L) %>% 
  ids()
clindat = gdc_clinical(case_ids)
names(clindat)
[1] "demographic" "diagnoses"   "exposures"   "main"       

and obtain four data.frames: demographic, diagnoses, exposures and main with complementary pieces of metadata for each participant. Of note, the diagnoses data.frame can contain multiple rows for the same case_id, e.g. when both primary tumor and a metastasis samples were collected from the same patient.

Next, we will wrangle it into shape and track our process with dtrackr.

Default options

First, we set a few default options:

old = options(
  dtrackr.strata_glue="{tolower(.value)}",
  dtrackr.strata_sep=", ",
  dtrackr.default_message = "{.count} records",
dtrackr.default_headline = "{paste(.strata, ' ')}"
)
clindat$demographic %>%
  comment("Demographic") %>%
  track() %>% 
  inner_join(
    dplyr::select(clindat$main, case_id, disease_type),
    by = "case_id", 
    .headline = "Added disease type",
    .messages = c("{.count.lhs} records from Demographic table",
                  "joined with {.count.rhs} records from Main table:",
                  "{.count.out} in linked set")
  ) %>%
  include_any(
    disease_type == "Adenomas and Adenocarcinomas" ~ "{.included} Adenomas/ Adenocarcinomas",
    disease_type == "Ductal and Lobular Neoplasms" ~ "{.included} Ductal and Lobular Neoplasms ",
     disease_type == "Gliomas" ~ "{.included} Gliomas",
    .headline = "Included disease types") %>%
  exclude_all(
    age_at_index<35 ~ "{.excluded} subjects under 35",
    age_at_index>75 ~ "{.excluded} subjects over 75",
    race!="white" ~ "{.excluded} non-white subjects",
    .headline = "Exclusions:"
  ) %>%
  group_by(disease_type, .messages="") %>%
  count_subgroup(ethnicity) %>%
  status(
    percent_male = sprintf("%1.2f%%", mean(gender=="male") * 100),
    .messages = c("male: {percent_male}")                    
  ) %>%
  ungroup(.messages = "{.count} in final data set") %>%
  flowchart()
%0 8:s->11 9:s->11 10:s->11 5:s->8 6:s->9 7:s->10 3:s->5 3:s->6 3:s->7 3:e->4 2:s->3 1:s->2 11   242 in final data set 8 adenomas and adenocarcinomas   male: 64.71% 9 ductal and lobular neoplasms   male: 1.31% 10 gliomas   male: 47.37% 5 adenomas and adenocarcinomas   hispanic or latino: 1 items not hispanic or latino: 45 items Unknown: 5 items 6 ductal and lobular neoplasms   hispanic or latino: 11 items not hispanic or latino: 125 items not reported: 17 items 7 gliomas   hispanic or latino: 2 items not hispanic or latino: 32 items Unknown: 4 items 3 Included disease types inclusions: 111 Adenomas/ Adenocarcinomas 292 Ductal and Lobular Neoplasms 41 Gliomas 4 Exclusions: 8 subjects under 35 27 subjects over 75 178 non-white subjects 2 Added disease type 500 records from Demographic table joined with 500 records from Main table: 500 in linked set 1   Demographic

Finally, we restore the default options:

options(old)

Reproducibility

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.0 (2023-04-21)
 os       macOS Ventura 13.4
 system   x86_64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Los_Angeles
 date     2023-06-26
 pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package            * version   date (UTC) lib source
 BiocGenerics         0.46.0    2023-04-25 [1] Bioconductor
 bitops               1.0-7     2021-04-24 [1] CRAN (R 4.3.0)
 cli                  3.6.1     2023-03-23 [1] CRAN (R 4.3.0)
 crayon               1.5.2     2022-09-29 [1] CRAN (R 4.3.0)
 curl                 5.0.1     2023-06-07 [1] CRAN (R 4.3.0)
 digest               0.6.32    2023-06-26 [1] CRAN (R 4.3.0)
 dplyr              * 1.1.2     2023-04-20 [1] CRAN (R 4.3.0)
 dtrackr            * 0.4.0     2023-03-24 [1] CRAN (R 4.3.0)
 evaluate             0.21      2023-05-05 [1] CRAN (R 4.3.0)
 fansi                1.0.4     2023-01-22 [1] CRAN (R 4.3.0)
 fastmap              1.1.1     2023-02-24 [1] CRAN (R 4.3.0)
 generics             0.1.3     2022-07-05 [1] CRAN (R 4.3.0)
 GenomeInfoDb         1.36.1    2023-06-21 [1] Bioconductor
 GenomeInfoDbData     1.2.10    2023-04-30 [1] Bioconductor
 GenomicDataCommons * 1.24.2    2023-05-25 [1] Bioconductor
 GenomicRanges        1.52.0    2023-04-25 [1] Bioconductor
 glue               * 1.6.2     2022-02-24 [1] CRAN (R 4.3.0)
 hms                  1.1.3     2023-03-21 [1] CRAN (R 4.3.0)
 htmltools            0.5.5     2023-03-23 [1] CRAN (R 4.3.0)
 htmlwidgets          1.6.2     2023-03-17 [1] CRAN (R 4.3.0)
 httr                 1.4.6     2023-05-08 [1] CRAN (R 4.3.0)
 IRanges              2.34.1    2023-06-22 [1] Bioconductor
 jsonlite             1.8.5     2023-06-05 [1] CRAN (R 4.3.0)
 knitr                1.43      2023-05-25 [1] CRAN (R 4.3.0)
 lifecycle            1.0.3     2022-10-07 [1] CRAN (R 4.3.0)
 magrittr             2.0.3     2022-03-30 [1] CRAN (R 4.3.0)
 pillar               1.9.0     2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig            2.0.3     2019-09-22 [1] CRAN (R 4.3.0)
 purrr                1.0.1     2023-01-10 [1] CRAN (R 4.3.0)
 R6                   2.5.1     2021-08-19 [1] CRAN (R 4.3.0)
 rappdirs             0.3.3     2021-01-31 [1] CRAN (R 4.3.0)
 Rcpp                 1.0.10    2023-01-22 [1] CRAN (R 4.3.0)
 RCurl                1.98-1.12 2023-03-27 [1] CRAN (R 4.3.0)
 readr                2.1.4     2023-02-10 [1] CRAN (R 4.3.0)
 rlang                1.1.1     2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown            2.22      2023-06-01 [1] CRAN (R 4.3.0)
 rstudioapi           0.14      2022-08-22 [1] CRAN (R 4.3.0)
 S4Vectors            0.38.1    2023-05-11 [1] Bioconductor
 sessioninfo          1.2.2     2021-12-06 [1] CRAN (R 4.3.0)
 stringi              1.7.12    2023-01-11 [1] CRAN (R 4.3.0)
 stringr              1.5.0     2022-12-02 [1] CRAN (R 4.3.0)
 tibble               3.2.1     2023-03-20 [1] CRAN (R 4.3.0)
 tidyr                1.3.0     2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect           1.2.0     2022-10-10 [1] CRAN (R 4.3.0)
 tzdb                 0.4.0     2023-05-12 [1] CRAN (R 4.3.0)
 utf8                 1.2.3     2023-01-31 [1] CRAN (R 4.3.0)
 V8                   4.3.0     2023-04-08 [1] CRAN (R 4.3.0)
 vctrs                0.6.3     2023-06-14 [1] CRAN (R 4.3.0)
 withr                2.5.0     2022-03-03 [1] CRAN (R 4.3.0)
 xfun                 0.39      2023-04-20 [1] CRAN (R 4.3.0)
 xml2                 1.3.4     2023-04-27 [1] CRAN (R 4.3.0)
 XVector              0.40.0    2023-04-25 [1] Bioconductor
 yaml                 2.3.7     2023-01-23 [1] CRAN (R 4.3.0)
 zlibbioc             1.46.0    2023-04-25 [1] Bioconductor

 [1] /Users/sandmann/Library/R/x86_64/4.3/library
 [2] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library

──────────────────────────────────────────────────────────────────────────────

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.