if (!requireNamespace("dtrackr", quietly = TRUE)) {
install.packages("dtrackr")
}suppressPackageStartupMessages(library("dplyr"))
suppressPackageStartupMessages(library("dtrackr"))
library("glue")
library("GenomicDataCommons",
include.only = c("cases", "results", "ids", "gdc_clinical"))
tl;dr
Today I learned about Robert Challen’s dtrackr R package. It extends functionality from the tidyverse to track and visualize the data wrangling operations that have been applied to a dataset.
Motivation
Publications involving cohorts of human subjects often include flow charts describing which participants were screened, included in a specific study arm or excluded from analysis. In fact, reporting guidelines such as CONSORT, STROBE or STARD include visualizations that communcate how the participants flowed through the study.
If a dataset is processed with tidyverse functions, e.g. from the dplyr or tidyr R packages, methods from the dtrackr
package add metadata to each step - and automatically generate a flow chart.
Installation
The dtrackr
package is available from CRAN
It contains several very useful vignettes, including an example of processing clinical trial according to CONSORT guidelines.
Retrieving metadata from The Cancer Genome Atlas
The GenomicDataCommons Bioconductor Package provides an interface to search and retrieve data and metadata from the NIH Genomic Data Commons (GDC), including information from The Cancer Genome Atlas (TCGA) an international collaboration that collected molecular and clinical data on tens of thousands of human tumor samples.
Here, we retrieve metadata on the subjects and samples available as part of TCGA, and then use dtrackr
to select a (hypothetical) subset of samples for analysis.
We start by retrieving data on 500 cases
= cases() %>%
case_ids results(size=500L) %>%
ids()
= gdc_clinical(case_ids)
clindat names(clindat)
[1] "demographic" "diagnoses" "exposures" "main"
and obtain four data.frames: demographic, diagnoses, exposures and main with complementary pieces of metadata for each participant. Of note, the diagnoses
data.frame can contain multiple rows for the same case_id
, e.g. when both primary tumor and a metastasis samples were collected from the same patient.
Next, we will wrangle it into shape and track our process with dtrackr
.
Default options
First, we set a few default options:
= options(
old dtrackr.strata_glue="{tolower(.value)}",
dtrackr.strata_sep=", ",
dtrackr.default_message = "{.count} records",
dtrackr.default_headline = "{paste(.strata, ' ')}"
)
$demographic %>%
clindatcomment("Demographic") %>%
track() %>%
inner_join(
::select(clindat$main, case_id, disease_type),
dplyrby = "case_id",
.headline = "Added disease type",
.messages = c("{.count.lhs} records from Demographic table",
"joined with {.count.rhs} records from Main table:",
"{.count.out} in linked set")
%>%
) include_any(
== "Adenomas and Adenocarcinomas" ~ "{.included} Adenomas/ Adenocarcinomas",
disease_type == "Ductal and Lobular Neoplasms" ~ "{.included} Ductal and Lobular Neoplasms ",
disease_type == "Gliomas" ~ "{.included} Gliomas",
disease_type .headline = "Included disease types") %>%
exclude_all(
<35 ~ "{.excluded} subjects under 35",
age_at_index>75 ~ "{.excluded} subjects over 75",
age_at_index!="white" ~ "{.excluded} non-white subjects",
race.headline = "Exclusions:"
%>%
) group_by(disease_type, .messages="") %>%
count_subgroup(ethnicity) %>%
status(
percent_male = sprintf("%1.2f%%", mean(gender=="male") * 100),
.messages = c("male: {percent_male}")
%>%
) ungroup(.messages = "{.count} in final data set") %>%
flowchart()
Finally, we restore the default options:
options(old)
Reproducibility
::session_info() sessioninfo
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.0 (2023-04-21)
os macOS Ventura 13.4
system x86_64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Los_Angeles
date 2023-06-26
pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
BiocGenerics 0.46.0 2023-04-25 [1] Bioconductor
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.3.0)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)
curl 5.0.1 2023-06-07 [1] CRAN (R 4.3.0)
digest 0.6.32 2023-06-26 [1] CRAN (R 4.3.0)
dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)
dtrackr * 0.4.0 2023-03-24 [1] CRAN (R 4.3.0)
evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
GenomeInfoDb 1.36.1 2023-06-21 [1] Bioconductor
GenomeInfoDbData 1.2.10 2023-04-30 [1] Bioconductor
GenomicDataCommons * 1.24.2 2023-05-25 [1] Bioconductor
GenomicRanges 1.52.0 2023-04-25 [1] Bioconductor
glue * 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)
httr 1.4.6 2023-05-08 [1] CRAN (R 4.3.0)
IRanges 2.34.1 2023-06-22 [1] Bioconductor
jsonlite 1.8.5 2023-06-05 [1] CRAN (R 4.3.0)
knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
purrr 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.3.0)
Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.3.0)
RCurl 1.98-1.12 2023-03-27 [1] CRAN (R 4.3.0)
readr 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
rmarkdown 2.22 2023-06-01 [1] CRAN (R 4.3.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
S4Vectors 0.38.1 2023-05-11 [1] Bioconductor
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
stringr 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)
V8 4.3.0 2023-04-08 [1] CRAN (R 4.3.0)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)
xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
xml2 1.3.4 2023-04-27 [1] CRAN (R 4.3.0)
XVector 0.40.0 2023-04-25 [1] Bioconductor
yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
zlibbioc 1.46.0 2023-04-25 [1] Bioconductor
[1] /Users/sandmann/Library/R/x86_64/4.3/library
[2] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
──────────────────────────────────────────────────────────────────────────────
This work is licensed under a Creative Commons Attribution 4.0 International License.