Retrieving access-controlled data from NCBI’s dbGAP repository

NCBI
dbGAP
TIL
Author

Thomas Sandmann

Published

September 17, 2023

tl;dr:

Today I learned about different ways to retrieve access-controlled short read data from NCBI’s dbGAP repository. dbGAP hosts both publicly available and Access Controlled data. The latter is usually used to disseminate data from individual human participants and requires a data access application.

After the data access request has been granted, it is time to retrieve the actual data from dbGAP and - in case of short read data - its sister repository the Short Read Archive.

Authenticating with JWT or NGC files

The path to authenticating and downloading dbGAP data differs depending on whether you are using the AWS or GCP cloud proviers, or a local compute infrastructure (or another, unsupported cloud provider) instead.

Authentication within AWS or GCP cloud environments

On these two platforms, you have two paths to access the data:

  1. With a JWT file: A JWT1 file, introduced with sra-tools version 2.10, allows users to transfer data from dbGAP’s cloud buckets into your own cloud instance. (Because both your and dbGAP’s system’s share the same cloud environment, this is faster than a regular transfer e.g. via https or ftp) 2
  1. Via fusera: Alternatively, you can mount dbGAP’s buckets as read-only volumes on your cloud instances via fusera3

The nf-core/fetchngs workflow supports the retrieval of dbGAP data via JWT file authentication, e.g. when it is executed on AWS or GCP compute instances (see above). As all nf-core workflows, it is easily parallelized, e.g. across an HPC or via an AWS Batch queue. (Highly recommended when you need to retrieve large amounts of data.)

Authenticating outside AWS / GCP

On all other compute platforms, including your laptop or your local high- performance cluster (HPC), you need to authenticate with an NGC file (containing your repository key) instead45.

In this blog post, I will outline how to use NGC authentication, but make sure to read dbGAP’s official documentation as well.

Retrieving dbGAP data with NGC authentication

If you are not working on AWS or GCP, and need to rely on NGC authentication, the following steps might be useful.

1. Log into dbGAP

2. Install sra-tools from github

I usually download the latest binary of the sra-tools suite for my operating system from [github]https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). Alternatively, you can also install it using Bioconda

Note

Please note that the sra-tools package is frequently updated, so make sure you have the latest version).

For example, this code snipped retrieves and decompresses the latest version for Ubuntu Linux into the ~/bin directory:

mkdir -p ~/bin
pushd ~/bin
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.7/sratoolkit.3.0.7-ubuntu64.tar.gz
tar xfz sratoolkit.3.0.7-ubuntu64.tar.gz
rm sratoolkit.3.0.7-ubuntu64.tar.gz
popd

Afterward, I add the sra directory to my PATH and verify that it’s the version I expected:

export PATH=~/bin/sratoolkit.3.0.7-ubuntu64/bin:$PATH
prefetch --version  # verify that it's the version you downloaded
Note

The best source of information about using the various tools included in the sra-tools is the sra-tools wiki.

3. Configure the sra toolkit

Next, I configure the toolkit, especially the location of the cache directory. The prefetch command stores all SRA files it downloads in this location, so I make sure it is on a volume that is large enough to hold the expected amount of data.

./vdb-config -i
  • In the Cache section, O specify an existing directory as the public user-repository. This is where prefetch will be download files to (and they will be kept until the cache is cleared!)

My settings are stored in the ${HOME}/.ncbi/user-settings.mkfg file. For more information and other ways to configure the toolkit, please see its wiki page.

4. Log into dbGAP to retrieve the repository key

  • Back in on the dbGAP website, navigate to the “My Projects” tab
  • Choose “get dbGaP repository key” in the “Actions” column.
  • Download the repository key file with the .ngc extension to your system.

5. Choose the files to download from SRA

  • In your dbGAP account, next navigate to the “My requests” tab.
  • Click on “Request files” on the right side of the table.
  • Navigate to the SRA data (reads and reference alignments) tab.
  • Click on SRA Run Selector
  • Select all of the files you would like to download in the table at the bottom of the page.
  • Toggle the Selected radio button.

6. Download the .krt file that specifies which files to retrieve

  • Download the .krt file by clicking on the green Cart file button.

7. Initiate the download of the files in SRA format

  • Now, with both the .ngc and .krt files in hand, we can trigger the download with the sra-tool’s prefetch command. We need to provide both paths to
    • the repository key (via --ngc) and
    • the cart file (via --cart)

For example, this code snipped assumes the two files are are in my home directory. (The exact names of your .ngc and .krt files will be different.)

mkdir -p ~/dbgap
pushd ~/dbgap
prefetch \
  --max-size u \
  --ngc ~/prj_123456.ngc \
  --cart ~/cart_DAR12345_2023081212345.krt
popd

Note: The files are downloaded (and cached) in SRA format into the directory I specified when configuring the sra-toolkit (e.g. the public user-repository). Extracting reads and generating FASTQ files is a separate step.

8. Decrypt SRA files and extract reads in FASTQ format

🚨 The final fastq-files will be approximately 7 times the size of the accession. The fasterq-dump-tool needs temporary space (scratch space) of about 1.5 times the amount of the final fastq-files during the conversion. Overall, the space you need during the conversion is approximately 17 times the size of the accession.

🚨 The extraction and recompression steps are very CPU intensive, and it is recommended to use multiple cores. (The code below uses all available cores, as determined via the nproc --all command.)

The fasterq-dump tool extracts the reads into FASTQ files. It only accepts a single accession at a time, and expects to find the corresponding SRA file in the cache directory. Like the prefetch command above, it requires the .ngc file to verify that I am permitted to decrypt the data.

To save disk space I only extract a single SRA file at a time and then compress the FASTQ files with pigz. Afterward I copy the compressed FASTQ files to an AWS S3 bucket and delete the local files before processing the next accession.

#!/usr/bin/env bash
set -e
set -o nounset

declare -r CACHEDIR="~/cache/sra/"  # the cache directory with .sra files
declare -r BUCKET="s3://my-s3-bucket-for-storing-dbGAP-data"

for SRA_FILE in ${CACHEDIR}/*.sra
do
  fasterq-dump -p \
    --threads $(nproc --all) \
    --ngc ~/prj_123456.ngc \
    $(basename $SRA_FILE .sra)
  pigz \
    --processes $(nproc --all) \
    *.fastq
  aws s3 sync \
    --exclude "*" \
    --include "*.fastq.gz" \
    . \
    ${BUCKET}
  rm *.fastq.gz
done

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Footnotes

  1. JSON Web Token↩︎

  2. dbGAP’s official JWT instructions are here.↩︎

  3. dbGAP’s official fusera instructions are here.↩︎

  4. dbGAP’s official NGC instructions are here.↩︎

  5. NGC file authentication also works on cloud instances, e.g. an AWS EC2 instance, but it is slower as it doesn’t take advantage of the fact that your instance and dbGAP’s data bucket are co-located.↩︎