tl;dr:
Today I learned about different ways to retrieve access-controlled short read data from NCBI’s dbGAP repository. dbGAP hosts both publicly available and Access Controlled data. The latter is usually used to disseminate data from individual human participants and requires a data access application.
After the data access request has been granted, it is time to retrieve the actual data from dbGAP and - in case of short read data - its sister repository the Short Read Archive.
Authenticating with JWT or NGC files
The path to authenticating and downloading dbGAP data differs depending on whether you are using the AWS or GCP cloud proviers, or a local compute infrastructure (or another, unsupported cloud provider) instead.
Authentication within AWS or GCP cloud environments
On these two platforms, you have two paths to access the data:
- With a
JWT
file: A JWT1 file, introduced withsra-tools
version 2.10, allows users to transfer data from dbGAP’s cloud buckets into your own cloud instance. (Because both your and dbGAP’s system’s share the same cloud environment, this is faster than a regular transfer e.g. via https or ftp) 2
- Via
fusera
: Alternatively, you can mount dbGAP’s buckets as read-only volumes on your cloud instances via fusera3
The nf-core/fetchngs workflow supports the retrieval of dbGAP data via JWT
file authentication, e.g. when it is executed on AWS or GCP compute instances (see above). As all nf-core workflows, it is easily parallelized, e.g. across an HPC or via an AWS Batch queue. (Highly recommended when you need to retrieve large amounts of data.)
Authenticating outside AWS / GCP
On all other compute platforms, including your laptop or your local high- performance cluster (HPC), you need to authenticate with an NGC
file (containing your repository key) instead45.
In this blog post, I will outline how to use NGC
authentication, but make sure to read dbGAP’s official documentation as well.
Retrieving dbGAP data with NGC authentication
If you are not working on AWS or GCP, and need to rely on NGC
authentication, the following steps might be useful.
1. Log into dbGAP
- Navigate to the dbGAP login page for controlled access and log in with your eRA credentials.
2. Install sra-tools from github
I usually download the latest binary of the sra-tools
suite for my operating system from [github]https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). Alternatively, you can also install it using Bioconda
Please note that the sra-tools
package is frequently updated, so make sure you have the latest version).
For example, this code snipped retrieves and decompresses the latest version for Ubuntu Linux into the ~/bin
directory:
mkdir -p ~/bin
pushd ~/bin
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.7/sratoolkit.3.0.7-ubuntu64.tar.gz
tar xfz sratoolkit.3.0.7-ubuntu64.tar.gz
rm sratoolkit.3.0.7-ubuntu64.tar.gz
popd
Afterward, I add the sra directory to my PATH
and verify that it’s the version I expected:
export PATH=~/bin/sratoolkit.3.0.7-ubuntu64/bin:$PATH
prefetch --version # verify that it's the version you downloaded
The best source of information about using the various tools included in the sra-tools
is the sra-tools wiki.
3. Configure the sra toolkit
Next, I configure the toolkit, especially the location of the cache
directory. The prefetch
command stores all SRA
files it downloads in this location, so I make sure it is on a volume that is large enough to hold the expected amount of data.
./vdb-config -i
- In the
Cache
section, O specify an existing directory as thepublic user-repository
. This is whereprefetch
will be download files to (and they will be kept until the cache is cleared!)
My settings are stored in the ${HOME}/.ncbi/user-settings.mkfg
file. For more information and other ways to configure the toolkit, please see its wiki page.
4. Log into dbGAP to retrieve the repository key
- Back in on the dbGAP website, navigate to the “My Projects” tab
- Choose “get dbGaP repository key” in the “Actions” column.
- Download the repository key file with the
.ngc
extension to your system.
5. Choose the files to download from SRA
- In your dbGAP account, next navigate to the “My requests” tab.
- Click on “Request files” on the right side of the table.
- Navigate to the
SRA data (reads and reference alignments)
tab. - Click on SRA Run Selector
- Select all of the files you would like to download in the table at the bottom of the page.
- Toggle the
Selected
radio button.
6. Download the .krt
file that specifies which files to retrieve
- Download the
.krt
file by clicking on the greenCart file
button.
7. Initiate the download of the files in SRA
format
- Now, with both the
.ngc
and.krt
files in hand, we can trigger the download with the sra-tool’sprefetch
command. We need to provide both paths to- the repository key (via
--ngc
) and - the cart file (via
--cart
)
- the repository key (via
For example, this code snipped assumes the two files are are in my home directory. (The exact names of your .ngc
and .krt
files will be different.)
mkdir -p ~/dbgap
pushd ~/dbgap
prefetch \
--max-size u \
--ngc ~/prj_123456.ngc \
--cart ~/cart_DAR12345_2023081212345.krt
popd
Note: The files are downloaded (and cached) in SRA format into the directory I specified when configuring the sra-toolkit (e.g. the public user-repository
). Extracting reads and generating FASTQ files is a separate step.
8. Decrypt SRA files and extract reads in FASTQ format
🚨 The final fastq-files will be approximately 7 times the size of the accession. The fasterq-dump-tool needs temporary space (scratch space) of about 1.5 times the amount of the final fastq-files during the conversion. Overall, the space you need during the conversion is approximately 17 times the size of the accession.
🚨 The extraction and recompression steps are very CPU intensive, and it is recommended to use multiple cores. (The code below uses all available cores, as determined via the nproc --all
command.)
The fasterq-dump
tool extracts the reads into FASTQ files. It only accepts a single accession at a time, and expects to find the corresponding SRA file in the cache directory. Like the prefetch
command above, it requires the .ngc
file to verify that I am permitted to decrypt the data.
To save disk space I only extract a single SRA file at a time and then compress the FASTQ files with pigz
. Afterward I copy the compressed FASTQ files to an AWS S3 bucket and delete the local files before processing the next accession.
#!/usr/bin/env bash
set -e
set -o nounset
declare -r CACHEDIR="~/cache/sra/" # the cache directory with .sra files
declare -r BUCKET="s3://my-s3-bucket-for-storing-dbGAP-data"
for SRA_FILE in ${CACHEDIR}/*.sra
do
fasterq-dump -p \
--threads $(nproc --all) \
--ngc ~/prj_123456.ngc \
$(basename $SRA_FILE .sra)
pigz \
--processes $(nproc --all) \
*.fastq
aws s3 sync \
--exclude "*" \
--include "*.fastq.gz" \
\
. ${BUCKET}
rm *.fastq.gz
done
This work is licensed under a Creative Commons Attribution 4.0 International License.
Footnotes
NGC
file authentication also works on cloud instances, e.g. an AWS EC2 instance, but it is slower as it doesn’t take advantage of the fact that your instance and dbGAP’s data bucket are co-located.↩︎