Acquiring Genomic Data from NCBI
An Introduction to Genomic Data and NCBI Databases
Genomic data is essential for studying evolutionary biology, human health, and epidemiology. For example, genomic data is used to track the evolution of various strains of the SARS-CoV-2 virus (see here for some great visualizations). But first–what is the difference between genomic and genetic data? Genomic data includes all genetic material from an organism, that is, the sequence of every gene from start to finish. On the other hand, genetic data focuses on the function or composition of a single gene.
Since the Human Genome project was completed in 2003, there have been major advances in sequencing technologies and decreased costs. As the quantity of genomic data increases, storing, managing, and sharing large files across research labs becomes challenging. The National Center for Biotechnology Information (NCBI) stores both raw sequencing reads through their Sequence Reads Archive (SRA) database, as well as assembled genomes in their Assembly database. The difference between the two is that “raw reads” are often short segments of the full genome, so there are many computational steps to “assemble” the genome. There are different workflows for assembling data, and some researchers prefer one or another depending on their research question. Most pipelines start with acquiring raw reads (fastq files), so that is what we will focus on here.
Find Your Sequences
Let’s go through an example workflow to acquire data for several SARS-CoV-2 genomes. If we search for “SARS-CoV-2” on NCBI’s SRA database, we see that there are 6,380,100 results (and counting…). Wow! Using the checkboxes, let’s select the first 5 genomes on the list. Then, in the upper right corner, click “Send to..”, select “Run Selector”, and click “Go”. This tool (Run Selector) allows us to see more metadata about these sequences: who collected them (e.g. Broad Institute), what host they came from (e.g. Homo sapiens), etc. From this tool, you can easily download all metadata for all of the selected sequences. Importantly, you can also download a list of the run accessions - unique codes for each of the genomes uploaded on SRA - by clicking on the Download > Accession List. The run accessions have names like “SRR22249084”. We will use this list to programmatically download all read data for these accessions. This list should be a .txt file.
Install or Load SRA Toolkit
The SRA (quick reminder: Sequence Read Archive) toolkit is a code repository managed by NCBI. This includes several tools and libraries that allow us to easily download SRA data in the command line. This is particularly helpful to download SRA data onto cloud computing resources. You can download sequencing data locally for small projects, but the size of these files can be prohibitive. Here’s a quick rundown of ways to install the SRA toolkit:
-
Instructions for downloading and installing locally are included on the GitHub page
-
Some university computing resources may have some standard bioinformatic software pre-installed. If you are working on Berkeley’s High Performance Computing (HPC) cluster, sra-tools is already available. Loading the software is simple:
module load sra-tools |
Be aware that there are different versions of the SRA Toolkit. The command that we are going to use (fasterq-dump) is in a recent update to the previous command (fastq-dump). You can still use fastq-dump - some key differences are outlined here
Use a Python Loop to Download Your Data
Before we download any data, make sure you have the accession list file in your working directory and that sra-tools has been installed. Then, in the same directory, create a Python file. We will call this file fasterq_loop.py. The first step is to load in our run accession (SRR) list:
srr_list = open("SRR_Acc_List_example.txt","r").read().splitlines() |
Great. We now have a list of the accessions we want sequencing data for. The next step will be to loop through this accession list and call the “fasterq-dump” command (from SRA toolkit) for each accession. To do so, we will wrap this command in a string, and then call this through subprocess. This module allows us to execute shell commands in Python. Note that we import the subprocess module before this loop.
importsubprocess for srr in srr_list : fasterq_dump = “fasterq-dump” + srr + “ --outdir example_dir --progress --threads 4” subprocess.call(fasterq_dump, shell = True) |
The other flags include setting the output directory, including a progress bar, and setting the number of threads to 4. Finally, let’s call the Python file we just wrote to download our data from the console:
python fasterq_loop.py |
We see the progress report for downloading the first accession. Nice!
Check Our Output
I used the above code to download some of the accessions for this example. Let’s look at the contents of the output directory we created (example_dir). We will call list (ls) with the long (-l) and human-readable flags (-h) to check file sizes.
cd example_dir ls -l -h |
Wait, why are there 2 files for each accession? For many sequencing types, the data outputs files for both strands of the double-sided DNA. So one file is for the “forward” reads (file name ending in _1), while the other is for “reverse” reads (file name ending in _2). So, for each accession, there is about 9.6 Gb of data. This is why cloud computing resources are helpful for large genomic datasets.
Finally, let’s check out what one of our files looks like by calling “head”:
head SRR10056629_1.fastq |
This is standard fastq file format. We have information about each individual read: the length, the sequence, and the quality encoding. Here, most of the bases (that is, the A, C, G, or T) have a quality score of “F”, which is on the upper end.
References
-
Next Strain. Genomic epidemiology of SARS-CoV-2 with subsampling focused globally over the past 6 monthshttps://nextstrain.org/ncov/gisaid/global/6m?l=unrooted
-
NCBI. SRA Toolkit 3.0.3. https://github.com/ncbi/sra-tools
-
Harrington, R. Fasterq-dump. https://rnnh.github.io/bioinfo-notebook/docs/fasterq-dump.html
-
Illumina. Quality Score Encoding. https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm