In Silico Approach to Mining Viral Sequences from Bulk RNA-Seq Data
You may have heard of a microbiome, but what about a virome? Like other microbes, viruses shape host and ecosystem health, but can be challenging to isolate and study.
All biological entities, including viruses, contain unique DNA or RNA. Sequencing technologies allow us to identify (micro)organisms and their functions. One approach to identifying viruses through DNA/RNA sequencing involves physically isolating viruses from a sample before sequencing. However, this approach is resource- and time-intensive, requiring specialized equipment and labor-intensive filtering approaches [1].
In contrast, bulk sequencing identifies all of the DNA or RNA present in a mixed community, including viruses. These methods permit the identification of viruses in a sample without needing complex, specialized processing methods. Furthermore, they can permit paired analyses of different microbial groups and their hosts. For example, bulk RNA-Seq data can be used to study both host function and microbial community composition/function since it contains (1) transcripts of actively expressed host and microbe (including DNA virus) genes and (2) genomes of RNA viruses. Yet, mining viral data from these datasets can be elusive and underutilized.
With a suite of tools optimized for analysis of viral sequences from metagenomic/transcriptomic data, researchers can mine viral sequences from next-generation sequencing data to understand viral communities in different contexts. In this tutorial, we will explore a concise in silico approach to mine viral sequences from publicly available bulk RNA-Seq data from distinct layers of a lake in Washington, U.S.A. These analyses pave the way for downstream comparisons of the role of viruses in specific contexts, including disease contexts across animal, plant, and human hosts.
Step 1: Install Programs
This tutorial is intended for Linux HPC systems with Anaconda3 (2024.02-1-11.4), Git, Python (3.11.6-gcc-11.4.0), and SPAdes (4.1.0) preinstalled and available as modules. If your system is not set up this way, consult the program documentation to download/install Anaconda, Git, Python, and SPAdes
To set up your environment for this tutorial, clone the GitHub repository and make the scripts executable.
|
# Clone the GitHub repo for this tutorial git clone https://github.com/ckarrick/Virus_RNA-seq_tutorial.git # Move to the directory cd Virus_RNA-seq_tutorial/ # Make the scripts in each subdirectory executable chmod +x data/*.sh chmod +x programs/*.sh |
This should download a directory titled Virus_RNA-seq_tutorial. Navigate to the programs subdirectory and install the programs with the download_programs.sh script.
|
# Move to the programs directory cd programs/ # Download all required programs ./download_programs.sh |
Alternatively, you can download each program individually by running each line of the download_programs.sh script individually as follows:
|
# Download SRA-Toolkit wget --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-al... # Decompress SRA-Toolkit tar -vxzf sratoolkit.tar.gz # Download FastP wget http://opengene.org/fastp/fastp # Make FastP executable chmod a+x ./fastp # Install BBMap wget https://sourceforge.net/projects/bbmap/files/BBMap_39.09.tar.gz # Decompress BBMap tar -xvf BBMap_39.09.tar.gz # Install MVP with conda module load anaconda3 # Load anaconda3 module mamba create -n mvip -c conda-forge -c bioconda mvip # Make the MVP scripts executable chmod +x mvip/modules/*.py |
Step 2: Download Data
For this tutorial, we are using RNA-Seq data from a study that investigated viral activity in different layers of lakes [2]. To reduce the computational resources required for this tutorial, we will use two samples, one from the mixolimnion layer of Lime Blue lake and one from the microbial plate layer of this lake. These data are available with NCBI accession numbers SRR30671252 and SRR30671260. To download these data from NCBI using these accession numbers, run the download_data.sh script in the data directory.
|
# Move to the data directory cd ../data/ # Run the download_data.sh script ./download_data.sh |
Alternatively, you can download the data by running the code in the download_data.sh script separately by running:
|
# For each sample's accession number for sra in SRR30671252 SRR30671260; do # Download the SRA file ../programs/sratoolkit.3.2.1-alma_linux64/bin/prefetch $sra # Convert the SRA file to a FASTQ file ../programs/sratoolkit.3.2.1-alma_linux64/bin/fasterq-dump $sra done |
This should download four total FASTQ files, two per sample (ending with _1.fastq and _2.fastq), since these data were generated from paired-end sequencing. For ease of interpretation, we will rename these files from their accession numbers to their IDs.
|
# Rename the FASTQ files mv SRR30671252_1.fastq Lime_Blue_Mixolimnion_1.fastq mv SRR30671252_2.fastq Lime_Blue_Mixolimnion_2.fastq mv SRR30671260_1.fastq Lime_Blue_Microbial_Plate_1.fastq mv SRR30671260_2.fastq Lime_Blue_Microbial_Plate_2.fastq |
Step 3: Clean and Normalize Data
Now that we have downloaded the RNA-Seq data, we will clean it for downstream analyses. First, we will use FastP with default parameters for quality control and adapter trimming. You can do this by either running the run_fastp.sh script in the data directory…
|
# Run the run_fastp.sh script ./run_fastp.sh |
…or running FastP outside of the script:
|
# For each sample for sample in *_1.fastq; do # Define the sample name from what comes before _1.fastq sample_name=$(basename $sample _1.fastq) # Run FastP ../programs/fastp \ -i ${sample_name}_1.fastq \ -I ${sample_name}_2.fastq \ -o cleaned_${sample_name}_1.fastq \ -O cleaned_${sample_name}_2.fastq done |
After running FastP, you should have four new files corresponding to the raw FASTQ sequences, but starting with “cleaned.” Next, normalize the data with BBNorm within BBMap to reduce biases from variable sequencing depth within/between samples. If your HPC has a job scheduler, it may be helpful to run this as a job. Either run the run_bbnorm.sh script…
|
# Run the run_bbnorm.sh script ./run_bbnorm.sh |
…or run BBNorm outside of the script:
|
# For each sample for sample_name in Lime_Blue_Mixolimnion Lime_Blue_Microbial_Plate; do # Run BBNorm ../programs/bbmap/bbnorm.sh \ in=cleaned_${sample_name}_1.fastq \ in2=cleaned_${sample_name}_2.fastq \ out=normalized_${sample_name}_1.fastq \ out2=normalized_${sample_name}_2.fastq done |
Now the data are ready for downstream analyses since they have been cleaned and normalized! At this point, some researchers employ an additional filtering step to filter out host reads using BBSplit within BBMap, but we will skip this step for the purposes of this tutorial since we are using environmental RNA-Seq data not associated with a specific host. For an example of this approach, see [3].
Step 4: Perform De Novo Transcriptome Assembly
Next, we will assemble viral sequences from the RNA-Seq data. We will use rnaviralSPAdes, a program specifically designed for de novo transcriptome assembly (i.e., assembling without a reference transcriptome) of viral sequences from RNA-Seq data. If your HPC has a job scheduler available, it may be useful to run this step as a job.
|
# Load required programs module load python module load spades/4.1.0 # Assemble viral contigs with rnaviralSPAdes for sample_name in Lime_Blue_Mixolimnion Lime_Blue_Microbial_Plate; do spades.py \ --rnaviral \ -1 normalized_${sample_name}_1.fastq \ -2 normalized_${sample_name}_2.fastq \ -o ${sample_name}_rnaviralspades done |
This should create subdirectories within the data directory for each sample. We will use the scaffolds.fasta file in each directory to map our reads to our assembly file in Step 5 below.
Step 5: Map Reads to Viral Assemblies and Quantify Viruses
Last, we will map the RNA-Seq read to the viral assemblies we created in Step 4. Then, we will annotate these assemblies to identify viral groups (as virus operational taxonomic units or vOTUs) and proteins. All of these functions will be performed with the Modular Viromics Pipeline, or MVP [4]. As in Step 4, it is recommended to run these commands as a job on an HPC, if possible.
MVP takes a modular approach to identify viruses, perform quality assessment/filtering, cluster viral sequences into vOTUs, estimate abundance, annotate function, and perform binning. First, activate the MVP conda environment.
|
# Unload python module unload python # Load anaconda3 module load anaconda3 # Load the MVP conda environment source activate mvip |
MVP requires files to end in R1.fastq or R2.fastq. So, we need to rename our normalized FASTQ files in this format.
|
# Rename the normalized FASTQ files to be compatible with MVP for sample_name in Lime_Blue_Mixolimnion Lime_Blue_Microbial_Plate; do # Rename forward reads mv normalized_${sample_name}_1.fastq normalized_${sample_name}_R1.fastq # Rename reverse reads mv normalized_${sample_name}_2.fastq normalized_${sample_name}_R2.fastq done |
MVP also requires a metadata file so it can identify the assembly file associated with each sample. This metadata file is a .txt file with tab-separated values of the sample number, sample ID, absolute assembly path to the scaffolds.fasta output from rnaviralSPAdes, and absolute read path to the normalized FASTQ files we renamed above. There is an example mvp_metadata.txt file in the MVP subdirectory of the data directory, but you will adjust the paths with your absolute file paths. For more information about this file, see the MVP user documentation
To get the absolute read path, run pwd in the data directory. Paste the output of pwd plus “normalized_Lime_Blue_Microbial_Plate_R1.fastq” or “normalized_Lime_Blue_Mixolimnion_R1.fastq” in the mvp_metadata.txt file.
|
# Print the working directory of the data directory pwd |
To get the absolute assembly path, navigate to the output of rnaviralSPAdes for each sample and run pwd.
|
# Move to the directory containing the rnaviralSPAdes output for the mixolimnion sample cd Lime_Blue_Mixolimnion_rnaviralspades/ # Print the working directory pwd # Move to the directory containing the rnaviralSPAdes output for the microbial plate sample cd ../Lime_Blue_Microbial_Plate_rnaviralspades/ # Print the working directory pwd |
Paste the output of pwd plus “/scaffolds.fasta” for each sample in the Assembly_Path fields of the mvp_metadata.txt file. To edit the mvp_metadata.txt file, run:
|
# Move to the MVP directory cd ../MVP/ # Open the mvp_metadata.txt file to edit it vi mvp_metadata.txt |
Once you have edited the mvp_metadata.txt file appropriately, exit the vi editor with esc+:wq to save your changes. Ensure each field is tab-separated before moving on.
Now, we are ready to run each module of MVP.
|
# Run MVP module 00 (MVP setup) mvip MVP_00_set_up_MVP -i . -m mvp_metadata.txt |
After running MVP module 00, your 00_DATABASES directory should contain six subdirectories of each database MVP will use in other modules. Ensure all the databases are downloaded before continuing. If they are not all present, try rerunning module 00 above. Once all the databases are downloaded, you can proceed to the remaining modules.
|
# Run MVP module 01 (Running geNomad and CheckV) mvip MVP_01_run_genomad_checkv -i . -m mvp_metadata.txt # Run MVP module 02 (Filtering viral prediction) mvip MVP_02_filter_genomad_checkv -i . -m mvp_metadata.txt # Run MVP module 03 (Clustering) mvip MVP_03_do_clustering -i . -m mvp_metadata.txt # Run MVP module 04 (Read mapping) mvip MVP_04_do_read_mapping -i . -m mvp_metadata.txt # Run MVP module 05 (Creating vOTU tables) mvip MVP_05_create_vOTU_table -i . -m mvp_metadata.txt # Run MVP module 06 (Functional prediction) mvip MVP_06_do_functional_annotation -i . -m mvp_metadata.txt # Run MVP module 07 (Binning viral genomes) mvip MVP_07_do_binning -i . -m mvp_metadata.txt # Run MVP module 100 (Summarize output) mvip MVP_100_summarize_outputs -i . -m mvp_metadata.txt |
Within the MVP directory, MVP will produce subdirectories of the output per module. You can use these outputs individually or navigate to the 100_SUMMARIZED_OUTPUTS directory for summary outputs. These outputs can be used for downstream statistical analyses and visualizations in RStudio. In particular, the output from module 05 (ending with vOTU_RPKM_Table.tsv) can be used as a vOTU counts table for community ecology analyses.
Summary
In this tutorial, we ran a bioinformatic pipeline to mine publicly available RNA-Seq datasets for viral sequences. We downloaded the necessary programs and data, cleaned and normalized the data, performed de novo transcriptome assembly, and mapped the reads to the assembled transcriptomes to quantity viruses in each sample. This pipeline can be applied to other RNA-Seq datasets, and the output can be used to analyze and compare viral communities in different contexts, including diseases.
References
-
Kosmopoulos, J.C., Klier, K.M., Langwig, M.V., Tran, P.Q., & Anantharaman, K. (2024). Viromes vs. mixed community metagenomes: choice of method dictates interpretation of viral community ecology. Microbiome, 12: 195. https://doi.org/10.1186/s40168-024-01905-x
-
Varona, N.S., Wallace, B.A., Bosco-Santos, A., Mullinax, J., Stiffler, A.K., O’Beirne, M.D., Ford, J., Fulton, J.M., Werne, J.P., Gilhooly III, W.P., & Silveira, C.B. (2025). Microbiome, 13: 104. https://doi.org/10.1186/s40168-025-02085-y
-
Beavers, K.M., Van Buren, E.W., Rossin, A.M., Emery, M.A., Veglia, A.J., Karrick, C.E., MacKnight, N.J., Dimos, B.A., Meiling, S.S., Smith, T.B., Apprill, A., Muller, E.M., Holstein, D.M., Correa, A.M.S., Brandt, M.E., & Mydlarz, L.D. (2023). Stony coral tissue loss disease induces transcriptional signatures of in situ degradation of dysfunctional Symbiodiniaceae. Nature Communications, 14: 2915. https://doi.org/10.1038/s41467-023-38612-4
-
Coclet, C., Camargo, A.P., & Roux, S. (2024). MVP: a modular viromics pipeline to identify, filter, cluster, annotate, and bin viruses from metagenomes. mSystems, 9: e00888-24. https://doi.org/10.1128/msystems.00888-24
