In Silico Approach to Mining Viral Sequences from Bulk RNA-Seq Data

October 28, 2025

In Silico Approach to Mining Viral Sequences from Bulk RNA-Seq Data

You may have heard of a microbiome, but what about a virome? Like other microbes, viruses shape host and ecosystem health, but can be challenging to isolate and study. 

All biological entities, including viruses, contain unique DNA or RNA. Sequencing technologies allow us to identify (micro)organisms and their functions. One approach to identifying viruses through DNA/RNA sequencing involves physically isolating viruses from a sample before sequencing. However, this approach is resource- and time-intensive, requiring specialized equipment and labor-intensive filtering approaches [1]. 

In contrast, bulk sequencing identifies all of the DNA or RNA present in a mixed community, including viruses. These methods permit the identification of viruses in a sample without needing complex, specialized processing methods. Furthermore, they can permit paired analyses of different microbial groups and their hosts. For example, bulk RNA-Seq data can be used to study both host function and microbial community composition/function since it contains (1) transcripts of actively expressed host and microbe (including DNA virus) genes and (2) genomes of RNA viruses. Yet, mining viral data from these datasets can be elusive and underutilized.

With a suite of tools optimized for analysis of viral sequences from metagenomic/transcriptomic data, researchers can mine viral sequences from next-generation sequencing data to understand viral communities in different contexts. In this tutorial, we will explore a concise in silico approach to mine viral sequences from publicly available bulk RNA-Seq data from distinct layers of a lake in Washington, U.S.A. These analyses pave the way for downstream comparisons of the role of viruses in specific contexts, including disease contexts across animal, plant, and human hosts.

Step 1: Install Programs

This tutorial is intended for Linux HPC systems with Anaconda3 (2024.02-1-11.4), Git, Python (3.11.6-gcc-11.4.0), and SPAdes (4.1.0) preinstalled and available as modules. If your system is not set up this way, consult the program documentation to download/install Anaconda, Git, Python, and SPAdes

To set up your environment for this tutorial, clone the GitHub repository and make the scripts executable.

# Clone the GitHub repo for this tutorial

git clone https://github.com/ckarrick/Virus_RNA-seq_tutorial.git


# Move to the directory

cd Virus_RNA-seq_tutorial/


# Make the scripts in each subdirectory executable

chmod +x data/*.sh

chmod +x programs/*.sh

This should download a directory titled Virus_RNA-seq_tutorial. Navigate to the programs subdirectory and install the programs with the download_programs.sh script.

# Move to the programs directory

cd programs/


# Download all required programs

./download_programs.sh

Alternatively, you can download each program individually by running each line of the download_programs.sh script individually as follows:

# Download SRA-Toolkit

wget --output-document sratoolkit.tar.gz https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-al...


# Decompress SRA-Toolkit

tar -vxzf sratoolkit.tar.gz


# Download FastP

wget http://opengene.org/fastp/fastp


# Make FastP executable

chmod a+x ./fastp


# Install BBMap

wget https://sourceforge.net/projects/bbmap/files/BBMap_39.09.tar.gz


# Decompress BBMap

tar -xvf BBMap_39.09.tar.gz


# Install MVP with conda

module load anaconda3 # Load anaconda3 module

mamba create -n mvip -c conda-forge -c bioconda mvip


# Make the MVP scripts executable

chmod +x mvip/modules/*.py

Step 2: Download Data

For this tutorial, we are using RNA-Seq data from a study that investigated viral activity in different layers of lakes [2]. To reduce the computational resources required for this tutorial, we will use two samples, one from the mixolimnion layer of Lime Blue lake and one from the microbial plate layer of this lake. These data are available with NCBI accession numbers SRR30671252 and SRR30671260. To download these data from NCBI using these accession numbers, run the download_data.sh script in the data directory.

# Move to the data directory

cd ../data/


# Run the download_data.sh script

./download_data.sh

Alternatively, you can download the data by running the code in the download_data.sh script separately by running:

# For each sample's accession number

for sra in SRR30671252 SRR30671260; do

  # Download the SRA file

  ../programs/sratoolkit.3.2.1-alma_linux64/bin/prefetch $sra

  # Convert the SRA file to a FASTQ file

  ../programs/sratoolkit.3.2.1-alma_linux64/bin/fasterq-dump $sra

done

This should download four total FASTQ files, two per sample (ending with _1.fastq and _2.fastq), since these data were generated from paired-end sequencing. For ease of interpretation, we will rename these files from their accession numbers to their IDs.

# Rename the FASTQ files

mv SRR30671252_1.fastq Lime_Blue_Mixolimnion_1.fastq

mv SRR30671252_2.fastq Lime_Blue_Mixolimnion_2.fastq

mv SRR30671260_1.fastq Lime_Blue_Microbial_Plate_1.fastq

mv SRR30671260_2.fastq Lime_Blue_Microbial_Plate_2.fastq

Step 3: Clean and Normalize Data

Now that we have downloaded the RNA-Seq data, we will clean it for downstream analyses. First, we will use FastP with default parameters for quality control and adapter trimming. You can do this by either running the run_fastp.sh script in the data directory…

# Run the run_fastp.sh script

./run_fastp.sh

…or running FastP outside of the script:

# For each sample

for sample in *_1.fastq; do

  # Define the sample name from what comes before _1.fastq

  sample_name=$(basename $sample _1.fastq)

  # Run FastP

  ../programs/fastp \

    -i ${sample_name}_1.fastq \

    -I ${sample_name}_2.fastq \

    -o cleaned_${sample_name}_1.fastq \

    -O cleaned_${sample_name}_2.fastq

done

After running FastP, you should have four new files corresponding to the raw FASTQ sequences, but starting with “cleaned.” Next, normalize the data with BBNorm within BBMap to reduce biases from variable sequencing depth within/between samples. If your HPC has a job scheduler, it may be helpful to run this as a job. Either run the run_bbnorm.sh script…

# Run the run_bbnorm.sh script

./run_bbnorm.sh

…or run BBNorm outside of the script:

# For each sample

for sample_name in Lime_Blue_Mixolimnion Lime_Blue_Microbial_Plate; do

  # Run BBNorm

  ../programs/bbmap/bbnorm.sh \

    in=cleaned_${sample_name}_1.fastq \

    in2=cleaned_${sample_name}_2.fastq \

    out=normalized_${sample_name}_1.fastq \

    out2=normalized_${sample_name}_2.fastq

done

Now the data are ready for downstream analyses since they have been cleaned and normalized! At this point, some researchers employ an additional filtering step to filter out host reads using BBSplit within BBMap, but we will skip this step for the purposes of this tutorial since we are using environmental RNA-Seq data not associated with a specific host. For an example of this approach, see [3].

Step 4: Perform De Novo Transcriptome Assembly

Next, we will assemble viral sequences from the RNA-Seq data. We will use rnaviralSPAdes, a program specifically designed for de novo transcriptome assembly (i.e., assembling without a reference transcriptome) of viral sequences from RNA-Seq data. If your HPC has a job scheduler available, it may be useful to run this step as a job.

# Load required programs

module load python

module load spades/4.1.0


# Assemble viral contigs with rnaviralSPAdes

for sample_name in Lime_Blue_Mixolimnion Lime_Blue_Microbial_Plate; do

  spades.py \

    --rnaviral \

    -1 normalized_${sample_name}_1.fastq \

    -2 normalized_${sample_name}_2.fastq \

    -o ${sample_name}_rnaviralspades

done

This should create subdirectories within the data directory for each sample. We will use the scaffolds.fasta file in each directory to map our reads to our assembly file in Step 5 below.

Step 5: Map Reads to Viral Assemblies and Quantify Viruses

Last, we will map the RNA-Seq read to the viral assemblies we created in Step 4. Then, we will annotate these assemblies to identify viral groups (as virus operational taxonomic units or vOTUs) and proteins. All of these functions will be performed with the Modular Viromics Pipeline, or MVP [4]. As in Step 4, it is recommended to run these commands as a job on an HPC, if possible.

MVP takes a modular approach to identify viruses, perform quality assessment/filtering, cluster viral sequences into vOTUs, estimate abundance, annotate function, and perform binning. First, activate the MVP conda environment.

# Unload python

module unload python


# Load anaconda3

module load anaconda3


# Load the MVP conda environment

source activate mvip

MVP requires files to end in R1.fastq or R2.fastq. So, we need to rename our normalized FASTQ files in this format.

# Rename the normalized FASTQ files to be compatible with MVP

for sample_name in Lime_Blue_Mixolimnion Lime_Blue_Microbial_Plate; do

  # Rename forward reads

  mv normalized_${sample_name}_1.fastq normalized_${sample_name}_R1.fastq

  # Rename reverse reads

  mv normalized_${sample_name}_2.fastq normalized_${sample_name}_R2.fastq

done

MVP also requires a metadata file so it can identify the assembly file associated with each sample. This metadata file is a .txt file with tab-separated values of the sample number, sample ID, absolute assembly path to the scaffolds.fasta output from rnaviralSPAdes, and absolute read path to the normalized FASTQ files we renamed above. There is an example mvp_metadata.txt file in the MVP subdirectory of the data directory, but you will adjust the paths with your absolute file paths. For more information about this file, see the MVP user documentation

To get the absolute read path, run pwd in the data directory. Paste the output of pwd plus “normalized_Lime_Blue_Microbial_Plate_R1.fastq” or “normalized_Lime_Blue_Mixolimnion_R1.fastq” in the mvp_metadata.txt file.

# Print the working directory of the data directory

pwd

To get the absolute assembly path, navigate to the output of rnaviralSPAdes for each sample and run pwd.

# Move to the directory containing the rnaviralSPAdes output for the mixolimnion sample

cd Lime_Blue_Mixolimnion_rnaviralspades/


# Print the working directory

pwd


# Move to the directory containing the rnaviralSPAdes output for the microbial plate sample

cd ../Lime_Blue_Microbial_Plate_rnaviralspades/


# Print the working directory

pwd

Paste the output of pwd plus “/scaffolds.fasta” for each sample in the Assembly_Path fields of the mvp_metadata.txt file. To edit the mvp_metadata.txt file, run:

# Move to the MVP directory

cd ../MVP/


# Open the mvp_metadata.txt file to edit it

vi mvp_metadata.txt

Once you have edited the mvp_metadata.txt file appropriately, exit the vi editor with esc+:wq to save your changes. Ensure each field is tab-separated before moving on.

Now, we are ready to run each module of MVP.

# Run MVP module 00 (MVP setup)

mvip MVP_00_set_up_MVP -i . -m mvp_metadata.txt

After running MVP module 00, your 00_DATABASES directory should contain six subdirectories of each database MVP will use in other modules. Ensure all the databases are downloaded before continuing. If they are not all present, try rerunning module 00 above. Once all the databases are downloaded, you can proceed to the remaining modules.

# Run MVP module 01 (Running geNomad and CheckV)

mvip MVP_01_run_genomad_checkv -i . -m mvp_metadata.txt


# Run MVP module 02 (Filtering viral prediction)

mvip MVP_02_filter_genomad_checkv -i . -m mvp_metadata.txt


# Run MVP module 03 (Clustering)

mvip MVP_03_do_clustering -i . -m mvp_metadata.txt


# Run MVP module 04 (Read mapping)

mvip MVP_04_do_read_mapping -i . -m mvp_metadata.txt


# Run MVP module 05 (Creating vOTU tables)

mvip MVP_05_create_vOTU_table -i . -m mvp_metadata.txt


# Run MVP module 06 (Functional prediction)

mvip MVP_06_do_functional_annotation -i . -m mvp_metadata.txt


# Run MVP module 07 (Binning viral genomes)

mvip MVP_07_do_binning -i . -m mvp_metadata.txt


# Run MVP module 100 (Summarize output)

mvip MVP_100_summarize_outputs -i . -m mvp_metadata.txt

Within the MVP directory, MVP will produce subdirectories of the output per module. You can use these outputs individually or navigate to the 100_SUMMARIZED_OUTPUTS directory for summary outputs. These outputs can be used for downstream statistical analyses and visualizations in RStudio. In particular, the output from module 05 (ending with vOTU_RPKM_Table.tsv) can be used as a vOTU counts table for community ecology analyses.

Summary

In this tutorial, we ran a bioinformatic pipeline to mine publicly available RNA-Seq datasets for viral sequences. We downloaded the necessary programs and data, cleaned and normalized the data, performed de novo transcriptome assembly, and mapped the reads to the assembled transcriptomes to quantity viruses in each sample. This pipeline can be applied to other RNA-Seq datasets, and the output can be used to analyze and compare viral communities in different contexts, including diseases.

References

  1. Kosmopoulos, J.C., Klier, K.M., Langwig, M.V., Tran, P.Q., & Anantharaman, K. (2024). Viromes vs. mixed community metagenomes: choice of method dictates interpretation of viral community ecology. Microbiome, 12: 195. https://doi.org/10.1186/s40168-024-01905-x

  2. Varona, N.S., Wallace, B.A., Bosco-Santos, A., Mullinax, J., Stiffler, A.K., O’Beirne, M.D., Ford, J., Fulton, J.M., Werne, J.P., Gilhooly III, W.P., & Silveira, C.B. (2025). Microbiome, 13: 104. https://doi.org/10.1186/s40168-025-02085-y

  3. Beavers, K.M., Van Buren, E.W., Rossin, A.M., Emery, M.A., Veglia, A.J., Karrick, C.E., MacKnight, N.J., Dimos, B.A., Meiling, S.S., Smith, T.B., Apprill, A., Muller, E.M., Holstein, D.M., Correa, A.M.S., Brandt, M.E., & Mydlarz, L.D. (2023). Stony coral tissue loss disease induces transcriptional signatures of in situ degradation of dysfunctional Symbiodiniaceae. Nature Communications, 14: 2915. https://doi.org/10.1038/s41467-023-38612-4

  4. Coclet, C., Camargo, A.P., & Roux, S. (2024). MVP: a modular viromics pipeline to identify, filter, cluster, annotate, and bin viruses from metagenomes. mSystems, 9: e00888-24. https://doi.org/10.1128/msystems.00888-24