Spooky Microbiomes and the Curse of Dimensionality

October 25, 2021

Microbiomes are all the rage today and this trendiness is clear with the Human Microbiome Market predicted to be valued at more than 1 billion dollars by the year 2027 (up from 376 million in 2019). 

With studies showing that our microbial community is associated with health outcomes, from regulating our brain chemistry and behaviors to managing our blood pressure, it is difficult to argue against keeping our little microbes happy. As an avid climber, I was particularly excited at a mice study linking the microbiome and grip strength through the development of skeletal muscle (very cool). 

The Microbiome

So what is a microbiome, and what has accelerated research in this field? 

Modern definitions view a microbiome as the microorganisms including bacteria, archaea, fungi, algae, and small protists, collectively known as a microbial community, with distinct properties and functions that result in the formation of ecological niches. For this reason, microbiome research often uses techniques found in community ecology in an attempt to describe how biotic and abiotic factors affect community assemblages, but in this case, the environments are often living bodies! 

Before the expansion of methods to cheaply and efficiently sequence DNA, microbes were manually cultured on plates to identify species and strains. This not only represented a small fraction of the community which can be grown on a plate but made it difficult to quantify the relative abundances of our microbes. Now that DNA sequencing costs have drastically dropped, it has become commonplace to profile entire microbial communities using microbial DNA.

But how do we even start to observe microbial communities and gather data for analysis?

Sequencing Data Generation

To identify the bacterial species in our samples (say, stool samples from people on different continents), and count their relative abundance, we need a way to use DNA markers that are found among all microbes of interest, and have enough variation in their features (the DNA string) to classify their “species”.

A popular method for profiling the members of microbial communities and their relative abundances are called 16S amplicon sequencing (biology jargon alert!). This method takes advantage of a universal gene in bacteria, the 16S gene, that encodes information to build ribosomes, the molecular machines that synthesize proteins (note: we only limit our scope to bacteria). This gene is an ideal “barcode” that can represent a bacterial “species” because it hits a key goldilocks condition needed for taxonomic assignment. It contains regions of DNA sequence that are always the same (which we can use to find which part of the genome we want to sequence) and regions that are highly variable (features that are different, and can be use to identify different microbes and assign the degree of relatedness among “species”).

The Species Problem

You may be wondering why I keep using the term bacterial “species” in quotation marks. This is one of the fundamental issues in microbiome studies because 

1) many species are unknown, and 

2) assigning species based on a short sequence of DNA in bacteria is a difficult task. 

To make matters worse, there can be many errors and biases in our sequence data from the way we prepare our samples, whether it’s from DNA extraction to DNA amplification, to errors made in the sequencing machine itself.

So what do we do with our potentially sketchy data? We are left with two options, either we throw away these unknowns due to the possibility of errors, or we find a way to determine meaningful unknowns that are (mostly) error-free. The vast majority of studies use the latter. 

There are two schools of thought in the generation of meaningful unknowns for our 16S sequence data of bacterial DNA. The first is Operational Taxon Unit (OTU) clustering, in which we cluster sequences that share a certain degree of similarity across the DNA string. The most commonly used metric is 95% or 97% similarity. The second is Amplicon Sequence Variant (ASV) detection, where we use our sequence data to model the probability that a given sequence comes from errors, and we can drop them from our dataset. In the big picture, OTU clustering blurs our data into a consensus sequence, while ASV detection tries to filter out errors and keep sequences with high statistical confidence of not being an error. 

One major deal breaker that favors ASV detection over OTU clustering is cross-study comparisons. When using OTU clustering by a % threshold, we create an imaginary DNA consensus sequence that may not be a real biological sequence that exists in nature. This makes it hard to compare OTUs generated by clustering across studies. In opposition, ASVs represent real biological sequences (barring false positives) that can be consistently compared across studies. For this reason, I prefer to use ASVs in my own research, as it tries to model sequence data in a more biologically consistent manner.

The Dataset

So what does our complete dataset look like after sequencing, and generating error-corrected counts of our ASVs? We can view this as a very large matrix called the ASV table. The rows have every ASV (all unique bacterial sequences in our samples), the columns have each of our samples (human stool samples), and the values are the count of each ASV found within a sample. In my head, I like to flip (or transpose) this matrix, to think about our human stool samples, as the rows, and the ASVs as our feature columns within each sample. 

A few important things to highlight about our dataset: 

  1. It is a sparse matrix - Many of the entries are zero, meaning a given sample does not contain any observations of a specific ASV.

  2. It is large P, small N - It is common to have say n= 100 samples, and have p = 10,000 features (ASVs). Each feature column can be thought of as a dimension in our dataset, so our ASV table is very high-dimensional.

Due to these quirks in our dataset, many common statistical methods do not work very well, and this problem is known as the curse of dimensionality

An important side note is that we need to maintain other important features for both the samples and ASVs that we can link back to our ASV count table. This may include features about our human stool samples, such as age, and country of origin, as well as features about our ASVs such as taxonomic assignments, and degree of phylogenetic relatedness between ASVs. To derive biologically meaningful results, taking into account the degree of relatedness between ASVs lets us normalize for the correlation between bacteria. 

To illustrate this, imagine a DNA string of length 250 only differs by two letters in different ASVs. Should we consider these separate features, or should we take into account that they likely perform similar functions?

Luckily for us, in R there is a package called phyloseq that contains a phyloseq object capable of containing all this data, their relationships, and methods to extract information/conduct statistical tests.

Statistical Analysis

So how do we conduct statistical analyses on this cacophony of data? 

The first question many researchers ask is “How diverse is each sample”? One simple way you can calculate diversity is by counting the number of unique ASVs in a given sample, a metric known as observed richness. This summarizes our 100,000 ASV feature columns, into a single column. Now that we have summarized the entire diversity for each sample into a single number, we can use this to run whatever statistical test you want, like making linear models of stool sample participant age ~ observed richness, or run an ANOVA of stool sample participant country of origin ~ observed richness. The world is your oyster! It should be noted that there are many more advanced ways to calculate sample diversity that includes the measures of evenness of counts. These methods may be adequate to answer questions such as “Do westernized countries have less diverse microbial communities?”, and “Does microbial diversity decrease with age?”.

For other questions such as “Are microbial communities more similar within different continents?”, a simple diversity metric will not suffice. So do we give up? Never! We need to jump through a few more hoops, but there are ways to tame this very high-dimensional ASV count table. A common first step is to convert our wide matrix into a simpler one by calculating sample-sample pairwise dissimilarities known as a dissimilarity matrix. In our human stool example, we would get a 100x100 symmetrical matrix where the rows and columns are our sample names, and the values are the dissimilarity value between the two samples. 

There are two common ways to compute the dissimilarity between two samples. The first is Bray-Curtis dissimilarity. It is calculated by taking the sum of the lower count for each ASV between two samples multiplied by two and dividing it by the sum of the total count for ASVs in sample one and sample two. Notice that if two samples have the same count the value is 0, which is exactly what we will have on the diagonal of our dissimilarity matrix comparing a sample with itself. The second dissimilarity metric is called UniFrac, which uses evolutionary relatedness (phylogenetic) distances to calculate dissimilarity. Weighted UniFrac combines this information with count data to calculate dissimilarity. 

With our dissimilarity matrices in hand, we can now test the statistical significance of ASV community groupings along with some factors, such as country of origin. Using permutation multivariate analysis of variance (PERMANOVA), we can test the null hypothesis that the centroids between our groups are the same. PERMANOVA compares within-group variation (distance of each sample to the centroid within its group) to between-group variation (distance of each group centroid to the centroid of the entire dataset) resulting in a pseudo-F-statistic. The wonderful thing about PERMANOVA is that it is non-parametric, or doesn’t rely on assumptions about the data’s distribution. Instead, we build our own distribution from the data itself! Briefly, we calculate F statistics through random permutations where we shuffle the group labels for each sample. We can then extract a p-value which is the proportion of F-statistics greater than or equal to what we get from our real dataset, over the total number of permutations we generated F-statistics for by random permutation!


If you made it to this point, congratulations! I hope you gained a little intuition into how data is generated and analyzed in microbiome experiments. Now go forth and remember to always avoid the curse of dimensionality… or jump right in like I do!