A brief primer on Hidden Markov Models

April 25, 2022

For many data science problems, there is a need to estimate unknown information from a sequence of observed events. You may want to know, for instance, whether a person is angry or happy, given a sequence of brain scans taken while playing a video game. Or you may be digitizing an ancient text, but, due to water damage, can’t tell what one word in the sequence says. Or in my case (I’m a wildlife biologist), you may want to infer whether an animal is sleeping or eating at any given moment using a sequence of animal GPS locations. 

Now, there are many ways to tackle these types of sequential input problems. In the data science world, there is a tendency to use machine learning approaches to search for relations in the dataset, such as Long Short Term Memory networks (LSTM), a class of recurrent neural networks (RNN). But in many cases, we don’t have enough data or the sequences are too long to train RNNs effectively. In such cases, simpler is better. 

Enter the Hidden Markov Model (HMM). HMMs are a robust and lightweight approach that relies on statistics and distributions, using probability maximization. To understand HMMs, we need to first talk about the Markov Chain

Imagine we want to predict if tomorrow’s weather will be rainy or sunny. For this toy example, there are two possible “states” of weather: rain and sun. In probability theory, a Markov Chain says that the probability of the future state only depends on the current state

Thus, we’ll assume that tomorrow’s weather (rainy or sunny) ONLY depends on the weather today. In order to predict tomorrow’s weather we need to know the probability of each possible sequence:  

  1. Rain -> Sun

  2. Rain -> Rain 

  3. Sun -> Rain

  4. Sun -> Sun

To get these probabilities, we collect a sequence of weather data and calculate the probability of transitioning between each state based on existing data. These are called the Transition Probabilities

And voila! Some simple math to decide the most probable solution for tomorrow’s weather based on today’s weather. This is, in essence, what a Markov chain does. There are many fun/bizarre examples of using Markov chains, such as training text generation applications to write like George R.R. Martin, of Game of Thrones acclaim. 

But there is one problem. Sometimes, we can’t actually observe the state to make our prediction. For our toy example, imagine you are trying to predict the weather, but now you are locked in a windowless cafeteria. You can only predict the weather based on observing what people in the cafeteria are eating. 

Or skip to a real world application. I’m an ecologist studying how coyotes adapt to changing landscapes. If I simply follow a coyote around, it will know I’m there and it will change its behavior. So, I need to classify the coyote’s behavioral state (sleeping or eating) without actually seeing the animal! As a result, ecologists often use GPS collars to remotely gather information on an animal’s location, distance or angle between points, and movement speed. This type of dataset can be used to infer what the animal is doing -- its behavioral state -- at any given moment.  

Now let’s talk about Hidden Markov Models. These models find the probability of a hidden (or “latent”) state given the sequence of observed data. The theory looks something like this:

Okay, let’s jump back to our toy example of weather prediction. Your observed dataset is no longer the weather, now it’s just if a person is eating ice cream or hot soup in the windowless cafeteria. 

Let’s say if someone eats ice cream, there is a 60% chance it’s sunny outside and 40% chance it’s raining outside. Similarly if someone eats hot soup, there is an 80% chance it’s raining and a 20% chance it’s sunny. These are called Emission Probabilities

By combining both the transition probabilities (likelihood of a given state continuing or changing) and the emission probabilities (proximate data sources that can help us determine the hidden state) we can estimate the most likely hidden state and state sequence. And this in essence defines the HMM. 

To assign the most likely weather sequence for the week, given what people ate in the cafeteria,  you would find the highest transition and emission probability output, assign the most likely state at that step, and move to the next in the sequence. It results in something like this:

While there are many things you can do with an HMM, most of the time we want to guess what the states were, given the sequence of observed data. For that, we often use the Viterbi algorithm, which computes the optimal sequence of underlying states given our observations.

Jump back to my case, where I’m remotely assigning behavioral states to an animal’s movement pattern (i.e. sleeping, eating, running), if the animal was running, it is most likely running in the subsequent time step. If it was sleeping, it likely stayed in that state. These are the transition probabilities. Similarly, if the animal moved far and fast, it was probably running. If it turned around a lot, moved slowly, and didn’t move far, it was probably sleeping. These are the emission probabilities. And so on. These probabilities inform our HMM and the most probable movement sequence for the animal.

While I don’t cover it in this primer, we can evaluate the probability of an output sequence using a couple different methods -- the Viterbi algorithm or the Forward-Backward algorithm, and probably others as well. Choosing the correct initial model parameters is also key to how you decode and assign your state sequence. To learn more about this and HMMs, take a look at training HMMs


Overall, I’ve found HMMs to be an approachable, versatile, easy to implement class of model with many interesting and diverse applications to pattern recognition including computer vision, speech recognition, bioinformatics, animal movement, and more. I hope you found this primer on HMMs useful, I definitely encourage you to learn more, happy state decoding!