Which Coin Should I Flip? The Multi-Arm Bandit

February 4, 2025

Which Coin Should I Flip? The Multi-Arm Bandit

Say that you are offered the opportunity to play the following game: There are two coins, possibly biased, and at each round you can choose one of them to flip. There are a total of 10 rounds, and for each time that you flip Heads, you get one dollar. How do you decide which coin to flip at each round? In this blog post, we will show how this problem can be approached and solved by the famous Multi-Arm Bandit Algorithm. Using this particular application as an example, we will illustrate the key aspects of this algorithm, as well as discuss possible variations and how they affect the solution of the problem.

Let’s label the two coins A and B, and denote the probability of getting Heads for each of them by pA and pB, respectively. Notice that we are including the possibility that these probabilities could be 0 or 1 - in which case the coin has the same face on both sides. The main trade-off that you as the decision maker face is that, at each round, you could choose between two different strategies: You can either try to maximize your earnings by choosing the coin that had the highest proportions of Heads in the previous flips - which we call the “exploitation” strategy - or you can try to obtain more information about pA and pB, so that you can make a more well-founded decision - what we call the “exploration” strategy.

To better understand the dynamics of the algorithm, let us consider what you might want to do in a hypothetical scenario for this game. In the first round, you have no information about either coin, so without loss of generality, you choose coin A. If the initial flip is Heads, you might want to choose coin A again in the next period, as the first flip was a positive signal - you at least know that pA > 0. But it could be the case that pA is very close to zero, and you just got lucky in the first flip. In this case, it would be sensible to choose coin B, so that you could obtain some information about the other option. Let’s say that you decide to go with the first option, and flip coin A again, resulting in Heads. Now you are even more confident about choosing coin A rather than B - in other words, “exploiting” option A. But if the flip is Tails, you are back to square one regarding coin A, and flipping coin B becomes more attractive - you decide to do “exploration”.

One way of formalizing this problem is in terms of Bayesian Updating. We will omit some of the technical details, but those interested can read more about this Casella-Berger, Statistical Inference, Section 7.2.3. Before you flip any coin, you have no information whatsoever about pA and pB. In mathematical terms, this is what we call an uninformative prior, which is a measure of your beliefs about each pi before the game starts. Since each pi can take any values in [0,1], we will assume that all values have the same prior probability, so that each pi has a uniform distribution in [0,1]. Bayesian Updating is the process of how agents can use the information from each flip to update their beliefs about these probabilities. Notice that there are two levels of uncertainty here - the uncertainty of whether each coin flip will result in Heads or Tails, and the uncertainty about the probability of flipping Heads or Tails at each round. Let’s take a look at a few concrete examples to see what the updates might look like.

First, let’s consider the case in which all 10 flips result in Heads (with a sequence of HHHHHHHHHH). We assume an uninformative prior - that is, before any flips, the agent considers all realizations to hold with the same probability. After each realization yielding Heads, the agent gets more and more confident that the probability of Heads is close to 1. This can be seen in Figure 1, which starts with a flat line - the uninformative prior - but ends up with a curve whose area is highly concentrated for values close to 1.

 Probability Distributions for different realizations of Heads and Tails - All Heads

Now, let’s consider the case in which the realizations alternate between Heads and Tails (with a sequence of HTHTHTHTHT). In this case, after each pair of realizations, the agent becomes more convinced that the coin is fair, putting more weight into the probability that the probability of Heads is around 0.5. At the same time, your belief that the coin is very biased - the probability of Heads being close to zero or one - goes down fast. This generates a familiar bell curve, as can be seen in Figure 2.

 Probability Distributions for different realizations of Heads and Tails - Equal Number of Heads and Tails

Lastly, imagine that you see two streaks when flipping the coin - the first five realizations are Heads, and the following five are Tails (sequence: HHHHHTTTTT). As seen in Figure 3, you start similarly to the first case, having a high degree of belief that the probability of Heads is large. But after seeing the first realization of Tails, your belief is balanced out by putting more weight into Tails. Once you have seen the same number of Tails realizations as you did for Heads, we end up in the same scenario as our second case, in which the realizations alternate perfectly. This showcases one important aspect of Bayesian Updating: the order of realizations does not matter when creating the final estimation of probabilities.

 Probability Distributions for different realizations of Heads and Tails - Two Streaks

To finalize the specification of the algorithm, we need to determine how you will use the information you obtained to decide which coin to flip. For example, you might just choose to flip whatever coin you believe has the highest pi and flip coin A as a tie-breaker. This would be a pure exploitation strategy.

Let’s start with an example assuming that the coin being flipped is fair and that you will use the “exploitation” strategy as described above. Figure 4 illustrates the belief updating process for different sequences of realizations of the coin flips. In the first simulation, the first coin flip resulted in Tails, which brings down the expected belief for Coin A, which makes flipping Coin B a more attractive choice for the next round.

 Updating under different realizations for the ”Exploitation” rule

Figure 4: Updating under different realizations for the ”Exploitation” rule

Another possible strategy might be to alternate: continue “exploiting” on odd rounds, but now use an “exploration” strategy on even rounds by flipping the coin not flipped in the previous round. Figure 5 depicts the updating under this type of strategy. We can see that, unlike the Exploitation rule, you never get ”stuck” playing the same strategy for too long.

 Updating under different realizations for the ”Alternating” rule

Figure 5: Updating under different realizations for the ”Alternating” rule

You could include some randomization on how he chooses between the two strategies. For example, you might roll a die before each round, and if it rolls 1, you exploit, and for any other number, you explore. We can see these dynamics in Figure 6. Similar to Figure 5, you don’t spend too long flipping one coin without trying the other. However, unlike in Figure 5, you can spend more than one round without alternating.

 Updating under different realizations for the ”Random Alternating” rule

Figure 6: Updating under different realizations for the ”Random Alternating” rule

The decision rule might also be much more complicated. For example, it might involve a randomization strategy that depends on your current beliefs. A natural choice would be to exploit with a probability reflecting your degree of confidence, but always leave some space for exploration. This is a reflection of a key principle in statistics: observing more realizations of a random variable allows you to be more confident about making statements about the data-generating process, but never allows you to be certain when making them.

In this blog post, we discussed a particular application of the Multi-Arm Bandit decision-making algorithm. This algorithm can be adapted to tackle much more complex problems, including the possibility of many information sources with more general probability distributions, more intricate decision-making, and updating rules, allowing the agents to decide whether to stop playing at any round and so on. This algorithm has a large number of practical applications, for example helping an institution, such as a pharmaceutical company or a science foundation, decide which projects they should fund by taking into account how much risk and return they bring.