Introduction to Propensity Score Matching with MatchIt

April 1, 2024

Introduction to Propensity Score Matching with MatchIt

Why Matching?

Identifying and explaining cause-and-effect relationships is incredibly valuable for data scientists in a wide array of disciplines, from medical research to social science to public policy. But causality is often difficult to firmly establish: how do we know that a particular experimental intervention or a policy program is having its intended impact? One of the biggest challenges is our inability to observe counterfactuals: if a physician were to give a patient a particular drug, for example, they could not observe what would have happened if they had not given the patient that drug. It is impossible to observe the same individual in both the “treated” condition and in the “untreated” condition at the same time (throughout this post, I will use the term treatment to describe observations that have been “treated” and/or affected by a particular intervention, and the term control to describe observations that have not been treated and/or affected).

As a result, researchers must come up with workarounds – alternative strategies for trying to establish causality. While the full range of causal inference methods is beyond the scope of this blog post, propensity score matching presents a relatively easy-to-implement statistical procedure. Given that the fundamental challenge of causal inference is the need to see what would happen under both “treatment” and “control” conditions, one strategy is to attempt to find two different groups – one that receives the treatment (i.e. a patient that receives an experimental drug, or a population that is impacted by a new policy) and the other which does not – that are otherwise as similar to each other as possible. If we can identify two groups of people that are extremely similar, then we can make the argument that any differences between them are the result of the treatment. To create this balanced comparison of “treatment” and “control” observations, we need a strategy to compare which observations from each group are the most similar. Matching allows us to determine which observations from our “control” population are most similar to our “treated” population.

Example Using `MatchIt` in R

MatchIt is an R package that provides users with an array of options for conducting data matching, and providing matched datasets as outputs.

Step 1: Install MatchIt and explore dataset

First, we load the MatchIt package into our R session. We will also load cobalt, another complementary package for the analysis of matched datasets. For this example, we will be working with the sample dataset lalonde from the MatchIt package, which contains a “treatment” group from the National Supported Work Demonstration (NSW) – a program from the mid-1970s designed to equip populations with supports to enter the labor market. In addition to observations of individuals that participated in the NSW demonstration, this dataset also includes a “control” group from the Panel Survey of Income Dynamics (PSID), which is a long-running longitudinal survey intended to provide a representative sample of the U.S. population over time. 

The goal of this exercise will be to use the shared characteristics of these two datasets to create a balanced comparison of treatment observations and control observations in order to answer the question: Did participation in the NSW demonstration program result in higher incomes?

> library(MatchIt)

> library(cobalt)

> View(lalonde)

Dataframe with rows "NSW" 1-10 and "PSID" 1-10, and with columns "treat", "age", "educ", "race", "married", "nodegree", "re74", "re75", and "re78"

In the sample of the dataset above, we can see several important pieces of information. The first 10 rows in this sample are labeled “NSW”, indicating that they originate from the NSW dataset; the last 10 are labeled “PSID”, indicating that they originate from the PSID dataset. All of the NSW observations receive a value of 1 for the treat column, because the NSW dataset is our treatment group, and all of the PSID observations receive a value of 0 for the treat column because the PSID dataset is our control group.

We also observe several other pieces of information: the age of the individual (age), the number of years of education they have received (educ), their race (race), marital status (married), and whether or not they have a college degree (nodegree). The final three fields, re74, re75, and re78 represent the income of the individual in 1974, 1975, and 1978 respectively. We can see that our NSW participants had no income in 1974 or 1975, but many of them did by 1978!

In total, we have 429 observations in our control group and 185 observations in our treatment group. 

> table(lalonde$treat)

  0   1

429 185

Let’s say we’re interested in determining whether participation in the NSW program significantly increased the incomes of individuals that participated in the program. We can use a t-test to see whether the treatment group has a higher or lower income in 1978 (re78) than the control group: 

> t.test(re78 ~ treat)

mean in group 0 mean in group 1

6984.170    6349.144

This means that the average income of individuals participating in the program (group 1) was $635 lower than individuals that did not participate in the program (group 0). This is a bad sign for our original hypothesis that participating in the NSW program results in higher incomes!

The problem with this approach is that the NSW and PSID are two entirely different datasets representing entirely different populations. How do we know that these two groups are comparable with one another? What if the individuals in the PSID are older or have more years of education, and are simply more likely to have higher incomes than the participants in the NSW demonstration group? This is where matching can provide us with a more balanced comparison between the two groups.

Step 2: Matching

Exact Matching. One strategy to resolve this issue is to construct a new dataset consisting of a subset of individuals from both groups that have exactly the same features – a procedure known as “exact matching” For example, if we wanted to find an exact match for the first observation in our dataset – an individual that participated in the NSW and was 37 years old, Black, married, and has 11 years of schooling, we would want to find an individual in the control population with those exact same characteristics in terms of age, race, marital status, and educational attainment. 

We can perform this operation using the MatchIt package. First, we specify a formula that begins with the variable indicating the individual’s treatment status (treat), followed by an equation with all of the variables we want to match on (in this case age, years of education, race, and marital status). This is our matching formula. Since we are interested in obtaining an exact match based on all of these characteristics, we then add the argument exact and provide a formula containing all of the same variables for individual characteristics. 

> exact_match <- matchit(treat ~ age + educ + race + married + nodegree,

>                        exact = ~ age + educ + race + married + nodegree,

>                        data = lalonde)

We can take a look at a summary of our new object by calling its name:

> exact_match

A matchit object

 - method: 1:1 nearest neighbor matching without replacement

 - distance: Propensity score

- estimated with logistic regression

 - number of obs.: 614 (original), 90 (matched)

 - target estimand: ATT

 - covariates: age, educ, race, married

This tells us that we have produced a “matchit” object based on “1:1 nearest neighbor matching without replacement”. This means that each treatment observation retained in the final matched dataset was matched with exactly one control observation. However, there is a problem: the summary also reveals that while we started with 615 observations, only 90 observations were matched! This means that our final dataset only consists of 45 individuals from the treatment group and 45 individuals from the control group. Using exact matching means we lose a lot of information, because oftentimes we are not able to find two individuals that are exactly identical between our treatment and control datasets. We need a different strategy!

Propensity Score Matching. What if our control dataset contains an individual that was almost the same as the first observation in our treatment dataset – some who is Black, married, has 11 years of schooling, and doesn’t have a degree – but they are 38 years old instead of 37? Isn’t that close enough for a match? This is where propensity score matching (PSM) can be an incredibly useful tool. PSM does not require us to find individuals that are exactly the same between our treatment and control datasets. Instead, it uses the matching criteria we provide to calculate a “propensity score” – a single numeric score that can be used to determine how likely it is that an observation with particular characteristics is in the treatment group instead of the control group. 

This is calculated via a logistic regression that predicts the likelihood that an individual was assigned to the treatment group based on their individual characteristics. The MatchIt package does this calculation for us automatically, constructing a new matched dataset composed of each treatment observation and the control observation that has the closest propensity score.

> ps_match <- matchit(treat ~ age + educ + race + married + nodegree,

>                     data = lalonde)

> ps_match

A matchit object

 - method: 1:1 nearest neighbor matching without replacement

 - distance: Propensity score

- estimated with logistic regression

 - number of obs.: 614 (original), 370 (matched)

 - target estimand: ATT

 - covariates: age, educ, race, married

This object has a significantly larger matched dataset! Whereas exact matching only resulted in a dataset with 90 observations, the ps_match object contains 370 observations. In other words, it contains every single one of the original 185 treatment observations, and the 185 control observations that most closely resemble those treatment observations in terms of age, educational attainment, race, and marital status!

Step 3: Run diagnostics for matching balance

Now that we’ve created our matches, we need to determine whether we have actually been successful in creating a balance between our treatment and control groups! We can do this using a diagnostic tool known as a “balance plot”, which examines the average difference between our treatment and control groups for each matching variable. We can generate a nice-looking balanced plot using the love.plot function from the cobalt package. Let’s start by looking at the balance plot for our exact match:

> love.plot(exact_match, drop.distance = TRUE)

Graph titled "Covariate Balance"

In this plot, the first row provides an overall diagnostic balance measure called “distance”, and each subsequent row represents a single variable (or each variable category, in the case of the race variable). The red dot indicates how large the difference was between the treatment and control groups before we matched our data. By looking at where these red dots fell it looks like people in the treatment group were quite a bit younger, more likely to be black, less likely to be white, and less likely to be married than people in the control group. The blue dots, on the other hand, tell us that once we found observations that exactly matched between the two datasets, the difference between the treatment and control groups was exactly zero. Our goal is always to get those blue dots as close to zero as possible!

Now, let’s take a look at the balance plot for our propensity score matching object:

> love.plot(ps_match)

A second graph titled "Covariate Balance", with points more widely scattered than the first.

It looks like propensity score matching has not reduced the differences between the control group and treatment group all the way to zero. Instead, most of the variables are seeing the blue dots move closer to zero, with the exception of the Hispanic variable which has moved slightly further away from zero. Overall, it looks like our balance has improved quite a bit. Ultimately, we have to make the decision whether we think that our matched dataset is balanced enough. For this exercise, let’s assume that it is!

Step 4: Create balanced dataset

Now that we have our match, let’s convert it into a dataset that we can analyze! We can use the function to turn ps_match into a dataset, which we’ll call matched:

> matched <-

> View(matched)

5 rows of the same dataframe as before, with columns "treat", "age", "educ", "race", "married", "nodegree", "re74", "re75", and "re78", as well as 3 new columns "distance", "weights", and "subclass"

This produces a new dataset that has the same variables as our original data, plus a few important new ones: distance gives us the propensity score associated with each observation, and subclass tells us which observations were matched with each other. The first two observations were matched together in Group 1, and the next two were matched together in Group 2. We can see that these matched observations are similar in some ways and different in others. The first two observations are similar in age, level of education, marital status, and degree status. However, one is identified as Black and the other as Hispanic. On the other hand, the second set of observations matches perfectly along the characteristics that we specified! Both are 33 years old, have 12 years of education, are white, are married, and don’t have a degree.

Step 5: Measure treatment effect

Now that we have our balanced dataset, we can see whether there is a significant difference between people that participated in the NSW program and people that didn’t participate. Now let’s run another t-test on the difference in the 1978 income (re78), this time using our “matched” dataset:

> t.test(re78 ~ treat, data = matched)

mean in group 0 mean in group 1

5852.884    6349.144

Now, it looks like the income in the treatment group is quite a bit higher! In other words, once we have accounted for the differences between our treatment and control populations by running propensity score matching to make those populations more similar, it appears that participating in the NSW program did result in higher incomes! There are many opportunities to further improve the quality of these matches as well, making propensity score matching a valuable and flexible tool for the analysis of causal effects.


  1. Ho, D., Imai, K., King, G., & Stuart, E. A. (2011). MatchIt: Nonparametric preprocessing for parametric causal inference. Journal of Statistical Software