Design Your Observational Study with the Joint Variable Importance Plot
In my previous blog post, I walked through a gentle introduction to causal inference. This time, I’ll address designing causal inference in observational studies. There are many scenarios where implementing an experiment might not be ethical or feasible to do. For example, a researcher who wishes to study the effects of smoking on lung cancer cannot randomly request people to smoke. Thus, to use methods directly from randomized experiments or trials, the researcher must consider an observational study design to compare groups that are as similar as possible. Often, we aim to establish balance to make two groups similar, defined by similar averages of variables in the treated and control groups. For example, pre-adjustment age can be 30.5 years for the treated group and 25.3 years for control group; post-adjustment can be 29.6 years for the treated group and 29.3 years for control group. In this case, post-adjustment age is balanced between the two groups. While researchers often seek to balance all the variables available, not all variables are important to the intervention and outcome equally. I discuss how to use the joint variable importance plot to visualize the important variables for prioritization.
Confounders can explain away the causal effect
A confounding variable, or confounder, is a variable that can explain away the causal relationship discovered that is related to both intervention and outcome. This is quite concerning because if the variable can explain away that relationship, then the analysis is moot. Therefore, it is important to pay attention to confounders and adjust for them.
One example is age: if the treated group is older and sicker and the control group is younger and healthier, then age can potentially explain away the effect of the intervention. While it is important to adjust for age, there are often other variables to adjust for such as socioeconomic status, education levels, and gender.
Identifying confounders with the Love plot
A common approach to identify confounders is to visualize using the Love plot that plots the mean differences or standardized mean differences for every available variable in the data. Let’s consider a classic dataset on evaluating job training programs on earnings, where the intervention is the selection of a job training program and the outcome is earnings in 1978.1 The variables include the following:
-
treat – whether the person was selected to be in the National Supported Work Demonstration job-training program;
-
age – age in years;
-
educ – education in years;
-
black – denoting the race of the person was Black/ African American or not;
-
hisp – denoting the ethnicity of the person was Hispanic or not;
-
marr – denoting the person was married or not;
-
nodegree – denoting the person with a degree or not;
-
re74, re75, and re78 are real earnings in 1974, 1975, and 1978, respectively.
After log-transforming the earnings, we can visualize the Love plot showcasing the absolute standardized mean difference between the treated and control groups. Figure 1 shows that nodegree is the variable most important for adjustment. However, plotting the Love plot only shows the treatment imbalance without taking into consideration whether the variables are related to the outcome or not.
Figure 1: Love plot to visualize treatment imbalance through absolute standardized mean differences
Using the jointVIP to identify confounders
The joint variable importance plot (jointVIP) instead considers both outcome and treatment dimensions relating to each covariate.2 On the x-axis, we have the familiar absolute standardized mean difference, with a slightly different denominator for standardizing, accounting for the treatment imbalance. On the y-axis we account for the outcome correlation using the Peterson correlation.
The contour curves draw the bias to help interpret the variable importance. It allows us to compare variables that are far apart. For example, compare two points, one has higher treatment imbalance and lower outcome correlation and another has lower treatment imbalance and higher outcome correlation, at first glance it may be difficult to compare; however, with the assistance of bias curves, the two points can be compared with ease.
From the jointVIP (Figure 2), for the working example, the variables log_re74 and log_re75 (log-transformed of the real earnings of 1974 and 1975) are highlighted as the most important confounders for adjustment. Notice that the log_re74 is not highlighted with top importance using the Love plot alone (Figure 1).
Figure 2: Joint variable importance plot to visualize both treatment imbalance and outcome correlation through absolute standardized mean differences and Pearson correlation
Plot jointVIP in R
The blog post is a gentle introduction to the jointVIP paper2 and the accompanying software paper.3 Those who wish to learn more are encouraged to read the get started vignette to code in R. An R shiny application is also available for those who are unfamiliar with the language.
library(jointVIP) treatment = 'treat' outcome = 'log_re78' covariates = c('age', 'educ', 'black', 'hisp', 'marr', 'nodegree', 'log_re74', 'log_re75') new_jointVIP = create_jointVIP(treatment = treatment, outcome = outcome, covariates = covariates, pilot_df = pilot_df, analysis_df = analysis_df) plot(new_jointVIP) |
Example code for how to use the jointVIP package and plot a jointVIP object |
Takeaways
When designing an observational study, it is important to identify important confounders for adjustment. This blog post does not highlight any specific adjustment strategy because identifying important confounders is important in many adjustment strategies (matching, weighting, and regression). While the Love plot is traditionally used, it only considers the treatment imbalance and can be misleading if outcome-knowledge is not incorporated.
In practice, to design using the jointVIP, it is important to not use outcome information from the analysis data to inform the design! We recommend using a separate sample containing only controls (a pilot sample) to inform the outcome relationship. In the example data, a previous study was conducted.4 Another way to obtain the pilot sample is to draw a random selection of control units.
References
-
LaLonde, Robert J. "Evaluating the econometric evaluations of training programs with experimental data." The American Economic Review (1986): 604-620.
-
Liao, Lauren D., Yeyi Zhu, Amanda L. Ngo, Rana F. Chehab, and Samuel D. Pimentel. "Prioritizing Variables for Observational Study Design using the Joint Variable Importance Plot." The American Statistician (2024): 1-9.
-
Liao, Lauren D., and Samuel D. Pimentel. "jointVIP: Prioritizing variables in observational study design with joint variable importance plot in R." arXiv preprint arXiv:2302.10367 (2023).
-
Dehejia, Rajeev H., and Sadek Wahba. "Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs." Journal of the American Statistical Association 94, no. 448 (1999): 1053-1062.