One of the opportunities and challenges of 'data science' is that it allows you to bring the methodological flexibility of scientific computing to bear on newly available, complex data sets from ubiquitous sensors and the Web. This is an opportunity because it means there are ways we can confirm old theories using ecologically valid observational data, as well as potentially prove new theories. It is a challenge because ecological data does not isolate the factors of interest and the techniques for deriving causal conclusions from it are notoriously tricky and often statistically weak.

A combination of the Rubin causal model and Pearl's graphical modeling techniques is ascendant in the social sciences now. These tools are powerful but depend on a generalized sense of independence called the Stable Unit Treatment Value Assumption (SUTVA). Among other things, this assumption enforces a clear demarcation of ones unit of analysis when performing causal inference. In a typical example, a 'unit' is a person and 'treatment' would be the assignment of medication. The applications of this theory can get much more complex.

I am running into the limits of this framework for causal inference, at least in its basic form. This is because I'm interested less in any individual person represented in my data set and more in the emergent topological properties of the social network they form as a whole. How can I think about the ways networks connect their members over time causally, when the dominant framework of causal reasoning requires the stable independence of units?

Note that I am not talking about peer and network effects that are mediated by a social network. While this is an emerging area of research (c.f. van der Laan, 2012; Fafchamps, 2015), it is not exactly what I'm getting at. I'm talking about something that is more often discussed in the literature on complex networks, where the topology of the networks is itself of interest.

An example of this are preferential attachment processes that are commonly invoked to explain the scale-free degree distribution of social networks. Graphs generated with this mathematically defined process provably have emergent properties that match interesting networks that we find in the world. This opens up an interesting class of problems. Suppose we have observational data of a network topology changing over time--for example, a network of research collaborators where a collaborative link has a specific duration. What is the best (most well-fitting and concise) process that explains the topology like the network observed?

This is hard to reason about causally because naively graph generation through a mathematical process is endogenous, meaning determined by the internal mechanics of the system. Since causal claims require the possibility on an exogenous intervention, it's unclear how to even think about what a causal claim would mean in this case.

The potential impossibility of this problem is no reason to be deterred. We are scientists after all and if we did not attempt the impossible our lives would be very dull. So we had an enlivened discussion about this problem at the last Social Computing working group where I laid out some preliminery thoughts about how to move forward on this in a loosely Bayesian framework. Here are my slides.

If there is anyone out there interested in this topic, I hope they will contact to discuss this with me at sb at ischool dot berkeley dot edu.


Sebastian Benthall

Sebastian Benthall is a PhD student at the School of Information.  He is interested in collective intelligence in an open collaborative setting, with a focus on open software development. He is also interested in the foundations and limits of data science.