Before COVID-19, becoming a data scientist was never on my radar. As a policy analyst for the California Research Bureau, a legislative research and reference section of the California State Library, I’ve worked on a variety of projects and requests. For the last 8 years, my work has focused on producing timely, confidential concise summaries of bill ideas and trending topics drawing from academic, governmental, news media, and policy sources.
California legislative staff are on overdrive from January through September in a herculean effort that starts with putting bills “across the desk” and ends with passage by the Legislature and approval by the Governor. Unlike previous sessions, COVID-19 slowed the normally frantic pace of legislative activity, reduced bill packages, and created an opportunity for a legislative staff member to pose a rather provocative question in early 2021 that I knew couldn’t be answered with the usual qualitative tools and methods.
To correctly answer the question, I’d have to become a data scientist.
Trained in Ethnic Studies, American studies, and Anthropology. I was happily, irrevocably, and resolutely qualitative; that is, until I had my first experience with a Jupyter Notebook in Python Fundamentals 1-4 taught regularly at UC Berkeley’s D-Lab by Renata Barreto. Introduced to D-Lab in early 2020 through working with a group of Cal-in-Sac students on COVID-19 related projects, learning for the first time in my professional career that it was okay not to know (IOKN2K) was, and continues to be, revelatory.
In this blog post, I will take you briefly through my data science journey: the research question, acquiring the data, “seeing it” with visualizations, and how Python rescued me from a lifetime losing streak with statistical analysis software tools.
A PROVOCATIVE QUESTION
“The client would like a data set (spreadsheet) of the last 30 years of CA state legislative sessions, to include the ratio of bill introduction to bill passage for all legislators. He is interested in comparing the passage ratios of urban versus rural district legislators. He is open to how to define urban and rural – we discussed rural definitions illustrated here:https://www.ers.usda.gov/webdocs/DataFiles/53180/25559_CA.pdf?v=0 and also using districts in theCalifornia Legislative Rural Caucus.”
To answer this question, I knew I would have to use publicly available legislative session data available via the California Legislative Information website. It would require a lot of cleaning and need to be paired with data from different sources to be of true value. Thankfully, I’d used it before, but mostly for fun. It was a secret I had shared with only a few co-workers: I’d been quietly interested in legislative data and visualizing legislative activity for a while on the side, generating sankey diagrams of which California codes were tinkered with the most for a single session (the answer: Government Code), and making heatmaps to check out how many bills had been chaptered for the last 10 sessions.
To be sure, there were challenges ahead: we’re non-partisan at the Research Bureau so I’d have to stick closely to existing facts and evidence, word choice would be critical to minimize any potential for misinterpretation, definitions of “rural” vary so we’d have to use the bipartisan bicameral self-selected caucus as a proxy, and there wasn’t much time in my already busy schedule, so I’d have to contribute my nights and weekends to make it work. The legislative staff member, thankfully, was willing to take the data science leap with me and open to whatever outcomes would be observable from such a large corpus of historical legislative data, once assembled, analyzed, explored, and visualized.
SCATTERPLOTS, LINES, AND CHOROPLETHS, OH MY!
Initially the legislative staff member wanted to “see all the data.” With the help of historical voting data made available by Alex Vassar, Legislative Historian, we were able to “see” the comparison between rural vs. urban, by house and district, in a virtual gallery of Seaborn-driven pointillist scatterplots: total votes cast, total measures introduced, total measures chaptered, average measures chaptered, percent of measures by house and district per square mile, and percent of measures by house and district per person per square mile.
Although I was convinced that the patterns in the scatterplots were self-evident, the deluge of dots was just too much for both of us (as well as for my co-workers who indulgently gave me feedback along the way!). Happy that he could “see” the data in all its glorious totality with the scatterplot gallery, the legislative staff member commissioned a series of cleaner, more elegant line graphs. While we knew from the previous plot gallery that urban districts introduce and chapter more measures than rural districts, it's much easier to see with this set of visualizations.
Finally, the legislative staff member challenged me to represent the data geographically, wanting to see the success rate (percent introduced and chaptered) and failure rate (percent introduced and not chaptered). Breaking Anaconda multiple times and wrestling with dependencies to create a geo-friendly environment, I was able to produce a synchronic series of chloropleths for the 2019 session that offered a different way of looking at the data, not en masse as with the dot deluge, but in stark, comparative colorful relief.
The surprisingly easy creation of animated gifs with the choropleths soon followed, and in no uncertain terms, I realized I had become irrevocably infected with the desire to evangelize the glories of Python to anyone who would listen, addicted to the experience of seeking out new and novel data science tools, endlessly importing and trying out libraries I knew nothing about, forever flipping errors into my browser, restarting my kernel and clearing all outputs ad nauseam, leaping onto JupyterLab whenever I had time.
As is often the case with our requests, I’m not sure how the legislative staff member put this trinity of visualization galleries to use. I do know that it was appreciated, when sometime later he presented me with a certificate of recognition thanking me for my “hard work and research into the trends and history of the State Legislature.”
The whirlwind seven glorious months of data science skill-building was, and still is, a redemption of sorts. Truly a baptism by fire. I’d taken Pascal in undergrad, SPSS in grad school, explored STATA early in my government career, and barely survived R courses in Coursera and Harvard X, never having the chance to apply those nascent skills directly to a real-world problem. Learning how to code in Python--and applying it--changed everything!
Want to get into data science but don’t know where to start? Pick a side project that lights you up. Have a plan to keep you on track. Develop a routine. Think you are unqualified? Learn how to overcome that feeling. Surround yourself with data scientists who love to teach. Be fearless! Remember that even when accidents happen, you can rely on your team to see you through.