How I Accidentally became Interested in Data Science

February 24, 2020

Jae Yeon Kim

About three years ago—and six months after I became a PhD candidate—I learned that I was going to become a father. The baby shock was not part of my well-crafted plan for writing my dissertation, and I needed to deal with it. As soon as I heard the news, I emailed my funding agency and rescheduled my fieldwork in Canada. I finished most of my archival work for the dissertation research within that semester. I then wrote a chapter, based on the data I had collected, while waiting for the baby to arrive. (The paper recently got accepted at the Studies in American Political Development, a flagship journal in my subfield. My other baby was also forthcoming!)

Soon after, the size of my family increased by 150%, and I realized that the time I had for myself and especially for my research had decreased by more than 50%. I was born and raised in a poor, but proud, working-class family in South Korea. My parents emphasized a strong work ethic throughout my upbringing. If five hours a day of work did not make it, I doubled my efforts. However, this time was different. My wife was also in school, and she was looking for a job. With no family members or relatives in the States, we were entirely on our own. I had to sacrifice my sleep and even cut my time for eating to save time for research. I couldn’t complain, because my wife was going through a harder time as a mother, student, and job seeker.

I became interested in data science by accident. I did so not because I wanted to, but because I had to. I loved visiting archives, flipping through arcane documents, and discovering something I missed in conventional narratives. As a new parent, I needed to find a new and far more efficient method to do research. Instead of visiting archives, I collected more than 80,000 historical newspaper articles as HTML files from an online database and parsed them using Python. I then hired undergraduate research assistants, did content analysis, and classified these articles using supervised machine learning. My paper based on these preliminary findings was selected to receive the best paper award in Asian Pacific American Politics at the upcoming Western Political Science Association annual meeting. I applied the techniques I developed from my dissertation to my new research project, which uses machine learning (text classification) to create critical data for causal inference (interrupted time series design). I have shared all the code I wrote for each step in these research projects through my Github repositories. I hope that these resources can help other researchers who are struggling with their own life challenges. I am fully aware that dealing with multiple responsibilities is not an uncommon problem in the research community.

These days, more than 90% of my research time is devoted to writing and revising code. I consider myself situated at the intersection between data science and social sciences. I am excited to be a D-Lab data science fellow and a data science education program fellow this semester, and to get more involved in the vibrant, open, and inclusive data science community at UC Berkeley.

Sometimes, life leads us through unexpected turns and twists. That is okay. New challenges make us acquire new skills. New skills open doors for new opportunities. I was initially drawn to data science because I needed those tools to survive. Looking back, I am grateful for this transition, because data science helped me embrace the complex world with efficiency in my life, both personally and professionally.

That said, if you are new to programming, sign up for a workshop at the D-Lab. If you have challenging computational or statistical problems, please feel free to make an appointment with a D-Lab consultant. (I am also a D-Lab consultant.) I am looking forward to getting to know you and welcoming you to the data science community.