The recent explosion in availability of digitized text, whether it's text produced in real time by Internet users on cites such as Twitter, Facebook, or on blogs, or the ongoing push by companies like Google to digitize published texts, has prompted an interest in new methods to analyze text-as-data. D-Lab is interested in learning, exploring, and providing training in these newly developing methods. Members of D-Lab's Python and Text Analysis Working Group, for example, are working together to learn the methods and tools necessary for different types of quantitative text analysis. This blog post highlights my effort to bring together some of the tools we've been learning in this working group, and to demonstrate how these tools can be applied to real data.

Using articles about business in two major newspapers, I use automated text analysis to compare the articles that mention companies with a woman CEO or the CEO herself to the articles that mention companies with a man CEO or the CEO himself. I do two basic quantitative and comparative analyses: a sentiment analysis to measure the use of affective words, and a difference of proportion analysis to measure unique content in different categories of articles.

The Setup

While more and more women are becoming CEOs of Fortune 500 companies they still represent a small proportion of CEOs compared to men. Only 21 of Fortune 500 companies (4%) have a CEO that is a woman. To explore some uses of quantitative text analysis I decided to compare how two major newspapers, the liberal-leaning New York Times and the conservative leaning Washington Times, report on companies that have women CEOs compared to companies that have men CEOs. For this brief analysis I chose to compare the 21 Fortune 500 companies headed by women to the top 21 companies headed by men. This includes the following CEOs and their companies:

Top 21 companies headed by women (source):

Meg Whitman, HP (#15)
Virginia Rometty, IBM (#20)
Patricia A. Woertz, Archer Daniels Midland Company (ADM) (#27)
Indra K. Nooyi, PepsiCo, Inc. (#43)
Marillyn Hewson, Lockheed Martin (#59)
Ellen J. Kullman, DuPont (#72)
Irene B. Rosenfeld, Mondelez International
Phebe Novakovic, General Dynamics
Carol M. Meyrowitz, The TJX Companies, Inc.
Ursula M. Burns, Xerox Corporation
Lynn J. Good, Duke Energy
Deanna M. Mulligan, Guardian
Sheri S. McCoy, Avon Products Inc.
Debra L. Reed, Sempra Energy
Denise M. Morrison, Campbell Soup
Heather Bresch, Mylan
Ilene Gordon, Ingredion Incorporated
Kathleen M. Mazzarella, Graybar Electric
Gracia C. Martore, Gannett
Mary Agnes (Maggie) Wilderotter, Frontier Communications
Marissa Mayer, Yahoo

Top 21 companies headed by men (source):

Michael T. Duke, Wal-Mart
Rex W. Tillerson, Exxon Mobil
John S. Watson, Chevron
Greg C. Garland, Phillips 66
Warren E. Buffett, Berkshire Hathaway
Timothy D. Cook, Apple
Daniel F. Akerson, General Motors
Jeffrey R. Immelt, General Electric
William R. Klesse, Valero Energy
Alan R. Mulally, Ford Motor
Randall L. Stephenson, AT&T
Timothy J. Mayopoulos, Fannie Mae
Larry J. Merlo, CVS Caremark
John H. Hammergren, McKesson
Lowell C. McAdam, Verizon Communications
Stephen J. Hemsley, UnitedHealth Group
James Dimon, J.P. Morgan Chase & Co.
George S. Barrett, Cardinal HealthBrian T. Moynihan, Bank of America Corp.
W. Craig Jelinek, Costco Wholesale
David B. Dillon, Kroger

Data Collection and Cleaning

I first downloaded a few thousand articles from LexisNexis found using a  keyword search for “business” on both newspapers. The Python and Text Analysis working group had been working through Python code written by Neal Caren, found here,  here,  and here, which includes a script to transform output from LexisNexis into a comma separated (CSV) file. I applied this LexisNexis script to my downloaded files, producing a CSV file with each article on one line. Using this CSV file I then did a keyword search through each document for references to the 42 companies above—either the last name of the CEO or the name of the company. I created two dummy variables for each document, one that indicates the document mentioned one of the companies headed by a woman or the last name of its CEO, and one that indicates the document mentioned one of the companies headed by a man or its CEO. You can see some of my code here. My final CSV file includes all of the information in the original LexisNexis output plus two additional columns: one for each of the two dummy variables.

Armed with this CSV file as my data, I did the following calculations using a combination of Python and R.

Sentiment Analysis

There was a total of 1051 articles from the New York Times, and 1526 from The Washington Times. I am interested only in those that mention the 42 companies of interest, and only those articles which mentioned a company headed by a woman but not a company headed by a man, and vice-versa. In other words, the categories had to be mutually exclusive. I ended up with a total of 456 articles. Within these articles I am interested in four categories: documents that mention women CEOs or their companies in the New York Times and The Washington Times, and the same for documents the mention men CEOs or their companies. This is a summary of the number of documents in each category (including proportions):



Woman CEO



New York Times

33 (.07)

196 (0.43)

229 (0.50)

The Washington Times

67 (0.15)

160 (0.35)

227 (0.50)


100 (0.22)

356 (0.78)

456 (1.00)


Same categories, with word count rather than document count:



Woman CEO



New York Times

26,388 (0.07)

149,439 (0.42)

175,827 (0.49)

The Washington Times

53,733  (0.15)

129,668 (0.36)

183,401 (0.51)


80,121 (0.22)

279,107 (0.78)

359,228 (1.00)

Using Python I concatenated all of the documents in each category into four text files, one for each category. I first did a rough  sentiment analysis on each of these four documents, calculating the percentage of positive and negative words in each document (I borrowed code from Caren's tutorials to do this calculation, linked above). The results:


Woman CEO



New York Times

pos: 0.0285313731434

neg: 0.0186420127311

pos: 0.0296061875259

neg: 0.0179711230948

pos: 0.0294450478525

neg: 0.0180719123359

The Washington Times

pos: 0.0349878100981

neg: 0.0214207284164

pos: 0.0337610660415

neg: 0.0195718560104

pos: 0.0341206490448

neg: 0.0201136264503


pos: 0.0328615645749

neg: 0.02050571614

pos: 0.0315364856683

neg: 0.0187148246202

pos: 0.0318320993153

neg: 0.019114290486

Overall these documents contained a higher percentage of positive words than negative words (3.1% versus 1.9%). The documents that mentioned companies with a woman CEO contained a higher percentage of both positive and negative words (3.3% and 2.1% compared to 3.2% and 1.9% for men-led companies), suggesting the use of more affective words overall in these documents when compared to documents that mentioned men CEOs. The documents published by The Washington Times used more affective words overall compared to those published in The New York Times (3.4% and 2.0% compared to 2.9% and 1.8%). The New York Times used a higher percentage of positive words and a lower percentage of negative words in the documents that mentioned companies with a man CEO (3.0% and 1.8%) compared to those mentioning companies with a woman CEO (2.8% and 1.9%), suggesting a more positive sentiment overall toward these man-led companies. The Washington Times, on the other hand, used a higher percentage of both positive and negative words in the documents mentioning companies with a woman CEO than those mentioning companies with a man CEO (3.5% and 2.1% compared to 3.4% and 2.0%). The Washington Times thus was not only more affective overall, it was particularly affective when discussing companies with a woman CEO.

These results do not say anything per se about how these papers discuss different gendered CEOs. There are many problems with these data, including possible correlations with other variables such as the relatively higher economic standing of companies headed by men CEOs, and possible differences in the types of companies headed by each. A more robust analysis would control for these interfering variables. They do provide some descriptive statistics about these four categories.

Difference of Proportions

To learn more about the content used in these articles I did one more simple mathematical calculation comparing these four categories: difference of proportions, defined as the following:

The observed proportions are defined as:


The difference of proportions are defined as:


 is the frequency of word in document (i) and is the total number of words, or row sum, for document (i).


This calculation returns a list of words (the largest and the smallest difference of proportions) that consists of the most unique words in each document. Using the same four text files, I compared the documents mentioning men and women CEOs within each newspapers.

In The New York Times, the most unique words in the documents mentioning men CEOs compared to those mentioning women CEOs:

"percent"   "year"      "sales"     "bank"      "billion"   "last"      "banks"     "financial" "investors" "million"   "american"  "market"    "fed"       "google"    "according" "tax"       "stock"     "month"     "america"   "united"    "federal"   "may"       "quarter"   "results"   "growth"    "deal"      "vehicles"  "stocks"    "people"    "capital"   "company"   "many"      "european"  "analysts"  "increase"  "general"  "economy"    "markets"   "crisis"    "including"

And the most unique words in the documents mentioning female CEOs compared to male CEOs:

"got"     "activist"     "felt"     "patients"     "play"     "side"     "journalist"       "nuclear"        "life"      "hes"      "newspapers" "york"      "plant"     "thought"      "just"     "technology"     "newspaper"       "tribune"       "sometimes"       "science"     "drugs"      "something"     "chicago"    "stories"    "state"      "disease"      "post"       "time"      "fire"      "industry"      "food"       "writing"    "media"      "really"     "broadway"     "drilling"     "hulu"     "story"      "news"       "amazon"     "drug" 

In The Washington Times, the most unique words in the documents mentioning male CEOs versus documents mentioning female CEOs:

"oil"       "said"      "will"      "gas"       "stores"    "billion"   "year"      "city"      "bill"      "much"      "jobs"      "prices"    "one"       "caps"      "wage"      "energy"        "housing"      "market"      "money"     "business"      "economy"       "financial"     "growth"    "mayor"     "council"      "company"       "quarter"      "day"     "retailers"       "going"       "small"        "put"       "workers"   "first"     "top"       "games"     "last"      "per"       "american"       "minimum" 

And the most unique words in the documents mentioning women CEOs versus those mentioning men CEOs:

"planned"     "election"     "poll"      "employees"    "america"      "immigration"  "iran"      "online"      "supreme"     "freedom"      "post"      "information"  "paul"     "pentagon"     "political"    "foreign"     "million"      "national"      "kerry"       "times"      "government"      "war"      "department"   "nsa"     "intelligence"      "federal"      "case"      "republican"      "percent"      "former"     "military"     "security"     "women"    "program"     "media"     "americans"      "news"      "court"      "state"     "president"    "defense"  


The most defining words for the documents mentioning men CEOs largely relate to finance in The New York Times, e.g. “billions”, “bank”, “percent”, “sales”, “financial”, “investors”, “stocks”, and “quarter”; and the economy in The Washington Times: “oil”, “gas”, “jobs”, “prices”, “wage”, “housing”, “market”, “money”, “economy”, “growth”, and “company”. Alternatively, the most defining words for the documents mentioning women CEOs in The New York Times are related to medicine: “patients”, “science”, “drugs”, “disease”, “drug”; and journalism: “journalist”, “newspapers”, “newspaper”, “tribune”, “stories”, “post”, “writing”, “media”, “story”, and “news”. In The Washington Times the most defining words for the documents mentioning women CEOs relate to the government: “election”, “poll”, “kerry”, “government”, “federal”, “republican”, and “president”; foreign relations and the military: “iran”, “pentagon”, “foreign”, “military”, “security”, and defense”; and political topics: “immigration”, “nsa”, and “intelligence”. There are also a few journalism related words: “media” and “news”.

Overall, The Washington Times in general focuses on the government and the economy, which is not surprising given its proximity to Washington D.C., and The New York Times focuses on topics outside of the government, including finance, medicine, and the news. Within The Washington Times, the two major topics (the government and the economy) are stratified by gender: the economy defines the documents mentioning men CEOs and the government defines the documents mentioning women CEOs. Within The New York Times the two major topics (finance and medicine) are also stratified by gender: finance defines the documents mentioning men CEOs while medicine defines the documents mentioning women CEOs.


Without reading any articles, these two simple methods of quantitative text analysis reveal interesting facts about the text. For example, we know The Washington Times, at least in the articles related to business, is more affective and more concentrated on the government than the equivalent articles in The New York Times. Through this analysis we can also garner some information about the companies headed by men versus women. As the text analysis suggests, many of the companies headed by men are focused on finance and the economy: Morgan Chase & Co., Bank of America, Fannie Mae, Exxon Mobile, Chevron, Phillips 66, Berkshire Hathaway. The companies headed by women are also related to the words identified in the text analysis. Medicine and health: Avon Products Inc., Guardian Life Insurance, and DuPont; the news: Gannett; and defense and foreign relations: Lockheed Martin and General Dynamics.

There are some inconsistencies in the results. For example, CVS Caremark, McKesson, UnitedHealth Group, and Cardinal Health all have men CEOs, yet the health terms were defining of the documents mentioning women CEOs. This may be because the health terms relate to food and food-borne diseases rather than health care, which would match the high number of food-based companies headed by women (PepsiCo, Inc., Mondelez International, Cambell Soup, and Ingredion Incorporated). I would have to look into some of the articles in each category to figure out what is producing these results.

This is a relatively simple analysis on limited data. These methods, because they are fully automated, can scale up almost indefinitely. It will take roughly the same amount of time to process 100 documents as 1,000,000 documents, so sampling the text is not necessary. To make this analysis more complete I could include many more articles from each newspaper, I could include more newspapers, and I could link to data that includes all of the companies headed by women and all of the companies headed by men. I could also include more information about each company, so I could hold particular variables constant. Alternatively, I could introduce change or time into the design. If a company changes from a man CEO to a woman, or vice versa, I could compare articles that mention this company before and after the change, or simply look at overall change through time.

Despite the limitations, the preceding calculations are examples of how quantitative text analysis can quickly and efficiently provide a preliminary analysis of text-based data. Hopefully it peaked your interest in quantitative text analysis and will motivate you to learn more about these methods.


Laura Nelson

Laura Nelson is an Assistant Professor of Sociology at Northeastern University and author of “Computational Grounded Theory: A Methodological Framework” and a contributor to various blog forums, most recently the forum on data analytics and inclusivity.