Enhancing Research Transparency Inspired by Grounded Theory

April 30, 2024

Enhancing Research Transparency Inspired by Grounded Theory

In today’s landscape of data science research, transparency is often viewed as a checklist: dataset shared, code uploaded, and, at most, documenting and registering the research plan beforehand. True transparency in research, however, requires much more. It involves opening up the entire scientific process, including sharing the exploration of data, the development of data collection methodologies, and the dynamic adjustments made throughout the study. In this post, I will discuss how grounded theory, an approach to systematically generating theory from data and documenting the process, can make us better researchers by taking a broader perspective on transparency and open science.

As a researcher, I first encountered grounded theory in the context of qualitative analysis. However, as I progressed on my journey to become a well-rounded, full-cycle researcher, engaging in both qualitative and quantitative work, I discovered that the principles of grounded theory could transform all aspects of my research, including data analysis for quantitative methods. When I applied grounded theory to my big data and quantitative projects, every step of my research began contributing to the emerging theory, results, and discussion. This included the often-overlooked initial phases of data exploration and the crucial process of selecting variables.

The grounded theory approach encouraged me to continuously reflect and adapt, prompting me to meticulously document each step, and enriched my academic skills and intuition. By integrating grounded theory, I was also able to demonstrate a clear trail from initial explorations to conclusions, ensuring transparency in every phase—from data collection to analysis and the contribution of each to the evolving understanding of the study's phenomena. 

This realization led me to think of grounded theory as an invaluable research tool. Therefore, in this blog post, I will explore how grounded theory can be adopted to enhance transparency in research, laying the foundations for how a quantitative researcher can use grounded theory.

What is Grounded Theory?

Grounded theory is a research methodology that enables researchers to develop theories rooted in empirical data. Unlike many other qualitative approaches that start with a pre-existing theory to categorize data, grounded theory adopts a bottom-up, inductive process. This method is especially valuable in scenarios where little is known about a phenomenon or possibly related theories or when the goal is to generate novel insights directly from the data.

Grounded theory views research as an iterative process involving stages of open coding, axial coding, and selective coding, accompanied by frequent memo writing. This progression allows researchers to identify emerging patterns, themes, and relationships, which become the building blocks for theory development. The methodology encourages letting the data speak for itself, avoiding the constraints of predefined categories and theories.

The journey begins with open coding, or dissecting data into discrete pieces. Each fragment is carefully examined and tagged with codes capturing its essence in an inherently iterative process. Data scientists will recognize this as being similar to data preparation and preliminary data analysis, where each data point or variable is evaluated for patterns and anomalies.This stage invites a constant comparison of new data segments with established codes, refining the codes as the analysis deepens—a dialogue with the data where each piece can reveal new insights or challenge existing interpretations.

The second step is axial coding, in which the initial codes are organized into categories and subcategories, unveiling their relationships. For data scientists, this can be likened to the process of clustering or classifying data and building statistical, economic, or econometric models, where relationships within the data are identified and explored. This phase demands revisiting the data and initial codes to ensure the emerging categories accurately reflect the data's intricacies. As axial coding advances, connections and discrepancies between categories arise. The researcher might observe that while physicians recognize the potential of neural network applications for diagnosis, concerns about patient privacy emerge. These insights prompt revisiting the data, refining categories to better capture physicians' nuanced opinions on AI.

The culmination of this process is selective coding, where core categories are chosen, revisited, and synthesized into a cohesive theoretical framework. This step involves a thorough and continual review process to ensure that the developed theory fully represents the data and the primary subject being analyzed. For instance, selective coding might involve identifying some discrepancies between key categories and previous codes. These discrepancies necessitate a reassessment and refinement of the initial theory to more accurately reflect the phenomenon, akin to model refinement and validation in data science, ensuring the variables accurately capture the intended phenomena.

An essential component of grounded theory, regardless of the coding stage, is memo writing, which functions as a vital reflective practice where researchers record insights, hypotheses, and questions throughout the analysis. These memos capture the evolving thought process, documenting pivotal 'aha' moments when new, significant insights are uncovered. Memo writing ensures that each step, whether a minor adjustment or a major theoretical shift, is well-documented and grounded in the data. 

Taken together, grounded theory, with its meticulous coding processes and iterative cycles, is traditionally associated with qualitative research but can significantly enhance transparency across various disciplines, including quantitative data science. In practice, many researchers in quantitative fields are already engaging in a similar approach to research. Integrating grounded theory into this process formalizes these steps, guiding researchers on when and how to record their findings and decisions effectively.

The Practical Guide to Leveraging Grounded Theory for Transparency in Data Science

To see how grounded theory can bolster data science, let’s have a look at an example. Suppose a researcher aims to conduct an econometric study on physicians' utilization of AI and its impacts, analyzing a dataset of Electronic Health Records (EHR) from Florida over five years. To enhance their work's transparency and robustness, the researcher can integrate grounded theory principles into the data analysis process, improving the study's clarity and the potential for replication, discussion, and impact.

Open Coding: Data Exploration and Variable Identification

In the open coding phase of the quantitative study analyzing physicians' utilization of AI and its impacts, the researcher engages in a detailed examination of the dataset. This stage involves identifying key variables such as patient demographics, physician characteristics, AI technology usage, and health outcomes. Each variable is meticulously examined to discern its characteristics and potential anomalies. The researcher documents the rationale behind selecting or excluding specific variables and any preprocessing steps like cleaning, normalization, and transformation, which could impact the results' interpretation.

Suppose a challenge arises when a variable measuring patients' income levels shows a high percentage of missing values. This socio-economic data could provide insights into disparities in health outcomes, and the high rate of missing data renders it unusable in its raw form. To address this, the researcher might exclude the variable from the analysis or apply advanced imputation techniques to estimate the missing values based on other available data, thereby maintaining its inclusion. Inspired by grounded theory, the researcher meticulously documents this decision, explaining the rationale and potential consequences. This thorough documentation ensures that the adjustments made are transparent and their impact on the study's methodology and results is clear.

This documentation also clarifies the basis for subsequent model development, ensuring the research process's initial stages are open and reproducible. It highlights the absence of key variables that could inform on inequalities and transparency in AI-assisted healthcare, thus drawing attention to the need for more inclusive and representative datasets.

To be clear, transparency of this nature extends beyond research; it could also enhance the utility of research for policy-making. For instance, identifying missing variables can prompt policymakers and healthcare organizations to prioritize collecting and sharing data that enables the examination of inequalities and transparencies. Consistently reporting the absence of crucial variables related to equity and transparency can create a compelling case for investing in comprehensive data collection efforts. It also aids in establishing governance frameworks that promote responsible data sharing and use.

The transparency encouraged by grounded theory benefits the broader scientific community as well. For example, if another researcher using the same dataset encounters detailed documentation, they can quickly understand why certain variables were deemed unsuitable and avoid spending time on them. Instead, they might focus on finding ways to impute missing values or transform variables differently, potentially uncovering new insights that were previously overlooked.

Axial Coding: Econometric Model Specification

The next step is axial coding, which involves refining and relating the categories identified during open coding, exploring their connections, and organizing them into a coherent structure. In the context of econometric modeling, this phase entails developing preliminary models to capture the relationships between key variables identified earlier. Quantitative researchers, at this stage, delve deeper into these categories, adjusting them to more accurately reflect the underlying data, and construct econometric models as tools to probe these relationships and identify significant patterns.

Suppose the researcher decides to use a linear regression model to analyze how AI-assisted diagnoses per physician impact patient outcomes, like hospital readmission rates. Recognizing a potentially nuanced relationship, they might hypothesize that the effect of AI on readmission rates is influenced by physician experience.

To investigate this, the researcher creates an interaction term called "AI_Experience_Interaction" by combining physician experience and the number of AI-assisted diagnoses.

This approach allows them to examine how the impact of AI-assisted diagnoses on readmission rates changes with the physician's level of experience. For further refinement, the researcher might also adjust variables, such as merging the type of AI technology with the specific conditions diagnosed into a composite variable called "AI_Tech_Condition". This new variable could help assess how different technologies affect outcomes for various medical conditions.

These steps of combining and refining variables allow the researcher to capture the complexities in the data. However, each of these actions might impact the results and their interpretations. Therefore, although such decisions in variable selection and model specification are critical, they are often unreported in the final article. Documenting these choices enhances the transparency of the model development process and aids in understanding the researcher's methodology.

Note that this approach contrasts sharply with conventional methodology sections, which often provide only a brief overview and leave readers questioning the rationale behind specific choices. By detailing the reasoning for selecting particular models and estimation methods, researchers clarify the empirical and theoretical foundations underpinning their work.

Transparency in documenting such processes invites constructive criticism from peers, which is crucial in an academic environment often dominated by a "publish or perish" mentality. For example, if another researcher identifies a more appropriate model or estimation method, they can share this information, leading to a more robust and accurate analysis. Such collaboration, facilitated by transparency, helps counteract the pressures of rushing work and encourages a more deliberate and thorough examination of econometric models.

Just as importantly, transparency helps prevent duplication of efforts and promotes the efficient use of research resources. If a researcher documents that a certain model specification, such as a linear regression model, led to unsatisfactory results due to its inability to capture non-linear relationships, other researchers can avoid this pitfall. They might explore more suitable approaches like non-linear models or machine learning techniques, saving time and resources that might otherwise be spent on ineffective models.

Selective Coding: Theoretical Framework Synthesis

As discussed, selective coding is the phase where the core category, the central phenomenon around which all other categories are integrated, is identified and refined. This phase involves systematically relating the core category to other categories, validating those relationships, and filling in categories that need further development. In the context of econometrically examining AI's influence on healthcare outcomes, selective coding could involve integrating findings from various models into a cohesive theoretical framework–even the observations, actions, and models that are conventionally not reported but could influence the theoretical understanding of the phenomenon.

For example, the researcher might discover that the initial framework does not capture the role of physician experience in moderating AI's impact on patient outcomes, prompting a revisit to the open coding phase to identify relevant variables and refine the models accordingly. Transparency in this phase is crucial for ensuring the credibility and reproducibility of the research findings. By documenting the evolution of the theoretical framework and the iterative process of refining it, the researcher can demonstrate the robustness of their conclusions and invite constructive feedback from the scientific community. Suppose another researcher identifies a gap in the framework or suggests an alternative interpretation of the findings. In that case, this can lead to a fruitful dialogue and the development of a more comprehensive understanding of AI's role in healthcare.

Memo Writing: Documentation and Revisiting Other Steps

In the context of using grounded theory for quantitative data analysis, memo writing is essential for enhancing documentation and transparency. This technique could involve several types of memos that collectively ensure a thorough and clear record of the research process. For instance, the researcher might decide to have decision documentation memos for recording the reasoning behind every methodological choice, such as why certain econometric models were chosen or why specific variables were included or excluded. They could also keep the logs of model iterations to provide detailed accounts of changes made to the models during the study, including adjustments in parameters or variables, and discuss how these changes affect the outcomes.

Suppose the researcher decides to write a model iteration log. They initially introduced the interaction term "AI_Experience_Interaction." This was a strategic move to investigate how the effectiveness of AI technology in diagnosing conditions varies with the physician’s level of experience. Then the researcher decides to merge the type of AI technology with the specific conditions diagnosed into a composite variable named "AI_Tech_Condition." This variable is intended to assess how different AI technologies affect outcomes across various medical conditions. The log would detail this adjustment, explaining that the synthesis of these variables into a single composite aims to reduce complexity and enhance the model's ability to capture nuances in how technology impacts healthcare outcomes.

Therefore, each entry in the Model Iteration Log not only documents these changes but also includes theoretical justifications for each model adjustment, the estimation methods used, and the criteria for variable selection as well as the impact of the observations on researchers’ insights and theorizing. This documentation provides a properly transparent account of how the models are developed and refined over time, allowing other researchers to understand the progression and reasoning behind each methodological choice.


The heart of grounded theory lies in meticulous documentation—recording each decision, code, and the rationale behind moving from one step to another. This comprehensive record, including the rationale for each decision and the statistical methods applied, will create a detailed trail from raw data to the theoretical framework, making the quantitative research methodology transparent and reproducible.

In the reviewed example, the researcher will probably stand out for its methodological transparency, encouraging replication, critique, and further research within their scientific and stakeholders’ communities. Their study is not only accessible but also positioned to make a significant impact on understanding AI's role in healthcare. It adds an enormous amount of richness to the academic output, which is normally disregarded.

I should address the elephant in the room: this level of documentation can be time-intensive, which may be a challenge for researchers in academia who are often pressed for time. The pressure to publish quickly and the demands of other academic responsibilities can make it difficult to allocate the necessary resources to document every aspect of the research process. Nevertheless, I believe that the benefits of transparency and reproducibility outweigh the costs.

This means that academia desperately needs to rethink its reward structures. Institutions and funding agencies need to support this approach by recognizing the value of transparent and reproducible research and providing the necessary resources and incentives to encourage researchers to adopt these practices.

Ultimately, the most rewarding aspect of research is seeing others understand and utilize our findings, and grounded theory magnifies this impact by inviting others into our world through detailed documentation and sharing of our research journey. By investing in transparency and reproducibility, we can ensure that our research has a lasting impact and contributes to the collective advancement of our fields.

Figure: Grounded Theory for Data Science


  1. Strauss, A. L., & Corbin, J. M. (1997). Grounded theory in practice. Sage.