Organized Code Repositories Accelerate Science and Facilitate Reproducubility

March 2, 2021

Computational and data-driven research increasingly requires developing complex codebases. At the same time, many scientists don’t receive training in software engineering practices, resulting in, for some, the perception that scientists write terrible software. As scientists, good software should accelerate our work and facilitate its reproducibility. While building good coding practices takes some time and experience, it doesn’t require a degree in computer science to achieve. By following basic design principles, we can put ourselves in a position to succeed.

A first step we can take is to ensure that our code repositories are well-organized and easy to parse. In this post, I’ll make the case that structuring your repositories according to a predictable template will make your science easier, cleaner, and more reproducible. I’ll start by discussing virtual environments, then version control (specifically git and Github), and end by detailing a template for a repository. This template is designed with a “paper repository” in mind: that is, it’s for projects whose output is some kind of written report (e.g., publication, thesis, technical report). Furthermore, I’m going to focus on Python, but many of these principles hold for other languages.

The template I present is accessible on Github. It’s forked from another template that is used by my lab. This original template was developed by Jesse Livezey, a postdoc at Lawrence Berkeley National Laboratory.

0) Precursor: Virtual Environments

Imagine you have a codebase for one of your projects. You submit a paper, and go work on other projects while waiting on reviews. When the reviews come back, you’ve got to do some additional analyses. So, you fire up the old codebase, only to find that all your code breaks! Turns out, while working on the newer projects, you updated package A from version 1.0 to version 2.0. Unfortunately, this update made changes to specific functions your codebase relied on, which expected version 1.0.

This scenario is one motivation for using a virtual environment. A virtual environment is an isolated copy of Python and any external packages, all installed at specific versions. Ideally, you’d have a virtual environment for each one of your research projects. Thus, when using a virtual environment, you can be confident that any changes you make won’t mess with your other projects. So, in the scenario above, your codebase would have its own virtual environment, in which package A would be version 1.0. Meanwhile, your other projects would have their own virtual environment(s), in which package A would be version 2.0. When you return to the old codebase, you’d switch back to the virtual environment for it, and everything should work smoothly.

The most common approaches to handling virtual environments in Python are the virtualenv package or Anaconda. I strongly recommend Anaconda (often referred to as conda), as it’s specifically tailored toward scientific programming. I’d recommend having a conda environment for each research project (or very similar group of projects). So, step 1: create a conda environment for your project.

1) Version control: Setting up a Github repository

Now imagine that you’re collaborating with one of your labmates on a project. You’re both making changes to functions in the codebase. At one point, you both have changed the same lines in a particular function. How do you go about merging your changes so that you’re both using the same code? This is the rationale for version control: a system that manages and records changes to a codebase. The most commonly used version control system is called git (others include Mercurial and SVN). git is often used in tandem with a cloud-based hosting platform - the most common is Github (but others include Gitlab and Bitbucket). The benefit to using Github is that it makes it easier to collaborate on code with others via its web platform.

Using Github is beyond the scope of this blog post (attend the git D-Lab workshop to learn more!). However, your next step after creating the environment should be to initialize a git repository for your project, and make sure it’s hosted somewhere like Github, Gitlab, or Bitbucket. I can’t make this clear enough: you should have a git repository for every one of your research projects. If you’re concerned about keeping early stage analysis private before publication (which is perfectly valid), you can make your repository private. Ultimately,there is no good reason not to use version control.

2) Creating a Package for your Project

Now that you’ve created a repository, you need to start populating it. The first file you’ll want is the setup.py file. This file provides instructions to Python on how to treat your repository like a package. It contains descriptive information about the package as well as any dependencies, which are other packages that need to be installed before you can run your code.

Once you have a setup file, you can install your package onto your conda environment using pip, which is a Python package index and installer (pip and conda can work together, each capable of installing packages into a conda environment). In particular, you can use pip to install an editable version of the package. This means that any changes you make during development will automatically update the package. So, if you’re testing some code and find a bug, you can fix the bug and expect the package to update on its own, without having to reinstall it.

The benefit to having your repository be an installable package is that you can access the code within the package - any classes or utility functions - anywhere you might be coding, as long as you import the package. This is much easier than having to set your working directory everytime you need to import a class or function from your codebase.

3) Choosing the best folder structure

Now that you have an installable package, you’re ready to begin developing. But where do all your files go? It wouldn’t be productive to have all your files in the same folder - having some organization will make it easier for you and others to efficiently use the repository. The folder organization for the template I linked to is as follows:

codebase: The name of this folder is set by the setup.py file. This is your main codebase: any code that you expect to be imported when this package is imported should go in here. This includes any classes and functions that are consistently used during analyses relevant for the project. Some suggested files are included in the template: these include analysis.py, plotting.py, and utils.py, which contains, as one might expect, analysis, plotting, and utility functions, respectively.
scripts: Contains scripts that perform the important analyses for the project. These scripts will depend on the functions in the codebase, but should not contain functions themselves. For example, a script may apply functions in analysis.py on specific datasets, producing outputs that are used in the figures of the paper.
notebooks: Contains Jupyter notebooks that perform important analyses for the project. Note that there may be some flexibility between what goes in scripts and notebooks. This is often up to personal style. As a general rule of thumb, if the output is a plot, use a notebook. If the output is processed data, use a script.
figures: A separate folder, often consisting of Jupyter notebooks that generate the figures (or at least each figure’s subpanels) of the paper that the project leads to. Ideally, each figure should have its own notebook. This way, any user can download your repository, install the package, and easily generate the figures in your paper.
tests: An important component of good software engineering is unit testing, where you develop simple tests for the classes and functions in the package. You often do some sort of unit testing as you debug your code. However, storing these tests in their own folder - which can be run with an external package, like pytest, increases confidence in the quality and correctness of the code.

Ultimately, the only required folder here is codebase, since it contains the package code. The rest can be tailored to your preferences, but you can use this organization as a starting point.

4) And beyond…

That’s it! Following the above steps will help your push forward your science more efficiently, and help others use, reproduce, and build upon your work.

But wait, there’s more! Here’s a smattering of additional tools you can look into to take your scientific programming to the next level:

Code coverage and continuous integration: I mentioned above that unit testing is very important to ensuring that users trust the correctness of your package. Github provides tools to facilitate the effectiveness of these unit tests. The first is code coverage: this is a measure of how well your tests cover your code. The second is continuous integration, which continuously runs unit tests as new changes are integrated into the package. Both code coverage and continuous integration rely on third party services that automatically run via Github anytime updates are made to the repository.
Code linting: In addition to unit testing, code style is instrumental to ensuring that your code is readable and clean. Different languages have style standards that you should follow (e.g., when to indent, when to have spaces, restrictions on variable names, etc.). In Python, there are packages that can automatically lint your code, to point out instances where style is not being adhered to. Such packages include flake8 or pylint. You can include a protocol in your Github repository that details the custom style guide it follows (e.g., a .flake8 file).
Documentation: The last key component to code reproducibility is documentation. In Python, classes and functions should each be accompanied by docstrings, which provide important information on the inputs, outputs, and what the functions and classes do. There are tools that will automatically compile all docstrings into an easy-to-read website (e.g., a “ReadTheDocs”). In Python, you can use a package called Sphinx to generate these websites.
Docker: A virtual environment helps reproducibility by providing a record of exactly what packages are installed, and what their version numbers are. However, this might not be good enough, particularly if users are running different operating systems. For example, packages often require slightly different dependencies, making cross-platform building of virtual environments tricky. This is where Docker comes in: Docker provides a platform for constructing a container that is, quite literally, a barebones virtual OS capable of running your code. That way, another user can simply run your code within a Docker container, without having to worry about the details of the underlying environment.

Organized Code Repositories Accelerate Science and Facilitate Reproducubility

Topics

Pratik Sachdeva, Ph.D.