Reduce, Reuse, Recycle: Practical strategies for working with large datasets

October 12, 2022

When the size of your datasets start to approach the size of your computer’s available memory, even the simplest data wrangling tasks can become frustrating. Suddenly, reading in a .csv or calculating a simple average becomes time-consuming or impossible. As students or researchers, accessing additional computing resources can be costly or is not always an available option. Here are some principles and strategies for reducing the overhead of your dataset while keeping the momentum going. The code mainly focuses on reading csv files - a very common data format -  into Python and pandas, but the overarching principles will apply to any language. 

Reduce

While developing your code, reduce the amount of data you are reading into memory. Often, while you are in the exploratory phase, you just need to become familiar with the structure of your data, so working with a small sample initially will suffice. 

Sampling

The simplest way to reduce the size of your data? Limit the number of rows and columns. In Python/pandas, the read_csv() function, which you’ll commonly use to import your data, comes with keyword arguments for keeping your dataframe a manageable size. This is akin to taking a non-random sample of your data.* These keyword arguments include:

  • nrows: an integer value indicating the number of rows you’d like pandas to read in from the file

  • skiprows: relatedly, skiprows lets you skip a specified number of rows. This can be used in conjunction with nrows.

  • usecols: takes a list of columns names to keep in the dataframe

You can develop your data cleaning, manipulation, and exploratory analysis scripts – which likely won’t need the dataset during development – on these samples first. Then, execute them on the full dataset once you are more certain they will run.

Chunking

Another approach when you have limited memory is to process your dataset in chunks. You can think about this like taking slices from a cake rather than trying to eat an entire cake at once. Helpfully, pd. read_csv() offers the chunksize argument, where the number passed is the number of lines in each “slice” to work with at once. This returns an iterable object of data chunks which you can then iterate over to process. Here’s an example (source):

filename = "data/filename.csv"
chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
    for chunk in reader:
        # code to process the chunk

However, note that if limited memory is the issue, you may still not be able to read the entire dataset into memory to work with all at once. Most often, I’ll use chunking to conduct manipulations on the entire dataset and save my outputs to a list. 

Reuse and Recycle

Another set of strategies involves reusing components of the dataset. 

Saving and re-use objects using pickle

Python’s built-in (and in my opinion, best-named) pickle library allows you to save Python objects – including pandas dataframes – directly to disk, rather than converting them and saving them in a format like csv. This means that when you open a pickle file, the Python object inside it will look exactly like the one that was saved – no need to worry about conversions! So, for a dataframe you’ve cleaned and processed, you can save it directly to a pickle file for reuse later without having to worry about how conversion to csv might change the underlying data. Here’s an example of this process:

# Pickle a dataframe/object
import pickle
df = <dataframe object>
filename = "filename.pickle"
with open(filename, "wb") as f:
    pickle.dump(df, f)
# Load / unpickle the dataframe/object
with open(filename, "wb") as f:
    new_df = pickle.load(f)
    # new_df will look exactly like df

Loading Python pickles is speedier than reading in a csv, since pickle files are stored in machine-readable format – if you happen to open one of these files, you’ll notice that it is totally incomprehensible. 

However, they do come with a couple of considerations: 

  • Pickles can take up significant space on your hard drive, but most of us have more available storage space than memory.

  • Since pickles can contain executable code, there is potential for security risks if using pickles from unknown origins. However, if you are just using pickles you create yourself, you should be good to go!

Pre-define your datatypes

If your input data schema changes infrequently, pre-defining your datatypes can help pandas speed up read-in by skipping the step of inferring the column data types. The read_csv() function’s dtype argument takes a dictionary of columns and their corresponding types:

dtypes = {'col_a': np.float64, 'col_b': np.int32, 'col_c': 'object'}
pd.read_csv(..., dtype = dtypes)

If you have a lot of columns, you can combine this with taking a sample using nrows, getting the column names and datatypes as a series with the .dtypes attribute, then hard-coding it as dtypes for future use:

# Get header and data types using sample
filename = 'data/filename.csv'
nrows = 10 ** 4
df = pd.read_csv(filename, nrows=nrows, ...)
# Show dtypes
df.dtypes

# Make dict using dtypes
keys = [key for key in df.dtypes.index]
values = [str(value) for value in df.dtypes.values]
# Copy and hard-code this output
dict(zip(keys,values))

# When loading larger/full dataset, use the hard-coded output
dtypes = {'col_a': np.float64, 'col_b': np.int32, 'col_c': 'object'}
df2 = pd.read_csv(filename, dtype = dtypes)

Still too much data? 

  • If you do have resources available, look into cloud computing resources offered by Google / Amazon Web Services.

  • Look into parallelizing utilities such as Dask, which can help optimize chunking and processing operations. 

  • If you really need to go deep into optimizing your code, use timing and profiling tools such as %%timeit and time.time(), and the cprofile library.

*An aside on random sampling

Sometimes you’ll want a representative, random sample: this gets a little tricky as there is not always a straightforward approach. My favorite approach leverages Bash (a command line utility). Here’s a command you can run to subsample your dataset: 

tail -n +2 <filename> | shuf -n <nrows> -o <new_filename>

Let’s break down this command: 

# Skip the the header of filename
tail -n +2 <filename>

# pass the output the next command
|

# shuffle all rows of the file and return nrows in your sample
shuf -n <nrows>

# save this random sample output to new_filename
-o <new_filename>

You’ll then need to add back the header using panda’s read_csv() function. Specifically, you’ll use the header argument:

pd.read_csv(<file_name>, header=0, name=<list of header labels>)