Is your Random Sample Really Random?

January 20, 2022

One of the frequent ways people can run into random numbers is through their research. We often hear the term “random sample,” or a “randomized” assignment to control. Or, sometimes, we can randomly select a certain number of rows or columns from data to perform an analysis on a representative snapshot of the data. Additionally, for many of us from a natural science or engineering background, random numbers are often used in simulations or optimization models. Given the wide variety of uses for random numbers in Data Science, I thought it would be interesting to take an introductory look at what they really are and how they can be generated by a computer

In general, the intuition behind a random number is pretty easy to grasp: pulling a card from a deck of cards, rolling a dice, or flipping a coin are all classic ‘random’ events: each outcome in the event has an equal chance of occurring. In other words, if you ran that same event over and over again, the number of times each possible outcome occurs would be the same. The one difference, however, is that many of the above are somewhat deterministic processes, especially if a human is involved. The likelihood that every dice roll, every coin flip, or every card shuffle ends up being perfectly random over millions or billions of iterations is very small as there is likely some systematic, human-induced pattern. So what then is a perfectly random number? 

The ‘classical’ perfectly random numbers are generated by natural processes. One oft-used example is particle decay when one particle will ‘decide’ to split into two or more component particles. This process is completely random and unpredictable; even with billions of observations of particle decays, we couldn't find the time another particle decays any more accurately than random chance. This realization is important because it hints at one of the key aspects of random numbers: if one outcome or number is selected, it must be independent of all the other outcomes or numbers. This is what defines a truly random number as opposed to what is called a pseudo-random number. 

These pseudo-random numbers are almost always what are used in computer programs and algorithms where a series of random numbers dependent on some initial seed value is generated. One frequent form for these types of generators is called Linear Congruential Generators or LCG. Essentially, given some initial seed value, the generator will perform a multiplication, addition, and modulus to the seed by other numbers to generate a new value; the new value is then used as the seed for the next random number. This process allows for distinct and highly independent numbers to be generated. While this number is clearly not entirely random (it was generated by a function, and each value is in some way dependent on the previous) it is a computationally very efficient way to generate near-random numbers. It is important to note, however, that a method such as LCG is not generally considered cryptographically secure [1].

These pseudo-random numbers are still more than good enough for most applications in Data Science, however. It is especially important that Data Scientists understand, for example, that their choice of seed value will impact how reproducible their results are. Woodcock notes that using a Python function such as numpy.random.seed() to set the seed is not preferable in large projects, as the global random seed value could be changed without you knowing [2]. Instead, the recommended practice is to use a generating function like np.random.default_rng() which will create a local random seed variable that is set in each part of your code [2]. So to answer the question title: no, your random sample probably isn’t really random, but with good coding practices it shouldn’t matter too much unless you are doing cryptography, high-level simulations, or working with very large datasets. 

That is about it for this introduction to randomness and pseudo-random numbers in Data Science! If you are still interested in true random number generation or want to explore RNG  options for your work there are a number of organizations and websites that generate random numbers for free. One such israndom.org which uses atmospheric noise to generate truly random numbers [3]. 

Citations: 

[1] Rebecca N. Wright, in Encyclopedia of Physical Science and Technology (Third Edition), 2003

[2] Woodcock, Henri. “Stop Using Numpy.random.seed().” Medium. Towards Data Science, March 1, 2021. https://towardsdatascience.com/stop-using-numpy-random-seed-581a9972805f

[3] https://www.random.org