Machine learning in atmospheric science

The atmosphere is an incredibly complex (and fascinating, I would add!) chemical-physical system. Imagine it as a big mixture of gaseous molecules, liquid and solid particles, commonly referred to as aerosol particles or particulate matter (PM). The chemical compounds you can find in the air you are breathing in at any given moment are literally thousands, and they can be both inorganic (like sea salt, dust or volcanic ash) or organic, (i.e. containing carbon molecules, coming from sources such as engine exhausts, forests, scented candles and oils secreted from your own skin). In addition, the composition of the atmosphere is characterized by spatial and temporal variability. For example, in our cities the chemical composition of the air we breathe can be seasonal, due for example to the use of domestic heating in the winter and its absence in the summer, or to differences in temperature and solar radiation. Thinking about spatial variability, we all have experienced the difference in chemical composition of air at a beach or in a polluted city, but it also varies vertically from the ground up. Finally, molecules in the atmosphere undergo a wide variety of chemical transformations, and freshly emitted compounds can greatly differ from what scientists often refer to as ‘aged’ components.

Both our knowledge about the chemical composition of the atmosphere and our ability to measure trace concentrations of chemicals has dramatically improved over the last decades. That has certainly increased our understanding of how altering the composition of the atmosphere affects air quality, people’s health, ecosystems and the global climate. However, at the same time, the amount of data generated by new monitoring systems can be enormous and extracting patterns in such a complex system is often not straightforward with ‘traditional’ data analysis and numerical modelling.

It seems more and more clear that machine learning algorithms could represent powerful tools to investigate the intrinsic complexity of the atmosphere, and the number of works including such methodologies are rapidly rising in specialized journals. Some of the most interesting applications of machine learning in atmospheric science include air quality forecasting, weather normalization, single particles classification, instrument development and the assessment of the health, social and climate impacts of atmospheric pollution. What follows is certainly not an extensive literature review of the various machine learning models used in atmospheric science, but rather a summary of these recurring themes in the atmospheric science literature, with references to a few works I have come across from the literature as examples. 

Weather normalization of pollutants concentrations

Seasonal and day-to-day variations in the concentration of harmful pollutants are often pronounced and can make the recognition of smaller but significant long-term variations less straightforward than one would wish. For example, if we want to look at how effective the regulation of certain pollutants emissions in decreasing their concentration in the atmosphere, ideally we would want a method that removes our measurement variability due to meteorology and seasonality and leave us with a ‘normalized’ long-term trend. These kinds of approaches are referred to as ‘weather normalization’ and they remove the effect of meteorological parameters such as (but not limited to) rainfall, wind speed and direction, radiation intensity, temperature and relative humidity.

A weather normalization approach based on predictive random forest models was recently proposed by Grange and coworkers, which has the advantages of being applicable to time series for virtually any pollutant and can estimate the importance of each meteorological variable included in the normalization. That allows not only to calculate normalized long-term trends and evaluate the effectiveness of policy on air quality, but also to get insights on the weather variables that most determine the day-to-day variations.  

At the time this blog post was written, there were a little over 60 works in the literature using Grange’s approach to investigate both historic time series of several pollutants and to look into the effect of COVID-19 lockdowns on pollutants concentrations and air quality.

Air quality and weather forecasting

The forecasting of pollutants concentrations (from particulate matter, to nitrogen oxide and ozone) in both outdoor and indoor air quality environments, together with weather forecasting have been approached with various machine learning methods. When long time series are available to successfully train a model for the prediction of future weather or pollution events, machine learning could provide a valid and more accurate alternative to traditional numerical models. 

Health, climate and social impacts of atmospheric pollution

It is well-known that the both the acute and long-term exposure to particulate matter can lead to negative health effects in humans. Because of the world-wide nature of this problem, estimation of PM concentrations with machine learning algorithms applied to satellite optical measurements coupled to exposure models are promising especially where extensive networks of ground-level PM measurements are not available. This is true particularly in developing countries, which also happen to be among those most affected by high levels of atmospheric pollutions and therefore mostly affected by air pollution-related premature deaths. Additionally,  unequal exposure can arise from patterns of segregation in cities, with certain social groups more likely to be exposed to bad air quality, and questions around the equal distribution of air quality monitoring stations can also be answered with similar approaches.

Other examples of the application of machine learning models related to the health effects of atmospheric pollution include the estimation of hourly exposure to PM from biomass burning and ozone exposure during wildfires, of indoor exposure to airborne fungi that can be responsible for respiratory diseases and of the exposure to heavy metals in ultra-fine PM that can penetrate deeply into the human respiratory tract and enter the cardiovascular system. 

Furthermore, atmospheric pollutants such as CO2, methane, PM and others are responsible for climate change, one of the biggest challenges that humanity has to confront. Climate change affects so many interconnected aspects of life on earth and ‘machine learning can be a powerful tool in reducing greenhouse gas emissions and helping society adapt to a changing climate’ (from the mission statement of Climate Change AI).

Particle classification

Aerosol particles in the atmosphere are extremely heterogenous in terms of their chemical composition, morphology and phase state. One useful approach to tackle this heterogeneity is that of characterizing large numbers of individual particles one at a time and then statistically describing the analyzed particles as an ensemble. 

For this reason, in recent years a few analytical methods were developed to characterize atmospheric aerosols at a single particle level. Classification machine learning algorithms are more and more often coupled to these techniques to attribute each particle to a specific compositional category and extract useful statistics on the observed population of particles. Both supervised classification methods and unsupervised feature extraction algorithms are used in this context. Examples of the application of machine learning related to particle classification include the characterization of fungal spores, bacteria and pollens in biological aerosol, the detection of engineered nanoparticles in environmental samples and the study of long-term changes in chemical composition in the Finnish boreal forest. One of the most adopted algorithms is in this respect is positive matrix factorization, either applied to data from aerosol mass spectrometers or offline PM chemical analysis, with the aim of determining the source of emission of the analyzed particles.

Instrument design

Machine learning models have the potential of informing instrument design, particularly in relation to the simplification of the analytical approaches for the characterization of aerosol particles. Considering the chemical heterogeneity of particles in the atmosphere, it could be tempting to combine multiple analytical approaches to correctly classify collected aerosol particles. However, in order to minimize the complexity of instrumental setups and measurement campaigns, machine learning can be used to really pin down what are the most relevant variables to discriminate, for example, between particles deriving from different emission sources. Ideally, instrumental development would therefore only focus on the measurement of those specific features, thus reducing both costs and the time required for data analysis.

To conclude, besides all the exciting work and the enormous potential of machine learning in atmospheric science, it seems important to remark that good practices around the implementation of machine learning algorithms need to be shared and followed by the atmospheric science community. Such good practices include broadly applicable ethical principles in artificial intelligence, algorithm fairness, code and data availability to grant reproducibility of results.


Grazia Rovelli

Grazia is a postdoctoral scholar at the Chemical Science Division at Berkeley Lab and a Data Science Fellow at D-Lab. Her research has focused on several different aspects of atmospheric chemistry and she in now interested in data science and machine learning tools applied to atmospheric pollution problems.