Uncertainty Quantification for Air Quality Forecasting using Multiple Data Sources

Authors: Carl Malings, K. Emma Knowland, and Stephen Cohn

Poor air quality is a major global environmental challenge, leading to reduced visibility, damage to crops, exacerbation of many chronic health conditions, and millions of premature deaths each year. The ability to accurately forecast future air quality can help individuals and public health organizations take preventative action to reduce some of these impacts. Furthermore, efforts to predict the future are inherently uncertain, and so being able to appropriately assess and convey the relative confidence of these forecasts is an important aspect of making these forecasts actionable.

To this end, GMAO scientists, in collaboration with other university and private sector researchers, have developed a methodology to combine multiple sources of air quality information via a data fusion technique (as discussed in a previous science snapshot) to produce spatially refined forecasts of hourly air quality pollutant concentrations for the next 96 hours (4 days), together with estimates of the uncertainty in these forecasts. The data fusion operates in four successive phases. In phase one, global atmospheric composition forecasts from the GMAO GEOS-CF system provide an initial forecast at relatively coarse (0.25 degree, about 25 kilometer) spatial scale. In phase two, satellite information on trace gas concentrations from ESA TROPOMI data products are used to spatially refine these forecasts, based on past comparisons between GEOS-CF and the satellite data. In phase 3, several days of historical in-situ air quality monitoring data from the US EPA’s Air Data service (or other local monitoring information, as available) are integrated to correct for systematic biases in the earlier phases. Finally, in phase 4, the most recent measurements from these in-situ monitors are used to update near-real-time estimates. During each phase, the spatial and temporal variability of each data source as well as differences and covariances among the data sources are also assessed. This informs the data fusion methodology’s estimates of the accuracy of the forecast it produces. Figure 1 provides a conceptual overview of how the different data sources fit together.

slide graphic from TC Akara
Figure 1: A conceptual diagram of the information sources contributing to air quality forecasting in the data fusion methodology discussed here. In phase 1, the NASA GEOS-CF model system (blue) provides complete information in space and time, including forecasts out to 96 hours ahead, but at a relatively coarse spatial resolution, and so is unable to resolve highly localized air quality effects. In phase 2 the satellite, TROPOMI (orange) in this example, can provide information with more spatial detail to capture more localized patterns. However, satellite data are only available at times when the satellite is overhead. Furthermore, the satellite is observing the atmospheric column, rather than the near-surface layer which is most relevant for air quality. In phase 3, in-situ monitors (green), operated by regulatory agencies such as the US EPA, give accurate point-specific information, but only at relatively few locations, which might be far from a point of interest. In phase 4, the most recent information available from these monitors (purple) can further be used to update forecasts. Bringing together information from all these sources via data fusion, as well as assessing how they have related to each other in the past, can improve forecasts as well as characterize the relative confidence we might expect from these forecasts.

An evaluation of this methodology for data fusion with uncertainty quantification was conducted using air quality data collected in the regions around San Francisco, California, and New York City, New York during 2019 as case studies. The aim of this evaluation was to determine whether the methodology could produce reasonable confidence intervals for its forecasts. In other words, if the methodology was asked to predict a range of concentration values with a 75% chance of covering the true concentration, what fraction of actual measurements would fall within that range over time? Figure 2 shows one of the outcomes of this evaluation. Across the four phases of the data fusion process, and for forecasts out to 96 hours, the fraction of measurements falling inside the forecast’s 75% confidence interval were assessed during September 2019 for each of 25 NO2 monitoring sites in the San Francisco area. While this coverage fraction varied from site to site, from a low of about 40% to a high of about 95%, median performance was close to the target value of 75%, indicated by the horizontal dotted line. It was important to assess the uncertainty quantification across all four phases, rather than only for the final phase (phase 4), since limitations on data availability (e.g., little to no satellite data due to persistent cloud cover, delays in obtaining the latest in-situ data) might prevent the method from moving past any given phase in a practical application.

slide graphic from TC Akara
Figure 2: An assessment of the fraction of measured NO2 concentrations falling within the 75% confidence interval (CI) of the forecast, plotted on the vertical axis. The boxplots indicate the spread of fractions across 25 evaluation sites in San Francisco, California, during September 2019. Evaluations were conducted across four phases of the data fusion process (indicated by colors) and for forecasts up to 96 hours ahead (horizontal axis). In all cases, the fraction is typically around the expected result of 75%, indicated by the horizontal dotted line.

Investigating the site-to-site differences in more detail, it was found that sites very close to highways, which are a major source of NO2, constituted most of the sites with lower coverage. This is to be expected, based on the information included in the methodology; even the high-resolution satellite information is too coarse to capture the very sharp spatial gradients of NO2 which occur within a few dozen meters of a highway, where these monitors are sited. Data from these sites were purposefully excluded during the evaluation, such that the results would fairly represent the performance of the methodology at sites without in-situ air quality monitors. In future iterations of the methodology, additional information about the presence of potential local sources might be brought in to mitigate this limitation. Nevertheless, the results indicate that the methodology can convey credible estimates of confidence in the forecasts it is providing across a range of conditions and forecast lead times. Such estimates can inform public health managers about the probabilities of different air quality risk scenarios and help them to make informed decisions about which response actions might be the most robust in the face of uncertainty. This methodology is being implemented as part of an ongoing NASA Health and Air Quality applied science project, as discussed in a previous science snapshot.

References:

Malings, C., Knowland, K. E.,Pavlovic, N., Coughlin, J. G., King, D., Keller, C. A., Cohn, S. E., & Martin, R. V. (2024). Air quality estimation and forecasting via data fusion with uncertainty quantification: theoretical framework and preliminary results. Journal of Geophysical Research Machine Learning and Computation, 1 (4), e2024JH000183. https://doi.org/10.1029/2024JH000183.

« GMAO Science Snapshots