Pattern Detection in Climate Data

Problem Statement and Goals

Modern climate simulations produce massive amounts of data. Commonly used models such as the Community Earth System Model (CESM) and Weather Research and Forecasting model (WRF) routinely produce tens of gigabytes of data, and the next generation of global cloud resolving simulations will produce terabytes of data on a per-timestep basis. Coordinated multi-model ensembles of climate model simulations such as the Coupled Model Intercomparison Project (CMIP3) contain hundreds of gigabytes today, and CMIP5 will contain tens of terabytes in the near future. These ensembles are a crucial tool for international and national organizations, such as the Intergovernmental Panel on Climate Change (IPCC), tasked to assess the human role in climate change and to assess potential adaptation and mitigation strategies for reducing our role in climate change.

While there are a number of opportunities to mine climate datasets for scientifically important features, a manual examination of the multivariate timeseries output is simply infeasible. We need automated techniques that can search and analyze these datasets for important phenomena. Climate datasets also expose abundant parallelism across spatial locations, timesteps, and ensemble members; we need to design scalable tools that can exploit these modalities.

In this work, we are:

Implementation and Results

Figure 1. Visualization of water vapor in a CAM5 0.25 degree simulation.

We are interested in developing tools that can automatically detect and track a number of extreme weather phenomena (i.e., tropical cyclones, extra-tropical cyclones, atmospheric rivers and blocking events) in high-resolution model output. Figure~\ref{fig:cam5} shows a typical snapshot from a high-resolution CAM5 simulation.

We have designed and implemented TECA (Toolkit for Extreme Climate Analysis) that provides a framework for creating an end-to-end climate analysis application. The toolkit facilitates development of new feature detection techniques, and it provides a number of convenience functions (i.e., loading NetCDF files, calendar support, etc). Most importantly, the toolkit provides support for spatial and temporal parallelism, which in turn enables a pattern detection code to process a massive output. We present case studies demonstrating the application of TECA to Tropical Cyclone and Atmospheric River detection.

Figure 2. Tropical cyclone tracks for 19 years of CAM5 output. Tracks are colored by storm categories on the Sapphir–Simpson scale.

Tropical Cyclone Detection

We incorporated the TSTORMS code (originally developed at Princeton Geophysical Fluid Dynamics Lab) into the TECA framework for detecting tropical cyclones (TC). The code comprises of two steps: detection of candidate points and stitching detections into a trajectory. The detection step comprises of threshold conditions on a number of variables (sea level pressure, vorticity, and temperature). All grid points that satisfy these multi-variate conditions form candidates for the stitching step. This step places spatio-temporal constraints involving the speed of tropical storms, the tendency to move poleward and westward, and a specified minimum temporal duration. Such criteria are used to assign candidate points to individual tracks and resolve ties. The final outputs are individual storm tracks which can then be counted and categorized based on maximum wind speed. The computationally expensive detection step is embarrassingly parallel across timesteps. The stitching step takes up little runtime, permitting execution in a serial mode. We used TECA to facilitate a parallel execution of the detection step, and aggregated the results for the stitching.

Atmospheric River Detection

Figure 3. Sample CAM5 AR detections from our parallel code. AR events in February and December of 1991 are plotted. Note that our method is able to detect AR events of different shapes and sizes.

We applied the TECA framework for detecting Atmospheric Rivers (AR) in both model and observational output. The AR detection process is comprised of a detection and post-processing phase. The detection algorithm computes a time-averaged integrated water vapor field, applies a threshold followed by connected component analysis, and finally checks for geometric connectivity of the resulting structure (origin in the tropics and landfall on the US west coast). Post-processing operations on AR detections include binning by month, length, width, and flux calculations.

We successfully developed parallel versions of the TC and AR detection tools and applied it to model output and observational data. The TC detection code was able to process 19 years of CAM5 output in about two hours on 7,000 cores. In contrast, we estimate the runtime of a serial version of this code on this problem to be about 583 days. The AR detection code processed 8 years of satellite imagery data on 8,500 cores in three seconds.

This work appeared in the Third Workshop on Data Mining in Earth System Science [1].


Scientists are now able to use this capability to characterize extreme climate phenomena, and assess the quality of model output. LBL climate scientists used the TC detection tool to verify that the CAM5 model produces more realistic TC counts as the model resolution increases from 1 degree to 0.25 degrees. They were also able to verify that the onset of tropical cyclone season in the north atlantic was very close to real-world observations. Similarly, the AR detection tool was used to automatically characterize different shapes and sizes of atmospheric rivers, which was previously under-appreciated by the scientific community.


[1] Prabhat, Oliver ubel, Surendra Byna, Kesheng Wu, Fuyu Li, Michael Wehner, and E. Wes Bethel. TECA: A Parallel Toolkit for Extreme Climate Analysis. In Third Worskhop on Data Mining in Earth System Science (DMESS 2012) at the International Conference on Computational Science (ICCS 2012), Omaha, Nebraska, June 2012. LBNL-5352E.


Prabhat, Suren Byna, Oliver Rubel, Michael Wehner