Focus Areas

Our research and development efforts focus on all stages of the data analysis pipeline to enable scientific discovery:

Machine Learning and Feature-based Analytics:

We develop new approaches for scalable machine learning, topological data analysis, and computer vision to enable detection of salient features, dimensionality reduction, and data modeling. Our methods target automation of complex data analysis tasks, feature-based comparison of simulations, experiments and observations, analysis across scales, and modeling of complex systems.

Scalable Algorithms and Infrastructure:

We conduct software technology R&D towards the creation of scalable data-intensive algorithms and end-to-end data science applications. We create novel analytics algorithms to effectively utilize modern parallel compute architectures and develop in situ algorithms and software technologies for coupling data processing and analysis with the data generation to reduce I/O and storage cost, enable more accurate data analysis, and derive reusable data products for further analysis.

FAIR Data Science:

We develop new approaches to make scientific data, machine learning, and analytics findable, accessible, interoperable, and reproducible (FAIR). From development of data standards, methods for data modeling and provenance, to online data portals, our research in this area makes scientific data and analysis methods broadly accessible to the scientific community.

Science Applications:

Across all our research thrusts we work closely with application science teams, applying our methods to uncover new knowledge in scientific data from simulations, experiments and observations. Combining methods from across our research portfolio, we develop scalable and interactive applications that integrate visualization, machine learning, and data analytics methods to solve application-specific data understanding problems.

Current Research Projects

Current funding for our group comes from the following projects:

  • Accelerating HEP Science - Inference and Machine Learning at Extreme Scale SciDAC: This project brings together ASCR and HEP researchers to develop and apply new methods and algorithms in the area of extreme-scale inference and machine learning. The research program melds high-performance computing and techniques for “big data” analysis to enable new avenues of scientific discovery. The focus is on developing powerful and widely applicable approaches to attack problems that would otherwise be largely intractable.
  • Base Program - Scalable Data-Computing Convergence and Scientific Knowledge Discovery: To answer today’s increasingly complex and data-intensive science questions in experimental, observational and computational sciences, we are developing methods in three interrelated R&D areas: (i) We are creating new scalable data analysis methods capable of running on large-scale computational platforms to respond to increasingly complex lines of scientific inquiry. (ii) Our new computational design patterns for key analysis methods will help scientific researchers take full advantage of rapidly evolving trends in computational technology, such as increasing cores per processor, deeper memory and storage hierarchies, and more complex computational platforms. The key objectives are high performance and portability across DOE computational platforms. (iii) By combining analysis and processing methods into data pipelines for use in large-scale HPC platforms—either standalone or integral to a larger scientific workflow—we are maximizing the opportunities for analyzing scientific data using a diverse collection of software tools and computational resources.
  • Berkeley Institute for Data Science (BIDS): Founded in 2013, the Berkeley Institute for Data Science (BIDS) is a central hub of research and education at UC Berkeley designed to facilitate and nurture data-intensive science. People are at the heart of BIDS. We are building a community centered on a cohort of talented data science fellows and senior fellows who are representative of the world-class researchers from across campus and are leading the data science revolution within their disciplines.
  • Calibrated And Systematic Characterization, Attribution, And Detection Of Extremes (CASCADE): Changes in the risk of extreme weather events may pose some of the greatest hazards to society and environment as the climate system changes due to global warming. The CASCADE project will advance the Nation’s ability to identify and project climate extremes and how they are impacted by environmental drivers.
  • Center for Advanced Mathematics for Energy Research (CAMERA): The Center for Advanced Mathematics for Energy Research (CAMERA) is a coordinated team of applied mathematicians, computer scientists, light scientists, materials scientists, and computational chemists focusing on targeted science problems, with initial involvements aimed at the Advanced Light Source (ALS), Molecular Foundry, and National Center for Electron Microscopy (NCEM). Together, our goal is to accelerate the transfer of new mathematical ideas to experimental science.
  • ECP ALPINE - Algorithms and Infrastructure for In Situ Visualization and Analysis: Data analysis and visualization are critical to decoding complex simulations and experiments, debugging simulation codes, and presenting scientific insights. ECP ALPINE extends DOE investments in basic research, software development, and product deployment for visualization and analysis by delivering exascale-ready in situ data analysis and visualization capabilities to the Exascale Computing Project (ECP).
  • IDEAL – Image across Domains, Experiments, Algorithms and Learning: IDEAL focuses on computer vision and machine learning algorithms for timely interpretation of experimental data recorded as images in 2D or multispectral. One of our projects is pyCBIR, which stands for content-based image retrieval in python; it allows searching through millions of scientific images using their fingerprints, and ranking search results by image content similarity. These fingerprints, and other compact data representations, are mandatory to new regimes of data collection at experimental facilities that we support, such as the LBNL Advanced Light Source.
  • Interactive Machine Learning for Tomogram Segmentation and Annotation: Despite tremendous progress made in biological imaging that has yielded tomograms with ever-higher resolutions, the segmentation of cell tomograms into organelles and proteins remains a challenging task. The difficulty is most extreme in the case of cryo-electron tomography (cryo-ET), where the samples exhibit inherently low contrast due to the limited electron dose that can be applied during imaging before radiation damage occurs. The tomograms have a low signal-to-noise ratio (SNR), as well as missing-wedge artifacts caused by the limited sample tilt range that is accessible during imaging. While SNR can be improved by applying contrast enhancement and edge detection methods, these algorithms can also generate false connectivity and additional artifacts that degrade the results produced by automatic segmentation programs. If the challenges can be overcome, automatic segmentation approaches are of great interest. However, the achievement of this vision is precluded today by the complexity of the specimen and the SNR limitations described above. State of the art machine learning results are not generally suitable for deep mining, in fact, the situation in cryo-ET is quite the opposite: the highest quality segmentations are produced by hand, representing effort levels ranging from days to months. Segmentation tools could be vastly improved if they were constructed to take into account prior knowledge, minimizing the sensitivity to noise and false connection. To the best of our knowledge, there are no methods using specific contextual information about biological structures as restraints for segmentation. Nor are there approaches that incorporate active learning with feedback from the user, which would provide guidance as to the correctness of the segmentation. We are developing new machine learning techniques to facilitate the segmentation, extraction, visualization, and annotation of biological substructures within 3D tomograms obtained from a variety of imaging modalities.
  • Neurodata without Borders (NWB): Neurodata without Borders (NWB) develops a unified data format for cellular-based neurophysiology data, focused on the dynamics of groups of neurons measured under a large range of experimental conditions. The NWB team consists of neuroscientists and software developers who recognize that adoption of a unified data format is an important step toward breaking down the barriers to data sharing in neuroscience.
  • RAPIDS – SciDAC Institute for Computer Science and Data: The objective of RAPIDS is to assist DOE Office of Science application teams in overcoming computer science and data challenges in the use of DOE supercomputing resources to achieve science breakthroughs. To accomplish this objective, the Institute will solve computer science and data technical challenges for SciDAC and SC science teams, work directly with SC scientists and DOE facilities to adopt and support our technologies, and coordinate with other DOE computer science and applied mathematics activities to maximize impact on SC science.
  • SENSEI - Extreme-scale In Situ Methods and Infrastructure: A fact of life on current and future HPC platforms is the increasingly arduous task of writing out data to persistent storage, thus impeding or prevening scientific discovery as data goes unanalyzed. In situ methods work around this problem by performing analysis, visualization, and related processing while the data is still resident in memory. The SENSEI project focuses on a set of challenges relating to effectively using in situ methods and infrastructure at scale.
  • ScienceSearch: Automated Metadata using Machine Learning: Next-generation scientific discoveries rely on the insights we can derive from the large amounts of data that are produced through simulations and experimental and observational facilities. The goal of ScienceSearch is to investigate using machine learning techniques to generate automated metadata that will enable search on data. Enabling search on data will accelerate scientific discoveries through virtual experiments, multidisciplinary and multimodal data assimilation.