Our group’s research efforts focus on all stages of the visual data analysis pipeline:
We conduct software technology R&D towards the creation of scalable
data-intensive analyses and end-to-end data science applications. We develop
scalable analysis libraries (e.g, DIY, DAGR) and toolkits (e.g., BASTet, TECA),
application data management, query and I/O libraries (e.g., PyNWB), and
end-to-end data systems (e.g., OpenMSI) that run on DOE HPC platforms.
Parallel, High-performance Algorithms:
We create algorithms that utilize effectively the computational resources
provided by massively parallel computers.
In Situ Methods:
On current and future HPC platforms it is becoming increasingly arduous to
write out data to persistent storage, impeding or preventing scientific
discovery as data goes unanalyzed. Our group conducts R&D on in situ
algorithms and software technology that work around this problem by performing
analysis, visualization, and related processing while the data is still
resident in memory.
Interactive Application combine visualization and data analysis to solve
application-specific data understanding problems. We develop visualization tools
for a wide range of applications, such as three-dimensional gene expression,
brain connectivity, and spectral distributions.
Topological Data Analysis:
Topological Data Analysis identifies salient features in data, thus
reducing data complexity and dimensionality. Detected features enable
comparisons across simulations, experiments and observations. Our research in
applied topological analysis focuses on parallel algorithms and large scale
Computer Vision and Image Processing:
We develop new approaches for scientific image analysis for use on HPC
platforms are applicable to a wide array of problems challenged by increasing
data size and complexity. Computer vision, feature modeling, analysis, and
We develop new approaches for scalable machine learning
and use machine learning methods for scientific data understanding.
Current Research Projects
Current funding for our group comes from the following projects:
- Accelerating HEP Science - Inference and Machine Learning at Extreme Scale SciDAC: This project brings together ASCR and HEP researchers to develop and apply new methods and algorithms in the area of extreme-scale inference and machine learning. The research program melds high-performance computing and techniques for “big data” analysis to enable new avenues of scientific discovery. The focus is on developing powerful and widely applicable approaches to attack problems that would otherwise be largely intractable.
- Base Program - Scalable Data-Computing Convergence and Scientific Knowledge Discovery: To answer today’s increasingly complex and data-intensive science questions in experimental, observational and computational sciences, we are developing methods in three interrelated R&D areas: (i) We are creating new scalable data analysis methods capable of running on large-scale computational platforms to respond to increasingly complex lines of scientific inquiry. (ii) Our new computational design patterns for key analysis methods will help scientific researchers take full advantage of rapidly evolving trends in computational technology, such as increasing cores per processor, deeper memory and storage hierarchies, and more complex computational platforms. The key objectives are high performance and portability across DOE computational platforms. (iii) By combining analysis and processing methods into data pipelines for use in large-scale HPC platforms—either standalone or integral to a larger scientific workflow—we are maximizing the opportunities for analyzing scientific data using a diverse collection of software tools and computational resources.
- Berkeley Institute for Data Science (BIDS): Founded in 2013, the Berkeley Institute for Data Science (BIDS) is a central hub of research and education at UC Berkeley designed to facilitate and nurture data-intensive science. People are at the heart of BIDS. We are building a community centered on a cohort of talented data science fellows and senior fellows who are representative of the world-class researchers from across campus and are leading the data science revolution within their disciplines.
- Calibrated And Systematic Characterization, Attribution, And Detection Of Extremes (CASCADE): Changes in the risk of extreme weather events may pose some of the greatest hazards to society and environment as the climate system changes due to global warming. The CASCADE project will advance the Nation’s ability to identify and project climate extremes and how they are impacted by environmental drivers.
- Center for Advanced Mathematics for Energy Research (CAMERA): The Center for Advanced Mathematics for Energy Research (CAMERA) is a coordinated team of applied mathematicians, computer scientists, light scientists, materials scientists, and computational chemists focusing on targeted science problems, with initial involvements aimed at the Advanced Light Source (ALS), Molecular Foundry, and National Center for Electron Microscopy (NCEM). Together, our goal is to accelerate the transfer of new mathematical ideas to experimental science.
- ECP ALPINE - Algorithms and Infrastructure for In Situ Visualization and Analysis: Data analysis and visualization are critical to decoding complex simulations and experiments, debugging simulation codes, and presenting scientific insights. ECP ALPINE extends DOE investments in basic research, software development, and product deployment for visualization and analysis by delivering exascale-ready in situ data analysis and visualization capabilities to the Exascale Computing Project (ECP).
- IDEAL – Image across Domains, Experiments, Algorithms and Learning: IDEAL focuses on computer vision and machine learning algorithms for timely interpretation of experimental data recorded as images in 2D or multispectral. One of our projects is pyCBIR, which stands for content-based image retrieval in python; it allows searching through millions of scientific images using their fingerprints, and ranking search results by image content similarity. These fingerprints, and other compact data representations, are mandatory to new regimes of data collection at experimental facilities that we support, such as the LBNL Advanced Light Source.
- Interactive Machine Learning for Tomogram Segmentation and Annotation: Despite tremendous progress made in biological imaging that has yielded tomograms with ever-higher resolutions, the segmentation of cell tomograms into organelles and proteins remains a challenging task. The difficulty is most extreme in the case of cryo-electron tomography (cryo-ET), where the samples exhibit inherently low contrast due to the limited electron dose that can be applied during imaging before radiation damage occurs. The tomograms have a low signal-to-noise ratio (SNR), as well as missing-wedge artifacts caused by the limited sample tilt range that is accessible during imaging. While SNR can be improved by applying contrast enhancement and edge detection methods, these algorithms can also generate false connectivity and additional artifacts that degrade the results produced by automatic segmentation programs. If the challenges can be overcome, automatic segmentation approaches are of great interest. However, the achievement of this vision is precluded today by the complexity of the specimen and the SNR limitations described above. State of the art machine learning results are not generally suitable for deep mining, in fact, the situation in cryo-ET is quite the opposite: the highest quality segmentations are produced by hand, representing effort levels ranging from days to months. Segmentation tools could be vastly improved if they were constructed to take into account prior knowledge, minimizing the sensitivity to noise and false connection. To the best of our knowledge, there are no methods using specific contextual information about biological structures as restraints for segmentation. Nor are there approaches that incorporate active learning with feedback from the user, which would provide guidance as to the correctness of the segmentation. We are developing new machine learning techniques to facilitate the segmentation, extraction, visualization, and annotation of biological substructures within 3D tomograms obtained from a variety of imaging modalities.
- Neurodata without Borders (NWB): Neurodata without Borders (NWB) develops a unified data format for cellular-based neurophysiology data, focused on the dynamics of groups of neurons measured under a large range of experimental conditions. The NWB team consists of neuroscientists and software developers who recognize that adoption of a unified data format is an important step toward breaking down the barriers to data sharing in neuroscience.
- RAPIDS – SciDAC Institute for Computer Science and Data: The objective of RAPIDS is to assist DOE Office of Science application teams in overcoming computer science and data challenges in the use of DOE supercomputing resources to achieve science breakthroughs. To accomplish this objective, the Institute will solve computer science and data technical challenges for SciDAC and SC science teams, work directly with SC scientists and DOE facilities to adopt and support our technologies, and coordinate with other DOE computer science and applied mathematics activities to maximize impact on SC science.
- SENSEI - Extreme-scale In Situ Methods and Infrastructure: A fact of life on current and future HPC platforms is the increasingly arduous task of writing out data to persistent storage, thus impeding or prevening scientific discovery as data goes unanalyzed. In situ methods work around this problem by performing analysis, visualization, and related processing while the data is still resident in memory. The SENSEI project focuses on a set of challenges relating to effectively using in situ methods and infrastructure at scale.
- ScienceSearch: Automated Metadata using Machine Learning: Next-generation scientific discoveries rely on the insights we can derive from the large amounts of data that are produced through simulations and experimental and observational facilities. The goal of ScienceSearch is to investigate using machine learning techniques to generate automated metadata that will enable search on data. Enabling search on data will accelerate scientific discoveries through virtual experiments, multidisciplinary and multimodal data assimilation.