LBL CRD Machine Learning Reading Group (2006-2007)

Potential Topics

Below is a preliminary list of topics in machine learning and statistical data analysis, along with several subtopics within each area, and some DOE applications to which they may be relevant. In dicussing the mathematical and algorithmic details of these methods and models, we want to think critically about which are scalable to large scientific data sets and what research efforts are necessary to scale them to these applications.

Please send suggestions for additional or alternative 1) main topics, 2) subtopics & methods, 3) applications, or 4) readings, to romano@hpcrd.lbl.gov. Many of these areas have overlapping subtopics, so may be split into several parts.

Date

Topic

Definition

Recent and Classical Methods

Applications

Readings

Discussion
Leader(s)

November 9, 2006
50F-1647
12noon-1:30pm
Introductory Meeting
  • Introductions; purpose of group
  • Review topics, definitions
  • Discuss areas of maximum interest, potential applications
  • --- ---
  • Prediction Machines. Bishop, C. M. et. al., 2020 Science, Microsoft Research. pp.34-35, 2006.
  • The Discipline of Machine Learning. Mitchell T., Technical Report, CMU-ML-06-108, July 2006.
  • Statistical Learning/Pattern Recognition Glossary, Tom Minka.
  • Raquel Romano
  • November 20, 2006
    50B-4025
    3-4:30pm
    Applications
  • Review target applications and their machine learning problems in preparation for January DOE workshop
  • Define key techniques and discuss how they can be extended to petascale level if necessary
  • --- --- Application Summaries
  • Raquel Romano
  • November 30, 2006, 50B-4205
    12noon-1:30pm
    Example Methods in Biology & Climate
  • Clique-finding; biclustering
  • Bayesian hierarchical space-time modeling; MCMC approaches
  • Protein Fractionation
  • Climate Data Assimilation
  • Combining Ensemble Run Simulations from Multiple Models
  • Hierarchical Bayesian Space-Time Models (1998), Christopher K. Wikle, L. Mark Berliner, Noel Cressie
  • Multivariate Bayesian Analysis of Atmosphere-Ocean General Circulation Models (2006), Furrer, R, Sain, S. R., Nychka, D., and Meehl, G. A.
  • Claudia Tebaldi Talks(NCAR/Stanford)
  • Chris Ding
  • Raquel Romano
  • December 11, 2006, 50F-1647
    11:30am-1pm
    High-Energy Physics Track reconstruction
  • Adaptive methods with application to track reconstruction at LHC
  • Track reconstruction in high density environment
  • Ali Pinar
  • Juan Meza
  • January 10, 2007, 50B-4205
    2pm-3:30pm
    Classification: Theory Statistical procedures that predict the group to which a given item belongs to using quantitative measurements or characteristics inherent to the item (referred to as features, traits, variables, etc.). Prediction models are built from training sets of items previously labeled according to group membership. Supervised learning.
  • Ensemble Learning (Boosted Decision Trees; Random Forests)
  • Kernel Methods (Support Vector Machines)
  • Linear Discriminants (LDA)
  • K-Nearest Neighbors
  • Naive Bayes
  • Neural Networks
  • Astrophysics: image search
  • Bioinformatics
  • High-energy physics: track recognition
  • See presentations for more detailed lists of references.
  • Ensemble Learning. Dietterich, T. G. In The Handbook of Brain Theory and Neural Networks, Second edition, Cambridge, MA: The MIT Press, 2002. 405-408.
  • A tutorial on nu-support vector machines. P.-H. Chen, C.-J. Lin, and B. Schölkopf. Applied Stochastic Models in Business and Industry , 21(2005), 111-136.
  • Juan - Ensemble Learning
  • Raquel - Naive Bayes, LDA, Fisher Linear, Logistic Regression, Kernel Methods
  • Ali - K-Nearest Neighbors, Neural Networks
  • January 24, 2007, 50B-4205
    1pm-2:30pm
    Clustering: Theory Statistical techniques for partitioning data set into subsets (clusters), so that the data in each subset share some common trait, typically proximity according to some defined distance measure. Unsupervised learning.
  • Spectral clustering
  • Hierarchical clustering
  • K-Means
  • Expectation Maximization (EM)
  • Computational Biology: gene expression profiles, protein sequences
  • Data Clustering: A Review, Jain, Murty, and Flynn, ACM Computing Surveys, 1999.
  • Unsupervised and Semi-supervised Clustering: A Brief Survey, Grira, et. al., in A Review of Machine Learning Techniques for Processing Multimedia Content, 2005.
  • On spectral clustering: Analysis and an algorithm, Ng, Jordan, and Weiss, NIPS, 2001.
  • Ekow Otoo - Clustering: Hierarchical, Grid-Based, & Visualization
  • Chris Ding - A Tutorial on Spectral Clustering, ICML, 2004.
  • February 8, 2007, 50F-1647
    12noon-1:30pm
    Dimensionality Reduction: Theory Transformation (linear or nonlinear) of high-dimensional data into a lower-dimensional subspace satisfying certain criteria, e.g. minimum loss of information, elimination of noise, extraction of salient features. Assumes the data of interest lies in a lower-dimensional space which is more desirable for statistical analysis.
  • Linear Methods (PCA, ICA, NMF)
  • Nonlinear Methods; Manifold Learning
  • Feature Selection
  • Astrophysics: parameterizing spectra
  • Climate Modeling
  • Combustion Modeling
  • Linear Dimensionality Reduction, Liang, P., Lecture Notes, Practical Machine Learning, CS294-10, UC Berkeley, October 2006.
  • PCA and Matrix Factorization for Learning, Chris Ding, ICML 2005 Tutorial.
  • Algorithms For Manifold Learning, Cayton, L., Research Exam, 2005.
  • TBA Classification: Practice Practical issues and applications of the above.
  • Online Learning
  • Large-Scale Problems
  • (see above)
  • A Parallel Mixture of SVMs for Very Large Scale Problems.Collobert R., et. al.,Advances in Neural Information Processing Systems, NIPS 14. MIT Press, 2002.
  • The Interplay of Optimization and Machine Learning Research, Bennett, et. al., Journal of Machine Learning Research 7 (2006) 1265-1281. (JMLR Special Topic)
  • Shogun - A Large Scale Machine Learning Toolbox, Sonnenburg, Raetsch & De Bona.
  • TBA Dimensionality Reduction: Practice Practical issues and applications of the above.
  • Feature Selection
  • (see above)
  • An Introduction to Variable and Feature Selection, Isabelle Guyon, André Elisseeff, Journal of Machine Learning Research, 3(Mar):1157--1182, 2003.
  • TBA Graphical Models
  • A Brief Introduction to Graphical Models and Bayesian Networks, Murphy, K., 1998.
  • TBA Time Series Analysis Methods for modeling patterns in sequences of observed variables and using the models to predict future values, compare multiple time series,
  • Kalman Filtering
  • Hidden Markov Models (HMMs)
  • Markov Chain Monte Carlo (MCMC) Methods
  • Anomaly and Change Detection
  • Trend Estimation
  • Climate and Combustion Modeling: learning spatial/temporal dependencies among variables
  • Overview of time-series-based anomaly detection algorithms
  • The Kalman Filter
  • A tutorial on hidden markov models and selected applications in speech recognition. L. Rabiner. In Proc. IEEE, 77 (2), 257-286., 1989.
  • An introduction to MCMC for machine learning,C. Andrieu, et. al., Machine Learning, vol. 50, pp. 5--43, Jan. - Feb. 2003.
  • Large data series: Modeling the usual to identify the unusual, Downing D.J.; et. al., Computational Statistics and Data Analysis, 32(3), 28 January 2000, pp. 245-258(14).
  • TBA Multiway Analysis Multilinear extensions of matrix-based multivariate analyses, where underlying data representations are higher-order tensors.
  • Tensor Decomposition
  • PARAFAC/CANDECOMP
  • Climate and Combustion Modeling: decomposing spatiotemporal data
  • PARAFAC. Tutorial & applications, Rasmus Bro
  • Tensor Compression for Petabyte-Size Data, Eugene Tyrtyshnikov, Workshop on Algorithms for Modern Massive Data Sets, MMDS 2006.
  • Multilinear Algebra in Data Analysis, Lek-Heng Lim, Workshop on Algorithms for Modern Massive Data Sets, MMDS 2006.
  • Potential Participants

    General Interest

  • Cecilia Aragon
  • Wes Bethel
  • Chris Ding
  • Krishna Palaniappan
  • Juan Meza
  • Esmond Ng
  • Ali Pinar
  • Raquel Romano
  • Doron Rotem
  • Janet Jacobsen
  • Gunther Weber

  • romano@hpcrd.lbl.gov