U.S. Department of Energy

Pacific Northwest National Laboratory

Population Based Hypothesis Generation and Control

Objective

Approaches to analyze streaming data tend to use a single predictive model or a fixed ensemble where the assumption is that the models at hand are optimal for the problem under consideration. In complex environments, a single model may be of little use and fixed ensembles may be unable to arrive at a statistically optimal set that can produce insightful information for an analyst. We propose a population-based approach where different combinations of models are selected using statistical principles of effectiveness, parsimony, and goodness of fit.

Approach

Using a single, fixed set of models, it is usually not possible to know the effect each has on a response or which combination of models is most useful. Careful investigation of different model combinations allows estimation of the effect of the classification models available, alone and in combination with others, on a response.  These effect estimates allow models to be pruned to accommodate a parsimonious ensemble. The population-based approach lets models that do not contribute significantly to the classification performance to remain dormant, while the rest provide input to an information aggregator that can produce an actionable hypothesis.

Achievements

  • Produced a systematic approach to evaluate the effect of individual models in a population and created significant interactions between models and threshold of performance effectively and efficiently using statistical principles
  • Establishing a metric by which improvement over single-model and fixed-ensemble can be measured, measuring the trade-offs between simplicity of a single-model approach and the more complex environment proposed by this work
  • At the end of our first year, we will have established how much improvement can be made over single-model and what level of complexity can be practically tolerated for the population-based method
  • The paper “Selecting a Classification Ensemble and Detecting Process Drift in an Evolving Data Stream” was accepted for publication and presentation in the 11th International Conference on Data Mining

Impact

  • Metrics for performance when a single classifier or a single predictive model is used are very well established (one-step-ahead predictions, lack of fit, leave-one-out validation)
  • Combining multiple predictive models and classifiers is challenging because rules for individual predictors may not apply in a multiple-model environment
  • Explaining, using statistically based tools, how predictions from individual models act in combination with information obtained from other classifiers and areas where information may be lacking can be very useful to chemists, intelligence analysts, and statisticians
| Pacific Northwest National Laboratory