U.S. Department of Energy

Pacific Northwest National Laboratory

Selecting a Classification Ensemble and Detecting Process Drift in an Evolving Data Stream

Publish Date: 
Wednesday, September 30, 2015
We characterize the commercial behavior of a group of companies in a common line of business using a small ensemble of classifiers on a stream of records containing commercial activity information. This approach is able to effectively find a subset of classifiers that can be used to predict company labels with reasonable accuracy. Performance of the ensemble, its error rate under stable conditions, can be characterized using an exponentially weighted moving average (EWMA) statistic. The behavior of the EWMA statistic can be used to monitor a record stream from the commercial network and determine when significant changes have occurred. Results indicate that larger classification ensembles may not necessarily be optimal, pointing to the need to search the combinatorial classifier space in a systematic way. Results also show that current and past performance of an ensemble can be used to detect when statistically significant changes in the activity of the network have occurred. The dataset used in this work contains tens of thousands of high level commercial activity records with continuous and categorical variables and hundreds of labels, making classification challenging.
Heredia-Langner A, LR Rodriguez, A Lin, and JB Webster. 2015. "Selecting a Classification Ensemble and Detecting Process Drift in an Evolving Data Stream." In DMIN'15: The 11th International Conference on Data Mining, July 27-30, 2015, Las Vegas, Nevada, ed. R Stahlbock, et al, pp. 31-36. CSREA Press, ATHENS, GA.
| Pacific Northwest National Laboratory