U.S. Department of Energy

Pacific Northwest National Laboratory

Scalable Feature Extraction and Sampling for Streaming Data Analysis

Today, scientific simulations, experiments and handheld devices are producing data at exorbitant velocity. Machine Learning and Data Mining (MLDM) algorithms are quintessential in analyzing high velocity streams. Yet, existing algorithms are restricted to low velocities, and sequential execution. As an example, the majority of MLDM libraries can leverage GPUs/multi-core systems, but rarely realize high performance distributed systems effectively. We propose to design multiple of algorithms under this project, which address this fundamental limitation. We will design parallel incremental MLDM algorithms for feature and incremental model generation. In addition to large scale distributed memory algorithms, the project intends to design algorithms, which incrementally find the critical samples for model learning (without overfitting) and simple models, which can be readily deployed on memory/compute constrained systems.

Approach

We are interested in feature extraction and hypothesis generation by designing and implementing novel parallel incremental machine learning algorithms. We will conduct research on Intel, AMD architectures, novel networking architectures and GPUs. Our novel algorithms will consider elimination (to reduce time complexity) and recall (to maintain precision), while considering incremental updates to the existing models. Specifically, we have enhanced Google TensorFlow and Berkeley Caffe for execution on Extreme scale systems.

Benefit

We are creating a new dimension to large scale machine learning and consider the impact of the proposed research to be highly technical. We are using datasets in several domains including High Energy Physics and offending Uniform Resource Locator (URL), including use cases of streaming data to demonstrate the effectiveness of the proposed solutions. Our open source release – Machine Learning Toolkit for Extreme Scale (MaTEx) – consists of several parallel implementations of algorithms, which are routinely used by data analysts. With these implementations, we expect the scientists/analysts to be able to learn new models and analyze the data much faster in comparison to the current practice. As such, the project has several important achievements, including scalable implementations of Deep Learning algorithms on Extreme scale systems. The research is integrated with the Machine Learning Toolkit for Extreme Scale (MaTEx: http://hpc.pnl.gov/matex)

| Pacific Northwest National Laboratory