U.S. Department of Energy

Pacific Northwest National Laboratory

Streaming Query User Interface (SQUINT)

How can a user make sense of high volume non-numerical streaming data? Can the machine learn the users models and goals to act as the user’s surrogate, speeding making her job easier and faster? Furthermore, how should anomalies be presented in the context of normal behavior to allow the user to understand the machine’s decision making? Our goal is to solve these problems with SQUINT, with a focus towards detecting insider threats in streaming cloud telemetry data and other applications.

SQUINT is a stream summarization architecture that facilitates real-time summarization and visual representation of large collections of temporal event sequences. The capability will allow analysts to build and refine queries from example events, either observed or constructed, and then deploy these queries to continuously organize and summarize the large volumes of streaming data.

Queries are intended to operate on unlabeled data by leveraging user knowledge solved an unsupervised sequence-labeling problem. Queries may receive continuous feedback from the analyst to improve model performance. The analyst receives continuous feedback from stream summaries (visualizations), which are updated not only as new data arrives, but also as the queries themselves improve from user feedback.

Challenge

Current practices for understanding streaming data often ignore the time dimension by treating events as independent instances. Our approach breaks from practice by relying on state of the art machine learning techniques to model sequences of events.

Approach

We propose a stream summarization architecture, SQUINT, that facilitates real-time summarization and visual representation of large collections of temporal event sequences. SQUINT summaries are useful (i.e., customizable to the analyst’s interest) and timely (i.e., available and up to date within seconds of individual events). SQUINT summarizes the stream within the appropriate historical context to improve decision-making. The analyst controls how the stream is summarized by interactively constructing one or more queries that organize the stream, past and present, into clusters or ranked lists.

Benefit

A major benefit of our approach is to model temporal event sequences in a manner that allows existing machine learning techniques to be leveraged. This enables the user more adaptability–rather than training machine learning models for specific, pre-defined tasks, the user can create new models on the fly for their particular need.

| Pacific Northwest National Laboratory