U.S. Department of Energy

Pacific Northwest National Laboratory

Streaming Data Characterization

A key task for any streaming system is to figure out from moment to moment which data items to remember and which to forget. Put simply, the processing system should remember the data items which are or will be important to the streaming query, and forget the others. The Streaming Data Characterization project is targeted at providing a set of algorithms to figure out, in a computationally efficient manner, which data items should be remembered as the stream flows on, and which should be forgotten. We do this by leveraging formal task descriptions and domain ontologies, and combining them with computationally efficient data structures and ranking systems. 

Challenge

Reasoning and querying over data streams rely on the ability to deliver a sequence of stream snapshots to the processing algorithms. These snapshots are typically provided using windows as temporarily static views into streams. This design leads to an alternating pattern of fixing the window, then executing the query, then updating the window with new stream data, then executing the query on the updated window, then updating the window again, and so forth. However, for the most common window update strategies (e.g., First-In-First-Out, Least-Recently-Used, or Least-Frequently-Used), it is quite possible that the query will fail to find the desired result because critical stream data may get flushed from the window before the query can identify it.

Approach

We are exploring computationally efficient window update algorithms based on semantic processing that can mitigate this problem by ensuring that assertions that are likely to be important for the query are kept in the window for a longer time. Essentially, SDC’s algorithms dynamically rank the window assertions by their importance, and use this ranking to preserve the important assertions through window updates. Also, because the algorithms rely on semantic rules, it is possible for human operators to dynamically modify the ranking algorithms, and thus provide a way for human judgment and insight to interact with the stream processing algorithms.

Benefit

SDC algorithms will directly benefit:

  • Stream processing applications where human operators have to use their expertise to help define the important facts to remember, such as cyber detection systems
  • Any stream processing system where the window size cannot be determined in advance
| Pacific Northwest National Laboratory