U.S. Department of Energy

Pacific Northwest National Laboratory

Publications

2016

Learning the representation of shape cues in 2D & 3D objects for recognition is a fundamental task in computer vision. Deep neural networks (DNNs) have shown promising performance on this task. Due to the large variability of shapes, accurate recognition relies on good estimates of model uncertainty, ignored in traditional training of DNNs, typically learned via stochastic optimization. This...
As we delve deeper into the ‘Digital Age’, we witness an explosive growth in the volume, velocity, and variety of the data available on the Internet. For example, in 2012 about 2.5 quintillion bytes of data was created on a daily basis that originated from myriad of sources and applications including mobiledevices, sensors, individual archives, social networks, Internet of Things, enterprises,...
While streaming data have become increasingly more popular in business and research communities, semantic models and processing software for streaming data have not kept pace. Traditional semantic solutions have not addressed transient data streams. Semantic web languages (e.g., RDF, OWL) have typically addressed static data settings and linked data approaches have predominantly addressed static...
Precise analysis of both (S)TEM images and video are time and labor intensive processes. As an example, determining when crystal growth and shrinkage occurs during the dynamic process of Li dendrite deposition and stripping involves manually scanning through each frame in the video to extract a specific set of frames/images. For large numbers of images, this process can be very time consuming, so...
NSF workshop on Stream Reasoning
This report documents progress made on all LDRD-funded projects during fiscal year 2015.
A deep generative model is developed for representation and analysis of images, based on a hierarchical convolutional dictionary-learning framework. Stochastic unpooling is employed to link consecutive layers in the model, yielding top-down image generation. A Bayesian support vector machine is linked to the top-layer features, yielding max-margin discrimination. Deep deconvolutional inference is...
Estimating the confidence for a link is a critical task for Knowledge Graph construction. Link prediction, or predicting the likelihood of a link in a knowledge graph based on prior state is a key research direction within this area. We propose a Latent Feature Embedding based link recommendation model for prediction task and utilize Bayesian Personalized Ranking based optimization technique for...
Storyline visualizations offer an approach that promises to capture the spatio-temporal characteristics of individual observers and simultaneously illustrate emerging group behaviors. We develop a visual analytics approach to parsing, aligning, and clustering fixation sequences from eye tracking data. Visualization of the results captures the similarities and differences across a group of...

2015

It is useful to understand and to predict the dynamics of cognitive performance and how the timing of breaks affects this process. Prior research analyzed data from online standardized test questions, enabling the creation of a model in which a secondary resource replenishes a primary resource that determines the probability of a successful outcome. However, parameters for this model require...
Computing innovations have fundamentally changed many aspects of scientific inquiry. For example, advances in robotics, high-end computing, networking, and databases now underlie much of what we do in science such as gene sequencing, general number crunching, sharing information between scientists, and analyzing large amounts of data. As computing has evolved at a rapid pace, so too has its...
We characterize the commercial behavior of a group of companies in a common line of business using a small ensemble of classifiers on a stream of records containing commercial activity information. This approach is able to effectively find a subset of classifiers that can be used to predict company labels with reasonable accuracy. Performance of the ensemble, its error rate under stable...
Power has become the major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the {\em de facto} communication interface used as the back-end for designing applications, programming models and runtime for these systems. Slack --- the time spent by an MPI process in a single MPI call --- provides a potential for energy and power savings, if an appropriate...
Machine Learning algorithms are benefiting from the continuous improvement of programming models, including MPI, MapReduce and PGAS. k-Nearest Neighbors (k-NN) algorithm is a widely used machine learning algorithm, applied to supervised learning tasks such as classification. Several parallel implementations of k-NN have been proposed in the literature and practice. However, on high-performance...
Cognitive Depletion, the decline in user performance over time through the exhaustion of mental resources, ensures an increasing prevalence of human error in the interaction between computers and their users. Key logger data from the Science of Interaction project was analyzed to determine if patterns in user activity could be used to determine change in user performance. Though the majority of...
This brief white paper describes PNNL’s Analysis In Motion (AIM) initiative, with special emphasis on the requirements of AIM’s TEM use case.
Business intelligence problems are particularly challenging due to the use of large volume and high velocity data in attempts to model and explain complex underlying phenomena. Incremental machine learning based approaches for summarizing trends and identifying anomalous behavior are often desirable in such conditions to assist domain experts in characterizing their data. The overall goal of this...
Support Vector Machines (SVM) is a supervised Machine Learning and Data Mining (MLDM) algorithm, which has become ubiquitous largely due to its high accuracy and obliviousness to dimensionality. The objective of SVM is to find an optimal boundary --- also known as hyperplane --- which separates the samples (examples in a dataset) of different classes by a maximum margin. Usually, very few samples...
In this paper, we propose a work-stealing runtime --- Library for Work Stealing LibWS --- using MPI one-sided model for designing scalable FP-Growth --- {\em de facto} frequent pattern mining algorithm --- on large scale systems. LibWS provides locality efficient and highly scalable work-stealing techniques for load balancing on a variety of data distributions. We also propose a novel...
Semantic roles play a significant role in extracting knowledge from text. The current unsupervised approaches utilize features from grammar structures, to induce semantic roles on to the words. The dependence on these grammars makes it difficult to adapt to noisy and new languages. In this paper we develop a data-driven approach for identifying semantic roles, where we are truly unsupervised till...
Today the ability to make sense of data is foundational to all discoveries, innovations, and decision making; the outcome of our work often critically depends on the speed and adaptability of the analysis and interpretation processes we employ. While much progress has been made in automating the analysis of standard events, little is available to support complex or rare event analysis situations...
Temporal and spatial resolution of chemical imaging methodologies such as x-ray tomography are rapidly increasing, leading to more complex experimental procedures and fast growing data volumes. Automated analysis pipelines and big data analytics are becoming essential to effectively evaluate the results of such experiments. Offering those data techniques in an adaptive, streaming environment can...
Some of the most pressing machine learning applications such as cyber security and object recognition lack enough ground-truth training data to build a classifier. Rather than build a classifier, our approach is to determine when data is anomalous or deviates from the norm. We demonstrate the use of an autoencoder to both learn a feature space and then identify anomalous portions without the aid...
Standardized testing plays a central role in educational and vocational evaluation. By analyzing performance of a large cohort answering practice standardized test questions online, we show that accuracy and learning decline as the test session progresses, but improve following a break. To explain these findings, we hypothesize that answering questions consumes some finite cognitive resources,...

2014

Visual analytics is inherently a collaboration between human and computer. However, in current visual analytics systems, the computer has limited means of knowing about its users and their analysis processes. While existing research has shown that a user's interactions with a system reflect a large amount of the user's reasoning process, there has been limited advancement in developing...
The report is the culmination of a year-long evaluation of the drivers of Big Data in the life sciences, possible risks and benefits, and existing or needed solutions to address the risks identified.
The central aim of this project, which is part of the Analytic Framework Focus Area, is to enable the integration of multidisciplinary research efforts and their products into a unified framework for discovery and validation of complex signatures. This research leverages multiple capabilities developed under the PNNL Signature Discovery Initiative including the analytic framework architecture,...
We present a prototype Active Data environment, which addresses the challenge of human-in-the-loop multi-INT analysis of data from diverse sources. Active Data combines powerful user interfaces, data and task models. It infers the user's task from her interactions and recommends data to the user in the context of the ongoing analysis. This environment extends the analyst's reach,...
The rise of Big Data has influenced the design and technical implementation of visual analytic tools required to handle the increased volumes, velocities, and varieties of data. This has required a set of data management and computational advancements to allow us to store and compute on such datasets. However, as the ultimate goal of visual analytic technology is to enable the discovery and...
Interactive analytics provide users a myriad of computational means to aid in extracting meaningful information from large and complex datasets. Much prior work focuses either on advancing the capabilities of machine-centric approaches by the data mining and machine learning communities, or human-driven methods by the visualization and CHI communities. However, these methods do not yet support a...
We propose to advance the state of the art in PII scrubbing techniques by designing and administering a public competition with cash prizes, on behalf of an interested government sponsor. The goal of the contest would be to create a set of novel methods to develop techniques that can remove PII from an increasingly-difficult set of data sources. If successful, this would engage a nontraditional...
We describe in this document a categorization of existing and emerging capabilities that address the mission drivers: 1) The need to develop scientific fundamental understanding of cyber systems, 2) The need for sense making in complex, vast, data streams to support awareness and decision making, and 3) The need to understand the complex interactions between the physical world and the cyber world...
Human-Centered Big Data Research (HCBDR) is an area of work that focuses on the methodologies and research areas focused on understanding how humans interact with “big data”. In the context of this paper, we refer to “big data” in a holistic sense, including most (if not all) the dimensions defining the term, such as complexity, variety, velocity, veracity, etc. Simply put, big data requires us...
Visual analytics is inherently a collaboration between human and computer. However, in current visual analytics systems, the computer has limited means of knowing about its users and their analysis processes. While existing research has shown that a user’s interactions with a system reflect a large amount of the user’s reasoning process, there has been limited advancement in developing automated...
Designing software for collaborative sensemaking environments begins with a set of very challenging requirements. At a high level, the software needs to be flexible enough to support multiple lines of inquiry, contradictory hypotheses, and collaborative tasking by multiple analysts. It should also include support for managing evolving human/machine workflows and analytic products at various...
Pacific Northwest National Laboratory (PNNL) has put significant effort into nonproliferation activities as an institution, both in terms of the classical nuclear material focused approach and in the examination of other strategic goods necessary to implement a nuclear program. To assist in these efforts, several projects in the Analysis in Motion (AIM) and Signature Discovery (SDI) Initiatives...
Support Vector Machines (SVM), a popular machine learning technique, has been applied to a wide range of domains such as science, finance, and social networks for supervised learning. Whether it is identifying high-risk patients by health-care professionals, or potential high-school students to enroll in college by school districts, SVMs can play a major role for social good. This paper...
Nuclear Magnetic Resonance (NMR) spectroscopy is a valuable tool for analyzing the composition of various small non-protein molecules by comparing the experimental spectrum against a library of expected peak locations. Currently, small molecule identification can be time-consuming and labor intensive as spectral results can vary over sample preparation and run conditions, and typically hundreds...
With rapid real-time data streams, predictive analysis algorithms must execute under stringent space and time constraints. A class of algorithms that may naturally support predictive analysis on streaming data may be found in the vicinity of incremental machine learning. Traditional machine learn approaches assume that a good training set is always available a priori and contains all the required...

2011

Scientists often use specific data analysis and presentation methods familiar within their domain. But does high familiarity drive better analytical judgment? This question is especially relevant when familiar methods themselves can have shortcomings: many visualizations used conventionally for scientific data analysis and presentation do not follow established best practices. This necessitates...

Pages

| Pacific Northwest National Laboratory