U.S. Department of Energy

Pacific Northwest National Laboratory

Publications

2017

Scanning transmission electron microscopes (STEM) provide high resolution images at an atomic scale. Unfortunately, the level of electron dose required to achieve these high resolution images results in a potentially large amount of specimen damage. A promising approach to mitigate specimen damage is to subsample the specimen [1, 2, 3]. With random sampling, the microscope creates high resolution...
We developed CHISSL, a human-machine interface that utilizes supervised machine learning in an unsupervised context to help the user group unlabeled instances by her own mental model. The user primarily interacts via correction (moving a misplaced instance into its correct group) or confirmation (accepting that an instance is placed in its correct group). Concurrent with the user's...
We introduce new dictionary learning methods for tensor-variate data of any order. We represent each data item as a sum of Kruskal decomposed dictionary atoms within the framework of beta-process factor analysis (BPFA). Our model is nonparametric and can infer the tensor-rank of each dictionary atom. This Kruskal-Factor Analysis (KFA) is a natural generalization of BPFA. We also extend KFA to a...
To promote more interactive and dynamic machine learn- ing, we revisit the notion of user-interface metaphors. User-interface metaphors provide intuitive constructs for supporting user needs through interface design elements. A user-interface metaphor provides a visual or action pattern that leverages a user’s knowledge of another domain. Metaphors suggest both the visual representations that...
Capacity coefficient analysis could offer a theoretically grounded alternative approach to subjective measures and dual task assessment of cognitive workload. Workload capacity or workload efficiency is a human information processing modeling construct defined as the amount of information that can be processed by the visual cognitive system given a specified of amount of time. In this paper, I...
Visual data analysis helps people gain insights into data via interactive visualizations. People generate and test hypotheses and questions about data in context of the domain. This process can generally be referred to as sensemaking. Much of the work on studying sensemaking (and creating visual analytic techniques in support of it) has been focused on static datasets. However, how do the...
Visual analytic systems have long relied on user studies and standard datasets to demonstrate advances to the state of the art, as well as to illustrate the efficiency of solutions to domain-specific challenges. This approach has enabled some important comparisons between systems, but unfortunately the narrow scope required to facilitate these comparisons has prevented many of these lessons from...

2016

FP-Growth algorithm is a Frequent Pattern Mining (FPM) algorithm that has been extensively used to study correlations and patterns in large scale datasets. While several researchers have designed distributed memory FP-Growth algorithms, it is pivotal to consider fault tolerant FP-Growth, which can address the increasing fault rates in large scale systems. In this work, we propose a novel parallel...
The recent successes of machine learning (ML) have exposed the need for making models more interpretable and accessible to different stakeholders in the data science ecosystem, like domain experts, data analysts, and even the general public. The assumption here is that higher interpretability will lead to more confident human decision-making based on model outcomes. In this talk, we report on two...
Dealing with the curse of dimensionality is a key challenge in high-dimensional data visualization. We present SeekAView to address three main gaps in the existing research literature. First, automated methods like dimensionality reduction or clustering suffer from a lack of transparency in letting analysts interact with their outputs in real-time to suit their exploration strategies. The results...
Visual analytic systems have long relied on user studies and standard datasets to demonstrate advances to the state of the art, as well as to illustrate the efficiency of solutions to domain-specific challenges. This approach has enabled some important comparisons between systems, but unfortunately the narrow scope required to facilitate these comparisons has prevented many of these lessons from...
Reasoning and querying over data streams rely on the ability to deliver a sequence of stream snapshots to the processing algorithms. These snapshots are typically provided using windows as views into streams and associated window management strategies. Generally, the goal of any window management strategy is to preserve the most important data in the current window and preferentially evict the...
In-situ (scanning) transmission electron microscopy (S/TEM) is being developed for numerous applications in the study of nucleation and growth under electrochemical driving forces. For this type of experiment, one of the key parameters is to identify when nucleation initiates. Typically the process of identifying the moment that crystals begin to form is a manual process requiring the user to...
Combining interactive visualization with automated analytical methods like statistics and data mining facilitates data-driven discovery. These visual analytic methods are beginning to be instantiated within mixed-initiative systems, where humans and ma- chines collaboratively influence evidence-gathering and decision-making. But an open research question is that, when domain experts analyze their...
Learning the representation of shape cues in 2D & 3D objects for recognition is a fundamental task in computer vision. Deep neural networks (DNNs) have shown promising performance on this task. Due to the large variability of shapes, accurate recognition relies on good estimates of model uncertainty, ignored in traditional training of DNNs, typically learned via stochastic optimization. This...
As we delve deeper into the ‘Digital Age’, we witness an explosive growth in the volume, velocity, and variety of the data available on the Internet. For example, in 2012 about 2.5 quintillion bytes of data was created on a daily basis that originated from myriad of sources and applications including mobiledevices, sensors, individual archives, social networks, Internet of Things, enterprises,...
While streaming data have become increasingly more popular in business and research communities, semantic models and processing software for streaming data have not kept pace. Traditional semantic solutions have not addressed transient data streams. Semantic web languages (e.g., RDF, OWL) have typically addressed static data settings and linked data approaches have predominantly addressed static...
Precise analysis of both (S)TEM images and video are time and labor intensive processes. As an example, determining when crystal growth and shrinkage occurs during the dynamic process of Li dendrite deposition and stripping involves manually scanning through each frame in the video to extract a specific set of frames/images. For large numbers of images, this process can be very time consuming, so...
NSF workshop on Stream Reasoning
This report documents progress made on all LDRD-funded projects during fiscal year 2015.
A deep generative model is developed for representation and analysis of images, based on a hierarchical convolutional dictionary-learning framework. Stochastic unpooling is employed to link consecutive layers in the model, yielding top-down image generation. A Bayesian support vector machine is linked to the top-layer features, yielding max-margin discrimination. Deep deconvolutional inference is...
Estimating the confidence for a link is a critical task for Knowledge Graph construction. Link prediction, or predicting the likelihood of a link in a knowledge graph based on prior state is a key research direction within this area. We propose a Latent Feature Embedding based link recommendation model for prediction task and utilize Bayesian Personalized Ranking based optimization technique for...
Storyline visualizations offer an approach that promises to capture the spatio-temporal characteristics of individual observers and simultaneously illustrate emerging group behaviors. We develop a visual analytics approach to parsing, aligning, and clustering fixation sequences from eye tracking data. Visualization of the results captures the similarities and differences across a group of...

2015

It is useful to understand and to predict the dynamics of cognitive performance and how the timing of breaks affects this process. Prior research analyzed data from online standardized test questions, enabling the creation of a model in which a secondary resource replenishes a primary resource that determines the probability of a successful outcome. However, parameters for this model require...
Computing innovations have fundamentally changed many aspects of scientific inquiry. For example, advances in robotics, high-end computing, networking, and databases now underlie much of what we do in science such as gene sequencing, general number crunching, sharing information between scientists, and analyzing large amounts of data. As computing has evolved at a rapid pace, so too has its...
We characterize the commercial behavior of a group of companies in a common line of business using a small ensemble of classifiers on a stream of records containing commercial activity information. This approach is able to effectively find a subset of classifiers that can be used to predict company labels with reasonable accuracy. Performance of the ensemble, its error rate under stable...
Power has become the major impediment in designing large scale high-end systems. Message Passing Interface (MPI) is the {\em de facto} communication interface used as the back-end for designing applications, programming models and runtime for these systems. Slack --- the time spent by an MPI process in a single MPI call --- provides a potential for energy and power savings, if an appropriate...
Machine Learning algorithms are benefiting from the continuous improvement of programming models, including MPI, MapReduce and PGAS. k-Nearest Neighbors (k-NN) algorithm is a widely used machine learning algorithm, applied to supervised learning tasks such as classification. Several parallel implementations of k-NN have been proposed in the literature and practice. However, on high-performance...
Cognitive Depletion, the decline in user performance over time through the exhaustion of mental resources, ensures an increasing prevalence of human error in the interaction between computers and their users. Key logger data from the Science of Interaction project was analyzed to determine if patterns in user activity could be used to determine change in user performance. Though the majority of...
This brief white paper describes PNNL’s Analysis In Motion (AIM) initiative, with special emphasis on the requirements of AIM’s TEM use case.
Business intelligence problems are particularly challenging due to the use of large volume and high velocity data in attempts to model and explain complex underlying phenomena. Incremental machine learning based approaches for summarizing trends and identifying anomalous behavior are often desirable in such conditions to assist domain experts in characterizing their data. The overall goal of this...
Support Vector Machines (SVM) is a supervised Machine Learning and Data Mining (MLDM) algorithm, which has become ubiquitous largely due to its high accuracy and obliviousness to dimensionality. The objective of SVM is to find an optimal boundary --- also known as hyperplane --- which separates the samples (examples in a dataset) of different classes by a maximum margin. Usually, very few samples...
In this paper, we propose a work-stealing runtime --- Library for Work Stealing LibWS --- using MPI one-sided model for designing scalable FP-Growth --- {\em de facto} frequent pattern mining algorithm --- on large scale systems. LibWS provides locality efficient and highly scalable work-stealing techniques for load balancing on a variety of data distributions. We also propose a novel...
Semantic roles play a significant role in extracting knowledge from text. The current unsupervised approaches utilize features from grammar structures, to induce semantic roles on to the words. The dependence on these grammars makes it difficult to adapt to noisy and new languages. In this paper we develop a data-driven approach for identifying semantic roles, where we are truly unsupervised till...
Today the ability to make sense of data is foundational to all discoveries, innovations, and decision making; the outcome of our work often critically depends on the speed and adaptability of the analysis and interpretation processes we employ. While much progress has been made in automating the analysis of standard events, little is available to support complex or rare event analysis situations...
Temporal and spatial resolution of chemical imaging methodologies such as x-ray tomography are rapidly increasing, leading to more complex experimental procedures and fast growing data volumes. Automated analysis pipelines and big data analytics are becoming essential to effectively evaluate the results of such experiments. Offering those data techniques in an adaptive, streaming environment can...
Some of the most pressing machine learning applications such as cyber security and object recognition lack enough ground-truth training data to build a classifier. Rather than build a classifier, our approach is to determine when data is anomalous or deviates from the norm. We demonstrate the use of an autoencoder to both learn a feature space and then identify anomalous portions without the aid...
Standardized testing plays a central role in educational and vocational evaluation. By analyzing performance of a large cohort answering practice standardized test questions online, we show that accuracy and learning decline as the test session progresses, but improve following a break. To explain these findings, we hypothesize that answering questions consumes some finite cognitive resources,...

2014

Visual analytics is inherently a collaboration between human and computer. However, in current visual analytics systems, the computer has limited means of knowing about its users and their analysis processes. While existing research has shown that a user's interactions with a system reflect a large amount of the user's reasoning process, there has been limited advancement in developing...
The report is the culmination of a year-long evaluation of the drivers of Big Data in the life sciences, possible risks and benefits, and existing or needed solutions to address the risks identified.
The central aim of this project, which is part of the Analytic Framework Focus Area, is to enable the integration of multidisciplinary research efforts and their products into a unified framework for discovery and validation of complex signatures. This research leverages multiple capabilities developed under the PNNL Signature Discovery Initiative including the analytic framework architecture,...
We present a prototype Active Data environment, which addresses the challenge of human-in-the-loop multi-INT analysis of data from diverse sources. Active Data combines powerful user interfaces, data and task models. It infers the user's task from her interactions and recommends data to the user in the context of the ongoing analysis. This environment extends the analyst's reach,...
The rise of Big Data has influenced the design and technical implementation of visual analytic tools required to handle the increased volumes, velocities, and varieties of data. This has required a set of data management and computational advancements to allow us to store and compute on such datasets. However, as the ultimate goal of visual analytic technology is to enable the discovery and...
Interactive analytics provide users a myriad of computational means to aid in extracting meaningful information from large and complex datasets. Much prior work focuses either on advancing the capabilities of machine-centric approaches by the data mining and machine learning communities, or human-driven methods by the visualization and CHI communities. However, these methods do not yet support a...
We propose to advance the state of the art in PII scrubbing techniques by designing and administering a public competition with cash prizes, on behalf of an interested government sponsor. The goal of the contest would be to create a set of novel methods to develop techniques that can remove PII from an increasingly-difficult set of data sources. If successful, this would engage a nontraditional...
We describe in this document a categorization of existing and emerging capabilities that address the mission drivers: 1) The need to develop scientific fundamental understanding of cyber systems, 2) The need for sense making in complex, vast, data streams to support awareness and decision making, and 3) The need to understand the complex interactions between the physical world and the cyber world...
Human-Centered Big Data Research (HCBDR) is an area of work that focuses on the methodologies and research areas focused on understanding how humans interact with “big data”. In the context of this paper, we refer to “big data” in a holistic sense, including most (if not all) the dimensions defining the term, such as complexity, variety, velocity, veracity, etc. Simply put, big data requires us...
Visual analytics is inherently a collaboration between human and computer. However, in current visual analytics systems, the computer has limited means of knowing about its users and their analysis processes. While existing research has shown that a user’s interactions with a system reflect a large amount of the user’s reasoning process, there has been limited advancement in developing automated...
Designing software for collaborative sensemaking environments begins with a set of very challenging requirements. At a high level, the software needs to be flexible enough to support multiple lines of inquiry, contradictory hypotheses, and collaborative tasking by multiple analysts. It should also include support for managing evolving human/machine workflows and analytic products at various...

Pages

| Pacific Northwest National Laboratory