U.S. Department of Energy

Pacific Northwest National Laboratory

Publications

2018

Cyber network analysts follow complex processes in their investigations of potential threats to their network. Much research is dedicated to providing automated tool support in the effort to make their tasks more efficient, accurate, and timely. This tool support comes in a variety of implementations from machine learning algorithms that monitor streams of data to visual analytic environments for...
We describe a novel study of decision-making processes around misinformation on social media. Using a custom-built visual analytic system, we presented users with news content from social media accounts from a variety of news outlets, including outlets engaged in distributing misinformation. We conducted controlled experiments to study decision-making regarding the veracity of these news outlets...
Liquid chromatography-mass spectrometry (LC-MS)-based proteomics studies of large sample cohorts can easily require from months to years to complete. Acquiring consistent, high-quality data in such large-scale studies is challenging because of normal variations in instrumentation performance over time, as well as artifacts introduced by the samples themselves, such as those due to collection,...
One of the main limitations in the use of operando scanning transmission electron microscopes to study dynamic chemical processes is the effect of the electron beam on the kinetics of the reaction being observed. Here we demonstrate that a flexible Gaussian mixture model can be used to extract quantitative information directly from sub-sampled images, i.e.~images where not all the pixels in the...
We demonstrate CHISSL, a scalable client-server system for real-time interactive machine learning. Our system is capable of incorporating user feedback incrementally and immediately without a structured or pre-defined prediction task. Computation is partitioned between a lightweight web-client and a heavyweight server. The server relies on representation learning and agglomerative clustering to...
We demonstrate Percolator, a distributed system for graph pattern discovery in dynamic graphs. In contrast to conventional mining systems, Percolator advocates efficient pattern mining schemes that (1) support pattern detection with keywords; (2) integrate incremental and parallel pattern mining; and (3) support analytical queries such as trend analysis. The core idea of Percolator is to...

2017

In this paper we present a technique to train neural network models on small amounts of data. Current methods for training neural networks on small amounts of rich data typically rely on strategies such as fine-tuning a pre-trained neural networks or the use of domain-specific hand-engineered features. Here we take the approach of treating network layers, or entire networks, as modules and...
Graph mining is an important data analysis methodology, but struggles as the input graph size increases. The scalability and usability challenges posed by such large graphs make it imperative to sample the input graph and reduce its size. The critical challenge in sampling is to identify the appropriate algorithm to insure the resulting analysis does not suffer heavily from the data reduction....
This paper presents the results and conclusions of our participation in the Clickbait Challenge 2017 on automatic clickbait detection in social media. We first describe linguistically-infused neural network models and identify informative representations to predict the level of clickbaiting present in Twitter posts. Our models allow to answer the question not only whether a post is a clickbait or...
During knowledge elicitations with cyber analysts, we uncovered a need for tools that helps analysts understand threat alerts in a context of baseline ``normal'' behaviors. We used an iterative design process to create a prototype alert management system with which we can explore the critical design space for effective baseline visualizations. We report herein on the design elements of...
In this paper, we propose a visual analytics workflow to help data scientists explore, diagnose, and understand the decisions made by a binary classifier. The approach leverages “instance-level explanations”, measures of local feature relevance that explain single instances and uses them to build a set of visual representations that guide the users in their investigation. The workflow is based on...
Experts, cutting across domains like biology, climate science, cyber, energy, etc., frequently use visualizations as the principal medium for making analytical judgments or for presenting the results of their analysis. However, scientists are often skeptical about adopting new visualization methods over familiar ones, although the latter might be perceptually sub-optimal. This is due to the use...
Visual analytic tools combine the complementary strengths of humans and machines in human-in-the-loop systems. Humans provide invaluable domain expertise and sensemaking capabilities to this discourse with analytic models; however, little consideration has yet been given to the ways inherent human biases might shape the visual analytic process. In this paper, we establish a conceptual framework...
Systems have biases. Their interfaces naturally guide a user toward specific patterns of action. For example, modern word-processors and spreadsheets are both capable of taking word wrapping, checking spelling, storing tables, and calculating formulas. You could write a paper in a spreadsheet or could do simple business modeling in a word-processor. However, their interfaces naturally...
Wall E, LM Blaha, C Paul, KA Cook, and A Endert. 2017. "Four Perspectives on Human Bias in Visual Analytics." In Dealing with Cognitive Biases in Visualisations : a VIS 2017 workshop.
In the age of data science, the use of interactive information visualization techniques has become increasingly ubiquitous. From online scientific journals to the New York Times graphics desk, the utility of interactive visualization for both storytelling and analysis has become ever more apparent. As these techniques have become more readily accessible, the appeal of combining interactive...
Storylines are adept at communicating complex change by encoding time naturally on the x-axis and using the proximity of lines in the y direction to encode an interaction between entities at a particular time. The original definition of a storyline visualization requires data defined in terms of explicit interaction sessions. A more relaxed definition allows storyline visualization to be applied...
Cyber network analysts follow complex processes in their investigations of potential threats to their network. Much research is dedicated to providing automated tool support in the effort to make their tasks more efficient, accurate, and timely. This tool support comes in a variety of implementations from machine learning algorithms that monitor streams of data to visual analytic environments for...
Property graphs can be used to represent heterogeneous networks with attributed vertices and edges. Given one property graph, simulating another graph with same or greater size with identical statistical properties with respect to the attributes and connectivity is critical for privacy preservation and bench marking purposes. In this work we tackle the problem of capturing the statistical...
State-of-the-art visual analytics models and frameworks mostly assume a static snapshot of the data, while in many cases it is a stream with constant updates and changes. Exploration of streaming data poses unique challenges as machine-level computations and abstractions need to be synchronized with the visual representation of the data and the temporally evolving human insights. In the visual...
Electron energy loss spectroscopy (EELS) is typically conducted in STEM mode with a spectrometer, or in TEM mode with energy selection. These methods produce a 3D data set (x, y, energy). Some compressive sensing [1,2] and inpainting [3,4,5] approaches have been proposed for recovering a full set of spectra from compressed measurements. In many cases the final form of the spectral data is an...
In the age of data science, the use of interactive information visualization techniques has become increasingly ubiquitous. From online scientific journals to the New York Times graphics desk, the utility of interactive visualization for both storytelling and analysis has become ever more apparent. As these techniques have become more readily accessible, the appeal of combining interactive...
Electron energy loss spectroscopy (EELS) is typically conducted in STEM mode with a spectrometer, or in TEM mode with energy selction. These methods produce a 3D data set (x, y, energy). Some compressive sensing [1,2] and inpainting [3,4,5] approaches have been proposed for recovering a full set of spectra from compressed measurements. In many cases the final form of the spectral data is an...
Compressive Sensing (CS) allows a signal to be sparsely measured first and accurately recovered later in software [1]. In scanning transmission electron microscopy (STEM), it is possible to compress an image spatially by reducing the number of measured pixels, which decreases electron dose and increases sensing speed [2,3,4]. The two requirements for CS to work are: (1) sparsity of basis...
Traditionally, microscopists have worked with the Nyquist-Shannon theory of sampling, which states that to be able to reconstruct the image fully it needs to be sampled at a rate of at least twice the highest spatial frequency in the image. This sampling rate assumes that the image is sampled at regular intervals and that every pixel contains information that is crucial for the image (it even...
Compressive sensing approaches are beginning to take hold in (scanning) transmission electron microscopy (S/TEM) [1,2,3]. Compressive sensing is a mathematical theory about acquiring signals in a compressed form (measurements) and the probability of recovering the original signal by solving an inverse problem [4]. The inverse problem is underdetermined (more unknowns than measurements), so it is...
Images that are collected via scanning transmission electron microscopy (STEM) can be undersampled to avoid damage to the specimen while maintaining resolution [1, 2]. We have used BPFA to impute missing data and reduce noise [3]. The reconstruction is typically evaluated using the peak signal-to-noise ratio (PSNR). This measure is too conservative for STEM images and we propose that the Fourier...
Images that are collected via scanning transmission electron microscopy (STEM) can be undersampled to avoid damage to the specimen while maintaining resolution [1, 2]. We have used BPFA to impute missing data and reduce noise [3]. The reconstruction is typically evaluated using the peak signal-to-noise ratio (PSNR). This measure is too conservative for STEM images and we propose that the Fourier...
In this article, a team of experts in synthetic biology, data analytics, and national security describe the overall supply chain surrounding synthetic biology. The team analyzes selected interactions within that network to better understand the risks raised by synthetic biology and identifies opportunities for risk mitigation. To introduce the concept, the article will briefly describe how an...
Since Wolfgang Pauli posed the question in 1933, whether the probability densities |Ψ(r)|² (real-space image) and |Ψ(q)|² (reciprocal space image) uniquely determine the wave function Ψ(r) [1], the so called Pauli Problem sparked numerous methods in all fields of microscopy [2, 3]. Reconstructing the complete wave function Ψ(r) = a(r)e-iφ(r) with the amplitude a(r) and the phase φ(r) from the...
Capacity coefficient analysis could offer a theoretically grounded alternative approach to subjective measures and dual task assessment of cognitive workload. Workload capacity or workload efficiency is a human information processing modeling construct defined as the amount of information that can be processed by the visual cognitive system given a specified of amount of time. In this paper, I...
Scanning transmission electron microscopes (STEM) provide high resolution images at an atomic scale. Unfortunately, the level of electron dose required to achieve these high resolution images results in a potentially large amount of specimen damage. A promising approach to mitigate specimen damage is to subsample the specimen [1, 2, 3]. With random sampling, the microscope creates high resolution...
We developed CHISSL, a human-machine interface that utilizes supervised machine learning in an unsupervised context to help the user group unlabeled instances by her own mental model. The user primarily interacts via correction (moving a misplaced instance into its correct group) or confirmation (accepting that an instance is placed in its correct group). Concurrent with the user's...
To realize the full potential of machine learning in diverse real- world domains, it is necessary for model predictions to be readily interpretable and actionable for the human in the loop. Analysts, who are the users but not the developers of machine learning models, often do not trust a model because of the lack of transparency in associating predictions with the underlying data space. To...
We introduce new dictionary learning methods for tensor-variate data of any order. We represent each data item as a sum of Kruskal decomposed dictionary atoms within the framework of beta-process factor analysis (BPFA). Our model is nonparametric and can infer the tensor-rank of each dictionary atom. This Kruskal-Factor Analysis (KFA) is a natural generalization of BPFA. We also extend KFA to a...
The ability to construct domain specific knowledge graphs (KG) and perform question-answering or hypothesis generation is a trans- formative capability. Despite their value, automated construction of knowledge graphs remains an expensive technical challenge that is beyond the reach for most enterprises and academic institutions. We propose an end-to-end framework for developing custom knowl- edge...
Deep Learning (DL) algorithms have become the {\em de facto} choice for data analysis. Several DL implementations -- primarily limited to a single compute node -- such as Caffe, TensorFlow, Theano and Torch have become readily available. Distributed DL implementations capable of execution on large scale systems are becoming important to address the computational needs of large data produced by...
To promote more interactive and dynamic machine learn- ing, we revisit the notion of user-interface metaphors. User-interface metaphors provide intuitive constructs for supporting user needs through interface design elements. A user-interface metaphor provides a visual or action pattern that leverages a user’s knowledge of another domain. Metaphors suggest both the visual representations that...
Capacity coefficient analysis could offer a theoretically grounded alternative approach to subjective measures and dual task assessment of cognitive workload. Workload capacity or workload efficiency is a human information processing modeling construct defined as the amount of information that can be processed by the visual cognitive system given a specified of amount of time. In this paper, I...
Visual data analysis helps people gain insights into data via interactive visualizations. People generate and test hypotheses and questions about data in context of the domain. This process can generally be referred to as sensemaking. Much of the work on studying sensemaking (and creating visual analytic techniques in support of it) has been focused on static datasets. However, how do the...
Visual analytic systems have long relied on user studies and standard datasets to demonstrate advances to the state of the art, as well as to illustrate the efficiency of solutions to domain-specific challenges. This approach has enabled some important comparisons between systems, but unfortunately the narrow scope required to facilitate these comparisons has prevented many of these lessons from...

2016

FP-Growth algorithm is a Frequent Pattern Mining (FPM) algorithm that has been extensively used to study correlations and patterns in large scale datasets. While several researchers have designed distributed memory FP-Growth algorithms, it is pivotal to consider fault tolerant FP-Growth, which can address the increasing fault rates in large scale systems. In this work, we propose a novel parallel...
Machine Learning and Data Mining (MLDM) algorithms are becoming ubiquitous in {\em model learning} from the large volume of data generated using simulations, experiments and handheld devices. Deep Learning algorithms -- a class of MLDM algorithms -- are applied for automatic feature extraction, and learning non-linear models for unsupervised and supervised algorithms. Naturally, several libraries...
The recent successes of machine learning (ML) have exposed the need for making models more interpretable and accessible to different stakeholders in the data science ecosystem, like domain experts, data analysts, and even the general public. The assumption here is that higher interpretability will lead to more confident human decision-making based on model outcomes. In this talk, we report on two...
Dealing with the curse of dimensionality is a key challenge in high-dimensional data visualization. We present SeekAView to address three main gaps in the existing research literature. First, automated methods like dimensionality reduction or clustering suffer from a lack of transparency in letting analysts interact with their outputs in real-time to suit their exploration strategies. The results...
Visual analytic systems have long relied on user studies and standard datasets to demonstrate advances to the state of the art, as well as to illustrate the efficiency of solutions to domain-specific challenges. This approach has enabled some important comparisons between systems, but unfortunately the narrow scope required to facilitate these comparisons has prevented many of these lessons from...
Reasoning and querying over data streams rely on the ability to deliver a sequence of stream snapshots to the processing algorithms. These snapshots are typically provided using windows as views into streams and associated window management strategies. Generally, the goal of any window management strategy is to preserve the most important data in the current window and preferentially evict the...
In-situ (scanning) transmission electron microscopy (S/TEM) is being developed for numerous applications in the study of nucleation and growth under electrochemical driving forces. For this type of experiment, one of the key parameters is to identify when nucleation initiates. Typically the process of identifying the moment that crystals begin to form is a manual process requiring the user to...
Combining interactive visualization with automated analytical methods like statistics and data mining facilitates data-driven discovery. These visual analytic methods are beginning to be instantiated within mixed-initiative systems, where humans and ma- chines collaboratively influence evidence-gathering and decision-making. But an open research question is that, when domain experts analyze their...

Pages

| Pacific Northwest National Laboratory