About
271
Publications
45,767
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
12,890
Citations
Introduction
Current institution
Publications
Publications (271)
In the rapidly evolving healthcare industry, plat- forms now have access to not only traditional medical records, but also diverse data sets encompassing various patient interactions, such as those from healthcare web portals. To address this rich diversity of data, we introduce WellFactor: a method that derives patient profiles by integrating info...
In the rapidly evolving healthcare industry, plat- forms now have access to not only traditional medical records, but also diverse data sets encompassing various patient interactions, such as those from healthcare web portals. To address this rich diversity of data, we introduce WellFactor: a method that derives patient profiles by integrating info...
Cut-based directed graph (digraph) clustering often focuses on finding dense within-cluster or sparse between-cluster connections, similar to cut-based undirected graph clustering methods. In contrast, for flow-based clusterings the edges between clusters tend to be oriented in one direction and have been found in migration data, food webs, and tra...
A framework is proposed to simultaneously cluster objects and detect anomalies in attributed graph data. Our objective function along with the carefully constructed constraints promotes interpretability of both the clustering and anomaly detection components, as well as scalability of our method. In addition, we developed an algorithm called Outlie...
Advances in single cell transcriptomics have allowed us to study the identity of single cells. This has led to the discovery of new cell types and high resolution tissue maps of them. Technologies that measure multiple modalities of such data add more detail, but they also complicate data integration. We offer an integrated analysis of the spatial...
Human-in-the-loop topic modeling allows users to explore and steer the process to produce better quality topics that align with their needs. When integrated into visual analytic systems, many existing automated topic modeling algorithms are given interactive parameters to allow users to tune or adjust them. However, this has limitations when the al...
We consider the problem of low-rank approximation of massive dense nonnegative tensor data, for example, to discover latent patterns in video and imaging applications. As the size of data sets grows, single workstations are hitting bottlenecks in both computation time and available memory. We propose a distributed-memory parallel computing solution...
Existing tensor factorization methods assume that the input tensor follows some specific distribution (i.e. Poisson, Bernoulli, and Gaussian), and solve the factorization by minimizing some empirical loss functions defined based on the corresponding distribution. However, it suffers from several drawbacks: 1) In reality, the underlying distribution...
Motivation
Recent advances in single cell transcriptomics have allowed us to examine the identity of single cells, which has led to the discovery of new cell types and high resolution maps of cell type composition in tissues. Technologies that measure multiple modalities of single cell data provide a more comprehensive picture of a cell, but they a...
Existing tensor factorization methods assume that the input tensor follows some specific distribution (i.e. Poisson, Bernoulli and Gaussian), and solve the factorization by minimizing some empirical loss functions defined based on the corresponding distribution. However, it suffers from several drawbacks: 1) In reality, the underlying distributions...
Nonnegative matrix factorization (NMF) is a prominent technique for data dimensionality reduction that has been widely used for text mining, computer vision, pattern discovery, and bioinformatics. In this paper, a framework called ARkNLS (Alternating Rank-k Nonnegativity constrained Least Squares) is proposed for computing NMF. First, a recursive f...
We propose a flexible framework for clustering hypergraph-structured data based on recently proposed random walks utilizing edge-dependent vertex weights. When incorporating edge-dependent vertex weights (EDVW), a weight is associated with each vertex-hyperedge pair, yielding a weighted incidence matrix of the hypergraph. Such weightings have been...
Complex relationships among entities can be modeled very effectively using hypergraphs. Hypergraphs model real-world data by allowing a hyperedge to include two or more entities. Clustering of hypergraphs enables us to group the similar entities together. While most existing algorithms solely consider the connection structure of a hypergraph to sol...
Phenotyping electronic health records (EHR) focuses on defining meaningful patient groups (e.g., heart failure group and diabetes group) and identifying the temporal evolution of patients in those groups. Tensor factorization has been an effective tool for phenotyping. Most of the existing works assume either a static patient representation with ag...
We consider the problem of low-rank approximation of massive dense non-negative tensor data, for example to discover latent patterns in video and imaging applications. As the size of data sets grows, single workstations are hitting bottlenecks in both computation time and available memory. We propose a distributed-memory parallel computing solution...
A hybrid method called JointNMF is presented which is applied to latent infor-mation discovery from data sets that contain both text content and connection structureinformation. The new method jointly optimizes an integrated objective function, which isa combination of two components: the Nonnegative Matrix Factorization (NMF) objectivefunction for...
Topic modeling is commonly used to analyze and understand large document collections. However, in practice, users want to focus on specific aspects or "targets" rather than the entire corpus. For example, given a large collection of documents, users may want only a smaller subset which more closely aligns with their interests, tasks, and domains. I...
Understanding narrative content has become an increasingly popular topic. Nonetheless, research on identifying common types of narrative characters, or personae, is impeded by the lack of automatic and broad-coverage evaluation methods. We argue that computationally modeling actors provides benefits, including novel evaluation mechanisms for person...
This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doi...
Embedding and visualizing large‐scale high‐dimensional data in a two‐dimensional space is an important problem, because such visualization can reveal deep insights of complex data. However, most of the existing embedding approaches run on an excessively high precision, even when users want to obtain a brief insight from a visualization of large‐sca...
Co-clustering computes clusters of data items and the related features concurrently, and it has been used in many applications such as community detection, product recommendation, computer vision, and pricing optimization. In this paper, we propose a new co-clustering method, called CoDiNMF, which improves the clustering quality and finds direction...
Detecting anomalous events of a particular area in a timely manner is an important task. Geo-tagged social media data are useful resource for this task; however, the abundance of everyday language in them makes this task still challenging. To address such challenges, we present TopicOnTiles, a visual analytics system that can reveal information rel...
Understanding of narrative content has become an increasingly popular topic. However, narrative semantics pose difficult challenges as the effects of multiple narrative facets, such as the text, events, character types, and genres, are tightly intertwined. We present a joint representation learning framework for embedding actors, literary character...
Integer datasets frequently appear in many applications in science and engineering. To analyze these datasets, we consider an integer matrix approximation technique that can preserve the original dataset characteristics. Because integers are discrete in nature, to the best of our knowledge, no previously proposed technique developed for real number...
This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doi...
SmallK is a high performance software package for low rank matrix approximation via the nonnegative matrix factorization (NMF). NMF is a constrained low rank approximation where a matrix is approximated by a product of two nonnegative factors. The role of NMF in data analytics has been as significant as the singular value decomposition (SVD). Howev...
In this article, we present an interactive visual information retrieval and recommendation system, called VisIRR, for large-scale document discovery. VisIRR effectively combines the paradigms of (1) a passive pull through query processes for retrieval and (2) an active push that recommends items of potential interest to users based on their prefere...
Detecting anomalous events of a particular area in a timely manner is an important task. Geo-tagged social media data are useful resource for this task, but the abundance of everyday language in them makes this task still challenging. To address such challenges, we present TopicOnTiles, a visual analytics system that can reveal the information rele...
Event detection is a very important problem across many domains and is a broadly applicable encompassing many disciplines within engineering systems. In this paper, we focus on improving the user’s ability to quickly identify threat events such as malware, military policy violations, and natural environmental disasters. The information to perform t...
Background
Community discovery is an important task for revealing structures in large networks. The massive size of contemporary social networks poses a tremendous challenge to the scalability of traditional graph clustering algorithms and the evaluation of discovered communities.
Methods
We propose a divide-and-conquer strategy to discover hierar...
Notes for SmallK release 2017/07/21: 1. All code compiled with gcc 7.1.0; 2. Vagrant installation upgraded; 3. Docker installation available; 4. OSX Sierra SIP (system integrity protection) issue resolved.
See first comment for additional release notes.
SmallK is a high performance software package for low rank matrix approximation via the nonnegative matrix factorization (NMF). NMF is a constrained low rank approximation where a matrix is approximated by a product of two nonnegative factors. The role of NMF in data analytics has been as significant as the singular value decomposition (SVD). Howev...
The importance of unsupervised clustering and topic modeling is well recognized
with ever-increasing volumes of text data available from numerous sources. Nonnegative
matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery
in unlabeled data sets. In this paper, we propose a fast algorithm for computing
NMF us...
Event detection is a very important problem across many domains and is a broadly applicable encompassing many disciplines within engineering systems. In this paper, we focus on improving the user's ability to quickly identify threat events such as malware, military policy violations, and natural environmental disasters. The information to perform t...
Hierarchical organizations of information processing in the brain networks have been known to exist and widely studied. To find proper hierarchical structures in the macaque brain, the traditional methods need the entire pairwise hierarchical relationships between cortical areas. In this paper, we present a new method that discovers hierarchical st...
We present a hybrid method for latent information discovery on
the data sets containing both text content and connection structure
based on constrained low rank approximation. e new method
jointly optimizes the Nonnegative Matrix Factorization (NMF) objective
function for text clustering and the Symmetric NMF (Sym-
NMF) objective function for grap...
We present a hybrid method for latent information discovery on the data sets containing both text content and connection structure based on constrained low rank approximation. The new method jointly optimizes the Nonnegative Matrix Factorization (NMF) objective function for text clustering and the Symmetric NMF (SymNMF) objective function for graph...
Autonomous vehicles (AVs) have emerged as a transformative technology with the potential to both fundamentally improve lives in cities but also to exacerbate suburban sprawl, vehicle miles traveled and the associated greenhouse gas emissions. Are communities willing to adopt best practices that can lead to early adoption of more sustainable outcome...
One of the key advantages of visual analytics is its capability to leverage both humans's visual perception and the power of computing. A big obstacle in integrating machine learning with visual analytics is its high computing cost. To tackle this problem, this paper presents PIVE (Per-Iteration Visualization Environment) that supports real-time in...
The problem of outlier detection is extremely challenging in many domains such as text, in which the attribute values are typically non-negative, and most values are zero. In such cases, it often becomes difficult to separate the outliers from the natural variations in the patterns in the underlying data. In this paper, we present a matrix factoriz...
Visual analytics techniques help users explore high-dimensional data. However, it is often challenging for users to express their domain knowledge in order to steer the underlying data model, especially when they have little attribute-level knowledge. Furthermore, users' complex, high-level domain knowledge, compared to low-level attributes, posits...
Embedding and visualizing large-scale high-dimensional data in a two-dimensional space is an important problem since such visualization can reveal deep insights out of complex data. Most of the existing embedding approaches, however, run on an excessively high precision, ignoring the fact that at the end, embedding outputs are converted into coarse...
We present a visual analytics system that supports the geospatiotemporal
analysis of social media data based on a large-scale distributed
topic modeling technique. Through the analysis of social
media data in a given time and region, we can identify critical events
in real time. However, it takes significant time to perform such analyses
against a...
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors $W$ and $H$, for the given input matrix $A$, such that $A \approx W H$. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in soc...
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors $W$ and $H$, for the given input matrix $A$, such that $A \approx W H$. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in soc...
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A ≈ WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks...
In this chapter, we
propose a new improved scalable low rank approximation algorithm for such
bounded matrices called Bounded Matrix Low Rank Approximation(BMA)
that bounds every element of the approximation PQ. We also present an alternate
formulation to bound existing recommender system algorithms called
BALS and discuss its convergence. Our expe...
Clustering high-dimensional data and making sense out of its result is a challenging problem. In this paper, we present a weakly supervised nonnegative matrix factorization (NMF) and its symmetric version that take into account various prior information via regularization in clustering applications. Unlike many other existing methods, the proposed...
Integer data sets frequently appear in many applications in sciences and
technology. To analyze these, integer low rank approximation has received much
attention due to its capacity of representing the results in integers
preserving the meaning of the original data sets. To our knowledge, none of
previously proposed techniques developed for real nu...
The importance of unsupervised clustering and topic modeling is well recognized with everincreasing
volumes of text data. In this paper, we propose a fast method for hierarchical clustering
and topic modeling called HierNMF2. Our method is based on fast Rank-2 nonnegative
matrix factorization (NMF) that performs binary clustering and an efficient n...
Nonnegative matrix factorization (NMF) approximates a nonnegative matrix by the product of two low–rank nonnegative matrices. Since it gives semantically meaningful result that is easily interpretable in clustering applications, NMF has been widely used as a clustering method especially for document data, and as a topic modeling method. We describe...
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A ≈ WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks...
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors $W$ and $H$, for the given input matrix $A$, such that $A \approx W H$. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in soc...
Scatterplots are effective visualization techniques for multidimensional data that use two (or three) axes to visualize data items as a point at its corresponding x and y Cartesian coordinates. Typically, each axis is bound to a single data attribute. Interactive exploration occurs by changing the data attributes bound to each of these axes. In the...
Understanding large-scale document collections in an efficient manner is an important problem. Usually, document data are associated with other information (e.g., an author's gender, age, and location) and their links to other entities (e.g., co-authorship and citation networks). For the analysis of such data, we often have to reveal common as well...
Nonnegative matrix factorization (NMF) provides a lower rank approximation of a matrix by a product of two nonnegative factors. NMF has been shown to produce clustering results that are often superior to those by other methods such as K-means. In this paper, we provide further interpretation of NMF as a clustering method and study an extended formu...
This paper contributes a method for combining sparse parallel graph algorithms with dense parallel linear algebra algorithms in order to understand dynamic graphs including the temporal behavior of vertices. Our method is the first to cluster vertices in a dynamic graph based on arbitrary temporal behaviors. In order to successfully implement this...
The n-gram model has been widely used to capture the local ordering of words, yet its exploding feature space often causes an estimation issue. This paper presents local context sparse coding (LCSC), a non-probabilistic topic model that effectively handles large feature spaces using sparse coding. In addition, it introduces a new concept of localit...
We present our process and analysis for VAST 2014 Mini Challenge 1 and 2, which integrate an off-the-shelf tool, Jigsaw, rapid web-based visualization prototyping using D3, and analytics-based visualizations using Matlab.
Sentiment analysis predicts a one-dimensional quantity describing the positive or negative emotion of an author. Mood analysis extends the one-dimensional sentiment response to a multi-dimensional quantity, describing a diverse set of human emotions. In this paper, we extend sentiment and mood analysis temporally and model emotions as a function of...
Tracking the movement of vehicles in urban environments using fixed position sensors, mobile sensors, and crowd-sourced data is a challenging but important problem in applications such as law enforcement and defense. A dynamic data driven application system (DDDAS) is described to track a vehicle's movements by repeatedly identifying the vehicle un...
The problems that involve separation of grouped outliers and low rank part in a given data matrix have attracted a great attention in recent years in image analysis such as background modeling and face recognition. In this paper, we introduce a new formulation called Linf-norm based robust asymmetric nonnegative matrix factorization (RANMF) for the...
ABSTRACT: A main bottleneck in integrating computational methods with visual analytics is their significant computational cost, which hinders real-time interactive visualization with them. To solve this, we present PIVE (Per-Iteration Visualization Environment), which visualizes intermediate results from algorithm iterations, thus allowing users to...
ABSTRACT: We present VisIRR, an interactive visual information retrieval and recommendation system for large-scale document data. Starting with a query, VisIRR visualizes the retrieved documents in a scat-ter plot along with their topic summary. Next, based on interactive personalized preference feedback on the documents, VisIRR collects and visual...
Due to an ever increasing amount of document data and the complexities involved in their analyses that will reveal meaningful insights, it is crucial to guide users in their decision- making processes using advanced methods that are both interactive and information preserving. Numerous computational approaches from machine learning, data mining, in...
Micro-finance organizations provide non-profit lending opportunities to mitigate poverty by financially supporting impoverished, yet skilled entrepreneurs who are in desperate need of an institution that lends to them. In Kiva.org, a widely-used crowd-funded micro-financial service, a vast amount of micro-financial activities are done by lending te...
Non-profit Micro-finance organizations provide loaning opportunities to eradicate poverty by financially equipping impoverished, yet skilled entrepreneurs who are in desperate need of an institution that lends to those who have little. Kiva.org, a widely-used crowd-funded micro-financial service, provides researchers with an extensive amount of pub...
We review algorithms developed for nonnegative matrix factorization (NMF) and nonnegative tensor factorization (NTF) from a unified view based on the block coordinate descent (BCD) framework. NMF and NTF are low-rank approximation methods for matrices and tensors in which the low-rank factors are constrained to have only nonnegative elements. The n...
Topic modeling has been widely used for analyzing text document collections. Recently, there have been significant advancements in various topic modeling techniques, particularly in the form of probabilistic graphical modeling. State-of-the-art techniques such as Latent Dirichlet Allocation (LDA) have been successfully applied in visual text analyt...
The traditional data analysis tools support strong computational capabilities and numerous standard visualization techniques. However, they provide little visual interactions due to the fact that the tools maintain a wide applicability to diverse data domains, and thus any inherent meanings associated with the data domains are hardly allowed. To co...
The nonnegative least squares problems are useful in applications where the physical nature of problem domain permits only additive linear combinations. We discuss the l1-regularized nonnegative least squares (L1-NLS) problem, where l1-regularization is used to induce sparsity. Although l1-regularization has been successfully used in least squares...
In this paper, we design a hierarchical clustering algorithm for
high-resolution hyperspectral images. At the core of the algorithm, a new
rank-two nonnegative matrix factorizations (NMF) algorithm is used to split the
clusters, which is motivated by convex geometry concepts. The method starts
with a single cluster containing all pixels, and, at ea...
We present Lytic, a domain-independent, faceted visual analytic (VA) system for interactive exploration of large datasets. It combines a flexible UI that adapts to arbitrary character-separated value (CSV) datasets with algorithmic preprocessing to compute unsupervised dimension reduction and cluster data from high-dimensional fields. It provides a...
Nonnegative matrix factorization (NMF) has been successfully used as a clustering method especially for flat partitioning of documents. In this paper, we propose an efficient hierarchical document clustering method based on a new algorithm for rank-2 NMF. When the two block coordinate descent framework of nonnegative least squares is applied to com...
Many modern data sets can be represented in high dimensional vector spaces and have benefited from computational methods that utilize advanced techniques from numerical linear algebra and optimization. Visual analytics approaches have contributed greatly to data understanding and analysis due to utilization of both automated algorithms and human's...