## About

89

Publications

9,790

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

797

Citations

Citations since 2017

Introduction

Additional affiliations

August 2009 - present

March 1991 - July 2009

Education

January 1983 - October 1989

## Publications

Publications (89)

This special issue of Information Visualization explores the technical challenges and technology development opportunities of graph visual analytics arising from the trend of big data. Big graph visual analytics is about applying visualization and analytics techniques to gather, analyze, and understand big graphs and the knowledge behind them.

In this work we propose a novel formulation that models the attack and compromise on a cyber network as a combination of two parts — direct compromise of a host and the compromise occurring through the spread of the attack on the network from a compromised host. The model parameters for the nodes are a concise representation of the host profiles th...

In the past decade, existing and new knowledge and datasets have been encoded in different ontologies for semantic web and biomedical research. The size of ontologies is often very large in terms of number of concepts and relationships, which makes the analysis of ontologies and the represented knowledge graph computational and time consuming. As t...

This chapter discusses the approaches integrated in GEMS (Graph database Engine for Multithreaded Systems) for managing and querying datasets of RDF (Resource Description Framework) triples. GEMS is a software stack that implements graph databases on top of commodity, high-performance clusters. GEMS is composed of a SPARQL-to-C++ compiler, a librar...

A software stack relies primarily on graph-based methods to implement scalable resource description framework databases on top of commodity clusters, providing an inexpensive way to extract meaning from volumes of heterogeneous data.

Many fields require organizing, managing, and analyzing massive amounts of data. Among them, we can find social network analysis, financial risk management, threat detection in complex network systems, and medical and biomedical databases. For these areas, there is a problem not only in terms of size but also in terms of performance, because the pr...

Data-intensive science simultaneously derives from and creates the need for large quantities of data. As such, scientists increasingly need to discover and analyze new datasets from diverse sources. Beyond the sheer volume of data, issues posed by the resultant data heterogeneity are often overlooked. We postulate that heterogeneity challenges can...

Big data problems are often more akin to sparse graphs rather than relational tables. As such we argue that graph-based physical representations provide advantages in terms of both size and speed for executing queries. Drawing from research in sparse matrices, we use a compressed sparse row (CSR) format to model graph-oriented data. We also present...

This article presents SGEM, a full software system for accelerating large-scale graph databases on commodity clusters. Unlike current approaches, GEMS addresses graph databases by primarily employing graph-based methods, which is reflected at all levels of the stack. On the one hand, this allows exploiting the space efficiency of graph data structu...

Science is increasingly motivated by the need to process larger quantities of data. It is facing severe challenges in data collection, management, and processing, so much so that the computational demands of "data scaling" are competing with, and in many fields surpassing, the traditional objective of decreasing processing time. Example domains wit...

We are developing a full software system for accelerating semantic graph databases on commodity cluster that scales to hundreds of nodes while maintaining constant query throughput. Our framework comprises a SPARQL to C++ compiler, a library of parallel graph methods and a custom multithreaded runtime layer, which provides a Partitioned Global Addr...

The emergence of petascale triple stores have motivated the investigation of alternates to traditional table-based relational methods. Since triple stores represent data as structured tuples, graphs are a natural data structure for encoding their information. The use of graph data structures, rather than tables, requires us to rethink the methods u...

We consider cyber traffic analysis (TA) as a challenge problem for research in graph database systems. TA involves observing and analyzing connections between clients, servers, hosts, and actors within IP networks, over time, to detect suspicious patterns. Towards that end, NetFlow (or more generically, IPFLOW) data are available from routers and s...

A recent trend in hardware development is producing computing systems that are stretching the number of cores and size of shared-memory beyond where most fundamental serial algorithms perform well. The expectation is that this trend will continue. So it makes sense to rethink our fundamental algorithms such as sorting. There are many situations whe...

Graph mining algorithms that seek to find interesting structure in a graph are compelling for many reasons but may not lead to useful information learned from the data. This position paper explores the current graph mining approaches and suggests why certain algorithms may provide misleading information whereas others may be just what is needed. In...

Large-scale L1-regularized loss minimization problems arise in
high-dimensional applications such as compressed sensing and high-dimensional
supervised learning, including classification and regression problems.
High-performance algorithms and implementations are critical to efficiently
solving these problems. Building upon previous work on coordin...

The goal of N - x contingency selection is to pick a subset of critical cases to assess their potential to initiate a severe crippling of an electric power grid. Even for a moderate-sized system there can be an overwhelmingly large number of contingency cases that need to be studied. The number grows exponentially with x. This combinatorial explosi...

Neighboring regional transmission organizations (RTO) and independent system operators (ISOs) exchange electric power to enable efficient and reliable operation of the grid. Net interchange (NI) schedule is the sum of the transactions (in MW) between an RTO/ISO and its neighbors. Effective forecasting of the amount of actual NI can improve grid ope...

We present a generic framework for parallel coordinate descent (CD)
algorithms that includes, as special cases, the original sequential algorithms
Cyclic CD and Stochastic CD, as well as the recent parallel Shotgun algorithm.
We introduce two novel parallel algorithms that are also special
cases---Thread-Greedy CD and Coloring-Based CD---and give p...

A recent focus in itemset mining has been the discovery of frequent itemsets from high-dimensional datasets. With exponentially increasing running time as average row length increases, mining such datasets renders most conventional algorithms impractical. Unfortunately, large cardinality itemsets are likely to be more informative than small cardina...

Electrical power grid contingency analysis aims to understand the impact of potential component failures and assess a system's capability to tolerate them. The computational resources needed to explore all potential x-component failures, for modest sizes of x > 1, is not feasible due to the combinatorial explosion of cases to consider. A common app...

To-date, the application of high-performance computing resources to Semantic Web data has largely focused on commodity hardware and distributed memory platforms. In this paper we make the case that more specialized hardware can offer superior scaling and close to an order of magnitude improvement in performance. In particular we examine the Cray XM...

As semantic graph database technology grows to address components ranging from large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly impor-tant to be able to understand their inherent semantic structure, whether codified in explicit ontologies or not. Our group is researching novel meth-ods fo...

A key challenge to automated clustering of documents in large text corpora is the high cost of comparing documents in a multi-million dimensional document space. The Anchors Hierarchy is a fast data structure and algorithm for localizing data based on a triangle inequality obeying distance metric, the algorithm strives to minimize the number of dis...

Given a semantic graph data set, perhaps one lacking in an explicit ontology, we wish to first identify its significant semantic structures, and then measure the extent of their significance. Casting a semantic graph dataset as an edge-labeled, directed graph, this task can be built on the ability to mine frequent labeled subgraphs in edge-labeled,...

The dominant parallel programming models for shared memory computers, Pthreads and OpenMP, are both thread-centric in that they are based on explicit management of tasks and enforce data dependencies and output ordering through task management. By com-parison, the Cray XMT programming model is data-centric where the primary concern of the programme...

Two of the most commonly used hashing strategies-linear probing and hashing with chaining-are adapted for efficient execution on a Cray XMT. These strategies are designed to minimize memory contention. Datasets that follow a power law distribution cause significant performance challenges to shared memory parallel hashing implementations. Experiment...

As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly im-portant to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with resp...

Three parallel implementations of a divide-and-conquer search algorithm (called SUDA2) for finding minimal unique itemsets (MUIs) are compared in this paper. The identification of MUIs is used by national statistics agencies for statistical disclosure assessment. The first parallel implementation adapts SUDA2 to a symmetric multi-processor cluster...

We present a database of spectral lags and internal luminosity function (ILF) measurements for gamma-ray bursts (GRBs) in the BATSE catalog. Measurements were made using 64 ms count rate data and are defined for various combinations of the four broadband BATSE energy channels. We discuss the processes used for measuring lags and ILF characteristics...

Guided by the supervised pattern recognition algorithm C4.5 developed by Quinlan in 1986, we examine the three gamma-ray burst classes identified by Mukherjee et al. in 1998. C4.5 provides strong statistical support for this classification. However, with C4.5 and our knowledge of the BATSE instrument, we demonstrate that class 3 (intermediate fluen...

We present a database of spectral lags and internal luminosity function
(ILF) measurements for gamma-ray bursts (GRBs) in the BATSE catalog.
Measurements were made using 64ms count rate data and are defined for
various combinations of the four broadband BATSE energy channels.
(3 data files).

SUDA2 is a recursive search algorithm for minimal unique itemset detection. Such sets of items are formed via combinations of non-obvious attributes enabling individual record identification. The nature of SUDA2 allows work to be divided into non-overlapping tasks enabling parallel execution. Earlier work developed a parallel implementation for SUD...

A new algorithm, SUDA2, is presented which finds minimally unique itemsets i.e., minimal itemsets of frequency one. These itemsets, referred to as Minimal Sample Uniques (MSUs), are important for statistical agencies who wish to estimate the risk of disclosure of their datasets. SUDA2 is a recursive algorithm which uses new observations about the p...

Statisticalevidence exists for two or more gamma-ray burst (GRB) subclasses. Pattern recognition algorithms also find this.
However, not all statistical clusterings necessarily indicate separate source populations. Subclass identification can aid
our understanding of systematic observationaland instrumentalbiases. We demonstrate several computation...

Gamma-Ray Burst (GRB) prompt emission contains information that can be used to infer structure of the relativistic outflow. Spectral lags, the Internal Luminosity Function (ILF), and Color-Color Diagrams are attributes that provide diagnositcs with which jet structure can be studied. These attributes help delineate properties of internal shocks ori...

The scientific method encourages sharing data with other researchers to independently verify conclusions. Currently, technical barriers impede such public scrutiny. A strategy for offering scientific data for public analysis is described. With this strategy, effectively no requirements of software installation (other than a web browser) or data man...

We present SUDA2, a recursive algorithm for finding Minimal Sample Uniques (MSUs). SUDA2 uses a novel method for representing the search space forMSUs and new observations about the properties ofMSUs to prune and traverse this space. Experimental comparisons with previous work demonstrate that SUDA2 is not only several orders of magnitude faster bu...

Despite being the most energetic phenomenon in the known universe, the astrophysics of gamma-ray bursts (GRBs) has still proven difficult to understand. It has only been within the past five years that the GRB distance scale has been firmly established, on the basis of a few dozen bursts with x-ray, optical, and radio afterglows. The afterglows ind...

Data mining techniques indicate that three gamma-ray burst classes exist. Our analysis indicates that the intermediate class is produced not from the fluence duration bias but from sample incompleteness caused by BATSE's preference to trigger on shorter bursts relative to longer bursts. We introduce the dual timescale peak flux to minimize this pre...

We review GRB classification using statistical clustering and data
mining techniques. These techniques generally indicate that more than
two GRB subgroups exist. The number of subgroups discovered depends on
the database size, classification algorithm(s) used, choice of data
attributes and way in which measurement errors are treated in the
analysis...

We classify BATSE gamma-ray bursts using unsupervised clustering algorithms in order to compare classification with statistical clustering techniques. BATSE bursts detected with homogeneous trigger criteria and measured with a limited attribute set (duration, hardness, and fluence) are classified using four unsupervised algorithms (the concept hier...

The Internal Luminosity Function (ILF) is the differential distribution of luminosity measured within a gamma‐ray burst (GRB). Most GRBs are found to have pseudo power‐law ILFs; the properties of the ILF power‐law index α has been examined by Horack and Hakkila for a sample of 50 bright GRBs (ApJ 479, 371). We measure α values for 348 BATSE GRBs, t...

The GRB ToolShed is an online suite of induction-based machine learning and statistical tools designed for gamma-ray burst classification and cluster analysis. The ToolSHED also includes a large preprocessed gamma-ray burst database. We report on the current status of the ToolSHED.

Unsupervised pattern-recognition algorithms support the existence of three gamma-ray burst classes: class 1 (long, large-fluence bursts of intermediate spectral hardness), class 2 (short, small-fluence, hard bursts), and class 3 (soft bursts of intermediate durations and fluences). The algorithms surprisingly assign larger membership to class 3 tha...

Statistical evidence exists for two or more gamma-ray burst (GRB) subclasses. Pattern recognition algorithms also nd this. However, not all statistical clusterings necessarily indicate separate source populations. Subclass identi cation can aid our understanding of systematic observational and instrumental biases. We demonstrate several computation...

The Message Minimizing Load Redistribution Problem is described which arises from the need to redistribute work when performing load balancing in a parallel computing environment. We consider a global perspective and seek a redistribution plan that minimizes the overall processing time. We define the cost associated with a solution to be the number...

Gamma-ray bursts provide what is probably one of the messiest of all astrophysical data sets. Burst class properties are indistinct, as overlapping characteristics of individual bursts are convolved with effects of instrumental and sampling biases. Despite these complexities, data mining techniques have allowed new insights to be made about gamma-r...

We describe the design of a suite of software tools to allow users to query Gamma Ray Burst
(GRB) data and perform data mining expeditions. We call this suite of tools a shed (SHell for Expeditions using Datamining). Our schedule is to have a completed prototype (funded via the NASA AISRP) by February, 2002. Meanwhile, interested users will find a...

We use ESX, a product of Information Acumen Corporation, to perform unsupervised learning on a data set containing 797 gamma-ray bursts taken from the BATSE 3B catalog. Assuming all attributes to be distributed logNormally, Mukherjee et al. (1998) analyzed these same data using a statistical cluster analysis. Utilizing the logarithmic values for T9...

The three gamma-ray burst (GRB) classes identified by statistical clustering analysis (Mukherjee et al. 1998) are examined using the pattern recognition algorithm C4.5 (Quinlan 1986). Although the statistical existence of Class 3 (intermediate duration, intermediate fluence, soft) is supported, the properties of this class do not need to arise from...

The fluence duration bias causes fluences and durations of faint gamma-ray bursts to be systematically underestimated relative to their peak fluxes. Using Monte Carlo analysis, we demonstrate how this effect explains characteristics of structure of the fluence vs. 1024 ms peak flux diagram. Evidence of this bias exists in the BATSE fluence duration...

A tournament is a complete directed graph. A convex subset is a vertex subset with the property that every two-path beginning and ending inside the convex subset is contained completely within the subset. This paper shows a relationship between convex subsets and transitive closures which leads to an optimal O(n(3))-time algorithm for finding all c...

An understanding of gamma-ray burst (grb) physics is dependent upon
interpreting the large body of grb spectral and temporal data. Although
many grb spectral and temporal attributes have been identified by
various researchers, considerable disagreement exists as to the physical
meaning and relative importance of each. We present preliminary but
pro...

Artificial intelligence (AI) classifiers can be used to classify unknowns, refine existing classification parameters, and identify/screen out ineffectual parameters. We present an AI methodology for classifying new gamma-ray bursts, along with some preliminary results. Comment: 5 pages, 2 postscript figures. To appear in the Fourth Huntsville Gamma...

A computational study of some logarithmic barrier decomposition algorithms for semi-infinite programming is presented in this paper. The conceptual algorithm is a straightforward adaptation of the logarithmic barrier cutting plane algorithm which was presented recently by den Hartog et al. (Annals of Operations Research, 58, 69–98, 1995), to solve...

A parallel approximation algorithm for the MAXIMUM 2-CNF SATISFIABILITY problem is presented. This algorithm runs in O(log 2 (n + jF j)) parallel time on a CREW PRAM machine using O(n+jF j) processors, where n is the number of variables and jF j is the number of clauses. Performance guarantees are considered for three slightly differing definitions...

A work-efficient deterministic NC algorithm is presented for finding a maximum matching in a bipartite expander graph with any expansion factor fi ? 1. This improves upon a recently presented deterministic NC maximum matching algorithm which is restricted to those bipartite expanders with large expansion factors (fi \Delta ffl ; ffl ? 0), and is no...

A tournament is a complete directed graph. A convex subset is a vertex subset with the property that every two-path beginning and ending inside the convex subset is contained completely within the subset. This paper shows a relationship between convex subsets and transitive closures which leads to an optimal O(n 3 )-time algorithm for finding all c...

A tournament is a complete directed graph. A convex subset is a vertex subset with the property that every two-path beginning and ending inside the convex subset is contained completely within the subset. This paper shows that every nontrivial convex subset is the closure of a subset of vertices of cardinality two. This result leads to algorithms t...

We show that for any constant k > 0, a matching with cardinality at least 1-1/(k+1) times the maximum can be computed in NC.

The maximum cut problem is known to be an important NP-complete problem with many applications. The authors investigate this problem (which they call the normal maximum cut problem) and a variant of it (which is referred to as the connected maximum cut problem). They show that any n -vertex e -edge graph admits a cut with at least the fraction 1/2+...