
Jonathan W. Berry- Sandia National Laboratories
Jonathan W. Berry
- Sandia National Laboratories
About
70
Publications
10,437
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,488
Citations
Current institution
Publications
Publications (70)
Designing flexible graph kernels that can run well on various platforms is a crucial research problem due to the frequent usage of graphs for modeling data and recent architectural advances and variety. In this work, we propose a novel graph processing framework, PGAbB (Parallel Graph Algorithms by Blocks), for modern shared-memory heterogeneous pl...
Given an input stream S of size N , a ɸ-heavy hitter is an item that occurs at least ɸN times in S . The problem of finding heavy-hitters is extensively studied in the database literature.
We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ɸ N-th occurrence (and hence it becomes a heavy hitt...
Motivated by the properties of unending real-world cybersecurity streams, we present a new graph streaming model: XStream. We maintain a streaming graph and its connected components at single-edge granularity. In cybersecurity graph applications, input streams typically consist of edge insertions; individual deletions are not explicit. Analysts mai...
Triangle counting is a fundamental building block in graph algorithms. In this paper, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacenc...
We introduce HITMIX, a new technique for network seed-set expansion, i.e., the problem of identifying a set of graph vertices related to a given seed-set of vertices. We use the moments of the graph's hitting-time distribution to quantify the relationship of each non-seed vertex to the seed-set. This involves a deterministic calculation for the hit...
Triangle counting is a fundamental building block in graph algorithms. In this paper, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacenc...
Community detection in graphs is a canonical social network analysis method. We consider the problem of generating suites of teras-cale synthetic social networks to compare the solution quality of parallel community-detection methods. The standard method, based on the graph generator of Lancichinetti, Fortunato, and Radicchi (LFR), has been used ex...
A key problem in social network analysis is to identify nonhuman interactions. State‐of‐the‐art bot‐detection systems like Botometer train machine‐learning models on user‐specific data. Unfortunately, these methods do not work on data sets in which only topological information is available. In this paper, we propose a new, purely topological approa...
Given a stream $S = (s_1, s_2, ..., s_N)$, a $\phi$-heavy hitter is an item $s_i$ that occurs at least $\phi N$ times in $S$. The problem of finding heavy-hitters has been extensively studied in the database literature. In this paper, we study a related problem. We say that there is a $\phi$-event at time $t$ if $s_t$ occurs exactly $\phi N$ times...
Technologies such as Multi-Channel DRAM (MCDRAM) or High Bandwidth Memory (HBM) provide significantly more bandwidth than conventional memory. This trend has raised questions about how applications should manage data transfers between levels. This paper focuses on evaluating different usage modes of the MCDRAM in Intel Knights Landing (KNL) manycor...
A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (with all DRAM reachable only via bus) are insufficient to feed groups of modern processing units,...
We present history-independent alternatives to a B-tree, the primary indexing data structure used in databases. A data structure is history independent (HI) if it is impossible to deduce any information by examining the bit representation of the data structure that is not already available through the API. We show how to build a history-independent...
Triangle enumeration is a fundamental graph operation. Despite the lack of provably efficient (linear, or slightly super-linear) worst-case algorithms for this problem, practitioners run simple, efficient heuristics to find all triangles in graphs with millions of vertices. How are these heuristics exploiting the structure of these special graphs t...
Communities of vertices within a giant network such as the World Wide Web are likely to be vastly smaller than the network itself. However, Fortunato and Barthélemy have proved that modularity maximization algorithms for community detection may fail to resolve communities with fewer than √L/2 edges, where L is the number of edges in the entire netw...
We present an experimental study of parallel algorithms for solving the single source shortest path problem with non-negative edge weights (NSSP) on large-scale graphs. We implement Meyer and Sander's Δ-stepping algorithm and report performance results on the Cray MTA-2, a multithreaded parallel architecture. The MTA-2 is a high-end shared memory s...
Enumerating triangles (3-cycles) in graphs is a kernel operation for social network analysis. For example, many community detection methods depend upon finding common neighbors of two related entities. We consider Cohen's simple and elegant solution for listing triangles: give each node a 'bucket.' Place each edge into the bucket of its endpoint of...
Developing multi-threaded graph algorithms, even when using the MTGL infrastructure, provides a number of challenges, including discovering appropriate levels of parallelism, preventing memory hot spotting, and eliminating accidental synchronization. In this paper, we have demonstrated that using the combination of Qthreads and MTGL with commodity...
The US Environmental Protection Agency (EPA) is the lead federal agency for the security of drinking water in the United States. The agency is responsible for providing information and technical assistance to the more than 50,000 water utilities across the country. The distributed physical layout of drinking-water utilities makes them inherently vu...
Following the events of September 11, 2001, in the United States, world public awareness for possible terrorist attacks on water supply systems has increased dramatically. Among the different threats for a water distribution system, the most difficult to address is a deliberate chemical or biological contaminant injection, due to both the uncertain...
The authors describe a decision framework for selecting sensor monitoring locations for a contamination warning system. Using the threat ensemble vulnerability assessment and sensor placement optimization tool (TEVA-SPOT) to determine sensor placement, a utility can eliminate the guessing game of where to best locate sensors. Specifically, sensor l...
Following the events of September 11, 2001, in the United States, world public awareness for possible terrorist attacks on water supply systems has increased dramatically. Among the different threats for a water distribution system, the most difficult to address is a deliberate chemical or biological contaminant injection, due to both the uncertain...
We present the TEVA-SPOT Toolkit, a sensor placement optimization tool developed within the USEPA TEVA program. The TEVA-SPOT Toolkit provides a sensor placement framework that facilitates research in sensor placement optimization and enables the practical application of sensor placement solvers to real-world CWS design applications. This paper pro...
Placing sensors in municipal water networks to protect against a set of contami-nation events is a classic p-median problem for most objectives when we assume that sensors are perfect. Many researchers have proposed exact and approximate solution methods for this p-median formulation. For full-scale networks with large contamina-tion event suites,...
Large, complex graphs arise in many settings including the Internet, social networks, and communication networks. To study such data sets, the authors explored the use of high-performance computing (HPC) for graph algorithms. They found that the challenges in these applications are quite different from those arising in traditional HPC applications...
The general sensor placement problem (SPP) for contaminant warning system (CWS) design involves placement of a limited number of sensors such that the expected impact of an attack is minimized. We cast the SPP in terms of the well-known p-median problem from discrete location theory. The p-median formulation assumes a fixed number of attack scenari...
The practical utility of optimization technologies is often impacted by factors that reflect how these tools are used in practice, including whether various real-world constraints can be adequately modeled, the sophistication of the analysts applying the optimizer, and related environmental factors (e.g. whether a company is willing to trust predic...
In this paper we apply theoretical and practical results from facility
location theory to the problem of community detection in networks. The result
is an algorithm that computes bounds on a minimization variant of local
modularity. We also define the concept of an edge support and a new measure of
the goodness of community structures with respect...
In this paper, we introduce EXACT, the EXperimental Algorithmics Computational Toolkit. EXACT is a software framework for describing, controlling, and analyzing computer experiments. It provides the experimentalist with convenient software tools to ease and organize the entire experimental process, including the description of factors and levels, t...
This short, non-technical paper summarizes the main points made by the author in a talk at DIMACS’s workshop computer generated conjectures from graph theoretic and chemical databases I. His experience leading the LINK project should be helpful to any future non-commercial efforts to produce a freely available and general purpose graph software pac...
Sensor placement problems for municipal water distribution networks usually in-volve detecting a series of scenarios. The number of scenarios needed to accurately model a full set of possible events based on season, special events, and type of contami-nation can grow much faster than the size of the network. We introduce two new meth-ods for reduci...
We present a study of multithreaded implementations of Thorup's algorithm for solving the single source shortest path (SSSP) problem for undirected graphs. Our implementations leverage the fledgling multithreaded graph library (MTGL) to perform operations such as finding connected components and extracting induced subgraphs. To achieve good paralle...
Search-based graph queries, such as finding short paths and isomorphic subgraphs, are dominated by memory latency. If input graphs can be partitioned appropriately, large cluster-based computing platforms can run these queries. However, the lack of compute-bound processing at each vertex of the input graph and the constant need to retrieve neighbor...
Graph algorithms are becoming increasingly important for solving many problems in scientific computing, data mining and other domains. As these problems grow in scale, parallel computing resources are required to meet their computational and memory requirements. Unfortunately, the algorithms, software, and hardware that have worked well for develop...
We present an experimental study of the single source shortest path problem with non-negative edge weights (NSSP) on large- scale graphs using the Δ-stepping parallel algorithm. We report performance results on the Cray MTA-2, a multithreaded parallel computer. The MTA-2 is a high-end shared memory system offering two unique features that aid the e...
The Cray MTA-2 system provides exceptional perfor- mance on a variety of sparse graph algorithms. Unfor- tunately, it was an extremely expensive platform. Cray is preparing an Eldorado platform that leverages the Cray XT3 network and system infrastructure while integrating a new revision of the MTA-2 processors that is pin compatible with the AMD O...
We will discuss our experiences in designing and us-ing a software infrastructure for processing seman-tic graphs on massively multithreaded computers. We have developed implementations of several algo-rithms for connected components, subgraph isomor-phism, and s-t connectivity. We will discuss their performance on the existing Cray MTA-2, and thei...
A new trend in processor design is increased on-chip support for multithreading in the form of both chip multiprocessors and simultaneous multithreading. Recent research in data- base systems has begun to explore increased thread-level parallelism made possible by these new multicore and mul- tithreaded processors. The question of how best to use t...
In recent years, several integer programming models have been proposed to place sensors in municipal water networks in order to detect intentional or accidental contamination. Although these initial models assumed that it is equally costly to place a sensor at any place in the network, there clearly are practical cost constraints that would impact...
A number of algorithms have been developed to solve the problem of where to place a limited number of sensors in a water distribution network such that public health protection from accidental or intentional contaminant injections is maxi- mized. However, the ability of these algorithms to solve real-world, large-scale sensor placement problems has...
We consider the accuracy of predictions made by integer programming (IP) models of sensor placement for water security applications. We have recently shown that IP models can be used to find optimal sensor placements for a variety of different performance criteria (e.g. minimize health impacts and minimize time to detection). However, these models...
Integer programming (IP) is a general optimization technology capable of expressing most resource allocation decisions. More specifically, IP is the optimization of a linear objective function subject to linear contraints and additional nonlinear integrality constraints. For sensor placement problems, discrete decision variables usually represent d...
Finding the central sets, such as the median sets, of a network topology is a fundamental step in the design and analysis of general distributed systems. This paper presents an alternative synchronous distributed algorithm for finding the median set in general tree structures, based on a revision of a simple sequential algorithm. When this algorith...
Managing an industrial production facility requires carefully allocating limited resources, and gives rise to large, potentially complicated scheduling problems. In this paper we consider a specific instance of such a problem: planning efficient utilization of the facilities and technicians that maintain the United States nuclear stockpile. A detai...
Finding the central sets, such as center and median sets, of a network topology is a fundamental step in the design and analysis of complex distributed systems. This paper presents distributed synchronous algorithms for finding central sets in general tree structures. Our algorithms are distinguished from previous work in that they take only qualit...
We present a model for optimizing the placement of sensors in municipal water networks to detect maliciously-injected contaminants. An optimal sensor configuration minimizes the expected fraction of the population at risk. We formulate this problem as an integer program, which can be solved with generally available IP solvers. We find optimal senso...
This paper presents a new heuristic for graph partitioning called Path Optimization (PO), and the results of an extensive set of empirical comparisons of the new algorithm with two very well-known algorithms for partitioning: the Kernighan-Lin algorithm and simulated annealing. Our experiments are described in detail, and the results are presented...
This paper will describe the basic architecture of the system and illustrate its flexibility with several examples. These descriptions will be accompanied by commentary on the associated design decisions, but will certainly not be exhaustive. The LINK manual fills in We would like to acknowledge the support of DIMACS and NSF grant CCR-9214487. DIMA...
This paper introduces the system as an educational tool which can be used to visualize and experiment with discrete algorithms. An extended example demonstrates the flexibility of the system in the context of a fundamental graph algorithm: finding the strongly connected components of a directed graph.
An abstract is not available.
and Technology Center, funded under contract STC-91-19999; and also receives support from the New Jersey Commission on Science and Technology. 1 I took over the direction of the project in June, 1995, and spent a year at DIMACS preparing the public release. Figure 1: A Latka tournament and an induced subgraph extracted with the LINK GUI Figure 2: T...
Link is a tool for exploring combinatorial objects. It provides a graphical interface, a functional language interface, an algorithm animation system, and a detachable set of C++ libraries. Link allows the user to interactively explore the structure of combinatorial objects such as sets, graphs, digraphs, and hypergraphs. Link makes it easy to defi...
This paper introduces the Link system for exploring combinatorial objects. It provides graphical and functional language interfaces which are user friendly and flexible, and includes a detachable set of C++ libraries. Link allows us to interactively explore the structure of combinatorial objects such as collections, graphs, and hypergraphs. Graphs...
This paper introduces the LINK system as a flexible tool for the creation, manipulation, and drawing of graphs and hypergraphs. We describe the basic architecture of the system and illustrate its flexibility with several examples. LINK is distinguished from existing software for discrete mathematics by its layered interface, including a graphical u...
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under Contract DE-AC04-94-AL85000. Approved for public release; further dissemination unlimited.
In this document, we propose a dynamic graph data structure that can serve as a common data structure for multiple real-world applications. The extensible represen-tation for dynamic complex networks is space-efficient, allows parallelism over vertices and edges independently, and can be used for efficient checkpoint/restart of the data.