Judy Qiu's research while affiliated with Indiana University East and other places

Publications (100)

Preprint
Forecasting is challenging since uncertainty resulted from exogenous factors exists. This work investigates the rank position forecasting problem in car racing, which predicts the rank positions at the future laps for cars. Among the many factors that bring changes to the rank positions, pit stops are critical but irregular and rare. We found exist...
Preprint
Full-text available
Subgraph counting aims to count occurrences of a template T in a given network G(V, E). It is a powerful graph analysis tool and has found real-world applications in diverse domains. Scaling subgraph counting problems is known to be memory bounded and computationally challenging with exponential complexity. Although scalable parallel algorithms are...
Chapter
Our project is at the interface of Big Data and HPC – High-Performance Big Data computing and this paper describes a collaboration between 7 collaborating Universities at Arizona State, Indiana (lead), Kansas, Rutgers, Stony Brook, Virginia Tech, and Utah. It addresses the intersection of High-performance and Big Data computing with several differe...
Preprint
Full-text available
Subgraph counting aims to count the occurrences of a subgraph template T in a given network G. The basic problem of computing structural properties such as counting triangles and other subgraphs has found applications in diverse domains. Recent biological, social, cybersecurity and sensor network applications have motivated solving such problems on...
Preprint
Full-text available
The convergence of HPC and data intensive method-ologies provide a promising approach to major performance improvements. This paper provides a general desription of the interaction between traditional HPC and ML approaches and motivates the "Learning Everywhere" paradigm for HPC. We introduce the concept of "effective performance" that one can achi...
Preprint
Full-text available
This paper describes opportunities at the interface between large-scale simulations, experiment design and control, machine learning (ML including deep learning DL) and High-Performance Computing. We describe both the current status and possible research issues in allowing machine learning to pervasively enhance computational science. How should on...
Technical Report
Full-text available
Our project is at Interface Big Data and HPC -- High-Performance Big Data computing and this paper describes a collaboration between 7 collaborating Universities at Arizona State, Indiana (lead), Kansas, Rutgers, Stony Brook, Virginia Tech, and Utah. It addresses the intersection of High-performance and Big Data computing with several different app...
Preprint
Full-text available
Subgraph counting aims to count the number of occurrences of a subgraph T (aka as a template) in a given graph G. The basic problem has found applications in diverse domains. The problem is known to be computationally challenging - the complexity grows both as a function of T and G. Recent applications have motivated solving such problems on massiv...
Article
Several variants of the subgraph isomorphism problem, e.g., finding, counting, and estimating frequencies of subgraphs in networks arise in a number of real world applications, such as web analysis, disease diffusion prediction, and social network analysis. These problems are computationally challenging in having to scale to very large networks wit...
Article
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of terascale integration. Among emerging killer applications, parallel graph processing has been a critical technique to analyze connected data. In this paper, we empirically evaluate various computing platforms including an Intel Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU an...
Article
Full-text available
The system generates three errors of "Bad character(s) in field Abstract" for no reason. Please refer to manuscript for the full abstract.
Technical Report
Full-text available
Two major trends in computing systems are the growth in high performance computing (HPC) with an international exascale initiative, and the big data phenomenon with an accompanying cloud infrastructure of well publicized dramatic and increasing size and sophistication. This tutorial weaves these trends together using some key building blocks. The f...
Poster
Full-text available
Status of NSF 1443054 Project -------------------------------------------------- Big Data Application Analysis identifies features of data intensive applications that need to be supported in software and represented in benchmarks. This analysis was started for proposal and has been extended to support HPC-Simulations-Big Data convergence. The proje...
Article
LDA is a widely used machine learning technique for big data analysis. The application includes an inference algorithm that iteratively updates a model until it converges. A major challenge is the scaling issue in parallelization owing to the fact that the model size is huge and parallel workers need to communicate the model continually. We identif...
Conference Paper
Full-text available
Two major trends in computing systems are the growth in high performance computing (HPC) with in particular an international exascale initiative, and big data with an accompanying cloud infrastructure of dramatic and increasing size and sophistication. In this paper, we study an approach to convergence for software and applications/algorithms and s...
Poster
Full-text available
This poster introduces all of DSC projects below and covers 1) 3) 4) 5) 1) Digital Science Center Facilities 2) RaPyDLI Deep Learning Environment 3) SPIDAL Scalable Data Analytics Library and applications including Bioinformatics and Polar Remote Sensing Data Analysis 4) MIDAS Big Data Software; Harp for HPC-ABDS 5) Big Data Ogres Classification an...
Poster
Full-text available
This covers Streaming workshops held, IoTCloud for cloud control of robots, SPIDAL project, HPC-ABDS, WebPlotviz visualization and Stock Market data, Scientific paper impact analysis for XSEDE
Poster
Full-text available
This poster covers the Harp HPC Hadoop plugin, RaPyDLI deep learning system, Virtual Clusters on XSEDE Comet system, Cloudmesh to defer Ansible Big data applications, Big Data Ogres and Diamonds to converge HPC and Big Data, Performance of Flink on machine learning
Technical Report
Full-text available
This is a 21-month progress report on an NSFfunded project NSF14-43054 started October 1, 2014 and involving a collaboration between university teams at Arizona, Emory, Indiana (lead), Kansas, Rutgers, Virginia Tech, and Utah. The project is constructing data building blocks to address major cyberinfrastructure challenges in seven different communi...
Conference Paper
We categorize parallel machine learning applications into four types of computation models and propose a new set of model-centric computation abstractions. This work sets up parallel machine learning as a combination of training data-centric and model parameter-centric processing. The analysis uses Latent Dirichlet Allocation (LDA) as an example, a...
Technical Report
Full-text available
This report is a contribution to Frankfurt BDEC meeting on "exascale and big data". In previous BDEC meetings we described two concepts "The Big Data Ogres" and "The HPC-ABDS Software Stack" that here we bring together and extend to provide an approach to the convergence of Big Data and HPC (simulations). Our approach suggests a hardware architectu...
Article
Full-text available
The study of social phenomena is becoming increasingly reliant on big data from online social networks. Broad access to social media data, however, requires software development skills that not all researchers possess. Here we present the IUNI Observatory on Social Media, an open analytics platform designed to facilitate computational social scienc...
Article
Full-text available
The study of social phenomena is becoming increasingly reliant on big data from online social networks. Broad access to social media data, however, requires software development skills that not all researchers possess. Here we present the IUNI Observatory on Social Media, an open analytics platform designed to facilitate computational social scienc...
Preprint
Full-text available
The study and adoption of deep learning methods has led to significant progress in different application domains. As deep learning continues to show promise and its utilization matures, so does the infrastructure and software needed to support it. Various frameworks have been developed in recent years to facilitate both implementation and training...
Conference Paper
Full-text available
We review the High Performance Computing Enhanced Apache Big Data Stack HPC-ABDS and summarize the capabilities in 21 identified architecture layers. These cover Message and Data Protocols, Distributed Coordination, Security & Privacy, Monitoring, Infrastructure Management, DevOps, Interoperability, File Systems, Cluster & Resource management, Data...
Article
Full-text available
We introduce Cloud DIKW as an analysis environment supporting scientific discovery through integrated parallel batch and streaming processing, and apply it to one representative domain application: social media data stream clustering. Recent work demonstrated that high-quality clusters can be generated by representing the data points using high-dim...
Conference Paper
Full-text available
We study many Big Data applications from a variety of research and commercial areas and suggest a set of characteristic features and possible kernel benchmarks that stress those features for data analytics. We draw conclusions for the hardware and software architectures that are suggested by this analysis.
Chapter
Full-text available
The intensive research activity in analysis of social media and micro-blogging data in recent years suggests the necessity and great potential of platforms that can efficiently store, query, analyze, and visualize social media data. To support these “social media observatories” effectively, a storage platform must satisfy special requirements for l...
Conference Paper
Full-text available
Analysis of structural properties and dynamics of networks is currently a central topic in many disciplines including Social Sciences, Biology and Business. CINET, a cyberinfras-tructure for such studies, introduced the concept of supporting network analysis as a service. The basic idea is to allow experts in various disciplines to focus on obtaini...
Conference Paper
Full-text available
Social media data analysis demonstrates two special characteristics in Big Data processing. First, most analyses focus on data subsets related to specific social events or activities instead of the whole dataset. Second, analysis workflows consist of multiple stages, and algorithms applied in each stage may use different computation and communicati...
Conference Paper
We generalize MapReduce, Iterative MapReduce and data intensive MPI runtime as a layered Map-Collective architecture with Map-All Gather, Map-All Reduce, MapReduce Merge Broadcast and Map-Reduce Scatter patterns as the initial focus. Map-collectives improve the performance and efficiency of the computations while at the same time facilitating ease...
Article
The Special Issue of Concurrency and Computation: Practice and Experience, August 2013, discusses papers presented at the Emerging Computational Methods for the Life Sciences Workshop (ECMLS2012). Weber and colleague note that GPUs and multicore processors are now pervasive in computational sciences and high-performance computing. Their high-arithm...
Article
The MapReduce programming model has proven useful for data-driven high throughput applications. However, the conventional MapReduce model limits itself to scheduling jobs within a single cluster. As job sizes become larger, single-cluster solutions grow increasingly inadequate. We present a hierarchical MapReduce framework that utilizes computation...
Article
Full-text available
Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive appli...
Conference Paper
Full-text available
As data intensive applications evolve, many research projects involving Big Data require efficient extraction and analysis of specific data subsets, rather than the whole dataset. Social media data analysis is one such example. While social media platforms provide tremendous data about all kinds of social activities, most research analyses focus on...
Conference Paper
Large-scale iterative computations are common in many important data mining and machine learning algorithms. Most of these applications can be specified as iterations of MapReduce computations, leading to the Iterative MapReduce programming model [1] for efficient execution of data-intensive iterative computations interoperably between HPC and clou...
Article
Recent advances in data-intensive computing for science discovery are fueling a dramatic growth in the use of data-intensive iterative computations. The utility computing model introduced by cloud computing, combined with the rich set of cloud infrastructure and storage services, offers a very attractive environment in which scientists can perform...
Article
Full-text available
Data intensive computing, cloud computing, and multicore computing are converging as frontiers to address massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large-scale computing resources effectively to...
Article
The Special Issue of Concurrency and Computation: Practice and Experience 2013 deals with the latest trends in parallel and distributed high-performance systems applied to life science problems. Mitchel and co-researchers present parallel implementations of two popular microarray data analysis techniques, exploratory clustering analyses using the r...
Article
Full-text available
The recent explosion of publicly available biology gene sequences and chemical compounds offers an unprece-dented opportunity for data mining. To make data analysis feasible for such vast volume and high-dimensional scientific data, we apply high performance dimension reduction algorithms. It facilitates the investigation of unknown structures in a...
Conference Paper
Full-text available
In order to meet the big data challenge of today's society, several parallel execution models on distributed memory architectures have been proposed: MapReduce, Iterative MapReduce, graph processing, and dataflow graph processing. Dryad is a distributed data-parallel execution engine that model program as dataflow graphs. In this paper, we evaluate...
Conference Paper
Full-text available
The recent advance in next generation sequencing (NGS) techniques has enabled the direct analysis of the genetic information within a whole microbial community, bypassing the culturing individual microbial species in the lab. One can profile the marker genes of 16S rRNA encoded in the sample through the amplification of highly variable regions in t...
Conference Paper
Predictive pre-fetcher, which predicts future data access events and loads the data before users requests, has been widely studied, especially in file systems or web contents servers, to reduce data load latency. Especially in scientific data visualization, pre-fetching can reduce the IO waiting time. In order to increase the accuracy, we apply a d...
Article
Full-text available
The Special Issue of Distributed Parallel Databases journal, 2012, discusses novel data processing techniques for this new data-driven world. This data intensive eScience special issue encouraged researchers to submit and present original work related to the latest trends in preservation, movement, access and analysis of massive datasets that requi...
Conference Paper
Full-text available
Networks are an effective abstraction for representing real systems. Consequently, network science is increasingly used in academia and industry to solve problems in many fields. Computations that determine structure properties and dynamical behaviors of networks are useful because they give insights into the characteristics of real systems. We int...
Article
Full-text available
The shift to parallel computing -- including multi-core computer architectures, cloud distributed computing, and generalpurpose GPU programming -- leads to fundamental changes in the design of software and systems. As a result, learning parallel, distributed, and cloud techniques in order to allow software to take advantage of the shift toward para...
Conference Paper
Full-text available
Many distributed computing models have been developed for high performance processing of large scale scientific data. Among them, MapReduce is a popular and widely used fine grain parallel runtime. Workflows integrate and coordinate distributed and heterogeneous components to solve the computation problem which may contain several MapReduce jobs. H...
Conference Paper
Modern biology is experiencing a rapid increase in data volumes that challenges our analytical skills and existing cyberinfrastructure. Exponential expansion of the Protein Sequence Universe (PSU), the protein sequence space, together with the costs and complexities of manual curation creates a major bottleneck in life sciences research. Existing r...
Article
Full-text available
MapReduce distributed data processing architecture has become the de-facto data-intensive analysis mechanism in compute clouds and in commodity clusters, mainly due to its excellent fault tolerance features, scalability, ease of use and the simpler programming model. MapReduceRoles for Azure (MR4Azure) is a decentralized, dynamically scalable MapRe...
Article
Full-text available
Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential...
Chapter
Data intensive computing, cloud computing, and multicore computing are converging as frontiers to address massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large-scale computing resources effectively to...
Article
We present performance results on a Windows cluster with up to 768 cores using Message Passing Interface (MPI) and two variants of threading—Concurrency and Coordination Runtime (CCR) and Task Parallel Library (TPL). CCR presents a message-based interface, while TPL allows for loops to be automatically parallelized. MPI is used between the cluster...
Article
Full-text available
Technical advancements produces a huge amount of scientific data which are usually in high dimensional formats, and it is getting more important to analyze those large-scale high-dimensional data. Dimension reduction is a well-known approach for high-dimensional data visualization, but can be very time and memory demanding for large problems. Among...