About
225
Publications
30,625
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
15,583
Citations
Citations since 2017
Introduction
Publications
Publications (225)
Data science applications increasingly rely on heterogeneous data sources and analytics. This has led to growing interest in polystore systems, especially analytical polystores. In this work, we focus on a class of emerging multi-data model analytics workloads that fluidly straddle relational, graph, and text analytics. Instead of a generic polysto...
The Relationship-based Access Control Model (ReBAC) generalizes Role-based Access Control (RBAC) by considering both hierarchical and non-hierarchical relationships between users to specify access control of a set of target resources (objects). This paper extends the ReBAC model by considering relationships between objects as well as between subjec...
Early detection of diseases such as COVID-19 could be a critical tool in reducing disease transmission by helping individuals recognize when they should self-isolate, seek testing, and obtain early medical intervention. Consumer wearable devices that continuously measure physiological metrics hold promise as tools for early illness detection. We ga...
There is significant variability in neutralizing antibody responses (which correlate with immune protection) after COVID-19 vaccination, but only limited information is available about predictors of these responses. We investigated whether device-generated summaries of physiological metrics collected by a wearable device correlated with post-vaccin...
Modern big data applications usually involve heterogeneous data sources and analytical functions, leading to increasing demand for polystore systems, especially analytical polystore systems. This paper presents AWESOME system along with a domain-specific language ADIL. ADIL is a powerful language which supports 1) native heterogeneous data models s...
Knowledge analysis is an important application of knowledge graphs. In this paper, we present a complex knowledge analysis problem that discovers the gaps in the technology areas of interest to an organization. Our knowledge graph is developed on a heterogeneous data management platform. The analysis combines semantic search, graph analytics, and p...
Quantum materials research is a rapidly growing domain of materials research, seeking novel compounds whose electronic properties are born from the uniquely quantum aspects of their constituent electrons. The data from this rapidly evolving area of quantum materials requires a new community-driven approach for collaboration and sharing the data fro...
Social media data are often modeled as heterogeneous graphs with multiple types of nodes and edges. We present a discovery algorithm that first chooses a "background" graph based on a user's analytical interest and then automatically discovers subgraphs that are structurally and content-wise distinctly different from the background graph. The techn...
Social media data are often modeled as heterogeneous graphs with multiple types of nodes and edges. We present a discovery algorithm that first chooses a "background" graph based on a user's analytical interest and then automatically discovers subgraphs that are structurally and content-wise distinctly different from the background graph. The techn...
We present our experience with a data science problem in Public Health, where researchers use social media (Twitter) to determine whether the public shows awareness of HIV prevention measures offered by Public Health campaigns. To help the researcher, we develop an investigative exploration system called boutique that allows a user to perform a mul...
Many data science applications like social network analysis use graphs as their primary form of data. However, acquiring graph-structured data from social media presents some interesting challenges. The first challenge is the high data velocity and bursty nature of the social media data. The second challenge is that the complex nature of the data m...
Temporal text, i.e., time-stamped text data are found abundantly in a variety of data sources like newspapers, blogs and social media posts. While today's data management systems provide facilities for searching full-text data, they do not provide any simple primitives for performing analytical operations with text. This paper proposes the temporal...
AWESOME is a polystore system that enables a
data analyst to create a data ingestion script that specifies
how it should collect, organize, run a data-derivation pipeline
and reports results of the analysis. The collected data can be
stored in different component stores under AWESOME for
subsequent secondary analysis. This paper demonstrates the
pr...
Proactive forensics uses the investigative principles of digital forensics to develop automated techniques that prevent cybercrime. One such prevention-minded methodology is PROFORMA, a prototype system that continuously evaluates the trustworthiness and risk of social communications.
Specifying the search space is an important step in designing multimedia annotation systems. With the large amount of data available from sensors and web services, context-aware approaches for pruning search spaces are becoming increasingly common. In these approaches, the search space is limited by the contextual information obtained from a fixed...
Polystores, i.e., data management systems that use multiple stores for different data models, are gaining popularity. We are developing a polystore-based system called AWESOME to support social data analytics. The AWESOME polystore can support relational, semistructured, graph and text data and houses a Spark computation engine to produce derived d...
Attack Graphs have been widely used by the network security administrators to gain an understanding of possible attack paths, an attacker may follow to compromise critical resources. As networks get larger and more complex, one needs to use databases to perform iterative, interactive analysis tasks with attack graphs. In this paper we investigate h...
Social media data can be viewed as " mixed model " data that reflect interesting community behavior. We take a graph-centric view of microblogs and develop a user-defined specification of a community on these social graphs. We demonstrate the temporal behavior of communities can be captured by a set of graph metrics. We describe a system which tran...
Wildfires are critical for ecosystems in many geographical regions. However, our current urbanized existence in these environments is inducing the ecological balance to evolve into a different dynamic leading to the biggest fires in history. Wildfire wind speeds and directions change in an instant, and first responders can only be effective if they...
Graphs have emerged as an important genre of data that are found in a wide class of applications. The most dominant benchmark for graph data today is Graph 500 that generates a Stochastic Kronecker graph of various sizes, and reports the time to perform a breadth-first search. Apache Giraph uses Pagerank computation as an algorithmic benchmark for...
Connecting people to the resources they need is a fundamental task for any society. We present the idea of a technology that can be used by the middle tier of a society so that it uses people's mobile devices and social networks to connect the needy with providers. We conceive of a world observatory called the Social Life Network (SLN) that connect...
The NIF system is a semantic search engine that uses an ontology to improve search quality. In this experience paper we present SKEYQL, our semantic keyword query language and describe a number of ontology-based query reformulation strategies that go beyond standard query expansion techniques. We also present a set of lessons learnt and strategies...
Computational problems are increasingly relying on context-aware approaches for tractable solutions. Usually, these approaches statically link additional sources of information to those already present in the problem space. We have been building CueNet, a context discovery framework, which will dynamically discover the most relevant context for a g...
The availability of enormous volumes of heterogeneous Cyber-Physical-Social (CPS) data streams allow design and implementation of networks to connect people with essential life resources. We call these networks Social Life Networks (SLNs). We are developing concepts, technology, and infrastructure to design and build these networks. SLNs will be he...
We report on progress of employing the Kepler workflow engine to prototype “end-to-end” application integration workflows that concern data coming from microscopes deployed at the National Center for Microscopy Imaging Research (NCMIR). This system is built upon the mature code base of the Cell Centered Database (CCDB) and integrated rule-oriented...
The number of available neuroscience resources (databases, tools, materials, and networks) available via the Web continues to expand, particularly in light of newly implemented data sharing policies required by funding agencies and journals. However, the nature of dense, multifaceted neuroscience data and the design of classic search engine systems...
In this short paper, we present early results from an ongoing research on creating a new graph-based representation from NLP analysis of scientific documents so that the graph can be utilized for answering structured queries on NL-processed data. We present a sketch of the data model and the query language to show how scientifically meaningful quer...
An initiative of the NIH Blueprint for neuroscience research, the Neuroscience Information Framework (NIF) project advances neuroscience by enabling discovery and access to public research data and tools worldwide through an open source, semantically enhanced search portal. One of the critical components for the overall NIF system, the NIF Standard...
The numbers of available neuroscience resources (databases, tools, materials and networks) on the web have, and continue to expand; particularly in light of newly implemented data sharing policies required by funding agencies and journals. However, the nature of dense, multi-faceted neuroscience data and the design of classic search engine systems...
Methods. Detailed description of the methods and data types used in the BiologicalNetworks system for host-pathogen studies.
Understanding of immune response mechanisms of pathogen-infected host requires multi-scale analysis of genome-wide data. Data integration methods have proved useful to the study of biological processes in model organisms, but their systematic application to the study of host immune system response to a pathogen and human disease is still in the ini...
In this paper, we examine the problem of efficiently computing aggregate functions over polygonal regions of space. We first formalize a class of efficient region-based aggregation model, where the aggregation query is computed by representing the query region with pre-defined regions using set operations. By focusing on a grid tessellation, we fir...
As we saw in the last chapter, there is wide diversity in the way data modelers and knowledge representation researchers view events. In this chapter, we will present a number of approaches to event data modeling.
A significant problem in the study of mechanisms of an organism's development is the elucidation of interrelated factors which are making an impact on the different levels of the organism, such as genes, biological molecules, cells, and cell systems. Numerous sources of heterogeneous data which exist for these subsystems are still not integrated su...
Community Cyber infrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA) is an eScience project to enable the microbial ecology community in managing the challenges of metagenomics analysis. CAMERA supports extensive metadata based data acquisition and access, as well as execution of metagenomics experiments through stand...
This paper presents BIODB, an ontology-enhanced information system to manage heterogeneous data. An ontology-enhanced system is a system where ad hoc data is imported into the system by a user, annotated by the user to connect the data to an ontology or other data sources, and then all data connected through the ontology can be queried in a federat...
Social science is often concerned with the emergence of collective behavior out of the interactions of large numbers of individuals, but in this regard it has long suffered from a severe measurement problem - namely that individual-level behavior and ...
The objective of the CAMERA project is to provide a facility to enable researchers to achieve revolutionary advances in the understanding of marine microbial ecology.
This paper proposes a navigational method for mining by collecting evidences from diverse data sources. Since the representation method and even semantics of data elements differ widely from one data source to the other, consolidation of data under a single platform doesnt become cost effective. Instead, this paper has proposed a method of mining i...
Events are at least as important as objects in modeling the dynamic universe. Modeling the real world and weaving the web of events require Composite Events that are valid constitution of atomic and composite sub-events. The progress in event composition has been limited to construction of some entities and relationships in upper ontologies such as...
As increasing volumes and varieties of data are becoming available online, the challenges of accessing and using heterogeneous data resources are growing. We have developed a mediator-based data integration system called Cartel for biological oceanography data. A mediation approach is appropriate in cases where a single central warehouse is not des...
Since the dawn of human civilization, stories have been a popular medium of communication, both synchronously and asynchronously. Technically, a story is a time-ordered coher- ent sequence of events. In many applications, heterogeneous data is collected and organized so appropriate stories could be told. In this paper, we present a system that help...
Autism spectrum disorder is an inherently complex phenomenon requiring large studies of many different types to further understanding of its causes. The National Database for Autism Research (NDAR) is being constructed to aid in this effort by providing a means for researchers to share and integrate data. An autism ontology drafted by a group at St...
When SQL and the relational data model were introduced 25 years ago as a general data management concept, enterprise software migrated quickly to this new technology. It is fair to say that SQL and the various implementations of RDBMSs became the backbone ...
Using global physical and biological datasets, we tested oceanographic retention (fac- toring out effects of seamount depth and age) as one possible mechanism structuring seamount ben- thic decapod and gastropod communities. We first determined the relative oceanographic retentive potential (such as from Taylor caps or columns) for individual seamo...
Amarnath Gupta + Yang Yang Aditya Bagchi Animesh RayUniversity of California Indian Statistical Keck GraduateSan Diego Institute, Calcutta Institute{gupta,yyang}@sdsc.eduaditya@isical.ac.in Animesh Ray@kgi.edu1
This paper presents current progress in the development of semantic data integration environment which is a part of the Biomedical Informatics Research Network (BIRN; http://www.nbirn.net) project. BIRN is sponsored by the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH). A goal is the developmen...
The overarching goal of the NIF (Neuroscience Information Framework) project is to be a one-stop-shop for Neuroscience. This paper provides a technical overview of how the system is designed. The technical goal of the first version of the NIF system was to develop an information system that a neuroscientist can use to locate relevant information fr...
A critical component of the Neuroscience Information Framework (NIF) project is a consistent, flexible terminology for describing and retrieving neuroscience-relevant resources. Although the original NIF specification called for a loosely structured controlled vocabulary for describing neuroscience resources, as the NIF system evolved, the requirem...
The overarching goal of the NIF (Neuroscience Information Framework) project is to be a one-stop-shop for Neuroscience. This paper provides a technical overview of how the system is designed. The technical goal of the first version of the NIF system was to develop an information system that a neuroscientist can use to locate relevant information fr...
Querying live media streams is a challenging problem that is becoming an essential requirement in a growing number of applications.
Research in multimedia information systems has addressed and made good progress in dealing with archived data. Meanwhile,
research in stream databases has received significant attention for querying alphanumeric symbol...
Annotation is the process of supplementing data with additional information that was not part of the actual observation, but reflects post-facto comments and associations made by a user who analyzes the data. While annotation management systems are emerging in the field of relational data, such systems for scientific applications, where there is a...
Databases have become integral parts of data management, dissemination, and mining in biology. At the Second Annual Conference on Electron Tomography, held in Amsterdam in 2001, we proposed that electron tomography data should be shared in a manner analogous to structural data at the protein and sequence scales. At that time, we outlined our progre...
The broadly defined mission of the Biomedical Informatics Research Network (BIRN, www.nbirn.net) is to better understand the causes human disease and the specific ways in which animal models inform that understanding. To construct the community-wide infrastructure for gathering, organizing and managing this knowledge, BIRN is developing a federated...
With support from the Institutes and Centers forming the NIH Blueprint for Neuroscience Research, we have designed and implemented
a new initiative for integrating access to and use of Web-based neuroscience resources: the Neuroscience Information Framework.
The Framework arises from the expressed need of the neuroscience community for neuroinforma...
This chapter focuses the application of brain cartography to the problem of multiscale integration of brain data in the context
of the Biomedical Informatics Research Network (BIRN) project.The BIRN project focuses on creating a grid infrastructure for
integrating data on brain morphology and function obtained by different researchers to support co...
We present the semantic data model for an ontological database for subcellular anatomy for Neurosciences. The data model builds
upon the foundations of OWL and the Basic Formal Ontology, but extends them to include novel constructs that address several
unresolved challenges encountered by biologists in using ontological models in their databases. T...