Rihan Hai

Rihan Hai
Delft University of Technology | TU · Web Information Systems (WIS) group

Dr. rer. nat. in Computer Science
Assistant professor in WIS group, TU Delft

About

21
Publications
8,865
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
401
Citations
Introduction
I am an assistant professor at Delft University of Technology (TU Delft). Before joining TU Delft, I received my Ph.D. degree from computer science 5 (Information Systems) of RWTH Aachen University. I received my master degree from the School of Information, Tsinghua University. My research focuses on data integration and dataset discovery in large-scale data lakes.
Skills and Expertise

Publications

Publications (21)
Article
A digital twin (DT), originally defined as a virtual representation of a physical asset, system, or process, is a new concept in health care. A DT in health care is not a single technology but a domain-adapted multimodal modeling approach incorporating the acquisition, management, analysis, prediction, and interpretation of data, aiming to improve...
Preprint
Machine learning (ML) practitioners and organizations are building model zoos of pre-trained models, containing metadata describing properties of the ML models and datasets that are useful for reporting, auditing, reproducibility, and interpretability purposes. The metatada is currently not standardised; its expressivity is limited; and there is no...
Preprint
Virtually every sizable organization nowadays is building a form of a data lake. In theory, every department or team in the organization would enrich their datasets with metadata, and store them in a central data lake. Those datasets can then be combined in different ways and produce added value to the organization. In practice, though, the situati...
Preprint
Full-text available
The data needed for machine learning (ML) model training and inference, can reside in different separate sites often termed data silos. For data-intensive ML applications, data silos present a major challenge: the integration and transformation of data, demand a lot of manual work and computational resources. Sometimes, data cannot leave the local...
Preprint
UNSTRUCTURED A Digital Twin (DT), which is defined originally as a virtual representation of a physical asset, system or process, is a new concept in healthcare. DT in healthcare cannot be a single technology, but a domain adapted multi-modal modelling approach, which incorporates the acquisition, management, analysis, prediction, and interpretatio...
Preprint
Full-text available
Although big data has been discussed for some years, it still has many research challenges, especially the variety of data. It poses a huge difficulty to efficiently integrate, access, and query the large volume of diverse data in information silos with the traditional 'schema-on-write' approaches such as data warehouses. Data lakes have been propo...
Chapter
This chapter introduces the most important features of data lake systems, and from there it outlines an architecture for these systems. The vision for a data lake system is based on a generic and extensible architecture with a unified data model, facilitating the ingestion, storage and metadata management over heterogeneous data sources. The chapte...
Chapter
Functional dependencies are important for the definition of constraints and relationships that have to be satisfied by every database instance. Relaxed functional dependencies (RFDs) can be used for data exploration and profiling in datasets with lower data quality. In this work, we present an approach for RFD discovery in heterogeneous data lakes....
Chapter
In this work, we present a Metadata Framework in the direction of extending intelligence mechanisms from the Cloud to the Edge. To this end, we build on our previously introduced notion of Data Lagoons—the analogous to Data Lakes at the network edge—and we introduce a novel architecture and Metadata model for the efficient interaction between Data...
Conference Paper
Full-text available
Schema mappings express the relationships between sources in data interoperability scenarios and can be expressed in various formalisms. Source-to-target tuple-generating dependencies (s-t tgds) can be easily used for data transformation or query rewriting tasks. Second-order tgds (SO tgds) are more expressive as they can also represent the composi...
Chapter
Full-text available
JSON has become one of the most popular data formats. Yet studies on JSON data integration (DI) are scarce. In this work, we study one of the key DI tasks, nested mapping generation in the context of integrating heterogeneous JSON based data sources. We propose a novel mapping representation, namely bucket forest mappings that models the nested map...
Chapter
The increasing popularity of NoSQL systems has lead to the model of polyglot persistence, in which several data management systems with different data models are used. Data lakes realize the polyglot persistence model by collecting data from various sources, by storing the data in its original structure, and by providing the datasets for querying a...
Conference Paper
Medical engineering (ME) is an interdisciplinary domain with short innovation cycles. Usually, researchers from several fields cooperate in ME research projects. To support the identification of suitable partners for a project, we present an integrated approach for patent classification combining ideas from topic modeling, ontology modeling & match...
Article
In addition to volume and velocity, Big data is also characterized by its variety. Variety in structure and semantics requires new integration approaches which can resolve the integration challenges also for large volumes of data. Data lakes should reduce the upfront integration costs and provide a more flexible way for data integration and analysi...
Conference Paper
Full-text available
As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake systems have been proposed as a solution to this proble...
Conference Paper
The heterogeneity of sources in Big Data systems requires new integration approaches which can handle the large volume of the data as well as its variety. Data lakes have been proposed to reduce the upfront integration costs and to provide more flexibility in integrating and analyzing information. In data lakes, data from the sources is copied in i...
Conference Paper
Data Warehouses (DW) have to continuously adapt to evolving business requirements, which implies structure modification (schema changes) and data migration requirements in the system design. However, it is challenging for designers to control the performance and cost overhead of different schema change implementations. In this paper, we demonstrate...

Network

Cited By

Projects

Projects (2)
Project
Pushing the notion of Data Lakes to the network Edge by introducing a novel architecture for Extract-Load-Transform (ELT) on edge gateways.
Project
The workshop QDB 2016 aims at discussing recent advances and challenges on data quality management in database systems, and focuses especially on problems in related to Big Data Integration and Big Data Quality. The workshop will provide a forum for the presentation of research results, a panel discussion, and an attractive keynote speaker. For detailed information please check the workshop website: dbis.rwth-aachen.de/QDB2016