
Berthold Reinwald- PhD
- Manager at IBM
Berthold Reinwald
- PhD
- Manager at IBM
About
110
Publications
45,738
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,937
Citations
Introduction
Current institution
Publications
Publications (110)
Semantic schema alignment aims to match elements across a pair of schemas based on their semantic representation. It is a key primitive for data integration that facilitates the creation of a common data fabric across heterogeneous data sources. Deep learning approaches such as graph representation learning have shown promise for effective alignmen...
Knowledge graphs are at the core of numerous consumer and enterprise applications where learned graph embeddings are used to derive insights for the users of these applications. Since knowledge graphs can be very large, the process of learning em-beddings is time and resource intensive and needs to be done in a distributed manner to leverage comput...
Developing scalable solutions for training Graph Neural Networks (GNNs) for link prediction tasks is challenging due to the high data dependencies which entail high computational cost and huge memory footprint. We propose a new method for scaling training of knowledge graph embedding models for link prediction to address these challenges. Towards t...
This paper describes an end-to-end solution for the relationship prediction task in heterogeneous, multi-relational graphs. We particularly address two building blocks in the pipeline, namely heterogeneous graph representation learning and negative sampling. Existing message passing-based graph neural networks use edges either for graph traversal a...
This paper describes an end-to-end solution for the relationship prediction task in heterogeneous, multi-relational graphs. We particularly address two building blocks in the pipeline, namely heterogeneous graph representation learning and negative sampling. Existing message passing-based graph neural networks use edges either for graph traversal a...
Knowledge graph embedding methods learn embeddings of entities and relations in a low dimensional space which can be used for various downstream machine learning tasks such as link prediction and entity matching. Various graph convolutional network methods have been proposed which use different types of information to learn the features of entities...
Sparse and irregularly sampled multivariate time series are common in clinical, climate, financial and many other domains. Most recent approaches focus on classification, regression or forecasting tasks on such data. In forecasting, it is necessary to not only forecast the right value but also to forecast when that value will occur in the irregular...
A person ontology comprising concepts, attributes and relationships of people has a number of applications in data protection, didentification, population of knowledge graphs for business intelligence and fraud prevention. While artificial neural networks have led to improvements in Entity Recognition, Entity Classification, and Relation Extraction...
Unplanned intensive care units (ICU) readmissions and in-hospital mortality of patients are two important metrics for evaluating the quality of hospital care. Identifying patients with higher risk of readmission to ICU or of mortality can not only protect those patients from potential dangers, but also reduce the high costs of healthcare. In this w...
Efficiently computing linear algebra expressions is central to machine learning (ML) systems. Most systems support sparse formats and operations because sparse matrices are ubiquitous and their dense representation can cause prohibitive overheads. Estimating the sparsity of intermediates, however, remains a key challenge when generating execution p...
Large-scale Machine Learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications. Hence, it is crucial for performance to fit the data into single-node or distributed main memory to enable fast matrix-vector operations. General-purpose compression struggles to achieve both good compr...
Entity Type Classification can be defined as the task of assigning category labels to entity mentions in documents. While neural networks have recently improved the classification of general entity mentions, pattern matching and other systems continue to be used for classifying personal data entities (e.g. classifying an organization as a media com...
Large-scale machine learning algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory and enable fast matrix-vector operations on in-memory data. General-purpose, heavy- a...
Enterprises operate large data lakes using Hadoop and Spark frameworks that (1) run a plethora of tools to automate powerful data preparation/transformation pipelines, (2) run on shared, large clusters to (3) perform many different analytics tasks ranging from model preparation, building, evaluation, and tuning for both machine learning and deep le...
Many large-scale machine learning (ML) systems allow specifying custom ML algorithms by means of linear algebra programs, and then automatically generate efficient execution plans. In this context, optimization opportunities for fused operators---in terms of fused chains of basic operators---are ubiquitous. These opportunities include (1) fewer mat...
Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only data access and I/Obound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory and enable very fast matrix-vector operations on in-memory data. Generalpurpose,...
The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of...
Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory. General-purpose, heavy- and lightweight compression techniques struggle to achi...
Declarative machine learning (ML) aims at the high-level specification of ML tasks or algorithms, and automatic generation of optimized execution plans from these specifications. The fundamental goal is to simplify the usage and/or development of ML algorithms, which is especially important in the context of large-scale computations. However, ML sy...
MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuf...
Declarative large-scale machine learning (ML) aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations to distributed computations on MapReduce (MR) or similar frameworks. State-of-the-art compilers in this context are very sensitive to memory constraints of th...
Meta learning techniques such as cross-validation and ensemble learning are crucial for applying machine learning to real-world use cases. These techniques first generate samples from input data, and then train and evaluate machine learning models on these samples. For meta learning on large datasets, the efficient generation of samples becomes pro...
A method and apparatus controls use of a computing resource by multiple tenants in DBaaS service. The method includes intercepting a task that is to access a computer resource, the task being an operating system process or thread; identifying a tenant that is in association with the task from the multiple tenants; determining other tasks of the ten...
Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches are commonly pursued: (i) developing specific GPU accelerated implementations of complete ML algor...
MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frame- works hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: sh...
An external service at a service provider server is invoked from a database by accessing from over a network a description of the external service published by the service provider external to the database. A database invocation mechanism is generated from the accessed description of the external service, wherein the database invocation mechanism c...
We consider the learning of a distance metric, using the Localized Supervised Metric Learning (LSML) scheme, that discriminates entities characterized by high dimensional feature attributes, with respect to labels assigned to each entity. LSML is a supervised learning scheme that learns a Mahalanobis distance grouping together features with the sam...
A method and system for discovering keys in a database. A minimal set of non-keys of the database are found. The database includes at least two entities and at least two attributes. The minimal set of non-keys includes at least two non-keys. Each entity independently includes a value of each attribute. A set of keys of the database is generated fro...
SystemML enables declarative, large-scale machine learning (ML) via a high-level language with R-like
syntax. Data scientists use this language to express their ML algorithms with full flexibility but without
the need to hand-tune distributed runtime execution plans and system configurations. These ML pro-
grams are dynamically compiled and optimiz...
SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce, where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The declarative specication of ML algorithms enables-in contrast to existing large-scale machine learning libraries-automatic optimization. SystemML's primary focus is on data paral...
Systems and methods for processing Machine Learning (ML) algorithms in a MapReduce environment are described. In one embodiment of a method, the method includes receiving a ML algorithm to be executed in the MapReduce environment. The method further includes parsing the ML algorithm into a plurality of statement blocks in a sequence, wherein each s...
A computer-implemented method for accessing content items in a content store are described. In one embodiment, the computer-implemented method includes maintaining a text index of content items in a content store to enable a keyword search on the content items, receiving a query having a keyword and generating a hit list from the text index using t...
Analytics on big data range from passenger volume prediction in transportation to customer satisfaction in automotive diagnostic systems, and from correlation analysis in social media data to log analysis in manufacturing. Expressing and running these analytics for varying data characteristics and at scale is challenging. To address these challenge...
SaaS (Software as a Service) provides new business opportunities for application providers to serve more customers in a scalable and cost-effective way. SaaS also raises new challenges and one of them is multi-tenancy. Multi-tenancy is the requirement of deploying only one shared application to serve multiple customers (i.e. tenant) instead of depl...
SaaS (Software as a Service) provides new business opportunities for application providers to serve more customers in a scalable and cost-effective way. SaaS also raises new challenges and one of them is multi-tenancy. Multi-tenancy is the requirement of deploying only one shared application to serve multiple customers (i.e. tenant) instead of depl...
Database-as-a-Service (DBaaS) has gain significant momentum with the prevailing usage of Cloud computing. Multi-tenancy is one of the key features of DBaaS offering, where a large volume of databases with different Service Level Agreement (SLA) requirements are co-located in one environment and sharing resources. As Cloud resources are elastic and...
With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine learning algorithms are expressed as scripts in a high-level...
We propose a similarity index for set-valued features and study algorithms for executing various set similarity queries on it. Such queries are fundamental for many application areas, including data integration and cleaning, data profiling as well as near duplicate document detection. In this paper, we focus on Jaccard similarity and present estima...
Jie Zhu Bo Gao Zhihu Wang- [...]
Wei Sun
In Database-as-a-Service (DBaaS), a large number of tenants share DBaaS resources (CPU, I/O and Memory). While the DBaaS provider runs DBaaS to "share" resources across the entire tenant population to maximize resource utilization and minimize cost, the tenants subscribe to DBaaS at a low price point while still having resources conceptually "isola...
MapReduce is emerging as a generic parallel programming paradigm for large clusters of machines. This trend combined with the growing need to run machine learning (ML) algorithms on massive datasets has led to an increased interest in implementing ML algorithms on MapReduce. However, the cost of implementing a large class of ML algorithms as low-le...
The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into partitions. We create for each partition a synopsis that can be used to estimate the number of DVs in th...
Entity uncertainty is an unavoidable problem in modern enterprise databases, resulting from integration of data over multiple sources. In traditional warehousing, the administrator, during an ETL process, manually and laboriously resolves inconsistent data records to discover "true'' entities(customers, products, etc.) and identify their "correct''...
Dynamic authority-based keyword search algorithms, such as ObjectRank and personalized PageRank, leverage semantic link information to provide high quality, high recall search in databases, and the Web. Conceptually, these algorithms require a query-time PageRank-style iterative computation over the full graph. This computation is too expensive for...
Content Management Systems (CMS) store enterprise data such as insurance claims, insurance policies, legal documents, patent applications, or archival data like in the case of digital libraries. Search over content allows for information retrieval, but does not provide users with great insight into the data. A more analytical view is needed through...
DBPubs is a system for effectively analyzing and exploring the con- tent of database publications by combining keyword search with OLAP-style aggregations, navigation, and reporting. DBPubs starts with keyword search over the content of publications. The publica- tions' metadata such as title, authors, venues, year, and so on, pro- vide traditional...
The increasing complexity of enterprise databases and the prevalent lack of documentation incur significant cost in both understanding and integrating the databases. Exist- ing solutions addressed mining for keys and foreign keys, but paid little attention to more high-level structures of databases. In this paper, we consider the problem of dis- co...
The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques that are designed for use within a flexible and scalable "synopsis warehouse" architecture. In this setting, incoming data is split into partitions and a synopsis i...
We model heterogeneous data sources with cross references, such as those crawled on the (enterprise) web, as a labeled graph with data objects as typed nodes and references or links as edges. Given the labeled data graph, we introduce flexible and efficient querying capabilities that go beyond existing capabilities by additionally discovering meani...
Gaining business insights from data has recently been the focus of research and product development. On Line-Analytical Processing (OLAP) tools provide elaborate query languages that allow users to group and aggregate data in various ways, and explore interesting trends and patterns in the data. However, the dynamic nature of today's data along wit...
Gaining business insights such as measuring the effectiveness of a product campaign requires the integration of a multitude
of different data sources. Such data sources include in-house applications (like CRM, ERP), partner databases (like loyalty
card data from retailers), and syndicated data sources (like credit reports from Experian). However, d...
Identification of (composite) key attributes is of fundamental im- portance for many different data management tasks such as data modeling, data integration, anomaly detection, query formulation, query optimization, and indexing. However, information about keys is often missing or incomplete in many real-world database scenarios. Surprisingly, the...
Modern business applications involve a lot of distributed data processing and inter-site communication, for which they rely on middleware products. These products provide the data access and communication framework for the business applications.
Integrated messaging seeks to integrate messaging operations into the database, so as to provide a singl...
The high cost of data consolidation is the key market inhibitor to the adoption of traditional information integration and data warehousing so- lutions. In this paper, we outline a next-generation integrated database man- agement system that takes traditional information integration, content man- agement, and data warehouse techniques to the next l...
An XM L publish/su bscribe system needs to match many XPath queries (subscriptions) over published XML documents. The performance and scalability of the matching algorithm is essential for the system when the number of XPath subscriptions is large. Earlier solutions to this problem usually built large finite state automata for all the XPath subscri...
We introduce a new database object called Cache Table that enables persistent caching of the full or partial content of a remote database table. The content of a cache table is either defined declaratively and populated in advance at setup time, or determined dynamically and populated on demand at query execution time. Dynamic cache tables exploit...
Caching as a means to improve scalability, throughput, and r esponse time for Web applications has re- ceived a lot of attention. Caching can take place at differen t levels in a multi-tier architecture. However, due to the increased personalization and dynamism in Web con tent, caching at the bottom of the multi- tier architecture - the database l...
This chapter introduces a new database object called cache table that enables persistent caching of the full or partial content of a remote database table. The content of a cache table is either defined declaratively and populated in advance at setup time, or determined dynamically and populated on demand at query execution time. Dynamic cache tabl...
The education industry has a very poor record of productivity gains. In this brief article, I outline some of the ways the teaching of a college course in database systems could be made more efficient, and staff time used more productively. These ideas ...
We introduce a new database object called Cache Table that enables persistent caching of the full or partial content of a remote database table. The content of a cache table is either defined declaratively and populated in advance at setup time, or determined dynamically and populated on demand at query execution time. Dynamic cache tables exploit...
Enterprises have been storing multidimensional data, using a star or snowflake schema, in relational databases for many years. Over time, relational database vendors have added optimizations that enhance query performance on these schemas. During the 1990s many special-purpose databases were developed that could handle added calculational complexit...
The World Wide Web offers a tremendous amount of information. Accessing and integrating the available information is a challenge. “Screen scraping” and reverse template engineering are manual and error-prone integration techniques from the past. The advent of Simple Object Access Protocol (SOAP) from the World Wide Web Consortium (W3C) allowed Web...
Most business data are stored in relational database systems, and SQL (Structured Query Language) is used for data retrieval and manipulation. With XML (Extensible Markup Language) rapidly becoming the de facto standard for retrieving and exchanging data, new functionality is expected from traditional databases. Existing SQL applications will evolv...
XML is rapidly emerging as a standard for exchanging business data on the World Wide Web. For the foreseeable future, however, most business data will continue to be stored in relational database systems. Consequently, if XML is to fulfill its potential, some mechanism is needed to publish relational data as XML documents. Towards that goal, one of...
Most existing workflow management systems (WFMSs) are based on a client/server architecture. This architecture simplifies the overall design but it does not match the distributed nature of workflow applications and it imposes severe limitations in terms of scalability and reliability. Moreover, workflow engines are not very sophisticated in terms o...
In today's heterogeneous development environments, application programmers have the responsibility to segment their application data and store relational data in RDBMSs, C++ objects in OODBMSs, SOM objects in OMG persistent stores, and OpenDoc or OLE compound documents in document files. In addition, they must deal with multiple server systems with...
In today's IT infrastructures, data is stored in SQL databases, non-SQL, databases, and host databases like ISAM/VSAM files. Non-SQL databases are specialized data stores controlled by applications like spreadsheets, mail, directory and index services. Developing applications accessing a variety of different data sources is challenging for applicat...
this paper, we provide a brief and necessarily incomplete overview of what IBM is doing to address some of the aforementioned challenges. In particular, we describe how IBM's DB2 Universal Database product family is responding to these challenges in order to provide heterogeneous data access capabilities to DB2 customers. To address the problem of...
We describe the open, extensible architecture of SQL for accessing data stored in external data sources not managed by the SQL engine. In this scenario, SQL engines act as middleware servers providing access to external data using SQL DML statements and joining external data with SQL tables in heterogeneous queries. We describe the state-of-the art...
Most existing workflow management systems (WFMSs) are based on a client/server architecture. This architecture simplifies the overall design but it does not match the distributed nature of workflow applications and imposes severe limitations in terms of scalability and reliability. Moreover workflow engines are not very sophisticated in terms of da...
In this paper we describe the design and implementation of workflow management applications on top of Notes Release 4. We elaborate on various design issues for Notes workflow applications and introduce Notes Release 4's native workflow concepts like agents, events, macros, Lotus-Script, OLE2 capabilities, and doclinks, which make Notes a powerful...
In today's heterogeneous development environments, application programmers have the responsibility to segment their application data and to store those data in different types of stores. That means relational data will be stored in RDBMSs (relational database management systems), C++ objects in OODBMSs (object-oriented database management systems),...
this paper, we present the Exotica research project currently in progress at the IBM Almaden Research Center. One of the goals of the project is to bring together industrial trends and research issues in the workflow area. It is for this reason that we have focused on a particular commercial product, FlowMark, IBM's workflow product [15, 16, 19, 20...
Cooperation is both a key concept to deal with complex design tasks and an ambiguous notion interpreted in a variety of ways. In particular, cooperation is considered from its behavioral aspects relating to the course of the dialogue (communication), and from a synchronizational viewpoint with respect to coordination issues. From the database point...
The Shared Memory-Resident Cache (SMRC) facility under development at IBM Almaden enables persistent C++ data structures to reside in a relational database by utilizing its binary large object (BLOB) facilities. Through SMRC, persistent C++ data then can be accessed both programmatically and through relational queries. Testing and refinement of the...
In diesem Kapitel werden ausgewählte Implementierungsaspekte der in Kapitel 7 dargestellten Architektur des Ablaufkontrollsystems ActMan beschrieben. Die Rahmenarchitektur des Ablaufkontrollsystems in Abschnitt 7.2 und dessen verteilte Konfiguration dienen dabei als Orientierungshilfe.
Entwurf und Modellierung geregelter arbeitsteiliger Anwendungssysteme sind klassische Aufgabenstellungen der Software-Entwicklung. In diesem Kapitel werden zunächst die wesentlichen Komponenten geregelter arbeitsteiliger Anwendungssysteme eingeführt. Der Entwurf dieser Komponenten wird daran anschließend anhand grundlegender Konstruktionsprinzipien...
In geregelten arbeitsteiligen Anwendungssystemen werden rechnergestützte Aktivitäten in eine durchgängige Umgebung integriert, um die Geregeltheit und Arbeitsteiligkeit einer Anwendung unterstützen zu können. Während im vorangegangenen Kapitel die Grundlagen des Entwurfs und der Modellierung geregelter arbeitsteiliger Anwendungssysteme aufgezeigt w...
Das Ablaufkontrollsystem ActMan realisiert den Kontroll- und Datenfluß in geregelten arbeitsteiligen Anwendungssystemen. In diesem Kapitel wird die Architektur des Ablaufkontrollsystems in ihren einzelnen Teilkomponenten beschrieben, während im anschließenden Kapitel ausgewählte Implementierungsaspekte bei der Realisierung von ActMan aufgezeigt wer...
Die im vorangegangenen Kapitel beschriebene Fallstudie stellt ein konkretes Beispiel eines geregelten arbeitsteiligen Anwendungssystems dar. In diesem Kapitel wird die Fallstudie im Hinblick auf geregelte arbeitsteilige Anwendungssysteme generell interpretiert. Es wird zunächst eine Begriffsdefinition für die geregelten arbeitsteiligen Anwendungssy...
In diesem Kapitel wird ein anwendungsorientiertes Verarbeitungsmodell für geregelte arbeitsteilige Anwendungen definiert. Das Verarbeitungsmodell orientiert sich eng an der Klasse der zu unterstützenden Anwendungen und reglementiert die Abwicklung von Abläufen in geregelten arbeitsteiligen Anwendungen (Abschnitt 6.1). Für dieses anwendungsorientier...
In diesem Kapitel wird eine konkrete Fallstudie für eine geregelte arbeitsteilige Anwendung vorgestellt. Die Fallstudie stammt aus dem Bereich der rechnerintegrierten Produktionssysteme als einem typischen Anwendungsbereich, bei dem Automation durch Rechnereinsatz und Rationalisierung durch Arbeitsteilung und-organisation erklärte Ziele sind. Das u...
In diesem abschließenden Hauptabschnitt werden, ausgehend von den grundlegenden Zielsetzungen beim Entwurf und Betrieb geregelter arbeitsteiliger Anwendungssysteme, die wesentlichen Charakteristika von Ablaufkontrollsystemen herausgestellt und anschließend zukünftige Entwicklungslinien aufgezeigt.
Gegenstand dieses Buches ist eine umfassende Darstellung des Workflow-Managements in verteilten Systemen. Es wird zunächst anhand einer Fallstudie eine begriffliche Einführung in dieses neue Forschungsfeld vorgenommen. Daran anschließend werden die notwendigen Grundlagen und Basismechanismen zur Modellierung und Realisierung von Workflow-Management...
This paper considers control and data flow of well-structured procedures in distributed application systems. At control flow level, an application-oriented cooperation model is used to model well-structured cooperative work in distributed applications. At data flow level, a customizable data management mechanism passes data between activities and p...
Ein Wirtschaftsbetrieb besteht aus einer Vielzahl von Funktionseinheiten, die im Hinblick auf die gemeinsamen Unternehmensziele in kontrollierter Weise zusammenwirken müssen. Die Kooperation der Funktionseinheiten spiegelt sich in einem Netz von Abhängigkeiten zwischen den Funktionseinheiten wider. In einem Kooperationsprojekt der Universität Erlan...