• Home
  • IBM
  • Computer Science, Almaden
  • Berthold Reinwald
Berthold Reinwald

Berthold Reinwald
  • PhD
  • Manager at IBM

About

110
Publications
45,738
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,937
Citations
Current institution
IBM
Current position
  • Manager

Publications

Publications (110)
Article
Full-text available
Semantic schema alignment aims to match elements across a pair of schemas based on their semantic representation. It is a key primitive for data integration that facilitates the creation of a common data fabric across heterogeneous data sources. Deep learning approaches such as graph representation learning have shown promise for effective alignmen...
Conference Paper
Full-text available
Knowledge graphs are at the core of numerous consumer and enterprise applications where learned graph embeddings are used to derive insights for the users of these applications. Since knowledge graphs can be very large, the process of learning em-beddings is time and resource intensive and needs to be done in a distributed manner to leverage comput...
Preprint
Full-text available
Developing scalable solutions for training Graph Neural Networks (GNNs) for link prediction tasks is challenging due to the high data dependencies which entail high computational cost and huge memory footprint. We propose a new method for scaling training of knowledge graph embedding models for link prediction to address these challenges. Towards t...
Article
Full-text available
This paper describes an end-to-end solution for the relationship prediction task in heterogeneous, multi-relational graphs. We particularly address two building blocks in the pipeline, namely heterogeneous graph representation learning and negative sampling. Existing message passing-based graph neural networks use edges either for graph traversal a...
Preprint
Full-text available
This paper describes an end-to-end solution for the relationship prediction task in heterogeneous, multi-relational graphs. We particularly address two building blocks in the pipeline, namely heterogeneous graph representation learning and negative sampling. Existing message passing-based graph neural networks use edges either for graph traversal a...
Preprint
Full-text available
Knowledge graph embedding methods learn embeddings of entities and relations in a low dimensional space which can be used for various downstream machine learning tasks such as link prediction and entity matching. Various graph convolutional network methods have been proposed which use different types of information to learn the features of entities...
Preprint
Sparse and irregularly sampled multivariate time series are common in clinical, climate, financial and many other domains. Most recent approaches focus on classification, regression or forecasting tasks on such data. In forecasting, it is necessary to not only forecast the right value but also to forecast when that value will occur in the irregular...
Preprint
A person ontology comprising concepts, attributes and relationships of people has a number of applications in data protection, didentification, population of knowledge graphs for business intelligence and fraud prevention. While artificial neural networks have led to improvements in Entity Recognition, Entity Classification, and Relation Extraction...
Conference Paper
Full-text available
Unplanned intensive care units (ICU) readmissions and in-hospital mortality of patients are two important metrics for evaluating the quality of hospital care. Identifying patients with higher risk of readmission to ICU or of mortality can not only protect those patients from potential dangers, but also reduce the high costs of healthcare. In this w...
Conference Paper
Efficiently computing linear algebra expressions is central to machine learning (ML) systems. Most systems support sparse formats and operations because sparse matrices are ubiquitous and their dense representation can cause prohibitive overheads. Estimating the sparsity of intermediates, however, remains a key challenge when generating execution p...
Article
Large-scale Machine Learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications. Hence, it is crucial for performance to fit the data into single-node or distributed main memory to enable fast matrix-vector operations. General-purpose compression struggles to achieve both good compr...
Preprint
Entity Type Classification can be defined as the task of assigning category labels to entity mentions in documents. While neural networks have recently improved the classification of general entity mentions, pattern matching and other systems continue to be used for classifying personal data entities (e.g. classifying an organization as a media com...
Article
Full-text available
Large-scale machine learning algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory and enable fast matrix-vector operations on in-memory data. General-purpose, heavy- a...
Article
Enterprises operate large data lakes using Hadoop and Spark frameworks that (1) run a plethora of tools to automate powerful data preparation/transformation pipelines, (2) run on shared, large clusters to (3) perform many different analytics tasks ranging from model preparation, building, evaluation, and tuning for both machine learning and deep le...
Article
Many large-scale machine learning (ML) systems allow specifying custom ML algorithms by means of linear algebra programs, and then automatically generate efficient execution plans. In this context, optimization opportunities for fused operators---in terms of fused chains of basic operators---are ubiquitous. These opportunities include (1) fewer mat...
Article
Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only data access and I/Obound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory and enable very fast matrix-vector operations on in-memory data. Generalpurpose,...
Article
The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of...
Article
Full-text available
Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. It is crucial for performance to fit the data into single-node or distributed main memory. General-purpose, heavy- and lightweight compression techniques struggle to achi...
Article
Declarative machine learning (ML) aims at the high-level specification of ML tasks or algorithms, and automatic generation of optimized execution plans from these specifications. The fundamental goal is to simplify the usage and/or development of ML algorithms, which is especially important in the context of large-scale computations. However, ML sy...
Article
Full-text available
MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuf...
Conference Paper
Full-text available
Declarative large-scale machine learning (ML) aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations to distributed computations on MapReduce (MR) or similar frameworks. State-of-the-art compilers in this context are very sensitive to memory constraints of th...
Conference Paper
Meta learning techniques such as cross-validation and ensemble learning are crucial for applying machine learning to real-world use cases. These techniques first generate samples from input data, and then train and evaluate machine learning models on these samples. For meta learning on large datasets, the efficient generation of samples becomes pro...
Patent
A method and apparatus controls use of a computing resource by multiple tenants in DBaaS service. The method includes intercepting a task that is to access a computer resource, the task being an operating system process or thread; identifying a tenant that is in association with the task from the multiple tenants; determining other tasks of the ten...
Article
Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches are commonly pursued: (i) developing specific GPU accelerated implementations of complete ML algor...
Chapter
MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frame- works hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: sh...
Patent
An external service at a service provider server is invoked from a database by accessing from over a network a description of the external service published by the service provider external to the database. A database invocation mechanism is generated from the accessed description of the external service, wherein the database invocation mechanism c...
Conference Paper
We consider the learning of a distance metric, using the Localized Supervised Metric Learning (LSML) scheme, that discriminates entities characterized by high dimensional feature attributes, with respect to labels assigned to each entity. LSML is a supervised learning scheme that learns a Mahalanobis distance grouping together features with the sam...
Patent
A method and system for discovering keys in a database. A minimal set of non-keys of the database are found. The database includes at least two entities and at least two attributes. The minimal set of non-keys includes at least two non-keys. Each entity independently includes a value of each attribute. A set of keys of the database is generated fro...
Article
Full-text available
SystemML enables declarative, large-scale machine learning (ML) via a high-level language with R-like syntax. Data scientists use this language to express their ML algorithms with full flexibility but without the need to hand-tune distributed runtime execution plans and system configurations. These ML pro- grams are dynamically compiled and optimiz...
Article
Full-text available
SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce, where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The declarative specication of ML algorithms enables-in contrast to existing large-scale machine learning libraries-automatic optimization. SystemML's primary focus is on data paral...
Patent
Systems and methods for processing Machine Learning (ML) algorithms in a MapReduce environment are described. In one embodiment of a method, the method includes receiving a ML algorithm to be executed in the MapReduce environment. The method further includes parsing the ML algorithm into a plurality of statement blocks in a sequence, wherein each s...
Patent
A computer-implemented method for accessing content items in a content store are described. In one embodiment, the computer-implemented method includes maintaining a text index of content items in a content store to enable a keyword search on the content items, receiving a query having a keyword and generating a hit list from the text index using t...
Conference Paper
Full-text available
Analytics on big data range from passenger volume prediction in transportation to customer satisfaction in automotive diagnostic systems, and from correlation analysis in social media data to log analysis in manufacturing. Expressing and running these analytics for varying data characteristics and at scale is challenging. To address these challenge...
Article
SaaS (Software as a Service) provides new business opportunities for application providers to serve more customers in a scalable and cost-effective way. SaaS also raises new challenges and one of them is multi-tenancy. Multi-tenancy is the requirement of deploying only one shared application to serve multiple customers (i.e. tenant) instead of depl...
Chapter
SaaS (Software as a Service) provides new business opportunities for application providers to serve more customers in a scalable and cost-effective way. SaaS also raises new challenges and one of them is multi-tenancy. Multi-tenancy is the requirement of deploying only one shared application to serve multiple customers (i.e. tenant) instead of depl...
Conference Paper
Database-as-a-Service (DBaaS) has gain significant momentum with the prevailing usage of Cloud computing. Multi-tenancy is one of the key features of DBaaS offering, where a large volume of databases with different Service Level Agreement (SLA) requirements are co-located in one environment and sharing resources. As Cloud resources are elastic and...
Article
Full-text available
With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine learning algorithms are expressed as scripts in a high-level...
Article
Full-text available
We propose a similarity index for set-valued features and study algorithms for executing various set similarity queries on it. Such queries are fundamental for many application areas, including data integration and cleaning, data profiling as well as near duplicate document detection. In this paper, we focus on Jaccard similarity and present estima...
Conference Paper
In Database-as-a-Service (DBaaS), a large number of tenants share DBaaS resources (CPU, I/O and Memory). While the DBaaS provider runs DBaaS to "share" resources across the entire tenant population to maximize resource utilization and minimize cost, the tenants subscribe to DBaaS at a low price point while still having resources conceptually "isola...
Conference Paper
Full-text available
MapReduce is emerging as a generic parallel programming paradigm for large clusters of machines. This trend combined with the growing need to run machine learning (ML) algorithms on massive datasets has led to an increased interest in implementing ML algorithms on MapReduce. However, the cost of implementing a large class of ML algorithms as low-le...
Article
Full-text available
The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into partitions. We create for each partition a synopsis that can be used to estimate the number of DVs in th...
Conference Paper
Entity uncertainty is an unavoidable problem in modern enterprise databases, resulting from integration of data over multiple sources. In traditional warehousing, the administrator, during an ETL process, manually and laboriously resolves inconsistent data records to discover "true'' entities(customers, products, etc.) and identify their "correct''...
Conference Paper
Dynamic authority-based keyword search algorithms, such as ObjectRank and personalized PageRank, leverage semantic link information to provide high quality, high recall search in databases, and the Web. Conceptually, these algorithms require a query-time PageRank-style iterative computation over the full graph. This computation is too expensive for...
Article
Full-text available
Content Management Systems (CMS) store enterprise data such as insurance claims, insurance policies, legal documents, patent applications, or archival data like in the case of digital libraries. Search over content allows for information retrieval, but does not provide users with great insight into the data. A more analytical view is needed through...
Article
Full-text available
DBPubs is a system for effectively analyzing and exploring the con- tent of database publications by combining keyword search with OLAP-style aggregations, navigation, and reporting. DBPubs starts with keyword search over the content of publications. The publica- tions' metadata such as title, authors, venues, year, and so on, pro- vide traditional...
Conference Paper
Full-text available
The increasing complexity of enterprise databases and the prevalent lack of documentation incur significant cost in both understanding and integrating the databases. Exist- ing solutions addressed mining for keys and foreign keys, but paid little attention to more high-level structures of databases. In this paper, we consider the problem of dis- co...
Conference Paper
Full-text available
The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques that are designed for use within a flexible and scalable "synopsis warehouse" architecture. In this setting, incoming data is split into partitions and a synopsis i...
Conference Paper
Full-text available
We model heterogeneous data sources with cross references, such as those crawled on the (enterprise) web, as a labeled graph with data objects as typed nodes and references or links as edges. Given the labeled data graph, we introduce flexible and efficient querying capabilities that go beyond existing capabilities by additionally discovering meani...
Conference Paper
Full-text available
Gaining business insights from data has recently been the focus of research and product development. On Line-Analytical Processing (OLAP) tools provide elaborate query languages that allow users to group and aggregate data in various ways, and explore interesting trends and patterns in the data. However, the dynamic nature of today's data along wit...
Conference Paper
Gaining business insights such as measuring the effectiveness of a product campaign requires the integration of a multitude of different data sources. Such data sources include in-house applications (like CRM, ERP), partner databases (like loyalty card data from retailers), and syndicated data sources (like credit reports from Experian). However, d...
Conference Paper
Full-text available
Identification of (composite) key attributes is of fundamental im- portance for many different data management tasks such as data modeling, data integration, anomaly detection, query formulation, query optimization, and indexing. However, information about keys is often missing or incomplete in many real-world database scenarios. Surprisingly, the...
Conference Paper
Modern business applications involve a lot of distributed data processing and inter-site communication, for which they rely on middleware products. These products provide the data access and communication framework for the business applications. Integrated messaging seeks to integrate messaging operations into the database, so as to provide a singl...
Conference Paper
Full-text available
The high cost of data consolidation is the key market inhibitor to the adoption of traditional information integration and data warehousing so- lutions. In this paper, we outline a next-generation integrated database man- agement system that takes traditional information integration, content man- agement, and data warehouse techniques to the next l...
Conference Paper
Full-text available
An XM L publish/su bscribe system needs to match many XPath queries (subscriptions) over published XML documents. The performance and scalability of the matching algorithm is essential for the system when the number of XPath subscriptions is large. Earlier solutions to this problem usually built large finite state automata for all the XPath subscri...
Article
Full-text available
We introduce a new database object called Cache Table that enables persistent caching of the full or partial content of a remote database table. The content of a cache table is either defined declaratively and populated in advance at setup time, or determined dynamically and populated on demand at query execution time. Dynamic cache tables exploit...
Article
Caching as a means to improve scalability, throughput, and r esponse time for Web applications has re- ceived a lot of attention. Caching can take place at differen t levels in a multi-tier architecture. However, due to the increased personalization and dynamism in Web con tent, caching at the bottom of the multi- tier architecture - the database l...
Chapter
This chapter introduces a new database object called cache table that enables persistent caching of the full or partial content of a remote database table. The content of a cache table is either defined declaratively and populated in advance at setup time, or determined dynamically and populated on demand at query execution time. Dynamic cache tabl...
Conference Paper
The education industry has a very poor record of productivity gains. In this brief article, I outline some of the ways the teaching of a college course in database systems could be made more efficient, and staff time used more productively. These ideas ...
Conference Paper
Full-text available
We introduce a new database object called Cache Table that enables persistent caching of the full or partial content of a remote database table. The content of a cache table is either defined declaratively and populated in advance at setup time, or determined dynamically and populated on demand at query execution time. Dynamic cache tables exploit...
Article
Full-text available
Enterprises have been storing multidimensional data, using a star or snowflake schema, in relational databases for many years. Over time, relational database vendors have added optimizations that enhance query performance on these schemas. During the 1990s many special-purpose databases were developed that could handle added calculational complexit...
Article
Full-text available
The World Wide Web offers a tremendous amount of information. Accessing and integrating the available information is a challenge. “Screen scraping” and reverse template engineering are manual and error-prone integration techniques from the past. The advent of Simple Object Access Protocol (SOAP) from the World Wide Web Consortium (W3C) allowed Web...
Article
Full-text available
Most business data are stored in relational database systems, and SQL (Structured Query Language) is used for data retrieval and manipulation. With XML (Extensible Markup Language) rapidly becoming the de facto standard for retrieving and exchanging data, new functionality is expected from traditional databases. Existing SQL applications will evolv...
Article
Full-text available
XML is rapidly emerging as a standard for exchanging business data on the World Wide Web. For the foreseeable future, however, most business data will continue to be stored in relational database systems. Consequently, if XML is to fulfill its potential, some mechanism is needed to publish relational data as XML documents. Towards that goal, one of...
Article
Full-text available
Most existing workflow management systems (WFMSs) are based on a client/server architecture. This architecture simplifies the overall design but it does not match the distributed nature of workflow applications and it imposes severe limitations in terms of scalability and reliability. Moreover, workflow engines are not very sophisticated in terms o...
Article
Full-text available
In today's heterogeneous development environments, application programmers have the responsibility to segment their application data and store relational data in RDBMSs, C++ objects in OODBMSs, SOM objects in OMG persistent stores, and OpenDoc or OLE compound documents in document files. In addition, they must deal with multiple server systems with...
Conference Paper
In today's IT infrastructures, data is stored in SQL databases, non-SQL, databases, and host databases like ISAM/VSAM files. Non-SQL databases are specialized data stores controlled by applications like spreadsheets, mail, directory and index services. Developing applications accessing a variety of different data sources is challenging for applicat...
Article
this paper, we provide a brief and necessarily incomplete overview of what IBM is doing to address some of the aforementioned challenges. In particular, we describe how IBM's DB2 Universal Database product family is responding to these challenges in order to provide heterogeneous data access capabilities to DB2 customers. To address the problem of...
Conference Paper
We describe the open, extensible architecture of SQL for accessing data stored in external data sources not managed by the SQL engine. In this scenario, SQL engines act as middleware servers providing access to external data using SQL DML statements and joining external data with SQL tables in heterogeneous queries. We describe the state-of-the art...
Conference Paper
Full-text available
Most existing workflow management systems (WFMSs) are based on a client/server architecture. This architecture simplifies the overall design but it does not match the distributed nature of workflow applications and imposes severe limitations in terms of scalability and reliability. Moreover workflow engines are not very sophisticated in terms of da...
Conference Paper
Full-text available
In this paper we describe the design and implementation of workflow management applications on top of Notes Release 4. We elaborate on various design issues for Notes workflow applications and introduce Notes Release 4's native workflow concepts like agents, events, macros, Lotus-Script, OLE2 capabilities, and doclinks, which make Notes a powerful...
Article
Full-text available
In today's heterogeneous development environments, application programmers have the responsibility to segment their application data and to store those data in different types of stores. That means relational data will be stored in RDBMSs (relational database management systems), C++ objects in OODBMSs (object-oriented database management systems),...
Article
this paper, we present the Exotica research project currently in progress at the IBM Almaden Research Center. One of the goals of the project is to bring together industrial trends and research issues in the workflow area. It is for this reason that we have focused on a particular commercial product, FlowMark, IBM's workflow product [15, 16, 19, 20...
Conference Paper
Cooperation is both a key concept to deal with complex design tasks and an ambiguous notion interpreted in a variety of ways. In particular, cooperation is considered from its behavioral aspects relating to the course of the dialogue (communication), and from a synchronizational viewpoint with respect to coordination issues. From the database point...
Conference Paper
The Shared Memory-Resident Cache (SMRC) facility under development at IBM Almaden enables persistent C++ data structures to reside in a relational database by utilizing its binary large object (BLOB) facilities. Through SMRC, persistent C++ data then can be accessed both programmatically and through relational queries. Testing and refinement of the...
Chapter
In diesem Kapitel werden ausgewählte Implementierungsaspekte der in Kapitel 7 dargestellten Architektur des Ablaufkontrollsystems ActMan beschrieben. Die Rahmenarchitektur des Ablaufkontrollsystems in Abschnitt 7.2 und dessen verteilte Konfiguration dienen dabei als Orientierungshilfe.
Chapter
Entwurf und Modellierung geregelter arbeitsteiliger Anwendungssysteme sind klassische Aufgabenstellungen der Software-Entwicklung. In diesem Kapitel werden zunächst die wesentlichen Komponenten geregelter arbeitsteiliger Anwendungssysteme eingeführt. Der Entwurf dieser Komponenten wird daran anschließend anhand grundlegender Konstruktionsprinzipien...
Chapter
In geregelten arbeitsteiligen Anwendungssystemen werden rechnergestützte Aktivitäten in eine durchgängige Umgebung integriert, um die Geregeltheit und Arbeitsteiligkeit einer Anwendung unterstützen zu können. Während im vorangegangenen Kapitel die Grundlagen des Entwurfs und der Modellierung geregelter arbeitsteiliger Anwendungssysteme aufgezeigt w...
Chapter
Das Ablaufkontrollsystem ActMan realisiert den Kontroll- und Datenfluß in geregelten arbeitsteiligen Anwendungssystemen. In diesem Kapitel wird die Architektur des Ablaufkontrollsystems in ihren einzelnen Teilkomponenten beschrieben, während im anschließenden Kapitel ausgewählte Implementierungsaspekte bei der Realisierung von ActMan aufgezeigt wer...
Chapter
Die im vorangegangenen Kapitel beschriebene Fallstudie stellt ein konkretes Beispiel eines geregelten arbeitsteiligen Anwendungssystems dar. In diesem Kapitel wird die Fallstudie im Hinblick auf geregelte arbeitsteilige Anwendungssysteme generell interpretiert. Es wird zunächst eine Begriffsdefinition für die geregelten arbeitsteiligen Anwendungssy...
Chapter
In diesem Kapitel wird ein anwendungsorientiertes Verarbeitungsmodell für geregelte arbeitsteilige Anwendungen definiert. Das Verarbeitungsmodell orientiert sich eng an der Klasse der zu unterstützenden Anwendungen und reglementiert die Abwicklung von Abläufen in geregelten arbeitsteiligen Anwendungen (Abschnitt 6.1). Für dieses anwendungsorientier...
Chapter
In diesem Kapitel wird eine konkrete Fallstudie für eine geregelte arbeitsteilige Anwendung vorgestellt. Die Fallstudie stammt aus dem Bereich der rechnerintegrierten Produktionssysteme als einem typischen Anwendungsbereich, bei dem Automation durch Rechnereinsatz und Rationalisierung durch Arbeitsteilung und-organisation erklärte Ziele sind. Das u...
Chapter
In diesem abschließenden Hauptabschnitt werden, ausgehend von den grundlegenden Zielsetzungen beim Entwurf und Betrieb geregelter arbeitsteiliger Anwendungssysteme, die wesentlichen Charakteristika von Ablaufkontrollsystemen herausgestellt und anschließend zukünftige Entwicklungslinien aufgezeigt.
Article
Gegenstand dieses Buches ist eine umfassende Darstellung des Workflow-Managements in verteilten Systemen. Es wird zunächst anhand einer Fallstudie eine begriffliche Einführung in dieses neue Forschungsfeld vorgenommen. Daran anschließend werden die notwendigen Grundlagen und Basismechanismen zur Modellierung und Realisierung von Workflow-Management...
Conference Paper
This paper considers control and data flow of well-structured procedures in distributed application systems. At control flow level, an application-oriented cooperation model is used to model well-structured cooperative work in distributed applications. At data flow level, a customizable data management mechanism passes data between activities and p...
Chapter
Ein Wirtschaftsbetrieb besteht aus einer Vielzahl von Funktionseinheiten, die im Hinblick auf die gemeinsamen Unternehmensziele in kontrollierter Weise zusammenwirken müssen. Die Kooperation der Funktionseinheiten spiegelt sich in einem Netz von Abhängigkeiten zwischen den Funktionseinheiten wider. In einem Kooperationsprojekt der Universität Erlan...

Network

Cited By