Figure 4 - uploaded by Erhard Rahm
Content may be subject to copyright.
General entity resolution workflow 

General entity resolution workflow 

Source publication
Conference Paper
Full-text available
An often forgotten asset of many companies is internal process data. From the everyday processes that run within companies, huge amounts of such data is collected within different software systems. However, the value of analyzing this data holistically is often not exploited. This is mainly due to the inherent heterogeneity of the different data so...

Contexts in source publication

Context 1
... workflows like this are useful to generate certain types of edges for our property graph that connect objects of the same type. However, more complex linking approaches are needed. To enable the functionality needed for the current Smart Link Infrastructure, i.e. the calculation of links between arbitrary data objects representing entities like per- sons, documents, etc. most of the steps depicted in Figure 4 are extended and an additional, optional, rule application step is added to the process. Figure 5 shows a general linking workflow. Especially when trying to support search and information discovery in knowledge management processes the data often consists of unstructured text from documents with possibly incomplete metadata. The string similarity of the content of two text documents, for example, does not provide enough information about the relation between them. The only relationship that could be derived from a low lexicographic distance is that the documents have duplicate text or large parts of overlapping text. Named entity recognition, keyword detection, the detection of hyperlinks and mail addresses as well as other text mining approaches become important to link documents with semantic relations that go beyond ‘sameAs’ relations. A linking workflow that relies on text mining and uses a simple rule to decide the relation type could look like ...
Context 2
... architecture of the Smart Link Infrastructure is shown in Figure 1. It consists of three layers. On the top are the user interface components. These are HTML user interfaces supporting administrative tasks, i.e. data import and linking of data records as well as graphical user interfaces for the visualization of graphs, querying and the creation of reports. Since the Smart Link Infrastructure aims at an integration with existing software systems external tools like OneNote (as used in the knowledge management processes described in chapter 3) are supported as well. The second layer consists of a collection of services: the Schema Matching Service, the Link Generator Service and the Storage Service. The first two serve data import and management tasks while the Storage Service represents the general interface to the central graph database providing operations to store and query data. The services each expose a REST-based API towards the UI components and external clients. Communication hap- pens largely with JSON messages. On the bottom of the architecture the data sources can be found. One is the central graph database itself and on the left are the multitude of sources (e.g. databases, file repositories, SPARQL endpoints etc.) from which data is imported. To be able to holistically analyze and search within data that is distributed across multiple heterogeneous data sources, various data integration approaches are used. There are some similarities to ETL processes that are used when building data warehouses to allow efficient analytic processing, e.g. when data from independently created databases is to be combined to create a unified view that gives new insights for analysts. The general workflow applied within the Smart Link Infrastructure is depicted in Figure 2. The data integration workflow starts with the data import that is supported by a service providing schema matching methods. In this step relevant data sources can be selected like files, databases, etc. and a mapping of their schema with the schema of the Smart Link Store is created. Based on the mapping, new entities are added to the information graph. The use of the schema matching service is optional and supports administrators during the definition of the mapping. It could simplify subsequent steps if a sound input schema is available. In the linking step, entity resolution approaches are applied to identify relations between entity sets that originate from different sources. In contrast to standard link approaches, we also support text mining techniques and new mapping types. Text mining helps to deal with unstructured data like text documents whereas new mapping types are needed to determine the type of semantic relations between entities. The linking step is important for defining a well-connected information graph, especially if the imported data sets contain no direct references to each other. In the center of the Smart Link workflow there is a graph database, the Smart Link Store, which provides the capability to store and query data. In particular, it offers means for executing pattern queries on the information graph that are crucial for the analysis. Each step of the workflow will be detailed in the following chapters. Chapter 3 will introduce two real-world use case scenarios that utilize the Smart Link Infrastructure and give an insight into the actual usage. In order to pre-fill the Smart-Link Store with entities, an administrative user can use the Data Import Tool. Such pre-filling allows a user to connect to existing data sources such as relational databases, Excel files or data sources that are exposed through a SPARQL end- point. When integrating a data source into the information graph the schema of the data source needs to be mapped to the existing flexible schema of the information graph. Such a mapping describes how elements of the source schema correspond to elements of the information graph schema. As discussed above, extending the graph schema is not prob- lematic, but existing types and properties should be reused. Adding properties to existing types or adding types is implicitly triggered when new entities are uploaded. Defining the mappings can be complex and time-consuming. It is often done manually, with the help of point and click interfaces. The data import tool of the Smart Link Infrastructure offers a point and click interface similar to existing mapping solutions (see Figure 3). Moreover, a schema matching service is integrated that is able to compute suggestions for mappings. To reduce the manual effort in mapping, a matching service was integrated that semi- automatically computes a mapping suggestion for a user. The matching service contains a number of matching algorithms and a library of schema importers for different schema types [PER11, DR02]. It takes as input two schemas and computes a mapping suggestion between them. Similarities between source and target elements are computed on metadata level but also on instance level. Since current matching systems are often not robust enough to be able to cope with very heterogeneous source schemas we developed an adaptive matching approach [PER12]. This approach automatically configures a schema matching system process that consists of a set of operators for matching and filtering. Based on measured features of the input schemas and intermediate results so-called matching rules can be defined. These rewrite rules rely on analyzing the input schemas and intermediate results while executing a process and rewrite the process to better fit to the problem at hand. As was already described the adaptive schema matching approach is crucial to match the heterogeneity of schema types to an integrated schema and to finally integrate those sources within a common information graph. The schema matching service described above implements parts of the adaptive matching system and is therefore able to improve the quality of matching results. Creating graphs from a structured data source like a database with well-modeled metadata can be relatively easy. In contrast to that, creating a graph from unstructured sources like document collections or independently created databases requires a component to find links between entities. The Link Generator Service allows the creation, management and execution of workflows, so called linkers, that can determine relations between entities, the relation type and the confidence of these relations. Linkers are related to the entity resolution workflows used in various data integration and data quality related scenarios(e.g. link discovery in the web of linked open data [NKH + 13], web data integration [WK11] or classic ETL processes in data warehousing [CP11]). Figure 4 shows how entity resolution workflows look at an abstract level (cf. [Chr12] ). Normally they are used for detecting data objects that are equivalent in the real world but have different represen- tations in multiple data sources. The input of an entity resolution workflow are usually sets of data records from one or multiple databases. The operations in the preprocessing step include data transformation steps (e.g. to convert data types or remove special characters) and filter operations or blocking steps to reduce the search space for finding matching objects [DN09]. Match operations are then applied to determine the pairwise similarity of the candidates. Depending on the domain different distance metrics on one attribute or a combination of attributes can be used to determine this likelihood of two entities being equal [EIV07]. The match result usually is an instance-level mapping: a set of correspondences of the type (entity1, entity2, sim) where sim is the confidence of two entities being equal. In general a mapping contains correspondences of a single type, i.e. ‘sameAs’. This resulting mapping is then used to determine a set of matching objects and a set of non-matching object pairs from the input data source. Standard entity resolution workflows can be used to find objects in the graph store that are very similar and merge them under the assumption that these are duplicate nodes or slightly different versions of the same note. These data quality related workflows, however, are not the main focus of the Link Generator service. Instead of searching for pairs with a high confidence of a ‘sameAs’ relation, linking workflows aim to search for links between arbitrary objects and their type. In the easiest case a linker can just determine a link between two objects by calculating the similarity between a certain field that exists in two objects. An example of this is a linker that tries to find documents written by the same author which involves the following ...

Similar publications

Article
Full-text available
This article aims to analyze the relationship between logistics and knowledge management in a context before and after the emergence of industry 4.0. This article identifies the main technologies applied to logistics 4.0. For this, a survey of articles was conducted in the search engines Scopus and Web of Science for the period from 2012 to 2017. I...
Article
Full-text available
Hysa B., Spałek S., “Opportunities and threats presented by social media in project management”, Heliyon, Volume 5, Issue 4, 2019, ISSN 2405-8440, https://doi.org/10.1016/j.heliyon.2019.e01488, str. 1-28, SCOPUS indexed. The application of new technologies is rapidly increasing, not only in private but also in professional spheres, including projec...
Article
Full-text available
Purpose The purpose of this paper is to present the main barriers, practices, methods and knowledge management tools in startups that are characterized as agile organizations with dynamic capabilities to meet the demands of a business environment of high volatility, uncertainties, complexity and ambiguity. Design/methodology/approach The conceptua...
Article
Full-text available
This paper deliberates the influence of organisational agility (OA) on knowledge management (KM), which enables organisations to survive and achieve their competitive advantage through developing and integrating the KM strategy and sustainable knowledge transfer capability. Currently, the conception of agility has become widespread in organisationa...

Citations

... Drawing connections between documents makes the corpus much more valuable, as discussed earlier. Currently, documents can be overlayed with extracted entities or annotated to terms [Barczyński et al. 2010], and semantic relations based on entity occurrence [Peukert et al. 2015] can be used to generate typed links between documents. After deriving semantics from documents, a logical next step is deriving semantics from unstructured corpora. ...
Article
Enterprise data is an amalgam of mostly semi-structured and unstructured data and documents stored in heterogeneous systems. The available structure is often not readily apparent or modelled to be useful. Formats such as PDF, DWG, Excel, or Word offer a high grade of flexibility; the issue is rather that their freeform content does not divulge its structure and meaning. In the case of binary formats as used in CAD or simulation tools, even basic textual content may be missing. Structured metadata is only sometimes available. When taking a step back, we see a challenging quality issue not based on individual documents, but on the whole corpus of enterprise documents. Full paper available: http://www.informatik.uni-oldenburg.de/~there/