About
121
Publications
44,568
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,959
Citations
Introduction
Current institution
Additional affiliations
January 2000 - April 2008
Publications
Publications (121)
Existing evaluations of entity linking systems often say little about how the system is going to perform for a particular application. There are four fundamental reasons for this: many benchmarks focus on named entities; it is hard to define which other entities to include; there are ambiguities in entity recognition and entity linking; many benchm...
We present Elevant, a tool for the fully automatic fine-grained evaluation of a set of entity linkers on a set of benchmarks. Elevant provides an automatic breakdown of the performance by various error categories and by entity type. Elevant also provides a rich and compact, yet very intuitive and self-explanatory visualization of the results of a l...
Timetable information in public transportation networks exhibit a large degree of redundancy; e.g. consider a bus going from station A to station B at 6:00, 6:15, 6:30, 6:45, 7:00, 7:15, 7:30, . . . , 20:00, the very same data can be provided by a frequency-based representation as ’6:00-20:00, every 15 minutes’ in considerably less space. Neverthel...
We show how to achieve fast autocompletion for SPARQL queries on very large knowledge bases. At any position in the body of a SPARQL query, the autocompletion suggests matching subjects, predicates, or objects. The suggestions are context-sensitive in the sense that they lead to a non-empty result and are ranked by their relevance to the part of th...
We study the following problem: given two public transit station identifiers A and B, each with a label and a geographic coordinate, decide whether A and B describe the same station. For example, for "St Pancras International" at (51.5306, -0.1253) and "London St Pancras" at (51.5319, -0.1269), the answer would be "Yes". This problem frequently ari...
We consider the following tokenization repair problem: Given a natural language text with any combination of missing or spurious spaces, correct these. Spelling errors can be present, but it's not part of the problem to correct them. For example, given: "Tispa per isabout token izaionrep air", compute "Tis paper is about tokenizaion repair". It is...
Schematic transit maps (often called “metro maps” in the literature) are important to produce comprehensible visualizations of complex public transit networks. In this work, we investigate the problem of automatically drawing such maps on an octilinear grid with an arbitrary (but optimal) number of edge bends. Our approach can naturally deal with o...
We investigate the following map-matching problem: given a sequence of stations taken by a public transit vehicle and given the underlying network, find the most likely geographical course taken by that vehicle. We provide a new algorithm and tool, which is based on a hidden Markov model and takes characteristics of transit networks into account. O...
We present LOOM (Line-Ordering Optimized Maps), a fully automatic generator of geographically accurate transit maps. The input to LOOM is data about the lines of a given transit network, namely for each line, the sequence of stations it serves and the geographical course the vehicles of this line take. We parse this data from GTFS, the prevailing s...
The WSDM Cup 2017 was a data mining challenge held in conjunction with the 10th International Conference on Web Search and Data Mining (WSDM). It addressed key challenges of knowledge bases today: quality assurance and entity search. For quality assurance, we tackle the task of vandalism detection, based on a dataset of more than 82 million user-co...
This paper provides an overview of the triple scoring task at the WSDM Cup 2017, including a description of the task and the dataset, an overview of the participating teams and their results, and a brief account of the methods employed. In a nutshell, the task was to compute relevance scores for knowledge-base triples from relations, where such sco...
We present QLever, a query engine for efficient combined search on a knowledge base and a text corpus, in which named entities from the knowledge base have been identified (that is, recognized and disambiguated). The query language is SPARQL extended by two QLever-specific predicates ql:contains-entity and ql:contains-word, which can express the oc...
We provide a quality evaluation of KB+Text search, a deep integration of knowledge base search and standard full-text search. A knowledge base (KB) is a set of subject–predicate–object triples with a common naming scheme. The standard query language is SPARQL, where queries are essentially lists of triples with variables. KB+Text search extends thi...
We present LOOM (Line-Ordering Optimized Maps), a fully automatic generator of geographically accurate transit maps. The input to LOOM is data about the lines of a given transit network, namely for each line, the sequence of stations it serves and the geographical course the vehicles of this line take. We parse this data from GTFS, the prevailing s...
We present LOOM (Line-Ordering Optimized Maps), a fully automatic generator of geographically accurate transit maps. The input to LOOM is data about the lines of a given transit network, namely for each line, the sequence of stations it serves and the geographical course the vehicles of this line take. We parse this data from GTFS, the prevailing s...
The WSDM Cup 2017 was a data mining challenge held in conjunction with the 10th International Conference on Web Search and Data Mining (WSDM). It addressed key challenges of knowledge bases today: quality assurance and entity search. For quality assurance, we tackle the task of vandalism detection, based on a dataset of more than 82 million user-co...
We survey recent advances in algorithms for route planning in transportation networks. For road networks, we show that one can compute driving directions in milliseconds or less even at continental scale. A variety of techniques provide different trade-offs between preprocessing effort, space requirements, and query time. Some algorithms can answer...
This article provides a comprehensive overview of the broad area of semantic search on text and knowledge bases. In a nutshell, semantic search is "search with meaning". This "meaning" can refer to various parts of the search process: understanding the query (instead of just finding matches of its components in the data), understanding the data (in...
This monograph provides a comprehensive overview of the broad area of semantic search on text and knowledge bases. In a nutshell, semantic search is "search with meaning". This "meaning" can refer to various parts of the search process: understanding the query (instead of just finding matches of its components in the data), understanding the data (...
We show how to estimate population numbers for arbitrary user-defined regions, down to the level of individual buildings. This is important for various applications like evacuation planning, facility placement, or traffic estimation. However, census data with precise population numbers is typically only available at the level of cities, villages, o...
Real-world factoid or list questions often have a simple structure, yet are hard to match to facts in a given knowledge base due to high representational and linguistic variability. For example, to answer "who is the ceo of apple" on Freebase requires a match to an abstract "leadership" entity with three relations "role", "organization" and "person...
We compute and evaluate relevance scores for knowledge-base triples from type-like relations. Such a score measures the degree to which an entity "belongs" to a type. For example, Quentin Tarantino has various professions, including Film Director, Screenwriter, and Actor. The first two would get a high score in our setting, because those are his ma...
We survey recent advances in algorithms for route planning in transportation
networks. For road networks, we show that one can compute driving directions in
milliseconds or less even at continental scale. A variety of techniques provide
different trade-offs between preprocessing effort, space requirements, and
query time. Some algorithms can answer...
We introduce a framework to create a world-wide live map of public transit, i.e. the real-time movement of all buses, subways, trains and ferries. Our system is based on freely available General Transit Feed Specification (GTFS) timetable data and also features real-time delay information (where available). The main problem of such a live tracker i...
We consider the application of route planning in large public-transportation networks (buses, trains, subways, etc). Many connections in such networks are operated at periodic time intervals. When a set of connections has sufficient periodicity, it becomes more efficient to store the time range and frequency (e.g., every 15 minutes from 8:00am-6:00...
We present TRAVIC, a thin browser-based client that is able to display smooth vehicle movements on a map. The focus is on visualizing world-wide public transit vehicle movements in an interactive way. But we also investigate other use cases, for example, traffic simulation. We describe in detail which server requests are fired and how the received...
We combine search in triple stores with full-text search into what we call \emph{semantic full-text search}. We provide a fully functional web application that allows the incremental construction of complex queries on the English Wikipedia combined with the facts from Freebase. The user is guided by context-sensitive suggestions of matching words,...
The vision behind our project is a closed-loop system for continuous deep brain stimulation (DBS) based on features extracted from complex motion data. Our focus is on Parkinsons Disease (PD) patients, with a possible expansion to related neurological disorders. The system we envision is lightweight, will continuously gather motion information, wil...
We seek to compute utilization information for public spaces, in particular forests: which parts are used by how many people. Our contribution is threefold. First, we present a sound model for computing this information from publicly available data such as road maps and population counts. Second, we present efficient algorithms for computing the de...
Public-transportation route-planning systems typically work as follows. The user specifies a source and a target location, as well as a departure time. The system then returns one or more optimal trips at or after that departure time. In this paper, we consider guidebook routing, where the goal is to provide timeindependent answers that are valid o...
Recent Open Information Extraction (OpenIE) systems utilize grammatical structure to extract facts with very high recall and good precision. In this paper, we point out that a significant fraction of the extracted facts is, however, not informative. For example, for the sentence The ICRW is a non-profit organization headquartered in Washington, the...
We demonstrate a system for fast and intuitive exploration of the Freebase dataset. This required solving several non-trivial problems, including: entity scores for proper ranking and name disambiguation, a unique meaningful name for every entity and every type, extraction of canonical binary relations from multi-way relations (which in Freebase ar...
In this paper we present a novel index data structure tailored towards semantic full-text search. Semantic full-text search, as we call it, deeply integrates keyword-based full-text search with structured search in ontologies. Queries are SPARQL-like, with additional relations for specifying word-entity co-occurrences. In order to build such querie...
We present Icecite, a new fully web-based research paper management system (RPMS). Icecite facilitates the following otherwise laborious and time-consuming steps typically involved in literature research: automatic metadata and reference extraction, on-click reference downloading, shared annotations, offline availability, and full-featured search i...
We study multi-modal route planning allowing arbitrary (meaningful) combinations of public transportation, walking, and taking a car / taxi. In the straightforward model, the number of Pareto-optimal solutions explodes. It turns out that many of them are similar to each other or unreasonable. We introduce a new filtering procedure, types and thresh...
Transfer pattern routing is a state-of-the-art speed-up technique for finding optimal paths which minimize multiple cost criteria in public transportation networks. It precomputes sequences of transfer stations along optimal paths. At query time, the optimal paths are searched among the stored transfer patterns, which allows for very fast response...
We show how contextual sentence decomposition (CSD), a technique originally developed for high-precision semantic search, can be used for open information extraction (OIE). Intuitively, CSD decomposes a sentence into the parts that semantically "belong together". By identifying the (implicit or explicit) verb in each such part, we obtain facts like...
We consider the problem of fuzzy full-text search in large text collections, that is, full-text search which is robust against errors both on the side of the query as well as on the side of the documents. Standard inverted-index techniques work extremely well for ordinary full-text search but fail to achieve interactive query times (below 100 milli...
Die klassische Volltextsuche sucht nach Vorkommen der eingegebenen Suchwörter in einer gegebenen Menge von Texten. Dieser Ansatz funktioniert bei vielen Anfragen sehr gut, hat aber auch seine offensichtlichen Grenzen. Bei der semantischen Suche versucht man, sowohl die Suchanfrage als auch die Texte in denen gesucht wird zu ,,verstehen“. Dieser Art...
Epigenome mapping consortia are generating resources of tremendous value for studying epigenetic regulation. To maximize their utility and impact, new tools are needed that facilitate interactive analysis of epigenome datasets. Here we describe EpiExplorer, a web tool for exploring genome and epigenome data on a genomic scale. We demonstrate EpiExp...
Supplemental figures.
We discuss the advantages and shortcomings of full-text search on the one hand and search in ontologies/triple stores on the other hand. We argue that both techniques have an important quality missing from the other. We advocate a deep integration of the two, and describe the associated requirements and challenges.
We present Broccoli, a fast and easy-to-use search engine for what we call
semantic full-text search. Semantic full-text search combines the capabilities
of standard full-text search and ontology search. The search operates on four
kinds of objects: ordinary words (e.g., edible), classes (e.g., plants),
instances (e.g., Broccoli), and relations (e....
Dimension reduction techniques have been a suc-cessful avenue for automatically extracting the "concepts" underlying unstructured data, a task that naturally arises in fields as diverse as infor-mation retrieval, image processing, social science, etc. It is surprising how much can be achieved for this task using only the raw data itself, without re...
As shown in a series of recent works, the HYB index is an alternative to the inverted index (INV) that enables very fast prefix searches, which in turn is the basis for fast processing of many other types of advanced queries, including autocompletion, faceted search, error-tolerant search, database-style select and join, and semantic search. In thi...
We show how to route on very large public transportation networks (up to half a billion arcs) with average query times of
a few milliseconds. We take into account many realistic features like: traffic days, walking between stations, queries between
geographic locations instead of a source and a target station, and multi-criteria cost functions. Our...
This paper is a report on the 3D Shape Retrieval Constest 2010 (SHREC'10) track on large scale retrieval. This benchmark allows evaluating how wel retrieval algorithms scale up to large collections of 3D models. The task was to perform 40 queries in a dataset of 10000 shapes. We describe the methods used and discuss the results and signifiance anal...
We consider fast two-sided error-tolerant search that is robust against errors both on the query side (type alogrithm, find documents with algorithm) as well as on the document side (type algorithm, find documents with alogrithm). We show how to realize this feature with an index blow-up of 10% to 20% and an increase in query time by a factor of at...
Abstract When you drive to somewhere ‘far away’, you will leave your current location via one of only a few ‘important’ traffic junctions. Starting from this informal observation, we develop an algorithmic approach—transit node routing— that allows us to reduce quickest-path queries in road networks to a small number,of table lookups. We present tw...
There are two kinds of people: those who travel by car, and those who use public transport. The topic of this article is to show that the algorithmic problem of computing the fastest way to get from A to B is also surprisingly different on road networks than on public transportation networks.
On road networks, even very large ones like that of the...
We show how a half-inverted index can be constructed twice as fast as an ordinary inverted index. As shown in a series of
recent works, the half-inverted index enables very fast prefix search, which in turn is the basis for very fast processing
of many other types of advanced queries. Our construction algorithm is truly single-pass in that every po...
We consider the following spelling variants clustering problem: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only document...
We present a demo of ESTER, a search engine that combines the ease of use, speed and scalability of full-text search with the powerful semantic capabilities of ontologies. % ESTER supports full-text queries, ontological queries and combinations of these, yet its interface is as easy as can be: A standard search field with semantic information provi...
We consider the following autocompletion search scenario: imagine a user of a search engine typing a query; then with every keystroke display those completions of the last query word that would lead to the best hits, and also display the best such hits. The following problem is at the core of this feature: for a fixed document collection, given a s...
Recent IR extensions to XML query languages such as Xpath 1.0 Full-Text or the NEXI query language of the INEX benchmark series reflect the emerging interest in IR-style ranked retrieval over semistructured data. TopX is a top-$k$ retrieval engine for text and semistructured data. It terminates query execution as soon as it can safely determine the...
We present an efficient realization of the following interactive search engine feature: as the user is typing the query, words that are related to the last query word and that would lead to good hits are suggested, as well as selected such hits. The realization has three parts: (i) building clusters of related terms, (ii) adding this information as...
We present ESTER, a modular and highly efficient system for combined full-text and ontology search. ESTER builds on a query engine that supports two basic operations: prefix search and join. Both of these can be implemented very efficiently with a compact index, yet in combination provide powerful querying capabilities. We show how ESTER can answer...
When you drive to somewhere far away, you will leave your current location via one of only a few important traffic junctions.
Starting from this informal observation, we developed an algorithmic approach, transit node routing, that allows us to reduce
quickest path queries in road networks to a small number of table lookups. For road maps of Wester...
CompleteSearch is a highly interactive search engine, which, instantly after every single keystroke, offers to the user various kinds of feedback, like promising query completions or refinements by category. We combined CompleteSearch with our institute's helpdesk system and carried out a small user study with some of the staff operating the h...
We describe CompleteSearch, an interactive search engine that offers the user a variety of complex features, which at first glance have little in common, yet are all provided via one and the same highly optimized core mechanism. This mechanism answers queries for what we call context-sensitive prefix search and completion: given a set of documents...
We consider the following autocompletion search scenario: imagine a user of a search engine typing a query; then with every
keystroke display those completions of the last query word that would lead to the best hits, and also display the best such
hits. The following problem is at the core of this feature: for a fixed document collection, given a s...
Top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. Top-k queries operate on index lists for a query's elementary conditions and aggregate scores for result candidates. One of the best implementation methods i...
We consider the following full-text search autocompletion feature. Imagine a user of a search engine typing a query. Then with every letter being typed, we would like an instant display of completions of the last query word which would lead to good hits. At the same time, the best hits for any of these completions should be displayed. Known indexin...
We present an improved average case analysis of the maximum cardinality
matching problem. We show that in a bipartite or general random graph on n
vertices, with high probability every non-maximum matching has an augmenting
path of length O(log n). This implies that augmenting path algorithms like
the Hopcroft-Karp algorithm for bipartite graphs an...
This paper describes the setup and results of our contribu- tion to the TREC 2006 Terabyte Track. Our implemen- tation was based on the algorithms proposed in (1) \IO- Top-k: Index-Access Optimized Top-K Query Processing, VLDB'06", with a main focus on the e-ciency track.
We consider the following autocompletion search scenario: imagine a user of a search engine typing a query; then with every keystroke display those completions of the last query word that would lead to the best hits, and also display the best such hits. The following problem is at the core of this feature: for a fixed document collection, given a s...
We introduce the concept of transit nodes, as a means for preprocessing a road network, with given coordinates for each node and a travel time for each edge, such that point-to-point shortest-path queries can be answered extremely fast. The transit nodes are a set of nodes as small as possible with the property that every shortest path that is non-...
Top-$k$ query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. Top-$k$ queries operate on index lists for a query's elementary conditions and aggregate scores for result candidates. One of the best implementation...
We argue that the ability to identify pairs of related terms is at the heart of what makes spectral retrieval work in practice. Schemes such as latent semantic indexing (LSI) and its descendants have this ability in the sense that they can be viewed as computing a matrix of term-term relatedness scores which is then used to expand the given documen...
We point out that for two sets of measurements, it can hap- pen that the average of one set is larger than the average of the other set on one scale, but becomes smaller after a non-linear monotone transfor- mation of the individual measurements. We show that the inclusion of error bars is no safeguard against this phenomenon. We give a theorem, ho...
We view a variety of established methods for ranked retrieval from a common angle, namely as a process of combining query-independent rankings that were precomputed for certain attributes. Apart from a general insight into what effectively distinguishes various schemes from each other, we obtain three specific results concerned with concept-based r...
We point out that for two sets of measurements, it can happen that the average of one set is larger than the average of the other set on one scale, but becomes smaller after a non-linear monotone transformation of the individual measurements. We show that the inclusion of error bars is no safeguard against this phenomenon. We give a theorem, howeve...