Arjen P. de Vries

Arjen P. de Vries
Radboud University | RU · Institute for Computing and Information Sciences

Prof.dr.ir.

About

348
Publications
44,861
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,228
Citations
Introduction
Computer scientist specialized in information access to dataspaces through the integration of information retrieval and databases.
Additional affiliations
November 2015 - present
Radboud University
Position
  • Professor (Full)
Description
  • Professor Information Retrieval.
November 2009 - present
Spinque
Position
  • Co-founder
September 2008 - October 2015
Delft University of Technology
Position
  • Full Professor (0.2 fte)

Publications

Publications (348)
Article
Full-text available
In many domains of information retrieval, system estimates of document relevance are based on multidimensional quality criteria that have to be accommodated in a unidimensional result ranking. Current solutions to this challenge are often inconsistent with the formal probabilistic framework in which constituent scores were estimated, or use sophist...
Conference Paper
Full-text available
In the course of a search session, searchers often modify their queries several times. In most previous work analyzing search logs, the addition of terms to a query is identified with query specification and the removal of terms with query generalization. By analyzing the result sets that motivated searchers to make modifications, we show that this...
Conference Paper
Full-text available
This paper introduces the concept of a Parameterised Search System (PSS), which allows flexibility in user queries, and, more importantly, allows system engineers to easily define customised search strategies. Putting this idea into practise requires a carefully designed system architecture that supports a declarative abstraction language for the s...
Conference Paper
Full-text available
This position statement advocates that the integration of information retrieval and databases, a topic that has been studied for many years (see e.g. [3]), is now in a state where the technology is ready to be brought out of the laboratory, and that this technology is especially a good match for the meaningful, semantic annotations that are the top...
Article
Full-text available
Recently, online social networks have emerged that allow people to share their multimedia files, retrieve interesting content, and discover like-minded people. These systems often provide the possibility to annotate the content with tags and ratings. Using a random walk through the social annotation graph, we have combined these annotations into a...
Article
Full-text available
A recommender system imposes differences between users, by presenting to them different recommendation lists, which they respond to, resulting in different “reaction” lists. Comparison of the differences in the recommendation and reaction lists can indicate different user states. Users can approve the imposed difference, end up narrowing the differ...
Chapter
Domain specialists such as council members may benefit from specialised search functionality, but it is unclear how to formalise the search requirements when developing a search system. We adapt a faceted task model for the purpose of characterising the tasks of a target user group. We first identify which task facets council members use to describ...
Preprint
Full-text available
Pre-trained language models such as BERT have been a key ingredient to achieve state-of-the-art results on a variety of tasks in natural language processing and, more recently, also in information retrieval.Recent research even claims that BERT is able to capture factual knowledge about entity relations and properties, the information that is commo...
Preprint
Full-text available
User intent classification is an important task in information retrieval. In this work, we introduce a revised taxonomy of user intent. We take the widely used differentiation between navigational, transactional and informational queries as a starting point, and identify three different sub-classes for the informational queries: instrumental, factu...
Preprint
Full-text available
Machine understanding of user utterances in conversational systems is of utmost importance for enabling engaging and meaningful conversations with users. Entity Linking (EL) is one of the means of text understanding, with proven efficacy for various downstream tasks in information retrieval. In this paper, we study entity linking for conversational...
Preprint
We explore how to generate effective queries based on search tasks. Our approach has three main steps: 1) identify search tasks based on research goals, 2) manually classify search queries according to those tasks, and 3) compare three methods to improve search rankings based on the task context. The most promising approach is based on expanding th...
Preprint
Full-text available
Conversational AI systems are being used in personal devices, providing users with highly personalized content. Personalized knowledge graphs (PKGs) are one of the recently proposed methods to store users' information in a structured form and tailor answers to their liking. Personalization, however, is prone to amplifying bias and contributing to t...
Chapter
Word embeddings provide a common basis for modern natural language processing tasks, however, they have also been a source of discussion regarding their possible biases. This has led to a number of publications regarding algorithms for removing this bias from word embeddings. Debiasing should make the embeddings fairer in their use, avoiding potent...
Preprint
Full-text available
Entity linking is a standard component in modern retrieval system that is often performed by third-party toolkits. Despite the plethora of open source options, it is difficult to find a single system that has a modular architecture where certain components may be replaced, does not depend on external sources, can easily be updated to newer Wikipedi...
Preprint
Full-text available
In this research, we improve upon the current state of the art in entity retrieval by re-ranking the result list using graph embeddings. The paper shows that graph embeddings are useful for entity-oriented search tasks. We demonstrate empirically that encoding information from the knowledge graph into (graph) embeddings contributes to a higher incr...
Chapter
In this research, we improve upon the current state of the art in entity retrieval by re-ranking the result list using graph embeddings. The paper shows that graph embeddings are useful for entity-oriented search tasks. We demonstrate empirically that encoding information from the knowledge graph into (graph) embeddings contributes to a higher incr...
Chapter
Full-text available
When researchers speak of BM25, it is not entirely clear which variant they mean, since many tweaks to Robertson et al.’s original formulation have been proposed. When practitioners speak of BM25, they most likely refer to the implementation in the Lucene open-source search library. Does this ambiguity “matter”? We attempt to answer this question w...
Preprint
Full-text available
There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions...
Chapter
Full-text available
Conference Paper
Full-text available
As part of the project "Strengthening the Lebanese Water and Agricultural Sector" a Managed Aquifer Recharge (MAR) pilot system was implemented in the western Bekaa Valley. This paper gives an project overview; 1) Background: The geological setting and challenges of implementing MAR systems in karstic aquifers; 2) Development of a site suitability...
Preprint
Full-text available
Search conducted in a work context is an everyday activity that has been around since long before the Web was invented, yet we still seem to understand little about its general characteristics. With this paper we aim to contribute to a better understanding of this large but rather multi-faceted area of `professional search'. Unlike task-based studi...
Article
Full-text available
It has been widely acknowledged that reinstallations and re-executions of contemporary artworks substantially rely on available documentation. Especially for installations and performances it is crucial to record the artist’s intent, past iterations, and tacit knowledge involved in staging the artwork. The growing presence of contemporary artworks...
Chapter
This chapter discusses the relationship between privacy and algorithms that make use of large amounts of multimedia data. As users continue to post their audiovisual content online, and as companies continue to collect user profiles and interaction data, concerns about privacy are becoming increasingly urgent. The chapter focuses on multimedia algo...
Article
Full-text available
In this report we describe the outcome of the First International Workshop on Professional Search, held in co-location with SIGIR 2018. The workshop addressed the specific requirements and challenges of professional search, as opposed to web search. The workshop included a survey held among 113 professional searchers, two keynote talks, six short p...
Conference Paper
This Recsys Challenge paper by Team Radboud presents a solution to the automatic playlist continuation (APC) task using random walks, inspired by Pinterest's Pixie [5] and earlier work by the second author [4]. The generic idea of recommendation using random walks is specialised to the APC task by the specific choices made to represent playlists an...
Article
Full-text available
The purpose of the Strategic Workshop in Information Retrieval in Lorne is to explore the long-range issues of the Information Retrieval field, to recognize challenges that are on-or even over-the horizon, to build consensus on some of the key challenges, and to disseminate the resulting information to the research community. The intent is that thi...
Conference Paper
Full-text available
Professional search is a problem area in which many facets of information retrieval are addressed, both system-related (e.g. distributed search) and user-related (e.g. complex information needs), and the interface between user and system (e.g. supporting exploratory search tasks). Professional search tasks have specific requirements, different from...
Preprint
Full-text available
We implemented and evaluated a two-stage retrieval method for personalized academic search in which the initial search results are re-ranked using an author-topic profile. In academic search tasks, the user's own data can help optimizing the ranking of search results to match the searcher's specific individual needs. The author-topic profile consis...
Article
Full-text available
A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in comm...
Article
Full-text available
This paper describes the participation of team Chicory in the Triple Ranking Challenge of the WSDM Cup 2017. Our approach deploys a large collection of entity tagged web data to estimate the correctness of the relevance relation expressed by the triples, in combination with a baseline approach using Wikipedia abstracts following [1]. Relevance esti...
Conference Paper
Recommender System research has evolved to focus on developing algorithms capable of high performance in online systems. This development calls for a new evaluation infrastructure that supports multi-dimensional evaluation of recommender systems. Today's researchers should analyze algorithms with respect to a variety of aspects including predictive...
Conference Paper
Full-text available
Implementing keyword search and other IR tasks on top of relational engines has become viable in practice, especially thanks to high-performance column-store technology. Supporting complex combinations of structured and unstruc-tured search in real-world heterogeneous data spaces however requires more than " just " IRon -DB. In this work, we walk t...
Conference Paper
As audiovisual archives are digitizing their collections and making these collections available online, the need arises to also establish connections between different collections and to allow for cross-collection search and browsing. Structured vocabularies, made available as Linked Data, can be used as connecting points by aligning thesauri from...
Conference Paper
Recommender systems leverage both content and user interactions to generate recommendations that fit users' preferences. The recent surge of interest in deep learning presents new opportunities for exploiting these two sources of information. To recommend items we propose to first learn a user-independent high-dimensional semantic space in which it...
Conference Paper
Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library o...
Conference Paper
In the evaluation of recommender systems, the quality of recommendations made by a newly proposed algorithm is compared to the state-of-the-art, using a given quality measure and dataset. Validity of the evaluation depends on the assumption that the evaluation does not exhibit artefacts resulting from the process of collecting the dataset. The main...
Conference Paper
Successful news recommendation requires facing the challenges of dynamic item sets, contextual item relevance, and of fulfilling non-functional requirements, such as response time. The CLEF NewsREEL challenge is a campaign-style evaluation lab allowing participants to tackle news recommendation and to optimize and evaluate their recommender algorit...
Conference Paper
First Story Detection (FSD) systems aim to identify those news articles that discuss an event that was not reported before. Recent work on FSD has focussed almost exclusively on efficiently detecting documents that are dissimilar from their nearest neighbor. We propose a novel FSD approach that is more effective, by adapting a recently proposed met...
Conference Paper
We have collected the access logs for our university's web domain over a time span of 4.5 years. We now release the pre-processed data of a 3-month period for research into user navigation behavior. We preprocessed the data so that only successful GET requests of web pages by non-bot users are kept. The resulting 3-month collection comprises 9.6M p...
Conference Paper
Full-text available
Since its introduction, Word2Vec and its variants are widely used to learn semantics-preserving representations of words or entities in an embedding space, which can be used to produce state-of-art results for various Natural Language Processing tasks. Existing implementations aim to learn efficiently by running multiple threads in parallel while o...
Article
Full-text available
Since its introduction, Word2Vec and its variants are widely used to learn semantics-preserving representations of words or entities in an embedding space, which can be used to produce state-of-art results for various Natural Language Processing tasks. Existing implementations aim to learn efficiently by running multiple threads in parallel while o...
Conference Paper
Bias in the retrieval of documents can directly influence the information access of a digital library. In the worst case, systematic favoritism for a certain type of document can render other parts of the collection invisible to users. This potential bias can be evaluated by measuring the retrievability for all documents in a collection. Previous e...
Article
Full-text available
The most common approach to measuring the effectiveness of Information Retrieval systems is by using test collections. The Contextual Suggestion (CS) TREC track provides an evaluation framework for systems that recommend items to users given their geographical context. The specific nature of this track allows the participating teams to identify can...
Conference Paper
This paper proposes a range of probabilistic models of local expertise based on geo-tagged social network streams. We assume that frequent visits result in greater familiarity with the location in question. To capture this notion, we rely on spatio-temporal information from users’ online check-in profiles. We evaluate the proposed models on a large...
Presentation
Full-text available
Machine Learning classifiers can be used to analyze trends in counts of items per class, e.g., over time or location. This counting task is a basis for a variety of data analysis use cases, such as the study of species populations living in an ecosystem, or the profiling of customers using a service or a product. Classification results are inherent...
Article
Information retrieval systems rely on multitudes of individual features in order to determine the ranking of documents for a given user and query combination. Current solutions to this challenge are often inconsistent with the formal probabilistic framework in which constituent scores were estimated, or use sophisticated learning methods that make...
Conference Paper
Full-text available
Following online news about a specific event can be a difficult task as new information is often scattered across web pages. In such cases, an up-to-date summary of the event would help to inform users and allow them to navigate to articles that are likely to contain relevant and novel details. We propose a three-step approach to online news tracki...
Conference Paper
Web archives preserve the fast changing web. While we can archive the web pages, the popularity of queries in the past has usually not been preserved. Previous studies have observed the importance of anchor text for improving the quality of text search, and have shown that anchor text is similar to real user queries and documents titles. Other stud...
Conference Paper
Contextual suggestion aims at recommending items to users given their current context, such as location-based tourist recommendations. Our contextual suggestion ranking model consists of two main components: selecting candidate suggestions and providing a ranked list of personalized suggestions. We focus on selecting appropriate suggestions from th...
Conference Paper
Full-text available
Many generative language and relevance models assume conditional independence between the likelihood of observing individual terms. This assumption is obviously naive , but also hard to replace or relax. There are only very few term pairs that actually show significant conditional dependencies while the vast majority of co-located terms has no impl...
Conference Paper
Traditional batch evaluation metrics assume that user interaction with search results is limited to scanning down a ranked list. However, modern search interfaces come with additional elements supporting result list refinement (RLR) through facets and filters, making user search behavior increasingly dynamic. We develop an evaluation framework that...
Conference Paper
Full-text available
Following news about a specific event can be a difficult task as new information is often scattered across web pages. An up-to-date summary of the event would help to inform users and allow them to navigate to articles that are likely to contain relevant and novel details. We demonstrate an approach that is feasible for online tracking of news that...
Article
Full-text available
Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to...
Conference Paper
We investigate the role of geographic proximity in news consumption. Using a month-long log of user interactions with news items of ten information portals, we study the relationship between users' geographic locations and the geographic foci of information portals and local news categories. We find that the location of news consumers correlates wi...
Article
State-of-the-art instance matching approaches do not perform well when used for matching instances across heterogeneous datasets. This shortcoming derives from their core operation depending on direct matching, which involves a direct comparison of instances in the source with instances in the target dataset. Direct matching is not suitable when th...
Conference Paper
Cumulative Citation Recommendation (CCR) is defined as: given a stream of documents on one hand and Knowledge Base (KB) entities on the other, filter, rank and recommend citation-worthy documents. The pipeline encountered in systems that approach this problem involves four stages: filtering, classification, ranking (or scoring), and evaluation. Fil...
Conference Paper
Full-text available
Internet users are turning more frequently to online news as a replacement for traditional media sources such as newspapers or television shows. Still, discovering news events online and following them as they develop can be a difficult task. In previous work, we presented a novel approach to extract sentences from an online stream of news articles...
Conference Paper
Full-text available
Following online news about a specific event can be a difficult task as new information is often scattered across web pages. In such cases, an up-to-date summary of the event would help to inform users and allow them to navigate to articles that are likely to contain relevant and novel details. Several approaches exist to compose a summary of salie...
Conference Paper
Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies - most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions...
Conference Paper
Full-text available
Modern relevance models consider a wide range of criteria in order to identify those documents that are expected to sat-isfy the user's information need. With growing dimension-ality of the underlying relevance spaces the need for sophis-ticated score combination and estimation schemes arises. In this paper, we investigate the use of copulas, a mod...
Conference Paper
Full-text available
Data visualization and exploration tools are crucial for data scientists, especially during pilot studies. In this paper, we present an extensible open-source workbench for aggregating, summarizing and filtering social network profiles derived from tweets. We demonstrate its range of basic features for two use cases: geo-spatial profile summarizati...
Article
In the information retrieval process, functions that rank documents according to their estimated relevance to a query typically regard query terms as being independent. However, it is often the joint presence of query terms that is of interest to the user, which is overlooked when matching independent terms. One feature that can be used to express...
Conference Paper
Recommender Systems need to deal with different types of users who represent their preferences in various ways. This difference in user behaviour has a deep impact on the final performance of the recommender system, where some users may receive either better or worse recommendations depending, mostly, on the quantity and the quality of the informat...