Djoerd Hiemstra

Djoerd Hiemstra
Radboud University | RU

About

298
Publications
31,369
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,005
Citations

Publications

Publications (298)
Article
Full-text available
In online peer-to-peer fundraising, individual fundraisers, acting on behalf of nonprofit organizations, mobilize their social networks using social media to request donations. Whereas existing studies focus on networks of donors to explain success, we examine the role of the networks of fundraisers and their effect on fundraising outcomes. By draw...
Preprint
We explore how to generate effective queries based on search tasks. Our approach has three main steps: 1) identify search tasks based on research goals, 2) manually classify search queries according to those tasks, and 3) compare three methods to improve search rankings based on the task context. The most promising approach is based on expanding th...
Article
Full-text available
This study examines how the interplay between an online campaign’s network structure and prosocial cultural norms in a country affect charitable giving. We conducted a multilevel analysis that includes Twitter network and aggregated donation data from the 2013 Movember fundraising campaigns in 24 countries during 62 campaign days. Prosocial cultura...
Book
This two-volume set LNCS 11437 and 11438 constitutes the refereed proceedings of the 41st European Conference on IR Research, ECIR 2019, held in Cologne, Germany, in April 2019. The 48 full papers presented together with 2 keynote papers, 44 short papers, 8 demonstration papers, 8 invited CLEF papers, 11 doctoral consortium papers, 4 workshop paper...
Book
This two-volume set LNCS 11437 and 11438 constitutes the refereed proceedings of the 41st European Conference on IR Research, ECIR 2019, held in Cologne, Germany, in April 2019. The 48 full papers presented together with 2 keynote papers, 44 short papers, 8 demonstration papers, 8 invited CLEF papers, 11 doctoral consortium papers, 4 workshop paper...
Conference Paper
We argue that there is a need for Multi-Tenant Customizable OLTP systems. Such systems need a Multi-Tenant Customizable Database (MTC-DB) as a backing. To stimulate the development of such databases, we propose the benchmark MTCB. Benchmarks for OLTP exist and multi-tenant benchmarks exist, but no MTC-DB benchmark exists that accounts for customiza...
Conference Paper
People tend to type short queries, however, the belief is that longer queries are more effective. Consequently, a number of attempts have been made to encourage and motivate people to enter longer queries. While most have failed, a recent attempt - conducted in a laboratory setup - in which the query box has a halo or glow effect, that changes as t...
Conference Paper
Full-text available
We combine social theory and NLP methods to classify English-speaking Twitter users' on-line social identity in profile descriptions. We conduct two text classification experiments. In Experiment 1 we use a 5-category online social identity classification based on identity and self-categorization theories. While we are able to automatically classif...
Conference Paper
Web content changes rapidly [18]. In Focused Web Harvesting [17] which aim it is to achieve a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a set of all the relevant web data to their topics of interest. Whether you are a fan following your favorite idol or a journalist investigatin...
Conference Paper
Users tend to articulate their complex information needs in only a few keywords, making underspecified statements of request the main bottleneck for retrieval effectiveness. Taking advantage of feedback information is one of the best ways to enrich the query representation, but can also lead to loss of query focus and harm performance in particular...
Article
Full-text available
We evaluate five term scoring methods for automatic term extraction on four different types of text collections: personal document collections, news articles, scientific articles and medical discharge summaries. Each collection has its own use case: author profiling, boolean query term suggestion, personalized query suggestion and patient query exp...
Article
A publicly available dataset for federated search reflecting a real web environment has long been absent, making it difficult for researchers to test the validity of their federated search algorithms for the web setting. We present several experiments and analyses on resource selection on the web using a recently released test collection containing...
Article
Full-text available
Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particul...
Conference Paper
Full-text available
With the goal of harvesting all information about a given entity, in this paper, we try to harvest all matching documents for a given query submitted on a search engine. The objective is to retrieve all information about for instance "Michael Jackson", "Islamic State", or "FC Barcelona" from indexed data in search engines, or hidden data behind web...
Article
Full-text available
Learning to rank is an increasingly important scientific field that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by non-existence of a standard set of evaluation b...
Conference Paper
Full-text available
We consider the task of automatically identifying participants' motivations in the public health campaign Movember and investigate the impact of the different motivations on the amount of campaign donations raised. Our classification scheme is based on the Social Identity Model of Collective Action (van Zomeren et al., 2008). We find that automatic...
Article
Full-text available
In the widely used message platform Twitter, about 2% of the tweets contains the geographical location through exact GPS coordinates (latitude and longitude). Knowing the location of a tweet is useful for many data analytics questions. This research is looking at the determination of a location for tweets that do not contain GPS coordinates. An acc...
Article
Full-text available
Recommendation based on user preferences is a common task for e-commerce websites. New recommendation algorithms are often evaluated by offline comparison to baseline algorithms such as recommending random or the most popular items. Here, we investigate how these algorithms themselves perform and compare to the operational production system in larg...
Conference Paper
This paper presents 'FedWeb Greatest Hits', a large new test collection for research in web information retrieval. As a combination and extension of the datasets used in the TREC Federated Web Search Track, this collection opens up new research possibilities on federated web search challenges, as well as on various other problems.
Conference Paper
Full-text available
Health campaigns that aim to raise awareness and subsequently raise funds for research and treatment are commonplace. While many local campaigns exist, very few attract the attention of a global audience. One of those global campaigns is Movember, an annual campaign during the month of November, that is directed at men’s health with special foci on...
Conference Paper
Selecting and aggregating different types of content from multiple vertical search engines is becoming popular in web search. The user vertical intent, the verticals the user expects to be relevant for a particular information need, might not correspond to the vertical collection relevance, the verticals containing the most relevant content. In thi...
Conference Paper
Sequence labeling has wide applications in natural language processing and speech processing. Popular sequence labeling models suffer from some known problems. Hidden Markov models (HMMs) are generative models and they cannot encode transition features; Conditional Markov models (CMMs) suffer from the label bias problem; And training of conditional...
Article
Children represent an increasing group of web users. Some of the key problems that hamper their search experience is their limited vocabulary, their difficulty in using the right keywords, and the inappropriateness of their general-purpose query suggestions. In this work, we propose a method that uses tags from social media to suggest queries relat...
Article
Full-text available
The Internet is increasingly used by young children for all kinds of purposes. Nonetheless, there are not many resources especially designed for children on the Internet and most of the content online is designed for grown-up users. This situation is problematic if we consider the large differences between young users and adults since their topic i...
Conference Paper
Full-text available
To express a more nuanced notion of relevance as compared to binary judgments, graded relevance levels can be used for the evaluation of search results. Especially in Web search, users strongly prefer top results over less relevant results, and yet they often disagree on which are the top results for a given information need. Whereas previous works...
Article
To make deep web data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need of a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a large number of issues should be addressed. To have all influential elements in one big picture, a new c...
Conference Paper
Structured prediction has wide applications in many areas. Powerful and popular models for structured prediction have been developed. Despite the successes, they suffer from some known problems: (i) Hidden Markov models are generative models which suffer from the mismatch problem. Also it is difficult to incorporate overlapping, non-independent fea...
Article
Full-text available
Concept based video retrieval often relies on imperfect and uncertain concept detectors. We propose a general ranking framework to define effective and robust ranking functions, through explicitly addressing detector uncertainty. It can cope with multiple concept-based representations per video segment and it allows the re-use of effective text ret...
Conference Paper
We describe a novel and flexible method that translates free-text queries to structured queries for filling out web forms. This can benefit searching in web databases which only allow access to their information through complex web forms. We introduce boosting and discounting heuristics, and use the constraints imposed by a web form to find a solut...
Article
In this paper, we address the problem of scientific-social network integration to find a matching relationship between members of these networks (i.e. The DBLP publication network and the Twitter social network). This task is a crucial step toward building a multi environment expert finding system that has recently attracted much attention in Infor...
Conference Paper
Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the ef...
Conference Paper
Building a federated search engine based on a large number existing web search engines is a challenge: implementing the programming interface (API) for each search engine is an exacting and time-consuming job. In this demonstration we present SearchResultFinder, a browser plugin which speeds up determining reusable XPaths for extracting search resu...
Conference Paper
In this paper we explore the vertical selection methods in aggregated search in the specific domain of topics for children between 7 and 12 years old. A test collection consisting of 25 verticals, 3.8K queries and relevant assessments for a large sample of these queries mapping relevant verticals to queries was built. We gather relevant assessment...
Conference Paper
Full-text available
This paper proposes a prototype One Click access system, based on previous work in the field and the related 1CLICK-2@NTCIR10 task. The proposed solution integrates methods from previous such attempts into a three tier algorithm: query categorization, information extraction and output generation and offers suggestions on how each of these can be im...
Conference Paper
In the early years of information retrieval, the focus of research was on systems aspects such as crawling, indexing, and relevancy ranking. Over the years, more and more user-related information such as click information or search history has entered the equation creating more and more personalized search experiences, though still within the scope...
Conference Paper
In this paper, we address the problem of scientific-social network integration to find a matching relationship between members of these networks. Utilizing several name similarity patterns and contextual properties of these networks, we design a focused crawler to find high probable matching pairs, then the problem of name disambiguation is reduced...
Conference Paper
How well can the relevance of a page be predicted, purely based on snippets? This would be highly useful in a Federated Web Search setting where caching large amounts of result snippets is more feasible than caching entire pages. The experiments reported in this paper make use of result snippets and pages from a diverse set of actual Web search eng...
Article
Accessing information is an essential factor in decision making processes occurring in different domains. Therefore, broadening the coverage of available information for the de- cision makers is of a vital importance. In such a information- thirsty environment, accessing every source of information is considered highly valuable. Nowadays, the main...
Article
With increasing amount of data in deep web sources (hid-den from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping crawling or sampling processes which can be...
Article
A panel discussion on the use of proprietary data was held at SIGIR 2012 in Portland. This report summarizes the positions put forward by the six panelists and the points that arose during the wider discussion that followed.
Conference Paper
Full-text available
What is the likelihood that a Web page is considered rel-evant to a query, given the relevance assessment of the corresponding snippet? Using a new federated IR test collection that contains search results from over a hundred search engines on the internet, we are able to investigate such research questions from a global perspective. Our test colle...
Conference Paper
With the increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping the crawling or sampling processes which...
Conference Paper
The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show th...
Article
Full-text available
Large document collections can be partitioned into topical shards to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shard...
Conference Paper
Full-text available
Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. However, a publicly available dataset for federated search reflecting an actual web environment ha...
Conference Paper
Full-text available
One of the biggest problems that children experience while searching the web occurs during the query formulation process. Children have been found to struggle formulating queries based on keywords given their limited vocabulary and their difficulty to choose the right keywords. In this work we propose a method that utilizes tags from social media t...
Article
Full-text available
In this paper we address the following important questions for concept-based video retrieval: (1) What is the impact of detector performance on the performance of concept-based retrieval engines, and (2) will these engines be applicable to real-life search tasks if detector performance improves in the future? We use Monte Carlo simulations to answe...
Article
Full-text available
When undergoing medical treatment in combination with extended stays in hospitals, children have been frequently found to develop an interest in their condition and the course of treatment. A natural means of searching for related information would be to use a web search engine. The medical domain, however, imposes several key challenges on young a...
Article
Full-text available
We report how users interact with an experimental system that transforms single-field textual input into a multi-field query for an existing travel planner system. The experimental system was made publicly available and we collected over 30,000 queries from almost 12,000 users. From the free-text query log, we examined how users formulated structur...
Article
Full-text available
Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these has seen widespread real-world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralized solutions. In this art...
Conference Paper
Full-text available
The Emma Search (EmSe) demonstrator developed for the Emma Children's Hospital showcases the PuppyIR project and PuppyIR framework for building information services for children.
Article
Full-text available
Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to a...
Book
This book constitutes the proceedings of the Third International Conference of the CLEF Initiative, CLEF 2012, held in Rome, Italy, in September 2012. The 14 papers and 3 poster abstracts presented were carefully reviewed and selected for inclusion in this volume. Furthermore, the books contains 2 keynote papers. The papers are organized in topical...
Conference Paper
For peer-to-peer web search engines it is important to quickly process queries and return search results. How to keep the perceived latency low is an open challenge. In this paper we explore the solution potential of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels o...
Conference Paper
This paper investigates the problem of using free-text queries as an alternative means for searching 'behind' web forms. We introduce a novel specification language for specifying free-text interfaces, and report the results of a user study where we evaluated our prototype in a travel planner scenario. Our results show that users prefer this free-t...
Article
Full-text available
For peer-to-peer web search engines it is important to keep the delay between receiving a query and providing search results within an acceptable range for the end user. How to achieve this remains an open challenge. One way to reduce delays is by caching search results for queries and allowing peers to access each others cache. In this paper we ex...
Conference Paper
Full-text available
Children experience several difficulties retrieving information using current Information Retrieval (IR) systems. Particularly, children struggle to find the right keywords to construct queries given their lack of domain knowledge. This problem is even more critical in the case of the specialized health domain. In this work we present a novel metho...
Conference Paper
Full-text available
We investigated the use of free-text queries as an alternative means for searching ‘behind’ web forms. We conducted a user study where we evaluated our prototype free-text interface in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at com...
Article
Full-text available
Recent work shows that children are very well capable of searching with Google, due to their familiarity with the interface. However, children do have difficulties with the vertical list representation of the results. In this paper, we present an alternative result representation for a touch interface, the ImagePile. The ImagePile displays the resu...
Article
Full-text available
This paper investigates the problem of translating free-text queries into key-value pairs as an alternative means for searching `behind' web forms. We introduce a novel specication language for specifying free-text interfaces, and report the results of a user study where we evaluated our prototype in a travel planner scenario. Our results show that...
Article
Full-text available
This report presents preliminary results for the TREC 2010 ad-hoc web search task. We ran our MIREX system on 0.5 billion web documents from the ClueWeb09 crawl. On average, the system retrieves at least 3 relevant documents on the first result page containing 10 results, using a simple index consisting of anchor texts, page titles, and spam remova...
Conference Paper
Full-text available
We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retr...
Article
Full-text available
The standard training method of Conditional Random Fields (CRFs) is very slow for large-scale applications. As an alternative, piecewise training divides the full graph into pieces, trains them independently, and combines the learned weights at test time. In this paper, we present \emph{separate} training for undirected models based on the novel Co...