Geert-Jan HoubenDelft University of Technology | TU
Geert-Jan Houben
prof.dr.ir.
About
317
Publications
63,786
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,097
Citations
Introduction
Geert-Jan Houben currently works at Delft University of Technology. See gjhouben.nl for more info.
Publications
Publications (317)
How can humans remain in control of artificial intelligence (AI)-based systems designed to perform tasks autonomously? Such systems are increasingly ubiquitous, creating benefits - but also undesirable situations where moral responsibility for their actions cannot be properly attributed to any particular person or group. The concept of meaningful h...
The concept of meaningful human control has been proposed to address responsibility gaps and mitigate them by establishing conditions that enable a proper attribution of responsibility for humans (e.g., users, designers and developers, manufacturers, legislators). However, the relevant discussions around meaningful human control have so far not res...
The increasing use of data-driven decision support systems in industry and governments is accompanied by the discovery of a plethora of bias and unfairness issues in the outputs of these systems. Multiple computer science communities, and especially machine learning, have started to tackle this problem, often developing algorithmic solutions to mit...
In online crowd mapping, crowd workers recruited through crowdsourcing marketplaces collect geographic data. Compared to traditional mapping methods, where workers physically explore the area, the benefit of using online crowd mapping is the potential to be cost-effective and time-efficient. Previous studies have focused on mapping urban objects us...
In online crowd mapping, crowd workers recruited through crowdsourcing marketplaces collect geographic data. Compared to traditional mapping methods, where workers physically explore the area, the benefit of using online crowd mapping is the potential to be cost-effective and time-efficient. Previous studies have focused on mapping urban objects us...
Up-to-date listings of retail stores and related building functions are challenging and costly to maintain. We introduce a novel method for automatically detecting, geo-locating, and classifying retail stores and related commercial functions, on the basis of storefronts extracted from street-level imagery. Specifically, we present a deep learning a...
This demo presents VirtualCrowd, a simulation platform for crowdsourcing campaigns. The platform allows the design, configuration, step-by-step execution, and analysis of customized tasks, worker profiles, and crowdsourcing strategies. The platform will be demonstrated through a crowd-mapping example in two cities, which will highlight the utility...
ACM UMAP - User Modelling, Adaptation and Personalization is the premier international conference for researchers and practitioners working on systems that adapt to individual users, to groups of users, and that collect, represent, and model user information. The Theory, Opinion and Reflection (TOR) track at UMAP is designed to highlight emerging a...
Knowledge about the organization of the main physical elements (e.g. streets) and objects (e.g. trees) that structure cities is important in the maintenance of city infrastructure and the planning of future urban interventions. In this paper, a novel approach to crowd-mapping urban objects is proposed. Our method capitalizes on strategies for gener...
The study of learning is grounded in theories and research. Since learning is complex and not directly observable, it is often inferred by collecting and analysing data based on the things learners do or say. By virtue, theories are developed from the analyses of data collected. With the proliferation of technology, large amounts of data are genera...
Scientific evidence for effective learning strategies is primarily derived from studies conducted in controlled laboratory or classroom settings with homogeneous populations. Large-scale online learning environments such as MOOCs provide an opportunity to evaluate the efficacy of these learning strategies in an informal learning context with a dive...
Dialog agents like digital assistants and automated chat interfaces (e.g.chatbots) are becoming more and more popular as users adapt to conversing with their devices like with humans. In this article we present approaches and available tools for dialog management, a component of dialog agents that handles dialog context and decides the next action...
Crowdsourcing has emerged as an effective method of scaling-up tasks previously reserved for a small set of experts. Accordingly, researchers in the large-scale online learning space have begun to employ crowdworkers to conduct research about large-scale, open online learning. We here report results from a crowdsourcing study (N=135) to evaluate th...
Named Entity Recognition and Typing (NER/NET) is a challenging task, especially with long-tail entities such as the ones found in scientific publications. These entities (e.g. “WebKB”,“StatSnowball”) are rare, often relevant only in specific knowledge domains, yet important for retrieval and exploration purposes. State-of-the-art NER approaches emp...
Past research in large-scale learning environments has found one of the most inhibiting factors to learners’ success to be their inability to effectively self-regulate their learning efforts. In traditional small-scale learning environments, personalized feedback (on progress, content, behavior, etc.) has been found to be an effective solution to t...
Massive Open Online Courses (MOOCs) allow learning to take place anytime and anywhere with little external monitoring by teachers. Characteristically, highly diverse groups of learners enrolled in MOOCs are required to make decisions related to their own learning activities to achieve academic success. Therefore, it is considered important to suppo...
This paper applies theory and methodology from the learning design literature to large-scale learning environments through quantitative modeling of the structure and design of Massive Open Online Courses. For two institutions of higher education, we automate the task of encoding pedagogy and learning design principles for 177 courses (which account...
We present LearningQ, a challenging educational question generation dataset containing over 230K document-question pairs. It includes 7K instructor-designed questions assessing knowledge concepts being taught and 223K learner-generated questions seeking in-depth understanding of the taught concepts. We show that, compared to existing datasets that...
City-scale events attract large amounts of attendees in temporarily re-purposed urban environments. In this setting, the real-time measurement of the density of attendees stationing in – or moving through – the event terrain is central to applications such as crowd management, emergency support, and quality of service evaluation. Sensing or communi...
A chatbot is an example of a text-based conversational agent. While natural language understanding and machine learning techniques advance rapidly, current fully automated chatbots still struggle to serve their users well. Human intelligence, brought by crowd workers, freelancers or even full-time employees can be embodied in the chatbot logic to f...
Taking advantage of the vast history of theoretical and empirical findings in the learning literature we have inherited, this research offers a synthesis of prior findings in the domain of empirically evaluated active learning strategies in digital learning environments. The primary concern of the present study is to evaluate these findings with an...
This demo presents SmartPub, a novel web-based platform that supports the exploration and visualization of shallow meta-data (e.g., author list, keywords) and deep meta-data--long tail named entities which are rare, and often relevant only in specific knowledge domain--from scientific publications. The platform collects documents from different sou...
Retrieval practice has been established in the learning sciences as one of the most effective strategies to facilitate robust learning in traditional classroom contexts. The cognitive theory underpinning the "testing effect" states that actively recalling information is more effective than passively revisiting materials for storing information in l...
With the increasing amount of scientific publications in digital libraries, it is crucial to capture “deep meta-data” to facilitate more effective search and discovery, like search by topics, research methods, or data sets used in a publication. Such meta-data can also help to better understand and visualize the evolution of research topics or rese...
This brief introduction begins with an overview of the types of research that are relevant to the special issue on Big Personal Data in Interactive Intelligent Systems. The overarching question is: How can big personal data be collected, analyzed, and exploited so as to provide new or improved forms of interaction with intelligent systems, and what...
Massive Open Online Courses (MOOCs) play an ever more central role in open education. However, in contrast to traditional classroom settings, many aspects of learners' behaviour in MOOCs are not well researched. In this work, we focus on modelling learner behaviour in the context of continuous assessments with completion certificates, the most comm...
Data processing pipelines are a core object of interest for data scientist and practitioners operating in a variety of data-related application domains. To effectively capitalise on the experience gained in the creation and adoption of such pipelines, the need arises for mechanisms able to capture knowledge about datasets of interest, data processi...
Social comparison theory asserts that we establish our social and personal worth by comparing ourselves to others. In in-person learning environments, social comparison offers students critical feedback on how to behave and be successful. By contrast, online learning environments afford fewer social cues to facilitate social comparison. Can increas...
Massive Open Online Courses (MOOCs) aim to educate the world, especially learners from developing countries. While MOOCs are certainly available to the masses, they are not yet fully accessible. Although all course content is just clicks away, deeply engaging with a MOOC requires a substantial time commitment, which frequently becomes a barrier to...
The rise of Big Data analytics has been a disruptive game changer for many application domains, allowing the integration into domain-specific applications and systems of insights and knowledge extracted from external big data sets. The effective "injection" of external Big Data demands an understanding of the properties of available data sets, and...
Massive Open Online Courses (MOOCs) aim to educate the world. More often than not, however, MOOCs fall short of this goal — a majority of learners are already highly educated (with a Bachelor degree or more) and come from specific parts of the (developed) world. Learners from developing countries without a higher degree are underrepresented, though...
Massive Open Online Courses (MOOCs) are successful in delivering educational resources to the masses, however, the current retention rates—well below 10 %—indicate that they fall short in helping their audience become effective MOOC learners. In this paper, we report two MOOC studies we conducted in order to test the effectiveness of pedagogical st...
Massive Open Online Courses (MOOCs) have gained considerable momentum since their inception in 2011. They are, however, plagued by two issues that threaten their future: learner engagement and learner retention. MOOCs regularly attract tens of thousands of learners, though only a very small percentage complete them successfully. In the traditional...
The successful execution of knowledge crowdsourcing (KC) tasks requires contributors to possess knowledge or mastery in a specific domain. The need for expert contributors limits the capacity of online crowdsourcing marketplaces to cope with KC tasks. While online social platforms emerge as a viable alternative source of expert contributors, how to...
Social media has emerged as one of the data backbones of urban analytics systems. Thanks to geo-located microposts (text-, image-, and video-based) created and shared through portals such as Twitter and Instagram, scientists and practitioners can capitalise on the availability of real-time and semantically rich data sources to perform studies relat...
This demo presents the Crowd Knowledge Curator (CroKnow), a novel web-based platform that streamlines the processes required to enrich existing knowledge bases (e.g. Wikis) by tapping on the latent knowledge of expert contributors in online platforms. The platform integrates a number of tools aimed at supporting the identification of missing data f...
Massive Open Online Courses (MOOCs) have enabled millions of learners across the globe to increase their levels of expertise in a wide variety of subjects. Research efforts surrounding MOOCs are typically focused on improving the learning experience, as the current retention rates (less than 7% of registered learners complete a MOOC) show a large g...
The rising number of Massive Open Online Courses (MOOCs) enable people to advance their knowledge and competencies in a wide range of fields. Learning though is only the first step, the transfer of the taught concepts into practice is equally important and often neglected in the investigation of MOOCs. In this paper, we consider the specific case o...
In recent years, gamification, “the use of game design elements in non-game contexts”, has drawn the attention of an increasing number of scientists. Although several studies highlighted the benefits of gamification in many applications, its potential in the \emph{enterprise} environment still needs to be fully understood.
This work contributes to...
In the Social Web, a large number of individuals stores and shares private data in social networks like Facebook and Twitter. By agreeing with their license agreements that support a revenue model, which is mostly advertising, occasionally combined with (premium) subscription and transactions, these individuals transfer data ownership to these soci...
For companies across the globe, building and sustaining a talent pipeline has become top priority. Job satisfaction is a core reason for employee retention and has shown to be more dependent on the organisational climate, which includes aspects such as working conditions, leadership and inclusion, than on variables such as structure, size, and pay,...
Crowdsourcing has recently become a powerful computational tool for data collection and augmentation. Although crowdsourcing has been extensively applied in diverse domains, most tasks are of low complexity such that workers are assumed to be endless, anonymous and disposable. By unlocking the value of human knowledge-related features, e.g., experi...
This demo presents E-WISE, an expertise-driven recommendation platform built upon Web Question Answering (QA) systems to assist askers in question-answering process. Despite that crowdsourcing knowledge (e.g., on-line question-answering) is becoming increasingly important, it remains a big challenge to accelerate its process. E-WISE blends the rece...
Thanks to reputation and gamification mechanisms, collaborative question answering systems coordinate the process of topical knowledge creation of thousands of users. While successful, these systems face many challenges: on one hand, the volume of submitted questions overgrows the amount of new users willing, and capable, of answering them. On the...
This demo presents Social Glass, a novel web-based platform that supports the analysis, valorisation, integration, and visualisation of large-scale and heterogeneous urban data in the domains of city planning and decision-making. The platform systematically combines publicly available social datasets from municipalities together with social media s...
Social bookmarking communities are now major content production platforms. There, millions of users interact every day on a great variety of knowledge domains, creating new contents, linking to existing ones, and engaging in constructive discussions. Relevant domain-specific content is often mixed with less useful contributions, and domain experts...
Hypertext 2015 will be held in Cyprus from September 2 to 4. This newsletter article briey introduces the conference and also the venue.
We hope to meet you all in Cyprus in September!
http://ht.acm.org/ht2015/
https://www.facebook.com/Hypertext2015
https://twitter.com/Hypertext2015
Conducting analytics over data generated by Social Web portals such as Twitter is challenging, due to the volume, variety and velocity of the data. Commonly, adhoc pipelines are used that solve a particular use case. In this paper, we generalize across a range of typical Twitter-data use cases and determine a set of common characteristics. Based on...
This book constitutes the refereed proceedings of the 15th International Conference on Web Engineering, ICWE 2015, held in Rotterdam, The Netherlands, in June 2015.
The 26 full research papers, 11 short papers, 7 industry papers, 11 demonstrations, 6 posters and 4 contributions to the PhD symposium presented were carefully reviewed and selected fro...
The authors' Continuous Predictive Social Media Analytics system operates in real time on social media streams and graphs to recommend venues to visitors of geo- and temporally bounded city-scale events. By combining deductive and inductive stream reasoning techniques with visitor-modeling functionalities, this system semantically analyzes and link...
Collaborative Question Answering (cQA) platforms are a very popular repository of crowd-generated knowledge. By formulating questions, users express needs that other members of the cQA community try to collaboratively satisfy. Poorly formulated questions are less likely to receive useful responses, thus hindering the overall knowledge generation pr...
Question Answering platforms are becoming an important repository of crowd-generated knowledge. In these systems a relatively small subset of users is responsible for the majority of the contributions, and ultimately, for the success of the Q/A system itself. However, due to built-in incentivization mechanisms, standard expert identification method...
From a liberal perspective, pluralism and viewpoint diversity are seen as a
necessary condition for a well-functioning democracy. Recently, there have been
claims that viewpoint diversity is diminishing in online social networks,
putting users in a "bubble", where they receive political information which
they agree with. The contributions from our...
Large datasets such as Cultural Heritage collections require detailed annotations when digitised and made available online. Annotating different aspects of such collections requires a variety of knowledge and expertise which is not always possessed by the collection curators. Artwork annotation is an example of a knowledge intensive image annotatio...
Cultural institutions are increasingly contributing content to social media platforms to raise awareness and promote use of their collections. Furthermore, they are often the recipients of user comments containing information that may be incorporated in their catalog records. However, not all user-generated comments can be used for the purpose of e...
The results of our exploratory study provide new insights to crowdsourcing knowledge intensive tasks. We designed and performed an annotation task on a print collection of the Rijksmuseum Amsterdam, involving experts and crowd workers in the domain-specific description of depicted flowers. We created a testbed to collect annotations from flower exp...
Understanding the impact of corporate information publicly distributed on the Web is becoming more and more crucial. In this paper we report the result of a study that involved 130 IBM employees: we explored the correctness and extent of organisational information that can be observed from the online profiles of a company's employees. Our work cont...
Students and researchers of media and communication sciences study the role of media in our society. They frequently search through media archives to manually select items that cover a certain event. When this is done for large time spans and across media-outlets, this task can however be challenging and laborious. Therefore, up until now the focus...
How can the search process on Twitter be improved to better meet the various information needs of its users? As an answer to this question, we have developed the Twinder framework, a scalable search system for Twitter streams. Twinder contains algorithms to determine the relevance of tweets in relation to search requests, as well as components to d...
Queries that users pose to search engines are often ambiguous - either because different users express different query intents with the same query terms or because the query is underspecified and it is unclear which aspect of a particular query the user is interested in. In the Web search setting, search result diversification, whose goal is the cr...
In this work we present an in-depth analysis of the user behaviors on
different Social Sharing systems. We consider three popular platforms, Flickr,
Delicious and StumbleUpon, and, by combining techniques from social network
analysis with techniques from semantic analysis, we characterize the tagging
behavior as well as the tendency to create frien...
Cultural institutions are increasingly opening up their repositories and contribute digital objects to social media platforms such as Flickr. In return they often receive user comments containing information that could be incorporated in their catalog records. Since judging the usefulness of a large number of user comments is a labor-intensive task...
Politics and media are heavily intertwined and both play a role in the discussion on policy proposals and current affairs. However, a dataset that allows a joint analysis of the two does not yet exist. In this paper we take the first step by discovering links between parliamentary debates in a political dataset and newspaper articles in a media dat...
With more than 340~million messages that are posted on Twitter every day, the amount of duplicate content as well as the demand for appropriate duplicate detection mechanisms is increasing tremendously. Yet there exists little research that aims at detecting near-duplicate content on microblogging platforms. We investigate the problem of near-dupli...
Diversity and profundity of the topics in cultural heritage collections make experts from outside the institution indispensable for acquiring qualitative and comprehensive annotations. We define the con- cept of nichesourcing and present challenges in the process of obtain- ing qualitative annotations from people in these niches. We believe that ex...
The inherent heterogeneity of datasets on the Semantic Web has created a need to interlink them, and several tools have emerged that automate this task. In this paper we are interested in what happens if we enrich these matching tools with knowledge of the domain of the ontologies. We explore how to express the notion of a domain in terms usable fo...
Adaptation and personalization of e-learning and technology-enhanced learning (TEL) systems in general, have become a tremendous key factor for the learning success with such systems. In order to provide adaptation, the system needs to have access to relevant data about the learner. This paper describes a preliminary study with the goal to infer a...
Estimating the geographic location of images is a task which has received increasing attention recently. Large numbers of images uploaded to platforms such as Flickr do not contain GPS-based latitude/longitude coordinates. Obtaining such geographic information is beneficial for a variety of applications including travelogues, visual place descripti...
Various applications are developed today on top of microblogging services like Twitter. In order to engineer Web applications which operate on microblogging data, there is a need for appropriate filtering techniques to identify messages. In this paper, we focus on detecting Twitter messages (tweets) that report on social events. We introduce a filt...
How can one effectively identify relevant messages in the hundreds of millions of Twitter messages that are posted every day? In this paper, we aim to answer this fundamental research question and introduce Twinder, a scalable search engine for Twitter streams. The Twinder search engine exploits various features to estimate the relevance of Twitter...
Social Web applications such as Twitter and Flickr are widely used services that generate large volumes of usage data. The challenge of modeling the use and the users of such Social Web services based on their data has received a lot of attention in recent years. In this paper, we go a step further and investigate how the Linked Open Data (LOD) clo...
In this article, we analyze and compare user behavior on two different microblogging platforms: (1) Sina Weibo which is the most popular microblogging service in China and (2) Twitter. Such a comparison has not been done before at this scale and is therefore essential for understanding user behavior on microblogging services. In our study, we analy...
Automatically filtering relevant information about a real-world incident from Social Web streams and making the information accessible and findable in the given context of the incident are non-trivial scientific challenges. In this paper, we engineer and evaluate solutions that analyze the semantics of Social Web data streams to solve these challen...
In this paper, we present Twitcident, a framework and Web-based system for filtering, searching and analyzing information about real-world incidents or crises. Twitcident connects to emergency broadcasting services and automatically starts tracking and filtering information from Social Web streams (Twitter) when a new incident occurs. It enriches t...