Jakub Simko

Jakub Simko
Kempelen Institute of Intelligent Technologies

Doctor of Philosophy

About

62
Publications
9,908
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
465
Citations

Publications

Publications (62)
Preprint
Full-text available
The generative large language models (LLMs) are increasingly used for data augmentation tasks, where text samples are paraphrased (or generated anew) and then used for classifier fine-tuning. Existing works on augmentation leverage the few-shot scenarios, where samples are given to LLMs as part of prompts, leading to better augmentations. Yet, the...
Preprint
Full-text available
The generative large language models (LLMs) are increasingly being used for data augmentation tasks, where text samples are LLM-paraphrased and then used for classifier fine-tuning. However, a research that would confirm a clear cost-benefit advantage of LLMs over more established augmentation methods is largely missing. To study if (and when) is t...
Preprint
Full-text available
With the goal of uncovering the challenges faced by European AI students during their research endeavors, we surveyed 28 AI doctoral candidates from 13 European countries. The outcomes underscore challenges in three key areas: (1) the findability and quality of AI resources such as datasets, models, and experiments; (2) the difficulties in replicat...
Preprint
Full-text available
While fine-tuning of pre-trained language models generally helps to overcome the lack of labelled training samples, it also displays model performance instability. This instability mainly originates from randomness in initialisation or data shuffling. To address this, researchers either modify the training process or augment the available samples,...
Preprint
Full-text available
The latest generative large language models (LLMs) have found their application in data augmentation tasks, where small numbers of text samples are LLM-paraphrased and then used to fine-tune downstream models. However, more research is needed to assess how different prompts, seed data selection strategies, filtering methods, or model settings affec...
Preprint
Full-text available
The emergence of generative large language models (LLMs) raises the question: what will be its impact on crowdsourcing. Traditionally, crowdsourcing has been used for acquiring solutions to a wide variety of human-intelligence tasks, including ones involving text generation, manipulation or evaluation. For some of these tasks, models like ChatGPT c...
Preprint
Full-text available
Fact-checkers are often hampered by the sheer amount of online content that needs to be fact-checked. NLP can help them by retrieving already existing fact-checks relevant to the content being investigated. This paper introduces a new multilingual dataset -- MultiClaim -- for previously fact-checked claim retrieval. We collected 28k posts in 27 lan...
Conference Paper
Full-text available
The emergence of generative large language models (LLMs) raises the question: what will be its impact on crowdsourcing? Traditionally,crowdsourcing has been used for acquiring so-lutions to a wide variety of human-intelligence tasks, including ones involving text generation,modification or evaluation. For some of these tasks, models like ChatGPT ca...
Preprint
Full-text available
To mitigate the negative effects of false information more effectively, the development of automated AI (artificial intelligence) tools assisting fact-checkers is needed. Despite the existing research, there is still a gap between the fact-checking practitioners' needs and pains and the current AI research. We aspire to bridge this gap by employing...
Article
In this paper, we present results of an auditing study performed over YouTube aimed at investigating how fast a user can get into a misinformation filter bubble, but also what it takes to “burst the bubble”, i.e., revert the bubble enclosure. We employ a sock puppet audit methodology, in which pre-programmed agents (acting as YouTube users) delve i...
Conference Paper
Full-text available
False information detection models are susceptible to adversarial attacks. Such susceptibility is a critical weakness of detection models. Automated creation of adversarial samples can ultimately help to augment training sets and create more robust detection models. However, automatically generated adversarial samples often do not preserve the info...
Preprint
Full-text available
In this paper, we present results of an auditing study performed over YouTube aimed at investigating how fast a user can get into a misinformation filter bubble, but also what it takes to "burst the bubble", i.e., revert the bubble enclosure. We employ a sock puppet audit methodology, in which pre-programmed agents (acting as YouTube users) delve i...
Preprint
Full-text available
False information detection models are susceptible to adversarial attacks. Such susceptibility is a critical weakness of detection models. Automated creation of adversarial samples can ultimately help to augment training sets and create more robust detection models. However, automatically generated adversarial samples often do not preserve the info...
Conference Paper
Full-text available
In this paper, we describe a black-box sockpuppeting audit which we carried out to investigate the creation and bursting dynamics of misinformation filter bubbles on YouTube. Pre-programmed agents acting as YouTube users stimulated YouTube's recommender systems: they first watched a series of misinformation promoting videos (bubble creation) and th...
Preprint
Full-text available
False information has a significant negative influence on individuals as well as on the whole society. Especially in the current COVID-19 era, we witness an unprecedented growth of medical misinformation. To help tackle this problem with machine learning approaches, we are publishing a feature-rich dataset of approx. 317k medical news articles/blog...
Preprint
The negative effects of misinformation filter bubbles in adaptive systems have been known to researchers for some time. Several studies investigated, most prominently on YouTube, how fast a user can get into a misinformation filter bubble simply by selecting wrong choices from the items offered. Yet, no studies so far have investigated what it take...
Preprint
Full-text available
The online spreading of fake news is a major issue threatening entire societies. Much of this spreading is enabled by new media formats, namely social networks and online media sites. Researchers and practitioners have been trying to answer this by characterizing the fake news and devising automated methods for detecting them. The detection methods...
Article
Full-text available
The online spreading of fake news is a major issue threatening entire societies. Much of this spreading is enabled by new media formats, namely social networks and online media sites. Researchers and practitioners have been trying to answer this by characterising the fake news and devising automated methods for detecting them. The detection methods...
Conference Paper
Full-text available
The online spreading of fake news (and misinformation in general) has been recently identified as a major issue threatening entire societies. Much of this spreading was enabled by new media formats, namely social networks and online media sites. Researchers and practitioners have been trying to answer this by characterizing the fake news and devisi...
Chapter
Eye-tracking data provide many new options in domain of user modeling. In our work we focus on the automatic detection of web-navigation skill from eye-tracking data. We strive to gain a comprehensive view on the impact of navigation skills on addressing specific user studies and overall interaction on the Web. We proposed an approach for estimatin...
Conference Paper
Onboarding users to a complex application or a new functionality can be a serious issue, especially for organizations that need to train their new employees. Using a complex application without proper training or guidance can lead to users' confusion and frustration. In this paper, we introduce the onboarding platform YesElf intended for web applic...
Article
Full-text available
In usability studies involving eye-tracking, quantitative analysis of gaze data requires the information about so called scene occurrences. Scene ocurrences are time segments during which the application user interface remains more-less static, so gaze events (e.g., fixations) can be mapped to the particular areas of interest (user interface elemen...
Conference Paper
Full-text available
When analyzing user implicit feedback in recommender systems, several biases need to be taken into account. A user is influenced by the position (i.e., position bias) or by the appeal of the items (i.e., visual bias). Since images have become an essential part of the Web, the study of their impact on user behavior during the decision-making tasks i...
Article
Full-text available
The costs of eye-tracking technologies steadily decrease. This allows research institutions to obtain multiple eye-tracking devices. Already, several multiple eye-tracker laboratories have been established. Researchers begin to recognize the subfield of group eye-tracking. In comparison to the single-participant eye-tracking, group eye-tracking bri...
Conference Paper
Full-text available
Modern constructivist approaches to education dictate active experimentation with the study material and have been linked with improved learning outcomes in STEM fields. During classroom time we believe it is important for students to experiment with the lecture material since active recall helps them to start the memory encoding process as well as...
Conference Paper
Full-text available
The study of emotions in human-computer interaction has increased in the recent years. With successful classification of emotions, we could get instant feedback from users, gain better understanding of the human behavior while using the information technologies and thus make the systems and user interfaces more emphatic and intelligent. In our work...
Conference Paper
Full-text available
When creating intelligent systems, we often need proper knowledge bases and resources annotated with metadata. Sometimes, we have no other option, than to utilize crowdsourcing, to acquire the data in necessary quantity. Crowdsourcing is a costly endeavor, always with space for improvements in task solving quantity and quality. Studies show that co...
Conference Paper
Gathering proper descriptive metadata for multimedia resources is nowadays essential for effective information processing, recommendation and personalization. And, we still need to employ human workforce in crowdsourcing scenarios for solving particular metadata acquisition tasks. In this paper we present a human computation game, which acquires me...
Chapter
This chapter presents our semantics acquisition game called CityLights and experiments related to it. It has the purpose of validating of existing tags for musical resources in datasets where tag quality (correctness, descriptiveness) is not ensured. The game has a form of music quiz, where player first hears a music track and then he is asked to s...
Chapter
In this chapter, we review the state-of-the-art of semantics acquisition games (SAGs), focusing primarily on the purposes these games fulfill. We first define the terms: crowdsourcing game, human-computation game, semantics acquisition game, serious game and game with a purpose. Then, we present a classification of semantics acquisition games’ purp...
Chapter
In this chapter we review the field of semantics acquisition to provide ground for further discussion on the semantics acquisition games. First, we cover the necessary definitions and review the main “client” approaches for semantics utilization—the information retrieval applications. Then, we move through three major groups of semantics acquisitio...
Chapter
We follow up with our semantics acquisition game (SAG) design classification, presented in previous chapter. In this chapter, we focus on our own contributions to this design space—new “design patterns”. We demonstrate them on our semantics acquisition games: the Little Search Game, PexAce and CityLights. The contributions include novel “helper art...
Chapter
In this chapter, we review our game-based approach for term relationship acquisition, called the Little Search Game. The game is in particular, focused on discovery of term relationships that are hard to be acquired by automated means. The principal game mechanics is the formulation of search queries by the player. The goal of the player is, by uti...
Chapter
In this chapter, we present a semantics acquisition game for image tag acquisition, called PexAce. In this game, the player’s task is to (consecutively, in turns) disclose and conceal card pairs laid down on the game board, remember images on them and correctly identify identical image pairs. The game acquires the image descriptions through a game...
Chapter
This chapter discusses the state-of-the-art of the semantics acquisition games (SAGs) from the perspective of their design, abstracting from their purposes. Our primary concern is the lacking methodology for uneasy SAG creation. At the same time, we are aware of the fact, that even for “regular” games, no such methodology exists. However, we aimed...
Chapter
The crowdsourcing and semantics acquisition games were with us for almost a decade. They helped solving many human intelligence tasks and attracted many researchers and practitioners. They demonstrated their potential, but also revealed weak points, among other, the inefficient incentives. As a meta-problem, the lack of holistic design methodology...
Article
Full-text available
Effective acquisition of descriptive semantics for images is still an open issue today. Crowd-based human computation represents a family of approaches able to provide large scale metadata with decent quality. Within this field, games with a purpose (GWAP) have become increasingly important, as they have the potential to motivate contributors to th...
Conference Paper
Full-text available
The Web 2.0 principles reflect into learning domain and provide means for interactivity and collaboration. Student activities during learning in this environment can be utilized to gather data usable for learning corpora enrichment. It is now a research issue to examine, to what extent the student crowd is reliable in delivering useful artifacts an...
Chapter
The effective acquisition of (semantic) metadata is crucial for many present day applications. Games with a purpose address this issue by transforming computational problems into computer games. The authors present a novel approach to metadata acquisition via Little Search Game (LSG) – a competitive web search game, whose purpose is the creation of...
Article
Full-text available
Quantity of music metadata on the Web is sufficient, music recommendation and online repository systems are proof of it. However, it became a real challenge to keep quality of these metadata at reasonable level as the cost of manual validation is too high and current automatic approaches are inaccurate. In this paper we present a game with a purpos...
Article
Full-text available
An effective search and organization of personal multimedia repositories demands very specific, owner-related metadata (e.g. person names, places, events). Only the resource owners and their social circles are able to provide these metadata, but are often not motivated to do so. To increase their motivation, we introduce a game-based personal image...
Article
Full-text available
The effective acquisition of semantic metadata is crucial for many present day applications. Games with a purpose address this issue by transforming computational problems into computer games. The authors present a novel approach to metadata acquisition via Little Search Game LSG-a competitive web search game, whose purpose is the creation of a ter...
Conference Paper
Full-text available
Semantic structures, ranging from ontologies to flat folksonomies, are widely used on the Web despite the fact that their creation in sufficient quality is often a costly task. We propose a new approach for acquiring a lightweight network of related terms via the Little Search Game - a competitive browser game in search query formulation. The forma...
Conference Paper
Full-text available
With the proliferation of mobile devices, management of the growing user personal generated multimedia content is more demanding. Proper organization of this content requires manual metadata authoring, since automated or crowd sourcing approaches are inapplicable in case of personal content or content of a small social group (e.g. family). Recently...
Conference Paper
Full-text available
We present a novel approach intended to reduce user effort required to retrieve and/or revisit previously discovered information by exploiting web search and navigation history. In our approach, we collect streams of user actions during search and navigation sessions, identify individual user goals and construct and persistently store visual trees...
Article
Full-text available
Fakulta informatiky a informačných technológií, Slovenská technická univerzita v Bratislave, Ilkovičova 3, 842 16 Bratislava Abstrakt Spotreba energie si neustále vyžaduje našu pozornosť. Zdroje nie sú nevyčerpateľné a tak riešením nemôže byť len výstavba nových elektrární. Práve inteligentné domácnosti majú potenciál podieľať sa na optimalizácii s...
Article
Full-text available
Semantic web has to overcome several challenges ranging from web resource annotation to domain modelling. Ac-quiring general or domain knowledge ontologies is done by automated approaches with only limited sucess or is left to costly human experts. Games with a Purpose of-fer an opportunity to employ broader crowd of laics into these tasks by trans...
Article
Full-text available
Some of the today's computational tasks are still subject to human labor, because computational machinery paradigms are unable to deal with them (esp. in terms of quality). These tasks include metadata acquisition and domain modeling, the two essential processes needed for enabling effective and adaptive hypermedia systems. Hence, a whole field of...

Network

Cited By