Karl Aberer

Karl Aberer
École Polytechnique Fédérale de Lausanne | EPFL · School of Computer and Communication Sciences

PhD

About

693
Publications
80,666
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
16,160
Citations

Publications

Publications (693)
Preprint
Full-text available
Multilingual pre-trained language models perform remarkably well on cross-lingual transfer for downstream tasks. Despite their impressive performance, our understanding of their language neutrality (i.e., the extent to which they use shared representations to encode similar phenomena across languages) and its role in achieving such performance rema...
Preprint
Full-text available
The COVID-19 pandemic has fueled the spread of misinformation on social media and the Web as a whole. The phenomenon dubbed `infodemic' has taken the challenges of information veracity and trust to new heights by massively introducing seemingly scientific and technical elements into misleading content. Despite the existing body of work on modeling...
Technical Report
Full-text available
Graph neural networks take node features and graph structure as input to build representations for nodes and graphs. While there are a lot of focus on GNN models, understanding the impact of node features and graph structure to GNN performance has received less attention. In this paper, we propose an explanation for the connection between features...
Article
Full-text available
Graph neural networks take node features and graph structure as input to build representations for nodes and graphs. While there are a lot of focus on GNN models, understanding the impact of node features and graph structure to GNN performance has received less attention. In this paper, we propose an explanation for the connection between features...
Preprint
Full-text available
Mapping the technology landscape is crucial for market actors to take informed investment decisions. However, given the large amount of data on the Web and its subsequent information overload, manually retrieving information is a seemingly ineffective and incomplete approach. In this work, we propose an end-to-end recommendation based retrieval app...
Article
Graph embedding aims at learning a vector-based representation of vertices that incorporates the structure of the graph. This representation then enables inference of graph properties. Existing graph embedding techniques, however, do not scale well to large graphs. While several techniques to scale graph embedding using compute clusters have been p...
Preprint
Full-text available
Deep learning-based Natural Language Processing methods, especially transformers, have achieved impressive performance in the last few years. Applying those state-of-the-art NLP methods to legal activities to automate or simplify some simple work is of great value. This work investigates the value of domain adaptive pre-training and language adapte...
Preprint
Full-text available
The adoption of Transformer-based models in natural language processing (NLP) has led to great success using a massive number of parameters. However, due to deployment constraints in edge devices, there has been a rising interest in the compression of these models to improve their inference time and memory footprint. This paper presents a novel los...
Preprint
Full-text available
In the era of misinformation and information inflation, the credibility assessment of the produced news is of the essence. However, fact-checking can be challenging considering the limited references presented in the news. This challenge can be transcended by utilizing the knowledge graph that is related to the news articles. In this work, we prese...
Article
The heterogeneity of today's Web sources requires information retrieval (IR) systems to handle multi-modal queries. Such queries define a user's information needs by different data modalities, such as keywords, hashtags, user profiles, and other media. Recent IR systems answer such a multi-modal query by considering it as a set of separate uni-moda...
Article
Queries to detect isomorphic subgraphs are important in graph-based data management. While the problem of subgraph isomorphism search has received considerable attention for the static setting of a single query, or a batch thereof, existing approaches do not scale to a dynamic setting of a continuous stream of queries. In this paper, we address the...
Technical Report
Full-text available
The heterogeneity of today's Web sources requires information retrieval (IR) systems to handle multi-modal queries. Such queries define a user's information needs by different data modalities, such as keywords, hashtags, user profiles, and other media. Recent IR systems answer such a multi-modal query by considering it as a set of separate uni-moda...
Technical Report
Full-text available
Queries to detect isomorphic subgraphs are important in graph-based data management. While the problem of subgraph isomorphism search has received considerable attention for the static setting of a single query, or a batch thereof, existing approaches do not scale to a dynamic setting of a continuous stream of queries. In this paper, we address the...
Preprint
Full-text available
We demonstrate the SciLens News Platform, a novel system for evaluating the quality of news articles. The SciLens News Platform automatically collects contextual information about news articles in real-time and provides quality indicators about their validity and trustworthiness. These quality indicators derive from i) social media discussions rega...
Article
We propose an automated image selection system to assist photo editors in selecting suitable images for news articles. The system fuses multiple textual sources extracted from news articles and accepts multilingual inputs. It is equipped with char-level word embeddings to help both modeling morphologically rich languages, e.g., German, and transfer...
Preprint
This paper presents our approach for SwissText & KONVENS 2020 shared task 2, which is a multi-stage neural model for Swiss German (GSW) identification on Twitter. Our model outputs either GSW or non-GSW and is not meant to be used as a generic language identifier. Our architecture consists of two independent filters where the first one favors recal...
Preprint
Full-text available
We propose an automated image selection system to assist photo editors in selecting suitable images for news articles. The system fuses multiple textual sources extracted from news articles and accepts multilingual inputs. It is equipped with char-level word embeddings to help both modeling morphologically rich languages, e.g. German, and transferr...
Preprint
Graph neural network (GNN) is a deep model for graph representation learning. One advantage of graph neural network is its ability to incorporate node features into the learning process. However, this prevents graph neural network from being applied into featureless graphs. In this paper, we first analyze the effects of node features on the perform...
Preprint
Full-text available
In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages. We combine two existing objective functions to make images and captions close in a joint embedding space while adapting the alignment of word embeddings between existing languages in our model. We sh...
Preprint
Full-text available
Graph embedding aims at learning a vector-based representation of vertices that incorporates the structure of the graph. This representation then enables inference of graph properties. Existing graph embedding techniques, however, do not scale well to large graphs. We therefore propose a framework for parallel computation of a graph embedding using...
Article
Data models capture the structure and characteristic properties of data entities, e.g., in terms of a database schema or an ontology. They are the backbone of diverse applications, reaching from information integration, through peer-to-peer systems and electronic commerce to social networking. Many of these applications involve models of diverse da...
Preprint
Information about world events is disseminated through a wide variety of news channels, each with specific considerations in the choice of their reporting. Although the multiplicity of these outlets should ensure a variety of viewpoints, recent reports suggest that the rising concentration of media ownership may void this assumption. This observati...
Preprint
Full-text available
This paper describes, develops, and validates SciLens, a method to evaluate the quality of scientific news articles. The starting point for our work are structured methodologies that define a series of quality aspects for manually evaluating news. Based on these aspects, we describe a series of indicators of news quality. According to our experimen...
Article
Full-text available
Background: With increased specialization of health care services and high levels of patient mobility, accessing health care services across multiple hospitals or clinics has become very common for diagnosis and treatment, particularly for patients with chronic diseases such as cancer. With informed knowledge of a patient’s history, physicians can...
Preprint
BACKGROUND With increased specialization of health care services and high levels of patient mobility, accessing health care services across multiple hospitals or clinics has become very common for diagnosis and treatment, particularly for patients with chronic diseases such as cancer. With informed knowledge of a patient’s history, physicians can m...
Conference Paper
Full-text available
In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages. We combine two existing objective functions to make images and captions close in a joint embedding space while adapting the alignment of word embedding between existing languages in our model. We sho...
Conference Paper
Full-text available
In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages. We combine two existing objective functions to make images and captions close in a joint embedding space while adapting the alignment of word embedding between existing languages in our model. We sho...
Preprint
Full-text available
In this work, we introduce a novel framework that employs cluster annotation to boost active learning by reducing the number of human interactions required to train deep neural networks. Instead of annotating single samples individually, humans can also label clusters, producing a higher number of annotated samples with the cost of a small label er...
Article
The Web became the central medium for valuable sources of information fusion applications. However, such user-generated resources are often plagued by inaccuracies and misinformation as a result of the inherent openness and uncertainty of the Web. While finding objective data is non-trivial, assessing their credibility with a high confidence is eve...
Conference Paper
News entities must select and filter the coverage they broadcast through their respective channels since the set of world events is too large to be treated exhaustively. The subjective nature of this filtering induces biases due to, among other things, resource constraints, editorial guidelines, ideological affinities, or even the fragmented nature...
Article
Full-text available
Many Web platforms rely on user collaboration to generate high-quality content: Wiki, Q&A communities, etc. Understanding and modeling the different collaborative behaviors is therefore critical. However, collaboration patterns are difficult to capture when the relationships between users are not directly observable, since they need to be inferred...
Article
Distributing data over multiple cloud storage providers automatically provides users with a certain degree of information leakage control, for no single point of attack can leak all the information. However, unplanned distribution of data chunks can lead to high information disclosure even while using multiple clouds. In this paper, we study an imp...
Article
Privacy policies are the primary channel through which companies inform users about their data collection and sharing practices. In their current form, policies remain long and difficult to comprehend, thus merely serving the goal of legally protecting the companies. Short notices based on information extracted from privacy policies have been shown...
Article
Full-text available
Crowdsourcing has been established as an essential means to scale human computation in diverse Web applications, reaching from data integration to information retrieval. Yet, crowd workers have wide-ranging levels of expertise. Large worker populations are heterogeneous and comprise a significant amount of faulty workers. As a consequence, quality...
Conference Paper
Full-text available
Automatically extracting information from social media is challenging given that social content is often noisy, ambiguous, and inconsistent. However, as many stories break on social channels first before being picked up by mainstream media, developing methods to better handle social content is of utmost importance. In this paper, we propose a robus...
Conference Paper
We propose a novel, semi-supervised approach towards domain taxonomy induction from an input vocabulary of seed terms. Unlike all previous approaches, which typically extract direct hypernym edges for terms, our approach utilizes a novel probabilistic framework to extract hypernym subsequences. Taxonomy induction from extracted subsequences is cast...
Conference Paper
Therapeutic Drug Monitoring (TDM) is a key concept in precision medicine. The goal of TDM is to avoid therapeutic failure or toxic effects of a drug due to insufficient or excessive circulating concentration exposure related to between-patient variability in the drug's disposition. We present TUCUXI - an intelligent system for TDM. By making use of...
Article
Full-text available
Classification of social media data is an important approach in understanding user behavior on the Web. Although information on social media can be of different modalities such as texts, images, audio or videos, traditional approaches in classification usually leverage only one prominent modality. Techniques that are able to leverage multiple modal...
Article
Full-text available
The amount of controversial issues being discussed on the Web has been growing dramatically. In articles, blogs, and wikis, people express their points of view in the form of arguments, i.e., claims that are supported by evidence. Discovery of arguments has a large potential for informing decision-making. However, argument discovery is hindered by...
Conference Paper
The trend of time series characterizes the intermediate upward and downward behaviour of time series. Learning and forecasting the trend in time series data play an important role in many real applications, ranging from resource allocation in data centers, load schedule in smart grid, and so on. Inspired by the recent successes of neural networks,...
Chapter
In this chapter, we study the socioeconomic issues that can arise in distributed computing environments such as distributed and open, participatory sensing systems. Due to the decentralized nature of such systems, they present many challenges, some of which are equally socioeconomic and technical in essence. Three such major challenges arise in par...
Article
We propose a simple, yet effective, approach towards inducing multilingual taxonomies from Wikipedia. Given an English taxonomy, our approach leverages the interlanguage links of Wikipedia followed by character-level classifiers to induce high-precision, high-coverage taxonomies in other languages. Through experiments, we demonstrate that our appro...
Article
We propose a novel, semi-supervised approach towards domain taxonomy induction from an input vocabulary of seed terms. Unlike most previous approaches, which typically extract direct hypernym edges for terms, our approach utilizes a novel probabilistic framework to extract hypernym subsequences. Taxonomy induction from extracted subsequences is cas...
Conference Paper
Full-text available
Applying classical time-series analysis techniques to online content is challenging, as web data tends to have data quality issues and is often incomplete, noisy, or poorly aligned. In this paper, we tackle the problem of predicting the evolution of a time series of user activity on the web in a manner that is both accurate and interpretable, using...
Conference Paper
Cloud storage services, like Dropbox and Google Drive, have growing ecosystems of 3rd party apps that are designed to work with users' cloud files. Such apps often request full access to users' files, including files shared with collaborators. Hence, whenever a user grants access to a new vendor, she is inflicting a privacy loss on herself and on h...
Article
Full-text available
The sustained growth of urban settlements in the last years has had an inherent impact on the environment and the quality of life of their inhabitants. In order to support sustainability and improve quality of life in this context, we advocate the fostering of ICT-empowered initiatives that allow citizens to self-monitor their environment and asses...
Article
Diabetes Type 1 is a metabolic disease which results in a lack of insulin production, causing high glucose levels in the blood. It is crucial for diabetic patients to balance this glucose level, and they depend on external substances to do so. In order to keep this level under control, they usually need to resort to invasive glucose control methods...
Chapter
In the recent years, smartphones became part of everyday life for most people. Their computational power and their sensing capabilities unlocked a new universe of possibilities for mobile developers. However, mobile development is still a young field and various pitfalls need to be avoided. In this chapter, the authors present several aspects of mo...
Chapter
Erratum to: R. Meersman et al. (Eds.) Semantic Issues in E-Commerce Systems DOI: 10.1007/978-0-387-35658-7
Article
Full-text available
We provide an up-to-date view on the knowledge management system ScienceWISE (SW) and address issues related to the automatic assignment of articles to research topics. So far, SW has been proven to be an effective platform for managing large volumes of technical articles by means of ontological concept-based browsing. However, as the publication o...
Article
Understanding the severity of vulnerabilities within cloud services is particularly important for today service administrators.Although many systems, e.g., CVSS, have been built to evaluate and score the severity of vulnerabilities for administrators, the scoring schemes employed by these systems fail to take into account the contextual information...
Conference Paper
Full-text available
Social media has become an important instrument for running various types of public campaigns and mobilizing people. Yet, the dynamics of public campaigns on social networking platforms still remain largely unexplored. In this paper, we present an in-depth analysis of over one hundred large-scale campaigns on social media platforms covering more th...
Conference Paper
While social data is being widely used in various applications such as sentiment analysis and trend prediction, its sheer size also presents great challenges for storing, sharing and processing such data. These challenges can be addressed by data summarization which transforms the original dataset into a smaller, yet still useful, subset. Existing...
Conference Paper
Full-text available
Processing data streams is increasingly gaining momentum, given the need to process these flows of information in real-time and at Web scale. In this context, RDF Stream Processing (RSP) and Stream Reasoning (SR) have emerged as solutions to combine semantic technologies with stream and event processing techniques. Research in these areas has propo...
Conference Paper
The availability of massive volumes of data and recent advances in data collection and processing platforms have motivated the development of distributed machine learning algorithms. In numerous real-world applications large datasets are inevitably noisy and contain outliers. These outliers can dramatically degrade the performance of standard machi...
Article
Third party apps that work on top of personal cloud services, such as Google Drive and Drop-box, require access to the user’s data in order to provide some functionality. Through detailed analysis of a hundred popular Google Drive apps from Google’s Chrome store, we discover that the existing permission model is quite often misused: around two-thir...
Conference Paper
Traditional mechanisms for delivering notice and enabling choice have so far failed to protect users’ privacy. Users are continuously frustrated by complex privacy policies, unreachable privacy settings, and a multitude of emerging standards. The miniaturization trend of smart devices and the emergence of the Internet of Things (IoTs) will exacerba...
Conference Paper
Third party applications work on top of existing platforms that host users’ data. Although these apps access this data to provide users with specific services, they can also use it for monetization or profiling purposes. In practice, there is a significant gap between users’ privacy expectations and the actual access levels of 3rd party apps, which...