Magnus Sahlgren's research while affiliated with RISE Research Institutes of Sweden and other places

Publications (108)

Book
Distributional semantics develops theories and methods to represent the meaning of natural language expressions, with vectors encoding their statistical distribution in linguistic contexts. It is at once a theoretical model to express meaning, a practical methodology to construct semantic representations, a computational framework for acquiring mea...
Chapter
The distributional representation of a lexical item is typically a vector representing its co-occurrences with linguistic contexts. This chapter introduces the basic notions to construct distributional semantic representations from corpora. We present (i) the major types of linguistic contexts used to characterize the distributional properties of l...
Chapter
Distributional semantics develops theories and methods to represent the meaning of natural language expressions, with vectors encoding their statistical distribution in linguistic contexts. It is at once a theoretical model to express meaning, a practical methodology to construct semantic representations, a computational framework for acquiring mea...
Chapter
This chapter focuses on the evaluation of distributional semantic models (DSMs). Distributional semantics has usually favored intrinsic methods that test DSMs for their ability to model various kinds of semantic similarity and relatedness. Recently, extrinsic evaluation has also become very popular: the distributional vectors are fed into a downstr...
Chapter
Distributional semantics develops theories and methods to represent the meaning of natural language expressions, with vectors encoding their statistical distribution in linguistic contexts. It is at once a theoretical model to express meaning, a practical methodology to construct semantic representations, a computational framework for acquiring mea...
Chapter
This chapter presents current research in compositional distributional semantics, which aims at designing methods to construct the interpretation of complex linguistic expressions from the distributional representations of the lexical items they contain. This theme includes two major questions that we are going to explore: What is the distributiona...
Chapter
Lexical semantic competence is a multifaceted and complex reality, which includes the ability of drawing inferences, distinguishing different word senses, referring to the entities in the world, and so on. A long-standing tradition of research in linguistics and cognitive science has investigated these issues using symbolic representations. The aim...
Chapter
Distributional semantics is the study of how distributional information can be used to model semantic facts. Its theoretical foundation has become known as the Distributional Hypothesis: Lexemes with similar linguistic contexts have similar meanings. This chapter presents the epistemological principles of distributional semantics. First, we explore...
Chapter
In this chapter, we review random encoding models that directly reduce the dimensionality of distributional data without first building a co-occurrence matrix. While matrix distributional semantic models (DSMs) output either explicit or implicit distributional vectors, random encoding models only produce low-dimensional embeddings, and emphasize ef...
Chapter
Distributional semantics develops theories and methods to represent the meaning of natural language expressions, with vectors encoding their statistical distribution in linguistic contexts. It is at once a theoretical model to express meaning, a practical methodology to construct semantic representations, a computational framework for acquiring mea...
Chapter
The most recent development in distributional semantics is represented by models based on artificial neural networks. In this chapter, we focus on the use of neural networks to build static embeddings. Like random encoding models, neural networks incrementally learn embeddings by reducing the high dimensionality of distributional data without build...
Chapter
This chapter discusses the major types of matrix models, a rich and multifarious family of distributional semantic models (DSMs) that extend and generalize the vector space model in information retrieval from which they derive the use of co-occurrence matrices to represent distributional information. We first focus on a group of matrix DSMs (e.g.,...
Preprint
This paper details the process of developing the first native large generative language model for the Nordic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation and considerations for release strategies. We hope that this paper can s...
Preprint
Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the...
Article
Full-text available
The lack of effective, scalable solutions for lifestyle treatment is a global clinical problem, causing severe morbidity and mortality. We developed a method for lifestyle treatment that promotes self-reflection and iterative behavioral change, provided as a digital tool, and evaluated its effect in 370 patients with type 2 diabetes (ClinicalTrials...
Article
Full-text available
Linguistic Explorations of Societies (LES) is an interdisciplinary research project with scholars from the fields of political science, computer science, and computational linguistics. The overarching ambition of LES has been to contribute to the survey-based comparative scholarship by compiling and analyzing online text data within and between lan...
Article
Full-text available
Distributional semantics has deeply changed in the last decades. First, predict models stole the thunder from traditional count ones, and more recently both of them were replaced in many NLP applications by contextualized vectors produced by neural language models. Although an extensive body of research has been devoted to Distributional Semantic M...
Preprint
Full-text available
In this paper, we identify the state of data as being an important reason for failure in applied Natural Language Processing (NLP) projects. We argue that there is a gap between academic research in NLP and its application to problems outside academia, and that this gap is rooted in poor mutual understanding between academic researchers and their n...
Preprint
Recent studies in zero-shot cross-lingual learning using multilingual models have falsified the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. Inspired by this advancement, we introduce a cross-lingual transfer method for monolingual models based on domain adaptation. We study the eff...
Article
Full-text available
This paper discusses the current critique against neural network-based Natural Language Understanding solutions known as language models . We argue that much of the current debate revolves around an argumentation error that we refer to as the singleton fallacy : the assumption that a concept (in this case, language, meaning, and understanding) refe...
Preprint
Full-text available
Distributional semantics has deeply changed in the last decades. First, predict models stole the thunder from traditional count ones, and more recently both of them were replaced in many NLP applications by contextualized vectors produced by Transformer neural language models. Although an extensive body of research has been devoted to Distributiona...
Preprint
Full-text available
Most work in NLP makes the assumption that it is desirable to develop solutions in the native language in question. There is consequently a strong trend towards building native language models even for low-resource languages. This paper questions this development, and explores the idea of simply translating the data into English, thereby enabling t...
Preprint
Full-text available
Large scale contextual representation models have significantly advanced NLP in recent years, understanding the semantics of text to a degree never seen before. However, they need to process large amounts of data to achieve high-quality results. Joining and accessing all these data from multiple sources can be extremely challenging due to privacy a...
Preprint
This paper discusses the current critique against neural network-based Natural Language Understanding (NLU) solutions known as language models. We argue that much of the current debate rests on an argumentation error that we will refer to as the singleton fallacy: the assumption that language, meaning, and understanding are single and uniform pheno...
Preprint
This paper presents the first Swedish evaluation benchmark for textual semantic similarity. The benchmark is compiled by simply running the English STS-B dataset through the Google machine translation API. This paper discusses potential problems with using such a simple approach to compile a Swedish evaluation benchmark, including translation error...
Preprint
This document concerns data readiness in the context of machine learning and Natural Language Processing. It describes how an organization may proceed to identify, make available, validate, and prepare data to facilitate automated analysis methods. The contents of the document is based on the practical challenges and frequently asked questions we h...
Article
Full-text available
Dynamic Topic Modeling (DTM) is the ultimate solution for extracting topics from short texts generated in Online Social Networks (OSNs) like Twitter. It requires to be scalable and to be able to account for sparsity and dynamicity of short texts. Current solutions combine probabilistic mixture models like Dirichlet Multinomial or Pitman-Yor Process...
Article
We propose a novel method for enriching word-embeddings without the need of a labeled corpus. Instead, we show that relying on a regressor – trained with a small lexicon to predict pseudo-labels – significantly improves performance over current techniques that rely on human-derived sentence-level labels for an entire corpora. Our approach enables e...
Article
Full-text available
A learning machine, in the form of a gating network that governs a finite number of different machine learning methods, is described at the conceptual level with examples of concrete prediction subtasks. A historical data set with data from over 5000 patients in Internet-based psychological treatment will be used to equip healthcare staff with deci...
Preprint
Sentiment and topic analysis are common methods used for social media monitoring. Essentially, these methods answers questions such as, "what is being talked about, regarding X", and "what do people feel, regarding X". In this paper, we investigate another venue for social media monitoring, namely issue ownership and agenda setting, which are conce...
Preprint
Full-text available
This paper introduces a novel type of data-driven segmented unit that we call r-grams. We illustrate one algorithm for calculating r-grams, and discuss its properties and impact on the frequency distribution of text representations. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and mu...
Conference Paper
Full-text available
This paper engages with the challenges of designing "implicit interaction", systems (or system features) in which actions are not actively guided or chosen by users but instead come from inference driven system activity. We discuss the difficulty of designing for such systems and outline three Research through Design approaches we have engaged with...
Conference Paper
Full-text available
This paper introduces the notion of a smart data layer for the Internet of Everything. The smart data layer can be seen as an AI that learns a generic representation from heterogeneous data streams with the goal of understanding the state of the user. The smart data layer can be used both as materials for design processes and as the foundation for...
Article
Full-text available
Hateful comments, swearwords and sometimes even death threats are becoming a reality for many people today in online environments. This is especially true for journalists, politicians, artists, and other public figures. This paper describes how hate directed towards individuals can be measured in online environments using a simple dictionary-based...
Conference Paper
This paper describes design principles for and the implementation of Gavagai Explorer---a new application which builds on interactive text clustering to extract themes from topically coherent text sets such as open text answers to surveys or questionnaires. An automated system is quick, consistent, and has full coverage over the study material. A s...
Article
Full-text available
This paper is a short empirical study of the performance of centrality and classification based iterative term set expansion methods for distributional semantic models. Iterative term set expansion is an interactive process using distributional semantics models where a user labels terms as belonging to some sought after term set, and a system uses...
Article
Full-text available
The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers. We also explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diver...
Article
The automatic detection and classification of stance (e.g., certainty or agreement) in text data using natural language processing and machine-learning methods creates an opportunity to gain insight into the speakers’ attitudes toward their own and other people’s utterances. However, identifying stance in text presents many challenges related to tr...
Article
Full-text available
Random indexing (RI) is a lightweight dimension reduction method, which is used, for example, to approximate vector semantic relationships in online natural language processing systems. Here we generalise RI to multidimensional arrays and therefore enable approximation of higher-order statistical relationships in data. The generalised method is a s...
Article
Full-text available
The visual exploration of large and complex network structures remains a challenge for many application fields. Moreover, a growing number of real-world networks is multivariate and often interconnected with each other. Entities in a network may have relationships with elements of other related datasets, which do not necessarily have to be networks...
Chapter
Full-text available
The ability to disseminate information instantaneously over vast geographical regions makes the Internet a key facilitator in the radicalisation process and preparations for terrorist attacks. This can be both an asset and a challenge for security agencies. One of the main challenges for security agencies is the sheer amount of information availabl...
Conference Paper
Full-text available
Automatic detection of five language components, which are all relevant for expressing opinions and for stance taking, was studied: positive sentiment, negative sentiment, speculation, contrast and condition. A resource-aware approach was taken, which included manual annotation of 500 training samples and the use of limited lexical resources. Activ...
Article
This paper investigates the effects of data size and frequency range on distributional semantic models. We compare the performance of a number of representative models for several test settings over data of varying sizes, and over test items of various frequency. Our results show that neural network-based models underperform when the data is small,...
Conference Paper
The automatic detection and classification of stance taking in text data using natural language processing and machine learning methods create an opportunity to gain insight about the writers' feelings and attitudes towards their own and other people's utterances. However, this task presents multiple challenges related to the training data collecti...
Article
Full-text available
Online social media are a perfect text source for stance analysis. Stance in human communication is concerned with speaker attitudes, beliefs, feelings and opinions. Expressions of stance are associated with the speakers' view of what they are talking about and what is up for discussion and negotiation in the intersubjective exchange. Taking stance...
Conference Paper
Full-text available
This paper introduces a parameterization for word embeddings produced by the Random Indexing framework. The parameterization introduces position specific weights in the context windows, and the approach is shown to improve the performance in both word similarity and sentiment classification tasks. We also demonstrate the relation between Random Ind...
Conference Paper
Full-text available
Machine learning offers significant benefits for systems that process and understand natural language: (a) lower maintenance and upkeep costs than when using manually-constructed resources, (b) easier portability to new domains, tasks, or languages, and (c) robust and timely adaptation to situation-specific settings. However, the behaviour of an ad...
Article
Full-text available
Encoding information about the order in which words typically appear has been shown to improve the performance of high-dimensional semantic space models. This requires an encoding operation capable of binding together vectors in an order-sensitive way, and efficient enough to scale to large text corpora. Although both circular convolution and rando...
Article
This paper is concerned with nearest neighbor search in distributional semantic models. A normal nearest neighbor search only returns a ranked list of neighbors, with no information about the structure or topology of the local neighborhood. This is a potentially serious shortcoming of the mode of querying a distributional semantic model, since a ra...
Conference Paper
Full-text available
A support vector classifier was compared to a lexicon-based approach for the task of detecting the stance categories speculation, contrast and conditional in English consumer reviews. Around 3,000 training instances were required to achieve a stable performance of an F-score of 90 for speculation. This outperformed the lexicon-based approach, for w...
Conference Paper
Stance in human communication is a linguistic concept relating to expressions of subjectivity such as the speakers' attitudes and emotions. Taking stance is crucial for the social construction of meaning and can be useful for many application fields such as business intelligence, security analytics, or social media monitoring. In order to process l...
Conference Paper
Full-text available
In this paper we present our experiments on the RepLab 2014 Reputation Dimension task. RepLab is a competitive challenge for Reputation Management Systems. RepLab 2014's reputation dimensions task focuses on categorization of Twitter messages with regard to standard reputation dimensions (such as performance, leadership, or innovation). Our approac...
Conference Paper
Full-text available
What can text sentiment analysis technology be used for, and does a more usage-informed view on sentiment analysis pose new requirements on technology development?
Article
We present an incremental, scalable and efficient dimension reduction technique for tensors that is based on sparse random linear coding. Data is stored in a compactified representation with fixed size, which makes memory requirements low and predictable. Component encoding and decoding are performed on-line without computationally expensive re-ana...
Conference Paper
Full-text available
This paper describes experiments to use non-terminological information to find attitudinal expressions in written English text. The experiments are based on an analysis of text with respect to not only the vocabulary of content terms present in it (which most other approaches use as a basis for analysis) but also with respect to presence of structu...
Conference Paper
Full-text available
The highly variable and dynamic word usage in social media presents serious challenges for both research and those com- mercial applications that are geared towards blogs or other user-generated non-editorial texts. This paper discusses and exemplifies a terminology mining approach for dealing with the productive character of the textual environmen...
Conference Paper
Full-text available
Word space models, in the sense of vector space models built on distributional data taken from texts, are used to model semantic re- lations between words. We argue that the high dimensionality of typical vector space models lead to unintuitive eects on modeling likeness of meaning and that the local structure of word spaces is where interesting se...
Conference Paper
Full-text available
This paper discusses the task of tracking mentions of some topically interesting textual entity from a continuously and dynamically changing flow of text, such as a news feed, the output from an Internet crawler or a similar text source — a task sometimes referred to as buzz monitoring. Standard approaches from the field of information access for i...
Article
Distributional approaches to meaning acquisition utilize distributional properties of linguistic entities as the building blocks of semantics. In doing so, they rely fundamentally on a set of assumptions about the nature of language and meaning referred to as "the distributional hypothesis". The main point of this hypothesis is that there is a corr...
Article
Full-text available
2 Abstract. 1ƚ2 This paper presents work in progress on implementing an embodied question answering system, Dr. Cecilia, in the form of a virtual caregiver, for use in the treatment of eating disorders. The rationale for the system is grounded in one of the few effective treatments for anorexia and bulimia nervosa. The questions and answers databas...
Article
We show that sequence information can be encoded into high-dimensional fixed-width vectors using permutations of coor-dinates. Computational models of language often represent words with high-dimensional semantic vectors compiled from word-use statistics. A word's semantic vector usually encodes the contexts in which the word appears in a large bod...
Article
Full-text available
This paper reports on a experiment to iden- tify the emotional loading (the "valence") of news headlines. The experiment re- ported is based on a resource-thrifty ap- proach for valence annotation based on a word-space model and a set of seed words. The model was trained on newsprint, and va- lence was computed using proximity to one of two manuall...
Article
This paper discusses evaluation methodologies for a particular kind of meaning models known as word-space models, which use dis- tributional information to assemble geometric representations of meaning similarities. Word-space models have received considerable attention in recent years, and have begun to see employment outside the walls of computat...
Conference Paper
Full-text available
This paper introduces a measure of corpus homogeneity that indicates the amount of topical dispersion in a corpus. The measure is based on the density of neighborhoods in semantic word spaces. We evaluate the measure by comparing the results for five different corpora. Our initial results indicate that the proposed density measure can indeed identi...
Conference Paper
Full-text available
We present four approaches to the Amharic - French bilingual track at CLEF 2005. All experiments use a dictionary based approach to translate the Amharic queries into French Bags-of-words, but while one approach uses word sense discrimination on the translated side of the queries, the other one includes all senses of a translated word in the query...
Conference Paper
This year, the SICS team has concentrated on query processing and on the internal topical structure of the query, specifically compound translation. Compound translation is non-trivial due to dependencies between compound elements. This year, we have investigated topical dependencies between query terms: if a query term happens to be non-topical or...
Article
Full-text available
This paper presents a very simple and effective approach to using parallel corpora for automatic bilingual lexicon acquisition. The approach, which uses the Random Indexing vector space methodology, is based on finding correlations between terms based on their distributional characteristics. The approach requires a minimum of preprocessing and ling...
Conference Paper
Full-text available
The study presented involves several different contextual aspects and is the latest in a continuing series of exploratory experiments on information access behaviour in a multi-lingual context [1, 2]. This year’s interactive cross-lingual information access experiment was designed to measure three parameters we expected would affect the performance...
Conference Paper
Full-text available
This article describes an automatic evaluation procedure for NLP system robustness under the strain of noisy and ill-formed input. The procedure requires no manual work or annotated resources. It is language and annotation scheme independent and produces reliable estimates on the robustness of NLP systems. The only requirement is an estimate on the...
Article
Full-text available
This paper proposes a novel method for automatically acquiring multi- lingual lexica from non-parallel data and reports some initial experiments to prove the viability of the approach. Using established techniques for building mono-lingual vector spaces two independent semantic vector spaces are built from textual data. These vector spaces are rela...
Conference Paper
Full-text available
This experiment tests a simple, scalable, and effective approach to building a domain-specific translation lexicon using distributional statistics over parallellized bilingual corpora. A bilingual lexicon is extracted from aligned Swedish-French data, used to translate CLEF topics from Swedish to French, which resulting French queries are then in t...
Article
Full-text available
AthosMail is a multilingual spoken dialogue system for reading of e-mail messages. The key features of the application are adaptivity and the integration of different approaches for spoken interaction. The application has flexible system structure supporting multiple components for both different and same purposes. The AthosMail system includes com...