Magnus Sahlgren's research while affiliated with RISE Research Institutes of Sweden and other places
What is this page?
This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.
If you're a ResearchGate member, you can follow this page to keep up with this author's work.
If you are this author, and you don't want us to display this page anymore, please let us know.
Publications (108)
Distributional semantics develops theories and methods to represent the meaning of natural language expressions, with vectors encoding their statistical distribution in linguistic contexts. It is at once a theoretical model to express meaning, a practical methodology to construct semantic representations, a computational framework for acquiring mea...
The distributional representation of a lexical item is typically a vector representing its co-occurrences with linguistic contexts. This chapter introduces the basic notions to construct distributional semantic representations from corpora. We present (i) the major types of linguistic contexts used to characterize the distributional properties of l...
Distributional semantics develops theories and methods to represent the meaning of natural language expressions, with vectors encoding their statistical distribution in linguistic contexts. It is at once a theoretical model to express meaning, a practical methodology to construct semantic representations, a computational framework for acquiring mea...
This chapter focuses on the evaluation of distributional semantic models (DSMs). Distributional semantics has usually favored intrinsic methods that test DSMs for their ability to model various kinds of semantic similarity and relatedness. Recently, extrinsic evaluation has also become very popular: the distributional vectors are fed into a downstr...
Distributional semantics develops theories and methods to represent the meaning of natural language expressions, with vectors encoding their statistical distribution in linguistic contexts. It is at once a theoretical model to express meaning, a practical methodology to construct semantic representations, a computational framework for acquiring mea...
This chapter presents current research in compositional distributional semantics, which aims at designing methods to construct the interpretation of complex linguistic expressions from the distributional representations of the lexical items they contain. This theme includes two major questions that we are going to explore: What is the distributiona...
Lexical semantic competence is a multifaceted and complex reality, which includes the ability of drawing inferences, distinguishing different word senses, referring to the entities in the world, and so on. A long-standing tradition of research in linguistics and cognitive science has investigated these issues using symbolic representations. The aim...
Distributional semantics is the study of how distributional information can be used to model semantic facts. Its theoretical foundation has become known as the Distributional Hypothesis: Lexemes with similar linguistic contexts have similar meanings. This chapter presents the epistemological principles of distributional semantics. First, we explore...
In this chapter, we review random encoding models that directly reduce the dimensionality of distributional data without first building a co-occurrence matrix. While matrix distributional semantic models (DSMs) output either explicit or implicit distributional vectors, random encoding models only produce low-dimensional embeddings, and emphasize ef...
Distributional semantics develops theories and methods to represent the meaning of natural language expressions, with vectors encoding their statistical distribution in linguistic contexts. It is at once a theoretical model to express meaning, a practical methodology to construct semantic representations, a computational framework for acquiring mea...
The most recent development in distributional semantics is represented by models based on artificial neural networks. In this chapter, we focus on the use of neural networks to build static embeddings. Like random encoding models, neural networks incrementally learn embeddings by reducing the high dimensionality of distributional data without build...
This chapter discusses the major types of matrix models, a rich and multifarious family of distributional semantic models (DSMs) that extend and generalize the vector space model in information retrieval from which they derive the use of co-occurrence matrices to represent distributional information. We first focus on a group of matrix DSMs (e.g.,...
This paper details the process of developing the first native large generative language model for the Nordic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation and considerations for release strategies. We hope that this paper can s...
Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the...
The lack of effective, scalable solutions for lifestyle treatment is a global clinical problem, causing severe morbidity and mortality. We developed a method for lifestyle treatment that promotes self-reflection and iterative behavioral change, provided as a digital tool, and evaluated its effect in 370 patients with type 2 diabetes (ClinicalTrials...
Linguistic Explorations of Societies (LES) is an interdisciplinary research project with scholars from the fields of political science, computer science, and computational linguistics. The overarching ambition of LES has been to contribute to the survey-based comparative scholarship by compiling and analyzing online text data within and between lan...
Distributional semantics has deeply changed in the last decades. First, predict models stole the thunder from traditional count ones, and more recently both of them were replaced in many NLP applications by contextualized vectors produced by neural language models. Although an extensive body of research has been devoted to Distributional Semantic M...
In this paper, we identify the state of data as being an important reason for failure in applied Natural Language Processing (NLP) projects. We argue that there is a gap between academic research in NLP and its application to problems outside academia, and that this gap is rooted in poor mutual understanding between academic researchers and their n...
Recent studies in zero-shot cross-lingual learning using multilingual models have falsified the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. Inspired by this advancement, we introduce a cross-lingual transfer method for monolingual models based on domain adaptation. We study the eff...
This paper discusses the current critique against neural network-based Natural Language Understanding solutions known as language models . We argue that much of the current debate revolves around an argumentation error that we refer to as the singleton fallacy : the assumption that a concept (in this case, language, meaning, and understanding) refe...
Distributional semantics has deeply changed in the last decades. First, predict models stole the thunder from traditional count ones, and more recently both of them were replaced in many NLP applications by contextualized vectors produced by Transformer neural language models. Although an extensive body of research has been devoted to Distributiona...
Most work in NLP makes the assumption that it is desirable to develop solutions in the native language in question. There is consequently a strong trend towards building native language models even for low-resource languages. This paper questions this development, and explores the idea of simply translating the data into English, thereby enabling t...
Large scale contextual representation models have significantly advanced NLP in recent years, understanding the semantics of text to a degree never seen before. However, they need to process large amounts of data to achieve high-quality results. Joining and accessing all these data from multiple sources can be extremely challenging due to privacy a...
This paper discusses the current critique against neural network-based Natural Language Understanding (NLU) solutions known as language models. We argue that much of the current debate rests on an argumentation error that we will refer to as the singleton fallacy: the assumption that language, meaning, and understanding are single and uniform pheno...
This paper presents the first Swedish evaluation benchmark for textual semantic similarity. The benchmark is compiled by simply running the English STS-B dataset through the Google machine translation API. This paper discusses potential problems with using such a simple approach to compile a Swedish evaluation benchmark, including translation error...
This document concerns data readiness in the context of machine learning and Natural Language Processing. It describes how an organization may proceed to identify, make available, validate, and prepare data to facilitate automated analysis methods. The contents of the document is based on the practical challenges and frequently asked questions we h...
Dynamic Topic Modeling (DTM) is the ultimate solution for extracting topics from short texts generated in Online Social Networks (OSNs) like Twitter. It requires to be scalable and to be able to account for sparsity and dynamicity of short texts. Current solutions combine probabilistic mixture models like Dirichlet Multinomial or Pitman-Yor Process...
We propose a novel method for enriching word-embeddings without the need of a labeled corpus. Instead, we show that relying on a regressor – trained with a small lexicon to predict pseudo-labels – significantly improves performance over current techniques that rely on human-derived sentence-level labels for an entire corpora. Our approach enables e...
A learning machine, in the form of a gating network that governs a finite number of different machine learning methods, is described at the conceptual level with examples of concrete prediction subtasks. A historical data set with data from over 5000 patients in Internet-based psychological treatment will be used to equip healthcare staff with deci...
Sentiment and topic analysis are common methods used for social media monitoring. Essentially, these methods answers questions such as, "what is being talked about, regarding X", and "what do people feel, regarding X". In this paper, we investigate another venue for social media monitoring, namely issue ownership and agenda setting, which are conce...
This paper introduces a novel type of data-driven segmented unit that we call r-grams. We illustrate one algorithm for calculating r-grams, and discuss its properties and impact on the frequency distribution of text representations. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and mu...
This paper engages with the challenges of designing "implicit interaction", systems (or system features) in which actions are not actively guided or chosen by users but instead come from inference driven system activity. We discuss the difficulty of designing for such systems and outline three Research through Design approaches we have engaged with...
This paper introduces the notion of a smart data layer for the Internet of Everything. The smart data layer can be seen as an AI that learns a generic representation from heterogeneous data streams with the goal of understanding the state of the user. The smart data layer can be used both as materials for design processes and as the foundation for...
Hateful comments, swearwords and sometimes even death threats are becoming a reality for many people today in online environments. This is especially true for journalists, politicians, artists, and other public figures. This paper describes how hate directed towards individuals can be measured in online environments using a simple dictionary-based...
This paper describes design principles for and the implementation of Gavagai Explorer---a new application which builds on interactive text clustering to extract themes from topically coherent text sets such as open text answers to surveys or questionnaires. An automated system is quick, consistent, and has full coverage over the study material. A s...
This paper is a short empirical study of the performance of centrality and classification based iterative term set expansion methods for distributional semantic models. Iterative term set expansion is an interactive process using distributional semantics models where a user labels terms as belonging to some sought after term set, and a system uses...
The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers. We also explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diver...
The automatic detection and classification of stance (e.g., certainty or agreement) in text data using natural language processing and machine-learning methods creates an opportunity to gain insight into the speakers’ attitudes toward their own and other people’s utterances. However, identifying stance in text presents many challenges related to tr...
Random indexing (RI) is a lightweight dimension reduction method, which is used, for example, to approximate vector semantic relationships in online natural language processing systems. Here we generalise RI to multidimensional arrays and therefore enable approximation of higher-order statistical relationships in data. The generalised method is a s...
The visual exploration of large and complex network structures remains a challenge for many application fields. Moreover, a growing number of real-world networks is multivariate and often interconnected with each other. Entities in a network may have relationships with elements of other related datasets, which do not necessarily have to be networks...
The ability to disseminate information instantaneously over vast geographical regions makes the Internet a key facilitator in the radicalisation process and preparations for terrorist attacks. This can be both an asset and a challenge for security agencies. One of the main challenges for security agencies is the sheer amount of information availabl...
Automatic detection of five language components, which are all relevant for expressing opinions and for stance taking, was studied: positive sentiment, negative sentiment, speculation, contrast and condition. A resource-aware approach was taken, which included manual annotation of 500 training samples and the use of limited lexical resources. Activ...
This paper investigates the effects of data size and frequency range on distributional semantic models. We compare the performance of a number of representative models for several test settings over data of varying sizes, and over test items of various frequency. Our results show that neural network-based models underperform when the data is small,...
The automatic detection and classification of stance taking in text data using natural language processing and machine learning methods create an opportunity to gain insight about the writers' feelings and attitudes towards their own and other people's utterances. However, this task presents multiple challenges related to the training data collecti...
Online social media are a perfect text source for stance analysis. Stance in human communication is concerned with speaker attitudes, beliefs, feelings and opinions. Expressions of stance are associated with the speakers' view of what they are talking about and what is up for discussion and negotiation in the intersubjective exchange. Taking stance...
This paper introduces a parameterization for word embeddings produced by the Random Indexing framework. The parameterization introduces position specific weights in the context windows, and the approach is shown to improve the performance in both word similarity and sentiment classification tasks. We also demonstrate the relation between Random Ind...
Machine learning offers significant benefits for systems that process and understand natural language: (a) lower maintenance and upkeep costs than when using manually-constructed resources, (b) easier portability to new domains, tasks, or languages, and (c) robust and timely adaptation to situation-specific settings. However, the behaviour of an ad...
Encoding information about the order in which words typically appear has been shown to improve the performance of high-dimensional semantic space models. This requires an encoding operation capable of binding together vectors in an order-sensitive way, and efficient enough to scale to large text corpora. Although both circular convolution and rando...
This paper is concerned with nearest neighbor search in distributional
semantic models. A normal nearest neighbor search only returns a ranked list of
neighbors, with no information about the structure or topology of the local
neighborhood. This is a potentially serious shortcoming of the mode of querying
a distributional semantic model, since a ra...
A support vector classifier was compared to a lexicon-based approach for the task of detecting the stance categories speculation, contrast and conditional in English consumer reviews. Around 3,000 training instances were required to achieve a stable performance of an F-score of 90 for speculation. This outperformed the lexicon-based approach, for w...
Stance in human communication is a linguistic concept relating to expressions of subjectivity such as the speakers' attitudes and emotions. Taking stance is crucial for the social construction of meaning and can be useful for many application fields such as business intelligence, security analytics, or social media monitoring. In order to process l...
In this paper we present our experiments on the RepLab 2014 Reputation Dimension task. RepLab is a competitive challenge for Reputation Management Systems. RepLab 2014's reputation dimensions task focuses on categorization of Twitter messages with regard to standard reputation dimensions (such as performance, leadership, or innovation). Our approac...
What can text sentiment analysis technology be used for, and does a more usage-informed view on sentiment analysis pose new requirements on technology development?
We present an incremental, scalable and efficient dimension reduction
technique for tensors that is based on sparse random linear coding. Data is
stored in a compactified representation with fixed size, which makes memory
requirements low and predictable. Component encoding and decoding are performed
on-line without computationally expensive re-ana...
This paper describes experiments to use non-terminological information to find attitudinal expressions in written English
text. The experiments are based on an analysis of text with respect to not only the vocabulary of content terms present in
it (which most other approaches use as a basis for analysis) but also with respect to presence of structu...
The highly variable and dynamic word usage in social media presents serious challenges for both research and those com- mercial applications that are geared towards blogs or other user-generated non-editorial texts. This paper discusses and exemplifies a terminology mining approach for dealing with the productive character of the textual environmen...
Word space models, in the sense of vector space models built on distributional data taken from texts, are used to model semantic re- lations between words. We argue that the high dimensionality of typical vector space models lead to unintuitive eects on modeling likeness of meaning and that the local structure of word spaces is where interesting se...
This paper discusses the task of tracking mentions of some topically interesting textual entity from a continuously and dynamically changing flow of text, such as a news feed, the output from an Internet crawler or a similar text source — a task sometimes referred to as buzz monitoring. Standard approaches from the field of information access for i...
Distributional approaches to meaning acquisition utilize distributional properties of linguistic entities as the building blocks of semantics. In doing so, they rely fundamentally on a set of assumptions about the nature of language and meaning referred to as "the distributional hypothesis". The main point of this hypothesis is that there is a corr...
2 Abstract. 1ƚ2 This paper presents work in progress on implementing an embodied question answering system, Dr. Cecilia, in the form of a virtual caregiver, for use in the treatment of eating disorders. The rationale for the system is grounded in one of the few effective treatments for anorexia and bulimia nervosa. The questions and answers databas...
We show that sequence information can be encoded into high-dimensional fixed-width vectors using permutations of coor-dinates. Computational models of language often represent words with high-dimensional semantic vectors compiled from word-use statistics. A word's semantic vector usually encodes the contexts in which the word appears in a large bod...
This paper reports on a experiment to iden- tify the emotional loading (the "valence") of news headlines. The experiment re- ported is based on a resource-thrifty ap- proach for valence annotation based on a word-space model and a set of seed words. The model was trained on newsprint, and va- lence was computed using proximity to one of two manuall...
This paper discusses evaluation methodologies for a particular kind of meaning models known as word-space models, which use dis- tributional information to assemble geometric representations of meaning similarities. Word-space models have received considerable attention in recent years, and have begun to see employment outside the walls of computat...
This paper introduces a measure of corpus homogeneity that indicates the amount of topical dispersion in a corpus. The measure is based on the density of neighborhoods in semantic word spaces. We evaluate the measure by comparing the results for five different corpora. Our initial results indicate that the proposed density measure can indeed identi...
We present four approaches to the Amharic - French bilingual track at CLEF 2005. All experiments use a dictionary based approach to translate the Amharic queries into French Bags-of-words, but while one approach uses word sense discrimination on the translated side of the queries, the other one includes all senses of a translated word in the query...
This year, the SICS team has concentrated on query processing and on the internal topical structure of the query, specifically
compound translation. Compound translation is non-trivial due to dependencies between compound elements. This year, we have
investigated topical dependencies between query terms: if a query term happens to be non-topical or...
This paper presents a very simple and effective approach to using parallel corpora for automatic bilingual lexicon acquisition. The approach, which uses the Random Indexing vector space methodology, is based on finding correlations between terms based on their distributional characteristics. The approach requires a minimum of preprocessing and ling...
The study presented involves several different contextual aspects and is the latest in a continuing series of exploratory
experiments on information access behaviour in a multi-lingual context [1, 2]. This year’s interactive cross-lingual information
access experiment was designed to measure three parameters we expected would affect the performance...
This article describes an automatic evaluation procedure for NLP system
robustness under the strain of noisy and ill-formed input. The procedure
requires no manual work or annotated resources. It is language and annotation
scheme independent and produces reliable estimates on the robustness of NLP
systems. The only requirement is an estimate on the...
This paper proposes a novel method for automatically acquiring multi- lingual lexica from non-parallel data and reports some initial experiments to prove the viability of the approach. Using established techniques for building mono-lingual vector spaces two independent semantic vector spaces are built from textual data. These vector spaces are rela...
This experiment tests a simple, scalable, and effective approach to building a domain-specific translation lexicon using distributional
statistics over parallellized bilingual corpora. A bilingual lexicon is extracted from aligned Swedish-French data, used to
translate CLEF topics from Swedish to French, which resulting French queries are then in t...
AthosMail is a multilingual spoken dialogue system for reading of e-mail messages. The key features of the application are adaptivity and the integration of different approaches for spoken interaction. The application has flexible system structure supporting multiple components for both different and same purposes. The AthosMail system includes com...