Lars Borin

Lars Borin
University of Gothenburg | GU · Språkbanken

About

111
Publications
10,989
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,004
Citations

Publications

Publications (111)
Conference Paper
Full-text available
Swedish computational lexicography has a long history at the University of Gothenburg, both in its primary role as a central aspect of the scientific study of vocabulary and also as an infrastructural component for conducting research based on language data. Starting in the 1960s, the Språkdata research group pioneered corpus-supported lexicography...
Article
Argumentation has long been studied in a number of disciplines, including several branches of linguistics. In recent years, computational processing of argumentation has been added to the list, reflecting a general interest from the field of natural language processing (NLP) in building natural language understanding systems for increasingly intric...
Article
Full-text available
In this article, hesitancy towards COVID-19 vaccinations is investigated as a phenomenon touching upon existential questions. We argue that it encompasses ideas of illness and health, and also of dying and fear of suffering. Building on a specific strand within anti-vaccination studies, we conjecture that vaccine hesitancy is, to some extent, reaso...
Chapter
Full-text available
Swedish speech and language technology (LT) research goes back over 70 years. This has paid off: there is a national research infrastructure, as well as significant research projects, and Swedish is well-endowed with language resources (LRs) and tools. However, there are gaps that need to be filled, especially high-quality goldstandard LRs required...
Chapter
Frame semantics is a theory of meaning in natural language, which defines the structure of the lexical semantic resources known as framenets. Both framenets and frame semantics have proved useful for a number of natural language processing (NLP) tasks. However, in this connection framenets have often been criticized for their limited coverage. A pr...
Chapter
Full-text available
In pursuing the historical emergence of the discourse on terrorism, this study trawls the “digital Gulf of Bothnia” in the form of a corpus of combined Swedish and Finnish digitized newspaper texts. Through a cross-lingual exploration of the uses of the concept of terrorism in historical Swedish and Finnish news, we examine meanings anchored in the...
Chapter
Full-text available
In this chapter we focus on a shared innovation of Kanashi and Kinnauri, the addition of characteristic suffixes - or adaptive markers - to Indo-Aryan loan nouns and adjectives. The distribution of this adaptive mechanism in related and unrelated languages in the western Himalayas is investigated, and we also discuss possible sources for the adapti...
Article
Full-text available
Kanashi is an indigenous language of India spoken by some 2,000 individuals in one single village in the Indian Himalayas. It is a Sino-Tibetan language, separated from the other Sino-Tibetan speaking communities in the region by a girdle of Indo-Aryan speaking villages. In the present volume we contribute to the documentation of Kanashi with a pho...
Chapter
Full-text available
Kanashi exhibits a great deal of variation on several linguistic levels, which raises questions of a theoretical and methodological nature relevant to the formulation of useful and faithful linguistic descriptions of Kanashi. In this chapter, we address such questions in connection with working out a description of the phonology of Kanashi as part...
Chapter
Full-text available
Despite a long history of physical and social isolation from its surrounding communities, Kanashi exhibits several layers of borrowing from genealogically unrelated Indo-Aryan languages, which contribute substantially to the phenomenon described and discussed in this chapter, Kanashi’s surprisingly rich array of mechanisms for forming numerals. The...
Chapter
Full-text available
In this chapter, the findings from the loanword adaptation studies presented in previous chapters are combined with data on other linguistic features, socio-cultural phenomena, population genetics, and geography, in order to draw some conclusions about the genealogical and areal relationships of Kanashi to other languages of the region, about the i...
Chapter
Full-text available
Kanashi is an indigenous language of India spoken by some 2,000 individuals in one single village in the Indian Himalayas. It is a Sino-Tibetan language, separated from the other Sino-Tibetan speaking communities in the region by a girdle of Indo-Aryan speaking villages. In the present volume we contribute to the documentation of Kanashi with a pho...
Chapter
Full-text available
In this chapter, we extend the investigation of common loanword adaptation patterns noted in Kanashi and Kinnauri to the verbal domain, where both languages use dedicated transitivity-signalling morphology exclusively on Indo- Aryan loan verbs. In the same way as with the nominal adaptive markers, we investigate the distribution of this adaptive me...
Chapter
Full-text available
This chapter presents a grammar sketch of Kanashi, covering its main features and contrasting it with its closely related and better described sister language Kinnauri.
Chapter
Large computational lexicons are central NLP resources. Swedish FrameNet++ aims to be a versatile full-scale lexical resource for NLP containing many kinds of linguistic information. Although focused on Swedish, this ongoing effort, which includes building a new Swedish framenet and recycling existing lexicons, has offered valuable insights into ge...
Chapter
Large computational lexicons are central NLP resources. Swedish FrameNet++ aims to be a versatile full-scale lexical resource for NLP containing many kinds of linguistic information. Although focused on Swedish, this ongoing effort, which includes building a new Swedish framenet and recycling existing lexicons, has offered valuable insights into ge...
Chapter
Large computational lexicons are central NLP resources. Swedish FrameNet++ aims to be a versatile full-scale lexical resource for NLP containing many kinds of linguistic information. Although focused on Swedish, this ongoing effort, which includes building a new Swedish framenet and recycling existing lexicons, has offered valuable insights into ge...
Chapter
Large computational lexicons are central NLP resources. Swedish FrameNet++ aims to be a versatile full-scale lexical resource for NLP containing many kinds of linguistic information. Although focused on Swedish, this ongoing effort, which includes building a new Swedish framenet and recycling existing lexicons, has offered valuable insights into ge...
Chapter
Large computational lexicons are central NLP resources. Swedish FrameNet++ aims to be a versatile full-scale lexical resource for NLP containing many kinds of linguistic information. Although focused on Swedish, this ongoing effort, which includes building a new Swedish framenet and recycling existing lexicons, has offered valuable insights into ge...
Chapter
Large computational lexicons are central NLP resources. Swedish FrameNet++ aims to be a versatile full-scale lexical resource for NLP containing many kinds of linguistic information. Although focused on Swedish, this ongoing effort, which includes building a new Swedish framenet and recycling existing lexicons, has offered valuable insights into ge...
Chapter
Large computational lexicons are central NLP resources. Swedish FrameNet++ aims to be a versatile full-scale lexical resource for NLP containing many kinds of linguistic information. Although focused on Swedish, this ongoing effort, which includes building a new Swedish framenet and recycling existing lexicons, has offered valuable insights into ge...
Article
Full-text available
We present initial exploratory work on illuminating the long-standing question of areal versus genealogical connections in South Asia using computational data visualization tools. With respect to genealogy, we focus on the subclassification of Indo-Aryan, the most ubiquitous language family of South Asia. The intent here is methodological: we explo...
Article
Full-text available
Aspect-Based Sentiment Analysis constitutes a more fine-grained alternative to traditional sentiment analysis at sentence level. In addition to a sentiment value denoting how positive or negative a particular opinion or sentiment expression is, it identifies additional aspects or ‘slots’ that characterize the opinion. Some typical aspects are targe...
Conference Paper
Full-text available
We present NordiCon, a database containing medieval Nordic personal names attested in Continental sources. The database combines formally interpreted and richly interlinked onomastic data with digitized versions of the medieval manuscripts from which the data originate and information on the tokens' context. The structure of NordiCon is inspired by...
Article
Full-text available
We use a gold standard under construction for sentiment analysis in Swedish to explore how attitudes towards immigration change across time and media. We track the evolution of attitude starting from the year 2000 for three different Swedish media: the national newspapers Aftonbladet and Svenska Dagbladet, representing different halves of the left–...
Article
Full-text available
We process and visualize Swedish parliamentary data using methods from statistics and machine learning, which allows us to obtain insight into the political processes behind the data. We produce plots that let us infer the relative stance of political parties and their members on different topics. In addition, we can infer the degree of homogeneity...
Preprint
Our languages are in constant flux driven by external factors such as cultural, societal and technological changes, as well as by only partially understood internal motivations. Words acquire new meanings and lose old senses, new words are coined or borrowed from other languages and obsolete words slide into obscurity. Understanding the characteris...
Chapter
In constructionist theory, a constructicon is an inventory of constructions making up the full set of linguistic units in a language. In applied practice, it is a set of construction descriptions – a “dictionary of constructions”. The development of constructicons in the latter sense typically means combining principles of both construction grammar...
Chapter
Full-text available
In constructionist theory, a constructicon is an inventory of constructions making up the full set of linguistic units in a language. In applied practice, it is a set of construction descriptions – a “dictionary of constructions”. The development of constructicons in the latter sense typically means combining principles of both construction grammar...
Chapter
In the field of language technology, researchers are starting to pay more attention to various interactional aspects of language – a development prompted by a confluence of factors, and one which applies equally to the processing of written and spoken language. Notably, the so-called ‘phatic’ aspects of linguistic communication are coming into focu...
Article
Full-text available
We introduce an expanded version of the Swedish research resource Språkbanken (the Swedish Language Bank). In 2018, Språkbanken, which has supported national and international research for over four decades, adds two branches, one focusing on speech and one on societal aspects of language, to its existing organization, which targets text.
Article
Full-text available
There is an increasing demand for multilingual sentiment analysis, and most work on sentiment lexicons is still carried out based on English lexicons like WordNet. In addition, many of the non-English sentiment lexicons that do exist have been compiled by (machine) translation from English resources, thereby arguably obscuring possible language-spe...
Chapter
We present our work aiming at turning the linguistic material available in Grierson’s classical Linguistic Survey of India (LSI) from a printed discursive textual description into a formally structured digital language resource, a database suitable for a broad array of linguistic investigations of the languages of South Asia. While doing so, we dev...
Article
Full-text available
We present a framework and its implementation relying on Natural Language Processing methods, which aims at the identification of exercise item candidates from corpora. The hybrid system combining heuristics and machine learning methods includes a number of relevant selection criteria. We focus on two fundamental aspects: linguistic complexity and...
Conference Paper
The present paper describes experiments on automatically extracting typological linguistic features of natural languages from traditional written descriptive grammars. The feature-extraction task has high potential value in typological, genealogical, historical, and other related areas of linguistics that make use of databases of structural feature...
Conference Paper
Full-text available
Despite many years of research on Swedish language technology, there is still no well-documented standard for Swedish word processing covering the whole spectrum from low-level tokenization to morphological analysis and disambiguation. SWORD is a new initiative within the SWE-CLARIN consortium aiming to develop documented standards for Swedish word...
Article
Full-text available
The concept of culturomics was born out of the availability of massive amounts of textual data and the interest to make sense of cultural and language phenomena over time. Thus far however, culturomics has only made use of, and shown the great potential of, statistical methods. In this paper, we present a vision for a knowledge-based culturomics th...
Article
As we were preparing this special issue, Charles J. Chuck Fillmore (1929- 2014), the intellectual father of all framenets and all constructicons, a brilliant linguist and the most gentle and amiable of individuals, passed away at the much too young age of 84 after a long period of illness. Although physically already quite weakened by this illness...
Article
Full-text available
We present an experiment where natural language processing tools are used to automatically identify potential constructions in a corpus. The experiment was conducted as part of the ongoing efforts to develop a Swedish constructicon. Using an automatic method to suggest constructions has advantages not only for efficiency but also methodologically:...
Article
This article describes the development of a geographical information system (GIS) at Språkbanken as part of a visualization solution to be used in an archive of historical Swedish literary texts. The research problems we are aiming to address concern orthographic and morphological variation, missing place names, and missing place name coordinates....
Conference Paper
The present paper describes the process of identifying lexical bundles, i.e., frequently recurring word sequences such as by means of and in the end of, in secondary school history and physics textbooks. In its determination of finding genuine lexical bundles, i.e. the word boundaries between lexical bundles and surrounding arbitrary words, it prop...
Conference Paper
Full-text available
This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative's work throughout Europe in order to boost progress a...
Conference Paper
Full-text available
We present Lärka, the language learning platform of Språkbanken (the Swedish Language Bank). It consists of an exercise generator which reuses resources available through Språkbanken: mainly Korp, the corpus infrastructure, and Karp, the lexical infrastructure. Through Lärka we reach new user groups – students and teachers of Linguistics as well as...
Article
In this article, we investigate the properties of phoneme N-grams across half of the world's languages. We investigate if the sizes of three different N-gram distributions of the world's language families obey a power law. Further, the N-gram distributions of language families parallel the sizes of the families, which seem to obey a power law distr...
Article
In this paper, we apply an information theoretic measure, self-entropy of phoneme n-gram distributions, for quantifying the amount of phonological variation in words for the same concepts across languages, thereby investigating the stability of concepts in a standardized concept list – based on the 100-item Swadesh list – specifically designed for...
Article
The paper presents an ongoing project which aims to publish Swedish lexical-semantic resources using Semantic Web and Linked Data technologies. In this article, we highlight the practical conversion methods and challenges of converting three of the Swedish language resources in RDF with lemon.
Article
Swesaurus is a freely available (under a CC-BY license) Swedish wordnet under construction, built primarily by scavenging and recycling information from a number of existing lexical resources. Among its more unusual characteristics are graded lexical-semantic relations and inclusion of all parts of speech, not only open-class items.
Article
The English-language Princeton WordNet (PWN) and some wordnets for other languages have been extensively used as lexical–semantic knowledge sources in language technology applications, due to their free availability and their size. The ubiquitousness of PWN-type wordnets tends to overshadow the fact that they represent one out of many possible choi...
Conference Paper
The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to rele...
Article
Full-text available
This paper reports on the ongoing international project System architecture for ICALL and the progress made by the Swedish partner. The Swedish team is developing a web-based exercise generator reusing available annotated corpora and lexical resources. Apart from the technical issues like implementation of the user interface and the underlying proc...
Conference Paper
Full-text available
In this paper, we present an on-going project whose overall aim is to develop open-source system architecture for supporting ICALL systems that will facilitate re-use of existing NLP tools and resources on a plug-and-play basis. We introduce the project, describe the approaches adopted by the two language teams, and present two applications being d...
Conference Paper
Full-text available
It is a surprising fact that, despite the existence of various mature Natural Language Processing (NLP) tools and resources that can potentially benefit language learning, very few projects are devoted to development of Intelligent Computer-Assisted Language Learning (ICALL) applications. This paper presents an on-going collaborative project whose...
Conference Paper
Full-text available
This paper reports on the ongoing international project System architecture for ICALL and the progress made by the Swedish partner. The Swedish team is developing a web-based exercise generator reusing available annotated corpora and lexical resources. Apart from the technical issues like implementation of the user interface and the underlying proc...
Article
This paper is a theoretical and empirical investigation into the use of the notion core vocabulary in some areas of linguistics and related disciplines, originally prompted by the concrete task of compiling core vocabularies in two research projects growing out of two quite different research traditions: (1) lexicostatistics, where core vocabularie...
Conference Paper
Today museums and other cultural heritage institutions are increasingly storing object descriptions using semantic web domain ontologies. To make this content accessible in a multilingual world, it will need to be conveyed in many languages, a language generation task which is domain specific and language dependent. This paper describes how semanti...
Article
Full-text available
We present our ongoing work on Karp, Språkbanken's (the Swedish Language Bank) open lexical infrastructure, which has two main functions: (1) to support the work on creating, curating, and integrating our various lexical resources; and (2) to publish daily versions of the resources, making them searchable and downloadable. An important requirement...
Article
This project report describes a multilingual wordnet initiative embarked in the META-NORD project and concerned with the validation and pilot linking between Nordic and Baltic wordnets. The builders of these wordnets have applied very different compilation strategies: The Danish, Icelandic and Swedish wordnets are being developed via monolingual di...
Article
Full-text available
We show how the lexicographic task of find-ing informative and diverse example sentences can be cast as a search result diversification problem, where an objective based on rele-vance and diversity is maximized. This prob-lem has been studied intensively in the in-formation retrieval community during recent years, and efficient algorithms have been...
Article
Full-text available
This article surveys work on Unsupervised Learning of Morphology. We define Unsupervised Learning of Morphology as the problem of inducing a description (of some kind, even if only morpheme-segmentation) of how orthographic words are built up given only raw text data of a language. We briefly go through the history and motivation of the this proble...
Article
Full-text available
We present a computational morphological description of Old Swedish implemented in Functional Morphology. The objective of the work is concrete - connecting word forms in real text to entries in electronic dictionaries, for use in an online reading aid for students learning Old Swedish. The challenge we face is to find an appropriate mode l of Old...
Chapter
The goal of the work presented in this chapter is to create a set of computational lexical resources, interlinked on the lexical sense level using the persistent sense identifiers designed for the Present-Day Swedish lexical resource SALDO. In this way, all the diverse linguistic information available in our individual lexical resources – modern an...
Conference Paper
Full-text available
This paper introduces the META-NORD pro-ject which develops Nordic and Baltic part of the European open language resource infra-structure. META-NORD works on assem-bling, linking across languages, and making widely available the basic language resources used by developers, professionals and re-searchers to build specific products and ap-plications....
Conference Paper
Full-text available
We present our ongoing work on language technology-based e-science in the humanities, social sciences and education, with a focus on text-based research in the historical sciences. An important aspect of language technology is the research infrastructure known by the acronym BLARK (Basic LAnguage Resource Kit). A BLARK as normally presented in the...
Conference Paper
Full-text available
Abstract Currently, research infrastructures are being designed and established in many disciplines, all partly to address the problem that they all suffer from an enormous fragmentation of their resources and tools. In the domain of language resources and tools the CLARIN initiative has been funded since 2008 to overcome many of the integration an...
Chapter
In this chapter, the authors describe the development and application of language technology for intelligent information access to the content of digitized cultural heritage collections in the form of Swedish classical literary works. This technology offers sophisticated and flexible support functions to literary scholars and researchers. The autho...
Article
Full-text available
Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources. Editors: Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard, Eiríkur Rögnvaldsson and Koenraad de Smedt. NEALT Proceedings Series, Vol. 5 (2009), 1-5. © 2009 The editors and contributors. Published by Northern European As...
Article
Full-text available
For the Snark’s a peculiar creature, that won’t Be caught in a commonplace way. Do all that you know, and try all that you don’t: Not a chance must be wasted to-day! (Lewis Carroll: The Hunting of the Snark) 1
Article
Full-text available
In this paper we present a pilot study on the development of a FrameNet-like annotation of a sample of Swedish medical corpora, for a selected set of verbal predicates. We explore and exploit a number of linguistic tools for the provision of much of the necessary annotations required by such a semantic scheme. Particular attention is paid to the sy...
Article
We describe the background and motivation for an e-learning project---IT-based Collaborative Learning in Grammar---where NLP resource reuse has become an important issue. The resources are of several kinds: POS-tagged and syntactically annotated corpora (treebanks), parsing systems and grammar writer's workbenches, and visulization and manipulation...
Article
This study focuses on how to automatically locate text sources published on the World Wide Web in order to produce adequate and upto -date learning materials for second language learners of Nordic languages. The Web is an excellent source of authentic text materials. However, the large amount of information available on the Web makes search service...
Article
Parallel and comparable corpora are playing an increasingly important role in linguistics and computational linguistics. This introduction aims at providing an overview of the state of the art of parallel and comparable corpus research, paying particular attention to the situation in Scandinavia. The existence of two distinct and partly separate re...
Article
It is sometimes said that part of speech (POS) tags are likely to be the same for translation equivalent words. If this is correct, we could formulate the following hypothesis: It should be possible to use POS tagging for one language in combination with a word alignment system, in order to obtain a (partial) POS tagging for another language. This...
Article
this paper, we first show some examples of the browser, which we have named ETAP-WebTEq--ETAP project Web-based browser of Translation Equivalents--, at work on the IVT2 ETAP subcorpus. This is a parallel corpus of newswire text in 6 languages, or, rather, 5 language pairs: Swedish (the SL) - Finnish, Polish, Serbian-Bosnian-Croatian, Spanish, and...
Article
this paper is an ongoing effort to exploit combinations of existing natural language processing (NLP) resources in order to reach part-of-speech (POS) tagging performance in excess of that which any single resource is able to provide
Article
this paper, we describe work in progress in the Swedish Leaming Lab and at Uppsala University, Sweden, on the development of Didax, the Digital Interactive Diagnostic Administering and Correction System. In the remainder of this section, we describe the background and general motivation for the development of Didax. Section 2 elaborates upon the di...
Article
We present the LingoNet project for creating a 'web-based language lab', a website where resources for web-based language training will be collected and made available for use in foreign language education at the university level. One of the most pressing needs in this connection is to develop guidelines, procedures, and tools for the (summative an...
Article
While language-independent sentence alignment programs typically achieve a recall in the 90 percent range, the same cannot be said about word alignment systems, where normal recall figures tend to fall somewhere between 20 and 40 percent, in the language-independent case. As words (and phrases) for various reasons are more interesting to align than...
Article
Linguistically annotated text resources are still scarce for many languages and for many text types, mainly because their creation represents a major investment of work and time. For this reason, it is worthwhile to investigate ways of reusing existing resources in novel ways. In this paper, we investigate how off-the-shelf part of speech (POS) tag...
Article
Finnish Romani is a language with a fairly recent written tradition; for all practical purposes it is a 20th century phenomenon. An official orthography was created in 1971, and it is mostly from the 1970's onwards that we see texts of the kind which we normally associate with a written language variety. The text corpus described here is being comp...
Article
Word alignment of parallel texts is typically carried out using many kinds of knowledge, or information sources, in concert, i.e., it is profitably viewed as a kind of cooperative process, where e.g. distribution, string similarity, cooccurrence statistics, and other information sources are used together. We investigate a novel such information sou...
Conference Paper
In the past, so-called translationese has been investigated mainly as a lexical phenomenon, despite suggestions that it also must have a syntactic dimension. In this article, we explore the use of part-of-speech- (POS)-tagged parallel and comparable corpora as one means of investigating translation effects in the syntactic domain. We suggest a meth...
Conference Paper
While language-independent , which is the use of one or more additional languages to improve bilingual word alignment. The conclusion is that in a multilingual parallel corpus, pivot alignment is a safe way to increase word alignment recall without lowering the precision.
Article
Full-text available
Finnish Romani is a language with a fairly recent written tradition; for all practical purposes it is a 20th century phenomenon. An official orthography was created in 1971, and it is mostly from the 1970's onwards that we see texts of the kind which we normally associate with a written language variety. The text corpus described here is being comp...
Article
Full-text available
This position paper presents META-NORD project which develops Nordic and Baltic part of the European open language resource infra-structure. META-NORD works on assem-bling, linking across languages, and making widely available the basic language resources used by developers, professionals and re-searchers to build specific products and appli-cation...

Network