Conference PaperPDF Available

Towards A Welsh Semantic Annotation System

Authors:

Abstract and Figures

Automatic semantic annotation of natural language data is an important task in Natural Language Processing, and a variety of semantic taggers have been developed for this task, particularly for English. However, for many languages, particularly for low-resource languages, such tools are yet to be developed. In this paper, we report on the development of an automatic Welsh semantic annotation tool (named CySemTagger) in the CorCenCC Project, which will facilitate semantic-level analysis of Welsh language data on a large scale. Based on Lancaster's USAS semantic tagger framework, this tool tags words in Welsh texts with semantic tags from a semantic classification scheme, and is designed to be compatible with multiple Welsh POS taggers and POS tagsets by mapping different tagsets into a core shared POS tagset that is used internally by CySemTagger. Our initial evaluation shows that the tagger can cover up to 91.78% of words in Welsh text. This tagger is under continuous development, and will provide a critical tool for Welsh language corpus and information processing at semantic level.
Content may be subject to copyright.
A preview of the PDF is not available
... The WNLT Welsh part-of-speech tagger was employed in work to develop an automatic Welsh language semantic annotation system, CySemTagger (Piao, et al., 2018), as part of the wide ranging CorCenCC project to create a national corpus of contemporary Welsh 11 . CySemTagger builds on partof-speech tagging in order to produce broader annotations, such General/Abstract, Food/Farming, Emotion, Time 12 . ...
... Subsequently the CyTag part-of-speech tagger was developed as the Welsh partof-speech tagger for the CorCenCC project (Neale, et al., 2018). In an evaluation of text coverage (percentage of words in the test corpus identified by the two taggers), CyTag achieved higher coverage than the WNLT part-of-speech tagger (92% vs 73% coverage respectively) mainly due to lemmatisation performance and is adopted by CySemTagger (Piao, et al., 2018). CyTag employs a two stage rule based constraint grammar method, developed previously for the multilingual Bangor Autoglosser 13 tagger (Donnelly and Deuchar 2011), which is likely to be a significant element in the higher text coverage by CyTag. ...
Article
Full-text available
Language technology is becoming increasingly important across a variety of application domains which have become common place in large, well-resourced languages. However, there is a danger that small, under-resourced languages are being increasingly pushed to the technological margins. Under-resourced languages face significant challenges in delivering the underlying language resources necessary to support such applications. This paper describes the development of a natural language processing toolkit for an under-resourced language, Cymraeg (Welsh). Rather than creating the Welsh Natural Language Toolkit (WNLT) from scratch, the approach involved adapting and enhancing the language processing functionality provided for other languages within an existing framework and making use of external language resources where available. This paper begins by introducing the GATE NLP framework, which was used as the development platform for the WNLT. It then describes each of the core modules of the WNLT in turn, detailing the extensions and adaptations required for Welsh language processing. An evaluation of the WNLT is then reported. Following this, two demonstration applications are presented. The first is a simple text mining application that analyses wedding announcements. The second describes the development of a Twitter NLP application, which extends the core WNLT pipeline. As a relatively small-scale project, the WNLT makes use of existing external language resources where possible, rather than creating new resources. This approach of adaptation and reuse can provide a practical and achievable route to developing language resources for under-resourced languages.
... It organises data into multiple facets, which can be used to study sublanguages as defined by [4]. All data are also annotated with different types of linguistic information including morphological units, tokens, part-of-speech (POS) [5] and semantic categories [6,7]. In addition to linguistic research, the corpus can support a range of other applications such as learning and teaching of Welsh, but also NLP. ...
... Members of the CorCenCC team also developed downstream NLP methods for multiword term recognition [12] and semantic tagging [6,7]. These methods were originally developed for English and successfully adapted for Welsh [13][14][15]. ...
Article
Full-text available
Word embeddings are representations of words in a vector space that models semantic relationships between words by means of distance and direction. In this study, we adapted two existing methods, word2vec and fastText, to automatically learn Welsh word embeddings taking into account syntactic and morphological idiosyncrasies of this language. These methods exploit the principles of distributional semantics and, therefore, require a large corpus to be trained on. However, Welsh is a minoritised language, hence significantly less Welsh language data are publicly available in comparison to English. Consequently, assembling a sufficiently large text corpus is not a straightforward endeavour. Nonetheless, we compiled a corpus of 92,963,671 words from 11 sources, which represents the largest corpus of Welsh. The relative complexity of Welsh punctuation made the tokenisation of this corpus relatively challenging as punctuation could not be used for boundary detection. We considered several tokenisation methods including one designed specifically for Welsh. To account for rich inflection, we used a method for learning word embeddings that is based on subwords and, therefore, can more effectively relate different surface forms during the training phase. We conducted both qualitative and quantitative evaluation of the resulting word embeddings, which outperformed previously described word embeddings in Welsh as part of larger study including 157 languages. Our study was the first to focus specifically on Welsh word embeddings.
... Although state-of-the-art POS taggers for the English language use deep learning, the authors argue there is insufficient Welsh language data to use such an approach for the Welsh language. The same authors later developed a rule-based semantic tagger, entitled CySemTagger [15]. Both of these tools are available under a free software (GPL version 3) licence (https://github.com/CorCenCC, ...
Article
Full-text available
Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English–Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points.
Chapter
Moving on to the final key stages of building a corpus, this chapter provides a brief exploration of approaches to processing and (re)presenting language data for future analysis. The chapter first details the importance of the transcription phase in spoken corpus development and documents how bespoke transcription conventions can be developed for, and employed in, a given target language. The tagging and collation of corpus data (via corpus managers) are also briefly explored, before some exemplar corpus querying tools and CorCenCC’s novel pedagogic toolkit, Y Tiwtiadur, are presented.
Conference Paper
Annotated corpora allow researchers to carry out studies that include linguistic phenomenona such as grammatical categories or even knowledge by means of semantic annotation. Semantic annotation in corpus linguistics is often included into every lexical unit of the corpus using a word labelling tool. Resources such as lexicons or ontologies are very useful for this purpose. However, current sizes of these resources for languages different from English are not always appropriate. This paper proves that an updated version of an existing NLP baseline can be extended for different domains and, as a consequence, it is able to improve Spanish token accuracy by up to 30% in comparison with using only a small lexicon.
Chapter
As noted in Chap. 1.1, CorCenCC’s vision rested on two guiding principles: first, the language data captured within the corpus should be as representative as possible of the ways in which Welsh is currently used and encountered. Second, the corpus infrastructure should address the needs of the multiple community user groups operating in and/or engaging with the Welsh language. In this chapter we evaluate the extent to which these principles are represented in the project outcomes. First, we reflect on CorCenCC’s vision of representativeness, by considering the size and scope of the corpus, and the provenance of data in terms of language topic, context, genre and mode on the one hand, and contributor profiles on the other. We then document some of the specific challenges, including unanticipated ones, encountered during the project, and the steps taken to mitigate their impact. Next, we consider the extent to which the project has identified, and addressed, the specific needs of user group communities. This necessitates a reflection on our engagement and communication with potential corpus user groups throughout the project. Finally, we summarise the technical tools and resources developed to support the realisation of the CorCenCC vision. Throughout the chapter we report on the dynamic decision-making regarding method and approach that shaped the project as it progressed.
Chapter
Fel y nodwyd ym Mhennod 2.1, roedd dwy egwyddor arweiniol i weledigaeth CorCenCC: yn gyntaf, dylai’r data iaith a gipiwyd yn y corpws gynrychioli cystal â phosibl y ffyrdd y mae’r Gymraeg yn cael ei defnyddio a’i chanfod ar hyn o bryd. Yn ail, dylai seilwaith y corpws ymdrin ag anghenion y grwpiau amrywiol o ddefnyddwyr cymunedol sy’n gweithredu yn Gymraeg neu sy’n ymgysylltu â’r Gymraeg. Yn y bennod hon, rydym yn gwerthuso i ba raddau y cynrychiolir yr egwyddorion hyn yng nghanlyniadau’r prosiect. Yn gyntaf, rydym yn adfyfyrio ar weledigaeth cynrychioldeb CorCenCC, drwy ystyried maint a chwmpas y corpws, a tharddiad y data o ran testun, cyd-destun, genre a modd iaith ar y naill law, a phroffiliau cyfranwyr ar y llaw arall. Wedyn rydym yn cofnodi rhai o’r heriau penodol, gan gynnwys rhai nas rhagwelwyd, y daethpwyd ar eu traws yn ystod y prosiect, a’r camau a gymerwyd i liniaru ar eu heffaith. Nesaf, rydym yn ystyried i ba raddau y mae’r prosiect wedi nodi anghenion penodol cymunedau o grwpiau defnyddwyr ac wedi ymdrin â nhw. I wneud hyn, mae angen ystyried sut buom yn ymgysylltu â grwpiau o ddefnyddwyr posibl y corpws drwy gydol y prosiect ac yn cyfathrebu â nhw. Yn olaf, rydym yn crynhoi’r offer a’r adnoddau technegol a ddatblygwyd i gefnogi gwireddu gweledigaeth CorCenCC. Drwy gydol y bennod rydym yn adrodd ar y gwneud penderfyniadau dynamig o ran dull ac ymagwedd a luniodd y prosiect wrth iddo fynd rhagddo.
Preprint
This report provides an overview of the CorCenCC project and the online corpus resource that was developed as a result of work on the project. The report lays out the theoretical underpinnings of the research, demonstrating how the project has built on and extended this theory. We also raise and discuss some of the key operational questions that arose during the course of the project, outlining the ways in which they were answered, the impact of these decisions on the resource that has been produced and the longer-term contribution they will make to practices in corpus-building. Finally, we discuss some of the applications and the utility of the work, outlining the impact that CorCenCC is set to have on a range of different individuals and user groups.
Conference Paper
Full-text available
The UCREL semantic analysis system (USAS) is a software tool for undertaking the automatic semantic analysis of English spoken and written data. This paper describes the software system, and the hierarchical semantic tag set containing 21 major discourse fields and 232 fine-grained semantic field tags. We discuss the manually constructed lexical resources on which the system relies, and the seven disambiguation methods including part-of-speech tagging, general likelihood ranking, multi-word-expression extraction, domain of discourse identification, and contextual rules. We report an evaluation of the accuracy of the system compared to a manually tagged test corpus on which the USAS software obtained a precision value of 91%. Finally, we make reference to the applications of the system in corpus linguistics, content analysis, software engineering, and electronic dictionaries.
Conference Paper
Full-text available
This paper reports on the current status and evaluation of a Finnish semantic tagger (hereafter FST), which was developed in the EU-funded Benedict Project. In this project, we have ported the Lancaster English semantic tagger (USAS) to the Finnish language. We have re-used the existing software architecture of USAS, and applied the same semantic field taxonomy developed for English to Finnish. The Finnish lexical resources have been compiled using various corpus-based techniques, and the resulting lexicons have then been manually tagged and used for the FST prototype. At present, the lexicons contain 33,627 single lexical items and 8,912 multi-word expression templates. In the evaluation, we used two sets of test data. The first test data is from the domain of Finnish cooking, which is both sufficiently compact and sufficiently versatile. The second data is from Helsingin Sanomat, the biggest Finnish daily newspaper. As a result, the FST produced a lexical coverage of 94.1% and a precision of 83.03% on the cooking test data and a lexical coverage of 90.7% on the newspaper data. While there is much room for improvement, this is an encouraging result for a prototype tool. The FST will be continually improved by expanding the semantic lexical resources and improving the disambiguation algorithms.
Conference Paper
Full-text available
The KIM platform provides a novel Knowledge and Information Management infrastructure and services for automatic semantic annotation, indexing, and retrieval of documents. It provides mature infrastructure for scaleable and customizable information extraction (IE) as well as annotation and document management, based on GATE. In order to provide basic level of performance and allow easy bootstrapping of applications, KIM is equipped with an upper-level ontology and a knowledge base providing extensive coverage of entities of general importance. The ontologies and knowledge bases involved are handled using cutting edge Semantic Web technology and standards, including RDF(S) repositories, ontology middleware and reasoning. From technical point of view, the platform allows KIM-based applications to use it for automatic semantic annotation, content retrieval based on semantic restrictions, and querying and modifying the underlying ontologies and knowledge bases. This paper presents the KIM platform, with emphasize on its architecture, interfaces, tools, and other technical issues.
Article
Automatic extraction and analysis of meaning-related information from natural language data has been an important issue in a number of research areas, such as natural language processing (NLP), text mining, corpus linguistics, and data science. An important aspect of such information extraction and analysis is the semantic annotation of language data using a semantic tagger. In practice, various semantic annotation tools have been designed to carry out different levels of semantic annotation, such as topics of documents, semantic role labeling, named entities or events. Currently, the majority of existing semantic annotation tools identify and tag partial core semantic information in language data, but they tend to be applicable only for modern language corpora. While such semantic analyzers have proven useful for various purposes, a semantic annotation tool that is capable of annotating deep semantic senses of all lexical units, or all-words tagging, is still desirable for a deep, comprehensive semantic analysis of language data. With large-scale digitization efforts underway, delivering historical corpora with texts dating from the last 400 years, a particularly challenging aspect is the need to adapt the annotation in the face of significant word meaning change over time. In this paper, we report on the development of a new semantic tagger (the Historical Thesaurus Semantic Tagger), and discuss challenging issues we faced in this work. This new semantic tagger is built on existing NLP tools and incorporates a large-scale historical English thesaurus linked to the Oxford English Dictionary. Employing contextual disambiguation algorithms, this tool is capable of annotating lexical units with a historically-valid highly fine-grained semantic categorization scheme that contains about 225,000 semantic concepts and 4,033 thematic semantic categories. In terms of novelty, it is adapted for processing historical English data, with rich information about historical usage of words and a spelling variant normalizer for historical forms of English. Furthermore, it is able to make use of knowledge about the publication date of a text to adapt its output. In our evaluation, the system achieved encouraging accuracies ranging from 77.12% to 91.08% on individual test texts. Applying time-sensitive methods improved results by as much as 3.54% and by 1.72% on average.