
Paul Rayson- PhD
- Professor (Full) at Lancaster University
Paul Rayson
- PhD
- Professor (Full) at Lancaster University
About
230
Publications
83,752
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,541
Citations
Introduction
Current institution
Publications
Publications (230)
Mental health services in the UK are increasingly pressured with long waiting times. Meanwhile, online forums for mental health, set up by charities, NHS services, and individual volunteers, have increased in popularity. Little is known, however, about who is using them and why. This study aimed to investigate this, using multiple methods to captur...
Background
Bipolar is a severe mental health condition affecting at least 2% of the global population, with clinical observations suggesting that individuals experiencing elevated mood states, such as mania or hypomania, may have an increased propensity for engaging in risk-taking behaviors, including hypersexuality. Hypersexuality has historically...
The paper presents the results of the ParlaMint II project, which comprise comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words. The corpora are uniformly encoded, contain rich metadata about their 24 thousand speakers, and ar...
BACKGROUND
Bipolar is a severe mental health condition affecting at least 2% of the global population, with clinical observations suggesting that individuals experiencing elevated mood states, such as mania or hypomania, may have an increased propensity for engaging in risk-taking behaviors, including hypersexuality. Hypersexuality has historically...
Objective
Clinical observations suggest that individuals with a diagnosis of bipolar face difficulties regulating emotions and impairments to their cognitive processing, which can contribute to high-risk behaviours. However, there are few studies which explore the types of risk-taking behaviour that manifest in reality and evidence suggests that th...
Background
Online forums are widely used for mental health peer support. However, evidence of their safety and effectiveness is mixed. Further research focused on articulating the contexts in which positive and negative impacts emerge from forum use is required to inform innovations in implementation.
Objective
This study aimed to develop a realis...
The use of metaphors to talk about cancer experiences has attracted much research and debate, especially in the case of military metaphors. However, questions remain about what metaphors are used by different populations for different aspects of the cancer experience. For further information: Method A scoping literature review Review questions: 1)...
BACKGROUND
There has been an increase in the use of online mental health forums to support mental health. These forums are often moderated by trained moderators to ensure a safe, therapeutic environment. While the moderator role is rewarding, it can also be challenging. There is a need to understand the impact of the role on moderators and how they...
The paper presents the results of the ParlaMint II project, which comprise comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words. The corpora are uniformly encoded, contain rich metadata about their 24 thousand speakers, and ar...
BACKGROUND
Online forums are widely used for mental health peer support. However, evidence of their safety and effectiveness is mixed. Further research focused on articulating the contexts in which positive and negative impacts emerge from forum use is required to inform innovations in implementation.
OBJECTIVE
This study aimed to develop a realis...
Background
Personal recovery is of particular value in bipolar disorder, where symptoms often persist despite treatment. We previously defined the POETIC (Purpose and Meaning, Optimism and Hope, Empowerment, Tensions, Identity, Connectedness) framework for personal recovery in bipolar disorder. So far, personal recovery has only been studied in res...
Background
Mental health (MH) peer online forums offer robust support where internet access is common, but healthcare is not, e.g., in countries with under-resourced MH support, rural areas, and during pandemics. Despite their widespread use, little is known about who posts in such forums, and in what mood states. The discussion platform Reddit is...
Introduction
Peer online mental health forums are commonly used and offer accessible support. Positive and negative impacts have been reported by forum members and moderators, but it is unclear why these impacts occur, for whom and in which forums. This multiple method realist study explores underlying mechanisms to understand how forums work for d...
There are more than 2,000 listed companies on the UK’s London Stock Exchange, divided into 11 sectors who are required to communicate their financial results at least twice in a single financial year. UK annual reports are very lengthy documents with around 80 pages on average. In this study, we aim to benchmark a variety of summarisation methods o...
In recent years, the problem of Cross-Lingual Text Reuse Detection (CLTRD) has gained the interest of the research community due to the availability of large digital repositories and automatic Machine Translation (MT) systems. These systems are readily available and openly accessible, which makes it easier to reuse text across languages but hard to...
Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as natural language processing, corpus linguistics, information retrieval, and data science. An important aspect of such automatic information extraction and analysis is the annotation of language data u...
[This corrects the article DOI: 10.2196/27670.].
The key‐words procedure originating in corpus linguistics is extremely widely used for the comparative analysis of language alongside other standard methods such as frequency profiling, concordancing, n‐grams, and collocation. This entry briefly describes the method for creating a contingency table for each word in the joint frequency lists of two...
This work presents a standard Igbo named entity recognition (IgboNER) dataset as well as the results from training and fine-tuning state-of-the-art transformer IgboNER models. We discuss the process of our dataset creation-data collection and annotation and quality checking. We also present experimental processes involved in building an IgboBERT la...
Electronic word-of-mouth communication in the form of online reviews influences people’s product or service choices. People use text features to add or emphasise feelings and emotions in their text. The text emphasis can come in as capital letters, letter repetition, exclamation marks and emoticons. The existing literature has not paid sufficient a...
Introduction: Computer-use behaviours can provide useful information about an individual's cognitive and functional abilities. However, little research has evaluated unaided and non-directed home computer-use. In this proof of principle study, we explored whether computer-use behaviours recorded during routine home computer-use i) could discriminat...
This study describes a Natural Language Processing (NLP) toolkit, as the first contribution of a larger project, for an under-resourced language—Urdu. In previous studies, standard NLP toolkits have been developed for English and many other languages. There is also a dire need for standard text processing tools and methods for Urdu, despite it bein...
Current approaches to the expansion of semantic lexicons for corpus annotation are somewhat ad hoc in nature and do not generally offer a systematic means of identifying areas for development within one’s lexicon. The present paper sets forward a domain based approach to semantic lexicon expansion, targeting UCREL’s Semantic Analysis System (USAS)....
There is a lack of concrete knowledge about floristic change in Britain before the mid-20th century. Relevant evidence is available, but it is principally contained in disparate historical sources. In this article, we demonstrate how such sources can be efficiently collated and analysed through the implementation of state-of-the-art computational-l...
Keyness is a commonly used method in corpus linguistics and is assumed to identify key items that are characteristic of 1 corpus when compared to another. This paper puts this assumption to the test by comparing case study corpora in the fields of genetic, immunological and psychiatric biomedical association studies, using what we refer to as a ‘K-...
We take a step towards addressing the underrepresentation of the African continent in NLP research by creating the first large publicly available high quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand th...
We propose MUMBO, the first high-performing yet computationally efficient acquisition function for multi-task Bayesian optimization. Here, the challenge is to perform efficient optimization by evaluating low-cost functions somehow related to our true target function. This is a broad class of problems including the popular task of multi-fidelity opt...
This paper describes a general-purpose extension of max-value entropy search, a popular approach for Bayesian Optimisation (BO). A novel approximation is proposed for the information gain -- an information-theoretic quantity central to solving a range of BO problems, including noisy, multi-fidelity and batch optimisations across both continuous and...
BACKGROUND
Twitter is a real time messaging platform widely used by people and organisations to share information on many topics. It could potentially be useful to analyse tweets for infectious disease monitoring purposes in order to reduce reporting lag time, and to provide an independent complementary source of data, compared to traditional...
Background:
Twitter is a real time messaging platform widely used by people and organisations to share information on many topics. It could potentially be useful to analyse tweets for infectious disease monitoring purposes in order to reduce reporting lag time, and to provide an independent complementary source of data, compared to traditional...
In March 2020, the World Health Organization announced the COVID-19 outbreak as a pandemic. Most previous social media related research has been on English tweets and COVID-19. In this study, we collect approximately 1 million Arabic tweets from the Twit-ter streaming API related to COVID-19. Focussing on outcomes that we believe will be useful for...
This report provides an overview of the CorCenCC project and the online corpus resource that was developed as a result of work on the project. The report lays out the theoretical underpinnings of the research, demonstrating how the project has built on and extended this theory. We also raise and discuss some of the key operational questions that ar...
This article develops a Bayesian optimization (BO) method which acts directly over raw strings, proposing the first uses of string kernels and genetic algorithms within BO loops. Recent applications of BO over strings have been hindered by the need to map inputs into a smooth and unconstrained latent space. Learning this projection is computational...
Deployments of Bayesian Optimization (BO) for functions with stochastic evaluations, such as parameter tuning via cross validation and simulation optimization, typically optimize an average of a fixed set of noisy realizations of the objective function. However, disregarding the true objective function in this manner finds a high-precision optimum...
We propose MUMBO, the first high-performing yet computationally efficient acquisition function for multi-task Bayesian optimization. Here, the challenge is to perform efficient optimization by evaluating low-cost functions somehow related to our true target function. This is a broad class of problems including the popular task of multi-fidelity opt...
We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central ar...
Building ontologies is a crucial part of the semantic web endeavour. In recent years, research interest has grown rapidly in supporting languages such as Arabic in NLP in general but there has been very little research on medical ontologies for Arabic. We present a new Arabic ontology in the infectious disease domain to support various important ap...
Although researchers and practitioners are pushing the boundaries and enhancing the capacities of NLP tools and methods, works on African languages are lagging. A lot of focus on well resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese etc. Over 97% of the world's 7000 languages, including African languages, are...
We report experience in requirements elicitation of domain knowledge from experts in clinical and cognitive neurosciences. The elicitation target was a causal model for early signs of dementia indicated by changes in user behaviour and errors apparent in logs of computer activity. A Delphi-style process consisting of workshops with experts followed...
The aim of word sense disambiguation (WSD) is to correctly identify the meaning of a word in context. All natural languages exhibit word sense ambiguities and these are often hard to resolve automatically. Consequently WSD is considered an important problem in natural language processing (NLP). Standard evaluation resources are needed to develop, e...
While the application of word embedding models to downstream Natural Language Processing (NLP) tasks has been shown to be successful, the benefits for low-resource languages is somewhat limited due to lack of adequate data for training the models. However, NLP research efforts for low-resource languages have focused on constantly seeking ways to ha...
We provide a methodological contribution by developing, describing and evaluating a method for automatically retrieving and analysing text from digital PDF annual report files published by firms listed on the London Stock Exchange (LSE). The retrieval method retains information on document structure, enabling clear delineation between narrative and...
There is vast untapped potential in relation to the use of social media for monitoring the spread of infectious diseases around the world. Much previous research has focussed on English only, but the Arabic twitter universe has been comparatively much less studied. Motivated by important issues related to levels of trust, quality and reliability of...
We present FIESTA, a model selection approach that significantly reduces the computational resources required to reliably identify state-of-the-art performance from large collections of candidate models. Despite being known to produce unreliable comparisons, it is still common practice to compare model evaluations based on single choices of random...
Advances in Empirical Translation Studies - edited by Meng Ji June 2019
We critically assess mainstream accounting and finance research applying methods from computational linguistics (CL) to study financial discourse. We also review common themes and innovations in the literature and assess the incremental contributions of work applying CL methods over manual content analysis. Key conclusions emerging from our analysi...
We critically assess mainstream accounting and finance research applying methods from computational linguistics (CL) to study financial discourse. We also review common themes and innovations in the literature and assess the incremental contributions of work applying CL methods over manual content analysis. Key conclusions emerging from our analysi...
Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi‐lingual content on the Web has increased cross‐language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailab...
Lack of repeatability and generalisability are two significant threats to continuing scientific development in Natural Language Processing. Language models and learning methods are so complex that scientific conference papers no longer contain enough space for the technical depth required for replication or reproduction. Taking Target Dependent Sen...
In many areas of academic publishing, there is an explosion of literature, and subdivision of fields into subfields, leading to stove-piping where sub-communities of expertise become disconnected from each other. This is especially true in the genetics literature over the last 10 years where researchers are no longer able to maintain knowledge of p...
Automatic semantic annotation of natural language data is an important task in Natural Language Processing, and a variety of semantic taggers have been developed for this task, particularly for English. However, for many languages, particularly for low-resource languages, such tools are yet to be developed. In this paper, we report on the developme...
Objective:
To determine whether multiple computer use behaviours can distinguish between cognitively healthy older adults and those in the early stages of cognitive decline, and to investigate whether these behaviours are associated with cognitive and functional ability.
Methods:
Older adults with cognitive impairment (n = 20) and healthy contro...
This book presents the methodology, findings and implications of a large-scale corpus-based study of the metaphors used to talk about cancer and the end of life (including care at the end of life) in the UK. It focuses on metaphor as a central linguistic and cognitive tool that is frequently used to talk and think about sensitive and subjective exp...
This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, a...
The poster is at: http://wp.lancs.ac.uk/btm/2017/09/15/poster-presented-at-iges-2017-international-genetic-epidemiology-society/
Automatic extraction and analysis of meaning-related information from natural language data has been an important issue in a number of research areas, such as natural language processing (NLP), text mining, corpus linguistics, and data science. An important aspect of such information extraction and analysis is the semantic annotation of language da...
This poster seeks to describe the creation of a Spanish lexicon with semantic annotation in order to analyse more extensive corpora in the Spanish language. The semantic resources most employed nowadays are WordNet, FrameNet, PDEV and USAS, but they have been used mainly for English language research. The creation of a large Spanish lexicon will pe...
Identity resolution capability for social networking profiles is important for a range of purposes, from open-source intelligence applications to forming semantic web connections. Yet replication of research in this area is hampered by the lack of access to ground-truth data linking the identities of profiles from different networks. Almost all dat...
lexiDB is a scalable corpus database management system designed to fulfill corpus linguistics retrieval queries on multi-billion-word multiply-annotated corpora. It is based on a distributed architecture that allows the system to scale out to support ever larger text collections. This paper presents an overview of the architecture behind lexiDB as...
We present a desktop monitoring application that combines keyboard, mouse, desktop and application-level activities. It has been developed to discover differences in cognitive functioning amongst older computer users indicative of mild cognitive impairment (MCI). Following requirements capture from clinical domain experts, the tool collects all Mic...
Technology advancement in social media software allows users to include elements of visual communication in textual settings. Emoticons are widely used as visual representations of emotion and body expressions. However, the assignment of values to the “emoticons” in current sentiment analysis tools is still at a very early stage. This paper present...
As part of a larger project where we are examining the relationship and influence of news and social media on stock price, here we investigate the potential links between the sentiment of news articles about companies and stock price change of those companies. We describe a method to adapt sentiment word lists based on news articles about specific...
In this paper, we describe the open-source SAMS framework whose novelty lies in bringing together both data collection (keystrokes, mouse movements, application pathways) and text collection (email, documents, diaries) and analysis methodologies. The aim of SAMS is to provide a non-invasive method for large scale collection, secure storage, retriev...
It is increasingly acknowledged that the Digital Humanities have placed too much emphasis on data creation and that the major priority should be turning digital sources into contributions to knowledge. While this sounds relatively simple, doing it involves intermediate stages of research that enhance digital sources, develop new methodologies and e...
We develop, describe and evaluate a web-based software tool for batch extraction and analysis of digital PDF annual report files. The retrieval method retains information on document structure thereby enabling clear delineation between narrative and financial statement components of reports, and between individual sections within the narratives com...
There are various factors that affect the sentiment level expressed in textual comments. Capitalization of letters tends to mark something for attention and repeating of letters tends to strengthen the emotion. Emoticons are used to help visualize facial expressions which can affect understanding of text. In this paper, we show the effect of the nu...
The use of metaphor in popular science is widespread to aid readers’ conceptions of the scientific concepts under discussion. Almost all research in this area has been done by careful close reading of the text(s) in question, but this article describes—for the first time—a digital ‘distant reading’ analysis of popular science, using a system create...