Serge Sharoff’s research while affiliated with University of Leeds and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (128)


Figure 1: Diastratic Translation Strategies distributed along a continuum, from most deduction of text (-4) to most addition of text (+4)
Figure 4: Top 20 Words Identified as Complex and Removed in Easy Version
Snapshot of Scottish Government Dataset Statistics
Hyperparameters and Training Configuration
Classification Report for Typology Prediction

+1

Reading Between the Lines: A dataset and a study on why some texts are tougher than others
  • Preprint
  • File available

January 2025

·

3 Reads

Nouran Khallaf

·

Carlo Eugeni

·

Serge Sharoff

Our research aims at better understanding what makes a text difficult to read for specific audiences with intellectual disabilities, more specifically, people who have limitations in cognitive functioning, such as reading and understanding skills, an IQ below 70, and challenges in conceptual domains. We introduce a scheme for the annotation of difficulties which is based on empirical research in psychology as well as on research in translation studies. The paper describes the annotated dataset, primarily derived from the parallel texts (standard English and Easy to Read English translations) made available online. we fine-tuned four different pre-trained transformer models to perform the task of multiclass classification to predict the strategies required for simplification. We also investigate the possibility to interpret the decisions of this language model when it is aimed at predicting the difficulty of sentences. The resources are available from https://github.com/Nouran-Khallaf/why-tough

Download

Figure 2: Accuracy comparison between GPT-4o baseline and detailed control prompts across different numbers of demonstration examples (shots).
Classification accuracy for the ablated genre classification and re-phrased versions of our prompts with GPT-4o.
Corpus of natural genre annotation (from Roussinov and Sharoff, 2023)
Keywords from ukWac for the topic model with 25 topics (from Roussinov and Sharoff, 2023) Seemingly similar names (e.g., Politics1 vs. Politics2) still have very different keywords (International vs Domestic). Addi- tionally, any similarity does not pose a limitation to our method, as we split texts into on-topic/off-topic groups based on their topic-model scores (close vs distant from that topic), thus we are not relying on 'orthogonality' between the topics.
Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection

This study demonstrates that the modern generation of Large Language Models (LLMs, such as GPT-4) suffers from the same out-of-domain (OOD) performance gap observed in prior research on pre-trained Language Models (PLMs, such as BERT). We demonstrate this across two non-topical classification tasks: 1) genre classification and 2) generated text detection. Our results show that when demonstration examples for In-Context Learning (ICL) come from one domain (e.g., travel) and the system is tested on another domain (e.g., history), classification performance declines significantly. To address this, we introduce a method that controls which predictive indicators are used and which are excluded during classification. For the two tasks studied here, this ensures that topical features are omitted, while the model is guided to focus on stylistic rather than content-based attributes. This approach reduces the OOD gap by up to 20 percentage points in a few-shot setup. Straightforward Chain-of-Thought (CoT) methods, used as the baseline, prove insufficient, while our approach consistently enhances domain transfer performance.


Frequency of Harmful-Error causing POS (Info)
Frequency of Harmful-Error causing POS (PROF)
Example cases of inaccurate machine translation of VPs and LVCs
Error severity by word senses (Info)
Error severity by no. of domains (Info)
Quantifying the Contribution of MWEs and Polysemy in Translation Errors for English-Igbo MT

In spite of recent successes in improving Machine Translation (MT) quality overall , MT engines require a large amount of resources, which leads to markedly lower quality for lesser-resourced languages. This study explores the case of translation from English into Igbo, a very low resource language spoken by about 45 million speakers. With the aim of improving MT quality in this scenario, we investigate methods for guided detection of crit-ical/harmful MT errors, more specifically those caused by non-compositional multi-word expressions and polysemy. We have designed diagnostic tests for these cases and applied them to collections of medical texts from CDC, Cochrane, NCDC, NHS and WHO.


Figure 1: An overall framework of our proposed Cross-modal multi-task learning network
Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning

Image-to-text generation involves automatically generating descriptive text from images and has applications in medical report generation. However, traditional approaches often exhibit a semantic gap between visual and textual information. In this paper, we propose a multi-task learning framework to leverage both visual and non-imaging data for generating radiology reports. Along with chest X-ray images, 10 additional features comprising numeric, binary, categorical, and text data were incorporated to create a unified representation. The model was trained to generate text, predict the degree of patient severity, and identify medical findings. Multi-task learning, especially with text generation prioritisation, improved performance over single-task baselines across language generation metrics. The framework also mitigated overfitting in auxiliary tasks compared to single-task models. Qualitative analysis showed logically coherent narratives and accurate identification of findings, though some repetition and disjointed phrasing remained. This work demonstrates the benefits of multi-modal, multi-task learning for image-to-text generation applications.



FIGURE 3
List of variables.
Singular model results.
Beyond images: an integrative multi-modal approach to chest x-ray report generation

February 2024

·

101 Reads

·

1 Citation

Frontiers in Radiology

·

Serge Sharoff

·

Selcuk Baser

·

[...]

·

Image-to-text radiology report generation aims to automatically produce radiology reports that describe the findings in medical images. Most existing methods focus solely on the image data, disregarding the other patient information accessible to radiologists. In this paper, we present a novel multi-modal deep neural network framework for generating chest x-rays reports by integrating structured patient data, such as vital signs and symptoms, alongside unstructured clinical notes. We introduce a conditioned cross-multi-head attention module to fuse these heterogeneous data modalities, bridging the semantic gap between visual and textual data. Experiments demonstrate substantial improvements from using additional modalities compared to relying on images alone. Notably, our model achieves the highest reported performance on the ROUGE-L metric compared to relevant state-of-the-art models in the literature. Furthermore, we employed both human evaluation and clinical semantic similarity measurement alongside word-overlap metrics to improve the depth of quantitative analysis. A human evaluation, conducted by a board-certified radiologist, confirms the model’s accuracy in identifying high-level findings, however, it also highlights that more improvement is needed to capture nuanced details and clinical context.


Building Comparable Corpora

August 2023

·

22 Reads

·

2 Citations

Synthesis Lectures on Human Language Technologies

In a parallel corpus we know which document is a translation of what by design. If the link between documents in different languages is not known, it needs to be established. In this chapter we will discuss methods for measuring document similarity across languages and how to evaluate the results. Then, we will proceed to discussing methods for building comparable corpora of different degrees of comparability and for different tasks.


Comparable and Parallel Corpora for Machine Translation

August 2023

·

17 Reads

Synthesis Lectures on Human Language Technologies

With the advent of Neural Machine Translation (NMT), a breakthrough has been achieved with regard to translation quality when compared to previous approaches such as rule-based, example-based, and statistical machine translation (MT). NMT systems tend to be considered as black boxes and it is not easy to predict their behavior.


Induction of Bilingual Dictionaries

August 2023

·

11 Reads

Synthesis Lectures on Human Language Technologies

The aim of the Bilingual Lexicon Induction (BLI) task is to produce a bilingual lexicon using a pair of comparable corpora and either a small set of seed translations (a supervised setting) or no seeds at all (an unsupervised setting). A traditional bilingual dictionary usually offers a structure of senses and conditions for their translations, as well as POS tags for disambiguation. In contrast to this task, building bilingual lexicons as the aim for the BLI task involves a number of simplifications.


Conclusions and Future Research

August 2023

·

3 Reads

Synthesis Lectures on Human Language Technologies

In the beginning of the 2000s the use of comparable corpora was on the margins of NLP research. Existing MT systems were nearly always based on fully parallel corpora, while NLP applications were mostly built separately in each language without the advantages of cross-lingual transfer.


Citations (59)


... The performance of PTMs often fluctuates depending on some characteristics of the target dataset (Schaffer, 1994), usually referred to as meta-features (Rivolli et al., 2022). For example, a model that performs exceptionally well on carefully written news articles may encounter difficulties with the brevity and slang commonly found in social media posts (Zheng and Yang, 2019;Shushkevich et al., 2022;Roussinov and Sharoff, 2023). Therefore, the challenge is to decide, for a particular dataset, which PTM is expected to perform the best after fine-tuning. ...

Reference:

Characterizing Text Datasets with Psycholinguistic Features
BERT Goes Off-Topic: Investigating the Domain Transfer Challenge using Genre Classification
  • Citing Conference Paper
  • January 2023

... • It is essential to develop new parallel datasets [248] for the formality transfer in Arabic text. To expand the current parallel datasets, it is important to investigate techniques for extracting parallel sentences [249]. • It is important to investigate how formality transfer relates to other style transfer tasks, like sentiment transfer [189,190,205,250] and code-switching style transfer [14]. ...

Building and Using Comparable Corpora for Multilingual Natural Language Processing
  • Citing Book
  • January 2023

Synthesis Lectures on Human Language Technologies

... Many existing studies utilize English datasets with translations that often lack cross-cultural relevance [17]. Additionally, the reliance on machine translation in prior models raises issues of accuracy and cultural sensitivity [18]. Addressing these gaps, our work introduces the AraImg2k Dataset, a comprehensive collection of 2000 images embodying Arab culture, each paired with five carefully crafted captions in Modern Standard Arabic (MSA). ...

Building Comparable Corpora
  • Citing Chapter
  • August 2023

Synthesis Lectures on Human Language Technologies

... Despite its prevalent utilization, the BLEU score has constraints in fully capturing all dimensions of paraphrase quality, such as semantic similarity, fluency, and sensitivity to tokenization [ 8 ]. • BERTScore (Bidirectional Encoder Representations from Transformers) utilizes contextual embeddings from BERT to measure the similarity between tokens. It evaluates precision, recall, and F1 scores to gauge the alignment between generated and reference texts, offering a nuanced assessment that considers semantic similarity [ 12 ]. ...

Towards Arabic Sentence Simplification via Classification and Generative Approaches
  • Citing Conference Paper
  • January 2022

... The variant models BERT, GPT, and T5 have practical applications in the field of translation technology. GPT has been used for question-answering systems and can be applied to further NLP tasks such as text classification, Named Entity Recognition (NER), and language translation, (Dai, 2023). According to Zaki, (26) NER is "the procedure that a machine follows in finding the name entities". ...

Syntactic Knowledge via Graph Attention with BERT in Machine Translation
  • Citing Preprint
  • May 2023

... Recently, the paradigm has been revamped with neural networks able to profile a text and identify its complexity [7,29,34]. A certain disadvantage of this approach is the lack of interpretability of its assessment results, as they do not provide information on the parameters affecting its complexity. ...

What neural networks know about linguistic complexity

Russian Journal of Linguistics

... Concerning the Arabic language, the works are even scarcer. In 2021, Khallaf et al. [16] presented an approach to predict the difficulty of MSA sentences. They compared the performance of different types of sentence embedding (fastText, mBERT, XLM-R and Arabic-BERT) and compared them to traditional linguistic features, such as PoS tags, dependency trees, readability scores and frequency lists for language learners. ...

Automatic Difficulty Classification of Arabic Sentences

... Laippala et al. (2023) experimented with various machine learning methods and classification settings to achieve the best possible identification scores for CORE. They found that using a multi-label approach, where multiple labels can be assigned to each instance, yields significantly higher scores than a single-label, multi-class approach (see also Santini 2007;Vidulin, Luštrek, and Gams 2007;Madjarov et al. 2019;Sharoff 2021;Kuzman, Rupnik, and Ljubešić 2022;Egbert, Biber, and Davies 2015). In their methodology, each register label-both the main and subregister labels-was assigned independently, and every subregister label was always assigned together with its main register label. ...

Genre Annotation for the Web: text-external and text-internal perspectives
  • Citing Article
  • December 2020

Register Studies

... Another promising research direction concerns the ability to transfer prediction models to less resourced languages via multilingual transfer, for example via multilingual embeddings (Sogaard et al., 2019), especially when we want to transfer our models between closely related languages, such as French and Italian (Sharoff, 2020). It is known that multilingual PLMs share enough information between languages to make this successful (Lample and Conneau, 2019). ...

Finding next of kin: Cross-lingual embedding spaces for related languages
  • Citing Article
  • September 2019

Natural Language Engineering

... For this purpose, most methods rely on comparable corpora [15], taking a measure of common words in their contexts as first clue to the similarity between a word and its translation. However, in this type of resources the position and the frequency of the source and target words are not comparable, and the translation of a word might not exist in a given pair of comparable documents [4]. ...

New Areas of Application of Comparable Corpora: Methods and Protocols
  • Citing Chapter
  • January 2019