Claire Brierley's research while affiliated with University of Leeds and other places

Publications (40)

Article
Full-text available
This paper presents two sets of lexical items automatically extracted from the Arabic Quran, and denoting two different notions of linguistic salience: keyness and prosodic prominence. Our novel hypothesis investigates a possible correlation between them. Our novel findings discover distributionally significant keywords that also occur strategicall...
Article
Full-text available
Natural Language Processing Working Together with Arabic and Islamic Studies is a 2-year project funded by the UK Engineering and Physical Sciences Research Council (EPSRC) to study prosodic-syntactic mark-up in the Quran (Atwell et al 2013). Tajwīd or correct Quranic recitation is very important in Islam. The original insight informing this projec...
Conference Paper
Full-text available
Inspired by the Oxford Children's Corpus, we have developed a prototype corpus of Arabic texts written and/or selected for children. Our Arabic Children's Corpus of 2950 documents and nearly 2 million words has been collected manually from the web during a 3-month project. It is of high quality, and contains a range of different children's genres b...
Conference Paper
Full-text available
Natural Language Processing Working Together with Arabic and Islamic Studies is a 2-year project funded by the UK Engineering and Physical Sciences Research Council (EPSRC) to study prosodic-syntactic mark-up in the Quran (Atwell et al 2013). Tajwīd or correct Quranic recitation is very important in Islam. The original insight informing this projec...
Conference Paper
Full-text available
In this paper, we focus on the prosodic effect of qalqalah or "vibration" applied to a subset of Arabic consonants under certain constraints during correct Qur'anic recitation or taǧwīd, using our Boundary-Annotated Qur’an dataset of 77430 words (Brierley et al 2012; Sawalha et al 2014). These qalqalah events are rule-governed and are signified ort...
Article
Full-text available
Semantic Pathways is a corpus exploration tool with a unique visual interface in which keyword extraction and keyword-based document clustering have been implemented in order to facilitate insight forming. Semantic Pathways combines corpus comparison techniques from Corpus Linguistics with aesthetically-driven design and interaction to produce flui...
Conference Paper
Full-text available
We train and test two probabilistic taggers for Arabic phrase break prediction on a purpose-built, "gold standard", boundary-annotated and PoS-tagged Qur"an corpus of 77430 words and 8230 sentences. In a related LREC paper (Brierley et al., 2012), we cover dataset build. Here we report on comparative experiments with off-the-shelf N-gram and HMM ta...
Conference Paper
Full-text available
A boundary-annotated and part-of-speech tagged corpus is a prerequisite for developing phrase break classifiers. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener. We take a novel approach to phrase break prediction for Arabic, deriving our prosodic annotation scheme from Tajwīd (r...
Article
Full-text available
A phrase break classifier is needed to predict natural prosodic pauses in text to be read out loud by humans or machines. To develop phrase break classifiers, we need a boundary-annotated and part-of-speech tagged corpus. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener; manual an...
Article
Full-text available
A phrase break classifier is needed to predict natural prosodic pauses in text to be read out loud by humans or machines. To develop phrase break classifiers, we need a boundary-annotated and part-of-speech tagged corpus. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener; manual an...
Conference Paper
Full-text available
Prosodic-syntactic chunking is a language universal [1]: we process linguistic content by splitting it up into meaningful stand-alone units, where each chunk includes at least one accented word with pitch variation on the stressed syllable(s). In English, punctuation in text plus pauses and inflections in speech signify phrase breaks or boundaries...
Conference Paper
Full-text available
We review a range of Artificial Intelligence and Corpus Linguistics research at Leeds University on Arabic and the Quran, which has produced a range of software and corpus datasets for research on Modern Standard Arabic and more recently Quranic Arabic .Our work on Quranic Arabic corpus linguistics has attracted widespread interest, not only from A...
Conference Paper
Full-text available
We plan to apply Text Analytics techniques honed on English for corpus-based exploration of Arabic, for Arabic Text-to-Speech (TTS) and other applications. Such techniques depend on a corpus or sample of naturally-occurring language texts capturing empirical data on the phenomena being studied, for example prosodic-syntactic patterns at phrase junc...
Conference Paper
Full-text available
We plan to apply Text Analytics techniques honed on English for corpus-based exploration of Arabic, for Arabic Text-to-Speech (TTS) and other applications. Such techniques depend on a corpus or sample of naturally-occurring language texts capturing empirical data on the phenomena being studied, for example prosodic-syntactic patterns at phrase junc...
Conference Paper
This report describes the joint entry from Middlesex University and the University of Leeds for Mini Challenge 3 for the VAST Challenge 2011. In order to address the challenge question, the primary tool we used was Middlesex University’s Interactive Visual Search and Query Environment (INVISQUE), which served as the user interface to the Mini-Chall...
Conference Paper
Full-text available
We have found empirical evidence of a correlation in English between words containing complex vowels (diphthongs and triphthongs) and 'gold-standard' phrase break annotations in datasets as apparently different as seventeenth-century verse and a Reith lecture transcript on economics from the late twentieth-century. Spontaneous speech in the form of...
Article
Full-text available
We report on a significant correlation between lexical items containing complex vowels in their present day canonical forms, and prosodic-syntactic boundaries in Milton’s Paradise Lost, where all line terminals, whether end-stopped or run-on, plus line-medials with associated punctuation, constitute boundary tokens and equate to gold-standard phras...
Conference Paper
Full-text available
It is universally recognized that humans process speech and language in chunks, each meaningful in itself. Any two renditions or assimilations of a given sentence will exhibit similarities and discrepancies in the distribution of phrase breaks. Automated phrase break prediction assigns pauses to plain text as input, evaluated against human performa...
Conference Paper
Full-text available
We have previously reported on ProPOSEL, a purpose-built Prosody and PoS English Lexicon compatible with the Python Natural Language ToolKit. ProPOSEC is a new corpus research resource built using this lexicon, intended for distribution with the Aix-MARSEC dataset. ProPOSEC comprises multi-level parallel annotations, juxtaposing prosodic and syntac...
Article
Full-text available
We report on a significant correlation between lexical items containing complex vowels in their present day canonical forms, and prosodic-syntactic boundaries in Milton's Paradise Lost, where all line terminals, whether end-stopped or run-on, plus line-medials with associated punctuation, constitute boundary tokens and equate to gold-standard phras...
Conference Paper
Full-text available
Real-world knowledge of syntax is seen as integral to the machine learning task of phrase break prediction but there is a deficiency of a priori knowledge of prosody in both rule-based and data-driven classifiers. Speech recognition has established that pauses affect vowel duration in preceding words. Based on the observation that complex vowels oc...
Conference Paper
Full-text available
ProPOSEL is a prosody and PoS English lexicon, purpose-built to integrate and leverage domain knowledge from several well-established lexical resources for machine learning and NLP applications. The lexicon of 104049 separate entries is in accessible text file format, is human and machine-readable, and is intended for open source distribution with...
Conference Paper
Full-text available
ProPOSEL is a prototype prosody and PoS (part-of-speech) English lexicon for Language Engineering, derived from the following language resources: the computer-usable dictionary CUVPlus, the CELEX-2 database, the Carnegie-Mellon Pronouncing Dictionary, and the BNC, LOB and Penn Treebank PoS-tagged corpora. The lexicon is designed for the target appl...
Article
Full-text available
Prosodic phrasing is the means by which speakers of any given language break up an utterance into meaningful chunks. The term "prosody" itself refers to the tune or intonation of an utterance, and therefore prosodic phrases literally signal the end of one tune and the beginning of another. This study uses phrase break annotations in the Aix-MARSEC...
Conference Paper
Full-text available
An automatic phrase break prediction system aims to identify prosodic-syntactic boundaries in text which correspond to the way a native speaker might process or chunk that same text as speech. In computational linguistics, Machine Learning from hand-annotated corpus data has become the de-facto standard approach to text annotation problems such as...
Article
Full-text available
The goal of automatic phrase break prediction is to identify prosodic-syntactic boundaries in text which correspond to the way a native speaker might process or chunk that same text as speech. This is treated as a classification task in machine learning and output predictions from language models are evaluated against a ‘gold standard’: human-label...

Citations

... Adopting two successive modules was to facilitate the implementation and the verification of the second process (the Tajweed rule verification). The rules structure is quite similar to the works of (Bellegdi & Muhtaseb, 2015;El-Imam, 2004;Saidane, Zrighi & Ben Ahmed 2004;Sawalhaa, Brierley, Atwell, & Dickins 2017) but we grouped and applied them depending on the case of the grapheme (its position, its type, etc.) to speed up the entire process. ...
... Brierley et al. [56] have worked on a set of consonants in the Qur'an that provide the prosodic effect of Qalqalah (vibration). Prosody is the pattern of rhythm that how the voice of the speaker rises and falls while speaking. ...
... In [5], Sulaiti et al. describe a corpus of reading material targeted at children. Although the texts are classified by some categories, they are not leveled by difficulty and it is not clear that the corpus is available or ever will be. ...
... A further aspect of our experimental work, and a means of familiarization with the corpus, was to compare the first author's intuitive prosodic phrasing to that of expert annotators and to mark out longer prosodic phrases in response to Liberman and Church's own criticism of the chink/chunk rule in their original paper [19]. Due to space constraints, this work is not included here, but detailed in [13]. ...
... This is original research in that: (i) our goal is to derive chunking algorithms for Arabic speech and language applications from traditional prosodic mark-up in the Qur " an; and (ii) our underpinning question is whether Qur " anic Arabic speech rhythms still inform native speaker intuition and judgment when processing Modern Standard Arabic. Our two papers for LREC 2012, along with an earlier paper (Brierley et al., 2011), represent groundwork for a larger-scale project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic. ...
... However, an alternative approach has since been used by the authors (Brierley and Atwell, 2009b) and has now been applied to both datasets in this study. The customised verse tokenizer in Listing 3 uses a regular expression (cf. ...
... We present an overview of work related to a two-year project (2013)(2014)(2015) funded by the UK Engineering and Physical Sciences Research Council (EPSRC) (Atwell et al 2013). The project is entitled: Natural Language Processing Working Together with Arabic and Islamic Studies, and involves a core interdisciplinary and international research team from Computing and Arabic at the Universities of Leeds and Jordan, developing Islamic applications for the Arabic Quran as core text. ...
... To achieve these goals, usage of modern punctuation marks was investigated. Then, machine learning algorithms were trained and tested on the Boundary Annotated Qur " an (BAQ) Corpus (Sawalha, et al., 2012) after some modification. Modern punctuation marks were extracted from ‫ت‬ ّ ‫ط‬ ُ ‫ق‬ ‫ّذ‬ َٞ ‫ع‬ Sayyid Qutb ُ‫اىقشا‬ ‫ظاله‬ ٜ‫ف‬ " fī ẓilāl al-qur'ān " (Qutb, 1991) and then inserted in the BAQ Corpus. ...
... We are pioneering phrase break prediction for Arabic. In Sawalha et al (2012aSawalha et al ( , 2012b we use trigram and HMM taggers from the Natural Language Toolkit (Bird et al 2009) to predict boundaries in a discrete Quran test set of 7318 words and 849 sentences, using both sets of syntactic features and break types in the Boundary-Annotated Quran. This test set comprises Quran chapters where Meccan/ Medinan provenance is disputed, and constitutes a fair test for a classifier trained on both styles. ...
... We are pioneering phrase break prediction for Arabic. In Sawalha et al (2012aSawalha et al ( , 2012b we use trigram and HMM taggers from the Natural Language Toolkit (Bird et al 2009) to predict boundaries in a discrete Quran test set of 7318 words and 849 sentences, using both sets of syntactic features and break types in the Boundary-Annotated Quran. This test set comprises Quran chapters where Meccan/ Medinan provenance is disputed, and constitutes a fair test for a classifier trained on both styles. ...