Michael Oakes

Michael Oakes
University of Wolverhampton · Research Institute in Information and Language Processing (RIILP)

Doctor of Philosophy, Computer Science

About

110
Publications
23,554
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,552
Citations
Introduction
With Meng Ji, we are guest editing a special issue of the journal "META" on Corpus-Based Translation Studies. I have been working with Alois Pichler on the Stylometry of Wittgenstein.
Additional affiliations
January 2001 - September 2013
University of Sunderland
Position
  • Professor (Associate)

Publications

Publications (110)
Article
Full-text available
Background: Emotionally unstable personality disorder (EUPD) is a challenging condition with a prevalence of 20% in inpatient services. Psychotherapy is the preferred treatment; nevertheless, off-license medications are widely used. Objectives: To identify socio-demographics, clinical and service-delivery characteristics of people with EUPD admi...
Chapter
Full-text available
Whether you wish to deliver on a promise, take a walk down memory lane or even on the wild side, phraseological units (also often referred to as phrasemes or multiword expressions) are present in most communicative situations and in all world’s languages. Phraseology, the study of phraseological units, has therefore become a rare unifying theme acr...
Presentation
Full-text available
in 1899 Qassim Amin introduced his book “The Liberation of Women” and used both rational Islamic arguments and emotional arguments to put forward his view. In his book, Amin called for women’s education, removing the veil, and reformation of marriage and divorce laws. Mohamed Emara stated in his book “Islam and Women in Mohammed Abdu’s opinion” (E...
Presentation
Full-text available
Aaidh ibn Abdullah Al-Qarni was born in 1960 in Saudi Arabia and became one of the most respected scholars in his country. He is a famous writer, having written over 80 books in a very short time. In 2003 he published the book (لا تحزن) “Don’t be sad” which sold over 10 million copies and was widely translated. In 2012 Al-Qarni was accused of plag...
Chapter
Full-text available
Advances in Empirical Translation Studies - edited by Meng Ji June 2019
Chapter
Full-text available
Advances in Empirical Translation Studies - edited by Meng Ji June 2019
Conference Paper
Full-text available
In this paper we investigate differences and similarities between dialects using unsupervised learning. We used a binary phonetic representation to cluster utterances from different Arabic and English dialects. This phonetic representation aims to capture phonetic patterns such as vowel and consonant length. We tested this representation on an Arab...
Article
Full-text available
The Indus Script originates from the culture known as the Indus Valley Civilization, which flourished from approximately 2600 to 1900 bc. Several thousand objects bearing these signs have been found over a wide area of Northern India and Pakistan. In 1977, Iravatham Mahadevan published a concordance of all of the scripts that had been discovered so...
Article
Full-text available
This article looks at the provenance of the unfinished novel The Dark Tower, generally attributed to C. S. Lewis. The manuscript was purportedly rescued from a bonfire shortly after Lewis’s death by his literary executor Walter Hooper, but the quality of the text is hardly vintage Lewis. Using computer stylometric programs made available by Eder et...
Article
Full-text available
In this paper, we present experiments using the Linguistic Inquiry and Word Count (LIWC) program, a 'closed-class keyword' (CCK) analysis and a 'correspondence analysis' (CA) to examine whether the Scientology texts of L. Ron Hubbard are linguistically and conceptually like those of other religions. A Kruskal-Wallis test comparing the frequencies o...
Article
Full-text available
The field of Cross-Language Information Retrieval relates techniques close to both the Machine Translation and Information Retrieval fields, although in a context involving characteristics of its own. The present study looks to widen our knowledge about the effectiveness and applicability to that field of non-classical translation mechanisms that w...
Article
Full-text available
Nowadays, documents are increasingly associated with multi-level category hierarchies rather than a flat category scheme. As the volume and diversity of documents grow, so do the size and complexity of the corresponding category hierarchies. To be able to access such hierarchically classified documents in real-time, we need fast automatic methods t...
Article
Full-text available
It has been shown that online health-related discussions significantly influence the attitudes and behavioral intentions of the discussion participants. Although empirical evidence strongly supports the importance of emotions in health-related online discussions, there are few studies of the relationship between a subjective language and online dis...
Conference Paper
Full-text available
This work studies sentiment and factual transitions on an online medical forum where users correspond in English. We work with discussions dedicated to reproductive technologies, an emotionally-charged issue. In several learning problems, we demonstrate that multi-class sentiment classification significantly improves when messages are represented b...
Conference Paper
Many Information Retrieval (IR) and Natural language processing (NLP) systems require textual similarity measurement in order to function, and do so with the help of similarity measures. Similarity measures function differently, some measures which work better on highly similar texts do not always do so well on highly dissimilar texts. In this pape...
Article
Full-text available
Background: Suicide is a major public health problem, with mental disorders being one of its major risk factors. The high incidence of suicide on the Isle of Wight has motivated this study, the first of its kind on suicide in this small geographic area. Aim The aim of the study was to identify socio-demographic and clinical risk factors for suicid...
Book
Computational linguistics can be used to uncover mysteries in text which are not always obvious to visual inspection. For example, the computer analysis of writing style can show who might be the true author of a text in cases of disputed authorship or suspected plagiarism. The theoretical background to authorship attribution is presented in a step...
Conference Paper
Full-text available
Presented is a comparative study of two machine learning models (MLP Neural Network and Bayesian Network) as part of a decision support system for prescribing ITE (in the ear) and BTE (behind the ear) aids for people with hearing difficulties. The models are developed/trained and evaluated on a large set of patient records from major NHS audiology...
Article
Full-text available
The purpose of this research is to mine a large set of heterogeneous audiology data to create a decision support system (DSS) to choose between two hearing aid types (ITE and BTE aid). This research is based on the data analysis of audiology data using various statistical and data mining techniques. It uses the data of a large NHS (National He...
Conference Paper
Full-text available
We perform data mining on the publicly available Tinnitus Archive. A number of statistically significant associations with gender were found using the Chi-squared test. These were age, onset rapidity, tinnitus localisation, number of tinnitus sounds heard, sleep interference due to tinnitus, feeling tired and ill because of tinnitus, index of noise...
Article
Full-text available
Using techniques from computational stylometry we will examine some of the dictated writings of Ludwig Wittgenstein which have been made available by the Wittgenstein Archives at the University of Bergen. Our purpose is to give an example of how computational stylometry can be used to help answer concrete questions of Wittgenstein research, and thu...
Article
Full-text available
Many organizations are nowadays keeping their data in the form of multi-level categories for easier manageability. An example of this is the Reuters Corpus which has news items categorized in a hierarchy of up to five levels. The volume and diversity of documents available in such category hierarchies is also increasing daily. As such, it becomes d...
Article
Full-text available
This paper describes the analysis of a database of over 180,000 patient records, collected from over 23,000 patients, by the hearing aid clinic at James Cook University Hospital in Middlesbrough, UK. These records consist of audiograms (graphs of the faintest sounds audible to the patient at six different pitches), categorical data (such as age, ge...
Book
This is a comprehensive guidebook to the quantitative methods needed for Corpus-Based Translation Studies (CBTS). It provides a systematic description of the various statistical tests used in Corpus Linguistics which can be used in translation research. In Part 1, Theoretical Explorations, the interplay between quantitative and qualitative methodol...
Chapter
Full-text available
This is a comprehensive guidebook to the quantitative methods needed for Corpus-Based Translation Studies (CBTS). It provides a systematic description of the various statistical tests used in Corpus Linguistics which can be used in translation research. In Part 1, Theoretical Explorations, the interplay between quantitative and qualitative methodol...
Chapter
Full-text available
Through a detailed comparison of early English translations of the Chinese novel, this paper demonstrates how a set of bivariate statistics, commonly used for the comparison of corpora, can be applied in translation studies. Our corpus study uncovered a number of textual phenomena that have been rarely discussed before. We found that while the use...
Conference Paper
Full-text available
In today's world, the number of electronic documents made available to us is increasing day by day. It is therefore important to look at methods which speed up document search and reduce classifier training times. The data available to us is frequently divided into several broad domains with many sub-category levels. Each of these domains of data c...
Article
Full-text available
Corpora with high-quality linguistic annotations are an essential component in many NLP applications and a valuable resource for linguistic research. For obtaining these annotations, a large amount of manual effort is needed, making the creation of these ...
Conference Paper
Full-text available
The volume and diversity of documents available in today's world is increasing daily. It is therefore difficult for a single classifier to efficiently handle multi-level categorization of such a varied document space. In this paper we analyse methods to enhance the efficiency of a single classifier for two-level classification by combining it with...
Conference Paper
Full-text available
In this paper we describe our analysis of a database of over 180,000 patient records, collected from over 23,000 patients, by the hearing aid clinic at James Cook University Hospital in Middlesbrough, UK. These records consist of audiograms (graphs of the faintest sounds audible to the patient at six different pitches), categorical data (such as ag...
Chapter
Full-text available
In this chapter, we have used the chi-squared test and Yule’s Q measure to discover associations in tables of patient audiology data. These records are examples of heterogeneous medical records, since they contain audiograms, textual notes and typical relational fields. In our first experiment we used the chi-squared measure to discover association...
Conference Paper
Full-text available
Subspace learning is very important in today’s world of information overload. Distinguishing between categories within a subset of a large data repository such as the web and the ability to do so in real time is critical for a successful search technique. The characteristics of data belonging to different domains are also varying widely. This merit...
Article
Full-text available
A vast data repository such as the web contains many broad domains of data which are quite distinct from each other e.g. medicine, education, sports and politics. Each of these domains constitutes a subspace of the data within which the documents are similar to each other but quite distinct from the documents in another subspace. The data within th...
Conference Paper
Full-text available
This work describes a variation on the traditional Information Retrieval paradigm, where instead of text documents being indexed according to their content, they are indexed according to the search terms previous users have used in finding them. We determine the effectiveness of this approach by indexing a sample of query logs from the European Lib...
Conference Paper
Subspace detection and processing is receiving more attention nowadays as a method to speed up search and reduce processing overload. Subspace Learning algorithms try to detect low dimensional subspaces in the data which minimize the intra-class separation while maximizing the inter-class separation. In this paper we present a novel technique using...
Article
Full-text available
How to bridge the semantic gap is currently a major research problem in Content-Based Image Retrieval (CBIR). Most applications are based on supervised machine-learning classifiers to match images with their related categories. Noisy training information has resulted in current systems having low accuracy, especially when using large numbers of voc...
Article
Full-text available
In this paper, we have used the chi-squared test and Yule's Q measure to discover associations in tables of patient audiology data. These records are examples of heterogeneous medical records, since they contain audiograms, textual notes and typical relational fields. In our first experiment we used the chi-squared measure to discover associations...
Conference Paper
In this paper we describe new results of statistical and neural data mining of audiology patient records, with the ultimate aim of looking for factors influencing which patients would most benefit from being fitted with a hearing aid. We describe how a combination of neural and statistical techniques can usefully subdivide a set of patients into cl...
Article
Full-text available
This paper reports an experiment to evaluate a Cross Language Information Retrieval (CLIR) system that uses a multilingual ontology to improve query translation in the travel domain. The ontology-based approach significantly outperformed the Machine Readable Dictionary translation baseline using Mean Average Precision as a metric in a user-centered...
Conference Paper
Choosing the optimal terms to represent a search engine query is not trivial, and may involve an iterative process such as relevance feedback, repeated unaided attempts by the user or the automatic suggestion of additional terms, which the user may select or reject. This is particularly true of a multimedia search engine which searches on concepts...
Conference Paper
Full-text available
We make use of search logs provided by the Belga News Agency to recommend images downloaded by previous users to new users. Each search session in the logs consists of a session ID number, the ID of the images which were downloaded at the conclusion of that session, and the various search terms which were input leading up to the selection and downl...
Article
Full-text available
Ontologies are the backbone of the semantic web and allow software agents to interoperate effectively. An ontology is able to represent and to clarify concepts and inter-concept relationships and can be used as a framework to represent underlying domain concepts expressed in many different languages. One way to do this is by mapping Ontologies in d...
Article
Full-text available
This paper describes the automatic assignment of images into classes described by individual keywords provided with the Corel data set. Automatic image annotation technology aims to provide an efficient and effective searching environment for users to query their images more easily, but current image retrieval systems are still not very accurate wh...
Conference Paper
Full-text available
We present a method to automatically acquire a set of keywords that characterise a large multimedia collection. Our method compares captions associated with pictures in the collection with a model of general English language. The words that deviate from the model are very specific of the captions and thus make appropriate keywords. Professional ann...
Chapter
This volume brings together revised versions of a selection of papers presented at the Sixth International Conference on “Recent Advances in Natural Language Processing” (RANLP) held in Borovets, Bulgaria, 27–29 September 2007. These papers cover a wide variety of Natural Language Processing (NLP) topics: ontologies, named entity extraction, transl...
Article
Grid information retrieval (GIR) means using a grid system for retrieving relevant documents that satisfy the user need from within a large-scale data collection. A data collection can be text, audio, video, etc. The grid provides powerful computation while information retrieval provides techniques for retrieving useful information. In previous wor...
Conference Paper
Information retrieval (IR) systems for largescale data collections must build an index in order to provide efficient retrieval that meets the userpsilas needs. In distributed IR systems, query response time is affected by the way in which the data collection is partitioned across nodes. There are three types of collection partitioning; document-bas...
Conference Paper
Full-text available
This paper describes how similarity matrices are being used to help indexing, browsing, and relevance feedback in a multi-media search engine. We describe the creation of a matrix of the pairwise similarities of image captions, which was used to provide a cartography of related image clusters. A matrix of semantic concept similarities was also prod...
Conference Paper
In this paper we describe a system for storing and retrieving digital images from personal collections. The images can either be manually annotated with a set of keywords chosen by the owner of the collection, or keywords can be automatically inferred from the time and location stamps associated with the image and the Geographic Names Data Base gaz...
Conference Paper
Full-text available
This paper describes work performed at the University of Sunderland as part of the EU-funded VITALAS project. Text feature vectors, extracted from the TRECVID video data set, were submitted to an SVM-light implementation of support vector machine, which aimed to label each video shot with the relevant concepts from the 101-concept MediaMill set. Su...
Conference Paper
Full-text available
In this paper, we describe the use of a Boosting algorithm, Real AdaBoost, for content-based image retrieval (CBIR) on a large number (190) of keyword categories. Previous work with Boosting for image orientation detection has involved only a few categories, such as a simple outdoor vs. indoor scene dichotomy. Other work with CBIR has incorporated...
Conference Paper
Full-text available
In Parallel (IR) systems the query response time is limited by the time of the slowest node in the system, thus distributing the load equally across the nodes is very important issue. In this paper, we propose improving the load balance for term-based partitioning by classifying the terms based on their length then distribute them equally across no...
Article
Kilgarriff (2001) gives a number of reasons for comparing corpora which are relevant to the theme of this workshop. In particular, he considers the difficulty, measurable in terms of time and cost, in porting a new corpus to an existing NLP system. Different types of corpora which have been compared in the past include samples of English spoken in...
Conference Paper
Full-text available
This paper describes an extension of our work presented in the robust English-to-French bilingual task of the CLEF 2007 workshop, a knowledge-light approach for query translation in Cross-Language Information Retrieval systems. Our work is based on the direct translation of character n-grams, avoiding the need for word normalization during indexing...
Conference Paper
Full-text available
This paper describes the technique for translation of character n-grams we developed for our participation in CLEF 2006. This solution avoids the need for word normalization during indexing or translation, and it can also deal with out-of-vocabulary words. Since it does not rely on language-specific processing, it can be applied to very different l...
Conference Paper
Full-text available
This paper describes a new technique for the direct translation of character n-grams for use in Cross-Language Information Retrieval systems. This solution avoids the need for word normalization during indexing or translation, and it can also deal with out-of-vocabulary words. This knowledge-light approach does not rely on language-specific process...
Chapter
Full-text available
Wilks’ early English/French Machine Translation system was based on a notion called Preference Semantics. There were two key components of Preference Semantics. First was the notion of combining elementary meaning units of some kind (in Wilks’ case effectively surrogates for Roget thesaurus’ categories) in structures of arbitrary complexity and fin...
Conference Paper
Full-text available
A new Relevance Feedback (RF) technique called Weight Propagation has been developed which provides greater retrieval effectiveness and computational efficiency than previously described techniques. Documents judged relevant by the user propagate positive weights to documents close by in vector similarity space, while documents judged not relevant...
Article
Full-text available
This work is an extension of our proposal originally presented in CLEF 2006, which, unfortunately, could not be ready on time for the workshop. We describe here a knowledge-light approach for query translation in Cross-Language Information Re-trieval systems. This proposal itself can be considered as an extension of the previous work of the Johns H...
Conference Paper
Full-text available
cameras with software tools for personal image management. We originally developed a time and location based clustering model to produce a browsing tool for collections of images stamped with GPS metadata [1]. The idea was to group the images into discrete events, where two images belonged to the same event if they were taken at similar times in ne...
Conference Paper
Full-text available
Improving the accuracy of assigning new email messages to small folders can reduce the likelihood of users creating duplicate folders for some topics. In this paper we presented a hybrid classification model, PERC, and use the Enron Email Corpus to investigate the performance of kNN, SVM and PERC in a simulation of a real­time situation. Our result...
Conference Paper
Full-text available
In this paper we consider episodic memory for system design in image retrieval. Time and location are the main factors in episodic memory, and these types of data were combined for image event clustering. We conducted a user studies to compare five image browsing systems using searching time and user satisfaction as criteria for success. Our result...
Conference Paper
Full-text available
A new Relevance Feedback (RF) technique is developed to improve upon the efficiency and performance of existing techniques. This is based on propagating positive and negative weights from documents judged relevant and not relevant respectively, to other documents, which are deemed similar according to one of a number of criteria. The performance an...
Article
Full-text available
The chi-squared test is used to find the vocabulary most typical of seven different ICAME corpora, each representing the English used in a particular country. In a closely related study, Leech and Fallon (1992, Computer corpora - what do they tell us about culture? ICAME Journal, 16: 29-50) found differences in the vocabulary used in the Brown Corp...
Article
Full-text available
In this our first joint participation as the CoLesIR group, our team has partici-pated in the Portuguese monolingual ad-hoc task and in all robust ad-hoc tasks —all monolingual tasks, the English-to-German bilingual task, and the multilingual task. We have developed an n-gram model inspired by the previous work of the Johns Hopkins University Appli...
Conference Paper
Full-text available
Classical Information Retrieval (IR) is the sifting out of the documents most relevant to a user's information requirement (expressed as a "query"), from a large electronic store of documents. A search engine performs IR by retrieving relevant web pages from the internet. Rather than regarding foreign-language documents simply as unwanted "noise",...
Conference Paper
Full-text available
This paper describes the text mining of personal document collections in order to learn the categories of the documents in the collection, and to assign a suitable text label to each category. In the first experiment we make use of a pre­ classified collection of documents from which we extract a text label for each category. In the second experime...
Conference Paper
In our project AudioMine we wish to address the problem that we need to understand more of the underlying factors influencing which patients would benefit from being fitted with a hearing aid. We describe some results from our pilot study, in which two data mining techniques, the chi-squared test and self-organising maps, were used to discover asso...
Conference Paper
� Abstract— We report on the results of a pilot study in which a data-mining tool was developed for mining audiology records. The records were heterogeneous in that they contained numeric, category and textual data. The tools developed are designed to observe associations between any field in the records and any other field. The techniques employed...
Article
This paper describes the use of Ant Colony Optimisation for the classification of works of disputed authorship, in this case the Federalist Papers.Classification accuracy was 79.1%, which compares reasonably well with previous work on the same data set using neural networks and genetic algorithms. Although statistical approaches have performed much...
Chapter
We report on the results of a pilot study in which a data­mining tool was developed for mining audiology records. The records were heterogeneous in that they contained numeric, category and textual data. The tools developed are designed to observe associations between any field in the records and any other field. The techniques employed were the st...
Conference Paper
The aim of this project is the automatic conversion of query terms in one language into their equivalents in a second, historically related, language, so that documents in the second language can be retrieved. The method is to compile lists of regular sound changes which occur between related words of a language pair, and substitute these in the so...
Conference Paper
Full-text available
Word sense ambiguity is recognized as having a detrimental effect on the precision of information retrieval systems in general and web search systems in particular, due to the sparse nature of the queries involved. Despite continued research into the application of automated word sense disambiguation, the question remains as to whether less than 90...
Article
Full-text available
In this paper we show how two standard outputs from information extraction (IE) systems -- named entity annotations and scenario templates -- can be used to enhance access to text collections via a standard text browser. We describe how this information is used in a prototype system designed to support information workers' access to a pharmaceutica...
Conference Paper
Full-text available
We introduce a method for document classification based on using the chi-square test to identify characteristic vocabulary of document classes.
Article
Full-text available
this paper is that of Gale and Church (1993). Once the texts have been aligned, they can then be displayed to the translator as required using Scott's (1996) "WordSmith" concordancing tool. Using WordSmith, sentences and their translations can be retrieved and shown to the translator if they contain specified words, phrases or word fragments. The p...
Article
Full-text available
In this paper we show how two standard outputs from information extraction (IE) systems - named entity annotations and scenario templates - can be used to enhance access to text collections via a standard text browser. We describe how this information is used in a prototype system designed to support information workers' ac- cess to a pharmaceutica...
Conference Paper
Full-text available
We introduce a method for document classification based on using the chi-square test to identify characteristic vocabulary of document classes.
Article
Full-text available
This first collection of selected articles from researchers in automatic analysis, storage, and use of terminology, and specialists in applied linguistics, computational linguistics, information retrieval, and artificial intelligence offers new insights on computational terminology. The recent needs for intelligent information access, automatic que...
Article
Full-text available
Even if no written records of a protolanguage remain, it is possible to estimate what some of the words in that language might have been, by comparison of its reflexes in the more recent daughter languages. This method of protolanguage reconstruction is called the ‘comparative method’, and is described by Crowley (1992, Chapter 5). Although long pr...
Article
Full-text available
GATE, a General Architecture for Text Engineering, aims to provide a software infrastructure for researchers and developers working in NLP. GATE has now been widely available for four years. In this paper, we review the objectives which motivated the creation of GATE and the functionality and design of the current system. We describe some of the wa...
Conference Paper
Full-text available
Our goal is the automatic abstraction of journal articles, initially in the field of crop protection. We build a set of templates against which the original text is compared. The templates are designed so that they match the text at points of high information content, where inferences can be made about which expressions best reflect the content of...
Article
Full-text available
We report on the design and construction of features of an automated query system which will assist pharmacologists who are not information specialists to access the Derwent Drug File (DDF) pharmacological database. Our approach was to first elucidate those search skills of the search intermediary which might prove tractable to automation. Modules...

Network