About
69
Publications
22,384
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,172
Citations
Introduction
Scott Piao currently works at the School of Computing and Communications, Lancaster University, UK. His research interests span Natural Language Processing, Text Mining, Social Computing, Corpus Lintuistics and their application to practical tasks.
Publications
Publications (69)
Electronic word-of-mouth communication in the form of online reviews influences people’s product or service choices. People use text features to add or emphasise feelings and emotions in their text. The text emphasis can come in as capital letters, letter repetition, exclamation marks and emoticons. The existing literature has not paid sufficient a...
Plastic pollution is one of the most significant environmental issues in the world. The rapid increase of the cumulative amount of plastic waste has caused alarm, and the public have called for actions to mitigate its impacts on the environment. Numerous governments and social activists from various non-profit organisations have set up policies and...
We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central ar...
While the application of word embedding models to downstream Natural Language Processing (NLP) tasks has been shown to be successful, the benefits for low-resource languages is somewhat limited due to lack of adequate data for training the models. However, NLP research efforts for low-resource languages have focused on constantly seeking ways to ha...
In many areas of academic publishing, there is an explosion of literature, and subdivision of fields into subfields, leading to stove-piping where sub-communities of expertise become disconnected from each other. This is especially true in the genetics literature over the last 10 years where researchers are no longer able to maintain knowledge of p...
Automatic semantic annotation of natural language data is an important task in Natural Language Processing, and a variety of semantic taggers have been developed for this task, particularly for English. However, for many languages, particularly for low-resource languages, such tools are yet to be developed. In this paper, we report on the developme...
The poster is at: http://wp.lancs.ac.uk/btm/2017/09/15/poster-presented-at-iges-2017-international-genetic-epidemiology-society/
Automatic extraction and analysis of meaning-related information from natural language data has been an important issue in a number of research areas, such as natural language processing (NLP), text mining, corpus linguistics, and data science. An important aspect of such information extraction and analysis is the semantic annotation of language da...
MLCT is a tool for building, processing and analysing multilingual corpora. It has a range of functionalities such as text formatting, searching, subsituting, frequency extraction, collocation extraction, concordancing etc. For further details of this tool, see website: https://sites.google.com/site/scottpiaosite/software/mlct.
This poster seeks to describe the creation of a Spanish lexicon with semantic annotation in order to analyse more extensive corpora in the Spanish language. The semantic resources most employed nowadays are WordNet, FrameNet, PDEV and USAS, but they have been used mainly for English language research. The creation of a large Spanish lexicon will pe...
Technology advancement in social media software allows users to include elements of visual communication in textual settings. Emoticons are widely used as visual representations of emotion and body expressions. However, the assignment of values to the “emoticons” in current sentiment analysis tools is still at a very early stage. This paper present...
There are various factors that affect the sentiment level expressed in textual comments. Capitalization of letters tends to mark something for attention and repeating of letters tends to strengthen the emotion. Emoticons are used to help visualize facial expressions which can affect understanding of text. In this paper, we show the effect of the nu...
The use of metaphor in popular science is widespread to aid readers’ conceptions of the scientific concepts under discussion. Almost all research in this area has been done by careful close reading of the text(s) in question, but this article describes—for the first time—a digital ‘distant reading’ analysis of popular science, using a system create...
We hypothesise that it is possible to determine a fine-grained set of sentiment values over and above the simple three-way positive/neutral/negative or binary Like/Dislike distinctions by examining textual formatting features. We show that this is possible for online comments about ten different categories of products. In the context of online shop...
This paper presents our research on the feasibility of extracting Twitter users' interests for suggesting serendipitous connections using natural language processing (NLP) technology. Defined by Andel [1] as the art of making an unsought finding, serendipity has a positive role in scientific research and people's daily lives. Applications that faci...
Much has been documented in the literature on sentiment analysis and document summarisation. Much of this applies to long structured text in the form of documents and blog posts. With a shift in social media towards short commentary (see Facebook status updates and twitter tweets), the difference in comment structure may affect the accuracy of sent...
Today, academic researchers face a flood of information. Full text search provides an important way of finding useful information from mountains of publications, but it generally suffers from low precision, or low quality of document retrieval. A full text search algorithm typically examines every word in a given text, trying to find the query word...
In this paper, we propose a corpus annotation scheme and lexicon for Chinese kinship terms. We modify existing traditional Chinese kinship schemes into a comprehensive semantic field framework that covers kinship semantic categories in contemporary Chinese. The scheme is inspired by the Lancaster USAS (UCREL Semantic Analysis System) taxonomy, whic...
Automatic sentiment analysis is an important and challenging topic in Human Language Technology (HLT) and text mining, with several applications for social sciences. Over recent years, much effort has been devoted to this subject. Many published works on this subject employ various machine learning techniques. In our work, we investigate the feasib...
It is a challenging task to match similar or related terms/expressions in NLP and Text Mining applications. Two typical areas in need for such work are terminology and ontology constructions, where terms and concepts are extracted and organized into certain structures with various semantic relations. In the EU BOOTSTrep Project we test various tech...
This chapter aims at bridging the functionalist theoretical perspective on word usage with corpus-based studies. We are dealing with the issue of construction of reliable lists of what is called 'phraseological units' in general linguistics literature or 'multi-word expressions' (MWEs) in literature on computational linguistics. The two groups of c...
Annotation of information in corpora is an important aspect of text mining. It bridges between the information hidden in natural language texts and the semantic search queries for the information desired by users. Due to the complex nature of the information needed for text mining, it is essential to design comprehensive annotation schemes to encod...
Since the manual construction of ontologies is time-consuming and expensive, an increasing number of initiatives to ease the construction by automatic or semi-automatic means have been published. Most initiatives combine a certain level of NLP techniques with machine learning approaches to find concepts and relationships. However, a challenging iss...
In this paper, we discuss the issue of implementing the interoperability of natural language annotation tools for text mining with the Unstructured Information Management Architecture (UIMA) (Ferrucci and Lally, 2004; http://incubator.apache.org/uima). In particular, we discuss the practical issue of designing UIMA annotation schemes for text minin...
In this paper we investigate variation in the translation of three vague quantifiers, many, some and a few between English and Chinese. Studies of 'linguistic vagueness' (sometimes called 'language vagueness' in previous research) regard vagueness as a general phenomenon in language. In this type of study, vagueness is often discussed – but not lim...
We introduce an annotation type system for a data-driven NLP core system. The specifications cover formal document structure and document meta information, as well as the linguistic levels of morphology, syntax and semantics. The type system is embedded in the framework of the Unstructured Information Management Architecture (UIMA).
Opinion mining has been receiving increasing attention recently, and various approaches have been suggested for mining sentiment information, such as mining attitudes or opinions about a topic or product etc. However, as far as we know, little work has been reported on citation
This paper reports on an experiment in which we explore a new approach to the automatic measurement of multi-word expression (MWE) compositionality. We propose an algorithm which ranks MWEs by their compositionality relative to a semantic field taxonomy based on the Lancaster English semantic lexicon (Piao et al., 2005a). The semantic information p...
The problem we address in this paper is that of providing contextual examples of translation equivalents for words from the general lexicon using comparable corpora and semantic annotation that is uniform for the source and target languages. For a sentence, phrase or a query expression in the source language the tool detects the se-mantic type of t...
In this paper, we report on our experi- ment to extract Chinese multiword ex- pressions from corpus resources as part of a larger research effort to improve a machine translation (MT) system. For ex- isting MT systems, the issue of multi- word expression (MWE) identification and accurate interpretation from source to target language remains an unso...
Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP community and corpus linguistics. Indeed, although numerous knowledge-based symbolic approaches and statistically driven algorithms have been proposed, efficient MWE extraction still remains an unsolved issue. In this paper, we evaluate the Lancaster UCREL S...
This paper reports on the current status and evaluation of a Finnish semantic tagger (hereafter FST), which was developed in the EU-funded Benedict Project. In this project, we have ported the Lancaster English semantic tagger (USAS) to the Finnish language. We have re-used the existing software architecture of USAS, and applied the same semantic f...
Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the past 14 years, a large semantic lexical resource has been built at Lancaster University. Different from other major semantic lexicons in existence, such as WordNet, EuroWordNet and HowNet, etc., in which lexemes are clustered and linked via the relationsh...
A phraseological expression in a language may have equivalent expressions in other languages with different morpho-syntactic structures and semantic properties. Our recent experience in the Benedict Project (EU IST-2001-34237), in which a Finnish semantic lexicon compatible to the Lancaster English semantic lexicon (Rayson et al., 2004) has been bu...
The UCREL semantic analysis system (USAS) is a software tool for undertaking the automatic semantic analysis of English spoken and written data. This paper describes the software system, and the hierarchical semantic tag set containing 21 major discourse fields and 232 fine-grained semantic field tags. We discuss the manually constructed lexical re...
Semantic lexical resources play an important part in both linguistic study and natural language engineering. In Lancaster, a large semantic lexical resource has been built over the past 14 years, which provides a knowledge base for the USAS semantic tagger. Capturing semantic lexicological theory and empirical lexical usage information extracted fr...
Annotation schemes for semantic field analysis use abstract concepts to classify words and phrases in a given text. The use of such schemes within lexicography is increasing. Indeed, our own UCREL semantic annotation system (USAS) is to form part of a web-based 'intelligent' dictionary (Herpiö 2002). As USAS was originally designed to enable automa...
Text reuse is commonplace in academia and the media. An efficient algorithm for automatically detecting and measuring similar/related texts would have applications in corpus linguistics, historical studies and natural language engineering. In an effort to explore the issue of text reuse, a tool, named Crouch 1 , has been developed based on the TESA...
The METER (MEasuring TExt Reuse) corpus is a corpus designed to support the study and analysis of journalistic text reuse. It consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers, some of which were derived from...
Automatic extraction of multiword expressions (MWE) presents a tough challenge for the NLP community and corpus linguistics. Although various statistically driven or knowl-edge-based approaches have been proposed and tested, efficient MWE extraction still remains an unsolved issue. In this paper, we present our research work in which we tested appr...
Semantic annotation is an important and challenging issue in corpus linguistics and language engineering. While such a tool is available for English in Lancaster (Wilson and Rayson 1993), few such tools have been reported for other languages. In a joint Benedict project funded by the European Community under the `Information Society Technologies Pr...
In this paper we present results from the METER (MEasuring TExt Reuse) project whose aim is to explore issues pertaining to text reuse and derivation, especially in the context of newspapers using newswire sources. Although the reuse of text by journalists has been studied in linguistics, we are not aware of any investigation using existing computa...
Word alignment in bilingual or multilingual parallel corpora has been a challenging issue for natural language engineering. An efficient algorithm for automatically aligning word translation equivalents across different languages will be of use for a number of practical applications such as multilingual lexical construction, machine translation, et...
In this paper we present the METER Corpus, a novel resource for the study and analysis of journalistic text reuse. The corpus consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers. In some cases the newspaper stor...
As a part of the METER (MEasuring TExt Reuse) project we have built a new type of comparable corpus consisting of annotated examples of related newspaper texts. Texts in the corpus were manually collected from two main sources: the British Press Association (PA) and nine British national newspapers that subscribe to the PA newswire service. In addi...
Besides measuring verbatim copy of text, which is a basic approach to detecting text reuse, another important issue in METER project is to detect textual rewrites in which PA copy is changed or modified. Because reporters tend to conform to a particular style of a given newspaper, or have to modify source copy to fit into a limited newspaper space,...