Article

Xara: an XML aware tool for corpus searching

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

From SARA to Xara Xara is the working name for a new version of SARA, thèSGML aware retrieval application' originally developed for use with the British National Corpus (BNC) in 1994. The system has been completely rewritten as a general purpose tool for searching large XML corpora, with a particular focus on the needs of corpus linguists, with close attention to new XML-based encoding standards, and with the benefit of hindsight derived from a decade of feedback from hundreds of SARA-users world wide. The Xara system combines the following components: (1) an indexer, which creates inverted file style indexes to a large collection of discrete XML documents; (2) a server, which handles all interaction between the client programs and the data files; (3) a Windows client, which handles interaction between the server and the user. The modularity of this architecture has several advantages, permitting, for example, the development of multiple specialized client programs for different applications or styles of usage. In addition, an index building utility, called Indextools, is supplied with Xara, which simplifies the process of constructing a Xara database. Its chief function is to collect information about the corpus to be supplied additional to that present in any pre-existing corpus header, and to produce a validated and extended form of the corpus header. It can also be used to run the indexer and test its output.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... From the second category, in the absence of other software, the TimeML-using community is restricted to generic XML analysis tools, such as Xaira (Burnard and Dodd, 2003) or LT-XML 1 , as well as similar format-specific tools (TEI). These generic corpus tools are powerful applications, but require substantial effort to apply to TimeML data. ...
... Only a few tools have as yet been developed for TimeML, mostly focusing on the annotation task, such as TTK (Verhagen and Pustejovsky, 2008), which does not support analysis. From the second category, in the absence of other software, the TimeML-using community is restricted to generic XML analysis tools, such as Xaira (Burnard and Dodd, 2003) or LT-XML 1 , as well as similar format-specific tools (TEI). These generic corpus tools are powerful applications, but require substantial effort to apply to TimeML data. ...
Article
Full-text available
We present CAVaT, a tool that performs Corpus Analysis and Validation for TimeML. CAVaT is an open source, modular checking utility for statistical analysis of features specific to temporally-annotated natural language corpora. It provides reporting, highlights salient links between a variety of general and time-specific linguistic features, and also validates a temporal annotation to ensure that it is logically consistent and sufficiently annotated. Uniquely, CAVaT provides analysis specific to TimeML-annotated temporal information. TimeML is a standard for annotating temporal information in natural language text. In this paper, we present the reporting part of CAVaT, and then its error-checking ability, including the workings of several novel TimeML document verification methods. This is followed by the execution of some example tasks using the tool to show relations between times, events, signals and links. We also demonstrate inconsistencies in a TimeML corpus (TimeBank) that have been detected with CAVaT.
... InTable 2 the key-cluster for each presidency is shown, with its raw frequency, its normalised (per thousand words) frequency and the keyness value. 6 Having identified the key clusters, we carried out the remaining of the analysis using Xaira (Burnard and Dodd 2003), since it allows to retrieve the occurrences of a cluster only in a specific utterance (utterances by the podium for the purpose of this analysis). The software also makes it possible to track each cluster variation in terms of frequency distribution in the five presidential terms, highlighting similarities and differences in terms of relative frequency of use. ...
... The concordance tools that make of POS information substantially alleviate the tedious task of corpus exploration. Examples of such concordancers include SARA (Aston and Burnard, 1998), XAIRA (Burnard and Dodd, 2003), BNCweb (Hoffmann et al., 2008), and PIE (Fletcher, 2011), which are designed for accessing the POS-tagged version of the British National Corpus (BNC, 1994), or WordSmith (Scott, 2004), MonoConc (Barlow, 2004), and Wmatrix (Rayson, 2009), which can be used with other corpora as well. In this paper, we argue that, given the state of the art in language technologies, the linguistic pre-processing of corpora could also be performed at the syntactic level, not only at the lexical level. ...
Article
Full-text available
Concordancers are tools that display the contexts of a given word in a corpus. Also called key word in context (KWIC), these tools are nowadays indispensable in the work of lexicographers, linguists, and translators. We present an enhanced type of concordancer that integrates syntactic information on sentence structure as well as statistical information on word cooccurrence in order to detect and display those words from the context that are most strongly related to the word under investigation. This tool considerably alleviates the users' task, by highlighting syntactically well-formed word combinations that are likely to form complex lexical units, i.e., multi-word expressions. One of the key distinctive features of the tool is its multilingualism, as syntax-based multi-word expression detection is available for multiple languages and parallel concordancing enables users to consult the version of a source context in another language, when multilingual parallel corpora are available. In this article, we describe the underlying methodology and resources used by the system, its architecture, and its recently developed online version. We also provide relevant performance evaluation results for the main system components, focusing on the comparison between syntax-based and syntax-free approaches.
... Xaira is no longer tied to the one corpus, instead opting for a more generalised application. It takes advantage of Unicode and XML technologies to achieve this (Burnard and Dodd, 2003). At the time of writing, its first version is still in beta testing. ...
Article
Full-text available
There is, currently, a surge of activity surrounding Arabic corpus linguistics. As the number of available Arabic corpora continues to grow, there is an increasing need for robust tools that can process this data, whether for research or teaching. One such tool that is useful for both of these purposes is the concordancer - a simple tool for displaying a specified target word in its context. However, obtaining one that can reliably cope with the Arabic language had proved difficult. Also, there was a desire to add some novel features to the standard concordancer to enhance its usefulness within the classroom - easy-to-use root- and stem-based concordance and integration to corpus clustering algorithms are two examples. Therefore, aConCorde was created to provide such a tool to the community.
... CLAWS used to PoS_tag the LOB and BNC corpora (Leech et al. 1994); and retrieve data from a corpus, e.g. XAIRA, a web based concordance application , developed for use with the British National Corpus (BNC) (Burnard and Dodd 2003); or lemmatised and unlemmatised frequency lists generated by Kilgarriff (1996). In this paper, we present the chatbot system as a tool to explore or visualize different types of English language used in the BNC corpus in a qualitative manner in contrast to tools such as Wmatrix which visualises a corpus in terms of quantitative statistics. ...
... As EMILLE and LCMC are marked up respectively in SGML and XML, non-markup-aware concordancers will not allow users to easily exploit these corpora fully. Two Unicode-compliant markup-aware corpus tools that are available, Xara (Burnard and Todd 2003) and WordSmith version 4 (Scott 2003), are at the final stage of beta testing at the moment and will be released soon. Using WordSmith 4 to explore the two corpora is quite straightforward, though the LCMC Corpus needs to be converted from utf-8 to utf-16 first using a built-in utility of WordSmith. ...
Article
This paper first discusses standards for developing Asian language corpora so as to facilitate international data exchange. Following this, we present two corpora of Asian languages developed at Lancaster University – the EMILLE Corpus, which contains 14 South Asian languages, and the Lancaster Corpus of Mandarin Chi-nese. Finally, we will demonstrate how to explore these corpora using Xara and other corpus tools.
... Only a few tools have as yet been developed for TimeML, mostly focusing on the annotation task, such as TTK (Verhagen and Pustejovsky, 2008), which does not support analysis. From the second category, in the absence of other software, the TimeML-using community is restricted to generic XML analysis tools, such as Xaira (Burnard and Dodd, 2003) or LT-XML 1 , as well as similar format-specific tools (TEI). These generic corpus tools are powerful applications, but require substantial effort to apply to TimeML data. ...
Conference Paper
We present CAVaT, a tool that performs Corpus Analysis and Validation for TimeML. CAVaT is an open source, modular checking utility for statistical analysis of features specific to temporally -annotated natural language corpora. It provides reportin g, highlights salient links between a variety of general and time-specific linguistic fe atures, and also validates a temporal annotation to ensure t hat it is logically consistent and sufficiently annotated. Uniquely, CAVaT pro vides analysis specific to TimeML-annotated temporal infor mation. TimeML is a standard for annotating temporal information in natural language text. In this paper, we present the reporting part of CAVaT, and then its error-checking ability, including the workings of several novel TimeML document verification methods. This is followed by the execution of some example tasks using the tool to show relations between times, events, signals and links. We also demonstrate inconsistencies in a TimeML corpus (TimeBank) that have been detected with CAVaT.
... Both GB2312 and Big5 are double-byte encoding systems. Although the original corpus texts were encoded in GB2312, we decided to convert the encoding to Unicode (UTF-8) for the following reasons: (1) to ensure the compatibility of a non-Chinese operating system and Chinese characters; (2) to take advantage of the latest Unicode-compliant concordancers such as Xara (Burnard and Todd, 2003) and WordSmith Tools version 4.0. To make it more convenient for users of our corpus with an operating system earlier than Windows 2000 and no language support pack to use our data, we have produced a Romanized Pinyin version of the LCMC corpus in addition to the standard version containing Chinese characters. ...
Article
This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. We first discuss the major decisions we took when building the corpus. These relate to sampling, text collection, mark-up, and annotation. Following from this we use the corpus to study aspect marking in Chinese and British/American English. The study shows that although Chinese and English are typologically different, aspect markers in the two languages show a strikingly similar distribution pattern, especially across the two broad categories of narrative and expository texts. The study also reveals some important differences in the distribution of aspect markers in Chinese versus English and British versus American English across fifteen text categories, and provides an account of these differences.
Conference Paper
Full-text available
Arabic corpus linguistics is currently enjoying a surge in activity. As the growth in the number of available Arabic corpora continues, there is an increased need for robust tools that can process this data, whether it be for research or teaching. One such tool that is useful for both groups is the concordancer — a simple tool for displaying a specified target word in its context. However, obtaining one that can reliably cope with the Arabic lan-guage had proved extremely difficult. Therefore, aConCorde was created to provide such a tool to the community.
ResearchGate has not been able to resolve any references for this publication.