May 2024
·
7 Reads
·
1 Citation
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
May 2024
·
7 Reads
·
1 Citation
April 2018
·
10 Reads
·
4 Citations
We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over 10^12 URIs from over 5 x 10^9 pages crawled in April 2014 and April 2017, the second study adds a further 3 x 10^9 pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information.
January 2018
·
12 Reads
·
1 Citation
We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over URIs from over pages crawled in April 2014 and April 2017, the second study adds a further pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information.
May 2013
·
27 Reads
Agreement on exactly how to understand the use of URIs in data to provide information has been hard to come by. As no consensus seems likely to emerge, we propose instead to improve interoperability by standardizing metadocumentation, without taking a position on any of the underlying debates about standards or terminology.
January 2008
·
3 Reads
·
1 Citation
Lecture Notes in Computer Science
In 2004 the Technical Architecture Group (TAG) of the World Wide Web Consortium (W3C) published The Architecture of the World Wide Web (AWWW) [1]. This document was the result of a careful after-the-fact analysis of the distinctive properties of the distributed information system known as the Web. By describing these properties, the TAG was able to define a number of ‘principles’, ‘constraints’ and ‘best practices’ which users and authors of and for the Web should observe in order to preserve and enhance its value to everyone.
November 1999
·
15 Reads
·
4 Citations
Technology in Society
We can identify three areas of interaction between our understanding of computer systems and moral and spiritual issues: (a) the moral and technical issues involved in empowering computer systems in contexts with significant impact, direct or indirect, on human well-being; (b) the scientific/technical questions in the way of introducing an explicit moral sensibility into computer systems; (c) the theological insights to be gained from a consideration of decision-making in existing and envisagable computers. In this article we make this concrete by reference to the parable of the Good Samaritan, if we imagine the innkeeper fetched a barefoot doctor for the injured man who consulted a medical expert system via a satellite up-link, that the robbers were caught and brought before an automated justice machine, that the Samaritan was in fact a robot and finally that Paul himself rethought the significance of the parable on the basis of this reformulation. We also use examples from Isaac Asimov's stories about his “Three Laws of Robotics” to tease out pre-existing understanding of computational morality and the insights its consideration may offer into our own moral self-understanding.
March 1997
·
6 Reads
Journal of Linguistics
January 1996
·
5 Reads
International Journal of Corpus Linguistics
An overview is given of work on the creation, collection, preparation, and publication of electronic corpora of written and spoken language undertaken at the Human Communication Research Centre at the Universities of Edinburgh and Glasgow. Four major efforts are described: the HCRC Map Task Corpus, the ECI/MC1, the MLCC project and work on document architectures and processing regimes for SGML-encoded corpora.
January 1993
·
765 Reads
·
32 Citations
The HCRC Map Task corpus has been collected and transcribed in Glasgow and Edinburgh, and recently published on CD-ROM. This effort was made possible by funding from the British Economic and Social Research Council. The corpus is composed of 128 two-person conver-sations in both high-quality digital audio and orthographic transcriptions, amounting to 18 hours and 150,000 words respectively. The experimental design is quite detailed and complex, allowing a number of different phone-mic, syntactico-semantic and pragmatic con-trasts to be explored in a controlled way. The corpus is a uniquely valuable resource for speech recognition research in particular, as we move from developing systems intended for con-trolled use by familiar users to systems intended for less constrained circumstances and naive or occasional users. Examples supporting this claim are given, including preliminary evi-dence of the phonetic consequences of second mention and the impact of different styles of ref-erent negotiation on communicative efficacy.
October 1991
·
167 Reads
·
1,028 Citations
Language and Speech
This paper describes a corpus of unscripted, task-oriented dialogues which has been designed, digitally recorded, and transcribed to support the study of spontaneous speech on many levels. The corpus uses the Map Task (Brown, Anderson, Yule, and Shillcock, 1983) in which speakers must collaborate verbally to reproduce on one participant's map a route printed on the other's. In all, the corpus includes four conversations from each of 64 young adults and manipulates the following variables: familiarity of speakers, eye contact between speakers, matching between landmarks on the participants’ maps, opportunities for contrastive stress, and phonological characteristics of landmark names. The motivations for the design are set out and basic corpus statistics are presented.
... Both papers also conclude that tabular chart parsers do not seem to be appropriate for parallelization on the Meiko. Thompson (1991) presents two implementations of a parallel parser for two different systems. The target architecture for the first implementation was the Connection Machine, a large scale SIMD machine. ...
January 1991
... A rapid rate of technical progress leaves formal standardization efforts slow to catch up, if the standards are formulated by relatively slow moving and deliberate standard-setting bodies. In the case of web services, the underlying technologies are relatively new and still evolving –-with some apprehension that the technology evolution is still trying to catch up to the marketing hype (Thompson 2000). There is also a need to create consensus across multiple stakeholders among different organizations that are impacted by the standards. ...
... In Natural Language research, software systems implementing a particular linguistic theory or formalism have been used for three distinct tasks: @BULLET during the development of a linguistic theory, to ensure that the theory and analyses based on it remain consistent when they are modified, and during investigation of the theory, to help elucidate the manner in which the various aspects of the theory interact with one another, and @BULLET to provide a formalism for encoding linguistic information in a uniform way, in order to be able to compare and evaluate alternative linguistic theories, and @BULLET to support the development of large, wide-coverage NL grammars. Examples of the first of these types of system include the Grammar-Writer's Workbench (Kiparsky, 1985) for LFG, and ProGram (Evans, 1985) and GPSGP (Phillips & Thompson, 1985) for versions of GPSG. These systems are direct computational implementations of particular versions of the respective linguistic theories, and thus cannot be used to compare alternative theories since, in general, a grammar expressed in one theory cannot be encoded in exactly the same way in another. ...
January 1985
Linguistics
... Browser/ServerYi language belongs to the Sino-Tibetan Tibeto-Burman Yi branch, there are six kinds of dialects. With the IPA (Hompson, Henry S, 1991) (Meng Zunxian et al, 1991 ) in 47 consonants , 12 vowels, 3 tone and a tight sound symbols, can accurately record the Yi language (Minnis, Stephen, 1991). The total population of China using the Yi language more than 650 million, with a population of more than 400 million in Yunnan Province (Nirenburg, S.ed, 1987), accounting for 65% of the total population. ...
... While this diversity is valuable, the uncontrolled recording environments often result in inconsistent audio quality. The HCRC Map Task Corpus [19] approached conversational speech collection through a structured task-based framework, recording 128 unscripted dialogues where participants collaborated on map-related tasks. However, this task-oriented approach may not fully reflect natural conversation patterns. ...
October 1991
Language and Speech
... • Intuitive elicitation Task: The goal is to elicit an exchange that would be as representative as possible of child-parent spontaneous conversations. Researchers have traditionally used physical prompts to elicit conversations, such as the maze game, the map task, or the spot-the-difference tasks (Anderson et al., 1993;Garrod and Anderson, 1987;Van Engen et al., 2010). However, we realized in piloting that such prompts tend to absorb children's attention, making the face-to-face multimodal interaction sub-optimal. ...
January 1993
... Another measure, duration-independent overlap rate [18] was employed to evaluate alignment accuracy. For this measure, 100 randomly selected word instances were manually time aligned across the corpus (50 within the segments that were used for MAP adaptation, and 50 in the segments that were used for evaluation only). ...
January 1991
... Regular expression composition is very similar to composition of finite state transducers [37]. Sets A and B represent, respectively, the input and the output of the first transducer; sets B and C represent, respectively, the input and the output of the second transducer. ...
January 1988
... Since a left recursive structure can be arbitrarily nested, we cannot predict the correct connection-path incrementally. There are a few practical and psycholinguistically motivated solutions in the literature [19], but in the current work we have resorted to an immediate approach which is extensively implemented in the Penn Treebank schema: namely we flatten the tree structure and avoid the left recursion issue altogether. Consider as an example the application of the flattening procedure to a local tree like 1 that produces as a result a tree like 2: ...
January 1991
... These tests and simulations, taken together, constitute a high level of confidence that the Thompson, 1985) (Cherniak, 1988). Fletcher (1984) gab als Obergrenze des gesamten SDl-Projektes t0 Millionen 'lines of code' an. ...
January 1985