Henry S. Thompson’s research while affiliated with University of Edinburgh and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (20)


Improved methodology for longitudinal Web analytics using Common Crawl
  • Conference Paper

May 2024

·

7 Reads

·

1 Citation

Henry S Thompson

Can Common Crawl Reliably Track Persistent Identifier (PID) Use Over Time

April 2018

·

10 Reads

·

4 Citations

We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over 10^12 URIs from over 5 x 10^9 pages crawled in April 2014 and April 2017, the second study adds a further 3 x 10^9 pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information.


Can Common Crawl reliably track persistent identifier (PID) use over time?

January 2018

·

12 Reads

·

1 Citation

We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over 101210^{12} URIs from over 51095 * 10^9 pages crawled in April 2014 and April 2017, the second study adds a further 31093 * 10^9 pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information.



Web Architecture and Naming for Knowledge Resources

January 2008

·

3 Reads

·

1 Citation

Lecture Notes in Computer Science

In 2004 the Technical Architecture Group (TAG) of the World Wide Web Consortium (W3C) published The Architecture of the World Wide Web (AWWW) [1]. This document was the result of a careful after-the-fact analysis of the distinctive properties of the distributed information system known as the Web. By describing these properties, the TAG was able to define a number of ‘principles’, ‘constraints’ and ‘best practices’ which users and authors of and for the Web should observe in order to preserve and enhance its value to everyone.


Computational systems, responsibility, and moral sensibility

November 1999

·

15 Reads

·

4 Citations

Technology in Society

We can identify three areas of interaction between our understanding of computer systems and moral and spiritual issues: (a) the moral and technical issues involved in empowering computer systems in contexts with significant impact, direct or indirect, on human well-being; (b) the scientific/technical questions in the way of introducing an explicit moral sensibility into computer systems; (c) the theological insights to be gained from a consideration of decision-making in existing and envisagable computers. In this article we make this concrete by reference to the parable of the Good Samaritan, if we imagine the innkeeper fetched a barefoot doctor for the injured man who consulted a medical expert system via a satellite up-link, that the robbers were caught and brought before an automated justice machine, that the Samaritan was in fact a robot and finally that Paul himself rethought the significance of the parable on the basis of this reformulation. We also use examples from Isaac Asimov's stories about his “Three Laws of Robotics” to tease out pre-existing understanding of computational morality and the insights its consideration may offer into our own moral self-understanding.



Corpus work at HCRC

January 1996

·

5 Reads

International Journal of Corpus Linguistics

An overview is given of work on the creation, collection, preparation, and publication of electronic corpora of written and spoken language undertaken at the Human Communication Research Centre at the Universities of Edinburgh and Glasgow. Four major efforts are described: the HCRC Map Task Corpus, the ECI/MC1, the MLCC project and work on document architectures and processing regimes for SGML-encoded corpora.


The HCRC Map Task corpus: natural dialogue for speech recognition
  • Article
  • Full-text available

January 1993

·

765 Reads

·

32 Citations

Henry S Thompson

·

·

·

[...]

·

Cathy Sotillo

The HCRC Map Task corpus has been collected and transcribed in Glasgow and Edinburgh, and recently published on CD-ROM. This effort was made possible by funding from the British Economic and Social Research Council. The corpus is composed of 128 two-person conver-sations in both high-quality digital audio and orthographic transcriptions, amounting to 18 hours and 150,000 words respectively. The experimental design is quite detailed and complex, allowing a number of different phone-mic, syntactico-semantic and pragmatic con-trasts to be explored in a controlled way. The corpus is a uniquely valuable resource for speech recognition research in particular, as we move from developing systems intended for con-trolled use by familiar users to systems intended for less constrained circumstances and naive or occasional users. Examples supporting this claim are given, including preliminary evi-dence of the phonetic consequences of second mention and the impact of different styles of ref-erent negotiation on communicative efficacy.

Download

The hcrc Map Task corpus

October 1991

·

167 Reads

·

1,028 Citations

Language and Speech

This paper describes a corpus of unscripted, task-oriented dialogues which has been designed, digitally recorded, and transcribed to support the study of spontaneous speech on many levels. The corpus uses the Map Task (Brown, Anderson, Yule, and Shillcock, 1983) in which speakers must collaborate verbally to reproduce on one participant's map a route printed on the other's. In all, the corpus includes four conversations from each of 64 young adults and manipulates the following variables: familiarity of speakers, eye contact between speakers, matching between landmarks on the participants’ maps, opportunities for contrastive stress, and phonological characteristics of landmark names. The motivations for the design are set out and basic corpus statistics are presented.


Citations (10)


... Both papers also conclude that tabular chart parsers do not seem to be appropriate for parallelization on the Meiko. Thompson (1991) presents two implementations of a parallel parser for two different systems. The target architecture for the first implementation was the Connection Machine, a large scale SIMD machine. ...

Reference:

Parallel Natural Language Parsing: From Analysis to Speedup
Chart Parsing for Loosely Coupled Parallel Systems
  • Citing Chapter
  • January 1991

... A rapid rate of technical progress leaves formal standardization efforts slow to catch up, if the standards are formulated by relatively slow moving and deliberate standard-setting bodies. In the case of web services, the underlying technologies are relatively new and still evolving –-with some apprehension that the technology evolution is still trying to catch up to the marketing hype (Thompson 2000). There is also a need to create consensus across multiple stakeholders among different organizations that are impacted by the standards. ...

Web Services and the Semantic Web: Separating Hype from Reality
  • Citing Article

... In Natural Language research, software systems implementing a particular linguistic theory or formalism have been used for three distinct tasks: @BULLET during the development of a linguistic theory, to ensure that the theory and analyses based on it remain consistent when they are modified, and during investigation of the theory, to help elucidate the manner in which the various aspects of the theory interact with one another, and @BULLET to provide a formalism for encoding linguistic information in a uniform way, in order to be able to compare and evaluate alternative linguistic theories, and @BULLET to support the development of large, wide-coverage NL grammars. Examples of the first of these types of system include the Grammar-Writer's Workbench (Kiparsky, 1985) for LFG, and ProGram (Evans, 1985) and GPSGP (Phillips & Thompson, 1985) for versions of GPSG. These systems are direct computational implementations of particular versions of the respective linguistic theories, and thus cannot be used to compare alternative theories since, in general, a grammar expressed in one theory cannot be encoded in exactly the same way in another. ...

GPSGP — a parser for generalized phrase structure grammars
  • Citing Article
  • January 1985

Linguistics

... Browser/ServerYi language belongs to the Sino-Tibetan Tibeto-Burman Yi branch, there are six kinds of dialects. With the IPA (Hompson, Henry S, 1991) (Meng Zunxian et al, 1991 ) in 47 consonants , 12 vowels, 3 tone and a tight sound symbols, can accurately record the Yi language (Minnis, Stephen, 1991). The total population of China using the Yi language more than 650 million, with a population of more than 400 million in Yunnan Province (Nirenburg, S.ed, 1987), accounting for 65% of the total population. ...

Automatic Evaluation of Translation Quality: Outline of Methodology and Report on Pilot Experiment
  • Citing Article

... While this diversity is valuable, the uncontrolled recording environments often result in inconsistent audio quality. The HCRC Map Task Corpus [19] approached conversational speech collection through a structured task-based framework, recording 128 unscripted dialogues where participants collaborated on map-related tasks. However, this task-oriented approach may not fully reflect natural conversation patterns. ...

The hcrc Map Task corpus
  • Citing Article
  • October 1991

Language and Speech

... • Intuitive elicitation Task: The goal is to elicit an exchange that would be as representative as possible of child-parent spontaneous conversations. Researchers have traditionally used physical prompts to elicit conversations, such as the maze game, the map task, or the spot-the-difference tasks (Anderson et al., 1993;Garrod and Anderson, 1987;Van Engen et al., 2010). However, we realized in piloting that such prompts tend to absorb children's attention, making the face-to-face multimodal interaction sub-optimal. ...

The HCRC Map Task corpus: natural dialogue for speech recognition

... Since a left recursive structure can be arbitrarily nested, we cannot predict the correct connection-path incrementally. There are a few practical and psycholinguistically motivated solutions in the literature [19], but in the current work we have resorted to an immediate approach which is extensively implemented in the Penn Treebank schema: namely we flatten the tree structure and avoid the left recursion issue altogether. Consider as an example the application of the flattening procedure to a local tree like 1 that produces as a result a tree like 2: ...

Compose-Reduce Parsing.
  • Citing Conference Paper
  • January 1991