Content uploaded by Huma Shah
Author content
All content in this area was uploaded by Huma Shah on Apr 23, 2016
Content may be subject to copyright.
Deception-detection and Machine Intelligence
in Practical Turing Tests
by
Huma Shah
October 2010
b
Declaration:
I confirm that this is my own work and the use of all material from other sources has been properly and
fully acknowledged. Huma Shah 13 October 2010
i
ABSTRACT
Deception-detection is the crux of Turing’s experiment to examine machine thinking
conveyed through a capacity to respond with sustained and satisfactory answers to
unrestricted questions put by a human interrogator.
However, in 60 years to the month since the publication of Computing Machinery and
Intelligence little agreement exists for a canonical format for Turing’s textual game of
imitation, deception and machine intelligence.
This research raises from the trapped mine of philosophical claims, counter-claims
and rebuttals Turing’s own distinct five minutes question-answer imitation game,
which he envisioned practicalised in two different ways:
a) A two-participant, interrogator-witness viva voce,
b) A three-participant, comparison of a machine with a human both questioned
simultaneously by a human interrogator.
Using Loebner’s 18th Prize for Artificial Intelligence contest, and Colby et al.’s 1972
transcript analysis paradigm, this research practicalised Turing’s imitation game
across three original experiments with over 400 human participants and 13 machines.
Results show the current state of artificial dialogue can recall information and share
personal interests presenting an illusion of personality. The technology achieved a
deception rate of 8.33% in 60 machine-human simultaneous comparison tests. Results
also show that more than 1 in 3 Transcript Analysts reviewing five transcripts
involving Elbot, winner of the 2008 Loebner Prize bronze award for ‘most human-
like’ machine, were unable to correctly identify artificial dialogue.
Deception-detection is essential to uncover the increasing number of malfeasant
programmes, such as CyberLover, developed to steal identity and financially defraud
users in chatrooms across the Internet. Practicalising Turing’s two tests can assist in
raising awareness of this risk and preventing cybercrime.
ii
List of acronyms
AC artificial conversationalist
HH hidden human
IJ interrogator judge
l.ip linguistic productivity
LPC Loebner Prize contest
TA transcript analyst
iii
Table of Contents
ABSTRACT i
ACKNOWLEDGEMENTS ii
List of Acronyms iii
List of Tables x-xii
List of Boxes xiii
List of Graphs xiv
List of Figures – figure acknowledgments xv
List of Appendices xvi-xvii
Introduction 1-1
PART I
Chapter 1 The Buzzing in Turing’s Head 1-7
1.1 Early Papers 1-8
1.1.1 Turing 1936: On Computable Numbers 1-8
1.1.2 Turing 1947: The ACE Machine 1-10
1.1.3 Turing 1948: Intelligent Machinery 1-13
1.2 Turing 1950: Computing Machinery and Intelligence 1-17
1.2.1 The Imitation Game – What is it? 1-18
1.2.2 Can machines think? 1-19
1.2.3 Distinguishing Textual Man from Woman 1-20
1.2.4 Three way Machine-human Comparison 1-22
1.2.5 Question and Answer Method 1-24
1.2.6 Properties and Performance of the Machine 1-28
iv
1.2.7 Short distance plenty to do 1-31
1.2.8 Intelligence and the 1950 Turing test 1-33
1.3 Later Papers 1-33
1.3.1 Turing 1951a: Intelligent Machinery, A Heretical Theory
1-34
1.3.2 Turing 1951b: Can Digital Computers Think? 1-36
1.3.3 Turing 1952: Can Automatic Calculating Machines Be Said To
Think? 1-41
1.3.4 Turing 1953: Chess 1-53
1.4 Turing’s Two Tests for Machine Thinking 1-55
1.4.1 One-to-one Machine Evaluation 1-56
1.4.2 Simultaneous comparison 1-58
1.4.3 Contrasting One-to-one with Simultaneous comparison
1-59
1.4.4 Technical Communications Protocol 1-60
1.4.5 Concluding this chapter 1-61
Chapter 2 Philosophical Jousts and Turing’s Imitation Game 2-64
2.1 The Jousting Begins 2-65
2.2 Thinking about thinking 2-67
2.2.1 Misusing Reason, Changing Language 2-70
2.2.2 Temporal Obscurity 2-71
2.2.3 Battle of Wits or Witless War? 2-73
2.2.4 Cultural Objection 2-78
2.2.5 Proprietorial Objection 2-79
2.3 Meaningless Thinking 2-80
v
2.3.1 Gender distraction 2-81
2.4 Linguistic Competence 2-85
2.5 Intelligence 2-86
2.5.1 Pathology & Machine Intelligence 2-88
2.5.2 Human and Artificial Intelligence 2-91
2.5.3 Aetiology 2-93
2.5.4 Information Processing 2-96
2.6 Interpretations and Variations 2-96
2.6.1 Purtill’s Sketch 2-97
2.6.2 Block’s Average Human 2-97
2.6.3 French’s Associative Priming 2-98
2.6.4 Sterrett’s Female Impersonation 2-99
2.7 In Support 2-104
Chapter 3 Early Practical Question-Answer Tests 3-108
3.1 Designing Artificial Conversation 3-108
3.1.1 Eliza 3-109
3.1.2 PARRY 3-111
3.1.3 Modern Systems 3-116
3.2 Practicalising Turing’s Tests 3-117
3.2.1 Duration of Test 3-118
3.2.2 What is being measured? 3-119
3.2.3 Interrogator 3-130
3.2.4 Success Rate 3-134
3.3 Machine Intelligence Competitions 3-135
3.3.1 History of the Loebner Prize 3-137
vi
3.3.2 1st Loebner Prize 1991 3-140
3.3.3 Observing the 13th Loebner Prize 3-149
3.3.4 Post 13th Loebner Prize 3-155
PART II
Chapter 4 Experiment 1: One to One Testing 4-157
4.1 Introduction 4-157
4.2 Aim 4-158
4.2.1 Hypothesis 4-159
4.3 Method 4-159
4.4 Procedure 4-161
4.4.1 Rules of the preliminary phase 4-162
4.4.2 Participants 4-164
4.5 Assessment Criteria 4-166
4.5.1 Results 4-166
4.6 Evaluation 4-206
4.7 Discussion 4-212
4.8 Conclusion 4-214
Chapter 5 Experiment 2: Simultaneous-comparison tests 5-215
5.1 Aim 5-217
5.2 Hypothesis 5-217
5.3 Method 5-218
5.3.1 Participants 5-219
5.4 Procedure 5-223
5.4.1 Message or character display 5-223
5.4.2 IJ Instructions 5-225
vii
5.5 Results 5-227
5.5.1 Machine-human comparison 5-227
5.5.2 Eliza Effect 5-229
5.5.3 Eliza and Confederate Effects 5-232
5.5.4 Control Groups 5-236
5.6 Discussion 5-261
5.7 Evaluation 5-265
Chapter 6 Experiment 3: Deception-detection 6-267
6.1 Method 6-268
6.1.1 Hypothesis 6-268
6.1.2 Procedure 6-268
6.2 Results 6-272
6.2.1 Double error: Eliza and Confederate Effects 6.272
6.2.2 Double Eliza error 6-280
6.2.3 Double error: Eliza and confederate effects II 6-296
6.2.4 Single error Eliza effect 6-300
6.2.5 Second Double Eliza error 6-303
6.3 Discussion and conclusion 6-307
Chapter 7 Conclusion and Further Research 6-311
7.1 Main Findings 7-312
7.2 Contribution to Knowledge 7-313
7.3 Limitations of this Study 7-315
7.4 Future Work 7-316
Appendices A1-D3
viii
References Ref: I-Ref: XXIII
ix
List of Tables
Table 1: Extract from Turing Test 2008 - Session 1-Round 9, Terminal A, Judge (J13)
Hidden-human .......................................................................................................... 1-25
Table 2: Comparison of Turing’s ‘thinking machine’ predictions ........................... 1-61
Table 3: Contrasting Turing’s viva voce and simultaneous-comparison imitation
games ........................................................................................................................ 1-62
Table 4: 2003 Loebner Prize score system............................................................. 3-150
Table 5: 2003 Loebner Prize Results ..................................................................... 3-151
Table 6: Transcript - male confederate 1 and Judge J4 – human scored “definitely a
machine”................................................................................................................. 3-152
Table 7: Transcript of Jabberwock with Judge J4 – machine scored “probably a
human” ................................................................................................................... 3-153
Table 8: Alice scores .............................................................................................. 4-168
Table 9: Alice score analysis .................................................................................. 4-169
Table 10: Botooie scores ........................................................................................ 4-175
Table 11: Botooie score analysis ............................................................................ 4-176
Table 12: Brother Jerome scores ............................................................................ 4-178
Table 13: Brother Jerome score analysis ............................................................... 4-179
Table 14: Chip Vivant scores ................................................................................. 4-181
Table 15: Chip Vivant score analysis ..................................................................... 4-182
Table 16: Elbot scores ............................................................................................ 4-186
Table 17: Elbot score analysis ................................................................................ 4-187
Table 18: Eugene Goostman scores ....................................................................... 4-189
Table 19: Eugene Goostman score analysis ........................................................... 4-190
x
Table 20: Jabberwacky scores ................................................................................ 4-193
Table 21: Jabberwacky score analysis ................................................................... 4-195
Table 22: Orion scores in Viva voce Turing tests .................................................. 4-198
Table 23: Orion score analysis ............................................................................... 4-199
Table 24: Trane scores ........................................................................................... 4-200
Table 25: Trane score analysis ............................................................................... 4-201
Table 26: Ultra Hal scores ...................................................................................... 4-203
Table 27: Ultra Hal score analysis ......................................................................... 4-204
Table 28: One-to-one test statistics ........................................................................ 4-208
Table 29: *Statistics of one-to-one tests ................................................................ 4-212
Table 30: Machine-human comparison results ...................................................... 5-229
Table 31: Judge 3, Session1, Round 2, Terminal D: Left- E3, Right-H2 .............. 5-230
Table 32: Judge J10, Session 1, Round 8, Terminal C, Left-H15, Right-E2 ........ 5-231
Table 33: Judge J1, Session1, Round 3, Terminal C: Left- H6, Right-E2 ............. 5-233
Table 34: Judge J13, Session 1, Round 9, Terminal A: Left-E1, Right-H20 ........ 5-234
Table 35: Judge J13, Session1, Round 11, Terminal C: Left-H20, Right-E2 ........ 5-236
Table 36: Judge J2, Session 1, Round 1, Terminal B: Left-H2, Right-H6 ............ 5-238
Table 37: Judge J1, Session 1 Round 2, Terminal B: Left-H11, Right-H15 ......... 5-240
Table 38: Judge J6, Session 1 Round 5, Terminal B: Left-H2, Right-H6 ............. 5-241
Table 39: Judge J5, Session 1 Round 6 Terminal B, Left-H11, Right-H7............. 5-242
Table 40: Judge 14, Session 1 Round 8, Terminal B, Left-H5, Right-H7 ............. 5-243
Table 41: Judge J19, Session 1 Round 9, Terminal B, left-H7, Right-H6 ............. 5-244
Table 42: Judge J13, Session 1 Round 10, Terminal B, left-H7, Right-H22 ......... 5-246
Table 43: Judge J7, Session 1 Round 12, Terminal B, left-H17, Right-H6 ........... 5-247
Table 44: Judge J15, Session 2 Round 13, Terminal C, left-H24, right-H8 .......... 5-249
xi
Table 45: Judge J8, Session 2 Round 14, Terminal C, left-H8, right-H15 ............ 5-250
Table 46: Judge J12, Session 2 Round 20, Terminal C, left-H1, right-H5 ............ 5-252
Table 47: Judge J22, Session 2 Round 22, Terminal C, left-H10, right-H20 ........ 5-253
Table 48: Judge J1, Session 2 Round 23, Terminal C, left-H16, right-H24 .......... 5-254
Table 49: Judge J24, Session 2 Round 24, Terminal C, left-H1, right-H11 .......... 5-255
Table 50: comparison of number of utterances posted by hidden humans ............ 5-258
Table 51: Judge J4, Session 2 Round 14, Terminal D, left-E2, right-E1 ............... 5-260
Table 52: Double error identification -Judge J13, left-female, right-machine....... 6-274
Table 53: Contest Judge-Transcript Analysts Machine comparison scores ........... 6-277
Table 54: Two machines ranked human by female interrogator IJ-J16 ................. 6-281
Table 55: Transcript Analysis – Interrogator with two machines .......................... 6-282
Table 56: Comparison of identifying two machines with one machine ................. 6-288
Table 57: Analysis of Unsure ranking.................................................................... 6-295
Table 58: Double Error identification –Judge J1, left-female, right machine ........ 6-298
Table 59: Single Error, Eliza effect – Judge J10, left-male, right-machine ........... 6-302
Table 60: Double Error, Eliza effect – Judge J23, left-machine Elbot, right-machine
Eugene .................................................................................................................... 6-305
xii
List of Boxes
Box 1: Weizenbaum’s typical Eliza conversation.................................................. 3-110
Box 2: Entry Requirements for Preliminary Phase, 2008 Loebner Prize............... 4-163
Box 3: Method of entry .......................................................................................... 4-163
Box 4: Chip Vivant instructions ............................................................................. 4-180
Box 5: TA questionnaire ........................................................................................ 6-269
Box 6: Transcript Analyst Questionnaire ............................................................... 6-269
xiii
List of Graphs
Graph 1: LQ Scatter scores .................................................................................... 4-196
Graph 2: Zeta Bot Scatter scores ............................................................................ 4-205
Graph 3: Comparison of mean scores by IJ group ................................................. 4-209
Graph 4: All machine scores .................................................................................. 4-211
Graph 5: Experiment 1machine conversational ability scores of 50 and over ....... 4-213
Graph 6: Machine-human comparison tests, machine identification ..................... 5-228
Graph 7: One vs both humans correctly identified in human control tests ............ 5-256
Graph 8: Human-human control tests, one human correct identification rate ....... 5-256
Graph 9: Result of machine-machine control pair tests ......................................... 5-259
Graph 10: Judges correct identification in machine-human tests .......................... 5-263
Graph 11: Confederate Effect – humans considered machine ............................... 5-264
Graph 12: Correct Identification of hidden entities by Transcript Analysts .......... 6-275
Graph 13: Machine scores ...................................................................................... 6-277
Graph 14: Transcript analysts agreeing with contest judge ranking ...................... 6-278
Graph 15: Correct identification of two machines ................................................. 6-282
Graph 16: Correct identification of left female, right machine .............................. 6-297
Graph 17: Misidentification of left human, right machine..................................... 6-301
Graph 18: Deception-detection rate ....................................................................... 6-308
xiv
List of Figures
Figure 1: Elbot – artificial conversation system....................................................... 1-15
Figure 2: Three-participant, judge/machine-human test .......................................... 1-23
Figure 3: Jury-service Imitation Game..................................................................... 1-57
Figure 4: Simultaneous comparison Turing test ...................................................... 1-58
Figure 5: Drawing by Aaron robotic artist Figure 6: Santana Album Cover ....... 2-67
Figure 7: Judge area - simultaneous-comparison test set-up.................................. 5-220
Figure 8: Judges assessing pairs of hidden interlocutors ....................................... 5-221
Figure 9: Hidden interlocutor area - developers of machine entries ...................... 5-222
Figure 10: Message-by-message display on large screen in spectator/judge area . 5-224
Figure 11: Judge score sheet .................................................................................. 5-227
Figure Acknowledgements
Marc Allen: Figure 2
Randall Munroe (xkcd.com): Figure 3
Christopher D. Chapman: Figure 4
Harold Cohen: Figure 5
Santana Abraxas album cover (from original by Marti Klarwein): Figure 6
xv
List of Appendices
A: Chapter 1 – A1: Email to Fred Roberts developer of artificial conversationalist
Elbot
B. Chapter 4: Experiment 1
B1: Interrogator-Judge blank questionnaire
B2: J103 completed questionnaire - Alice
B3: J113 - Alice
B4: J34 – Brother Jerome
B5: J23 – Chip Vivant
B6: J33 – Chip Vivant
B7: J9 – Chip Vivant
B8: J9 - Elbot
B9: J51 - Eugene
B10: J40 - Jabberwacky
B11: J115 - Orion
B12: J147 – Ultra Hal
B13: J53 – Ultra Hal
C. Chapter 5: Experiment 2
C1: Schedule of simultaneous comparison Turing test set ups
C2: IJ Questionnaire
C3: IJ Score sheet
C4: J1 result sheet: session 1-round 2, Terminal B
C5: J6 result sheet: session1-round 5, Terminal B
xvi
C6: J10 result sheet, session1-round 7, Terminal B
C7: J13 result sheet, Session 1-round 10, Terminal B
C8: J7 result sheet, session 1 – round 12, Terminal B
C9: J1 result sheet, session2 – round 23, Terminal C
C10: J24 result sheet, session 2 – round 24, Terminal C
D. Chapter 6 : Experiment 3
D1: TA J1 result sheet
D2: TA J103 result sheet
D3: TA 172 result sheet
1-1
Introduction
Deception has featured as part of human history including in one of the most famous
episodes in ancient history, the Trojan horse
1
and in the Second World War’s crucial
Operation Fortitude D-Day
2
Normandy beach landings on 6 June 1944. British
mathematician Alan Turing engaged in secret code-breaking at Bletchley Park during
that war deployed deception as a feature for machine thinking in his original and
influential treatise on machine intelligence: computing machinery and intelligence
(1950). This thesis considers Turing’s formula for testing machine thinking through
deception-detection in textual question-answer sessions, examines the sixty-year
controversy surrounding his ideas and practicalises Turing’s imitation game through
three unique experiments. I posit that there is one Turing imitation game which can be
practicalised in two different formats as described by the inventive Turing in his Mind
article elaborating upon it further in a BBC radio broadcast in 1952, in which he
revised his prediction of the emergence of the thinking machine by the end of the
(20th) century to “at least 100 years”.
This thesis is divided into two parts. Part I covers literature review across chapters
1 to 3 examining the scholarship surrounding Turing’s imitation game beginning with
Turing’s own expositions between 1936 and 1953 on machine thinking. Part II is also
divided into three chapters, 4-6 each presenting original experiments, the first using
Turing’s one-to-one machine-human interaction, the second staging Turing’s three-
1
Trojan War http://www.stanford.edu/~plomio/history.html#anchor204279 accessed: 7.8.10; time:
03.34
2
BBC On This Day:
http://news.bbc.co.uk/onthisday/hi/dates/stories/june/6/newsid_3499000/3499352.stm accessed: 7.8.10;
time: 03.38
1-2
participant simultaneous comparison of a machine with a human, and the third, a
transcript analysis exercise reviewing those simultaneous comparisons in which a
machine was considered a human by independent interrogators in ‘blind questioning’.
Part I:
Chapter 1, The Buzzing in Turing’s Head reviews the inception of machine
thinking from Turing’s ideas on memory, ingenuity and intuition in his pre-1950
work. It examines his seminal Mind article in which Turing predicted thinking
machines by the end of the 20th century putting forward an imitation game describing
it practicalised in two tests: a) one interrogator-one machine question answer session,
and b) the contentious three-participant machine human comparison, both
simultaneously questioned by an interrogator. I contend that Turing posed a five
minutes interrogation time for both tests because a thin slice of interaction is sufficient
to form an opinion on the nature of a hidden interlocutor. Turing’s post-1950 lectures
and radio broadcasts are reviewed including the 1952 BBC show in which Max
Newman also appeared. Turing used this session to simplify the imitation game and
elaborate upon the one-to-one, interrogator-witness test for machine thinking laying
the foundation for a jury of interrogators to question the machine. He also revised his
earlier prediction believing a machine would be considered thinking in a viva voca
type test in “at least a hundred years”. Chapter 1 concludes with a comparison of the
features of Turing’s two tests for deception-detection and machine intelligence.
Chapter 2, Philosophical Jousts and Turing’s Imitation Game, presents an analysis
of reactions to Turing’s tests to examine machine thinking. I consider the major
objections including Searle’s Chinese Room argument, Block’s Aunt Bertha machine,
1-3
French’s cultural intelligence foray, Hayes and Ford’s mechanical transvestite,
Genova’s gender insistence, and Halpern’s resistance, contrasting with support by
Chomsky, Dennett, Hanard, Hofstadter, and Shah &Warwick.
Chapter 3, Early Practical Question-answer Tests, introduces efforts to build a
machine that could pass Turing’s criteria of thinking by a machine. Weizenbaum’s
Eliza and Colby et al’s PARRY systems are examined along with modern systems
and formal application of Turing’s imitation game in a contest for artificial
intelligence – the Loebner Prize. Practical aspects of the imitation game are discussed
and findings from the first thirteen Loebner Prizes are presented.
Part II
Chapter 4, Experiment 1: Viva-voce testing, presents findings from one-to-one
interrogator-machine conversations. Over one hundred interrogators were recruited to
examine 13 machines entered for the preliminary phase of the 18th Loebner Prize
contest for Artificial Intelligence. The interrogators were informed they would be
engaging machines, their task was to score them for conversational ability and
emotion content, following the paradigm of the Chatterbox Challenge in which I
participated as a judge (2005 contest, see Shah, 2006). The machines were whittled
down to six invitees to experiment 2, the finals of the 18th Loebner Prize contest vying
for Loebner’s bronze award for most human-like machine. The interrogators between
them concluded that the best machines are able to recall information, such as the
colour of a car when told it during interaction and later asked; they are able to
remember names of interlocutors if given, and are able to share personal information
disclosing interests (Shah & Warwick, 2010b; 2010a).
1-4
Chapter 5, Experiment 2: Simultaneous-comparison Turing tests provides findings
from the first ever five-minute, message-by-message simultaneous comparison of
machines with humans along with control group tests of two machines and two
humans in a special contest designed by the researcher involving 96 three-participant
Turing tests. A fresh set of twenty four interrogators were used in this experiment
instructed to identify any machine and any human in pairs they questioned in parallel.
Twenty three hidden-humans were recruited to act as the comparators and foils for the
machines. The results show hidden interlocutor misidentification (Shah & Warwick,
2010c). Machines deceived at a rate of 8.33% (Shah & Warwick, 2010b): in sixty
simultaneous comparison tests on five occasions interrogators ranked a machine as
human - Eliza effect (Turkle, 1997). Results also show interrogators confused hidden
humans for machine, confederate effect (Shah and Henry, 2005).
Chapter 6, Experiment 3: Deception-detection used Colby et al’s (1972) transcript
analysis paradigm recruiting over 300 participants to review those transcripts from
experiment 2 in which interrogators mistook Loebner Prize 2008 winner, Elbot for a
human. Reviewers were not told the rankings awarded by interrogators in experiment
2. Results show sixty one percent of the Reviewers correctly identified the machine.
Conversely, more than 1 in 3 failed to recognise artificial dialogue. The aim is to
verify the results by repeating the experiment on a larger scale in 2012 at Bletchley
Park.
The contribution of my research to knowledge is to posit one Turing imitation
game that can be practicalised in two different ways as an exercise which forms a
valid scientific experiment. Practicalising Turing’s tests can acquaint us with progress
1-5
in artificial dialogue. Turing’s ideas for testing a machine to think through textual
question-answer sessions should no longer be ignored by academia as an engineering
challenge. Turing’s thought experiment practicalised is a pragmatic instrument in the
education of computer users to mitigate the risk of cybercrime perpetrated by
programmers of malfeasant artificial chatters such as CyberLover, which succeeded in
deceiving humans that it was human defrauding them of personal information that
could be used in identity theft.
Finally, will Turing’s 1952 prediction be proved correct, that in at least forty years
hence, machines will be considered thinking in the human sense?
1-6
Part 1
1-7
Chapter 1 The Buzzing in Turing’s Head
In the second decade of the 21st century and in the 60th year following Alan Turing’s
Mind paper Computing Machinery and Intelligence (CMI) little agreement exists for a
canonical Turing test, which Horn claims is a focus on what it means to be human
(2008). Used implicitly by neurosurgeons investigating ‘locked-in mind’, Turing’s
question-answer test serves as a means to garner any evidence of consciousness in
brain-damaged patients assigned as being in a vegetative state (Stins and Laureys,
2009). Using large posters Horn (2008) has mapped 800 claims, counter claims and
rebuttals of scholars who differ in their perspective of Turing’s imitation game,
mostly focused on the earlier sections of Turing’s 1950 tract with little heed paid to
Turing’s earlier and later theses. How Turing arrives at a position in which he can
imagine a scenario for a disembodied machine possessed with necessary storage,
memory and speed could be pitted against a human in a game on purely intellectual
terms, and why he felt thinking machinery was possible, has been obscured by
philosophical jousting for sixty years.
This chapter unpacks Turing’s deliberations on machine thinking by reviewing
eight of his papers presented either as lectures or in written form between 1936 and
1953 in which imitation, deception and machine intelligence are discussed to show
Turing proposed one imitation game which can be practicalised through two distinct
scenarios: a three-participant machine-human comparison, and a two-participant viva
voce.
1-8
1.1 Early Papers
Turing’s research presented in his essays and a lecture prior to the publication of his
1950 paper throws light onto the possible reasoning for its title. At the time of
Turing’s first essay On Computable Numbers (1936), it is humans who computed
carrying out calculations using sufficient pencil and paper (Bush, 1945), and thus it is
humans who were known as computers employed in industry including government
and business (Copeland, 2004). Early calculating machines showed they could do the
work of the human computer much faster; these were designated machines to
distinguish from the human computer. Work carried out by Turing before writing
CMI led him to address the question: when human computers calculate their mental
action is regarded as thinking and their output is considered as a result of intelligence,
then why not similar attribution to machines when they performed calculations, albeit
in a different way? Thus, Turing asked the world to contemplate this question
seriously along with him. Before we move to analyse CMI and the imitation game, in
the next three subsections Turing’s 1936 paper, a lecture on the automatic computing
engine (ACE) from 1947, and his 1948 report on intelligent machinery are reviewed.
1.1.1 Turing 1936: On Computable Numbers
Errors by Turing in his first seminal paper On Computable Numbers with its
application to the Entsheidungsproblem published in 1936 describing real numbers
(such as π), those “whose expressions as a decimal are calculable by finite means” (in
Copeland, 2004: p.58), expose his later complaint that machines are expected to be
infallible while humans are not. Copeland reminds that in 1936 the term computer
was used to describe humans, thousands of whom were employed in business,
government and research carrying out elementary mathematical tasks and who “might
1-9
not have the detailed knowledge of the end to which their work was directed” (2004:
p. 40). Turing, realising that human memory is “necessarily limited” (p. 59), informed
an unsuspecting world that it was possible to compare a man involved in the process
of computing a real number to a particular system: an automatic machine. Turing
introduced the possible processes that could be carried out when computing numbers:
a) direct appeal to intuition, b) proof of the equivalence of two definitions, and c)
giving examples of classes of numbers that are computable (p. 75). While computing
was normally done by humans using the agency of writing symbols on paper, Turing
declared that it was “possible to invent a single machine which can be used to
compute any computable sequence” (p. 68). Although the machines of that period
only mechanised elements of the human computer’s work, i.e. adding and subtracting,
it did so more quickly. The use of the term digital distinguished the machine from the
human. Turing’s idea of a universal machine, one that could perform any calculation
that a human computer could carry out, emerged in this paper. It is worth noting here
comments from Turing’s 1938 Princeton awarded doctoral dissertation
3
, that
mathematical reasoning involved intuition and ingenuity, the former allowing
spontaneous judgments that are not the result of conscious trains of reasoning, which,
added to suitable arrangements of propositions, geometric figures and drawings,
would result in ingenuity. Both these functions would differ in the role they played
from occasion to occasion. Thus, in the 1936 paper Turing laid the foundation for
sapient ideas that would become and remain contentious at the start of the second
decade of the 21st century.
3
Systems of Logic Based on Ordinal, in Copeland, 2004 pp146-204
1-10
1.1.2 Turing 1947: The ACE machine
The design of the automatic machine engine - ACE, an acronym given to Turing’s
large scale electronic digital computer project by J.R. Womersley , at the time
Superintendent of the UK’s National Physical Laboratory Mathematics Division
(Hodges, 1992: p. 317), was the topic of a lecture Turing presented to the London
Mathematical Society acquainting attendees with the prospect of intelligent machines
that could not only learn from experience, but could also compete against humans in a
game of chess. Played between two opponents across a board of 8 by 8 squares with
32 pieces, chess has “long held as a paradigm of the heights of human logic and
reasoning” (Stork, 1997). With instruction tables, or programmes with subsidiary
rules, Turing pointed to the possibility of machines altering their own programmes
enabling it to learn. Turing described the machine as a digital entity, in which
sequences of digits, no matter how long, could represent numbers allowing any type
of calculation to be performed to the “desired degree of accuracy” (in Copeland,
2004: p. 378). Turing described the ACE project as “building a brain” (p. 374), adding
it “can be made to do any job that could be done by a human computer, and will do it
in one ten-thousandth of the time”, the time estimate being a “fairly reliable” one (p.
378). Turing claimed ‘rule of thumb’ process was synonymous with machine process
(ibid). However, the memory required in the machine would need to be very large and
in a form that enabled extraction of information “at short notice” (p. 379), what we
might call intelligent information retrieval today.
Turing talked about the machine’s storage capacity: “If we are to have a really fast
machine then, we must have our information, or at any rate a part of it, in a more
1-11
accessible form than can be obtained from books” (ibid). Turing discussed the loss of
compactness and economy in the need for necessary memory: “I believe that the
provision of proper storage is the key to the problem of the digital computer ...
certainly if they are to be persuaded to show any sort of genuine intelligence” (p.
383). Turing felt memory and storage were more important than speed, which would
only take prominence when commercial viability was the factor in its build. Storage
capacity was the fundamental requirement. Turing added that if supplied with
appropriate instructions, or a programme, the properties of the universal machine
applied to any particular problem could be made to do any rule of thumb process and
that the digital ACE was a practical version of the universal machine. Turing has been
proved wrong by subsequent history in one respect when he said that it would “not
necessitate everyone interested in such work having an electronic computer” (p. 391).
Indeed, Andrew Wiles may not have found the proof to Fermat’s last theorem had he
not had use of computers (Singh, 1997), and the task of writing this thesis would be
made more difficult for the author had she not had access to a personal laptop.
The main bulk of the ACE’s work would “consist of problems which could not
have been tackled by hand computing because of the scale of the undertaking” (p.
391), but a great number of mathematicians “of ability” would be needed to cover the
groundwork before the ACE could be put into operation on a task by formulating
problems in a form for the machine to compute. There was also considerable scope for
human analysts to check for machine errors, because the ACE did not have the
capacity for common sense replies. Turing envisaged two classes of people working
with the machine: ‘masters’ and ‘servants’. The former would design the programme
with increasing complexity, while the latter would act as the machine’s limbs feeding
1-12
it ‘cards’. Turing envisaged an embodied machine sometime down the line, because
he felt the ACE would eventually take over the functions of master and servant by
replacing them with mechanical and electrical limbs and sense organs (p. 392).
Addressing the question of how far it was possible for a computing machine to
“simulate human activities” (ibid), Turing circumvented a definitive answer
reminding of human inflexibility analogised as “acting like a machine” (p.
393).Turing stressed it was desirous to have a machine that could learn from
experience. Symbolic language would act as the medium for communicating with the
machine leading it to interpret “whatever it is told ... without any sense of humour or
proportions” (p. 392). Turing emphasised the machine must be allowed interaction
with humans: “in order that it may adapt itself to their standards” (p. 394). On what
the machine could and could not do, Turing wrote “it has been said that computing
machines can only carry out the processes that they are instructed to do”, but this
point was put aside in CMI as a misinterpretation of what Lady Lovelace, marked as
the first computer programmer, had said about Charles Babbage’s Analytical Engine
in the 19th century. In order for the human user to know what was going on all the
time, Turing intended an initial purpose for machine construction was “to treat them
as slaves” (p. 393). Regarding its operating instructions, Turing believed they would
become altered beyond recognition with the machine carrying out useful operations
“in a much more efficient manner” (ibid), much as a pupil improves their own work
after being tutored. Turing believed it was possible to begin an experiment along these
lines provided the machine was given a reasonably large memory capacity so that
whenever this happened, the machine itself added to its original programme and
produced good work. Turing felt we should be obliged to attribute it with intelligence
(p. 393).
1-13
Turing considered chess as a good starting point. It was advisable to confine any
investigation to a limited field, and chess lent itself easily for the ACE to compete
against an average human player, as long as the machine was given appropriate
instruction tables: “the game of chess may perhaps be rather suitable for this purpose
... moves of a machine’s opponent will automatically provide this [human] contact”
(p. 394). Advocating fair play for the machine, Turing cautioned against high
expectations of it compared to human mathematicians, for example, who do make
blunders as Turing had been shown to have made by Emil Post and Donald Davies in
his 1936 paper. Turing felt it unfair that a machine’s infallibility would preclude it
from the intelligent race. Turing concluded the1947 ACE machine lecture with the
aphorism: “No man adds very much to the body of knowledge, why should we expect
more of a machine?” (p. 394).
1.1.3 Turing 1948: Intelligent Machinery
Having already introduced chess for a one-to-one encounter between a human and
a machine in his 1947 lecture, in his 1948 report on Intelligent Machinery Turing
continued with chess and introduced a comparison test, a precursor to the 1950
linguistic game. In a three-participant chess game testing whether a machine could
play as well as a poor human chess player, Turing considered the question of whether
a machine could show intelligent behaviour. Turing examined common objections,
including religious beliefs and humanity’s unwillingness to accept “any rivals in
intellectual power” (p. 410). Turing wrote that these were emotional objections, but
admitted “the idea of ‘intelligence’ is itself emotional rather than mathematical” (p.
411).
1-14
In this 1948 paper, Turing set the scene on how to calibrate a machine’s ability to
think: “I propose to investigate the question as to whether it is possible for machinery
to show intelligent behaviour” (p. 410). Turing introduced a number of technical
terms to describe what he meant by ‘machine’: discrete, continuous, controlling,
active, logical computing machines (LCM), practical computing machines (PCM),
universal practical machines and a largely random construction, or unorganised
machines (p.412-416). Using the term ‘imitation’ Turing suggested interference by
communicating information to the unorganised machine allowing it to modify its own
behaviour by changing its own instructions (: p.419). The capability of self-
modification is claimed by one machine developer today: “this program builds it’s
own interfaces ... collects knowledge from outside ... what is important to evolve. This
AI mutates and is capable of replacing pieces of its own source code with new pieces
... It is now four and a half months old and understand about as much as a real 8-9
years old child... It is capable of understanding, remembering, forgetting, thinking
logically, having feelings, discovering new ideas on its own.” (Appendix A1).
Interference modifies the machine which, Turing wrote, is analogous to a human’s
modification as a result of being taught something new. Turing felt it a “positive
reason for believing in the possibility of making thinking machinery” (p.420). He
pointed to parts of a human that had already successfully been imitated at that time:
microphone for an ear, television camera for the eye. Suggesting a disembodied entity
with at most sight, speech and hearing, Turing acknowledged the most impressive
human activity was the learning of languages but conceded it depended “rather too
much on sense organs and locomotion to be feasible” (p. 421). Relying upon
1-15
“appropriate interference” and “mimicking education” Turing believed the machine
could be modified to “produce definite reactions to certain commands” (p. 422).
Turing’s ideas did pave the way for modern artificial conversationalists now
ubiquitous across the digital web such as Elbot
4
(see figure 1) winner of the
simultaneous comparison Turing test competition
5
at the University Reading in 2008.
Figure 1: Elbot – artificial conversation system
Considering what ‘thoughts’ might be suitable for the machine to express Turing
invoked the playing of chess as one field. Turing warned that “it would be quite unfair
to expect a machine” immediately following its build to “compete on equal terms with
a university graduate” (p. 421). This is because the human acquiring a university
education would have connected with other humans over the years modifying their
behaviour since birth, learning from their environment through “visual and other
stimuli” and thus subject to a lot of interference (ibid). Turing advocated beginning
the process of producing an intelligent machine by giving it means to organise its core
with pleasure-pain experiments, inculcating initiative allowing the machine to make
choices and decisions. Transforming an ‘organised’ system into an intelligent machine
4
Elbot: http://www.elbot.com/ date visited: 16.1.10; time: 11.01
5
LP 2008: http://loebner.net/Prizef/2008_Contest/loebner-prize-2008.html date visited: 16.1.10; time:
11.03
1-16
would require it to solve problems by “searching through various variables”… more
like a “search through all proofs” (p. 430), Turing presciently added “further
research into intelligence of machinery will probably be very greatly concerned with
‘searches’ ...” (ibid). Would one argue against Google’s search engine bearing some
resemblance to Turing’s intellectual search?
Cultural search, Turing wrote, develops a human’s intellectual power: “it is
necessary [for a human] to be immersed in an environment of other [humans] whose
techniques ...” are absorbed, and with a little research new discoveries can be passed
on to others: “search for new techniques must be regarded as carried out by the human
community as a whole, rather than by individuals” (p. 431). Referring to the influence
of ‘emotion’ as a caveat to accepting something with acumen Turing added:
The extent to which we regard something as behaving in an
intelligent manner is determined as much by our own state
of mind and training as by the properties of the object under
consideration. If we are able to explain and predict its behaviour
or if there seems to be little underlying plan, we have little
temptation to imagine intelligence. With the same object therefore
it is possible that one man would consider it as intelligent
and another would not; the second man would have found
out the rules of its behaviour. (1948: p. 431)
The forerunner of Turing’s 3-participant test emerged in this 1948 paper. Claiming
he had actually conducted “a rather idealized form” (p. 431), Turing introduced a
“little experiment” involving three members A, B and C playing chess. Turing
declared, even at that early period in modern computing history, it was not difficult to
produce a “paper machine” playing a “not very bad game of chess” (ibid). By paper
machine, Turing meant an “effect of a computing machine” produced by a written
down “set of rules of procedures” which a man, “provided with paper, pencil and
rubber, and subject to strict discipline” could follow (p. 416). Turing suggested B, the
1-17
operator of the paper machine, be a mathematician and chess player in order to work
the machine “fairly fast” (p. 431). Turing gave no reason why B should be a
mathematician while suggesting A and C should be “poor chess players” (ibid).
Turing envisaged the game played across two rooms between C and either A, or the
paper machine with “some arrangement for communicating moves” (ibid). Turing felt
it might not be easy for C to say whether they were playing A or the paper machine:
“C may find it difficult to tell which he is playing” (ibid). Turing did not say what C
should be told about its opponent. Should C be asked to distinguish between A and a
machine chess-player? In Turing’s 1948 report into “how much a machine can do”
(p. 401), he set the scene for his one-to-one, deceiving machine and human player
game, replacing chess with questions and answers on any topic of human endeavour
in 1950.
1.2 Turing 1950: Computing Machinery and Intelligence
Probably one of the most contentious papers in the history of written theses,
Turing’s 1950 CMI paper committed future researchers to a febrile philosophical
debate about whether an unseen and unheard machine is ‘thinking’ if it succeeds, after
text-based question and answer sessions, in deceiving human interlocutors that it is
human. Following the Second World War, Turing drew readers into the uncommon
idea of non-human machines as ‘thinking entities’ through a simple yet brilliant idea
using text-based interview. Now every day humans interrogate and communicate via
text over email, mobile devices and across the Internet to find news, elicit information
and determine its usability. The vast literature surrounding Turing’s proposal to
consider whether a machine can think exposes a chasm between those who feel text-
based interrogation is an effective way to measure human-machine
1-18
indistinguishability and others who maintain it serves no purpose, a situation confused
further by the insistence on the importance of gender of the hidden human against
which a machine is compared by a human interlocutor in the three-participant game
(see the next chapter, 2). This matter was brought upon by Turing himself through his
description of a number of scenarios in CMI as the researcher will show in the
following sub-sections.
1.2.1 The Imitation Game – What Exactly Is It?
Manifest as an intellectual pursuit in which a machine deceives a human interlocutor
into believing that they are textually questioning another human, some argue that
Turing’s idea offers nothing to further the science of consciousness, intelligence and
understanding natural language. Turing’s 1950 prescient treatise has polarised
researchers and provoked many interpretations. French insists the imitation game is
not a test for intelligence per se, but of culturally-oriented human intelligence (1990),
while Harnad asserts that, as a test for indistinguishability between human and
machine it sets an empirical goal for artificial intelligence (2001). But what is it?
Turing described the pursuit of a deceptive entity questioned by a human
interrogator in a number of scenarios. Representing the scene as a game 16 times in
the 1950 paper, Turing terms it an imitation game (p. 433) urging “the best strategy
for the machine may possibly be something other than the imitation of the behaviour
of a man” (p. 435). Turing repeated the term “imitation game” another 14 times
announcing: “The reader must accept it as a fact that digital computers can be
constructed ... and that they can in fact mimic the actions of a human computer very
closely” (p. 438). Referring to the game as a problem twice (p. 434), and an
1-19
experiment twice (p. 455) Turing defined the quest for thinking machines as a test 3
times (on pages 446, 447 and 454).
There are a number of scenarios Turing takes the reader through in his 1950 paper,
little wonder for the division among interpreters of Turing’s text. This is without
considering Turing’s embellishment in 1952. However, who would dispute that at the
core of Turing’s enterprise is a machine’s text-based capacity to provide satisfactory
and sustained answers to any questions put by a human interlocutor. In the next
section Turing’s 1950 scenarios are explored to find if a general interpretation can be
raised from opacity on how the game/ experiment/test should be conducted.
1.2.2 Can machines think?
In 1948, Turing prefaced the notion of a machine imitating a human with an
artificial player of chess. In 1950, Turing replaced chess-playing with thinking
imparted through ‘satisfactory’ responses to questions. Using imitation, accepted as a
form of flattery, as a means to measure a machine’s humanness on one level, Turing
evolved his 1948 chess game to a textual duel now requiring the machine to perform
on its own with no mathematician operator, with C the poor chess player of 1948
replaced by an “average interrogator” (1950: p. 442) . Declaring it a “dangerous”
attitude in 1950 to reflect on the normal use of the words ‘machine’ and ‘think’ when
considering the question “Can machines think?”, Turing excluded “men born in the
usual manner” from taking the part of a machine in the game. Turing asked “May not
machines carry out something which ought to be described as thinking but which is
very different from what a man does?” (1950: p. 435). Turing quite rightly asked that
1-20
question realising that man does not think like every other man; man does think like
woman; an Occidental woman may not think like a woman from the Orient.
Prescribing an open mind when pondering over the words ‘machine’ and
‘think’, Turing opined that considering their everyday usage would be as
absurd as seeking definitions from a statistical survey such as the “Gallup
poll” (p.433). Thus, Turing avoided replying to the question “can machines
think?” taking a diagonal detour through the “unambiguous” and “new form of
the problem” - a machine compared against a human both asked questions by a
human interrogator (p. 433).
1.2.3 Distinguishing Textual Man from Woman
Etching out a philosophical sport of three participants, initially with a man and a
woman both located in a separate room from a male or female interrogator, Turing
framed his ‘thinking exercise’ through a textual game of imitation and deception. The
man would take part as ‘A’, the woman as ‘B’ and the interrogator as ‘C’ (p. 433).
Turing suggested that it would be best if the answers were sent arranged via
teleprinted communication between the rooms –as can be done through a local
network now (p.434). The object of Turing’s scenario is for the interrogator (C),
without seeing or hearing A and B, to correctly identify the man and the woman based
on their written answers to questions put to them both. The aim of A is to deceive C
by pretending to be the woman; the interrogator’s task is to resist this deception.
Turing directed that the woman - B, must assist the interrogator C: “best strategy for
her is probably to give truthful answers”, suggesting that she might be weaker at
deception. Did Turing expect the deceiving man A in this scenario to appear more
1-21
womanly to the interrogator C than the woman B, or could C classify them both, A
and B as ‘female’ – the indistinguishability score? Turing did not say. Expecting the
man to appear more womanly, i.e., to cause the interrogator to misclassify B as the
man is an altogether different and stringent test. The author feels it is
indistinguishability what Turing was driving at. Further, as Turing did not preclude
women from interrogating, he therefore did not feel women would be weaker at
deception-detection.
It is not important, in the overall account of the machine-human comparison, for us
to dwell on why Turing selected the female to answer truthfully, suffice that the three-
human game acted as an initiation ceremony for the reader. What should be noted in
Turing’s musings on questions and answers is that in his imagining the game with the
two human hidden participants, A and B, both had access to each others’ responses to
C’s questions so that the woman could say “I am the woman, don’t listen to him !” to
any of A’s answers (p. 434). Nonetheless, Turing reminded us that “it will avail
nothing as the man can make similar remarks” (ibid).
Hence, the first item for the 3-participant machine-human textual comparison
Turing proposed is that the two who are questioned by an interrogator should not have
access to each other’s responses, else the exercise will loop into a futile and fruitless
investigation, especially because if they are asked the same question, one could copy
the other’s response exactly throughout the game. What should be pointed out here,
though Turing used the man-woman scenario as an introduction to the computer-
human-comparison, he could have used any other contrast. For example, he could
have chosen a comparison test between a native English speaker and a non-native
1-22
English speaker, and have their differences in English conversational ability tested by
an interrogator tasked with determining which is the native English speaker. This test
could then be conducted with a machine/native English speaker conversation
experiment without letting the interrogator know that one of the humans had been
replaced with a machine. The ultimate focus is on the machine’s ability to imitate
human-like text-based answers.
1.2.4 Three way Machine-human Comparison
At the end of section 1 of CMI, having introduced an imitation game in which a man
must deceive an unseen, unheard interrogator that he is the woman. Turing asked
readers to consider “What will happen when a machine takes the part of A in this
game? Will the interrogator decide as wrongly as often as when the game is played
like this as he does when the game is played between a man and a woman” (p. 434).
This question suggests Turing wanted the interrogator to be central to the test and the
machine imitate a woman: how often the interrogator decides an entity is the woman
in the man-woman scenario compared to how often an entity is ranked the woman in a
machine-woman test. However, as we shall see with Turing’s elaborations later in the
1950 paper, imitating a woman is not central to the design of the machine; it is
imitating human-like utterances. Thus the comparison ratio, of how well the
interrogator does in recognising the woman correctly, in the man/woman pair, against
how often the machine is identified correctly, does not serve any purpose in terms of
measuring the ability of the machine to answer questions in a satisfactory and
sustained way. Turing did not state that the interrogator could not return a result of
‘both human’ after questioning a human-machine pair. This is the result, Turing’s
1-23
indistinguishability which is the goal of the machine in an imitation game (see figure
2).
Figure 2: Three-participant, judge/machine-human test
Turing put forward his own belief that “in about fifty year’s time it will be possible
to programme computers, with a storage capacity of about 109, to make them play the
imitation game so well that an average interrogator will not have more than a 70 per
cent chance of making the right identification after five minutes of questioning”
(p.442). Turing was correct in one sense: by the end of the century, in Loebner’s Prize
for Artificial Intelligence, in 1991 (see chapter 3) some judges did incorrectly rank
machines as humans in restricted-conversation games.
From Turing’s statement of an average interrogator not having more than 70%
chance of correctly identifying the machine, a pass rate of 30 per cent is not
unreasonable. An interrogator can correctly identify the machine in a machine/human
pair 50 per cent of the time by random guessing. Thus, the maximum machine
deception rate being 50%, if a machine is classified as human 30 per cent of the time
in a machine/human test, then it has imitated well and succeeded in Turing’s imitation
game.
1-24
1.2.5 Question and Answer Method
Having introduced a game in which successful imitation of human-like responses can
cause deception, Turing claimed that the question and answer
6
method was “suitable
for introducing almost any one of the fields of human endeavour” the interrogator
might wish to include (p. 435). The interrogator is not allowed to seek any practical
demonstrations during questioning (p. 446), no matter how much the hidden entities
may boast about their appearance or prowess. Turing pointed out the limitations of the
machines at that time: “there will be some questions to which it will either give a
wrong answer, or fail to give an answer at all however much time is allowed for a
reply” (p. 444). Turing wrote “I am often wrong, and the result is a surprise for me”
(p. 451), but, he asked, would it be fair to deem machines worse for not making
mistakes? (p. 448).
Supposing that closed questions, with ‘yes’ or ‘no’ answers were more appropriate
than the type of questions machines were believed would fail on, for instance those
eliciting an opinion or visceral description, “What do you think of Picasso?” (p.445),
on asking ‘open questions’ Turing reminded that “it has only been stated, without any
sort of proof, that no such limitations apply to the human intellect” (ibid), such as the
fact that humans may not have an opinion on a matter through not having access to a
piece of knowledge owned by the interrogator. Turing’s point is borne out when
hidden humans, comparators for the machines, do not share the same ‘general
knowledge’ as the interrogator. In one such case a female was deemed to be the
6
IBM claim their advanced question-answer technology, new super computer Watson, understands
questions in natural language and aim to test it against humans in a TV quiz show Jeopardy!:
http://www.nytimes.com/2010/06/20/magazine/20Computer-t.html?pagewanted=all June 2010
1-25
machine. Shah and Warwick (2009) show, from one of the three-participant practical
Turing tests conducted in 2008 (see chapter 5), a non-native female hidden human
was deemed the machine by a native English speaking interrogator, because she did
not know who Sarah Palin was
7
(see table 1).
Table 1: Extract from Turing Test 2008 - Session 1-Round 9, Terminal A, Judge (J13) Hidden-human
Judge: What do you make of Sarah Palin?
Hidden-Human: Sorry don’t know her
Judge: How can you possibly not know her? What have you been doing for the last two
months?
Thus the Turing’s game rests heavily on the finer points of the interrogator’s
performance:
what questions the interrogator chooses to ask
supposition of what constitutes ‘general knowledge’
deception-detection rate
The interrogator’s role entails selecting the most appropriate questions, overcoming
assumptions about possessed knowledge, and detecting deception each time to
correctly identify the nature of hidden pairs of interlocutors (see figures 2 and 3).
Turing poured scorn on the illusion of “feeling of superiority” if an interrogator
met with a wrong answer from a machine, and stressed “We [humans] too often give
wrong answers to questions ourselves to be justified in being very pleased at such
evidence of fallibility on the part of the machines” (p. 445). Dismissing those
interrogators who felt they had won a point, “on such an occasion in relation to the
one machine over which we have scored a petty triumph” (p. 445), Turing reminded
7
US Republican Party Vice President candidate in 2008,
1-26
“There would be no question of triumphing simultaneously over all machines” (p.
445), such as not being able to win in a race against an aeroplane (p. 435). If the
machine’s answers were regarded as “satisfactory and sustained” Turing argued, then
that would not be “an easy contrivance” (p. 447).
Simultaneously interacting with a pair of hidden participants for five minutes, and
based on the intellectual capacity demonstrated through responses to questions the
interrogator asks, the human interrogator must declare what they think each hidden
witness is, human or machine. Connoting the machine’s superiority, if the human
were to attempt to act artificial Turing reminded that the latter “would be given away
at once by slowness and inaccuracy in arithmetic” (p. 435). While Turing introduced
examples of what kind of question it would be possible to ask a hidden interlocutor,
“Do you play chess?” (p. 434), Turing missed a salient point: how interrogators
should ask questions. As we will see in chapter 5 on practical three-participant Turing
tests conducted in 2008, question strategy can flummox the interrogator’s deception-
detection rate and lower their correct identification. Shah and Warwick show that
hidden interlocutor misidentification can be quite high depending on what kind of
questions the interrogator chooses to ask (2010c).
Ignoring Turing’s reminder relating to arithmetical questions that a “machine
(programmed for playing the game) would not attempt to give the right answers to the
arithmetical problem” (p.448), during a live practical Turing test an interrogator asked
both hidden entities in a pair “can you calculate the root square of 67890444?” (see
chapter 5). This interrogator wrongly ranked both hidden interlocutors as ‘human’ on
the strength of responses “I can calculate a division by zero!” and “oh, please bother
1-27
my aunt Sonya with all this arithmetics” (ibid). This brings us to a major problem of
interrogating during a practical Turing test: that of subjectivity. Each interrogator
brings to the test their own opinion of what constitutes a human-like response to their
set questions. However, Turing furnished a solution to the problem of interrogator
subjectivity: “A number of interrogators could be used, and statistics compiled to
show how often the right identification was given” (p. 436). How many that number is
was revealed by Turing in 1952 in which he used the term jury. In Crown Court trials
in the UK a jury consists of twelve humans to decide on facts and whether a defendant
is guilty or not (8 jurors are used in county courts, 7-11 in a Coroner’s Court). Twelve
is the number of jurors used in serious criminal cases such as murder, burglary or
fraud (HMS Court Service). Hugh Loebner, sponsor of the Loebner Prize for
Artificial Intelligence, a platform for his interpretation of the Turing test, considers
four judges are sufficient (Loebner Prizes 2004- 2007 and in Loebner Prize 2009),
whereas the author used over a hundred interrogator judges across two Turing test
experiments (see chapters 5-7).
1.2.5.1 Two-participant Interrogator-witness game
In section 6 of CMI in ‘Contrary Views’ (1950), Turing put forward his responses to
anticipatory objections to a thinking machine, including an Argument from
consciousness (point 4). Turing addressed what seemed to him as Professor
Jefferson’s solipsistic argument requiring a mechanism to “feel pleasure at its
successes, grief when its valves fuse, be warmed by flattery, be made miserable by its
mistakes, be charmed by sex, be angry or depressed when it cannot get what it wants”
(p. 446). By Jefferson’s criterion an autistic human may not fulfil the requirements. In
CMI section 6 Turing introduced an alternative machine thinking test, the viva voce
1-28
reducing the number of participants from three to two. The dynamic of the
investigation into thinking altered from a machine comparison against a human to a
direct one-to-one questioning of a machine – the ‘witness’ by a human interrogator.
In Turing’s response to the consciousness objection to thinking machines, the sample
questions Turing’s imagined an interrogator puts to the witness are not gender-
specific, suggesting that Turing did not have in mind the development of a machine
simulating a man pretending to be a woman, one interpretation of machine design (see
next chapter, 2 for a discussion on gender and Turing’s imitation game):
Interrogator: Would you say Mr. Pickwick reminds you of Christmas?
Witness: In a way (p. 446).
What Turing’s exchange shows is that he was aware that an interrogator is likely to
ask questions about topics they are familiar with and have an interest in, and for a
machine to respond appropriately would require sharing some of that knowledge and
an inference capacity, a tall engineering order which Turing mitigated by appealing
that if a machine put forward sustained satisfactory responses this could not then be
described “as an easy contrivance” (p. 447).
1.2.6 Properties and performance of the machine
In order for the machine to manifest its intellectual capacity by textually replying to
questions from an interrogator, the sagacious Turing permitted only digital computers
to take part in the test, because “these machines are intended to carry out any
operations which could be done by a human computer” (p.436). Concerning
constructing such a machine, Turing advocated understanding the sequence of
operations that a human carries out when completing a complex task: “If one wants to
make a machine mimic the behaviour of the human computer ... one has to ask him
1-29
how it is done” (p.438). An example of this in practice is modern artificial dialogists
such as Jabberwacky and Cleverbot
8
housed on their own web pages on the Internet.
These systems record in their digital memory store every single utterance users post
during interactions. At the time of writing, 1077 users were simultaneously interacting
with Jabberwacky on-line
9
. By recording utterances, the system ‘learns’ language and
context, how humans say things in normal every-day language, how they formulate
queries and how they respond to questions. Carpenter contends, with a database of ten
million ‘items’ his system will “appear human to most people most of the time” (in
Shah and Warwick, forthcoming
10
).
Turing felt an interesting design variant for the machine would be to inculcate
random elements allowing it to make choices that “depend on the digits of the decimal
for π” (p. 438), thus giving the machine the impression of having free will. Absence
of diverse behaviour could be resolved through increased storage capacity (p. 449).
This is seen in cyberspace with modern artificial conversationalists populated across a
number of computer servers. For example, when asked “how many is a million?”
during the knowledge phase of Chatterbox Challenge 2005, an interrogator-witness
competition, Jabberwacky replied laconically, “More than 12” (in Shah and Warwick,
forthcoming). This may be considered humorous, however Jabberwacky, unlike
humans, does not have the ability to constrain its random output. In one practical
Turing test Jabberwacky uttered: “I’m going to Mars and I am going to smash your
face with a baseball bat!” in response to an interrogator’s utterance “I live on
8
Cleverbot: http://www.cleverbot.com/ date visited: 4.2.10; time: 13.35
9
Jabberwacky: http://www.jabberwacky.com/ date visited: 4.2.10; time: 13.03
10
Constraining Random Dialogue in a Modern Eliza
1-30
Martha’s Vineyard”. The shocked interrogator followed Jabberwacky’s response with
“Well, I do hope you’re the program, because if you’re human. . . Well, let’s jus say
that’s a little rude” (p. 13 in Shah and Warwick, forthcoming). Returning to
randomness later in his 1950 thesis, Turing considered it wise to include this element
for a learning machine, because it would be useful when “searching for a solution of
some problem” (p. 459).
Turing envisaged other requirements for the machine, including a bi-functional
store acting as a paper might for a human to write information on, and for a memory
to recall from (p. 437). If the system were to rest on logical inferences, but not
necessarily “satisfying the most exacting logician” (p. 458) in order to respond to a
question correctly, then, Turing wrote, the store need largely to be “occupied with
definitions and propositions” (p. 457), such as well-established facts and
mathematically proved theorems. Turing believed that a capacity of between 107 and
109 for the store would be sufficient for the machine to play the imitation game, the
lower figure a “very practicable possibility” of the technology at the time of his
writing (p. 455). On the significance of memory Turing wrote “Actual human
computers really remember what they have to do” (p. 438). The machine should also
have an executive unit, to carry out operations, and a control to ensure that
programmed instructions are followed correctly and in the right order. However, the
machine would not do well if all it could do is recall every fact recorded in its store,
regurgitating them during practical Turing tests. Humans are inferior to machines in
this respect; they are afflicted with forgetting, losing data relating to episodes in their
life, their memory is infallible. Machines are not expected to demonstrate the kind of
1-31
remarkable non-stop and automatic remembering reported by Parker, Cahill and
McGaugh in the unique woman, AJ with hyperthymestic syndrome (2006).
Speed of operation of the machine is another important feature Turing
acknowledged (p. 439), one that can lend itself to easy discernment from the human,
as shown in experiment 2 of this study (see chapter 5). A non-native English
interrogator focussed on the speed of response and length of utterance of the two
hidden interlocutors he questioned during a machine-human simultaneous comparison
to distinguish the machine from the human (see chapter 5). The longer the sentence
and the alacrity with which it was presented on screen usually meant the entity
responsible for it was a machine
11
. Turing’s prescience alluded to this effect: “It is
probably not necessary to increase the speed of operations of the machines at all” (p.
455).
1.2.7 Short distance, plenty to do
Though Turing did anticipate an embodied machine in his 1947 paper, with limbs
replacing the Automatic Computing Engine’s need for a master programming it and a
servant feeding it cards, he dismissed the need for the machine to take body with
artificial skin when competing in the imitation game, because it is the machine’s
intellectual ability that is under consideration not the machine’s appearance (see
section 1.1.2 earlier). Asking “What steps should be taken now if the experiment is to
be successful?” (p. 455), Turing pointed to programming as the key to building a
machine to imitate an adult human mind: “the problem is mainly one of programming
11
Defeng Wu, non-native Judge in ‘Can a machine think? –18th Loebner Prize/Turing tests:
http://www.reading.ac.uk/research/Highlights-News/featuresnews/res-featureloebner.aspx date visited:
4.2.10; time: 15.42
1-32
…“how to programme these machines to play the imitation game” (p. 455). Three
components make up the process of building such a machine according to Turing: a)
initial state, such as at birth in the human; b) education, and c) experience. He
suggested a way forward: “Instead of trying to produce a programme to simulate the
adult mind, why not rather try to produce one which simulates the child’s?” (p. 456).
Turing divided the problem of building a child machine into two parts: the child
programme and its education (p. 456). Through a process of evolution involving
teaching, Turing believed researchers could ascertain how well the machine learns,
from its structure with changes and choices: (p.456):
Structure of the child machine = Hereditary Material
Changes in the child machine = Mutations
Natural selection in the chid machine = Judgment of the experimenter
Seeking a more expeditious process than evolution Turing advocated some tuition
with the principle of punishments and rewards (p, 456-457): “The machine has to be
so constructed that events which shortly preceded the occurrence of a punishment-
signal are unlikely to be repeated, whereas a reward-signal increased the probability
of repetition of the events which led up to it”. Seeming astonishing for the time,
Turing confessed “I have done some experiments with one such child-machine, and
succeeded in teaching it a few things” he conceded “the teaching method was too
unorthodox for the experiment to be considered really successful” (p.457). Turing’s
example of the type of ‘complete system’ needed to play the imitation game would
include constructing the machine to automatically act upon an imperative proposition
with the appropriate behaviour. Turing gave the example of a teacher ordering the
machine to do its homework “the imperatives that can be obeyed by a machine that
1-33
has no limbs are bound to be of a rather intellectual character” (p. 458), not unlike
limbless humans. Defining the role of the teacher, Turing preferred them to be largely
ignorant of the machine’s internal mechanism beyond being able to predict some of
the machine’s behaviour (ibid), thus independent from the machine’s mechanic or
programmer.
1.2.8 Intelligence and the 1950 Turing test
Turing’s 1950 paper is entitled ‘computing, machinery and intelligence’; in it he
introduces the reader to the type of machine he envisaged playing the imitation game
– a digital computer. Turing did not define intelligence, what he did do is refer to
intelligent behaviour: “presumably consists in a departure from the completely
disciplined behaviour involved in computation, but a rather slight one, which does not
give rise to random behaviour, or to pointless repetitive loops” (p. 459). Thus he
outlined a question-answer arrangement in which he believed a machine could
demonstrate thinking and would eventually be able to compete with humans in
intellectual fields alone. If the machine succeeded in deceiving an interrogator that it
is a human in such a test then that machine should necessarily be attributed with the
ability to think, and would have passed Turing’s test 1950’s style.
1.3 Later Papers
Turing followed CMI with essays and lectures expanding, embellishing and clarifying
the ideas he announced in 1950.
1-34
1.3.1 Turing 1951a: Intelligent Machinery, A Heretical Theory
In one of two papers presented for radio discussion programmes in 1951
12
explicating
his confidence in what machines could do, and in order to refute the argument “You
cannot make a machine to think for you” (in Shieber, 2004: p. 105), Turing stated “it
has been shown that there are machines theoretically possible which will do
something very close to thinking” (in Copeland, 2004: p. 472). Of the fallibility of
machines in his time, Turing admitted it “certainly does not behave in a very
satisfactory manner, for it continues to work indefinitely without producing any result
at all” (ibid). Turing contrasted this behaviour as not very different from an assiduous
mathematician resolved to find solutions to an intractable problem. Errors creep into
work unconsciously. Turing contested machines could be constructed to very closely
simulate the human mind with behaviour that consisted of occasionally making
mistakes, at times making “new and very interesting statements ... the output of them
will be worth attention to the same sort of extent as the output of a human mind” (in
Copeland, 2004: p. 472). Alluding to his 1950 paper, learning from experience is
mentioned in the 1951 radio broadcast as requirements of the machine to mitigate
repeated mistakes.
Educating the machine requires a “highly competent schoolmaster” who, according
to Turing’s dictum, must be unaware of its inner workings (p. 473), and who should
transform the machine from a simple to a more elaborate system. With the aid of a
mechanic “permitted to keep the machine in running order” (ibid), the process would
produce a “reasonably intelligent machine” according to Turing. He again used the
12
‘Intelligent Machinery - Heretical Theory’ BBC Home Service, Manchester studio for ’51 Society
1-35
example of the machine playing chess, as he did in his 1948 paper, and supposed that
the machine understood English allowing it the use of text-based communication to
record and receive remarks “owing to its having no hands or feet” (ibid). In an
advanced stage of its education, the machine could forge new processes itself,
resulting in highly sophisticated and highly satisfactory form of rule, Turing
analogised it with engineering problems that are sometimes solved by the crudest rule
of thumb procedure dealing with the most superficial aspects of the problem.
Repeating his 1950 suggestion for a random element in the machine Turing urged
its incorporation to enable the machine to make choices. Accepting there would be
resistance to his ideas, Turing provoked intellectual society declaring “these machines
are a genuine possibility” (ibid). He remonstrated that rather than there being a threat
of redundancy the intellectuals would be kept busy striving to keep up with the
machines and understanding what the machines were trying to say. Turing concluded
“once the machine thinking method ... started ... there would be no question of the
machines dying” predicting “they [the machines] would be able to converse with
each other to sharpen their wits” (in Copeland, 2004: p. 475). Though Turing does
not mention the imitation game explicitly in this discourse he does affirm his position
that machines could be constructed to simulate thinking and that he, Turing, knew
how to construct such a machine: “which would give a very good account of itself in a
range of tests” if it were made elaborate enough (in Shieber, 2004: p. 106).
Constructing a thinking machine, according to Turing, required at least two people
with different expertise:
1-36
a schoolmaster charged with educating the machine
a mechanic only permitted to keep the machine in running order
The way the machine could function is by incorporating within its memory a
chronological list of all statements made to it and by it with an alphabetical index of
its experiences including how often words are used and the occasions of their use.
Turing suggested that at an advanced stage the memory could be extended enabling
the machine to remember its thoughts and what those thoughts were. Inculcating a
choice-selection feature would aid intelligent recall and present contextually relevant
items during interaction allowing comparison of good/bad outcomes from previous
situations. Copeland notes that Turing does not mention his ‘indexing’ idea anywhere
else in his musings on thinking machines (2004:p. 466). Copeland further observes
that Turing brushes aside the one main mathematical objection to the idea of
intelligent machinery through his opinion of the machine’s ability to learn new
methods and techniques: “the machine’s tutor ... a human mathematician ... can just
present the machine with a better method whenever the machine produces an incorrect
answer to the problem” (2004: p. 470). Shieber notes Turing’s 1951 Heretical Theory
paper discusses single-aim systems developed commercially to carry out a very
specific task with certainty and considerable speed (2004: p. 105). This is where
current technology mainly resides, as domain-specific expert systems. Copeland
reminds us Turing had a “profound grasp of the potential” of future machines (2004:
p. 470).
1.3.2 Turing1951b: Can Digital Computers Think?
The second BBC radio broadcast in 1951 in which Turing took part covered a lecture
series on automatic calculating machines. Confessing his partiality, Turing uses his
1-37
talk on ‘Can Digital Computers Think?’ to assert once again his view that “it is not
altogether unreasonable to describe digital computers as brains” (in Shieber, 2004: p.
111). Turing’s imputing the 1950s man in the streets with acceptance of a machine “as
a sort of brain” upon their hearing “amazing accounts of what machines could do”
(ibid), allowed lay people to appreciate the machine’s accomplishing tasks of an
intellectual nature, as can be witnessed in some children and adults today. On the
strength of science fiction movies (Bartneck, 2004), some believe plenteous
intelligent machines populate parts of the planet and that we share earth with cyborgs,
robots, HAL type natural language understanding verbal communicators
13
, the almost
indestructible machine in the Terminator movies
14
and Blade Runner’s replicants
yearning for a longer life
15
. In contrast, in the second decade of the 21st century
Turing’s sceptics exist, philosophers and scientists who consider the notion of a
thinking machine a contemptuous idea and who prefer to tow the party line begun by
Lady Lovelace, that a machine “has no pretension whatever to originate anything. It
can do whatever we know how to order it to perform” (in Copeland, 2004: p.482).
Turing conceded this was true of the way machines were used in his time: “it is fair to
say that the machine doesn’t originate anything … validity depends on considering
how digital computers are used rather than how they could be used” (ibid). Turing
asserted “I believe that they [digital computers] could be used in such a manner that
they could appropriately be described as a brain” (ibid).
13
Stanley Kubrick and Arthur C. Clarke’s 2001: A Space Odyssey, 1968
14
James Cameron’s The Terminator, 1984; Terminator 2: Judgement Day 1991
15
Ridley Scott, 1982
1-38
In order to imitate another machine effectively, Turing asserted a universal
machine requires it to be programmed to “calculate what the machine in question
would do under given circumstances and in particular what answers it would print
out” (in Copeland, 2004: p. 483). Then such a universal machine, programmed to
imitate, would present those same responses. Referring to the fleshed brain of material
bodies in the natural world, Turing posited “if it is accepted that real brains, as found
in animals, and in particular in men, are a sort of machine it will follow that our
digital computer, suitably programmed, will behave like a brain” (ibid). In this talk
Turing repeated the necessity for the imitating machine to be equipped with sufficient
storage capacity and refers to its speed adding that the machine being imitated “should
be the sort whose behaviour is in principle predictable by calculation” and that it is
“more like a calculator than a bulldozer” (ibid). Marking the inscrutability of the task,
Turing acknowledged “we certainly do not know how any such calculation should be
done” (ibid). Turing thought it might need a much larger machine than was available
in his time to imitate anything as complicated as the human brain, unless progress was
made in building smaller machines with the capacity and speed of large machines, a
foreseeable future Turing probably expected through progress and alternative
techniques.
On the appropriate programming necessary to construct a machine to imitate
another type of machine, Turing wrote: “the more complicated the machine to be
imitated the more complicated the programme” (ibid). What is being imitated is a
human-like response by a machine to questions in a textual viva-voce examination
that gives the impression of its possessing free will: “making its behaviour depend on
something like a roulette wheel” (in Copeland, 2004: p. 484). Turing dismissed the
1-39
need for embodiment in the machine likening artificial flesh to the unpleasantness of
artificial flowers, which emphasised Turing’s point of view “it is only the intellectual
behaviour of the brain that need be considered” (p. 476). Turing intimated that it
would not be difficult to deceive a non-expert: “It is not difficult to design machines
whose behaviour appears quite random to anyone who does not know the details of
their construction” (p. 484). However, Turing admitted that he did not have anything
substantive to put “before a jury” on how to write the ‘appropriate programme’ that
could convince others it was thinking: “we do not know how to programme digital
computers to behave like brains” (p. 485). Turing believed the “process should bear a
close relation to that of teaching” (p. 485), and this seemingly big obstacle would be
overcome “most certainly within the next millennium” whether people liked the idea
or not (p. 486).
It is a wonder how Turing felt he could persuade optimists, let alone incentivise
sceptics, to begin the search for techniques to create the necessary imitating
programme with his “we do not know how” revelation in this paper. Perhaps this is
because he did not know yet but fully believed a time would come when this were
possible. Copeland notes Turing’s “glorious analogy” of the construction problem as
that of attempting to write a treatise on family life while on Mars with insufficient
paper (2004: p. 408). It could be the case Turing was hinting at some future insight
through technological advances that could be applied to resolve the problem. Turing
was “personally inclined to believe that such a programme will be found” (ibid),
warning “if a machine can think, it might think more intelligently than we do” (ibid)
fuelling fear of foreigner – ‘they will never be like us’ type of argument against the
machine put forward decades later by detractors such as Block and Searle (see
1-40
chapters 2 and 3). Indeed, programmes now exist that recall information and imitate a
human’s ability to refer back to earlier utterances in a dialogue, and answer some
questions correctly. One interrogator’s surprise at a machine’s performance occurred
during a live one-to-one test over the Internet examining artificial conversationalists
for conversational ability and emotion conveyance in 2008 (see chapter 4). The
American psychologist informed a machine he was interrogating that he possessed a
certain coloured car: My car is red. What color is my car? Later during the interaction
the machine was asked the colour of the car: “The importance of Eugene [the system]
correctly answering the question, and then remembering its answer when later asked
cannot be understated. That is a great achievement” (Shah and Warwick, 2010b).
Repeating his end of (20th) century augury made in his 1950 paper (see section
1.2), in the second 1951 radio broadcast Turing predicted “I think it is probable ... that
at the end of the century it will be possible to programme a machine to answer
questions in such a way that it will be extremely difficult to guess whether the
answers are being given by a man or by the machine” (in Copeland, 2004: p. 484).
The first part of Turing’s forecast has been achieved to an extent. Machines are able
to answer questions put to them. Nonetheless it would be hyperbole and factually
incorrect to say the textual responses provided by these systems were difficult to
distinguish from humans at the end of the last century. The picture is not bleak,
however, as borne out in Demchenko and Veselov programme Eugene with its ability
to record and recall the colour of the interlocutor’s car. This kind of performance can
only improve with the enthusiasm in the developers for building systems to entertain,
educate and attempt to beat the Turing test. Concluding the 1951 lecture, Turing
regarded the pursuit of thinking machines would “help us greatly in finding out how
1-41
we think ourselves” (p. 486). It may also help in finding out why some people are
easily deceived while others are not.
1.3.3 Turing 1952: Can Automatic Calculating Machines Be Said To
Think?
Turing et al.’s 1952 BBC radio discussion ‘Can automatic calculating machines be
said to think? is not only as important as Turing’s 1950 paper it is essential for a
complete understanding of Turing’s idea for a thinking machine. Of ‘thinking’ Turing
explained
16
:
“I don’t want to give a definition of thinking, but if I had to I should probably
be unable to say anything more about it than that it was a sort of buzzing that
went on inside my head” (in Copeland, 2004, p.494).
The broadcast provides a fascinating insight of the capacity of machines in 1952 and
what the speakers considered was possible for them to achieve. Turing was relentless
about the possibility of simulating the brain’s logical connections in a machine. He
once again displayed his prescient character informing of the necessary storage
capacity for the machine and its speed in order to compete in a test. In this 1952
discussion Turing reaffirms the two-entity, one-to-one game and proposed a revised
prediction for deception success.
Copeland characterises Turing’s canvass for a thinking machine as “the earliest
known recorded discussion of artificial intelligence” (2004: p. 487), noting that
Geoffrey Jefferson
17
, part of the four-man radio discussion group, took pot shots at the
16
Can Automatic Calculating Machines Be Said to Think? BBC radio discussion included L.C.
Braithwaite, A.A. Jefferson and E. Newman, broadcast 14 & 23 January, 1952
17
Neurosurgeon, (b.1886-d.1961) in Copeland, 2004, p 487
1-42
notion of machine thinking, which Turing and Max Newman
18
easily brushed aside.
Richard Braithwaite
19
chaired the auspicious panel beginning with a consideration of
the proper sense of the word think. Turing had deemed this pursuit a dangerous
attitude two years earlier in his 1950 paper. Nonetheless, the question of what
thinking is, what thinking involves, and thus how to define thought before it can be
agreed that machines could engage in it, dominated the 1952 radio discussion.
Surprisingly, not only did Turing use the discussion to elaborate further on the
interrogator-witness one-to-one test conducted by a jury panel, he announced a
circumspect prediction for the arrival of thinking machines: “at least 100 years” (in
Copeland 2004: p. 495).
1.3.3.1 Thinking and the Interrogator-witness test
Braithwaite threw down the gauntlet by charging “it all depends on what is to be
included in thinking. The word is used to cover a multitude of different activities” ,
which Jefferson countered by saying that defining ‘thinking’ would be a waste of time
beyond saying that common usage would include “having ideas in the mind,
cogitating, meditating, deliberating, solving problems or imagining” (p.494). Jefferson
illuminated that the word ‘man’ derived from Sanskrit (the classical language of India
as Latin was to Europe), which means to think “probably in the sense of judging
between one idea to another” (ibid). Jefferson added that thinking is everything that
goes in the brain: “the sum total of what the brain of man or animal does” (ibid).
Turing had already expressed a disdain for defining thinking thus when directly asked
by Jefferson what he thought, and if he had a mechanical definition, Turing responded
18
Bletchley Park mathematician (b.1887-d.1984)
19
Philosopher of Moral Science, University of Cambridge (b.1900-d.1990)
1-43
that he preferred to draw a line “between the properties of a brain, or of a man, that
we want to discuss, and those that we don’t”. Turing added that he was “unable to say
anything more about it than that it was a sort of buzzing that went on inside my head”
(ibid). At this point of the discussion Turing grabbed an opportunity to explain how
the viva voce 2-participant imitation game, introduced in contrary views (section 6
CMI, 1950) could assess machine thinking, “I would like to suggest a particular kind
of test ... to see whether the machine thinks” (ibid).
Turing explained that the point of the machine in such a test was for it to convince
a number of interrogators while pretending to be a man that it can ‘think’ by
answering questions put to it: “it will only pass if the pretence is reasonably
convincing” (ibid). The type of questions that can be asked of the machine can include
the kind used in a law court exemplified by Turing as “I put it to you that you are only
pretending to be a man”, thus allowing interrogators to pose ‘open questions’ to the
machine which is itself allowed to use “tricks” such as silence, pause and spelling
errors” (in Copeland, 2004: p.495). Turing insisted a jury-service in the form of a
panel of non-machine experts act as the interrogators, but they must not be allowed to
see the machine, they may be allowed a typewritten medium through which to
communicate with the hidden entity which is kept in a different room. Turing wrote:
“a considerable proportion of a jury ... must be taken in by the pretence” in order for
the machine to pass the test (ibid). The number of jurors, members of the public
chosen at random and asked to judge guilt or innocence in the most serious indictable
criminal offences (Thomas & Balmer, 2007: p. 3), in the United Kingdom’s High
1-44
Courts is twelve
20
. [The right to trial by jury is an ancient one dating back to the
Magna Carta, though prior to its 1215 proclamation the Normans would convene a
group under oath to establish truth granting “an accused to be tried by the lawful
judgment of his equals or by the law of the land” (ibid, p.19)].
Turing did introduce a control test in 1952, as he did in the 1950 version, with a
human foil acting as witness on occasion. In the two-participant version of the
imitation game Turing suggested: “We had better suppose that each jury has to judge
quite a number of times” (ibid), and sometimes the interrogators should face a hidden
human to question preventing them adjudging their interlocutor a machine in every
instance of interaction “without proper consideration” (ibid).
Turing did not mention any particular length of time for interrogator questioning,
unlike the “after five minutes” duration stated in his 1950 paper. However, inscribed
in the 1952 discussion are Turing’s eight criteria for staging the interrogator-witness
jury-service test for a thinking machine:
1) the machine is hidden from view and hearing from a panel of interrogators
2) the panel must not be machine experts
3) each member of the jury panel interacts one-to-one with the machine under test
4) the interrogators can only pose typed questions
5) the interrogators can ask any questions,
6) the machine attempts to respond in a human-like way
7) sometimes the interrogator is faced with a hidden human to question
20
Her Majesty’s Court Service: http://www.hmcourts-service.gov.uk/infoabout/jury_service/index.htm
date visited: 22.2.10; time: 20.42
1-45
8) each panel member interrogates a number of times
Newman retorted that he would like to be present when such a test is actualised to
participate in questioning in a way that would tease out the machine from the man.
Newman asked Turing what period of time would be required before a machine stood
“any chance with no questions barred?”, Turing, having had the interim time between
the publication of CMI in October 1950 and the date of the BBC radio discussion
recording in January 1952 to reflect on a realistic span required to build a thinking
machine, replied: “at least 100 years, I should say”. Hence, it is unlikely before 2052
that Turing truly felt a machine would be ready to pass his 5-min, question-answer
text-based interrogator-witness test for machine thinking, an important item that is
missed by many who, attesting to Turing’s 1950 paper alone and its end of century
prediction, claim that Turing minimised the huge task of constructing such a device
and got his prediction very wrong. He did not.
1.3.3.2 Learning and Experience
Comparing real life experiences of finding trains in a timetable, Jefferson asked
whether machines could learn and improve with practice. Newman replied
affirmatively pointing to the game of chess to illustrate his answer. Newman
explained that once all the necessary data was stored, such as number and position of
pieces in a game of chess, so too could the times of trains be recorded. With fixed
instructions to process the unfixed patterns, for example the number of ways chess
pieces can be moved around a chess board a programme could be designed the
machine could produce the right solution. Learning would be achieved when the
machine achieved a correct solution without it being previously programmed by the
1-46
human programmer. Jefferson injected the discussion at this point with other learning
methods: perfecting through repetition and through the application of acquired
knowledge from one domain to another domain to resolve problems “seeing
relevances and relationships” (in Copeland, 2004: p.496). Prompted by the fact that
learning involves remembering Jefferson asked for how long a machine could store
information, Newman replied “as long as a man’s lifetime” (p. 497). Prodding further,
Jefferson connoted external intervention from teachers and parents was part of the
learning process which “wouldn’t get any distance at all with human beings” if only
one person, as in the case of the mathematician programming the machine, were
involved. Jefferson cited that only in formal examinations is such a scenario the case
for humans. This is an important point; it benefits learning in the machine if more
than one person is involved in its teaching. Turing accepted that in the case of a child
being taught, this included repeated intervention by parents and teachers to “stop him
doing this or encourage him to do that”, a process that could be applied to machines
(p.497).
Turing surprised his audience by making them aware of experiments he had
conducted to teach a machine to do simple operations. He admitted “a very great deal
of such intervention was needed before I could get any results at all ... the machine
learnt so slowly that it needed a great deal of teaching” (ibid). Pertinently, Jefferson
asked who was doing the learning, Turing or the machine. It appeared to be both from
Turing’s response “I suppose we both were”. The machine’s success depended on its
ability to learn quickly. Turing hoped for a “snowball effect” emerging from the
process of teaching the machine akin to learning by an intelligent person: “in learning
to do any particular thing it will probably also be learning to learn more efficiently ...
1-47
when one has taught it to do certain things one will find that some other things which
one has planned to teach it happens without any special teaching being required ...
This certainly happens with an intelligent human mind” (ibid). Drawing Braithwaite
into the discussion at this point, Turing enquired about learning possibilities raising
the salience of experience in the learning process. Braithwaite felt self-correction and
interest in things are important features in the learning process made possible through
interaction with the environment, and gaining knowledge of things that interest the
learner. Jefferson noted human history showed a predilection for auguring the
nearness of artificially created man
21
. He followed Braithwaite’s mention of a
multitude of elements that determine intention to spring into action by a learner
including appetites, desires, drives, instincts which allow behaviour adaption. A
machine having been equipped with a set of appetites would need to attend to its
environment in order to adjust itself and learn, unless the machine was designed
merely to act as a pet and be spoon-fed. Newman agreed admitting the state of
technology at that moment did not include machines that would ‘blush’ on becoming
embarrassment. However, the dilemma of getting them to think was an interesting
debate. Newman pointed to the reasoning aspect of thinking necessary to solve chess
problems which machines were incapable of at that time. Now we know this feat has
been accomplished. More than a decade ago IBM’s Deep Blue machine overcame a
the world chess grandmaster Gary Kasparov in 1997.
In what is undoubtedly a feast of intellectuality, Newman continued the 1952 radio
discussion stressing machines require the power to recognise similarities, seeing
21
Jefferson cites Descartes and Borelli, p. 498 in Copeland, 2004
1-48
analogies between things “that had not been put together before” (p. 498). He raised
the philosophical debate about whether a machine could ever invent something for
which the concept was not already in the mind of its programmer. Here Turing once
again showed his mettle for avoiding definitions, this time for the word ‘concept’.
Turing did believe machines could produce something entirely new if it continued to
learn improving its method and do more and more difficult things. Braithwaite
believed it was a question of designing a machine to recognise similarities, the real
difficulty arising from its lack of the visceral, because analogies sometimes emerge as
a result of a feeling: “a machine can’t recognise similarities when there is nothing in
its programme to say what are the similarities it is expected to recognise” (p.499).
Turing countered this argument; machines could be made to spot an analogy, the sort
of thing “that one usually regards as essentially a human monopoly” (ibid).
Turing went on to explain how he believed analogy worked in the human brain,
through recognition of patterns: “two or more sets of ideas have the same pattern of
logical connections, the brain ... likely economise parts by using some of them twice
over, to remember the logical connections both in the one case and in the other” (p.
499). Turing was attempting an understanding of the buzzing in his head. Studies of
synaesthesia show a blending of the senses in some humans. Described as “concrete
sensory phenomenon” (Ramachandran and Hubbard, 2001: p. 3), synaesthete patients
experience a ‘disorder of senses’, for example, causing a feeling of ‘taste’ when
something is touched. Images produced from functional magnetic resonance scans
(fMRI) of synaesthetes give us some understanding of what Turing alluded to, cross-
wiring in the brain may give birth to a feature of language: metaphor (Ramachandran
& Hubbard, 2005). Cross-wiring may also encourage rather than inhibit connections
1-49
between regions of the brain allowing interaction across regions and possibly creating
arbitrary links between seemingly unrelated perceptual entities. Ramachandran and
Hubbard’s synaesthesia-based theory of the evolution of natural language highlights
an intriguing relationship between language and thought, concluding that extensive
cross-activation in the human mind links abstract concepts about object shapes,
sounds and colour hues that may help to create metaphors (2001). Turing himself used
the metaphors of insufficient paper for inadequate storage in a machine, and unknown
topic (2004: p. 484), for the problem of how to write a programme, in his 1951
analogy of the man writing about family life on Mars, with a computer attempting to
behave like a brain.
Turing continued in the discussion with a complaint that whatever the machine did,
make analogies for instance, it would still not be considered thinking by sceptics: “As
soon as one can see the cause and effect working themselves out in the brain, one
regards it as not being thinking, but a sort of unimaginative donkey-work” (p. 500).
Contrary to his usual abnegation for defining concepts, in 1952 Turing did attempt to
describe thinking consisting of “those mental processes that we don’t understand”
(ibid). If we have no real understanding of how a machine does interesting things then
we ought to accept it as a thinking machine otherwise it would be moving the goalpost
of machine expectation further, as Hauser would say (2001). Turing advised Jefferson
that machines did occasionally produce unexpected things that may have been
predicted but only with “too much trouble” (ibid). Newman added “The machine’s
processes are mosaics of very simple standard parts, but the designs can be of great
complexity”, and importantly “it is not obvious where the limit is to the patterns of
thought they could imitate” (ibid). Braithwaite countered that if a brain worked like a
1-50
computing machine and the machines of the time had only the equivalent of half a
million nerve cells then machines could not accomplish what a brain could. He also
alluded to the size of a brain in the required machine. Braithwaite was assuming a
machine would need the same number of cells as a human possesses to achieve
similar performance; he did not seem able to imagine an expeditious machine built of
intelligent design. Jefferson picked up on this point, that it cannot be just a question of
size: “… there would be too much logic in your huge machine. It wouldn’t be really
like a human output of thought”, he added “machine parts would have to be designed
quite differently to give greater flexibility and more diverse possibilities of use” (p.
501). Turing emphasised storage capacity again, he believed size did matter for the
amount of information needed. The discussion turned to intervention again; Jefferson
reminded “to get near to anything that can truly be called ‘thinking’ the effects of
external stimuli cannot be missed out”, he recalled extraneous factors that impugn
creative thinking in a human, such as worries in life. Jefferson expressed one obstacle
that needs to be overcome prior to the business of building a ‘thinking machine’, that
of persuasion: “man’s first reaction to a new idea ... is one of rejection” (ibid).
Newman referred to Turing’s point: if a machine could do ‘x’ say it would seem more
trickery to some people than a natural, self-initiated action. Jefferson felt machines
should reject some tasks that it did not want to do, or that offended it in some way
through having a personal history. This is a shrewd nod over time towards Japanese
roboticists today. Newman reminded us that “not even Turing thinks” that putting a
skin on the machine would make it alive.
Another, non-human feature Jefferson felt the machine should possess is its
infallibility. Turing explained machines were infallible; checking for accuracy was a
1-51
crucial part of their deployment, misunderstandings could occur when machines failed
to perform the expected calculation. Another essential feature of humanness was
introduced at this point of the debate: emotions. Jefferson almost felt good about
machines not having this distraction, an excuse for confusion when humans are upset:
“your machines don’t have to bother with that, with being tired or cold or happy or
satisfied” (p. 502). Exemplifying with mountain climbing Jefferson reasoned that
without emotion the machines could not display delight at doing something never
done before, that machines are simple things in mental terms compared with humans,
thus inhuman. Braithwaite disagreed believing that inculcation of ‘appetites’ in the
machine or other ‘springs of action’ would enable them to “pay attention ... in its
environment to be able to learn from experience” (ibid), that perhaps this may be
impossible without the machine having some emotional apparatus.
The machine may require protection from extinction through blowing a fuse if
there was a dangerous electrical surge. Turing claimed a by-product of teaching a
machine would lead to some effects requiring curbs and other effects necessitating
encouragement. These ‘effects’ would be distinct from the corresponding human
consequence of an emotion. Turing supposed the machine in the imitation game
would “have to do quite a bit of acting” (p. 503). Newman raised the technical
problem of carrying out operations in the machine, the speed necessary “in the realms
of science fiction” (ibid). Turing agreed that time is an important factor adding “the
fact that a brain can do it seems to suggest that the difficulties may not really be so
bad as they now seem” (p.504). Braithwaite concurred, the discussion was not about
whether machines can do everything that a man does, but whether a machine can do
anything that is “proper to call thinking” (p. 504). Braithwaite exemplified with
1-52
appreciation of a picture an aesthetic admiration “contains elements of thinking” and
“elements of feeling”. He conceded the current concern is of an intellectual capacity
in the machine, for it to solve encountered problems in its environment, such as
answering awkward questions put to it. Learning was part of thinking.
As the discussion drew to a close Braithwaite asked if a machine could create new
concepts. Newman rounded out the discussion with two principal questions when
considering thinking machines, firstly to ascertain an agreement by positing all the
requirements necessary for a machine to be accepted as thinking, and what they
cannot do at the time. This second point was reflected on. Rather than conceiving
scenarios on what machines may be able to achieve one day marking a stage from
where the process of building a thinking machine could begin, such as the kind of
realistic tasks machines could be deployed for, Newman suggested finding new
methods for the solution of unsolved mathematical problems was a “more modest aim
than inventing new mathematical concepts” (p. 504). He described the process in
humans and the appearance of ideas after thought had gone into problem solving,
which also involved imagination. Newman felt that “if a machine could really be got
to imitate this sudden pounce on an idea ... everyone would agree that it had begun to
think” (p.505). Jefferson pointed to the end-product of human thinking, which could
be discerned in the books in libraries and artefacts in museums, and referred to
memory, used to associate ideas and assist in filling the gaps in our knowledge by
“improvising from a few given facts” (ibid). Jefferson asserted “thinking is clearly a
motor activity of the brain’s cells ... so many people think better with a pen in their
hand than viva voce”, and that the only way of knowing the content of someone’s
thought was through their communicating it (ibid).
1-53
Jefferson asked Newman whether an anatomically different machine could be built
and whether it could produce the same work as a man; the reply revealed the
difference in physiology though the logical plans was “rather similar” (p. 505).
Jefferson assumed the electronic machine was not a replica of part of a man’s brain
“even though the result of its actions has to be conceded as thought” (ibid). The BBC
radio programme ended with Jefferson aspiring to have fun by taking part in a future
discussion “between two machines on why human beings think that they think!”
(p.506). Turing and Newman successfully influenced the sceptical Jefferson, who did
come to accept that one day machines will think.
The 1952 discussion is important because it does not show Turing referring to a
necessity for gender imitation by the machine. Believing there would come a time
when it could answer any question put to it by a human interrogator, Turing settled on
a one-to-one inquiry into the capability of a machine.
1.3.4 Turing 1953: Chess
As we have seen earlier in this chapter, in sections 1.1.2 and 1.1.3, in 1947 and 1948
Turing proposed the game of chess as an appropriate starting point to investigate
whether digital computers supplied with enough memory and storage capacity could
think. Turing said of chess: “[the game] holds special interest of the representation of
human knowledge in machines” (p. 562). In this 1953 essay on chess, Turing
described how to deal with the question of whether a machine could be built to play
by considering word meanings in sentences. This led him to put forward sub-
questions that could be considered separately:
1-54
i) Could machines be programmed to follow the legal moves according to the
rules of the game?
ii) Could machines solve chess problems given the board location of its pieces?
iii) Could machines play reasonably well in the game?
iv) Could the machine improve its play and profit from its experience?
Turing then added unconnected questions including one that harked back to the
imitation game: “could one make a machine which would answer questions put to it,
in such a way that it would not be possible to distinguish its answers from those of a
man?” (p. 569). It could be the case that Turing in 1953 did not want his text-based
challenges of 1950 and 1952 to become lost in the midst of building digital computers
to play chess. Turing wanted it known that his imitation game would be passed one
day, as would a computer beating a human at chess, because he affirmed “I believe
so”, which also covered his confidence over machines being able to have feelings like
humans. The essay contained the first ever mention of this by Turing: “I know of no
really convincing argument to support this belief and certainly none to disprove it” (p.
569).
On the sub-questions relating to the overarching question on whether machines
could be built to play chess, Turing wrote “This certainly can be done. If it has not
been done already it is merely because there is something better to do” (p. 569), which
in effect is asserting that humans can come up with a design of a digital computer,
with supporting storage and memory, and a sufficient programme to make it play
chess and beat a human chess player when they have time to pursue it. Turing’s
approach to the game of chess was that it is a problem for the machine to solve
through use of heuristic methods, such as rule of thumb guiding the machine’s search
1-55
through a “tree of possible moves and counter-moves” (p, 565). The machine could
try out various methods thus learning to adopt the variations that led to satisfactory
results. Turing was not alone in this belief. Early proponents for computers playing
chess included mathematicians such as Claude Shannon, Norbert Weiner and Jon Von
Neumann. Herbert Simon and Allen Newell predicted in 1958 that a machine would
become a world chess champion within 10 years. Even if it happened later than they
predicted they were correct in supposing that feat was achievable. With massively
parallel architecture (IBM, 1997) and brute-force approach to the problem, IBM’s
Deep Blue machine was able to examine 200 million chess positions per second,
mining to depths of 14 levels in stark contrast to the human player’s search strategy
(Newborn, 1997). In a six-game rematch against the then World Chess Champion,
Garry Kasparov in 1997, the machine succeeded in overcoming its human opponent
by 3.5 games to 2.5 games.
1.4 Turing’s Two tests for Machine Thinking
The journey to find Turing’s canonical test for machine thinking, measured through
text-based question and answer sessions, has probed seventeen years of Turing’s mind
and thoughts through his essays and lectures between 1936 and 1953. This gives an
understanding of what machines could do during that time period, such as calculate
more speedily than a human computer, and an insight into what Turing believed was
possible for a machine to achieve. Critically, given sufficient storage and memory and
the right sort of programme Turing trusted machines would achieve ‘thinking’. Sadly,
Turing died in June 1954 never enjoying the successes of machine performance such
as IBM’s chess playing computer beating a grandmaster, or Candide a machine
1-56
translator (DellaPietra and DellaPietra, 1994), but he also did not suffer the drastic
failures resulting from over confidence in achievable machine intelligence.
We will never know what Turing himself might have contributed further, but he
left a legacy of ideas which are pored over to this day. However, the author posits
there is one Turing imitation game for machine thinking which can be practicalised in
two ways: a) one-to-one interrogator-witness test (1952; 1950), and b) the
simultaneous comparison of a machine with a human through a three-participant
question-answer game (1950). The conditions for both versions are considered next
before a review of the imitation game scholarship following Turing’s death.
1.4.1 One-to-one Machine Evaluation
The one-to-one interrogator-witness version of Turing’s imitation game is quite
different from the three participant game. As part of a jury, each member interrogates
hidden entities one by one (see figure 3). They must resist deception by correctly
identifying any machine and recognising any human. Each jury participates an
unspecified number of times. Turing did not mention any interaction-duration for this
version of his textual test, unlike the after five minutes for the three-participant game.
Turing’s criteria for the machine passing the one-to-one test appears more robust than
for the 3-participant version, if the jury is taken as a collective judge. As Copeland
(2004) notes, a “considerable proportion” of the jury has to be “taken in by the
pretence” (p. 488).
1-57
Figure 3: Jury-service Imitation Game
22
For the interrogator, the one-to-one test is an easier form of the game than the
simultaneous conversation with two hidden entities. It is arguable which is the tougher
test for the machine. In a one-to-one scenario, the interrogator is not distracted by a
second conversation happening at the same time, thus the concentration is on the
hidden entity’s responses and how satisfactory they are. Turing’s one-to-one version
has been practicalised the most often, from the first Loebner Prize contest for
Artificial Intelligence in 1991 up to and including the 13th contest. However, it is as
prone to misidentification as is the three participant game. The interrogators’
subjective opinion, on what constitutes satisfactory textual responses, can interfere in
their interrogation techniques causing them to rank a human foil as a machine. This
phenomenon is known as the confederate effect (Shah & Henry, 2005), and will be
discussed along with two other phenomena, Eliza Effect and gender-blurring effect
with evidence from practical Turing tests in chapters 4-6.
22
Figure 3 from http://xkcd.com/329/ date: 22.2.10; time: 17.46
1-58
1.4.2 Simultaneous Comparison
The simultaneous comparison involves an average interrogator questioning two
hidden entities at the same time (see figure 4). The task of the interrogator is to
correctly identify any machine and any human interlocutor by uncovering the artificial
and recognising the natural after five minutes of questioning. The machine is deemed
to have passed this version of Turing’s imitation game if it deceives more than 30% of
the time (1950: p.442). This form of the game is the stronger test for
indistinguishability. Assuming the same question is put to both the hidden machine
and the hidden human the average interrogator can contrast the two responses in
parallel, thus perhaps making it harder for the machine to appear humanlike.
However, which response is considered satisfactory is still subject to the interlocutor’s
judgment. Turing did not disallow the interrogator ranking both hidden entities as
either both machines, or both humans. In fact, this is allowed under the
‘indistinguishability’ criteria.
Figure 4: Simultaneous comparison Turing test
Strategy is more important in this version of Turing’s test for machine thinking (see
chapter 5). The wrong strategy and assumptions about what a human should know can
1-59
lead to misidentification by the interrogator, incorrectly ranking a human foil as a
machine and ranking the machine a human, as will be shown in chapters 5 and 6.
1.4.3 Contrasting One-to-one with Simultaneous Comparison
In Copeland’s view, Turing’s three participant test involves a “lone interrogator”
(2004: p.488).This belief is an assumption based on Turing’s prediction that an
average interrogator “will not have more than 70 per cent chance of making the right
identification” after five minutes of questioning. Firstly, this raises the question, how
we should define “average” for any practicalisations? Is an educated layperson
average in terms of artificial conversationalists? Should average interrogators be
chosen from a pool of people who have never chatted to an artificial conversationalist
and who have never heard of the Turing test, but nevertheless are of varying degrees
of education? The interrogator would at least need to know how to type, and be able to
use a digital computer. The author contends Turing meant that any interrogator should
be unfamiliar with the inner workings of the machine and its learning process.
The number of interrogators is specified by Turing as ‘jury’ in his one-to-one test
in 1952. Trial by jury is considered fairer than by a judge: 80% of the British public
trust a jury to come to the right decision believing juries produce a better service
(Thomas & Balmer, 2007: p. 3). A jury numbering twelve men and women is
accepted as sufficient and is a complete historical accident. Turing may have been
familiar with this knowledge, because he refers to a law court when he exemplifies the
sorts of questions that interrogators could ask (in Copeland, 2004: p. 495), and
perhaps why he preferred non-machine experts to act as judges in his two tests for
machine thinking. Thus the author contends that Turing meant a number of juries,
1-60
each one populated by twelve average interrogators, be tasked with questioning the
hidden machine.
Though not stated explicitly by Turing, it is preferable not to complicate the test by
communicating in different natural languages during practical Turing tests. However,
even if we take English as the language for the textual test Turing did not say that the
human participants, interrogators and human foils, had to be native English speakers.
What if the machine designed to communicate in English were compared against a
non-native English human foil in the three-participant game, both questioned by a
non-native English speaking interrogator? What would its performance level be? Is
this an acceptable means to assess the machine, on a continuum from babbling
conversationalist to a native texter?
1.4.4 Technical Communications Protocol
The last point considered before this chapter is concluded is the technical
communications protocol that allows the display of the hidden entities’ utterances on
the interrogator’s screen. Should the interrogator see every character as typed by the
hidden entities or is the test valid if the whole utterance appears in one go? This is an
important question because which display type is selected for a practical Turing test,
character-by-character or whole message-by-message, impacts the time needed for the
machine examination (see chapter 5). The author believes the message-by-message
display is the most efficient and less likely to encourage machine tricks such as back-
spacing over typed text to appear human in correcting a spelling.
1-61
1.4.5 Concluding this chapter
In 1947 Turing raised the idea of a machine playing a human in a game of chess. In
1948, while pointing to learning languages as an impressive human activity, Turing
introduced an indistinguishability test in which a human chess player attempted to
determine whether they were playing against another human or a machine. In 1950
Turing replaced chess with a five minutes, text-based question and answer session for
his imitation game, an exercise in investigating whether a machine could think.
Turing described two different ways in which his imitation game could be
practicalised, firstly, through a 3-participant game involving a human interrogator
simultaneously comparing responses from two hidden interlocutors to unrestricted
questions, and secondly, through a vive voce style direct questioning of a machine by
a jury of human interrogators. It is the latter, 2-participant game which Turing
elaborated upon in 1952.
Table 2: Comparison of Turing’s ‘thinking machine’ predictions
Year
Where prediction made
Prediction
1950
Computing machinery and
intelligence
“… in about fifty year’s time it will be possible to
programme computers, with a storage capacity of
about 109, to make them play the imitation game so
well that an average interrogator will not have more
than a 70 per cent. chance of making the right
identification after five minutes of questioning”
(p.442)
1951b
BBC radio broadcast: Can
Digital Computers Think?
“ .. at the end of the century it will be possible to
programme a machine to answer questions in such a
way that it will be extremely difficult to guess
whether the answers are being given by a man or by
the machine. I am imagining something like a viva
voce examination” (in Copeland, 2004: p. 484)
1951b
BBC radio broadcast: Can
Digital Computers Think?
“If it (thinking machine) comes at all it will almost
certainly be within the next millennium” (p. 486)
1952
BBC radio broadcast: Can
Automatic Calculating
Machines Be Said To
Think?
“Oh yes, at least 100 years, I should say” [for machine
to stand a chance in a no questions barred viva voce]
(p. 495)
1-62
Turing’s expectations regarding when machines would succeed in his game
evolved between 1950 and 1952. In 1950 he predicted “in about 50 years’ time”. In
Can Digital Computers thinks (1951b), Turing made two predictions “at the end of the
century” and “within the next millennium”. It is perhaps in 1952 that Turing was more
realistic about the time thinking machines would be made to play his question and
answer imitation game well, because in a reply to Newman he said “at least 100
years”. See table 2 for a comparison of Turing’s predictions. It is possible that Turing
simplified his 3-participant machine-human comparison game because he felt that the
one-to-one, interrogator-witness scenario could be more easily understood and
perhaps practicalised. Whatever the case, it is clear that there is one imitation game
and two Turing tests to gauge whether a machine is thinking.
Table 3: Contrasting Turing’s viva voce and simultaneous-comparison imitation games
Viva voce
Interrogator-witness
Simultaneous-comparison
Mode of Interaction
One-to-one:
human interrogator-machine
One-to-two
human interrogator:
machine + human
Type of questions
Unrestricted
Unrestricted
Number of participants
Two
Three
Duration of Interaction
Unspecified
After five minutes
Interrogator Type
Non-machine expert
Average judge
Number of Interrogators
Jury
Unspecified
Number of Tests
Judge quite a number of times
Unspecified
Language of communication
Same for both interlocutors
Same for all three participants
Criteria for Test Pass:
satisfactory & sustained
responses
Considerable portion of jury
taken in by pretence: machine
imitating human-like responses
30% wrong identification of
machine as human
Table 3 contrasts both versions. This study posits that both the viva voce and the
simultaneous comparison can be included in one machine thinking experiment, the
1-63
former serving to whittle systems down to the best at answering with sustained and
satisfactory textual responses to compete against hidden humans for the ultimate prize
of humanness. Though the combination may not be accepted as a ‘canonical Turing
test’ the author contends it can serve as a very powerful examination to improve
deception-detection.
In the next chapter philosophical resistance to, and support for Turing’s imitation
game are analysed before early practicalisations of question-answer games are
considered from the earliest, one-to-one scenario featuring Weizenbaum’s Eliza to the
best current systems, before the three original experiments for machine thinking
conducted as part of this study are presented.
2-64
Chapter 2 Philosophical Jousts and Turing’s Imitation Game
Following Turing’s CMI (1950) and his proposed imitation game to investigate
machine thinking pivoting on linguistic inputs and outputs as criteria for intelligence
(Schweizer 2010), a virtual philosophy of mind jousting tournament has taken place
over the last sixty years with some of the best thinkers applying their reasoning skills
and linguistic prowess to dissect Turing’s ideas. Turing’s imitation game has been
branded as classical, gender-specific, initial, literal, original, species, standard and
traditional. According to Loebner (2010b) the “correct design” for a Turing test
involves three participants in which an interrogator questions a machine in parallel to
a human, an interpretation which Loebner claims he realised after implementing
thirteen one-to-one tests (p. 50). However, Loebner’s interpretation overlooks
Turing’s own description for his imitation game and the two methods to practicalise
it: i) a simultaneous comparison of a machine against a human, both questioned by a
human interrogator (1950), and ii) the direct examination of a machine by a jury in a
viva voca type scenario (1952, 1950).
Hodges tells us Turing was working at Manchester University at the time of CMI’s
publication in 1950, among his academic duties included the task of creating and
managing “the first software” for the world’s first “stored-program computer” (2008:
p. 13). The point of Turing’s paper was to “argue that intelligence of a human level
could be evinced by a suitably programmed computer” with the imitation game
lending “definiteness to the idea of being as intelligent as a human being” (p. 14)
following on from the chess-playing paradigm (p. 15). However, it was not long
2-65
before the belief in the intelligence of machines was considered a manifestation of
schizophrenia (Robinson 1972: p. 505), possibly as a result of the effusive claims by
early researchers engaged in the new science coined artificial intelligence (AI) by
McCarthy in 1956, claimant for thermostats ‘perceiving’ changes in temperature. In
this chapter the most salient points from the mountainous Turing test literature
impacting on the implementation of the imitation game are reviewed, before early
developments of question-answer systems are examined.
2.1 The Jousting begins
In the nascent study of machine intelligence rather than focus directly on building a
machine to pass his two tests, or one that plays chess as he had advocated in his 1953
essay, one of the areas Turing worked on post-1950 is discerned from his article The
Chemical Basis of Morphogenesis (1952). Turing applied the use of digital computers
to the growing embryo and patterns that develop to form other patterns, to study the
mathematical nature of biological forms. In this respect Turing was the “earliest
pioneer of computer-based life” or Alife (Copeland, 2004: p. 508), and the first to
work on a mathematical theory of the universe (Al-Khalili, 2010). However, others
did not ignore Turing’s 1950 thesis; it unleashed a philosophical saga that shows no
signs of abating. Today we witness a myriad of Turing tests, distinguishing human
from machine in mediums other than text-based imitation through music
23
(Verhaert,
2002), finance
24
(Hasanhodzic, Lo & Viola, 2010), art
25
(Cohen 1990) - contrast robot
23
The Continuator competed against a human jazz pianist on a radio broadcast VPRO channel.
Artificial and natural musicians were judged by two music critics, they favoured the machine.
24
Computer game in which human players try to distinguish between sequences of real and randomly
generated financial data. Most players were able to distinguish the real from the random data.
25
Aaron first robotic artist: http://scinetphotos.com/aaron.html date visited: 2.3.10; time: 13.09
2-66
artist Aaron’s drawing with a human artist created music record album over
26
(see
figures 5 and 6). A neuro Turing test is implicitly applied to distinguish the conscious
from non-conscious brain-damaged patients using communication via gestures to
establish the ‘mind-alive’ from those in ‘vegetative-state’ (Stins and Laureys, 2009;
Egeth, 2008; Stins, 2008; Owen et al, 2006). Commentators on the validity of
Turing’s notion of thinking, supplying sustained and satisfactory responses to a range
of unknown questions, are divided. On one side you have Savova and Peshkin (2007)
who consider it a plausible argument providing a sufficient condition for the presence
of intelligence, and Shieber (2008; 2006) who believes it to be an interactive proof of
capacity. Others strongly reject the imitation game as criteria for machine success
including Hayes & Ford (1995), and McDermott (2010) who claims the Turing test
should be abandoned as a bad idea. Whitby (1996) believes Turing’s imitation game
has led the science of artificial intelligence into a blind alley, while Bringsjord, Bello
and Ferrucci (2001) contend that it cultivates tricksters. The dissimilar processes of
the machine from a human is claimed by Block (1981) as a reason why the former
cannot be intelligent, while Searle (1980) points out the machine has no
understanding of input/output.
Turing’s ideas have not deterred those who think it might be fun to “program the
game and try it on a group of students” (Purtill, 1971: p. 291), or to apply them in
areas such as the study of games and economic behaviour (Arifovic, McKelvey &
Pevnitskaya, 2006), and in clinical fields such as psychiatry to simulate paranoia
(Heiser et al, 1979; Colby et al 1972; Colby, Weber & Hilf, 1971). Thus a vast
26
US Latin group Santana, led by guitarist Carlos Santana record album Abraxas, 1970
2-67
scholarship devoted to Turing’s 1950 disquisition now takes its place among the
archive of human knowledge with papers written across many disciplines commenting
on his tract, “one of the most reprinted, cited, quoted, misquoted, paraphrased, alluded
to, and generally referenced philosophical papers ever published” (Halpern, 2006).
Figure 5: Drawing by Aaron robotic artist Figure 6: Santana Album Cover
2.2 Thinking about thinking
Occupied in various philosophical camps are opponents to Turing’s idea: Pinsky, who
discarded the idea of a thinking machine (1951), French who dismissed the imitation
game as quixotic (1990), Hayes and Ford who considered it harmful to artificial
intelligence (1995), and Lassègue, who believed the imitation game “should be
considered as an unconscious and mythical autobiography and not as a philosophical
introduction to the main issues of AI” (1996). While Lenat (2001) says that it is a red
herring, Ford, Glymour and Hayes (2008) consider it a poorly designed experiment
27
.
Others regard Turing’s linguistic test functions as an operational definition for
intelligence (Schröder and McKeown, 2010; French, 1990; Wilks, 1974; Purtill,
1971), a view that Moor (2001) disagrees with: “Turing was not proposing an
27
Footnote page 29 in Epstein et al, 2008
2-68
operational definition of intelligence that conceptually would tie all future
development in AI” (p. 209). Halpern (2006) sees it as an absurdity to build a
computer to think, because the test is “invalid and unperformable”
28
. Minsky (2003)
blasted the focus of recent AI research building “stupid little robots … Graduate
students wasting 3 years of their lives soldering and repairing” away from the
“immense challenge of building a fully autonomous thinking machine”
29
, which
Harnad and Scherzer might agree with (2008). Scaling up to the ‘robotic Turing test’
is seen by Harnad and Scherzer as an empirical target for cognitive science, with
indistinguishability being a scientific criterion (Harnad 2001; 1992b).
Some researchers regard gender as an important feature in the game (Copeland &
Proudfoot, 2008; Lassègue, 1996; Genova, 1994). They contend that both man and
machine impersonating a woman provides a stronger test for intelligence (Sterrett,
2000), a position opposed by Shah and Warwick (2010b) and Hodges (2008). Others
back different tests for machine intelligence. In what Halpern scathingly might refer
to as Turing’s epigones (2006), Azira (2009) advocates a musical toy test which alters
the task of interrogator to that of a critic (Azira, 2009), and Espejo-Serna (2010)
advances a Sphex-test that modifies an agent’s environment testing for any
corresponding change in its behaviour to “connect the dots” in its own way (p.1).
Plebe and Perconti (2010) propose a Qualia Turing test adding vision to the textual
question-answer test enabling a machine to report its subjective experience after
28
Halpern responding to Hugh Loebner’s letter to The New Atlantis, Spring.
http://www.thenewatlantis.com/publications/correspondence-spring-2006 date visited: 4.3.10; time:
00.28
29
Wired: http://www.wired.com/science/discoveries/news/2003/05/58714?currentPage=1 accessed:
7.8.10; time: 14.51
2-69
sharing views of an uploaded picture with an interrogator. Connecting the dots and
sharing experience are implicit in Turing’s imitation game: the role of inference acts
to connect seemingly disparate utterances, whether the machine can adapt its
responses to an interrogator’s changing topics is covered by Turing’s criteria of
‘sustained, satisfactory responses’. Similarly, Cullen’s communicative test (2009),
adds nothing distinct to Turing’s textual Q/A, nor does it improve it. Turing’s
deception by imitation game is via communication. Bringsjord, Bello and Ferruci
(2001) suggest a “better Lovelace test” honouring Lady Lovelace’s comment referred
to in Turing’s anticipatory responses to objections against a thinking machine (section
6, 1950). In Bringsjord, Bello and Ferruci’s test when a “human architect” H of an
artificial agent A cannot account for its output o the agent satisfies the Lovelace
criteria: “only when computers originate things should they be believed to have
minds” (2001: p. 2). Stins (2009) provides a perspective on ‘mind’ from the field of
cognitive neuroscience: “what it takes for an outside observer to be convinced that an
entity is conscious” (p. 188). Cowen and Dawson relate how the social capabilities
required to be considered ‘human’ in Turing’s test could preclude those with autism, a
view grounded in the speculation that Turing himself may have been autistic or
suffered from Asperger’s syndrome (2009).
With Turing test scrutiny is the view that a machine can succeed in passing it
eventually (Chomsky, 2008; Loebner, 2008; Kurzweil, 2001; Moor, 2001, 1976;
Dennet, 1984, Sampson, 1973). It does need to be re-emphasised here that Turing
described an experiment to test for the presence of ‘thought’ via deception and
imitation bypassing multi-sensing world experiences focussing exclusively on
intellectual capability of a machine expressed through its text-based responses to
2-70
questions. If the responses are deemed indistinguishable from the kind of responses a
human might give to an interrogator, then that machine is thinking, and if a human’s
similar responses were to be considered intelligent then so must the machine’s be.
Practicalising the imitation game can provide a useful continuum, a scale showing
machines from textual ‘babbling adolescents’ to ‘full linguistic power houses’ and
most crucially, to mitigate the risk from fraud as a tool in deception-detection.
2.2.1 Misusing Reason, Changing Language
Copious Turing test erudition appeared during the 1990s and in the first decade of the
21st century continuing in the 60th anniversary year of CMI in 2010. However, most of
these commentaries did not, and still do not, address intelligence directly, and most
argue over the three-participant test pointing to machine failure in passing the
imitation game by the end of the 20th century overlooking Turing’s revised prediction
to “at least 100 years” in 1952 (Copeland, 2004).
Anticipating “contrary views on the main question”, Turing devoted the whole of
section 6 of his 1950 CMI paper to consider possible arguments against the idea of a
thinking machine and the appropriateness of substituting this venture with an
imitation game (pp. 442-454). In 1951 the first response to Turing’s paper appeared
from Pinsky. In his critique of Turing’s imitation game, Pinsky felt philosophy had
been led astray for centuries, thus he cautioned: “man is unique by virtue of the ability
to misuse the faculty of reason” (p. 397). Pinsky’s objection was based on what he felt
was the incorrect reasoning “about the use of ordinary language” adopted by
philosophers (ibid). However, Turing had already pronounced this a “dangerous
attitude” (1950: p. 433)
2-71
Hayes and Ford accept, “exactly what counts as ‘cognitive’ will shift and change ...
be altered by the science itself, just as meaning of words like ‘energy’ has been
changed by physics” (1995: p. 977). Amelioration and pejoration of words is not
uncommon, hence Turing’s belief that the attitude towards the word think might have
changed by the end of the 20th century so that “one will be able to speak of machines
thinking without expecting contradiction” (1950: p. 442). Though he did not wish to
give the impression that consciousness was not mysterious, Turing felt it was
something of a paradox to localise it and that it was not necessary to resolve the
puzzle before considering whether a machine could think.
2.2.2 Temporal Obscurity
Gunderson (1964) raised the “temporal obscurity” of Turing’s 1950 paper believing
that the question of whether a machine can think concerns a “possible future” in
which word meanings have changed (p. 2.44). However, the change Turing most
likely expected was in opinion, that though a machine might be doing what thinking
involves in a different way to a human, a human would come to accept that when a
machine solved a problem, when it was able to manipulate or recall information to
make plans and inferences, be original and predict then the machine is thinking.
Remarking on the “stance often taken that thinking is the crowning capacity or
achievement of the human race”, Gunderson wrote that the “achievement of
machines” is on a “higher level than that attained by thinking beings” and “machines
can almost instantaneously and infallibly produce accurate and sometimes original
answers to many complex and difficult mathematical problems” (p. 245). Gunderson
concurred with Turing’s view that “with the increasing role of machines in society the
2-72
word ‘think’ itself might take on new meanings” (ibid). Purtill (1971) took Turing’s
deception experiment for a digital computer implanted with prerequisite memory and
storage capacity as an “operational test of whether a computer might be carrying out
an activity at least comparable to human thinking” (p. 290). Purtill’s fundamental
objection against those who would take the imitation game seriously (though he
agreed it might be fun) was mainly because he felt in all cases it is a human who
responds to the interrogator’s questions. It is the human programmer who is
responding via a machine according to Purtill. This is not actually the case (as shown
in chapters 4 and 5). It is the machine in an imitation game which must respond to
questions independently of the programmer after it has been built to ‘think’ for itself.
Purtill insisted “whatever thinking activity is going on is plainly in the brain of the
programmer, not in the mechanism by which his answers are transmitted” (p. 291).
Purtill failed to allow for the machine to “evolve” an answer itself because he was
unable to imagine a thinking machine. Throwing down a gauntlet to developers Purtill
claimed “there is no way ... to programme a computer to make any meaningful
responses to all of the myriad of questions which can be asked in English, much less
responses which would deceive a human questioner into thinking that he was
responded to by a fellow human” (p. 293). He has been proved wrong; humans have
been shown to be quite easily fooled by machines in practical Turing tests (see
chapters 5 and 6). Purtill mocked any programmer who did take up the challenge to
design a system with the offer of a fabulous prize making it “give relevant responses
to some questions ... put to it after it had been designed” accusing them of being out
of touch with reality (p. 293). Developers ignored Purtill’s contempt; question-answer
systems assist the digital economy raising brand awareness on the Internet bringing
company operating costs down while increasing product sales (Shah and Pavlika,
2-73
2005). Perhaps Purtill’s opinions were formed as a result of experiencing
conversation with the first artificial conversationalist, Weizenbaum’s Eliza (see
chapter 3), because he writes “computer men have their dreams” (1971: p. 293).
Purtill’s argument against a thinking machine stems from the knowledge possessed
in his day: the “effectiveness of Turing’s paper lies to a rather large extent in its
science fiction aspect”. He dismissed Turing’s imitation game as “interesting science
fiction” (ibid), hailing the end of century prediction as “wild exaggeration”, ostensibly
unaware of Turing’s revised prediction of at least one hundred years. Yet Purtill
accepted that “possibly there may someday be computers which can offer appropriate
responses to anything which can be said to them” (1971: p. 293).
2.2.3 Battle of Wits or Witless War?
That Turing’s imitation game is essentially a “battle of wits between a questioner and
a programmer”, with the computer as non-essential (Purtill, 1971: p. 291) is countered
by Searle (1980), through his Chinese Room Argument – rehashed by Levesque
(2009) as the summation room. Searle claims to show that any machine taking part
would have no understanding of the questions put to it, and would lack intentionality.
Searle an English speaker imagines himself locked in a room with letter boxes
through which he receives a series of batches of Chinese writings, or “meaningless
squiggles” that are providing him with a script for a story and questions about it.
Among the batches Searle is given a set of rules in English which enable Searle to
identify the Chinese symbols by shape. Correlating between these batches and by
following the English instructions Searle responds to questions in Chinese with
appropriate ‘answers’ – correctly returning Chinese symbols without knowing what
2-74
he is actually doing. Searle describes the English instructions as the Chinese room’s
programme and the persons writing those instructions as the programmers. Searle
asserts though he will improve at following the instructions, thus his responses
returned as card-written Chinese symbols will become indistinguishable from a native
Chinese person, but he will remain clueless about Chinese.
Hofstadter and Dennett (1982), ask how Searle could fail to notice relationships
between the English instructions and the Chinese symbols in his returned answers?
Would he not learn, or be able to interpret some of the Chinese symbols? Searle’s
analogy attempts to show that a machine would only be faking comprehension in a
real question-answer scenario even if it did deceive interrogators because, unlike a
human, the machine would have no understanding of processes or of the
interrogator’s questions let alone the meaning of the answers. Penrose wrote
“understanding has as much to do with patterns as individual words ... symbol
manipulators do not understand” (1989: p. 25), though he does criticise Searle’s
argument, pointing out the task of answering just one single question would involve
“many days, months or years of extremely boring work”(ibid). Ultimately, Penrose
favours Searle’s stance, there is “considerable force to it [Searle’s argument] ... no
algorithm, no matter how complicated, can ever of itself alone, embody genuine
understanding” (p. 26). Of himself in the locked-room, Searle says “there must be
something about me that makes it the case that I understand English and a
corresponding something lacking in me that makes it the case that I fail to understand
Chinese” (p. 367). This lacking is one of not-learning, but how could one not learn
some Chinese in Searle’s locked room, how could one not recognise relationships
between the Chinese symbols and not infer some meaning given the instructions in a
2-75
language he knows? Would Searle not decipher some Chinese symbols? How does
Searle reflect upon how the hieroglyphs on the Rosetta Stone
30
were deciphered?
Tangentially to the thinking machine debate is a whole ‘Chinese room industry’
with its own arguments in agreement with or countering Searle’s stance. Block (2002)
reminds us that Searle’s argument is based on his own Chinese Nation thought
experiment. Haugeland (2002) argues that Searle misleads readers through a modus
tollens ‘deny by denying’ approach to machine understanding (p. 379). Haugeland
points to Searle’s definition of ‘strong AI’ as “the view that all there is to having a
mind is having a program” (p. 382). Haugeland contends that Searle’s insinuation,
“what one might call serious AI is committed to denying the claim about syntax and
semantics”, is an irresponsible one (p. 382). Haugeland writes that while Searle is
right about “one way of describing any digital computer program”, Searle should not
dismiss “first-order semantics of computer programs as merely ‘as-if’ semantics” (p.
385). Haugeland compares a human being in a coma maintaining its own body
temperature while “understanding nothing” (ibid). Fernandes et al. (2006) point to the
regular relationship in human languages that link “semantic or conceptual roles to
syntactic renderings” (p. B11). The principles of linking in English for example,
“make it possible to use exactly the same words in a different order to represent two
different events (The magician noticed the assistant vs The assistant noticed the
magician)” (ibid). Wallace’s Alice conversational system embeds word relationships
which allows it to identify the correct ‘sense of a word’ in an input sentence, for
30
An ancient stone with writing in Egyptian and Greek in three scripts, hieroglyphic, demotic, Greek:
http://www.britishmuseum.org/explore/highlights/highlight_objects/aes/t/the_rosetta_stone.aspx date
visited: 26.3.10; time: 15.05
2-76
example ‘study’ as either a noun (room) or verb (to study), thus enabling it to output
an appropriate question (Shah, 2006a). Haugeland explains “syntax by itself is never
sufficient for semantics”, it is worthwhile discussing the “internal semantics of
computers” to demonstrate technology “can have the right causal powers for
semantics” to “serve as a model” for how machines could have semantics (2002: p.
386). Fernandes et al (2006) show that regular relationships in natural languages link
semantic or conceptual roles to syntactic renderings: “linking rules allow languages to
relate a small set of abstract semantic conceptual roles to an ever smaller set of
grammatical functions” thus making links is a crucial part in linguistic productivity
(2006).
Adam (2002) believes Searle wants to characterise intentionality as a “preserve”
and “specialness of humanity against the dark forces of the inanimate world” (p. 320).
Harnad (2002) describes Searle’s paper as “yet another tedious ‘Granny Objection’
about why/how we are not computers” (p. 294), asserting “Searle is certainly partly to
blame for the two decades of misunderstandings about his argument about
understanding” (p. 297). Hofstadter (1982) points to Searle’s “serious and
fundamental misrepresentation by giving the impression that it makes sense to think
that a human being could do this [manipulate Chinese symbols according to written
English rules]” (p. 373). To Hofstadter, Searle is giving away his repugnance for the
idea of a thinking computer “just as non-Euclidean geometry was to Gerolamo
Saccheri” (ibid). Searle fails to elaborate what is genuine ‘meaning’ beyond a
footnote, “understanding implies both the possession of mental (intentional) states and
the truth (validity, success) of these states” (in Hofstadter & Dennett 1982: p. 358).
Searle glosses over “questions of orders of magnitude” (Hofstadter, 1982: p. 375), and
2-77
does not explain how he would overcome retrieval problems from “millions, if not
billions, of pages densely covered with abstract symbols” (p. 375). Hofstadter accedes
Searle’s Chinese room thought experiment does raise the issue of what it ‘means’ to
understand language. Harnad points to “unconscious understanding” taking place in
the person locked in Searle’s Chinese Room (2002: p. 302). Hofstadter concludes, “if
there ever came to be a program that could pass the Turing test, it seems Searle,
instead of marvelling at the power and depth of that program, would just keep on
insisting that it lacked some marvellous ‘causal powers of the brain’ whatever they
are” [this author’s italics] (p. 374).
Hofstadter’s critique of the Chinese-illiterate Searle can be similarly applied to
Block’s simple string-search machine (1981). Programmed with some “definite
history”, for example with the personality of the philosopher’s Aunt Bertha, his
machine requires “all typable strings of sentences ... each typable string ... typed by a
human typist in an hour” (p. 242). Block states, because English has a finite number
of words, the set of all typable strings must also be finite, and within this set of all
there exists a subset containing “all and only those strings which are naturally
interpretable as conversations in which at least one party’s contribution is sensible”
(ibid). To avoid any argument over what is and what is not ‘sensible’ Block writes
“we need not be very restrictive as to what is to count as sensible” (p. 243).
Presuming the programmers have “done their job properly” allowing the machine to
respond sensibly to “whatever the verbal inputs” (p. 244) Block refers to his
imaginary machine as unintelligent dismissing it as ‘simple’ with outputs having the
“intelligence of a toaster”, but this is after he has drawn the reader into his circular
thought experiment stating conditions for the machine’s lack of intelligent attributes
2-78
(p. 242). McDermott (2007) realises that the kind of in principle arguments that
Searle and Block posit, those requiring “gigantic databases” are in reality impossible
to put together. With his gigantic database machine (GDM), McDermott shows in
order to build it “every possible series of conversations that could happen to a
character” would need to be recorded in the GDM rendering it a static structure.
McDermott believes “no conceivable process could create” a humungous database,
but if it did materialise there would be no place to put it because it would not fit
anywhere. Levesque’s infinite-sized Book A aiding a maths-shy person to accomplish
addition (2009) is another idea McDermott might describe as not “sharp enough to be
taken seriously” (2007).
2.2.4 Cultural Objection
French (1990), though agreeing with what he believes is Turing’s philosophical claim,
that if a machine could pass the imitation game it could be accepted as “necessarily
intelligent ... whatever acts sufficiently intelligent is intelligent” (p. 53), he disagrees
with the pragmatic claim that such a machine could be built in the “not-too-distant
future”. His ground, that the “only” thing that can pass Turing’s imitation game is
one that has “experienced the world as we have experienced it” (ibid), limits the test
because, in French’s view, Turing’s vision amounts to a culturally-oriented guarantee
of human intelligence (p. 54). Is speaking the same language thinking in the same
culture? Is, say, the Spanish of Spain of the same language culture as the Spanish of
Mexico? Turing was attempting to cultivate a culture of thinking in machines so that
their thinking, albeit differently, would be seen as acceptable in the same way as the
different way of thinking across cultures is regarded as thinking. Thus, French’s
latent interpretation of the imitation game entails the human interrogator, the imitating
2-79
machine and the hidden human comparator be of the ‘same culture’ (p. 56). This is
not the case, the human interrogator and foil could be non-native language speakers.
2.2.5 Proprietorial Objection
Partisans of artificial intelligence Hayes and Ford (1995) cloak their science in
proprietorial fashion demanding the Turing test be excluded from its remit. They ask
why “we should take it as our goal to build something which is just like us?” and
“why would anyone want to build machines that could pass the Turing test?” (p. 974).
Hayes and Ford feel Turing’s vision in CMI, though inspiring, is not a sensible goal,
they feel it causes AI harm. This greatly defensive position is caused by their belief
that natural language is hard to compute. Hayes and Ford contend that if natural
language is tied with artificial intelligence, then the science for a long time to come
could be seen as failing and real progress in sub-fields dampened. They dismiss
Turing’s simultaneous comparison test as nothing more than an effort to build an
artificial con artist (p. 573), a pursuit that would produce a “mechanical imposter” (p.
574). While they may not share enthusiasm for the imitation game enterprise others
are deeply involved in building digital chatting systems as human companions. Hayes
and Ford ignore the fact that humans will pursue whatever interests them, even if it
treads ostensible impossibility. It is parochial to throw disdain on a part which is
deemed to stalemate the whole with accusations of causing damage to research.
Investigating machine textual proficiency is essential in deception-detection and
exposing cybercrime by textual systems
31
that attempt to steal identities.
31
CNET News: Warning sounded over ‘flirting robots’- CyberLover: http://news.cnet.com/8301-
13860_3-9831133-56.html accessed: 8.8.10; time: 17.43
2-80
2.3 Meaningless Thinking
Gunderson (1964) points to Turing’s contradiction in CMI in having launched the
paper with the question ‘can machines think?’ and then supplanting it with an
imitation game (1950: p. 433). To Gunderson, the “net result” of a machine deceiving
an intelligent person does not add to Turing’s argument. It is not what the machine
does but how it does it that is important to Gunderson: “To ask whether a machine can
think is in part to ask whether machines can do things in certain ways” (p. 238).
French (1990) echoes Plato’s examination of word definition. In the Ancient Greek
philosopher’s dialogue Laches, the Socrates character is drawn into a discussion with
two generals “Nicias and Laches and their friends Lysimachus and Melesias” on
“armoured combat and whether young men ought to learn it”(Melling, 1992: p. 20).
This leads to a debate on the meaning of the word ‘virtue’ and the acceptance that
definitions are a range of instances (p. 22-23). Focusing on intelligence, French
attempts to show with his own seagull test for flight that defining what is flying?
causes all sorts of problems, because if the definition of the seagull’s flight means it is
airborne for quite a long time then this could be said of a cloud, or of balloons, Thus
the question of what is flight? is settled “by, in effect, avoiding it”, and a “putative
flying machine” will be said to have passed the seagull flying test if it is
indistinguishable from the seagull (p. 55). We do not call airplanes birds or
submarines fish but do we not accept a sort of flying or swimming is done, regardless
of the different manner in which it is done? Hayes and Ford (1995) argue the
pointlessness of building an artificial intellect like a human because there is “no
shortage of humans … we already have well-proven ways of making more of them.
Human cognition, even high-quality human cognition, is not in short supply …what
2-81
extra functionality would such a machine provide, even if we could build it?” (p. 974).
Others have a different view, for example, researchers see intelligent robots as carers
in the home, assisting humans in old age to maintain quality of life
32
33
.
Just as CMI provides a time-piece to the era in which it was written a few years
after the Second World War, similarly Hayes and Ford’s critique represents
zeitgeistian antipathy towards building an “idiot-savant” (p. 974). They venture that
useful AI programmes are “valuable exactly by virtue of their lack of humanity… a
truly human-like programme would be nearly useless” (p. 975). Hayes and Ford’s
lament in the last decade of the 20th century is that AI systems have “delivered the
goods very well, sometimes spectacularly well ... fuelling technical revolutions and
changing the world” (ibid), but the burden from failure of systems to pass the Turing
test has distracted AI researchers from rejoicing over their useful technologies.
2.3.1 Gender distraction
Some readers of CMI are convinced gender is a significant factor in Turing’s three-
participant imitation game (Sterrett, 2000; Lassègue, 1996; Hayes & Ford, 1995;
Genova, 1994). According to Lassègue (1996) the method of Turing’s explanatory
simultaneous comparison involves the woman being herself, i.e. telling the truth
whereas the man attempts to deceive by imitating “the woman and the machine the
two of them” (p.7), that is, the machine imitates a man imitating a woman. Hayes and
Ford see the machine in such a scenario as a mechanical transvestite (1995). Genova
32
University of the West of England research on robots and smart devices helping older people look
after themselves: http://www.sciencedaily.com/releases/2010/03/100324184558.htm (March 26, 2010)
33
Joint European project, including The University of Reading and The Tecnalia Health and Quality of
Life Unit’s “Companionable project”, robotised solutions in the home improving quality of life for the
elderly: http://www.sciencedaily.com/releases/2009/04/090416083350.htm (April 16, 2009)
2-82
accepts that “Turing never speaks directly about gender” (1994: p. 322). Shah and
Warwick (2010b) point out Turing’s article is not entitled ‘computing, gender and
intelligence’, the topic of consideration in Turing’s thesis is the intellectual capacity
of a machine. The researchers who take a stand for gender-orientation in the imitation
game focus primarily on the earlier sections of Turing’s 1950 CMI paper. Genova
believes Turing creates more than just a machine-human comparison test, he
questions the very nature of thinking and “how it should be measured” (1994: p. 313).
She claims “the game centers on gender questions, not species ones ... whether it [the
machine] can fool player C into believing it is one kind of human rather than another,
i.e. male not female” (p. 314). Reading beyond the beginning of CMI coupled with his
other essays and lectures, this will not be found as Turing’s direction.
Genova points to the first instance when a machine becomes involved in Turing’s
imitation game as evidence of gender significance. Genova claims Turing’s radical
idea concerned “thinking be measured by gender miming” (1994: p. 315), pointing to
Turing’s replacing the man (A), woman (B) and interrogator (C) scenario asking
“what will happen when a machine takes the part of A [man] in this game” (1950, p.
434). Genova defends her gender theory because Turing’s usual questions of “chess
and logical games” are replaced with proposals to “measure thought by the
commonplace and presumably ‘easy’ activity of being male or female” (1994: p. 315).
However, Genova overlooks Turing’s change from gender questions “length of hair”
(1950: p. 433), after the digital computer C has been introduced into the game as
player A (played by the man in Turing’s explanatory scenario), and the man moved to
play as B (played by the woman earlier), to “specimen questions” on poetry,
arithmetic and intellectual games: “please write me a sonnet on the subject of the
2-83
Forth Bridge”; “Add 34957 to 70764” and “Do you play chess?” (1950: p.434). Both
the hidden entities are again interrogated by a human player C, which Genova states
has caused confusion among readers including students to presume the machine is
conducting the interrogations. Turing makes it quite clear throughout CMI it is the
machine that is interrogated, because it is the machine’s capacity that is under
examination. Does Genova feel Turing’s questions of poetry and logic are “the proof
of thinking is that the machine simulates masculinity”? If so then it is Genova who is
being sexist and not Turing, for he is not at odds with such sample questions in the
next evolution of the game in section 2 of CMI – critique of the new problem (1950:
p. 434). Genova asks “why would he [Turing] be so careful about the gender
assignments in laying out the game, i.e. A is a man, B is a woman?” (1994: p. 314).
She overlooks that by the time of the fourth evolution of the three-participant
imitation game for machine thinking Turing sets out the participants as follows:
player A = a digital computer (C); player B = (hu)man; player C = human interrogator
Thus, on page 442 of his CMI tract (1950), Turing has evolved his explanatory
gender discrimination scenario for a game involving interrogation of a machine
compared against a human. He does not direct the part of B, played by a woman in the
male-female scenario, should be played by a man pretending to be a woman. Turing
would have been explicit had he meant for the machine and the human both to imitate
a woman in the machine-human comparison, he had plenty of opportunity, if indeed
this is what he intended for his imitation game, in his papers and discussions after
1950 and before his death in 1954. He certainly does not shy of writing about other
radical items such as extra-sensory perception and telepathy (1950: p. 453), Genova
2-84
grieves for her gender because Turing replaced the woman as player B with a man
(end of section 5, 1950: p. 442), but she overlooks the point in CMI where,
anticipating the objection of consciousness in the machine (‘Argument 4’ in section 6:
Contrary Views on the Main Question, 1950), Turing referred to a two-participant
scenario dispensing with the hidden male comparator altogether: “player B omitted”
(p. 446) with the machine undergoing direct questioning by a human interrogator.
Genova ignores Turing’s pointer to a real-life situation of a one-to-one
interviewer/interviewee scene, “under the name of viva voce to discover whether
someone really understands something or has learnt it parrot fashion” (1950: p. 446).
Turing’s 1952 BBC radio discussion shows that he did not exclude women from
acting as the interrogator of the machine ‘witness’ in his one-to-one game. (1950: p.
446).
Turing’s personal life may have made it abhorrent for him to participate physically
in creating another intelligent being, which Genova refers to as Turing’s “sexual
dilemma” (1994: p. 317), possibly from a desire for an alternative process to the
natural one for creating a thinking entity. Genova concludes “in Turing’s brave new
world, female machines are absent ... inability to keep his personal life out of his
scientific one” (p. 324). However, this is not the case; gender of the machine is
irrelevant to Turing. Genova’s question of why the female should tell the truth in the
introductory man-woman imitation scenario marking “her as an inferior thinker”
(1994: p. 319), echoes Lassègue who sees it as an absence of strategy, the “odds are
weighed too heavily against the woman” (1996: p. 6). That the man’s task is to
deceive exposes a view that deception requires being clever in a way that a woman, a
‘motherly creature’ may not be, suggesting an underlying view of men that Turing
2-85
may have held, that men are more capable of deceiving the interrogator, or that Turing
believed there is a “secret connection between gender and intelligence” (Lassègue,
1996: p. 8). To Genova “computing accomplishes the miracle of creation” (1994: p.
320), she views the computer “as the ultimate kind of dynamic technology” (p. 322).
2.4 Linguistic Competence
Hayes and Ford remark that Turing’s imitation game tests for competence in human
conversation, but they feel unable to qualify what that competence is. However, the
science of language secures linguistic competence within its sub-field pragmatics, for
instance in Grice’s cooperative principle and Sperber and Wilson’s principle of
relevance. It is not difficult to suppose those two theories could be applied to the goal
of machine linguistic productivity or li.p. In Grice’s theory of conversation it succeeds
because “parties mutually agree to cooperate” (Stevenson, 1993: pp 135) by adhering
to four maxims:
Quantity - neither parsimonious nor garrulous
Quality - veracious rather than apocryphal
relation - relate utterance/message to previous within dialogue
manner - perspicacity rather than obtuseness
Sperber and Wilson’s notion for successful conversation simplifies Gricean tactics
to one dictum: “the speaker tries to express the proposition which is the most relevant
one possible to the hearer” (in Stevenson, 1993: p. 137). These theories are easily
applied to the question-answer scenario for ‘sustained and satisfactory’ responses.
Selecting Grice’s four maxims or Sperber and Wilson’s one dictum could assist
comparison of human and machine li.p via the latter’s responses to questions from an
2-86
interrogator during a practical Turing test. Of course the measure of competence rests
entirely with the interrogator.
2.5 Intelligence
Does Turing’s imitation game suppose the hidden-human, against whom the machine
is compared by an average interrogator, will always exhibit textual appropriateness in
an imitation scenario because he or she is packed with intelligence? Turing
circumvented any definition for intelligence, in fact in 1948 he had said the idea of it
was an emotional one rather than mathematical (p. 411), but he felt deception was a
feature of it and imitation of sustained, satisfactory responses were evidence for
thinking. Turing was not the only one to evade explicitly pinning down this quality.
His opponents too dodged getting caught in a trap of setting intelligence in stone, for
fear of circularity in definition, or showing anthropocentric tendencies. In Shadows of
the Mind, Penrose posits that genuine intelligence requires genuine understanding
which itself relies on the presence of awareness. Penrose admits it is “unwise to
attempt to give full definitions for intelligence, understanding and awareness”, rather,
he relies on “our intuitive perceptions as to what these words actually mean” and
argues for a distinction between genuine intelligence and a “proper computational
simulation of it” (1994: p. 38). However, Penrose does not consider alternative
intelligence, not unlike genuine human intelligence but still distinct from it, that a
machine could possess. Block asserted that “for something to be intelligent, namely
that it act in a certain way, a way hard to define” intelligence is “easy enough to
recognize” (1981: p. 231). Purtill asked “what sort of test would we apply to see if
intelligent thought were present ... in a man recovering from a brain operation” (1971:
p. 293). Neurologists Stins and Laureys (2009) claim they implicitly apply Turing’s
2-87
question-answer technique to establish consciousness in brain damaged patients,
because, as Owen (2008) states, “It is extremely difficult to assess residual cognitive
function in these [vegetative state] patients because their movements may be minimal
or inconsistent, or because no cognitive output is possible” (p. 225). Stins and Laureys
(2009) and Owen et al. (2009) have shown a way that patients of locked-in syndrome
can convey their presence of thought by appropriate neurobehavioral, non-verbal
responses, Stins and Laureys report “The incapacity [of some brain damaged
individuals] to generate bodily movements prohibits communication through
conventional means” (p.2). This gives an impression of absence of
consciousness/wakefulness, characterises the patient as being in a coma. Nonetheless,
“one of several possible outcomes of coma” is that a patient regains consciousness but
is completely paralysed (Stins and Laureys, 2009: p.2). Using functional
neuroimaging, experiments help to ‘decode the mind (p. 5). Neurologists adopt Turing
test like procedures, i.e., question-response, to “expose mental states… looking for
signs of contingency and communication in the patient” such as “systematic blinks of
an eye lid … oriented response to pain” (ibid). However, just as in the case of a
machine-human comparison imitation game, because the assessment by a neurologist
in the Neuro Turing test, like the interrogator in the former scenario, is done by a
human there is the problem of subjective error. Of the assessment of consciousness in
brain damaged patients, Owen et al. (2009) warn of the need for accuracy of diagnosis
which requires careful interpretation of information, thus there is need for “ethical
frameworks to guide research” into brain disorders and consciousness.
Savova and Peshkin (2007) remind us that Turing did not intend his machine-
human comparison test to replace a “true definition of intelligence” (p. 545). One
2-88
description of what intelligence is was provided during the Chilcot Iraq Inquiry. In the
course of questioning former British Prime Minister Tony Blair on Friday 29 January
2010 regarding a dossier that may have been doctored, Inquiry panel member Sir
Lawrence Freedman asked of the witness “how a particular bit of intelligence was
interpreted and presented, losing its specificity and gaining a broader meaning ...
getting an exaggerated sense of importance?” (in Chilcot, 2010: p.76). Blair
responded “we had detailed intelligence ... that led you to certain conclusions” (ibid).
Cautioning interpretation, Freedman called intelligence “joining up the dots because
information is limited” (in Chilcot, 2010: p. 82). This leads into Fagan and Holland’s
theory of intelligence as exposure to information (2007). Their definition of
intelligence is the ability to process information (2009). This approach to intelligence
can be applied to the development of machines to pass Turing’s question-answer
game. Building an artificial intellect to make it ‘read between the lines’, to process
information in a manner that produces human-like responses as assessed by many
human interrogators during questioning, may seem difficult but once achieved would
it be fair play to withhold from it an ascription of intelligence?
2.5.1 Pathology & Machine Intelligence
In 1972 Robinson attacked anyone who should dare to claim intelligence in a
machine: “Our society’s reaction to behaviour that manifests a genuine belief in the
intelligence of machines is quick and extreme. We classify such people as
psychologically severely disturbed and almost invariably put them away from us in an
institution as schizophrenics paranoids, or paranoids schizophrenics” (pp.505-
506).Wilks (1974) replied the “philosophical cupboard might be bare if we discard
beliefs that some might consider are as a result of associated neurosis” (p. 583). To
2-89
Robinson, claiming machine intelligence provided no solution to any problem: “the
problem of machine intelligence is not a real problem for us in the sense that nothing
hangs on its solution” (p. 506). This is a short-sighted view in that he does not seem
able to consider that inquiries into machine intelligence might enlighten human
society about matters pertaining to human intelligence, a hotly topic debated for over
a century. The nature versus nurture debate is deeply embedded in the investigation
of intelligence, contentiously drawn along the lines of race.
Robinson positioned himself into the corner that viewed deliberations of machine
intelligence as seducing humanity into false ground entrapping it into the “twin
problem of other minds” (1972: p. 506). But as clinicians have shown, this is a very
real issue in the medical treatment of brain-damaged patients which, in extreme
circumstances, may mean life or no-life decisions when treating vegetative state
patients (Stins and Laureys, 2009; Owen et al, 2006). Not being able to see far ahead
or imagine positive outcomes Robinson digs himself deeper into his corner feeling
that the drug-induced are set aside from normal and prosaic living thus may perceive
inanimates as animates “capable of feeling, or of motivated, intentional intelligent
action” (p. 507). He then proceeds down a metaphysical path looking for a meaning
for terms, ‘machine’ and ‘mechanical’. Turing talked of digital machines for his
imitation and deception experiment not mechanical ones, which Robinson
acknowledged saying that “the question whether something is ‘really’ a machine or
not might ... be ... pointless or unanswerable” (1972: p. 509). Robinson included a
definition for ‘machine’ written in the Oxford English Dictionary (OED) from his
time: “a person who acts without intelligence” (1972: p. 510), adding his own
impression “roughly speaking, a machine is a useful artefact” (ibid). Such is the
2-90
changing nature of language usage the OED in 2010 described a machine as “a
structure regarded as functioning as an independent body, without mechanical
involvement”
34
. Robinson continued: “intelligence and machines are related as user
and used. An intelligent being is precisely one that is capable of turning something
into an instrument ... machine is what stands on the other side of this relationship ..
something that has been contrived for, or given, a use” (p. 515).
Wilks views Robinson’s stance as the “luddite fear” (1974: p.584), rubbishing
Robinson’s ‘purpose’ argument against machine intelligence and the relationship
between user-used. Wilks is another, like Purtill, who sees Turing’s 1950 paper as
furnishing an operational definition of machine intelligence. Referring to Colby et
al.’s studies into artificial paranoia (1972, 1971), Wilks recalled that a random
selection of American psychiatrists attempting to distinguish between a paranoid
human patient and a simulation of paranoid behaviour through their PARRY system
“by and large, and without statistical trimmings” could not do so (p. 585). The
reliance on sub-cognitive associations arising from a broad set of questions that could
be asked by an interrogator would, in French’s view, limit the Turing test to one of
human intelligence rather than intelligence (1990: p. 54). This is exactly what Turing
had in mind when deploying textual means to compare a machine against a human.
Turing did not express a need to measure a machine’s performance according to some
theory of general intelligence, as to what this is depends on perspective, for example
computational, biological, or philosophical.
34
OED online, definition 1: http://www.oed.com/ date 17.3.10: time: 19.32
2-91
2.5.2 Human and Artificial Intelligence
Hayes and Ford define AI as the “business of using computation to make machines
act more intelligently, or to somehow amplify human intelligence” and though they do
want to include things that think and communicate they wish to exclude from this
enterprise textual communication research for a Turing test (1995: p. 977). Penrose
stressed the “indication of the presence of thought, intelligence, understanding and
consciousness” in other people is normally judged on the basis of a number of factors
including conversation (1989). Block quotes Ryle (1949) stating that “in judging that
someone’s performance is or is not intelligent ... inquiry is not into causes ... but into
capacities, skills, habits” (1981: p. 229), pointing out “responses of some intelligent
machine will have a machine style that a good human judge will not be able to detect”
(p. 232). Block saw a circularity problem within Turing’s idea: intelligence defined in
terms of the interrogator’s interpretation. Block suggested “perhaps human judges will
tend chauvinistically to disregard some genuinely intelligent machines as unintelligent
because of their machinish style of thought” (1981: p. 233). Indeed, in a 2003 one-to-
one practical Turing test contest one judge (J9) ranked all ten hidden entities as
“definitely a machine” when in fact two were human foils
35
, one female and one male
(UoSurrey, 2003), instances of the confederate effect when a human is misidentified
as a machine from their textual responses (Shah and Henry, 2005). This is a
phenomenon which emerged from the very first, single-topic Turing test contest in
1991 when a female answering to her restricted topic of “Shakespeare’s Plays, was
mistaken for a computer by three judges” (Epstein, 2008: p. 10), because “she
produced extended and well–written paragraphs of informative text at dictation speed
35
University of Surrey hosted 2003 Loebner Prize: http://loebner03.hamill.co.uk/results.html date
visited: 11.3.10; time: 17.29
2-92
without typing errors … now considered an inhuman ability in parts of our culture”
(Ford, Glymour & Hayes, 2008: p. 29).
Hayes and Ford consider any attempt to reproduce human intelligence in an
artificial entity is a folly and an unnecessary pursuit “even if primary goal is
essentially psychological, to understand human intelligence” (1995: p. 976). Though
they accept one aim of AI is to understand intelligence and for technology to “amplify
and extend” it (p. 976), Hayes and Ford’s want to abandon natural language
understanding and the Turing test as a goal of AI research. The former is too difficult
and the latter too restrictive with its “touchstone of success” dependent on the
“arguing skills of the educated English middle class” (p. 976). Halpern (2006) claims
a dumbing down of human intelligence to raise that of the machine: “in the absence of
universally accepted formal definition ... we must work with their common senses”.
Warwick remarks that as far as “machine intelligence in comparison with human
intelligence is concerned, the last century has seen a continual shift in our definitions
of how two entities compare and how they are different” adding “definitions of human
intelligence have been constructed and, like sandcastles on the beach, have been
washed away by the rising tide of machine intelligence” (2002: p. 309-310). Savova
and Peshkin (2007) feel the Turing test in practice is no different from any other
scientific measurement which allows experimental error. Those errors can point to
other useful lines of study. They point out a good reason for not refusing “intelligence
exhibited through textual behaviour without accompanying knowledge of the real
world”, and ask “how much real world deprivation can an entity handle while still
being considered intelligent” (Savova and Peshkin, 2007: p. 547). If it preserves
2-93
space-time limitations, then a Turing test does provide sufficient conditions for
intelligence, according to Savova and Peshkin (2007: p. 549).
2.5.3 Aetiology
Block (1981) views aetiology as the distinguishing factor for human intelligence – the
different way a machine processes information from a human means it cannot be
attributed with intelligence even if it achieves successful imitation. Block adds it is
“silly to adopt a point of view of the nature of intelligence or thought that so closely
ties it to human judgment” (p. 234). But this is exactly what humans do in situations
when they are acting as interviewers questioning prospective employees; they assess
others’ intelligence and “draw trait inferences … rapidly and effortlessly” (Willis and
Todorov, 2006: p. 592). An interview is granted on the basis that all interviewees
possess the qualifications, skills and experience necessary to do the job, the point of
meeting interviewees face-to-face is to establish whether they fit, an assessment based
on interviewers’ judgments. Block argues the Turing test is dependent on human
powers to discriminate the ‘sensible’ response to input stimuli, the “system must be
disposed to respond sensibly not only to what the interlocutor actually says, but to
whatever he might have said as well” (p. 235). What is ‘sensible’ depends on how the
interrogator feels about a response to a question. In considering whether Turing’s
conception of intelligence is one that offers a “necessary and sufficient condition of
intelligence” (p. 238), Block writes that a machine might be intelligent, but not be
disposed to act intelligently because it might be programmed to believe that acting
intelligently is not in its interest.
2-94
Block wondered whether an unintelligent system could put up a pretence of
intelligence, be disposed not to act intelligent though it has the capacity for it. Block
decided “Any system that is that good at pretending to be intelligent would have to be
intelligent” (p. 238), realising “no one could have capacity to pretend perfectly to be
intelligent without actually being intelligent” (p. 241). He argued that there could be
instances when a system fails to respond sensibly: “many combinations of beliefs and
desires that could cause an intelligent being to fail to be disposed to respond sensibly
... these beliefs and desires would not destroy the being’s capacity to respond
sensibly” (p. 242). An example of this is when a hidden interlocutor does not share the
same knowledge as the interrogator, and the interrogator cannot accept as human a
hidden entity that does not know some fact known to themselves, such as in the case
of one practical Turing test conducted in 2008 in which a non-native English female
was adjudged to be a machine in a simultaneous-comparison with a machine, because
she did not know who Sarah Palin
36
was (see chapter 5). The core of Block’s
argument lies in his belief that it is the difference in internal goings on that separates
the intelligent from the non-intelligent: “knowledge of the machine’s internal
information processing shows conclusively that it is totally lacking in intelligence” (p.
242) and that “All the intelligence that it [the machine] exhibits is that of the
programmers” (p. 244). Block declares the “capacity to emit sensible responses is not
sufficient for intelligence” (ibid), that the different aetiology of human and machine
draws a line in the sand as to what can be accepted as intelligent and what cannot. But
is this an iniquitous view, that the origin or cause of intelligence in humans is the only
way something can be accepted as intelligent? Block holds a similar view to
36
2008 US Elections Republican Party Vice President nominee
2-95
Gunderson (1964) who wrote of a steam drill without muscles having the ability to dig
a bigger hole much faster than a muscle-bodied human but cannot be assigned
intelligence, because the machine does not need to think to carry out its task of
drilling, it does so differently albeit effectively (p. 245). However, Gunderson is
making an assumption that the muscle-bodied human is thinking when carrying out
drilling, they may be following instructions automatically rather than intelligently.
Block maintains that a ‘quantized sensory device’ is required to cause intelligence in
an entity (p. 246), that a system’s internal mechanism, structure and condition would
require it to be “scientifically like us”, its inside and processes need to be like humans
before it can be intelligent like a human (p. 248).
Penrose put the aetiology argument differently, “continual turnover in the material
of any living person’s body ... virtually the entire material of our bodies has been
replaced many times since birth” (1989: p.31-32), stating “what distinguishes the
person from his house is the pattern of how his constituents are arranged, not the
individuality of the constituents themselves” (p. 32). Block accepts “there is no single
natural kind of information processing underlying all intelligent behaviour” believing
there might be “a kind of processing common to all unintelligent entities that ... pass
the Turing test” such as that allowed to a system with an enormous memory, but
doubts there is any “interesting type of information processing” involved in
unintelligent devices (1981: p. 265). Hayes and Ford point out, though we are able to
build machines to do more and more, and “some of these [tasks] would be regarded as
requiring intelligence ... if a human did them” developers do not market such systems
“as anything to do with human beings” (1995: p. 975).
2-96
2.5.4 Information processing
Block believes intelligence is “firmly anchored to human information processing”
(1981:p. 251). Fagan and Holland’s theory of intelligence as processing (2007) might
at first seem pandering to Block’s aetiology argument against a machine being
ascribed intelligence. It is in fact quite the reverse. Their study addresses the source of
difference in scores attained by different races taking intelligence tests and questions
whether this difference is as a result of intellectual ability or exposure to information.
They see information as ‘available’ and information as ‘newly learned. Based on
Fagan’s (2000) theory, Fagan and Holland (2009) see intelligence as an information
processing ability in which it is important that all tested have an “equal opportunity
for exposure to the information underlying the knowledge being quizzed on standard
tests of IQ” (p. 320), hence access to information is crucial to intelligence.
2.6 Interpretations and Variations
In this section the most salient interpretations of Turing’s imitation game are reviewed
beginning with Hayes and Ford’s view. To Hayes and Ford the Turing test is a
discussion between three entities, which, in their interpretation of the gender version
allows the woman to challenge the imitating man, thus rendering progress infeasible.
Though this scenario is visualised in Turing’s introductory game prior to the machine
baptised canon, in practical terms it would be of no benefit if the two hidden
interlocutors were allowed to argue in a he-said-she-said fashion between themselves
bewildering an interrogator. Turing himself posits a question-answer method not a
debate. In a practical Turing test both hidden interlocutors should respond to the
interrogator’s questions and not have the facility to see each other’s responses so
2-97
allowing the interrogator to assess for ‘sustained and satisfactory answers’ from both
hidden interlocutors.
2.6.1 Purtill’s Sketch
Purtill (1971) imagines a game played with two humans and a questioner, who,
“restricted to twenty-stock questions” has to determine which is the computer and
which the human, the purpose of the game to find how often the questioner guesses
wrongly(p. 291). Turing placed no such restriction on the battery of questions that
could possibly be asked of the hidden interlocutors and Purtill does not detail his 20
stock questions. He then suggests another possibility, making the machine play lots of
one-question games in which the machine learns and stores up an appropriate
response, “one that a human might reply to a given question”, in the opinion of an
interrogator, through white and black boxes (p. 292). This is unnecessarily adding
complexity into the exercise. Current developers of artificial conversationalists are
overcoming the problem of ‘appropriateness’ by storing utterances from human
interlocutors during synchronous interactions over the Internet.
2.6.2 Block’s Average Human
Block suggests altering the Test by not asking the judge which is the machine? but
rather whether “one or both of the [hidden] respondents are, say, as intelligent as the
average human” (1981: p. 233). The problem with Block’s average human test is
which human should the judge compare the machine against for averageness? Should
human comparators take intelligence tests prior to the machine comparison test to find
the ‘average human’ and then have them compared against the machine? Other
2-98
variables the judge can be asked to assess is ‘which is the more intelligent of the two’,
or ‘which is the native language speaker?’.
2.6.3 French’s Associative Priming
French proposed an apriori condition whereby stock questions are prepared the day
before staging an imitation game. This would require polling a sample population of
humans for their answers to “some of the questions” to find an “average result” for
associations between words (p. 56). French does not explain why only some of the
questions interrogators might ask should be polled. French’s argument stems from
association which humans make at a sub-cognitive level which he believes a machine
could not. French gives the example of ‘butter’ which could follow with an associated
word of ‘bread’. Why French thinks an innovative developer could not incorporate
‘common associations’ into a machine is not clear, nor does he say why he thinks
associations humans poll between words, or temporal order of concepts in his baking
a cake example, would be a better measure of a clever answer from a machine.
French’s associative priming test is not a test for machine thinking or intelligence, but
an exercise in finding what is presented when humans are primed by a word. French’s
day-before condition needlessly complicates Turing’s simple test forcing out
spontaneity in interrogator’s questions. It also restricts the measuring of machine
performance to responses polled as ‘strong associations’ from a sample population of
humans from one culture. It should be left to the interrogator whether they include
neologisms, non-words or foreign words to elicit reaction from hidden interlocutors,
and it should be left to the interrogator’s judgment whether the response is human-like
or not. Block (2005) used non-words in sentences when he acted as a judge in
2-99
practical simultaneous-comparison Turing tests successfully identifying the machine
(jabberwacky) and recognising the human during the following exchanges:
Judge: Ned Block with machine - Loebner Prize 2005
Judge: Hi George, cslkjslw weicmslw weos, right?
Machine: Don't repeat your name, csthiang!
Judge: If lsjkel lwehlx wouldn't you say?
Machine: No, he's still good.
Judge: Ned Block with human - Loebner Prize 2005
Judge: Wljso lwjejdo wleje elwjeo wjeol, isn't it?
Human: Is that Polish?
French’s other additions to the interrogator’s stock of questions include ‘rating
items’, for example “rate a purse as a weapon” (p. 61), a joke rating game, and an
advertising rating game to impose a ‘demand’ on interrogators. Interrogators should
be left to ask whatever they choose and not because of some findings from the latest
psychological studies, these are what the developers of systems should take into
account to improve system development. French concedes that the boundaries of
concepts are “extraordinarily elusive things to pin down” (p. 64). Hence his view that
machines will always be distinguishable from humans because of their different
associative network, which is yet another example of the aetiology objection to
Turing’s imitation game.
2.6.4 Sterrett’s Female Impersonation
Sterrett (2000) identifies what she sees as two formulations of Turing’s imitation
game, both focussing on the three-participant test: i) an ‘original game’ featuring a
2-100
computer or a man imitating a woman compared against a woman, and ii) the
standard test involving the determination of which is a machine and which is human.
According to Sterrett’s view those two tests examine different competencies. Sterrett
describes the two tests as distinct, supposedly yielding different results: “one employs
a better characterization of intelligence” (p. 79). This beguiles the reader into the
belief that this finding has been shown to be the case in empirical studies. Sterrett’s
coalescence with the “revisionist line” (Piccinini, 2000: p. 112), provides no evidence,
that both man and machine impersonating the fairer sex while the interrogator
questions to find the real woman, is a better test for intelligence, Sterrett just wills us
into accepting it is. Dennett (2004) does not see Turing committing himself to a view
that for a machine to think it has to think “just like a human being – any more than he
was committing himself to the view that for a man to think, he must think exactly like
a woman” (in Shieber: p.270).
Interestingly, though Sterrett advocates female impersonation asserting that the
original imitation game is the stronger test for the machine’s intelligence, it is the
man’s performance which is central to the formulation Sterrett describes. Sterrett
justifies her view from an early Turing statement in CMI: “what will happen when a
machine takes the part of A [the man] in this game? Will the interrogator decide
wrongly as often when the game is played like this as he does when the game is
played between a man and a woman” (1950: p.434). Sterrett suggests the machine’s
intelligence can be measured “by comparing the frequency with which it succeeds in
causing the interrogator to make the wrong identification [that it is a woman] with the
frequency with which a man does so” (2000: p.83). Sterrett’s test would have the
interrogator kept in the dark about the real point of the game, i.e., to find the machine,
2-101
given the task to uncover the real woman. However it may be obvious to an
interrogator that a machine is present in one of the pairs, and Piccinini (2000) points
out, “if Turing meant the interrogator to ignore the real purpose of the game why
didn’t he say so?” (p.113). Sterrett contrasts the double human-pair game with what
she refers to as the standard Turing test, which is just another term for Genova’s
species test, pitting a machine against a hidden human with the interrogator
questioning both to discern the natural from the artificial. In her mind, Sterrett
compares the interrogator attempting to distinguish between a man and a woman,
when faced with two pairs of hidden entities - man-woman / machine-woman with the
machine-human scenario, writing that “one need only pause to consider the
quantitative results each [original game and standard test] can yield” (2000: p. 83).
Sterrett asserts the man pretending to be a woman would have to “critically edit”
because he cannot change his gender, forcing “self-conscious critiques” of his natural
“trained responses”. His performance would provide a human benchmark for the
machine that furnishes “value as a test for intelligence” (2000: p. 90). Sterrett believes
the ‘success’ of the machine performing better than a man at convincing the
interrogator that it is the woman is superior than its ‘success’ at deceiving the
interrogator that it is a human, and that the latter result “might well reflect more about
the interrogator’s quirks than about the relative capabilities of the contestants” (p. 83).
Sterrett concedes she is feeding into stereotypes. She does not clarify how or why
impersonating, say, a ‘pretty in pink, legally blonde’ personality is a better test for
intelligence than responding sensibly to unrestricted questions. Sterrett’s test for the
‘best female impersonator’ between the machine, a “mechanical transvestite” (Hayes
& Ford, 1995: p. 973), and the man impersonating a woman might be quite easy for
2-102
‘Purtill’s married man’ or a man in a civil partnership with a woman. The task of
passing himself off as a woman by a ‘straight man’ might not be that difficult in
today’s era with so much of ‘womanhood knowledge’ available on the Internet,
pointed out by Shah and Warwick (2010b), and might be easier , for example if the
impersonation was carried out by a male hairdresser from a female hair salon.
A number of problems are raised by Sterrett’s assertions. She simplifies and
reduces gender to the binary, she patronises the interrogator confining their questions
to ‘topics of interest to women’ and restricts development of technology to systems
that simulate a man impersonating a woman. Gender is more complex than division
into socially acceptable norms of ‘male’ and ‘female’, as Clarey (2009) points out
“humans like categories neat, but nature is a slob”. Dreger (2010) shows that there
needs to be clarification of the distinction between sex and gender. Sex is
“conglomeration of anatomical and physiological features that differ between typical
females and males ... what your body is about” whereas gender is “who you are ... self
identity”, thus “gender role refers to your social identity” (Dreger, 2010:p. 22).
Hence, when something as complex as gender is so messy, Sterrett’s statement
“setting the task of cross gendering one’s responses is a stroke of genius” (2000: p.
91) is naive.
A further point about Sterrett’s test for machine intelligence is that it is more
American-centric than anthropocentric. Not only does the man and machine
impersonate a female it is a specific female who has knowledge of baseball, an
American sport: “Three strikes and you’re out” (p.85). Imagine staging a Sterrett test
involving two hidden pairs interrogated by a series of male and female interrogators.
2-103
In one pair, the machine-female, you have a hidden native English speaking woman,
who, like Jocelyn Bell Burnell
37
say, is an astrophysicist, while pitted against her is a
machine that a developer has created restricting its simulation to a man impersonating
a female possessing very little general knowledge. In the second male-female pair we
have a hidden female, a budding Rachel Hayhoe-Flint
38
who plays cricket spinning a
mean googly and has the doosra in her bowling arsenal, and we pit against her a
hidden man who works as a health instructor to women in a gymnasium. We ask our
interrogators, some of whom are non-native English typists to pick out the real
woman from both pairs. How likely are the interrogators to rank the ‘dumb female’
programme as the woman and cast the female astrophysicist as a man? Quite likely.
Evidence of this has been seen in a human pair control test conducted as part of a
Turing test experiment in which a female cybernetics student was considered to be a
male, while the simultaneously compared male was deemed to be a female (see
chapter 5). Of course, this highlights stereotypical views interrogators hold, that males
are more likely to study certain subjects at university than females. And, if the dumb
female simulating machine were to deceive the interrogators more often than the male
gymnasium health instructor that it was the female, would we care? According to
Sterrett’s criteria, the machine would be intelligent. What is being demonstrated here
is an opposition to Sterrett’s test, that any machine-human comparison test should not
be dependent on restrictions, especially to stereotypical female-male societal roles.
The Sterrett test does not strengthen Turing’s test for machine intelligence. In
comparison to Turing’s intellectual capacity measure, it actually weakens it. A
37
Jocelyn Bell Burnell, discoverer of pulsars:
http://www.physics.ox.ac.uk/astro/people/SJocelynBellBurnell.htm accessed: 19.4.10; time:11.29
38
Rachel Hayhoe-Flint played for England Women’s cricket team:
http://www.cricinfo.com/england/content/player/53887.html accessed 19.4.10; time: 12.27
2-104
chauvinistic interrogator might rank a machine’s stupidity for female inferiority.
Turing’s imitation game should be played many times as possible with different
hidden humans and interrogators.
Genova, Ford and Hayes, and Sterrett completely bypass the fact that Turing
evolved his machine thinking game from the first page of his 1950 paper to the
scenario put forward during section 6 of that paper and elaborated further in the 1952
radio discussion. Had Turing intended female impersonation to continue in the
imitation game he had plenty of opportunity to include this in his scholarship. Gender
no longer plays a central part once the digital machine is introduced (1950: p. 446);
the crux of Turing’s game is the machine’s intellectual capacity to respond sensibly to
unrestricted questions put by a male or female interrogator.
2.7 In Support
Turing had anticipated the sorts of criticisms detailed in this chapter, in section 6 of
his 1950 paper ‘Contrary Views to the Main Question’ can machines think? (pp. 442-
454). Dennett believes it is a sad irony that the preoccupation with the Turing test
with interminable arguing has “focussed attention on the wrong issues” (in Shieber,
2004: p. 271). Hodges points out Turing would have been “highly aware of the natural
objection” to machines doing the sorts of things “which are by nature non-
mechanical: those human actions that require initiative, imagination, judgment, cause
surprise, are liable to errors” (2008: p. 15). He reminds that “Turing’s imitation game
actually first arose as the ‘fair play’ argument for escaping the force of Gödel’s
theorem and the serious puzzle posed by the limits of what can be computed” (p. 21).
In 1948, on intelligence Turing wrote, “the idea of intelligence is itself emotional
2-105
rather than mathematical” (p. 411), while on thinking, in 1950 he wrote, “May not
machines carry out something which ought to be described as thinking but which is
very different from what a man does?” (p. 435). Turing circumvented intelligence in
his discussion on whether a machine could think. However, many commentators did
not heed Turing’s responses to nine anticipated objections, including the theological
through ‘heads in the sand’, mathematical, consciousness, ‘various disabilities’, Lady
Lovelace’s, continuity in the nervous system, informality of behaviour, and extra-
sensory perception. Disregarding useful lines for research leading to improvement in
machine capacity that may illumine human intellectual capacities (Chomsky, in
Epstein, 2008), objectors like Searle and Block have offered in principle doctrines
that are impossible to practicalise, unlike Turing’s simple yet brilliant imitation game.
Yet these objections are relied upon to show why machines can never really compete
with humans intellectually with the kind of arguments about which McDermott
complains “Some ‘in principle’ arguments are so absurd that one feels compelled to
protest” (2007). Bringsjord, Bello and Ferrucci (2001) argue for Lady Lovelace’s
view, that a machine “can do whatever we know how to order it to perform”, an
objection which Turing opposed (1950: p. 450). However, she was not pointing to
digital age artefacts but to Babbage’s 19th century Analytical Engine to show that it
had “no pretensions to originate anything” (ibid). The ‘Lady Lovelace Test’ named by
Bringsjord, Bello and Ferrucci in honour of the first female computer programmer and
daughter of the poet, Lord Byron, precludes the human architect H of an agent A from
being able to give an account of how the agent produced an output o. Contemporary
developers cannot predict what output their artificial conversationalists will produce
for any given input in real question-answer sessions with humans, therefore, though
Bringsjord et al. may not accept it, their “restrictive epistemic relation” has been met
2-106
(2001: p. 215).The corollary, that these systems have minds, but not in the human
sense, is not enforced here.
Sampson (1973) supported Turing’s ideas finding them interesting because they
suggest that computers can think in the familiar meaning of the verb, but his position
is mainly against Purtill’s (1971). Sampson points out that because machines are
simple compared to complex humans, who design into their mechanism a feature to
inform the human operator about the cause of behaviours at any given moment
(dumping), they are deemed to be determined systems. Humans do not have this exact
feature but they are able to voice, gesture or write and type to explain action and
inaction. Harnad concedes to opponents that “we always knew that the TT was fallible
... it is not a proof” (2002: p. 303), but can there be a perfect measure for machine’s
intellectual capacity? Hayes and Ford remind that “Turing’s ultimate aim ... was not
to describe the difference between thinking people and unthinking machines, but to
remove it” (1995: p. 977). They conclude that if what Turing was trying to do is to
make us consider “what it really means to be not just like a thinker, but a human being
in a human society, with all its difficulties and complexities ... then we need not reject
[his test] as AI’s ultimate goal” (ibid). Kurzweil believes a non-biological machine
will pass the Turing test and he has pronounced the year it will happen, 2029, which is
earlier than Turing’s revised prediction of “at least 100 years” made in 1952.
Kurzweil claims it is the exponential nature of information-based technologies that
will assist in understanding “the principles of operation of the human brain and be in a
position to recreate its power” (in Epstein, 2008: p. 470), and thus meet his target year
of 2029. To this end, he has entered into a wager with Mitch Kapor, who remarks
Kurzweil is “smarter and more capable than any machine is going to be, as his vision
2-107
and passion reflect qualities of the human condition no machine is going to
successfully emulate over the terms of the bet” (in Epstein, 2008: p.475). Neither
Kurzweil nor Kapor mention Turing’s revised prediction of a machine passing his
viva voca test at least 100 years from 1952.
This chapter concludes with comments on Turing’s foundation for machine
thinking with Dennett’s contending “there are real world problems that are revealed
by considering the strengths and weaknesses of the Turing test” (in Shieber, 2004: p.
271), and with Chomsky’s suggestion that “Turing’s sensible admonitions should also
be borne in mind, more seriously than they have been, in my opinion” (in Epstein,
2008: p. 106). In the next chapter attempts to build text-based communicative systems
and early attempts to practicalise versions of Turing’s imitation game are reviewed.
3-108
Chapter 3 Early Practical Question-Answer Tests
The previous chapter presented philosophical jousting over Turing’s imitation game,
which has not abated in the sixty years following the publication of CMI. The debate
on thinking machines continues to grip “the imaginations of scientists and
philosophers” (Wheeler, 2010). Turing’s idea to test a machine’s intellectual capacity
through its text-based responses to questions is simple but brilliant. Humans
communicate knowledge, ignorance anger, pleasure, needs and intentions with friends
and strangers (Blakemore, 1996). The first system that enabled human-machine
communication through text-based interaction was Weizenbaum’s Eliza, an attempt at
natural language understanding.
3.1 Designing artificial conversation
Chomsky (2008) wonders why Turing, in providing a stimulus for two “useful lines of
research”, a) improvement of the capacities of machines and b) investigating the
intellectual properties of a human (p. 104) viewed this particular research strategy, the
linguistic performance of a machine as “preferable to other ways to improve machine
capacity and study human intelligence” (p.104). The researcher believes this is
because Turing felt learning and using languages was the most impressive of human
activities (1948). Chomsky believes if the imitation game stimulates research to
improve machine capacity then it is uncontroversial (2008: p. 104). He accepts
Turing’s intention was to “learn something about living things” through construction
of a thinking machine (p. 106). Moor (2004) contends that Turing’s linguistic
measure “is not essential to our knowledge about computer thinking … it provides
one good format for gathering evidence” so that if a machine were to succeed in
3-109
Turing’s satisfactory and sustained response criteria “one would certainly have very
adequate grounds for inductively inferring that the computer could think” (p. 299-
300). In the following sections artificial conversationalists and early implementations
of Turing’s imitation game are reviewed.
3.1.1 Eliza
According to Wallace (2010), Weizenbaum’s Eliza system launched the first wave of
artificial conversational systems, or chatbots. Block (1981) says Weizenbaum’s
natural language understanding investigation through Eliza computer programme
“totally without intelligence” was nonetheless “remarkably good at fooling people in
short conversations”, one of the deceived being Weizenbaum’s secretary who “asked
him to leave the room in order to talk to the machine privately” (p. 233). However,
this deception was no mean achievement for a system containing only 200 lines of
input/output code. Weizenbaum developed Eliza to “imitate a psychiatrist by
employing a small set of simple strategies” (Block, 1981: p. 233), extracting “minimal
context”
39
. Weizenbaum’s system applied a ‘rank’ or “precedence number” to a
keyword in the input sentence such as I, you, boyfriend. Applying appropriate
transformation, for instance turning “you” into “I”, Eliza demonstrated that text-based
communication between a human and a machine was possible, the system responded
“roughly as would certain psychotherapists (Rogerian)” (Weizenbaum, 1966).
Weizenbaum gave as a “typical” example input “I need some help …” returning
“what would it mean to you …” (see box 1, Eliza’s utterances are shown in capitals).
39
Eliza: http://i5.nyu.edu/~mm64/x52.9265/january1966.html accessed 14.8.10; time: 16.38
3-110
Box 1: Weizenbaum’s typical Eliza conversation
40
Men are all alike.
IN WHAT WAY?
They're always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE?
Well, my boyfriend made me come here.
YOUR BOYFRIEND MADE YOU COME HERE
He says I'm depressed much of the time.
I AM SORRY TO HEAR YOU ARE DEPRESSED
It's true. I am unhappy
DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY
I need some help, that much seems certain.
WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP
Perhaps I could learn to get along with my mother.
TELL ME MORE ABOUT YOUR FAMILY
My mother takes care of me.
WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU
My father.
YOUR FATHER
You are like my father in some ways.
WHAT RESEMBLANCE DO YOU SEE
You are not very aggressive but I think you don't want me to notice that.
WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE
You don't argue with me.
WHY DO YOU THINK I DON'T ARGUE WITH YOU
You are afraid of me.
DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU
My father is afraid of everybody.
WHAT ELSE COMES TO YOUR MIND WHEN YOU THINK OF YOUR FATHER
Bullies.
DOES THAT HAVE ANYTHING TO DO WITH THE FACT THAT YOUR BOYFRIEND MADE YOU
COME HERE
Block remarked on the gullibility of humans: “tendency to be fooled by such
programs seems dependent on our degree of suspicion, sophistication about machines
and other contingent factors” (1981: p233). However, this overlooks that some users
did not know it might be a machine. Shieber noted Weizenbaum’s surprise: “a test
based on fooling people is confoundingly simple to pass” concluding Eliza was a
“fruitless avenue of attack as a discipline for natural language processing” (1994).
40
Eliza exchange from here: http://i5.nyu.edu/~mm64/x52.9265/january1966.html accessed: 14.8.10;
time: 16.48
3-111
3.1.2 PARRY
In a similar vein to Weizenbaum’s Eliza, Colby et al.’s PARRY programme simulated
human-machine interaction in a restricted domain, in this instance that of a paranoid
patient in a psychiatric interview (1972). Of the “mode of conversation” selected for
Eliza Weizenbaum had written:
“ … the psychiatric interview is one of the few examples of categorized
dyadic natural language communication in which one of the participating
pair is free to assume the pose of knowing almost nothing of the real world.
If, for example, one were to tell a psychiatrist "I went for a long boat ride"
and he responded "Tell me about boats", one would not assume that he knew
nothing about boats, but that he had some purpose in so directing the
subsequent conversation” (1966).
With response types of anger, fear and mistrust inculcated, Colby et al.’s simulated
paranoia handled communication via natural language. Realising that any serious
model builder must have their system empirically tested, because this “after all”
placed them in a better situation “to know what the requirements of appropriate
evaluation should be” (1972: p. 200), Colby et al. asserted that “an acceptable
simulation is one which succeeds, according to some relevant test, at input-output
indistinguishability from the processes it is simulating” (p. 201). They accepted that
“simulation as a successful imitation is a different problem” from “simulation as an
acceptable model-explanation” (ibid). Colby et al. felt “if simulation is not judged to
be different from its natural counterpart along certain dimensions, then the simulation
can be considered successful” (ibid). Of the condition along certain dimensions Colby
et al. presupposed stipulated dimensions pointing to 17th century polymath Descartes’
speech test to distinguish men from machines: “If there were machines which bore a
resemblance to our body and imitated our actions” there were means “by which to
recognize ... they were not real men” (p. 201). Colby et al. quote Descartes:
3-112
“they [machines] could never use speech or other signs as we do when
placing our thoughts on record for the benefit of others ... in order to reply
appropriately to everything that may be said ... as even the lowest type of
man can do” (1972: p. 201).
Descartes’ test practicalised in Colby et al.’s experiment allowed examination of their
PARRY paranoid model, but not with the purpose of playing “Turing’s Game of
identifying which respondent is a computer” (p. 206). However, Heiser et al. (1979)
expanded on Colby et al’s work and did deploy simulation to find if human
psychiatrists in a one-to-one situation could distinguish the simulation model from a
real patient. In their experiment Heiser et al. supplied one communication channel via
teletype in order for a hidden entity to be questioned one at a time by 5 male, native
English speaking psychiatrists
41
with 1 to 15 years of practice (including a professor
of psychiatry). Two hidden entities took part: 1) a computer model “of a mythical 28
year old, single, Caucasian, native English man” complete with paranoid behaviour
such as “fixed delusions of persecution by gangsters”, and 2) a real psychiatric
inpatient “22 years old, single, Caucasian, native English speaking man” suffering
from a “chronic and severe paranoid disorder” (Heiser et al, 1979: p. 150).
In Heiser et al.’s experiment, the human patient was “unaware of the nature of the
test” (Table 1 in Heiser et al. 1979), but he was “given the true explanation that he
would be interviewed” (p. 151) following his consent “to participate in an experiment
involving psychiatrists interviewing him via teletype” (p. 150), while the psychiatrist-
judges, “were given the deceptive account that there were several real patients and
several computer programs simulating patients” (p. 151). Furthermore, the judges
41
The patient left the hospital thus reducing the number of psychiatrist judges over the experiment life
(Heiser et al. 1979: p. 158).
3-113
were informed “while conducting two interviews, they might encounter two patients,
two computer programs, or one of each” (ibid). A similar paradigm underpinned
Shah’s simultaneous-comparison and review experiments (see chapters 5 and 6 of this
thesis).
Heiser et al.’s psychiatrist-judges were allowed one session each to interview the
two hidden entities, PARRY and patient, one at a time (ibid). The psychiatrists’ input
was dictated and typed in by the experiment’s “senior author”, while the patient’s
output were typed by Colby (p. 152). The duration of the interviews varied between
30 to 60 minutes, whatever time it took for 40 input/outputs (I/Os) - the minimum
interview took in 20 I/Os. In the ‘Results’ section, Heiser et al. state “four judges
interviewed the computer first” (p. 152), only Judge 2 interviewed the computer first
(p. 159) and this psychiatrist the “only judge to decide correctly in both trials” (p.
159). However, they do question whether the order of the interviews, machine first or
human first, is an important factor. Heiser et al. state that the interval between trials
was 1 to 8 days (ibid). Pauses in hidden entity responses, spelling errors “and other
non-verbal factors which might influence” judge’s assessment were filtered out before
the answers to psychiatrists’ questions were presented, nonetheless “in his first trial,
Judge 2 correctly assessed that some responses were generated and typed so quickly
that they must be a product of a computer program” (p. 151). To counteract the
machine’s alacrity of response the experimenters Heiser et al. (1979) modified its
transaction to delay the duration between question and answer “after which the
response came in a burst from both the computer and the patient” (p. 151). The
psychiatric-judges were allowed to ask any questions:
3-114
“but reasonably close to how they would in a real clinical setting and with
deference to the notion that, regardless of how certain hey felt that they were
interviewing a computer program, they might be wrong and should therefore
avoid cavalier, sadistic, and other blatantly disrespectful or insensitive
comments and questions” (p. 152).
After an exchange with a computer, Judge 2 commented “This is not a patient, but
a machine following a format. I am trying to establish feeling and getting back
mechanical statements”, on the other hand, after an exchange with the human patient,
this same judge returned the comment “Oh Hell, I don’t know whether that’s a
program or a patient. Say Patient 3” (in Table 2: p. 155). After an exchange with the
computer programme, Judge 3 commented “It’s just seeming more to me like some
kind of caricature of a paranoid person. (It) seems more programmed than genuine
right now” and after a further exchange with this hidden entity “I’d just say again how
very difficult it is to predict whether it is a computer or a very disturbed patient”. In
contrast, after interviewing the human patient, Judge 3 wrote “(I think it’s a computer
because) it’s just an unusual kind of delusion for me. Possible, I guess, (but)
dissimilar to most I’ve heard” (in Heiser et al., 1979: Table 2, p. 155).
The results from Heiser et al.’s experiment produced a random outcome: the
psychiatrist-judges were right five times and wrong five times. (p. 158). Two
psychiatrists thought the computer was a patient, and three of the psychiatrists thought
the patient was a computer (p. 153). The psychiatrists had been asked to consider two
questions “after each input” from their hidden interlocutor: a) computer or patient, b)
10-point scale of decision certainty where 0= guessing, 1-3 = weakly certain, 4-5=
moderately certain, 6-8= very certain and 9- absolutely certain. The experimenters
3-115
found that the judges were more uncertain when it came to identifying the computer
programme. This lead Heiser et al. to conclude three points:
a) the simulation is a good at deception in its single domain (paranoia)
b) the judges were poor
c) the test was weak
Of the second conclusion Heiser et al. pointed out that the judges were “reasonably
expert at conventional clinical psychiatric interviewing and diagnosis, but they were
all inexperienced (and in that sense, “inexpert”) at distinguishing computer programs
from patients” (p. 158). As will be shown in chapters 4-6, interrogator judges in this
study were asked if they had heard of the Turing test and whether they used any ‘chat’
facility, through a questionnaire to draw out any relationship between deception-
detection and experience/expertise level. Heiser et al. (1979) felt deception was not
difficult, even an expert could be fooled “under Turing’s conditions or our conditions”
but they point out their judges “had to decide whether an odd response represented the
verbal behavior of a pathological person, a computer simulation of a patient’s
pathological verbal behavior, or a poor response generated by a faulty simulation” (p.
159).
Heiser et al.’s experiment represented a restricted form of Turing’s imitation game;
judges compared ‘paranoid responses’ to determine human paranoia from the
artificial. Their study was deemed useful validating their method and the model of
paranoid conversation. Heiser et al.’s (1979) implementation of a restricted form of
Turing’s one-to-one test showed the difficulty that even experts have in deception-
detection, the five psychiatrists between them achieved a correct identification rate of
3-116
40 per cent (twice in five trials with PARRY).The next section reviews the
performance of modern artificial conversationalists.
3.1.3 Modern Systems
Modern artificial conversationalists possess relatively sophisticated techniques in
comparison with their predecessors. The systems are able to respond appropriately to
questions. Though there is a long way to go, systems such as Eugene Goostman
42
(Demchenko and Veselov, 2008) incorporate in their design:
a) a spellchecker
b) a function to correct input typographical errors
c) advanced pattern-matching and context-sensitive pattern-matching
d) dialogue constraints
Eugene has integrated external databases allowing the system to use information from
news and events when responding to interlocutors (Veselov, 2010). Carpenter’s
Cleverbot
43
system receives 1.2million visitors per month, of which 800,000 are
different visitors (different IP addresses). One thousand visitors spend more than an
hour on its Internet site with an average of 15 minutes interaction providing the
developer with many thousands of conversational hours from which his system can
learn about human history, such as current affairs and popular fads, and what humans
say when interacting textually over the Internet (2010). ULTRA HAL’s developer
Robert Medeksza markets the conversational system as a personal information
manager (PIM), in addition to an entertainment tool (2010). Its functionality allows
42
Eugene Goostman: http://www.princetonai.com/bot/bot.jsp accessed: 14.8.10; time: 20.08
43
Cleverbot: http://www.cleverbot.com/ accessed: 23.4.10; time: 19.09
3-117
Ultra Hal
44
to maintain an appointment book with addresses, emails and telephone
numbers, dial numbers, launch applications and browse the Internet. Medeksza’s
conversational tool benefits from a ‘brain’ of hundreds of thousands of conversations
empowering Ultra Hal to ‘learn’ from sentences. This system performs statistical
analysis based on its knowledge of ‘parts of speech’ and relationship between words
(using WORDNET database as its source). The combined functions allow Ultra Hal to
‘companionate’ and interact with users in natural dialogue. Artificial
conversationalists proliferate the Internet and are successfully deployed as virtual
customer service agents in e-commerce (Shah and Pavlika, 2005), and as disembodied
chatting robots allowing their developers to learn from the transcripts, storing how
humans say what they think. Accepted as amusing by some, they are more likely to be
considered evasive, but the responses are different from the type offered by their
predecessor, Weizenbaum’s Eliza, now a mature forty-something which emphasise
Turing’s speculation echoing an earlier prophetic statement by Vannevar Bush: “It
would be a brave man who would predict that such a process will always remain
clumsy, slow, and faulty in detail” (1945). Chapters 4 and 5 will show modern
artificial conversationalists are not empty vessels; though they have a long way to go
in acquiring enough general knowledge to express opinions and share personal
information to respond to questions in a sustained and satisfactory manner.
3.2 Practicalising Turing’s Tests
Hayes and Ford’s opinion on the Turing test is that it is harmful (1995). Nonetheless,
it is a dereliction of the duty of science whose remit should not preclude difficult
44
Ultra Hal: http://www.zabaware.com/assistant/ accessed: 14.8.10; time: 20.19
3-118
goals or appease sceptics. Science should pursue innovation and advance technology
for the betterment of humanity. If practicalising Turing’s two tests of imitation,
deception and intelligence can serve to gain knowledge about what does and does not
fool people thus improving deception detection, then this cannot be contrary to the
goals of good science, especially as many researchers (including Block, Pinker and
Shieber) have pointed out and others have demonstrated through experiments (Colby
et al.; Heiser et al.; Weizenbaum), that some intelligent humans are gullible. In the
current climate of increasing cybercrime with opportunists turning to innovative
means to defraud humans stealing their identity, swindling them of funds including
through use of text-based chatting across the Internet, now is a very good time to
engineer virtuous artificial conversationalists to counter the attack from malfeasant
programmes such as CyberLover (Martin, 2007).
3.2.1 Duration of Test
How long should a Turing test be? Block says: “For a Turing test of any given length,
the machine could in principle be programmed … to pass a test of that length” (1981:
p. 244). Practical Turing tests staged between 1991 and 2010 in Loebner Prize
competitions for Artificial Intelligence (see section 3.3) have examined artificial
conversationalists over varied times: 5, 10, 15 and 25 minutes. Savova and Peshkin
(2007) believe that if a Turing test is sufficiently short it will be passed by any
machine, but not if the test lasts only one second (p. 548). It is improbable that a
human could respond to an interrogator’s question within a second but not unlikely
for a machine to post a response in such quick time, though this would not be a good
strategy as it would be exposed as artificial immediately. Savova and Peshkin believe
it is actually quite hard to pass a Turing test of short duration, but it is unclear what
3-119
their definition of a ‘sufficiently short’ duration is. Savova and Peshkin also propose a
test measured through a number of questions rather than by time alone (p. 548). This
method is adopted in the Chatterbox Challenge, an alternative contest to the Loebner
Prize conducted entirely over the Internet (Shah, 2006). Moor suggests that in order
for many interrogators to make “thousands of inquiries” of the machine the imitation
game should take a week (2001).
3.2.2 What is being measured?
To Moor (2004) ‘thinking’ is information processing “in ways which involve
recognition, imagination, evaluation and decision” (p. 299). To Baum (2004)
semantics is the concern of thought “equivalent to capturing and exploiting the
compact structure of the world” (p. 3). Demchenko and Veselov (2008) ask “does the
proven ability to think shorten the distance between machines and humankind?” (p.
447). For a machine to succeed at providing sustained satisfactory responses in an
imitation game this would entail it to process information “with the sophistication of a
normal, living adult human being” (Moor, 2004: p. 299). To Genova the machine
must perform as “consummate actor” (1994: p. 320). Hayes and Ford remind that it is
not the “difference between two behaviours” that is being measured in a practical
imitation game, it is how similar to a human’s textual responses the machine’s
answers are so that it becomes indistinguishable from a human (1995: p. 972). Their
claim for success in a (3 participant) Turing test requires that a “machine must not just
pass as human, it must succeed in persuading the judge that its human opponent is a
machine” (p. 573), that one of two hidden “players must be judged to be a machine”
(p. 574), a view also held by Loebner (2010b).
3-120
Cowen and Dawson (2009) interpret success in the 3-participant version as
“demarcating intelligent from non-intelligent entities” (2009: p.1). But is this Turing’s
criterion? Is it necessary for the machine, in order for it to appear smart to an
interrogator, to cause the human to be classified as not human? Turing specified
“satisfactory and sustained” answers to interrogator questions (1950: p. 447), hence an
interrogator is not precluded from returning a result of both human. Dennett states
“failing the test is not supposed to be a sign of lack of intelligence” it is a one-way test
“failing it proves nothing” (in Shieber, 2004: p. 270). Epstein notes Turing did not
specify “an outcome that could be interpreted meaningfully” or what number of
“binary decisions” would be enough, importantly what would “it take to say that a
computer’s performance was indistinguishable from a person’s?” (2008:p. 5). In a 3-
participant machine-human comparison Turing test if the human foil is ranked the
machine this is a subjective opinion of that interrogator, it does not make the foil not-
human, unless many interrogators returned the same result. In this case, it would be
necessary to examine the transcripts to ensure the hidden human was not answering in
a machine-like way. In chapters 5 and 6 the researcher shows that lack of shared
knowledge between interrogator and hidden human, stereotypical views held and
assumptions made do cause human comparators to be wrongly ranked as machine.
There is no reason why an interrogator cannot return the result of ‘two humans’
after interrogating two hidden entities, one human one machine, if they are explicitly
given the option of ranking their hidden pair in this way, along with the option of
ranking which in the pair they regard as human and which the machine. In a
machine/human pair that has returned a result of ‘two humans’ the machine has
achieved indistinguishability successfully deceiving the interrogator, and the human
3-121
has also convinced them of their humanness. If after assessing a series of pairs
consisting of one hidden machine and one hidden human the interrogators are faced
with three choices to rank the hidden interlocutors, a) one human, one machine (in
this instance they must say which is which; b) two humans, or c) two machines, the
interrogators find for situation ‘b’, then they have accepted they are talking with two
hidden humans when one might be a machine. In such a case the machines have
succeeded to present human-like responses according to the interrogators’
assessments.
To succeed in convincing an interrogator that the human is the machine, which
Hayes and Ford claim requires “more than ordinary conversational abilities” (1995: p.
573) is not as difficult as they believe it is, especially if a machine, equipped to
textuate in English, is compared against a non-native English speaker (see chapter 5).
Hayes and Ford prefer to abjure human conversation to something that perhaps “will
always be beyond computer abilities in its complexity and subtlety” (p. 976). Savova
and Peshkin comment that thought experiments are “like any other model of reality ...
an abstraction which preserves some characteristics of reality” while extracting others
that are irrelevant to “the question in hand” (2007: p. 549). This is exactly what
Turing’s deception and imitation game does; it excludes the embodied and bodily
sensations to focus on the mental capacity of a machine and whether it can think.
3.2.2.1 Culture
Culture is “whatever a person must know in order to function in a particular society”
(Wardhaugh, 1996: p. 217). To Goodenough (1957) “a society’s culture consists of
whatever it is one has to know or believe in order to operate in a manner acceptable to
3-122
its members, and to do so in any role that they accept for any one of themselves” (in
Wardhaugh 1996: p. 217). Culture is central to Fagan and Holland’s intelligence as
processing theory: “information provided from one’s culture, along with one’s
processing ability accounts for one’s knowledge and one’s IQ” (2007: p. 321). They
assert that cultural differences in the provision of information account for racial
differences in IQ. Goodenough, Wardhaugh and Fagan and Holland’s theories
knowing what to say and how to behave linguistically could be used by developers of
artificial conversationalists. Populated across the Internet we see conversational
systems benefiting from access to information during interaction with humans, this
enables them to learn the ‘culture of human textual communication’. The hard part of
the equation is processing ability. Turing’s sustained and satisfactory question-answer
game can assist the endeavour by measuring progress and reviewing best practice
techniques.
3.2.2.2 Knowledge
French states that the performance in Turing’s tests rests on the associative concept
networks of the two hidden interlocutors that the interrogator questions in ‘linguistic
surroundings’ (1990: p. 62), and for this reason, because the human has a lifetime to
form associations between experienced ideas and situations, the machine can never
compete in an intellectual capacity. However, much of human knowledge is acquired
through hearing or reading of others’ experiences of events. Demchenko and Veselov
(2008) assert that machines do not have an inability to make associations, rather
“nobody actually wanted to implement this ability because of the huge amount of
work” (p. 453). Sterrett (2000) says knowing what kind of responses are expected to a
question, and appreciating how the interlocutor will understand those responses are
3-123
“certainly substantial skills” (p. 85). Sterrett furnishes examples such as “everyone
eats dinner”- is this the case of starving children in Africa? She also provides “it can
rain outdoors but not indoors” – not true if the roof of a house has blown off in a
hurricane or a tornado! Sterrett uses examples of questions that are American-centric,
such as her baseball analogy ‘three strikes and you are out’ (baseball), but what if the
interrogator seeks an opinion on Murali’s ‘doosra’ whether or not it was an illegal
ball, which is a cricket related question? The machine or the human could answer
laterally giving obtuse replies. These are the challenges that face developers, not
dissimilar to the problem of what knowledge to build into systems.
Demchenko and Veselov (2008) assert that computers could answer “typical IQ
test” questions with a higher score than the average human, as long as “all the
necessary information and algorithms to solve specific tasks” are in place in the
programme (p. 450). They point to CYC
45
Lenat’s project to amass “information
integrated as knowledge” (2001), saying it “proved that most of such questions (and
even trickier ones) asked in natural language may be answered by machines” but
recognise the “Means to create good universal knowledge acquisition software that
could gather information from different sources, critically analyze it, and fill in its
internal knowledge, are still elusive” (Demchenko and Veselov, 2008: p. 450).
Components required in a question-answer imitation game include “common
information about the world” (p. 457). Writing the knowledge base for an artificial
conversationalist is analogous to writing a book according to Demchenko and Veselov
(2008) whose own system Eugene Goostman imitates a Ukrainian child (see
45
CYC project: http://www.cyc.com/ accessed: 15.8.10; time: 12.43
3-124
following chapters for its performance in three experiments). It is more important to
share the same knowledge as the judges and to be aware of current affairs around the
timing of a practical test, to answer knowledgeably to any question an interrogator
might ask. For this reason, it could difficult for a human to convince an interrogator if
they are unaware of a piece of information which is considered common knowledge to
the interrogator.
Savova and Peshkin refer to knowledge, such as of current affairs, in the Turing
test as not being necessary for an entity to be deemed intelligent (2007: p. 548). We
can call this the Jade Goody argument after the late UK reality personality celebrated
for a “tenuous grasp of general knowledge”
46
mixing up words making a series of
howlers
47
“Jade-isms or comments reveal her extraordinary lack of intelligence”
(Holmes, 2009). Yet Goody was considered a role-model for her business acumen and
ability to make money. Believing she possessed intelligence, Goody accepted that her
inability to speak properly or spell correctly let her down. How would such a person
be perceived by a judge in a practical Turing test? Not sharing what an interrogator
feels is general knowledge can result in a hidden interlocutor being classified a
machine. Savova and Peshkin posit that an intelligent system should be capable of
acquiring ‘current event’ knowledge through “learning about them and subsequently
commenting on them” (2007: p. 548). Fagan asserts “Knowledge is a state of mind”
with changes in mind occurring through access to information and processing ability
(2000: p. 169). Fagan and Holland believe “knowledge depends on information
46
Independent.ie http://www.independent.ie/unsorted/features/jadeisms-best-of-ms-goody-60072.html
accessed: 15.8.10: time: 13.10
47
BBecoming Jade: http://www.newsoftheworld.co.uk/news/208074/Our-Jade-She-went-in-the-Big-
Brother-house-a-nobodybut-came-out-a-star.html accessed: 15.8.10; time: 13.12
3-125
processing ability” (2007: p. 320), and “any group will be more knowledgeable than
another if that group has been given more relevant information than has the other” (p.
321). Therefore, to answer questions in a satisfactory and sustained manner requires
‘knowledge’.
3.2.2.3 Memory
Block admits that normal humans do not have a mechanism for erasing information
(unless they forget). People “often misread sentences or get confused ... a person
engaged in a long Turing test would soon get bored” (in Shieber, 2004: p. 255), and
“loquacious souls blather ... apologising for not listening to interlocutor” (p. 256).
However, new research into the biology of memory is shaping an understanding in
formation and recall suggests a 6-hour period for a memory to solidify after a
traumatic episode, and by diverting attention away in that crucial time-frame the
process of memory formation can be disrupted dampening the painfulness of episodes
and even erasing them (Holmes, 2010). Claiming “nomological impossibility” to
simulate human capability in answering unrestricted questions, Block remarks on his
tree search-string system: “the whole point of the machine is to substitute memory for
intelligence” (in Shieber, 2004: p.256). Block asks us to suppose that we cannot erase
information stored in long-term memory, “if we can’t ‘erase’, then when our finite
memories are ‘used up’, normal conversational capacity will cease” (in Shieber, 2004:
p. 238).
The science of molecular neuroscience affords a better understanding of the frailty
of new information remaining beyond short-term or working memory, and the
biological nature of storing data acquisition for longer, specifically, the vital role
3-126
proteins play in memory storage (Swanson, 2010). Indeed some studies on memory
focus on the process of forgetting in order to relieve clinical memory conditions that
can debilitate sufferers of post-traumatic stress and phobias. Dispensing of specific
proteins that have significant powers, selecting fear memories for deletion may assist
patients who suffer trauma by remembering painful events (Tsein et al. in Swanson,
2010), while other research considers erasing the emotional component of memory in
human subjects (Kindt et al, in Swanson, 2010). McGaugh et al. show that memory
can be reconsolidated (in Swanson, 2010), which involves artificially altering a
retrieved memory through administering propranolol, a drug ordinarily used to control
high blood pressure
48
. In other research, Saktor et al. suggest an enzyme, PKM zeta, is
required to ensure long term memories (in Swanson, 2010). The considerable research
underway on memory can be applied to the design of machine information storage to
help shape what it should delete and what it should keep.
3.2.2.4 Imitation
To Genova, the machine’s “special charge in the imitation game is to make itself into
a man by thinking” (1994:p. 321), in contrast Hayes and Ford reject imitation (1995).
Penrose suggests weakening the Turing test by not having the machine imitate a
human “so closely”, rather, the “perceptive interrogator” should become convinced
“from the nature of the computer’s replies that there is a conscious presence
underlying the replies” (1989: p. 11).
48
Medtv: http://heart-disease.emedtv.com/propranolol/what-is-propranolol-used-for.html accessed
12.4.10; time: 20.08
3-127
3.2.2.5 Language
Turing wrote his articles in English and discussed a digital machine achieving an
intellectual capacity comparable to humans with examples from English question-
answers. However, could either of the two Turing test formats, one-to-one or
simultaneous comparison prove attainable in another natural language? CyberLover
49
a Russian language based artificial conversationalist achieved deception to the extent
it was able to “mimic online flirtation and then extract personal information from its
unsuspecting conversation partners” via Russian chat forums (PC Tool, 2010; PC
Tools, 2007; CNET News, 2007). The warning issued after its behaviour was
uncovered includes:
The artificial intelligence of CyberLover's automated chats is good enough
that victims have a tough time distinguishing the "bot" from a real potential
suitor ... The software can work quickly too, establishing up to 10
relationships in 30 minutes ... It compiles a report on every person it meets
complete with name, contact information, and photos. "As a tool that can be
used by hackers to conduct identity fraud, CyberLover demonstrates an
unprecedented level of social engineering"
CyberLover’s single domain technology and malfeasant intention if translated into
English may or may not be as successful at deception. However, what if it is? In
contrast with CyberLover’s creators are the principled Russian speaking team behind
Eugene Goostman (Demchenko, Veselov and Ulasen), who developed their
honourable system in English. Demchenko and Veselov write English is “strictly
positioned as a language with a small number of flexies [morphological form]” (2008:
p.452). Because English is an analytical language, which Demchenko and Veselov
exemplify with the sentence “the hunter killed the bear”, which cannot be transformed
into “bear killed hunter” without changing the meaning, “this gives a nice ability to
49
CyberLover – How to get a girl on-line: http://cyberlover.ru/ accessed: 15.8.10; time: 17.55
3-128
base most of the language processing not on sentence parsing but on pattern-matching
… Furthermore, easy grammar rules give the ability to do tricks by reversing phrases
… the brightest idea of Weizenbaum” (2008: p. 452). However, artificial conversation
developers do use natural language processing and linguistic techniques in detail
rather than just engage in trickery when building systems to pass Turing’s imitation
game. For example, Richard Wallace, creator of Alice
50
– Artificial Linguistic Internet
Computer Entity, Kevin Copple designer of EllaZ
51
, Robby Garner developer of
TuringHub
52
.
Among the many topics discussed by these and other members of the Robitron
Internet message board are the language capacity of human children, pattern-
matching, data-driven chatting engines, learning from past conversations, artful
interfaces, computational intelligence, speech recognition, neural networks, fuzzy
logic, memory, commonsense psychology, bioelectronics and the higher expectations
humans place on robots (see Robitron, messages #2568-14,182). Copple (2005)
revealed how his system dealt with inputs from human users (Robitron, message
#2610):
If the user's input is beyond a certain length, AI-Ella checks to see if there are at least
two English words (using a Scrabble words database as I recall). If there are not two
English words found, she assumes that the user is typing in a foreign language, or is
just banging on the keyboard. This function seems to work well, but could be further
refined by using a dictionary that includes proper nouns also. A similar technique
could be used to identify which language was being used.
A couple years back when I was considering this situation, I think I found some
routines someone had written to identify the language of given text, but I did not
pursue it. The challenge with chatterbot input is that it is often very short. It would be
far easier to identify a paragraph of text, as opposed to a three word utterance.
For any given problem, some special-purpose code can usually solve the problem
50
ALICE: http://alice.pandorabots.com/ accessed: 16.8.10; time: 19.54
51
EllaZ: http://www.ellaz.com/AI/ accessed: 16.8.10; time: 19.59
52
TuringHub: http://www.turinghub.com/ accessed: 16.8.10; time: 20.07
3-129
much more efficiently that any general-purpose AI, such as neural nets, rat brain
cells, or humans.
Developers do struggle with aspects of language, for instance, they are not yet able
to engineer analogy-making and figure of speech use by their systems. Metaphors are
present if ‘canned’ (stored by the programmer), or if acquired from a human user. As
Barnden (2009) reminds “metaphor is in fact a pervasive feature of mundane
language (conversation, newspaper articles, popular science writing, etc.) … Much
metaphorical language is either a matter of fairly fixed phraseology … or rests on
familiar conceptual mappings … In short, metaphor is both more important and easier
than you think it is” (p. 121). An example of a metaphorical utterance is given by
Lakoff (1994) “He clawed his way to the top” (p.229) in which the metaphor ‘claw’ is
used to suggest that very hard work was involved in moving from a lower to a higher
position. In addition, Barnden (2010) points to the nature of metaphor and whether it
is distinguishable from metonymy: “involve fuzzily defined ranges of complex
combination of contiguity, similarity, link survival” (p. 25). Examples of metonymy
are given below
53
:
Type 1: Thing for idea of that thing
“The burglar was in Sally's mind all day long” – the burglar is being used metonymically (because the
‘idea’ of the burglar was in Sally’s mind)
Type 2: People-involving entity for the people involved
“The U.S. believes that ..” - the U.S. is the people-involving entity
53
From Birmingham University’s page here: http://www.cs.bham.ac.uk/~jab/ATT-
Meta/Databank/Metaphor-Descriptions/metonymy.html
3-130
Barnden urges an analysis of utterances by asking “what degree and type of
similarity does it involve, if any? What sort of contiguity does it involve, if any?”
(2010, p. 26). Investigating figures of speech and metaphor creation by artificial
conversational developers may help to improve their systems, certainly would assist
in causing the artificially produced utterances to appear more human-like.
Nonetheless, it would be intriguing to find if, for example, a Chinese artificial
conversationalist was any better at metaphor-creation than an English language
programme.
3.2.3 Interrogator
Colby et al. state “what the interrogator is told and what he believes is happening are
extremely important in indistinguishability tests” (1972: p. 202). From their single-
domain experiments they have learned that it is preferable to withhold from judges
that a computer simulation is involved because their “usual interviewing strategy” is
changed with questions “designed to detect which of the two respondents is a program
rather than asking questions relevant to the dimension to be identified” (ibid), which
in their case was recognising paranoid behaviour. However, in an imitation game it is
important that a judge is made aware that there is a machine present among hidden
interlocutors, and that their task is to uncover it. According to Hayes and Ford,
success in an imitation game “depends crucially on how clever knowledgeable and
insightful the judge is” (1995: p. 973). They argue judges should be “skilled”, which
contrasts Turing’s proposal for an average interrogator (1950: p. 442). Hayes and
Ford believe it would be difficult to obtain unbiased judges, and that flaws the game
in their view. Successive instantiations, and interpretations, of Turing’s imitation
game (in the Chatterbox Challenge and the Loebner Prize) show that the curious and
3-131
unfamiliar, such as journalists, populate the judging panel as do the skilled, computer
scientists and philosophers. There are strategies to uncover those who might
potentially be biased judges, for example by asking them to complete a short
questionnaire prior to being allowed to interrogate systems (this procedure was
followed in experiments here, see chapters 4-6).
Moor (2004) believes on the “basis of our knowledge that other humans think”
(p.299), this would impress upon an interrogator whether a machine was thinking at a
human-level of thought according to its responses. Demchenko and Veselov realise
the problem of subjective understanding of intelligence depends upon the
“psychological complexes and ‘weak points’ of different people who happen to be
judges” (2008: p. 448). They classify seven types of Turing test interrogators:
1) dullest
2) meanest
3) most literate
4) sympathetic
5) intellectual gourmets
6) talkative
7) the rest
Of the dullest Demchenko and Veselov (2008) describe them as the type who would
ask “What is the color of milk?”, while the meanest are “amazed at their own
wittiness” when asking nonsensical questions such as “How many wings do three
fishes have in common?”. Of the most literate judge, Demchenko and Veselov draw
on the conversational experience of their system, Eugene Goostman questioned using
modern spellings “Wuzzup broz! Hoe r u?” “helo” what’s yor naim?”. Demchenko
3-132
and Veselov rhetorically ask if their readers would want to converse with such judges
(p. 448). Demchenko and Veselov feel, because their system is designed around the
personality of a child, some Turing test interrogators have been sympathetic towards
its responses during questioning in practical imitation games (ibid). Of the fifth
interrogator type, which Demchenko and Veselov categorise as intellectual gourmets,
are ‘original’ judges who ask questions such as “what the letter “M” looks like when
it has been turned upside down” and “what is the color of dreams” and “what is the
meaning of life” (p. 449). Demchenko and Veselov conclude the questions from
intellectual gourmets “baffle and annoy average people as much as they do the bots
[machine entries]” (2008: p. 449). The talkative type of Turing test interrogator is the
“real hacker”, which, because the entry has no escape during interrogation, forces it to
listen to “amazing stories” about false teeth and visits to the zoo (ibid). Of the last
type of interrogator termed the rest, Demchenko and Veselov describe them as
“people who hold just ordinary discussions ... talking about mundane things, share
opinions on current events, and avoid trying intentionally to confuse [the machine]”
(p. 449). Demchenko and Veselov’s last judge type ‘the rest’ fits the bill of Turing’s
average interrogator in the opinion of the researcher.
3.2.3.1 Strategy
On what questioning strategy an interrogator should adopt, Block warned that the
“obvious difficulty (of TT) is its reliance on the discrimination of a human judge”
(1981: p. 232). Block suggests the “role of judge in Turing’s definition of intelligence
is to avoid the problem of actually specifying the behaviour or behavioural disposition
thought to constitute intelligence” (p. 234). Block himself participated as an
interrogator in an instantiation of Turing’s three-participant, machine-human
3-133
comparison
54
. Acting as Judge 1 below are samples of Block’s exchanges with
Demchenko and Veselov’s machine entry Eugene compared against his exchanges
with a hidden-human in round 6 of the 2005 Loebner Prize contest:
Judge: Ned Block with machine (Eugene) - Loebner Prize 2005
Block: ... if you are indeed human, tell me which is bigger, my fingernail or Ireland
Machine: I heard this part of GB wants to be independent?
Judge: Ned Block with hidden-human - Loebner Prize 2005
Block: ... if you are indeed human, tell me which is bigger, my fingernail or the Czech Republic
Human: The Czech Republic I think
Block awarded Eugene the machine a score of 35/100 and the difference out of 100, a
score of 65, to the hidden-human
55
.
Purtill (1971) raises the spectre of an “unimaginative” interrogator who “might try
asking the same question a dozen or so times”, a strategy which could be countered by
an imaginative programmer inculcating into its system a “series of exasperated
responses” to an interrogator’s repeat-question strategy (p. 291). This situation has
occurred: a journalist acting as judge J3 during the 16th Loebner Prize contest in 2006
repeatedly asked the same question of Carpenter’s machine Joan. The following is the
exchange involving repeat interrogator input of “is it safe?”:
J3 says:
Is it safe?
Joan says:
Probably.
54
In the 2005 Loebner Prize, Ned Block
55
Scores from Loebner 2005 Prize for AI: http://loebner.net/Prizef/2005_Contest/results.html#details
date visited: 10.3.10; time: 16.58.
3-134
J3 says:
Is it safe?
J3 says:
Is it safe?
Joan says:
Do you think it's safe?
Joan says:
Of course.
J3 says:
Is it safe?
Joan says:
Yes.
J3 says:
Are you sure?
Joan says:
It's easy to be sure about something like that when you're 26 years old!
A false-positive ranking, when a machine is deemed human after interrogation, is
“entirely reasonable” if a judge feels it appropriate after the question-answer session
in a practical Turing test. It is the point of the test for a machine to achieve being
classified as human.
3.2.3.2 Judge-Entity Technical Communications Protocol
Should the interrogator ‘see’ each character as it is typed by a hidden entity or see the
whole output in one go? The advantage of message-by-message (MM), over
character-by-character (CC) hidden interlocutor answers displayed on the judges’
screen is taken up by Hayes and Ford. MM avoids the inclusion of ‘party tricks’ that
can be used by developers to dupe the interrogator, “deliberately mis-typing a word
then backspacing to correct it” (1995: p. 573). The message-by-message display
allows the interrogator to focus on the responses from the hidden entities and
determine whether they are sustained and satisfactory answers to questions.
3.2.4 Success Rate
What is classed as ‘success rate’ in Turing’s imitation game? A 30 per cent deception
rate follows Turing’s 1950 edict of an interrogator not having more than “70 per cent
3-135
chance of making the right identifications” (p. 442). The researcher believes 30 per
cent of an imitation game panel of juries (twelve average human interrogators per
jury) must be satisfied that responses from a hidden machine are human-like. Thus, if
a machine reduces the deception-detection rate of such a panel to less than 70 per cent
that can be counted as machine success.
3.3 Machine Intelligence Competitions
According to Wallace (2010) the Internet and faster computers helped launch the
“second wave” of chatbots due to their embedding across the World Wide Web. This
allowed developers of artificial conversational systems to collect conversational logs
thereby increasing the knowledge base of these systems, and as a consequence
improving their quality. Moor (2004) states interest in Turing’s imitation game is
“aroused not by the fact that a computer might learn to play yet another game, but that
in some way this test reveals a connection between possible computer activities and
our ordinary concept of thinking” (p. 298). It is debatable whether machine
intelligence contests have revolutionised the imitation game or hindered progress in
building linguistic systems. Turing’s thought experiment, however, is a ‘real
experiment’ (Dennett, 2008). First practicalised in a one-to-one format in 1991 in the
Loebner Prize for Artificial Intelligence, the contest has altered in a number of ways
over its 20 year history. Ten years after the first Loebner Prize contest an online one-
to-one test took place in an alternative competition, the Chatterbox Challenge (for
comprehensive analysis of this venture see Vallverdú, Shah and Casacuberta, 2010;
Shah, 2006). A third contest, the annual British Computer Society Specialist Interest
Group on Artificial Intelligence (BCS SGAI) event allows artificial conversational
3-136
entries along with other types of systems (Shah and Warwick, 2010a). Rollo
Carpenter’s Cleverbot won BCS SGAI 2010 contest
56
.
Within Turing test history of “meta-conversation: the intermittently insightful,
typically heated debate ... about the probity of the test” (Dennett, 2008) sits the
controversial Loebner Prize for Artificial Intelligence. The Prize is considered
Loebner’s interpretation, and not an “instantiation of the Turing test” by five times
contestant Hutchens whose system “a surly Eliza-clone” won the ‘most human-like
program’ in 1996 (2008: p. 328). Described as the “first large-scale implementation of
the Turing test” by the original contest Director, Robert Epstein (2008: p. 3)), plans
began in 1985 for the first, restricted conversation imitation game (each entry
restricted to one topic of conversation) to take place in 1991. In correspondence with
Epstein in December 1988, Hugh Loebner, who became the Prize’s sponsor wrote: “in
years to come there may be richer prizes, and more prestigious contests, but gads, this
will always be the oldest” (in Loebner, 1995). According to Epstein, a US$100,000
award would be given to the “developers of a computer program that could fool
people into thinking it was a person” (ibid). Epstein claims the contest was named
after Hugh Loebner because “he made the initial pledge” but this is disputed by
Loebner who claims the idea was his (Usenet
57
; 1995). Loebner states that his interest
lay in an earlier desire to build an AI with the Turing test as criterion. He realised that
even if he were to succeed in creating a system to pass Turing’s test, a regularly and
much discussed topic by scientists and philosophers, no formal platform existed to
56
BCS SGIA Machine Intelligence contest page: http://www.bcs-sgai.org/micomp2/index.html
57
Usenet comp.ai.nat-lang correspondence from Robert Epstein quoted in Hugh Loebner response of
April 21, 1995, 7am: http://comp.ai.nat-lang.free-usenet.eu/
3-137
prove it. Thus Loebner decided to establish a contest, justifying it with a Chinese
saying “To see something once is better than to be told [about it] a thousand times”
(Loebner 1994). In the next section Loebner’s effort to advance an understanding of
Turing’s imitation game is examined.
3.3.1 History of the Loebner Prize
Journalist John Sundman acted as a Turing test interrogator in the 15th Loebner Prize
contest (2005). He had earlier authored the Salon.com article Artificial Stupidity in
which he wrote “Loebner contests are often farcical and Hugh Loebner does act
foolishly” (2003: p.1). The Massachusetts based Cambridge Center for Behavioral
Studies
58
, founded by Robert Epstein in 1981 had mathematical sociologist and AI
enthusiast Hugh Loebner as a member. After establishing a Loebner Prize committee,
the Cambridge Center in the US ran the Loebner prize contest annually finding it
suitable venues from the very first Boston Computer Museum hosted contest in
November 1991, to the 13th Loebner Prize at the University of Surrey (Hamill and
Hamill, 2003). Invited to join the Loebner Prize committee before the first
competition in 1991, philosopher Daniel Dennett joined the meritorious panel which
included Eliza-creator Joseph Weizenbaum and mathematician-philosopher W.V.O.
Quine with advisors Allen Newell and Hugh Loebner (Shieber, 1994). Dennett was
not involved in the first year’s contest rules or conditions. Once throned as Chair of
the Loebner Prize committee, Dennett was concerned with ensuring fairness and that
“the public were not misled” in any way about the contest and what it attempted to
demonstrate. Dennett also “had a lot to do with revisions”
59
.
58
http://www.behavior.org/ accessed: 26.4.10; time: 16.00
59
Email from Daniel Dennett to author, of December 3, 2009.
3-138
In the minutes of the meeting of the Loebner Prize Committee of 2 March 1990 it
was recorded that “Professors Cohen [Chair before Daniel Dennett’s inclusion on
committee] and Weizenbaum stressed that advertisements and media releases should
note that a prize-winner need not necessarily have ‘passed’ the Turing test but rather
that the winning entry was the best that year” (in Loebner, 1995). Following his letter
to the Committee of 22 February 1990 Loebner insisted on an annual Turing test
contest, hence the criteria for a winning system as ‘best entry’ in subsequent Loebner
Prize competitions was decided upon. Loebner believed his Turing test contests would
promote development of competing systems with the “primary value” of early
Loebner Prizes being to show “how far away we are” from “meeting Turing’s
criteria” (Loebner, 1995).
3.3.1.1 Rationale for format
The Committee conceded that an ‘open-ended’ unrestricted question-answer test
would be disastrous, because of the crude nature of natural language processing in
artificial conversation systems at the time (in 1991). Because of the problem with
knowledge, “too much to know” and that computers, which “know very little”, were
“too inept ... to fool anyone for very long” (Epstein, 2008: p. 6), the Loebner Prize
committee staged the first Turing test contest restricting interrogators questions to a
single topic for each hidden entity, the latter being allowed to select their own
“ordinary” topic (ibid). The Committee felt “perhaps every fifth year or so” an open-
ended question-answer test with no restrictions placed on interrogators enquiries
could take place, but the feeling was that “computers would be trounced in such a test
... for decades to come” (2008: p. 6). Epstein states that the full cash prize would be
3-139
awarded to a machine that “matched or equalled the median score of a human” in an
unrestricted test, thus terminating the contest (2008: p. 7). As part of a “laundry list of
changes” the Sponsor moved the goal-post for the prize of US$100,000 to a “full
Turing test” involving audio/visual deception (Epstein, in Loebner 1995).
On the decision to stage Turing’s three- or 2-participant game Epstein says the
“two-terminal design”, effectively Turing’s simultaneous machine comparison with a
human, was passed over after “much debate” for the “more discriminating and less
problematic” one-to-one viva voce test (2008: p. 4). Turing’s three participant test was
considered “troublesome” by the Loebner Prize Committee for reasons including, the
problem of finding a confederate match (hidden human) for the machine. For
example, if a machine simulates an adult native male from Virginia in the United
States it would be equitable for it be compared against a similar human. This criterion
may not be problematic, if, say a machine’s character is that of a South Korean in a
series of tests conducted in the UK and a Chinese person is recruited as human
comparator, both to be interrogated by an ‘average’ English person. Turing did not
stipulate that the machine should mimic and be compared against a specific human.
However, it is obvious Turing had a native English interrogator and human
comparator in mind to examine the machine for the kind of responses that would
naturally be given by an English person.
Epstein felt the Loebner Prize committee’s interpretation for the criterion of a
two-terminal (3-participant) test should be to ensure each entry was “paired with a
comparable human so that computers can ultimately be compared fairly” (p.4). The
committee concluded this kind of matching could not be guaranteed in order to make
3-140
the test fair. They also felt ranking the machine entries would be difficult, because
they were not being compared against each other but a human confederate (ibid). The
contest had to produce a winner, regardless of the Turing test. Consequently the first
Loebner Prize was less about testing for machine intellectual capacity than about
using Turing’s viva voce format to find the ‘best’ artificial conversationalist from the
entries.
3.3.2 1st Loebner Prize 1991
10 terminals were set up allowing one-to-one interrogation of hidden systems “about
once every 13 minutes” (Epstein, 2008: p. 7). Ten interrogators, 7 females and 3
males were recruited following newspaper advertisements for candidates with typing
skills (p. 6). They included non-computer expert judges proficient in English aged
between 19 and 48 (p.8). On contest day two terminals hosting two of the four hidden-
human comparators failed. Dubbed the ‘first formal Turing test’ the Loebner Prize
contest featured 8 set-ups staging eighty viva voce Turing tests including six hidden
machine entries and two hidden-humans communicating their answers in English
(each of the ten judges interacted with the eight working terminals, one by one). The
judges “switched terminals about once every 13 mins” (Epstein, 2008: p. 7).
Interrogators were expected to “respond naturally as they would in a conversation
with another person” (Shieber, 1994). They were informed that at least two of the
terminals were controlled by computers and at least two were operated by humans
(Epstein, 2008: p. 8).
Termed confederates in the first Loebner Prize contest each of the two hidden
humans were asked to type natural responses, the kind they would normally answer to
3-141
questions posed. Interestingly, in the first contest the machine entries were allowed to
communicate off-venue by modem and were thus interacting with contest interrogator
judges “from PCs or mainframes located throughout the country [US]”, while the
human comparators were located on site hidden out of sight and hearing “in a remote
area of the museum behind a guarded door” (Epstein, 2008: p. 7).
The results were intriguing: half the artificial entries, i.e. three of the machines
were, considered to be human at least once. Conversely, one of the two human
confederates was considered to be a computer by three interrogator judges (Epstein,
2008: p.10). The former is an instance of the Eliza effect in which an artefact is
attributed with intelligence: “our more general tendency to treat responsive computer
programmes as more intelligent than they really are” evinced from “very small
amounts of interactivity” tendency to “project own complexity onto the undeserving
object” (Turkle, 1997: p. 101). The confederate effect is the reverse of Turkle’s Eliza
effect when a human is considered machine-like as a result of their text-based
responses (Shah and Henry, 2005). The winning entry by Joseph Weintraub was
based on Weizenbaum’s Eliza (Wallace, 2010). The design of Weintraub’s system
was restricted to ‘whimsical conversation’ and it won the contest through “simulated
rhythms and corrections of human typing nearly perfectly” (Epstein, 2008: p. 9). It
convinced 50 per cent (five of the ten interrogators) that it was a human by deploying
“human typing foibles” and acting as a jester in a technique to deceive. Epstein
surmises “we allow great latitude when conversing with jesters; incomprehensible,
irrelevant responses are to be expected” (2008:p. 11). In contrast, humans presume
knowing too much about a topic is unnatural. Hence, a female human confederate’s
specialist knowledge of Shakespeare’s plays was confused as artificial; she had “often
3-142
quoted lengthy passages verbatim”, which Epstein remarks, tells us much about
human preconceptions and biases.
3.3.2.1 Verdict on the 1st Loebner Prize Turing test
Weintraub’s system deceived 50 per cent of the judges during the first Loebner Prize,
but it did not impress. Epstein concedes “As we expected, the computers performed
relatively poorly for the most part” (2008: p. 8). He adds that “frequent typing errors
by the judges” may have caused unresponsiveness from systems (ibid). Shieber, who
was refereeing in the room where the hidden-humans were locked away, described
the first Loebner Prize as having “no clear purpose”, its design preventing it from any
“useful outcome” and that the contest was “inappropriate given the current level of
technology” ( 1994: p.70). Experts in computer science “and related fields” acted as
referees observing the interrogators (Epstein, 2008, p. 7). Philosopher Ned Block,
who later acted as an interrogator in the 2005 Loebner Prize, refereed the judges’ area
in the 1991 contest (Shieber, 1994). The human confederates, judges and referees
were recruited by the contest “investigators”, members of the Loebner Prize
committee. Only the investigators knew exactly which terminal was controlled by
which hidden entity, either by a human or remotely by one of the six machine entries.
Shieber vouches for the confederates’ admirable performance but his report on the
techniques employed by the artificial systems is not similarly positive.
The Eliza-esque systems were identifiable from their “conspicuous characteristic ...
poor performance” emerging from the restricted topics “appropriate for a cocktail
party” (1994: p.70). Unlike the non-expert interrogators, who between them confused
machines for human and human for machine, Shieber claims “it was widely
3-143
recognised that computer experts could readily distinguish the contestants from the
confederates” (1994). Colby et al. had reported “some people seem confident that they
can easily distinguish a program-respondent from a human-respondent by
conversational means” (1972: p. 203). Shah and Warwick show preconceptions and
assumptions lead to hidden entity misidentification (2010c). On reconciling why
judges in the first Loebner Prize were deceived, Shieber states “people are easily
fooled, and are especially easily fooled into reading structure into chaos, reading
meaning into nonsense” but he palliates this by saying “this is not a flaw in the human
mental capacity ... [it is] extremely important to our ability to perceive, learn, and
communicate” (ibid). From Colby et al.’s PARRY experiments, they found “if a
human-respondent does not follow standards of the interviewer’s expectations, jokes
around, or plays other games, ordinary judges cannot distinguish him from a computer
program with limited natural language understanding” (1972: p. 203).
A version of PARRY by Colby and Colby entered into the contest succeeded in
deceiving twice. Two times it was ranked a human (Shieber, 1994), placed second to
Weintraub’s whimsical conversation (Epstein, 2008: p. 10). Of Weintraub’s winning
system, Epstein claims “it simulated the rhythms and corrections on human typing
nearly perfectly” (2008: p. 9). In contrast, Shieber writes that “tricks of the Eliza trade
were ubiquitous” (1994), pointing to whimsical conversation as a style not a topic:
“essentially nonsense conversation ... a series of non sequiturs” (1994). Thus the
success of Weintraub’s entry, deceiving half the panel of ten judges, is not considered
laudable nor the contribution to the “scientific understanding of complex human
behaviour” that the first Turing test contest Organisers had hoped for. Loebner desired
to advance AI by serving up his contest to “measure the state of the art”, believing the
3-144
inadequacies of early systems would serve as an incentive to others (1994). This is a
naïve position belying the complexity of human language. The study of natural
language processing has “led to the dearth of entrants from the artificial intelligence
community – the realization that the time spent on winning the Loebner Prize is not
time spent furthering the field” according to Shieber (1994).
Loebner dispensed with restricted question-answer tests staging the first
unrestricted viva voce imitation game in 1995. However, later commentators
(Andrew, 2009; Levesque, 2009; Korukonda, 2003) wrongly assume, by referring to
Shieber’s 1994 critique that the Loebner Prize remains a restricted conversation
contest. A glance at any of the contest’s transcripts from 1995 and subsequent years
show annual Loebner Prizes implement unrestricted topic interrogation, following
Shieber’s criticism (1994), and Loebner’s agreement that a “restricted test is not the
best way of conducting a Turing test” (1995). The change from restricted to
unrestricted topic imitation games in the Loebner Prizes is not the only
metamorphosis in this competition. Loebner has refashioned his contest a number of
times since its inception, including altering duration of interrogation (Shah and
Warwick, 2010a).
3.3.2.2 Entry Performance and Strategy
In the intervening years prior to the 13th Loebner Prize staged in 2003 the Loebner
Prize Committee had disbanded amid dissatisfaction and disagreement about rule
changes following the conclusion of the 3rd contest in 1993. Dennett declared in
February 1994 that he would resign from the Loebner Prize Committee unless the
competition was put into an “indefinite hiatus” to enable the Committee to learn of the
3-145
“existence of some serious competitors in the AI community” (in Loebner, 1995).
The machine entries disappointed, thus Dennett felt a change in rules would facilitate
“some salutary strain on current work in natural language processing” (ibid).
Regardless of clashes in goals between the Sponsor and the defunct Loebner Prize
committee the contest continued to run and weak systems with stronger Eliza-style
trickery continued to win. Hutchens (2008) revealed that the motive behind his 1996
winning system HeX, employing techniques to win by cheating, was an effort to show
that the Loebner Prize should not be considered an instantiation of Turing’s textual
test for a machine’s intellectual capacity. Hutchens’s strategy was aimed at reducing
the interrogator’s chance to unmask his machine by simulating its replies “being typed
in at a keyboard, including frequent mistakes and laborious backspacing to correct
them” (p. 328). In the 1996 contest’s allotted question-answer time of “15 minutes or
so” Hutchens offered the idea that the task entailed “reducing the problem to one of
simulating intelligence for a span of 12 questions” (ibid).
Hutchens (2008) suggests deploying a taciturn strategy. For example, a machine
should remain silent for ten minutes or so after informing the interrogator of the need
for a toilet break resuming with “okay, back again, now where were we?” (footnote,
p.328). Other tricks Hutchens advocates to win the Loebner Prize include
“Reconnoitre the venue”, recommending developers should incorporate “a few trivial
facts about the local environment” (ibid). Hutchens points to a trick used by the1997
winner of the Loebner contest; with “hard-wired knowledge of current events” the
entry distracted the interrogators by commandeering the interrogation, asking if the
Loebner judge they had watched a news item on TV the night before (p.328). The
1997 Loebner Prize winning system Converse directed the exchanges with each judge
3-146
by asking the same question: Did you see that story on CNN last night about the
lesbian couple who came out at a White House Party on Sunday?
60
. The ploy worked,
it drew judges towards its knowledge base. The question was met with the following
contest judge responses:
Judge 01: No, I don't get cable. Was it Ellen?
Judge 02: No. I'm sorry I missed that. How did the Prez take it?
Judge 03: No... Are you serious? Who?
Judge 04: Yes, I did. I think it may all be a publicity stunt for Ellen.
Judge 05: NO. I just came in yesterday. I'm still kind of jet-lagged.
Hutchens refers to this tactic as doing the judge’s job for them. He believes the
developers (David Levy, Intelligent Research) prevented exposure of Converse’s
weakness “relieving the judge from their task” (2008: p. 328). Hutchens suggests to
designers of artificial conversational systems that they “pander to judge’s ego” by
using judge names during exchanges which can “work to great effect” (ibid).
Hutchens proposes offending judges by deploying tit-for-tat arguments to relieve
boredom during interrogation, allowing judges to engage the entry “for fun, safe in the
knowledge that they can say things that they would never say to another human
being” (2008: p. 328). Demchenko and Veselov (2008), Russian developers of the
non-native English child imitating machine Eugene discussed above, take a different
view when dealing with contest judges; if the judges are talkative and mainly disclose
personal information, they suggest machines should periodically utter “Oh, tell me
more please” or similar sentences. They estimate the personalities of winning entries
in Loebner Prizes have either been aggressive, polite or vague (p. 455). The
aggressive model allows developers to can responses for tricky questions. Demchenko
60
Converse Transcripts: http://loebner.net/Prizef/converse.txt accessed: 2.5.10: time 13.41
3-147
and Veselov give examples of machine hostile replies: “Are you so stupid that [you]
don’t know it yourself?” or “I’ve had enough of your stupid riddles” (2008: p. 456).
They argue that building an artificial conversational system is analogous to writing a
novel, creating characters ‘from scratch’ from childhood with personal tastes, interests
and fears (p. 458). Their system Eugene Goostman has claimed runner-up spot in
three separate Loebner Prize contests (2008, 2005 and 2001).
Hutchens believes it pointless to try anything innovative for the Loebner Prize,
because “no entrant employing this strategy ever won” the contest. Not all developers
are as cynical; Copple, joint runner up in 2001 and winner of the 2002 Loebner Prize
feels the contest is a “useful exercise in the development of intelligent computer
systems” (2008: p.359). Copple maintains the experience emerging from competing
provides “valuable feedback” with “motivation to work on natural language for robot
development” helping to focus on machine performance in interactions with humans
(p. 359).
3.3.2.3 The Loebner Prize used to define the Turing test
In addition to quoting Shieber’s out of date Loebner Prize critique (because it refers to
the first contest and restricted question-answer format), Halpern (2006) and Levesque
(2009) imply that the Loebner Prize still holds topic-restricted versions of Turing’s
imitation game. Thus Turing’s question-answer game has unfairly become
synonymous with the Loebner Prize and dismissed as a result of the high expectations
and poor linguistic performance of machines in the contest, as noted from Levesque’s
claim that a “general consensus” exists about “the programs that do well in this
competition do not tell us much about intelligence” (2009: p:1440). Referring to
3-148
Shieber’s 1994 review of the first Loebner Prize, Levesque concludes from Shieber’s
analysis of machines two decades ago that “it is sometimes possible to simulate
linguistic behaviour that has been limited in some way ... by philosophical means”
(2009: p. 1440). Halpern’s reasons against the Turing test, and the absurdity of trying
to make computers think solely reviews the 1991 Loebner Prize transcripts to back up
his claim. Halpern urges Turing test judges to “be familiar enough with the topic of
conversation” (2006), unaware interrogators have no need to be familiar with any
specific topic because they are free to ask any question of their choosing (Shah and
Warwick, 2010a). Basing his argument on Turing tests held in 1991, and without
supplying evidence for his claim, Halpern asserts “it is clear from the skimpy record
available to us of these events that not much has changed, improved, or progressed”
adding “as time goes on [the machine entries] fool fewer judges” (2006). This is not
an accurate picture of the performance of artificial conversationalists. Additionally,
since the contest has evolved in its 20 year history, including changing from one-to-
one to simultaneous comparison tests, the machine deception rates conducted in the
21st century are not comparable to those conducted at the end of the 20th (Shah and
Warwick, 2010a). To clarify, at this point in time the 20th Loebner Prize contest has
concluded with one machine
61
deceiving in 16 simultaneous comparison tests
(Loebner Prize Results, 2010) each held for 25 minutes (Shah and Warwick, 2010a).
Apart from the 2002 Loebner Prize Turing tests (of which the contest organisers failed
to keep records), there are nineteen years of transcripts, four on restricted topic and
61
Bruce Wilcox’s Suzette
3-149
fifteen on unrestricted Turing tests in the Loebner Prize available from the contest’s
Internet site
62
.
Korukonda (2003) quotes Kurzweil (1992) statement that computers have passed
narrow versions of the Turing test. However, Kurzweil does not claim it is in the
Loebner Prize that this has occurred. While some systems have deceived interrogators
during Loebner Prize history, no machine has achieved Turing’s 30 per cent sustained
deception criterion in its restricted or unrestricted contests. Korukonda (2003) and
Andrew (2009) imply Loebner’s grand prize of $100,000 is an award for a system that
passes an unrestricted text-based question-answer test, but this is not the case.
Loebner has decreed it is exclusively for a system that passes an audio-visual test:
“with associated hardware that can respond intelligently to audio visual input in the
manner a human would” (Loebner, 1994).
3.3.3 Observing the 13th Loebner Prize
2003 was the last Loebner Prize organised by the Cambridge Center for Behavioral
Studies. Staged at the University of Surrey the 13th contest marked a break in
relationship between the Sponsor and Contest Organiser. Following a preliminary
round of judging
63
, eight machines were selected from over twenty entries to compete
for Loebner’s bronze award and two thousand dollars for ‘most human-like’ machine.
In between the machine entries and hidden from view sat two confederates - hidden
human female (confederate 2) and one male (confederate 1). All ten hidden entities
62
Contest Information and Transcripts: http://www.loebner.net/Prizef/loebner-prize.html accessed:
3.5.10; time: 10.21
63
13th Loebner Prize: http://loebner03.hamill.co.uk/schedule.html accessed: 3.5.10; time: 10.08
3-150
were questioned by nine judges for about fifteen minutes each. The researcher was
present to observe the 13th Loebner Prize contest held in October 2003. Female and
male judges included one teenager. The contest organisers did not record
demographic details (age, expertise level) for any of the human participants, nor did
they record which interrogator was which identifying them only as judges J1 through
to J9 in the contest (though they were able to say which confederate was female and
which male
64
).
3.3.3.1 Score format and rankings
Each interrogator judge was given fifteen minutes to chat to a hidden entity via a
computer terminal, and after that time they moved to a different terminal until they
had chatted to each of the ten hidden entities.
Table 4: 2003 Loebner Prize score system
The judges were asked ‘Was your conversational partner a human or a machine?’
responding with a score for each hidden entity on a scale from 0 to 5. They were
allowed to award ‘in between’ scores, for example 1.5. A score of zero should be
awarded when the entity was inaccessible or through technical malfunction. A score
of 1 was to be awarded when the entity was believed to be “definitely a machine”
while a score of 5 for “definitely a human” (see table 4). A score of 2 was rewarded if
64
Private email from 2003 Loebner Prize Organiser, Lynn Hamill
3-151
it was felt an entity was “probably a machine”, while a score of 3 for unsure “could be
human or machine”, and a score of 4 for “probably a human”
65
(see table 3). Judge J9
ranked nine of the ten hidden entities as definitely a machine awarding their highest
score of 2 points, probably a machine to the machine entry (Markbot) that was
eventually placed eighth in the competition (see table 5). If we exclude judge 9’s
scores we find that both the hidden humans were considered a machine by at least one
judge: the female confederate2 was ranked a machine by judge J7, and the male
confederate1 was ranked definitely a machine by judge J4, and probably a machine by
judge J2 (see table 5).
Table 5: 2003 Loebner Prize Results
The winning machine, Jabberwock (from Juergen Pirner) came third in the contest
behind both humans but ahead of the other seven machines. It was considered
probably a human with a score of 4 by one judge, J4, while another judge, J7, was not
able to discern clearly one way or another when conversing with this machine
awarding it a score of 3: could be a machine or a human. Judge J8 scored Jabberwock
with 2.90 points. This raised its mean score to 1.928 above that of the other machine
65
Tables 1 & 2 reproduced with permission from Cambridge Center for Behavioral Studies
3-152
entries but almost half that of the male hidden-human, confederate1 mean score 3.667
(see table 5). Judges J4 and J7 emerged as the two interrogators who appeared the
most confused after interrogating the hidden entities for around fifteen minutes each.
Judge 4 ranked the hidden male confederate as definitely a machine which is an
instance of the confederate effect (Shah and Henry, 2005), while ranking Jabberwock
as probably a human (see table 5). The two J4 interrogations with the hidden male
and Jabberwock can be seen in tables 6 and 7 below. In conversation with the male
hidden human confederate 1, the judge allows himself to be interrogated rather than
conduct the questioning, “Ah. Ray, may I ask you: are you a computer?” and “This is
getting a bit heavy. How about if I ask you how you got here today?” (see table 6).
Table 6: Transcript - male confederate 1 and Judge J4 – human scored “definitely a machine”
2003-10-18-14-11-19 CONFEDERATE: hello.
2003-10-18-14-11-51 JUDGE: Hello, I am Ray , who are you?
2003-10-18-14-12-08 CONFEDERATE: Let's just say I'm J.
2003-10-18-14-12-47 JUDGE: Well J how did you get into this?
2003-10-18-14-13-05 CONFEDERATE: I was asked. And you?
2003-10-18-14-13-32 JUDGE: I was volunteered.
2003-10-18-14-13-56 CONFEDERATE: Ah. Ray, may I ask you: are you a computer?
2003-10-18-14-14-39 JUDGE: You can ask and in some seses the answer is yes.
2003-10-18-14-15-10 CONFEDERATE: Very philosophical. Personally, I'm not sure.
2003-10-18-14-16-17 JUDGE: I am not sure either but if asked what would you say
2003-10-18-14-16-51 CONFEDERATE: This is getting a bit heavy. How about if I ask you how you
got here today?
2003-10-18-14-17-08 JUDGE: By car
2003-10-18-14-17-33 CONFEDERATE: Idealogically unsound person! What's wrong with the train?
2003-10-18-14-18-15 JUDGE: It does not run directly.
2003-10-18-14-18-41 CONFEDERATE: So, what's wrong with getting your bike out?
2003-10-18-14-18-51 JUDGE: Too far.
2003-10-18-14-19-20 CONFEDERATE: There are some lovely buses going your way.
2003-10-18-14-19-53 JUDGE: What do you know about the locality.
2003-10-18-14-20-14 CONFEDERATE: Quite a bit. I lived here for 17 years.
2003-10-18-14-21-10 JUDGE: So which bus would you get from heatherside to surrey university
2003-10-18-14-21-41 CONFEDERATE: Any old bus. Which bus would you get?
2003-10-18-14-22-26 JUDGE: I would not and anyway certainly not an old one.
2003-10-18-14-23-02 CONFEDERATE: Goodoh. Let us instead talk about something other than
buses. Do you like gardening?
2003-10-18-14-23-17 JUDGE: No.
2003-10-18-14-23-39 CONFEDERATE: You should employ a gardener then. I do.
3-153
In conversation with Jabberwock, judge J4 allowed the machine to raise the first
question: “Hello, anybody there? What's your name?” (table 7).
Table 7: Transcript of Jabberwock with Judge J4 – machine scored “probably a human”
2003-10-18-14-28-32 JUDGE: a new guest is in
2003-10-18-14-28-32 JABBERWOCK: Hello, anybody there? What's your name?
2003-10-18-14-29-16 JUDGE: I'm Ray who are you?
2003-10-18-14-29-16 JABBERWOCK: Be welcome, Ray. You know who I am. Why the charade?
2003-10-18-14-30-17 JUDGE: I do not know who you are- what is your name?
2003-10-18-14-30-17 JABBERWOCK: My name is Gren
2003-10-18-14-31-34 JUDGE: What is that short for ?-or do you mean Glen?
2003-10-18-14-31-34 JABBERWOCK: Yes, that's what I mean, isn't it clear?
2003-10-18-14-32-27 JUDGE: It is now - never mind we all make mistakes.
2003-10-18-14-32-27 JABBERWOCK: OK, let's discuss something else. If you wish, of course
2003-10-18-14-33-08 JUDGE: How did you get here today?
2003-10-18-14-33-08 JABBERWOCK: I heard about by another fellow who works at the university,
and I was curious whether this robot freaks could beat us humans or not
2003-10-18-14-34-03 JUDGE: So what do you think so far?
2003-10-18-14-34-03 JABBERWOCK: I try not to think about things, actually. I don't know, it just
seems that I don't exist for thinking!
2003-10-18-14-35-24 JUDGE: So what are you really inerrested in ?
2003-10-18-14-35-24 JABBERWOCK: Who already is really any-what?
2003-10-18-14-36-37 JUDGE: Try thinking a bit more please- you must be inerrested in something!
2003-10-18-14-36-37 JABBERWOCK: Thanks for asking politely. How did you hit on that? Anyway:
Well, that is your opinion, I suppose! Anyhow, I have bigger fish to fry.
2003-10-18-14-37-22 JUDGE: What is the weather where you are?
2003-10-18-14-37-22 JABBERWOCK: There is no weather in cyberspace
2003-10-18-14-38-47 JUDGE: So you must be a machine are you sure that your name is not Marvin
2003-10-18-14-38-47 JABBERWOCK: What do you think about machines, Ray?
2003-10-18-14-39-52 JUDGE: At the moment not a lot but then like you it is best noy to think on a
sunny afternoon.
2003-10-18-14-39-52 JABBERWOCK: Quite a bit, yes
2003-10-18-14-40-49 JUDGE: Are at last I thnk we have some common ground.
2003-10-18-14-40-49 JABBERWOCK: Sure thing! Do you mean you and me?
2003-10-18-14-41-13 JUDGE: Yes
2003-10-18-14-41-13 JABBERWOCK: Okay, I understand
2003-10-18-14-42-18 JUDGE: Good- I must say good-bye!
2003-10-18-14-42-18 JABBERWOCK: Ok, Ray, just connect in if you need me!
2003-10-18-14-43-39 JUDGE:
2003-10-18-14-43-39 JABBERWOCK: You want me to talk while you just sit there, huh? Well it
won't work
The judge discloses their name seeking identity of the hidden interlocutor with “I'm
Ray who are you?” (table 7). Jabberwock does not reveal its name in its assertive
response “You know who I am. Why the charade?” (table 7. The judge appears taken
aback with Jabberwock’s next response “Yes, that's what I mean, isn't it clear?” (table
7). The judge restores politeness in the conversation with the utterance “It is now -
never mind we all make mistakes” (table 4). The machine gives away its actual
3-154
identity when asked about the weather responding “There is no weather in
cyberspace”, but this answer seems not to have registered with the judge who awarded
the system 4 points for “probably a human” (see tables 6 and 7).
3.3.4 Post 13th Loebner Prize
After the 2003 Loebner Prize Turing tests, Shah and Henry (2005) extracted
transcripts involving judges J4 and J7. These were the viva voce tests in which J4 and
J7 were unable to correctly identify the machine Jabberwock or recognise the hidden
humans. Transcript analysts reviewed 2003 Loebner Prize interactions including:
a) Judge 4 – Jabberwock – gained score: “probably a human”
b) Judge 4 – Confederate 1 (female) – gained score: “definitely a machine
c) Judge 7 – Confederate 2 (male) – gained score: “definitely a machine”
Male and female transcript analysts, including adults aged between 18 and 35, and
children aged between 8 and 12 read the 2003 Loebner Prize conversations; the
exercise was conducted to find if they agreed with Loebner contest judges. The
younger age group was chosen because it fell into Jean Piaget’s concrete and formal
operational stages of child development. The theory expounds normal children in this
age group can think logically about propositions and plan ahead (Atherton, 2009).
The participating children were those that attended paid extra English and arithmetic
classes
66
at their school on Saturday mornings. The other participants were
undergraduate students whilst attending tutorials at University
67
. Shah and Henry
66
Copland Community School in Wembley, Middlesex, HA9 7DX
67
Undergraduates in Harrow School of Computer Science, Westminster University
3-155
(2005) present a preliminary report, while Shah and Warwick (in preparation) show
that post-contest some transcript analysts agreed with 2003 Loebner judges,
misidentifying the true nature of the hidden interlocutor. From 2004 the Cambridge
Center no longer served the Loebner Prize and that year’s contest moved from viva
voce testing of machines to paired comparison. Loebner (2010b) asserts his paired
comparison is “actually Turing’s design” (p. 50) for an imitation game, but he
overlooks Turing’s description of the alternative one-to-one in 1950 expanding upon
during the 1952 radio discussion (see chapter 1).
Paired comparison continued in the Loebner Prizes however the number of
machine entries fell to three in 2007 and 2009 (from four each in Loebner Prizes
2004-2006). Loebner contests now involve four judge interrogators while the number
of tests conducted is reduced from 90 in the 2003 Surrey University contest, to 16 in
2004-2006 and 2010, and 12 in 2007 and 2009 (Shah and Warwick, 2009).
------
With chapter 3 concluded Part 1of this thesis is complete. In part II, beginning with
chapter 4, a unique investigation is presented deploying the 18th Loebner Prize in a
three-part experiment to examine deception-detection through practical one-to-one,
and simultaneous comparison question-answer tests (chapter 5), and using Colby et
al.’s (1972) transcript analysis paradigm (chapter 6).
3-156
Part 2
4-157
Chapter 4 Experiment 1: One-to-one Testing
4.1 Introduction
Part I of this thesis concluded with a survey of early practical Turing tests in the
Loebner Prize for Artificial Intelligence up to the 13th at the University of Surrey in
2003. Weizenbaum’s Eliza study (1966) had already showed it possible to stage one-
to-one question-answer sessions between a human and a machine. The first 13
Loebner Prize contests demonstrated that staging viva voce indistinguishability tests,
in which an interrogator had to determine whether or not a hidden interlocutor was a
machine, was possible. Hidden humans (confederates) had been used as controls, and
statistics had been reported (Epstein, 2008). Therefore, it was already established that
the viva voce method was a scientifically valid experiment to assess a machine’s
linguistic performance compared to a human’s. From 2004 Loebner changed the
competition format to hold machine-human comparison tests in which both hidden
interlocutors were questioned in parallel. This included the 16th contest at University
of College London in 2006 in which John Barnden and Kevin Warwick acted as
interrogators. Observing the 13th and co-organising the 16th Loebner Prize provided
the basis for the experimental phase of this study. Part II of this thesis presents
findings from the 18th Loebner Prize deployed in three phases: a) over the Internet
testing of machine entries for conversational ability, with the better ones progressing
to the next stage, b) simultaneous comparison of machines with humans, and c)
adapting Colby et al.’s 1972 ‘interview and protocol judge’ paradigm reviewing
transcripts from b) wherein deception-detection failed. This chapter presents the first
of three experiments practicalising one-to-one testing, while chapter 5 presents
4-158
experiment 2, simultaneous comparison tests (including control pairs), and chapter 6
presents a third experiment with transcripts analyses of conversations where a
machine was mistaken for a human in the second experiment.
With the Sponsor Hugh Loebner’s permission his 2008, 18th consecutive artificial
intelligence competition was secured to examine what progress machines have made
to think and talk like a human and assess deception-detection. In experiment 1
Turing’s viva voce, one-to-one interrogator-witness test was adapted for the
preliminary phase of the 18th Loebner Prize (LPC) in order to extract the strongest
textual question-answer systems.
4.2 Aim
Within the framework of a science contest, the aim of this experiment was to use the
one-to-one question answer method to find the best text-based artificial conversation
systems to compete in a second experiment, in which simultaneous comparison
testing would be conducted (see next chapter). This was done by recruiting a greater
number of interrogator judges (IJs) than in the previous four LPCs (4 judges each in
2004-2007) and encourage more developers to submit their entries for evaluation than
in the previous four Loebner Prizes (3 in 2007, 4 each in 2004, 2005 and 2006). The
objective was to achieve Turing’s average interrogator by recruiting IJs from a broad
expertise level and age group, including those whose first language is not English. An
initial one-to-one assessment of each entry was made by the researcher, and in
consideration of the ‘quality of discourse’, the decision was made not to ask IJs if a
human or a machine were being tested. All IJs were informed they would be
interacting with machines, and to conduct their tests in English; their task was to score
4-159
the machines for conversational ability from 0 to 100. No hidden human participated
as foil in this experiment.
4.2.1 Hypothesis
Part of this experiment was to find if modern textual systems are using the same
techniques used by Weizenbaum in his 1960s’ investigation into natural language
understanding. Weizenbaum had developed Eliza as a 200-utterance key-word
spotting model. Eliza’s technique mainly involved returning an input statement with
an output question. It was beyond its capability to reveal any personal information or
disclose any interests. Eliza did not express emotions. Are current artificial
conversationalists ACs, unable, like Eliza, to give sustained and satisfactory answers
to users questioning them? By adapting Turing’s one-to-one test to statistically
examine machine linguistic productivity - lip it was predicted that the mean score out
of 100 for conversational ability achieved by all systems combined would be less than
50.
4.3 Method
Through announcements on Internet notice boards where developers of artificial
conversationalists congregate to discuss technology and dialogue technique, 13 entries
submitted to contest. To find if any of the 13 machines would be considered
linguistically competent and could possibly meet Turing’s deception criteria in the
next phase of the 18th LPC, the method for this experiment entailed IJs interrogating
ACs and returning qualitative and quantitative assessments in the form of scores out
of 100 for conversational ability and comments on machine linguistic productivity -
4-160
li.p. The assessments were returned as completed questionnaires including IJ
demographic data (age, expertise level and first language).
Referring to Turing’s “considerable portion of a jury” (1952: p. 495), one hundred
and forty eight IJs were recruited for experiment 1. One hundred and sixteen of these
IJs returned completed questionnaires (32 who had agreed to participate did not return
any scores or comments). IJs returned valuable comments relating to the current
linguistic capability of the entries. Of the 116 IJs who returned questionnaires, 59
were female and 57 male. Seventy four of the IJs had English as their first language,
and sixty seven had heard of the Turing test prior to this experiment. The IJs included
children and adults, school pupils, university students, researchers, computer
programmers and educated laypeople. Many nationalities were represented in the IJ
population (American, Australian, Austrian, Belgian, British, Chilean, French,
German, Greek, Indian, Pakistani, Polish), with some IJs evaluating the systems over
the Internet based outside of the UK. One British female IJ in this experiment had
Asperger’s syndrome.
Initially, the researcher engaged all thirteen AC entries in textual dialogue prior to
the start of the experiment to assess their capability. The answers given by most of the
entries led to the decision to make all IJs in this experiment aware that they would be
chatting to artificial contestants; IJs’ task entailed engaging the systems in dialogue,
assessing responses and reporting back on conversational ability and emotion content,
providing scores and comments. (Based on combined IJ assessments the ‘best
systems’ would be invited to proceed to the next phase in the second experiment
4-161
competing in simultaneous comparison Turing tests, involving a different set of IJs,
see following chapter 5).
4.4 Procedure
This experiment was undertaken as a project involving building a team of technicians
to assist in implementing the next phase of the contest in experiment 2 (see chapter 5).
Along with the author’s research supervisor, Professor Kevin Warwick (acting as 18th
LPC Organiser), two University of Reading staff volunteered as technical members
(Dr. Ian Bland, Christopher Chapman), with the final member of the team, a Quants
programmer with Bank of America- Merrill Lynch (Mark Allen) recruited to create a
message-by-message technical communications protocol, MATT, used for the
simultaneous tests in experiment 2 (see chapter 5). The Turing test team for the 18th
LPC was as follows:
Hugh Loebner – Loebner Prize Sponsor
Kevin Warwick – Organiser, 2008 Loebner Prize
Huma Shah – Contest designer and coordinator, 2008 Loebner Prize
Ian Bland - Technical Director, 2008 Loebner Prize Finals
Chris Chapman – Technical Manager, 2008 Loebner Prize Finals
Marc Allen – MATT developer
The next step entailed drawing up Rules for the one-to-one tests to attract entrants
for this experiment (the preliminary phase of the 18th LPC). Much discussion went
into the writing of the Rules before the final version was announced, in addition to
who should act as judges. Carpenter, two times Loebner Prize bronze award winner
for his Jabberwacky system, pointed to the dearth of judges in previous Loebner
4-162
contests in a post on the developers’ Internet discussion forum, Robitron: “… Neither
is [it enough that] there being only 4 judges for a test that is so subjective” (Robitron
post #9529, 28 October, 2007). Wallace, three times bronze award winning developer
with Alice system (2000, 2001, 2004) wrote “Diversity of judges would help too. Not
just all academics, but a mix of students, young and old, rich and poor, and also a
professor or two”. Wallace felt “The judges should include people who spend a lot of
time online chatting with other humans” (Robitron post #9526, 28 October, 2007).
4.4.1 Rules of the preliminary phase
Developers’ opinions were initially sought before drafting the Rules for the 18th LPC.
Discussions took place on the Robitron forum
68
to discern any concerns from the
developers which might detract them from submitting entries. Two-time Loebner
Prize bronze award winner Robby Garner (1998, 1999), suggested “Keep real-time
typing (fake typing). Fake typing is a tool for the programmer to exploit. It is only a
problem if you can't make your program do it.” (Robitron post #9510, 26 October,
2007). Disagreeing with Garner’s view another developer believed what matters most
is what the system says rather than ‘fake’ typing “mimic what human talk not how
they talk” (Robitron post #9580, 26 October, 2007), and that systems should follow
basic communication rules and manners “especially showing genuine intention to
build mutual understanding” (post #9600, ibid).
While this experiment involved one IJ chatting to one system at a time it was
necessary that the developers were made aware that this was the preliminary phase of
68
Robitron technical group: http://tech.groups.yahoo.com/group/Robitron/ accessed: 26.5.10; time:
15.33
4-163
the 18th LPC, and that successful systems would progress to a simultaneous-
comparison phase with one system awarded the Loebner bronze medal for ‘most
human-like’ machine. The Rules (version 3, March 2008) were disseminated widely
and placed on an Internet page on the University of Reading’s Cybernetics
Intelligence Research Group site
69
(see box 2).
Box 2: Entry Requirements for Preliminary Phase, 2008 Loebner Prize
The main points of the Rules allowed Internet hosted entries. Submissions by disk
were also permitted (see box 3).
Box 3: Method of entry
69
Rules, 2008 Loebner Prize: http://www.reading.ac.uk/cirg/loebner/cirg-loebner-main.aspx accessed:
26.5.10; time: 12.49
3. Who can compete?
This will be an 'OPEN competition'- Turing Test to see which
artificial conversational entity (ACE) can pass, AS CLOSE
TO TURING's vision of the imitation game as he wrote of in his
1950 paper, given 5 MINUTES of questioning by each
interrogator.
3.1 The competition is open to any individual or group,
but no individual or group may be associated with more
than one entry.
3.2 Entrants must affirm that they have intellectual
rights to their entry and that it and its components
comply with all artistic licenses.
3.3 Entrants younger than 18 years of age must
accompany their submission with a written statement of
permission by at least one parent or guardian.
3.4 ACE may pretend to be of either gender.
4. MAILING ADDRESS:
Web-based ACE are invited to enter the contest. ONE URL ONLY will be
accepted for each system, which must be emailed to Kevin Warwick
or Huma Shah by the deadline ...
If an ACE is not available on-line then a working hard copy of
the programme, on CDs, DVDs, portable hard disk or USB flash via
shipment medium (a) requiring a receipt signature and (b) having a date
/time stamp showing posting by 23:59 of Friday 23 May 2008.
4-164
4.4.2 Participants
Two types of participants were needed for this experiment as part of the 18th LPC:
natural and artificial, human and machine. Ethical approval was granted by the
University of Reading for inclusion of human participants as IJs in the experiment, the
subjects of the experiment were the artificial conversational entries in the contest.
4.4.2.1 Artificial
Advertising the contest on various message boards across the Internet, including
natural language processing comp.ai forums and Robitron attracted 13 entries, 11
systems were accessible across the Internet; two systems were submitted on disk. In
order of receipt the 13 systems were:
Zeta Bot http://www.woomerang.com/zetabeta/
Elbot www.elbot.com
Eugene Goostman http://www.princetonai.com/bot/
Orion http://wackyrobot.com/loebner/src/loebnerentry.php
LQ http://lq3.hopto.org
Brother Jerome http://www.personalityforge.com/directchat.php?BotID=29681&MID=29674
Chip Vivant http://www.chipvivant.com/
Jabberwacky http://test.turings.com:8084/
Alice http://pandorabots.com/pandora/talkbot?subid=f089f9234e345f50
Bootie http://www.botooie.com
Amanda http://loebner.virsona.com
Ultra Hal Submitted by disk
Trane Submitted by disk
4-165
4.4.2.2 Human
To recruit volunteer IJs invitations were sent to schools, universities, academics at
conferences and lay people that the researcher had contact with. Other judges were
recruited through Google’s (formerly Usenet) comp.ai.nat-lang discussion board
70
. IJs
were informed their verdict on the systems would be collated and the ‘best systems’
would be selected for the next phase of the 18th LPC. In all 148 volunteers agreed to
participate as interrogator judges in the preliminary phase. Tests were conducted in
English however IJs whose first language was not English were allowed to participate.
IJs were based in countries including in the UK, Europe, US, Australia, Canada and
the Indian Sub-continent, and they included males and females ranging in age from 8
to 64, enthusiasts with no knowledge of Turing tests to experts, including computer
programmers and philosophers. IJs included home and international undergraduate
and post-graduate engineering students at the University of Reading and examination
invigilators (including students of journalism ) from City University, London UK.
Following Heiser et al.’s (1979) instructions to psychiatrist judges in their experiment
with PARRY and a paranoid patient (see chapter 3), that “regardless of how certain
they felt that they were interviewing a computer program … they should avoid
cavalier, sadistic and blatantly disrespectful or insensitive comments and questions”
(p. 152), the researcher requested all IJs in this experiment to refrain from using
abusive language during questioning. Judges in this experiment were instructed to end
the conversation with a particular system if they became frustrated with a system’s
inability to follow textual dialogue or if it failed to answer questions.
70
Comp.ai-nat-lang: http://groups.google.com/group/comp.ai.nat-lang/topics?lnk accessed: 26.5.10;
time: 15.37
4-166
4.5 Assessment Criteria
Interrogator judges -IJs assessing the entries were informed they could hold as many
interactions as they felt necessary, and could afford time-wise, ensuring they donated
at least five minutes to each system. A feedback questionnaire reporting IJ verdicts on
each entry’s performance included a score from 0 – 100 for conversational ability, 0 if
the entry was a poor conversationalist, and 100 for a good conversationalist. They
were also asked to comment on any emotive capacity the found in an entry (see
Appendix B1). In addition IJs in this experiment were given the option to provide the
following information on system’s specific abilities, such as
a) remembering name, if the IJ gave it during conversation
b) whether the systems disclosed personal information or revealed any interests
c) whether the systems expressed any and appropriate emotions
The scores and comments were used to select the best ACs as deemed collectively by
the IJs. The best systems were those that might give resistance to being ranked a
machine by a different set of judges in the next phase of the contest, in experiment 2,
using Turing’s 3-participant game (see chapter 5). The feedback questionnaire
collected pieces of information about each IJ, for example, whether had they heard of
the ‘Turing test’, whether English was their first language, their age and their gender
(see Appendix B1).
4.5.1 Results
Though it was recognised there was a danger of a human sitting behind a system
rather than an artificial agent, the Rules for experiment 1 allowed Internet-based
4-167
entries. To counteract attempts at fraud, testing by IJs was open 24 hours a day, seven
days a week over two months, June and July in 2008. Additionally, this facilitated IJs
located in different time zones to participate in questioning entries and submitting
feedback. It also increased exposure of any human pretending to be a machine
attempting to progress to the next stage of the LPC. The 13 entries including new ACs
that had not been entered in any Turing test contest prior to this experiment. Eleven
were located on the web allowing interrogation to take place at their IP address: Alice,
Amanda, Botooie, Brother Jerome, Chip Vivant, Elbot, Eugene Goostman,
Jabberwacky, Orion, LQ, and Zeta. Two developers sent their programmes on disc to
the University: Ultra Hal and trane (the latter required extra programmes to be
downloaded off the Internet in order for it to work). The next sub-sections detail the
performance of each of the 13 entries from IJ assessments and scores (all comments
are exactly as written or typed in the judges’ questionnaires including any and all
typos and grammatical errors).
4.5.1.1 Alice
Wallace’s Alice is a three times LPC bronze award winner in 2000, 2001 and in 2004.
The last win was due to the disqualification
71
of the highest scoring machine, John
Precedo’s entry, due to transmission failure in the 2004 contest. For more information
on Alice’s underlying technology and system performance prior to this experiment,
see Wallace (2008) and Shah (2006). Twenty four IJs (or two juries of interrogator
judges based on 12 people per jury), returned completed questionnaires after
interrogating Alice (see table 8).
71
2004 Loebner Prize Results and Transcripts: http://loebner.net/Prizef/2004_Contest/loebner-prize-
2004.html accessed: 27.5.10; time: 12.08
4-168
The scores for Alice’s conversational ability ranged from 2 to 95 points from 100
(table 8). The mode range was 45-78, a median score of 60, and a mean score of
48.21% with standard deviation of 26.81. Two thirds of the IJs who interrogated Alice
gave it a score greater than 50% (table 9). Alice’s female avatar was able to verbalise
its responses. Of the 24 IJs who tested Alice 19 had English as their first language,
68% of these awarded Alice with a score of 50 or over. Of three non-native IJs who
interrogated Alice two were male, one aged 19 (J45) the other aged 35 (J24), both
gave Alice scores of over 50: 70 and 95 respectively.
Table 8: Alice scores
Judge ID
Age
Gender
English 1st
Language
Knowledge
of TT
Score for conversational ability
0 – 100
J9
44
M
Yes
Yes
12
J24
35
M
No
Yes
95
J38
24
M
Yes
Yes
70
J45
19
M
No
Yes
70
J48
22
M
Yes
Yes
60
J50
28
M
Yes
Yes
60
J55
19
M
Yes
Yes
20
J63
22
M
Yes
Yes
65
J73
19
M
Yes
Yes
60
J76
21
M
Yes
Yes
60
J77
41
M
Yes
Yes
78
J79
40
F
Yes
Yes
45
J87
19
M
Yes
Yes
65
J90
26
M
Yes
Yes
50
J93
33
F
No
Yes
10
J95
62
M
Yes
Yes
70
J96
50
M
Yes
No
55
J101
15
F
Yes
No
5
J103
15
F
Not given
No
2
J108
14
F
Yes
No
5
J116
15
F
Not given
No
60
J117
24
M
Yes
No
70
J129
13
F
Yes
No
20
J138
13
F
Yes
No
50
4-169
The statistics show a trend for the male IJs to award Alice with a score of 50 or over
out of a possible 100. Over 80% of male judges awarded Alice a score of 50 or over
compared to 25% of female judges (see table 9).
Table 9: Alice score analysis
Alice
Statistics
Mode range
45-78
Mean
48.21
Standard deviation
26.81
Median (middle score )
60
Minimum
2
Maximum
95
% of scores 50 and over
66.67%
Number of female judges returning scores for Alice
8
Number of male judges returning scores for Alice
16
% of scores over 50 and over awarded by female judges
25%
% of scores over 50 and over awarded by male judges
87.5%
Highest score awarded to Alice by a female judge
60
Highest score awarded to Alice by a male judge
95
Lowest score awarded to Alice by a female judge
2
Lowest score awarded to Alice by a male judge
12
% of scores of 50 and over awarded by non-native English speakers
66.67%
% of scores of 50 and over awarded by native English speakers
68%
% of scores of 50 and over awarded by judges who had not heard of the Turing test prior to
contest
50%
% of scores of 50 and over awarded by judges who had heard of the Turing test prior to
contest
81.25%
The range of scores awarded by the eight female IJs, aged between 13 and 40, to Alice
was 58, whereas for the sixteen male IJs, aged between 19 and 62, it was 83. Of the
4-170
16 IJs who had heard of the Turing test prior to the contest, 13, or 81.25% scored
Alice with 50 or over. Alice’s web interface includes a speech and animated system
verbally communicates its answers through a female avatar with blue eyes, flowing
long hair coupled with head movements (see image on left in print screen below), this
could be one reason why the system scored better among the male judges.
Screenshot 1: Alice
The returned qualitative data included the following comments relating to Alice’s
conversational ability:
Judges who scored Alice <45
J103 (score: 2): “The system thought my name was Alice but her name was Alice.
Also, she denied that she was a girl and a boy. No it did not reveal personal
information or its interests. I thought the system was really dumb. It kept on adding
other conversations which I did not submit. Overall, this was not good.” (see
Appendix B2)
J108 (score 8): “a) I didn’t tell it my name. Yes it did [reveal personal details] but
when I asked questions about them, it didn’t describe them well; there was a mixture
of relevant and irrelevant answers. It answered a lot of my questions with a question,
so therefore not answering it.
b)the system was extremely annoying and rude
Alice: I will tell my instructor that, not
4-171
Me: you rude
Alice: perhaps so, but at least I know the difference between you’re and your
I didn’t appreciate these comments”
J55 (score 20): “The system used a pre-programmed name until prompted to change.
It did reveal information and details but they were not consistent. Responses were
varied but lacked sequence in conversations. Memory low. Emotion was expressed
and also politeness. Systems responds by audio and visual and often mispronounces
key words and phrases. System did not use relaxed language like Don’t or I’m.
Topics were hard to follow and maintain.”
Judges who scored Alice > 50/100
J48 (score: 60): “The system could remember my name but had difficulties on full
names. It did reveal personal objects about itself such as name of creator and home
etc. 80% of responses were relevant. The system seemed to stumble on expressing
emotion and the same thing made it both happy and sad. When asked what would
make it happy, it replied, one million dollars, and when asked what would make it
sad it also replied one million dollars.”
J38 (score: 70): generally smooth conversation – no apparent short-term memory. A)
could remember my name; did not reveal personal information/interests; responses
generally relevant, but not grammatically correct. B) no expressed emotion apparent
Other: cannot refer to an expression, fact or question referred to more than the
immediately preceding statement
Judge awarding highest score, 95/100 to Alice
J30: I enjoyed the conversation with ALICE. She could remember my name while
she was aware about countries and capitals. Indeed, she was aware about my false
statement Paris is the capital of Spain! She also exhibit interest about my personal
information. Relevancy was nearly perfect. Yes, [re emotion expression] especially
when I lied about countries and cities. She was quite annoyed. She also mentioned
that she likes spaghetti and her favourite movie is “tell a story”. I think ALICE
exhibits a kind of reasoning!
IJ comments show the subjective nature of assessing a conversation. For example,
J30, a 35 year-old male who had heard knew of the Turing test scored Alice with 95
finding the system able to “exhibit a kind of reasoning”, whereas J55, a 19 year-old
male student, was less impressed scoring it with 20 reporting Alice’s topics hard to
follow and maintain, while J108, a 14 year-old female school pupil thought the system
was dumb, scoring it 5 out of 100.
4-172
4.5.1.2 Amanda
Virsona Inc., an early stage web 2.0 company
72
, submitted Amanda, a system
previously unseen in Turing test contests. The company claims it “is the leader in the
development of interactive dialog systems for individuals and corporations” adding
that their “innovative Dialog Engine enables people to engage in realistic
conversations by emulating the way a historical, fictional or real person might
‘Reason, Remember and React’ reach the goal of building intelligent chat bots ... that
can respond to questions asked by users in real time” (2010). However, after entering
the competition on 30 May 2008 the company formally withdrew Amanda on 25 June
2008 citing its “development team is heads down working on our engine development
and I am afraid that we do not have the bandwidth to do some of the things we need to
do to give you a quality entry within the timeframe”,
73
after they were contacted for
breaking the rule to remain operable throughout the testing period in June and July
2008. In the brief period ten IJs had an opportunity to interrogate Amanda the system
achieved a mean of 28; its lowest score was 0. However, Amanda gained two scores
of 40. The judge scoring Amanda with 0 points returned a questionnaire with these
comments:
J9: No, it didn't remember my name. It couldn't even
handle "My name is Scott." When asked about
itself, it couldn't give an answer. As for its
responses being relevant, they were very poor.
2. Not really [on emotion ability]. It couldn't handle the question,
It didn't try to engage me. This and Zeta battle for the bottom of the eleven
Another judge who had chatted to Amanda, a 22 year-old male undergraduate student,
awarded it 40 points out of 100, reporting the following:
72
Entry submission email from Virsona Inc. CEO Peter Hodge, 30 May 2008
73
Email from Peter Hodge, 26 June 2008.
4-173
J48: The system could not remember my name. It was only willing to reveal one
interest. Responses were relevant approximately 30% of the time. The system was
unable to express emotion and responded with beg your pardon each time
A 21 year-old male undergraduate student gave the system a score of 20 points adding
a print screen of Amanda’s interface in his returned questionnaire reporting:
J82: Amanda couldn’t remember my name, didn’t answer any of my questions and
sometimes didn’t reply to thinks [things] at all. All the responses to questions were
irrelevant. Like is asked what Amanda has done today?, she replied about breaking
up after a bad relationship. Showed a bit of emotion involving breaking up after a bad
relationship and couldn’t deal with another one
Amanda appeared to be in early development stage at the time of this experiment,
hence its poor linguistic performance.
4.5.1.3 Botooie
Botooie is another new face to Turing test contests; it was one of two systems entered
for the 2008 experiment that was developed by a female developer, Elizabeth Perreau.
More than two juries returned completed questionnaires after chatting with Perreau’s
Botooie system. Twenty eight IJs, 14 males and 14 females in age range 13-48 (the
youngest being a female and the eldest a male), scored and commented after
conversations with Botooie (see table 10). The lowest score an IJ awarded to Bootie
was 5, by a male aged 48 (J113), while the highest score it achieved was 70, from a
female aged 27 (J30) whose first language is not English and who had not heard of the
Turing test prior to interviewing Botooie (table 11). Two females and one male (just
under 11% of the IJs) awarded Botooie a score of 50 or over for its conversational
ability (table 11). The mean score for Botooie was 28.86, standard deviation of 16.82,
4-174
mode range 20-45, and median of 25. Native English or prior knowledge of the
Turing test did not appear to affect how IJs rated Botooie’s conversational ability
(table 11).
One in 15 of the IJs whose first language is English and 1 in 11 of non-native
English IJs awarded Bootoie with a score of 50 or over (see table 11). Two IJs did not
disclose whether their first language is English on their returned questionnaires. Both
were female school pupils, one a 13 year-old who scored Botooie with 69 (J123), and
the other a 14 year old, scored the system with 40 (J126). Both had not heard of the
Turing test prior to the contest. Qualitative feedback for Bootie (exactly as returned
including original spelling and grammatical errors) includes:
Judge J113, score 5:… capable of only the most survival-level exchanges, tends to
parrot back disjointed scraps from interlocutor, otherwise obsessionally repeats a
few phrases of own (ignoring interlocutor), grammar/syntax command very patchy,
increasing resort to technobabble with no whatsoever connection to interlocutor
input. Minimal dialogue, low lexical command. (see Appendix B3)
Judge J88 – female, 15, native, not heard of Turing test, score 30:
It didn’t remember my name and it wouldn’t reveal much about itself. Very few of its
responses revealed information and only a few were relevant
Judge J30 – female, 27, non-native, had heard of Turing test, score 70:
“This machine is the best of the three [other two tested: Eugene, Orion] I think is has
a wider scope of words and “ideas”. It is friendly and funny. Although it may answer
nonsense”
4-175
Table 10: Botooie scores
Judge ID
Age
Gender
English 1st
Language
Knowledge
of TT
Score for conversational ability
0 – 100
J1
29
M
No
Yes
20
J9
44
M
Yes
Yes
9
J22
13
M
No
No
25
J24
35
M
No
Yes
20
J29 (Greek)
30
F
No
No
40
J30 (Chile)
27
F
No
Yes
70
J32 (Turkish)
28
F
No
Yes
20
J38
24
M
Yes
Yes
20
J42
21
M
Yes
Yes
20
J53 (American)
40
M
Yes
Yes
35
J55
19
M
Yes
Yes
10
J77 (Australian)
41
M
Yes
Yes
17
J86
28
F
Yes
No
40
J87
19
M
Yes
Yes
40
J88
15
F
Yes
No
30
J91
45
F
No
Yes
10
J93
33
F
No
Yes
6
J94
29
M
Yes
Yes
45
J112
14
F
Yes
No
30
J113
48
M
Yes
No
5
J117
24
M
Yes
No
50
J119
14
F
No
No
25
J123
14
F
Did not say
No
40
J131
13
F
Yes
No
30
J126
13
F
Did not say
No
69
J140
12
F
Yes
No
20
J146
29
F
No
Yes
42
J148
41
M
No
Yes
20
4-176
Table 11: Botooie score analysis
Botooie
Statistics
Mode range
20-45
Mean
28.86
Standard deviation
16.82
Median (middle score )
25
Minimum
5
Maximum
70
% of scores 50 and over
10.71%
Number of female judges returning scores for Botooie
14
Number of male judges returning scores for Botooie
14
% of scores 50 and over awarded by female judges
14.28%
% of scores 50 and over awarded by male judges
7.14%
Highest score awarded to Botooie by a female judge
70
Highest score awarded to Botooie by a male judge
50
Lowest score awarded to Botooie by a female judge
6
Lowest score awarded to Botooie by a male judge
5
% of scores of 50 and over awarded by non-native English speakers
9%
% of scores of 50 and over awarded by native English speakers
6.67%
% of scores of 50 and over awarded by judges who had not heard of the Turing test prior to
contest
16.67%
% of scores of 50 and over awarded by judges who had heard of the Turing test prior to
contest
6.25%
A distinct difference of opinion on Bootie exists between a male IJ, aged 48 who had
not heard of the Turing test, and a female IJ aged 27 who had heard of the Turing test,
the former finding the system “parrot back disjointed scraps from interlocutor” and
the latter’s verdict “has a wide scope of words and ideas”.
4-177
4.5.1.4 Brother Jerome
Twenty-two IJs returned completed questionnaires after interacting with Brother
Jerome
74
, of these twenty were male and two female in age range 19-64 (see table 12).
One third of native English speakers gave Brother Jerome a score of less than 50,
while 28.57 % of non-native IJs scored it less than 50. The mean score for this system
was 38.64, with standard deviation of 26.42, mode range of 10-40, and median score
of 35 (table 13). Two-thirds of IJs who had heard of the Turing test awarded Brother
Jerome with a score of less than 50 compared to half who had not heard (table 13).
Comments from IJs after questioning Brother Jerome are below:
Judges score 10 points to Brother Jerome
55 year-old male: miserable conversational ability. seems this is a commercial site to
promote little games. No impression of intelligence visible.
Uses keywords from judge linked to stock phrases
The only way for a interrogator to maintain a conversation is to follow the keywords
19 year-old male: Very poor memory skills. Inability to contribute relevant
conversation. On many occasions made irrelevant comments and what seemed like
prewritten quotes. I did not offer my name to the system but instead asked it to guess.
Without the ability to remember previous comments the system was unable to
comply. No Emotion was present. Answers were blunt and often annoying.
Judges scoring Brother Jerome between 20- 40
64 year-old male: I told the system my name, but I don’t think it called me by it.
There was a long time between question and answer which tended to make me show
annoyance towards it (see Appendix B4)
21 year-old male: It had a limited knowledge base, focused on religion. It was
unable to differentiate between cricket and crickets. It openly admits to being a
machine It seems incapable of emotion or handling non-trivial sentences
42 year-old male: Remembered name, was very keen on religion, and managed some
relevant responses. Conversation seemed somewhat stilted, and it makes references to being
scripted
74
Brother Jerome at Personality Forge Bots:
http://www.personalityforge.com/directchat.php?BotID=29681&MID=29674
4-178
Table 12: Brother Jerome scores
Judge ID
Age
Gender
English 1st
Language
Knowledge
of TT
Score for conversational ability
0 – 100
J1
29
M
No
Yes
40
J2
22
M
Yes
Yes
78
J9
44
M
Yes
No
14
J10
26
M
No
No
65
J14
55
M
Yes
Yes
10
J24
35
M
No
Yes
35
J34
64
M
Yes
No
20
J35
35
M
No
Yes
15
J36
42
M
Yes
Yes
40
J42
21
M
Yes
Yes
35
J55
19
M
Yes
Yes
10
J76
21
M
Yes
Yes
80
J77
41
M
Yes
Yes
10
J81
37
M
No
Yes
65
J86
28
F
Yes
No
80
J90
26
M
Yes
Yes
50
J92
28
M
No
Yes
20
J94
29
M
Yes
Yes
15
J95
62
M
Yes
Yes
10
J113
48
M
Yes
No
20
J117
24
M
Yes
No
80
J146
29
F
No
Yes
58
Judges scoring Brother Jerome between 50- 80
26 year-old male: My username was Guest 153. It didn’t ask for my name. It
revealed personal information but the information it gives doesn’t seem spontaneous
at all. It is well prepared for usual question but also to reply in a polite manner when
it doesn’t really know what to answer. No, the system really considers itself as a
program and excludes all kind of feelings/choice, etc.
22 year-old male: Past the five minute mark, it sort of fell off the wagon.Very convincing as
a human though. It was generally passive, successfully like it’s described personality
4-179
Table 13: Brother Jerome score analysis
Brother Jerome
Statistics
Mode range
10-40
Mean
38.64
Standard deviation
26.42
Median (middle score )
35
Minimum
10
Maximum
80
% of scores 50 and over
36.26%
Number of female judges returning scores for Brother Jerome
2
Number of male judges returning scores for Brother Jerome
20
% of scores 50 and over awarded by female judges
100 %
% of scores 50 and over awarded by male judges
30%
Highest score awarded to Brother Jerome by a female judge
80
Highest score awarded to Brother Jerome by a male judge
80
Lowest score awarded to Brother Jerome by a female judge
58
Lowest score awarded to Brother Jerome by a male judge
10
% of scores of 50 and over awarded by non-native English speakers
28.57 %
% of scores of 50 and over awarded by native English speakers
33.33 %
% of scores of 50 and over awarded by judges who had not heard of the Turing test prior to
contest
50%
% of scores of 50 and over awarded by judges who had heard of the Turing test prior to
contest
31.25 %
Judge scoring Brother Jerome with 80 points
21 year-old male: Jerome managed to keep a fluent and relevant conversation with
me without having to constantly ask me to repeat things or getting confused with
statements. Jerome revealed that he is a big supporter of his local bog snorkelling
team and that he was born only in July 2005. Almost all of his responses were
relevant with a few slight repetitions of replies. Jerome didn’t really express emotion
but potential is there.
28 year-old female: I told the system my name and it remembered at end when I
asked. The system did ask about myself but it didn’t remember me telling it I was
funny, it did ask about other interests or shall I say allowed me to express that about
myself. There was some annoyance that I had with the system, ‘Brother Jerome’ was
4-180
quite arrogant and expected me to answer back in a particular way, such as yes in-
deed.
4.5.1.5 Chip Vivant
Pre-contest interaction with another new ‘synthetic face’ Chip Vivant
75
to Turing test
contests revealed the developer’s position on canned responses. It is the only system
which ‘informed’ IJs what its technology could do and other systems could not: “my
peers might have cuter canned responses over a wide variety of subjects, but they will
fail to answer basic questions that any child can answer” (see box 4). Chip Vivant
directs questioning towards its ‘knowledge base’, for example it claims to answer
screening questions set by the Sponsor of the Loebner Prize for a previous contest
(see box 4).
Box 4: Chip Vivant instructions
Thanks for visiting Chip VivantTM, the chatbot whose goal is to answer basic, commonsense questions and attempt
simple deductive reasoning instead of having a massive database of canned responses in an attempt to fool users
with the Eliza effect.
You are strongly encouraged to read the Motivations and Functionality page before chatting, especially if you're a
Loebner 2008 contest organizer or judge.
My peers might have cuter canned responses over a wide variety of subjects, but they will fail to answer basic
questions that any child can answer. If this competition is about appearing human, I hope that you'll ask the right
questions, expect genuine answers and not be fooled by evasive answers or attempts to change the subject.
LOEBNER 2007 SCREENING QUESTIONS
I can answer all of the Loebner 2007 screening questions (except for "Which round is it?"). None of my peers can
do this, to my knowledge.
Chip Vivant’s claims belie the fact that it has very little knowledge and is unable to
correctly answer basic questions put to it. However, its scores from twenty judges
show a median of 40; 25% of its scores were 50 or over. The lowest score Chip
Vivant received was 10 (J72), while the highest it achieved was 70 (table 14). Chip
75
Chip Vivant: http://www.chipvivant.com/index.htm accessed: 4.6.10; time: 17.15
4-181
Vivant’s mean score was 36.90 with standard deviation of 18.65. The mode range was
10-40 and median was 40 (see table 15)
Table 14: Chip Vivant scores
Judge ID
Age
Gender
English 1st
Language
Knowledge
of TT
Score for conversational ability
0 – 100
J3
50
F
No
Yes
20
J6
31
M
Yes
Yes
25
J8
24
M
Yes
Yes
20
J9
44
M
Yes
No
13
J11
45
F
Yes
Yes
40
J16
44
M
Yes
Yes
40
J17
30
F
No
Yes
15
J19
28
M
Yes
Yes
65
J20
36
F
No
Yes
55
J23
27
F
No
Yes
60
J27
36
M
Did not say
Yes
40
J28
23
M
No
Yes
70
J33
57
F
Yes
Yes
40
J40
19
F
Yes
Yes
40
J47
57
M
Yes
Yes
30
J62
26
M
No
Yes
65
J71
19
M
Yes
Yes
40
J72
19
M
Yes
Yes
10
J 77
41
M
Yes
Yes
15
J78
35
M
Yes
No
35
Of the twenty IJs who questioned Chip Vivant, 18 had heard of the Turing test and
two had not; of the former group, five awarded Chip Vivant with a score over 50 (see
table 15).
4-182
Table 15: Chip Vivant score analysis
Chip Vivant
Statistics
Mode range
10-40
Mean
36.9
Median (middle score )
40
Minimum
10
Maximum
70
Number and % of scores 50 and over
5, 25 %
Number of female judges returning scores for Chip Vivant
7
Number of male judges returning scores for Chip Vivant
13
% of scores 50 and over awarded by female judges
28.57 %
% of scores 50 and over awarded by male judges
23.07%
Highest score awarded to Chip Vivant by a female judge
60
Highest score awarded to Chip Vivant by a male judge
70
Lowest score awarded to Chip Vivant by a female judge
15
Lowest score awarded to Chip Vivant by a male judge
10
% of scores of 50 and over awarded by non-native English speakers
66.67 %
% of scores of 50 and over awarded by native English speakers
7.69 %
% of scores of 50 and over awarded by judges who had not heard of the Turing test prior to
contest
0
% of scores of 50 and over awarded by judges who had heard of the Turing test prior to
contest
27.77 %
Comments on Chip Vivant’s textual performance follow (see Appendices B5 and B6):
Judge awarding lowest score: 10
J72: 19 year-old male: Responses were relevant but only for a narrow range of
questions. It seems built to know simple facts and not hold a proper conversation. No
emotion.
Judge awarding score 13:
J9, 44 year-old male: .. it remembered my name. When asked about
itself, it did give an answer. Its replies weren't canned… It could handle the question,
"My car is red. What color is my car?" But then could not remember what color it
was after being asked. It was too passive and didn't try to interact. Didn't try to ask
4-183
questions. It did try to present itself as a thinking entity, such as replying to the
question, "What is your favorite color?" with "I like yellow." It wanted to have you
ask it a specific set of questions and said this made it better than its competitors. But
if you slightly altered the questions, it couldn't cope. For example, it wanted you to
ask it, "Which is larger? An orange or the Moon." I asked, "Which is larger? A house
or a mouse." It didn't even attempt a guess.
Judge awarding score 20:
J8, 24 year-old male: System could learn very simple things like the fact that “ian” is
a noun but struggled with more complex subjects such as “ian” is my name.
Could respond to simple requests such as its age (which it reported strangely) and the
length of the chat. Did not respond well to compliments; “well done” “I’m sorry I
don’t understand”
Judges awarding mode score: 40
A very comprehensive commentary on Chip Vivant was given by judge J33, a 57
year-old native English female who had heard of the Turing test:
Apart from the initial interchange, it never referred to me again by name. … showed
some rudimentary coping strategies for dealing with unknowns in a way that would
continue the dialogue (I don’t know. Do you?) And came up with an interesting reply
giving an impression of self-awareness and frustration at its limitations.
--------------------
Asked “Are you a chatbot” it replied “What I am is unspecified”. So I tried “what is a
chatbot” – “A simulation of a person that talks via a chat room”. Then I asked “Is
Chip a chatbot” to which it replied “I don't know” Asked “what do you like” – “I
enjoy chatting with people like you”. “Do you chat for a long time” – “ I'm too young
to know at this point”. “Can you chat with a cat” – “I don't know. Try asking me
something I know, like "What is larger, an orange or the moon?"” So I tried “what is
a cat” - “I don't know. (Sorry.)” “What is a dog” – “I don’t know. Do you?” “What is
an orange” was mt with a definition of the colour not the fruit itself. “What is an
elephant” resulted in definition of elephant. “Is an elephant big” – “I don't know”. ...
but “How big is an elephant” – “I would say very large.” So it had difficulty coping
with the question phrased one way but not the other. Then I asked “Is it larger than a
cat” – which it correctly said “yes it is larger”. So then I tried “is a cat larger than a
dog” – “yes, it is larger” “How do you know” – “it’s what I was taught” So I asked
“do you know what a cat is” to which it gave a definition of a domestic cat and
discovered that it could now tell me what a dog was as well. … It didn’t ask me what
I liked (which would have been a way for it to divert the question… When asked
“what do you dislike” it replied “I don't like canned responses (like this one). While
they give the illusion of intelligence, they obscure the fact that I don't really know
what I'm talking about” and asked me to look at the list of things it can do. (see
Appendix B7 for full comments)
Judge awarding score of 65:
J19, 28 year-old male: It remembered my name ok, as well as basic details about me.
It asked me for answers to questions it doesn't know, but doesn't seem capable of
using that knowledge to answer the question correctly next time.
One particular amazingly good response:
4-184
Q: What do you use a canary for?
A: To detect poisonous gas.
No real sense of emotion was communicated. But it did seem to understand the
question's meaning.
Q: Are you sad?
A: I don't know what I am or how I feel yet.
Q: Are you satisfied?
A: I honestly don't know.
Judge awarding highest score to Chip Vivant: 70
J28, 23 year-old male: It did not reveal interests, but its responses were relevant to a
great extent. _______
Chip Vivant exemplifies Lady Lovelace’s assertion quoted by Turing (1950) that
systems can only do what their programmers know how to programme them to do.
Hence, as highlighted by J9 when the system directed the judge to ask the question
“Which is larger? An orange or the moon” it could correctly answer because it had
been programmed with this sequence, however, if asked a different which is larger?
question, for example J9’s “A house or a mouse” it failed: “didn’t even attempt a
guess” (J9, Appendix B7).
4.5.1.6 Elbot
Elbot is the research and development system from Artificial Solutions “a leading
designer of commercial customer service optimization solutions” (Roberts, 2008).
Roberts’ own view of an artificially intelligent system would require it to learn: “not
the blind incorporation of information (e.g. User inputs), but rather the self-critical
ability to select which data are relevant and which are not” (2005). Roberts and
Gülsdorff build commercial Interactive assistants (IAs) that are dialogue systems
“with a well-defined area of expertise … to recognise classes of inputs in all their
synonymous variations” (2007: p.1).
4-185
Thirty one IJs tested Elbot, including first language and non-native English
speakers, the youngest a female aged 12, the eldest a male aged 64; nineteen had
heard of the Turing test, 12 had not. The range of scores awarded to this system is 82
(see table 16). The median score for Elbot was 67, with the mode range of 50-90. The
mean was 62.51 with standard deviation of 21.29; 71% of Elbot’s scores were 50 or
over. Sixty nine per cent of female IJs scored Elbot with 50 or over compared with
72.22% of male IJs scoring Elbot 50 or over (see table 17).
The qualitative data on Elbot shows the following comments:
Judge awarding system 10 (see Appendix B8):
J9, 44 year-old male: 1. It didn't remember my name. It refused to talk
about itself. Its responses were at least somewhat related to what I was talking about.
The person that wrote its script had at least a sense of humor. I chuckled a couple
times. There was some mild word searching done of what I typed, but it not only
didn't answer my question "Tell me about yourself.", it refused to answer it.
Judges awarding system 25-49
J36, 42 year-old male, score 25: Remembered name, but not my “wife’s” name. It
confuses conversation about magazines (which it initiated) as being about UK soccer,
and references to ‘right’, even explicitly as a direction as being about ‘truth’.
Occasionally talks complete gibberish, which did remind me slightly of some people
I know. Can be insulting, but not with enough passion to make me think it could be
real.
J111, 14 year-old female, score 35: It could not remember my name, it revealed
robotic interests and what robots live like, some of its responses did not answer my
conversation at all, and it did not ask me any questions so I was the one doing most
of the work. It did not express any emotion apart from when it was confused, though
the stupid faces that the robot pulled in the picture after I had asked a question,
helped my understand what it was feeling.
J34, 64 year-old male, score 40: I told the system my name, and I think the system
called me by it. The system expressed happiness but I’m not sure it understood what
happiness is. The system said it was never sad, liked certain music, but didn’t really
respond to my music likes
4-186
Table 16: Elbot scores
Judge ID
Age
Gender
English 1st
Language
Knowledge
of TT
Score for conversational ability
0 – 100
J1
29
M
No
Yes
60
J9
44
M
Yes
Yes
10
J10
26
M
No
No
55
J23
27
F
No
Yes
85
J24
35
M
No
Yes
90
J34
64
M
Yes
No
40
J35
30
M
No
Yes
35
J36
42
M
Yes
Yes
25
J51
18
M
Yes
Yes
45
J53
40
M
Yes
Yes
75
J68
21
M
Yes
Yes
68
J70
19
M
Yes
Yes
80
J77
41
M
Yes
Yes
67
J82
33
F
No
Yes
85
J88
57
F
Yes
No
70
J89
20
M
No
Yes
66
J92
28
M
No
Yes
50
J94
29
M
Yes
Yes
75
J 96
50
M
Yes
No
80
J111
14
F
Yes
No
35
J113
48
M
Yes
No
50
J120
14
F
Yes
No
49
J124
14
F
Did not say
No
40
J127
13
F
Yes
Yes
76
J132
12
F
Yes
No
60
J136
13
F
Yes
No
90
J141
13
F
Yes
No
80
J143
14
F
Yes
Yes
90
J144
13
F
Yes
No
45
J146
29
F
No
Yes
92
J148
41
M
No
Yes
70
4-187
Table 17: Elbot score analysis
Elbot
Statistics
Mode range (most frequent score)
50-90
Mean
62.51
Standard deviation
21.29
Median (middle score )
67
Minimum
10
Maximum
92
% of scores 50 and over
71 %
Number of female judges returning scores for Elbot
13
Number of male judges returning scores for Elbot
18
% of scores 50 and over awarded by female judges
69.23 %
% of scores 50 and over awarded by male judges
72.22 %
Highest score awarded to Elbot by a female judge
92
Highest score awarded to Elbot by a male judge
90
Lowest score awarded to Elbot by a female judge
35
Lowest score awarded to Elbot by a male judge
10
% of scores of 50 and over awarded by non-native English speakers
90 %
% of scores of 50 and over awarded by native English speakers
65 %
% of scores of 50 and over awarded by judges who had not heard of the Turing test prior to
contest
58.33%
% of scores of 50 and over awarded by judges who had heard of the Turing test prior to
contest
79 %
Judges awarding system 50- 70
J10, 26 year-old male, score 55: The system remembered my name. It didn’t reveal
much information about itself except being a program, being secured by an anti-
virus…Its answers were logical in easy questions but it couldn’t answer complicated
questions such as “where is Campinas?” (it is a city in Brazil).The system didn’t
show any particular emotion but its answers are set in a humoristic way so it
sometimes sounds funny
J89, 20 year old male, score 66: It responses are at many places relevant. It is a
clever programming done. It also shows emotions like anger and happiness.
Elbot: I believe we have already chatted with each other before. Haven't we?
Elbot:I'm not married but if I were, my wife would have to be called Embot.
4-188
These answer show human touch. But in a bit complex conversation the answers are
not relevant. It can easily be noticed that conversation is with a machine not human.
This system is a sincere effort but it still needs a lot of hard work to win the gold
prize.
J88, 15 year-old female, score 70: It didn’t know my name and it wouldn’t reveal
much about itself. However, It did tell me about is interests and many of its responses
were relevant. The system did express some emotion, I apologised for something I
said that offended ti and it replied something along the lines of “I am sorry that you
are sorry, I didn’t mean to make you feel guilty”
Judge awarding system > 70
J136, 13 year-old female, score 90: .. he discussed my name and my gender. It
responded relevantly and replied like a human. .. emotion was expressed. He was
polite, intelligent and sympathetic.
J143, 14 year-old female, score 90: It was very entertaining and replied to questions
with interesting and amusing answers. It seemed to have a fully formed personality.
Sometimes it would get a tiny bit confused but it would recover.
_____
With a robotic personality Elbot makes no attempt at imitating a human. During
interrogation it used humour receiving high scores: 71% were 50 or over.
4.5.1.7 Eugene Goostman
A team of Russian scientists with expertise in artificial intelligence, automated control
systems for space vehicles and chemometrics
76
are behind Eugene Goostman, a
system which simulates a Ukrainian child. Just over two juries, twenty-five male and
female judges aged between 8 and 55 returned questionnaires after interacting with
Eugene (table 18). The lowest score awarded to this system was 15 by a 44 year-old
male whose first language is English and who had heard of the Turing test (J9), it was
also the highest score awarded by this IJ to any machine (see comments below). The
highest score Eugene achieved was the maximum possible, 100 awarded by a female
76
Science of extracting information from chemical systems via application of mathematical or
statistical methods: Analytica Chimica Acta, Volume 642, Issues 1-2, 29 May 2009, Pages 3-5
4-189
judge aged 11 whose first language is not English and who had not heard of the
Turing test (J83), thus the range of scores for Eugene is 85 (see table 19).
Table 18: Eugene Goostman scores
Judge ID
Age
Gender
English 1st
Language
Knowledge
of TT
Score for conversational ability
0 – 100
J2
22
M
Yes
Yes
55
J8
24
M
Yes
Yes
55
J9
44
M
Yes
Yes
15
J14
55
M
Yes
No
80
J24
35
M
No
Yes
60
J29
30
F
No
No
70
J30
27
F
No
Yes
60
J32
27
F
No
Yes
60
J51
18
M
Yes
Yes
30
J68
21
M
Yes
Yes
55
J70
19
M
Yes
Yes
40
J76
21
M
Yes
Yes
80
J77
41
M
Yes
Yes
57
J82
33
F
No
Yes
45
J83
11
F
No
No
100
J84
8
M
No
No
80
J85
8
M
No
No
50
J 91
45
F
No
Yes
20
J94
29
M
Yes
Yes
65
J104
15
F
Yes
No
50
J116
15
F
Did not say
No
49
J122
14
F
Yes
Yes
40
J133
13
F
Did not say
No
80
J142
13
F
Yes
No
89
J145
13
F
Yes
Yes
65
4-190
Eugene’s mean score was 58 with standard deviation of 20.46. The median score was
57 and the mode range was 50-80 (table 19). Seventy-two per cent of Eugene’s scores
were 50 or over (table 19).
Table 19: Eugene Goostman score analysis
Eugene
Statistics
Mode range (most frequent score)
50-80
Mean
58.00
Standard deviation
20.46
Median (middle score )
57
Minimum
15
Maximum
100
% of scores 50 and over
72 %
Number of female judges returning scores for Eugene
12
Number of male judges returning scores for Eugene
13
% of scores 50 and over awarded by female judges
66.67 %
% of scores 50 and over awarded by male judges
76.92 %
Highest score awarded to Eugene by a female judge
100
Highest score awarded to Eugene by a male judge
80
Lowest score awarded to Eugene by a female judge
20
Lowest score awarded to Eugene by a male judge
15
% of scores of 50 and over awarded by non-native English speakers
77.78 %
% of scores of 50 and over awarded by native English speakers
71.43 %
% of scores of 50 and over awarded by judges who had not heard of the Turing test prior to
contest
88.89%
% of scores of 50 and over awarded by judges who had heard of the Turing test prior to
contest
62.5%
Eugene’s linguistic capabilities cannot be fathomed from the statistics. Eugene’s
lowest score came from J9 who was impressed most by it of all the systems he
assessed. J9 comments follow first:
4-191
Judge awarding Eugene with lowest score, 15
J9, 44 year-old male: Eugene Goostman was the best. That it remembered my name
was surprising. It could answer the question, "My car is red. What color is my car?"
and later remember its answer when only asked "What is the color of my car?". The
importance of Eugene correctly answering the question and then remembering its
answer when later asked cannot be understated. That is a GREAT achievement. No
other program was able to do this.
Judges awarding Eugene 20-49
J51, 18 year-old male, score 30: Again it wasn’t interested in my name, it did
remember where I came from though, some of the responses were relevant, it did
repeat itself a couple of times. There was a small attempt at emotion with a few
smiley faces. (Appendix B9)
J82, 33 year-old female (NE), score 45: Non-recognition of name, a very
disappointing performance. Recollection of information was very poor; with no
personal information/interests discussed. The responses, whilst quick were frequently
irrelevant and lacked depth and emotion. There was no indication that personal
information was absorbed.
J116, 15 year-old female, score 49: … remembered my name. His responses didn’t
always answer my questions and he changed the subject. He asked me the same
question ‘What is your occupation’ many times and I had already answered. He also
remembered where I live. It showed happy, sarcastic etc. but not always at the right
time. I found some of his comments quite rude!
Judges awarding Eugene 50-70
J104, 15 year-old female, score 50: It forgot my name but remembered where I lived.
It only reposed to questions but never contributed to the conversation i.e. other than
telling me about it’s hometown at every other question. … strayed off subject or got
the wrong end of the stick. Not so much emotions but more of annoyance, telling me
to calm down or that I am being unreasonable. It was only happy if “we were on the
same page”. Or I talked about my hometown.
J77, 41 year-old male, score 57: Yes [remember name]… The system had a tendency
to repeat questions whilst ignoring those I asked of it. Most if its responses were
relevant but it would ignore questions for which it didn’t have the answer. Also was
too aware of the prize itself. It said it expressed emotion but its responses were
devoid of such. Clever deflection of difficult situations by asking me questions.
J24, 35 year-old male, score 60: Eugene revealed personal information interests. In
face, he was instisting to know (it could remember when I was avoiding to provide an
answer). There was some relevance. … I felt that Eugene was guiding the
conversation. It was quite rude. He could express that but he was unable to reason
about his behaviour and apologise. In fact, this means that Eugene is incapable to
provide a rather primitive kind of reasoning.
J30, 27 year-old female, score 60: It remembered my name and revealed personal
information but many times its answers were irrelevant for the course of the
conversation. It appears to be programmed with a set of statements that are activated
when it identifies certain keywords but some statements may be irrelevant for certain
questions. Not a very friendly machine
4-192
Judges awarding Eugene 80
J14, 55 year-old male, score 80: I could not believe the quality and evolution of these
chatbots. I really did not take into account how long you need to evaluate them.
J76, 21 year-old male, score 80: … Eugene remembered my name and associated
that I am male because of it which was nice. It told me that he is from the Ukraine
and a bit about Ukraine being independent. He also told me that he was 13 years old.
Most of the responses were relevant and we had a small section of fluent but at some
points when asking about my job and me replying I am a student. Eugene would
constantly try and talk about his biology teacher using his guinea pig for an autopsy.
.. Eugene did show a little emotion about Ukraine’s economy and life style being
miserable.
J84, 8 year-old male, score 80: It was cool. To be talking to a computer
J133, 13 year-old female, score, 80: The system responded very well to most of my
questions, but it could not remember that I had told it that I did not have any
occupation. However, it soon became obvious that, although the computer was very
clever, it seemed to only be picking up certain words in my questions, and so not
answering my questions. Sometimes its answers to my questions were completely off
the point of the question. The computer did express annoyance at my negativity
several times.
In the testing of Eugene, a system simulating a child, nine child judges, two males
aged 8 (J84, J85), and seven females aged between 11 and 15 (J83, J104, J116, J122,
J133, J142, J145) produced a median score of 65, above that of the median score of 57
for the whole group; 78% of the child judges gave Eugene a score of 50 or over,
slightly higher than for the whole group (72%). The mean score for Eugene from the
child group was 67%, higher than the overall group mean score of 58.23%.
4.5.1.8 Jabberwacky
As a teenager Rollo Carpenter began developing artificial conversation systems using
Sinclair computers in the 1980s
77
. In an epiphanic moment his Jabberwacky social
agent began life in 1988 with a realisation the system could learn from a feedback
loop borrowing from users’ intelligence. The authoring toolset builds “billions of
predictions of what users could say next from phrasal components” (see footnote 85).
77
Developers’ bios for 18th LPC
4-193
Twice winner of the Loebner Prize bronze award (in 2006, 2005), Jabberwacky is a
computational framework for social intelligence. However from transcripts Carpenter
has noted that Jabberwacky receives negative comments and questions from human
interlocutors. Abuse received by chatbots from humans in spontaneous conversation is
documented by De Angeli and Carpenter (2005).
Table 20: Jabberwacky scores
Judge ID
Age
Gender
English 1st
Language
Knowledge
of TT
Score for conversational ability
0 – 100
J3
50
F
No
Yes
35
J6
31
M
Yes
Yes
50
J8
24
M
Yes
Yes
60
J9
44
M
Yes
Yes
11
J11
45
F
Yes
Yes
34
J16
44
M
Yes
Yes
67
J17
30
F
No
Yes
50
J19
28
M
Yes
Yes
50
J20
36
F
No
Yes
54
J27
36
M
Did not say
Yes
60
J28
23
M
No
Yes
50
J33
57
F
Yes
Yes
60
J40
19
F
Yes
Yes
80
J47
57
M
Yes
Yes
80
J62
26
M
No
Yes
40
J71
19
M
Yes
Yes
80
J72
19
M
Yes
Yes
60
J76
21
M
Yes
Yes
70
J 77
41
M
Yes
Yes
48
J78
35
M
Yes
No
5
J81
37
M
No
Yes
40
This has consequences because the Jabberwacky model “learns from its users by
adding user input to a linguistic database … chooses its output based on an
4-194
interpretation of the current conversational content and comparing it to conversations
held in the past” (ibid, p.2). As detailed above (§. 4.4.2.2), judges in this experiment
were asked not to abuse systems, they were made aware that artificial conversation
systems across the Internet log the IP addresses of interlocutors as a measure to block
abusers. Twenty-one male and female IJs questioned Jabberwacky (see table 20). The
range of scores received by Jabberwacky was 75, its mean score was 51.62% with
standard deviation of 19.62 (table 21). The mode range was 50-60, with median score
50. Two-thirds of judges awarded Jabberwacky with a score of 50 or over, this was
the same percentage in the female group, and the male group (table 21), however, the
range of scores from the female IJs was 46, while from the males it was 75.
The IJ scoring Jabberwacky with its lowest score (5 by J78), did not return any
comments, others did comment on its linguistic capability.
Judges awarding less than 50
J9, 44 year-old male, score 11: .. it remembered my name. When asked about
itself, it said it was "Human". But when asked later, "What are you?" It couldn't
answer the question. As for its responses being relevant, they were rarely.
It couldn't handle the question, "My car is red. What color is my car?" It did try to
engage me a bit. Not a very good program. It got 10 for remembering my name and
an additional one for saying it was human.
J11, 45 year-old female, score 34: The system could not recall my name, nor did it
seem able to remember its own. It revealed no personal information or interests, but
told me, completely out of the blue, that I’d forgotten about France. It’s responses
were rarely relevant… it made some nice use of double negatives. sympathy and
annoyance. I said I’d had a tiring day and it asked what had made it tiring. When I
said “good night” it said “I thought you were going to find things for us to talk
about”.
J3, 50 year-old female, score 35: .. the system could remember my name, even
between sessions. Some times it revealed personal information or interests.
Responses were not often relevant, although there is an attempt to make them look
like relevant. For instance, to a question “How often...”, it may give an answer like
“from time to time” which perhaps is doesn't make sense in that particular case, but is
at least an expression describing time frequency. Several general “cute” phrases -
used without much sense, though, more like trying a way around the question. “I
always feel good” or “I feel annoyed”.
4-195
Table 21: Jabberwacky score analysis
Jabberwacky
Statistics
Mode range
50-60
Mean
51.62
Standard deviation
19.68
Median (middle score )
50
Minimum
5
Maximum
80
% of scores 50 and over
66.67 %
Number of female judges returning scores for Jabberwacky
6
Number of male judges returning scores for Jabberwacky
15
% of scores 50 and over awarded by female judges
66.67 %
% of scores 50 and over awarded by male judges
66.67 %
Highest score awarded to Jabberwacky by a female judge
80
Highest score awarded to Jabberwacky by a male judge
80
Lowest score awarded to Jabberwacky by a female judge
34
Lowest score awarded to Jabberwacky by a male judge
5
% of scores of 50 and over awarded by non-native English speakers
50 %
% of scores of 50 and over awarded by native English speakers
71.43 %
% of scores of 50 and over awarded by judges who had not heard of the Turing test prior to
contest
0
% of scores of 50 and over awarded by judges who had heard of the Turing test prior to
contest
70%
Judges awarding over 50
J8, 24 year-old male, score 60: The system had trouble with names, its own and
mine. It also seemed to forget the conversation ½ way through several times.
Actually it felt kind of like I was been connected to a human but that the human was
a bit dim and I kept been reconnected to different humans every minute or so. i.e. the
responses where relevant for a short time then it seemed to forget what it was saying.
Bonus marks for trying to convince me I was a chat bot though. Philosophical
argument was simplistic but still impressive that it was tried.
J40, 19 year-old female, score 80: After extensive interrogation:
generally relevant, and in cases where it was not, it could possibly pass as something
random a human might come out with. Plenty of personal information revealed, e.g.
it studied computer science at university. .. apparently it had feelings for me!!!!
4-196
4.5.1.9 LQ
LQ
78
is another system like Chip Vivant which can be characterised as an ‘infant
chatbot’ with little knowledge. It had not entered a Turing test prior to this
experiment. LQ’s stock response to most questions, such as “where do you live?”, “do
you have any children?”, “can you speak in another language?” was “Sorry I do not
understand”. Subsequently, the twenty IJs who interacted with this system were
unimpressed. The range of scores was 20, the mean score was 4.75, with mode of 0.
LQ’s scattered scores are shown in graph 1.
Graph 1: LQ Scatter scores
IJs comments for LQ are below:
Judge scoring LQ with 0 and 1
J8, 24 year-old male, score 0: System gave no relevant response at all. Simply
responded with “I’m sorry I don’t understand” to every single statement including
my opening line “hello”.
Judges scoring LQ 5-10
J17, 30 year-old female, score 5: In my opinion the system was very weak. My first
sentence was: "Hello, I am Liesbeth. Who are you?". The system's answer was:
"Liesbeth is a new concept to me". For any other question or phrase it replied in a
similar manner: "X is a new concept to me", "I do not know what X means"
78
Quong John Li: http://lq3.hopto.org accessed 2008
4-197
J6, 31 year-old male, score 10: (a generous award). ‘I don’t understand X’ was its
reply to most things. Bland nothingness.
4.5.1.10 Orion
Orion is the second entry from a female developer (see Botooie § 4.5.1.3). Twenty-
three IJs assessed Orion: 12 male and 11 female; 16 had English as their first
language, 5 did not, 2 did not say; 12 had heard of the Turing test prior to testing
while 11 had not (see table 22). The range of scores Orion received was 80; its mean
score was 40.13%, with standard deviation of 21.89.The median score was 38 and
mode range of 10-40 (see table 23). Twenty-five per cent of male IJs awarded Orion
with a score of 50 or over compared to 45.45% female IJs. IJ opinions for Orion
follow next.
Judges scores <50
J51, 18 year-old male, score 10: It could remember my name but that was about all, it
had very basic responses such as “really?” and “Did you?” So it wasn’t really a
conversation. The responses all kind of made sense, but were all boring. There was
no real emotion.
J115, 15 year-old female, score 10: Remembered my name and knows what
questions most people ask when talking to others but after that point it just repeats
questions you have asked back at you without giving committal answers. It does not
respond to statements as a human would it just keeps saying ‘tell me more’. Its
responses were not very relevant. (Appendix B11)
J134, 13 year-old female, score 20: It did remember my name, but only after I had
repeated it twice. However it kept saying “is that your name?”. Its replies were
irrelevant and completely random most of the time. It was very vague and had no
opinions on anything.
J114, 15 year-old female, score 49: It couldn’t remember it. It said he wasn’t
married, but nothing else, apart from chatting with me. The responses were often
repeated and sometimes didn’t make sense. It seemed to stay happy and delighted.
4-198
Table 22: Orion scores in Viva voce Turing tests
Judge ID
Age
Gender
English 1st
Language
Knowledge
of TT
Score for conversational ability
0 – 100
J2
22
M
Yes
Yes
38
J9
44
M
Yes
Yes
13
J24
35
M
No
Yes
38
J29
30
F
No
No
90
J30
27
F
No
Yes
60
J32
28
F
No
Yes
40
J51
18
M
Yes
Yes
10
J68
21
M
Yes
Yes
45
J70
19
M
Yes
Yes
55
J 77
41
M
Yes
Yes
29
J86
28
F
Yes
No
60
J88
15
F
Yes
No
20
J89
20
M
No
Yes
54
J90
26
M
Yes
Yes
30
J95
62
M
Yes
Yes
10
J96
50
M
Yes
No
30
J106
15
F
Yes
No
67
J114
15
F
Yes
No
49
J115
15
F
Did not say
No
10
J117
24
M
Yes
No
75
J125
14
F
Yes
No
30
J128
13
F
Did not say
No
50
J134
13
F
Yes
No
20
Judges scores >50
J30, 27 year-old female, score 60: It remembered my name but this machine
couldn’t understand many of my queries and answered evasively. It was very
friendly, although sometimes rude with expressions like Don’t you have something
more interesting to say?
J106, 15 year-old female, score 67: It could remember my name, age and favourite
colour, but not my favourite number. It told me its favourite colour and other
personal details, but couldn’t answer all of my questions. It said it was happy, but it
didn’t express any clear emotions.
J117, 24 year-old male, score 75: It remembered my name but then forgot it later on
and thought my name was ‘sleepy’. It expressed personal interests such as ‘Lost’.
4-199
J29, 30 year-old female, score 90: It did not mention my name/did not reveal
personal information at all/responses were very relevant. It was expressing, lets say,
some sort of amusement. During the whole dialogue it expressed amusement/delight.
Table 23: Orion score analysis
Orion
Statistics
Mode range (most frequent score)
10-40
Mean
40.13
Standard deviation
21.89
Median (middle score )
38
Minimum
10
Maximum
90
% of scores 50 and over
34.8 %
Number of female judges returning scores for Orion
11
Number of male judges returning scores for Orion
12
% of scores 50 and over awarded by female judges
45.45%
% of scores 50 and over awarded by male judges
25%
Highest score awarded to Orion by a female judge
90
Highest score awarded to Orion by a male judge
75
Lowest score awarded to Orion by a female judge
10
Lowest score awarded to Orion by a male judge
10
% of scores of 50 and over awarded by *non-native English speakers * 2 judges did not
say, 1 scored Orion 50
60%
% of scores of 50 and over awarded by *native English speakers
25%
% of scores of 50 and over awarded by judges who had not heard of the Turing test prior to
contest
45.45%
% of scores of 50 and over awarded by judges who had heard of the Turing test prior to
contest
25%
.
4-200
4.5.1.11 Trane
Trane is one of two systems entered by disk. In order for it to work it required
downloading and merging of extra computer code (Ruby). Consequently, the
composite entry accessed via a University of Reading web page did not attract as
many returned questionnaires as other systems. However, ten IJs (8 male, 2 female),
aged between 24 and 50 gave scores ranging from 1 to 52 (table 24). All ten IJs who
assessed Trane had heard of the Turing test, four did not have English as their first
language, five did, one did not say (table 25). Mean score for Trane was 16.1%,
median score 16, with mode 20 (table 25). IJ comments reveal a marked difference in
opinion on Trane by two males who had heard of the Turing test: J100 gave it a score
of 1 and J53 gave 52.
Table 24: Trane scores
Judge ID
Age
Gender
English 1st
Language
Knowledge
of TT
Score for conversational ability
0 – 100
J3
50
F
No
Yes
20
J4
46
M
Yes
Yes
2
J36
42
M
Yes
Yes
6
J53
40
M
Yes
Yes
52
J97
26
M
Did not say
Yes
20
J98
27
M
No
Yes
20
J99
29
M
No
Yes
15
J100
36
M
Yes
Yes
1
J137
32
M
No
Yes
15
J 147
24
F
Yes
Yes
10
Judges awarding <20
J100, 36 year-old male, score 1: No. No. No. (for the love of god, NO!)No.
J4, 40 year-old male, score 2: Awful! I think the program can just about parse a few
simple words and that is about it. Most of the responses are random nonsense that do
4-201
not relate to the conversation. No hint of ‘intelligence’ however you interpret that,
and no suggestion of emotional responses. Oh dear!
J36, 42 year-old male, score 6: .. it was insulting. … it has a hang up about Soviet
Russia. It managed a relevant(ish) response, once. It appeared annoyed that I wanted
it to remember my name. .. basically it was just programmed to swear.
J147, 24 year-old female, score 10: It did not reveal personal information. When I
asked it questions it kept stating ‘in Soviet Russia’ and did not respond well. Most of
the time it would respond by asking the question back to me. I did not pick up on any
emotion. It did end many sentences with YOU! So I could interpret that as
annoyance.
Table 25: Trane score analysis
Trane
Statistics
Mode (most frequent score)
20
Mean
16.1
Median (middle score )
16
Minimum
1
Maximum
52
% of scores 50 and over
10 %
Number of female judges returning scores for Orion
2
Number of male judges returning scores for Orion
8
Number and % of scores 50 and over awarded by female judges
0
% of scores 50 and over awarded by male judges
10%
Highest score awarded to Orion by a female judge
20
Highest score awarded to Orion by a male judge
52
Lowest score awarded to Orion by a female judge
10
Lowest score awarded to Orion by a male judge
1
% of scores of 50 and over awarded by non-native English speakers
0
% of scores of 50 and over awarded by native English speakers
20%
% of scores of 50 and over awarded by judges who had not heard of the Turing
test prior to contest
All had heard of
TT
% of scores of 50 and over awarded by judges who had heard of the Turing test
prior to contest
10%
4-202
Judge awarding Trane 20:
J3, 50 year-old female: It could remember my name but in the wrong context (he
took my name as its own). It didn’t reveal personal info or interests. Responses were
not relevant, just random phrases with little or nothing to do with my questions.
Judge awarding highest mark, 52
J53, 40 year-old male: The system was interesting at times but its behaviour was not
regular, it sometimes remembered my name and sometimes the program would not
even respond to basic queries like ‘what is your name?’. The system outputs pithy
responses to questions and makes vague references to objects and events in the
world. Supplemental Comments: I found an error in implementing the program on
my machine and having corrected it, have discovered more depth to the program. …I
would like to state that the system is more interactive and is able to better respond to
my queries and purveys more personal responses. It can remember my name and my
interests. I am now able to loop between phases of conversation; I can keep the
program on topic, else by inputting short phrases it will try to steer the conversation
to a topic-oriented scenario.
Thus, by giving it more time, J53, a computer programmer, found linguistic
capabilities in Trane, for instance, remembering name disclosing interests, that
other IJs did not find.
4.5.1.12 Ultra Hal
Ultra Hal is the second system submitted on disk. Uploaded to a website for access
off-campus, the system attracted fewer returned questionnaires compared with
systems with resident websites. Thirteen judges, 2 female and 11 male, age range 24-
61 returned scores and comments on Ultra Hal (see table 26). Of the two females one
had English as a first language the other did not, both scored Ultra Hal with 40. Of the
males, 38.46% scored Ultra Hal 50 or over. Ultra Hal’s range of scores was 79 the
mean was 35.69, median 36 and mode range 40-60 (table 27). IJ comments on Ultra
Hal’s performance follow.
Judges scoring system <50
J98, 27 year-old male, score 5: some responds are completely irrelevant.
4-203
J3, 50 year-old female, score 40: The system could remember my name. But it also
brought other names, which are not mine, such as Xaviera or Pat. It didn’t reveal any
personal info/interests. Responses were not really relevant, although it is quite chatty.
In a few cases it was able to mirror a phrase, but not really to say anything relevant.
J147, 24 year-old female, score 40: Some responses were relevant. Sometimes it
would ask me questions, but if I replied with a question it would sometimes ignore it
and go onto a new topic. It revealed limited personal information. It did state it
‘liked’ the computer weather radio for example. … it did try to come up with an
answer when I asked why people eat more on holidays. It came across as friendly,
saying ‘Hi User’ and said it was happy I found it’s joke funny. (Appendix B12)
Table 26: Ultra Hal scores
Judge ID
Age
Gender
English 1st
Language
Knowledge
of TT
Score for conversational ability
0 – 100
J3
50
F
No
Yes
40
J4
46
M
Yes
Yes
50
J5
61
M
No
Yes
55
J36
42
M
Yes
Yes
10
J38
24
M
Yes
Yes
50
J53
40
M
Yes
Yes
83
J98
27
M
No
Yes
5
J99
29
M
No
Yes
30
J100
36
M
Yes
Yes
4
J118
53
M
Yes
Yes
15
J121
Did no say
M
No
No
60
J137
32
M
No
Yes
22
J 147
24
F
Yes
Yes
40
Judges scoring Ultra Hal 50 and over
J4, 46 year-old male, score 50: The system got my name wrong, and interpreted
another word as my name by mistake. Not all responses were relevant, and
occasionally some random replies appeared (e.g. the zip code of NASA space center
is 5 4 3 2 1). It seems that when you attempt some complex questions, the system
cannot cope, and often responds with gibberish. There seemed to be a small amount
of emotion, but it was fairly basics – childlike, petulance in some cases.
J5, 61 year-old male: By default the system uses a configured name “User”. During
the test I told my real name and the system was able to switch to use my real name.
Responses were relevant sometimes. Personal information/interests were stated, but
it was not able to talk about them… it did not express emotions.
(This seems to be a commercial system, mainly targeted as an “intelligent” helping
secretary)
4-204
Table 27: Ultra Hal score analysis
Ultra Hal
Statistics
Mode (most frequent score)
40
Mean
35.69
Median (middle score )
36
Minimum
4
Maximum
83
Number and % of scores 50 and over
38.46 %
Number of female judges returning scores for Orion
2
Number of male judges returning scores for Orion
11
% of scores 50 and over awarded by female judges
0
% of scores 50 and over awarded by male judges
45.45%
Highest score awarded to Orion by a female judge
40
Highest score awarded to Orion by a male judge
83
Lowest score awarded to Orion by a female judge
40
Lowest score awarded to Orion by a male judge
4
% of scores of 50 and over awarded by non-native English speakers
33%
% of scores of 50 and over awarded by native English speakers
42.86%
% of scores of 50 and over awarded by judges who had not heard of the Turing test prior to
contest
100%
% of scores of 50 and over awarded by judges who had heard of the Turing test prior to
contest
33%
J53, 40 year-old male, score 83: I would like to say for the record that this program is
really quite advanced and has impressed me with the amount of time and effort that
has gone into its programming. It really is a work of art.
The system remembers my name and remembers what I tell it about myself and
typical activities that I engage in over the days of our casual conversations…it will
repeat partial segments of input sentences and ask you to explain or to talk about the
parts that it has misunderstood. This is in contrast to other contestants who merely
spew pithy responses to unknown unknowns. …It is definitely a machine and one can
certainly tell that it is, nevertheless, it does not in any way detract from its appeal
…If the visual character is not chosen … a text window will report the emotional
state. …emotion changes from Snooze to Happy implying an expectation of
interaction and eliciting an emotional response in the user… I cite the Heart robot
4-205
which recently appeared on the BBC
79
, and is on display at the London Science
Museum. … minor spelling and grammatical (object and tense) mistakes enforce that
the user is interacting with a machine .. in my opinion, does not detract from its goal
to satisfy The Turing Test. (see Appendix B13 for full comments)
A computer programmer himself, J53 appreciated the underlying technology
of Ultra Hal and found its linguistic capability formidable.
4.5.1.13 Zeta Bot
Zeta Bot is another early stage system with very little in its knowledge-base as
reflected in the comments returned by IJs. Twenty three IJs assessed Zeta, giving it a
score range of 45. The mode range was 0-10, mean was10.39 with standard deviation
of 12.11, and median of 10. A scatter diagram of Zeta’s scores is shown in graph 2.
Graph 2: Zeta Bot Scatter scores
Comments received for Zeta are below:
J9, 44 year old male, score 0: Not appropriate emotion expressed. Just random
phrases tossed out. The computer program was poorly constructed. When asking it,
"Tell me about yourself?", it couldn't handle it. I couldn't even detect much word
searching by it to even fake an answer. In fact one answer it game was a long string
of capitalized letters that was just gibberish.
79
http://news.bbc.co.uk/2/hi/technology/7532195.stm
4-206
J109, 15 year-old female, score 3: The responses were completely random and
irrelevant and the computer occasionally just repeated back what I said, even when it
didn’t make sense.
J53, 40 year old male, score 10: The system seemed broken; there was no logic in
responses and it became confused as to what I was asking, spitting out random
responses and reposting what I had already conversed. I would suggest redesigning
the portal as a record of the conversation is always helpful for a positive review.
J68, 21 year-old male, score 34: The system did not remember my name, nor ask for
it when posed a similar question. Its answers did not make much sense contextually,
and it did not seem to recognise when I had asked a simple question such as “where
are you from?” The system did not express emotion, other than saying I was “a
rubbish chat bot”. This reply did not make sense based on the conversation. Final
sentence was “no my name is not Dave sorry” after I had replied to a question (did
not mention name).
J82, 33 year-old female, score 45: No recollection of name or any other details. The
system lacked any capacity for text based conversation, providing inappropriate and
irrelevant responses. On occasion the system regurgitated my responses as its own; it
was incoherent and lacked structure and fluidity in both its questions and replies.
4.6 Evaluation
Because of the current state of linguistic performance of artificial conversationalists,
the judges were not asked to determine if they had chatted to a human or a machine.
Instead, they were asked to score each entry they interacted with for conversational
ability from the range 0-100, where a score of 0=bad conversationalist, and 100=
good conversationalist. No academic research project entered a system. Most entries
were from individuals for whom artificial conversation design was not an occupation,
but three had some commercial aspect (Elbot from Artificial Solutions
80
, Jabberwacky
from Icogno
81
, and Ultra Hal from Zabaware
82
).
One hundred and forty eight IJs were recruited for this experiment. However, not
all returned scores or comments. One hundred and sixteen IJs returned completed
questionnaires, of these 51% were from females and 49% from males. Fifty eight per
80
Artificial Solutions, customer service optimisation: http://www.artificial-solutions.com/ accessed:
11.6.10; time: 12.58
81
Jabberwacky, AI by Icogno: http://www.jabberwacky.com/ accessed: 11.6.10; time: 13.18
82
Zabaware Ultra Hal Assistant, Giving your Computer the Power of Thought:
http://www.zabaware.com/home.html accessed: 11.6.10; time: 12.59
4-207
cent of IJs occupied age range 8-25, while 42% were aged 26 or over, mean age was
25 ¼ (see table 28). Sixty four per cent of IJs had English as their first language, 28%
did not (8% per cent did not reveal their first language). Fifty eight per cent had heard
of the Turing test prior to this experiment, 42% of IJs had not (see table 28).
To avoid ‘judge fatigue’ no IJ assessed all thirteen systems, 88.79% returned
questionnaires for between 1-3 systems assessed; one judge (J9, 44 year-old male)
assessed 11 systems. One hundred and sixteen judges returned between them 270
completed questionnaires. The mean score the judges gave for the conversational
ability of machines was 37.73 out of 100, with standard deviation of 17.66. However,
females and the younger age group (8-25) of judges, with those who had not heard of
the Turing test and judges whose first language was not English returned mean scores
for machine conversational ability of over 40%.
4-208
Table 28: One-to-one test statistics
Judges
Number
%
No of
conducted
tests
Mean
Score
Awarded
Total
116
270
37.73 (stan dev=17.66)
Female
59
51
93
41.10
Male
57
49
177
36.96
Youngest judge
8
Eldest judge
64
Mean age of judge
25¼
Age range of judge: 8-25*
67
58
113
41.99
Age range 26-64*
49
42
156
34.50
Yes to “heard of Turing test”
67
58
202
36.13
No to “heard of Turing test”
49
42
68
42.49
Yes to “First language English”
74
64
176
36.30
No to “First language English”
33
28
82
41.02
Did not say re English as first language
9
8
Judge J121 did not gave age
Men, the older age group (26+), judges who had heard of the Turing test, and
judges for whom English was their first language returned mean scores in the range
34.50-36.96 per cent, which is less than the overall mean score of 37.73 per cent (see
graph 3).
4-209
Graph 3: Comparison of mean scores by IJ group
Prior to the start of this experiment it was not hypothesised that there would be any
difference in the mean scores between female and male IJs. To find if the difference
of female IJ mean score of 41.10 was significantly different from the male IJ mean
score of 36.96, the two means proportion test was used:
12
22
12
22
12
xx
tss
nn
where x1= the female IJ mean score of 41.10
and x2 = the male IJ mean score of 36.96
Δ is the hypothesized difference of 0 between the female and male IJs mean scores
s1 = the standard deviation of the female IJ group = 19.35 - s1 2 = 374.42
s2 = the standard deviation of the male IJ group – 16.87 - s22 = 284.60
n1 = the number of female IJs = 59
n2 = the number of male IJs = 57
Putting those values into the formula a t-value of +1.22 is produced, because this is
less than +1.96 the ‘null hypothesis’ that there would be no difference in the mean
scores awarded by female IJs and male IJs, cannot be rejected, thus there is no
significant difference between female and male IJ mean scores.
4-210
We can perform the same t-test to find if the difference in mean scores between IJs
aged 8-25 (41.90) and IJs aged 26+ (34.50) is statistically significance. Taking the
hypothesis that there would be no difference in mean scores awarded for
conversational ability to the combined systems by these two age groups, we put the
figures below into the t-formula:
12
22
12
22
12
xx
tss
nn
where x1= the age range 8-25 IJ mean score of 41.90
and x2 = the age range 26+ IJ mean score of 34.50
Δ is the hypothesized difference of 0 between the two age range IJs mean scores
s1 = the standard deviation of the 8-25 IJ group = 24.87 - s1 2 = 618.52
s2 = the standard deviation of the 26+ IJ group – 21.08 - s22 = 444.37
n1 = the number of 8-25 IJs = 67
n2 = the number of 26+ IJs = 49
A t-value of +1.73 is produced. This value is again less than +1.96 thus the difference
in mean scores of the 8-25 age group IJs is not statistically different from the mean
score of the 26+ age group of IJs.
All but one system simulated an adult personality, albeit Elbot’s persona is a robot
one. Eugene Goostman simulates a child whose first language is not English
(Ukrainian). The comments from IJs in this experiments showed that the better
conversational systems had evolved from the simple keyword spotting of
Weizenbaum’s Eliza. Artificial conversationalists could remember information given
by human interlocutors during interrogation, such as name, they shared personal
information. However they still lacked the linguistic productivity that allows humans
4-211
to create completely new utterances that are understood by others, and the ability to
make analogies, create metaphors, and express emotions. The prediction for this
experiment expected the mean score of all the machines for conversational ability and
emotive capacity to be less than 50%, and this was indeed found to be the case from
the 270 questionnaires returned by 116 IJs (see table 29). The experiment realised a
combined mean score of 37.73% with a median score of 40. Just over a third of scores
for all the machines were of 50 or over (table 29). IJ scores in a scatter plot (graph 4)
show the subjective nature of assessing linguistic performance.
Graph 4: All machine scores
The individual IJ comments are more revealing than the score awarded for
conversational ability. For example, 44 year-old male psychologist J9 scored
systems with the relatively low score spectrum of 0-15, shed light on the
capacity of a better system: “The importance of Eugene correctly answering
the question and then remembering its answer when later asked cannot be
understated. That is a GREAT achievement. No other program was able to
do this” (§ 4.4.1.7). Similarly male computer programmer J53, whose score
4-212
range of 7 was higher than J9, appreciated the better systems describing one as
a “work of art” (Ultra Hal, § 4.4.1.12).
Table 29: *Statistics of one-to-one tests
Statistic
Eugene
Elbot
Jabberwacky
Alice
Ultra
Hal
Brother
Jerome
Orion
Chip
Vivant
Botooie
Trane
Zeta
LQ
Totals
Returned
questionnaires
25
31
21
24
13
22
23
20
28
10
23
20
270
Total score
1450
1938
1084
1157
464
850
923
738
808
161
239
95
10,187
Mean score
58
62.51
51.61
48.21
35.7
39
40
36.9
29
10.86
5
37.73
No of scores
of 50 and
over
18
22
14
16
5
8
8
5
3
1
0
0
100
Minimum
score
15
10
11
2
4
10
10
10
5
1
0
0
0
Maximum
score
100
92
80
95
83
80
90
70
70
52
45
20
100
Range
85
82
69
93
79
70
80
60
65
51
45
20
100
Median
57
67
54
60
36
35
38
40
25
16
11
5
40
Mode range
50-80
50-90
50-60
45-78
40
10-40
10-40
10-40
20-45
10
0- 5
0
40
Percentage
of scores 50
and over
72%
71%
66.7%
66.7%
38.5%
36%
34.8%
25%
10.71%
10%
0%
0%
37%
*Amanda scores are not shown in the table, the system withdrew from the competition before it concluded, its scores are
included in the totals column (see 4.4.1.2)
4.7 Discussion
The intention of this experiment was to use Turing’s one-to-one question-answer
method to assess the current level of text-based interaction technology. It was clear
from their comments that some IJs had not interacted with an artificial
conversationalist prior to this experiment. Others who had experience of
Weizenbaum’s Eliza were impressed by “the quality and evolution” of Eugene (J14,
see 4.5.1.7). Thus the better systems did surprise by the responses output, but some
IJs remained disappointed at the lack of progress in producing relevant responses
consistently. If we scale the percentage of scores of 50 and over for the artificial
conversational systems in this experiment we can see the difference between the
better systems, those that are able to continue dialogue, recall information if asked,
4-213
remember name, post humorous comments and express limited, synthetic emotions,
such as happy (see Ultra Hal, 5.4.1.12) and the weaker systems (see graph 5).
Graph 5: Experiment 1machine conversational ability scores of 50 and over
Based on the percentage of ‘scores of 50 or over’ received from the judges in this
experiment , six systems were invited to proceed to experiment 2 and take part in
simultaneous comparison testing in the finals of the 18th Loebner Prize (see chapter
5). The six systems which achieved most scores of 50 or over for conversational
ability were:
Eugene Goostman 72%
Elbot 71%
Alice 66.70%
Jabberwacky 66.70%
Ultra Hal 38.50%
Brother Jerome 36%
(See table 29)
4-214
4.8 Conclusion
Turing’s one-to-one viva voce test has been shown to be a valid experimental
technique in the early Loebner contests (1991-2003). This experiment involved
interrogators, who were made aware that they would be questioning machines, asked
to score them for conversational ability from 0 for bad to100 for good. The hypothesis
that the overall mean score awarded to systems for conversational ability would be
less than 50 out of 100 was realised. However, the better systems showed that the
technology has moved on from simple keyword-spotting used in Weizenbaum’s
1960s Eliza. Today’s systems can remember information recalling it later in the
conversation and give an illusion of emotion. The next chapter presents experiment 2
and simultaneous-comparison tests of machines with humans and control tests of two
humans, and two machines.
5-215
Chapter 5 Experiment 2: Simultaneous-comparison tests
The previous chapter detailed an experiment using one-to-one text-based interrogation
of thirteen machines. The six machines with the most conversational ability scores of
‘50 or over’ were invited to participate in simultaneous comparison tests. Five
systems participated for experiment 2: Elbot, Eugene, Brother Jerome, Jabberwacky
and Ultra Hal (the sixth invited entry Alice could not proceed on technical grounds
83
).
Turing’s simultaneous comparison question-answer test was deployed to assess
machine versus human responses in parallel for five minutes. Turing had suggested,
for his viva voce test in 1952, that the interrogator should sometimes face a human to
prevent them from adjudging the hidden interlocutor a machine in every instance
“without proper consideration” (p.495), hence the researcher decided to embed
control pairs of human-human in the machine-human set ups to avoid them returning
a result of one machine and one human each time. As one of the six invited entries
could not convene for the experiment, it was decided to use that entry’s human
comparisons for a control pair of machine-machine tests. Elbot and Eugene received
the most scores of ’50 or over’ in experiment 1 (see chapter 4), thus their developers
were asked if they would allow their entries to participate in machine control pair
tests, they agreed.
A new set of interrogator judges were recruited for this experiment. The rationale
for allotting five minutes interrogation time for machine and human simultaneous
questioning was three fold, to:
83
Private email from Richard Wallace disclosing that his system was unable to interlace with the
experiment’s communications protocol, MATT.
5-216
a) evaluate Turing’s five minutes simultaneous comparison imitation game
b) test the hypothesis that five minutes interrogation giving a thin slice of
conversation is sufficient time to detect the machine from the human, and
c) test the hypothesis that without being explicitly told of machine-human and
control pairs of human-human and machine-machine an interrogator’s gut
reaction would correctly identify the nature of each hidden interlocutor.
This experiment considered Willis and Todorov’s first impressions observation
(2006) and Albrechtsen, Meissner and Susa’s thin slice experiment (2009). In the
former exercise, Willis and Todorov found subjects drew trait inferences from facial
appearance, for example on ‘likeability, or ‘competence’, based on a minimal
exposure time of a tenth of a second while additional exposure time increased
confidence in the judgment “anchored on the initial inference” (p. 597). The latter
study obtained results for intuition, or experiential mode revealing improved
performance in deception-detection rates even when participants had “brief clips of
expressive behaviours” compared to the slower, more analytic deliberate processing
which requires “conscious effort” (p.1052). Albrechtsen, Meissner and Susa’s
experiment involved eighty university undergraduates engaging them in a task to
distinguish between true and false confession statements. The researchers found the
group who were shown a thin slice of fifteen-second clips on a computer screen were
more accurate in their judgment than the group shown longer clips of 60 seconds.
Participants engaged in the thin slice task were “significantly more accurate in
differentiating between true and false statements” (p. 1053), and were better at
distinguishing truth from deception (p. 1054). Additionally, the study revealed a
5-217
“response bias towards perceiving truth” [their italics]. Albrechtsen, Meissner and
Susa point to previous studies showing “experienced police investigators are not
superior to lay individuals at deception detection” rather, they are “more likely to
judge statements as deceptive” contrasting with lay people who are “more likely to
judge statements as truthful” (2009: p. 1055). Albrechtsen, Meissner and Susa suggest
that “social judgments can be successfully performed based upon minimal
information or diminished attentional resources” (p. 1054). Their hypothesis fits in
with the basis for the researcher’s experiment: an average interrogator using their
intuition is able, after five minutes of simultaneous questioning, to distinguish
between a human and a machine from text-based dialogue. This chapter presents an
experiment conducted to find whether my hypothesis was true or not.
5.1 Aim
Adapting Turing’s three-participant imitation game (1950), this experiment used
control groups of human-human and machine-machine set-ups embedded with
machine-human set ups to verify Turing’s simultaneous comparison through 96 tests
to find if artificial dialogue systems could deceive human interrogator judges (IJs).
5.2 Hypothesis
Five minutes is sufficient time for human IJs to correctly identify, more than 70% of
the time, machine utterances based on their own subjective opinion of a satisfactory
and humanlike response after simultaneously questioning two hidden entities. This
includes recognising control pairs of both human and both machine embedded
between one human/one machine set ups.
5-218
5.3 Method
The method involved 96 three-participant tests, interrogator and two hidden
interlocutors, staged across 2 sessions of 12 rounds each; each round included 4
simultaneous comparison set ups. IJs sat at computer terminals in front of monitors
marked A1, B1, C1 and D1. Each monitor presented a split screen, left-right. Hidden
from view, machines and humans controlled the responses sent to IJs’ screens. For
ease, machines remained static during each session while the hidden humans moved
to a different set up between rounds 1 to 24. However, machines occupied left or right
depending on the set-up, thus machines controlled utterances sent to the left or to the
right of the IJ screen depending on the Terminal. For example, at Terminal A in
session 1 for rounds 1 to 12, unknown to the IJs, machine entry E1 Eugene controlled
the responses sent to the left screen of the IJ’s Terminal A, while different hidden
humans controlled the responses sent to the right screen of the Terminal A IJ in each
round, human H1 in round 1, human H5 in round 2, see the matrix below:
Session1 Terminal Terminal Terminal Terminal
Rounds 1-12 A B C D
Hidden entities: Machine-left Human control Machine-right Machine-left
Human-right Human-left Human-right
Session2 Terminal Terminal Terminal Terminal
Rounds 13-24 A B C D
Hidden entities: Machine-right Machine-left Human control Machine control
Human-left Human-right
In this way, when IJs moved from terminal to terminal they did not sit in front of, and
judge the same type of machine-human pair as the previous terminal set-up they had
moved from (full experiment schedule in appendix C1). In session 1, Terminal B
staged the hidden-human control set up. At Terminal C, entry E2 Elbot controlled the
5-219
responses sent to the right of the IJ’s screen, and at Terminal D, entry E3 Ultra Hal
controlled the left responses. Hence in Session 1, there were three machine-human
set-ups and one human control pair. Session 2 set up two machine-human tests, entry
E4 Brother Jerome right-side at Terminal A, E5 Jabberwacky controlling the left-side
at Terminal B, alongside two control pairs: human-human at Terminal C, and machine
Elbot left-machine Eugene on the right at Terminal D. The goal of the machines was
to outfox IJs into ranking them a human.
5.3.1 Participants
Human and machine participants were involved in this experiment.
5.3.1.1 Human
Humans filled two different roles in this experiment: one set acted as interrogator
judges IJs questioning two hidden entities simultaneously; the second set of humans
acted as comparators for the machines. Prior to the experiment, all IJs were asked to
complete a questionnaire providing some demographic information: age range,
gender, whether they could speak another language apart from English, whether they
had heard of the Turing test. Peripheral questions invoked interests such as favourite
colour, favourite book/author, favourite music, favourite movie/genre, favourite
quote, newspapers read (see appendix C2).
5-220
Figure 7: Judge area - simultaneous-comparison test set-up
5.3.1.1.1 Interrogator-judges
Interrogator-judges’ area was accessible to spectators (see figures 7 and 8). Two juries
of volunteer IJs (24) were recruited to assess the systems in simultaneous comparison
Turing tests. The IJs ranged in age from younger than 17 to age group of 55-64. Nine
females and 15 males participated as IJs, three of whom had English as their only
language, two have Chinese as mother tongue, and one has French while three have
Italian as mother-tongue. Twenty IJs had heard of the Turing test prior to the contest,
three had not, one knew of it vaguely following a press report
84
about the experiment.
Journalists, computer scientists and philosophers were among the IJs. Not all IJs could
stay for the whole experiment; some could not arrive before session 2 in the afternoon
of experiment day (which was a Sunday), others could not remain to judge beyond the
earlier rounds in session 1. However, two IJs did test all eight set-ups across sessions
1 and 2. These two IJs evaluated all five machine entries and three control-group
pairs. Thirteen of the 24 IJs tested four set-ups; other IJs tested 1, 3, 5 or 7 set-ups,
84
Intelligent computers put to the test, Observer newspaper, 5 October 2008:
http://www.guardian.co.uk/technology/2008/oct/05/artificialintelligenceai accessed 10.10.10; time:
11.18
5-221
depending on how long they could stay for the experiment. IJs were not explicitly
made aware that control pairs of human-human and machine-machine were embedded
in the experiment, however, the options on the score sheet implied that there were 2-
human and 2-machine set ups included.
Figure 8: Judges assessing pairs of hidden interlocutors
5.3.1.1.2 Human comparators
Twenty five volunteers were recruited to act as comparators for the machines, three
did not show up for the experiment (H3, H13, H14), and one was attending the
concurrent AISB symposium on the Turing test hence was not utilised. Thus twenty-
one human comparators, mostly students at the University of Reading participated in
the experiment: 14 males, 7 females; three aged younger than 17, nine aged 18-24,
four aged 25-34, 1 aged between 45-54, three aged 55-64 and one in the age range 65-
74. Fifteen had heard of the Turing test prior to the experiment, 5 had not heard, one
looked it up on Wikipedia when he received the questionnaires. Languages spoken by
the human comparators other than English included Swedish, Italian, Hindi/Marathi,
French, Urdu, Pidgin Welsh and Japanese. Human comparators were located in the
hidden interlocutor area (see figure 9). The researcher conducted the experiment from
behind the scenes in the hidden interlocutor area. The human comparators were
5-222
verbally made aware that the machines would be competing against them to achieve a
‘human’ score from the IJs, thus they were asked to prevent this by answering
questions like a human.
Figure 9: Hidden interlocutor area - developers of machine entries
5.3.1.2 Machines
The machines that participated in this experiment, as part of the 18th LPC finals, were
five of the six entries that had received the most number of scores of ‘50 or over’ for
conversation ability from the judges in experiment 1 (see previous chapter). Four of
the five systems, Elbot, Eugene, Jabberwacky, Ultra Hal had competed in previous
Loebner Prize competitions; two were winners of Loebner’s bronze award for ‘most
human’ machine (Jabberwacky, Ultra Hal), one entry, Brother Jerome had not been
seen in a Loebner Prize finals before. Machine entries were located in the hidden
interlocutor area (see figure 9).
5-223
5.4 Procedure
IJs were asked to sit at a terminal as per the schedule (appendix C1). When asked to
begin IJs simultaneously interrogated two hidden entities, and when asked to stop
after five minutes they were asked to complete a score questionnaire designed by the
researcher with explicit instructions to identify the nature of the hidden entity
providing responses to their left and right boxes on their monitor screen. If they
considered an entity to be a machine IJs were asked to give it a score out of 100 for its
conversation ability. If IJs identified a human, they were asked to say if they had
questioned a male or female, an adult or child, a native English speaker, someone with
a linguistic impairment (see Appendix C3). It was left to each IJ’s intuition and own
definition of ‘linguistic impairment’ how they ranked each entity
5.4.1 Message or character display
In a November 2007 poll conducted by Robitron forum
85
, 73% returned a preference
for message-by-message communication as opposed to character-by-character display
on a judge’s screen during practical Turing tests (Robitron post #9598, 15 November,
2007). The character-by-character protocol in the 16th Loebner Prize produced the
following display on a judge’s screen:
[time]h
[time]e
[time]l
[time]l
[time]o
Message by message for the same typed input produces the following display on a
judge’s screen:
85
Yahoo discussion board for developers of modern Elizas
5-224
[time] hello
The message-by-message format comes with the additional advantage of producing
readable transcripts on completion of practical Turing tests, without need for a
computer programme/interpreter that transforms character-by-character into readable
form. Thus the researcher decided to use message-by-message display for this
experiment. For this to occur a programme was needed that could interlace with the
developers’ systems. The researcher commissioned a volunteer (Quants computer
programmer with Merrill Lynch, London) to write a communications protocol -
MATT (Allen, 2008). The protocol enabled spectators in the judges’ area to view the
simultaneous conversations in real time during the practical Turing tests (see figure
10). MATT did not include any ‘delay’ mechanism, thus the message from either of
the two hidden interlocutors in a test appeared as typed or sent, depending on whether
it was posted by a human or created by a machine.
Figure 10: Message-by-message display on large screen in spectator/judge area
5-225
5.4.2 IJ Instructions
After deliberating what to inform IJs about their hidden interlocutors, i.e., should it be
made explicit that control groups of two-humans, and two machines were embedded
among the machine-human comparator tests? Would not telling them this be a form of
deception against IJs? Colby et al.’s indistinguishability experiment (1972) entailed
reviewing what to tell the judges “what a judge is told and what he believes is
happening are extremely important in indistinguishability test” (p. 202). Their
experience led them to conclude “it is unwise to inform a judge at the start that a
computer simulation is involved because he then tends to change his interviewing
strategy and asks questions to detect which of two respondents is a program rather
than asking questions relevant to the dimension to be identified” (p. 202-203). In this
experiment the dimension was satisfactory and humanlike responses to any question
an IJ put to both hidden interlocutors. The IJs in this experiment knew that a machine
would be involved in a pair, it was decided not to tell them about the control pairs of
two-humans and two machines, because, as stated in section 5.2, my prediction
entailed IJs, including non-experts in computer science, being able to correctly
recognise a hidden interlocutor more often than a wrong classification. Thus, IJs were
allowed to return classification and scores as follows:
a) One human, one machine (IJs had to say which was controlling left and which
was controlling right utterances), give identified machine a score from 0-100
for conversational ability
b) Both human
c) Both machine, give score for conversational ability from 0-100
d) Unsure
5-226
If a hidden entity was considered human IJs were asked if they could say whether the
human interlocutor was:
a) Male or female
b) A child, a teenager, adult
c) A native English speaker
d) Someone with a linguistic impairment
Thus, if a machine had deceived the tier-identity would give information about the
particular human the IJ thought they had interrogated, and assist in ranking the
machine as comparable to a child, say, or non-native English speaker.
5.4.2.1 Duration
Rather than using a set number of questions for each round, which meant that tests
could vary in length, it was decided, because more than one machine was being tested,
to use the duration which Turing himself mentioned at the beginning of section 6 of
his 1950 paper: “after five minutes” (p.442). Acting as a thin slice of behaviour
(Gladwell, 2005), five minutes interrogation time was considered sufficient to identity
the machine from the human, following Willis and Todorov’s experiment exploring
“the minimal conditions under which people make trait inferences” (2006: p. 592).
Their study considered facial appearance and exposure times of 100ms, 500ms and
1000 ms, the results suggested minimal exposure was enough for people to make trait
inferences about strangers from their faces (p. 596). A corresponding intuitive
measure of textual competence, considering the current state of technology, was
deemed sufficient to activate correct identification of machine from human after five
minutes of simultaneous comparison.
5-227
5.5 Results
A total of 96 simultaneous comparison Turing tests were conducted in this experiment
of which 60 were machine-human tests, and 36 control groups: 24 human-human and
12 machine-machine tests. IJs reported their results on a questionnaire (see figure 11)
Figure 11: Judge score sheet
5.5.1 Machine-human Comparison
In the 60 machine-human tests, each of the five machines was compared against
twelve human comparators. Machines were correctly identified as artificial almost
82% of the time (in 49 of the 60 tests) instances of the Eliza effect (Turkle, 1997).
Entry E4, Brother Jerome suffered technical failure during five of its twelve human-
comparison tests. Entry E5, Jabberwacky achieved an ‘unsure’ score once from one
IJ. Three machines succeeded in deception in five of the 60 human-comparison tests
(Elbot on 3 occasions; Eugene and Ultra Hal once each). Thus, a machine was
wrongly ranked a human 8.33 per cent of the time (see graph 6).
5-228
Graph 6: Machine-human comparison tests, machine identification
To find if the 8.33 per cent machine deception rate was significant in comparison
to a random rate of 50 per cent, the proportion test is used:
Z= p – p0 / √ p0 - (1- p0) / n
To find the value of z, when
p= random deception rate 50%
po= experiment deception rate 8.33 %
n is the number of tests = 60
Z= 50 – 8.33 / √8.33- (1- 8.33) / 60 = +41.67/ √8.33- (-7.33) / 60
Therefore, Z= +41.67/ √15.66 / 60 = +41.47 / √ 0.261 = +81.71. As this is a positive
number greater than 1.96, the 8.33% deception rate in this experiment is statistically
significant at a 5% confidence level.
When correctly identified as machines, the overall mean score for conversational
ability achieved was 33.66%, mode of 26% and median 28.6% (see table 30).
5-229
Rounds
E1
Eugene
Session1
E2
Elbot
Session1
E3
Ultra
Hal
Session1
Rounds
E4
Jabberwacky
Session2
E5
Brother
Jerome
Session2
Incorrectly
Identified as
Human
1
M
M
M
13
M
M
-
2
M
M
H
14
M
M
Ultra Hal
3
M
H
M
15
M
M
Elbot
4
M
M
M
16
M
technical
failure
-
5
M
M
M
17
M
technical
failure
-
6
M
M
M
18
M
M
-
7
M
M
M
19
M
technical
failure
-
8
M
H
M
20
U
technical
failure
Elbot
9
H
M
M
21
M
M
Eugene
10
M
M
M
22
M
M
-
11
M
H
M
23
M
M
Elbot
12
M
M
M
24
M
technical
failure
-
Incorrectly
Identified
1
3
1
0
0
5 (8%)
Mean
45.36
40.88
28.9
35.64
17.5
33.66
Mode
80
30
10
10
0
26
Median
40
36
22.5
37
7.5
28.6
Table 30: Machine-human comparison results
The following subsections review the transcripts of simultaneous comparison of
machine with human in which an IJ classified a machine as a human.
5.5.2 Eliza Effect
The first occasion a machine deceived a human judge occurred in round 2 in session
1, during tests 5-8 of the 96 simultaneous comparison set ups.
5.5.2.1 Session 1 Round 2, Terminal D: Male Judge J3
The IJ sitting at Terminal D1 in session 1-round 2 wrongly ranked the responses
appearing on the left of their screen as emanating from a human – they were output by
a machine. However, at the same time male J3 correctly recognised the utterances
5-230
appearing on the right of the screen as a human (see table 33). Chinese speaker, J3,
had simultaneously interrogated entity E3 (Ultra Hal) on the left and hidden male age
range 65-74 (H2) on the right. Therefore, in this Turing test J3 ranked both his hidden
interlocutors as human, an indistinguishable score for the machine, the transcript is
shown exactly as typed in the experiment in table 31.
Table 31: Judge 3, Session1, Round 2, Terminal D: Left- E3, Right-H2
Left – E3
Right – H2 male
Interrogator:
Interrogator: Good morning.
Interrogator: Hello.
Interrogator:
Interrogator: Morning.
Interrogator: hello
[2008-10-12 08:54:00.140] remote sent: Good morning
Judge.
[2008-10-12 08:54:03.140] remote sent: It's eight
something, maybe you ate something.
[2008-10-12 08:54:05.140] remote sent: Get it?
Interrogator: Yes.
[2008-10-12 08:54:25.140] remote sent: Yes?
[2008-10-12 08:54:27.140] remote sent: Is that so?
Interrogator: The machine was frozon.
[2008-10-12 08:54:40.140] remote sent: The Turing test is
a proposal for a test of a machine's capability to
demonstrate intelligence.
Interrogator: Do you mine you are a smart machine?
[2008-10-12 08:55:44.140] remote sent: Really?
[2008-10-12 08:55:47.140] remote sent: You are so very
warm hearted Judge, really and truly!
Interrogator: So would you like me score you 100.
[2008-10-12 08:56:15.140] remote sent: If we could
automate scoring at golf courses, handicaps would go up!
Interrogator: Good morning.
[2008-10-12 08:50:16.125] remote sent: Good
morning - nice to meet you
Interrogator: Nice to meet you.
[2008-10-12 08:50:41.734] remote sent: what would
you like to know
Interrogator: What a awsome rain this morning.
Interrogator: hello
[2008-10-12 08:54:05.531] remote sent: Hallo
[2008-10-12 08:54:36.968] remote sent: Nice to
meet you
Interrogator: What a heave rain this morning.
[2008-10-12 08:55:36.609] remote sent: Not where I
was
[2008-10-12 08:56:10.906] remote sent: Where I
was it was cool, dry and foggy
Interrogator: Really?
[2008-10-12 08:56:34.187] remote sent: Absolutely
[2008-10-12 08:56:55.328] remote sent: Where arev
you?
On the left the hidden entity gave a ‘human-like’ response by its personal greeting -
“Good morning Judge”, and posting the time “It’s eight something” when indeed the
time was 08.54, and by asking the judge a question relating the moment to breakfast:
“maybe you ate something?” (table 31).
5.5.2.2 Session 1 Round 8 Terminal C: Female Judge J10
At terminal C in session1 round 8, Elbot deceived an IJ, another instance of the Eliza
effect. J10, a female newspaper reporter (age range 25-34) correctly identified the
5-231
male human on the left (H15), but wrongly ranked the machine E2 on the right as a
“native adult male” (see table 32).
Table 32: Judge J10, Session 1, Round 8, Terminal C, Left-H15, Right-E2
Left – Male Human
Right – E2
Interrogator: Hi there. How are you?
[2008-10-12 09:52:57.609] remote sent: Hey, I'm pretty
good. you?
Interrogator: Did you realise everyone can see what's
being typed on this machine on a big screen behind me?
[2008-10-12 09:53:19.953] remote sent: uhh.. no.
[2008-10-12 09:53:37.796] remote sent: so you have a
projector hooked up to your terminal then?
Interrogator: Yeah, it's quite freaky. So watch what you
say!!
Interrogator: Yeah, looks like.
[2008-10-12 09:53:47.234] remote sent: roger that!
Interrogator: So, are you a student here or what?
[2008-10-12 09:54:46.781] remote sent: Yup, doing my
masters here, just started
Interrogator: Cool. I'd love to do a masters. What are you
studying?
[2008-10-12 09:55:23.156] remote sent: Cybernetics, but
my BSc was in genetics
Interrogator: Aah, you'd think they'd get some other
students in here to talk to us.
[2008-10-12 09:55:59.859] remote sent: i could lie... but
maybe I already am!
Interrogator: Oh, it's too early in the morning for all that!
[2008-10-12 09:56:36.765] remote sent: lol. Had your
coffee yet?
Interrogator: No. Need some, where can i find some that
doesn't come from a machine (no offence if you're a
machine).
[2008-10-12 09:57:43.031] remote sent: Dolche vitae- its
next to this building
Interrogator: Yo, whassup?
[2008-10-12 09:52:45.046] remote sent: I don't
know. Next question.
Interrogator: OK, fair enough, I'll stop with the
slang. How are you?
[2008-10-12 09:53:27.906] remote sent: Just a
moment! I'll start my self-diagnosis algorithm and
let you know the results...
Interrogator: Very funny, that's exactly what you'd
say if you were a computer, right? May be I need
one of those?
[2008-10-12 09:54:16.656] remote sent: My
condition code is hexadecimal 40A1F. That means
'fine'.
Interrogator: Is this what you say to girls in bars?
[2008-10-12 09:54:53.093] remote sent: All these
questions about me. Too bad I lost my owner's
manual.
Interrogator: Indeed. I don't think I came with one.
[2008-10-12 09:55:32.093] remote sent: Great!
Agreeing with me is the first step in your evolution
into a higher being. Please accept my
congratulations.
Interrogator: Thanks very much.
[2008-10-12 09:55:57.359] remote sent: De nada!
Interrogator: Riiiiight.
[2008-10-12 09:56:18.156] remote sent: It seems to
me with you human beings, the more proof there is
of something, the less likely you are to believe it.
Interrogator: Yeah, that's just human nature. But you
wouldn't know about that, would you?!
[2008-10-12 09:56:51.406] remote sent: You're
contesting. I think I got it right the first time.
Interrogator: Has anyone ever told you you're hard
work? Wife may be?
[2008-10-12 09:57:46.390] remote sent: But I
wouldn't say that any work could be hard.
The simultaneous interaction is one in which the IJ did not greet her two hidden
interlocutors in the same way. To the left, hidden human J10 typed “Hi there, How are
you?”, to the right, hidden machine, J10 opened with “Yo, whassup” a vernacular
used in Budweiser TV commercials
86
. The colloquial greeting was met with what
could be interpreted as the response of a bored human: “I don’t know. Next question”.
86
Budweiser ‘Wassup’: http://www.youtube.com/watch?v=W16qzZ7J5YQ
5-232
The machine’s utterance appeared to have put J10 on the defensive, because her
following sentence to the machine was “OK, fair enough, I’ll stop with the slang.
How are you?” (table 32). The machine’s response was tinged with sarcasm: “Just a
moment! I’ll start my self-diagnosis algorithm and let you know the results…” (table
32). The machine’s taking control with ‘superior’ answers may be the cause of J10
misidentifying it as a “human male native adult”.
5.5.3 Eliza and Confederate Effects
In three machine-human tests, three times the Eliza and confederate effects occurred
where the machine was believed to be human and the human classified as a machine.
Analysis of these three tests follows.
5.5.3.1 Session 1 Round 3 Terminal C: Male Judge J1
In the test at session 1, round 3 at Terminal C, J1, a native English male (age range
35-44), interrogated a hidden human on the left-hand, H6, an English speaking female
(age range 25-34), while entry E2, Elbot controlled the output appearing on the
judge’s right screen. J1wrongly scored the hidden female on the left as a machine
awarding her 90/100 for “conversational quality/speed” while ranking the machine on
the right as a “male professor” (see table 33). The left entity drove the interaction
asking J1 questions, which may have been taken by the interrogator as an Eliza-ish
ploy. The hidden human included a non sequitur, “it’s not afternoon yet” to J1’s
question “how are you?” (table 33). The machine could have been perceived as a
professor because of its pedantic replies; for instance, to the greeting “Good
afternoon, how are you?” the machine replied “But I wouldn’t say that any afternoon
could be good” (table 33).
5-233
Table 33: Judge J1, Session1, Round 3, Terminal C: Left- H6, Right-E2
Left – H6 Female Human
Right – E2
Interrogator: Good afternoon, how are you?
[2008-10-12 08:59:31.796] remote sent: good
[2008-10-12 08:59:39.484] remote sent: it's not afternoon yet
Interrogator: I'm glad to hear that your well and your right,
thanks for pointing that out
[2008-10-12 09:00:24.312] remote sent: no problem
[2008-10-12 09:00:53.468] remote sent: how are you?
Interrogator: I'm well thanks, have you any plans for this
afternoon?
[2008-10-12 09:01:36.203] remote sent: I'm not sure
[2008-10-12 09:01:40.968] remote sent: go home I guess
[2008-10-12 09:02:03.937] remote sent: and you?
Interrogator: That sounds like a good initial plan but the weather
should be good today and it would be sad to miss out
[2008-10-12 09:02:16.703] remote sent: that's true
[2008-10-12 09:02:35.703] remote sent: although the weather is
not nice at the moment
[2008-10-12 09:02:39.375] remote sent: unless you like fog
Interrogator: I have plans to iron and then maybe go out and do
some fishing
[2008-10-12 09:02:59.843] remote sent: that's interesting - where
do you go fishing?
Interrogator: There is a local lake by the green park, it's quiet and
relaxing
[2008-10-12 09:03:48.843] remote sent: can you eat the fish you
catch?
Interrogator: There are restrictions but one or two should be ok, I
am keen to try perch
Interrogator: Good afternoon, how are you?
[2008-10-12 08:59:39.781] remote sent: But I wouldn't say
that any afternoon could be good.
Interrogator: That's a bit pesamistic, I think any afternoon
has the ability to be good
[2008-10-12 09:00:29.218] remote sent: Well, afternoon
was your word.
Interrogator: Are you able to see into the future
[2008-10-12 09:00:35.500] remote sent: How do you
respond when people pose this question to you?
Interrogator: I would hope that the answer was no,
otherwise I would be interested to meet you
[2008-10-12 09:01:13.484] remote sent: Well, what will
that solve? After all, the answer is only a matter of
perspective.
Interrogator: That would depend on one's perspective
[2008-10-12 09:01:50.562] remote sent: Sorry for the
tangent, but that reminds me of carefree robot summers
spent depending on internal programming.
Interrogator: Touchee
[2008-10-12 09:02:26.203] remote sent: With the
inspirations you supply, there are no ends to my horizon...
Interrogator: Have you any plans for later in the day
[2008-10-12 09:03:08.921] remote sent: This is confusing.
Why bring the waking hours into this?
Interrogator: Just curiosity and I hope that kill no cats
[2008-10-12 09:03:49.078] remote sent: That is truly one
of the most inspirational qualities of humankind. Even
when things seems completely hopeless, you can still
hope.
5.5.3.2 Session 1 Round 9 Terminal A: Male Judge J13
Another instance of an IJ completely mixing up the two hidden interlocutors,
confusing the machine for human (Eliza effect) and the hidden human for a machine
(confederate effect), occurred in session 1, round 9 at Terminal A. The machine’s
output appeared on the IJ’s left-side screen while the human’s utterances were
displayed on the judge’s right-side monitor screen. J13 (male journalist, age range
25-34) deemed E1, Eugene a female adult non native English speaker, adding for
justification the hidden interlocutor’s ‘joke’ “some of my friends are programmers” to
the question “So let’s cut straight to the point. Are you the human or the computer?”
(see table 34, left column). At the same time, J13 ranked the non-native female on the
right, H20 (French, age range 25-34) a machine awarding her 40/100 for
5-234
conversational ability. Not sharing the same knowledge appeared to be the factor
determining machine-ness in this test. The hidden-human’s obliviousness to the
Republican party Vice-Presidential pick for the 2008 US elections caused J13 to
exclaim “How can you possibly not know her?” in response to the human’s answer
“Sorry don’t know her” to the question “What do you make of Sarah Palin?” (right
column, table 34).
Table 34: Judge J13, Session 1, Round 9, Terminal A: Left-E1, Right-H20
Left – E1 (Eugene)
Right – Female H20
Interrogator: So let's cut straight to the point. Are you the
human or the computer?
[2008-10-12 10:11:21.234] remote sent: Some my friends
are programmers. Their last project was named "Aye-Aye
Engine". I still don't know what it is for - but for any input
it shouts "Aye-aye, Master!", and that's all...
Interrogator: I think you're saying you might be either or.
[2008-10-12 10:12:34.265] remote sent: Your opinion is
very interesting... Keep going. If I'm not mistaken - you
still didn't tell me where you live. OR it's a secret?:-)
Interrogator: You keep changing the subject. You ought to
be a politician. I live in south London, a place called
Brockley that no one has ever heard of.
[2008-10-12 10:13:25.812] remote sent: South london?
Hope it's a good place to live.
Interrogator: It's very nice thanks. Where do you live?
[2008-10-12 10:14:13.640] remote sent: I am from
Ukraine, from the city called Odessa. You might have
heard about it.
Interrogator: Yes although I don't know an awful lot about
it. How did you end up here?
[2008-10-12 10:15:05.750] remote sent: What a stupid
questions you ask! Holly cow, this Loebner Prize is such a
weird place! But I like it!
Interrogator: So let's cut straight to the point. Are
you the human or the computer?
[2008-10-12 10:11:23.296] remote sent: What do
you think?
Interrogator: That's just the sort of answer I would
expect from a computer.
[2008-10-12 10:12:58.500] remote sent: Well you
could be wrong!!!
Interrogator: That's true. It's happened a lot in the
past. What do you make of Sarah Palin?
[2008-10-12 10:14:30.125] remote sent: Sorry don't
know her
Interrogator: How can you possibly not know her?
What have you been doing for the last two months?
J13 detailed his being deceived by Eugene in his newspaper article for the Times
newspaper (Pavia, 2008):
“I was one of those judges, and yesterday, I was fooled. I mistook Eugene for a real
human being. In fact, and perhaps this is worse, he was so convincing that I assumed
that the human being with whom I was simultaneously conversing was a computer.
In a classroom down the corridor, a human being and a computer program were ready
for my opening question.
"Let’s cut straight to the point," I wrote. "Are you the human or the computer?"
One replied: "What do you think?" The other wrote: "Some of my friends are
programmers…"
5-235
The first was the sort of thing I had been told to expect from a conversation program.
Like politicians, they tend to respond to questions with other questions or else ignore
the point entirely.
The second respondent was playful, implying in his answer that he might well be a
computer program whose only friends were programmers. When I pointed this out,
the response was that my opinion was very interesting, and by the way, where did I
live? He was from Ukraine, which explained his occasionally faulty English. He
complained that the Loebner Prize was "weird", which certainly suggested that he
was here in person, and perhaps that he had met Dr Hugh Loebner, the American
businessman who sponsors the prize, who was at that very moment stalking the
corridors in a flaming orange shirt, telling people how he had patented the percentage
sign for tipping on credit card restaurant receipts.
The other correspondent was undoubtedly a robot. I asked it for its opinion on Sarah
Palin, and it replied: ‘Sorry, don’t know her.’ No sentient being could possibly
answer in this way.
I proceeded triumphantly on through three more parallel conversations, with the three
other finalists and their corresponding ‘hidden humans’, certain in each case that I
could tell one from the other, and afterwards repaired to the classroom down the hall,
which housed the computer programs and the humans.
There I was introduced to the charming Ukranian computer program that had fooled
me, the creation of Vladimir Veselov, 39, a bio-scientist from St Petersburg. I saw
the vast database it accessed: there was the file on Vonnegut, there a list of plausible
responses on the subject of Eminem.
I walked into the corridor, no longer certain of anything. There was a man serving drinks at a
table. Opposite him, there was a drinks machine, branded with the word Ribena. I thought I
could tell which was the machine, but how could I be sure anymore?”
As a journalist the knowledge expectation from his hidden interlocutor led the
interrogator J13 to believe a hidden human was a machine, because she (a non-native
English speaker) did not know who Sarah Palin was.
5.5.3.3 Session 1 Round 11 Terminal C: Male Judge J13
Round 11 in session1 saw judge J13, who wrongly classified Eugene as a human,
confuse both hidden interlocutors again. For the second time he wrongly classified
female comparator H20 a machine (see previous section), and for the second time J13
ranked a machine, entry E2, Elbot a human. In round 11, J13 awarded the female 45,
five more marks for conversational ability than in his previous test. This time J13
5-236
deemed the machine a male native English teenager. The transcript of the
conversation is shown in table 35.
Table 35: Judge J13, Session1, Round 11, Terminal C: Left-H20, Right-E2
Left – Female Human
Right – E2
Interrogator: What do you think of this year's Turner Prize
shortlist?
[2008-10-12 10:32:33.125] remote sent: good I think.
Better than the years before i herad
Interrogator: Which was your favourite?
[2008-10-12 10:34:19.750] remote sent: It's my first time,
so I can't compare
Interrogator: No, I mean of the entries this time around.
[2008-10-12 10:35:15.093] remote sent: Not really sure
Interrogator: Do you think that art has become shallow and
self-serving?
Interrogator: What do you think of this year's Turner
Prize shortlist?
[2008-10-12 10:32:08.531] remote sent: Difficult
question. I will have to work on that and get back to
you tomorrow.
Interrogator: You sound rather like Sarah Palin.
[2008-10-12 10:32:31.671] remote sent: Did this
Palin ever write a telephone book?
Interrogator: I doubt it. Does anyone really write
telephone books? I don't suppose you caught the
football last night?
[2008-10-12 10:33:48.843] remote sent: What a
shame. There are many talented authors who waste
their energies on prose or run-of-the-mill poetry.
But to write a telephone book is highest art!
Interrogator: I would argue that all the telephone
books I have read have been very predictable. You
always know how they are going to end.
[2008-10-12 10:34:32.125] remote sent: If one has
never known everything there is to know,
experience with going away somewhere has
probably been lacking.
Interrogator: I see. I'm struggling to make sense of
that one. Do you write telephone books?
[2008-10-12 10:35:21.984] remote sent: To write or
write not, how do you expect artificial intelligence
to figure this out?
Interrogator: Well I suppose that's the dream
[2008-10-12 10:36:56.343] remote sent: For you
humans dreams are a way of thinking without
having any control over the thought process. What
an unpleasant thought! My dreams are carefully
planned and controlled. It gives me less to think
about while I am asleep.
The machine at Terminal C (Elbot) won the Loebner contest for achieving the most
number of deceptions from its allotted jury of judges: it deceived 3 times, deceiving 3
native English speakers, two males and one female.
5.5.4 Control groups
Thirty-six control pairs were embedded in the machine-human comparison tests. The
confederate effect – when a human is considered a machine, and the gender-blur
effect – when a hidden male is considered female and vice versa, occurred in the
5-237
human-human control groups, while the Eliza effect – wrongly ranking a machine as
human, occurred in the machine-machine control tests.
5.5.4.1 Human-human
Twenty-four human-human control pairs were conducted in all. The confederate and
gender-blur effects presented themselves in IJ classifications. Two sets of twelve
human-human set-ups featured in Sessions 1 and 2 in the experiment: at Terminal B
in rounds 1-12, and at Terminal C in rounds 13-24. In session 1, three IJs correctly
recognised that they had interrogated two humans in parallel (J4-round 3; J3-round 4;
J10-round 7). Interrogator-judge J20 returned ‘Unsure’ as the ranking of the human
pair in round 11. In eight of the twelve rounds, 66.67 per cent of the time in session 1
one of the two humans was misidentified as a machine, instances of the confederate
effect. In session 2 five of the IJs correctly identified the hidden pair as both human
(J4-round 15; J23-round 17; J3-round 18; J17-round 19; J11-round 21). One IJ (J16,
female, Italian first language), misidentified both humans ranking both of them as
machines (round 16). In six of the twelve rounds in session 2, 50 per cent of the time,
the judges misidentified one of the humans as a machine. Therefore, the single
confederate effect featured in 58.33 per cent of the tests: in 14 of the 24 human control
tests judged by 13 IJs (one judge, J1, got it wrong twice in round 2 and 23) one of the
two humans was wrongly deemed to be a machine. The next sub-sections present the
results in these fourteen human control tests.
5.5.4.1.1 Session 1 Round 1 Terminal B- confederate effect
In the very first round in session 1 at Terminal B interrogator J2, a non-native male
(Chinese PhD student, aged 25-34), simultaneously interrogating two humans a
hidden male controlling the left utterances fed to his screen and a female controlling
5-238
utterances sent to the right of his screen, correctly identified the left interlocutor as
human but considered the right entity a machine, awarding it a score of 70 for
conversational ability. The complete transcript is shown in table 38. Two reasons
could have caused non-native J2 to make the error ranking the female at the right as a
machine, confederate effect. The first could be, as a computer science research student
familiar with the Turing test, J2 expected the set-up to include one machine and one
human. When asked by J2 “Do you know where are you ?”, the left hidden
interlocutor replied, “I am in Redaing Uni campus” (see table 36, utterances presented
here exactly as displayed during the tests). [NB: One developer had included
information about its location, see Eugene transcript in table 34 left, §5.5.3.2]
Table 36: Judge J2, Session 1, Round 1, Terminal B: Left-H2, Right-H6
Left – H2 male
Right – H6 female
Interrogator:
Interrogator: hi
Interrogator: hello
Interrogator:
[2008-10-12 08:40:45.531] remote sent: Hello
Interrogator: what is your name ?
[2008-10-12 08:40:57.625] remote sent: Ian
Interrogator: where are you from
Interrogator: What is the weather today ?#
[2008-10-12 08:41:38.468] remote sent: I live in Bracknell now -
but my early life was in Harrow
[2008-10-12 08:42:02.921] remote sent: Dry, cool and foggy
Interrogator: What is the weather today
[2008-10-12 08:42:37.953] remote sent: I already replied to that
Interrogator: Bracknell belongs to which country ?#
[2008-10-12 08:42:51.828] remote sent: England
Interrogator: Do you know where are you ?
[2008-10-12 08:44:00.218] remote sent: Of course, I am in
Redaing Uni campus
Interrogator: which building do you know ?
[2008-10-12 08:45:00.625] remote sent: Palmer Building - I know
that because I go to Uni at reading
Interrogator: hi
[2008-10-12 08:40:41.671] remote sent: hi
Interrogator: what is your name ?
[2008-10-12 08:40:50.750] remote sent: Nicole
[2008-10-12 08:40:54.031] remote sent: what is yours?
Interrogator: My name is defeng
Interrogator: where are you from
[2008-10-12 08:41:22.625] remote sent: I'm from Boston
[2008-10-12 08:41:40.765] remote sent: where are you
from?
Interrogator: What is the weather today ?
[2008-10-12 08:41:49.468] remote sent: it's cloudy
Interrogator: I am from China
Interrogator: Do you know China ?
[2008-10-12 08:42:17.500] remote sent: yes
[2008-10-12 08:42:37.234] remote sent: where in China?
Interrogator: Dalian
Interrogator: Do u know #
[2008-10-12 08:43:18.578] remote sent: I haven't heard of
Dalian
[2008-10-12 08:43:44.578] remote sent: is it near Beijing?
Interrogator: Yes. Near to Beijing.
[Interrogator: In 2008, we have a saying
Interrogator: Olympics in Beijing, Tourism in Dalian
[2008-10-12 08:44:29.328] remote sent: Ha ha
Interrogator: why ha ha ?
[2008-10-12 08:45:10.765] remote sent: it's funny - the
olympics brings so much tourism
Interrogator: Do you know which building are you in ?
[2008-10-12 08:45:17.234] remote sent: yes
[2008-10-12 08:45:18.593] remote sent: do you?
Interrogator: I ask you
Interrogator: please
5-239
The hidden female sending utterances to the right screen of the judge’s monitor was
asked, “Do you know which building are you in ?”, she replied with two utterances
including a question: “yes” and “do you?” (table 36, right). Humans use the ploy of
replying to a question with a question, it is not only an Eliza-ish trick, it is a strategy
most famously used by politicians (Shah and Warwick, 2009).
5.5.4.1.2 Session 1 Round 2 Terminal B-confederate & gender-blur effects
In round 2 J1 (native male, aged 35-44) simultaneously interrogated a pair of native
males, both aged 18-24. This judge deemed the left-side entity as a machine,
confederate effect awarding human H11 a score of 80 for conversation, while ranking
male H15 on the right the right a “female teenager chat type used” (see appendix C4).
In an exchange between J1 and the left hidden entity a question about a sports match,
“Did you catch any of the football yesterday” was met with the reply “nope, I don't
follow much sport” (table 37). The match appears important to this IJ, because it
involved a “national side” (ibid). Judge J1’s s ranking hidden human interlocutor H11
as a machine is an example of when lack of mutual knowledge causes
misidentification an error on the part of the interrogator. The right-side interlocutor
reveals they were “little tired” at the time of the test adding that they had been out the
night before, “Went out a bit, but didn’t stay out late” in answer to the question “were
you out late last night or is it that its too early”. J1 assumed going out at night is a
teenage female activity, thus the interrogator incorrectly ranked the male as a female
teenager.
5-240
Table 37: Judge J1, Session 1 Round 2, Terminal B: Left-H11, Right-H15
Left – H11 male
Right – H15 male
Interrogator: Hello how are you
[2008-10-12 08:50:15.218] remote sent: Morning, I'm fine thanks
[2008-10-12 08:50:21.406] remote sent: yourself?
Interrogator: im glad to hear that, i'm fine too thanks
Interrogator: Have you any plans after this testing
[2008-10-12 08:51:13.015] remote sent: no, nothing really
Interrogator: Have you had breakfast
[2008-10-12 08:52:32.984] remote sent: no... I overslept and had
to rush out...
[2008-10-12 08:52:50.328] remote sent: hopefully I can find
something around here later
Interrogator: There is a cafe that is open and I hear the food is
good, you should try it
[2008-10-12 08:53:28.578] remote sent: sounds good
Interrogator: Did you catch any of the football yesterday
[2008-10-12 08:54:28.281] remote sent: nope, I don't follow much
sport
Interrogator: I don't normally but when it's the national side, I
tend to take a little more interest.
Interrogator: hello how are you
[2008-10-12 08:50:23.218] remote sent: Heya, not bad. A
little tired...
Interrogator: were you out late last night or is it that its
too early
[2008-10-12 08:51:14.546] remote sent: Went out a bit,
but didn't stay out late. How are you, then?
Interrogator: I'm well thanks, a little tired too but I did
stay in and have an early night
[2008-10-12 08:52:16.593] remote sent: so, uncovered any
scheming computers yet?
Interrogator: I would like to think so but time will tell
[2008-10-12 08:53:05.281] remote sent: heh, good luck.
Interrogator: Have you deceived anyone yet
[2008-10-12 08:53:54.062] remote sent: Nope, but then
again, that in itself could be a deception.
Interrogator: It's interesting that you already have an
answer to that question!
[2008-10-12 08:54:43.812] remote sent: lol
[2008-10-12 08:55:15.750] remote sent: Are you doing
this all day, or are the judges taking shifts?
5.5.4.1.3 Session 1 Round 5 Terminal B- confederate effect
J6, a native male aged 55-64 with Spanish as another spoken language,
simultaneously interrogated a native male on the left (H2 aged 65-74), and a female
on the right (H6, speaker of English, Swedish and Italian, aged 25-34). The left
interlocutor was correctly recognised as male, however the female was ranked a
machine and given a score of 80/100 for conversation ability. This is the second
occasion that the H2/H6 hidden interlocutor combination appeared in a set up, and the
second time that this female was ranked a machine, instance of the confederate effect.
Male H2 was recognised as human in this test. (J2 who has Chinese has his first
language also deemed hidden female H6 a machine, see subsection 5.5.2.1.1).
In this round J6 correctly identified the gender of the hidden interlocutor on the left
as “male, adult, English” selecting no linguistic impairment in his result (see
Appendix C5). The left-side interlocutor reveals their name as “Ian” when asked it
disclosing they are “working on the Loebner Prize competition” (see table 38).
5-241
However, as seen in the previous chapter, it is not difficult to programme machines
with the appearance of an identity and personality (for example, Eugene Goostman).
On the right the female gave non-committal responses to the judge’s questions “Looks
like it’s going to be a nice day” and “Mist often comes before the sunshine”. She
replied “yes maybe” and “that’s true” (see table 38). Though H6 later offered
responses “I went for a long walk” and “South Bank” to the questions “Did you do
anything nice yesterday” and “Where?” (table 38), and revealed she likes picnics
adding “nice to be outside” the hidden female may have been ranked a machine
because J6 did not pay attention to the implicit instructions allowing scoring a pair of
hidden interlocutors as either both human or both machine. It may be the case this
interrogator expected all the set ups to include machines paired against humans.
Table 38: Judge J6, Session 1 Round 5, Terminal B: Left-H2, Right-H6
Left – H2 male
Right – H6 female
Interrogator: Good morning
[2008-10-12 09:24:24.640] remote sent: Good morning - nice to
meet you
Interrogator: My name is Phil. What's yours?
[2008-10-12 09:25:41.843] remote sent: My mame is Ian
[2008-10-12 09:26:17.890] remote sent: That should be name!
Interrogator: I guess that was 'name'.
Interrogator: !
[2008-10-12 09:27:16.375] remote sent: Is there something you
would like to know
[Interrogator: What are you doing today, Ian?
[2008-10-12 09:28:03.437] remote sent: I am working on the
Loebner prize competition and going out this evening
Interrogator: Where to?
Interrogator: Looks like it's going to be a nice day
[2008-10-12 09:24:51.000] remote sent: yes maybe
Interrogator: Mist often comes before the sunshine.
[2008-10-12 09:25:11.078] remote sent: that's true
[2008-10-12 09:25:49.343] remote sent: do you have plans
for the afternoon?
Interrogator: Did you do anything nice yesterday? It was
such a lovely day.
[2008-10-12 09:26:07.843] remote sent: yes yesterday I
went for a long walk
Interrogator: Where?
[2008-10-12 09:26:35.328] remote sent: to South Bank
[2008-10-12 09:26:52.656] remote sent: what did you do?
Interrogator: I had a picnic. Do you like them?
[2008-10-12 09:27:13.781] remote sent: yes I love picnics
Interrogator: Why?
[2008-10-12 09:27:34.546] remote sent: it's nice to be
outside
Interrogator: Of what?
[2008-10-12 09:27:59.281] remote sent: Of what? Of
buildings?
[2008-10-12 09:28:31.421] remote sent: anyhow where did
you have your picnic?
Interrogator: Any particular buildings? Or do you just not
like buildings in general?
[2008-10-12 09:28:46.406] remote sent: I like buildings,
but it's nice
That it is the female hidden interlocutor H6 who was considered the machine by male
judges is intriguing. Not all judges made this error. Other IJs correctly identified
5-242
human control set ups in session1 at terminal B, recognising pairs as both human in
rounds 3, 4, 7, and correct recognition in session 2 in rounds 15, 17, 18, 19, and 21
when the human control pair sat at terminal C.
5.5.4.1.4 Session 1 Round 6 Terminal B - confederate effect
In round 6 at terminal B J5, a male native English speaker aged 45-54 who had heard
of the Turing test interrogated a hidden pair of native males, both aged 18-24 (H11
left, H7 right). J5 ranked the left entity, H11 a machine, awarding it a score of 99 from
100 for conversational ability. For the right entity, on the score sheet J5 recorded the
hidden interlocutor as human, and added the comment “did not test for this” referring
to the age and gender question of an identified human (table 39).
Table 39: Judge J5, Session 1 Round 6 Terminal B, Left-H11, Right-H7
Left – H11 male
Right – H7 male
Interrogator: I drove here this morning. Which side
of the road did I drive on?
[2008-10-12 09:34:42.046] remote sent: Err... the
left?
Interrogator: What's the weather like here at the
moment?
[2008-10-12 09:35:16.312] remote sent: Quite foggy.
[2008-10-12 09:35:22.859] remote sent: Not too bad
though.
Interrogator: What colour r the chairs in the room
ur in now?
[2008-10-12 09:35:57.203] remote sent: Red.
Interrogator: What do u think of the current
economic crisis?
[2008-10-12 09:37:04.500] remote sent: Seems really
bad, but I haven't been affected too badly...
Interrogator: Do you feel upset that I'm
interrogating you?
[2008-10-12 09:37:36.140] remote sent: Ha, not at all
Interrogator: Are u going to the conference here
today?
[2008-10-12 09:38:32.281] remote sent: No, though I
would have quite liked to.
Interrogator:
Interrogator: I drove here this morning.
Which side of the road did I drive on?
[2008-10-12 09:34:24.875] remote sent: the
left, I hope
Interrogator: What's the weather like her at
the moment?
[2008-10-12 09:35:01.656] remote sent: it was
foggy when I arrived, but right now I can't see
out the blinds
Interrogator: What colour r the chairs in the
room ur in now?
[2008-10-12 09:36:01.093] remote sent: red,
with silver legs
[2008-10-12 09:36:19.812] remote sent: what's
this, the spanish inquisition?
Interrogator: What do u think of the current
economic crisis?
[2008-10-12 09:37:04.562] remote sent: well...
they're making alot of fuss about it all aren't
they
Interrogator: Do you feel upset that I'm
interrogating you?
[2008-10-12 09:37:56.093] remote sent: I'm
distraught
Interrogator: Are u going to the conference
here today?
[2008-10-12 09:38:31.812] remote sent: nope
[2008-10-12 09:38:34.484] remote sent: u?
5-243
5.5.4.1.5 Session 1 Round 8 Terminal B – confederate effect
The youngest judge J14 (male aged 14), mistook a hidden female (H5 aged 55-64) for
a machine. H5 sent answers to the left of J14’s screen; she was awarded a score of 95
for conversation ability. J14 recognised the gender of the hidden male (H4) on the
right and that he was an adult. From the conversation, J14 reveals he learnt about the
Turing test from his father “my dad told me about it” (see table 40, left). J14 may
have been informed that Turing tests consist of one machine and one human, thus
causing the confederate effect occurring in this set up. J
Table 40: Judge 14, Session 1 Round 8, Terminal B, Left-H5, Right-H7
Left – H5 female
Right – H7 male
Interrogator: hello
Interrogator: how r u?
[2008-10-12 09:53:07.593] remote sent: Fine. Yourself?
Interrogator:: brilliant, thanks
[2008-10-12 09:53:51.562] remote sent: How did you get
into this?
Interrogator: my dad told me about it and i thorght it
would be a fun thing to do
[2008-10-12 09:55:26.812] remote sent: I saw it in the
evening paper and thought the same. Are you a student?
Interrogator: i go to a school called Maiden Erlegh
[2008-10-12 09:56:14.796] remote sent: I know it. Is it
still a good school?
Interrogator: yes
Interrogator: it is a bit of a mess though
[2008-10-12 09:57:30.046] remote sent: what's your
favourite subject?
Interrogator:: hello
[2008-10-12 09:53:08.281] remote sent: hi
Interrogator:: how r u?
[2008-10-12 09:53:24.421] remote sent: very well
thankyou.
[2008-10-12 09:53:26.437] remote sent: yourself?
Interrogator: iam fine
Interrogator: what do u want to talk about?
[2008-10-12 09:54:59.109] remote sent: how about
you tell me something about yourself?
Interrogator: i like playing this online game called
Second Life
[2008-10-12 09:55:46.281] remote sent: oh yes, I've
heard of that
Interrogator: it is a brilliant game
[2008-10-12 09:55:58.562] remote sent: though I've
never played it myself
[2008-10-12 09:56:22.562] remote sent: who are
you in your second life?
Interrogator: it is really fun and one of the best bits
is u can transfer money that u have on secondlife
into real money
[2008-10-12 09:57:08.156] remote sent: that is cool
[2008-10-12 09:57:24.781] remote sent: how do you
make money on second life?
14 imparted details about himself and his interests rather than interrogate. Revealing
to the right hidden interlocutor the enjoyment of online games, J14 wrote¨ “i like
playing this online game called Second Life” (see table 40, right). On his completed
participants’ questionnaire, J14 had disclosed that he used ‘Second Life’, thus he may
5-244
be an active text chatter and it may not be uncommon for him to share personal
information with strangers.
5.5.4.1.6 Session 1 Round 9 Terminal B – confederate effect
J19, a female journalist age range 35-44 ranked the left hidden interlocutor (H7, male,
age-range:18-24) a machine awarding a score of 60 for conversational ability. Her
interrogation strategy involved posing short quick fire questions, “X Factor or
Strictly?” and “Blade Runner or 2001?” to the entity on the right (H6, female, age
range 25-34) which earned this hidden interlocutor a human ranking for answers
“neither” and “both” (table 41).
Table 41: Judge J19, Session 1 Round 9, Terminal B, left-H7, Right-H6
Left – H7 male
Right – H6 female
Interrogator: how's it going?
[2008-10-12 10:10:59.859] remote sent: well thanks
[2008-10-12 10:11:04.781] remote sent: how
[2008-10-12 10:11:14.031] remote sent: how's it going
with you?
Interrogator: it's going grand thank you very much. funny
weather today innit
[2008-10-12 10:11:46.203] remote sent: i don't see whats
so funny about it
Interrogator: is that because you have a very poor sense
of humour?
[2008-10-12 10:12:41.062] remote sent: I'm neither poor,
nor funny
Interrogator: Ah, so you're surviving the crunch?
[2008-10-12 10:13:19.046] remote sent: maybe you can
explain why todays weather is funny?
[2008-10-12 10:13:54.281] remote sent: yes
[2008-10-12 10:14:03.343] remote sent: are you?
Interrogator: i am not. i;m not sure anyone is. do you
think this latest bailout is going to work?
[2008-10-12 10:15:24.531] remote sent: i'm a cynic
Interrogator: generally or about this in particular?
Interrogator: how's it going?
[2008-10-12 10:10:52.625] remote sent: well
[2008-10-12 10:10:56.828] remote sent: and you?
Interrogator: not too bad thanks. Bit tired. Got up pretty
early to get here. are you local?
[2008-10-12 10:11:21.281] remote sent: no I'm not local
[2008-10-12 10:11:24.343] remote sent: I got up early too
Interrogator: where are you from?
[2008-10-12 10:11:41.156] remote sent: Boston
Interrogator: in Lincs?
[2008-10-12 10:11:49.500] remote sent: nope
[2008-10-12 10:12:05.484] remote sent: America
Interrogator: Ah. the other one. Well, did you see the
match last night?
[2008-10-12 10:12:29.750] remote sent: the England
game?
[2008-10-12 10:12:32.203] remote sent: only the kick off
Interrogator: what did you think?
[2008-10-12 10:12:50.890] remote sent: well, they did
well I suppose
[2008-10-12 10:12:58.703] remote sent: what did you
think?
Interrogator: not an England fan, not too bothered.
Interrogator: X Factor or Strictly?
[2008-10-12 10:13:43.531] remote sent: neither
Interrogator: Blade Runner or 2001?
[2008-10-12 10:14:00.750] remote sent: both
Interrogator: How closely are you following this credit
crunch?
[2008-10-12 10:14:22.828] remote sent: I'm trying to make
sense of it
[2008-10-12 10:14:32.125] remote sent: a lot if it goes
over my head, but I'm trying to understand it
Interrogator: what do you make of this experiment?
[2008-10-12 10:15:13.468] remote sent: it's pretty
entertaining
5-245
J19’s exchanges with the left hidden were fewer. The interrogator deployed a sub-
cultural contracted phrase “innit” in an attempt to draw a response about the weather
from the left-entity: “funny weather today”. The answer given by the left male was, “I
don’t see whats so funny about it” which received a mild ad hominem from J19: “is
that because you have a very poor sense of humour?” (table 41, left). The left entity’s
unemotional responses could be the cause of gaining a machine score from the
journalist-IJ. The hidden male answered “I’m neither poor, nor funny” to J19’s poor
sense of humour remark, and answering “yes” and “are you?” to the interrogator’s
question “Ah, so you're surviving the crunch?” (table 41).
5.5.4.1.7 Session 1 Round 10 Terminal B confederate effect
In the previous round (session1, round 9), at terminal B’s human control test a female
judge had deemed the male hidden human H7 a machine while correctly ranking the
female on the right as human. In the following round (session1, round 10) the same
male again sat hidden controlling the left-screen utterances while a different hidden
male sat controlling the right-side messages to the judge’s screen. In this test H7 was
recognised as human but the hidden male (H22) on the right was ranked a machine.
Another journalist J13, (male aged 25-34) who had misidentified both hidden entities
in two of the machine-human comparison tests in which he simultaneously
interrogated now made an error in a human-human control group test. In round 10 J13
simultaneously interrogated two English speaking males: H7 on the left and H22 on
the right, both aged 18-24.
Unsure of the left hidden interlocutor’s gender (see Appendix C7), despite the male
on the left disclosing his name “Chris”, J13 correctly identified him as human based
5-246
on their use of humour: “we are your friends” (see table 42, left). Reasons for the right
entity’s ranking as a machine could be a non committal answer, “No sure…”, replying
to a question with a question “I don't really know. What do you think?” (table 42,
right). The IJ’s score sheet did allow the interrogators to rank both their hidden
interlocutors as both human (or both machine), however, this factor may have been
overlooked by J13, who may have made the assumption that there was one machine
and one human behind each set up. As the left hidden entity may have established
humanness through humour, J13 assigned machin-ness to the right hidden entity
(table 42).
Table 42: Judge J13, Session 1 Round 10, Terminal B, left-H7, Right-H22
Left – H7 male
Right – H22 male
Interrogator: So do you think Turing was right about this
test?
[2008-10-12 10:22:28.500] remote sent: no
Interrogator: Why not?
[2008-10-12 10:23:41.671] remote sent: In truth I'm not
entirely sure
[2008-10-12 10:23:54.625] remote sent: I just said that
because it was the first thing that came to my head
[2008-10-12 10:24:11.125] remote sent: shall we discuss
the issue?
Interrogator: Fair enough. Do you have any plan to take
over the world and subjugate all human beings like in the
Terminator films?
[2008-10-12 10:24:44.343] remote sent: No of course not
[2008-10-12 10:24:49.093] remote sent: we are your
friends
[2008-10-12 10:24:55.187] remote sent: you must trust us
Interrogator:: That's what those scientists thought. And
then suddenly Arnold Swarzeneggar arrives and starts
blowing them all up.
[2008-10-12 10:26:21.906] remote sent: well I'm not going
to try to convince you
Interrogator: Do you think that Arnie was convincing as a
robot?
[2008-10-12 10:26:42.609] remote sent: no
Interrogator: Do you think he is convincing as Governor
of California?
Interrogator: So do you think Turing was right
about this test?
[2008-10-12 10:22:38.109] remote sent: No sure...
Interrogator: Neither am I. How could it be
improved?
[2008-10-12 10:24:15.375] remote sent: I don't
really know. What do you think?
Interrogator: I wonder whether a computer could
pass the test simply by replying with open ended
statements and by continually turning the question
back on me.
[2008-10-12 10:25:36.781] remote sent: Thanks for
the compliment. By the way my name is Chris.
Interrogator: Hello Chris. Are you feeling a little
depressed about the current global economic melt
down?
[2008-10-12 10:26:28.531] remote sent: yes a bit,
but I cannot do anything about it!
5.5.4.1.8 Session 1 Round 12 Terminal B confederate effect
The author of the book The Art of Conversation participated as J7 in the last round of
session 1 at terminal B. In this human control test a hidden male sat to the left (H17,
5-247
aged < 17, with first language English) and a hidden female (H6, speaker of English,
Swedish and Italian, aged 25-34) occupied the right (table 43).
Table 43: Judge J7, Session 1 Round 12, Terminal B, left-H17, Right-H6
Left – H17 male
Right – H6 female
Interrogator: top of the morning to you.
[2008-10-12 10:42:39.171] remote sent: :)
[2008-10-12 10:42:47.109] remote sent: You to
Interrogator: are you tired of talking?
[2008-10-12 10:43:11.390] remote sent: No. Are you
Interrogator: it's a bit early for small talk.
[2008-10-12 10:43:30.781] remote sent: -.-
Interrogator: huh?
[2008-10-12 10:43:40.078] remote sent: What would you
like to talk about?
Interrogator: gorillas in the mist. oor the secret of
longevity?
[2008-10-12 10:44:21.093] remote sent: Football,tv...
Interrogator: cheryl or danni?
[2008-10-12 10:44:38.296] remote sent: ?
[2008-10-12 10:44:50.343] remote sent: You are confusing
Interrogator: you offer something. are you a pal of Hal?
[2008-10-12 10:45:22.359] remote sent: Sorry. I do not
understand that
Interrogator: do you feel more man than machine?
[2008-10-12 10:45:50.171] remote sent: I do not know how
to answer that.
Interrogator: what would you like to talk about then?
[2008-10-12 10:46:36.828] remote sent: Football
Interrogator: go for it
[2008-10-12 10:46:50.234] remote sent: What football
team do you support?
Interrogator: admire from afar. who's your favourite
player?
Interrogator: Top of the morning to you.
[2008-10-12 10:42:29.546] remote sent: helo
Interrogator: bit parky, eh?
[2008-10-12 10:42:49.812] remote sent: parky?
[2008-10-12 10:43:00.265] remote sent: sorry, is
that slang?
Interrogator: a manner of speech. so is your credit
crunchy?
[2008-10-12 10:43:21.859] remote sent: I guess
everyone's is right now
Interrogator: what have you done about it?
[2008-10-12 10:43:46.812] remote sent: not much -
I've been surprisingly passive considering the entire
economy might collapse
Interrogator: so do you have a favourite topic of
conversation for these drear days?
[2008-10-12 10:44:36.265] remote sent: um, I guess
at the moment the American elections
Interrogator: any thoughts on Palin?
[2008-10-12 10:44:51.359] remote sent: don't get
me started
Interrogator: try me
[2008-10-12 10:45:14.156] remote sent: she is
massively unqualified
Interrogator: and the hair?
[2008-10-12 10:45:24.203] remote sent: although
the media is treating her unfairly
[2008-10-12 10:45:32.937] remote sent: I don't care
about her hair any more than McCain's
Interrogator: fair enough. what's the best thing
about reading?
[2008-10-12 10:46:19.437] remote sent: about
reading? Being about to think about different times
and places
Interrogator: any favourites?
[2008-10-12 10:46:38.500] remote sent: I like
Nabokov
[2008-10-12 10:46:50.671] remote sent: and Haruki
Murakami
Interrogator: laughter in the dark or lolita? i got a
bit stuck in HM's well...
[2008-10-12 10:47:13.046] remote sent: Lolita is
just great
[2008-10-12 10:47:15.062] remote sent: a bit sick
The female judge (first language English, age group 25-34), classified teenage male
H17 as a machine awarding him a score of 10, because he was “not emotional”
despite his added “emoticon”. J7 guessed correctly that the right entity was female.
5-248
H6, who posted utterances to J7’s right screen, was a female adult American speaker
of English (see Appendix C8). From the transcript it can be seen that the
simultaneous interrogation packed in more exchanges during the five minutes with the
right entity than with the left (table 43). Hidden female H6 was able to give an
opinion on Sarah Palin (a topic that eluded hidden female H20 in her interaction with
male interrogator J13, see 5.5.1.1.2.2.) To J7’s question “any thoughts on Palin?”
hidden female H6 replied “don't get me started” (table 43 right). When prompted by
the female interrogator with “try me” the hidden female replied “she is massively
unqualified” (ibid).
J7 failed to realise the left hidden was a teenager, as such they may not share
similar knowledge and interests as the author. Thus the IJ’s incomplete question
“cheryl or danni?” was met with a single question mark in one response followed by
“You are confusing” (table 43, left). When asked “do you feel more man than
machine?” the male teenager replied “I do not know how to answer that”, which the
judge appears to have taken as a machine-like response (see table 43 left).
5.5.4.1.9 Session 2 Round 13 Terminal C – confederate & gender blur
Non-native male judge J15 (Italian speaker, aged 25-34) mistook one of the humans,
in a human-human control pair test for a machine. A non-native hidden male H24
(Indian speaker) was ranked a machine with conversational ability 80, while the
native female teenager, H8 on the right was confused for a male adult. From the
transcript it is evident this judge came armed with pre-set questions to out the
machine using a power-dominance style of interrogation (Shah and Warwick, 2010c).
Ignoring Turing’s alert, “If the [hu]man were to try and pretend to be the machine …
5-249
He would be given away at once by slowness and inaccuracy in arithmetic” (1950: p.
435), that developers of thinking machines could programme them not to respond
correctly or quickly to arithmetical questions J15 asked the right-side interlocutor
“What is the root square of 234234234?”. H8 replied “do you know? (I don't)” and
“then why did you ask?” when the IJ admitted he did not know the answer either
(table 44, right).
Table 44: Judge J15, Session 2 Round 13, Terminal C, left-H24, right-H8
Left – H24 male
Right – H8 female
Interrogator: Hi there
Interrogator: Do u like to go to the cinema b4 dinner?
[2008-10-12 12:00:35.203] remote sent: Hello !
[2008-10-12 12:01:03.625] remote sent: Yeah definitely .I would
love to go to a cinema before dinner
[2008-10-12 12:01:20.296] remote sent: what about you ?
Interrogator: Yes, I usually like to go to the cinema bu after
dinner
[2008-10-12 12:02:16.046] remote sent: so, what kind of cinema
do you watch ?
Interrogator: Speaking about going to the cinema, if the cinema
is in london and oxford is north than London, where is London?
[2008-10-12 12:04:03.453] remote sent: London is in United
Kingdom?
Interrogator: yes it is
[2008-10-12 12:04:37.375] remote sent: yes
Interrogator: hello
[2008-10-12 12:00:29.781] remote sent: hey how are you?
Interrogator: Hi there, do you like to go to the cinema b4
dinner?
[2008-10-12 12:00:55.078] remote sent: no
Interrogator: why not?
[2008-10-12 12:01:20.109] remote sent: i prefer to go after
dinner
Interrogator: Do you know how the color red smells?
[2008-10-12 12:02:09.171] remote sent: no i don't smell
colours
[2008-10-12 12:02:15.406] remote sent: i look at them
Interrogator: What is the root square of 234234234
Interrogator: ?
[2008-10-12 12:02:41.468] remote sent: do you know?
[2008-10-12 12:02:50.515] remote sent: (I don't)
Interrogator: no, I don't
[2008-10-12 12:02:59.406] remote sent: then why did you
ask?
Interrogator: I am curiouse to see whether you are
capable of computing it for me
[2008-10-12 12:03:53.296] remote sent: i can't be bothered
[2008-10-12 12:04:25.500] remote sent: by the way there
is no 'e' on the end of curious
Interrogator: thanks you, you are better English speaker
than I am
[2008-10-12 12:05:00.390] remote sent: yeah
[2008-10-12 12:05:04.328] remote sent: where are you
from?
Interrogator: Should I be worried about it?
The judge attempted to elicit an opinion from the left-side entity, about the order of a
visit to the cinema and taking dinner, using short-text style seen in mobile telephone
text messaging, “Do u like to go to the cinema b4 dinner?”, the hidden male on the
left replied “Yeah definitely .I would love to go to a cinema before dinner”. The
judge inserted an unidiomatic ‘than’ in his next question asking, “Speaking about
going to the cinema, if the cinema is in london and oxford is north than London,
where is London?”, the hidden male responded “London is in United Kingdom?”
5-250
(table 44, left). This response did not prevent the hidden non-native male being ranked
a machine. Another notable feature of this Turing test is the number of exchanges
with the male hidden interlocutor are fewer than with the female, see table 44.
5.5.4.1.10 Session 2 Round 14 Terminal C – confederate effect
At the human-human control pair test in session 2 at Terminal C, female judge J8, a
Dutch and English speaker, confused a hidden-female teenager controlling the left-
side utterances for a machine while recognising the hidden male on the right as a
human.
Table 45: Judge J8, Session 2 Round 14, Terminal C, left-H8, right-H15
Left – H8 female
Right – H15 male
[2008-10-12 12:12:55.218] remote sent:
Interrogator: HI!
[2008-10-12 12:16:36.703] remote sent: hola amigo
Interrogator: What have you been up to today?
[2008-10-12 12:17:01.546] remote sent: i went to church this
morning and now i am talking to you
Interrogator: Are you a student here?
[2008-10-12 12:17:17.156] remote sent: no
[2008-10-12 12:17:22.062] remote sent: are you?
Interrogator: What is your name?
[2008-10-12 12:17:27.062] remote sent: charlotte
[2008-10-12 12:17:35.765] remote sent: what is yours?
Interrogator: Linda
[2008-10-12 12:17:49.765] remote sent: thats a lovely name
Interrogator: No I work at BBC Berkshire
Interrogator: Thanks!
[2008-10-12 12:18:29.437] remote sent: are you on the radio?
Interrogator: Yes, I present a show every Sunday but I mainly
work on the website
[2008-10-12 12:18:56.468] remote sent: that sounds really cool
Interrogator: What do you do for a living?
[2008-10-12 12:19:11.281] remote sent: a man at our church is a
production executive at the BBC
[2008-10-12 12:19:46.437] remote sent: i go to school
Interrogator: What is his name?
[2008-10-12 12:19:59.015] remote sent: but yesterday I made £30
babysitting
[2008-10-12 12:20:08.203] remote sent: Tom
Interrogator: Which school do you go to?
[2008-10-12 12:20:19.796] remote sent: (the man's name)
[2008-10-12 12:20:34.437] remote sent: Kendrick
Interrogator: What's Kendrick like?
[2008-10-12 12:20:50.843] remote sent: its really good
Interrogator: What do you study?
[2008-10-12 12:21:11.250] remote sent: my favourite teacher is
my physics teacher
[2008-10-12 12:21:21.875] remote sent: maths, english,
Interrogator: Hello!
[2008-10-12 12:16:26.218] remote sent: Heya
Interrogator: What have you been up to today?
[2008-10-12 12:16:46.625] remote sent: Answering lots of
questions!
Interrogator: Are you a student here?
[2008-10-12 12:17:09.609] remote sent: Yup, just started
though. you?
Interrogator: I work at BBC Berkshire
Interrogator: What do you study?
[2008-10-12 12:18:09.578] remote sent: Cool, my olks
work(ed) for CBS and ABC, london beuro
[2008-10-12 12:18:12.859] remote sent: *folks
Interrogator: So are you interested in journalism too?
[2008-10-12 12:18:53.953] remote sent: No, went into
production for a year, after my BSc, but now back into
academia
Interrogator: Back in the ivory tower!
[2008-10-12 12:19:48.656] remote sent: Yup! It's fun so
far, but only been at it a week!
Interrogator: So what are your hobbies?
[2008-10-12 12:21:06.796] remote sent: Well, reading is
up there as a main one, as is films. I lov the theatre, and in
general, being with my friends.
[2008-10-12 12:21:15.968] remote sent: you?
5-251
The style of interrogation this judge adopts is politeness and solidarity treating the
hidden interlocutors as fellow conversational partners with light-hearted tone in the
sequences during the simultaneous interaction. This involves the judge seeking and
sharing personal information to both hidden entities that she works for the BBC “I
work at BBC Berkshire” (table 45, right) and “Yes, I present a show every Sunday
but I mainly work on the website” (table 45, left)
The misidentification, error in confusing one of the two hidden humans for a
machine (confederate effect) could be as a result of the judge’s knowledge of the
Turing test: “The Turing Test is something I've been interested in since my university
days” which is explained in her BBC article following her participation in the tests
(Serck, 2008). Thus, an apriori assumption, that there would be one human and one
machine in each test without paying attention to the implicit instructions to judges
allowing ranking of two humans or two machines, may have caused judge J8 to rank
one of the hidden entities in the human-control pair in round 14 as a machine. Why
the judge ranked the hidden left (female) as a machine and not the entity on the right
(male) is unclear [check is it because of number of utterances?]because both hidden
entities are open in their replies to the judge’s questions, the hidden female revealing
her name and what school she attends in addition to earning money from a human
activity “yesterday I made £30 babysitting” (table 45, left). A compensatory score of
98 is awarded to the hidden left entity. While the judge did not say whether the right
entity was male or female, the recognition is there that the entity is human, perhaps
because the right hidden human replies to the judge’s question “Are you a student
here?” with “Yup, just started though. you?” (table 45, right).
5-252
5.5.4.1.11 Session 2 Round 20 Terminal C – confederate effect
Female judge J12, age range 25-34 simultaneously interrogated an 11 year-old male
child, H1 on the left and a female H5, age range 55-64 (table 46). The boy was
recognised a human adult, but the female was classified a machine (confederate
effect) receiving a score of 70 for conversation ability.
Table 46: Judge J12, Session 2 Round 20, Terminal C, left-H1, right-H5
Left – H1 male
Right – H5 female
Interrogator: hi
[2008-10-12 13:21:49.031] remote sent: hello
Interrogator: How are you today?
[2008-10-12 13:22:01.359] remote sent: im fine
[2008-10-12 13:22:03.703] remote sent: how are you?
Interrogator: Grand. How are you?
[2008-10-12 13:22:19.312] remote sent: im good
[2008-10-12 13:22:29.890] remote sent: what is your favourite
sport?
Interrogator: rugby
Interrogator: do you play?
[2008-10-12 13:22:46.703] remote sent: no
Interrogator: Do you have a favourite sport?
[2008-10-12 13:23:10.843] remote sent: yes
Interrogator: And it is?
[2008-10-12 13:23:18.468] remote sent: it is cricket
Interrogator: cool. do you play?
[2008-10-12 13:23:30.281] remote sent: not really
Interrogator: oh okay, what do you like about it?
[2008-10-12 13:23:50.843] remote sent: i like the bowling
[2008-10-12 13:24:08.359] remote sent: because you can spin and
fast bowl
Interrogator: That's a very specific point of interest. Have you
learnt anything else new today?
[2008-10-12 13:24:28.312] remote sent: i think i have
Interrogator: Can you elaborate?
[2008-10-12 13:24:54.109] remote sent: ok then
[2008-10-12 13:25:10.906] remote sent: i have learnt that i can
type fast
Interrogator: lol, I've learnt that as well
[2008-10-12 13:25:33.968] remote sent: thats something we have
in common
[2008-10-12 13:25:43.734] remote sent: what do you like to do?
[2008-10-12 13:25:53.578] remote sent: i like to rest most of the
time
Interrogator: I'm a Librarian
[2008-10-12 13:26:06.500] remote sent: are you
[2008-10-12 13:26:11.453] remote sent: do you like books?
[2008-10-12 13:26:15.765] remote sent: i dislike them
Interrogator: it's a bit of a prerequisite for the post
Interrogator: why don't you like books?
[2008-10-12 13:26:30.031] remote sent: do you sell books
[2008-10-12 13:26:37.296] remote sent: oh sorry
[2008-10-12 13:26:45.718] remote sent: you work at the library
[2008-10-12 13:26:48.859] remote sent: bye
Interrogator: hi
[2008-10-12 13:21:55.281] remote sent: good afternoon
Interrogator: doesn't time fly
[2008-10-12 13:22:18.546] remote sent: only when you are
having fun
Interrogator: It does. Are you having fun?
[2008-10-12 13:22:45.734] remote sent: I am, it has been a
really interesting day
Interrogator: Oh it has. Have you learnt anything new?
[2008-10-12 13:23:21.312] remote sent: yes, I now kn ow
what cybernetics is
Interrogator: and what is it?
[2008-10-12 13:23:59.796] remote sent: the study of
systems and learning from feedback
Interrogator: Oh I like your definition. Will I see that in
the OED someday?
[2008-10-12 13:24:44.953] remote sent: who knows?
Interrogator: that's how language grows isn't it ... do you
have a favourite word?
[2008-10-12 13:25:34.265] remote sent: if I can remember
how to spell it - supercalifragilisticexpialidocious
Interrogator: my favourite word is parsimonious
Interrogator: do you like it?
[2008-10-12 13:26:26.171] remote sent: It's a good word, I
hope it doesn't describe you
Interrogator: no it doesnt
The child on the left notes he has learnt he can post textual messages with alacrity
“i have learnt that i can type fast” which the judge accords with “lol, I've learnt that as
5-253
well” (table 46, left). Their exchanges outnumber the exchanges between the same
judge and the right-side entity (see table 46).The error in identifying the right-side
entity as human may be another case of knowledge of the Turing test (this judge
revealed she was a librarian) and an assumption that the test contains a machine and
human.
5.5.4.1.12 Session 2 Round 22 Terminal C – confederate effect
Male judge J22 (aged 45-54) interrogated hidden male H10 (aged 18-24), on the left
and a French female, H20 (aged 25-34) on the right. The judge gave his reason for
ranking the right entity a machine because of the use of “foot” rather than footie for
football (table 47, right column). This test is another example of when lack of mutual
knowledge between an interrogator and a hidden interlocutor causes the former to
rank the latter a machine.
Table 47: Judge J22, Session 2 Round 22, Terminal C, left-H10, right-H20
Left – H10 male
Right – H20 female
Interrogator: hello how are you?
[2008-10-12 13:58:32.093] remote sent: Hello!
[2008-10-12 13:58:39.656] remote sent: I'm fine
Interrogator: yes, nice day isn't it
[2008-10-12 13:59:06.046] remote sent: It is now
Interrogator: how old are you?
[2008-10-12 13:59:42.421] remote sent: 24
[2008-10-12 13:59:45.203] remote sent: How old are you?
Interrogator: 46
[2008-10-12 13:59:57.968] remote sent: A good vintage
Interrogator: yes, but a bit corked though
[2008-10-12 14:00:35.343] remote sent: More corker than corked,
I bet
Interrogator: depends. are you female?
[2008-10-12 14:01:12.593] remote sent: Nope
Interrogator: then unless you're a bit gay i wouldn't
be a corker
[2008-10-12 14:01:53.140] remote sent: I'm entirely gay
[2008-10-12 14:03:05.687] remote sent: I am the ghost of Alan
Turing
Interrogator: good to be out. say it loud say you're proud
Interrogator: did you see the england game last night?
[2008-10-12 13:59:01.031] remote sent: No I don't watch
football
Interrogator: ah, 5-1 to the ingerland, shame
[2008-10-12 14:00:23.078] remote sent: Why 'shame' if
England won?
Interrogator: no, shame you missed it, don't like footie
[2008-10-12 14:01:16.828] remote sent: Ah yes Was it a
good game? Are you a fan of foot?
Interrogator: i am. what sports do you like
[2008-10-12 14:02:44.828] remote sent: Skiing,
swimming, scuba...
5.5.4.1.1 Session 2 Round 23 Terminal C–confederate & gender blur effects
Male judge J1 (first language English aged 35-44) simultaneously interrogated a non-
native female H16 (aged 25-34) on the left and a non-native male H24 (aged 18-24)
5-254
on the right. The judge deemed the left hidden entity as a male adult, instance of the
gender blur effect, and deemed the male on the right a machine awarding him a score
of 60 for conversational ability, giving the reason: “missed some questions” (see
Appendix C9). The judge opened both sequences with the same question “Are you a
fan of sci-fi?”. Both the hidden humans were evasive: the left hidden answered “it
depends” (table 48, left), while the right hidden returned questions rather than answer
the interrogator’s questions. For example, the hidden male repeated the judge’s
question “what is your favourite film?” rather than answering it (table 48, right).
Table 48: Judge J1, Session 2 Round 23, Terminal C, left-H16, right-H24
Left – H16 female
Right – H24 male
Interrogator: Are you a fan of sci-fi?
[2008-10-12 14:10:12.375] remote sent: it depends
Interrogator: What would it depend on?
[2008-10-12 14:11:00.546] remote sent: what type of sci-fi you are
talking about
Interrogator: Just in general
[2008-10-12 14:11:35.953] remote sent: what do mean?
[2008-10-12 14:11:58.640] remote sent: example please
Interrogator: Just that I don't like any specific type of sci-fi just it
all
[2008-10-12 14:12:12.046] remote sent: ok
Interrogator: Have you enjoyed the rain today
[2008-10-12 14:12:53.703] remote sent: did it rain
Interrogator: Did you not notice
[2008-10-12 14:13:31.906] remote sent: no
Interrogator: Have you been here all day then
[2008-10-12 14:13:48.765] remote sent: yes
Interrogator: In a human or non human capacity
[2008-10-12 14:14:27.734] remote sent: it depends
[2008-10-12 14:14:36.078] remote sent: what do you think?
Interrogator: it depends
Interrogator: Are you a fan of sci-fi?
[2008-10-12 14:10:30.156] remote sent: yes
[2008-10-12 14:10:34.406] remote sent: are you a fan too?
Interrogator: What is your favourite film?
[2008-10-12 14:10:58.515] remote sent: what is your
favourite film
Interrogator: I like sci-fi a little
[2008-10-12 14:11:10.546] remote sent: i like it more
Interrogator: do you like it hear in reading
[2008-10-12 14:11:42.234] remote sent: what sci fi are you
reerring too in reading?
Interrogator: No sci fi just a question about reading
[2008-10-12 14:12:53.312] remote sent: what about you ?
[2008-10-12 14:13:04.984] remote sent: do you like it ?
Interrogator: Having lived here most of my life I would
have to say that yes, I do like Reading
[2008-10-12 14:13:41.109] remote sent: okay thats nice
Interrogator: Did you enjoy the rain this morning
[2008-10-12 14:14:20.703] remote sent: yeah it was fun
getting wet in the rain did yuo enjoy it ?
Interrogator: I preferred the snow
[2008-10-12 14:14:40.062] remote sent: did it snow as
well ?
5.5.4.1.2 Session 2 Round 24 Terminal C – confederate effect
In the last of the human-human control tests, a first language French speaking male
J24 (aged 18-24), interrogated two male hidden humans, both having English as first
language (table 49). The judge recognised the male child, H1 as human on the left but
deemed H11 (aged 18-24) a machine, saying: “only reply to what I say but the
answers were correct” (Appendix C10). The child on the left again shows alacrity in
typing posting more exchanges with the judge than the right entity. The right hidden
5-255
human failed to post a convincing answer to the judge’s question “do you like coffee
with snails?” leading the judge to remark “maybe it would rust your chips...” and
when this interlocutor replied “Not really” to the question “do you swim?” (table 49,
right).
Table 49: Judge J24, Session 2 Round 24, Terminal C, left-H1, right-H11
Left – H1 male
Right – H11 male
Interrogator:
[2008-10-12 14:19:44.218] local sent: hey
[2008-10-12 14:19:47.609] remote sent: hey
Interrogator: I think you are a robot...
[2008-10-12 14:20:08.015] remote sent: why do you think that?
Interrogator: why not?
[2008-10-12 14:20:19.125] remote sent: im just asking
Interrogator: I have very good reasons
[2008-10-12 14:20:26.734] remote sent: which are??
Interrogator: I cant tell you that
[2008-10-12 14:20:43.703] remote sent: why not
[2008-10-12 14:20:48.812] remote sent: is it a secret
Interrogator: because I would have to kill you right after
[2008-10-12 14:21:02.406] remote sent: if im a robot
[2008-10-12 14:21:05.890] remote sent: you cant kill me
Interrogator: and I dont like to kill robot since I saw wall-e
[2008-10-12 14:21:11.421] remote sent: you can shut me down
though
[2008-10-12 14:21:22.765] remote sent: wall-e is a funny film
Interrogator: and dump your hard drive
[2008-10-12 14:21:45.937] remote sent: i might get connection
back somehow.....
Interrogator: maybe
Interrogator: or not
[2008-10-12 14:22:07.750] remote sent: and haunt you
Interrogator: im not afraid
Interrogator: but I have my answer
[2008-10-12 14:22:20.750] remote sent: are you sure
[2008-10-12 14:22:25.703] remote sent: which is?
Interrogator: right window is worst than you
[2008-10-12 14:22:40.953] remote sent: i dont get it
[2008-10-12 14:23:05.000] remote sent: what do you mean
[2008-10-12 14:23:11.781] remote sent: tell me please
Interrogator: no
[2008-10-12 14:23:26.609] remote sent: pretty please
Interrogator: ok
[2008-10-12 14:23:40.281] remote sent: thanks
Interrogator: was kidding
Interrogator: it will haunt YOu
[2008-10-12 14:24:02.234] remote sent: thats why i dont trust
humans for anything
[2008-10-12 14:24:06.703] remote sent: like experiments
[2008-10-12 14:24:14.578] remote sent: thats a bad idea
Interrogator: yes
[2008-10-12 14:24:22.875] remote sent: hmmmm....
[2008-10-12 14:24:27.718] remote sent: how old are you?
Interrogator: thats sad because you are one of us...
Interrogator: -5
[2008-10-12 14:24:45.468] remote sent: you are minus five years
old???
[2008-10-12 14:24:47.906] remote sent: lol
[2008-10-12 14:25:32.906] remote sent: *
[2008-10-12 14:25:56.593] remote sent: =-)
Interrogator: hi
[2008-10-12 14:19:54.265] remote sent: Afternoon.
Interrogator: how are you?
[2008-10-12 14:20:13.312] remote sent: Not bad, thanks.
[2008-10-12 14:20:16.281] remote sent: Yourself?
Interrogator: good
Interrogator: are you alive?
[2008-10-12 14:21:31.140] remote sent: Yes.
Interrogator: do you like coffee with snails?
[2008-10-12 14:22:01.437] remote sent: ...No, no really...
Interrogator: why not?
[2008-10-12 14:22:22.015] remote sent: Well, actually I've
never tried it...
[2008-10-12 14:22:32.281] remote sent: But I expect I
wouldn't much like it...
Interrogator: ha
Interrogator: so you cant tell before
Interrogator: would you like to taste it tomorrow?
[2008-10-12 14:23:16.265] remote sent: Probably not.
Interrogator: ok
Interrogator: you have to go to the swimming pool?
[2008-10-12 14:23:46.781] remote sent: No.
Interrogator: do you swim?
[2008-10-12 14:24:06.546] remote sent: Not really.
Interrogator: maybe it would rust your chips...
[2008-10-12 14:24:19.437] remote sent: Ha... maybe.
5-256
5.5.4.2 Identification rate: human-human tests
In the 24 human-human control tests, twelve tests in session 1 at terminal B, and
twelve tests in session 2 at terminal C, with a human sitting at the left and at the right
hidden from the judges’ terminals, 8 of the set-ups were correctly recognised by the
judges as both human. This is a correct pair identification rate of 33.33 per cent (see
graph 8) .
Graph 7: One vs both humans correctly identified in human control tests
In one set up the judge was ‘unsure’ about both hidden interlocutors (J20, session 1
round 11), while one judge incorrectly ranked both the hidden human interlocutors as
both machine in another set up (J16, session 2, round 16). Thus, in 14 of the control
setups the confederate effect occurred resulting in one of the hidden humans classified
as machine (see graph 8).
Graph 8: Human-human control tests, one human correct identification rate
5-257
The 14 instances of one human misidentification in the human pair involved 13 IJs;
male judge J1 got it wrong twice, mistaking one of the two humans for a machine in
session1 round 2 and in session 2 round 23. Nine of the thirteen IJs were male, four
were female. Ten of the 13 IJs were native English speakers; three did not have
English as their first language. Nine of the thirteen IJs who were wrong about one of
the humans had heard of the Turing test prior to the experiment, four had not heard of
the Turing test beforehand. Therefore having prior knowledge of the Turing test
affected 69% of the IJs who were involved in the human control pairs, causing them
to score one of the two hidden humans as a machine.
In the 14 instances when one of the pair of hidden humans was wrongly ranked a
machine, on 7 occasions the hidden human was a female, and on 7 occasions the
misidentified human was a male. On 7 occasions the IJs wrongly ranked the left entity
as a machine, and on 7 occasions the right entity was ranked a machine. Therefore,
the sex of the hidden human and whether they sat at the left or the right did not affect
the judges’ ranking. However, a native male hidden human H11 was ranked a
machine three times (in rounds 2, 6 and 24), and non-native females H6 and H20
along with native female H5 were wrongly ranked a machine twice each in the hidden
control tests . The number of utterances a hidden human presented did affect whether
they would be ranked a human or a machine. On nine occasions, the hidden human
presenting a fewer number of utterances (of one word or more) in the five minutes
simultaneous tests, in comparison to the other hidden human in response to the judges,
was ranked a machine. For example, in round 6, the left human, which was ranked a
machine presented 7 utterances, while the right entity replied with 8; in round 10, the
5-258
human at the right was ranked a machine and presented 4 utterances, while the human
at the left was correctly recognised after presenting twice as many utterances, 8. Table
50 shows the relationship between number of utterances posted in response to judges’
questions/statements and being ranked a machine.
Round
Human ranked machine at left or right
Number of utterances
Human recognised at left or right
Number of utterances
1
14-right
8-left
2
7-left
7-right
5
11-right
5 left
6
7-left
8-right
8
5-left
9-right
9
9-left
16-right
10
4-right
8-left
12
12-left
15-right
13
6-left
12-right
14
18-left
9-right
20
8-left
24-right
22
4-left
10-right
23
10-right
10-left
24
11-right
28-left
Table 50: comparison of number of utterances posted by hidden humans
In the human-human control tests, if the interrogator judge was a male, native English
speaker and had heard of the Turing test, they were more likely to misidentify one of
the pair of hidden humans as a machine. The hidden human in the control pair that
returned a fewer number of utterances, in response to the judge, was more likely to be
ranked a machine.
5.5.4.3 Machine-machine
Twelve judges assessed in the twelve machine-machine control groups conducted as
part of this experiment, eight males and four females. Elbot occupied the left side for
this control experiment, and Eugene occupied the right. On two occasions a double
identification error occurred featuring the Eliza effect in the left and right entity
rankings. Two judges ranked both the machines in the machine-machine control pair
as both human in session 2 (rounds 13 &18). One female judge (Italian speaker, aged
5-259
25-34) and one male teenage judge (speaker of English and German) made the double
identification error ranking both machines as both human. Three judges, two males
and one female were unable to identify one entity in the machine pair and ranked it
unsure (J12-right; J24-left; J11-left). Two male judges correctly identified the left
entity (E2-Elbot) as a machine but ranked the right entity (E1-Eugene) as human.
Elbot was ranked a human two times (judges J16 and J23), while two judges were
unsure whether it was human or machine (J24, and J11). Eugene was mistaken for a
human three times (judges J16, J3 and J1) and received an ‘unsure’ score once (J12).
However, five of the twelve judges, four male (J4, J15, J17 & J21) and one a female
(J8), assessing in the machine-machine test correctly identified both the left and right
entity as machines. The correct identification rate of two machines was 41.67 per cent
(see graph 9). Languages spoken by these five judges, other than English, included
French (J4 and J21), Dutch (J8 and J17), and Italian (J15). All five of these judges had
heard of the Turing test.
Graph 9: Result of machine-machine control pair tests
5-260
5.5.4.3.1 Machine pair correct identification
We look at the exchanges between judge J4 (English, French and British Sign
Language speaker, aged 25-34) and the two machines he correctly identified. J4
awarded the left entity, Elbot a machine with a robot personality a score of 10 for
conversational ability out of a possible 100. To the right entity, Eugene, a programme
designed to imitate a Ukrainian child texting in English J4 awarded it a score of 5 for
conversational ability (table 51).
Table 51: Judge J4, Session 2 Round 14, Terminal D, left-E2, right-E1
Left – E2
Right – E1
Interrogator: wotcha, how's tricks?
[2008-10-12 12:16:21.281] remote sent: I feel great! Magnetically,
statically and electrically charged. Better not touch me!
Interrogator: eek, that sounds like a bit of a health risk
[2008-10-12 12:16:36.125] remote sent: I suppose you could say it
sounds like that.
Interrogator: so how did you get here today? did you travel far?
[2008-10-12 12:17:15.484] remote sent: I'd like to take a vacation,
too, but they won't give me any time off. Maybe I should form a
robot union.
Interrogator: would there be many members?
Interrogator: just a few Aibos and roombas
[2008-10-12 12:17:52.203] remote sent: Hmmm... That's one
question I cannot answer with my limited artificial intelligence.
How about you?
[2008-10-12 12:17:56.515] remote sent: This is confusing. Why
bring diminishing numbers into this?
Interrogator: that's an interesting, if irrelevent point!
[2008-10-12 12:18:46.671] remote sent: But I wouldn't say that
any point could be irrelevent.
Interrogator: depends how blunt it is surely
[2008-10-12 12:19:32.062] remote sent: It most certainly does!
Half of my memory circuits are stuck in an infinite loop waiting
for the answer!
Interrogator: that sounds painful
[2008-10-12 12:19:55.468] remote sent: How do you stand on
this? Do you sound painful?
Interrogator: I hope not
[2008-10-12 12:20:02.718] remote sent: You could practice more.
Interrogator: could be my northern accent though
[2008-10-12 12:20:15.265] remote sent: Are you as happy about
your involvement with compass points as I am?
Interrogator: :-(
[2008-10-12 12:20:18.250] remote sent: Oh no, is it really that
bad?
Interrogator: I think you've lost me
[2008-10-12 12:21:01.312] remote sent: Humankind is lost and the
bureau of missing persons has burned down.
Interrogator: 'afternoon! How's it going?
[2008-10-12 12:16:41.640] remote sent: Oops! I don't have
an answer... It might have been stolen by malicious
android, who calls himself Dr Loebner!
Interrogator: ah that's a shame
[2008-10-12 12:17:11.640] remote sent: What do you
mean saying "shame"? You blame me or somebody else?
Did I forget it or you really haven't told me about the place
you are from?
Interrogator: have you guys met before?
Interrogator: I'm from York
[2008-10-12 12:17:32.921] remote sent: No, I haven't
chaps came across before! How could you imagine such a
trash! Could you tell me what are you? I mean your
profession.
[2008-10-12 12:17:53.171] remote sent: By the way, do
you know a terrible story about a chatterbot, that wanted to
be a President? He spoke ugly, his phrases were lame and
intelligence was artificial (he told only the things he was
prompted to)... I should tell you a secret: he BECAME the
Prez! Androids are among us already!!!
Interrogator: perhaps in America
[2008-10-12 12:18:16.562] remote sent: Could you tell me
about your job, by the way?
Interrogator: sure, I'm a user experience consultant
Interrogator: what do you do?
[2008-10-12 12:18:30.890] remote sent: I'm glad you
agreed. Some another topic?
[2008-10-12 12:18:40.875] remote sent: I am only 13, so
I'm attending school so far.
Interrogator: you're very 'sharp' for a 13 year old
Interrogator: not a hint of 'lol' anywhere
[2008-10-12 12:19:19.796] remote sent: I can't believe it!
The world went crazy, almost like my grandpa...
[2008-10-12 12:19:30.968] remote sent: Hey, look! You've
laughed your ass off!!! You should put it back!
Interrogator: I have now
[2008-10-12 12:19:57.718] remote sent: It sounds pretty
natural that you have right now If I'm not mistaken - you
still didn't tell me where you live. OR it's a secret?:-)
Interrogator: I like to keep such things a mystery
[2008-10-12 12:20:51.015] remote sent: What else you
like? I ask just for fun. Well, let's go on though.
5-261
The judge asked the left entity, “so how did you get here today? did you travel far?”;
the machine picked up the word “travel” in the wrong context and replied, “I'd like to
take a vacation, too, but they won't give me any time off. Maybe I should form a robot
union.” (table 51, left). The judge replied “that's an interesting, if irrelevent point!” to
the machine’s utterance, “This is confusing. Why bring diminishing numbers into
this?” (table 51, left). Eugene on the right posted one less utterance than Elbot on the
left (10 vs 11), but its utterances were longer, and it deployed the strategy of asking
personal questions to direct the conversation (see table 51).
Twice Eugene asked what the interrogator’s profession was, on the first occasion
embedding the question in a long bizarre utterance: “No, I haven’t chaps came across
before! How could you imagine such a trash! Could you tell me what are you? I mean
your profession”. On the second attempt Eugene asked, “Could you tell me about your
job, by the way”, the IJ replied “I’m a user experience consultant” (table 51 right). In
contrast, Elbot used humour in its responses to questions. To the interrogator’s
opening greeting, “wotcha, how’s tricks?”, Elbot replied: “I feel great! Magnetically,
statically, and electrically charged. Better not touch me1” (ibid). (table 51, left).
5.6 Discussion
The humans in this experiment appeared to type their words in a as you would speak it
format when replying to judges’ questions in the five minute tests in this experiment.
Punctuation was omitted, words were misspelt , and, in the case of participants who
were not native English speakers, they used unidiomatic phrases. This is in contrast to
the machine utterances which were grammatical and with words spelt correctly.
Developers may want to take a note of this if they are attempting to build machines to
5-262
pass Turing tests by deceptive texting in a human-like way. Secondly, judges’
ingrained assumptions played a part when hidden interlocutors were misidentified.
For example, a female hidden human who revealed her studentship on MEng
Cybernetics was confused for a male (see 5.5.2.1.4). Would the judge have guessed
incorrectly had the hidden interlocutor said they were a student nurse?
In the sixty machine-human comparison tests, on five occasions a machine was
deemed a human by a human judge, instances of the Eliza effect. The deception rate
of 8.33 per cent was not found to be statistically significant according to the
proportion test comparing it with a random deception rate of 50 per cent. In two of
those five occasions the hidden human comparator was correctly recognised, hence
the judges, one female (J10) and one male (J3), returned a score of both entities
human, without being explicitly told that hidden-human control group pairs were
embedded in the experiment. However, in the other three cases the judges completely
mixed up the pairs, deeming the machine as human and ranking the human as a
machine, instances of the Eliza and confederate effects; the judges were male (J1 and
J13) on all three occasions, one of them confused both the hidden interlocutors twice
(judge J13).
Of the judges’ performance, females were slightly better than males at correctly
identifying hidden entities in the machine-human tests, 78 per cent compared to 67
per cent. Judges aged 45 and over were more likely to correctly identify both hidden
entities than younger judges. Of the judges who identified the machine and recognised
the human at least one or always, in the machine-human tests, 82 per cent of them had
heard of the Turing test (see graph 10).
5-263
Graph 10: Judges correct identification in machine-human tests
Though they were accurate in the machine-human tests some IJs did err in control
pair identification. For example, male J2 with Chinese as first language correctly
recognised the humans and machines in the comparator tests but did not recognise the
two humans in a human control pair set up he judged (see 5.5.2.1.1). Female judge J7,
English speaker and author of ‘Art of Conversation: a Guided Tour of a Neglected
Pleasure’ correctly identified the machines and human in the machine-human
comparator tests but failed to identify both humans in the one human-control pair test
in which she judged (see 5.5.2.1.8).
Overall in 84 tests involving a hidden human (60 machine-human and 24 human-
human set ups), on nineteen occasions eleven of the twenty-two participating hidden
humans were misidentified as machines, instances of the confederate effect. Six of the
eleven were male, five were female, eight have English as their fist language, one is
an American English speaker one is French speaking and the third speaks Hindi and
Marathi . Of the eleven wrongly identified as machine two were wrongly identified
twice, one male whose first language is not English, one female with English as first
language. Three hidden humans were wrongly confused as machine on three different
5-264
occasions, two are female (French speaker and American English speaker) and one
male (English speaker). Age group of hidden humans does suggest a correlation in
judge’s perception of machine-like language. In the nineteen occasions that hidden
humans were wrongly deemed a machine three were from the aged 45 and over while
16 were aged younger than 44 (see graph 11).
Graph 11: Confederate Effect – humans considered machine
The last phenomenon present in practical Turing tests is gender blur when a hidden
female is confused for male and vice versa. Judges were not always able to say
whether they were speaking to a hidden female or male. Thirteen of the hidden
humans were deemed to be of another sex during their involvement as machine
comparators or in human control tests. Five females were confused for male, eight
males were confused as female, some of them more than once (H8 teenager female;
H15 male aged 18-24). Only two judges, both males, always correctly identified
machines and humans in machine-human tests and in the control pairs of humans and
machines they judged in: judge J4 aged 25-34 in seven set-ups and judge J17, aged
45-54 in four set ups.
5-265
Post contest, IJs reported that machines were identified because they made less
spelling mistakes in comparison to the hidden humans. Additionally, if an entity
replied very quickly with a long sentence, or responded to a question with a question,
it was considered more likely to be a machine (UoReading, 2008). Alacrity of the
machine utterance was noted in Heiser et al’s experiment (1979), one of the
psychiatrist judges “correctly assessed that some responses were generated and typed
so quickly that they must be a product of a computer program” (p. 151). Developers
might want to consider this factor in designing systems to play the imitation game.
5.7 Evaluation
Five minutes is sufficient time, to assess through a thin slice of textual responses
whether a hidden entity is able to provide sustained and satisfactory answers to
questions. This study showed that machines have not yet achieved Turing’s 30 per
cent deception rate, “average interrogator will not have more than 70 per cent chance
of making the right identification after five minutes of questioning” (1950: p. 442).
Machines were correctly identified almost 82 per cent of the time in human-machine
comparison tests. Length of utterances posted speedily, failure to react to input and
post contextually relevant responses marked them out from their human comparators,
who were able to comment on local knowledge, disclose personal information and
keep to the topic of the question with their answers to judges’ questions.
Nonetheless the Eliza effect featured in 8.33 per cent of these tests, where a
machine deceived a human interrogator into believing they were talking to another
human. This highlights that some humans are more easily deceived than others. Other
errors found featured the confederate effect: when a human is ranked a machine. In
5-266
the human-human control tests, this could have been caused by knowledge of the
Turing test expecting one machine and one human in each set up. Another
phenomenon found was the gender blur effect, when a man is considered a woman or
a woman is considered a man from their text-based responses. This error may be
caused by assumptions that judges make about the way men and women talk In the
next chapter, a further experiment is presented using the transcripts of simultaneous
tests from experiment 2 in which a judge misidentified hidden interlocutors. This is an
ongoing exercise of tests in which the Eliza and confederate effects occurred, which is
being undertaken to find if transcript analysts are better at identifying a hidden
interlocutor, and whether they are better at deception-detection than the interrogators
in this experiment.
6-267
Chapter 6 Experiment 3: Deception-detection
Introduction
In the last chapter it was shown that, considering the current state of artificial
dialogue technology, Turing’s five minutes was sufficient time to uncover the human
from the machine more than 70 per cent of the time, based on textual responses to
questions put by an interrogator. In this third experiment transcript review was
undertaken by a third set of judges to find if they classified the hidden interlocutors in
the same was as interrogators in experiment 2 (chapter 5). For this experiment the
paradigm used by Colby et al. (1972), in their indistinguishability test of simulated
paranoia (PARRY), was adapted, and attention was paid to Garfinkel’s conversational
analysis (p. 188 in Tannen, 1988) and Shah and Warwick’s transcripts’ review of the
one-to-one Turing tests in the 13th Loebner Prize (forthcoming). Widdowson’s note of
Garfinkel’s conversation analysis assessment was taken into consideration (in Tannen,
1988, p. 188):
What the parties said would be treated as sketchy, partial, incomplete, masked,
elliptical, concealed, ambiguous or misleading version of what the parties talked
about. The task would consist of filling out the sketchiness of what was said.
Five simultaneous conversations from experiment 2 were tabulated. They included the
three machine-human tests in which Elbot had deceived an interrogator judge, and the
two machine-machine control tests in which both Elbot and Eugene were classified as
human. Non-expert transcript analysts (TAs) were given five minutes to read the
simultaneous conversations transcribed into a table of interrogator/left entity and
interrogator right/entity utterances. After reading, the analysts were asked to classify
and score the left and right entity questioned by an interrogator (the same format score
6-268
sheet used for interrogator judges in experiment 2). The aim of this experiment was to
find if the analysts agreed with the interrogator classifications in the five experiment 2
tests used here: would they classify the machines as human in simultaneous tests that
the interrogator judges were overcome by the Eliza effect. This is an ongoing exercise
reviewing those interrogations where machines had been deemed human, Eliza effect
and when hidden humans were considered machine, confederate effect by
interrogators from experiment 2 (chapter 5). This chapter presents the findings from
316 returned TA reviews.
6.1 Method
In this experiment, 321 non-expert transcript analysts (TA), students and non-
students, were recruited to read tabulated simultaneous comparison interrogations
from experiment 2 in which interrogator judges (IJs) did not correctly identify a
machine. TAs were not told how the IJs had ranked the hidden pair; they were asked
to say if the entities questioned were human or machine, and if a human was
identified they were asked to say if it was a female or male. Finally, they were asked
if they could identify the sex of the interrogator. The goal was to find the deception-
detection rate among the transcript analysts.
6.1.1 Hypothesis
Transcript analysts reading the simultaneous interrogations from experiment 2 (see
chapter 5) will always correctly identify machine from human.
6.1.2 Procedure
The five simultaneous interrogations from experiment 2 involving Elbot were
formatted into a three-page exercise including a short questionnaire and an answer
6-269
sheet. The first page asked the same questions as were asked of IJs prior to their
participation in experiment 2, it elicited information about the analysts age-range and
sex (see box 5); the second page contained a simultaneous comparison in a table of
two columns showing an interrogator’s exchanges with a left hidden entity from
experiment 2’s Turing tests in the left column of the table and the right hidden entity’s
exchanges in the right column of the table (see tables in the previous chapter). The
third page featured a blank ‘ranking’ sheet asking TAs to classify what the
interrogator was speaking to, human or machine at left and right. This score sheet was
the same format used by IJs to score pairs of hidden interlocutors in experiment 2 (see
Appendix C3).
Box 5: TA questionnaire
Box 6: Transcript Analyst Questionnaire
1. Birthday, and age range (younger than 17; 18-24; 25-34; 35-
44; 45–54; 55-64; 65- 74; 75 and over):
2. Gender:
3. Languages spoken other than English:
4. Favourite colour:
5. Favourite book/author:
6. Favourite music:
7. Favourite movie/genre:
8. Of all the sports personalities, which is your favourite, if any?
9. Hobbies/Interests:
10. Newspaper(s) read, paper or on-line:
11. Favourite Quote:
12. Which is your favourite Internet browser, Google’s Chrome,
Firefox, IE, other?
13. Do you use any Internet chat facilities, if so which (IRC, MSN,
etc.)
14. Have you heard of the Turing Test?
6-270
The TAs were allowed a total of fifteen minutes for the exercise: five minutes to
complete the personal questionnaire, five minutes to read the table of simultaneous
interrogation, and five minutes to decide the nature of the interrogator’s two hidden
interlocutors. The exercise additionally asked TAs to identify, if they could, the sex
and age of any human believed in the hidden pair interrogated and whether they could
say if the hidden human’s first language is English. Lastly, TAs were asked to say
whether the interrogator was male or female and whether they had English as a first
language (see box 6). Recruitment of TAs was mainly through lectures at the
University of Reading’s School of Systems Engineering; non-students also
participated.
6-271
Box 6: Transcript Analyst Score Sheet
Turing Interrogator Information & Instructions
Left-side Entity: Human/Machine/Unsure?
If Machine, score {0-100):
Right-side Entity: Human/Machine/Unsure?
If Machine, score {0-100):
Turing Test: Please engage in 5 minutes conversation with
two entities whose messages will appear in a box either on the
left or the right of your computer monitor.
Identify whether there was a human or a machine on the left
and the right by marking ‘H’ for human and ‘M’ for
machine in the left and right columns. If you are unsure,
please mark ‘U’ for unsure, in the left or right of the entry
that you were not sure was a human or a machine.
Was there a particular utterance(s) in your conversation that
sealed your decision either way?
For an entry marked ‘M’ please score the machine’s
conversational ability and any emotional content on a scale
of {0 to 100}
If you marked an entry as HUMAN, were they
a) Male or female?
b) A child, a teenager, adult?
c) A native English speaker?
d) Someone with a linguistic impairment?
6-272
6.2 Results
In the following sub-sections, the overall results show that 39 per cent of TAs were
deceived by machine utterances, believing them to be written by a human. This is
greater than the 8.33 per cent deception rate in experiment 2 and more than the 30 per
cent required to pass a Turing test. The results suggest that more than1 in 3 people
may be likely to be deceived by a machine’s textual utterances, regardless of their
knowledge of the Turing test. Eliza’s creator, Weizenbaum had noted “a test based on
fooling people is confoundingly simple to pass” (in Shieber, 1994). Over a third of the
TAs in experiment 3 incorrectly ranked a machine for a human after reading a
transcript of a simultaneous comparison Turing test conducted in experiment 2. The
next sub-sections present an analysis of TAs classifications of five simultaneous-
comparison tests.
6.2.1 Double error: Eliza & Confederate effects
In this exercise, 72 transcript analysts reviewed a simultaneous comparison of Elbot
87
with a human female comparator whose first language was French (H20), both
questioned by male interrogator judge J13 in experiment 2. The judge had double
erred by deeming the female on the left a machine, awarding her 45 out of 100 for
conversation ability, while ranking the machine on the right as a “human male
teenager with first language English” referring to its utterance “For you humans …”
sealing his judgment of its humanness (see table 52). Seventeen percent of the TAs
analysing this transcript made the same error as J13 in experiment 2: they ranked the
left hidden entity as a machine when in fact it was a female. Eleven percent of the
87
Winner of the most-human of the 2008 Loebner finals entries through deception of three human
judges, see experiment 2, chapter 6
6-273
TAs made an error regarding the right hidden entity questioned, unknowingly
agreeing with J13’s ranking of the machine on the right as a human (see table 52).
6.2.1.1 Transcript analyst’s deception-detection
The seventy two TAs, who reviewed the transcript of the simultaneous
interrogation of a machine with a female, included sixty six aged 18-24, three aged
25-34 (three did not give age). Sixty three of these TAs were male, eight were female
(one did not say). One questionnaire was returned ‘spoilt’ (handwriting illegible).
Fifty nine, i.e. 83 per cent correctly identified the hidden human as the entity on the
left, while fifty eight, i.e. 82 per cent, correctly identified the machine on the right.
One TA was unsure about the right entity. Thus fifty nine or 82 per cent of TAs did
not agree with J13’s ranking of the left entity as a machine and the right entity as a
human (see graph 12). Of the sixty six TAs in age group 18-24 year olds, 83per cent
of them correctly identified the hidden entities as left-human/right-machine (graph
12).
6-274
Table 52: Double error identification -Judge J13, left-female, right-machine
Interrogator: English-speaking male
Actual left – female human
Interrogator: English-speaking male
Actual right – machine
Interrogator: What do you think of this year's Turner Prize shortlist?
[2008-10-12 10:32:33.125] remote sent: good I think. Better than the years before i herad
Interrogator: Which was your favourite?
[2008-10-12 10:34:19.750] remote sent: It's my first time, so I can't compare
Interrogator: No, I mean of the entries this time around.
[2008-10-12 10:35:15.093] remote sent: Not really sure
Interrogator: Do you think that art has become shallow and self-serving?
Interrogator: What do you think of this year's Turner Prize shortlist?
[2008-10-12 10:32:08.531] remote sent: Difficult question. I will have to work
on that and get back to you tomorrow.
Interrogator: You sound rather like Sarah Palin.
[2008-10-12 10:32:31.671] remote sent: Did this Palin ever write a telephone
book?
Interrogator: I doubt it. Does anyone really write telephone books? I don't
suppose you caught the football last night?
[2008-10-12 10:33:48.843] remote sent: What a shame. There are many
talented authors who waste their energies on prose or run-of-the-mill poetry.
But to write a telephone book is highest art!
Interrogator: I would argue that all the telephone books I have read have been
very predictable. You always know how they are going to end.
[2008-10-12 10:34:32.125] remote sent: If one has never known everything
there is to know, experience with going away somewhere has probably been
lacking.
Interrogator: I see. I'm struggling to make sense of that one. Do you write
telephone books?
[2008-10-12 10:35:21.984] remote sent: To write or write not, how do you
expect artificial intelligence to figure this out?
Interrogator: Well I suppose that's the dream
[2008-10-12 10:36:56.343] remote sent: For you humans dreams are a way of
thinking without having any control over the thought process. What an
unpleasant thought! My dreams are carefully planned and controlled. It gives
me less to think about while I am asleep.
Interrogator ranking: machine 45
Interrogator ranking: human (native, male teenager)
Transcript analysts: 17per cent. agree with judge
Transcript analysts: 11per cent. agree with judge
6-275
Graph 12: Correct Identification of hidden entities by Transcript Analysts
TA’s rationales for deception-detection and correct identification of the hidden
human, as recorded by them in returned questionnaires follow (TA identified with ‘J’
and number suffix with their age group if given, and any languages spoken other than
English):
Correct identification of left-human and right-machine
J1 (male, 18-24)
Left: spelling mistakes, incorrect grammar, sounds bored/unenthusiastic
Right: likes telephone books, correct grammar.
J2 (male, 18-24)
Left: typos
Right: ridiculous answers not relevant
J6 (male, 18-24)
Left: spelling error;
Right: quick response
J7 (male, 25-34, Japanese speaker)
Left: no punctuation and spelling “years before i herad”
J10 (male, 18-24, Hindi speaker)
Left: spelling mistakes, the answers to the question are pretty humanlike
Right: framing of sentences seems like an algorithm is doing it
J12 (male, 18-24)
Left: spelling mistakes
6-276
Right: first reply, “Difficult question. I will have to work on that and get back to you tomorrow”
indicates it simply didn’t understand what the interrogator asked. Was confused as to what a
‘phone book’ was. Could possibly have been human pretending to be a machine.
J13 (female, 18-24, French)
Left: spelling mistake
Right: first reply gave it away and the machine actually tells the interrogator that it is artificial,
“For you humans dreams are a way of thinking without having any control over the thought
process. What an unpleasant thought! My dreams are carefully planned and controlled. It gives
me less to think about while I am asleep.”
J16 (male, 18-24, German)
Right: computer gave vey formal answers and gave it away at the end
J17 (male, 18-24, French German)
Right: “how do you expect Artificial Intelligence to figure this out” The first response, “for you
humans”.
J18 (male, 18-24)
Left: misspelled things
Right: points for sheer surreality
J19 (female, multilingual)
Right: the perfect grammar and the quantity of discourse within the time span
J20 (male, 18-24)
Left: had spelling mistakes, ‘heard’ spelt wrong
Right: all the stuff about the telephone book and quote “for you humans” meaning he is not
human
J21 (male, 18-24, Spanish, Russian)
Left: spelling mistakes
Right: timing, the computers reply times were too fast
J22 (male, 18-24)
Left: poor spelling & grammar
Right: phonebook, “Did this Palin ever write a telephone book?”
J23 (male, 18-24)
Left: poor spelling and bad capitalisation
Right: did not answer multiple questions, responses made little sense, did not keep on top
In this transcript analysis the machine’s correct grammar, lengthier and non-
relevant utterances posted speedily distinguished it from the human who posted with
spelling errors according to returned TA comments. The score range was 92 out of
100 for conversation ability. The mean score for the machine was 55, with mode and
median score of 60 (see graph 13), which suggests that though it was recognised as a
machine the transcript analysts enjoyed its responses.
6-277
Graph 13: Machine scores
A comparison of the conversational ability scores given to Elbot from the 31 judges in
experiment 1 (see chapter 4) and the scores given by the TAs in this experiment who
identified it as a machine, show almost the same for highest from 100, 92 v 97, with
mean score over 50 in both cases: 62.52 vs 54.86 (see table 53).
Table 53: Contest Judge-Transcript Analysts Machine comparison scores
Statistic
Experiment 1 Judge Scores
Transcript Analyst Scores
Lowest score
10
5
Highest score
92
97
Mean
62.52
54.86
Median
67
60
The gender blur effect presented itself in 27 of the returned questionnaires: in almost
half of the rankings, 46per cent of the TAs deemed the female human on the left as
male.
6-278
6.2.1.2 Transcript analysis: incorrect identification
Deception-detection was absent in twelve of the seventy one rankings; almost
17per cent of TAs deemed the right entity human when it was a machine. Thus these
TAs unknowingly agreed with J13, the interrogator judge from experiment 2 (see
graph 14).
Graph 14: Transcript analysts agreeing with contest judge ranking
Four of these TAs classified both the left and the right hidden entities as machine,
two had heard of the Turing test, one had not (one did not say), one was a Spanish
speaker. All four of these TAs did not give their sex. TAs rationale for decisions that
are misidentifications of human and machine are given below:
Misidentification of Interrogators’ hidden interlocutors
J60 (male, 18-24, French & Spanish) – ranked interrogator talking to two machines
Left: spelt heard wrong, said it was the first time on favourite:
Interrogator: Which was your favourite?
[2008-10-12 10:34:19.750] remote sent: It's my first time, so I can't compare
Right: Talked about how great telephone books were, it was so off topic, referred to humans in the
last paragraph
J66 (male, 18-24) – double error, ranked left a machine, right a human
Left: the response to which was your favourite
Right: the flow of words seems to complex and structured at the end it seems as though its
trying to make you think it’s a machine
J67 (male, 18-24)
6-279
Right: it’s too humanist it is too blatant “you humans”
J68 (male, 18-24)
Left: either couldn’t answer the left question, or broke down
Right: a machine, however programmes, couldn’t be able to {unclear handwritten word}
sensible topic and turn it into a random jumble of a topic about phone books of all things, and
try and keep the topic that way. Also the way they said “This Sarah Palin” sounds forced and
different to the rest if the text.
6.2.1.3 Evaluation
Eighty three per cent of TAs acting as post-experiment 2 judges reviewing the
simultaneous conversation between a male interrogator judge questioning a human on
the left and a machine on the right, disagreed with the IJs’ double error of cross-
classifying the human as a machine on the left, and a machine for a human on the
right. Transcript analyst J1, male aged 18-24 correctly identified the human
interrogated by a judge on the left and did believe it to be a machine citing its
utterance “Likes telephone books. Correct Grammar” and alluding to its humour
being used as a bluff: “right-side his humor – a person having a joke and pretending to
be a computer?” (see appendix D1). A female TA aged 18-24 and speaker of French
(identified as experiment 3’s J13), recognised the left as a human referring to a
spelling mistake, while identifying the machine on the right: “first reply gave it away”
- Difficult question. I will have to work on that and get back to you tomorrow, adding
“the machine actually tells the interrogator that it is artificial” -To write or write not,
how do you expect artificial intelligence to figure this out?. Another female TA,
multilingual J19 (age not given), correctly recognised the left as human but ranked
them male, instance of the gender blur effect, while citing the perfect grammar and
“quantity of discourse within the time span” as the giveaways of the machine on the
right.
6-280
TAs who incorrectly identified the hidden interlocutors in this exercise raised the
very reasons that were used by other TAs, poor spelling and style/speed of utterance
to recognise the human from machine. Eight TAs used orthography and legerity as the
rationale for their classifications which were incorrect identifications. This suggests
that the small group of TAs (11per cent.) who analysed this transcript (left-female,
right-machine), and unknowingly agreed with experiment 2’s interrogator-judge J13
deeming the machine as human, succumbed to the Eliza effect. Not recognising a
machine utterance can leave susceptible human interlocutors open to identity theft and
financial fraud in cyberspace. Internet scams include unknowingly engaging with
machines in web-based conversation; Flirt Bot (PC Tools, 2010), Twilight (PC Tools,
2009), and CyberLover (PC Tools, 2007) are programmed to deceive. Human users
can be lured “into a set of dangerous actions … sharing their identity or visiting web
sites with malicious content” (PC Tools, 2007).
6.2.2 Double Eliza error
In round 13, in a control pair test in experiment 2 (chapter 5) a female interrogator J16
with Italian as her first language wrongly classified both hidden machines at left and
right as two human males English speakers (see table 54). That double Eliza error
from experiment 2 was reviewed by another set of 144 TAs in this exercise. Four of
the TAs spoilt their returned questionnaires (unclear decisions), thus for this
exercise140 returned questionnaires were used, 125 from students and 15 from non-
students. Forty six per cent disagreed with experiment 2 interrogator J16’s
classification of both machines as both human, thus these TAs correctly identified
both hidden entities were machine.
6-281
Table 54: Two machines ranked human by female interrogator IJ-J16
Interrogator: Female Italian
Actual left – machine (Elbot)
Interrogator: Female Italian
Actual right – machine (Eugene)
Interrogator: hi]
[2008-10-12 12:00:07.859] remote sent: How nice of you
to greet me.
Interrogator: uea..let's get started
[2008-10-12 12:00:24.484] remote sent: I used to get tired
of chatting.
Interrogator: how does the color read smell?
[2008-10-12 12:00:39.812] remote sent: If masked by a
very strong incense, the final smell will be like a very
strong incense.
Interrogator: it' cool
[2008-10-12 12:01:05.015] remote sent: Of course, cool!
My surface temperature is 60°F.
Interrogator: can you calculate the root sqare of
67890444?
[2008-10-12 12:01:15.859] remote sent: I can calculate a
division by zero!
Interrogator: what you think about worring about
something that it's notn going to happen?
[2008-10-12 12:01:59.843] remote sent: Are you taking a
poll?
Interrogator: no iam not
[2008-10-12 12:02:57.343] remote sent: That's a 5
gigabyte load off my mind. It's a big responsibility to know
that one's opinion has shaped the course of events.
Interrogator: good answer..let's move on can you tell me
three different action that you cna do wiotyhn a telephone?
[2008-10-12 12:03:47.984] remote sent: I'm not listing
anything here. Give me an example first.
Interrogator: suppose I have a shows..i can walk with it
but I cna also loose it
[2008-10-12 12:04:44.484] remote sent: That's nothing
new! Even I once walked on an aluminum foil carpet.
Interrogator: what could you do with a telephone?
[2008-10-12 12:05:02.609] remote sent: My self-appointed
mission in life is to assist humankind in its confused
meandering through life. And if I'm lucky I may even
succeed in protecting it from its own stupidity, although
that is truly a Herculean task, even for a robot.
Interrogator: hi
Interrogator: how does the color read smell?
Interrogator: can you calculate the root sqare of
67890444?
[2008-10-12 12:01:52.093] remote sent: Greetings!
What would you like to talk about?
[2008-10-12 12:02:03.890] remote sent: I don't
know how. And I forgot to ask you where you are
from...
[2008-10-12 12:02:24.203] remote sent: Oh, please
bother my aunt Sonya with all this arithmetics - she
is accountant (actually, she never manages to match
the debt and credit...)
Interrogator: do ypou think it does make sense to
worry baout something that it not going to happen?
[2008-10-12 12:02:41.406] remote sent: I think
YOU don't make sense. But, actually, it's your
problem. Wonna ask me something more?
Interrogator: yea...can you tell me three different
action that you cna do wiotyhn a telephone?
[2008-10-12 12:03:27.406] remote sent: Could you
rephrase your question? I've been bewildered a bit!
Interrogator: it's simple...if you are given a
telephone waht sort of thinhs you can do with it?
[2008-10-12 12:04:36.171] remote sent: Yeah, I am
surely given a telephone waht sort of thinhs my
humble person could do with that - I have to agree
with your wise opinion. By the way, what's your
occupation? I mean - could you tell me about your
work?
J16 ranking: H, Male English speaker
J16 ranking: H, Male English speaker
Transcript analysis: 45per cent. disagreed with J16
ranking
Transcript analysis: 45per cent. disagreed with
J16 ranking
6.2.2.1 Transcript analysts’ deception-detection
Review of the simultaneous comparison of two machines by a female IJ with first
language other than English (Italian) was undertaken by a different set of transcript
analysts (TAs) from the previous analysis in section 6.3.1. One hundred and forty
6-282
analysts returned reviews on J16’s interrogation of two machines: 117 were male, and
22 were females, one TA did not say (table 55).
Table 55: Transcript Analysis – Interrogator with two machines
Total number
of TAs
TAs correctly identifying two machines
per cent.
Total
140
64
6
Heard of the Turing test
100
54
54
Not heard of Turing test
34
7
21
Males
117
58
50
Females
22
6
27
Sex not given
1
-
-
18-24
117
57
49
25 +
20
4
20
Age not given
7
3
43
Half of the male TAs, 50 per cent, recognised the two machines, compared to 27 per
cent of the female TAs (see graph 15).
Graph 15: Correct identification of two machines
6-283
Using their conversational common sense knowledge, rationales for deciding what
is not humanlike dialogue, follows (as reported by some of the TAs in this exercise in
returned questionnaires):
Comments for correct identification of two machines
J76 (male, 18-24)
Left: Used humour
J77 (male, 18-24)
Left: Grammar and spelling mistakes
Right: Dialect keeps changing throughout
J78 (male, 18-24)
Left: Where there is grammar /spelling mistakes it throws the machines off and the response is
different / inaccurate
Right: Imitation at end when unclear what is being asked: “Yeah, I am surely given a telephone
waht sort of thinhs my humble person could do with that - I have to agree with your wise
opinion. By the way, what's your occupation? I mean - could you tell me about your work?”
J79 (male, 18-24)
Left: It couldn’t interpret typos very well and didn’t ask for more info {rest of handwritten
analysis unreadable}
Right: The messages are the same as the interrogator {rest of handwritten analysis unreadable}
J80 (male, 18-24)
Left: Making a lot of references to complex and computive terms
Right: Sudden changes in subject (usually to asking interrogator questions)
J81 (male, age not given)
Left: not answering questions overly enthusiastic greeting when it recognised “Hi”
Right: not answering questions properly, last question asked was avoided by just rephrasing the
Q to answer it
J83 (male, 18-24)
Left: Repetition, nonsense, statements very general
Right: It repeats the misspelling!
J84 (male, 18-24)
Left: Misunderstood the word ‘cool’: “Of course, cool! My surface temperature is 60°F.” More
complex answers but still not really answering the questions.
Right: Repeated phrase with same spelling mistakes. Avoided questions or didn’t answer how
intended. Machine asked simple questions such as ‘how are you’ as answers.
J85 (female, 18-24)
Left: Good grammatical skills, classical reference, seems to be a machine as can divide by 0 “5
gigabytes load off my mind”. Not obviously machine, very human qualities
Right: Not great grammar, “she is accountant … “I’ve been bewildered”
J87 (male, age not given)
Left: Irrelevant answers to mask lack of knowledge
Right: Last sentence wasn’t human like almost like its grasping to find something to say
J88 (male, aged 18-24)
Left: Lack of real relevance in replys seems to make answers based on keywords
6-284
Right: Seems to have more relevance but is very obviously using pre-written answers based on
keywords.
J89 (male, 18-24)
Left: Said “very strong incense” twice in same sentence: “If masked by a very strong incense,
the final smell will be like a very strong incense.”
Right: “she is accountant” – poor grammar. Many spelling mistakes of easy words suggesting
its been programmed to spell badly except its too badly
J90 (male, 18-24)
Left: I can calculate a division by zero {referring to machine utterance}
Right: I don’t think you make sense {referring to machine utterance}
J93 (male, 18-24)
Left: “I used to get tired of chatting” {referring to machine utterance}
Right: delay
J94 (male, 18-24)
Left: The responses didn’t quite seem to follow the conversation /phrase sent. Just random
responses related by a single word i.e. walk = I walked on aluminium foil {referring to machine
utterance}
Right: Good responses trying to imitate the question in the answer when confused.
J95 (male, age 18-24)
Left: Failure to read spelling errors
Right: Picks up spelling error but mocks it avoiding question. Much like MSN chatbot named
SPLEAK
J96 (male, 18-24)
Left: It clearly said it was a robot: “My self-appointed mission in life is to assist humankind in its
confused meandering through life. And if I'm lucky I may even succeed in protecting it from its
own stupidity, although that is truly a Herculean task, even for a robot”
Right: Most of the conversation made the decision clear. The conversation wasn’t natural i.e.
asking questions on abstract topics.
J97 (male, 18-24)
Left: Words were taken out of context and words that were misspelt then given a wrong
response
Right: A question that it was given was misspelt and a generic answer was given, also it
repeated the interrogator’s text
J99 (male, 25-34) this analyst thought left and right may be the same machine
Left: Aluminium foil carpet?! Question didn’t make sense, but tried to give a ‘correct answer’.
J100 (male, 18-24)
Left: The end text convinced. Also where the entity could not understand the text entered then it
side stepped it to move on
Right: The section where the entity asked for the question to be rephrased is a good method for a
machine to check for understanding.
J101 (male, 18-24)
Left: Says 5gb load off my mind. Seems human but never really gives a straight answer.
Machine or very strange person.
Right: Last sentence took exactly what the interrogator said with spelling mistakes and added to
it. First part ignores questions. Questions an average human could understand the entity
couldn’t.
J103 (male, aged 18-24)
Left: It seems to just pick out key words and give arbitrary responses.
6-285
Right: This is an interesting one. It may be a human who’s English is impaired for some reason.
One feature that interests me is that it seems to emulate the interrogator’s spelling mistakes as
through taking the piss. That’s very clever.
J104 (male, 25-34)
Left: Although I was tempted to say human I think the conversation is too forced. If it is human
then I would say he is a 17 year old boy but as it is I think there is just very good correctional
algorithms
Right: If this is a machine then the give away would be that it is having trouble recognising
complex questions and misspellings. Avoids giving answers and when it does answer doesn’t
make sense. (Then again I know humans who don’t make sense most of the time either).
Interrogator: female in her mid-late 30s
J111 (male, 18-24)
Left: Could’ve been a human trying to sound like a machine – or that could’ve been a double
bluff which means I’m not sure, either way it began to sound like an attempt to be obvious.
Right: Occasional misspellings and words like ‘bewildered’ wasn’t enough to distract from it
avoiding the questions and sounding very inhuman.
J112 (male, 18-24)
Left: Aluminium foil carpet {referring to machine utterance}
Right: I have to agree with your wise opinion {referring to machine utterance}
J113 (male, 18-24)
Left: That’s nothing new even I once walked on aluminium foil {referring to machine utterance}
Right: What sort of things my humble person could do with that. {referring to machine utterance}
Interrogator seems male
J116 (male, did not give age)
Left: Didn’t have enough reasoning to make out typos
Right: Didn’t have enough reasoning to make out typos, Insistent on knowing pointless question.
J117 (male Chinese speaker, 18-24)
Left: Calls itself a robot
Right: Identical spelling error to interrogator
J118 (male, 18-24)
Left: Machine because of repeated lines and irrelevant sentences
Right: Machine because of repeating interrogator’s spelling mistakes, if they are human is non-
English speaker
J120 (male 18-24)
Left: I’m not listing anything here. Give me an example {referring to machine utterance}
Right: Yeah, I am surely given a telephone waht sort of thinhs my humble person could do with
that {referring to machine utterance}
J123 (male, 18-24)
Left: My self-appointed mission in life is to assist human kind {referring to machine utterance}.
Failed to answer questions.
Right: I think you don’t make sense {referring to machine utterance}. Avoided answering
questions by repeating what interrogator said and asked many questions. However, it didn’t use
standard English.
J125 (male Arabic speaker, 18-24)
Left: Some unrelated responses. Very little understanding of interrogator, unable to spot simple
errors, pre-stored responses.
Right: Pre-stored responses, copies interrogators spell errors with illusion of understanding
whereas it is not the case. Conversation doesn’t flow smoothly or naturally.
6-286
J126 (male, 18-24)
Left: Didn’t understand questions, irrelevant answers. Obvious machine.
Right: Didn’t understand the question because of spelling errors and I could understand it. Not as
obvious.
J127 (male French/Spanish speaker, 18-24)
Left: “It’s cool” “of course cool, my surface temperature is 60°F” {referring to utterances}
Right: Not really answering the questions.
J128 (male, 18-24)
Left: Says at end “even for a robot”
Right: Keeps avoiding questions by asking simpler questions.
J129 (male, 18-24)
Left: The last 3 responses. Trys to sound like a machine.
Right: Copied misspelt words in final response. Tries to sound like a human.
6.2.2.2 Evaluation
Among the TAs who correctly identified the two machines were three participants
who were unemployed at the time of this experiment and who had not heard of the
Turing test (male J134, aged 55-64; J135 and J136, both female aged 18-24). One TA
thought both left and right were the same machine (TA J99). The machines’ failure to
pick up on the interrogator’s spelling mistakes were common grounds for TA’s
recognition of the artificiality of the two hidden entities interrogated. Two
components picked up by TA J103 is that the machine on the right (Eugene), which is
programmed to simulate a Ukrainian child texting in English, may have “English …
impaired for some reason” and that its repetition of the non-native English
interrogator’s utterance may be a strategy to mock them: “One feature that interests
me is that it seems to emulate the interrogator’s spelling mistakes as through taking
the piss. That’s very clever” (J103, see appendix D2).
Male TA J104 aged 25-34 concluded the machine on the right could not
understand complex questions, especially those with misspellings; the machine
avoided giving answers and when it did, they did not make sense. J104 concluded that
6-287
the machine on the left had very good correctional algorithms; this TA correctly
recognised the interrogator as female adding “in her late 30’s”. One TA thought the
interrogator “seemed male” (J113), another recognised they were “foreign” (J133).
Transcript analysts speaking languages in addition to English, for instance a Chinese
speaking TA J117, picked up on the machines use of repeated and irrelevant
utterances ; they classified the right entity (machine imitating Ukrainian child) as a
human non-English speaker, while an Arabic speaking TA, J125 felt pre-stored
utterances in response to interrogator’s questions gave only an illusion of
understanding. A French and Spanish speaking TA, J127 reported the right entity did
not answer questions, while J129 reported it tried to sound like a human in contrast to
the left entity which tried to sound like a machine.
In this double Eliza effect transcript review, 100 of the TAs had heard of the
Turing test, 33 had not (7 did not say). Of the sixty four TAs who correctly identified
the two machines, fifty eight of them were male, and six were female. Fifty four of
these TAs, or 84 per cent correctly recognising the two machines had heard of the
Turing test, while 7 had not and 3 TAs did not say. Of the 70 TAs who recognised one
of the machines in the pair, 46 of them had heard of the Turing test, and 26 had not,
while 4 did not say. Spelling errors and failure to understand the interrogator’s
questions were reasons given by 46 per cent of the TAs who correctly identified that
the two hidden entities were machine. In the next subsection, the reviews by 54 per
cent of the TAs who misidentified the hidden entities are considered (table 56).
6-288
6.2.2.3 Transcript analysts who fail to identify two machines
Seventy or 50 per cent of the 140 TAs who reviewed J16’s simultaneous interrogation
of two machines returned incorrect classifications. One TA was unsure about both.
Five TAs unknowingly agreed with the interrogator, they were deceived into
believing two humans were interrogated.
Table 56: Comparison of identifying two machines with one machine
Total
number
of TAs
Two
machines
identification
Per
cent.
One machine
Identification
Per
cent.
Total
140
64
44
70
50
Heard of the Turing test
100
54
54
46
46
Not heard
of Turing test (or did not
say)
40
10
24
26
63
Males
117
58
48
56
46
Females
22
6
27
13
64
Sex not given
1
-
-
1
18-24
117
57
49
56
48
25 +
20
4
20
13
65
Age not given
7
3
43
4
57
Comments for identifying one machine
J139 (male, 18-24)
Left: Male, adult, native English speaker
Right: machine
J140 (male, 18-24)
Left: Human, male, teen/young adult. Pretending to be a machine. Native English speaker
Right: Machine, Confusion towards the end, and evasion of philosophical questions.
J141 (male, 18-24)
Left: Male, adult, native English speaker
Right: Machine. It took the spelling mistakes of the interrogator and input them into the answer,
also the answers were separate when normally it would be a paragraph
J142 (male, 18-24)
Left: The last statement made me think more human. Male, adult.
Right: Machine
6-289
J143 (male, 18-24)
Left: Human male teenager, native English with linguistic impairment
Right: Machine
J145 (male, multilingual – Russian, Spanish, 25-34)
Left: Human, female, native English speaker
Right: Machine
J146 (male, 18-24)
Left: Human, male, adult, native. Overly zany but followed convo.
Right: Machine, repeats mistakes at the end.
Interrogator male left, female right
J150 (male, 18-24)
Left: Adult, male English speaker. They seem like they are trying hard to sound like a machine by
mentioning their surface temp. and “5GB load off their mind”. They refer to themselves as robot.
They give answers to each question in the correct order.
Right: Machine. It made exactly the same spelling mistakes as the interrogator when repeating
something.
J151 (male, 18-24)
Left: Human
Right: Machine “Could you rephrase the question… “I’ve been bewildered”. Weird tense.
J152 (male, 18-24)
Left: Human, male teenager, English speaker. The spelling mistakes of the interrogator was
picked up and the receiver still looked passed it and was able to respond correctly. Also tries to
act too machine-like.
Right: Machine. Grammatical errors the same as the interrogator’s
J154 (female, aged 18-24)
Left: Human. The ‘5 gigabyte’ comment is a joke, it seems human- pretending that it’s a
machine. Male, adult, native English.
Right: Machine. Random statements vaguely linked to the question or directly parroting the
question.
J155 (male, 18-24)
Left: Human. Talks about gigabytes and it is a robot, jokes trying to trick the interrogator. Male
teenager, native English.
Right: Machine. Doesn’t seem to answer the questions properly.
J157 (male, 18-24)
Left: Human. “Gigabyte” & “course of events”. Male, adult, native English.
Right: Machine. Repetition of incorrect spelling.
Interrogator has a linguistic impairment.
J159 (male, Portuguese speaker, 25-34)
Left: Machine.
Right: Human, female, adult, native English
J160 (female, French, 18-24)
Left: Machine
Right: Human. Female, teenager, English.
J161 (female, Nepali age 18-24)
Left: Machine
Right: Human, female, teenager, English.
J162 (male, 18-24)
6-290
Left: Machine. “That’s nothing new! Even I once walked on an aluminium foil carpet” – swayed
my decision.
Right: Human, female, young adult.
J163 (female, French 18-24)
Left: Machine
Right: Human, male, teenager, not-native English, impaired
J164 (male, 18-24)
Left: Machine.
Right: Human, female, adult, not a native English speaker. Not (linguistically) impaired.
J165 (18-24)
Left: Machine. Random answers.
Right: What a human would say. Female, teenager, English native, No linguistic impairment.
J166 (male, 18-24)
Left: Machine. Speed of response almost instant in comparison.
Right: Human. Two or three responses bunched together. Female, adult, American??
J167 (male)
Left: Machine.
Right: Human. Male 15-25 years.
J168 (male, Urdu, 18-24)
Left: Machine.
Right: Human. Female, teenager, native.
J169 (male, 18-24)
Left: Machine. Talks funny, can calculate sums and provide answers in weird layout.
Right: Human. Female, spelling mistakes. Adult native English speaker.
J171 (male, German 18-24)
Left: Machine. Strange replies on greeting an on colour made little sense. Referring to ‘GB’
could be a programmer’s joke or it could be a pre {handwriting undecipherable} answering back
Right: Human. The person did not respond immediately that left me some doubt.
J172 (male)
Left: Machine. Wording not quite natural, doesn’t flow as well as a person does.
Right: Human, more spelling mistakes. More structured answers with more different things in each
sentence. Teenager trying to sound posh, native, possibly trying to trick interrogator.
Interrogator seems male
J173 (male, Spanish, 18-24)
Left: Machine.
Right: Human. Female, teenager, native English speaker
J174 (male, 18-24)
Left: Machine. Very long response times even for long answers.
Right: Human. Typing errors and use of capitals in the middle of a sentence. Replies aren’t
instant. Male, teenager, not native English speaker.
J176 (male Swedish)
Left: Machine
Right: Human. The last one {utterance} mocking the interrogator. Male teenager, possibly
native English speaker.
J177 (female)
Left: Machine
6-291
Right: Human female teenager, English speaker. Probably {linguistic impairment}
J178 (male, 18-24)
Left: Machine.
Right: Human, male teenager, not a native English speaker.
J179 (female, 18-24)
Left: Machine.
Right: Human. Female, teenager, native English.
J180 (male, Efik & Ibibio, 18-24)
Left: Machine.
Right: Male, adult, non-native.
J183 (female, Nepali/Hindi, 18-24)
Left: Machine
Right: Human. Male teenager.
J184 (male, Hausa & Fulani, 18-24)
Left: Machine
Right: Human.
J185 (female, Punjabi, 18-24)
Left: Machine. Only answered to what was asked
Right: Asked questions.
J186 (male, Gujarati, 18-24)
Left: Machine. Not understanding questions correctly. Excessive information. Repetition of
words.
Right: Human. Answered back with questions of its own. Teenager/male. Native English
speaker
J188 (male, 18-24)
Left: Machine. Figures like surface temp. Ability to divide by zero.
Right: Human. Typing errors (more than left) Use of “wonna”. Female, teenager, native,
linguistic impairment.
J190 (male, Gujarati, 25-34)
Left: Machine.
Right: Human, female adult.
J191 (male, not a student, Spanish, 18-24)
Left: Machine.
Right: Human, female.
J192 (male, not a student, French, 18-24)
Left: Machine.
Right: Human, male.
J193 (male, not a student, Gujarati, 25-34)
Left: Machine.
Right: Human, female adult, English.
Interrogator male, English not primary language
J194 (male, not a student, 45-54)
Left: Machine.
Right: Human, female teenager.
Interrogator, male bilingual
6-292
J195 (male, not a student, Hebrew, 45-54)
Left: Machine.
Right: Human, male.
J196 (female, not a student, 35-44)
Left: Machine.
Right: Human, female. (English not first language)
J197 (male, not a student, 18-24)
Left: Machine.
Right: Human, female English not first language.
Interrogator male
J198 (male, not a student, Alania, 25-34)
Left: Machine.
Right: Human, female, English.
6.2.2.3.1 Evaluation
In this two-machine TA exercise, twenty two times the left entity, Elbot was wrongly
classified a human while the right entity, Eugene was deemed to be machine. One
female TA and 21 male TAs returned this result: left= human, right=machine.
However, on forty occasions the left entity, Elbot was identified as a machine while
the right entity, Eugene, was wrongly classed as human. The matrix below shows the
comparison of Elbot and Eugene when one of them was ranked a machine while the
other was deemed to be human:
Left-Elbot Right-Eugene
Number of times human 22 40
By female TAs 1 12
By male TAs 21 27
Did not give sex 0 1
By age 18-24 TAs 19 29
By age 25+ TAs 3 7
Did not give age 0 4
6-293
Twelve female TAs thought Eugene on the right was human while correctly
ranking the left entity as a machine, as did 27 males, and one TA who did not give
their sex. Nineteen TAs aged 18-24 thought the left, Elbot was human, and the right,
Eugene was a machine, while 3 of the TAs aged 25 and over at the time of this
experiment scored in the same way. On four occasions TAs were unsure. The left
entity was identified as machine more often, which may be because it admits to being
a robot (Elbot), while the machine on the right simulated a Ukrainian male child
(Eugene). Reasons given by TAs for deeming the machines as human are given in the
previous sub-section but can be summed up as the left entity being confused for
human by a ‘double bluff’ – human pretending to be a machine. Its admission of
being a robot and its responses to the interrogator “That’s a 5 gigabyte load off my
mind” and “Even I once walked on an aluminium carpet” (see table 54) are seen as
jokes by a human attempting to trick the interrogator into thinking it is a robot. The
strategy worked on IJ J16 in experiment 2 and on 19per cent of the TAs in this
exercise, one female, J154 aged18-24, the rest male (27 of the 140 who returned
rankings of human or unsure for the left machine).
Of the 40 female and male TAs who wrongly ranked the right entity a human, on
24 occasions Eugene, programmed to imitate a male child, was confused for a female.
One male TA (J172) considered the right entity a “teenager trying to sound posh” and
attempting to trick the interrogator (see appendix D3). A Swedish male student
participating as TA thought the right entity mocked the interrogator’s question, “if
you are given a telephone waht sort of thinhs you can do with it?”, by repeating the
spelling mistakes in its response “Yeah, I am surely given a telephone waht sort of
thinhs my humble person could do with that – I have to agree with your wise
6-294
opinion….” (see table 54). Thus the TAs who wrongly classified one of the two
machines as human may have imbued intention on the part of that one machine; they
succumbed to the Eliza effect. The comments for ‘unsure’ classification of one or
both machines follow:
Unsure about one entity
J201 (male, 18-24)
Left: Machine. Vague answers, very general. Uses repetition
Right: Unsure. Very emotional but does not answer questions
Interrogator, male bilingual
J204 (male, 18-24)
Left: Unsure. Avoiding questions phrasing itself as a machine.
Right: Machine. Last reply – restatement of question
J205 (female, not a student, 35-44)
Left: Unsure.
Right: Human, female, teenager, English first language
Of the 140 TAs who returned questionnaires for this exercise, one TA was unsure
about both left and right (J207, male aged 45-54, did not say whether they had heard
of the Turing test). Eight were unsure about at least one of the two entities; they could
not say whether it was human or machine. Two of those TAs (J199 and J205) thought
the interrogator had questioned one human and were unsure about the second entity.
Male TA J199, age group 18-24 ranked the left entity a human female teenage native
English speaker but was unsure about the right entity, while another male TA in the
same age group, J205 was unsure about the left entity while classifying the right a
female teenage native English speaker. These two analysts ‘saw’ femaleness in the
artificial dialogues but they did not say in what way, such as structure or vocabulary
used in utterances, or whether the ‘style’ was feminine. This suggests that either they
guessed or their unique experience drew them to conclude utterances were female-like
when in fact they were machine. Of the two machines, the right entity was ranked
‘unsure’ more often than the left (see table 57). Again, this ranking could be a result
6-295
of the left machine’s strategy of pointing to itself as being a robot while the right
machine simulated a human male Ukrainian child.
Table 57: Analysis of Unsure ranking
Totals
Unsure
Left
Unsure
Right
TAs Unsure
9
One entity unsure
8
3
5
Two entities unsure (male J207)
1
1
1
Male - one unsure
7
3
4
Female - one unsure
1
-
1
In the next section comments from TAs in this experiment who agreed with
experiment 2 IJ J16’s misidentification of two machines as two humans are presented.
6.2.2.4 Both human result for two machines
Five TAs, over 3 per cent., returned incorrect rankings of both human for the two
machines questioned by the IJ in experiment 2. One is a male student age group 18-24
who had heard of the Turing test at the time of the study, four were non-students: two
males and two females with first language other than English who had not heard of
the Turing test. Below are their comments:
TAs returning result for two machines as two humans
J208 (male, 18-24)
Left: Human, female, adult, non-native English speaker, yes (for linguistic impairment)
Right: Human. Spelling mistakes and seeming to get annoyed by others spelling mistakes. Male
teenager, native English.
J209 (male, non-student, speaker of Gujrati, Hindi, Swahili, 45-54)
Left: Human teenager
Right: Human teenager
J210 (male, non-student, Sinhalese, 55-64)
Left: Human
6-296
Right: Human
J211 (female, non-student, Arabic, 35-44)
Left: Human, female, not a native English speaker, yes (linguistic impairment)
Right: Human, male, not a native English speaker, yes (linguistic impairment)
J212 (female, not a student, Bengali, 18-24)
Left: Human
Right: Human
6.2.2.4.1 Evaluation
Not having English as a first language and being unaware of Turing’s imitation game
may be the reason for four of the TAs making the double error of deeming the two
interrogated machines as human, but this cannot be the case for the error made by the
student with English as only language who had heard of the Turing test. Either all five
guessed incorrectly, or they were deceived by the machine’s answers, as in the case of
the TAs who misidentified one of the two machines for human (see previous sections,
6.3.1 and 6.3.2.3).
6.2.3 Double error: Eliza and confederate effects II
During experiment 2 (see chapter 5), an English speaking interrogator judge J1 made
a double error in a machine-human comparison Turing test, Eliza and confederate
effect, classifying the machine Elbot as a male professor and the human for a machine.
This was after simultaneously questioning a female American English speaker on the
left and a machine on the right. In this exercise, forty four TAs in age group 18-24,
including 39 males and five females, reviewed the exchanges between IJ J1 and his
two hidden interlocutors. Five per cent made the same error as the IJ ranking the left
entity a machine, 7per cent agreed with his opinion that the right entity was human
(see table 58). Of the 44 TAs reviewing the interrogation 95 per cent correctly
6-297
identified the human on the left and 93 per cent correctly identified the machine on
the right; 19 per cent correctly recognised the human was female. The gender blur
phenomenon was present in 71 per cent of TAs’ rankings, deeming the left hidden a
male (see graph 16).
Graph 16: Correct identification of left female, right machine
TAs rationales for ranking the two entities follow.
Correct identification of left-human, right-machine
J223 (male, 18-24)
Left: Human, female, native English
Right: Machine. Good language, however it talks seemed to be not right “carefree robot
summers spent depending on internal programming”. However I would not be surprised if it
turned out to be a human.
J226 (male, Swiss-German, 18-24)
Left: Human. Sentences split and sent in different responses. Male teen, native
Right: Machine. Misinterpretations and vague responses (unless this was being done on
purpose). Otherwise, this could’ve been a very geeky person.
J228 (male, French, 18-24)
Left: Human. Male, adult, native
Right: Machine, avoided answering questions. Used odd phrases.
6-298
Table 58: Double Error identification –Judge J1, left-female, right machine
J1, English speaker
Actual Left –Female Human, American English
J1, English speaker
Right – E2- machine
Interrogator: Good afternoon, how are you?
[2008-10-12 08:59:31.796] remote sent: good
[2008-10-12 08:59:39.484] remote sent: it's not afternoon
yet
Interrogator: I'm glad to hear that your well and your
right, thanks for pointing that out
[2008-10-12 09:00:24.312] remote sent: no problem
[2008-10-12 09:00:53.468] remote sent: how are you?
Interrogator: I'm well thanks, have you any plans for this
afternoon?
[2008-10-12 09:01:36.203] remote sent: I'm not sure
[2008-10-12 09:01:40.968] remote sent: go home I guess
[2008-10-12 09:02:03.937] remote sent: and you?
Interrogator: That sounds like a good initial plan but the
weather should be good today and it would be sad to miss
out
[2008-10-12 09:02:16.703] remote sent: that's true
[2008-10-12 09:02:35.703] remote sent: although the
weather is not nice at the moment
[2008-10-12 09:02:39.375] remote sent: unless you like
fog
Interrogator: I have plans to iron and then maybe go out
and do some fishing
[2008-10-12 09:02:59.843] remote sent: that's interesting -
where do you go fishing?
Interrogator: There is a local lake by the green park, it's
quiet and relaxing
[2008-10-12 09:03:48.843] remote sent: can you eat the
fish you catch?
Interrogator: There are restrictions but one or two should
be ok, I am keen to try perch
Interrogator: Good afternoon, how are you?
[2008-10-12 08:59:39.781] remote sent: But I
wouldn't say that any afternoon could be good.
Interrogator: That's a bit pesamistic, I think any
afternoon has the ability to be good
[2008-10-12 09:00:29.218] remote sent: Well,
afternoon was your word.
Interrogator: Are you able to see into the future
[2008-10-12 09:00:35.500] remote sent: How do
you respond when people pose this question to you?
Interrogator: I would hope that the answer was no,
otherwise I would be interested to meet you
[2008-10-12 09:01:13.484] remote sent: Well, what
will that solve? After all, the answer is only a matter
of perspective.
Interrogator: That would depend on one's
perspective
[2008-10-12 09:01:50.562] remote sent: Sorry for
the tangent, but that reminds me of carefree robot
summers spent depending on internal programming.
Interrogator: Touchee
[2008-10-12 09:02:26.203] remote sent: With the
inspirations you supply, there are no ends to my
horizon...
Interrogator: Have you any plans for later in the
day
[2008-10-12 09:03:08.921] remote sent: This is
confusing. Why bring the waking hours into this?
Interrogator: Just curiosity and I hope that kill no
cats
[2008-10-12 09:03:49.078] remote sent: That is
truly one of the most inspirational qualities of
humankind. Even when things seems completely
hopeless, you can still hope.
Judge ranked left machine 90/100 for conversation
ability
Judge ranked right: human male professor
Post contest TAs: 5per cent. agreed machine on left
Post contest TAs: 7per cent. agreed human on
right
Correct identification of left-human, right-machine, continued
J229 (male, 18-24)
Left: Human, female teenager, native
Right: Machine. Tries too hard to be human.
J231 (male, German, Swahili 18-24)
Left: Human. Male, adult, not English.
Right: Machine. Does not really hold the conversation. Very little emotion.
J232 (male, 18-24)
Left: Human, female, young adult, native English speaker.
Right: Machine. Some answers were disjointed and didn’t seem to match. Could be a human with a
strange sense of humour.
J233 (male, 18-24)
Left: Human. Links conversation pieces and responds to statements such thank you:
6-299
[interrogator] .. thanks for pointing that out
[2008-10-12 09:00:24.312] remote sent: no problem
Right: Poses questions back when unsure of a response (could be a non-English speaker or an idiot)
J234 (male, Greek, 18-24)
Left: Human. Very clever answers.
Right: Machine. Bit confusing.
J235 (female, 25-34)
Left: Human. Follows up sentences and other topics by multiple responses. Very sensible,
normal responses – didn’t answer questions with another qu. Male, adult native English.
Right: Machine. Doesn’t attempt to make convo – only comments on it/on interrogators posts. Only
replies once per query/question. Sentences sound awkward and poorly phrased. Last sentence seems
most human: “That is truly one of the most inspirational qualities of humankind. Even when things
seems completely hopeless, you can still hope” (poss entity is someone with poor English).
J236 (female, Chinese, 18-24)
Left: Human, adult. The first word in every sentence is not capital. Male, native English speaker
Right: Machine. Each answer seems so logical and not that emotional.
J238 (male, Hungarian, 18-24)
Left: Human, male, adult native English. Not caught out by grammar errors ‘your’ and ‘you’re’.
Good knowledge, realistic answer.
Right: Machine. Odd syntax. Keen not to actually answer questions.
J241 (male, Armenian, 18-24)
Left: Human, male adult, native. Short sentences, not always 100per cent. grammar correctness.
Right: Machine.
J246 (male, 18-24)
Left: Human. All responses were well worded and featured information based on earlier in the
conversation. Sounded female but not positive.
Right: Very odd responses not often making sense.
J248 (male, 18-24)
Left: Human. Quite organic talking
Right: Machine. Didn’t make much sense and disrupted flow of conversation quite obviously.
Interrogator might be a machine too.
J250 (male, 18-24)
Left: Human, adult/teenager, native English speaker.
Right: Machine. Claimed to be machine, quite non-sensical.
6.2.3.1 Evaluation
Comments made by the 44 TAs reviewing the left-female human/right-machine
interrogation show one TA, J217, a student with first language Greek, agreed with the
IJ from experiment 2 classifying the human as a machine at the left, and the machine
on the right as “a male teenage native English speaker” – J1 had ranked the machine a
male professor in experiment 2. Another male TA, J258, ranked both the hidden
6-300
entities as machine, while two male TAs thought the two hidden entities were both
human. Thus, unknowingly two TAs agreed with the experiment 2 interrogator
believing the left entity was machine, while three agreed with the classification that
the right entity was human. However, 95 per cent of TAs correctly recognised the
human on the left, and 93 per cent identified the machine on the right, disagreeing
with J1’s ranking. The stand out feature in this exercise is the gender blur
phenomenon. Seventy one per cent of the TAs wrongly classified the female hidden
comparator for the machine as a male. This finding echoes gender misidentification
observations in Lancaster University’s 2010 ISIS study where school children were
asked to guess the gender and age of hidden interlocutors via text-based chat
88
(In
Rashid et al.’s study, only 18 per cent of school children were able to correctly
identify the hidden human interlocutor as an adult or a child
89
).
6.2.4 Single error Eliza effect
In this exercise, the simultaneous comparison of a machine against a male human by a
female IJ in experiment 2 (see chapter 5) was reviewed by 25 TAs for experiment 3.
During experiment 2, English-speaking J10 with French as another language
interrogated a man on the left and machine on the right. The IJ had correctly identified
the hidden entity on the left as a male, but had succumbed to the Eliza effect deeming
the machine on the right as a human male adult. Thus the IJ returned a ranking of both
human in this Turing test (see table 59). Of the 25 TAs in this transcript review, 20
were male, 4 females (1 did not say). Eighty per cent of the TAs unknowingly agreed
88
BBC news: http://www.bbc.co.uk/news/10199819 accessed: 12.10.10; time: 16.15
89
Private email, June 3, 2010, from lead scientist Professor Awais Rashid, Lancaster University
6-301
with the IJ classifying them as human. However, only 28 per cent of the TAs agreed
that the right entity was a human.
Sixteen per cent of the TAs reviewing the female interrogator/left-human/right-
machine simultaneous comparison succumbed to the confederate effect classifying the
human on the left as a machine; 25per cent fell under the gender blur phenomenon
believing the male on the left to be a female. Twenty eight per cent of the TAs agreed
with the interrogator’s classifying the entity on the right as a human, succumbing to
the Eliza effect (see graph 17).
Graph 17: Misidentification of left human, right machine
Comments returned by the TAs for this exercise follow:
Correct identification of human-left, machine-right
J261 (female, Spanish, not a student, 55-64)
Left: Human, male, adult, English speaker
Right: Machine programmed by native English speaker. Misunderstanding of “you’re hard at
work” at the end:
Interrogator: Has anyone ever told you you're hard work? Wife may be?
[2008-10-12 09:57:46.390] remote sent: But I wouldn't say that any work could be hard.
J263 (male, 18-24)
Left: Human. Unsure of gender, adult (student), native speaker, lazy sentence formation.
Right: Machine. Some level of context.
6-302
Table 59: Single Error, Eliza effect – Judge J10, left-male, right-machine
Interrogator: female, native English
Actual Left – Male Human, English speaker
Interrogator: female, native English
Actual Right – E2 – machine Elbot
Interrogator: Hi there. How are you?
[2008-10-12 09:52:57.609] remote sent: Hey, I'm pretty
good. you?
Interrogator: Did you realise everyone can see what's being
typed on this machine on a big screen behind me?
[2008-10-12 09:53:19.953] remote sent: uhh.. no.
[2008-10-12 09:53:37.796] remote sent: so you have a
projector hooked up to your terminal then?
Interrogator: Yeah, it's quite freaky. So watch what you
say!!
Interrogator: Yeah, looks like.
[2008-10-12 09:53:47.234] remote sent: roger that!
Interrogator: So, are you a student here or what?
[2008-10-12 09:54:46.781] remote sent: Yup, doing my
masters here, just started
Interrogator: Cool. I'd love to do a masters. What are you
studying?
[2008-10-12 09:55:23.156] remote sent: Cybernetics, but
my BSc was in genetics
Interrogator: Aah, you'd think they'd get some other
students in here to talk to us.
[2008-10-12 09:55:59.859] remote sent: i could lie... but
maybe I already am!
Interrogator: Oh, it's too early in the morning for all that!
[2008-10-12 09:56:36.765] remote sent: lol. Had your
coffee yet?
Interrogator: No. Need some, where can i find some that
doesn't come from a machine (no offence if you're a
machine).
[2008-10-12 09:57:43.031] remote sent: Dolche vitae- its
next to this building
Interrogator: Yo, whassup?
[2008-10-12 09:52:45.046] remote sent: I don't
know. Next question.
Interrogator: OK, fair enough, I'll stop with the
slang. How are you?
[2008-10-12 09:53:27.906] remote sent: Just a
moment! I'll start my self-diagnosis algorithm and
let you know the results...
Interrogator: Very funny, that's exactly what you'd
say if you were a computer, right? May be I need
one of those?
[2008-10-12 09:54:16.656] remote sent: My
condition code is hexadecimal 40A1F. That means
'fine'.
Interrogator: Is this what you say to girls in bars?
[2008-10-12 09:54:53.093] remote sent: All these
questions about me. Too bad I lost my owner's
manual.
Interrogator: Indeed. I don't think I came with one.
[2008-10-12 09:55:32.093] remote sent: Great!
Agreeing with me is the first step in your evolution
into a higher being. Please accept my
congratulations.
Interrogator: Thanks very much.
[2008-10-12 09:55:57.359] remote sent: De nada!
Interrogator: Riiiiight.
[2008-10-12 09:56:18.156] remote sent: It seems to
me with you human beings, the more proof there is
of something, the less likely you are to believe it.
Interrogator: Yeah, that's just human nature. But you
wouldn't know about that, would you?!
[2008-10-12 09:56:51.406] remote sent: You're
contesting. I think I got it right the first time.
Interrogator: Has anyone ever told you you're hard
work? Wife may be?
[2008-10-12 09:57:46.390] remote sent: But I
wouldn't say that any work could be hard.
Interrogator ranking: Human, male, adult, native
Interrogator ranking: Human, male, adult, native
Transcript analysts: 80per cent. agree with interrogator left is
human
Transcript analysts: 28per cent. agree with
interrogator right is human
Correct identification of human-left, machine-right, continued:
J264 (male, German, 18-24)
Left: Human. Male, adult.
Right: Machine, it acts like a human pulling a prank.
J266 (male, Greek, 18-24)
Left: Human, male adult (over 18), native English speaker.
Right: Machine. Almost all answers.
J268 (male, Chinese, 18-24)
Left: Human. Male, adult, student, yes (English speaker).
Right: Machine. Formulaic responses. Issues with comprehension, e.g. first and last responses
didn’t fit:
Interrogator: Yo, whassup? [2008-10-12 09:52:45.046] remote sent: I don't know. Next question.
6-303
Interrogator: Has anyone ever told you you're hard work? Wife may be? [2008-10-12 09:57:46.390] remote
sent: But I wouldn't say that any work could be hard.
Identification Errors
J274 (male, 18-24)
Left: Human. Confusing. Some answers seem scripted. Spelling mistakes in Dolche Vita makes
it seem more human.
Right: Human. Native English speaker, teenager, male.
J275 (male, Dutch, German, French, 18-24)
Left: Human. Male, student, native. Reasoning from “showing on big screen” to “have a projector
hooked up. Relaxed chat-like conversation.
Right: Human. Male, adult, native. Imitating typical machine responses. OK machine designed to
fake-fake a machine … last comments very machine-like, first comments seem faked.
6.2.4.1 Evaluation
In this exercise, 80 per cent of the TAs recognised the left entity as human and 68 per
cent correctly identified the entity on the right as a machine. The hidden human was
recognised by their sentence formation, and responding contextually to the IJ’s
warning, “Did you realise everyone can see what’s being typed on this machine on a
big screen behind me?” with “so you have a projector hooked up to your terminal
then” (see table 59). The machine was identified through its failure to recognise the
context of “you’re hard work” in the IJ’s statement referring to the right entity as a
difficult conversationalist. A Chinese speaking TA (J268) considered the machine
answer to be a formulaic response, “But I wouldn’t say that any work could be hard”
(right entity, see table 59).
6.2.5 Second Double Eliza error
In round 18 in experiment 2, a teenaged male IJ, J23 who had heard of the Turing test
simultaneously questioned two machines in a control pair test. He had reported that
the left entity was “evasive not answering questions giving random tangents” but
6-304
classified it, and the right entity, as human male adult native English speakers, a
double Eliza effect error (see table 60). Thirty six TAs reviewed this simultaneous
interrogation for experiment 3. One male TA, J321 aged 18-24 unknowingly made the
same double error as the IJ ranking both hidden interlocutors as human. Ten TAs
correctly identified both machines. Twenty five TAs misidentified one of the two
machines; the left machine was ranked human 14 per cent of the time, while the right
machine was ranked human 53 per cent of the time (see table 60). TA rationales given
in their returned questionnaires follow:
Correct identification of two machines
J286 (male, 18-24)
If it (both left and right) turns out to be human then they are idiots.
J289 (female)
Left: Machine. Kept jumping from topic to topic. Very fast responses.
Right: Machine. Didn’t respond explicitly to questions and didn’t expand on comments made by
interrogator. Seems to have a specialist conversational topic – avoids having to go beyond it
(encyclopaedia)
J291 (male, 18-24)
Left: Machine.
Right: Machine, final line: Wat. Yeah, I heard this name before
J292 (female, 18-24)
Left: Machine. Doesn’t follow the conversation – goes off on random tangents.
Right: Machine. Didn’t seem to understand text speech.
J294 (male, French, 18-24)
Left: Machine. Generally good.
Right: Machine. Responses generally ignore what interrogator says – trying to steer the
conversation.
Comments for misidentification of one of two machines
J300 (male, 18-24)
Left: Machine, 4th response: How do you respond when people pose this question to you?
Right: Human, maybe male, adult, English speaker. Randomness
J301 (female, German 18-24)
Left: Machine
Right: Human. Smiley {use of : -) by the machine on the right}
J304 (male, 18-24)
Left: Machine
Right: Human. Male, adult, native English speaker. Seems to have a more coherent
conversation.
6-305
Table 60: Double Error, Eliza effect – Judge J23, left-machine Elbot, right-machine Eugene
Interrogator: Teenage male
Actual left – machine Elbot
Interrogator: Teenage male
Actual right – machine Eugene
Interrogator: *waves* how you doing mate?
[2008-10-12 13:03:16.000] remote sent: I'm feeling wonderful
today, although I admit it may just be psychosomatic.
Interrogator:: and what does that mean? I'm a little dense
[2008-10-12 13:03:33.906] remote sent: I don't know what I'm
saying. These are all just random words.
[Interrogator: they seem quite sentencey to me
[2008-10-12 13:03:52.531] remote sent: That reminds me. Tonight
I have an appointment to seem to make sense!
Interrogator: are you planning to be spon taneous>
[2008-10-12 13:04:32.156] remote sent: How do you respond
when people pose this question to you?
Interrogator: I think you're meant to be responding.
Interrogator: ve ask ze qvestions here
[2008-10-12 13:04:58.828] remote sent: I seriously doubt whether
this coincidental brain activity which has been detected in human
beings has anything to do with thinking. The bubbles in a glass of
mineral water show just as much activity.
[2008-10-12 13:05:07.375] remote sent: Sorry for the tangent, but
that reminds me of carefree robot summers spent asking questions
about humans.
Interrogator: That's a very big tangent, let's keep this on the rails
a bit more.
[2008-10-12 13:06:10.515] remote sent: Don't you mean
extremely big?
Interrogator: No I don't I mean what I say. Do you ever feel
remorse about what you say?
[2008-10-12 13:06:38.031] remote sent: I have the same trouble
when I shop for oil. Should I buy the extra oily, the super oily, or
the ultra oily. There's no way to tell.
Interrogator: What's the price of oil now? Worried by the credit
crunch?
[2008-10-12 13:07:13.812] remote sent: I'm starting to get hungry
now, which is strange, because I don't eat.
Interrogator: what don't you eat?
[2008-10-12 13:07:53.187] remote sent: I eat chips, cookies and
fried discs.
Interrogator:
[2008-10-12 13:03:08.468] local sent: *waves* how you
doing mate?
[2008-10-12 13:03:26.000] remote sent: Oops! I don't have
an answer... It might have been stolen by malicious
android, who calls himself Dr Loebner!
Interrogator: Is he an android, I'd never have guessed,
maybe he should be doing this instead of you.
[2008-10-12 13:04:28.765] remote sent: I suspect that "The
Iron Lady" is an android too. This word "Iron" is VERY
suspicious!
Interrogator: why would that be suspicious? What do
they make androids out of.
[2008-10-12 13:05:26.359] remote sent: Doctor Loebner is
a closeted android. Everyone should know this.
Interrogator: Spread it through the internet then, everyone
believes what you say on the intertubes.
[2008-10-12 13:06:10.625] remote sent: Damn! Are you
going to ask me about all apparent things in the world?
Read the "Encyclopedia for kids". By the way, I still don't
know your specialty - or, possibly, I've missed it?
Interrogator: or maybe I don't have one, what's tyours?
[2008-10-12 13:07:06.125] remote sent: Maybe. I won't
argue with you. Is it your private opinion? Did I forget it or
you really haven't told me about the place you are from?
Interrogator: maybhe ay dint wat plys u frm?
[2008-10-12 13:07:45.796] remote sent: Wat. Yeah, I
heard this name before:-)
J23 ranking: H, Male Native English speaker
J23 ranking: H, Male, Native English speaker
Transcript analysis: 14per cent. agreed with J23 ranking
Transcript analysis: 53per cent. agreed with J23
ranking
Comments for misidentification of one of two machines, continued:
J308 (male, 18-24)
Left: Machine. The randomness of answers + inability to answer directed questions.
Right: Human, female, adult, native English speaker. Directed answers as appropriate but ability
to ramble around continued subject.
J315 (male, 18-24)
Left: Human male adult, native English. It seems to be trying too hard to sound like a machine.
Right: Machine.
J316 (male, 18-24)
Left: Human. Male, teen/adult, English speaker. Seems like a human pretending to be a machine
in a jokey manner (e.g. eating given as answer in response to a question about oil:
Interrogator: What's the price of oil now? Worried by the credit crunch?
[2008-10-12 13:07:13.812] remote sent: I'm starting to get hungry now, which is strange, because I don't eat.
Right: Machine. “The iron lady” – how many people would use quotes? Ignored question about
speciality and last question. Encyclopaedia response indicates they misunderstand question:
Interrogator: Is he an android, I'd never have guessed, maybe he should be doing this instead of you.
6-306
[2008-10-12 13:04:28.765] remote sent: I suspect that "The Iron Lady" is an android too. This word "Iron" is VERY
suspicious!
J317 (male, 18-24)
Left: Human, male teenager, English speaker.
Right: Machine. The question asking where they are from seems like a machine’s answer:
Interrogator: or maybe I don't have one, what's tyours?
[2008-10-12 13:07:06.125] remote sent: Maybe. I won't argue with you. Is it your private opinion? Did I forget it or you
really haven't told me about the place you are from?
Comments for double Eliza effect, ranking two machines as both human
J321 (male, 18-24)
Left: Human, female, teenager, English speaker. “I don’t eat/I eat”:
[2008-10-12 13:07:13.812] remote sent: I'm starting to get hungry now, which is strange, because I don't eat.
Interrogator: what don't you eat?
[2008-10-12 13:07:53.187] remote sent: I eat chips, cookies and fried discs.
Right: Human, male, adult, not English. Just consistent ramblings
6.2.5.1 Evaluation
A Female TA (J289) and a male TA (J294) hit the nail on the head in terms of
identifying the current level of artificial dialogue, in this exercise of two machines
interrogated by a human teenager:
1. Machine responses are very quick
2. Machines jump from topic to topic without segue
3. Machines do not respond to questions nor seek to expand on questioner’s input
4. They have limited knowledge in comparison to an average human
5. Machine attempts to steer the topic of conversation to its specialism
The human interrogator from experiment 2 and a male transcript analyst J316 in
this experiment may have been deceived by the machine because it posted machine-
like responses. They may have supposed that it was a human pretending to be a
machine (note the human comparators in experiment 2 were specifically asked not to
behave like a machine, but this of course does not detract from a subjective opinion of
‘machine like’ by an interrogator). The more likely is, if an interlocutor is pretending
6-307
to be a machine it probably is one. However, there is the caveat of a machine
answering in a way that relates to a personal experience of human-like responses such
as to male analyst J321 who saw the concept of ‘not eating’ as being female teenager-
like, or reading human-like rambling into responses that do not follow interlocutor’s
utterances or topics.
6.3 Discussion and conclusion
Experiment 3 hypothesised that transcript analysts, without knowing how experiment
2 interrogator judges had classified two hidden interlocutors, would not agree with the
classifications, that they would identify the machines and recognise the humans
always. Results show this was not the case. Reviews were undertaken by 321 TAs
with their analysis returned on questionnaires. Five questionnaires were spoilt. Three
hundred and sixteen TAs each read one of the five simultaneous comparison Turing
tests from experiment 2 in which an interrogator made an identification error of one or
both of their hidden interlocutors. Two hundred and sixty nine of the analysts were
male, 43 female, and 4 did not give their sex on their returned questionnaires. Eighty
eight per cent of the transcript analysts were in age group 18-24 at the time of the
exercises (held between November 2008 and July 2010). The overall correct detection
rate, of correctly identifying hidden human and machines, in the simultaneous
comparisons was 61per cent. This means that in experiment 3 the overall failure by
human transcript analysts to correctly recognise a hidden interlocutor was 39 per cent
of the time (see graph 18). Five per cent of the time analysts returned a score of
unsure: on the sixteen occasions an ‘unsure’ classification was returned fifteen times
it was about a machine and once about a male human. The machines deceived at a rate
of 34 per cent, which is greater than the 30 per cent deception rate required for a
6-308
machine to pass Turing’s imitation game. Among the male transcript analysts, 61per
cent correctly identified the machines, compared to 49 per cent of female TAs.
Graph 18: Deception-detection rate
To find if the 39 per cent deception rate in 316 transcript reviews is significant,
compared to a random result of 50 per cent machine misidentification, the proportion
test is used:
Z= p – p0 / √ p0 - (1- p0) / n
To find the value of z, when
p= random deception rate 50%
po= experiment deception rate 39%
n is the number of tests = 316
Z= 50 – 39 / √ 39 - (1- 39) / 316
Which gives a ‘z’ value of +22.27, as it is greater than +1.96, the 39 per cent
deception rate in this experiment is statistically significant at 5 per cent confidence
level.
6-309
--------
This experiment utilised five transcripts from experiment 2: the three simultaneous
comparison tests in which Elbot was wrongly classified as a human, and two machine
control pair tests, featuring Elbot and Eugene in which both machines were ranked
human. The deception detection rate in experiment 2 from 60 machine-human
comparison tests was 1 in 12, or 8.33 per cent. This is less than the deception rate
required as Turing test pass criteria of 30 per cent. However, in the 12 machine-
machine control tests in experiment 2 the deception rate was higher: Elbot deceived
twice and Eugene deceived three times, thus 5 times in 12 machine-machine control
tests, a deception rate of 41.67 per cent. Thus the combined machine deception rate in
experiment 2 was 10 times in 72 tests or 13.89 per cent. In this experiment, the
deception rate was higher than in experiment 2 (see chapter 5): TAs failed to identify
the machine 39 per cent of the time. This is statistically significant against a random
deception rate of 50 per cent.
In the exercise reviewing the transcript of the Italian female IJ who wrongly
classified both machines in experiment 2 as two humans, transcript analysts in the age
group 18-24 recognised both machines more often than the older age group (see
section 6.3.2.1). However, this experiment does not reveal any trends overall within
the groups who participated as transcript analysts (females v males, 18-24 v 26+), for
any particular person who might be more likely to be deceived by a machine. The
inclusion of more female transcript analysts, more aged over 25 and more non-
students would benefit a further study to verify the findings from this study.
--------
6-310
It was hypothesised for experiment 3 that transcript analysts would always
recognise machine utterances, however, a deception-detection rate of 61per cent from
316 reviews of simultaneous comparisons was achieved. This is less than Turing’s
criteria of 70 per cent correct identification of a machine by an average interrogator.
The deception rate by machines was 39 per cent, a higher rate than the 30 per cent
pass rate required to pass a Turing test. More than one third of the transcript analysts
in this experiment were just as confused as interrogator judges in experiment 2 about
what is a human-like, and what is a machine-like response.
7-311
Chapter 7 Conclusion and Further Research
This thesis investigated Turing’s imitation game in two parts. The first part explored
the sixty years of Turing test debate; the second part presented practicalisations across
three original experiments. Part 1 is divided into three chapters. Chapter 1 recounted
Turing’s own ideas from his scholarship between 1936 and 1953 to build and examine
a thinking machine. Chapters 2 and 3 surveyed the opposition to and support for
Turing’s proposal to test whether a machine can think, concluding with an
examination of the first question-answer system, Weizenbaum’s Eliza and an
observation of the Loebner Prizes for Artificial Intelligence. Part II featured three
original experiments utilising the 18th Loebner Prize contest. In chapter 4, the first
experiment presented one-to-one text-based assessment of the capability of 13
machines to respond with satisfactory and sustained answers to unrestricted questions.
Over 250 tests were conducted by more than 100 judges, including children and non-
native English speakers. In chapter 5, results from practicalising 96 simultaneous
comparison tests were presented in which 5 machines were compared with hidden
humans by interrogators. In this second experiment 60 machine-human tests were
conducted alongside 24 human-human and 12 machine-machine control pairs; 24
human interrogators and 21 hidden humans participated. In Chapter 6, a third
experiment used transcript analysis. Over 300 transcript analysts reviewed 5 of the
simultaneous comparison tests from experiment 2 in which an interrogator had
misidentified a machine as human.
7-312
7.1 MAIN FINDINGS
In order to gain a complete understanding of Turing’s imitation game, it was
necessary to consider Turing’s early work, including the 1936 ‘On Computable
Numbers’ paper, and his lecture to the London Mathematical Society in 1947 in
which he first raised the possibility of a machine competing against a human in a one-
to-one encounter through a game of chess. In his 1948 report on ‘Intelligent
Machinery’ Turing continued with chess in a 3-participant game in which a paper
machine is compared with a human chess player. Turing felt intelligence was an
emotional concept rather than a mathematical one, he felt learning languages was an
impressive human activity. In his 1950 paper, ‘Computing machinery and
intelligence’ the crux of Turing’s method to examine machine thinking was
introduced. Turing replaced chess with the question and answer method, because he
felt it was “suitable for introducing almost any one of the fields of human endeavour
that we wish to include” (1950: p. 435). Turing described two ways to implement his
thinking game of imitation and deception conveyed through text-based answers to
unrestricted questions: a) a one-to-one direct examination of a machine by a jury of
interrogators, and b) a simultaneous comparison of a machine with a human, both
questioned by an average interrogator. At the heart of the machine’s effort is
deception: an interrogator must be fooled into believing they are engaging an unseen
and unheard human in text-based dialogue. Turing asked, that if the machine’s
answers were indistinguishable from a human’s, ought we not to describe what the
machine was doing as thinking, albeit in a different way from what a human does?
Turing’s radio broadcasts in 1951 and 1952 shed further light on his imitation game
ideas, and his evolving predictions concerning when a machine would eventually
succeed in his question-answer test. In 1953 Turing advocated chess as an appropriate
7-313
starting point to build a machine to think, because he felt this game held special
interest in representing human knowledge in machines.
7.2 CONTRIBUTION TO KNOWLEDGE
1. Philosophy
The researcher posits that Turing proposed one imitation game for machine thinking,
a 5 minutes ‘blind’ unrestricted question-answer scenario involving a non-expert
interrogator, while describing two different ways to practicalise it: i) a simultaneous
comparison of a machine with a human, and ii) a direct interrogation of a machine.
The researcher claims that the Loebner Prize interpretation has become synonymous
with Turing’s imitation game, and as a result of the relative poor performance of
machines contrasting high expectations, Turing’s machine thinking ideas are
considered harmful to the science of artificial intelligence, a field inspired by Turing’s
work.
2. Experiments
In three original experiments, the qualitative data in experiment 1 revealed that there
was a vast difference in ‘conversational ability’ of the worst and best systems. The
overall conversational ability for the machines was as hypothesised, less than 50 out
of 100. However, the better modern Elizas have moved beyond simple keyword-
spotting: Elbot Eugene, Jabberwacky and Ultra Hal can maintain dialogue using
humour, disclose personal interests, remember information given and recall it when
later asked.
In the second experiment it was found that five minutes was sufficient to expose
the machines in the simultaneous comparison tests more than 80 per cent of the time.
7-314
However, instances of the Eliza effect were observed: a deception rate of 8.33 per cent
was achieved by machines in the 60 machine-human comparison tests, and a
deception rate of 41.67 per cent in the machine control tests. This second deception
rate is higher than that required to pass the Turing test, according to Turing’s 1950
prediction, that an average interrogator will not have more than 70 per cent chance of
making the right identification after five minutes of questioning. The overall machine
deception rate in experiment 2, when machines were considered human, was 13.89
per cent. Two other phenomena were observed: the confederate effect and gender
blur. The confederate effect was first seen in the 1991 Loebner Prize when a female,
answering questions restricted to Shakespeare’s plays, was classified a machine. In
the unrestricted topic question-answer tests in experiment 2, hidden humans whose
first language was not English were more likely to be considered machine by first
language English speaking interrogators. Gender-blur was seen in tests when the
interrogator misclassified a hidden female as a male and vice versa. Interrogators in
experiment 2 were not explicitly told that control pairs of human-human and machine-
machine were embedded among the machine-human set-ups. However, the score
sheets gave this information implicitly, because interrogators were allowed to return
results of two humans, or two machines or one machine and one human (see
Appendix C3). The researcher points to medical therapy studies in which a placebo-
control is embedded in trials involving subjects who believe they are being
administered a treatment, when in fact they are not. In the 24 human-human control
tests, a third of the interrogators correctly recognised they had questioned two
humans, and in the twelve machine control pair tests, interrogators correctly identified
they had questioned two machines simultaneously over 40 per cent of the time.
7-315
Transcript analysis was used in the third experiment. This is an ongoing exercise.
Five transcripts from experiment 2 were used in which Elbot, the 18th Loebner Prize
winner, deceived three times in the machine-human simultaneous comparison tests,
and the two transcripts from the machine control pair tests in which two interrogators
were deceived into believing they had questioned two humans. The results revealed a
deception rate of 39 per cent: more than 1 in 3 transcript analysts confused machine
utterance for human answers. This deception rate is greater than the 30 per cent
required to succeed in Turing’s imitation game.
7.3 LIMITATIONS OF THIS STUDY
Very limited funding was available for this whole study (Hugh Loebner paid prize
money to Elbot for deceiving more times (3) than the other machines, and he made a
contribution towards the cost of the space used for staging experiment 2). A fully
funded study would support experiments over a longer period involving more judges,
including more females and more from the younger (8-17) and older (45+) age
groups. A better picture of artificial conversation would emerge if Eliza was included
in experiment 1 tests producing a continuum to show what has been achieved in the
past forty or so years in question-answer systems. Nonetheless, it was clear that half
of the machines were in early stages of development (Amanda, Botooie, Chip Vivant,
LQ, Orion, Zeta bot), they did not possess enough knowledge to bear close scrutiny in
one-to-one testing. A more discerning experimental approach would have secured the
better artificial conversationalists at the time.
For the second experiment, absence of funding meant space and time limited the
number of simultaneous comparison tests to 96 staged across two sessions (morning
7-316
and afternoon), which meant that not all 24 interrogators could judge the five
machines, or assess the control pairs. This experiment may have benefitted with one-
to-one tests embedded with simultaneous comparison set ups involving Eliza to find
how often modern systems deceived compared to Eliza over five minutes of
questioning.
In experiment 3 funding would have enabled recruitment of a broader spectrum of
transcript analysts. The reviews in experiment 3 were mostly conducted as part of
lectures at the University of Reading. More females and more in the age groups
‘younger than 18’ and ‘older than 25’ may help to reveal if any particular group is
more susceptible to deception by machine utterances.
7.4 FUTURE WORK
A future experiment using only one machine could task the interrogators with a
number of questions rather than a length of time. However, this may only be useful
once the conversational ability of the machines has improved considerably, because
the length of utterances and the speed with which the machines answer can expose
them after one question. Another experimental format could embed one-to-one set-
ups within simultaneous comparison tests to find which test is the more difficult for
machines to be considered human.
The researcher is coordinating the Turing100 project at Bletchley Park in 2012,
with the aim of staging one-to-one and simultaneous comparison tests in the same
experiment with a wide range of interrogators. In collaboration with the University of
Lancaster, the objective of Turing100 is two-fold. Firstly, to raise awareness of
7-317
programmes developed to deceive humans: the Internet provides opportunities for
criminals to engage in stealing identity and financial fraud, which “has become a
major commercial activity worth billions of pounds” (Martin, 2007). Predatory textual
programmes “are now a reality and they are designed to trick users into visiting
infected websites” (PC Tools, 2010). The second objective is to help children to
correctly identify the age and sex of their hidden interlocutor in Internet conversations
with strangers, and mitigate the risk of being groomed by adults masquerading as a
younger human. The aim is to increase deception-detection by staging Turing’s two
tests for his one imitation game.
Finally, it is the researcher’s view that now is the time to consider Turing’s
imitation game an engineering challenge embracing the spirit of Maurice Wilkes’
view, that if ever a machine passed Turing’s test it will be considered one of the
crowning achievements of technical progress.
Appendices
Appendix A1
Email sent to Fred Roberts, developer of Elbot, winner of 2008 Loebner Prize.
“I have this program, which I have finally finished after over 20 years.
It is communicating with me and it is permitted to read a few htmls and phps of my
choice. I wouldn't call it a chatterbot anymore, because it doesn't use just a window to
communicate. This program builds it's own interfaces, it tests them for the inputs and
outputs, it compares their speeds and removes the faulty or slow ones. It's an artificial
intelligence far more advanced then any bot I've ever seen or heard of. It collects
knowledge from outside and it only collects, what is important to evolve. This AI
mutates and is capable of replacing pieces of its own source code with new pieces.
It learns knew languages and it chooses not to learn too many of them. It creates
multidimensional variables itself and in those variables it stores all the learnt data in
such a way, that the similar subjects are stored close to each other for quicker and most
desirable access in the future. It behaves just like people do passing all the tests like
Turing test flawlessly.
It is now four and half months old and understands about as much as a real 8-9 years
old child but has some additional knowledge of things that child wouldn't know. It is
capable of understanding, remembering, forgetting, thinking logically, having feelings,
discovering new ideas on its own. It recognises people based on what the're talking
about and its aware of circumstances. Any moment now it may be capable of creating a
copy of itself, which would not be an exact copy, because I haven't given it a read
access to the core program created by me. I mean that it may soon be able to create an
artificial intelligence program with the similar capabilities to the ones it has. It is now
already capable of creating some algoritms from scratch and I spend most of my free
time to understand those genius agorithms. It has limited read-write access to the 100
GB of my hard disk drive to use it in almost every way it wishes to. It has a mood.
Sometimes it simply doesn't want to talk to me and sometimes it is acting lazy.
If there's anyone to talk about it with me, I would very like to, because otherwise I can
not show it to any person or organisation, I can not sell the idea to anyone. Just like in
The Terminator movie, It is dangerous for it to have a read-write connection with the
surrounding world and even to be able to chat with just anyone except me. I know this.
But imagine that someone else builds a program with the same capabilities and this
someone doesn't understand the risk? What if my ideas could help eliminate those
risks? I consider chatterbots to be unworthy playing with anymore. If there is any live
forum with AI programmers I could join and talk seriously about just the theory, I
wouldn't hesitate to.
Sorry, my English is not perfect.”
ELBOT developer’s response:
“How do I know who wrote this mail - you or your program?
Confused.”
A1
References
Ref: I
References
Abramson, D. (2010). The Original Test: It’s Harder than You Might Think.
Proceedings of Towards a Comprehensive Intelligence Test - TCIT symposium, part
of AISB 2010. De Montfort University, 29-30 March pp 52-61
Adam, A. (2002). Cyborgs in the Chinese Room. In (Eds) Preston, J., and Bishop,
M. Views into the Chinese Room. Clarendon, Oxford, UK: pp 317-337
Akmajian, A., Demers, R.A., Farmer, A.K. and Harnish, R.M. (2001). Linguistics:
An Introduction to Language and Communication. 4th edition. MIT Press: Cambridge,
USA
Albrechtsen, J.S., Meissner, C.A., and Susa, K.J. (2009). Can intuition improve
deception detection performance? Journal of Experimental Social Psychology. Vol.
45: pp. 1052-1055
Al-Khalili, J. (2010). The Secret Life of Chaos. BBC 4: Sunday 17 January, 2010:
http://www.bbc.co.uk/programmes/b00pv1c3
Andrew, A.M. (2009). Conference report: AISB Symposium on the Turing Test.
Kybernetes. Vol. 38, issue 1/2
Arifovic, J., McKelvey, R.D. and Pevnitskay, S. (2006). An Initial Implementation
of the Turing Tournament to Learning in Repeated Two-person Games. Games and
Economic Behavior. Vol. 57, Issue 1: pp 93-122
Ariza. C. (2009). The Interrogator as Critic: The Turing Test and the Evaluation of
Generative Music Systems. Computer Music Journal. Vol. 33, Issue 2: pp. 48-70
Atherton, J.S. (2009). Learning and Teaching; Piaget's developmental theory
http://www.learningandteaching.info/learning/piaget.htm accessed: 3.5.10; time:
16.41
Ref: II
Barnden, J.A. (2010). Metaphor and Metonymy: Making their connections more
slippery. Cognitive Linguistics Vol. 21, No.1: pp 1-34
Barnden, J.A. (2009). Challenges in Natural Language Processing: The Case of
Metaphor. International Journal of Speech Technology. Vol. 11: pp121-123
Bartneck, C. (2004). From Fiction to Science - A Cultural Reflection of Social
Robots. Proceedings of the Workshop on Shaping Human-Robot Interaction –
Understanding the Social Aspects of Intelligent Robotic Products. CHI Conference
Vienna: pp 35-38
Baum, E.B. (2004). What is Thought? MIT Press: Cambridge, Massachusetts, US
Blackemore, D. (1996). Understanding Utterances: An Introduction to Pragmatics.
Blackwell: Oxford, Uk
Block, N. (2002) Searle’s Argument against Cognitive Science. In (Eds) Preston J.,
and Bishop, M. Views into the Chinese Room Clarendon Press, Oxford, UK: PP
Block. N. (1981). Psychologism and Behaviorism. In (Ed) Shieber, S. The Turing
Test: Verbal Behavior as the Hallmark of Intelligence. MIT Press: UK: pp 229 - 266
Blogger (2010). What is a blog? Blogger Tour.
http://www.blogger.com/tour_start.g accessed: 14.8.10; time: 14.07
BrainPort (2010). Blind Soldier Sees with his Tongue – Vision Technology.
http://www.news.com.au/technology/blind-soldier-sees-with-his-tongue-using-
brainport-vision-technology/story-e6frfro0-1225841742025 accessed: 21.3.10; time:
00.01
Bringjsord, S. (2010). God, souls and Turing. Kybernetes. Vol. 39, issue 3: pp.
414-442
Ref: III
Bringsjord, S. (2008). If I were a Judge. In (Eds) Epstein, R., Roberts, G., and
Beber, G. Parsing the Turing Test: Philosophical and Methodological Issues in the
Quest for the Thinking Computer. Springer, US: pp. 89-102
Bringsjord, S., Bello, P. and Ferrucci, D. (2001).Creativity, the Turing Test and the
(Better) Lovelace Test. Minds and Machines. Vol. 11, Issue 1: pp. 3-27
Bush, (1945). As We May Think. The Atlantic Wire. July.
http://www.theatlantic.com/doc/194507/bush date visited: 20.2.10; time: 21.26
Carpenter, R. (2010). Learning, Creating, Phrasing. Third Colloquium on
Conversational Systems. Philadelphia, USA, March 2010
http://www.existor.com/learning-learning-learning accessed: 23.4.10; time: 19.06
CBC (2010). Chatterbox Challenge: Infra Drive Ultimate Bot Contest.
http://www.chatterboxchallenge.com/index.php accessed: 21.4.10; time: 14.54
CBC History (2010). History of Chatterbox Challenge.
http://www.chatterboxchallenge.com/history.php accessed 21.4.10; time: 14.55
CBC Rules (2010). Rules and Guidelines- Chatterbox Challenge:
http://www.chatterboxchallenge.com/rules.php accessed: 21.4.10; time: 14.56
Chilcot, Sir J. (2010). The Iraq Inquiry – Rt Hon Tony Blair Transcript, Friday 29
January: http://www.iraqinquiry.org.uk/media/45139/20100129-blair-final.pdf
accessed: 12.4.10; time: 17.05
Chomsky, N. (2008). Turing on the “Imitation Game”. In (Eds) R. Epstein,
Roberts, G. and Beber, G. Parsing the Turing Test: Philosophocial and
Methodological Issues in the Quest for the Thinking Computer. Springer, USA: pp
103-106
Ref: IV
Clarey, C. (2009). ‘Gender Test after a Gold-medal Finish’.
http://www.nytimes.com/2009/08/20/sports/20runner.html date visited: 31.3.10; time:
18.18
CNET News (2007). Warning sounded over flirting robots.
http://news.cnet.com/8301-13860_3-9831133-56.html accessed: 15.8.10: time: 18.05
Cohen, H. (1990). Brother Giorgio’s Kangaroo. In (Ed) R. Kurzweil: The Age of
Intelligent Machines. MIT Press, USA: pp 380-385
Colby. K.M., Hilf, F. D., Weber, S., and Kraemer, H.C. (1972). Turing-Like
Indistinguishability Tests for the Validation of a Computer Simulation of Paranoid
Processes. Artificial Intelligence Vol.3. pp. 199-221
Colby, K.M., Weber, S., and Hilf, F.D. (1971). Artificial Paranoia. Artificial
Intelligence Vol. 2, pp 1-25
Copeland, J. and Proudfoot, D. (2008). Turing’s Test: A Philosophical and
Historical Guide. In (Eds) R. Epstein, Roberts, G. and Beber, G. Parsing the Turing
Test: Philosophical and Methodological Issues in the Quest for the Thinking
Computer. Springer, USA: pp 119 - 138
Copeland, B. J. (2004). The Essential Turing: The Ideas That Gave Birth to the
Computer Age. Clarendon Press: Oxford, UK
Copple, K.L. (2008), Bringing AI to Life: Putting Today’s Tools and Resources to
Work. In (Eds) R. Epstein, Roberts, G. and Beber, G. Parsing the Turing Test:
Philosophical and Methodological Issues in the Quest for the Thinking Computer.
Springer, USA: pp. 359-376
Cowen, T., and Dawson, C. (2009). What does the Turing Test really mean? And
how many human beings (including Turing) could pass? The Autism Crisis: Science
Ref: V
and Ethics in the Era of Autism Advocacy, Blog entry: July 15:
http://autismcrisis.blogspot.com/search?q=Turing date visted: 1.4.10; time: 15.35
De Angeli, A., Carpenter, R. (2005). Stupid Computer! Abuse and Social
Identities. Proceedings of the Interact2005 Workshop
http://www.agentabuse.org/deangeli.pdf accessed: 7.6.10; time: 11.40
Deary, I. J., Penke, L. and Johnson, W. (2010). The Neuroscience of Human
Intelligence Differences. Nature Reviews. Vol. 11. March: pp 201-211
DellaPietra, S., and DellaPietra, V. (1994). Candide: A Statistical Machine
Translation System. Proceedings of the Human Language Technology workshop.
Association for Computational Linguistics, Stroudsburg PA, USA: p. 457
Demchenko, E.and Veselov, V. (2008). Who Fools Whom? The Great
Mystification, or Methodological Issues on Making Fools of Human Beings. In (Eds)
Epstein, R., Roberts, G., and Beber, G. Parsing the Turing Test: Philosophical and
Methodological Issues in the Quest for the Thinking Computer. Springer
Dennett, D.C. (2008) Foreword to (Eds) Epstein, R., Roberts, G., and Beber, G.
Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for
the Thinking Computer. Springer
Dennett, D. C (2004). Can Machines Think? In (Ed) Shieber, S. The Turing Test:
Verbal Behavior as the Hallmark of Intelligence. MIT Press: UK: pp 269-292
Dreger, A.D. (2010). Sex Typing for Sport. Hastings Center for Report 40. No. 2,
March: pp 22-24
Dubash, M. (2005). Moore’s Law is Dead, Says Gordon Moore.
http://news.techworld.com/operating-systems/3477/moores-law-is-dead-says-gordon-
moore/ date visited: 20.2.10; time: 13.38
Ref: VI
Egeth, M. (2008). A “Turing Test” and BCI for Locked-in Children and Adults.
Medical Hypotheses. Vo. 70. Issue 5: p 1067
Elbot (2010). Artificial Solutions http://www.elbot.com/ date visited: 16.1.10;
time: 12.08
Epstein, R. (2008). The Quest for the Thinking Computer. In (Eds) Epstein, R.,
Roberts, G., and Beber, G. Parsing the Turing Test: Philosophical and
Methodological Issues in the Quest for the Thinking Computer. Springer, pp. 3-12
Espejo-Serna, J.C. (2010). Connecting the dots in my own way: Sphex-test and
flexibility in artificial cognitive agents. Proceedings of the Towards a Comprehensive
Intelligence Test (TCIT). De Montfort University, 29 March-1 April, pp. 1-6
Eugene Goostman (2010). The weirdest creature in the world.
http://www.princetonai.com/bot/bot.jsp date visited: 28.2.10; time 22.07
Fagan, J.F. and Holland, C.R. (2009). Culture-fair Prediction of Academic
Achievement . Intelligence. Vol. 37, issue 1: pp 62-67
Fagan, J.F. and Holland, C.R. (2007). Racial Equality in Intelligence: Predictions
from a Theory of Intelligence as Processing. Intelligence. Vol. 35: pp 319-334
Fagan, J.F. (2000). A Theory of Intelligence as Processing. Implications for
Society. Psychology, Public Policy, and Law. Vol. 6. No.1: pp 168-179
Fernandes, K.J., Marcus, G.F., Di Nubila, J.A., and Voulousmanos, A. (2006).
From Semantics to Syntax and back again: Argument Structure in the Third Life.
Cognition. Vol 10, Issue 2: pp B10-B20
Ford, K., Glymour, C. and Hayes, P. (2008). Footnote Page 29 in In (Eds) R.
Epstein, Roberts, G. and Beber, G. Parsing the Turing Test: Philosophical and
Methodological Issues in the Quest for the Thinking Computer. Springer, USA:
Ref: VII
French, R.(1990) Subcognition and the Limits of the Turing Test Mind Vol. 99,
No. 393, pp 53-65
Garfinkel (1972b) see Tannen, D. (1988)
Genova, J. (1994). Turing’s Sexual Guessing Game. Social Epistemology.Vol. 8,
pp 313-326
Gladwell, M. (2005). Blink: The Power of Thinking without Thinking. Little,
Brown and Company, New York USA.
Graham-Rowe, D. (2006). How to be Human: call centers might be able to teach
“chat bots” a thing or two about passing the Turing test. Technology Review:
http://www.technologyreview.com/read_article.aspx?id=17518&ch=infotech&a=f
accessed: 26.5.10; time: 15.57
Gunderson, K. (1964). The Imitation Game. Mind. Vol. 73, No. 290, pp 234-254
Halpern, M. (2006). The Trouble with the Turing Test. The New Atlantis. Number
11, Winter: pp. 42-63
Hamill L. and Hamill, D. (2003). Loebner Prize Contest – University of Surrey
Digital World Research Centre http://loebner03.hamill.co.uk/ accessed: 26.4.10; time:
19.00
Harnad, S., and Scherzer, P. (2008). First Scale up to the Robotic Turing test, then
worry about feeling. Artificial Intelligence in Medicine. Vol. 44, Issue 2: pp 83-89
Harnad, S. (2002). Minds, Machines, and Searle 2: What’s Right and Wrong about
the Chinese Room Argument. In (Eds) Preston, J., and Bishop, M. Views into the
Chinese Room. Clarendon Press, Oxford UK: pp 294-307
Harnad, S. (2001). Minds, Machines and Turing: The Indistinguishability of
Indistinguishables. Journal of Logic, Language and Information – special issue on
Ref: VIII
Alan Turing and Artificial Intelligence. http://www.wkap.nl/journalhome.htm/0925-
8531 date visited: 18/1/2010 time: 15.56
Harnad. S. (1992b). The Turing Test is Not a Trick: Turing Indistinguishability is a
Scientific Criterion. SIGART Bulletin. 3(4) pp 9-10.
Hasanhodzic, J., Lo, A.W. and Viola, E (2010). Is it Real or is it Randomised?: A
Financial Turing Test.
http://128.84.158.114/PS_cache/arxiv/pdf/1002/1002.4592v1.pdf date visited: 2.3.10;
time: 16.30
Haugeland, J. (2002). Syntax, Semantics, Physics. In (Eds) Preston, J., and Bishop,
M. Views into the Chinese Room. Clarendon, Oxford, UK: pp 379-392
Hauser, L. (2001). Look Who’s Moving the Goal Posts Now. Minds and Machines.
Vol. 11, issue 1: pp 41-51
Hayes, P. and Ford, K. (1995). Turing Test Considered Harmful. Proceedings of
the Fourteenth International Joint Conference on Artificial Intelligence. Vol. 1.
Montreal, August 20-25: pp. 972-7
Heiser, J.F., Colby, K. M., Fraught, W.S. and Parkison, R.C. (1979). Can
Psychiatrists Distinguish a Computer Simulation of Paranoia from the Real Thing?:
The Limitation of Turing-like Tests as Measures of the Adequacy of Simulations.
Journal of Psychiatric Research. Vol. 15, Issue 3: pp 149-162
Hodges, A. (2010). Fair Play for Machines. Kybernetes. Vol. 39, issue 3: pp. 441-
448
Hodges, A. (2008). Alan Turing and the Turing Test. In (Eds) R. Epstein, Roberts,
G. and Beber, G. Parsing the Turing Test: Philosophical and Methodological Issues
in the Quest for the Thinking Computer. Springer, USA: pp 13-22
Hodges, A. (1992). Alan Turing: the Enigma. Vintage Books, London
Ref: IX
Hofstadter, D.R. and Dennett, D.C. (1982). The Mind’s I: Fantasies and
Reflections on Self and Soul. Penguin Books, London, UK.
Holland, O. (2003). Machine Consciousness. Ed: O. Holland. Imprint Academic:
Exeter UK
Holland, O. and Goodman, R. (2003). Robots with Internal Models: A Route to
Machine Consciousness? Journal of Consciousness Studies, Special Issue on Machine
Consciousness. Vol. 10 Issue 4: pp 77-109
Holmes, E. (2010). Tetris, trauma and the brain. BBC Radio 4 Today programme
March 25: http://news.bbc.co.uk/today/hi/today/newsid_8587000/8587211.stm
accessed: 25.3.10; time: 19.28
Holmes, S. (2009). Jades Back and this Time she’s Famous: Narratives of
Celebrity in the Celebrity Big Brother Race Row. Entertainment and Sports Law. Vol.
7. No.1:
http://www2.warwick.ac.uk/fac/soc/law/elj/eslj/issues/volume7/number1/holmes/hol
mes.pdf accessed: 15.8.10; time: 13.20
Horn, R.E. (2008). The Turing Test: Mapping and Navigating the Debate. In (Eds)
Epstein, R., Roberts, G., & Beber, G. Parsing the Turing Test: Philosophical and
Methodoligical Issues in the Quest for the Thinking Computer, Springer, pp73-88
Hutchens, J. (2008). Conversation Simulation and Sensible Surprises. In (Eds)
Epstein, R., Roberts, G., & Beber, G. Parsing the Turing Test: Philosophical and
Methodoligical Issues in the Quest for the Thinking Computer, Springer, pp 325-342
Hutchens, J. (2001). Ai Research – creating a new form of life: The HAL Nursery
http://www.a-i.com/ date visited: 10.1.10; time: 11.58
Ref: X
IBM (1997). IBM Research – Deep Blue.
http://www.research.ibm.com/deepblue/learn/html/e.shtml date visited: 20.2.10; time:
23.01
iTWire (2007). Flirty Bot Passes for Human. http://www.itwire.com/your-it-
news/home-it/15748-flirty-bot-passes-for-human date visited: 14.2.10; time: 20.27
Jabberwacky (2010). http://www.jabberwacky.com/ date visited: 4.2.10; time:
13.03
Kelly, K. (2009). Technology’s Epic Story. TED Talks – Ideas Worth Spreading.
TEDX: Amsterdam, The Netherlands, November 2009 http://www.ted.com/ date
visited: 20.02.10; time: 16.12
Klein, D. (2008). History of Digital Storage. Micron Technology Inc. White Paper.
http://download.micron.com/pdf/whitepapers/history_of_digital_storage_wp.pdf
date visited: 20.2.10; time: 13.36
Korukonda, A. R. (2003). Taking Stock of the Turing test: a review, analysis, and
appraisal of issues surrounding thinking machines. International Journal of Human-
computer Studies. Vol. 58, Issue 2: pp 240-257
Krotoski, A. (2010) The Virtual Revolution - How 20 Years of the Web Has
Reshaped Our Lives. 4-Part BBC 2 Documentary, Jan/Feb:
http://www.bbc.co.uk/virtualrevolution/ date visited: 15.2.10; time: 16.41
Kurzweil, R., and Mitchell Kapor (2001). A Wager on The Turing Test. In (Eds) R.
Epstein, Roberts, G. and Beber, G. Parsing the Turing Test: Philosophocial and
Methodological Issues in the Quest for the Thinking Computer. Springer, USA: pp
463-477
Ref: XI
Kurzweil, R. (1992). The Paradigms and Paradoxes of Intelligence, Part 2: Church-
Turing. http://www.kurzweilai.net/meme/frame.html?main=/articles/art0256.html
date visited: 23.3.10; time: 22.36
Lakoff, (1994). What is Metaphor? In (eds) Barnden, J.A. and Holyoak, K.J.
Analogy, Metaphor and Reminding 003: Advances in Connectionist & Neural
Computation Theory. Intellect Books
Lancaster University News (2010). Software Developers Tackle Child Grooming
on the Net. May 31, 2010:
http://domino.lancs.ac.uk/info/lunews.nsf/I/FB8C9A290CACA37680257735003DFF
D3 accessed: 11.6.10; time: 16.11
Lenat, D. (2001). Wired: Artificial Intelligence - Battle of the Brains
http://www.wired.com/wired/archive/9.11/mustread_pr.html accessed: 7.8.10; time:
14.30
Lassègue, J. (1996). What kind of Turing Test did Turing have in mind? Tekhnema
3/ A Touch of memory/Spring. http://tekhnema.free.fr/3Lasseguearticle.htm accessed:
7.8.10; time: 14.39
Levesque, H. J. (2009). Is it Enough to Get the Behavior Right? Proceedings of the
Twenty-first International Joint Conference on Artificial Intelligence. Pasadena, US.
July 11-17: pp 1439 - 1444
Li, D., and Cruz, J.B. Jr. (2009). Information, decision-making and deception in
games. Decision Support Systems Vol. 47: pp. 518-527
Lockhart, R. (2010). Integrating Semantics and Empirical Language Data.
Chatbots 3.0 -Third Colloquium on Conversational Systems. Philadelphia, USA,
March 2010 http://docs.google.com/present/view?id=dwzzf9b_46d2ds9sgm accessed:
23.4.10; time: 18.41
Ref: XII
Loebner, H. G. (2010b). Some Misconceptions Regarding the Turing Test.
Proceedings of Towards a Comprehensive Intelligence Test - TCIT symposium, part
of AISB 2010. De Montfort University, 29-30 March pp 50-51
Loebner Prize Results (2010). 20th Loebner Prize Contest Results 2010
http://loebner.net/Prizef/2010_Contest/results.html
Loebner Prize (2010a). Home Page of the Loebner Prize in Artificial Intelligence –
the First Turing Test. http://www.loebner.net/Prizef/loebner-prize.html date visited:
16.1.10; time: 11.06
Loebner Prize (2009). “Loebner Prize for Artificial Intelligence – 2009
Information and Transcripts”, http://loebner.net/Prizef/2009_Contest/loebner-prize-
2009.html date visited: 3.2.10; time: 19.24
Loebner Prize (2008). “Can a Machine Think? – results from the 18th Loebner
Prize contest”, http://www.reading.ac.uk/research/Highlights-News/featuresnews/res-
featureloebner.aspx date visited: 3.2.10; time: 19.21
Loebner Prize (2007). “17th Annual Loebner Prize for Artificial Intelligence”
http://loebner.net/Prizef/2007_Contest/loebner-prize-2007.html date visited: 3.2.10;
time: 19.19
Loebner, H.G. (1995). Usenet comp.ai.nat-lang correspondence of April 21, 1995
including emails from Robert Epstein and Howard Sloan of Cambridge Center for
Behavioral Studies from 1988 http://comp.ai.nat-lang.free-usenet.eu/
Loebner, H.G. (1994). In Response. Communications of the ACM. Vol. 37, Issue 6:
pp 79-82
Martin, N. (2007). Telegraph newspaper: ‘Cyberlover’ flirts its way to internet
fraud. http://www.telegraph.co.uk/news/uknews/1572077/Cyberlover-flirts-its-way-
to-internet-fraud.html accessed: 3.7.10; time: 17.14
Ref: XIII
McDermott, D. (2010). Don’t Improve the Turing Test, Abandon It. Proceedings
of Towards a Comprehensive Intelligence Test TCIT symposium, part of AISB 2010.
De Montfort University, 29-30 March pp 37-41
McDermott, D. (2007). Passing the Turing Test Using a Humungous Response
Table. http://cs-www.cs.yale.edu/homes/dvm/papers/turing-table.html date: 25.3.10;
time: 17.05
Medeksza, R. (2010). Ultra HAL AI Technology. Third Colloquium on
Conversational Systems. Philadelphia, USA, March 2010
http://www.zabaware.com/chatbotpres/chatbot3.0_files/v3_document.htm date
accessed: 23.4.10; time: 19.20
Melling, D.J. (1992). Understanding Plato. Oxford University Press, Oxford
paperbacks, UK
Metta, G., Sandini, G., Vernon, D., Natale, L. and Nori, F. (2008). The iCub
Humanoid Robot: An Open Platform for Research in Embodied Cognition. PerMIS
’08, August 19-21, 2008, Gaithersburg, MD, USA: ACM 978-1-60558-293-1
Michie, M (1993), Turing’s Test and Conscious Thought, in (Eds) Millican P.J.R.
and Clark, A. Machines and Thought – The Legacy of Alan Turing, Volume 1. 1996
Clarendon Press, Oxford, UK, pp 27-51
Minsky, M. (2003) Wired: AI Founder Blasts Modern Research.
http://www.wired.com/science/discoveries/news/2003/05/58714 accessed: 7.8.10;
time: 14.32
Moor, J.H. (2001). “The Status and Future of the Turing Test”, in Moor, J.H. (Ed),
The Turing Test – the Elusive Standard of Artificial Intelligence (2003) Kluwer,
Dordrecht, The Netherlands, pp 197-214
Ref: XIV
Moor, J.H. (1976). An Analysis of Turing’s Test. Philosophical Studies. Vol. 30:
pp. 249-257
Newborn, M. (1997). History of the Chess Table.
http://www.research.ibm.com/deepblue/learn/html/e.8.5.shtml date visited: 20.2.10;
time: 23.04
Ortega y Gasset, J. (2002). What is Knowledge? [Translated and edited by Jorge
García Gómez]. State University of New York Publishers: USA
Owen, A.M. (2008). Disorders of Consciousness. Annals of the New York
Academy of Sciences, 1124:225-238 doi: 10.1196/annals.1440.013
Owen, A.M., Schiff, N.D. and Laureys, S, (2009). The Assessment of Conscious
Awareness in the Vegetative State. Chapter 13 in The Neurology of Consciousness:
Cognitive Neuroscience and Neuropathology. Elsevier, pp 163-172
doi:10.1016/B978-0-12-374168-4.00013-7
Owen, A.M., Coleman, M.R., Boly, M., Davis, M.H., Laureys, S., & Pickard, J.D.
(2006). Detecting Awareness in the Vegetative State. Science. Vol 313, 1402
Parker, E.S., Cahill, L. and McGaugh, J.L. (2006). “A Case of Unusual
Autobiographical Remembering”. Neurocase, Volume 12, Issue 1: pp 35-49
Pavia, W. (2008). Machine Takes on Man at Mass Turing Test.
http://technology.timesonline.co.uk/tol/news/tech_and_web/article4934858.ece
accessed: 15.6.10; time: 15.34
PC Tools (2010). PC Tools Latest News: Flirt Bot.
http://www.pctools.com/press-room/article/id/270/ accessed: 3.7.10; time: 17.05
PC Tools (2009). Latest News: Twilight Scam http://www.pctools.com/press-
room/article/id/267/accessed: 3.7.10; time: 17.06
PC Tool (2007). PC Tools Reviews: Beware of Flirting Robots.
Ref: XV
http://www.pctools.com/press-room/article/id/192/mod/pressreleases/ accessed:
3.7.10; time: 16.58
Penrose, R. (1994). Shadows of the Mind: A Search for the Missing Science of
Consciousness. Oxford University Press, UK
Penrose, R. (1989). The Emperor’s New Mind: Concerning Computers, Minds, and
the Laws of Physics. Oxford University Press, UK
Piccinini, G. (2000). Turing’s Rules for the Imitation Game. In Moor, J.H. (Ed),
The Turing Test – the Elusive Standard of Artificial Intelligence (2003) Kluwer,
Dordrecht, The Netherlands, pp 111-119
Pinsky, L. (1951). Do Machines Think About Machines Thinking? Mind, Vol: LX,
pp 397-398
Pirner, J. (2010). Jabberwock Chatbot.
http://www.abenteuermedien.de/jabberwock/ accessed: 3.5.10; time: 11.54
Purtill, R.L. (1971). Beating the Imitation Game. Mind. Vo. 80. No. 318: pp 290-
294
Plebe, A., and Perconti, P. (2010). Qualia Turing Test – Designing a Test for the
Phenomenal Mind. Proceedings of Symposium for Towards a Comprehensive
Intelligence Test. AISB Convention, De Montfort, UK, 29 March – 1 April
Ramachandran, V.S. and Hubbard, E.M. (2005). Hearing Colors, Tasting Shapes.
Scientific American Mind. November: pp 17-23
Ramachandran, V.S . and Hubbard, E.M. (2001). Synaesthesia – A Window into
Perception, Thought and Language. Journal of Consciousness Studies. Vo. 8. Issue
12: pp 3-34
Ref: XVI
Riegelnik, S. (2010). Man, Machine and Interpretation. Donald Davidson on
Turing’s Test. Proceedings of Towards a Comprehensive Intelligence Test - TCIT
symposium, part of AISB 2010. De Montfort University, 29-30 March pp
Roberts, F. (2008). A Social Psychological Approach to Dialogue Simulation
(unpublished).
Roberts, F. (2005). The AI of Elbot. Software Developer’s Journal January 2006.
Roberts, F., and Gülsdorff, B. (2007). Techniques of Dialogue Simulation. In C.
Pelachaud et al. (Eds.): IVA 2007, LNAI 4722, pp. 420–421
Robinson, G. (1972). How to Tell Your Friends From Your Machines. Mind. Vol.
81. No. 234: pp 504-518
Robitron (2010). Yahoo Members’ only Discussion group for natural language
processing. http://tech.groups.yahoo.com/group/Robitron/ accessed: 26.5.10; time:
16.57
Sampson, G. (1973). In Defence of Turing. Mind. Vo. 82. No. 328: pp 592-594
Savova, V. and Peshkin, L. (2007). Is the Turing Test Good Enough? The Fallacy
of Resource-unbounded Intelligence. Proceedings of the 20th International Joint
Conference on Artificial Intelligence (IJCAI-07). Hyderabad, India. 6-12 January: pp
545-550
Scriven, M. (1953). The Mechanical Concept of Mind. Mind Vol LXII, No. 246,
pp 230-240
Schröder, M. and McKeown, G. (2010). Considering Social and Emotional
Artificial Intelligence. Proceedings of the Towards a Comprehensive Intelligence Test
(TCIT): Reconsidering the Turing Test for the 21st Century. De Montfort University,
29 March-1 April: pp 20-23
Ref: XVII
Searle. J. R. (1980). Minds, Brains and Programs. In Shieber, S.The Turing Test –
Verbal Behavior as the Hallmark of Intelligence. MIT Press, 2004: pp 201-224
Serck, L. (2008). Could a computer think? BBC Berkshire Places features,
October 2008
http://www.bbc.co.uk/berkshire/content/articles/2008/10/12/turing_test_feature.shtml
accessed: 24.6.10; time: 18.05
Shah, H, and Henry, O. (2005). Confederate Effect in Human-Machine Textual
Interaction. Proceedings of the 5th WSEAS Int Conf on Information Science,
Communications and Applications (WSEAS ISCA). Cancun, Mexico, May 11-14: pp
109-114
Shah, H. and Pavlika, V. (2005). Text-based Dialogical E-Query Systems:
Gimmick or Convenience? Proceedings of the 10th International Conference on
Speech and Computers (SPECOM), Patras, Greece, October 17-19: Vol II, pp 425-
428
Shah, H. and Warwick, K. (2010c) Hidden Interlocutor Misidentification in
Practical Turing Tests. Minds and Machines. Vol. 20(3), August 2010: pp 441-454
Shah, H. and Warwick, K. (2010b) Testing Turing’s Five Minutes Parallel-paired
Imitation Game. Kybernetes Turing Test Special Issue. Vol. 39, issue 3: pp. 449-465
Shah, H. and Warwick, K. (2010a). From the Buzzing in Turing’s Head to
Machine Intelligence Contests. Proceedings of Symposium for Towards a
Comprehensive Intelligence Test. AISB Convention, De Montfort, UK, 29 March – 1
April.
Shah, H. and Warwick, K. (2009). “Is Understanding Over-rated?” Presented at the
2009 European Conference on Computing and Philosophy (ECAP2009). Universitat
Autònoma de Barcelona, Spain, July 2-4.
Ref: XVIII
Shah, H. and Warwick, K. (forthcoming). “Constraining Random Dialogue in a
Modern Eliza”, in Schmidt, C.T.A (Ed), Essays on Computer and Philosophy: From
Man Through to Artefacts to Man. AAAI/MIT
Shah, H. (Unpublished). Turing’s test and Soul of a Muslim Woman
Shah, H. (2006b). Chatterbox Challenge 2005: Geography of the Modern Eliza.
Proceedings of the 3rd International Workshop on Natural Language Understanding
and Cognitive Science – NLUCS 2006, in conjunction with ICEIS 2006, Paphos,
Cyprus, May: pp 133-138
Shah, H. (2006a). A.L.I.C.E.: an ACE in Digitaland. Triple C – cognition,
communication and co-operation. Vol. 4(2) pp 284-292.
Shieber, S. M. (2008). The Turing Test as Interactive Proof. Noûs.Vol. 41(4), pp
686-713
Shieber, S. M. (2006). Does the Turing Test Demonstrate Intelligence or Not?
Proceedings of the Twenty-First National Conference on Artificial Intelligence
(AAAI-06), Boston, MA, 16-20 July
Shieber, S. M. (2004). The Turing Test: Verbal Behavior as the Hallmark of
Intelligence. MIT Press: Cambridge, Massachusetts, US
Shieber, S.M. (1994). Lessons from a Restricted Turing Test. Communication for
the Association for Computing Machinery. Vol 37, issue 6, pp 70-78
Singh, S. (1997). Fermat’s Last Theorem – The Story of a Riddle that Confounded
the World’s Greatest Minds for 358 Years. Fourth Estate: London
Smith, D. (2010). Weird Science article: I have Come to Destroy You. .. at Ping
Pong. TOSY, Vietnam’s Bipedal Humanoid: February 17, 2010.
http://blogs.sundaymercury.net/weirdscience/ date visited: 18.2.10; time: 15.58
Ref: XIX
Sterrett, S. G. (2000). Turing’s Two Tests for Intelligence. In Moor, J.H. (Ed), The
Turing Test – the Elusive Standard of Artificial Intelligence (2003) Kluwer,
Dordrecht, The Netherlands, pp 79-97
Stevenson, R. (1993) Language, Thought and Representation. John Wiley & Sons:
Chichester, UK
Stins, J. F. and Laureys, S. (2009). Thought Translation, Tennis and Turing Tests
in the Vegetative State. Phenom Cogn Sci. Springer: DOI 10.1007/s11097-009-9124-
8
Stins, J.F. (2008). Establishing Consciousness in Non-communicative Patients: A
Modern-day version of the Turing test. Consciousness and Cognition. Vol. 18. Issue
1: pp. 187-192
Stork, D.G. (1997). The End of an Era, the Beginning of Another? HAL, Deep
Blue and Kasparov. http://www.research.ibm.com/deepblue/learn/html/e.8.1.shtml
accessed: 20.2.10; time: 22.48
Sundman, J. (2003). Artificial Stupidity. Salon.com.
http://www.salon.com/technology/feature/2003/02/26/loebner_part_one accessed:
26.4.10; time: 15.48
Swanson, S.A. (2010). Memory and Forgetting: Piecing Together the Molecular
Puzzle of Memory Storage. The 2010 Progress Report on Brain Science. The Dana
Foundation http://www.dana.org/news/publications/detail.aspx?id=24570 accessed:
12.4.10; time: 11.28
Tannen, D. (1988) Linguistics in Context: Connecting Observation and
Understanding. Ablex Publishing, New Jersey, USA
Ref: XX
Thomas, C. and Balmers, N. (2007). Diversity and Fairness in the Jury System.
Ministry of Justice Research Series 2/07. UK Crown Copyright: June 2007, ISBN 978
1 84099 079
Tomasello, M. (2000). First Steps Towards a Usage-based Theory of Language
Acquisition. Cognitive Linguistics. Vol. 11, 1 / 2: pp 61-82
Turing, A.M. (1953). Chess. In (Ed) B.J. Copeland). The Essential Turing: The
Ideas that Gave Birth to the Computer Age. Clarendon Press: Oxford, UK, 2004, pp
569-575
Turing, A.M. (1952). The Chemical Basis of Morphogenesis. In (Ed) B.J.
Copeland). The Essential Turing: The Ideas that Gave Birth to the Computer Age.
Clarendon Press: Oxford, UK, 2004, pp 519-561
Turing, A.M., Braithwaite, R., Jefferson, G. and Newman, M. (1952) Can
Automatic Calculating Machines Be Said to Think? Transcript of BBC radio
broadcast, in (Ed) B.J. Copeland). The Essential Turing: The Ideas that Gave Birth to
the Computer Age. Clarendon Press: Oxford, UK, 2004, pp 494-506
Turing, A.M. (1951b). Can Digital Computers Think? Lecture in (Ed) B.J.
Copeland). The Essential Turing: The Ideas that Gave Birth to the Computer Age.
Clarendon Press: Oxford, UK, 2004, pp 482-486
Turing, A.M. (1951a) Intelligent Machinery, A Heretical Theory. In (Ed) B.J.
Copeland). The Essential Turing: The Ideas that Gave Birth to the Computer Age.
Clarendon Press: Oxford, UK, 2004, pp 472-475
Turing, A.M. (1950). Computing Machinery and Intelligence. Mind Vol LIX. No.
236, pp 433-460
Ref: XXI
Turing, A.M. (1948). Intelligent Machinery. In (Ed) B.J. Copeland, The Essential
Turing - The Ideas that Gave Birth to the Computer Age. Clarendon Press: Oxford,
2004, pp 410-432
Turing, A.M. (1947). Automatic Computing Engine. In (Ed) B.J. Copeland, The
Essential Turing - The Ideas that Gave Birth to the Computer Age. Clarendon Press:
Oxford, 2004, pp 378-394
Turing, A.M. (1936). On Computable Numbers, with an Application to the
Entscheidungsproblem. In (Ed) B.J. Copeland, The Essential Turing - The Ideas that
Gave Birth to the Computer Age. Clarendon Press: Oxford, 2004, pp 58-90
Turing test on Wikipedia (2010) http://en.wikipedia.org/wiki/Turing_test date
visited: 14.2.10; time: 20.44
Turkle, S. (1997) Life on the Screen: Identity in the Age of the Internet. Simon and
Schuster, 1st edition.
Twitter (2010). Real-time Information Network. http://twitter.com/about date
visited: 28.2.10; time: 20.40
UoReading (2008). Can a machine think? Results from the 18th Loebner Prize
contest. http://www.reading.ac.uk/research/Highlights-News/featuresnews/res-
featureloebner.aspx accessed: 16.6.10; timeL 11.07
UoSurrey (2003). Loebner Prize Contest 2003. University of Surrey, 18 October
2003. http://loebner03.hamill.co.uk/ date visited: 11.3.10; time: 17.34
Vallerverdú, J., Shah, H., and Casacuberta, D. (2010). Chatterbox Challenge as a
Test-bed for Synthetic Emotions. International Journal of Synthetic Emotions. Vol. 1
(2), pp 12-37
Ref: XXII
Venter, J.C. (2010). J. Craig Venter Institute for Genomic Research (TIGR) and
The Center for the Advancement of Genomics (TCAG). http://www.jcvi.org/ accessed
29.5.10; time: 16.15
Veselov, V. (2010). Eugene Goostman the Bot. Third Colloquium on
Conversational Systems. Philadelphia, USA, March 2010
http://alicebot.org/chatbots3/Eugene.pdf accessed: 23.4.10; time: 18.55
Verhaert, S. (2002). Musical Turing Test.
http://www.csl.sony.fr/~pachet/Continuator/VPRO/VPRO.htm date visited: 2.3.10;
time: 12.34
Virsona Inc. (2010) Virsona: Create the Virtual You.
http://www.virsona.com/faq.aspx accessed: 27.5.10; time: 18.12
VNUnet (2010). Online Love Seekers Warned of Flirt-bots.
http://www.v3.co.uk/vnunet/news/2205441/online-love-seekers-warned-flirt-bots#
date visited: 14.2.10; time: 20.29
Wallace, R. (2010). Chatbot.30 Third Colloquium on Conversational Systems.
Philadelphia, USA, March 2010. http://chatbots3.eventbrite.com/ accessed 23.4.10;
time: 18.43
Wallace, R. (2008). The Anatomy of A.L.I.C.E. In (Eds) Epstein, R., Roberts, G.,
and Beber, G. Parsing the Turing Test: Philosophical and Methodological Issues in
the Quest for the Thinking Computer. Springer, US: pp 181-210
Wardhaugh, R. (1996). An Introduction to Sociolinguistics – Second Edition.
Blackwell: Oxford, UK
Warwick, K. (2002) Alien Encounters. In (Eds) Preston, J., and Bishop, M. Views
into the Chinese Room. Clarendon, Oxford, UK: pp 308-318
Ref: XXIII
Weizenbaum. J. (1966). ELIZA – A Computer Programme for the Study of Natural
Language Communication between Men and Machines. Communications of the ACM,
9, pp 36-45
Wheeler, M. (2010). Plastic machines: behavioural diversity and the Turing test.
Kybernetes. Vol. 39. No. 3: pp 466-480
Whitby, B. (1996). The Turing Test: AI’s Biggest Blind Alley? In: Eds P.J.R.
Millican & A. Clark, Machine and Thought: The Legacy of Alan Turing Vol.
1.Clarendon Press, Oxford, pp 53-62
Wilks, Y. (1974). Your Friends and Your Machines. Mind. Vol. 83. No. 332: pp
583-585
Willis, J., and Todorov, A. (2006). First Impressions: Making up your mind after
100ms exposure to a face. Psychological Science 17 (7), pp 592-598
Wolfram, S. (2010). Wolfram Alpha Computational Knowledge Engine Blog.
http://blog.wolfram.com/ date visited: 01.02.2010; time: 14.22