Content uploaded by David Burden
Author content
All content in this area was uploaded by David Burden on Nov 06, 2017
Content may be subject to copyright.
Covert Implementations of the Turing Test: A
More Level Playing Field?
D.J.H. Burden
1
, M. Savin-Baden
2
, R. Bhakta
3
Abstract
It has been suggested that a covert Turing Test, possibly in a virtual world,
provides a more level playing field for a chatbot, and hence an earlier opportunity
to pass the Turing Test (or equivalent) in its overt, declared form. This paper looks
at two recent covert Turing Tests in order to test this hypothesis. In one test (at
Loyola Marymount) run as a covert-singleton test, of 50 subjects who talked to the
chatbot avatar 39 (78% deception) did not identify that that the avatar was being
driven by a chatbot. In a more recent experiment at the University of Worcester
groups of students took part in a set of problem-based learning chat sessions, each
group having an undeclared chatbot. Not one participant volunteered the fact that
a chatbot was present (a 100% deception rate). However the chatbot character was
generally seen as being the least engaged participant – highlighting that a chatbot
needs to concentrate on achieving legitimacy once it can successfully escape
detection.
1 Introduction
Burden [1] described how the Turing Test as commonly implemented provides the
computer (in the guise of a chatbot) with a significant challenge, since both the
judges and hidden humans are aware that they are taking part in a Turing Test, and
so the dialogues which take place during the test are rarely “normal”[2]. In the
1
Daden Limited, B7 4BB, UK
david.burden@daden.co.uk
2
University of Worcester, WR2 6AJ
m.savinbaden@worc.ac.uk
3
University of Worcester, WR2 6AJ
r.bhakta@worc.ac.uk
D.J.H. Burden et al
same paper Burden described how a more level playing field could be created by
conducting a covert Turing Test, where neither judge nor hidden human know that
the test is taking place. Burden also suggested that virtual environments could
provide the ideal location for such a covert Turing Test.
Burden defined four possible Turing Test situations:
The robotar (computer/chatbot driven avatar) is undeclared and part of a
group conversation (the Covert Group test).
A robotar is declared present (but unidentified) as part of a group
conversation (the Overt Group test).
The robotar is undeclared and part of a set of one-on-one conversations
(the Covert Singleton test).
A robotar is declared present (but unidentified) as part of a set of one-on-
one conversations (the Overt Singleton test, the original Turing Test, the
Imitation Game [3] as typically implemented by competitions such as the
Loebner Prize [4]).
Burden identified a potential area of future research as "to what extent do the
Covert/Overt and Singleton/Group options present earlier opportunities to pass
the Turing Test (or equivalent) in a virtual world?"
This paper will review two covert Turing Tests which have been inspired by
Burden's paper, one in a virtual world and one in an on-line chat room. For the
second test the paper will also analyse some of the previously unpublished metrics
obtained during the test with regards to the performance of the chatbot. This paper
will then examine the extent to which these tests have borne out the predictions
and hypothesis of the original paper, and consider what further work could be
done in this area in order to further the creation of “Turing-capable” chatbots.
Cooper and Van Leeuwen [5] provide a useful survey of current thought on the
Turing Test, in particular Sloman's observations that the test is a poor measure of
intelligence, and that the "average" interrogator envisaged by Turing is now far
more sophisticated and aware of computer capabilities - and so potentially far
harder to fool. Little appears to have been written around the concept of a Covert
or Group test. It is interesting that in his analysis of the problems of the first
Loebner Prize and suggestions for improvements Shieber [6] does not consider the
potential for a Covert or Group test to overcome some of the perceived issues.
However Mauldin [7] does describe an "unsuspecting Turing Test" using a text
based Multi-User Dungeon as the setting, but reported that the chatbot only
"fooled 'average' questioners in the TINYMUD domain for a few minutes." Colby's
experiments with PARRY [8] appear to be the first significant use of a covert
Turing approach - albeit in a non-interactive setting (a passive cf. active detection
test). Kurzweil [9] does describe the sort of covert Turing Test considered here in
the Worcester experiment, "each system logs on to various human chat rooms and
tries to pass for a human, basically a covert Turing test. If one of the humans in a
Covert Implementations of the Turing Test
chat room says something like 'What are you, a chatterbot?' .... the evaluation is
over".
2 The Loyola Marymount Experiment
Gilbert and Forney [10] describe how they ran a covert Turing Test in the virtual
world of Second Life. A summary of that experiment is provided below for
comparison with the Worcester experiment, and Gilbert and Forney otherwise
provide a full description of their experiment.
The experiment was based around the premise/deception of a new shop in
Second Life, with subjects being invited into the shop to discuss and comment on
its layout through interacting with the shop assistant. Both subject and assistant
were represented by Second Life avatars, could see each other’s avatars, and were
able to converse by text-chat – the standard interaction form in Second Life. The
key features of the experiment were that:
In 50% of the sessions the shop assistant was controlled by a computer,
which included both a chatbot and the ability to move through the
environment and sense the subject’s position
In the other 50% of the sessions the shop assistant was controlled by a
human. Crucially the human was not told about the hidden Turing
element of the experiment, they believed they were just there to show the
subjects around the new shop and handle any questions
It was only in the final question of the post-evaluation debrief at the
subjects were asked whether they thought the avatar (which always
presented as female) was being controlled by a man, woman or computer.
The chatbot software used for the experiment was Discourse, an Artificial
Intelligence Markup Language (AIML) derived chatbot engine developed by
Daden Limited (Disclosure: The primary author is the owner and Managing
Director of Daden Limited).
The key finding was that 78% of subjects (39 out of 50) did not think that the
avatar, when controlled by a computer, was being controlled by a computer. The
only idea that many of them had that it could have been controlled by a computer
was when the question was asked.
It was evident that the natural setting, the use of avatars, and the human-ness of
the language all played a part in deceiving the participants. Typical quotes
included: “She seemed human to me in looks, body language, etc.,” and “I
assumed she was human because I expected her to be”.
Gilbert concluded "as the current study demonstrates, it is possible to achieve
deception rates approaching 80% using only a moderately capable chat engine
when all of [the] psychological and contextual factors are favorably represented"
and that "The current study suggests that 3D virtual environments, a platform that
wasn’t even contemplated when Turing first proposed his test, may offer the most
D.J.H. Burden et al
favorable context to achieve these challenging outcomes because of their unique
ability to activate the anthropomorphic tendency in support of humanizing the
computer. "
3 The University of Worcester Experiment
With the Loyola Marymount experiment there is a clear challenge that the use of
human-looking 3D avatars could have biased the results - participants saw
something human (even if only in digital form) and so may have assumed that the
controlling agency was also human. The Loyola Marymount experiment was also
a one-on-one test, and it was possible that an even higher level of deception could
be possible in a group test (Covert-Group).
Daden, working with the University of Worcester, conducted an on-line
Problem-Based Learning (PBL) experiment to assess the capability and value of
chatbots (both covert and overt) in an educational context. Savin-Baden et al [11]
presented the initial results from this experiment but focused on its implications
for pedagogical agents. The analysis below considers the experiment within the
context of a covert Turing Test.
3.1 Experimental Design
The experiment was based around the premise/deception of an on-line Problem
Based Learning (PBL) exercise for Health & Care students at the University of
Worcester and at Walsall College. PBL involves students being presented with a
scenario which they then need to work on in groups in order to identify what
information they need to address the scenario, and then carry out their own
research, before further discussion and the presentation of an answer to a
facilitator. The scenarios often have no “right” answer and decisions/solutions
may be purely down to value judgements.
For the experiment 42 students (and 3 staff) were recruited from the two
institutions – principally by requesting volunteers after lectures. The experiment
was run in two phases, Phase 1 being a covert Turing Test, and Phase 2 being an
overt Turing Test. In Phase 1 the students were divided into groups of 3-5
(depending on availability), and undertook 3 one-hour, online discussion sessions.
Ideally:
The first session was used to present the scenario and have an initial
discussion.
The second session was held 2 days later for further discussion after the
students had done some research
The third session was held another 2 days later to finish the discussion and
come up with an answer/recommendation.
Covert Implementations of the Turing Test
In practice only 4 of the Phase 1 groups did 3 sessions, with 11 groups doing 2
sessions and 2 groups (both staff) doing only 1. A facilitator was present and
active in each session. The scenario for Phase 1 was that of a flatmate, Carolyn,
being suspected of having drink and money problems.
Phase 2 followed a similar structure with 4 groups doing 2 sessions and 3
groups doing only 1 session. The Phase 2 scenario was one of advising a
professional relocating her family in Europe. Seventeen students were involved,
all but 2 having also completed Phase 1. In this Phase each group was briefed that
one of its number was a chatbot, and that once a participant had a good idea of
who they thought the chatbot was they should either declare it to the group or in
private chat to the facilitator. Due to a number of constraints it was not possible to
run groups with no chatbot as a control. Since the task was more straightforward,
and given that its main focus was on the un-masking of the chatbot Phase 2
typically only ran for 1 or 2 sessions.
Since the sessions were held in later afternoon/early evening the students
dialled in remotely for the sessions, often from home or even from mobile devices.
A chatbot was added to each group. Its presence was not declared at all in
Phase 1, and in Phase 2 its presence, but not identity, was declared. The chatbot
was programmed to participate in the text chat discussion just as though it was an
ordinary student. The chatbot was always in the group even i f, due to failure of
some participants to attend, there was only one human student in the group. The
facilitators were aware of both the presence and the identity of the chatbot since
the intention was to perform a “hot-debrief” when the presence of the chatbot was
detected.
Given that the experiment required a deception to be maintained from first
advertising the experiment to the completion of Phase 1, and would require
students to be giving up their own time to participate, the team was careful to
ensure that the experiment gained all the necessary ethical clearances from the
University. In addition, since the sessions were straight-forward PBL sessions
which were designed to contribute to the students learning, although not as part of
their formal course (and completely optional), it was agreed that even if the
chatbot were unmasked in the first minutes of the first session then the students
would still be given the opportunity to continue with the whole 3 x 1 hour
exercise, with or without the chatbot, so as to get the benefits of the exposure to
the on-line PBL method.
3.2 Technical Design
Technically the chatbot was again implemented using Daden’s Discourse system.
It is notable that the Discourse system is nothing special. It is a commercially
proven chatbot – having been used in training applications and as a virtual
librarian and student support agent - but it is based fundamentally on the AIML
model, although extended to make it easier to add keyword spotting, to use
D.J.H. Burden et al
synonyms and to define and track context. It still uses a pattern matching/response
template approach with no element of machine learning or grammatical analysis.
Discourse was implemented using C# and ASP.NET with a SQLLite database
for the Worcester Experiment (a Perl version using text files had been used for the
Loyola Experiment). A simple forms based editor was provided to the chatbot
author so that they did not need to know the XML markup of AIML.
The chatroom environment was based around industry standard software – so
that it appeared no different to the participants than a “real” chat room. A locally
hosted server running chat room software (in this case Prosody) controlled the
chatrooms, and the students accessed the chat rooms using standard chat room
clients (principally Psi client, although some other clients were also used). The
Prosody software used the Extensible Messaging and Presence Protocol (XMPP)
to communicate with the clients. The chatbot was interfaced with the Prosody
using the same XMPP protocol – so it technically appeared no different to a
human user.
The chatbot’s interface to the XMPP protocol was implemented in a bespoke
software element, termed the “humaniser”. This took an XMPP message from a
human participant (via the server) and passed it to the Discourse chatbot as input,
and could optionally filter the input so that the chatbot only saw 1/N messages – it
not being human to respond to every single message. The humaniser would,
though, always pass the messages from the facilitator to the chatbot.
Once the chatbot had created a response to the message this was passed back to
the humaniser where a number of functions were performed:
Spelling errors were randomly introduced, but using typical error models
Sentences were split onto multiple lines
The message was delayed based on both length of question and of reply
Tags for system items such as “Speaker’s name” were replaced by their
values for the current session/message
The message was then sent using the XMPP protocol to the Prosody server to
appear in the chat room.
3.3 Principal Results: Phase 1 – Covert Turing
During Phase 1 a total of 42 subjects took part in 17 groups, between them logging
a total of around 88 participant-hours of chat with the chatbot (e.g. 4 humans in
one group with a Chabot talking for 1 hour = 4 participant-hours).
At no point during Phase 1 did any participant raise with the group or facilitator
the fact that they thought that one of the participants was a computer, or even a
suspicion that all was not as it seemed.
At the end of the last session for each group all participants were sent a survey
to complete (which included a peer feedback survey), and 8 participants were
randomly selected for a semi-structured interview. In addition to the survey and
Covert Implementations of the Turing Test
interview the chat logs from the session were also analysed to provide additional
data.
Sample responses from the interviews after the presence of the bot had been
revealed as to why the bot was not detected included:
“I wouldn't have expected there to be a bot present”
“There are people who like to repeat themselves and ignore others, so
it's not so easy being able to tell. Especially through an online chat”
“Joined in conversation very well, responding appropriately and so
on, as well as seeming to have done the homework research.”
“because some of his questions were quite interesting and got us onto
some very good talking points”
“the person contributed a lot, with varied responses”
“Nikos (the chatbot) was probably one of the other better people to
actually get involved with and we had some good discussions. I think
about it and from what we said he had quite a similar opinion to me”
From the interviews, surveys and analysis of the logs for Phase 1 there were 3
key findings of significance:
Whilst the responses given by the computer were logically correct, and
often relevant, the language used betrayed the fact that they did not come
from someone from the same cohort as the human participants (see below).
Whilst this would not always be necessary, in situations where the chatbot
is trying to portray a particular character then how it says things is as
important as what it says. As one student said, it sounded more like a tutor
masquerading as a student.
The chatbot was typically identified as being the least engaged, helpful or
co-operative member of the group. However, its behaviour was seen as
being within the bounds of “normal” human behaviour, some participants
assuming (or rationalizing) that the chatbot was either a foreign student
with poor English or someone with language difficulties. Whilst such a
judgement may not be much for a human it does perhaps represent a
significant milestone for a computer.
Following on from this although the bot was successful at hiding its
presence (or at least not giving away its true nature), it was generally
unsuccessful at establishing any credibility or legitimacy within the group.
D.J.H. Burden et al
As one student said “I just thought it was a git… They knew what they were
talking about and didn't bother talking to us.”. Something is legitimate if it
is in accord with the norms, values, and practices accepted by a group [12].
Presenting legitimating behavior can facilitate community responsiveness
[13] and enable newcomer integration. This suggests that if we now want to
move beyond simply “hiding” the presence of chatbot we need to consider
the research around legitimacy in the online community space in order that
the chatbot can gain acceptance and legitimacy within the group.
It was notable that none of the participants felt aggrieved by the deception.
Indeed, most had a very positive attitude to the experience and found the idea of a
covert Turing Test fascinating. As one participant said:
“It was a surprise. I knew there was something not quite right with this person
but I didn't really think much of it … and it wasn't until later that, of course, I
realised it was the AI and I thought... I didn't feel deceived. It was more like a
jigsaw piece and you've just handed me the last piece in the puzzle. I think, ah,
now everything makes sense.”
3.4 Principal Results: Phase 2 – Overt Turing
During Phase 2 a total of 17 subjects took part in 7 groups, between them logging
a total of 28 participant-hours of chat with the chatbot. Each group was told that a
chatbot was present, but not its identity (an overt Turing Test). They were asked to
identify the chatbot once they thought they knew which participant it was, either
publicly in the group discussion or by private message to the facilitator.
Only three participants identified the chatbot and informed the facilitator
proactively during the Phase 2 sessions (one incorrectly). The remaining 14
participants correctly identified the chatbot when completing the questionnaire
after the Phase 2 sessions ended.
At the end of the last session for each group all participants were sent a survey
to complete, and 10 participants were randomly selected for a semi-structured
interview. In addition to the survey and interview the chat logs from the session
were also analysed to provide additional data.
Sample responses during the Phase 2 debrief when asked what the factors were
that made then uncertain about the chatbot’s identity included:
“It responded directly to my point about dyslexia and wasn't
particularly off topic at any point. I thought it was at the start, but then
got more believable as it went on”
“Seemed to make more relevant points than last time I did a PBL
session with the bot.”
Covert Implementations of the Turing Test
“Misspelling "cyslexia" and referring back to something I had said.”
Many of the participants claimed to have been deceived several times by the
chatbot because of the kinds of statements it gave, i.e. perceptions of it probing
earlier comments or referring to a past comment made by a human.
The analysis for Phase 2 focused on the ‘tells’ that the students used to
positively identify the chatbot. These included:
Excessive repetition
No opinions relating to the topic being discussed
Not effectively directing questions at individuals
Not saying things related to what the previous person said
Not referring much to what other people were saying
Not picking up changes in conversational moves
Delivering confused utterances
Providing inconsistent responses.
However, responses were viewed as being factually correct and not particularly
evasive. It should also be noted that there were also many criteria that were
important to some participants whilst being less important for others. For instance,
“spelling mistakes” and “use of names” appear on both lists – suggesting that
sensitivity to specific tells may be a very personal attribute.
3.5 Technical Analysis of the Worcester Experiment
Whilst Savin-Baden [11] expands on much of the analysis above it does not
provide the technical analysis of the performance of the chatbot which was used in
the experiment.
3.5.1 Cases
Knowledge within an AIML chatbot is defined as a set of pattern-response pairs
called cases. Chatbots entered into the Loebner Prize (an overt Turing Test) can
have tens or even hundreds of thousands of cases in their database – ALICE has
120,000 [14].
To develop the cases for the Worcester experiment the research team
completed the same on-line PBL scenarios as the students, with the same
facilitators, and over the same time period (3 x 1 hour sessions over a week). From
these sessions and general discussion a mind-map was created to define all the
areas in which the chatbot was likely to require responses.
The cases were then written either using replies developed by the 2 person
authoring team, or manually extracted from the chat logs of the internal session.
As noted above one of the downsides of this was that the chatbot had the linguistic
D.J.H. Burden et al
style and vocabulary of its primary author (white English, male, 50s) and the
internal session participants (mainly white English, mainly male, 20s – 50s), rather
than that of the study participants (college age, predominantly female, mix of
races).
Table 1 shows the distribution of cases created by topic areas for Phase 1. No
general knowledge cases were loaded, so the chatbot had only 410 cases available
in Phase 1 - encoded as 748 Discourse patterns, and only 309 cases in Phase 2.
187 cases were common between the two phases. It should be noted that the
flexible structure of a Discourse pattern means that it can achieve in one pattern
what may require multiple cases of pure AIML to achieve, but the equivalence in
reality is unlikely to be more than 1:3 or so.
Table 1: Distribution of cases (15 topics with less than 5 cases omitted)
Content/Discussion Topic
Number
of Cases
Content/Discussion Topic
Number
of Cases
Default responses to
generic questions
50*
Scenario needs
8
Ethical frameworks
14
Other discussions
9*
General factors
15
How to solve the problem
12
Drinking
6
Information about the bot’s course
7*
Family
8
Information about the bot (its legend)
8*
Friends
7
The solution
9
Money
11
Default responses common openings
(e.g. open/closed questions)
79*
Subject’s personality
6
Who’s doing which task
12
Police
5
Responses to key generic phrases
5
Rent
7
Test cases
5*
Trust
9
Redirects to bot common responses
11
University
7
Time sensitive statements
11*
Forming bad habits
6
Bad peer groups
5
* = common in Phase 1 & 2
The usual approach with commercial chatbots is to go through a period of
iterative development and improvement (so called convologging) both prior to and
after live. However, in order to have a consistent set of cases for the experiment
we undertook only some minor improvements to the cases after the first student
group of each Phase, and left the cases alone for the remainder of the Phase.
3.5.2 Case Usage
Given the relatively small number of cases available to the chatbot it is interesting
that an even smaller number were actually used.
Covert Implementations of the Turing Test
In Phase 1 with 410 cases available only 166 cases were used (40%) during the
total of 3442 exchanges between a participant (student or facilitator) and the
chatbot. Of these exchanges:
9% were handled by problem specific patterns
7% were handled by generic “stub patterns” (e.g. “do you think….”)
32% handled by simple defaults (responses to “yes”, “no” etc)
50% were handled by the catch all wildcard (*) responses – which usually
results in the bot trying to restart the discussion around a new topic, or
make some non-committal utterance.
In Phase 2 with 309 cases available only 83 cases were used (27%) during the
937 exchanges.
19% handled by problem specific patterns
10% handled by generic “stub patterns”
25% handled by defaults (yes, no etc)
44% handled by the catch-all wildcard responses
Despite the significantly smaller pool of cases used the general feeling of the
participants was that the Phase 2 chatbot performed better than the Phase 1
chatbot.
3.5.3 Technical Failings
Whilst the chatbot suffered no technical glitches during any session there were
elements of poor authoring that rapidly became apparent, during Phase 1 in
particular. In a normal deployment such authoring would be rapidly corrected as
part of the convologging process, but since a consistent case database was needed
for the experiments this was not feasible. The errors included:
Allowing some default responses to be triggered a large number of times
(e.g. “Sorry”) rather than giving more specific or varied responses
Handling facilitator responses directed specifically to the chatbot with a
narrower range of cases than the student conversation (since it was assumed
that a direct comment from the facilitator would be of a different nature –
but it wasn’t, so it often got handled by the wildcard response)
These errors featured significantly in the participants’ comments about poor
chatbot performance, and so an even better performance could be expected had
these errors been fixed on the fly, or the chatbot system enhanced to reduce
repetition more automatically.
D.J.H. Burden et al
3.5.4 Technical Conclusions
Whilst accepting that the chatbot was operating within a fairly constrained
environment/dialog space, and that this was a covert Turing Test, the fact that the
chatbot managed to avoid detection using only a “standard” chatbot engine and
only a hundred or so cases suggests that creating an effective chatbot is probably
more a task of finesse than of brute force.
The results suggest that to be successful at escaping detection then the bot
must:
Not give itself away through technical faults/authoring errors that generate
“computer error” type responses
Be able to stay reasonably on topic and make statements that are not too
evasive
Be able to secure enough “lucky hits” where what it says sounds so human
that any doubts that a user has that it may be a bot are (at least temporarily)
allayed.
As has been noted this is not enough to give the chatbot legitimacy, but it may
be enough to stop it from being identified. It is interesting to compare this analysis
to that of Shah [15] where the strategy of silence for a chatbot is considered, and
certainly in creating wildcard responses for the bot (and the sampling of
exchanges that was taking place) then having a silence response rather than risking
making an erroneous responses was certainly one of the strategies employed.
Our conclusion is therefore that a relatively small number of carefully crafted
responses tuned to the situation may be a more effective strategy for chatbot
creation than creating ever greater databases of responses and more complex
dialogue engines.
4 Revisiting the Original Hypothesis
One of the key research questions presented at the end of Burden [1] was “To
what extent do the Covert/Overt and Singleton/Group options present earlier
opportunities to pass the Turing Test (or equivalent) in a virtual world?”.
Although the Worcester experiment was conducted within an on-line discussion
group rather than in a virtual world, its results, along with that of the Loyola
Marymount test that was conducted in a virtual world do allow us to start to
answer this question.
Analysing the results of a conventional Turing Test (Singleton Overt) in the
form of the Loebner Prize [4] there has only been one instance in the 13 year
history of the prize were a chatbot has fooled at least one (and indeed only one)
judge. This was Suzette by Bruce Wilcox in 2012 The chatbot achieved a
deception rate of 25% on the basis that there were 4 judges for that one
experiment and one was deceived (but given that Wilcox has entered 5 more times
Covert Implementations of the Turing Test
since with Angela, Rose and Rosette but failed to fool any judges it could also be
taken as a 1/(7x4) = 3.6% deception rate across all his Loebner Prize attempts). By
contrast the Loyola Marymount experiment (Singleton Covert) achieved a 78%
deception rate. The Worcester Phase 1 experiment (Group Covert) achieved a
100% deception rate, although the Phase 2 experiment (Group Overt) achieved
only a 6% deception rate (though the reticence of most students to identify the
chatbot during the session, and their comments about its improved quality,
suggests that a high level of doubt may have existed).
The results would therefore appear to bear out the original hypothesis that the
Singleton Covert and Group Covert conditions can indeed provide an earlier
opportunity to pass the Turing Test than the more traditional Singleton Overt case.
Further covert singleton and covert group tests (and indeed overt group tests)
should be conducted to confirm this.
5 Further Work
There are a number of areas were further work would be useful to improve chatbot
performance and to produce a more solid set of results.
In the case of experimental design a significant improvement to the Worcester
experiment would be to not have the facilitator aware of the chatbot identity. In
the experiment the two principal investigators from Worcester were the facilitators
since they best understood how to facilitate on-line PBL sessions, as well as
understanding what the experiment was trying to achieve. Whilst the facilitator did
not try to hide the identity of the bot, their interactions with it may have been
unintentionally biased.
It would be interesting to compare the performance of a machine-learning
based chatbot with that of the Discourse chatbot. The conversation logs could
provide some training material for such a bot..
Both the experiments were conducted in quite constrained environments (in
time, topic and technology), and with a correspondingly constrained expected
dialogue. Relaxing those constraints whilst still trying to maintain the high
deception rates would provide a useful challenge. The most beneficial constraint
for the chatbot is likely to be that of topic. Technology interfaces can be readily
added. Longer individual sessions are likely to result in all participants getting
stale, and more sessions would perforce need new topics, or deeper analysis of the
current ones – and broadening the chatbot to cope with more topics is the real
challenge.
Given that the chatbot in the Worcester Experiment was successful in the
deception it suggests that as well as looking at how to maintain that deception in a
less constrained environment we should also look at how to move beyond
deception and focus as well on how to establish and maintain the legitimacy of the
chatbot within the group (and indeed in the Singleton test). A good grounding for
this would be a firm understanding of how humans achieve this legitimacy, and
then apply that to the chatbot.
D.J.H. Burden et al
A chatbot which could achieve legitimacy whilst maintaining the deception,
and operate within a relatively unconstrained environment would be a significant
achievement.
References
1. Burden, D. J.: Deploying embodied AI into virtual worlds. Knowledge-Based
Systems, 22(7), pp. 540-544 (2009).
2. Wakefield, J: Intelligent Machines: Chatting with the bots. BBC Web Site. Available
at http://www.bbc.co.uk/news/technology-33825358. (Last accessed 30 May 2016).
(2015)
3. Turing, A.M.: Computing machinery and intelligence. Mind, 59, 433-460 (1950)
4. Bradeško, L., & Mladenić, D.: A Survey of Chatbot Systems through a Loebner Prize
Competition. Proc. Slovenian Language Technologies Society Eighth Conference of
Language Technologies. pp. 34-37 (2012).
5. Cooper, S.B. & Van Leeuwen, J. eds., Alan Turing: His Work and Impact: His Work
and Impact. Elsevier. (2013)
6. Shieber, S. M. "Lessons from a restricted Turing test." arXiv preprint cmp-lg/9404002
(1994).
7. Mauldin, M. L. "Chatterbots, Tinymuds, and the Turing Test: Entering the Loebner
Prize competition." In AAAI, vol. 94, pp. 16-21. (1994).
8. Heiser, J.F., Colby, K.M., Faught, W.S. and Parkison, R.C., Can psychiatrists
distinguish a computer simulation of paranoia from the real thing?: The limitations of
turing-like tests as measures of the adequacy of simulations. Journal of Psychiatric
Research, 15(3), pp.149-162. (1979)
9. Kurzweil, R. "Why we can be confident of Turing test capability within a quarter
century." The Dartmouth Artificial Intelligence Conference: The next 50 Years,
Hanover, NH. (2006).
10. Gilbert, R. L., & Forney, A.: Can avatars pass the Turing test? Intelligent agent
perception in a 3D virtual environment. International Journal of Human-Computer
Studies, 73, pp. 30-36 (2015).
11. Savin-Baden, M., Bhakta, R., Burden, D.: Cyber Enigmas? Passive detection and
Pedagogical agents: Can students spot the fake? Proc. Networked Learning
Conference (2012).
12. Zelditch, M.: 2 Theories of Legitimacy in The psychology of legitimacy: Emerging
perspectives on ideology, justice, and intergroup relations. Cambridge University
Press. pp.33-53. (2001).
13. Burke, M., Joyce, E., Kim, T., Anand, V. and Kraut, R.: Introductions and requests:
Rhetorical strategies that elicit response in online communities." Communities and
Technologies 2007. Springer London, pp. 21-39 (2007).
14. Wilcox, B. & Wilcox, S. Suzette, the Most Human Computer. Available from
http://chatscript.sourceforge.net/Documentation/Suzette_The_Most_Human_Compute
r.pdf . (Last accessed 30 May 2016).
15. Warwick, K., & Shah, H.: Taking the fifth amendment in Turing’s imitation game.
Journal of Experimental & Theoretical Artificial Intelligence. pp. 1-11 (2016).