Conference PaperPDF Available

Covert Implementations of the Turing Test: A More Level Playing Field?

Authors:
  • Daden Limited

Abstract and Figures

It has been suggested that a covert Turing Test, possibly in a virtual world, provides a more level playing field for a chatbot, and hence an earlier opportunity to pass the Turing Test (or equivalent) in its overt, declared form. This paper looks at two recent covert Turing Tests in order to test this hypothesis. In one test (at Loyola Marymount) run as a covert-singleton test, of 50 subjects who talked to the chatbot avatar 39 (78 % deception) did not identify that the avatar was being driven by a chatbot. In a more recent experiment at the University of Worcester groups of students took part in a set of problem-based learning chat sessions, each group having an undeclared chatbot. Not one participant volunteered the fact that a chatbot was present (a 100 % deception rate). However the chatbot character was generally seen as being the least engaged participant—highlighting that a chatbot needs to concentrate on achieving legitimacy once it can successfully escape detection.
Content may be subject to copyright.
Covert Implementations of the Turing Test: A
More Level Playing Field?
D.J.H. Burden
1
, M. Savin-Baden
2
, R. Bhakta
3
Abstract
It has been suggested that a covert Turing Test, possibly in a virtual world,
provides a more level playing field for a chatbot, and hence an earlier opportunity
to pass the Turing Test (or equivalent) in its overt, declared form. This paper looks
at two recent covert Turing Tests in order to test this hypothesis. In one test (at
Loyola Marymount) run as a covert-singleton test, of 50 subjects who talked to the
chatbot avatar 39 (78% deception) did not identify that that the avatar was being
driven by a chatbot. In a more recent experiment at the University of Worcester
groups of students took part in a set of problem-based learning chat sessions, each
group having an undeclared chatbot. Not one participant volunteered the fact that
a chatbot was present (a 100% deception rate). However the chatbot character was
generally seen as being the least engaged participant highlighting that a chatbot
needs to concentrate on achieving legitimacy once it can successfully escape
detection.
1 Introduction
Burden [1] described how the Turing Test as commonly implemented provides the
computer (in the guise of a chatbot) with a significant challenge, since both the
judges and hidden humans are aware that they are taking part in a Turing Test, and
so the dialogues which take place during the test are rarely “normal”[2]. In the
1
Daden Limited, B7 4BB, UK
david.burden@daden.co.uk
2
University of Worcester, WR2 6AJ
m.savinbaden@worc.ac.uk
3
University of Worcester, WR2 6AJ
r.bhakta@worc.ac.uk
D.J.H. Burden et al
same paper Burden described how a more level playing field could be created by
conducting a covert Turing Test, where neither judge nor hidden human know that
the test is taking place. Burden also suggested that virtual environments could
provide the ideal location for such a covert Turing Test.
Burden defined four possible Turing Test situations:
The robotar (computer/chatbot driven avatar) is undeclared and part of a
group conversation (the Covert Group test).
A robotar is declared present (but unidentified) as part of a group
conversation (the Overt Group test).
The robotar is undeclared and part of a set of one-on-one conversations
(the Covert Singleton test).
A robotar is declared present (but unidentified) as part of a set of one-on-
one conversations (the Overt Singleton test, the original Turing Test, the
Imitation Game [3] as typically implemented by competitions such as the
Loebner Prize [4]).
Burden identified a potential area of future research as "to what extent do the
Covert/Overt and Singleton/Group options present earlier opportunities to pass
the Turing Test (or equivalent) in a virtual world?"
This paper will review two covert Turing Tests which have been inspired by
Burden's paper, one in a virtual world and one in an on-line chat room. For the
second test the paper will also analyse some of the previously unpublished metrics
obtained during the test with regards to the performance of the chatbot. This paper
will then examine the extent to which these tests have borne out the predictions
and hypothesis of the original paper, and consider what further work could be
done in this area in order to further the creation of “Turing-capable” chatbots.
Cooper and Van Leeuwen [5] provide a useful survey of current thought on the
Turing Test, in particular Sloman's observations that the test is a poor measure of
intelligence, and that the "average" interrogator envisaged by Turing is now far
more sophisticated and aware of computer capabilities - and so potentially far
harder to fool. Little appears to have been written around the concept of a Covert
or Group test. It is interesting that in his analysis of the problems of the first
Loebner Prize and suggestions for improvements Shieber [6] does not consider the
potential for a Covert or Group test to overcome some of the perceived issues.
However Mauldin [7] does describe an "unsuspecting Turing Test" using a text
based Multi-User Dungeon as the setting, but reported that the chatbot only
"fooled 'average' questioners in the TINYMUD domain for a few minutes." Colby's
experiments with PARRY [8] appear to be the first significant use of a covert
Turing approach - albeit in a non-interactive setting (a passive cf. active detection
test). Kurzweil [9] does describe the sort of covert Turing Test considered here in
the Worcester experiment, "each system logs on to various human chat rooms and
tries to pass for a human, basically a covert Turing test. If one of the humans in a
Covert Implementations of the Turing Test
chat room says something like 'What are you, a chatterbot?' .... the evaluation is
over".
2 The Loyola Marymount Experiment
Gilbert and Forney [10] describe how they ran a covert Turing Test in the virtual
world of Second Life. A summary of that experiment is provided below for
comparison with the Worcester experiment, and Gilbert and Forney otherwise
provide a full description of their experiment.
The experiment was based around the premise/deception of a new shop in
Second Life, with subjects being invited into the shop to discuss and comment on
its layout through interacting with the shop assistant. Both subject and assistant
were represented by Second Life avatars, could see each others avatars, and were
able to converse by text-chat the standard interaction form in Second Life. The
key features of the experiment were that:
In 50% of the sessions the shop assistant was controlled by a computer,
which included both a chatbot and the ability to move through the
environment and sense the subject’s position
In the other 50% of the sessions the shop assistant was controlled by a
human. Crucially the human was not told about the hidden Turing
element of the experiment, they believed they were just there to show the
subjects around the new shop and handle any questions
It was only in the final question of the post-evaluation debrief at the
subjects were asked whether they thought the avatar (which always
presented as female) was being controlled by a man, woman or computer.
The chatbot software used for the experiment was Discourse, an Artificial
Intelligence Markup Language (AIML) derived chatbot engine developed by
Daden Limited (Disclosure: The primary author is the owner and Managing
Director of Daden Limited).
The key finding was that 78% of subjects (39 out of 50) did not think that the
avatar, when controlled by a computer, was being controlled by a computer. The
only idea that many of them had that it could have been controlled by a computer
was when the question was asked.
It was evident that the natural setting, the use of avatars, and the human-ness of
the language all played a part in deceiving the participants. Typical quotes
included: “She seemed human to me in looks, body language, etc.,” and “I
assumed she was human because I expected her to be”.
Gilbert concluded "as the current study demonstrates, it is possible to achieve
deception rates approaching 80% using only a moderately capable chat engine
when all of [the] psychological and contextual factors are favorably represented"
and that "The current study suggests that 3D virtual environments, a platform that
wasn’t even contemplated when Turing first proposed his test, may offer the most
D.J.H. Burden et al
favorable context to achieve these challenging outcomes because of their unique
ability to activate the anthropomorphic tendency in support of humanizing the
computer. "
3 The University of Worcester Experiment
With the Loyola Marymount experiment there is a clear challenge that the use of
human-looking 3D avatars could have biased the results - participants saw
something human (even if only in digital form) and so may have assumed that the
controlling agency was also human. The Loyola Marymount experiment was also
a one-on-one test, and it was possible that an even higher level of deception could
be possible in a group test (Covert-Group).
Daden, working with the University of Worcester, conducted an on-line
Problem-Based Learning (PBL) experiment to assess the capability and value of
chatbots (both covert and overt) in an educational context. Savin-Baden et al [11]
presented the initial results from this experiment but focused on its implications
for pedagogical agents. The analysis below considers the experiment within the
context of a covert Turing Test.
3.1 Experimental Design
The experiment was based around the premise/deception of an on-line Problem
Based Learning (PBL) exercise for Health & Care students at the University of
Worcester and at Walsall College. PBL involves students being presented with a
scenario which they then need to work on in groups in order to identify what
information they need to address the scenario, and then carry out their own
research, before further discussion and the presentation of an answer to a
facilitator. The scenarios often have no “right” answer and decisions/solutions
may be purely down to value judgements.
For the experiment 42 students (and 3 staff) were recruited from the two
institutions principally by requesting volunteers after lectures. The experiment
was run in two phases, Phase 1 being a covert Turing Test, and Phase 2 being an
overt Turing Test. In Phase 1 the students were divided into groups of 3-5
(depending on availability), and undertook 3 one-hour, online discussion sessions.
Ideally:
The first session was used to present the scenario and have an initial
discussion.
The second session was held 2 days later for further discussion after the
students had done some research
The third session was held another 2 days later to finish the discussion and
come up with an answer/recommendation.
Covert Implementations of the Turing Test
In practice only 4 of the Phase 1 groups did 3 sessions, with 11 groups doing 2
sessions and 2 groups (both staff) doing only 1. A facilitator was present and
active in each session. The scenario for Phase 1 was that of a flatmate, Carolyn,
being suspected of having drink and money problems.
Phase 2 followed a similar structure with 4 groups doing 2 sessions and 3
groups doing only 1 session. The Phase 2 scenario was one of advising a
professional relocating her family in Europe. Seventeen students were involved,
all but 2 having also completed Phase 1. In this Phase each group was briefed that
one of its number was a chatbot, and that once a participant had a good idea of
who they thought the chatbot was they should either declare it to the group or in
private chat to the facilitator. Due to a number of constraints it was not possible to
run groups with no chatbot as a control. Since the task was more straightforward,
and given that its main focus was on the un-masking of the chatbot Phase 2
typically only ran for 1 or 2 sessions.
Since the sessions were held in later afternoon/early evening the students
dialled in remotely for the sessions, often from home or even from mobile devices.
A chatbot was added to each group. Its presence was not declared at all in
Phase 1, and in Phase 2 its presence, but not identity, was declared. The chatbot
was programmed to participate in the text chat discussion just as though it was an
ordinary student. The chatbot was always in the group even i f, due to failure of
some participants to attend, there was only one human student in the group. The
facilitators were aware of both the presence and the identity of the chatbot since
the intention was to perform a “hot-debrief” when the presence of the chatbot was
detected.
Given that the experiment required a deception to be maintained from first
advertising the experiment to the completion of Phase 1, and would require
students to be giving up their own time to participate, the team was careful to
ensure that the experiment gained all the necessary ethical clearances from the
University. In addition, since the sessions were straight-forward PBL sessions
which were designed to contribute to the students learning, although not as part of
their formal course (and completely optional), it was agreed that even if the
chatbot were unmasked in the first minutes of the first session then the students
would still be given the opportunity to continue with the whole 3 x 1 hour
exercise, with or without the chatbot, so as to get the benefits of the exposure to
the on-line PBL method.
3.2 Technical Design
Technically the chatbot was again implemented using Daden’s Discourse system.
It is notable that the Discourse system is nothing special. It is a commercially
proven chatbot having been used in training applications and as a virtual
librarian and student support agent - but it is based fundamentally on the AIML
model, although extended to make it easier to add keyword spotting, to use
D.J.H. Burden et al
synonyms and to define and track context. It still uses a pattern matching/response
template approach with no element of machine learning or grammatical analysis.
Discourse was implemented using C# and ASP.NET with a SQLLite database
for the Worcester Experiment (a Perl version using text files had been used for the
Loyola Experiment). A simple forms based editor was provided to the chatbot
author so that they did not need to know the XML markup of AIML.
The chatroom environment was based around industry standard software so
that it appeared no different to the participants than a real chat room. A locally
hosted server running chat room software (in this case Prosody) controlled the
chatrooms, and the students accessed the chat rooms using standard chat room
clients (principally Psi client, although some other clients were also used). The
Prosody software used the Extensible Messaging and Presence Protocol (XMPP)
to communicate with the clients. The chatbot was interfaced with the Prosody
using the same XMPP protocol so it technically appeared no different to a
human user.
The chatbots interface to the XMPP protocol was implemented in a bespoke
software element, termed the “humaniser”. This took an XMPP message from a
human participant (via the server) and passed it to the Discourse chatbot as input,
and could optionally filter the input so that the chatbot only saw 1/N messages it
not being human to respond to every single message. The humaniser would,
though, always pass the messages from the facilitator to the chatbot.
Once the chatbot had created a response to the message this was passed back to
the humaniser where a number of functions were performed:
Spelling errors were randomly introduced, but using typical error models
Sentences were split onto multiple lines
The message was delayed based on both length of question and of reply
Tags for system items such as “Speaker’s name” were replaced by their
values for the current session/message
The message was then sent using the XMPP protocol to the Prosody server to
appear in the chat room.
3.3 Principal Results: Phase 1 Covert Turing
During Phase 1 a total of 42 subjects took part in 17 groups, between them logging
a total of around 88 participant-hours of chat with the chatbot (e.g. 4 humans in
one group with a Chabot talking for 1 hour = 4 participant-hours).
At no point during Phase 1 did any participant raise with the group or facilitator
the fact that they thought that one of the participants was a computer, or even a
suspicion that all was not as it seemed.
At the end of the last session for each group all participants were sent a survey
to complete (which included a peer feedback survey), and 8 participants were
randomly selected for a semi-structured interview. In addition to the survey and
Covert Implementations of the Turing Test
interview the chat logs from the session were also analysed to provide additional
data.
Sample responses from the interviews after the presence of the bot had been
revealed as to why the bot was not detected included:
“I wouldn't have expected there to be a bot present”
“There are people who like to repeat themselves and ignore others, so
it's not so easy being able to tell. Especially through an online chat
“Joined in conversation very well, responding appropriately and so
on, as well as seeming to have done the homework research.”
“because some of his questions were quite interesting and got us onto
some very good talking points”
“the person contributed a lot, with varied responses”
Nikos (the chatbot) was probably one of the other better people to
actually get involved with and we had some good discussions. I think
about it and from what we said he had quite a similar opinion to me”
From the interviews, surveys and analysis of the logs for Phase 1 there were 3
key findings of significance:
Whilst the responses given by the computer were logically correct, and
often relevant, the language used betrayed the fact that they did not come
from someone from the same cohort as the human participants (see below).
Whilst this would not always be necessary, in situations where the chatbot
is trying to portray a particular character then how it says things is as
important as what it says. As one student said, it sounded more like a tutor
masquerading as a student.
The chatbot was typically identified as being the least engaged, helpful or
co-operative member of the group. However, its behaviour was seen as
being within the bounds of “normal” human behaviour, some participants
assuming (or rationalizing) that the chatbot was either a foreign student
with poor English or someone with language difficulties. Whilst such a
judgement may not be much for a human it does perhaps represent a
significant milestone for a computer.
Following on from this although the bot was successful at hiding its
presence (or at least not giving away its true nature), it was generally
unsuccessful at establishing any credibility or legitimacy within the group.
D.J.H. Burden et al
As one student said I just thought it was a git… They knew what they were
talking about and didn't bother talking to us.”. Something is legitimate if it
is in accord with the norms, values, and practices accepted by a group [12].
Presenting legitimating behavior can facilitate community responsiveness
[13] and enable newcomer integration. This suggests that if we now want to
move beyond simply “hiding” the presence of chatbot we need to consider
the research around legitimacy in the online community space in order that
the chatbot can gain acceptance and legitimacy within the group.
It was notable that none of the participants felt aggrieved by the deception.
Indeed, most had a very positive attitude to the experience and found the idea of a
covert Turing Test fascinating. As one participant said:
It was a surprise. I knew there was something not quite right with this person
but I didn't really think much of it … and it wasn't until later that, of course, I
realised it was the AI and I thought... I didn't feel deceived. It was more like a
jigsaw piece and you've just handed me the last piece in the puzzle. I think, ah,
now everything makes sense.
3.4 Principal Results: Phase 2 Overt Turing
During Phase 2 a total of 17 subjects took part in 7 groups, between them logging
a total of 28 participant-hours of chat with the chatbot. Each group was told that a
chatbot was present, but not its identity (an overt Turing Test). They were asked to
identify the chatbot once they thought they knew which participant it was, either
publicly in the group discussion or by private message to the facilitator.
Only three participants identified the chatbot and informed the facilitator
proactively during the Phase 2 sessions (one incorrectly). The remaining 14
participants correctly identified the chatbot when completing the questionnaire
after the Phase 2 sessions ended.
At the end of the last session for each group all participants were sent a survey
to complete, and 10 participants were randomly selected for a semi-structured
interview. In addition to the survey and interview the chat logs from the session
were also analysed to provide additional data.
Sample responses during the Phase 2 debrief when asked what the factors were
that made then uncertain about the chatbots identity included:
“It responded directly to my point about dyslexia and wasn't
particularly off topic at any point. I thought it was at the start, but then
got more believable as it went on”
“Seemed to make more relevant points than last time I did a PBL
session with the bot.”
Covert Implementations of the Turing Test
“Misspelling "cyslexia" and referring back to something I had said.”
Many of the participants claimed to have been deceived several times by the
chatbot because of the kinds of statements it gave, i.e. perceptions of it probing
earlier comments or referring to a past comment made by a human.
The analysis for Phase 2 focused on the tells that the students used to
positively identify the chatbot. These included:
Excessive repetition
No opinions relating to the topic being discussed
Not effectively directing questions at individuals
Not saying things related to what the previous person said
Not referring much to what other people were saying
Not picking up changes in conversational moves
Delivering confused utterances
Providing inconsistent responses.
However, responses were viewed as being factually correct and not particularly
evasive. It should also be noted that there were also many criteria that were
important to some participants whilst being less important for others. For instance,
“spelling mistakes” and “use of names” appear on both lists – suggesting that
sensitivity to specific tells may be a very personal attribute.
3.5 Technical Analysis of the Worcester Experiment
Whilst Savin-Baden [11] expands on much of the analysis above it does not
provide the technical analysis of the performance of the chatbot which was used in
the experiment.
3.5.1 Cases
Knowledge within an AIML chatbot is defined as a set of pattern-response pairs
called cases. Chatbots entered into the Loebner Prize (an overt Turing Test) can
have tens or even hundreds of thousands of cases in their database ALICE has
120,000 [14].
To develop the cases for the Worcester experiment the research team
completed the same on-line PBL scenarios as the students, with the same
facilitators, and over the same time period (3 x 1 hour sessions over a week). From
these sessions and general discussion a mind-map was created to define all the
areas in which the chatbot was likely to require responses.
The cases were then written either using replies developed by the 2 person
authoring team, or manually extracted from the chat logs of the internal session.
As noted above one of the downsides of this was that the chatbot had the linguistic
D.J.H. Burden et al
style and vocabulary of its primary author (white English, male, 50s) and the
internal session participants (mainly white English, mainly male, 20s 50s), rather
than that of the study participants (college age, predominantly female, mix of
races).
Table 1 shows the distribution of cases created by topic areas for Phase 1. No
general knowledge cases were loaded, so the chatbot had only 410 cases available
in Phase 1 - encoded as 748 Discourse patterns, and only 309 cases in Phase 2.
187 cases were common between the two phases. It should be noted that the
flexible structure of a Discourse pattern means that it can achieve in one pattern
what may require multiple cases of pure AIML to achieve, but the equivalence in
reality is unlikely to be more than 1:3 or so.
Table 1: Distribution of cases (15 topics with less than 5 cases omitted)
Content/Discussion Topic
Number
of Cases
Content/Discussion Topic
Number
of Cases
Default responses to
generic questions
50*
Scenario needs
8
Ethical frameworks
14
Other discussions
9*
General factors
15
How to solve the problem
12
Drinking
6
Information about the bot’s course
7*
Family
8
Information about the bot (its legend)
8*
Friends
7
The solution
9
Money
11
Default responses common openings
(e.g. open/closed questions)
79*
Subject’s personality
6
Who’s doing which task
12
Police
5
Responses to key generic phrases
5
Rent
7
Test cases
5*
Trust
9
Redirects to bot common responses
11
University
7
Time sensitive statements
11*
Forming bad habits
6
Bad peer groups
5
* = common in Phase 1 & 2
The usual approach with commercial chatbots is to go through a period of
iterative development and improvement (so called convologging) both prior to and
after live. However, in order to have a consistent set of cases for the experiment
we undertook only some minor improvements to the cases after the first student
group of each Phase, and left the cases alone for the remainder of the Phase.
3.5.2 Case Usage
Given the relatively small number of cases available to the chatbot it is interesting
that an even smaller number were actually used.
Covert Implementations of the Turing Test
In Phase 1 with 410 cases available only 166 cases were used (40%) during the
total of 3442 exchanges between a participant (student or facilitator) and the
chatbot. Of these exchanges:
9% were handled by problem specific patterns
7% were handled by generic “stub patterns” (e.g. “do you think….”)
32% handled by simple defaults (responses to “yes”, “no” etc)
50% were handled by the catch all wildcard (*) responses which usually
results in the bot trying to restart the discussion around a new topic, or
make some non-committal utterance.
In Phase 2 with 309 cases available only 83 cases were used (27%) during the
937 exchanges.
19% handled by problem specific patterns
10% handled by generic “stub patterns”
25% handled by defaults (yes, no etc)
44% handled by the catch-all wildcard responses
Despite the significantly smaller pool of cases used the general feeling of the
participants was that the Phase 2 chatbot performed better than the Phase 1
chatbot.
3.5.3 Technical Failings
Whilst the chatbot suffered no technical glitches during any session there were
elements of poor authoring that rapidly became apparent, during Phase 1 in
particular. In a normal deployment such authoring would be rapidly corrected as
part of the convologging process, but since a consistent case database was needed
for the experiments this was not feasible. The errors included:
Allowing some default responses to be triggered a large number of times
(e.g. “Sorry”) rather than giving more specific or varied responses
Handling facilitator responses directed specifically to the chatbot with a
narrower range of cases than the student conversation (since it was assumed
that a direct comment from the facilitator would be of a different nature
but it wasn’t, so it often got handled by the wildcard response)
These errors featured significantly in the participants comments about poor
chatbot performance, and so an even better performance could be expected had
these errors been fixed on the fly, or the chatbot system enhanced to reduce
repetition more automatically.
D.J.H. Burden et al
3.5.4 Technical Conclusions
Whilst accepting that the chatbot was operating within a fairly constrained
environment/dialog space, and that this was a covert Turing Test, the fact that the
chatbot managed to avoid detection using only a “standard” chatbot engine and
only a hundred or so cases suggests that creating an effective chatbot is probably
more a task of finesse than of brute force.
The results suggest that to be successful at escaping detection then the bot
must:
Not give itself away through technical faults/authoring errors that generate
“computer error” type responses
Be able to stay reasonably on topic and make statements that are not too
evasive
Be able to secure enough “lucky hits” where what it says sounds so human
that any doubts that a user has that it may be a bot are (at least temporarily)
allayed.
As has been noted this is not enough to give the chatbot legitimacy, but it may
be enough to stop it from being identified. It is interesting to compare this analysis
to that of Shah [15] where the strategy of silence for a chatbot is considered, and
certainly in creating wildcard responses for the bot (and the sampling of
exchanges that was taking place) then having a silence response rather than risking
making an erroneous responses was certainly one of the strategies employed.
Our conclusion is therefore that a relatively small number of carefully crafted
responses tuned to the situation may be a more effective strategy for chatbot
creation than creating ever greater databases of responses and more complex
dialogue engines.
4 Revisiting the Original Hypothesis
One of the key research questions presented at the end of Burden [1] was “To
what extent do the Covert/Overt and Singleton/Group options present earlier
opportunities to pass the Turing Test (or equivalent) in a virtual world?”.
Although the Worcester experiment was conducted within an on-line discussion
group rather than in a virtual world, its results, along with that of the Loyola
Marymount test that was conducted in a virtual world do allow us to start to
answer this question.
Analysing the results of a conventional Turing Test (Singleton Overt) in the
form of the Loebner Prize [4] there has only been one instance in the 13 year
history of the prize were a chatbot has fooled at least one (and indeed only one)
judge. This was Suzette by Bruce Wilcox in 2012 The chatbot achieved a
deception rate of 25% on the basis that there were 4 judges for that one
experiment and one was deceived (but given that Wilcox has entered 5 more times
Covert Implementations of the Turing Test
since with Angela, Rose and Rosette but failed to fool any judges it could also be
taken as a 1/(7x4) = 3.6% deception rate across all his Loebner Prize attempts). By
contrast the Loyola Marymount experiment (Singleton Covert) achieved a 78%
deception rate. The Worcester Phase 1 experiment (Group Covert) achieved a
100% deception rate, although the Phase 2 experiment (Group Overt) achieved
only a 6% deception rate (though the reticence of most students to identify the
chatbot during the session, and their comments about its improved quality,
suggests that a high level of doubt may have existed).
The results would therefore appear to bear out the original hypothesis that the
Singleton Covert and Group Covert conditions can indeed provide an earlier
opportunity to pass the Turing Test than the more traditional Singleton Overt case.
Further covert singleton and covert group tests (and indeed overt group tests)
should be conducted to confirm this.
5 Further Work
There are a number of areas were further work would be useful to improve chatbot
performance and to produce a more solid set of results.
In the case of experimental design a significant improvement to the Worcester
experiment would be to not have the facilitator aware of the chatbot identity. In
the experiment the two principal investigators from Worcester were the facilitators
since they best understood how to facilitate on-line PBL sessions, as well as
understanding what the experiment was trying to achieve. Whilst the facilitator did
not try to hide the identity of the bot, their interactions with it may have been
unintentionally biased.
It would be interesting to compare the performance of a machine-learning
based chatbot with that of the Discourse chatbot. The conversation logs could
provide some training material for such a bot..
Both the experiments were conducted in quite constrained environments (in
time, topic and technology), and with a correspondingly constrained expected
dialogue. Relaxing those constraints whilst still trying to maintain the high
deception rates would provide a useful challenge. The most beneficial constraint
for the chatbot is likely to be that of topic. Technology interfaces can be readily
added. Longer individual sessions are likely to result in all participants getting
stale, and more sessions would perforce need new topics, or deeper analysis of the
current ones and broadening the chatbot to cope with more topics is the real
challenge.
Given that the chatbot in the Worcester Experiment was successful in the
deception it suggests that as well as looking at how to maintain that deception in a
less constrained environment we should also look at how to move beyond
deception and focus as well on how to establish and maintain the legitimacy of the
chatbot within the group (and indeed in the Singleton test). A good grounding for
this would be a firm understanding of how humans achieve this legitimacy, and
then apply that to the chatbot.
D.J.H. Burden et al
A chatbot which could achieve legitimacy whilst maintaining the deception,
and operate within a relatively unconstrained environment would be a significant
achievement.
References
1. Burden, D. J.: Deploying embodied AI into virtual worlds. Knowledge-Based
Systems, 22(7), pp. 540-544 (2009).
2. Wakefield, J: Intelligent Machines: Chatting with the bots. BBC Web Site. Available
at http://www.bbc.co.uk/news/technology-33825358. (Last accessed 30 May 2016).
(2015)
3. Turing, A.M.: Computing machinery and intelligence. Mind, 59, 433-460 (1950)
4. Bradeško, L., & Mladenić, D.: A Survey of Chatbot Systems through a Loebner Prize
Competition. Proc. Slovenian Language Technologies Society Eighth Conference of
Language Technologies. pp. 34-37 (2012).
5. Cooper, S.B. & Van Leeuwen, J. eds., Alan Turing: His Work and Impact: His Work
and Impact. Elsevier. (2013)
6. Shieber, S. M. "Lessons from a restricted Turing test." arXiv preprint cmp-lg/9404002
(1994).
7. Mauldin, M. L. "Chatterbots, Tinymuds, and the Turing Test: Entering the Loebner
Prize competition." In AAAI, vol. 94, pp. 16-21. (1994).
8. Heiser, J.F., Colby, K.M., Faught, W.S. and Parkison, R.C., Can psychiatrists
distinguish a computer simulation of paranoia from the real thing?: The limitations of
turing-like tests as measures of the adequacy of simulations. Journal of Psychiatric
Research, 15(3), pp.149-162. (1979)
9. Kurzweil, R. "Why we can be confident of Turing test capability within a quarter
century." The Dartmouth Artificial Intelligence Conference: The next 50 Years,
Hanover, NH. (2006).
10. Gilbert, R. L., & Forney, A.: Can avatars pass the Turing test? Intelligent agent
perception in a 3D virtual environment. International Journal of Human-Computer
Studies, 73, pp. 30-36 (2015).
11. Savin-Baden, M., Bhakta, R., Burden, D.: Cyber Enigmas? Passive detection and
Pedagogical agents: Can students spot the fake? Proc. Networked Learning
Conference (2012).
12. Zelditch, M.: 2 Theories of Legitimacy in The psychology of legitimacy: Emerging
perspectives on ideology, justice, and intergroup relations. Cambridge University
Press. pp.33-53. (2001).
13. Burke, M., Joyce, E., Kim, T., Anand, V. and Kraut, R.: Introductions and requests:
Rhetorical strategies that elicit response in online communities." Communities and
Technologies 2007. Springer London, pp. 21-39 (2007).
14. Wilcox, B. & Wilcox, S. Suzette, the Most Human Computer. Available from
http://chatscript.sourceforge.net/Documentation/Suzette_The_Most_Human_Compute
r.pdf . (Last accessed 30 May 2016).
15. Warwick, K., & Shah, H.: Taking the fifth amendment in Turing’s imitation game.
Journal of Experimental & Theoretical Artificial Intelligence. pp. 1-11 (2016).
Book
Full-text available
The Metaverse: A Critical Introduction provides a clear, concise and well-grounded introduction to the concept of the Metaverse, its history, the technology, the opportunities, the challenges and how it is having an impact almost every facet of society. The book serves as a standalone introduction to the Metaverse, as an overarching summary of the specialist volumes in The Metaverse Series, and removes the need to repeat basic information in each book. The book provides: • a concise history of the Metaverse idea and related implementations to date; • an examination of what the Metaverse actually is; • an introduction to the fundamental technologies used in the Metaverse; • a brief overview of how aspects of the Metaverse are having an impact on our lives across multiple disciplines and social contexts; • a consideration of the opportunities and challenges the evolving Metaverse; and • a sense of how the Metaverse may mature over the coming decades. The book will be practical guide, but drawing from academic research, practical and commercial experiences and inspiration from the science fiction origins and treatments of the Metaverse. The book will also explore the impact of the increasing number of virtual worlds and proto-Metaverses which have existed since the late 1990s. The aim is to provide professional and lay readers, researchers, academics and students with an indispensable guide to what counts as the Metaverse, opportunities and challenges, and how the future of the coming Metaverse can best be realised. There is more information on the book and the series at http://www.themetaverseseries.info/ The Metaverse: A Critical Introduction will be published on 24th September 2024 and is now available for pre-order.
Article
Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge. We introduce Cicero, the first AI agent to achieve human-level performance in Diplomacy , a strategy game involving both cooperation and competition that emphasizes natural language negotiation and tactical coordination between seven players. Cicero integrates a language model with planning and reinforcement learning algorithms by inferring players' beliefs and intentions from its conversations and generating dialogue in pursuit of its plans. Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
Article
Full-text available
This paper presents a study that was undertaken to examine human interaction with a pedagogical agent and the passive and active detection of such agents within a synchronous, online environment. A pedagogical agent is a software application which can provide a human like interaction using a natural language interface. These may be familiar from the smartphone interfaces such as ‘Siri' or ‘Cortana', or the virtual online assistants found on some websites, such as ‘Anna' on the Ikea website. Pedagogical agents are characters on the computer screen with embodied life-like behaviours such as speech, emotions, locomotion, gestures, and movements of the head, the eye, or other parts of the body. The passive detection test is where participants are not primed to the potential presence of a pedagogical agent within the online environment. The active detection test is where participants are primed to the potential presence of a pedagogical agent. The purpose of the study was to examine how people passively detected pedagogical agents that were presenting themselves as humans in an online environment. In order to locate the pedagogical agent in a realistic higher education online environment, problem-based learning online was used. Problem-based learning online provides a focus for discussions and participation, without creating too much artificiality. The findings indicated that the ways in which students positioned the agent tended to influence the interaction between them. One of the key findings was that since the agent was focussed mainly on the pedagogical task this may have hampered interaction with the students, however some of its non-task dialogue did improve students' perceptions of the autonomous agents' ability to interact with them. It is suggested that future studies explore the differences between the relationships and interactions of learner and pedagogical agent within authentic situations, in order to understand if students' interactions are different between real and virtual mentors in an online setting.
Article
Full-text available
The current study involved the first natural language, modified Turing Test in a 3D virtual environment. One hundred participants were given an avatar-guided tour of a virtual clothing store housed in the 3D world of Second Life. In half of the cases, a human research assistant controlled the avatar-guide; in the other half, the avatar-guide was a visually indistinguishable virtual agent or “bot” that employed a chat engine called Discourse, a more robust variant of Artificial Intelligence Markup Language (AIML). Both participants and the human research assistant were blind to variations in the controlling agency of the guide. The results indicated that 78% of participants in the artificial intelligence condition incorrectly judged the bot to be human, significantly exceeding the 50% rate that one would expect by chance alone that is used as the criterion for passage of a modified Turing Test. An analysis of participants' decision-making criteria revealed that agency judgments were impacted by both the quality of the AI engine and a number of psychological and contextual factors, including: the naivety of participants regarding the possible presence of an intelligent agent, the duration of the trial period, the specificity and structure of the test situation, and the anthropomorphic form and movements of the agent. Thus, passage of the Turing Test is best viewed not as the sole product of advances in artificial intelligence or the operation of psychological and contextual variables, but as a complex process of human-computer interaction.
Conference Paper
Full-text available
Starting in 1966 with the introduction of the ELIZA chatbot, a great deal of effort has been devoted towards the goal of developing a chatbot system that would be able to pass the Turing Test. These efforts have resulted in the creation of a variety of technologies and have taken a variety of approaches. In this paper we compare and discuss the different technologies used in the chatbots which have won the Loebner Prize Competition, the first formal instantiation of the Turing Test. Although there was no game changing breakthrough in the chatbot technologies, it is obvious they evolved from the very simple pattern matching systems towards complicated patterns combined with ontologies and knowledge bases enabling computer reasoning.
Chapter
Full-text available
Online communities allow millions of people who would never meet in person to interact. People join web-based discussion boards, email lists, and chat rooms for friendship, social support, entertainment, and information on technical, health, and leisure activities [24]. And they do so in droves. One of the earliest networks of online communities, Usenet, had over nine million unique contributors, 250 million messages, and approximately 200,000 active groups in 2003 [27], while the newer My Space, founded in 2003, attracts a quarter million new members every day [27].
Article
In this paper, we look at a specific issue with practical Turing tests, namely the right of the machine to remain silent during interrogation. In particular, we consider the possibility of a machine passing the Turing test simply by not saying anything. We include a number of transcripts from practical Turing tests in which silence has actually occurred on the part of a hidden entity. Each of the transcripts considered here resulted in a judge being unable to make the ‘right identification’, i.e., they could not say for certain which hidden entity was the machine.
Article
"The fact remains that everyone who taps at a keyboard, opening a spreadsheet or a word-processing program, is working on an incarnation of a Turing machine."-TIME. In this award-winning selection of writings by Information Age pioneer Alan Turing, readers will find many of the most significant contributions from the four-volume set of the Collected Works of A. M. Turing. These contributions, together with commentaries from current experts in a wide spectrum of fields and backgrounds, provide insight on the significance and contemporary impact of A.M. Turings work. Offering a more modern perspective than anything currently available, Alan Turing: His Work and Impact gives wide coverage of the many ways in which Turings scientific endeavors have impacted current research and understanding of the world. His pivotal writings on subjects including computing, artificial intelligence, cryptography, morphogenesis, and more display continued relevance and insight into todays scientific and technological landscape. This collection provides a great service to researchers, but is also an approachable entry point for readers with limited training in the science, but an urge to learn more about the details of Turings work. 2013 winner of the prestigious R.R. Hawkins Award from the Association of American Publishers, as well as the 2013 PROSE Awards for Mathematics and Best in Physical Sciences & Mathematics, also from the AAP. Named a 2013 Notable Computer Book in Computing Milieux by Computing Reviews. Affordable, key collection of the most significant papers by A.M. Turing. Commentary explaining the significance of each seminal paper by preeminent leaders in the field. Additional resources available online.
Chapter
I propose to consider the question, “Can machines think?”♣ This should begin with definitions of the meaning of the terms “machine” and “think”. The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words “machine” and “think” are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, “Can machines think?” is to be sought in a statistical survey such as a Gallup poll.