PreprintPDF Available

Transcribear-Introducing a secure online transcription and annotation tool

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Reliable high-quality transcription and/or annotation (a.k.a. 'coding') is essential for research in a variety of areas in Humanities and Social Sciences which make use of qualitative data such as interviews, focus groups, classroom observations or any other audio/video recordings. A good tool can facilitate the work of transcription and annotation because the process is notoriously time-consuming and challenging. However, our survey indicates that few existing tools can accommodate the requirements for transcription and annotation (e.g. audio/video playback, spelling checks, keyboard shortcuts, adding tags of annotation) in one place so that a user does not need to constantly switch between multiple windows, for example, an audio player and a text editor. 'Transcribear' (https://transcribear.com) is therefore developed as an easy-to-use online tool which facilitates transcription and annotation on the same interface while this web tool operates offline so that a user's recordings and transcripts can remain secure and confidential. To minimize human errors, the functionality of tag validation is also added. Originally designed for a multimodal corpus project UNNC CAWSE, this browser-based application can be customized for individual users' needs in terms of the annotation scheme and corresponding shortcut keys. This paper will explain how this new tool can make tedious and repetitive manual work faster and easier and at the same time improve the quality of outputs as the process of transcription and annotation tends to be prone to human errors. The limitations of Transcribear and future work will also be discussed.
Content may be subject to copyright.
1
Updated in March 2019
Transcribear Introducing a secure online transcription and annotation tool
Yu-Hua Chen1, Radovan Bruncak2
1School of English, University of Nottingham Ningbo China
Corresponding author: Yu-Hua Chen yu-hua.chen@transcribear.com
2Independent Computer Scientist
Abstract
Reliable high-quality transcription and/or annotation (a.k.a. ‘coding’) is essential for research in a
variety of areas in Humanities and Social Sciences which make use of qualitative data such as
interviews, focus groups, classroom observations or any other audio/video recordings. A good tool
can facilitate the work of transcription and annotation because the process is notoriously time-
consuming and challenging. However, our survey indicates that few existing tools can accommodate
the requirements for transcription and annotation (e.g. audio/video playback, spelling checks,
keyboard shortcuts, adding tags of annotation) in one place so that a user does not need to constantly
switch between multiple windows, for example, an audio player and a text editor. ‘Transcribear’
(https://transcribear.com) is therefore developed as an easy-to-use online tool which facilitates
transcription and annotation on the same interface while this web tool operates offline so that a user’s
recordings and transcripts can remain secure and confidential. To minimize human errors, the
functionality of tag validation is also added. Originally designed for a multimodal corpus project
UNNC CAWSE, this browser-based application can be customized for individual users’ needs in
terms of the annotation scheme and corresponding shortcut keys. This paper will explain how this
new tool can make tedious and repetitive manual work faster and easier and at the same time improve
the quality of outputs as the process of transcription and annotation tends to be prone to human errors.
The limitations of Transcribear and future work will also be discussed.
2
Updated in March 2019
1. Introduction
Reliable high-quality transcription and/or annotation (a.k.a. coding) is essential for research
in a variety of areas in Humanities and Social Sciences which make use of qualitative data
such as interviews, focus groups, classroom observations or any other audio/video recordings.
With rapid developments in computer technology, much larger datasets of samples are
generally expected for academic research, particularly in the area of Corpus Linguistics,
where spoken corpora often have to be manually transcribed and annotated. According to a
survey we conducted (which will be discussed in the next section), however, few existing
tools can accommodate the requirements for transcription and annotation (e.g. audio/video
playback, shortcut keys, annotation, validation) in one place so that a user does not need to
constantly switch between multiple windows, for example, an audio player and a text editor.
‘Transcribear’ (https://transcribear.com) is therefore developed as an easy-to-use online tool
that facilitates both the tasks of transcription and annotation on the same interface. This paper
will introduce the functionality of Transcribear as well as the background to the development
of this secure browser-based tool.
Many types of text data often need to be annotated for follow-up analysis, that is, adding
interpretive information to the data by, for example, adding tags (Leech, 2005, p. 19).
Below (1) is an example of annotating an instance of a lexico-grammatical deviation
(typically called ‘errors’ in second language research or learner corpus research) from an L2
multimodal corpus, UNNC Corpus of Academic Written and Spoken Corpus (UNNC
CAWSE) (Chen, Harrison, Oakey, Stevens, Yang, Ioratim-Uba, Zhou & Bruncak, 2018).
This tagset of deviation is composed of an opening tag <dv> and a closing tag </dv> with the
correction indicated in the curly brackets {}.
(1) Maybe it's very bad for the <dv>economic{economy}</dv> to the country.
3
Updated in March 2019
Manually tagging a large amount of such data by keyboarding is prone to errors. For
example, a transcriber may delete part of the tag by accident or misspell the code that
indicates a specific feature. In the case above, the code <dv> indicates a deviation, and if
misspelled, this instance would not appear in a query for this specific tag.
In the context of spoken data such as interviews or conversations, the audio or video
recordings will also need to be transcribed. Although there are existing annotation tools
where a user can select a tag for annotation from a menu, they do not always have the
functionality for transcribing speech data, e.g. an audio/video player with a text editor, and
the tasks of annotation and transcription, therefore, are often carried out separately. This
means that a tool may be used for playing audio/video files and transcribing speech at the
first stage, and then data are annotated independently on the transcripts with or without
another tool at the second stage. In our experience of building the L2 corpus, UNNC CAWSE
(Chen, Harrison, Oakey, Ioratim-Uba, Stevens & Yang, under review), however, we found
that it is more efficient to transcribe and annotate data simultaneously rather than treating
them as two independent tasks. This is because L2 speech are often characterized with a large
number of instances in codeswitching, hesitation (indicated by pauses), self-correction
(indicated by truncation and/or false starts), unintelligible utterances, or deviations
exemplified earlier. Those features occur so frequently that the transcribers often have to play
the audio/video recordings back and forth multiple times in order to transcribe them as
truthfully as possible, and it is therefore sensible to annotate those features at the same time
while engaging with the task of transcription. See (2) and (3) below for the tagging of such
examples of L2 speech in the UNNC CAWSE corpus (CAWSE hereafter), where
codeswitching is indicated by the tagset of <cs></cs> with English translation in the curly
brackets {}, unintelligible speech by <ut></ut> and timed pauses (in seconds) by parentheses
() such as (1.4) indicating a pause of 1.4 seconds.
4
Updated in March 2019
(2) or maybe on some fa- some face <cs n= “zh”>
表情怎么说啊
{how to say ‘facial
expression’ in English}</cs>
(3) yeah and: it's a very convenient and very hh erm modern modern school small
<ut>x</ut> and every <ut>x</ut> is very: hh (1.4) is very good and erm the: (1.3) people
here is very friendly and: they are all they are all: very kind
Note: For the detailed transcription conventions used for CAWSE, please see the project website
https://www.nottingham.edu.cn/en/english/research/cawse/transcription.aspx.
The CAWSE corpus is designed in a way that users can access its plain text files with a
corpus tool such as Wordsmith (Scott, 2008) or AntConc (Anthony, 2018). This needs a
linear transcription and annotation system, instead of the multi-tier transcription that some
tools offer (which will be discussed later). This type of notation system is closer to the
transcription tradition of Conversation Analysis (e.g. see Jefferson, 2004; Swann, 2010), and
such transcription data can be converted to XML format at a later stage, which is the same
approach adopted by the new Spoken BNC2014 corpus (Love, Dembry, Hardie, Brezina, &
McEnery, 2017). Transcription encoded in XML defined by TEI Guidelines (see
http://www.tei-c.org/Guidelines/) is widely recognized in the field of Corpus Linguistics for
data exchangeability and computer readability. Figure 1 below is an example of XML
markup from the BASE corpus (Nesi & Thompson, 2000-2005), where similar instances of
timed pauses and overlaps are annotated as well as truncated utterances and speaker turns. As
can be seen, it follows a similar linear yet far more complex system and, as a result, could be
unfriendly for human eyes. For our project, the XML format is therefore not considered for
manual annotation because of its cumbersome format for direct data entry (Love et al.,
2017, p. 338).
5
Updated in March 2019
We therefore conducted a survey of existing tools to identify if there was any integrated
tool that could accommodate the requirements of both transcription and annotation in a linear
system but unfortunately did not find any that fit the needs of our project. This survey,
however, provided us with essential information to define the specifications when we later
decided to develop the online transcription and annotation tool ‘Transcribear’. Originally
designed for our L2 corpus project, this browser-based application is now equipped with the
facilities of customization, where users can change the settings of annotation scheme and
corresponding shortcut keys for their own projects. With a user-friendly interface and built-in
validation and spelling checks, Transcribear can also be used for Conversation Analysis or
similar projects that require transcription and/or annotation. In the next section, we will
summarize the results of the survey where existing similar tools were evaluated and
compared in terms of available functionality. Then more detail will be provided regarding
how the survey informed the design of Transcribear. The limitations of Transcribear and
future work will also be discussed.
Fig. 1 An example of XML markup from BASE
6
Updated in March 2019
2. Survey of existing tools
In total twelve computer programs were chosen for the survey, and two types of tools were
distinguished: one for transcription, particularly those featured with the technology of speech
recognition, and the other for annotation, which refers to the addition of interpretive
information into orthographic transcription by inserting defined tags. Some of the annotation
tools, however, also include basic functions for transcription such as audio/video playback
with a text editor. This survey provides an overview of available features, and the evaluation
criteria used for the survey were developed from our pilot work of transcribing and
annotating recording data in CAWSE. Those criteria include features such as audio/video
playback, shortcut keys, speech recognition, spelling checks, data confidentiality, tag
insertion or customization. Because a team of student interns, including both undergraduate
and postgraduate students, were recruited to transcribe data for CAWSE in addition to a full-
time assistant, our primary aim was to search for an easy-to-use tool which would not require
extensive training while still achieving good-quality work without imposing additional costs.
The details of evaluation for the first type of tools (i.e. used for transcription) can be found
in Table 1 and the second type of tools (used for annotation) in Table 2. In terms of the
transcription tools, among the six tools that we surveyed (Table 1), four of them have the
function of speech-to-text or dictation (VoxSigma, Dragon, Transcribe and Express Scribe),
and most of them are commercial, which requires a subscription fee at the time of writing,
except for the free-of-charge app oTranscribe and Express Scribe (which offers a free trial of
basic functions). Two members in our team, one native speaker of American English and the
other a fluent non-native speaker, then experimented with the dictation function available in
some of the tools, i.e. by reading aloud. Although the applications with a speech-recognition
engine seemed to have generally performed better with the native speaker, our conclusion
was that the engines often responded slowly and inaccurately and it was therefore too time-
7
Updated in March 2019
consuming and tiring to repetitively repeat and revise the machine-generated script. The
automatic speech-to-text applications were also trialed, that is, we uploaded an audio file to a
web application and a script was then generated. On average the accuracy rate reached
approximately 30%, which was considered too low, because it would still require a large
amount of effort to edit those scripts to acceptable standards. This might be relevant to the
fact that our L2 data contains recordings of interviews or conversations involved with
multiple speakers and that many L2 speakers in the data are not very fluent, hence the poor
results. We therefore decided to still manually transcribe and annotate the audio/video
recordings rather than using speech-recognition tools.
The tools described above, however, have a number of advantages identified during the
survey. For example, the interface integrated with a text editor and an audio/video player is
user-friendly, and some of the tools are equipped with the facilities of spelling checks and
data confidentiality without audio/video files transmitted to a server although they operate on
a browser. Some tools such as Transcribe also adopt the approach of keyboard shortcuts for
audio/video playback to free up the use of a mouse, which is also deemed a useful design to
improve efficiency since the keyboard can be used for the entire transcription process.
Because those tools appear to have been designed for professions such as journalists or
lawyers rather than linguists, transcription is the primary function, and thus no facility of
annotation is found in those transcription tools.
8
Updated in March 2019
Table 1 Transcription tools reviewed in the survey
Online or
desktop
Fee
(at the time of writing)
Audio/
video
playback
Shortcut
keys
Confidentiality
and Privacy
Spelling checks
Speech recognition
Website
Desktop and
Online
For generic systems and
large quantities, the price on
the online order is 0.01 Euro
(or USD$0.01) per minute.
More detailed pricing needs
to be discussed with
VoxSigma.
Automatic transcription
linked to the server
Not indicated
Not mentioned
as presumably
not needed.
Supporting
automatic
speech to text.
Yes
http://www.vo
capia.com/
Desktop
USD$74-$500
speech recognition (speech to
text)
Not indicated
Not mentioned
as presumably
not needed.
Supporting
automatic
speech to text.
Yes
http://www.nu
ance.com/drag
on/index.htm
Online
Free
Yes
Yes
Yes. Audio
files and
transcripts
stay on the
user’s
computer.
Yes
No
http://otranscri
be.com/
Online
USD$20/year for the
integrated engine with a
player, an editor and
dictation.
USD$6 for a 60-minute auto
transcription
Yes
Yes
Yes. Audio
files and
transcripts
stay on the
user’s
computer.
Yes, but it
seems to work
only for a
certain length of
the transcript.
Yes
https://transcri
be.wreally.com
/
Desktop
USD$99 full
USD$69 education/nonprofit
USD$39 student
Yes
Yes
Not indicated
No
No
https://www.in
qscribe.com/
Desktop
USD$25 -$159
Discount available
Yes
Yes
Not indicated
No
Yes. Speech to text
requires a SAPI
speech-to-text
engine to be
installed on the
user’s computer.
http://www.nc
h.com.au/scrib
e/
9
Updated in March 2019
Table 2 Annotation tools reviewed in the survey
Tool
Online
or
desktop
Fee
Audio/
video
playback
Spelling
checks
Shortcut keys
Validation
Supporting XML
Tag customization
Website
1.
VoiceScribe
Desktop
Free
Yes
No
Yes. A list for
shortcuts available.
Yes. Yellow
color indicates
recognized
tags.
Yes
No
https://sourcefor
ge.net/projects/v
oicescribe/
2.
EXMARaL
DA
Partitur-
Editor 1.6
Desktop
Free
Can play each
segment of the
recording
No
Yes. Shortcuts for
audio play,
segments, etc.
Yes. An error
list is available
for structure
errors,
annotation
mismatch, etc.
Yes. Some TEI-
complied symbols
(<dur=1>,
{codeswitch},
etc.) can be
selected.
Yes. Preferred
symbols can be
selected and used
for annotation.
http://exmaralda
.org/en/2017/04/
27/new-official-
version/
3. FOLKER
1.2
Desktop
Free
Can play each
segment of the
recording
No
Yes. Shortcuts for
media play,
segment selection,
segment view, etc.
Yes. A red
cross indicates
incorrect
syntax, etc.
Yes
Yes
http://agd.ids-
mannheim.de/fo
lker_en.shtml
4. ELAN
4.7.3
Desktop
Free
Can play each
segment of the
recording
No, but
Version 5
seems to
include this
function.
Yes. Shortcuts for
management of
files, selection of
segment,
annotation, etc.
N/A
Yes
Yes. Available
under Controlled
Vocabularies, but it
is saved in a
different tier from
speech
transcription.
https://tla.mpi.nl
/tools/tla-
tools/elan/
5.
Transcriber
AG
2.0.0-b1
Desktop
Free
Yes
Yes
Yes. Shortcuts for
audio play,
selection,
annotation, etc.
N/A
No
Yes. Available
under
Configuration.
http://transag.so
urceforge.net/in
dex.php?content
=presentation
6. UAM
CorpusTool
version 3.3
Desktop
Free
No (as it is not
designed for
transcription)
Not needed as
pre-defined
tags are
provided
which can be
selected from
a menu.
No shortcut keys,
but different
functions can be
selected from lists.
N/A
Yes. All
annotation is
stored in xml.
Yes. An easy-to-use
interface is
available to create
and modify coding
schemes.
http://www.corp
ustool.com/dow
nload.html
10
Updated in March 2019
In terms of the second type of tools, they are primarily designed for annotation purposes,
e.g. adding tags to a transcript, although basic functions of transcription such as audio/video
playback are included in some of the tools such as VoiceScribe (see Table 2). All those tools
are desktop apps that require users to install on individual devices. Among the six annotation
tools that we reviewed, two were designed exclusively for a specific corpus: VoiceScribe for
the VOICE corpus (a corpus of spoken ELF) (VOICE, 2013) and FOLKER for the FOLK
corpus (a corpus of spoken German) although FOLKER is a simplified version adapted from
EXMARaLDA (Schmidt, 2016, p. 407). Note that those tools are designed for different
purposes, and it does not mean the evaluation results indicate any flaws of their design. For
example, UAM CorpusTools is intended as an annotation suite instead of a transcription tool,
hence the absence of an audio/video player and spelling checks.
One fundamental difference between the transcription tools described earlier (Table 1) and
the annotation tools (Table 2) is that the latter often allows the addition of annotation notation
from a pre-defined file of conventions while the former requires manual addition of such
notations each time. Because the CAWSE corpus has its own unique transcription
conventions, it is important for us to find out whether any of those annotation tools allow
users to define their own markup systems for annotation rather than having to adopt existing
conventions built in the tools. While some tools such as ELAN does provide this option
(called ‘Controlled Vocabularies’, see Tacchetti, 2017), others require certain IT skills to
rewrite the codes of the tools (e.g. VoiceScribe or TranscriberAG). It was also found that
those annotation tools generally do not seem to support spelling checks (probably except for
TranscriberAG), which is important in enhancing the accuracy and reliability of transcription
work. Again, this is most likely due to the fact that many of the tools were designed for
adding annotation from a menu of pre-established schemes, and it may not seem necessary
for those tools to include the facility of spelling checks. One of the major issues, in
11
Updated in March 2019
considering those tools for the CAWSE project, however, is that the current transcription and
annotation system developed for the corpus is linear as discussed earlier, in the form of plain
text files, which can be searched using existing corpus tools. Some of the annotation tools
adopt a multi-tier annotation hierarchy system (such as ELAN or TranscriberAG; see Figure
2) and also have the issue of segmenting audio/video data, which appears rather complex for
our purposes. Another issue is that VoiceScribe and FOLKER only support audio data in wav
files while some of the data in CAWSE are currently saved in the format of mp3. While it is
possible to convert the data format, this certainly adds more complexity to customizing an
existing tool. Those annotation tools often have a variety of functions available such as XML
support, and probably because of this, our perception is that they are more suitable for tech-
savvy users or experienced researchers. The introduction of those established tools would
therefore require extensive training, and yet many of our transcribers and annotators are
student interns who did not work for the project for more than one year.
Fig. 2 A multi-tier annotation system from ELAN
The above survey indicates that some of the transcription tools are easy to use but do not
provide certain facilities such as adding annotation, whereas most of the annotation tools are
powerful but require a significant amount of training and experience for users to master them.
12
Updated in March 2019
The results of the survey reported here informed the design of the new transcription and
annotation tool ‘Transcribear’, which will be described in the next section. We also
acknowledge that there may be many other relevant computer programs available, and the
tools included in the survey here may be somewhat limited in terms of the scope. Those
additional programs are perhaps designed for specific purposes such as Praat (for phonetic
annotation) or NVivo (for qualitative textual and audiovisual data), and they therefore do not
fit our purposes.
3. Developing the new tool ‘Transcribear’
After experimenting with the tools reported in the previous section, it became clear that we
needed to develop our own software as no existing tools could cater for our needs. Yet it has
to be acknowledged that the survey provides essential information about possible utilities of a
transcription and annotation tool required for a corpus-building project like ours.
Our decision was to opt for an online tool rather than a stand-alone desktop one because a
browser-based tool does not require the administrator’s right to install for institutional
computers, which is often the case for universities. A web tool also allows constant updates
to improve the functionality without the users having to reinstall the software repeatedly.
Another advantage of a web-browser application is that it can be used across different
operation systems such as Mac OS, Linux or Windows, and the development and
maintenance costs would therefore be kept lower. On the other hand, being an online system
does not mean the compromise of data confidentiality. For example, researchers may need to
transcribe confidential data which is not supposed to be shared with third parties. We
therefore took privacy and confidentiality into consideration in the design of the online
application by choosing to use the programing language JavaScript. This means that when a
user visits the website to access Transcribear, a Javascript application is downloaded into the
13
Updated in March 2019
user’s web browser which provides necessary functionalities for the user to transcribe the
audio or video file, to insert tags into a transcript, or to have the transcript checked or
validated real-time. When the online tool is operating, it therefore works offline, which
means the application does not require a local computer to send any transcripts or audio/video
files to the server to facilitate transcription or annotation. The whole process is thus private
and can be used to transcribe or annotate confidential data.
It is also essential for the tool to have a built-in function of validation, i.e. automatically
checking the tags on the basis of pre-defined tagsets. During the earlier stages of our project
when data transcription had been carried out for several months without an exclusive tool,
transcribing errors were often found in the incorrect use of tags, e.g. mis-formed tags. For
example, it is possible that any component of a tagset such as <ol></ol> (indicating
‘overlap’) might be accidentally deleted or misspelled by the human transcribers, and such
illegitimate tags would be flagged now by the Transcribear tool. In addition, the functionality
of spelling checks, which is important for transcription, is also included in the tool. The
addition of the above functionality is in line with the principle of ‘validation’, emphasized
multiple times across a number of chapters in the edited volume ‘Developing Linguistic
Corpora: A Guide to Good Practice’ (Wynne, 2005) as accuracy and consistency are
important criteria for evaluating the quality of transcription and annotation in any research
project.
Based on the specifications discussed above, the online tool Transcribear is featured with
the following facilities:
A text editor integrated with an audio/video player which supports a variety of format
including mp3, mp4, wav and ogg;
Shortcut keys available for audio/video play, pause, slow, fast, fast-
forward/backward, timestamp as well as frequently used tags;
14
Updated in March 2019
An offline mode to ensure data confidentiality;
Customizable tags of annotation and corresponding shortcut keys for faster typing;
The validation function to automatically identify mis-formed tags which do not
conform to the pre-defined format.
A screenshot of Transcriber is presented in Figure 3 as well as an example of validation
with the symbol of ‘/’ missing in the closing tag </ol> in Figure 4, where the illegitimate
tagset is highlighted in red by the engine.
Fig. 3 The interface of Transcribear
Fig. 4 An example of validation where an illegitimate tagset is highlighted
15
Updated in March 2019
4. Conclusion
A good tool can facilitate the work of transcription and annotation as it is notoriously time-
consuming and challenging to manually transcribe or annotate data. After the Transcribear
tool was introduced to the CAWSE project, in the course of nine months, it has been used by
more than ten assistants/interns in transcribing or validating thousands of scripts. On the basis
of their feedback, it was estimated that the introduction of this tool saved approximately 15-
20% of working time as a result of its design of shortcut keys and built-in validation and
spelling checks. One of the team members also used the tool to double check the quality of
transcripts prior to the use of Transcribear, and with the validation function, a large number
of typing errors were identified and corrected. We concluded that Transcribear has
considerably enhanced the productivity of the team as well as the quality of transcription and
annotation outputs. The customizable settings of Transcribear also make it possible for any
other projects which require transcription and/or annotation to take advantage of this tool.
Audio/video recordings are used in many disciplines in Humanities and Social Sciences
because qualitative research often requires data transcription and/or annotation although they
may be termed differently, for example, ‘coding’ (e.g. Charmaz, 2006; Strauss & Corbin,
1998) instead of ‘annotation’. Even for the type of research which does not require
transcription, the users can still upload their text to Transcribear to add their own annotation
notations. For instance, any existing electronic text can be copied and pasted to the online
interface, and a researcher can annotate target features in the text systematically on the basis
of a framework defined in their own project (e.g. marketing strategies in a business study,
types of feedback in educational research, or collocation errors in a second writing project).
The Transcribear tool can therefore be used in a much wider range of contexts rather than just
Corpus Linguistics, and based on our experience, the introduction of this new tool can make
the tedious and laborious task of transcription and/or annotation easier and faster.
16
Updated in March 2019
In terms of limitations of the current design, first of all, Transcribear does not support a
multi-tier structure, which means all transcription and annotation tags are aligned on the same
tier. This could be problematic for multimodal analysis because multi-tiers are essential to
visualize the temporal coordination across modes. In addition, although this tool supports a
range of media file types, more file types such as flv may be considered if such needs arise.
For future development, the facility of speech recognition which can be used to speed up
transcription may be introduced when this technology is mature for naturally occurring
language data. The integration of XML format may also be considered at a later stage.
Currently Transcribear is freely available with CAWSE-specific transcription and
annotation conventions as the default settings. The development of this tool is collaboration
between a linguist (the CAWSE project director who designs the specifications of the tool)
and a computer scientist (the developer who implements the design). While most of similar
annotation tools appear to have been developed by academics and are often freely available
(as can be seen in Table 2), presumably funded by their institutions or projects, the computer
scientist who develops this tool, however, is not an academic and has been working on this
tool voluntarily. The development and maintenance of this web tool involves recurring costs
of a domain name, renting a server, a server certificate for secure communications between a
user’s web browser and the server via HTTPs protocol, constant updates and fixing bugs,
among others, let alone the developer’s assiduous (and unpaid) work for at least several
months to get a beta version running. To make this tool sustainable and to constantly improve
user experience as well as enhance the functionality, a small subscription fee may be
considered in the future. Free trials, however, will be available for those who wish to
experiment with the tool or those who may just need to use the tool for a shorter period of
time.
17
Updated in March 2019
References
Anthony, L. (2018). AntConc (Version 3.5.2) [computer software]. Tokyo, Japan: Waseda Universit.
Retrieved from http://www.laurenceanthony.net/software
Charmaz, K. (2006). Constructing grounded theory: a practical guide through qualitative analysis.
London: SAGE.
Chen, Y. H., Harrison, S., Oakey, D., Stevens, M. P., Yang, S., Ioratim-Uba, G., Zhou, Q. & Bruncak, R.
(2018). UNNC Corpus of Academic Written and Spoken Corpus (UNNC CAWSE) Version 1.0.
Ningbo, China: University of Nottingham Ningbo China.
Jefferson, G. (2004). Glossary of transcript symbols with an introduction. In G. H. Lerner (Ed.),
Conversation Analysis: Studies from the First Generation (pp. 13-31). Amsterdam: John
Benjamins.
Leech, G. (2005). Adding Linguistic Annotation. In M. Wynne (Ed.), Developing Language Corpora: A
Guide to Good Practice (pp. 17-29). Oxford: Oxbow Books.
Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014.
International Journal of Corpus Linguistics, 22(3), 319-344. doi: 10.1075/ijcl.22.3.02lov
Nesi, H., & Thompson, P. (2000-2005). British Academic Spoken English Corpus (BASE). Retrieved
from: http://www.coventry.ac.uk/research/research-directories/current-
projects/2015/british-academic-spoken-english-corpus-base/
Schmidt, T. (2016). Good practices in the compilation of FOLK, the Research and Teaching Corpus of
Spoken German. International Journal of Corpus Linguistics, 21(3), 396-418. doi:
10.1075/ijcl.21.3.05sch
Scott, M. (2008). WordSmith Tools 5.0 [computer software]. Liverpool: Lexical Analysis Software.
Strauss, A. L. & J. Corbin (1998). Basics of qualitative research: techniques and procedures for
developing grounded theory. Thousand Oaks: SAGE.
Swann, J. (2010). Transcribing spoken interaction. In S. Hunston & D. Oakey (Eds.), Introducing
Applied Linguistics: Concepts and Skills (pp. 163-176). New York: Routledge.
Tacchetti, M. (2017). User's Guide for ELAN Linguistic Annotator. http://tla.mpi.nl/tools/tla-
tools/elan/
VOICE. (2013). The Vienna-Oxford International Corpus of English (version 2.0 XML). Director:
Barbara Seidlhofer; researchers: Angelika Breiteneder, Theresa Klimpfinger, Stefan Majewski,
Ruth Osimk-Teasdale, Marie-Luise Pitzl, Michael Radeka. http://www.univie.ac.at/voice/
Wynne, M. (Ed.). (2005). Developing Language Corpora: A Guide to Good Practice. Oxford: Oxbow
Books.
ResearchGate has not been able to resolve any citations for this publication.
Article
This paper introduces the Spoken British National Corpus 2014, an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK, recorded in the years 2012-2016. After showing that a survey of the recent history of corpora of spoken British English justifies the compilation of this new corpus, we describe the main stages of the Spoken BNC2014's creation: design, data and metadata collection, transcription, XML encoding, and annotation. In doing so we aim to (i) encourage users of the corpus to approach the data with sensitivity to the many methodological issues we identified and attempted to overcome while compiling the Spoken BNC2014, and (ii) inform (future) compilers of spoken corpora of the innovations we implemented to attempt to make the construction of corpora representing spontaneous speech in informal contexts more tractable, both logistically and practically, than in the past.
Article
This paper presents practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German, a large collection of spontaneous verbal interaction from diverse discourse domains. After introducing the aims and organisational circumstances of the construction of FOLK, the general idea is discussed that good practices cannot be developed without considering methodological, technological and organisational aspects on equal footing. Starting from this idea, the this paper inspects a little more closely some actual practices in FOLK, namely the handling of legal (especially privacy protection) issues, the decisions taken for the transcription and annotation workflow, and the question of how to best disseminate a corpus like FOLK. The final section sketches some possible future improvements for practices in FOLK.
AntConc (Version 3.5.2)
  • L Anthony
Anthony, L. (2018). AntConc (Version 3.5.2) [computer software].
UNNC Corpus of Academic Written and Spoken Corpus (UNNC CAWSE) Version 1.0
  • Y H Chen
  • S Harrison
  • D Oakey
  • M P Stevens
  • S Yang
  • G Ioratim-Uba
  • Q Zhou
  • R Bruncak
Chen, Y. H., Harrison, S., Oakey, D., Stevens, M. P., Yang, S., Ioratim-Uba, G., Zhou, Q. & Bruncak, R. (2018). UNNC Corpus of Academic Written and Spoken Corpus (UNNC CAWSE) Version 1.0. Ningbo, China: University of Nottingham Ningbo China.