Content uploaded by Daniel W. Hieber
Author content
All content in this area was uploaded by Daniel W. Hieber on May 31, 2019
Content may be subject to copyright.
What is Digital Linguistics (DLx)?What is Digital Linguistics (DLx)?
Digital Linguistics (DLx) is the science of the digital data
management for linguistics, including the digital storage,
representation, manipulation, and dissemination of
linguistic data. It concerns itself with how to represent
linguistic data in digital form, as well as best practices for
working with that data, while being attentive to best
practices and ethical concerns in language
documentation, sociocultural linguistics, and language
revitalization.
2
Data ManagementData Management
Types of things called "data" in linguistics:
audiovisual media
(time-aligned) annotations
metadata
lexical databases
corpora
publications containing any of the above
4
MetadataMetadata
Data that describes another set of data.
location(s)
date(s)
speakers / researchers
sociocultural context
documentary context
folder/repository structure
file formats / naming conventions
terminology / glossary / abbreviations
5
Data LifecycleData Lifecycle
1. data entry
2. data cleaning
3. data editing
4. data use
8
Data WorkflowData Workflow
1. recording
2. metadata
3. (time-aligned) annotation
4. presentation
Backup and/or archive at every stage
Backup and/or archive at every version
9
10
Primary ("Raw") DataPrimary ("Raw") Data
audiovisual recordings
images / scans
Data are in "binary" format files (i.e. non-text files)
Must have specialized software to read
Not human-readable
Images .jpg, .jpeg, .png, .svg
Scans / Documents .pdf, .docx
Audio .wav, .mp3, .wma
Video .mpeg, .avi, .mov, .mp4
Databases .xlsx, .accdb, .fmp 11
JPEG fileJPEG file
12
JPEG file (as text)JPEG file (as text)
13
Structured Data (Text)Structured Data (Text)
Markup
Non-Proprietary
.txt (Text)
.md (Markdown)
.json (JavaScript Object Notation / JSON)
.sql (Structured Query Language / SQL)
.yml (YAML)
.xml (Extensible Markup Language / XML)
14
Structured Data (Text)Structured Data (Text)
Proprietary
EAF (ELAN)
FlexText (FLEx)
SFM (Toolbox)
TextGrid (Praat)
Saymore
15
Data Workflow + ToolsData Workflow + Tools
Problems
operating system-specific
task-specific
variety of formats
access / licensing
do not synchronize (easily)
few backup / archiving solutions
not easily citeable / shareable
17
Data Workflow + ToolsData Workflow + Tools
Recommendations
version control
single source of truth
document your workflow (for yourself as much as
others)
document your formats / fields
avoid manual transformations / processes
write scripts (document their inputs and outputs
carefully)
18