PresentationPDF Available

Digital Linguistics for language documentation

Authors:
  • Chitimacha Tribe of Louisiana

Abstract

Best practices in Digital Linguistics for language documentation. Slides available at https://slides.com/dwhieb/digital-linguistics-for-language-documentation.
Digital Linguistics forDigital Linguistics for
LanguageLanguage
DocumentationDocumentation
Daniel W. Hieber
University of California, Santa Barbara
May 24, 2019
Slides available at:
https://slides.com/dwhieb/digital-
linguistics-for-language-documentation
1
What is Digital Linguistics (DLx)?What is Digital Linguistics (DLx)?
Digital Linguistics (DLx) is the science of the digital data
management for linguistics, including the digital storage,
representation, manipulation, and dissemination of
linguistic data. It concerns itself with how to represent
linguistic data in digital form, as well as best practices for
working with that data, while being attentive to best
practices and ethical concerns in language
documentation, sociocultural linguistics, and language
revitalization.
2
DLx ResourcesDLx Resources
digitallinguistics.io
GitHub projects
bibliography
CLARIN-D
slides
repository
3
Data ManagementData Management
Types of things called "data" in linguistics:
audiovisual media
(time-aligned) annotations
metadata
lexical databases
corpora
publications containing any of the above
4
MetadataMetadata
Data that describes another set of data.
location(s)
date(s)
speakers / researchers
sociocultural context
documentary context
folder/repository structure
file formats / naming conventions
terminology / glossary / abbreviations
5
Metadata StandardsMetadata Standards
(OLAC)
(IMDI)
(DaFoDiL)
Different tools utilize different metadata formats, or just
use their own
Open Language Archives Community
ISLE Metadata Initiative
Data Format for Digital Linguistics
6
Data Management Plan (DMP)Data Management Plan (DMP)
Required by most funding organizations
Current practice has a focus on archiving
Good DMPs plan for the entire lifecycle of the data
(LDC)Linguistic Data Consortium
Workshop on DMPs for Linguistic Research
CLARIN-D
7
Data LifecycleData Lifecycle
1. data entry
2. data cleaning
3. data editing
4. data use
8
Data WorkflowData Workflow
1. recording
2. metadata
3. (time-aligned) annotation
4. presentation
Backup and/or archive at every stage
Backup and/or archive at every version
9
10
Primary ("Raw") DataPrimary ("Raw") Data
audiovisual recordings
images / scans
Data are in "binary" format files (i.e. non-text files)
Must have specialized software to read
Not human-readable
Images .jpg, .jpeg, .png, .svg
Scans / Documents .pdf, .docx
Audio .wav, .mp3, .wma
Video .mpeg, .avi, .mov, .mp4
Databases .xlsx, .accdb, .fmp 11
JPEG fileJPEG file
12
JPEG file (as text)JPEG file (as text)
13
Structured Data (Text)Structured Data (Text)
Markup
Non-Proprietary
.txt (Text)
.md (Markdown)
.json (JavaScript Object Notation / JSON)
.sql (Structured Query Language / SQL)
.yml (YAML)
.xml (Extensible Markup Language / XML)
14
Structured Data (Text)Structured Data (Text)
Proprietary
EAF (ELAN)
FlexText (FLEx)
SFM (Toolbox)
TextGrid (Praat)
Saymore
15
ToolsTools
Audacity
database software
(Access, Filemaker Pro)
ELAN
FLEx
keyboards (Keyman,
typeit.org)
LexiquePro
open source projects
(DLx)
Elpis
Kratylos
Praat
Saymore
scripts (JavaScript,
Python, R)
spreadsheet software
(Excel, Open Office)
SQL (HeidiSQL)
text editors (Atom,
Notepad++)
Toolbox
Transcriber
Webonary
WeSay
16
Data Workflow + ToolsData Workflow + Tools
Problems
operating system-specific
task-specific
variety of formats
access / licensing
do not synchronize (easily)
few backup / archiving solutions
not easily citeable / shareable
17
Data Workflow + ToolsData Workflow + Tools
Recommendations
version control
single source of truth
document your workflow (for yourself as much as
others)
document your formats / fields
avoid manual transformations / processes
write scripts (document their inputs and outputs
carefully)
18
PrinciplesPrinciples
Open Web Platform
open source
web-based
standards-based
discoverable / open
access
GoalsGoals
data format (JSON)
open-source tools
ecosystem
education
Austin Principles of Data Citation in
Linguistics
19
FormatsFormats
Scription
DaFoDiL
ToolsTools
scripts ( )
converters
transliterator
tools.digitallinguistics.io
app.digitallinguistics.io
GitHub
20
ContactContact
Danny Hieber
dhieber@ucsb.edu
danielhieber.com
21
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.