Ryan GeorgiUniversity of Washington Seattle | UW · Department of Linguistics
Ryan Georgi
PhD in Computational Linguistics
About
12
Publications
1,609
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
67
Citations
Introduction
Additional affiliations
June 2012 - present
June 2009 - September 2016
February 2008 - September 2008
Publications
Publications (12)
Extracting semi-structured text from scientific writing in PDF files is a difficult task that researchers have faced for decades. In the 1990s, this task was largely a computer vision and OCR problem, as PDF files were often the result of scanning printed documents. Today, PDFs have standardized digital typesetting without the need for OCR, but ext...
The current release of the ODIN (Online Database of Interlinear Text) database
contains over 150,000 linguistic examples, from nearly 1,500 languages, extracted
from PDFs found on the web, representing a significant source of data for language
research, particularly for low-resource languages. Errors introduced during PDF-to-text conversion or poor...
The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swath of the world’s languages. In many cases this involves boot...
The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swathe of the worlds languages. In many cases this involves boot...
In this paper, we will demonstrate a system that shows great promise for creating Part-of-Speech taggers for languages with little to no curated resources available, and which needs no expert involvement. Interlinear Glossed Text (IGT) is a resource which is available for over 1,000 languages as part of the Online Database of INterlinear text (ODIN...
Obtaining syntactic parses is an important step in many NLP pipelines. However, most of the world’s languages do not have a large amount of syntactically annotated data available for building parsers. Syntactic projection techniques attempt to address this issue by using parallel corpora consisting of resource-poor and resource-rich language pairs,...
Syntactic parses can provide valuable information for many NLP tasks, such as machine translation, semantic analysis, etc. However, most of the world's languages do not have large amounts of syntactically annotated corpora available for building parsers. Syntactic projection techniques attempt to address this issue by using parallel corpora between...
Recent studies have shown the potential benefits of leveraging resources for resource-rich languages to build tools for similar, but resource-poor languages. We examine what constitutes "similarity" by comparing traditional phylogenetic language groups, which are motivated largely by genetic relationships, with language groupings formed by clusteri...
In this thesis, we propose that instances of interlinear glossed text (IGT), as found in a wide range of linguistic papers, represent enriched content similar to partially annotated corpora. With such a type of data readily available for many languages for which little to no other data is available, we attempt to create a system which utilizes this...