Nathan Schneider’s research while affiliated with Georgetown University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (119)


Correction: Cross-linguistically consistent semantic and syntactic annotation of child-directed speech
  • Article
  • Full-text available

September 2024

·

5 Reads

Language Resources and Evaluation

Ida Szubert

·

Omri Abend

·

Nathan Schneider

·

[...]

·

Mark Steedman
Download

Main stages of the proposed annotation methodology
Example AMR (top) and UCCA (bottom) graphs for the sentence “Do you think the baby whale might want some milk?” Abbreviations: Part. (Participant), Elab. (Elaborator), Quant. (Quantity) and Adv. (Adverbial)
a UD parse, b tree transformation to subcategorize verb POS, remove punctuation, and combine verb with its particle, c LF assignment to nodes and edges. (Color figure online)
Derivation of the LF for the sentence Pick up that blue pencil, starting after α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha$$\end{document}-conversion of the LF expressions. Reduction proceeds by applying the LF of the dependency relation to the LF of the head, and applying the resulting LF to the LF of the dependent. The red numbers mark the order of composition determined in the tree binarization step. (Color figure online)
A screenshot from the Arborator annotation interface, displaying an automatically converted UD parse, which was later hand-corrected

+8

Cross-linguistically consistent semantic and syntactic annotation of child-directed speech

May 2024

·

37 Reads

·

3 Citations

Language Resources and Evaluation

Corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, yet semantic annotation for such corpora is still scarce and lacks a uniform standard. Semantic annotation of CDS is particularly important for understanding the nature of the input children receive and developing computational models of child language acquisition. For example, under the assumption that children are able to infer meaning representations for (at least some of) the utterances they hear, the acquisition task is to learn a grammar that can map novel adult utterances onto their corresponding meaning representations, in the face of noise and distraction by other contextually possible meanings. To study this problem and to develop computational models of it, we need corpora that provide both adult utterances and their meaning representations, ideally using annotation that is consistent across a range of languages in order to facilitate cross-linguistic comparative studies. This paper proposes a methodology for constructing such corpora of CDS paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two steps. First, we annotate the corpora using the Universal Dependencies (UD) scheme for syntactic annotation, which has been developed to apply consistently to a wide variety of domains and typologically diverse languages. Next, we further annotate these data by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The UD and LF representations have complementary strengths: UD structures are language-neutral and support consistent and reliable annotation by multiple annotators, whereas LFs are neutral as to their syntactic derivation and transparently encode semantic relations. Using this approach, we provide syntactic and semantic annotation for two corpora from CHILDES: Brown’s Adam corpus (English; we annotate ≈\approx 80% of its child-directed utterances), all child-directed utterances from Berman’s Hagar corpus (Hebrew). We verify the quality of the UD annotation using an inter-annotator agreement study, and manually evaluate the transduced meaning representations. We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.





Assessing the Cross-linguistic Utility of Abstract Meaning Representation

December 2023

·

28 Reads

·

4 Citations

Computational Linguistics

Semantic representations capture the meaning of a text. Abstract Meaning Representation (AMR), a type of semantic representation, focuses on predicate-argument structure and abstracts away from surface form. Though AMR was developed initially for English, it has now been adapted to a multitude of languages in the form of non-English annotation schemas, cross-lingual text-to-AMR parsing, and AMR-to-(non-English) text generation. We advance prior work on cross-lingual AMR by thoroughly investigating the amount, types, and causes of differences which appear in AMRs of different languages. Further, we compare how AMR captures meaning in cross-lingual pairs versus strings, and show that AMR graphs are able to draw out fine-grained differences between parallel sentences. We explore three primary research questions: (1) What are the types and causes of differences in parallel AMRs? (2) How can we measure the amount of difference between AMR pairs in different languages? (3) Given that AMR structure is affected by language and exhibits cross-lingual differences, how do cross-lingual AMR pairs compare to string-based representations of cross-lingual sentence pairs? We find that the source language itself does have a measurable impact on AMR structure, and that translation divergences and annotator choices also lead to differences in cross-lingual AMR pairs. We explore the implications of this finding throughout our study, concluding that, while AMR is useful to capture meaning across languages, evaluations need to take into account source language influences if they are to paint an accurate picture of system output, and meaning generally.


AMR4NLI: Interpretable and robust NLI measures from semantic graphs

June 2023

·

7 Reads

·

1 Citation

The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent premise and hypothesis, including sets of contextualized embeddings and semantic graphs (Abstract Meaning Representations), and measure whether the hypothesis is a semantic substructure of the premise, utilizing interpretable metrics. Our evaluation on three English benchmarks finds value in both contextualized embeddings and semantic graphs; moreover, they provide complementary signals, and can be leveraged together in a hybrid model.



Annotation Categories
Category-based micro-average F1 for each annotator. All other annotators' annotations considered gold while calculating annotator's F1.
CuRIAM: Corpus re Interpretation and Metalanguage in U.S. Supreme Court Opinions

May 2023

·

52 Reads

Most judicial decisions involve the interpretation of legal texts; as such, judicial opinion requires the use of language as a medium to comment on or draw attention to other language. Language used this way is called metalanguage. We develop an annotation schema for categorizing types of legal metalanguage and apply our schema to a set of U.S. Supreme Court opinions, yielding a corpus totaling 59k tokens. We remark on several patterns observed in the kinds of metalanguage used by the justices.


Translationese Reduction using Abstract Meaning Representation

April 2023

·

13 Reads

Translated texts or utterances bear several hallmarks distinct from texts originating in the language. This phenomenon, known as translationese, is well-documented, and when found in training or test sets can affect model performance. Still, work to mitigate the effect of translationese in human translated text is understudied. We hypothesize that Abstract Meaning Representation (AMR), a semantic representation which abstracts away from the surface form, can be used as an interlingua to reduce the amount of translationese in translated texts. By parsing English translations into an AMR graph and then generating text from that AMR, we obtain texts that more closely resemble non-translationese by macro-level measures. We show that across four metrics, and qualitatively, using AMR as an interlingua enables the reduction of translationese and we compare our results to two additional approaches: one based on round-trip machine translation and one based on syntactically controlled generation.


Citations (52)


... We test this model on two languages: English, on which it was originally tested, and Hebrew. The data we use is comprised of real child-directed utterances, taken from the CHILDES corpus (MacWhinney, 1998), coupled with a recent method for converting universal dependency annotations to logical forms (Szubert et al., 2024). ...

Reference:

A Language-agnostic Model of Child Language Acquisition
Cross-linguistically consistent semantic and syntactic annotation of child-directed speech

Language Resources and Evaluation

... The subgraphs then guide the generation of simpler sentences which form the final output. AMR is chosen as it is the meaning representation that receives more attention in recent developments of treebanks (Knight et al., 2020), parsing , text generation (Bai et al., 2022), and cross-lingual adaptation (Wein and Schneider, 2024), and it reflects the state-of-the-art of graph-based meaning representation. We demonstrate that with a welldeveloped semantic graph like AMR, a syntactic simplification system can be derived from simple rules as a lightweight alternative to LLMs. ...

Assessing the Cross-linguistic Utility of Abstract Meaning Representation

Computational Linguistics

... Firstly, their computational efficiency hinders the comparison of large AMRs extracted from documents (Naseem et al., 2022). Secondly, these metrics struggle to accurately capture the semantic similarity of the underlying text from which AMRs are derived (Leung et al., 2022a). Additionally, while recent efforts like BAMBOO (Opitz et al., 2021) have evaluated metrics on AMR transformations, we still lack a large-scale benchmark to systematically evaluate the ability of AMR metrics to capture structural similarity. ...

Semantic Similarity as a Window into Vector- and Graph-Based Metrics
  • Citing Conference Paper
  • January 2022

... 61 5.4.5 Supplements marked by a coordinator . . . . . . . . . . . . . . . 62 Introduction CGELBank (Reynolds et al., 2023) is a treebank and associated tools based on a syntactic formalism for English derived from the Cambridge Grammar of the English Language (CGEL; Huddleston and Pullum, 2002). 1 It is hosted on GitHub at https://github. com/nert-nlp/cgel. ...

Unified Syntactic Annotation of English in the CGEL Framework

... Our first experiment shows how metalinguistic judgments about reform language in LLMs reflect politically biased language ideologies. This extends research both on political bias (e.g., Feng et al., 2023) and on metalinguistic statements (e.g., Behzad et al., 2023;Hu and Levy, 2023) in LLMs, highlighting that ideas about "correctness" or "naturalness" of language are not neutral, and may impact use of socially-relevant reform language. Our second experiment assesses internal consistency, finding that LLMs use reform language at different rates depending on whether and how much metalinguistic context is provided. ...

ELQA: A Corpus of Metalinguistic Questions and Answers about English

... Future research should explore the full potential of AMRs for natural language understanding. Natural Language Inference (NLI) is a prime example, where AMR-based systems have already shown promise (Opitz et al., 2023). An even more intriguing direction would be to develop methods that perform NLI solely through AMR matching, capitalizing on the rich structure and semantics encoded within AMRs. ...

AMR4NLI: Interpretable and robust NLI measures from semantic graphs
  • Citing Preprint
  • June 2023

... In recent years, there has been a significant increase in research dedicated to the investigation and analysis of multi-word expressions (MWE) (Ramisch, 2015;Constant et al., 2017;Kahane et al., 2018;Savary et al., 2023;Mel'čuk, 2023), especially within Natural Language Processing (NLP). ...

PARSEME Meets Universal Dependencies: Getting on the Same Page in Representing Multiword Expressions

Northern European Journal of Language Technology

... Other work in spatial and situated AMR (unrelated to human-robot interaction) has also accounted for the necessity of altering AMR to include grounding language. Datasets of AMR graphs for multimodal dialogue have incorporated gestures (Donatelli et al., 2022;Lai et al., 2024) and spatial information (Bonn et al., 2020;Dan et al., 2020) into the AMR schema. Martin et al. ...

Spanish Abstract Meaning Representation: Annotation of a General Corpus

Northern European Journal of Language Technology

... It has been widely observed in the BERTology literature that structural information (including POS information, which is critical to an expectation of given word class as hypothesized above) is encoded in middle layers of the transformer architecture has been widely shown (e.g., Tenney et al., 2019;Liu et al., 2019a, inter alia). Aoyama and Schneider (2022) also corroborate this hypothesis directly through a language modeling task, showing that the model learns to predict a word with 'correct' (i.e., same as the target) POS most actively in middle layers. ...

Probe-Less Probing of BERT’s Layer-Wise Linguistic Knowledge with Masked Word Prediction
  • Citing Conference Paper
  • January 2022