September 2024
·
5 Reads
Language Resources and Evaluation
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
September 2024
·
5 Reads
Language Resources and Evaluation
May 2024
·
37 Reads
·
3 Citations
Language Resources and Evaluation
Corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, yet semantic annotation for such corpora is still scarce and lacks a uniform standard. Semantic annotation of CDS is particularly important for understanding the nature of the input children receive and developing computational models of child language acquisition. For example, under the assumption that children are able to infer meaning representations for (at least some of) the utterances they hear, the acquisition task is to learn a grammar that can map novel adult utterances onto their corresponding meaning representations, in the face of noise and distraction by other contextually possible meanings. To study this problem and to develop computational models of it, we need corpora that provide both adult utterances and their meaning representations, ideally using annotation that is consistent across a range of languages in order to facilitate cross-linguistic comparative studies. This paper proposes a methodology for constructing such corpora of CDS paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two steps. First, we annotate the corpora using the Universal Dependencies (UD) scheme for syntactic annotation, which has been developed to apply consistently to a wide variety of domains and typologically diverse languages. Next, we further annotate these data by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The UD and LF representations have complementary strengths: UD structures are language-neutral and support consistent and reliable annotation by multiple annotators, whereas LFs are neutral as to their syntactic derivation and transparently encode semantic relations. Using this approach, we provide syntactic and semantic annotation for two corpora from CHILDES: Brown’s Adam corpus (English; we annotate ≈ 80% of its child-directed utterances), all child-directed utterances from Berman’s Hagar corpus (Hebrew). We verify the quality of the UD annotation using an inter-annotator agreement study, and manually evaluate the transduced meaning representations. We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.
January 2024
·
7 Reads
January 2024
January 2024
December 2023
·
28 Reads
·
4 Citations
Computational Linguistics
Semantic representations capture the meaning of a text. Abstract Meaning Representation (AMR), a type of semantic representation, focuses on predicate-argument structure and abstracts away from surface form. Though AMR was developed initially for English, it has now been adapted to a multitude of languages in the form of non-English annotation schemas, cross-lingual text-to-AMR parsing, and AMR-to-(non-English) text generation. We advance prior work on cross-lingual AMR by thoroughly investigating the amount, types, and causes of differences which appear in AMRs of different languages. Further, we compare how AMR captures meaning in cross-lingual pairs versus strings, and show that AMR graphs are able to draw out fine-grained differences between parallel sentences. We explore three primary research questions: (1) What are the types and causes of differences in parallel AMRs? (2) How can we measure the amount of difference between AMR pairs in different languages? (3) Given that AMR structure is affected by language and exhibits cross-lingual differences, how do cross-lingual AMR pairs compare to string-based representations of cross-lingual sentence pairs? We find that the source language itself does have a measurable impact on AMR structure, and that translation divergences and annotator choices also lead to differences in cross-lingual AMR pairs. We explore the implications of this finding throughout our study, concluding that, while AMR is useful to capture meaning across languages, evaluations need to take into account source language influences if they are to paint an accurate picture of system output, and meaning generally.
June 2023
·
7 Reads
·
1 Citation
The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent premise and hypothesis, including sets of contextualized embeddings and semantic graphs (Abstract Meaning Representations), and measure whether the hypothesis is a semantic substructure of the premise, utilizing interpretable metrics. Our evaluation on three English benchmarks finds value in both contextualized embeddings and semantic graphs; moreover, they provide complementary signals, and can be leveraged together in a hybrid model.
May 2023
·
7 Reads
CGELBank is a treebank and associated tools based on a syntactic formalism for English derived from the Cambridge Grammar of the English Language. This document lays out the particularities of the CGELBank annotation scheme.
May 2023
·
52 Reads
Most judicial decisions involve the interpretation of legal texts; as such, judicial opinion requires the use of language as a medium to comment on or draw attention to other language. Language used this way is called metalanguage. We develop an annotation schema for categorizing types of legal metalanguage and apply our schema to a set of U.S. Supreme Court opinions, yielding a corpus totaling 59k tokens. We remark on several patterns observed in the kinds of metalanguage used by the justices.
April 2023
·
13 Reads
Translated texts or utterances bear several hallmarks distinct from texts originating in the language. This phenomenon, known as translationese, is well-documented, and when found in training or test sets can affect model performance. Still, work to mitigate the effect of translationese in human translated text is understudied. We hypothesize that Abstract Meaning Representation (AMR), a semantic representation which abstracts away from the surface form, can be used as an interlingua to reduce the amount of translationese in translated texts. By parsing English translations into an AMR graph and then generating text from that AMR, we obtain texts that more closely resemble non-translationese by macro-level measures. We show that across four metrics, and qualitatively, using AMR as an interlingua enables the reduction of translationese and we compare our results to two additional approaches: one based on round-trip machine translation and one based on syntactically controlled generation.
... We test this model on two languages: English, on which it was originally tested, and Hebrew. The data we use is comprised of real child-directed utterances, taken from the CHILDES corpus (MacWhinney, 1998), coupled with a recent method for converting universal dependency annotations to logical forms (Szubert et al., 2024). ...
May 2024
Language Resources and Evaluation
... The subgraphs then guide the generation of simpler sentences which form the final output. AMR is chosen as it is the meaning representation that receives more attention in recent developments of treebanks (Knight et al., 2020), parsing , text generation (Bai et al., 2022), and cross-lingual adaptation (Wein and Schneider, 2024), and it reflects the state-of-the-art of graph-based meaning representation. We demonstrate that with a welldeveloped semantic graph like AMR, a syntactic simplification system can be derived from simple rules as a lightweight alternative to LLMs. ...
December 2023
Computational Linguistics
... accessed August 2023. 5 Non-deterministic result counts seem to be an artefact of the ACL Anthology's use of Google's Programmable Search Engine(Bollmann et al., 2023). ...
January 2023
... Firstly, their computational efficiency hinders the comparison of large AMRs extracted from documents (Naseem et al., 2022). Secondly, these metrics struggle to accurately capture the semantic similarity of the underlying text from which AMRs are derived (Leung et al., 2022a). Additionally, while recent efforts like BAMBOO (Opitz et al., 2021) have evaluated metrics on AMR transformations, we still lack a large-scale benchmark to systematically evaluate the ability of AMR metrics to capture structural similarity. ...
January 2022
... 61 5.4.5 Supplements marked by a coordinator . . . . . . . . . . . . . . . 62 Introduction CGELBank (Reynolds et al., 2023) is a treebank and associated tools based on a syntactic formalism for English derived from the Cambridge Grammar of the English Language (CGEL; Huddleston and Pullum, 2002). 1 It is hosted on GitHub at https://github. com/nert-nlp/cgel. ...
Reference:
CGELBank Annotation Manual v1.0
January 2023
... Our first experiment shows how metalinguistic judgments about reform language in LLMs reflect politically biased language ideologies. This extends research both on political bias (e.g., Feng et al., 2023) and on metalinguistic statements (e.g., Behzad et al., 2023;Hu and Levy, 2023) in LLMs, highlighting that ideas about "correctness" or "naturalness" of language are not neutral, and may impact use of socially-relevant reform language. Our second experiment assesses internal consistency, finding that LLMs use reform language at different rates depending on whether and how much metalinguistic context is provided. ...
January 2023
... Future research should explore the full potential of AMRs for natural language understanding. Natural Language Inference (NLI) is a prime example, where AMR-based systems have already shown promise (Opitz et al., 2023). An even more intriguing direction would be to develop methods that perform NLI solely through AMR matching, capitalizing on the rich structure and semantics encoded within AMRs. ...
June 2023
... In recent years, there has been a significant increase in research dedicated to the investigation and analysis of multi-word expressions (MWE) (Ramisch, 2015;Constant et al., 2017;Kahane et al., 2018;Savary et al., 2023;Mel'čuk, 2023), especially within Natural Language Processing (NLP). ...
February 2023
Northern European Journal of Language Technology
... Other work in spatial and situated AMR (unrelated to human-robot interaction) has also accounted for the necessity of altering AMR to include grounding language. Datasets of AMR graphs for multimodal dialogue have incorporated gestures (Donatelli et al., 2022;Lai et al., 2024) and spatial information (Bonn et al., 2020;Dan et al., 2020) into the AMR schema. Martin et al. ...
Reference:
A Survey of AMR Applications
November 2022
Northern European Journal of Language Technology
... It has been widely observed in the BERTology literature that structural information (including POS information, which is critical to an expectation of given word class as hypothesized above) is encoded in middle layers of the transformer architecture has been widely shown (e.g., Tenney et al., 2019;Liu et al., 2019a, inter alia). Aoyama and Schneider (2022) also corroborate this hypothesis directly through a language modeling task, showing that the model learns to predict a word with 'correct' (i.e., same as the target) POS most actively in middle layers. ...
January 2022