Shuly Wintner's research while affiliated with University of Haifa and other places

Publications (134)

Conference Paper
Full-text available
We present the Hebrew Essay Corpus: an annotated corpus of Hebrew language argumentative essays authored by prospective higher-education students. The corpus includes both essays by native speakers, written as part of the psychometric exam that is used to assess their future success in academic studies; and essays authored by non-native speakers, w...
Preprint
Full-text available
Natural language processing (NLP) models trained on people-generated data can be unreliable because, without any constraints, they can learn from spurious correlations that are not relevant to the task. We hypothesize that enriching models with speaker information in a controlled, educated way can guide them to pick up on relevant inductive biases....
Preprint
State-of-the-art machine translation (MT) systems are typically trained to generate the "standard" target language; however, many languages have multiple varieties (regional varieties, dialects, sociolects, non-native varieties) that are different from the standard language. Such varieties are often low-resource, and hence do not benefit from conte...
Article
Full-text available
Acronyms—words formed from the initial letters of a phrase—are important for various natural language processing applications, including information retrieval and machine translation. While hand-crafted acronym dictionaries exist, they are limited and require frequent updates. We present a new machine-learning-based approach to automatically build...
Preprint
Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification. We find that standard text classifiers which...
Article
We present a computational analysis of cognate effects on the spontaneous linguistic productions of advanced non-native speakers. Introducing a large corpus of highly competent non-native English speakers, and using a set of carefully selected lexical items, we show that the lexical choices of non-natives are affected by cognates in their native la...
Preprint
Amidst growing concern over media manipulation, NLP attention has focused on overt strategies like censorship and "fake news'". Here, we draw on two concepts from the political science literature to explore subtler strategies for government media manipulation: agenda-setting (selecting what topics to cover) and framing (deciding how topics are cove...
Preprint
We present a computational analysis of cognate effects on the spontaneous linguistic productions of advanced non-native speakers. Introducing a large corpus of highly competent non-native English speakers, and using a set of carefully selected lexical items, we show that the lexical choices of non-natives are affected by cognates in their native la...
Preprint
Full-text available
This work distinguishes between translated and original text in the UN protocol corpus. By modeling the problem as classification problem, we can achieve up to 95% classification accuracy. We begin by deriving a parallel corpus for different language-pairs annotated for translation direction, and then classify the data by using various feature extr...
Article
Full-text available
Modern Standard Arabic (MSA) has several embedded clause constructions, some of which resemble control in English (and other languages). However, these constructions exhibit some notable differences. Chief among them is the fact that the embedded verb carries agreement features that can indicate both coreference and disjoint reference between a mat...
Conference Paper
Translated texts, in any language, have unique characteristics that set them apart from texts originally written in the same language. Translation studies is a research field that focuses on investigating these characteristics. Until recently, research in computational linguistics, and specifically in machine translation, has been entirely divorced...
Article
Translation has played an important role in trade, law, commerce, politics, and literature for thousands of years. Translators have always tried to be invisible; ideal translations should look as if they were written originally in the target language. We show that traces of the source language remain in the translation product to the extent that it...
Conference Paper
Full-text available
The language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait of the author, gender, and study how it is manifested in original texts and in translations. We show that author's gender has a powerful, clear signal in ori...
Article
Full-text available
In this paper we investigate the status of control constructions in Modern Standard Arabic (MSA). MSA has several embedded clause constructions, some of which resemble control in English (and other languages). However, these constructions exhibit some notable differences. Chief among them is the fact that the embedded verb carries agreement feature...
Article
Full-text available
The language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait, gender, and study how it is manifested in original texts and in translations. We show that gender has a powerful, clear signal in originals, but this signal...
Conference Paper
We present a computational analysis of three language varieties: native, advanced non-native, and translation. Our goal is to investigate the similarities and differences between non-native language productions and translations, contrasting both with native language. Using a collection of computational methods we establish three main results: (1) t...
Article
Translated texts are distinctively different from original ones, to the extent that supervised text classification methods can distinguish between them with high accuracy. These differences were proven useful for statistical machine translation. However, it has been suggested that the accuracy of translation detection deteriorates when the classifi...
Article
Existing approaches to the representation of argument structure in grammar tend to focus either on semantics or on syntax. Our goal in this paper is to strike the right balance between the two levels by proposing an analysis that maintains the independence of the syntactic and semantic aspects of argument structure, and, at the same time, captures...
Article
Full-text available
Multi-word expressions (MWEs) are challenging for grammatical theories and grammar development since they blur the traditional distinction between the lexicon and the grammar, and vary in the degree of idiosyncrasy with respect to their semantic, syntactic, and morphological behavior. Nevertheless, the need to incorporate MWEs into grammars is unqu...
Article
Full-text available
We show how linguistic grammars of two different yet related languages can be developed and implemented in parallel, with language-independent fragments serving as shared resources, and language-specific ones defined separately for each language. The two grammars in the focus of this paper are of Modern Hebrew and Modern Standard Arabic, and the ba...
Article
We describe bilingual English-French and English-German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpus is diverse, consisting of parliamentary proceedings, literary works, transcripts of TED talks and political commentary. It will be instrumental for research of translationese and its applica...
Article
We use text classification to distinguish automatically between original and translated texts in Hebrew, a morphologically complex language. To this end, we design several linguistically informed feature sets that capture word-level and sub-word-level (in particular, morphological) properties of Hebrew. Such features are abstract enough to allow fo...
Chapter
We present in this chapter some basic linguistic facts about Semitic languages, covering orthography, morphology, and syntax. We focus on Arabic (both standard and dialectal), Ethiopian languages (specifically, Amharic), Hebrew, Maltese and Syriac. We conclude the chapter with a contrastive analysis of some of these phenomena across the various lan...
Chapter
This chapter addresses morphological processing of Semitic languages. In light of the complex morphology and problematic orthography of many of the Semitic languages, the chapter begins with a recapitulation of the challenges these phenomena pose on computational applications. It then discusses the approaches that were suggested to cope with these...
Article
Much research in translation studies indicates that translated texts are ontologically different from original non-translated ones. Translated texts, in any language, can be considered a dialect of that language, known as ‘translationese’. Several characteristics of translationese have been proposed as universal in a series of hypotheses. In this w...
Article
Full-text available
We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the stan...
Article
Translation models used for statistical machine translation are compiled from parallel corpora that are manually translated. The common assumption is that parallel texts are symmetrical: The direction of translation is deemed irrelevant and is consequently ignored. Much research in Translation Studies indicates that the direction of translation mat...
Article
Full-text available
Nonverbal predicates in Modern Hebrew have been the subject of investigation in a number of studies. However, to our knowledge, none of them was corpus-based. Corpus searches reveal that the nonverbal constructions which are most commonly addressed in the literature are not the most commonly used ones. Once a broader range of data is considered add...
Article
Full-text available
We present a verb–complement dictionary of Modern Hebrew, automatically extracted from text corpora. Carefully examining a large set of examples, we defined ten types of verb complements that cover the vast majority of the occurrences of verb complements in the corpora. We explored several collocation measures as indicators of the strength of the a...
Article
Full-text available
In spite of the surging interest in multiword expressions (MWEs) in recent years, it is still unclear how such expressions should be stored in computational lexicons. This problem is amplified in morphologically-complex languages, where the unique properties of MWEs interact with non-trivial morphological processes. We propose an architecture for l...
Article
Full-text available
We present a syntactic parser of (transcripts of) spoken Hebrew: a dependency parser of the Hebrew CHILDES database. CHILDES is a corpus of child–adult linguistic interactions. Its Hebrew section has recently been morphologically analyzed and disambiguated, paving the way for syntactic annotation. This paper describes a novel annotation scheme of d...
Article
Full-text available
ABSTRACT Several models of language acquisition have emerged in recent years that rely on computational algorithms for simulation and evaluation. Computational models are formal and precise, and can thus provide mathematically well-motivated insights into the process of language acquisition. Such models are amenable to robust computational evaluati...
Article
Unification grammars (UGs) are a grammatical formalism that underlies several contemporary linguistic theories, including lexical-functional grammar and head-driven phrase-structure grammar. UG is an especially attractive formalism because of its expressivity, which facilitates the expression of complex linguistic structures and relations. Formally...
Conference Paper
We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the stan...
Conference Paper
Translation models used for statistical machine translation are compiled from parallel corpora; such corpora are manually translated, but the direction of translation is usually unknown, and is consequently ignored. However, much research in Translation Studies indicates that the direction of translation matters, as translated language (translation...
Conference Paper
Prepositions are hard to translate, because their meaning is often vague, and the choice of the correct preposition is often arbitrary. At the same time, making the correct choice is often critical to the coherence of the output text. In the context of statistical machine translation, this difficulty is enhanced due to the possible long distance be...
Article
Full-text available
Hebrew and Arabic are related but mutually incomprehensible languages with complex morphology and scarce parallel corpora. Machine translation between the two languages is therefore interesting and challenging. We discuss similarities and differences between Hebrew and Arabic, the benefits and challenges that they induce, respectively, and their im...
Article
Full-text available
Development of large-scale grammars for natural languages is a complicated endeavor: Grammars are developed collaboratively by teams of linguists, computational linguists, and computer scientists, in a process very similar to the development of large-scale software. Grammars are written in grammatical formalisms that resemble very-high-level progra...
Article
Full-text available
We compare translations of single words, made by bilingual speakers in a laboratory setting, with contextualized translation choices of the same items, made by professional translators and extracted from parallel language corpora. The translation choices in both cases show moderate convergence, demonstrating that decontextualized translation probab...
Conference Paper
We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and h...
Article
Grammars of natural languages can be expressed as mathematical objects, similar to computer programs. Such a formal presentation of grammars facilitates mathematical reasoning with grammars (and the languages they denote), as well as computational implementation of grammar processors. This book presents one of the most commonly used grammatical for...
Conference Paper
We propose an architecture for expressing various linguistically-motivated features that help identify multi-word expressions in natural language texts. The architecture combines various linguistically-motivated classification features in a Bayesian Network. We introduce novel ways for computing many of these features, and manually define linguisti...
Conference Paper
Transliteration is the rendering in one language of terms from another language (and, possibly, another writing system), approximating spelling and/or phonetic equivalents between the two languages. A transliteration dictionary is a crucial resource for a variety of natural language applications, most notably machine translation. We describe a gene...
Conference Paper
Full-text available
We present a corpus of transcribed spoken Hebrew that forms an integral part of a comprehensive data system that has been developed to suit the specific needs and interests of child language researchers: CHILDES (Child Language Data Exchange System). We introduce a dedicated transcription scheme for the spoken Hebrew data that is aware both of the...
Conference Paper
Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles ret...
Conference Paper
We present a general, novel methodology for extracting multi-word expressions (MWEs) of various types, along with their translations, from small, word-aligned parallel corpora. Unlike existing approaches, we focus on misalignments; these typically indicate expressions in the source language that are translated to the target in a non-compositional w...
Conference Paper
Multi-word expressions constitute a significant portion of the lexicon of every natural language, and handling them correctly is mandatory for various NLP applications. Yet such entities are notoriously hard to define, and are consequently missing from standard lexicons and dictionaries. Multi-word expressions exhibit idiosyncratic behavior on vari...
Chapter
IntroductionBasic NotionsLanguage Classes and Linguistic FormalismsRegular LanguagesContext-Free LanguagesThe Chomsky HierarchyMildly Context-Sensitive LanguagesFurther Reading
Conference Paper
Child language acquisition, one of Nature's most fascinating phenomena, is to a large extent still a puzzle. Experimental evidence seems to support the view that early language is highly formulaic, consisting for the most part of frozen items with limited productivity. Fairly quickly, however, children find patterns in the ambient language and gene...
Article
Full-text available
Corpora of child language are essential for research in child language acquisition and psycholinguistics. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe a project whose goal is to annotate the English section of the CHILDES database...
Article
Full-text available
Hebrew and Arabic are related but mutually incomprehensible languages with complex morphology and scarce parallel corpora. Machine translation between the two languages is therefore interesting and challenging. We discuss similarities and differences between Hebrew and Arabic, the benefits and challenges that they induce, respectively, and their im...
Article
This article discusses the transition from annotated data to a gold standard, that is, a subset that is sufficiently noise-free with high confidence. Unless appropriately reinterpreted, agreement coefficients do not indicate the quality of the data set ...
Article
Polarized unification grammar (PUG) is a linguistic formalism which uses polarities to better control the way grammar fragments interact. The grammar combination operation of PUG was conjectured to be associative. We show that PUG grammar combination is not associative, and even attaching polarities to objects does not make it order-independent. Mo...
Chapter
Full-text available
We propose to model the development of language by a series of formal grammars, accounting for the linguistic capacity of children at the very early stages of mastering language. This approach provides a testbed for evaluating theories of language acquisition, in particular with respect to the extent to which innate, language-specific mechanisms mu...
Conference Paper
We present a Hebrew to English transliter- ation method in the context of a machine translation system. Our method uses ma- chine learning to determine which terms are to be transliterated rather than trans- lated. The training corpus for this purpose includes only positive examples, acquired semi-automatically. Our classifier reduces more than 38%...
Book
This Festschrift volume, published in honor of Nissim Francez on the occasion of his 65th birthday, contains 15 papers, written by friends and colleagues, many of whom congregated at a celebratory symposium held on May 24-25, 2009, in Haifa, Israel. The theme of the symposium was Languages: From Formal to Natural, reflecting the focus of Nissim Fra...
Article
Finite-state technology is considered the preferred model for representing the phonology and morphology of natural languages. The attractiveness of this technology for natural language processing stems from four sources: modularity of the design, due to the closure properties of regular languages and relations; the compact representation that is ac...
Article
Words in Semitic languages are formed by combining two morphemes: a root and a pattern. The root consists of consonants only, by default three, and the pattern is a combination of vowels and consonants, with non-consecutive "slots" into which the root consonants are inserted. Identifying the root of a given word is an important task, considered to...
Article
Morphological analysis is a crucial com- ponent of several natural language pro- cessing tasks, especially for languages with a highly productive morphology, where stipulating a full lexicon of sur- face forms is not feasible. We describe HAMSAH (HAifa Morphological System for Analyzing Hebrew), a morphological processor for Modern Hebrew, based on...
Article
Full-text available
We describe a suite of standards, resources and tools for computa- tional encoding and processing of Modern Hebrew texts. These include an array of XML schemas for representing linguistic resources; a variety of text corpora, raw, automatically processed and manually annotated; lexical databases, including a broad-coverage monolingual lexicon, a bi...
Conference Paper
Finite-state technology is considered the preferred model for representing the phonology and morphology of natural languages. The attractiveness of this technology for natural language processing stems from four sources: modularity of the design, due to the closure properties of regular languages and relations; the compact representation that is ac...
Conference Paper
Polarities are used to sanction grammar fragment combi- nation in high level tree-based formalisms such as eXtenssible Meta- Grammar (XMG) and polarized unification grammars (PUG). We show that attaching polarities to tree nodes renders the combination operation non-associative, and in practice leads to overgeneration. We first pro- vide some examp...
Conference Paper
Abstract Morphological analysis and disambiguation are crucial stages in a variety of natural language processing applications, especially when,languages with complex,morphology are concerned. We present a system which disambiguates,the output of a morphological analyzer for Hebrew. It consists of several simple classifiers and a module,which combi...
Article
We report on the creation of a medium-scale WordNet for Hebrew. We address this task as an instance of building a lexical resource for a new language (Hebrew) in a setting where similar resources exist for other languages, and multilingual requirements call for an align-ment of the new resource with the existing ones. We compare the two main paradi...
Article
Full-text available
Corpora of child language are essential for psycholinguistic research. Linguistic anno- tation of the corpora provides researchers with better means for exploring the develop- ment of grammatical constructions and their usage. We describe an ongoing project that aims to annotate the English section of the CHILDES database with grammatical re- latio...
Chapter
The morphology of Semitic languages is unique in the sense that the major word-formation mechanism is an inherently non-concatenative process of interdigitation, whereby two morphemes, a root and a pattern, are interwoven. Identifying the root of a given word in a Semitic language is an important task, in some cases a crucial part of morphological...
Article
We introduce nite-state registered automata (FSRAs), a new computational device within the framework of nite-state technology, specically tailored for implementing non-concatenative morphological processes. This model extends and augments existing nite-state techniques, which are presently not optimized for describing this kind of phenomena. We rst...
Article
Unification grammars are widely accepted as an expressive means for describing the structure of natural languages. In general, the recognition problem is undecidable for unification grammars. Even with restricted variants of the formalism, off-line parsable grammars, the problem is computationally hard. We present two natural constraints on unifica...
Conference Paper
This work provides the essential founda- tions for modular construction of (typed) unification grammars for natural lan- guages. Much of the information in such grammars is encoded in the signature, and hence the key is facilitating a modularized development of type signatures. We intro- duce a definition of signature modules and show how two modul...
Article
Computational lexicons are among the most important resources for natural language processing (NLP). Their importance is even greater in languages with rich morphology, where the lexicon is expected to provide morphological analyzers with enough infor-mation to enable them to correctly process intricately inflected forms. We describe the Haifa Lexi...
Conference Paper
We extend nite state registered automata (FSRA) to account for medium-distance dependencies in natural languages. We provide an extended regular expression language whose expressions denote arbitrary FSRAs and use it to describe some morphological and phonological phenomena. We also dene several dedicated operators which support an easy and efcient...
Article
This paper introduces xfst2fsa, a compiler which translates grammars expressed in the syntax of the XFST nite-state tool-box to grammars in the language of the FSA Utilities package. Compilation to FSA facilitates the use of grammars de-veloped with the proprietary XFST tool-box on a publicly available platform. The paper describes the non-trivial...
Article
Full-text available
Unification grammars are known to be Turing-equivalent; giv en a grammar G and a word w, it is undecidable whether w 2 L(G). In order to ensure decidability, several constraints on gr ammars, commonly known as off-line parsability (OLP), were suggested, such that the recognition problem is de- cidable for grammars which satisfy OLP. An open questio...
Article
Natural languages encode gender distinctions in various ways. We investigate the differences between English and Hebrew in this respect, our departure point being the relations that are defined between the feminine and the masculine realizations of nouns in the English WordNet. We define a number of distinct classes of English nouns which differ in...
Article
Abstract We present a computational system for morphological,analysis and annotation of the Qur’an, for research and teaching purposes. The system facilitates a variety of queries on the Qur’anic text that make reference not only to the words but also to their linguistic attributes. The core of the system is a set of nite-state based rules which de...