North American News Text Corpus

THE SEMANTICS AND PRAGMATICS OF THE PERFECT IN ENGLISH AND JAPANESE

Article

Atsuko Nishiyama

A First Experimental Demonstration of Massive Knowledge Infusion

Conference Paper

Full-text available

Sep 2008

A central goal of Artificial Intelligence is to create systems that embody commonsense knowledge in a reliable enough form that it can be used for reasoning in novel situations. Knowledge Infusion is an approach to this problem in which the commonsense knowledge is acquired by learning. In this paper we report on experiments on a corpus of a half million sentences of natural language text that test whether commonsense knowledge can be usefully acquired through this approach. We examine the task of predicting a deleted word from the remainder of a sentence for some 268 target words. As baseline we consider how well this task can be performed using learned rules based on the words within a fixed distance of the target word and their parts of speech. This captures an approach that has been previously demonstrated to be highly successful for a variety of natural language tasks. We then go on to learn from the corpus rules that embody commonsense knowledge, additional to the knowledge used in the baseline case. We show that chaining learned commonsense rules together leads to measurable improvements in prediction performance on our task as compared with the baseline. This is apparently the first experimental demonstration that commonsense knowledge can be learned from natural inputs on a massive scale reliably enough that chaining the learned rules is efficacious for reasoning.

Learning to Predict U.S. Policy Change Using New York Times Corpus with Pre-Trained Language Model

Article

Full-text available

Dec 2020
MULTIMED TOOLS APPL

With the process of economic globalization and political multi-polarization accelerating, it is especially important to predict policy change in the United States. While current research has not taken advantage of the rapid advancement in the natural language processing and the relationship between news media and policy change, we propose a BERT-based model to predict policy change in the United States, using news published by the New York Times. Specifically, we propose a large-scale news corpus from the New York Times covers the period from 2006 to 2018. Then we use the corpus to fine-tune the pre-trained BERT language model to determine whether the news is on the front page, which corresponds to the policy priority. We propose a BERT-based Policy Change Index (BPCI) for the United States to predict the policy change in the future short period of time. Experimental results in the New York Times Corpus demonstrate the validity of the proposed method.

Automatic induction of verb classes using clustering

Thesis

Apr 2013

Lin Sun

Verb classiﬁcations have attracted a great deal of interest in both linguistics and natural language processing (NLP). They have proved useful for important tasks and applications, including e.g. computational lexicography, parsing, word sense disambiguation, semantic role labelling, information extraction, question-answering, and machine translation (Swier and Stevenson, 2004; Dang, 2004; Shi and Mihalcea, 2005; Kipper et al., 2008; Zapirain et al., 2008; Rios et al., 2011). Particularly useful are classes which capture generalizations about a range of linguistic properties (e.g. lexical, (morpho-)syntactic, semantic), such as those proposed by Beth Levin (1993). However, full exploitation of such classes in real-world tasks has been limited because no comprehensive or domain-speciﬁc lexical classiﬁcation is available. This thesis investigates how Levin-style lexical semantic classes could be learned automatically from corpus data. Automatic acquisition is cost-effective when it involves either no or minimal supervision and it can be applied to any domain of interest where adequate corpus data is available. We improve on earlier work on automatic verb clustering. We introduce new features and new clustering methods to improve the accuracy and coverage. We evaluate our methods and features on well-established cross-domain datasets in English, on a speciﬁc domain of English (the biomedical) and on another language (French), reporting promising results. Finally, our task-based evaluation demonstrates that the automatically acquired lexical classes enable new approaches to some NLP tasks (e.g. metaphor identiﬁcation) and help to improve the accuracy of existing ones (e.g. argumentative zoning).

Language Modeling for Determiner Selection.

Conference Paper

Jan 2007

We present a method for automatic deter- miner selection, based on an existing lan- guage model. We train on the Penn Tree- bank and also use additional data from the North American News Text Corpus. Our results are a significant improvement over previous best.

Quantum Criticism: an Analysis of Political News Reporting

Article

Full-text available

Jun 2020

Quantum Criticism: A Tagged News Corpus Analysed for Sentiment and Named Entities

Preprint

Full-text available

Jun 2020

In this research, we continuously collect data from the RSS feeds of traditional news sources. We apply several pre-trained implementations of named entity recognition (NER) tools, quantifying the success of each implementation. We also perform sentiment analysis of each news article at the document, paragraph and sentence level, with the goal of creating a corpus of tagged news articles that is made available to the public through a web interface. Finally, we show how the data in this corpus could be used to identify bias in news reporting.

Quantum Criticism: a Tagged news Corpus Analysed for Sentiment and Named Entities

Conference Paper

May 2020

AI Reasoning Systems: PAC and Applied Methods

Preprint

Jul 2018

Jeffrey Cheng

Learning and logic are distinct and remarkable approaches to prediction. Machine learning has experienced a surge in popularity because it is robust to noise and achieves high performance; however, ML experiences many issues with knowledge transfer and extrapolation. In contrast, logic is easily intepreted, and logical rules are easy to chain and transfer between systems; however, inductive logic is brittle to noise. We then explore the premise of combining learning with inductive logic into AI Reasoning Systems. Specifically, we summarize findings from PAC learning (conceptual graphs, robust logics, knowledge infusion) and deep learning (DSRL, $\partial$ILP, DeepLogic) by reproducing proofs of tractability, presenting algorithms in pseudocode, highlighting results, and synthesizing between fields. We conclude with suggestions for integrated models by combining the modules listed above and with a list of unsolved (likely intractable) problems.

Semi-Supervised Affective Meaning Lexicon Expansion Using Semantic and Distributed Word Representations

Article

Mar 2017

In this paper, we propose an extension to graph-based sentiment lexicon induction methods by incorporating distributed and semantic word representations in building the similarity graph to expand a three-dimensional sentiment lexicon. We also implemented and evaluated the label propagation using four different word representations and similarity metrics. Our comprehensive evaluation of the four approaches was performed on a single data set, demonstrating that all four methods can generate a significant number of new sentiment assignments with high accuracy. The highest correlations (tau=0.51) and the lowest error (mean absolute error < 1.1%), obtained by combining both the semantic and the distributional features, outperformed the distributional-based and semantic-based label-propagation models and approached a supervised algorithm.

Morphological Priors for Probabilistic Neural Word Embeddings

Conference Paper

Full-text available

Jan 2016

Decomposing Bilexical Dependencies into Semantic and Syntactic Vectors

Conference Paper

Jan 2016

Jeff Mitchell

Morphological Priors for Probabilistic Neural Word Embeddings

Article

Full-text available

Aug 2016

Word embeddings allow natural language processing systems to share statistical information across related words. These embeddings are typically based on distributional statistics, making it difficult for them to generalize to rare or unseen words. We propose to improve word embeddings by incorporating morphological information, capturing shared sub-word features. Unlike previous work that constructs word embeddings directly from morphemes, we combine morphological and distributional information in a unified probabilistic framework, in which the word embedding is a latent variable. The morphological information provides a prior distribution on the latent word embeddings, which in turn condition a likelihood function over an observed corpus. This approach yields improvements on intrinsic word similarity evaluations, and also in the downstream task of part-of-speech tagging.

Frames, Polarity and Causation

Article

Full-text available

Aug 2016
Corpora

A polarity-sensitive item (PSI), as traditionally defined, is an expression that is restricted to either an affirmative or negative context. PSIs like lift a finger and all the time in the world subserve discourse routines like understatement and emphasis. Lexical-semantic classes are increasingly invoked in descriptions of the properties of PSIs. Here, we use English corpus data and the tools of Frame Semantics [Fillmore 1982, 1985] to explore Israel's [2011] observation that the semantic role of a PSI determines how the expression fits into a contextually constructed scalar model. We focus on a class of exceptions implied by Israel's model: cases in which a given PSI displays two countervailing patterns of polarity sensitivity, with attendant differences in scalar entailments. We offer a set of case studies of polarity-sensitive expressions—including verbs of attraction and aversion like can live without, monetary units like a red cent, comparative adjectives and time-span adverbials—that demonstrate that the interpretation of a given PSI in a given polar context is based on multiple considerations, including the speaker's perspective on and affective stance toward the described event, available inferences about causality and, perhaps most critically, particulars of the predication, including the verb or adjective's frame membership, the presence or absence of an ability modal like can, the grammatical construction used and the range of contingencies evoked by the utterance.

Number Sense Disambiguation

Article

Word Sense Disambiguation is a well studied field, with a range of successful methods. However, there has been little work on examining the analogue for num-bers, classifying them into senses ('year', 'date', 'telephone number' etc.) based on their context -potentially useful for Text to Speech and Information Extraction systems. We extend the semi-supervised Decision List model described by David Yarowsky (1994), bringing the model to a problem on which little work has been done. We report promising results and present a thorough error analysis which highlights several areas where the current methodology needs to be extended to deal better with number senses. We conclude by proposing several directions for future work.

EXTENDING CENTERING THEORY FOR THE MEASURE OF ENTITY COHERENCE

Article

Milan Tofiloski

Effective self-training for parsing

Conference Paper

Full-text available

Jun 2006

We present a simple, but surprisingly effective, method of self-training a two-phase parser-reranker system using readily available unlabeled data. We show that this type of bootstrapping is possible for parsing when the bootstrapped parses are processed by a discriminative reranker. Our improved model achieves an f-score of 92.1%, an absolute 1.1% improvement (12% error reduction) over the previous best result for Wall Street Journal parsing. Finally, we provide some analysis to better understand the phenomenon.

Efficient integration of translation and speech models in dictation based machine aided human translation

Conference Paper

Full-text available

Mar 2012
Acoust Speech Signal Process

This paper is concerned with combining models for decoding an optimum translation for a dictation based machine aided human translation (MAHT) task. Statistical language model (SLM) probabilities in automatic speech recognition (ASR) are updated using statistical machine translation (SMT) model probabilities. The effect of this procedure is evaluated for utterances from human translators dictating translations of source language documents. It is shown that computational complexity is significantly reduced while at the same time word error rate is reduced by 30%.

Unsupervised syntax-based machine translation: The contribution of discontiguous phrases

Article

Full-text available

Rens Bod

We present a new unsupervised syntax-based MT system, termed U-DOT, which uses the unsupervised U-DOP model for learning paired trees, and which computes the most probable target sentence from the relative frequencies of paired subtrees. We test U-DOT on the German-English Europarl corpus, showing that it outperforms the state-of-the-art phrase-based Pharaoh system. We demonstrate that the inclusion of noncontiguous phrases significantly improves the translation accuracy. This paper presents the first translation results with the data-oriented translation (DOT) model on the Europarl corpus, to the best of our knowledge.

Brown at CL-SR'07: Retrieving conversational speech in English and Czech

Article

Jan 2007

Brown's entry to the Cross-Language Speech Retrieval (CL-SR) track at the 2007 Cross Language Evaluation Forum (CLEF) 1 was based on the language model (LM) paradigm for retrieval [17]. For English, our system introduced two minor enhance-ments to the basic unigram: we extended Dirichlet smoothing (popular with unigram modeling) to bigrams, and we smoothed the collection LM to compensate for the small collection size. For Czech, time-constraints restricted us to using a basic unigram model, though we did apply Czech-specific stemming. While our English system per-formed well in the evaluation and showed the utility of our enhancements, several aspects of it were rushed and need to be addressed in future work. Our Czech sys-tem did not perform competitively but did provide us with a useful first experience in non-English retrieval.

A large subcategorization lexicon for natural language processing applications

Article

Full-text available

Jan 2006

We introduce a large computational subcategorization lexicon which includes subcategorization frame (SCF) and frequency information for 6,397 English verbs. This extensive lexicon was acquired automatically from five corpora and the Web using the current version of the comprehensive subcategorization acquisition system of Briscoe and Carroll (1997). The lexicon is provided freely for research use, along with a script which can be used to filter and build sub-lexicons suited for different natural language processing (NLP) purposes. Documentation is also provided which explains each sub-lexicon option and evaluates its accuracy.

Parallel Spell-Checking Algorithm Based on Yahoo! N-Grams Dataset

Article

Apr 2012

Youssef Bassil

Spell-checking is the process of detecting and sometimes providing suggestions for incorrectly spelled words in a text. Basically, the larger the dictionary of a spell-checker is, the higher is the error detection rate; otherwise, misspellings would pass undetected. Unfortunately, traditional dictionaries suffer from out-of-vocabulary and data sparseness problems as they do not encompass large vocabulary of words indispensable to cover proper names, domain-specific terms, technical jargons, special acronyms, and terminologies. As a result, spell-checkers will incur low error detection and correction rate and will fail to flag all errors in the text. This paper proposes a new parallel shared-memory spell-checking algorithm that uses rich real-world word statistics from Yahoo! N-Grams Dataset to correct non-word and real-word errors in computer text. Essentially, the proposed algorithm can be divided into three sub-algorithms that run in a parallel fashion: The error detection algorithm that detects misspellings, the candidates generation algorithm that generates correction suggestions, and the error correction algorithm that performs contextual error correction. Experiments conducted on a set of text articles containing misspellings, showed a remarkable spelling error correction rate that resulted in a radical reduction of both non-word and real-word errors in electronic text. In a further study, the proposed algorithm is to be optimized for message-passing systems so as to become more flexible and less costly to scale over distributed machines.

A Dirichlet-Smoothed Bigram Model for Retrieving Spontaneous Speech

Conference Paper

Sep 2007
Lect Notes Comput Sci

We present two simple but effective smoothing techniqes for the standard language model (LM) approach to information retrieval (12). First, we extend the unigram Dirichlet smoothing technique popular in IR (17) to bigram modeling (16). Second, we propose a method of collec- tion expansion for more robust estimation of the LM prior, particularly intended for sparse collections. Retrieval experiments on the MALACH archive (9) of automatically transcribed and manually summarized spon- taneous speech interviews demonstrates strong overall system perfor- mance and the relative contribution of our extensions1.

When is Self-Training Effective for Parsing?

Conference Paper

Full-text available

Jan 2008

Self-training has been shown capable of improving on state-of-the-art parser per- formance (McClosky et al., 2006) despite the conventional wisdom on the matter and several studies to the contrary (Charniak, 1997; Steedman et al., 2003). However, it has remained unclear when and why self- training is helpful. In this paper, we test four hypotheses (namely, presence of a phase transition, impact of search errors, value of non-generative reranker features, and effects of unknown words). From these experiments, we gain a better un- derstanding of why self-training works for parsing. Since improvements from self- training are correlated with unknown bi- grams and biheads but not unknown words, the benefit of self-training appears most in- fluenced by seeing known words in new combinations.

Hierarchical Verb Clustering Using Graph Factorization

Conference Paper

Jan 2011

Most previous research on verb clustering has focussed on acquiring flat classifications from corpus data, although many manually built classifications are taxonomic in nature. Also Natural Language Processing (nlp) applications benefit from taxonomic classifications because they vary in terms of the granularity they require from a classification. We introduce a new clustering method called Hierarchical Graph Factorization Clustering (hgfc) and extend it so that it is optimal for the task. Our results show that Hgfc outperforms the frequently used agglomerative clustering on a hierarchical test set extracted from VerbNet, and that it yields state-of-the-art performance also on a flat test set. We demonstrate how the method can be used to acquire novel classifications as well as to extend existing ones on the basis of some prior knowledge about the classification.

Improving Verb Clustering with Automatically Acquired Selectional Preference

Conference Paper

Jan 2009

In previous research in automatic verb classification, syntactic features have proved the most useful features, although manual classifications rely heavily on se- mantic features. We show, in contrast with previous work, that considerable ad- ditional improvement can be obtained by using semantic features in automatic clas- sification: verb selectional preferences ac- quired from corpus data using a fully unsu- pervised method. We report these promis- ing results using a new framework for verb clustering which incorporates a re- cent subcategorization acquisition system, rich syntactic-semantic feature sets, and a variation of spectral clustering which performs particularly well in high dimen- sional feature space.

Semi-Supervised Training for the Averaged Perceptron POS Tagger.

Conference Paper

Full-text available

Jan 2009

This paper describes POS tagging exper- iments with semi-supervised training as an extension to the (supervised) averaged perceptron algorithm, first introduced for this task by (Collins, 2002). Experiments with an iterative training on standard-sized supervised (manually annotated) dataset (106 tokens) combined with a relatively modest (in the order of 108 tokens) un- supervised (plain) data in a bagging-like fashion showed significant improvement of the POS classification task on typo- logically different languages, yielding bet- ter than state-of-the-art results for English and Czech (4.12 % and 4.86 % relative er- ror reduction, respectively; absolute accu- racies being 97.44 % and 95.89 %).

Unsupervised Argument Identification for Semantic Role Labeling.

Conference Paper

Jan 2009

The task of Semantic Role Labeling (SRL) is often divided into two sub-tasks: verb argument identification, and argu- ment classification. Current SRL algo- rithms show lower results on the identifi- cation sub-task. Moreover, most SRL al- gorithms are supervised, relying on large amounts of manually created data. In this paper we present an unsupervised al- gorithm for identifying verb arguments, where the only type of annotation required is POS tagging. The algorithm makes use of a fully unsupervised syntactic parser, using its output in order to detect clauses and gather candidate argument colloca- tion statistics. We evaluate our algorithm on PropBank10, achieving a precision of 56%, as opposed to 47% of a strong base- line. We also obtain an 8% increase in precision for a Spanish corpus. This is the first paper that tackles unsupervised verb argument identification without using manually encoded rules or extensive lexi- cal or syntactic resources.

Is the End of Supervised Parsing in Sight?

Conference Paper

Full-text available

Jan 2007

Rens Bod

How far can we get with unsupervised parsing if we make our training corpus several orders of magnitude larger than has hitherto be attempted? We present a new algorithm for unsupervised parsing using an all-subtrees model, termed U-DOP*, which parses directly with packed forests of all binary trees. We train both on Penn's WSJ data and on the (much larger) NANC corpus, showing that U-DOP* outperforms a treebank-PCFG on the standard WSJ test set. While U-DOP* performs worse than state-of-the-art supervised parsers on hand- annotated sentences, we show that the model outperforms supervised parsers when evaluated as a language model in syntax-based machine translation on Europarl. We argue that supervised parsers miss the fluidity between constituents and non-constituents and that in the field of syntax-based language modeling the end of supervised parsing has come in sight.

You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement.

Conference Paper

Jan 2008

When multiple conversations occur simultane- ously, a listener must decide which conversa- tion each utterance is part of in order to inter- pret and respond to it appropriately. We refer to this task as disentanglement. We present a corpus of Internet Relay Chat (IRC) dialogue in which the various conversations have been manually disentangled, and evaluate annota- tor reliability. This is, to our knowledge, the first such corpus for internet chat. We pro- pose a graph-theoretic model for disentangle- ment, using discourse-based features which have not been previously applied to this task. The model's predicted disentanglements are highly correlated with manual annotations.

Improved Unsupervised POS Induction through Prototype Discovery.

Conference Paper

Jan 2010

We present a novel fully unsupervised algorithm for POS induction from plain text, motivated by the cognitive notion of prototypes. The algorithm first identifies landmark clusters of words, serving as the cores of the induced POS categories. The rest of the words are subsequently mapped to these clusters. We utilize morphological and distributional representations computed in a fully unsupervised manner. We evaluate our algorithm on English and German, achieving the best reported results for this task.

Reranking and Self-Training for Parser Adaptation.

Conference Paper

Full-text available

Jan 2006

Statistical parsers trained and tested on the Penn Wall Street Journal (WSJ) treebank have shown vast improvements over the last 10 years. Much of this improvement, however, is based upon an ever-increasing number of features to be trained on (typically) the WSJ treebank data. This has led to concern that such parsers may be too finely tuned to this corpus at the expense of portability to other genres. Such worries have merit. The standard "Charniak parser" checks in at a labeled precision-recall f-measure of 89.7% on the Penn WSJ test set, but only 82.9% on the test set from the Brown treebank corpus.This paper should allay these fears. In particular, we show that the reranking parser described in Charniak and Johnson (2005) improves performance of the parser on Brown to 85.2%. Furthermore, use of the self-training techniques described in (McClosky et al., 2006) raise this to 87.8% (an error reduction of 28%) again without any use of labeled Brown data. This is remarkable since training the parser and reranker on labeled Brown data achieves only 88.4%.

Faster Parsing by Supertagger Adaptation.

Conference Paper

Jan 2010

We propose a novel self-training method for a parser which uses a lexicalised gram- mar and supertagger, focusing on increas- ing the speed of the parser rather than its accuracy. The idea is to train the su- pertagger on large amounts of parser out- put, so that the supertagger can learn to supply the supertags that the parser will eventually choose as part of the highest- scoring derivation. Since the supertag- ger supplies fewer supertags overall, the parsing speed is increased. We demon- strate the effectiveness of the method us- ing a CCG supertagger and parser, obtain- ing significant speed increases on newspa- per text with no loss in accuracy. We also show that the method can be used to adapt the CCG parser to new domains, obtain- ing accuracy and speed improvements for Wikipedia and biomedical text.

Fully Unsupervised Core-Adjunct Argument Classification.

Conference Paper

Jan 2010

The core-adjunct argument distinction is a basic one in the theory of argument structure. The task of distinguishing between the two has strong relations to various basic NLP tasks such as syntactic parsing, semantic role labeling and subcategorization acquisition. This paper presents a novel unsupervised algorithm for the task that uses no supervised models, utilizing instead state-of-the-art syntactic induction algorithms. This is the first work to tackle this task in a fully unsupervised scenario.

Automatic Domain Adaptation for Parsing.

Conference Paper

Full-text available

Jan 2010

Current statistical parsers tend to perform well only on their training domain and nearby gen- res. While strong performance on a few re- lated domains is sufficient for many situations, it is advantageous for parsers to be able to gen- eralize to a wide variety of domains. When parsing document collections involving het- erogeneous domains (e.g. the web), the op- timal parsing model for each document is typ- ically not obvious. We study this problem as a new task — multiple source parser adapta- tion. Our system trains on corpora from many different domains. It learns not only statistics of those domains but quantitative measures of domain differences and how those differences affect parsing accuracy. Given a specific tar- get text, the resulting system proposes linear combinations of parsing models trained on the source corpora. Tested across six domains, our system outperforms all non-oracle base- lines including the best domain-independent parsing model. Thus, we are able to demon- strate the value of customizing parsing models to specific domains.

Integration of Statistical Models for Dictation of Document Translations in a Machine-Aided Human Translation Task

Article

Full-text available

Nov 2010

This paper presents a model for machine-aided human translation (MAHT) that integrates source language text and target language acoustic information to produce the text translation of source language document. It is evaluated on a scenario where a human translator dictates a first draft target language translation of a source language document. Information obtained from the source language document, including translation probabilities derived from statistical machine translation (SMT) and named entity tags derived from named entity recognition (NER), is incorporated with acoustic phonetic information obtained from an automatic speech recognition (ASR) system. One advantage of the system combination used here is that words that are not included in the ASR vocabulary can be correctly decoded by the combined system. The MAHT model and system implementation is presented. It is shown that a relative decrease in word error rate of 29% can be obtained by this combined system relative to the baseline ASR performance on a French to English document translation task in the Hansard domain. In addition, it is shown that transcriptions obtained by using the combined system show a relative increase in NIST score of 34% compared to transcriptions obtained from the baseline ASR system.

From Distributional to Semantic Similarity

Article

Mar 2004

James Richard Curran

Lexical-semantic resources, including thesauri and WORDNET, have been successfully incorporated into a wide range of applications in Natural Language Processing. However they are very difficult and expensive to create and maintain, and their usefulness has been severely hampered by their limited coverage, bias and inconsistency. Automated and semi-automated methods for developing such resources are therefore crucial for further resource development and improved application performance.

A Very Very Large Corpus Doesn't Always Yield Reliable Estimates

Article

Oct 2002

Banko and Brill (2001) suggested that the development of very large training corpora may be more effective for progress in empirical Natural Language Processing than improving methods that use existing smaller training corpora. This work tests their claim by exploring whether a very large corpus can eliminate the sparseness problems associated with estimating unigram probabilities. We do this by empirically investigating the convergence behaviour of unigram probability estimates on a one billion word corpus. When using one billion words, as expected, we do find that many of our estimates do converge to their eventual value. However, we also find that for some words, no such convergence occurs. This leads us to conclude that simply relying upon large corpora is not in itself sufficient: we must pay attention to the statistical modelling as well. 1

PROCESOS DE INTEGRACION TEXTUAL EN LA LECTURA DE INSTRUCCIONES - UNIVERSIDAD DE LA LAGUNA

Thesis

Full-text available

Jan 2013

This research is aimed at studying the role played by discourse structure in text comprehension. In a series of three experiments, volunteered subjects are asked for to reorder a set of sentences, randomly presented, to make a narrative text composed of five sentences –the first experiment-, an expository text composed of five sentences –the second experiment-, or an expository text composed of nine sentences –the third experiment- coherent. Each subject is presented with a specific and unique set of scrambled sentences. Results from the 1st experiment show that referential relationships among entities and/or events contribute to determine text coherence. In a text, sentences organized according to local coherence contain lexical and non lexical cues that allow subjects to link every pair of sentences to create a particular structure; sentences organized according to global coherence contain lexical and non lexical cues that allow subjects to relate every sentence with a common topic. Under these conditions a different strategy is applied by subjects in the second and the third experiments. Results from the 2nd and 3rd experiments reveal that human subjects are willing to exploit information from the way text sentence order is being constrained but in a different way for five-sentence texts and for nine-sentence texts, in five-sentence texts through local coherence and concrete semantic relations, and in nine-sentence texts through global coherence and abstrat semantic relations. This difference arises from the fact that contrary to five-sentence texts, nine-sentence texts represent a heavier cognitive processing load for working memory. Referential relations play an identical role in all the experiments. The results are discussed according to Latent Semantic Analysis, Rethorical Structure Theory and Centering Theory models. No theoretical model can account for all the experimental findings, but, discarding the role played by reference, they are more compatible with a Centering Theory model, since this model is the only one which make specific predictions about the role played by sentence order.

Perfect Tense and Aspect

Chapter

Full-text available

Jan 2012

Marie-Eve Ritz

In English and several other European languages, the perfect tense is a complex morpho-syntactic construction made of an auxiliary ("have," "be") followed by a past participle, as in "Jamie has eaten all the chocolate biscuits." The auxiliary appears in the past, present, and future tenses, thus creating past, present, and future perfects. The perfect has been a problematic category for scholars across time due to the multiplicity of its meanings/uses within a given language and to the variation in meanings/uses of what has been labeled "perfect" across languages. In an attempt to provide a clearer understanding of this complex semantic category, this article looks at theories of the perfect, focusing on its semantics and pragmatics. It also discusses whether the perfect is a tense, an aspect, or both; how pragmatic factors and discourse relations affect the use of the perfect; and concludes by examining the place of the perfect in a tense/aspect system more generally, focusing on how it relates to categories such as resultatives and the simple past tense, as well as to habituals and prospectives.

Disentangling Chat

Article

Sep 2010

When multiple conversations occur simultaneously, a listener must decide which conversation each utterance is part of in order to interpret and respond to it appropriately. We refer to this task as disentanglement. We present a corpus of Internet Relay Chat dialogue in which the various conversations have been manually disentangled, and evaluate annotator reliability. We propose a graph-based clustering model for disentanglement, using lexical, timing, and discourse-based features. The model's predicted disentanglements are highly correlated with manual annotations. We conclude by discussing two extensions to the model, specificity tuning and conversation start detection, both of which are promising but do not currently yield practical improvements.

Playing on the typewriter, typing on the piano: Manipulation knowledge of objects

Article

Feb 2006
COGNITION

Two experiments investigated sensory/motor-based functional knowledge of man-made objects: manipulation features associated with the actual usage of objects. In Experiment 1, a series of prime-target pairs was presented auditorily, and participants were asked to make a lexical decision on the target word. Participants made a significantly faster decision about the target word (e.g. 'typewriter') following a related prime that shared manipulation features with the target (e.g. 'piano') than an unrelated prime (e.g. 'blanket'). In Experiment 2, participants' eye movements were monitored when they viewed a visual display on a computer screen while listening to a concurrent auditory input. Participants were instructed to simply identify the auditory input and touch the corresponding object on the computer display. Participants fixated an object picture (e.g. "typewriter") related to a target word (e.g. 'piano') significantly more often than an unrelated object picture (e.g. "bucket") as well as a visually matched control (e.g. "couch"). Results of the two experiments suggest that manipulation knowledge of words is retrieved without conscious effort and that manipulation knowledge constitutes a part of the lexical-semantic representation of objects.

North American News Text Corpus

No full-text available

Recommended publications

“A Memory Sweet to Soldiers”: The Significance of Gender

Encountering Ecclesiastes : A Book for our Time, J. Limburg

Transmitting a Text Through Three Languages: The Future History of Galen’s Peri Anomalou Dyskrasias...

Art and Ideas in the New World