Conference Paper

North American News Text Corpus

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... I collected sample data pseudo-randomly from various genres, two newspapers of the same date (July 1st 1996), the first two discussion articles of the same month of the year (July 1996) in CQ Researcher Online (http://library2.cqpress.com/cqresearcher), conversation data from the Switchboard Corpus (Graff et al. 1998:files sw2001 through sw2019.txt), and narrative data from Netlibrary (http://www.netlibrary.com/) ...
... I've seen that, , that's, uh, that was a really good movie. (Graff et al. 1998 The speaker in these examples uses the present perfect to negotiate a topic she wants to talk about. She does so by asking the addressee whether an epistemic pre-condition for having a conversation on her chosen topic is satisfied, by asking, e.g., the extent of the addressee's experience or knowledge of the topic. ...
... No, I haven't been camping since I was about sixteen. (Graff et al. 1998 (5.28) a. Have you seen DANCING WITH WOLVES? (X=I want to talk about this movie.) ...
... In this paper we describe an implementation of this approach for natural language data, and report on experiments that show that the approach provides quantifiable benefits. The experiments are performed on a natural language corpus of a half million sentences (Graff 1995). We measure performance on a natural language task of the following form: ...
... The scenes employed in the experiments reported are based on text taken from the North American News Text Corpus (Graff 1995), a series of plaintext newspaper articles. Six months worth of articles, comprising approximately a half million sentences, were employed. ...
... The available part of the North American News Text Corpus (Graff 1995) was split into two equal halves, containing different sets of articles. The first half was further split and was used for parameter fitting during the design of the experiments. ...
Conference Paper
Full-text available
A central goal of Artificial Intelligence is to create systems that embody commonsense knowledge in a reliable enough form that it can be used for reasoning in novel situations. Knowledge Infusion is an approach to this problem in which the commonsense knowledge is acquired by learning. In this paper we report on experiments on a corpus of a half million sentences of natural language text that test whether commonsense knowledge can be usefully acquired through this approach. We examine the task of predicting a deleted word from the remainder of a sentence for some 268 target words. As baseline we consider how well this task can be performed using learned rules based on the words within a fixed distance of the target word and their parts of speech. This captures an approach that has been previously demonstrated to be highly successful for a variety of natural language tasks. We then go on to learn from the corpus rules that embody commonsense knowledge, additional to the knowledge used in the baseline case. We show that chaining learned commonsense rules together leads to measurable improvements in prediction performance on our task as compared with the baseline. This is apparently the first experimental demonstration that commonsense knowledge can be learned from natural inputs on a massive scale reliably enough that chaining the learned rules is efficacious for reasoning.
... In order to solve the problem that current corpus of New York Times [10,13] is too small to satisfy the need of large-scale datasets for fine-tuning BERT. Specifically, [13] only contains every New York Times front page story from 1996 to 2006. ...
... Specifically, [13] only contains every New York Times front page story from 1996 to 2006. All of [10] are sample materials of the New York Times Syndicate between July, 1994 and December, 1996. The current shortcomings of the New York Times corpus are: First, the time span of corpus is short, and it is impossible to provide sufficient data to ensure that the language model learns the rule of policy changes, and at the same time ensures that the model converges. ...
Article
Full-text available
With the process of economic globalization and political multi-polarization accelerating, it is especially important to predict policy change in the United States. While current research has not taken advantage of the rapid advancement in the natural language processing and the relationship between news media and policy change, we propose a BERT-based model to predict policy change in the United States, using news published by the New York Times. Specifically, we propose a large-scale news corpus from the New York Times covers the period from 2006 to 2018. Then we use the corpus to fine-tune the pre-trained BERT language model to determine whether the news is on the front page, which corresponds to the policy priority. We propose a BERT-based Policy Change Index (BPCI) for the United States to predict the policy change in the future short period of time. Experimental results in the New York Times Corpus demonstrate the validity of the proposed method.
... For each verb in T1 and T2, we extracted all the occurrences (up to 10,000) from the raw corpus data gathered originally for constructing the VALEX lexicon (Korhonen et al., 2006a). The data was gathered from five corpora, including the BNC (Leech, 1992), the Guardian corpus, the Reuters corpus (Rose et al., 2002), the North American News Text Corpus (Graff, 1995) and the data used for two Text Retrieval Evaluation Conferences 8 (TREC-4 and TREC-5). The average frequency of verbs in T1 was 1448 and T2 2166, showing that T1 is a more sparse dataset. ...
... For each verb appearing in T3-T5, we extracted all the occurrences (up to 10,000) from the raw corpus data used for constructing VALEX (Korhonen et al., 2006a), including the BNC (Leech, 1992), the Guardian corpus, the Reuters corpus (Rose et al., 2002), the North American News Text Corpus (Graff, 1995) and the data used for two Text Retrieval Evaluation Conferences (TREC-4 and TREC-5). ...
Thesis
Verb classifications have attracted a great deal of interest in both linguistics and natural language processing (NLP). They have proved useful for important tasks and applications, including e.g. computational lexicography, parsing, word sense disambiguation, semantic role labelling, information extraction, question-answering, and machine translation (Swier and Stevenson, 2004; Dang, 2004; Shi and Mihalcea, 2005; Kipper et al., 2008; Zapirain et al., 2008; Rios et al., 2011). Particularly useful are classes which capture generalizations about a range of linguistic properties (e.g. lexical, (morpho-)syntactic, semantic), such as those proposed by Beth Levin (1993). However, full exploitation of such classes in real-world tasks has been limited because no comprehensive or domain-specific lexical classification is available. This thesis investigates how Levin-style lexical semantic classes could be learned automatically from corpus data. Automatic acquisition is cost-effective when it involves either no or minimal supervision and it can be applied to any domain of interest where adequate corpus data is available. We improve on earlier work on automatic verb clustering. We introduce new features and new clustering methods to improve the accuracy and coverage. We evaluate our methods and features on well-established cross-domain datasets in English, on a specific domain of English (the biomedical) and on another language (French), reporting promising results. Finally, our task-based evaluation demonstrates that the automatically acquired lexical classes enable new approaches to some NLP tasks (e.g. metaphor identification) and help to improve the accuracy of existing ones (e.g. argumentative zoning).
... Our approach significantly improves upon the work of Minnen et al. (2000) . We also use additional automatically parsed data from the North American News Text Corpus (Graff, 1995 ), further improving our re- sults. ...
... As with (Minnen et al., 2000 ), we train the language model on the Penn Treebank (Marcus et al., 1993). As far as we know, language modeling always improves with additional training data, so we add data from the North American News Text Corpus (NANC) (Graff, 1995) automatically parsed with the Charniak parser (McClosky et al., 2006) to train our language model on up to 20 million additional words. ...
Conference Paper
We present a method for automatic deter- miner selection, based on an existing lan- guage model. We train on the Penn Tree- bank and also use additional data from the North American News Text Corpus. Our results are a significant improvement over previous best.
... The LDC also offer the North American News Corpus [9], assembled from varied sources, including the New York Times, the Los Angeles Times, the Wall Street Journal and others. The primary goals of this corpus are support for information retrieval and language modelling, so the count of "words"-almost 350 million tokens-is more important than the number of articles. ...
... Articles are provided in an XML format, with the majority of the articles tagged for named entities-persons, places, organizations, titles and topics-so that these named entities are consistent across articles. The LDC also offer the North American News Corpus [9], assembled from varied sources, including the New York Times, the Los Angeles Times, the Wall Street Journal and others. The primary goals of this corpus are support for information retrieval and language modelling, so the count of "words"-almost 350 million tokens-is more important than the number of articles. ...
Preprint
Full-text available
In this research, we continuously collect data from the RSS feeds of traditional news sources. We apply several pre-trained implementations of named entity recognition (NER) tools, quantifying the success of each implementation. We also perform sentiment analysis of each news article at the document, paragraph and sentence level, with the goal of creating a corpus of tagged news articles that is made available to the public through a web interface. Finally, we show how the data in this corpus could be used to identify bias in news reporting.
... The LDC also offer the North American News Corpus [9], assembled from varied sources, including the New York Times, the Los Angeles Times, the Wall Street Journal and others. The primary goals of this corpus are support for information retrieval and language modelling, so the count of "words"-almost 350 million tokens-is more important than the number of articles. ...
... In Michael and Valiant's empirical demonstration of knowledge infusion, they create "teaching materials" by turning text from news sources into scenes as per the Robust Logic framework. The dataset is the North American News Text Corpus [8], comprising 6 months worth of articles (roughly 500000 sentences). Sentences were annotated with the Semantic Role Labeler, an automated tagger [9]; sentence fragments were then passed in the Collins Head Rules, which extracts keywords to summarize the sentence. ...
Preprint
Learning and logic are distinct and remarkable approaches to prediction. Machine learning has experienced a surge in popularity because it is robust to noise and achieves high performance; however, ML experiences many issues with knowledge transfer and extrapolation. In contrast, logic is easily intepreted, and logical rules are easy to chain and transfer between systems; however, inductive logic is brittle to noise. We then explore the premise of combining learning with inductive logic into AI Reasoning Systems. Specifically, we summarize findings from PAC learning (conceptual graphs, robust logics, knowledge infusion) and deep learning (DSRL, $\partial$ILP, DeepLogic) by reproducing proofs of tractability, presenting algorithms in pseudocode, highlighting results, and synthesizing between fields. We conclude with suggestions for integrated models by combining the modules listed above and with a list of unsolved (likely intractable) problems.
... Corpus-based label propagation (CLP) is one of the most commonly used methods for sentiment lexicon generation that uses the co-occurrence statistics aggregated from different corpora (news articles, Twitter, etc.) to build the similarity graph in the label propagation algorithms. We used an n-gram features from the signal media (SM) one million news articles dataset which contains ∼ 265K blog articles and ∼ 734K news articles (Corney et al., 2016) and the North American News (NAN) text corpus (Graff, 1995) which has ∼931K articles from a variety of news sources. The co-occurrence matrix R was computed on a window size of four words. ...
Article
In this paper, we propose an extension to graph-based sentiment lexicon induction methods by incorporating distributed and semantic word representations in building the similarity graph to expand a three-dimensional sentiment lexicon. We also implemented and evaluated the label propagation using four different word representations and similarity metrics. Our comprehensive evaluation of the four approaches was performed on a single data set, demonstrating that all four methods can generate a significant number of new sentiment assignments with high accuracy. The highest correlations (tau=0.51) and the lowest error (mean absolute error < 1.1%), obtained by combining both the semantic and the distributional features, outperformed the distributional-based and semantic-based label-propagation models and approached a supervised algorithm.
... All embeddings are trained on 22 million tokens from the the North American News Text (NANT) corpus (Graff, 1995). We use an initial vocabulary of 50,000 words, with a special UNK token for words that are not among the 50,000 most common. ...
... McClosky et al. (2006) expanded the domain of the standard Penn Treebank ( Marcus et al., 1993) trained BLLIP model, applying self-training to 2.5M sentences from the NANC corpus (Graff, 1995). The resulting model has a large vocabulary, with reliable estimates of probabilities for many words, which provides a useful basis for our investigations. ...
... All embeddings are trained on 22 million tokens from the the North American News Text (NANT) corpus (Graff, 1995). We use an initial vocabulary of 50,000 words, with a special UNK token for words that are not among the 50,000 most common. ...
Article
Full-text available
Word embeddings allow natural language processing systems to share statistical information across related words. These embeddings are typically based on distributional statistics, making it difficult for them to generalize to rare or unseen words. We propose to improve word embeddings by incorporating morphological information, capturing shared sub-word features. Unlike previous work that constructs word embeddings directly from morphemes, we combine morphological and distributional information in a unified probabilistic framework, in which the word embedding is a latent variable. The morphological information provides a prior distribution on the latent word embeddings, which in turn condition a likelihood function over an observed corpus. This approach yields improvements on intrinsic word similarity evaluations, and also in the downstream task of part-of-speech tagging.
... Different monetary-unit PSIs evoke different perspectives (e.g., that of the buyer versus the seller), represent distinct ranges of FEs (e.g., Earnings versus Assets) and have distinct valences (positive versus negative evaluation). We analyse instances of the target expressions retrieved from the British National Corpus (BNC; Burnard, 2000), ukWaC (Ferraresi et al., 2008), and the North American Newstext corpus (NNTC; Graff, 1995). Table 4 describes the corpora used. ...
Article
Full-text available
A polarity-sensitive item (PSI), as traditionally defined, is an expression that is restricted to either an affirmative or negative context. PSIs like lift a finger and all the time in the world subserve discourse routines like understatement and emphasis. Lexical-semantic classes are increasingly invoked in descriptions of the properties of PSIs. Here, we use English corpus data and the tools of Frame Semantics [Fillmore 1982, 1985] to explore Israel's [2011] observation that the semantic role of a PSI determines how the expression fits into a contextually constructed scalar model. We focus on a class of exceptions implied by Israel's model: cases in which a given PSI displays two countervailing patterns of polarity sensitivity, with attendant differences in scalar entailments. We offer a set of case studies of polarity-sensitive expressions—including verbs of attraction and aversion like can live without, monetary units like a red cent, comparative adjectives and time-span adverbials—that demonstrate that the interpretation of a given PSI in a given polar context is based on multiple considerations, including the speaker's perspective on and affective stance toward the described event, available inferences about causality and, perhaps most critically, particulars of the predication, including the verb or adjective's frame membership, the presence or absence of an ability modal like can, the grammatical construction used and the range of contingencies evoked by the utterance.
... We have used their corpus data for our experiments, and adopt their number classifications. They labelled four corpora in different domains using this classification system, the most extensive being a subset of the North American News Text Corpus (NANTC) (Graff, 1995). ...
Article
Word Sense Disambiguation is a well studied field, with a range of successful methods. However, there has been little work on examining the analogue for num-bers, classifying them into senses ('year', 'date', 'telephone number' etc.) based on their context -potentially useful for Text to Speech and Information Extraction systems. We extend the semi-supervised Decision List model described by David Yarowsky (1994), bringing the model to a problem on which little work has been done. We report promising results and present a thorough error analysis which highlights several areas where the current methodology needs to be extended to deal better with number senses. We conclude by proposing several directions for future work.
... Two different genres were collected: newspaper articles and accident reports written by government officials. The newspaper articles are from the North American News Corpus [Graff, 1995], with the topic of the articles being earthquakes. The accident reports consist of aviation accidents and are taken from the National Transportation Safety Board aviation accident database 2 . ...
... We use the standard divisions: Sections 2 through 21 are used for training, section 24 is held-out development, and section 23 is used for final testing. Our unlabeled data is the North American News Text corpus, NANC (Graff, 1995), which is approximately 24 million unlabeled sentences from various news sources. NANC contains no syntactic information. ...
Conference Paper
Full-text available
We present a simple, but surprisingly effective, method of self-training a two-phase parser-reranker system using readily available unlabeled data. We show that this type of bootstrapping is possible for parsing when the bootstrapped parses are processed by a discriminative reranker. Our improved model achieves an f-score of 92.1%, an absolute 1.1% improvement (12% error reduction) over the previous best result for Wall Street Journal parsing. Finally, we provide some analysis to better understand the phenomenon.
... The models consisted of 6015 clustered states and 96,240 Gaussian densities. The baseline LM used for the experiments conducted here was built from more than 350 million words obtained from Broadcast news [10], North American news corpus [11] and Canadian English language Hansard corpus [12]. The dictionary was built from the 20000 most frequently occurring words in the aforementioned database. ...
Conference Paper
Full-text available
This paper is concerned with combining models for decoding an optimum translation for a dictation based machine aided human translation (MAHT) task. Statistical language model (SLM) probabilities in automatic speech recognition (ASR) are updated using statistical machine translation (SMT) model probabilities. The effect of this procedure is evaluated for utterances from human translators dictating translations of source language documents. It is shown that computational complexity is significantly reduced while at the same time word error rate is reduced by 30%.
... In Bod (2007) we have shown how a parse forest of binary trees can be converted into a compact PCFG in the vein of Goodman (2003), and which we will summarize in the next section. The PCFG reduction of parse forests allowed us to induce trees for very large corpora in Bod (2007), such as the four million sentences NANC corpus (Graff 1995). These large experiments could be accomplished also thanks to an efficient estimator known as DOP* (Zollmann and Sima'an 2005). ...
Article
Full-text available
We present a new unsupervised syntax-based MT system, termed U-DOT, which uses the unsupervised U-DOP model for learning paired trees, and which computes the most probable target sentence from the relative frequencies of paired subtrees. We test U-DOT on the German-English Europarl corpus, showing that it outperforms the state-of-the-art phrase-based Pharaoh system. We demonstrate that the inclusion of noncontiguous phrases significantly improves the translation accuracy. This paper presents the first translation results with the data-oriented translation (DOT) model on the Europarl corpus, to the best of our knowledge.
... While it is common practice to smooth a document's LM with the collection LM (as a prior) to make the LM more robust, as shown above, the collection LM is also estimated by maximum likelihood and so may also suffer from sparse data problems in the case of small collections. To investigate whether collections smoothing could help, we tried linearly mixing the collection with two larger text corpora: 40,000 sentences from the Wall Street Journal as found in the Penn Treebank [11], and 450,000 sentences (with automatically induced sentence boundaries) taken from the North American News Corpus (NANC) [6]. This introduced three additional hyper-parameters specifying integer mixing ratios between the collection, WSJ, and NANC corpora. ...
Article
Brown's entry to the Cross-Language Speech Retrieval (CL-SR) track at the 2007 Cross Language Evaluation Forum (CLEF) 1 was based on the language model (LM) paradigm for retrieval [17]. For English, our system introduced two minor enhance-ments to the basic unigram: we extended Dirichlet smoothing (popular with unigram modeling) to bigrams, and we smoothed the collection LM to compensate for the small collection size. For Czech, time-constraints restricted us to using a basic unigram model, though we did apply Czech-specific stemming. While our English system per-formed well in the evaluation and showed the utility of our enhancements, several aspects of it were rushed and need to be addressed in future work. Our Czech sys-tem did not perform competitively but did provide us with a useful first experience in non-English retrieval.
... 1. The British National Corpus (BNC) 2. The North American News Text Corpus (NANT) (Graff, 1995) 3. The Guardian corpus 4. The Reuters corpus (Rose et al., 2002) 5. The data used for two Text Retrieval Evaluation Conferences 4 : ...
Article
Full-text available
We introduce a large computational subcategorization lexicon which includes subcategorization frame (SCF) and frequency information for 6,397 English verbs. This extensive lexicon was acquired automatically from five corpora and the Web using the current version of the comprehensive subcategorization acquisition system of Briscoe and Carroll (1997). The lexicon is provided freely for research use, along with a script which can be used to filter and build sub-lexicons suited for different natural language processing (NLP) purposes. Documentation is also provided which explains each sub-lexicon option and evaluates its accuracy.
... One interesting idea is to use a free open source word list such as ""linux.words"" with 20,000 entries or the CMU (Carnegie Mellon University) word list with 120,000 words [32], or even a more massive corpora such as the North American News Corpus which contains over 500,000 unique tokens [33]. Despite their availability, these word lists do not contain word statistics and data counts about word sequences such as n-grams. ...
Article
Spell-checking is the process of detecting and sometimes providing suggestions for incorrectly spelled words in a text. Basically, the larger the dictionary of a spell-checker is, the higher is the error detection rate; otherwise, misspellings would pass undetected. Unfortunately, traditional dictionaries suffer from out-of-vocabulary and data sparseness problems as they do not encompass large vocabulary of words indispensable to cover proper names, domain-specific terms, technical jargons, special acronyms, and terminologies. As a result, spell-checkers will incur low error detection and correction rate and will fail to flag all errors in the text. This paper proposes a new parallel shared-memory spell-checking algorithm that uses rich real-world word statistics from Yahoo! N-Grams Dataset to correct non-word and real-word errors in computer text. Essentially, the proposed algorithm can be divided into three sub-algorithms that run in a parallel fashion: The error detection algorithm that detects misspellings, the candidates generation algorithm that generates correction suggestions, and the error correction algorithm that performs contextual error correction. Experiments conducted on a set of text articles containing misspellings, showed a remarkable spelling error correction rate that resulted in a radical reduction of both non-word and real-word errors in electronic text. In a further study, the proposed algorithm is to be optimized for message-passing systems so as to become more flexible and less costly to scale over distributed machines.
... In our case of collection expansion, we hope to compensate for collection sparsity by drawing upon " similar " data from external corpora. For this work, we simply leveraged two broad English newspaper corpora: the Wall Street Journal (WSJ) and the North American News Corpus (NANC) [2]. Specifically, we expanded the collection as a linear mixture with 40K sentences (830K words) from WSJ (as found in the Penn Treebank [7]) and 450K sentences (9.5M words) from NANC, with tunable hyperparameters specifying integer mixing ratios between corpora. ...
Conference Paper
We present two simple but effective smoothing techniqes for the standard language model (LM) approach to information retrieval (12). First, we extend the unigram Dirichlet smoothing technique popular in IR (17) to bigram modeling (16). Second, we propose a method of collec- tion expansion for more robust estimation of the LM prior, particularly intended for sparse collections. Retrieval experiments on the MALACH archive (9) of automatically transcribed and manually summarized spon- taneous speech interviews demonstrates strong overall system perfor- mance and the relative contribution of our extensions1.
... We use the Charniak and Johnson reranking parser (outlined below), though we expect many of these results to generalize to other generative parsers and discriminative rerankers. Our corpora consist of WSJ for labeled data and NANC (North American News Text Corpus, Graff (1995)) for unlabeled data. We use the standard WSJ division for parsing: sections 2-21 for training (39,382 sentences) and section 24 for development (1,346 sentences). ...
Conference Paper
Full-text available
Self-training has been shown capable of improving on state-of-the-art parser per- formance (McClosky et al., 2006) despite the conventional wisdom on the matter and several studies to the contrary (Charniak, 1997; Steedman et al., 2003). However, it has remained unclear when and why self- training is helpful. In this paper, we test four hypotheses (namely, presence of a phase transition, impact of search errors, value of non-generative reranker features, and effects of unknown words). From these experiments, we gain a better un- derstanding of why self-training works for parsing. Since improvements from self- training are correlated with unknown bi- grams and biheads but not unknown words, the benefit of self-training appears most in- fluenced by seeing known words in new combinations.
... T3 provides classification of 357 verbs into 11 top level, 14 second level, and 32 third level classes. For each verb appearing in T1-T3, we extracted all the occurrences (up to 10,000) from the British National Corpus (Leech, 1992) and North American News Text Corpus (Graff, 1995). ...
Conference Paper
Most previous research on verb clustering has focussed on acquiring flat classifications from corpus data, although many manually built classifications are taxonomic in nature. Also Natural Language Processing (nlp) applications benefit from taxonomic classifications because they vary in terms of the granularity they require from a classification. We introduce a new clustering method called Hierarchical Graph Factorization Clustering (hgfc) and extend it so that it is optimal for the task. Our results show that Hgfc outperforms the frequently used agglomerative clustering on a hierarchical test set extracted from VerbNet, and that it yields state-of-the-art performance also on a flat test set. We demonstrate how the method can be used to acquire novel classifications as well as to extend existing ones on the basis of some prior knowledge about the classification.
... 8 Otherwise, in the case of a large class imbalance the evaluation measure would be dominated by the classes with large population.Korhonen et al., 2006). The data was gathered from five corpora, including e.g. the British National Corpus (Leech, 1992) and the North American News Text Corpus (Graff, 1995). The average frequency of verbs in T1 was 1448 and T2 2166, showing that T1 is a more sparse data set. ...
Conference Paper
In previous research in automatic verb classification, syntactic features have proved the most useful features, although manual classifications rely heavily on se- mantic features. We show, in contrast with previous work, that considerable ad- ditional improvement can be obtained by using semantic features in automatic clas- sification: verb selectional preferences ac- quired from corpus data using a fully unsu- pervised method. We report these promis- ing results using a new framework for verb clustering which incorporates a re- cent subcategorization acquisition system, rich syntactic-semantic feature sets, and a variation of spectral clustering which performs particularly well in high dimen- sional feature space.
... For English, we have processed the North American News Text corpus (Graff, 1995 ) (without the WSJ section) with the Stanford segmenter and tokenizer (Toutanova et al., 2003). For Czech, we have used the SYN2005 part of Czech National Corpus (CNC, 2005 ) (with the original segmentation and tokenization). ...
Conference Paper
Full-text available
This paper describes POS tagging exper- iments with semi-supervised training as an extension to the (supervised) averaged perceptron algorithm, first introduced for this task by (Collins, 2002). Experiments with an iterative training on standard-sized supervised (manually annotated) dataset (106 tokens) combined with a relatively modest (in the order of 108 tokens) un- supervised (plain) data in a bagging-like fashion showed significant improvement of the POS classification task on typo- logically different languages, yielding bet- ter than state-of-the-art results for English and Czech (4.12 % and 4.86 % relative er- ror reduction, respectively; absolute accu- racies being 97.44 % and 95.89 %).
... Our algorithm requires large amounts of data to gather argument structure and collocation patterns . For the statistics gathering phase of the clause detection algorithm, we used 4.5M sentences of the NANC (Graff, 1995 ) corpus, bounding their length in the same manner. In order to extract collocations, we used 2M sentences from the British National Corpus (Burnard, 2000) and about 29M sentences from the Dmoz corpus (Gabrilovich and Markovitch, 2005). ...
Conference Paper
The task of Semantic Role Labeling (SRL) is often divided into two sub-tasks: verb argument identification, and argu- ment classification. Current SRL algo- rithms show lower results on the identifi- cation sub-task. Moreover, most SRL al- gorithms are supervised, relying on large amounts of manually created data. In this paper we present an unsupervised al- gorithm for identifying verb arguments, where the only type of annotation required is POS tagging. The algorithm makes use of a fully unsupervised syntactic parser, using its output in order to detect clauses and gather candidate argument colloca- tion statistics. We evaluate our algorithm on PropBank10, achieving a precision of 56%, as opposed to 47% of a strong base- line. We also obtain an 8% increase in precision for a Spanish corpus. This is the first paper that tackles unsupervised verb argument identification without using manually encoded rules or extensive lexi- cal or syntactic resources.
... While we do not achieve as high an f-score as the UML-DOP model in Bod (2006), we will show that U-DOP* can operate without subtree sampling, and that the model can be trained on corpora that are two orders of magnitude larger than in Bod (2006). We will extend our experiments to 4 million sentences from the NANC corpus (Graff 1995), showing that an f-score of 70.7% can be obtained on the standard Penn WSJ test set by means of unsupervised parsing. Moreover, U-DOP* can be directly put to use in bootstrapping structures for concrete applications such as syntax-based machine translation and speech recognition. ...
Conference Paper
Full-text available
How far can we get with unsupervised parsing if we make our training corpus several orders of magnitude larger than has hitherto be attempted? We present a new algorithm for unsupervised parsing using an all-subtrees model, termed U-DOP*, which parses directly with packed forests of all binary trees. We train both on Penn's WSJ data and on the (much larger) NANC corpus, showing that U-DOP* outperforms a treebank-PCFG on the standard WSJ test set. While U-DOP* performs worse than state-of-the-art supervised parsers on hand- annotated sentences, we show that the model outperforms supervised parsers when evaluated as a language model in syntax-based machine translation on Europarl. We argue that supervised parsers miss the fluidity between constituents and non-constituents and that in the field of syntax-based language modeling the end of supervised parsing has come in sight.
... We also find that sentences with technical content are more likely to be related than non-technical sentences. We label an utterance as technical if it contains a web address, a long string of digits, or a term present in a guide for novice Linux users 6 but not in a large news corpus (Graff, 1995) 7 . This is a light-weight way to capture one " semantic dimension " or cluster of related words, in a corpus which is not amenable to full LSA or similar techniques. ...
Conference Paper
When multiple conversations occur simultane- ously, a listener must decide which conversa- tion each utterance is part of in order to inter- pret and respond to it appropriately. We refer to this task as disentanglement. We present a corpus of Internet Relay Chat (IRC) dialogue in which the various conversations have been manually disentangled, and evaluate annota- tor reliability. This is, to our knowledge, the first such corpus for internet chat. We pro- pose a graph-theoretic model for disentangle- ment, using discourse-based features which have not been previously applied to this task. The model's predicted disentanglements are highly correlated with manual annotations.
... This simple heuristic yields very high performance on punctuation, scoring (when all other words are assumed perfect tagging) 99.6% (99.1%) 1-to-1 accuracy when evaluated against the English fine (coarse) POS tag sets, and 97.2% when evaluated against the German POS tag set. For English, we trained our model on the 39832 sentences which constitute sections 2-21 of the PTB-WSJ and on the 500K sentences from the NYT section of the NANC newswire corpus (Graff, 1995) Table 1: Top: English. Bottom: German. ...
Conference Paper
We present a novel fully unsupervised algorithm for POS induction from plain text, motivated by the cognitive notion of prototypes. The algorithm first identifies landmark clusters of words, serving as the cores of the induced POS categories. The rest of the words are subsequently mapped to these clusters. We utilize morphological and distributional representations computed in a fully unsupervised manner. We evaluate our algorithm on English and German, achieving the best reported results for this task.
... In addition to labeled news data, we make use of a large quantity of unlabeled news data. The unlabeled data is the North American News Corpus, NANC (Graff, 1995), which is approximately 24 million unlabeled sentences from various news sources. NANC contains no syntactic information and sentence boundaries are induced by a simple discriminative model. ...
Conference Paper
Full-text available
Statistical parsers trained and tested on the Penn Wall Street Journal (WSJ) treebank have shown vast improvements over the last 10 years. Much of this improvement, however, is based upon an ever-increasing number of features to be trained on (typically) the WSJ treebank data. This has led to concern that such parsers may be too finely tuned to this corpus at the expense of portability to other genres. Such worries have merit. The standard "Charniak parser" checks in at a labeled precision-recall f-measure of 89.7% on the Penn WSJ test set, but only 82.9% on the test set from the Brown treebank corpus.This paper should allay these fears. In particular, we show that the reranking parser described in Charniak and Johnson (2005) improves performance of the parser on Brown to 85.2%. Furthermore, use of the self-training techniques described in (McClosky et al., 2006) raise this to 87.8% (an error reduction of 28%) again without any use of labeled Brown data. This is remarkable since training the parser and reranker on labeled Brown data achieves only 88.4%.
... Sections 00 and 23 were used for development and test evaluation . A further 113,346,430 tokens (4,566,241 sentences) of raw data from the Wall Street Journal section of the North American News Corpus (Graff, 1995) were parsed to produce the training data for adaptation. This text was tokenised using the C&C tools tokeniser and parsed using our baseline models. ...
Conference Paper
We propose a novel self-training method for a parser which uses a lexicalised gram- mar and supertagger, focusing on increas- ing the speed of the parser rather than its accuracy. The idea is to train the su- pertagger on large amounts of parser out- put, so that the supertagger can learn to supply the supertags that the parser will eventually choose as part of the highest- scoring derivation. Since the supertag- ger supplies fewer supertags overall, the parsing speed is increased. We demon- strate the effectiveness of the method us- ing a CCG supertagger and parser, obtain- ing significant speed increases on newspa- per text with no loss in accuracy. We also show that the method can be used to adapt the CCG parser to new domains, obtain- ing accuracy and speed improvements for Wikipedia and biomedical text.
... The average number of words per argument is 5.1. The NANC (Graff, 1995) corpus was used as a training set. Only sentences of length not greater than 10 excluding punctuation were used (see Section 3.2), totaling 4955181 sentences. ...
Conference Paper
The core-adjunct argument distinction is a basic one in the theory of argument structure. The task of distinguishing between the two has strong relations to various basic NLP tasks such as syntactic parsing, semantic role labeling and subcategorization acquisition. This paper presents a novel unsupervised algorithm for the task that uses no supervised models, utilizing instead state-of-the-art syntactic induction algorithms. This is the first work to tackle this task in a fully unsupervised scenario.
... We use news articles portion of the Wall Street Journal corpus (WSJ) from the Penn Treebank (Marcus et al., 1993) in conjunction with the self-trained North American News Text Corpus (NANC, Graff (1995)). The English Translation Treebank, ETT (Bies, 2007), is the translation 6 of broadcast news in Arabic. ...
Conference Paper
Full-text available
Current statistical parsers tend to perform well only on their training domain and nearby gen- res. While strong performance on a few re- lated domains is sufficient for many situations, it is advantageous for parsers to be able to gen- eralize to a wide variety of domains. When parsing document collections involving het- erogeneous domains (e.g. the web), the op- timal parsing model for each document is typ- ically not obvious. We study this problem as a new task — multiple source parser adapta- tion. Our system trains on corpora from many different domains. It learns not only statistics of those domains but quantitative measures of domain differences and how those differences affect parsing accuracy. Given a specific tar- get text, the resulting system proposes linear combinations of parsing models trained on the source corpora. Tested across six domains, our system outperforms all non-oracle base- lines including the best domain-independent parsing model. Thus, we are able to demon- strate the value of customizing parsing models to specific domains.
... The baseline word error rate (WER) obtained for this model using the 5,000 word WSJ bigram language model on the WSJ test set was 6.7%. The baseline language model (LM) used for the dictation system was a trigram LM trained from more than 350 million words of broadcast news [24] , North American news corpus [25], and the Canadian English language Hansard courpus [26]. A 20,000 word vocabulary of the most frequently occurring words in the corpus was used. ...
Article
Full-text available
This paper presents a model for machine-aided human translation (MAHT) that integrates source language text and target language acoustic information to produce the text translation of source language document. It is evaluated on a scenario where a human translator dictates a first draft target language translation of a source language document. Information obtained from the source language document, including translation probabilities derived from statistical machine translation (SMT) and named entity tags derived from named entity recognition (NER), is incorporated with acoustic phonetic information obtained from an automatic speech recognition (ASR) system. One advantage of the system combination used here is that words that are not included in the ASR vocabulary can be correctly decoded by the combined system. The MAHT model and system implementation is presented. It is shown that a relative decrease in word error rate of 29% can be obtained by this combined system relative to the baseline ASR performance on a French to English document translation task in the Hansard domain. In addition, it is shown that transcriptions obtained by using the combined system show a relative increase in NIST score of 34% compared to transcriptions obtained from the baseline ASR system.
... Recognition III (CSR-III, ; North American News Text Corpus (NANTC, Graff, 1995); the NANTC supplement (NANTS, MacIntyre, 1998); and the ACQUAINT Corpus (Graff, 2002). The components and their sizes (including punctuation) are given in Table 3 The LDC has recently released the English Gigaword corpus (Graff, 2003) including most of the corpora listed above. ...
Article
Lexical-semantic resources, including thesauri and WORDNET, have been successfully incorporated into a wide range of applications in Natural Language Processing. However they are very difficult and expensive to create and maintain, and their usefulness has been severely hampered by their limited coverage, bias and inconsistency. Automated and semi-automated methods for developing such resources are therefore crucial for further resource development and improved application performance.
... To demonstrate this we compiled a homogeneous corpus of 1.145 billion words of newspaper and newswire text from three existing corpora: the North American News Text Corpus, NANC (Graff, 1995), the NANC Supplement (MacIntyre, 1998) and the Reuters Corpus Volume 1, RCV1 (Rose et al., 2002). The number of words in each corpus is shown in Table 1. ...
Article
Banko and Brill (2001) suggested that the development of very large training corpora may be more effective for progress in empirical Natural Language Processing than improving methods that use existing smaller training corpora. This work tests their claim by exploring whether a very large corpus can eliminate the sparseness problems associated with estimating unigram probabilities. We do this by empirically investigating the convergence behaviour of unigram probability estimates on a one billion word corpus. When using one billion words, as expected, we do find that many of our estimates do converge to their eventual value. However, we also find that for some words, no such convergence occurs. This leads us to conclude that simply relying upon large corpora is not in itself sufficient: we must pay attention to the statistical modelling as well. 1
Thesis
Full-text available
This research is aimed at studying the role played by discourse structure in text comprehension. In a series of three experiments, volunteered subjects are asked for to reorder a set of sentences, randomly presented, to make a narrative text composed of five sentences –the first experiment-, an expository text composed of five sentences –the second experiment-, or an expository text composed of nine sentences –the third experiment- coherent. Each subject is presented with a specific and unique set of scrambled sentences. Results from the 1st experiment show that referential relationships among entities and/or events contribute to determine text coherence. In a text, sentences organized according to local coherence contain lexical and non lexical cues that allow subjects to link every pair of sentences to create a particular structure; sentences organized according to global coherence contain lexical and non lexical cues that allow subjects to relate every sentence with a common topic. Under these conditions a different strategy is applied by subjects in the second and the third experiments. Results from the 2nd and 3rd experiments reveal that human subjects are willing to exploit information from the way text sentence order is being constrained but in a different way for five-sentence texts and for nine-sentence texts, in five-sentence texts through local coherence and concrete semantic relations, and in nine-sentence texts through global coherence and abstrat semantic relations. This difference arises from the fact that contrary to five-sentence texts, nine-sentence texts represent a heavier cognitive processing load for working memory. Referential relations play an identical role in all the experiments. The results are discussed according to Latent Semantic Analysis, Rethorical Structure Theory and Centering Theory models. No theoretical model can account for all the experimental findings, but, discarding the role played by reference, they are more compatible with a Centering Theory model, since this model is the only one which make specific predictions about the role played by sentence order.
Chapter
Full-text available
In English and several other European languages, the perfect tense is a complex morpho-syntactic construction made of an auxiliary ("have," "be") followed by a past participle, as in "Jamie has eaten all the chocolate biscuits." The auxiliary appears in the past, present, and future tenses, thus creating past, present, and future perfects. The perfect has been a problematic category for scholars across time due to the multiplicity of its meanings/uses within a given language and to the variation in meanings/uses of what has been labeled "perfect" across languages. In an attempt to provide a clearer understanding of this complex semantic category, this article looks at theories of the perfect, focusing on its semantics and pragmatics. It also discusses whether the perfect is a tense, an aspect, or both; how pragmatic factors and discourse relations affect the use of the perfect; and concludes by examining the place of the perfect in a tense/aspect system more generally, focusing on how it relates to categories such as resultatives and the simple past tense, as well as to habituals and prospectives.
Article
When multiple conversations occur simultaneously, a listener must decide which conversation each utterance is part of in order to interpret and respond to it appropriately. We refer to this task as disentanglement. We present a corpus of Internet Relay Chat dialogue in which the various conversations have been manually disentangled, and evaluate annotator reliability. We propose a graph-based clustering model for disentanglement, using lexical, timing, and discourse-based features. The model's predicted disentanglements are highly correlated with manual annotations. We conclude by discussing two extensions to the model, specificity tuning and conversation start detection, both of which are promising but do not currently yield practical improvements.
Article
Two experiments investigated sensory/motor-based functional knowledge of man-made objects: manipulation features associated with the actual usage of objects. In Experiment 1, a series of prime-target pairs was presented auditorily, and participants were asked to make a lexical decision on the target word. Participants made a significantly faster decision about the target word (e.g. 'typewriter') following a related prime that shared manipulation features with the target (e.g. 'piano') than an unrelated prime (e.g. 'blanket'). In Experiment 2, participants' eye movements were monitored when they viewed a visual display on a computer screen while listening to a concurrent auditory input. Participants were instructed to simply identify the auditory input and touch the corresponding object on the computer display. Participants fixated an object picture (e.g. "typewriter") related to a target word (e.g. 'piano') significantly more often than an unrelated object picture (e.g. "bucket") as well as a visually matched control (e.g. "couch"). Results of the two experiments suggest that manipulation knowledge of words is retrieved without conscious effort and that manipulation knowledge constitutes a part of the lexical-semantic representation of objects.
ResearchGate has not been able to resolve any references for this publication.