Article

Event Detection in Blogs using Temporal Random Indexing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Automatic event detection aims to identify novel, interesting topics as they are published online. While existing algorithms for event detection have focused on newswire releases, we examine how event detection can work on less structured corpora of blogs. The proliferation of blogs and other forms of self-published media have given rise to an ever-growing corpus of news, commentary and opinion texts. Blogs offer a major advantage for event detection as their content may be rapidly updated. However, blogs texts also pose a significant challenge in that the described events may be less easy to detect given the variety of topics, writing styles and possible author biases. We propose a new way of detecting events in this media by looking for changes in word semantics. We first outline a new algorithm that makes use of a temporally-annotated semantic space for tracking how words change semantics. Then we demonstrate how identified changes could be used to detect new events and their associated blog entries.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Random indexing is recently applied in a variety of applications, such as indexing of literature databases [23], event detection in blogs [6] and graph searching for the semantic web [3]. The idea of random indexing [10] originates from Pentti Kanervas work on sparse distributed memory [8], and related work on the mathematics of brain-inspired information processing with hyperdimensional symbols, see [9] for a recent review. ...
... To break this symmetry, e.g., for the purpose of coding time evolution or order, one has to introduce an additional tensor index for that degree of freedom. Temporal random indexing [6] is an example, which is used to analyze the time evolution of word semantics to detect novel events in on-line texts. ...
... Temporal RI [6]) or linguistic relations [1,22], or by incorporating structural information in distributed representations [2,26]. However, there have been few attempts at extending traditional matrix-based natural language processing methods to tensors. ...
Article
We present an incremental, scalable and efficient dimension reduction technique for tensors that is based on sparse random linear coding. Data is stored in a compactified representation with fixed size, which makes memory requirements low and predictable. Component encoding and decoding are performed on-line without computationally expensive re-analysis of the data set. The range of tensor indices can be extended dynamically without modifying the component representation. This idea originates from a mathematical model of semantic memory and a method known as random indexing in natural language processing. We generalize the random-indexing algorithm to tensors and present signal-to-noise-ratio simulations for representations of vectors and matrices. We present also a mathematical analysis of the approximate orthogonality of high-dimensional ternary vectors, which is a property that underpins this and other similar random-coding approaches to dimension reduction. To further demonstrate the properties of random indexing we present results of a synonym identification task. The method presented here has some similarities with random projection and Tucker decomposition, but it performs well at high dimensionality only (n>10^3). Random indexing is useful for a range of complex practical problems, e.g., in natural language processing, data mining, pattern recognition, event detection, graph searching and search engines. Prototype software is provided. It supports encoding and decoding of tensors of order >= 1 in a unified framework, i.e., vectors, matrices and higher order tensors.
... However, a WordSpace represents a snapshot of a specific corpus and it does not take into account temporal information. For this reason, we rely on a particular method, called Temporal Random Indexing (TRI), that enables the analysis of the time evolution of the meaning of a word [4,16]. TRI is able to efficiently build WordSpaces taking into account temporal information. ...
... The TRI framework provides all the necessary tools to build WordSpaces over different time periods and perform such temporal linguistic analysis. The system has been tested on several domains, such as a collection of Italian books, English scientific papers [3], the Italian version of the Google N-gram dataset [2], and Twitter [16]. ...
... All previous works based on word embeddings have in common the fact that they build a different semantic space for each period taken into consideration; this approach does not guarantee that each dimension bears the same semantics in different spaces [16], especially when embedding techniques are employed. In order to overcome this limitation, Jurgens and Stevens [16] introduced Temporal Random Indexing technique as a means to discover semantic changes associated to different events in a blog stream. ...
Chapter
Full-text available
Detecting significant linguistic shifts in the meaning and usage of words has gained more attention over the last few years. Linguistic shifts are especially prevalent on the Internet, where words’ meaning can change rapidly. In this work, we describe the construction of a large diachronic corpus that relies on the UK Web Archive and we propose a preliminary analysis of semantic change detection exploiting a particular technique called Temporal Random Indexing. Results of the evaluation are promising and give us important insights for further investigations.
... In this paper we show how one of such DSM techniques, called Random Indexing (RI) (Sahlgren 2005(Sahlgren , 2006, can be easily extended to allow the analysis of semantic changes of words over time (Jurgens and Stevens 2009). The ultimate aim is to provide a tool which enables the understanding of how words change their meanings within a document corpus as a function of time. ...
... The classical RI does not take into account temporal information, but it can be easily adapted to the methodology proposed in (Jurgens and Stevens 2009) for our purposes. Specifically, given a document collection D annotated with metatada containing information about the year in which the document was written, we can split the collection in different time periods D 1 , D 2 , . . . ...
... Then, they applied a co-occurrence graph based clustering algorithm in order to cluster words according to senses in different time periods: the difference between clusters is exploited to detect changes in senses. All these works have in common the fact that they build a different semantic space for each period taken into consideration; this approach does not guarantee that each dimension bears the same semantics in different spaces (Jurgens and Stevens 2009), especially when reduction techniques are employed. In order to overcome this limitation, Jurgens and Stevens (Jurgens and Stevens 2009) introduced Temporal Random Indexing technique as a means to discover semantic changes associated to different events in a blog stream. ...
Article
Full-text available
During the last decade the surge in available data spanning different epochs has inspired a new analysis of cultural, social, and linguistic phenomena from a temporal perspective. This paper describes a method that enables the analysis of the time evolution of the meaning of a word. We propose Temporal Random Indexing (TRI), a method for building WordSpaces that takes into account temporal information. We exploit this methodology in order to build geometrical spaces of word meanings that consider several periods of time. The TRI framework provides all the necessary tools to build WordSpaces over different time periods and perform such temporal linguistic analysis. We propose some examples of usage of our tool by analysing word meanings in two corpora: a collection of Italian books and English scientific papers about computational linguistics. This analysis enables the detection of linguistic events that emerge in specific time intervals and that can be related to social or cultural phenomena.
... For example, the accuracy of RI is comparable to SVD-based methods in a TOEFL synonym identification task [31], and that result has been further improved in the case of RI [45]. RI of co-occurrence matrices for semantic analysis works surprisingly well [11,30,43,52], and the method has been adopted in other applications, such as indexing of literature databases [54], event detection in blogs [26], web user clustering and page prefetching [57], graph searching for the semantic web [12], diagnosis code assignment to patients [24], predicting speculation in biomedical literature [55] and failure prediction [19]. In general, there is an increasing interest for randomisation in information processing because it enables the use of simple algorithms, which can be organised to exploit parallel computation in an efficient way [6,22]. ...
... The practical usefulness of RI is also demonstrated by several implementations in public software packages such as the S-Space Package [27] and the Semantic Vectors Package [58], and extensions of the basic method to new domains and problems [26,54]. Therefore, it is natural to ask whether the RI algorithm can be generalised to higher-order relationships and distributional arrays. ...
... In summary, RI is an incremental dimension reduction method that is computationally lightweight and well suited for online processing of streaming data not feasible to analyse with other, more accurate and complex methods [50]. For example, standard co-occurrence matrices in natural language processing applications can be extended with temporal information [26], linguistic relations [3,53] and structural information in distributed representations [9,59]. There have been few attempts at extending traditional matrix-based natural language processing methods to higher-order arrays due to the high computational cost involved. ...
Article
Full-text available
Random indexing (RI) is a lightweight dimension reduction method, which is used, for example, to approximate vector semantic relationships in online natural language processing systems. Here we generalise RI to multidimensional arrays and therefore enable approximation of higher-order statistical relationships in data. The generalised method is a sparse implementation of random projections, which is the theoretical basis also for ordinary RI and other randomisation approaches to dimensionality reduction and data representation. We present numerical experiments which demonstrate that a multidimensional generalisation of RI is feasible, including comparisons with ordinary RI and principal component analysis. The RI method is well suited for online processing of data streams because relationship weights can be updated incrementally in a fixed-size distributed representation, and inner products can be approximated on the fly at low computational cost. An open source implementation of generalised RI is provided.
... The correlation between authors and topics is computed by exploiting two different techniques inspired respectively to the Google Books N-gram Viewer 1 and to the Explicit Semantic Analysis [4]. The topic semantics is studied by means of a framework named Temporal Random Indexing [5] able to outline the evolution in the usage of a particular term over time. T-RecS is implemented as a web application and empowers users who want to know, for example, the authors who studied a given topic in a specific time interval, the most used recommendation paradigm in a given period, the evolution of the semantics of a specific 1 https://books.google.com/ngrams ...
... The Temporal Random Indexing (TRI) Analyzer extends a Distributional Semantic Model technique called Random Indexing (RI) to allow the analysis of semantic changes of words over time [5]. The semantic vector for a word in a given time period is the result of its co-occurrences with other words in the same time interval, but the usage of RI for building the word representations over different times guarantees their comparability along the timeline. ...
Conference Paper
This paper presents T-RecS (Temporal analysis of Recommender Systems conference proceedings), a framework that supplies services to analyze the Recommender Systems Conference proceedings from the first edition, held in 2007, to the last one, held in 2015, under a temporal point of view. The idea behind T-RecS is to identify linguistic phenomena that reflect some interesting variations for the research community, such as topic drift, or how the correlation between two terms changed over time, or how similarity between two authors evolved over time.
... Friday, a court ordered a suspect in her killing to be kept in custody for two weeks while police gather evidence. Additionally, methods similar to NED have been developed based on semantic space models (see [15], [16], [17]). These methods detect new events by detecting shifts in a term vector space for a given query. ...
... Some of the new events will also indicate the emergence of unexpected processes, states, or other series of events. Jurgens and Stevens [17] describe how the launch of a toy during the holiday season may lead to the emergence of a toy recall and eventual lawsuits due to toxicity associated with the toy. Eventually, the launch of the toy is just a new event, but later reports of toxic chemicals used in its manufacturing may lead to the development of this emerging event. ...
Conference Paper
Full-text available
Recognizing new and emerging events in a stream of news documents requires understanding the semantic structure of news reported in natural language. New event detection (NED) is the task of recognizing when a news document discusses a completely novel event. To be successful at this task, we argue a NED method must extract and represent the type of event and its participants as well as the temporal and spatial properties of the event. Our NED methods produce a 25% cost reduction over a bag-of-words baseline and a 13% cost reduction over an existing state-of-the-art approach. Additionally, we discuss our method for recognizing emerging events: the tracking and categorization of unexpected or novel events.
... Tracking News Stories. To examine the propagation of variations of phrases in news articles, Leskovec et al. (2009) developed a framework to identify and adaptively track the evolution of unique phrases using a graph based approach. In (Chong and Chua 2013), a search and summarization framework was proposed to construct summaries of events of in-terest. ...
... There are several popular ways of representing individual words or documents in a semantic space. Most do not address the temporal nature of documents but a notable method that does is described by Jurgens and Stevens (2009), adding a temporal dimention to Random Indexing for the purpose of event detection. Our approach focuses on summarization rather then event detection, however the concept of using word co-occurance to learn word representations is very similar. ...
Conference Paper
Full-text available
Twitter is often the most up-to-date source for finding and tracking breaking news stories. Therefore, there is considerable interest in developing filters for tweet streams in order to track and summarize stories. This is a non-trivial text analytics task as tweets are short, and standard text similarity metrics often fail as stories evolve over time. In this paper we examine the effectiveness of adaptive text similarity mechanisms for tracking and summarizing breaking news stories. We evaluate the effectiveness of these mechanisms on a number of recent news events for which manually curated timelines are available. Assessments based on the ROUGE metric indicate that the adaptive similarity mechanism is best suited for tracking evolving stories on Twitter.
... Fixed Duration Temporal Random Indexing, introduced by Jurgens and Stevens (2009), attempts to bypass the computational difficulty inherent in singular value decomposition through the use of random projections onto lower dimensional space. Similar to BEAGLE, each word has an environmental vector, although FDTRI vectors are generated to be sparse. ...
... The classical RI does not take into account temporal information, but it can be easily adapted to the methodology proposed in (Jurgens and Stevens, 2009) for our purposes. In particular, we need to add a metadata containing information about the year in which the document was written, to each document in C. ...
Conference Paper
Full-text available
English. This paper proposes an approach to the construction of WordSpaces which takes into account temporal information. The proposed method is able to build a geometrical space considering several pe-riods of time. This methodology enables the analysis of the time evolution of the meaning of a word. Exploiting this ap-proach, we build a framework, called Tem-poral Random Indexing (TRI) that pro-vides all the necessary tools for building WordSpaces and performing such linguis-tic analysis. We propose some examples of usage of our tool by analysing word meanings in two corpora: a collection of Italian books and English scientific papers about computational linguistics. Italiano. In questo lavoro proponi-amo un approccio per la costruzione di WordSpaces che tengano conto di in-formazioni temporali. Il metodo pro-posto costruisce degli spazi geometrici considerando diversi intervalli temporali. Questa metodologia permette di studiare l'evoluzione nel tempo del significato delle parole. Utilizzando questo approccio ab-biamo costruito uno strumento, chiam-ato Temporal Random Indexing (TRI), che permette la costruzione dei WordSpaces e fornisce degli strumenti per l'analisi lin-guistica. Nell'articolo proponiamo alcuni esempi di utilizzo del nostro tool analiz-zando i significati delle parole in due cor-pus: uno relativo a libri nella lingua ital-iana, l'altro relativo ad articoli scientifici in lingua inglese nell'ambito della linguis-tica computazionale.
... Earlier studies consisted primarily of corpus-based analysis ( [15,29] among many others), and used raw word frequencies to detect semantic shifts. However, there already were applications of distributional methods, for example, Temporal Random Indexing [16], co-occurrences matrices weighted by Local Mutual Information [11], graph-based methods [31] and others. ...
Chapter
Full-text available
We study the effectiveness of contextualized embeddings for the task of diachronic semantic change detection for Russian language data. Evaluation test sets consist of Russian nouns and adjectives annotated based on their occurrences in texts created in pre-Soviet, Soviet and post-Soviet time periods. ELMo and BERT architectures are compared on the task of ranking Russian words according to the degree of their semantic change over time. We use several methods for aggregation of contextualized embeddings from these architectures and evaluate their performance. Finally, we compare unsupervised and supervised techniques in this task.
... The dynamic user-generated content of blogs makes it a very good platform for event detection. A study proposed a novel event detection algorithm by making use of temporally annotated thematic space which tracks the change of semantics of words over time (Jurgens and Stevens 2009). A semantic space model is an automated method of building distributed word representation. ...
Article
Full-text available
The online social networks (OSNs) have become an important platform for detecting real-world event in recent years. These real-world events are detected by analyzing huge social-stream data available on different OSN platforms. Event detection has become significant because it contains substantial information which describes different scenarios during events or crisis. This information further helps to enable contextual decision making, regarding the event location, content and the temporal specifications. Several studies exist, which offers plethora of frameworks and tools for detecting and analyzing events used for applications like crisis management, monitoring and predicting events in different OSN platforms. In this paper, a survey is done for event detection techniques in OSN based on social text streams—newswire, web forums, emails, blogs and microblogs, for natural disasters, trending or emerging topics and public opinion-based events. The work done and the open problems are explicitly mentioned for each social stream. Further, this paper elucidates the list of event detection tools available for the researchers. Read Only: http://rdcu.be/mJF9
... The reliability of the empirically derived semantics needs to be analyzed in order to determine the extent to which domain specificity and temporal synchronization of the training corpus affect results. Further investigation of this effect is beyond the scope of the current paper (however, see for example [68,69]. ...
Article
Extracting concepts (such as drugs, symptoms, and diagnoses) from clinical narratives constitutes a basic enabling technology to unlock the knowledge within and support more advanced reasoning applications such as diagnosis explanation, disease progression modeling, and intelligent analysis of the effectiveness of treatment. The recent release of annotated training sets of de-identified clinical narratives has contributed to the development and refinement of concept extraction methods. However, as the annotation process is labor-intensive, training data are necessarily limited in the concepts and concept patterns covered, which impacts the performance of supervised machine learning applications trained with these data. This paper proposes an approach to minimize this limitation by combining supervised machine learning with empirical learning of semantic relatedness from the distribution of the relevant words in additional unannotated text.
... In the work of Gorman and Curran (2006) different weighting criteria for random index vectors in the dictionary were proven useful for improving the matrix representation. RI has been tested in different tasks, such as search (Rangan, 2011), query expansion (Sahlgren, Karlgren, Cöster, & Järvinen, 2002), image and text compression (Bingham & Mannila, 2001), and event detection (Jurgens & Stevens, 2009). Fradkin and Madigan (2003) showed that, since in RI distances are approximately preserved, distance-based learners such as k-Nearest Neighbours (k-NN) and SVMs are preferable when learning from randomly indexed instances. ...
Article
Full-text available
Multilingual Text Classification (MLTC) is a text classification task in which documents are written each in one among a set L of natural languages, and in which all documents must be classified under the same classification scheme, irrespective of language. There are two main variants of MLTC, namely Cross-Lingual Text Classification (CLTC) and Polylingual Text Classification (PLTC). In PLTC, which is the focus of this paper, we assume (differently from CLTC) that for each language in L there is a representative set of training documents; PLTC consists of improving the accuracy of each of the L monolingual classifiers by also leveraging the training documents written in the other (L - 1) languages. The obvious solution, consisting of generating a single polylingual classifier from the juxtaposed monolingual vector spaces, is usually infeasible, since the dimensionality of the resulting vector space is roughly L times that of a monolingual one, and is thus often unmanageable. As a response, the use of machine translation tools or multilingual dictionaries has been proposed. However, these resources are not always available, or are not always free to use. One machine-Translation-free and dictionary-free method that, to the best of our knowledge, has never been applied to PLTC before, is Random Indexing (RI). We analyse RI in terms of space and time efficiency, and propose a particular configuration of it (that we dub Lightweight Random Indexing { LRI). By running experiments on two well known public benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallel one), we show LRI to outperform (both in terms of effectiveness and efficiency) a number of previously proposed machine-Translation-free and dictionary-free PLTC methods that we use as baselines.
... The correlation between authors and topics is computed by exploiting two different techniques inspired respectively to the Google Books N-gram Viewer 1 and to the Explicit Semantic Analysis [2]. The topic semantics is studied by means of a framework named Temporal Random Indexing [5] able to outline the evolution of the usage of a particular term over time. Thanks to these techniques we are able to answer to questions like: What are the authors who studied the influence of emotions in recommender systems?, What was the most used recommendation paradigm in 2007? ...
Conference Paper
T-RecS is a system which implements several computational linguistic techniques for analyzing word usage variations over time periods in a document collection. We analyzed ACM RecSys conference proceedings from the first edition held in 2007, to the one held in 2015. The idea is to identify linguistic phenomena that reflect some interesting variations for the research community, such as a topic shift, or how the correlation between two terms changed over the time, or how the similarity between two authors evolved over time. T-RecS is a web application accessible via http://193.204.187.192/recsys/.
... The key challenge to derive novelty measure from the textual information of scientific literature is how to effectively and efficiently represent the semantics and the semantic changes of research topics without information loss. Word embedding techniques, such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) have proved their utility in representing the semantics of words , and techniques for learning semantic changes were also developed (Jurgens and Stevens, 2009;Hamilton et al., 2016). Hamilton et al. (2016) developed a temporal word embedding method to understand how the semantics of words changed over time, by aligning word embeddings across different periods. ...
Article
Full-text available
Novel scientific knowledge is constantly produced by the scientific community. Understanding the level of novelty characterized by scientific literature is key for modeling scientific dynamics and analyzing the growth mechanisms of scientific knowledge. Metrics derived from bibliometrics and citation analysis were effectively used to characterize the novelty in scientific development. However, time is required before we can observe links between documents such as citation links or patterns derived from the links, which makes these techniques more effective for retrospective analysis than predictive analysis. In this study, we present a new approach to measuring the novelty of a research topic in a scientific community over a specific period by tracking semantic changes of the terms and characterizing the research topic in their usage context. The semantic changes are derived from the text data of scientific literature by temporal embedding learning techniques. We validated the effects of the proposed novelty metric on predicting the future growth of scientific publications and investigated the relations between novelty and growth by panel data analysis applied in a large-scale publication dataset (MEDLINE/PubMed). Key findings based on the statistical investigation indicate that the novelty metric has significant predictive effects on the growth of scientific literature and the predictive effects may last for more than ten years. We demonstrated the effectiveness and practical implications of the novelty metric in three case studies.
... It paved the way for quantitatively comparing not only words with regard to their meaning, but also different stages in the development of word meaning over time. Jurgens and Stevens (2009) employed the Random Indexing (RI) algorithm (Kanerva et al., 2000) to create word vectors. Two years later, Gulordava and Baroni (2011) used explicit count-based models, consisting of sparse co-occurrence matrices weighted by Local Mutual Information, while Sagi et al. (2011) turned to Latent Semantic Analysis (Deerwester et al., 1990). ...
Preprint
Full-text available
Recent years have witnessed a surge of publications aimed at tracing temporal changes in lexical semantics using distributional methods, particularly prediction-based word embedding models. However, this vein of research lacks the cohesion, common terminology and shared practices of more established areas of natural language processing. In this paper, we survey the current state of academic research related to diachronic word embeddings and semantic shifts detection. We start with discussing the notion of semantic shifts, and then continue with an overview of the existing methods for tracing such time-related shifts with word embedding models. We propose several axes along which these methods can be compared, and outline the main challenges before this emerging subfield of NLP, as well as prospects and possible applications.
... In [10], the author proposed a method for event detection in blogs called temporal random indexing. Blogs are less structured compared to newswire and it can be updated very fast. ...
Chapter
Today, social media play a very important role in the world to share the real-world information, everyday existence stories, and thoughts through the virtual communities or networks. Different types of social media’s such as Twitter, Blogs, News Archieve, etc., have heterogeneous information with various formats. This information is useful to the real time events such as disasters, power outage, traffic, etc. Analyzing and understanding such information on different social media are a challenging task due to the presence of noisy data, unrelated data, and data with different formats. Hence, this paper focuses on various event detection methods in different types of social media and categorizes them according to the media type. Moreover, features and data sets of various social media are also explained in this paper.
... The topic is heating up because of increasing interest in temporal dynamics [8][9][10][11] and its anticipated connection with the Hamiltonian, a typically quantum interaction (QI) consideration. As proposed earlier, in both CM and QM, it is the Hamiltonian which describes the energy stored in a system, and in order to approach it, finding a way to compute term "mass" is the key. ...
Conference Paper
Full-text available
With insight from linguistics that degrees of text cohesion are similar to forces in physics, and the frequent use of the energy concept in text categorization by machine learning, we consider the applicability of particle-wave duality to semantic content inherent in index terms. Wave-like interpretations go back to the regional nature of such content, utilizing functions for its representation, whereas content as a particle can be conveniently modelled by position vectors. Interestingly, wave packets behave like particles, lending credibility to the duality hypothesis. We show in a classical mechanics framework how metaphorical term mass can be computed.
Conference Paper
We present the S-Space Package, an open source framework for developing and evaluating word space algorithms. The package implements well-known word space algorithms, such as LSA, and provides a comprehensive set of matrix utilities and data structures for extending new or existing models. The package also includes word space benchmarks for evaluation. Both algorithms and libraries are designed for high concurrency and scalability. We demonstrate the efficiency of the reference implementations and also provide their results on six benchmarks.
Chapter
We present a Divide-and-Learn machine learning methodology to investigate a new class of attribute inference attacks against Online Social Networks (OSN) users. Our methodology analyzes commenters’ preferences related to some user publications (e.g., posts or pictures) to infer sensitive attributes of that user. For classification performance, we tune Random Indexing (RI) to compute several embeddings for textual units (e.g., word, emoji), each one depending on a specific attribute value. RI guarantees the comparability of the generated vectors for the different values. To validate the approach, we consider three Facebook attributes: gender, age category and relationship status, which are highly relevant for targeted advertising or privacy threatening applications. By using an XGBoost classifier, we show that we can infer Facebook users’ attributes from commenters’ reactions to their publications with AUC from 94% to 98%, depending on the traits.
Conference Paper
Twitter has become popular among researchers as a means to detect various kinds of events. Several attempts were made to detect trends, real world events, news, earthquakes and others with satisfying results. However they do not perform well on finding local events such as release parties, musicians in a park, or art exhibitions. Many of the local events that were found by algorithms of existing work were not related to an event but to locations, global events, or just common words. In this paper, we introduce Event Radar, a novel local event detection method to improve the precision by analyzing seven day historic Tweet data. We estimate the average Tweet frequency of keywords per day in and around a potential event area and use these estimations to classify whether the keywords are related to a local event. The proposed scheme achieves a precision rate of 68% which is a significant improvement compared to related work that states a precision rate of 25.5%.
Article
Recognizing new and emerging events in a stream of news documents requires understanding the semantic structure of news reported in natural language. New event detection (NED) is the task of recognizing when a news document discusses a completely novel event. To be successful at this task, we believe a NED method must extract and represent four principal components of an event: its type, participants, temporal, and spatial properties. These components must then be compared in a semantically robust manner to detect novelty. We further propose event centrality, a method for determining the most important participants in an event. Our NED methods produce a 29% cost reduction over a bag-of-words baseline and a 17% cost reduction over an existing state-of-the-art approach. Additionally, we discuss our method for recognizing emerging events: the tracking and categorization of unexpected or novel events.
Chapter
This chapter is concerned with geometric representations of biomedical data. In it, we discuss how data elements with multiple features can be considered as vectors in a high-dimensional space, enabling the application of distance metrics as a means to estimate their similarity or relatedness for the purpose of information retrieval, exploration, or classification. In terms of applications, the emphasis in this chapter is on the representation of biomedical text. However, the methods are broadly applicable, and examples from other cases of biomedical research will at times be provided to illustrate this point.
Chapter
Today online social network services are challenging state-of-the-art social media mining algorithms and techniques due to its real-time nature, scale and amount of unstructured data generated. The continuous interactions between online social network participants generate streams of unbounded text content and evolutionary network structures within the social streams that make classical text mining and network analysis techniques obsolete and not suitable to deal with such new challenges. Performing event detection on online social networks is no exception, state-of-the-art algorithms rely on text mining techniques applied to pre-known datasets that are being processed with no restrictions on the computational complexity and required execution time per document analysis. Moreover, network analysis algorithms used to extract knowledge from users relations and interactions were not designed to handle evolutionary networks of such order of magnitude in terms of the number of nodes and edges. This specific problem of event detection becomes even more serious due to the real-time nature of online social networks. New or unforeseen events need to be identified and tracked on a real-time basis providing accurate results as quick as possible. It makes no sense to have an algorithm that provides detected event results a few hours after being announced by traditional newswire.
Conference Paper
Computers understand little about the meaning of human language. Vector space models of semantics are beginning to overcome these limits. In this regard, one of the modern issues is using high dimensional data, which is formulated as tensors. Also, due to the increased information and texts, automatic text summarization has become one of the most important issues in information retrieval and natural language processing. In this paper, we propose a new method, using higher-order singular value decomposition (HOSVD) for extracting the concept of the words from word-document-time three-dimensional tensor and then select important sentences with more cosine similarity to this concept. In the following, we measure WordNet-based semantic similarity between sentences and remove redundancy sentences with less importance. The evaluation of the proposed method is done using the ROUGE evaluation on the DUC 2007 standard data set that the obtained results indicate the predominance of our method over many dominant systems.
Article
Full-text available
Topic Detection and Tracking (TDT) is a research initiative that aims at techniques to organize news documents in terms of news events. We propose a method that incorporates simple semantics into TDT by splitting the term space into groups of terms that have the meaning of the same type. Such a group can be associated with an external ontology. This ontology is used to determine the similarity of two terms in the given group. We extract proper names, locations, temporal expressions and normal terms into distinct sub-vectors of the document representation. Measuring the similarity of two documents is conducted by comparing a pair of their corresponding sub-vectors at a time. We use a simple perceptron to optimize the relative emphasis of each semantic class in the tracking and detection decisions. The results suggest that the spatial and the temporal similarity measures need to be improved. Especially the vagueness of spatial and temporal terms needs to be addressed.
Article
Full-text available
The lexical semantic system is an important compo- nent of human language and cognitive processing. One approach to modeling semantic knowledge makes use of hand-constructed networks or trees of interconnected word senses (Miller, Beckwith, Fellbaum, Gross, & Miller, 1990; Jarmasz & Szpakowicz, 2003). An al- ternative approach seeks to model word meanings as high-dimensional vectors, which are derived from the co- occurrence of words in unlabeled text corpora (Landauer & Dumais, 1997; Burgess & Lund, 1997a). This pa- per introduces a new vector-space method for deriving word-meanings from large corpora that was inspired by the HAL and LSA models, but which achieves better and more consistent results in predicting human similarity judgments. We explain the new model, known as COALS, and how it relates to prior methods, and then evaluate the various models on a range of tasks, including a novel set of semantic similarity ratings involving both semantically and morphologically related terms.
Article
Full-text available
In this paper we present a linguistic resource for the lexical representation of affective knowledge. This resource (named W ORDNET- AFFECT) was developed starting from WORDNET, through a selection and tagging of a subset of synsets representing the affective meanings. In this paper we present a linguistic resource for a lexical representation of affective knowledge. This re- source (named WORDNET-AFFECT) was developed start- ing from WORDNET, through the selection and labeling of the synsets representing affective concepts. Affective computing is advancing as a field that allows a new form of human computer interaction, in addition to the use of natural language. There is a wide perception that the future of human-computer interaction is in themes such as entertainment, emotions, aesthetic pleasure, motivation, attention, engagement, etc. Studying the relation between natural language and affective information and dealing with its computational treatment is becoming crucial. For the development of WORDNET-AFFECT, we con- sidered as a starting point WORDNET DOMAINS (Magnini and Cavaglia, 2000), a multilingual extension of Word- Net, developed at ITC-irst. In WORDNET DOMAINS each synset has been annotated with at least one domain label (e.g. SPORT, POLITICS, MEDICINE), selected from a set of about two hundred labels hierarchically organized. A do- main may include synsets of different syntactic categories: for instance the domain MEDICINE groups together senses from Nouns, such as doctor#1 (i.e. the first sense of the word doctor) and hospital#1, and from Verbs such as operate#7. For WORDNET-AFFECT, our goal was to have an addi- tional hierarchy of "affective domain labels", independent from the domain hierarchy, with which the synsets repre- senting affective concepts are annotated.
Article
Full-text available
The Europe Media Monitor system (EMM) gathers and aggregates an aver- age of 50,000 newspaper articles per day in over 40 languages. To manage the in- formation overflow, it was decided to group similar articles per day and per language into clusters and to link daily clusters over time into stories. A story automatically comes into existence when related groups of articles occur within a 7-day window. While cross-lingual links across 19 languages for individual news clusters have been displayed since 2004 as part of a freely accessible online appli- cation (http://press.jrc.it/NewsExplorer), the newest development is work on link- ing entire stories across languages. The evaluation of the monolingual aggrega- tion of historical clusters into stories and of the linking of stories across languages yielded mostly satisfying results.
Article
Full-text available
In this paper we present the DLSIAUES team's participation in the TAC 2008 Opinion Pilot and Recognizing Textual Entailment tasks. Structured in two distinct parts corresponding to these tasks, the paper presents the opinion and textual entailment systems, their components, as well as the tools and methods used to implement the approaches taken. Moreover, we describe the difficulties encountered at different steps and the distinct solutions that were adopted. We present the results of the evaluations performed within TAC 2008, analyze them and comment upon their significance. Finally, we conclude on the performed experiments and present some of the lines for future work.
Article
Full-text available
This paper presents a new statistical method for detecting and tracking changes in word meaning, based on Latent Semantic Analysis. By comparing the density of semantic vector clusters this method allows researchers to make statistical inferences on questions such as whether the meaning of a word changed across time or if a phonetic cluster is associated with a specific meaning. Possible applications of this method are then illustrated in tracing the semantic change of 'dog', 'do', and 'deer' in early English and examining and comparing phonaesthemes.
Article
Full-text available
The CCNU summarization system, PUSMS (Proceeding to Using Semantic Method for Summarization), join in TAC (formerly DUC) for the first time. For the update summarization tasks, we used syntactic-based anaphora resolution and sentence compression algorithms in our system. Term significance was then obtained by frequency-related topic significance and query-related significance by obtaining co-occurrence information with query terms. For the pilot QA summarization task, a semantic orientation recognition module which used WordNet::Similarity::Vector to obtain all of the main part-of-speech terms' similarity with benchmark words derived from General Inquirer is used in PUSMS pilot system. We also developed a document classifier and a snippets-related content extracting module for the pilot tasks. In all, our initial job can be boiled down to be introducing semantic method into our former statistical summarization system. By analyzing the evaluation results, we found that we were preceding the right target but still have a long way to go.
Article
Full-text available
We present a fully automatic approach for summa-rization evaluation that does not require the creation of human model summaries. 1 Our work capitalizes on the fact that a summary contains the most rep-resentative information from the input and so it is reasonable to expect that the distribution of terms in the input and a good summary are similar to each other. To compare the term distributions, we use KL and Jensen-Shannon divergence, cosine similar-ity, as well as unigram and multinomial models of text. Our results on a large scale evaluation from the Text Analysis Conference show that input-summary comparisons can be very effective. They can be used to rank participating systems very similarly to man-ual model-based evaluations (pyramid evaluation) as well as to manual human judgments of summary quality without reference to a model. Our best fea-ture, Jensen-Shannon divergence, leads to a correla-tion as high as 0.9 with manual evaluations.
Article
Full-text available
We describe a system that accurately detects emerging trends in a corpus of time-stamped textual documents. Our approach uses a Singular Value Decomposition, a dimensionality reduction approach that is used for other information retrieval and text mining applications, such as Latent Semantic Indexing (LSI). We create a term by document matrix, and apply our dimensionality reduction algorithm in combination with a similarity function to identify clusters of closely associated noun phrases. Use of our approach enables the detection of 92% of the emerging trends, on average, for the five collections we tested.
Conference Paper
Full-text available
Merchants selling products on the Web often ask their customers to review the products that they have purchased and the associated services. As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds or even thousands. This makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. It also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. For the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. In this research, we aim to mine and to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. We do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. Our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. This paper proposes several novel techniques to perform these tasks. Our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques.
Conference Paper
Full-text available
There is an exploding amount of user-generated content on the Web due to the emergence of "Web 2.0" services, such as Blogger, MyS- pace, Flickr, and del.icio.us. The participation of a large number of users in sharing their opinion on the Web has inspired researchers to build an effective "information Þlter" by aggregating these in- dependent opinions. However, given the diverse groups of users on the Web nowadays, the global aggregation of the information may not be of much interest to different groups of users. In this paper, we explore the possibility of computing personalized aggre- gation over the opinions expressed on the Web based on a user's indication of trust over the information sources. The hope is that by employing such "personalized" aggregation, we can make the recommendation more likely to be interesting to the users. We ad- dress the challenging scalability issues by proposing an efÞcient method, that utilizes two core techniques: Non-Negative Matrix Factorization and Threshold Algorithm, to compute personalized aggregations when there are potentially millions of users and mil- lions of sources within a system. We show that, through experi- ments on real-life dataset, our personalized aggregation approach indeed makes a signiÞcant difference in the items that are recom- mended and it reduces the query computational cost signiÞcantly, often more than 75%, while the result of personalized aggregation is kept accurate enough.
Conference Paper
Full-text available
This paper discusses the task of tracking mentions of some topically interesting textual entity from a continuously and dynamically changing flow of text, such as a news feed, the output from an Internet crawler or a similar text source — a task sometimes referred to as buzz monitoring. Standard approaches from the field of information access for identifying salient textual entities are reviewed, and it is argued that the dynamics of buzz monitoring calls for more accomplished analysis mech- anisms than the typical text analysis tools provide today. The notion of word space is introduced, and it is argued that word spaces can be used to select the most salient markers for topicality, find associations those observations engender, and that they constitute an attractive foundation for building a representation well suited for the tracking and monitoring of mentions of the entity under consideration.
Conference Paper
Full-text available
This paper surveys current text and speech summarization evaluation approaches. It discusses advantages and disadvantages of these, with the goal of identifying summarization techniques most suitable to speech summarization. Precision/recall schemes, as well as summary accuracy measures which incorporate weight- ings based on multiple human decisions, are suggested as particu- larly suitable in evaluating speech summaries. Index Terms: evaluation, text and speech summarization
Conference Paper
Full-text available
Current results in basic Information Extraction tasks such as Named Entity Recognition or Event Extraction suggest that we are close to achieving a stage where the fundamental units for text understanding are put to- gether; namely, predicates and their arguments. How- ever, other layers of information, such as event modal- ity, are essential for understanding, since the inferences derivable from factual events are obviously different from those judged as possible or non-existent. In this paper, we first map out the scope of modality in natu- ral language; we propose a specification language for annotating this information in text; and finally we de- scribe two tools that automatically recognizing modal information in natural language text.
Conference Paper
Full-text available
Opinion mining is the task of extracting from a set of documents opinions expressed by a source on a specified target. This article presents a comparative study on the methods and resources that can be employed for mining opinions from quotations (reported speech) in newspaper articles. We show the difficulty of this task, motivated by the presence of different possible targets and the large variety of affect phenomena that quotes contain. We evaluate our approaches using annotated quotations extracted from news provided by the EMM news gathering engine. We conclude that a generic opinion mining system requires both the use of large lexicons, as well as specialised training and testing data.
Conference Paper
Full-text available
In this paper, we analyze the state of cur- rent human and automatic evaluation of topic-focused summarization in the Docu- ment Understanding Conference main task for 2005-2007. The analyses show that while ROUGE has very strong correlation with responsiveness for both human and automatic summaries, there is a signifi- cant gap in responsiveness between hu- mans and systems which is not accounted for by the ROUGE metrics. In addition to teasing out gaps in the current auto- matic evaluation, we propose a method to maximize the strength of current auto- matic evaluations by using the method of canonical correlation. We apply this new evaluation method, which we call ROSE (ROUGE Optimal Summarization Evalua- tion), to find the optimal linear combina- tion of ROUGE scores to maximize corre- lation with human responsiveness.
Conference Paper
Full-text available
This note describes a scoring scheme for the coreference task in MUC6. It improves on the original approach by: (1) grounding the scoring scheme in terms of a model; (2) producing more intuitive recall and precision scores; and (3) not requiring explicit computation of the transitive closure of coreference. The principal conceptual difference is that we have moved from a syntactic scoring model based on following coreference links to an approach defined by the model theory of those links.
Conference Paper
Full-text available
We present an empirically grounded method for evaluating content selection in summariza- tion. It incorporates the idea that no single best model summary for a collection of documents exists. Our method quantifies the relative im- portance of facts to be conveyed. We argue that it is reliable, predictive and diagnostic, thus im- proves considerably over the shortcomings of the human evaluation method currently used in the Document Understanding Conference.
Conference Paper
Full-text available
The paper proposes a Constrained Entity-Alignment F-Measure (CEAF) for evaluating coreference resolution. The metric is computed by aligning reference and system entities (or coreference chains) with the constraint that a system (reference) entity is aligned with at most one reference (system) entity. We show that the best alignment is a maximum bipartite matching problem which can be solved by the Kuhn-Munkres algorithm. Comparative experiments are conducted to show that the widely-known MUC F-measure has serious flaws in evaluating a coreference system. The proposed metric is also compared with the ACE-Value, the official evaluation metric in the Automatic Content Extraction (ACE) task, and we conclude that the proposed metric possesses some properties such as symmetry and better interpretability missing in the ACE-Value.
Chapter
In order to realize the semantic web vision, the creation of semantic annotation, the linking of web pages to ontologies, and the creation, evolution and interrelation of ontologies must become automatic or semi-automatic processes. Natural Language Generation (NLG) takes structured data in a knowledge base as input and produces natural language text, tailored to the presentational context and the target reader. NLG techniques use and build models of the context and the user and use them to select appropriate presentation strategies. In the context of Semantic Web or knowledge management, NLG can be applied to provide automated documentation of ontologies and knowledge bases. Unlike human-written texts, an automatic approach will constantly keep the text up-to-date which is vitally important in the Semantic Web context where knowledge is dynamic and is updated frequently. This chapter presents several Natural Language Generation (NLG) techniques that produce textual summaries from Semantic Web ontologies. The main contribution is in showing how existing NLG tools can be adapted to take Semantic Web ontologies as their input, in a way which minimizes the customization effort. A major factor in the quality of the generated text is the content of the ontology itself. For instance, the use of string datatype properties with implicit semantics leads to the generation of text with missing semantic information. Three approaches to overcome this problem are presented and users can choose the one that suits their application best.
Article
This paper proposes a new bootstrapping framework using cross-lingual information pro- jection. We demonstrate that this framework is particularly effective for a challenging NLP task which is situated at the end of a pipeline and thus suffers from the errors propagated from up- stream processing and has low-performance baseline. Using Chinese event extraction as a case study and bitexts as a new source of infor- mation, we present three bootstrapping tech- niques. We first conclude that the standard mono-lingual bootstrapping approach is not so effective. Then we exploit a second approach that potentially benefits from the extra informa- tion captured by an English event extraction sys- tem and projected into Chinese. Such a cross- lingual scheme produces significant performance gain. Finally we show that the combination of mono-lingual and cross-lingual information in bootstrapping can further enhance the perfor- mance. Ultimately this new framework obtained 10.1% relative improvement in trigger labeling (F-measure) and 9.5% relative improvement in argument-labeling.
Article
This article reports our progress in the classification of expressions of emotion in network-based chat conversations. Emotion detection of this nature is currently an active area of research (8) (9). We detail a linguistic approach to the tagging of chat conversation with appropriate emotion tags. In our approach, textual chat messages are automatically converted into speech and then instance vectors are generated from frequency counts of speech phonemes present in each message. In combination with other statistically derived attributes, the instance vectors are used in various machine-learning frameworks to build classifiers for emotional content. Based on the standard metrics of precision and recall, we report results exceeding 90% accuracy when employing k-nearest-neighbor learning. Our approach has thus shown promise in discriminating emotional from non-emotional content in independent testing.
Article
One of the essential characteristics of blogs is their subjectiv- ity, which makes blogs a particularly interesting domain for research on automatic sentiment determination. In this pa- per, we explore the properties of two most common subgenres of blogs - personal diaries and "notebooks" - and the eects that these properties have on performance of an automatic sentiment annotation system, which we developed for binary (positive vs. negative) and ternary (positive vs. negative vs. neutral) classification of sentiment at the sentence level. We also investigate the dierential eect of inclusion of negations and other valence shifters on the performance of our system on these two subgenres of blogs.
Article
This paper describes an application of statistical NLP for the extraction of significant topics from time-dependent data. Text from daily national newspapers is taken for an example. Based on comparison with a large reference corpus a small number of terms is selected, categorized and clustered in order to describe characteristic top-ics from the analyzed texts by these terms. Sta-tistical word co-occurrences are then used ex-tensively to visualize the course of events. The result can be regarded as an enriched type of electronic press report and archive.
Article
In this report we present the overall architecture for the NYU English ACE 2005 system. We focus on two components which were evaluated this year: reference resolution, where we experimented with features for relating anaphor and antecedent, and event recognition, where we sought to take advantage of a rich combination of logical grammatical structure and predicate-argument structure, building on the recent work on PropBank and NomBank.
Article
We show that sequence information can be encoded into high-dimensional fixed-width vectors using permutations of coor-dinates. Computational models of language often represent words with high-dimensional semantic vectors compiled from word-use statistics. A word's semantic vector usually encodes the contexts in which the word appears in a large body of text but ignores word order. However, word order often signals a word's grammatical role in a sentence and thus tells of the word's meaning. Jones and Mewhort (2007) show that word or-der can be included in the semantic vectors using holographic reduced representation and convolution. We show here that the order information can be captured also by permuting of vec-tor coordinates, thus providing a general and computationally light alternative to convolution.
Article
Event detection and recognition is a com-plex task consisting of multiple sub-tasks of varying difficulty. In this paper, we present a simple, modular approach to event extraction that allows us to exper-iment with a variety of machine learning methods for these sub-tasks, as well as to evaluate the impact on performance these sub-tasks have on the overall task.
Conference Paper
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluations. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale sum- marization evaluation sponsored by NIST.
Article
Current concerns about plagiarism participate in a culture-wide anxiety that mirrors the cultural climate in previous textual revolutions. In today's revolution, the Internet is described as the cause of a perceived increase in plagiarism, and plagiarism-detecting services (PDSs) are described as the best solution. The role of the Internet should be understood, however, not just in terms of access to text but also in terms of textual relationships. Synthesizing representations of iText with literary theories of intertextuality suggests that all writers work intertextually, all readers interpret texts intertextually, and new media not only increase the number of texts through which both writers and readers work but also offer interactive information technologies in which unacknowledged appropriation from sources does not necessarily invalidate the text. Plagiarism-detecting services, in contrast, describe textual appropriation solely in terms of individual ethics. The best response to concerns about plagiarism is revised institutional plagiarism policies combined with authentic pedagogy that derives from an understanding of IText, intertextuality, and new media.
Article
This paper presents a new method for computing a thesaurus from a text corpus. Each word is represented as a vector in a multi-dimensional space that captures cooccurrence information. Words are defined to be similar if they have similar cooccurrence patterns. Two different methods for using these thesaurus vectors in information retrieval are shown to significantly improve performance over the Tipster reference corpus as compared to a term vector space baseline.
Conference Paper
In this paper, we address the problem of event coreference resolution as specified in the Automatic Content Extraction (ACE 1) program. In contrast to entity coreference resolution, event coreference resolution has not received great attention from researchers. In this paper, we first demonstrate the diverse scenarios of event coreference by an example. We then model event coreference resolution as a spectral graph clustering problem and evaluate the clustering algorithm on ground truth event mentions using ECM F-Measure. We obtain the ECM-F scores of 0.8363 and 0.8312 respectively by using two methods for computing coreference matrices. 1
Conference Paper
New Event Detection is a challenging task that still offers scope for great improvement after years of effort. In this paper we show how performance on New Event Detection (NED) can be improved by the use of text classification techniques as well as by using named entities in a new way. We explore modifications to the document representation in a vector space-based NED system. We also show that addressing named entities preferentially is useful only in certain situations. A combination of all the above results in a multi-stage NED system that performs much better than baseline single-stage NED systems.
Conference Paper
One of the major problems in question answering (QA) is that the queries are either too brief or often do not contain most relevant terms in the target corpus. In order to overcome this problem, our earlier work integrates external knowledge extracted from the Web and WordNet to perform Event-based QA on the TREC-11 task. This paper extends our approach to perform event-based QA by uncovering the structure within the external knowledge. The knowledge structure loosely models different facets of QA events, and is used in conjunction with successive constraint relaxation algorithm to achieve effective QA. Our results obtained on TREC-11 QA corpus indicate that the new approach is more effective and able to attain a confidence-weighted score of above 80%.
Conference Paper
We present a new method and system for performing the New Event Detection task, i.e., in one or multiple streams of news stories, all stories on a previously unseen (new) event are marked. The method is based on an incremental TF-IDF model. Our extensions include: generation of source-specific models, similarity score normalization based on document-specific averages, similarity score normalization based on source-pair specific averages, term reweighting based on inverse event frequencies, and segmentation of the documents. We also report on extensions that did not improve results. The system performs very well on TDT3 and TDT4 test data and scored second in the TDT-2002 evaluation.
Conference Paper
Blogs and formal news sources both monitor the events of the day, but with substantially different frames of reference. In this paper, we report on experiments comparing over 500,000 blog postings with the contents of 66 daily newspapers over the same six week period. We compare the prevalence of popular topics in the blogspace and news, and in particular analyze lead/lag relationships in frequency time series of 197 entities in the two corpora. The correlation between news and blog references proved substantially higher when adjusting for lead/lag shifts, although the direction of these shifts varied for different entities.
Conference Paper
The volume of discussion about a product in weblogs has re- cently been shown to correlate with the product's financial performance. In this paper, we study whether applying senti- ment analysis methods to weblog data results in better corre- lation than volume only, in the domain of movies. Our main finding is that positive sentiment is indeed a better predictor for movie success when applied to a limited context around references to the movie in weblogs, posted prior to its release. If my film makes one more person miserable, I've done my job.
Conference Paper
In this paper, we investigate the emotion classification of web blog corpora using support vector machine (SVM) and conditional random field (CRF) machine learning techniques. The emotion classifiers are trained at the sentence level and applied to the document level. Our methods also determine an emotion category by taking the context of a sentence into account. Experiments show that CRF classifiers outperform SVM classifiers. When applying emotion classification to a blog at the document level, the emotion of the last sentence in a document plays an important role in determining the overall emotion.
Conference Paper
This paper presents a Text Summarisation approach, which combines three different features (Word frequency, Textual Entailment, and The Code Quantity Principle) in order to produce extracts from newswire documents in English. Experiments shown that the proposed combination is appropriate for generating summaries, improving the system’s performance by 10% over the best DUC 2002 participant. Moreover, a preliminary analysis of the suitability of these features for domain-independent documents has been addressed obtaining encouraging results, as well.
Conference Paper
The goal of the Blog track is to explore the information seeking behaviour in the blogosphere. It aims to create the required infrastructure to facilitate research into the blogosphere and to study retrieval from blogs and other related applied tasks. The track was
Conference Paper
We present the S-Space Package, an open source framework for developing and evaluating word space algorithms. The package implements well-known word space algorithms, such as LSA, and provides a comprehensive set of matrix utilities and data structures for extending new or existing models. The package also includes word space benchmarks for evaluation. Both algorithms and libraries are designed for high concurrency and scalability. We demonstrate the efficiency of the reference implementations and also provide their results on six benchmarks.
Conference Paper
We apply the hypothesis of "One Sense Per Discourse" (Yarowsky, 1995) to information extraction (IE), and extend the scope of "dis- course" from one single document to a cluster of topically-related documents. We employ a similar approach to propagate consistent event arguments across sentences and documents. Combining global evidence from related doc- uments with local decisions, we design a sim- ple scheme to conduct cross-document inference for improving the ACE event ex- traction task 1 . Without using any additional labeled data this new approach obtained 7.6% higher F-Measure in trigger labeling and 6% higher F-Measure in argument labeling over a state-of-the-art IE system which extracts events independently for each sentence.
Conference Paper
This article presents a study of the challenges and possible solutions to the issues raised by opinion (multi-perspective) question answering in a non-traditional textual genre setting. We show why this task is more difficult than traditional question answering and what additional methods, tools and resources are needed to solve it. We test our different hypotheses on question classification answer retrieval and validation on mixed fact/opinion question sets, and opinion questions and annotated answers from two different genres: OpQA, a corpus of newspaper articles and EmotiBlog, the blog post corpus we created and annotated. We discuss on our findings, drawing conclusions and tracing lines for future work.
Conference Paper
In this paper, we present a Chinese event ex- traction system. We point out a language spe- cific issue in Chinese trigger labeling, and then commit to discussing the contributions of lexical, syntactic and semantic features ap- plied in trigger labeling and argument labeling. As a result, we achieved competitive perform- ance, specifically, F-measure of 59.9 in trigger labeling and F-measure of 43.8 in argument labeling.
Article
The optimal amount of information needed in a given decision-making situation lies somewhere along a continuum from "not enough" to "too much". Ackoff proposed that information systems often hinder the decision-making process by creating information overload. To deal with this problem, he called for systems that could filter and condense data so that only relevant information reached the decision maker. The potential for information overload is especially critical in text-based information. The purpose of this research is to investigate the effects and theoretical limitations of extract condensing as a text processing tool in terms of recipient performance. In the experiment described here, an environment is created in which the effects of text condensing are isolated from the effects of message and individual recipient differences. The data show no difference in reading comprehension performance between the condensed forms and the original document. This indicates that condensed forms can be produced that are equally as informative as the original document. These results suggest that it is possible to apply a relatively simple computer algorithm to text and produce extracts that capture enough of the information contained in the original document so that the recipient can perform as if he or she had read the original. These results also identify a methodology for assessing the effectiveness of text condensing schemes. The research presented here contributes to a small but growing body of work on text-based information systems and, specifically, text condensing.