Article

Event graphs for information retrieval and multi-document summarization

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

With the number of documents describing real-world events and event-oriented information needs rapidly growing on a daily basis, the need for efficient retrieval and concise presentation of event-related information is becoming apparent. Nonetheless, the majority of information retrieval and text summarization methods rely on shallow document representations that do not account for the semantics of events. In this article, we present event graphs, a novel event-based document representation model that filters and structures the information about events described in text. To construct the event graphs, we combine machine learning and rule-based models to extract sentence-level event mentions and determine the temporal relations between them. Building on event graphs, we present novel models for information retrieval and multi-document summarization. The information retrieval model measures the similarity between queries and documents by computing graph kernels over event graphs. The extractive multi-document summarization model selects sentences based on the relevance of the individual event mentions and the temporal structure of events. Experimental evaluation shows that our retrieval model significantly outperforms well-established retrieval models on event-oriented test collections, while the summarization model outperforms competitive models from shared multi-document summarization tasks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For example, Chen et al. [14] defined a graphic event as a semantic network extracted from a documental event. Glavaš and Šnajder [250] defined a graphic event as a temporal structure. In the EVIN system [251], a graphic event is defined by their participators, e.g., persons and organizations. ...
... According to the type of nodes, many techniques are available. If a node is defined as a named entity (e.g., [12], [62], [239]) or an event mention (e.g., [250]), the process to recognize them is usually defined as a sequence labelling task. If a node is defined as a documental event, document clustering and classification are adopted [248], [249]. ...
... Furthermore, it may lead to the ''semantic drift'' problem caused by error accumulation [261]. The chronological order relation is mainly used in text streams, where the time stamp is usually available [248], [250]. For the subevents (an inclusion relationship) between nodes, the edges are commonly measured by a similarity function. ...
Article
Full-text available
There is large and growing amounts of textual data that contains information about human activities. Mining interesting knowledge from this textual data is a challenging task because it consists of unstructured or semistructured text that are written in natural language. In the field of artificial intelligence, event-oriented techniques are helpful in addressing this problem, where information retrieval (IR), information extraction (IE) and graph methods (GMs) are three of the most important paradigms in supporting event-oriented processing. In recent years, due to information explosions, textual event detection and recognition have received extensive research attention and achieved great success. Many surveys have been conducted to retrospectively assess the development of event detection. However, until now, all of these surveys have focused on only a single aspect of IR, IE or GMs. There is no research that provides a complete introduction or a comparison of IR, IE, and GMs. In this article, a survey about these techniques is provided from a broader perspective, and a convenient and comprehensive comparison of these techniques is given. The hallmark of this article is that it is the first survey that combines IR, IE and GMs in a single frame and will therefore benefit researchers by acting as a reference in this field.
... Recent research (Glavas and Snajder, 2014) has demonstrated that traditional IR yields low accuracy when applied to documents centered on events, such as police reports, medical records, and breaking news. As one can imagine, documents centered on events occur in large quantities and often contain very valuable information. ...
... In fact, in practice this use case may be even more complicated if the computer is connected to the router through other devices, in which case identifying a match requires reasoning about how the loss of connectivity propagates recursively to any device connected to the router. Glavas and Snajder (2014) proposed a new approach, called event-centered IR, which succeeded in increasing match accuracy by means of some level of semantic analysis. However, their approach was limited to matching events mentioned in both queries and sources. ...
... For illustration purposes, we start with a simple problem, which we progressively elaborate. An exhaustive evaluation of our approach over realistic examples is beyond the scope of the paper: as discussed by Glavas and Snajder (2014), the existing benchmarks for IR tasks are not suitable for the evaluation of semanticlevel matching approaches and the development of suitable datasets is a research project in its own merit, which we will tackle and document separately. However, we conducted a preliminary investigation of the scalability of our approach, whose results are discussed later in the paper. ...
Preprint
Full-text available
Information Retrieval (IR) aims at retrieving documents that are most relevant to a query provided by a user. Traditional techniques rely mostly on syntactic methods. In some cases, however, links at a deeper semantic level must be considered. In this paper, we explore a type of IR task in which documents describe sequences of events, and queries are about the state of the world after such events. In this context, successfully matching documents and query requires considering the events' possibly implicit, uncertain effects and side-effects. We begin by analyzing the problem, then propose an action language based formalization, and finally automate the corresponding IR task using Answer Set Programming.
... Recent research (Glavas and Snajder, 2014) has demonstrated that traditional IR yields low accuracy when applied to documents centered on events, such as police reports, medical records, and breaking news. As one can imagine, documents centered on events occur in large quantities and often contain very valuable information. ...
... In fact, in practice this use case may be even more complicated if the computer is connected to the router through other devices, in which case identifying a match requires reasoning about how the loss of connectivity propagates recursively to any device connected to the router. Glavas and Snajder (2014) proposed a new approach, called event-centered IR, which succeeded in increasing match accuracy by means of some level of semantic analysis. However, their approach was limited to matching events mentioned in both queries and sources. ...
... For illustration purposes, we start with a simple problem, which we progressively elaborate. An exhaustive evaluation of our approach over realistic examples is beyond the scope of the paper: as discussed by Glavas and Snajder (2014), the existing benchmarks for IR tasks are not suitable for the evaluation of semanticlevel matching approaches and the development of suitable datasets is a research project in its own merit, which we will tackle and document separately. However, we conducted a preliminary investigation of the scalability of our approach, whose results are discussed later in the paper. ...
Article
Full-text available
Information retrieval (IR) aims at retrieving documents that are most relevant to a query provided by a user. Traditional techniques rely mostly on syntactic methods. In some cases, however, links at a deeper semantic level must be considered. In this paper, we explore a type of IR task in which documents describe sequences of events, and queries are about the state of the world after such events. In this context, successfully matching documents and query requires considering the events’ possibly implicit uncertain effects and side effects. We begin by analyzing the problem, then propose an action language-based formalization, and finally automate the corresponding IR task using answer set programming.
... In order to access information in minimum time, it is necessary to represent the information in a more compact format. Automatic Text Summarization (ATS) is one of the solutions that address this need; ATS is deeply rooted in the history of text summarization for over five decades [5], [6]. Text summarization is the process of extracting information in such a way that the valuable information is not missed out in the generated summary, yet avoiding the redundancy of the original format [1], [2]. ...
... Multi-document summarization handles cases where the information is spread over multiple sources and documents. For instance, the same contents may be covered from multiple sources, so at times, a number of documents may be available to gain an insight into the same event [5]. In this regard, a multi-document summary becomes a representation of the information contained in a cluster of documents which helps users understand the gist of those documents [9], [10]. ...
Article
Full-text available
With the tremendous growth in the number of electronic documents, it is becoming challenging to manage the volume of information. Much research has focused on automatically summarizing the information available in the documents. Multi-Document Summarization (MDS) is one approach that aims to extract the information from the available documents in such a concise way that none of the important points are missed from the summary while avoiding the redundancy of information at the same time. This study presents an extensive survey of extractive MDS over the last decade to show the progress of research in this field. We present different techniques of extractive MDS and compare their strengths and weaknesses. Research work is presented by category and evaluated to help the reader understand the work in this field and to guide them in defining their own research directions. Benchmark datasets and standard evaluation techniques are also presented. This study concludes that most of the extractive MDS techniques are successful in developing salient and information-rich summaries of the documents provided.
... [11] uses a graph-based approach for enhanced bibliographic retrieval to a co-citation network incorporating citation context information; the method is based on a graph similarity calculation algorithm and the Random Walk with Restart (RWR) algorithm. The authors of [12] describe and structure the events of a document in order to build the text summary. The method consists in building the event graph by combining machine learning and rule-based methods. ...
... Indeed, [18] identifies a complete set of temporal relations which can exist between two intervals. In [12], document is represented as an event graph where a graph representation involves not only the recording of events, but also the representation of temporal connections. ...
... So far, the summarization problem has been modeled using different linguistic, probabilistic, machine learning, and graph-based approaches [3,5]. Previous studies indicate that the graph-based approach has great potential to be adopted for summarization of different types of general and domain-specific documents [6][7][8][9][10][11][12]. However, there are still two main challenges that need to be addressed in graph-based summarization. ...
... Various studies addressed the summarization problem by adopting a graph-theoretic approach [6][7][8][9][10][11][12]. GraphSum [12] constructs a correlation graph in which the nodes are terms and the edges represent co-occurrence relations between the terms. ...
Article
Full-text available
Text summarization tools can help biomedical researchers and clinicians reduce the time and effort needed for acquiring important information from numerous documents. It has been shown that the input text can be modeled as a graph, and important sentences can be selected by identifying central nodes within the graph. However, the effective representation of documents, quantifying the relatedness of sentences, and selecting the most informative sentences are main challenges that need to be addressed in graph-based summarization. In this paper, we address these challenges in the context of biomedical text summarization. We evaluate the efficacy of a graph-based summarizer using different types of context-free and contextualized embeddings. The word representations are produced by pre-training neural language models on large corpora of biomedical texts. The summarizer models the input text as a graph in which the strength of relations between sentences is measured using the domain specific vector representations. We also assess the usefulness of different graph ranking techniques in the sentence selection step of our summarization method. Using the common Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics, we evaluate the performance of our summarizer against various comparison methods. The results show that when the summarizer utilizes proper combinations of context-free and contextualized embeddings, along with an effective ranking method, it can outperform the other methods. We demonstrate that the best settings of our graph-based summarizer can efficiently improve the informative content of summaries and decrease the redundancy.
... It somehow triggers the emergence of knowledge representation form on events and their relations. In 2014, Glavaš andŠnajder [23] proposed event graph as a novel event-based document representation model that filters and structures the information about events described in texts to address the need for efficient retrieval and concise presentation of event-related information. In this event graph, nodes denote events consisting of triggers and arguments (only subject, object, time, and location are considered), and edges indicate temporal relations between events. ...
... As presented in Table 1, event evolutionary graph [26] and event logic graph [28] only focus on schema-level event knowledge. Nodes in event graph [23], [24] and event logic graph are composite structures, which are difficult to handle. Moreover, these EKG-related concepts all consider specific and limited argument roles, as well as specific and limited relations between events. ...
Preprint
Full-text available
Besides entity-centric knowledge, usually organized as Knowledge Graph (KG), events are also an essential kind of knowledge in the world, which trigger the spring up of event-centric knowledge representation form like Event KG (EKG). It plays an increasingly important role in many machine learning and artificial intelligence applications, such as intelligent search, question-answering, recommendation, and text generation. This paper provides a comprehensive survey of EKG from history, ontology, instance, and application views. Specifically, to characterize EKG thoroughly, we focus on its history, definitions, schema induction, acquisition, related representative graphs/systems, and applications. The development processes and trends are studied therein. We further summarize perspective directions to facilitate future research on EKG.
... Nguyen-Hoang et al. [40] employed graph-based PR algorithm for summarization of Vietnamese documents. An extractive approach based on event graph [41] used human crafted rules for generation of multidocument summary. ...
Article
Full-text available
Huge data on the web come from discussion forums, which contain millions of threads. Discussion threads are a valuable source of knowledge for Internet users, as they have information about numerous topics. The discussion thread related to single topic comprises a huge number of reply posts, which makes it hard for the forum users to scan all the replies and determine the most relevant replies in the thread. At the same time, it is also hard for the forum users to manually summarize the bulk of reply posts in order to get the gist of discussion thread. Thus, automatically extracting the most relevant replies from discussion thread and combining them to form a summary are a challenging task. With this motivation behind, this study has proposed a sentence embedding based clustering approach for discussion thread summarization. The proposed approach works in the following fashion: At first, word2vec model is employed to represent reply sentences in the discussion thread through sentence embeddings/sentence vectors. Next, K-medoid clustering algorithm is applied to group semantically similar reply sentences in order to reduce the overlapping reply sentences. Finally, different quality text features are utilized to rank the reply sentences in different clusters, and then the high-ranked reply sentences are picked out from all clusters to form the thread summary. Two standard forum datasets are used to assess the effectiveness of the suggested approach. Empirical results confirm that the proposed sentence based clustering approach performed superior in comparison to other summarization methods in the context of mean precision, recall, and F-measure.
... e authors in [46] presented a graphbased method for multidocument summarization of Vietnamese documents and employed traditional PageRank algorithm to rank the important sentences. e authors in [47] demonstrated an event graph-based approach for multidocument extractive summarization. However, the approach requires the construction of hand crafted rules for argument extraction, which is a time consuming process and may limit its application to a specific domain. ...
Article
Full-text available
With the growing information on web, online movie review is becoming a significant information resource for Internet users. However, online users post thousands of movie reviews on daily basis and it is hard for them to manually summarize the reviews. Movie review mining and summarization is one of the challenging tasks in natural language processing. Therefore, an automatic approach is desirable to summarize the lengthy movie reviews, and it will allow users to quickly recognize the positive and negative aspects of a movie. This study employs a feature extraction technique called bag of words (BoW) to extract features from movie reviews and represent the reviews as a vector space model or feature vector. The next phase uses Naïve Bayes machine learning algorithm to classify the movie reviews (represented as feature vector) into positive and negative. Next, an undirected weighted graph is constructed from the pairwise semantic similarities between classified review sentences in such a way that the graph nodes represent review sentences, while the edges of graph indicate semantic similarity weight. The weighted graph-based ranking algorithm (WGRA) is applied to compute the rank score for each review sentence in the graph. Finally, the top ranked sentences (graph nodes) are chosen based on highest rank scores to produce the extractive summary. Experimental results reveal that the proposed approach is superior to other state-of-the-art approaches.
... Its aim is to automatically extract specific knowledge of certain incidents identified in texts [4] in the form of who is involved, in what, at when and where [5]. This task can be very beneficial in a variety of domains including question answering [6,7], information retrieval [8], summarization [9][10][11][12], timeline extraction [13,14], news recommendation [15,16], knowledge base construction [9,17], and online monitoring systems such as ones for health, life, disease, cyber-attack, stock markets, accident and robbery [18][19][20][21][22][23][24]. ...
Preprint
Event extraction (EE) is one of the core information extraction tasks, whose purpose is to automatically identify and extract information about incidents and their actors from texts. This may be beneficial to several domains such as knowledge bases, question answering, information retrieval and summarization tasks, to name a few. The problem of extracting event information from texts is longstanding and usually relies on elaborately designed lexical and syntactic features, which, however, take a large amount of human effort and lack generalization. More recently, deep neural network approaches have been adopted as a means to learn underlying features automatically. However, existing networks do not make full use of syntactic features, which play a fundamental role in capturing very long-range dependencies. Also, most approaches extract each argument of an event separately without considering associations between arguments which ultimately leads to low efficiency, especially in sentences with multiple events. To address the two above-referred problems, we propose a novel joint event extraction framework that aims to extract multiple event triggers and arguments simultaneously by introducing shortest dependency path (SDP) in the dependency graph. We do this by eliminating irrelevant words in the sentence, thus capturing long-range dependencies. Also, an attention-based graph convolutional network is proposed, to carry syntactically related information along the shortest paths between argument candidates that captures and aggregates the latent associations between arguments; a problem that has been overlooked by most of the literature. Our results show a substantial improvement over state-of-the-art methods.
... Event extraction is an important technique in natural language processing. It is involved in many fields, such as question answering, information retrieval, and so on [2]. An event indicates a state transition has occurred. ...
Conference Paper
While hearing a case, the judge must fully understand the case and make clear the disputed issues between parties, which is the cornerstone of a fair trial. However, manual mining the key of the case from the statements of the litigious parties is a bottleneck, which currently relies on methods like keyword searching and regular matching. To complete this time-consuming and laborious task, judges need to have sufficient prior knowledge of cases belonging to different causes of action. We try to apply the technology of event extraction to faster capture the focus of the case. However, there is no proper definition of events that contains types of focus in the judicial field. And existing event extraction methods can't solve the problem of multiple events sharing the same arguments or trigger words in a single sentence, which is very common in case materials. In this paper, we present a mechanism to define focus events, and a two-level labeling approach, which can solve multiple events sharing the same argument or trigger words, to automatically extract focus events from case materials. Experimental results demonstrate that the method can obtain the focus of case accurately. As far as we know, this is the first time that event extraction technology has been applied to the judicial field.
... Reference [29] demonstrated an event graph-based approach for multi-document extractive summarization (MDES), which combines machine learning with hand crafted rules to extract sentence-level event mentions, and employs a supervised model to determine temporal relations between them. However, construction of hand crafted rules for argument extraction is a time consuming task and may limit the approach to a particular domain. ...
Article
Full-text available
The goal of abstractive summarization of multi-documents is to automatically produce a condensed version of the document text and maintain the significant information. Most of the graph-based extractive methods represent sentence as bag of words and utilize content similarity measure, which might fail to detect semantically equivalent redundant sentences. On other hand, graph based abstractive method depends on domain expert to build a semantic graph from manually created ontology, which requires time and effort. This work presents a semantic graph approach with improved ranking algorithm for abstractive summarization of multi-documents. The semantic graph is built from the source documents in a manner that the graph nodes denote the predicate argument structures (PASs)—the semantic structure of sentence, which is automatically identified by using semantic role labeling; while graph edges represent similarity weight, which is computed from PASs semantic similarity. In order to reflect the impact of both document and document set on PASs, the edge of semantic graph is further augmented with PAS-to-document and PAS-to-document set relationships. The important graph nodes (PASs) are ranked using the improved graph ranking algorithm. The redundant PASs are reduced by using maximal marginal relevance for re-ranking the PASs and finally summary sentences are generated from the top ranked PASs using language generation. Experiment of this research is accomplished using DUC-2002, a standard dataset for document summarization. Experimental findings signify that the proposed approach shows superior performance than other summarization approaches.
... They observed that the SRL-based sentence scoring approach outperformed a term-to-term-based sentence scoring system. By considering the semantics of events, Glavas and Snajder [43] used event graphs to obtain summaries from multiple texts. In their model, sentences are selected based on event importance, which is constituted by participant importance, event informativeness, and temporal relations among events. ...
... All aforementioned summarization works were primarily aimed at summarization of news articles. There can be also other summarization types like a summarization of emails (Carenini et al., 2008;Yousefi-Azar and Hamey, 2017), eventbased summarization (Glavaš and Šnajder, 2014;Kedzie et al., 2015), personalized summarization (Díaz and Gervás, 2007;Moro and Bielikova, 2012) and also sentiment-based or opinion summarization described in the next section. ...
Conference Paper
In recent years, the number of texts has grown rapidly. For example, most reviewbased portals, like Yelp or Amazon, contain thousands of user-generated reviews. It is impossible for any human reader to process even the most relevant of these documents. The most promising tool to solve this task is a text summarization. Most existing approaches, however, work on small, homogeneous, English datasets, and do not account to multi-linguality, opinion shift, and domain effects. In this paper, we introduce our research plan to use neural networks on user-generated travel reviews to generate summaries that take into account shifting opinions over time. We outline future directions in summarization to address all of these issues. By resolving the existing problems, we will make it easier for users of review-sites to make more informed decisions.
... Information retrieval problems have been discussed during several decades (van Rijsbergen, 1977;Bolchini et al., 2013;Lin et al., 2014;Chen et al., 2015;Glavaš & Šnajder, 2014;Manning et al., 2009) and they are one of the open research areas nowadays Gupta & Bendersky, 2015;Janowicz et al., 2011). Different approaches are discussed on this area but a semantic-based approach is accepted as a need for information systems (Janowicz et al., 2011;Malhotra & Nair, 2015;Zhai, 2015). ...
Article
Full-text available
With the evolution of technology, there has been a new trend in the development of software, from desktop applications to mobile web environment, making entering people with different physical and cognitive conditions and access to them, thus the level quality of these programs should be excellent, as there are a number of users with varied needs, abilities and complex skills. The purpose of this research further to understand the criteria that must be taken to develop usable software was the development of a tool to automate heuristic evaluations of usability and User Centered Design (UCD) for web and mobile applications from different points of view. As a methodological paradigm for research has used the mixed approach (qualitative and quantitative), with the use of methods and techniques of inquiry. The research was descriptive-propositional. For the development of the mobile application, the team took the decision to schedule under the methodological object-oriented paradigm and methodology that guided the development of the project is the RUP (Rational Unified Process). The research developed three instruments generated as results for heuristic evaluations and a mobile application that will include such tests and statistical results to determine the level of compliance ISO25010 and usability principles of Nielsen. Between what could conclude Heuristic Evaluations, it is a mobile application that is to be a tool to assess and deliver statistical results of different principles and standards of usability. With.
... It generates an ordered summary by optimizing the coherence and salient factors, where the coherence of text is evaluated by an approximated discourse graph. Glavaš and Šnajder (2014) introduce an event based summarization model that exploits the strength of machine learning rule based approaches, which performs effectively on the event oriented document collection. Complex networks are also based on graph theory. ...
... In [27] adopted a vector space model and TD-IDF to generate the document vectors, and the similarity of documents was measured by cosine similarity to describe the evolution relationships between the events, supplemented by the features of temporal proximity and document distributional proximity. Inspired by machine learning and rules based model, [24] extracted events from the level of the sentences, determined the evolutionary relationship between two events and demonstrated a new model of information retrieval and summary generation. The similarity between the content of the documents was incorporated with other features to evaluate the strength of the evolutionary relationship between news events. ...
Conference Paper
Full-text available
Microblog like Twitter and Sina Weibo has been an important information source for event detection and monitoring. In many decision-making scenarios, it is not enough to only provide a structural tuple for an event, e.g., a 5W1H record like <who, where, when, what, whom, how>. However, in addition to event structural tuples, people need to know the evolution lifecycle of an event. The lifecycle description of an event is more helpful for decision making because people can focus on the progress and trend of events. In this paper, we propose a novel method for efficiently detecting and tracking event evolution on microblogging platforms. The major features of our study are: (1) It provides a novel event-type-driven method to extract event tuples, which forms the foundation for event evolution analysis. (2) It describes the lifecycle of an event by a staged model, and provides effective algorithms for detecting the stages of an event. (3) It offers emotional analysis over the stages of an event, through which people are able to know the public emotional tendency over a specific event at different time periods. We build a prototype system and present its architecture and implemental details in the paper. In addition, we conduct experiments on real microblog datasets and the results in terms of precision, recall, and F-measure suggest the effectiveness and efficiency of our proposal.
... The uninformative parts in the documents were eliminated based on the event representations. Glavas and Snajder (2014) convert documents into event graphs in which nodes correspond to event mentions and edges to temporal relations between events. They use a logistic regression classifier for determining the event words and a set of manually built rules to retrieve the arguments of events. ...
Article
Full-text available
This paper describes a question answering framework that can answer student questions given in natural language. We suggest a methodology that makes use of reliable resources only, provides the answer in the form of a multi-document summary for both factoid and open-ended questions, and produces an answer also from foreign resources by translating into the native language. The resources are compiled using a question database in the selected domains based on reliability and coverage metrics. A question is parsed using a dependency parser, important parts are extracted by rule-based and statistical methods, the question is converted into a representation, and a query is built. Documents relevant to the query are retrieved from the set of resources. The documents are summarized and the answers to the question together with other relevant information about the topic of the question are shown to the user. A summary answer from the foreign resources is also built by the translation of the input question and the retrieved documents. The proposed approach was applied to the Turkish language and it was tested with several experiments and a pilot study. The experiments have shown that the summaries returned include the answer for about 50–60 percent of the questions. The data bank built for factoid and open-ended questions in the two domains covered is made publicly available.
... La granularité de ces méthodes peut également être plus fine que le document. Ainsi, Glavaš et Šnajder (2014) extraient des documents les prédicats correspondant à des événements élémentaires, puis organisent ces prédicats selon leurs liens de causalité. Choudhary et al. (2008) ont été parmi les premiers à proposer cette approche, en construisant un graphe dans lequel les documents mentionnant les mêmes personnes sont liés. ...
Thesis
Cette thèse en informatique s’intéresse à la structuration et à l’exploration de collections journalistiques. Elle fait appel à plusieurs domaines de recherches : sciences sociales, à travers l’étude de la production journalistique ; ergonomie; traitement des langues et la recherche d’information ; multimédia et notamment la recherche d’information multimédia. Une branche de la recherche d’information multimédia, appelée hyperliage, constitue la base sur laquelle cette thèse est construite. L’hyperliage consiste à construire automatiquement des liens entre documents multimédias. Nous étendons ce concept en l’appliquant à l’entièreté d’une collection afin d’obtenir un hypergraphe, et nous intéressons notamment à ses caractéristiques topologiques.Nous proposons dans cette thèse des améliorations de l’état de l’art selon trois axes principaux : une structuration de collections d’actualités à l’aide de graphes mutlisources et multimodaux fondée sur la création de liens inter-documents, son association à une diversité importante des liens permettant de représenter la grande variété des intérêts que peuvent avoir différents utilisateurs, et enfin l’ajout d’un typage des liens créés permettant d’expliciter la relation existant entre deux documents. Ces différents apports sont renforcés par des études utilisateurs démontrant leurs intérêts respectifs.
... Kannada is a resource poor, morphologically rich Dravidian language with about 45 million speakers 2 , mostly located in southern India. Automatic extraction of events has gained sizable attention in subfields of NLP and information retrieval such as automatic summarization, question answering and knowledge graph embeddings [25] [37], as events are a representation of temporal information and sequences in text. ...
Thesis
Full-text available
This thesis is a culmination of the work done on event detection, annotation and analysis. We present here the development of the detection of events on a large scale for low resource languages in two ways from a computational and a linguistic perspective. From the linguistic perspective, we discuss the cre- ation of a language specific event annotation and representation task for Kannada, a morphologically rich resource poor Dravidian language and Hindi, a popular Indo-Aryan language. From a computa- tional perspective, we look into leveraging information from resource rich languages and use transfer learning in order to detect events in a resource poor environment as well. We analyze events in Hindi briefly and Kannada, in depth. As most information retrieval and extrac- tion tasks are resource intensive, very little work has been done on Kannada NLP, with almost no efforts in discourse analysis and dataset creation for representing events or other semantic annotations in the text. In this thesis, we linguistically analyze what constitutes an event in this language, the challenges faced with discourse level annotation and representation due to the rich derivational morphology of the language that allows free word order, numerous multi-word expressions, adverbial participle construc- tions and constraints on subject-verb relations. Therefore, this is one of the first attempts at a large scale discourse level annotation for Kannada, which can be used for semantic annotation and corpus development for other tasks in the language. On the other hand, from a processing viewpoint, detection of TimeML events in text have tradition- ally been done on corpora such as TimeBanks. Traditional architectures revolve around highly feature engineered, language specific statistical models. In this thesis, we also present a Language Invariant Neural Event Detection (ALINED) architecture. ALINED uses an aggregation of both sub-word level features as well as lexical and structural information. This is achieved by combining convolution over character embeddings, with recurrent layers over contextual word embeddings. We find that our model extracts relevant features for event span identification without relying on language specific features. We compare the performance of our language invariant model to the current state-of-the-art in En- glish, Spanish, Italian and French. We outperform the F1-score of the state of the art in English by 1.65 points. We achieve F1-scores of 84.96, 80.87 and 74.81 on Spanish, Italian and French respectively which is comparable to the current states of the art for these languages. We also introduce the automatic annotation of events in Hindi and Kannada with an F1-Score of 77.13 and 67.30 respectively.
... A graph-based method for summarizing several Vietnamese documents is Scientific Programming presented by the authors in [47]. An event graph method is demonstrated by the authors in [48], which constitute an extractive summary of multidocuments. However, it involves the creation of handcrafted extraction rules of argument, which is a tedious task. ...
Article
Full-text available
Information is exploding on the web at exponential pace, so online movie review is becoming a substantial information resource for online users. However, users post millions of movie reviews on regular basis, and it is not possible for users to summarize the reviews. Movie review classification and summarization is one of the challenging tasks in natural language processing. Therefore, an automatic approach is demanded to summarize the vast amount of movie reviews, and it will allow the users to speedily distinguish the positive and negative aspects of a movie. This study has proposed an approach for movie review classification and summarization. For movie review classification, bag-of-words feature extraction technique is used to extract unigrams, bigrams, and trigrams as a feature set from given review documents, and represent the review documents as a vector space model. Next, the Naïve Bayes algorithm is employed to classify the movie reviews (represented as a feature vector) into positive and negative reviews. For the task of movie review summarization, Word2vec feature extraction technique is used to extract features from classified movie review sentences, and then semantic clustering technique is used to cluster semantically related review sentences. Different text features are used to calculate the salience score of each review sentence in clusters. Finally, the top-ranked sentences are chosen based on highest salience scores to produce the extractive summary of movie reviews. Experimental results reveal that the proposed machine learning approach is superior than other state-of-the-art approaches.
... Results obtained from the experiments justified the importance of their model. Glavaš and Šnajder (2014) combined rule-based models with machine learning to extract sentence-level event which is named as 'event graphs for multi-document summarization'. Similarity between the documents and the queries is calculated by computation of graph kernels over event graphs. ...
Article
Full-text available
The qualities of human readable summaries available in the datasets are not up to the mark, leading to issues in creating an accurate model for text summarization. Although recent works have been largely built upon this issue and set up a strong platform for further improvements, they still have many limitations. Looking in this direction, the paper proposes a novel methodology for summarizing a corpus of documents to generate a coherent summary using topic modeling and classification technique. The objectives of the propose work are highlighted below:A novel heuristic approach is introduced to find out the actual number of topics that exist in a corpus of documents which handles the stochastic nature of latent dirichlet allocation. A large corpus of documents is handled by minimizing the huge set of sentences into a small set without losing the important one and thus providing a concise and information rich summary at the end. Ensuring that the sentences are arranged as per their importance in the coherent summary. Results of the experiment are compared with the state-of-the-art summary systems. The outcomes of the empirical work show that the proposed model is more promising compared to the well-known text summarization models.
... Event extraction is a hot spot in information extraction area, it aims to accurately identify the event and relevant person, time, place and other elements from Chinese text. It was widely used in information retrieval [7] and Question Answer System [8], and it is also one of the construction processes of knowledge graph [9]. There are two main types of methods: rule-based methods and machine learning methods. ...
Chapter
Event element recognition is a significant task in event-based information extraction. In this paper, we propose an event element recognition model based on character-level embedding with semantic features. By extracting character-level features, the proposed model can capture more information of words. Our results show that joint character Convolutional Neural Networks (CNN) and character Bi-directional Long Short-Term Memory Networks (Bi-LSTM) is superior to single character-level model. In addition, adding semantic features such as POS (part-of-speech) and DP (dependency parsing) tends to improve the effect of recognition. We evaluated different methods in CEC (Chinese Emergency Corpus), and the experimental results show that our model can achieve good performance, and the F value of element recognition was 77.17%.
... Event extraction (EE) plays an important role in various real-life applications, such as information retrieval and news summarization (Glavas and Snajder, 2014;Daniel et al., 2003). It aims to discover events with triggers and their corresponding arguments. ...
... Event extraction plays an important role in various NLP applications including question answering and information retrieval (Yang et al. 2003;Glavaš andŠnajder 2014). Figure 1 shows an example of the event extraction task, which aims to discover events (die and attack) with triggering words (died and fired) and their corresponding arguments (e.g., Baghdad, cameraman). ...
Article
Event extraction plays an important role in natural language processing (NLP) applications including question answering and information retrieval. Traditional event extraction relies heavily on lexical and syntactic features, which require intensive human engineering and may not generalize to different datasets. Deep neural networks, on the other hand, are able to automatically learn underlying features, but existing networks do not make full use of syntactic relations. In this paper, we propose a novel dependency bridge recurrent neural network (dbRNN) for event extraction. We build our model upon a recurrent neural network, but enhance it with dependency bridges, which carry syntactically related information when modeling each word.We illustrates that simultaneously applying tree structure and sequence structure in RNN brings much better performance than only uses sequential RNN. In addition, we use a tensor layer to simultaneously capture the various types of latent interaction between candidate arguments as well as identify/classify all arguments of an event. Experiments show that our approach achieves competitive results compared with previous work.
... The task of Temporal Relation Extraction focuses on finding the chronology of events (e.g., Before, After, Overlaps) in text. Extracting temporal relation is useful for various downstream tasks -curating structured clinical data (Savova et al., 2010;Soysal et al., 2018), text summarization (Glavas and Snajder, 2014;Kedzie et al., 2015), questionanswering (Llorens et al., 2015;Zhou et al., 2019), etc. The task is most commonly viewed as a classification task where given a pair of events, and its textual context, the temporal relation between them needs to be identified. ...
Preprint
We present LOME, a system for performing multilingual information extraction. Given a text document as input, our core system identifies spans of textual entity and event mentions with a FrameNet (Baker et al., 1998) parser. It subsequently performs coreference resolution, fine-grained entity typing, and temporal relation prediction between events. By doing so, the system constructs an event and entity focused knowledge graph. We can further apply third-party modules for other types of annotation, like relation extraction. Our (multilingual) first-party modules either outperform or are competitive with the (monolingual) state-of-the-art. We achieve this through the use of multilingual encoders like XLM-R (Conneau et al., 2020) and leveraging multilingual training data. LOME is available as a Docker container on Docker Hub. In addition, a lightweight version of the system is accessible as a web demo.
... Event-based Summarizer Pre-training. Previous studies reveal that event information can be an effective building block for models to perform text generation [12,20], so we attempt to obtain a Summarizer with the ability to generate event-related text in an unsupervised way. Concretely, we pre-train a sequence-to-sequence model in the following steps: 1) randomly select a few sentences from the text; 2) extract events in these selected sentences; 3) mask these sentences in the source document; 4) take events and masked text as input, and use these selected sentences as the target for the model. ...
Preprint
Text summarization is a personalized and customized task, i.e., for one document, users often have different preferences for the summary. As a key aspect of customization in summarization, granularity is used to measure the semantic coverage between summary and source document. Coarse-grained summaries can only contain the most central event in the original text, while fine-grained summaries cover more sub-events and corresponding details. However, previous studies mostly develop systems in the single-granularity scenario. And models that can generate summaries with customizable semantic coverage still remain an under-explored topic. In this paper, we propose the first unsupervised multi-granularity summarization framework, GranuSum. We take events as the basic semantic units of the source documents and propose to rank these events by their salience. We also develop a model to summarize input documents with given events as anchors and hints. By inputting different numbers of events, GranuSum is capable of producing multi-granular summaries in an unsupervised manner. Meanwhile, to evaluate multi-granularity summarization models, we annotate a new benchmark GranuDUC, in which we write multiple summaries of different granularities for each document cluster. Experimental results confirm the substantial superiority of GranuSum on multi-granularity summarization over several baseline systems. Furthermore, by experimenting on conventional unsupervised abstractive summarization tasks, we find that GranuSum, by exploiting the event information, can also achieve new state-of-the-art results under this scenario, outperforming strong baselines.
... It can be seen that much of the work in automatic text summarization does not involve the temporal aspect of the corpora within the summarization framework. Temporal based text summarization works are mostly based on real world events, such as police reports, news, stories, and product reviews [22,23,24]. Temporal summarization approaches for scholarly literature however, have not been properly established yet. ...
Article
Full-text available
The number of scholarly publications has dramatically increased over the last decades. For anyone new to a particular science domain it is not easy to understand the major trends and significant changes that the domain has undergone over time. Temporal summarization and related approaches should be then useful to make sense of scholarly temporal collections. In this paper we demonstrate an approach to analyze the dataset of research papers by providing a high level overview of important changes that occurred over time in this dataset. The novelty of our approach lies in the adaptation of methods used for semantic term evolution analysis. However, we analyze not just semantic evolution of single words independently, but we estimate common semantic drifts shared by groups of semantically converging words. As an example dataset we use the ACL Anthology Reference Corpus that spans from 1974 to 2015 and contains 22,878 scholarly articles.
... Event extraction is typically an upstream step in pipelines for financial applications: it has been used for news summarization of single (C. S. Lee et al., 2003;Marujo et al., 2017) or multiple documents (Glavaš & Šnajder, 2014;M. Liu et al., 2007), forecasting and market analysis (Bholat et al., 2015;D. ...
Thesis
Full-text available
To capture the vast knowledge expressed in written language, the field of Information Extraction within Natural Language Processing aims to obtain structured information on facts and opinions from unstructured text. Event extraction, on the one hand, is the task of automatically collecting the factual `who, what, where, when, why and how' of recent occurrences in news or social media. The processing of subjective opinions, on the other hand, is performed with sentiment analysis systems, where positive or negative attitudes are detected towards products, persons, or organizations. In this dissertation, we present the construction of an extensive dataset for fine-grained event extraction and sentiment analysis in economic news, named SENTiVENT. Subsequently, we validated our novel resource with machine learning experiments in which we apply state-of-the-art deep learning models to check the feasibility of our task. We defined economic events as prototypical schemata in which words expressing an event of a certain type (e.g., product releases, revenue increases, security value movements, deals) are linked to the participating persons, companies, and entities that play a role in the event (e.g., a product, the amount of increase in stock price, the main companies involved in a deal). Event processing in financial text has historically been largely knowledge- or pattern-based, relying on manually created rules for matching phrases to events. Other approaches rely on approximate heuristics for automatically gathering event phrases such as the presence of dates or similarity to existing rule-sets. This over-reliance on knowledge-based methods stems from a lack of gold-standard annotated data in the field of financial event processing. Fine-grained sentiment analysis detects which attitude is expressed towards a target entity. Usually, the field is focused on user-generated and opinionated text genres where sentiment is explicitly expressed such as reviews. However, in objective genres such as business news, indirect expressions of implicit sentiment are common. Here, a positive or negative attitude can be inferred by the reader through common sense, connotational world-knowledge. The field of implicit sentiment analysis is currently lacking in fine-grained resources in which the opinion and target entity words are labeled for their implied sentiment value. Financial markets prove to be especially sensitive to news coverage and opinionated reporting. Combining the fine-grained extraction of events and their investor sentiment in company-specific news enables financial applications such as stock forecasting, identifying macro-economic trends, and event studies. Supervised machine learning methods rely on annotated training data, so to enable data-driven, supervised extraction of economic events and implicit sentiment, a substantial amount of annotations is required. We constructed a representative English corpus which was manually annotated with our novel annotation scheme obtaining over 6200 event schemata in 288 company-specific news article for 18 economic types. Next, we annotated the positive, neutral, or negative investor sentiment value on top of these events and added separate opinion and target word annotations, obtaining one of the largest fine-grained targeted datasets with 12,400 sentiment tuples. After verifying the quality of annotations in agreement studies, we applied deep learning models that obtain good performance on comparable tasks to check the portability of these methods to our SENTiVENT dataset in coarse- and fine-grained experiments. For the coarse-grained experiments we preprocessed our token-level annotation to sentences or clauses for event detection or implicit sentiment value classification, which obtained in good results. The fine-grained, token-level extraction of abstract semantic categories such as economic events and implicit sentiment proved to be highly challenging even to advanced current transformer-based transfer-learning methods. We have shown in error analyses our dataset contains larger lexical variation within extracted categories. This highlights a weakness of strictly supervised data-driven approaches: even though our dataset is on par or larger than current reference sets for the fine-grained tasks, knowledge-bases and distantly-supervised methods for instance enhancement and expansion should be introduced to alleviate data scarcity. We concluded that our corpus construction efforts resulted in a qualitative and rich resource that fills the need for data-driven approaches in financial event and implicit sentiment processing.
Article
In this paper, three methods of extracting single document summary by combining supervised learning with unsupervised learning are proposed. The purpose of these three methods is to measure the importance of sentences by combining the statistical features of sentences and the relationship between sentences at the same time. The first method uses supervised model and graph model to score sentences separately, and then linear combination of scores is used as the final score of sentences. In the second method, the graph model is used as an independent feature of the supervised model to evaluate the importance of sentences. The third method is to score the importance of sentences by supervised model, then as a priori value of nodes in the graph model, and finally use biased graph model to score sentences. On the data sets of DUC2001 and DUC2002, the ROUGE method is used as the evaluation criterion, which shows that the three methods have achieved good results, and are superior to the methods of extracting summary only using supervised learning or unsupervised learning. We also validate that priori knowledge can improve the accuracy of key sentence selection in graph model.
Chapter
With the advent of communication technology, a tremendous amount of data is generated. The availability of a vast amount of data provides information and presents the challenge of extracting knowledge from it. The solution to such an issue is text summarization. The documents are examined, and a thorough, compact, and relevant summary is generated using in-text summarization. It is classified into two forms based on the approach used: extractive and abstractive summarization. Extractive summarization selects words and sentences from an existing document to create a summary. Semantic analysis is performed in the case of Abstractive summarization, and new words and phrases are employed to construct the summary. We’ve gone over the many types of text summarization in detail in this paper, along with a discussion of the various research approaches that have been used so far.KeywordsText summarizationAbstractive text summarizationExtractive text summarizationDeep learning
Article
Full-text available
In this modern age, the internet is a powerful source of information. Roughly, one-third of the world population spends a significant amount of their time and money on surfing the internet. In every field of life, people are gaining vast information from it such as learning, amusement, communication, shopping, etc. For this purpose, users tend to exploit websites and provide their remarks or views on any product, service, event, etc. based on their experience that might be useful for other users. In this manner, a huge amount of feedback in the form of textual data is composed of those webs, and this data can be explored, evaluated and controlled for the decision-making process. Opinion Mining (OM) is a type of Natural Language Processing (NLP) and extraction of the theme or idea from the user's opinions in the form of positive, negative and neutral comments. Therefore, researchers try to present information in the form of a summary that would be useful for different users. Hence, the research community has generated automatic summaries from the 1950s until now, and these automation processes are divided into two categories, which is abstractive and extractive methods. This paper presents an overview of the useful methods in OM and explains the idea about OM regarding summarization and its automation process.
Article
The rise in the amount of textual resources available on the Internet has created the need for tools of automatic document summarization. The main challenges of query-oriented extractive summarization are (1) to identify the topics of the documents and (2) to recover query-relevant sentences of the documents that together cover these topics. Existing graph- or hypergraph-based summarizers use graph-based ranking algorithms to produce individual scores of relevance for the sentences. Hence, these systems fail to measure the topics jointly covered by the sentences forming the summary, which tends to produce redundant summaries. To address the issue of selecting non-redundant sentences jointly covering the main query-relevant topics of a corpus, we propose a new method using the powerful theory of hypergraph transversals. First, we introduce a new topic model based on the semantic clustering of terms in order to discover the topics present in a corpus. Second, these topics are modeled as the hyperedges of a hypergraph in which the nodes are the sentences. A summary is then produced by generating a transversal of nodes in the hypergraph. Algorithms based on the theory of submodular functions are proposed to generate the transversals and to build the summaries. The proposed summarizer outperforms existing graph- or hypergraph-based summarizers by at least 6% of ROUGE-SU4 F-measure on DUC 2007 dataset. It is moreover cheaper than existing hypergraph-based summarizers in terms of computational time complexity.
Article
The key to document summarization is semantic representation of documents. This paper investigates the role of Semantic Link Network in representing and understanding documents for multi-document summarization. We propose a novel abstractive multi-document summarization framework by first transforming documents into a Semantic Link Network of concepts and events, and then transforming the Semantic Link Network into the summary of the documents by selecting important concepts and events while keeping semantics coherence. Experiments on benchmark datasets show that the proposed approach significantly outperforms relevant state-of-the-art baselines and the Semantic Link Network plays an important role in representing and understanding documents.
Preprint
Due to the manifold ranking method has a significant effect on the ranking of unknown data based on known data by using a weighted network, many researchers use the manifold ranking method to solve the document summarization task. However, their models only consider the original features but ignore the semantic features of sentences when they construct the weighted networks for the manifold ranking method. To solve this problem, we proposed two improved models based on the manifold ranking method. One is combining the topic model and manifold ranking method (JTMMR) to solve the document summarization task. This model not only uses the original feature, but also uses the semantic feature to represent the document, which can improve the accuracy of the manifold ranking method. The other one is combining the lifelong topic model and manifold ranking method (JLTMMR). On the basis of the JTMMR, this model adds the constraint of knowledge to improve the quality of the topic. At the same time, we also add the constraint of the relationship between documents to dig out a better document semantic features. The JTMMR model can improve the effect of the manifold ranking method by using the better semantic feature. Experiments show that our models can achieve a better result than other baseline models for multi-document summarization task. At the same time, our models also have a good performance on the single document summarization task. After combining with a few basic surface features, our model significantly outperforms some model based on deep learning in recent years. After that, we also do an exploring work for lifelong machine learning by analyzing the effect of adding feedback. Experiments show that the effect of adding feedback to our model is significant.
Article
Event extraction (EE) is one of the core information extraction tasks, whose purpose is to automatically identify and extract information about incidents and their actors from texts. This may be beneficial to several domains such as knowledge base construction, question answering and summarization tasks, to name a few. The problem of extracting event information from texts is longstanding and usually relies on elaborately designed lexical and syntactic features, which, however, take a large amount of human effort and lack generalization. More recently, deep neural network approaches have been adopted as a means to learn underlying features automatically. However, existing networks do not make full use of syntactic features, which play a fundamental role in capturing very long-range dependencies. Also, most approaches extract each argument of an event separately without considering associations between arguments which ultimately leads to low efficiency, especially in sentences with multiple events. To address the above-referred problems, we propose a novel joint event extraction framework that aims to extract multiple event triggers and arguments simultaneously by introducing shortest dependency path in the dependency graph. We do this by eliminating irrelevant words in the sentence, thus capturing long-range dependencies. Also, an attention-based graph convolutional network is proposed, to carry syntactically related information along the shortest paths between argument candidates that captures and aggregates the latent associations between arguments; a problem that has been overlooked by most of the literature. Our results show a substantial improvement over state-of-the-art methods on two datasets, namely ACE 2005 and TAC KBP 2015.
Preprint
Event extraction (EE) has considerably benefited from pre-trained language models (PLMs) by fine-tuning. However, existing pre-training methods have not involved modeling event characteristics, resulting in the developed EE models cannot take full advantage of large-scale unsupervised data. To this end, we propose CLEVE, a contrastive pre-training framework for EE to better learn event knowledge from large unsupervised data and their semantic structures (e.g. AMR) obtained with automatic parsers. CLEVE contains a text encoder to learn event semantics and a graph encoder to learn event structures respectively. Specifically, the text encoder learns event semantic representations by self-supervised contrastive learning to represent the words of the same events closer than those unrelated words; the graph encoder learns event structure representations by graph contrastive pre-training on parsed event-related semantic structures. The two complementary representations then work together to improve both the conventional supervised EE and the unsupervised "liberal" EE, which requires jointly extracting events and discovering event schemata without any annotated data. Experiments on ACE 2005 and MAVEN datasets show that CLEVE achieves significant improvements, especially in the challenging unsupervised setting. The source code and pre-trained checkpoints can be obtained from https://github.com/THU-KEG/CLEVE.
Chapter
Named Entity Recognition (NER) is a computational linguistics task that seek to classify every word in a document as falling into different category. NER serves as an important component for many domain specific expert systems. Software engineering is one such domain where very minimum work has been done on identifying entities specific to domain. In this paper, we present NERSE, a tool that enables the user to identify software specific entities. It is developed with machine learning algorithms trained on software specific entity categories using Conditional Random Fields (CRF) and Bidirectional Long Short-Term Memory - Conditional Random Fields (BiLSTM-CRF). NERSE identifies 22 different categories of entities specific to software engineering domain with 0.85% and 0.95% for CRF (source code for Named Entity Recognition Model CRF is available at https://github.com/prathapreddymv/NERSE) and BiLSTM-CRF (source code for Named Entity Recognition Model BiLSTM-CRF is available at https://github.com/prathapreddymv/NERSE) models respectively.
Conference Paper
Full-text available
Detection of TimeML events in text have traditionally been done on corpora such as TimeBanks. However, deep learning methods have not been applied to these corpora, because these datasets seldom contain more than 10,000 event mentions. Traditional ar-chitectures revolve around highly feature engineered , language specific statistical models. In this paper, we present a Language Invariant Neural Event Detection (ALINED) architecture. ALINED uses an aggregation of both sub-word level features as well as lexical and structural information. This is achieved by combining convolution over character embed-dings, with recurrent layers over contextual word embeddings. We find that our model extracts relevant features for event span identification without relying on language specific features. We compare the performance of our language invariant model to the current state-of-the-art in English, Spanish, Italian and French. We outperform the F1-score of the state of the art in English by 1.65 points. We achieve F1-scores of 84.96, 80.87 and 74.81 on Spanish, Italian and French respectively which is comparable to the current states of the art for these languages. We also introduce the automatic annotation of events in Hindi, a low resource language, with an F1-Score of 77.13.
Article
Nowadays, it is necessary that users have access to information in a concise form without losing any critical information. Document summarization is an automatic process of generating a short form from a document. In itemset‐based document summarization, the weights of all terms are considered the same. In this paper, a new approach is proposed for multidocument summarization based on weighted patterns and term association measures. In the present study, the weights of the terms are not equal in the context and are computed based on weighted frequent itemset mining. Indeed, the proposed method enriches frequent itemset mining by weighting the terms in the corpus. In addition, the relationships among the terms in the corpus have been considered using term association measures. Also, the statistical features such as sentence length and sentence position have been modified and matched to generate a summary based on the greedy method. Based on the results of the DUC 2002 and DUC 2004 datasets obtained by the ROUGE toolkit, the proposed approach can outperform the state‐of‐the‐art approaches significantly.
Article
Full-text available
In extraction-based automatic text summarization (ATS) applications, feature scoring is the cornerstone of the summarization process since it is used for selecting the candidate summary sentences. Handling all features equally leads to generating disqualified summaries. Feature Weighting (FW) is an important approach used to weight the scores of the features based on their presence importance in the current context. Therefore, some of the ATS researchers have proposed evolutionary-based machine learning methods, such as Particle Swarm Optimization (PSO) and Genetic Algorithm (GA), to extract superior weights to their assigned features. Then the extracted weights are used to tune the scored-features in order to generate a high qualified summary. In this paper, the Differential Evolution (DE) algorithm was proposed to act as a feature weighting machine learning method for extraction-based ATS problems. In addition to enabling the DE to represent and control the assigned features in binary dimension space, it was modulated into a binary coded format. Simple mathematical calculation features have been selected from various literature and employed in this study. The sentences in the documents are first clustered according to a multi-objective clustering concept. DE approach simultaneously optimizes two objective functions, which are compactness measuring and separating the sentence clusters based on these objectives. In order to automatically detect a number of sentence clusters contained in a document, representative sentences from various clusters are chosen with certain sentence scoring features to produce the summary. The method was tested and trained using DUC2002 dataset to learn the weight of each feature. To create comparative and competitive findings, the proposed DE method was compared with evolutionary methods: PSO and GA. The DE was also compared against the best and worst systems benchmark in DUC 2002. The performance of the BiDETS model is scored with 49% similar to human performance (52%) in ROUGE-1; 26% which is over the human performance (23%) using ROUGE-2; and lastly 45% similar to human performance (48%) using ROUGEL. These results showed that the proposed method outperformed all other methods in terms of F-measure using the ROUGE evaluation tool. © 2021, International Journal of Advanced Computer Science and Applications. All Rights Reserved.
Article
Full-text available
Extractive multi-document summarization receives a set of documents and extracts the important sentences to form a summary. This paper proposes a novel multi-document summarization with sentences overlapping. First, we preprocess multi-document and calculate 12 features of each sentence. This paper suggests four new features: ROUGE-1 and ROUGE-2 score between the sentence and a single document, ROUGE-1 and ROUGE-2 score between the sentence and multiple documents, also a new definition of sentence overlapping feature. Then, we assign each sentence a score by the learned model. We calculate pairwise overlapping between the sentences and finally select the sentences with higher score and less redundancy. These sentences are given to form the final summary to output under a length constraint. Our method is language free, and it can be implemented on other languages with minor changes. The proposed method is tested on DUC 2006 and 2007 datasets. The effectiveness of this technique is measured using the ROUGE score, and the results are promising when they have been compared with some existing methods.
Chapter
Eventuality-centric knowledge graphs are essential resources for many downstream applications. However, current knowledge graphs mainly focus on knowledge about entities while ignoring the real-world eventualities (including events and states). To fill this gap, we propose to build ECCKG, a high-quality eventuality-centric commonsense knowledge graph. We argue that rule-based methods are of great value for knowledge graph construction, but must be used in conjunction with other techniques such as crowdsourcing. We thus create ECCKG by combining rule-based reasoning with crowdsourcing. We first acquire seed ECCKG by manually filtering out the incorrect and duplicate eventuality-related commonsense assertions in ConceptNet 5.5. Then we enrich the seed ECCKG with a set of logical rules iteratively. Finally, we generate new commonsense assertions by instantiating the existing eventualities. The resulting ECCKG contains more than 1.3 million eventuality-centric commonsense knowledge tuples which is about 15 times larger than ConceptNet 5.5. A manual evaluation shows that ECCKG outperforms other eventuality-centric commonsense knowledge graphs in terms of both quality and quantity. We also demonstrate the usefulness of ECCKG by the extrinsic use case of commonsense knowledge acquisition. ECCKG is available at https://zenodo.org/record/6084081.KeywordsEventuality-centric knowledge graphKnowledge graph completionCommonsense knowledge acquisition
Article
World Wide Web (WWW) is playing a vital role for sharing dynamic knowledge in every field of life. The information on web comprises of huge amount of data in different forms like structured, semi structured or few is totally in unstructured format. Due to huge size of information, searching from larger textual data about the specific topic or getting precise information is a challenging task. All this leads to the problem of word sense ambiguity (WSA). Urdu language based information retrieval system using different techniques related to Web Semantic Search Engine architecture is proposed to efficiently retrieve the relevant information and solve the problem of WSA. The proposed system has average precision ratio 96% as compared to average precision ratio of 74% and 75% average precision Google for single word query. For the long text queries, our system outperforms famous search engines such as Bing and Google having 16.50% and 16% accuracy respectively with 92% accuracy. Similarly, the proposed system for single word query, the recall ratio is 32.25% as compared to 25% and 25% of Bing and Google. The results of recall ratio for long text query are improved as well showing 6.38% as compared to 6.20% and 4.8% of Bing and Google respectively. The results showed that the proposed system gives better and efficient results as compared to the existing systems for Urdu language.
Conference Paper
Full-text available
The TempEval task proposes a simple way to evaluate automatic extraction of temporal relations. It avoids the pitfalls of evaluating a graph of inter-related labels by defining three sub tasks that allow pairwise evaluation of temporal relations. The task not only allows straightforward evaluation, it also avoids the complexities of full temporal parsing.
Article
Full-text available
Topic Detection and Tracking (TDT) is a research initiative that aims at techniques to organize news documents in terms of news events. We propose a method that incorporates simple semantics into TDT by splitting the term space into groups of terms that have the meaning of the same type. Such a group can be associated with an external ontology. This ontology is used to determine the similarity of two terms in the given group. We extract proper names, locations, temporal expressions and normal terms into distinct sub-vectors of the document representation. Measuring the similarity of two documents is conducted by comparing a pair of their corresponding sub-vectors at a time. We use a simple perceptron to optimize the relative emphasis of each semantic class in the tracking and detection decisions. The results suggest that the spatial and the temporal similarity measures need to be improved. Especially the vagueness of spatial and temporal terms needs to be addressed.
Article
Full-text available
Most existing research on applying the matrix factorization approaches to query-focused multi-document summarization (Q-MDS) explores either soft/hard clustering or low rank approximation methods. We employ a different kind of matrix factorization method, namely weighted archetypal analysis (wAA) to Q-MDS. In query-focused summarization, given a graph representation of a set of sentences weighted by similarity to the given query, positively and/or negatively salient sentences are values on the weighted data set boundary. We choose to use wAA to compute these extreme values, archetypes, and hence to estimate the importance of sentences in target documents set. We investigate the impact of using the multi-element graph model for query focused summarization via wAA. We conducted experiments on the data of document understanding conference (DUC) 2005 and 2006. Experimental results evidence the improvement of the proposed approach over other closely related methods and many of state-of-the-art systems.
Article
Full-text available
In this paper, a new multi-document summarization framework which combines rhetorical roles and corpus-based semantic analysis is proposed. The approach is able to capture the semantic and rhetorical relationships between sentences so as to combine them to produce coherent summaries. Experiments were conducted on datasets extracted from web-based news using standard evaluation methods. Results show the promise of our proposed model as compared to state-of-the-art approaches.
Article
Full-text available
In the American political process, news discourse concerning public policy issues is carefully constructed. This occurs in part because both politicians and interest groups take an increasingly proactive approach to amplify their views of what an issue is about However, news media also play an active role in framing public policy issues. Thus, in this article, news discourse is conceived as a sociocognitive process involving all three players: sources, journalists, and audience members operating in the universe of shared culture and on the basis of socially defined roles. Framing analysis is presented as a constructivist approach to examine news discourse with the primary focus on conceptualizing news texts into empirically operationalizable dimensions—syntactical, script, thematic, and rhetorical structures—so that evidence of the news media's framing of issues in news texts may be gathered. This is considered an initial step toward analyzing the news discourse process as a whole. Finally, an extended empirical example is provided to illustrate the applications of this conceptual framework of news texts.
Conference Paper
Full-text available
We propose a new approach to characterizing the timeline of a text: temporal dependency structures, where all the events of a narrative are linked via partial ordering relations like BEFORE, AFTER, OVERLAP and IDENTITY. We annotate a corpus of children's stories with temporal dependency trees, achieving agreement (Krippendorff's Alpha) of 0.856 on the event words, 0.822 on the links between events, and of 0.700 on the ordering relation labels. We compare two parsing models for temporal dependency structures, and show that a deterministic non-projective dependency parser outperforms a graph-based maximum spanning tree parser, achieving labeled attachment accuracy of 0.647 and labeled tree edit distance of 0.596. Our analysis of the dependency parser errors gives some insights into future research directions.
Conference Paper
Full-text available
This paper will focus on the semantic representation of verbs in computer systems and its impact on lexical selection problems in machine translation (MT). Two groups of English and Chinese verbs are examined to show that lexical selection must be based on interpretation of the sentences as well as selection restrictions placed on the verb arguments. A novel representation scheme is suggested, and is compared to representations with selection restrictions used in transfer-based MT. We see our approach as closely aligned with knowledge-based MT approaches (KBMT), and as a separate component that could be incorporated into existing systems. Examples and experimental results will show that, using this scheme, inexact matches can achieve correct lexical selection.
Article
Full-text available
Most multi-document summarizers utilize term fre-quency related features to determine sentence im-portance. No empirical studies, however, have been carried out to isolate the contribution made by frequency information from that of other features. Here, we examine the impact of frequency on var-ious aspects of summarization and the role of fre-quency in the design of a summarization system. We describe SumBasic, a summarization system that exploits frequency exclusively to create sum-maries. SumBasic outperforms many of the sum-marization systems in DUC 2004, and performs very well in the 2005 MSE evaluation, confirm-ing that frequency alone is a powerful feature in summary creation. We also demonstrate how a frequency-based summarizer can incorporate con-text adjustment in a natural way, and show that this adjustment contributes to the good performance of the summarizer and is sufficient means for duplica-tion removal in multi-document summarization.
Article
Full-text available
Tempeval-2 comprises evaluation tasks for time expressions, events and temporal re-lations, the latter of which was split up in four sub tasks, motivated by the notion that smaller subtasks would make both data preparation and temporal relation extrac-tion easier. Manually annotated data were provided for six languages: Chinese, En-glish, French, Italian, Korean and Spanish.
Article
Full-text available
This paper presents TIPSem, a system to extract temporal information from natural language texts for English and Spanish. TIPSem, learns CRF models from training data. Although the used features include different language analysis levels, the ap-proach is focused on semantic informa-tion. For Spanish, TIPSem achieved the best F1 score in all the tasks. For English, it obtained the best F1 in tasks B (events) and D (event-dct links); and was among the best systems in the rest.
Article
Full-text available
We describe the Edinburgh information extraction system which we are currently adapting for analysis of newspaper text as part of the SYNC3 project. Our most recent focus is geospatial and temporal grounding of entities and it has been use-ful to participate in TempEval-2 to mea-sure the performance of our system and to guide further development. We took part in Tasks A and B for English.
Article
Full-text available
While the mechanisms for conveying temporal information in language have been have been extensively studied by linguists, very little of this work has been done in the tradition of corpus linguistics. In this paper we discuss the outcomes of a research effort to build a corpus, called TIMEBANK, which is richly annotated to indicate events, times, and temporal relations. We describe the annotation scheme, the corpus sources and tools used in the annotation process, and then report some preliminary figures about the distribution of various phenomena across the corpus. TIMEBANK represents the most fine-grained and extensively temporally annotated corpus to date, and will be a valuable resource both for corpus linguists interested in time and language, and for language engineers interested in applications such as question answering and information extraction for which accurate knowledge of the position and ordering of events in time is of key importance.
Article
Full-text available
Given the advance of Internet technologies, we can now easily extract hundreds or thousands of news stories of any ongoing incidents from newswires such as CNN.com, but the volume of information is too large for us to capture the blueprint. Information retrieval techniques such as topic detection and tracking are able to organize news stories as events, in a flat hierarchical structure, within a topic. However, they are incapable of presenting the complex evolution relationships between the events. We are interested to learn not only what the major events are but also how they develop within the topic. It is beneficial to identify the seminal events, the intermediary and ending events, and the evolution of these events. In this paper, we propose to utilize the event timestamp, event content similarity, temporal proximity, and document distributional proximity to model the event evolution relationships between events in an incident. An event evolution graph is constructed to present the underlying structure of events for efficient browsing and extracting of information. Case study and experiments are presented to illustrate and show the performance of our proposed technique. It is found that our proposed technique outperforms the baseline technique and other comparable techniques in previous work.
Conference Paper
Full-text available
As most ‘real-world’ data is structured, research in kernel methods has begun investigating kernels for various kinds of structured data. One of the most widely used tools for modeling structured data are graphs. An interesting and important challenge is thus to investigate kernels on instances that are represented by graphs. So far, only very specific graphs such as trees and strings have been considered. This paper investigates kernels on labeled directed graphs with general structure. It is shown that computing a strictly positive definite graph kernel is at least as hard as solving the graph isomorphism problem. It is also shown that computing an inner product in a feature space indexed by all possible graphs, where each feature counts the number of subgraphs isomorphic to that graph, is NP-hard. On the other hand, inner products in an alternative feature space, based on walks in the graph, can be computed in polynomial time. Such kernels are defined in this paper.
Conference Paper
Full-text available
In this paper we provide a description of TimeML, a rich specification language for event and temporal expressions in natural language text, developed in the context of the AQUAINT program on Question Answering Systems. Unlike most previous work on event annotation, TimeML captures three distinct phenomena in temporal markup: (1) it systematically anchors event predicates to a broad range of temporally denotating expressions; (2) it orders event expressions in text relative to one another, both intrasententially and in discourse; and (3) it allows for a delayed (underspecified) interpretation of partially determined temporal expressions. We demonstrate the expressiveness of TimeML for a broad range of syntactic and semantic contexts, including aspectual predication, modal subordination, and an initial treatment of lexical and constructional causation in text.
Conference Paper
Full-text available
This paper presents a near real-time multilingual news monitoring and analysis system that forms the backbone of our research work. The system integrates technologies to address the problems related to information extraction and analysis of open source intelligence on the World Wide Web. By chaining together different techniques in text mining, automated machine learning and statistical analysis, we can automatically determine who, where and, to a certain extent, what is being reported in news articles.
Conference Paper
Full-text available
We consider the problem of constructing a directed acyclic graph that encodes tem- poral relations found in a text. The unit of our analysis is a temporal segment, a frag- ment of text that maintains temporal co- herence. The strength of our approach lies in its ability to simultaneously optimize pairwise ordering preferences and global constraints on the graph topology. Our learning method achieves 83% F-measure in temporal segmentation and 84% accu- racy in inferring temporal relations be- tween two segments.
Conference Paper
Full-text available
Event-based summarization attempts to select and organize the sentences in a summary with respect to the events or the sub-events that the sentences de- scribe. Each event has its own internal structure, and meanwhile often relates to other events semantically, temporally, spatially, causally or conditionally. In this paper, we define an event as one or more event terms along with the named entities associated, and present a novel approach to derive intra- and inter- event relevance using the information of inter- nal association, semantic relatedness, distributional similarity and named en- tity clustering. We then apply PageRank ranking algorithm to estimate the sig- nificance of an event for inclusion in a summary from the event relevance de- rived. Experiments on the DUC 2001 test data shows that the relevance of the named entities involved in events achieves better result when their rele- vance is derived from the event terms they associate. It also reveals that the topic-specific relevance from documents themselves outperforms the semantic relevance from a general purpose knowledge base like Word-Net.
Conference Paper
Full-text available
In this paper, we present a linguistic resource that annotates event structures in texts. We consider an event structure as a collection of events that interact with each other in a given situation. We interpret the inter actions between events as event relations. In this regard, we propose and annotate a set of six relations that best capture the concep t of event structure. These relations are: subevent, reason, purpose, enablement, precedence and related. A document from this resource can encode multiple event structures and an event structure can be described across multiple documents. In order to unify event structures, we also annotate inter- and intra-document event coreference. Moreover, we provide methodologies for automatic discovery of event structures from texts. First, we group the events that constitute an event structure into event clusters and then, we use supervised lear ning frameworks to classify the relations that exist between events from the same cluster.
Article
This paper examines statistical techniques for exploiting relevance information to weight search terms. These techniques are presented as a natural extension of weighting methods using information about the distribution of index terms in documents in general. A series of relevance weighting functions is derived and is justified by theoretical considerations. In particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval. Different applications of relevance weighting are illustrated by experimental results for test collections.
Article
Identifying news stories that discuss the same real-world events is important for news tracking and retrieval. Most existing approaches rely on the traditional vector space model. We propose an approach for recognizing identical real-world events based on a structured, event-oriented document representation. We structure documents as graphs of event mentions and use graph kernels to measure the similarity between document pairs. Our experiments indicate that the proposed graph-based approach can outperform the traditional vector space model, and is especially suitable for distinguishing between topically similar, yet non-identical events.
Conference Paper
This paper reports on a large-scale, end-to-end relation and event extraction system. At present, the system extracts a total of 100 types of relations and events, which represents a much wider coverage than is typical of extraction systems. The system consists of three specialized pattern-based tagging modules, a high-precision coreference resolution module, and a configurable template generation module. We report quantitative evaluation results, analyze the results in detail, and discuss future directions.
Article
Minimum spanning trees (MST) and single linkage cluster analysis (SLCA) are explained and it is shown that all the information required for the SLCA of a set of points is contained in their MST. Known algorithms for finding the MST are discussed. They are efficient even when there are very many points; this makes a SLCA practicable when other methods of cluster analysis are not. The relevant computing procedures are published in the Algorithm section of the same issue of Applied Statistics. The use of the MST in the interpretation of vector diagrams arising in multivariate analysis is illustrated by an example.
Conference Paper
Because event mentions in text may be referentially ambiguous, event coreferentiality often involves uncertainty. In this paper we consider event coreference uncertainty and explore how it is affected by the context. We develop a supervised event coreference resolution model based on the comparison of generically extracted event mentions. We analyse event coreference uncertainty in both human annotations and predictions of the model, and in both within-document and cross-document setting. We frame event coreference as a classification task when full context is available and no uncertainty is involved, and a regression task in a limited context setting that involves uncertainty. We show how a rich set of features based on argument comparison can be utilized in both settings. Experimental results on English data suggest that our approach is especially suitable for resolving cross-document event coreference. Results also suggest that modelling human coreference uncertainty in the case of limited context is feasible.
Article
Most information retrieval systems use keywords entered by the user as the search criteria to find documents. However, the language used in documents is often complicated and ambiguous, and thus the results obtained by using keywords are often inaccurate. To address this problem, this study developed a semantic-based content mapping mechanism for an information retrieval system. This approach employs the semantic features and ontological structure of the content as the basis for constructing a content map, thus simplifying the search process and improving the accuracy of the returned results.
Article
Sentence-based multi-document summarization is the task of generating a succinct summary of a document collection, which consists of the most salient document sentences. In recent years, the increasing availability of semantics-based models (e.g., ontologies and taxonomies) has prompted researchers to investigate their usefulness for improving summarizer performance. However, semantics-based document analysis is often applied as a preprocessing step, rather than integrating the discovered knowledge into the summarization process. This paper proposes a novel summarizer, namely Yago-based Summarizer, that relies on an ontology-based evaluation and selection of the document sentences. To capture the actual meaning and context of the document sentences and generate sound document summaries, an established entity recognition and disambiguation step based on the Yago ontology is integrated into the summarization process. The experimental results, which were achieved on the DUC'04 benchmark collections, demonstrate the effectiveness of the proposed approach compared to a large number of competitors as well as the qualitative soundness of the generated summaries.
Article
This paper presents a methodology and a prototype for extracting and indexing knowledge from natural language documents. The underlying domain model relies on a conceptual level (described by means of a domain ontology), which represents the domain knowledge, and a lexical level (based on WordNet), which represents the domain vocabulary. A stochastic model (the ME-2L-HMM2, which mixes – in a novel way – HMM and maximum entropy models) stores the mapping between such levels, taking into account the linguistic context of words. Not only does such a context contain the surrounding words; it also contains morphologic and syntactic information extracted using natural language processing tools. The stochastic model is then used, during the document indexing phase, to disambiguate word meanings. The semantic information retrieval engine we developed supports simple keyword-based queries, as well as natural language-based queries. The engine is also able to extend the domain knowledge, discovering new and relevant concepts to add to the domain model. The validation tests indicate that the system is able to disambiguate and extract concepts with good accuracy. A comparison between our prototype and a classic search engine shows that the proposed approach is effective in providing better accuracy.
Article
The summarization track at the Text Anal-ysis Conference (TAC) is a direct con-tinuation of the Document Understanding Conference (DUC) series of workshops, focused on providing common data and evaluation framework for research in au-tomatic summarization. In the TAC 2008 summarization track, the main task was to produce two 100-word summaries from two related sets of 10 documents, where the second summary was an update sum-mary. While all of the 71 submitted runs were automatically scored with the ROUGE and BE metrics, NIST assessors manually evaluated only 57 of the submit-ted runs for readability, content, and over-all responsiveness.
Article
Most approaches to extractive summarization define a set of features upon which selection of sentences is based, using algorithms independent of the fea-tures themselves. We propose a new set of features based on low-level, atomic events that describe rela-tionships between important actors in a document or set of documents. We investigate the effect this new feature has on extractive summarization, compared with a baseline feature set consisting of the words in the input documents, and with state-of-the-art summarization systems. Our experimental results indicate that not only the event-based features of-fer an improvement in summary quality over words as features, but that this effect is more pronounced for more sophisticated summarization methods that avoid redundancy in the output.
Article
Event detection and recognition is a com-plex task consisting of multiple sub-tasks of varying difficulty. In this paper, we present a simple, modular approach to event extraction that allows us to exper-iment with a variety of machine learning methods for these sub-tasks, as well as to evaluate the impact on performance these sub-tasks have on the overall task.
Conference Paper
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluations. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale sum- marization evaluation sponsored by NIST.
Article
An interval-based temporal logic is introduced together with a computationally effective reasoning algorithm based on constraint propagation. This system is notable in offering a delicate balance between expressive power and the efficiency of its deductive engine. A notion of reference intervals is introduced which captures the temporal hierarchy implicit in many domains, and which can be used to precisely control the amount of deduction performed automatically by the system. Examples are provided for a database containing historical data, a database used for modeling processes and process interaction, and a database for an interactive system where the present moment is continually being updated.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Article
The relationship between diseases and their causative genes can be complex, especially in the case of polygenic diseases. Further exacerbating the challenges in their study is that many genes may be causally related to multiple diseases. This study explored the relationship between diseases through the adaptation of an approach pioneered in the context of information retrieval: vector space models. A vector space model approach was developed that bridges gene disease knowledge inferred across three knowledge bases: Online Mendelian Inheritance in Man, GenBank, and Medline. The approach was then used to identify potentially related diseases for two target diseases: Alzheimer disease and Prader-Willi Syndrome. In the case of both Alzheimer Disease and Prader-Willi Syndrome, a set of plausible diseases were identified that may warrant further exploration. This study furthers seminal work by Swanson, et al. that demonstrated the potential for mining literature for putative correlations. Using a vector space modeling approach, information from both biomedical literature and genomic resources (like GenBank) can be combined towards identification of putative correlations of interest. To this end, the relevance of the predicted diseases of interest in this study using the vector space modeling approach were validated based on supporting literature. The results of this study suggest that a vector space model approach may be a useful means to identify potential relationships between complex diseases, and thereby enable the coordination of gene-based findings across multiple complex diseases.
Conference Paper
With the overwhelming volume of online news available today, there is an increasing need for automatic techniques to analyze and present news to the user in a meaningful and efficient manner. Previous research focused only on organizing news stories by their topics into a flat hierarchy. We believe viewing a news topic as a flat collection of stories is too restrictive and inefficient for a user to understand the topic quickly. In this work, we attempt to capture the rich structure of events and their dependencies in a news topic through our event models. We call the process of recognizing events and their dependencies event threading. We believe our perspective of modeling the structure of a topic is more effective in capturing its semantics than a flat list of on-topic stories. We formally define the novel problem, suggest evaluation metrics and present a few techniques for solving the problem. Besides the standard word based features, our approaches take into account novel features such as temporal locality of stories for event recognition and time-ordering for capturing dependencies. Our experiments on a manually labeled data sets show that our models effectively identify the events and capture dependencies among them.
Conference Paper
Update summarization aims to create a summary over a topic-related multi-document dataset based on the assumption that the user has already read a set of earlier documents of the same topic. Beyond the problems (i.e., topic relevance, salience, and diversity in extracted information) tackled by topic-focused multi-document summarization, the update summarization must address the novelty problem as well. In this paper, we propose a novel extractive approach based on manifold ranking with sink points for update summarization. Specifically, our approach leverages a manifold ranking process over the sentence manifold to find topic relevant and salient sentences. More important, by introducing the sink points into sentence manifold, the ranking process can further capture the novelty and diversity based on the intrinsic sentence manifold. Therefore, we are able to address the four challenging problems above for update summarization in a unified way. Experiments on benchmarks of TAC are performed and the evaluation results show that our approach can achieve comparative performance to the existing best performing systems in TAC tasks.
Conference Paper
Information retrieval evaluation based on the pooling method is inherently biased against systems that did not contribute to the pool of judged documents. This may distort the results obtained about the relative quality of the systems evaluated and thus lead to incorrect conclusions about the performance of a particular ranking technique. We examine the magnitude of this effect and explore how it can be countered by automatically building an unbiased set of judgements from the original, biased judgements obtained through pooling. We compare the performance of this method with other approaches to the problem of incomplete judgements, such as bpref, and show that the proposed method leads to higher evaluation accuracy, especially if the set of manual judgements is rich in documents, but highly biased against some systems.
Conference Paper
New Event Detection is a challenging task that still offers scope for great improvement after years of effort. In this paper we show how performance on New Event Detection (NED) can be improved by the use of text classification techniques as well as by using named entities in a new way. We explore modifications to the document representation in a vector space-based NED system. We also show that addressing named entities preferentially is useful only in certain situations. A combination of all the above results in a multi-stage NED system that performs much better than baseline single-stage NED systems.
Conference Paper
This paper proposes a novel method to translate tags attached to multimedia contents for cross-language retrieval. The main issue in this problem is the sense disambiguation of tags given with few textual contexts. In order to solve this problem, the proposed method represents both tags and its translation candidates as networks of co-occurring tags since a network allows richer expression of contexts than other expressions such as co-occurrence vectors. The method translates a tag by selecting the optimal one from possible candidates based on a network similarity even when neither the textual contexts nor sophisticated language resources are available. The experiments on the MIR Flickr-2008 test set show that the proposed method achieves 90.44% accuracy in translating tags from English into German, which is significantly higher than the baseline methods of a frequency based translation and a co-occurrence-based translation.
Conference Paper
We have recently completed the sixth in a series of "Message Understanding Conferences" which are designed to promote and evaluate research in information extraction. MUC-6 introduced several innovations over prior MUCs, most notably in the range of different t