About
94
Publications
14,250
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
990
Citations
Introduction
Marina Litvak currently works at the Department of Software Engineering, Shamoon College of Engineering. Marina does research in Data Mining, Information Science and Artificial Intelligence. Their current project is 'Multilingual text analysis'.
Skills and Expertise
Current institution
Publications
Publications (94)
The purpose of this survey is to provide a comprehensive overview of recent advancements in text line segmentation and baseline detection techniques within the analysis of historical document images. Text line extraction is an essential step in the historical documents image analysis pipeline, as its results significantly impact the accuracy of sub...
Improving factual consistency in abstractive summarization has been a focus of recent research. One promising approach is the post-editing method. However, previous works have yet to make sufficient use of factual factors in summaries and suffer from the negative effect of the training datasets. In this paper, we first propose a novel factual error...
The gender identification of authors in literary texts is a compelling research area at the intersection of computational linguistics and natural language processing, offering insights into historical biases and socio-cultural dynamics while enriching our understanding of literary traditions. This study is inspired by the historical context of wome...
This paper presents a streamlined taxonomy for categorizing offensive language in Arabic, specifically Modern Standard Arabic (MSA) and the Levantine dialect. Addressing a gap in the existing literature, which has mainly focused on Indo-European languages, our taxonomy divides offensive language into seven levels (six explicit and one implicit). We...
Gender identification of authors in literary texts is a compelling area of research within computational linguistics and natural language processing. Analyzing the gender of authors can uncover biases and socio-cultural dynamics of the past, deepening our understanding of historical texts. Inspired by the historical context where women often used m...
The purpose of this survey is to provide a comprehensive overview of recent advancements in text line segmentation and baseline detection techniques within the analysis of historic document images. Text line extraction is an essential step in the historical documents image analysis pipeline, as its results significantly impact the accuracy of subse...
The Seventh International Workshop on Narrative Extraction from Texts (Text2Story'24) was held on March 24 th , 2024, in conjunction with the 46 th European Conference on Information Retrieval (ECIR 2024) in Glasgow, Scotland. Over the day, more than 50 attendees engaged in discussions and presentations focused on recent advancements in narrative r...
Within the literary domain, genres function as fundamental organizing concepts that provide readers, publishers, and academics with a unified framework. Genres are discrete categories that are distinguished by common stylistic, thematic, and structural components. They facilitate the categorization process and improve our understanding of a wide ra...
The Text2Story Workshop series, dedicated to Narrative Extraction from Texts, has been running successfully since 2018. Over the past six years, significant progress, largely propelled by Transformers and Large Language Models, has advanced our understanding of natural language text. Nevertheless, the representation, analysis, generation, and compr...
The first edition of the International Workshop on Implicit Author Characterization from Texts for Search and Retrieval (IACT'23) was held on July 27 th , 2023, in conjunction with the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) in Taipei, Taiwan. To support both online and in-person particip...
This paper introduces a streamlined taxonomy for categorizing offensive language in Hebrew, addressing a gap in the literature that has, until now, largely focused on Indo-European languages. Our taxonomy divides offensive language into seven levels (six explicit and one implicit level). We based our work on the simplified offensive language (SOL)...
The issue of factual consistency in abstractive summarization has received extensive attention in recent years, and the evaluation of factual consistency between summary and document has become an important and urgent task. Most of the current metrics are adopted from the question answering (QA) or natural language inference (NLI) task. However, th...
The Sixth International Workshop on Narrative Extraction from Texts (Text2Story'23) was held on April 2 nd , 2023, in conjunction with the 45 th European Conference on Information Retrieval (ECIR 2023) in Dublin, Ireland. Continuing the tradition of past years, the workshop was held as a hybrid event. Online participation was allowed using the Zoom...
Over these past five years, significant breakthroughs, led by Transformers and large language models, have been made in understanding natural language text. However, the ability to capture contextual nuances in longer texts is still an elusive goal, let alone the understanding of consistent fine-grained narrative structures in text. These unsolved...
Automatic text summarization aims at producing a shorter version of a document (or a document set). Extractive summarizers compile summaries by extracting a subset of sentences from a given text, while abstractive summarizers generate new sentences. Both types of summarizers strive to preserve the meaning of the original document as much as possibl...
We present a novel supervised approach to sentence compression, based on classification and removal of word sequences generated from subtrees of the original sentence dependency tree. Our system may use any known classifier like Support Vector Machines or Logistic Model Tree to identify word sequences that can be removed without compromising the gr...
The Fifth International Workshop on Narrative Extraction from Texts (Text2Story'22) was held on the April 10 th , 2022, in conjunction with the 44 th European Conference on Information Retrieval (ECIR 2022) in Stavanger, Norway. Due to the COVID-19 restrictions that are still active in some countries, the workshop was held as an hybrid event, combi...
Automatically identifying the gender of a writer from a handwritten sample is an essential task in various domains, including historical document analysis, handwriting biometrics, and psychology. Technological advances in computer vision and image analysis have yielded various techniques suitable for this task, each with its own merits and limitati...
This work focuses on automatic gender and age prediction tasks from handwritten documents. This problem is of interest in a variety of fields, such as historical document analysis and forensic investigations. The challenge for automatic gender and age classification can be demonstrated by the relatively low performances of the existing methods. In...
The issue of factual consistency in abstractive summarization has attracted much attention in recent years, and the evaluation of factual consistency between summary and document has become an important and urgent task. Most of the current evaluation metrics are adopted from the question answering (QA). However, the application of QA-based metrics...
Plant classification requires the eye of an expert in botanics when the subtle differences in stem or petals differentiate between different species. Hence, an accurate automatic plant classification might be of great assistance to a person who studies agriculture, travels, or explores rare species. This paper focuses on a specific task of urban pl...
This paper reports an approach for summarizing financial texts that combine several techniques for sentence representation and neural document modeling. Our approach is extractive and it follows the classic pipeline of ranking and consequent selecting of the top-ranked text chunks. We evaluate our method on the financial reports provided in the Fin...
Narrative extraction, understanding, verification, and visualization are currently popular topics for users interested in achieving a deeper understanding of text, researchers who want to develop accurate methods for text mining, and commercial companies that strive to provide efficient tools for that. Information Retrieval (IR), Natural Language P...
Using a handwritten sample to automatically classify the writer’s gender is an essential task in a wide range of areas, e.g., psychology, historical documents classification, and forensic analysis. The challenge of gender prediction from offline handwriting can be demonstrated by the relatively low (below 90%) performance of state-of-the-art system...
Definitions are extremely important for efficient learning of new materials. In particular, mathematical definitions are necessary for understanding mathematics-related areas. Automated extraction of definitions could be very useful for automated indexing educational materials, building taxonomies of relevant concepts, and more. For definitions tha...
Facing the COVID-19 pandemic, governments have implemented a wide range of policies to contain the spread of the virus. During the pandemic, large amounts of COVID-19-related tweets emerge every day. Real-time pro- cessing of daily tweets may offer insights for monitoring public opinion about intervention measures implemented. In this work, lockdow...
Due to the subjectivity of the summarization, it is a good practice to have more than one gold summary for each training document. However, many modern large-scale abstractive summarization datasets have only one-to-one samples written by different human with different styles. The impact of this phenomenon is understudied. We formulate the differen...
Automatic definition extraction from texts is an important task that has numerous applications in several natural language processing fields such as summarization, analysis of scientific texts, automatic taxonomy generation, ontology generation, concept identification, and question answering. For definitions that are contained within a single sente...
Event detection in social media is a broad and well-addressed research topic, but the characteristics and sheer volume of Twitter messages with high amounts of noise in them make it a difficult task for Twitter. Tweets reporting real-life events are usually overwhelmed by a flood of meaningless information. This paper describes the TWItter event Su...
The objective of the 2019 RANLP Mul-tilingual Headline Generation (HG) Taskis to explore some of the challenges high-lighted by current state of the art ap-proaches on creating informative head-lines to news articles:non-descriptiveheadlines, out-of-domain training data,generating headlines from long documentswhich are not well represented by the h...
Various Seq2Seq learning models designed for machine translation were applied for abstractive summarization task recently. Despite these models provide high ROUGE scores, they are limited to generate comprehensive summaries with a high level of abstraction due to its degenerated attention distribution. We introduce Diverse Convolutional Seq2Seq Mod...
This paper introduces a novel perspective on unlabeled data driven technology for extractive summarization. Because unsupervised autoencoders, combined with neural network language models, help to capture deep semantic features for sentence quality, we propose to integrate autoencoders with sampling method based on Determinantal point processes (DP...
Automatic summarization is typically aimed at selecting as much information as possible from text documents using a predefined number of words. Extracting complete sentences into a summary is not an optimal way to solve this problem due to redundant information that is contained in some sentences. Removing the redundant information and compiling a...
Authorship verification is the task of determining whether a specific individual did or did not write a text, which very naturally can be reduced to the binary-classification problem. This paper deals with the authorship verification of short email messages. Hereafter, we use “message” to identify the content of the information that is transmitted...
There are organized groups that disseminate similar messages in online forums and social media; they respond to real-time events or as persistent policy, and operate with state-level or organizational funding. Identifying these groups is of vital importance for preventing distribution of sponsored propaganda and misinformation. This paper presents...
Authorship verification is the task of determining whether a specific individual did or did not write a text, which very naturally can be reduced to the binary-classification problem. This paper deals with the authorship verification of short email messages. Hereafter, we use "mes-sage" to identify the content of the information that is transmitted...
We present a novel supervised approach to sentence compression, based on classification and removal of word sequences generated from subtrees of the original sentence dependency tree. Our system may use any known classifier like Support Vector Machines or Logistic Model Tree to identify word sequences that can be removed without compromising the gr...
Extractive text summarization aims at selecting a small subset of sentences so that the contents and meaning of the original document are best preserved. In this paper we describe an unsupervised approach to extractive summarization. It combines hierarchical topic modeling (TM) with the Minimal Description Length (MDL) principle and applies them to...
Linguistic mimicry, the adoption of another’s language patterns, is a subconscious behavior with pro-social benefits. However, some professions advocate its conscious use in empathic communication. This involves mutual mimicry; effective communicators mimic their interlocutors, who also mimic them back. Since mimicry has often been studied in face-...
Automated text summarization is aimed at extracting essential information from original text and presenting it in a minimal, often predefined, number of words. In this paper, we introduce a new approach for unsupervised extractive summarization, based on the Minimum Description Length (MDL) principle, using the Krimp dataset compression algorithm~\...
The problem of extractive text summarization for a collection of documents is defined as the problem of selecting a small subset of sentences so that the contents and meaning of the original document set are preserved in the best possible way. In this paper we describe the linear programming-based global optimization model to rank and extract the m...
The problem of extractive text summarizationfor a collection of documents is defined as selecting asmall subset of sentences so the contents and meaningof the original document set are preserved in the bestpossible way. In this paper we present a new modelfor the problem of extractive summarization, where westrive to obtain a summary that preserves...
The problem of extractive summarization for a collection of documents is defined as the problem of selecting a small subset of sentences so that the contents and meaning of the original document set are preserved in the extract in best possible way. In this chapter, the authors present a linear model for the problem of extractive text summarization...
The invention relates to a multilingual method for summarizing an article, which comprises an offline stage in which a weights vector is determined using, among others, plurality of predefined metrics, a collection of documents and expert prepared summaries, subjection of all the document sentences to all said metrics, guess of a population of weig...
The increasing trend of cross-border globalization and acculturation requires text summarization techniques to work equally well for multiple languages. However, only some of the automated summarization methods can be defined as “language-independent,” i.e., not based on any language-specific knowledge. Such methods can be used for multilingual sum...
The problem of text summarization for a collection of documents is defined as the problem of selecting a small subset of sentences so that the contents and meaning of the original document set are preserved in the best possible way. In this paper we present a linear model for the problem of text summarization, where we strive to obtain a summary th...
In this paper, we introduce DegExt, a graph-based language-independent keyphrase extractor, which extends the keyword extraction method described in Litvak and Last (Graph-based keyword extraction for single-document summarization. In: Proceedings of the workshop on multi-source multilingual information extraction and summarization, pp 17–24, 2008)...
The Text Analysis Conference MultiLing Pilot of 2011 posed a multi-lingual summarization task to the summarization community, aiming to quantify and measure the performance of multi-lingual, multi-document summarization systems. The task was to create a 240-250 word summary from 10 news texts, describing a given topic. The texts of each topic were...
In this paper, we introduce DegExt, a graph-based languageindependent keyphrase extractor,which extends the keyword extraction
method described in [6]. We compare DegExt with two state-of-the-art approaches to keyphrase extraction: GenEx [11] and TextRank
[8].
Our experiments on a collection of benchmark summaries show that DegExt outperforms Tex...
The Text Analysis Conference MultiLing Pilot
of 2011 posed a multi-lingual summarization
task to the summarization community, aiming
to quantify and measure the performance of
multi-lingual, multi-document summarization
systems. The task was to create a 240–250
word summary from 10 news texts, describing a given topic. The texts of each topic were...
Automated summarization methods can be defined as "language-independent," if they are not based on any language-specific knowledge. Such methods can be used for multilingual summarization defined by Mani (2001) as "processing several languages, with summary in the same language as input." In this paper, we introduce MUSE, a language-independent app...
In this paper, we introduce and compare between two novel approaches, supervised and unsupervised, for identifying the keywords to be used in extractive summarization of text documents. Both our approaches are based on the graph-based syntactic representation of text and web documents, which enhances the traditional vector-space model by taking int...
In this paper, we deal with the problem of analyzing and classifying web documents in a given domain by information filtering
agents. We present the ontology-based web content mining methodology that contains such main stages as creation of ontology
for the specified domain, collecting a training set of labeled documents, building a classification...
In this paper, we deal with the problem of analyzing and classify-ing web documents to several major categories/classes in a given domain using domain ontology. We present the ontology-based web content mining methodology that contains such main stages as collecting a training set of labeled documents from a given domain, building a classification...
Data mining consists of finding interesting trends or patterns in large datasets, in order to
guide decisions about future activities. There is a general expectation that data mining
tools should be able to identify these patterns in the data with minimal user input.
The patterns identified by such tools can give a data analyst useful and unexpecte...
Text summarization is the process of distilling the most important in-formation from source/sources to produce an abridged version for a particular user/users and task/tasks. Automatically generated summaries can significantly re-duce the information overload on intelligence analysts in their daily work. More-over, automated text summarization can...