Conference Paper

Intelligent Combination of Approaches Towards Improved Bangla Text Summarization

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... ROUGE: ROUGE was introduced in 2004. ROUGE is a set of evaluation metrics used in NLP and text summarization to measure the quality and similarity of generated summaries to reference summaries [20]. ROUGE-1 measures the overlap of unigrams, ROUGE-2 measures the overlap of bigrams, and ROUGE-L considers the longest common subsequence. ...
Conference Paper
In today's fast-paced world, everyone wants things to happen quickly. Thanks to the internet, news spreads super fast. But not all news is important. News summarization helps by giving a short version of each news story, so readers can easily figure out what type of news they want to read. There are two main types of summarization: Abstractive Text Sum-marization and Extractive Text Summarization. The process of abstractive text summarization is much more complex than that of extractive text summarization. This study proposes a model for generating extractive summaries, which are then utilized as input to generate abstractive summaries. The model uses the Bengali Text Summarization (BenSumm) model for extractive summarization and the Bangla Text-to-Text Transfer Transformer (BanlaT5) for abstractive summarization. The research also compares summarization acquired straight from the BanglaT5 model with summarization obtained via the proposed model. Abstractive summa-rization in the Bengali language has been accomplished using the Text-to-Text Transfer Transformer(T5) in this research. Although abstractive summarization of the Bengali language has been accomplished over the years using a variety of techniques, the field of using T5 in this field has only recently been discovered, and there is still a wide range of opportunities to be explored. The study has achieved promising results.
Article
Full-text available
The task of summarization can be categorized into two methods, extractive and abstractive. Extractive summarization selects the salient sentences from the original document to form a summary while abstractive summarization interprets the original document and generates the summary in its own words. The task of generating a summary, whether extractive or abstractive, has been studied with different approaches in the literature, including statistical-, graph-, and deep learning-based approaches. Deep learning has achieved promising performances in comparison to the classical approaches, and with the advancement of different neural architectures such as the attention network (commonly known as the transformer), there are potential areas of improvement for the summarization task. The introduction of transformer architecture and its encoder model “BERT” produced an improved performance in downstream tasks in NLP. BERT is a bidirectional encoder representation from a transformer modeled as a stack of encoders. There are different sizes for BERT, such as BERT-base with 12 encoders and BERT-larger with 24 encoders, but we focus on the BERT-base for the purpose of this study. The objective of this paper is to produce a study on the performance of variants of BERT-based models on text summarization through a series of experiments, and propose “SqueezeBERTSum”, a trained summarization model fine-tuned with the SqueezeBERT encoder variant, which achieved competitive ROUGE scores retaining the BERTSum baseline model performance by 98%, with 49% fewer trainable parameters.
Article
Full-text available
The daily massive flow of information requires automated summarization methods to extract the most important information. Manual summarization of large text documents is very complicated and time-intensive for human beings. Numerous methods rate all of them decadently based on the sentence scoring that labels ratings for input sentences. The higher-rated sentences are used as a part of the summary. In an extractive-based automated text summary, locating the relevant sentences is an essential problem for the researchers. Therefore, to deal with such problems, evolutionary algorithms are applied as a solution. This paper presents a hybrid approach (CSOGA) based on the effectiveness and convergence to the solution of a (CSO) chicken swarm optimization and a (GA) genetic algorithm for text summarization to ensure the optimal solution. The evaluations of the proposed algorithms are done on the standard dataset from CNN / Daily Mail and are measured by the Recall-Oriented Understudy for Gisting Evaluation (ROUGE). The performance of the proposed method is then compared with other methods. The results show that the new approach hybrid (CSOGA) has the best performance on text summarization quality. The proposed method was capable of generating a better accuracy than other algorithms on the ROUGE-1, ROUGE-2 and ROUGE-L. The highest increase in the accuracy of the proposed method was in ROUGE-1 with a rise of 4.4%, ROUGE-2 with a rise of 12.01%, and ROUGE-L with a rise of 9.8% comparing with the highest accuracy of the other extractive models.
Article
Full-text available
Automatic text summarization is needed to concisely extract a small subset of text portions from a large text where the isolated text may have sentences that are more significant compared to other sentences in the text. Although there have been a lot of approaches to English text summarization, very few works have been done on automatic Bengali text summarization. For the evaluation purpose, a dataset was formulated from the scratch with Bengali news documents from two reputed newspapers. The evaluation dataset was classified into four different classes with benchmark standard summary text, generated by a group of random human contributors for each of the documents. The current work presents a hybrid approach for dealing with the summarization process of Bengali text documents. The hybrid model is introduced with a goal to improve the overall accuracy of the summary text generation. The proposed model generates a summary text based on keyword scoring, sentiment analysis, and the interconnection of sentences. After conducting the evaluation on the existing dataset, the proposed system performs with an average of 0.77 Recall Score,0.57 Precision Score, and 0.64 F-measure Score. Empirical verification with other similar systems shows that the proposed model can be used as an alternative system to address the Text Summarization problem of Bengali documents.
Article
Full-text available
Though plenty of research works have been done on stop word/phrase detection, there is no work done on Bengali stop words and stop phrases. This research innovates the definition and classification of Bengali stop words and phrases and implements two approaches to identify them. First one is a corpus-based approach, while the second one is based on the finite-state automaton. Performance of both approaches is measured and compared. Result analysis shows that corpus-based method outperforms the finite-state automaton-based method. The corpus-based and finite-state automaton-based method shows 90% and 80% of accuracy, respectively, for stop word detection and 80% and 70% accuracy, respectively, for stop phrase detection.
Conference Paper
Full-text available
Nowadays, efficient access of information from the text documents with high-degree of semantic information has become more difficult due to diversity of vocabulary and rapid growth of the Internet. Traditional text clustering algorithms are widely used to organize a large text document into smaller manageable groups of sentences, but it does not consider the semantic relationship among the words present in the document. Lexical chains try to identify cohesion links between words by identifying their semantic relationship. They try to link words in a document that are thought to be describing the same concept to gather information. This method of text summarization helps to process the linguistic features of the document which is otherwise ignored in statistical summarization approaches. In this paper, we have proposed a text summarization technique by constructing lexical chains and defining a coherence metric to select the summary sentences.
Conference Paper
Full-text available
At present, the text summarization has become an important tool for the user to retrieve the required information quickly. Many techniques on extractive text summarization have been developed for English text(s). However, there is a few works done for Bengali text(s) summarization. In this paper, an improved extractive Bengali text summarization technique has been proposed with enhancing the word scoring process, position value heuristics and summary procedure of the existing summarizer. In the word scoring technique, each word is preprocessed using noise removal, tokenization, stop word removal and stemming operation. Then, a heuristic to find the word score is proposed through checking it in all the input documents. Moreover, a modified heuristic is proposed for the sentence scoring in which it has given the priority to the middle sentence highest and then the upper and lower sentences from the middle sentence will be less emphasized. Finally, top k-sentences are extracted from each of the clusters of sentences and sorted the extracted sentences as their actual appearances in the original document(s). Thus, the final summary is synchronized with the original document(s). In comparison to the preceding method, the experimental result shows that the proposed technique produced better summarization to satisfy the users.
Article
Full-text available
Categories of parts of speech have both semantic and structural aspects. The two sets of features are essentially incommensurate, since the semantic features derive from the functions of language in communication and cognition, while the structural features are essentially based in the combinatorial potential of signs in a text. Consequently, the two sets of features are largely independent of each other. Their combination in a language yields sets of parts of speech whose systematicity is largely language-internal To the extent that there is a functional motivation for parts of speech, three restrictions must be made: 1) It is not, in the first place, a cognitive, but rather a communicative motivation. 2) The functional motivation of word classes is not direct, but mediated by semantic and syntactic categories of higher order. 3) Only the primary parts of speech (verb and noun) are motivated in this way. The secondary parts of speech (adjectives, adverbs etc.) and the minor parts of speech (pronouns, subordinators etc.) increasingly have a system-internal structural rather than a universal functional motivation. Given these heterogeneous functions and constraints, there is no uniform nature to all parts of speech.
Article
Full-text available
Text summarization is a process to produce an abstract or a summary by selecting significant portion of the information from one or more texts. In an automatic text summarization process, a text is given to the computer and the computer returns a shorter less redundant extract or abstract of the original text(s). Many techniques have been developed for summarizing English text(s). But, a very few attempts have been made for Bengali text summarization. This paper presents a method for Bengali text summarization which extracts important sentences from a Bengali document to produce a summary.
Article
This paper proposes an automatic method to summarize Bangla news document. In the proposed approach, pronoun replacement is accomplished for the first time to minimize the dangling pronoun from summary. After replacing pronoun, sentences are ranked using term frequency, sentence frequency, numerical figures and title words. If two sentences have at least 60% cosine similarity, the frequency of the larger sentence is increased, and the smaller sentence is removed to eliminate redundancy. Moreover, the first sentence is included in summary always if it contains any title word. In Bangla text, numerical figures can be presented both in words and digits with a variety of forms. All these forms are identified to assess the importance of sentences. We have used the rule-based system in this approach with hidden Markov model and Markov chain model. To explore the rules, we have analyzed 3,000 Bangla news documents and studied some Bangla grammar books. A series of experiments are performed on 200 Bangla news documents and 600 summaries (3 summaries are for each document). The evaluation results demonstrate the effectiveness of the proposed technique over the four latest methods.
Article
With the rapid growth of the World Wide Web, information overload is becoming a problem for an increasingly large number of people. Since summarization helps human to digest the main contents of a text document very rapidly, there is a need for an effective and powerful tool that can automatically summarize text. In this paper, we present a keyphrase based approach to single document summarization that extracts first a set of keyphrases from a document, use the extracted keyphrases to choose sentences from the document and finally form an extractive summary with the chosen sentences. We view keyphrases (single or multi-word) as the important concepts and we assume that an extractive summary of a document is an elaboration of the important concepts contained in the document to some permissible extent and it is controlled by the given summary length. We have tested our proposed keyphrase-based summarization approach on two different datasets: one for English and another for Bengali. The experimental results show that the performance of the proposed system is comparable to some state-of-the art summarization systems
Article
This paper describes a system that produces extractive summaries of Bengali news documents. The ultimate objective of produced summaries is defined as helping readers to determine whether they would be interested in reading a particular document. To this end, the summary aims to provide a reader with an idea about the theme of a document without revealing the in-depth detail. The approach presented here has four major steps (1) preprocessing (2) extraction of candidate summary sentences (3) ranking the candidate summary sentences (4) summary generation. The proposed approach defines TF*IDF, position and sentence length feature in more effective way that helps in improving the summarization performance. The experimental results show that the proposed text summarization approach outperforms the lead baseline and a more sophisticated baseline that uses TF*IDF and position features both.
Conference Paper
Text summarization is the technique which automatically creates an abstract or summary of a text. The technique has been developed for many years. So a survey has been done on different summarization techniques. No work in this area has been done for Bangla language. This paper presents a text summarizer for Bangla, which uses some extraction methods for text summarization.
Unsupervised bengali text summarization using sentence embedding and spectral clustering
  • S Roychowdhury
  • K Sarkar
  • A Maji
The method dynavic tf-idf
  • O Barabash
  • O Laptiev
  • O Kovtun
  • O Leshchenko
  • K Dukhnovska
  • A Biehun
The method dynavic tf-idf
  • barabash
Unsupervised bengali text summarization using sentence embedding and spectral clustering
  • roychowdhury
Hybrid text summarizer for bangla document
  • M Islam
  • F N Majumdar
  • A Galib
  • M M Hoque