PreprintPDF Available

Intelligent Combination of Approaches Towards Improved Bangla Text Summarization

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Text summarization is a technique to extract the main concept from a large document. It turns a big document into a smaller one without changing the main context. Text summarization have widely research area nowadays. There are two types of text summarization, one generates an extractive summary and the other generates an abstractive summary. Here in this paper, an intelligent model is proposed which can make an extractive summary from a given document. After completing some preprocessing steps on the document, some useful combinations of methods are applied such as Named Entity-based scoring, keyword based-scoring, parts of speech-based scoring, and word and sentence based-analysis to rank the sentences of the passage. These methods combined together generated the final summary. The proposed model is compared with multiple human-made summaries and the evaluation was performed with respect to precision, recall, and F-measures. The model is also compared with the state-of-the-art approaches and found to show its effectiveness with respect to Precision (0.606) and F-measure (0.6177) evaluation measures.
Intelligent Combination of Approaches Towards
Improved Bangla Text Summarization
Alam Khan, Sanjida Akter Ishita, Fariha Zaman,Ashiqul Islam Ashik §and Md Moinul Hoque
Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka
Bangladesh
Email: alamkhaan1997@gmail.com, sanjidaakterishita@gmail.com, fariha.aust.99@gmail.com,
§ashiqulislam170204070@gmail.com, moinul@aust.edu
Abstract—Text summarization is a technique to ex-
tract the main concept from a large document. It turns
a big document into a smaller one without changing
the main context. Text summarization have widely
research area nowadays. There are two types of text
summarization, one generates an extractive summary
and the other generates an abstractive summary. Here
in this paper, an intelligent model is proposed which
can make an extractive summary from a given docu-
ment. After completing some preprocessing steps on
the document, some useful combinations of methods
are applied such as Named Entity-based scoring, key-
word based-scoring, parts of speech-based scoring, and
word and sentence based-analysis to rank the sentences
of the passage. These methods combined together
generated the nal summary. The proposed model is
compared with multiple human-made summaries and
the evaluation was performed with respect to precision,
recall, and F-measures. The model is also compared
with the state-of-the-art approaches and found to show
its eectiveness with respect to Precision (0.606) and
F-measure (0.6177) evaluation measures.
Index Terms—Extractive summary, data preprocess-
ing, TF-IDF, Sentence Scoring, Keyword Scoring, POS
Tagging, Positional Value
I. Introduction
Bangla text summarization has widely research scope.
But the problem is there is no proper dataset for Bangla
text summarization that can be considered sucient. We
composed and used four data sets, where each dataset
has eight categories and each category has a total of
30 passages. Each passage has 3 dierent human-made
summaries. We had a total of 960 passages and 2880
summaries for the evaluation process. Moreover, it has
been checked that every passage is unique. As well as we
have made a tool using Python (tkInter Frame) to gather
data and generate human-made summaries with the help
of random volunteers. We have categorized the news
documents into 8 categories in our own dataset. These
are- 1. Accident 2. Bangladesh 3. Crime 4. Economics 5.
Entertainment 6. International 7. Politics 8. Sports. We
have collected the passages from online news portals and
some pas- sages are collected by web scrapping from online
newspaper websites. We have checked the uniqueness of
passages whether it is not used in more than one document
and also if the passages are relevant to the categories or
not is inspected. Some summaries are generated manually
and some are generated by using our own made desktop
app using the python tkInter interface which can generate
human-made summaries very fast and accurately. We
have checked the passages and summaries manually that
if it contains the central idea of the source document.
A summary document check was performed to see if the
summary fullls 40% of the source document or not. In
addition, we have used the standard BNLP dataset [1]
for comparing the eectiveness of our system with other
similar systems. The next section describes the state-of-
the-art of the presented problem-solving.
II. Related Works
As English is an international language, most of the
research work has been done on the English language.
But there is also some research on the Bangla language.
Kamal Sarkar [2] proposed a model for an extractive
summary from a passage. They used three major steps
for constructing the model and nally using the model
he got a machine-generated extractive summary from a
passage. Scoring of sentences was done using TF-IDF.
The score of a sentence k, Skis calculated as per equation
1
Sk=
w
T F I DFw,k (1)
The summary is generated that is sequential as the pas-
sage. Recently Kamal Sarkar [3] published a research ar-
ticle on Unsupervised Bengali Text Summarization where
they approached a spectral clustering algorithm to identify
various topics covered in a Bengali document and generate
a summary by selecting signicant sentences from the
identied topics (Rogue-1 recall score is 0.4481).
Jingzhou Liu [4] approached a graph-based single-
document unsupervised extractive method that constructs
a distance-augmented sentence graph from a document
that enables the model to perform more ne-grained
modeling of sentences and better characterize the original
2023 International Conference on Applied Intelligence and Sustainable Computing (ICAISC)
979-8-3503-2379-5/23/$31.00 ©2023 IEEE
document structures. This paper uses an automatically
constructed sentence graph from each document to select
sentences for summarization based on both the similarities
and relative distances in the neighborhood of each sen-
tence. Unfortunately, though their approach was unique,
they were really bad at generating the summary.
P.Tumpa [5] discussed a model on improved extractive
summarization technique. The summary is generated
using the K-means clustering algorithm which lacks syn-
chronization.
Chandro et al. [6] discussed the extraction-based sum-
marization techniques by collaborating on individual
words and scoring sentences. Passages for experimenta-
tion were collected from dierent popular Bengali daily
newspapers.
Uddin and Khan [7] presented an approach where they
have given importance to sentence location, cue phrase
presence, title word presence, term frequency, and numer-
ical data. They have focused on the importance of the
rst and last sentences of the passages. They have got an
average accuracy of 71.3 percent.
Junan Zhu [8] presented a neural system combination
strategy for sentence summarization using the sequence-
to-sequence model and its encoder-decoder framework. In
this process, bidirectional Gated Recurrent Units are used.
Mahimul Islam [9] described a model of hybrid Bangla
Text Summarization. Three methods were used to gen-
erate a summary. The top 40 percent of the actual
passage was selected as a generated summary based on
the combined weighted score of Sentiment Score, Keyword
Ranking, and Text Ranking. The evaluation of the aver-
age scores of the precision, recall, and f-measure are 0.57,
0.77, and 0.64 respectively.
Mallick et al. present a Lexical-Chain-based approach
[10] for Extractive summarization Using Wordnet. Their
f1-score (Rouge-2) was 0.490. which could be improved
if they used anaphora and cataphora resolution in their
semantic relationship detection which would’ve created a
much better Lexical chain.
Gamal et. al [11] used Chicken Swarm Optimization
and Genetic Algorithm for Text Summarization with aws
such as slow convergence speed and low text summary
accuracy.
Shehab Abdel-Salam proposed the BERT-base model
[12] on Extractive Text Summarization. Dierent sizes
for BERT were subsisted. For instance, BERT-base with
12 encoders and BERT-larger with 24 encoders, but they
focused on the BERT-base. Unfortunately, there was no
hyperparameter tuning to generate a better summary.
III. Proposed Model
In our model, we have added a Named Entity Recogni-
tion (Banner Model) based scoring and Parts of Speech
based scoring along with an improved Keyword based
scoring (TF-IDF) and scoring based on sentence positions.
Each of the approaches was assigned a seed weight value
Figure 1. The Proposed Model for the Bengali Text Summarization.
initially. Weights values were optimized to retrieve the
best possible set of sentences in the nal summary. We
have taken forty percent of the sentences from the input
text into our machine-generated summary. The steps of
the model (Figure 1) are explained in the next subsequent
subsections.
A. Input Pre-processing Layer
In this step, from the input passage, we eliminate
characters like ’—’, ’!’, ’, ’, ’, / etc. Some Phrases like
”Title” and blank or white spaces have also been removed
in this layer.
B. Category Labeling Layer
There are 8 categories of passages in our dataset as
mentioned in the introduction section. In this layer, the
input passages are categorized and labeled to be one of
those eight types.
C. Task Specic pre-processing Layer
Every document then comes into this task-specic pre-
processing layer. Some documents may need dierent
preprocessing based on their category label for the sen-
tence scoring step. For example, in the Named Entity
Recognition-based scoring approach, we do not use the
stemming of sentences before counting the named entities
from a sentence. We also removed stop words like 
   with a tool that is available online [13] to
increase the performance of our model for some types of
scoring approaches which is discussed in a later section.
Our ve approaches for sentence scoring are described
in the upcoming sections.
1) Sentence Scoring: Sentence Scoring is a scoring crite-
rion based on a similarity-based ranking model which can
be used to nd the most relevant sentence from a passage.
This model tokenizes each sentence of the training dataset
and converts them into sentence vectors. We trained a
model with a dataset of 3248295 sentences where 42466428
words and 461498 unique words are among them.
Here we used a pre-trained word2vec model on the Bengali
news dataset and tuned the model with our dataset and
used it to generate a summary. Here each sentence is
represented as a vector. After that, each vector gets
compared with all other vectors presented in the text, and
the similarity score of each of them is calculated via the
cosine similarity technique.
SentenceS core = (V1V2)/V1V2(2)
Here,
V1 = Vector representation of sentence 1
V2 = Vector representation of sentence 2
Sample Scoring snippet using gensim
import gensim
 
       

      
      


2) Named Entity Recognition: Named Entity Recogni-
tion (NER) is a technique to identify the named entities
from a chunk of words or a sentence. Here the main idea is
to identify the person, location, organization, and object
from the passage sentence. There are several techniques
for NER tagging. Here we have used the BIOES technique
for tagging the named entities.
Before NER tagging we have
       
      
After NER tagging, we get
    
     
      
     
     
The NER Score of the above sentence is = 5
3) Keyword Scoring: The appearance of some words
making a sentence valuable is called a keyword. A
sentence with more keywords has a higher chance
of being in the summary. So, detecting and scoring
keywords is important. For the keyword scoring, at
rst, we preprocessed our dataset with a task-specic
preprocessing layer such as eliminating stop words. Then
we applied TF-IDF scoring method from Kamal Sarkar’s
approach [14]. Here, Term Frequency values are measured
by the repetition of a particular word in a document. The
IDF (Inverse Document Frequency) value was computed
using log(N/df) where, N= number of documents in
the dataset, df= document frequency( indicates the
number of documents in which a word occurred). After
calculating the value of all keywords in a sentence, we
add them to nd the overall score of a sentence. After
scoring all the sentences, we take a maximum of 40%
cadndidate sentences with the higher TF-IDF score from
the passage to construct a summary.
An example of the Keyword based scoring method-
       

Here, the score of the above sentence using Keyword
Scoring(TF-IDF) after tuning the model with our dataset-
  
   
   
 
The nal sentence score is: 0.1037
4) Sentence Scoring based on Parts of Speech Tagging:
Parts of speech in a sentence shows us how the word relates
to each other. We have used a model for tagging Parts
of Speech (POS) developed by Sagor Sarker using a Pre-
trained CRF-based model to detect POSes such as nouns,
pronouns, adjectives, and verbs of a sentence.
An example of the POS-based scoring method is shown
below.
         
       
  
After applying the POS tagger model, the above sentences
look like below-
       
       
      
          
 
          
    
       
        
         
 
         
          
          

5) Scoring based on Positional Value: We all know that
the position of a sentence in a large text plays a very
signicant role in extractive summarization. For this
reason, we assigned scores to the sentences of a passage
according to their positions. For example, for the News
Article category, we saw that the rst sentence is the title
of the whole news article. So in gold summaries, the rst
sentence always appears. For this reason, we applied a
scoring method like below. (Equation 3) -
Positional Value = sqrt(len)/line*line (3)
For a better understanding, suppose a passage is having
14 lines so,
Positional value of sentence 1= (sqrt(14))/(12) = 3.741657
Positional value of sentence 2= (sqrt(14))/(22) = 0.935412
. . .
. . .
Positional value of sentence 14= (sqrt(14))/(142) = 0.0190
6) Final Score Generation for a sentence: After get-
ting candidate summary texts from each of the above-
mentioned (ve) approaches, we multiply the scores of the
sentences coming from dierent approaches with dierent
weight values and thus the nal score of a candidate
sentence is calculated (equation 4). Finally, the top 40%
unique sentences are merged to form the nal summary.
Final Score of a candidate sentence =
(W1 * Keyword Scoring) + (W2 * NER Score)
+ (W3 * POS tagger score)( + W4 * Sentence Scoring)
+ (W5 * Positional Scoring)
(4)
Here,
W1= Optimized weight of Keyword Scoring
W2= Optimized weight of Named Entity Recognition
Scoring
W3= Optimized weight of POS tagger Scoring
W4= Optimized weight of Sentence Scoring
W5= Optimized weight of Positional Scoring
7) Weight Optimization: Initially, We assumed
W1=W2=W3=W4=W5=1 and then we continuously
applied the regression approach to set Weight values ( W1
to W5) in increasing order until the summary generation
score starts to decline for each sentence in the candidate
test set. After several iterations, we found the optimized
weights to be W1=2, W2=1, W3=6, W4=1, W5=7 for
our dataset. We got the best result with these weights.
IV. Results and Discussion
Our approach to nding the human-like summaries by
a machine is tested with supporting experiments. To do
the experiments, we split our dataset into 4 parts. For
each part, we showed our experimental results. Then we
combined the whole results in our experiments.
A. Evaluation of Named Entity Recognition based sum-
mary generation
Table 1 shows the Rouge-1 score of the generated sum-
maries compared with the gold standard summaries. Table
2 shows the results based on Banner’s BERT-based NER
identier.
Table I
Rouge-1 Scores based on Sagar Sarkar’s CRF-based NER
identifier only
Model F-Measure Precision Recall
Haque [15] 0.6166 0.5757 0.6819
Kamal Sarkar [16] 0.5496 0.5603 0.5515
Mahimul Islam [9] 0.6487 0.5658 0.7745
Our Model 0.5854 0.6153 0.5730
Table II
Rouge-1 Scores based on Banner’s BERT-based NER
identifier
Model F-Measure Precision Recall
Haque [15] 0.6166 0.5757 0.6819
Kamal Sarkar [16] 0.5496 0.5603 0.5515
Mahimul Islam [9] 0.6487 0.5658 0.7745
Our Model 0.6645 0.6561 0.6849
From the result, we can see that Banner’s BERT-based
NER identier performed better than Sagar Sarkar’s CRF-
based NER identier [17]. We have thus selected the latter
model for the purpose. To note, this comparison is based
on the BNLPC dataset.
B. Evaluation of the Parts of Speech tagging Model
In this section, we have shown the comparison of two
models and gured out which model ts best for our
dataset. Two models that we used to tune the dataset
are given below-
1. Bengali Parts of Speech tagger by Sagor Sarker [18].
2. Bengali Parts of Speech tagger by ashwoolford/bnltk
[19]
The comparison of these two POS tagging models
(based on Rouge-2) on our dataset is shown in Table 3
and Table 4.
Table III
Rouge-2 score of the POS tagger based on Sagar Sarkar’s
Model
Model F-Measure Precision Recall
Haque [15] 0.5830 0.5459 0.6433
Kamal Sarkar [16] 0.5060 0.5165 0.5075
Mahimul Islam [9] 0.5777 0.4958 0.7065
Our Model 0.6067 0.6011 0.6253
After analyzing the result, we have selected the Bengali
POS tagger by ashwoolford/bnltk which performed com-
paratively better.
Table IV
Rouge-2 score based on Bengali POS tagger by
ashwoolford/bnltk
Model F-Measure Precision Recall
Haque [15] 0.5830 0.5459 0.6433
Kamal Sarkar [16] 0.5060 0.5165 0.5075
Mahimul Islam [9] 0.5777 0.4958 0.7065
Our Model 0.6254 0.5672 0.7138
C. Evaluation of category-based Keyword Scoring
Here we have used our developed dataset to train a
category-based TF/IDF model rst. Then for each key-
word in the corresponding category, we calculate the score
of the keywords appearing in a sentence based on TF-
IDF values. We calculate the score of a sentence based
on the keyword scores inside that sentence. The score of
a sentence is the summation of all keyword scores in that
sentence. The top 40 percent of sentences are selected as
per the score. The comparison of our system with others is
shown in the following tables (Table 5 and Table 6) based
on the BNLPC dataset.
Table V
Rouge-1 score for the Keyword-based scoring
Model F-Measure Precision Recall
Haque [15] 0.6166 0.5757 0.6819
Kamal Sarkar [16] 0.5496 0.5603 0.5515
Mahimul Islam [9] 0.6487 0.5658 0.7745
Our Model 0.6467 0.6593 0.6517
Table VI
Rouge-2 score for the Keyword-based scoring
Model F-Measure Precision Recall
Haque [15] 0.5830 0.5459 0.6433
Kamal Sarkar [16] 0.5060 0.5165 0.5075
Mahimul Islam [9] 0.5777 0.4958 0.7065
Our Model 0.5815 0.5891 0.5907
D. Model Combination
Our hybrid model combines NER, POS Tagging, Key-
word based Sentence Scoring, and the Positional Value
of the sentences to generate the nal summary. The
combination approach produces the best results.
E. Comparing our combined model with the existing models
based on the BNLPC dataset
The following tables (Table 7, and Table 8) show the
Precision, Recall, and F-measure values of our combined
system with the existing models. Our model has better
precision and better F-measure values compared to other
existing methods.
F. Performance of our model based on category-specic
passages
The performance of our model based on category-
specic passages was assessed with a dataset Split for 60%
Table VII
Rouge-1 Score of the Combined approach
Model F-Measure Precision Recall
Haque [15] 0.6166 0.5757 0.6819
Kamal Sarkar [16] 0.5496 0.5603 0.5515
Mahimul Islam [9] 0.6487 0.5658 0.7745
Our Model 0.6760 0.6665 0.6975
Table VIII
Rouge-2 Score of the Combined Approach
Model F-Measure Precision Recall
Haque [15] 0.5830 0.5459 0.6433
Kamal Sarkar [16] 0.5060 0.5165 0.5075
Mahimul Islam [9] 0.5777 0.4958 0.7065
Our Model 0.6177 0.6069 0.6411
in Training and 40% for testing. Here, we have used our
dataset only. The result is shown in Table 9 (Rouge-1
Score), Table 10 (Rouge-2 Score) and in Table 11 (Rouge-
L Score)
Table IX
Rouge-1 Score for Category-specific passages
Category F-Measure Precision Recall
Accident 0.7214 0.7050 0.7466
Bangladesh 0.7202 0.6955 0.7554
Crime 0.7180 0.6932 0.7517
Economics 0.6946 0.6842 0.7132
Entertainment 0.7019 0.6894 0.7231
International 0.6526 0.6419 0.6725
Politics 0.7279 0.7156 0.7467
Sports 0.6755 0.6622 0.6962
Combined Passages 0.7015 0.6859 0.7257
Table X
Rouge-2 Score for Category Specific passages
Category F-Measure Precision Recall
Accident 0.6559 0.6337 0.6874
Bangladesh 0.6534 0.6282 0.6892
Crime 0.6529 0.6267 0.6877
Economics 0.6208 0.6059 0.6434
Entertainment 0.6387 0.6265 0.6605
International 0.5791 0.5675 0.5981
Politics 0.6671 0.6551 0.6858
Sports 0.6056 0.5910 0.6275
Combined Passages 0.6342 0.6168 0.6600
G. Performance of our combined approach based on the
BNLPC Dataset
To do so, we split the BNLPC dataset at 70% 30%,
train-test ratio, and the summary result is given in Table
12.
In the future, we shall build a model that can do Bengali
sentiment analysis for news-related data and incorporate
it into the scoring mechanism. We shall also try to use a
Dynamic TF-IDF model proposed by Oleg Barabash et al
[20].
Table XI
Rouge-L Score for Category Specific passages
Category F-Measure Precision Recall
Accident 0.7064 0.6907 0.7306
Bangladesh 0.7041 0.6800 0.7384
Crime 0.7049 0.6808 0.7376
Economics 0.6732 0.6634 0.6907
Entertainment 0.6881 0.6759 0.7090
International 0.6322 0.6221 0.6512
Politics 0.7150 0.7029 0.7335
Sports 0.6591 0.6463 0.6790
Combined Passages 0.6854 0.6703 0.7087
Table XII
Rouge-1, Rouge-2, Rouge-L score based on the BNLPC
dataset
Category F-Measure Precision Recall
Rouge-1 BNLPC-1 0.7065 0.6891 0.7341
BNLPC-2 0.6456 0.6439 0.6609
Rouge-2 BNLPC-1 0.6470 0.6277 0.6781
BNLPC-2 0.5884 0.5861 0.6042
Rouge-L BNLPC-1 0.6931 0.6758 0.7202
BNLPC-2 0.6305 0.6294 0.6443
V. Conclusion and future works
In this work, our focus was to create a system that would
generate an extractive summary like a human being. As
there were not enough datasets in the Bangla language,
we made an enriched dataset that has its uniqueness
and has categorized passages. Dierent approaches were
intelligently combined for scoring sentences. We mea-
sured the performance of our model by comparing the
summaries with the human-generated ones. To compare
the similarities, we used standard evaluation measures
like precision, recall, and F-measure for rouge-1, rouge-
2, and rouge-L. The score of a sentence could be further
improved by adding a Text-sentiment based scoring ap-
proach. Analyzing the human-made summaries we found
that a sentence with a negative sentiment gets more
priority in human-generated summaries of Accident and
Crime categories. Whereas in Entertainment and sports
categories, sentences with positive sentiment appeared
more often in gold summaries. In the future, we can add
this approach to our overall model. Moreover, We are also
working on increasing the dataset with summaries from
textbooks and applying deep learning approaches to see if
the summarization performance can be further improved.
Acknowledgement
The authors of this paper acknowledge the help and
support of Md. Ashraful Haque for his excellent support
at the dierent stages of this research. His insights and
encouragement were invaluable to complete this work.
References
[1] “Bengali natural language processing(bnlp),” accessed July 22,
2020, vol. https://bnlp.readthedocs.io/en/latest/.
[2] K. Sarkar, “Bengali text summarization by sentence extrac-
tion,” arXiv preprint arXiv:1201.2240, 2012.
[3] S. Roychowdhury, K. Sarkar, and A. Maji, “Unsupervised ben-
gali text summarization using sentence embedding and spectral
clustering,” in Proceedings of the 19th International Conference
on Natural Language Processing (ICON), 2022, pp. 337–346.
[4] J. Liu, D. J. Hughes, and Y. Yang, “Unsupervised extractive
text summarization with distance-augmented sentence graphs,”
in 44th International ACM SIGIR Conference on Research and
Development in Information Retrieval, 2021, pp. 2313–2317.
[5] P. Tumpa, S. Yeasmin, A. Nitu, M. Uddin, M. Afjal, and
M. Mamun, “An improved extractive summarization technique
for bengali text (s),” in 2018 International Conference on
Computer, Communication, Chemical, Material and Electronic
Engineering (IC4ME2). IEEE, 2018, pp. 1–4.
[6] P. Chandro, M. F. H. Arif, M. M. Rahman, M. S. Siddik, M. S.
Rahman, and M. A. Rahman, “Automated bengali document
summarization by collaborating individual word & sentence
scoring,” in 2018 21st International Conference of Computer
and Information Technology (ICCIT). IEEE, 2018, pp. 1–6.
[7] M. N. Uddin and S. A. Khan, “A study on text summarization
techniques and implement few of them for bangla language,” in
2007 10th international conference on computer and informa-
tion technology. IEEE, 2007, pp. 1–4.
[8] J. Zhu, L. Zhou, H. Li, J. Zhang, Y. Zhou, and C. Zong,
“Augmenting neural sentence summarization through extractive
summarization,” in National CCF Conference on Natural Lan-
guage Processing and Chinese Computing, 2017, pp. 16–28.
[9] M. Islam, F. N. Majumdar, A. Galib, and M. M. Hoque, “Hybrid
text summarizer for bangla document,” Int J Comput Vis Sig
Process, vol. 1, no. 1, pp. 27–38, 2020.
[10] C. Mallick, M. Dutta, A. K. Das, A. Sarkar, and A. K. Das,
“Extractive summarization of a document using lexical chains,”
pp. 825–836, 2019.
[11] M. Gamal, A. Elsawy, and A. Abu El Atta, “Hybrid algorithm
based on chicken swarm optimization and genetic algorithm
for text summarization,” International Journal of Intelligent
Engineering and Systems, vol. 14, pp. 319–131, 05 2021.
[12] S. Abdel-Salam and A. Rafea, “Performance study on extractive
text summarization using bert models,” Information, vol. 13,
no. 2, p. 67, 2022.
[13] R. U. Haque, M. Mridha, M. Hamid, M. Abdullah-Al-Wadud,
M. Islam et al., “Bengali stop word and phrase detection mech-
anism,” Arabian Journal for Science and Engineering, vol. 45,
no. 4, pp. 3355–3368, 2020.
[14] K. Sarkar, “An approach to summarizing bengali news docu-
ments,” in proceedings of the International Conference on Ad-
vances in Computing, Communications and Informatics, 2012,
pp. 857–862.
[15] P. S. Haque, M.M and Z. Begum, “An innovative approach of
bangla text summarization by introducing pronoun replacement
and improved sentence ranking,” Journal of Information Pro-
cessing Systems, vol. 13:4, pp. 752–777, 2017.
[16] K. Sarkar, “A keyphrase-based approach to text summarization
for english and bengali documents,” International Journal of
Technology Diusion (IJTD), vol. 5, no. 2, pp. 28–38, 2014.
[17] S. Kamal, “Ner model,” accessed 22/6/2022, vol.
https://github.com/sagorbrur/bnlp/blob/master/
model/bn_ner.pkl 2020.
[18] C. Lehmann, “The nature of parts of speech,” STUF-Language
Typology and Universals, vol. 66, no. 2, pp. 141–177, 2013.
[19] S. Kamal, “Pos model,” accessed July 22, 2020, vol.
https://github.com/sagorbrur/bnlp/blob/master/model/
bn_pos.pkl 2020.
[20] O. Barabash, O. Laptiev, O. Kovtun, O. Leshchenko,
K. Dukhnovska, and A. Biehun, “The method dynavic tf-
idf,” International Journal of Emerging Trends in Engineering
Research, vol. 8:9, pp. 5712–5718, 2020.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The task of summarization can be categorized into two methods, extractive and abstractive. Extractive summarization selects the salient sentences from the original document to form a summary while abstractive summarization interprets the original document and generates the summary in its own words. The task of generating a summary, whether extractive or abstractive, has been studied with different approaches in the literature, including statistical-, graph-, and deep learning-based approaches. Deep learning has achieved promising performances in comparison to the classical approaches, and with the advancement of different neural architectures such as the attention network (commonly known as the transformer), there are potential areas of improvement for the summarization task. The introduction of transformer architecture and its encoder model “BERT” produced an improved performance in downstream tasks in NLP. BERT is a bidirectional encoder representation from a transformer modeled as a stack of encoders. There are different sizes for BERT, such as BERT-base with 12 encoders and BERT-larger with 24 encoders, but we focus on the BERT-base for the purpose of this study. The objective of this paper is to produce a study on the performance of variants of BERT-based models on text summarization through a series of experiments, and propose “SqueezeBERTSum”, a trained summarization model fine-tuned with the SqueezeBERT encoder variant, which achieved competitive ROUGE scores retaining the BERTSum baseline model performance by 98%, with 49% fewer trainable parameters.
Article
Full-text available
The daily massive flow of information requires automated summarization methods to extract the most important information. Manual summarization of large text documents is very complicated and time-intensive for human beings. Numerous methods rate all of them decadently based on the sentence scoring that labels ratings for input sentences. The higher-rated sentences are used as a part of the summary. In an extractive-based automated text summary, locating the relevant sentences is an essential problem for the researchers. Therefore, to deal with such problems, evolutionary algorithms are applied as a solution. This paper presents a hybrid approach (CSOGA) based on the effectiveness and convergence to the solution of a (CSO) chicken swarm optimization and a (GA) genetic algorithm for text summarization to ensure the optimal solution. The evaluations of the proposed algorithms are done on the standard dataset from CNN / Daily Mail and are measured by the Recall-Oriented Understudy for Gisting Evaluation (ROUGE). The performance of the proposed method is then compared with other methods. The results show that the new approach hybrid (CSOGA) has the best performance on text summarization quality. The proposed method was capable of generating a better accuracy than other algorithms on the ROUGE-1, ROUGE-2 and ROUGE-L. The highest increase in the accuracy of the proposed method was in ROUGE-1 with a rise of 4.4%, ROUGE-2 with a rise of 12.01%, and ROUGE-L with a rise of 9.8% comparing with the highest accuracy of the other extractive models.
Article
Full-text available
Though plenty of research works have been done on stop word/phrase detection, there is no work done on Bengali stop words and stop phrases. This research innovates the definition and classification of Bengali stop words and phrases and implements two approaches to identify them. First one is a corpus-based approach, while the second one is based on the finite-state automaton. Performance of both approaches is measured and compared. Result analysis shows that corpus-based method outperforms the finite-state automaton-based method. The corpus-based and finite-state automaton-based method shows 90% and 80% of accuracy, respectively, for stop word detection and 80% and 70% accuracy, respectively, for stop phrase detection.
Article
Full-text available
Categories of parts of speech have both semantic and structural aspects. The two sets of features are essentially incommensurate, since the semantic features derive from the functions of language in communication and cognition, while the structural features are essentially based in the combinatorial potential of signs in a text. Consequently, the two sets of features are largely independent of each other. Their combination in a language yields sets of parts of speech whose systematicity is largely language-internal To the extent that there is a functional motivation for parts of speech, three restrictions must be made: 1) It is not, in the first place, a cognitive, but rather a communicative motivation. 2) The functional motivation of word classes is not direct, but mediated by semantic and syntactic categories of higher order. 3) Only the primary parts of speech (verb and noun) are motivated in this way. The secondary parts of speech (adjectives, adverbs etc.) and the minor parts of speech (pronouns, subordinators etc.) increasingly have a system-internal structural rather than a universal functional motivation. Given these heterogeneous functions and constraints, there is no uniform nature to all parts of speech.
Article
This paper proposes an automatic method to summarize Bangla news document. In the proposed approach, pronoun replacement is accomplished for the first time to minimize the dangling pronoun from summary. After replacing pronoun, sentences are ranked using term frequency, sentence frequency, numerical figures and title words. If two sentences have at least 60% cosine similarity, the frequency of the larger sentence is increased, and the smaller sentence is removed to eliminate redundancy. Moreover, the first sentence is included in summary always if it contains any title word. In Bangla text, numerical figures can be presented both in words and digits with a variety of forms. All these forms are identified to assess the importance of sentences. We have used the rule-based system in this approach with hidden Markov model and Markov chain model. To explore the rules, we have analyzed 3,000 Bangla news documents and studied some Bangla grammar books. A series of experiments are performed on 200 Bangla news documents and 600 summaries (3 summaries are for each document). The evaluation results demonstrate the effectiveness of the proposed technique over the four latest methods.
Article
With the rapid growth of the World Wide Web, information overload is becoming a problem for an increasingly large number of people. Since summarization helps human to digest the main contents of a text document very rapidly, there is a need for an effective and powerful tool that can automatically summarize text. In this paper, we present a keyphrase based approach to single document summarization that extracts first a set of keyphrases from a document, use the extracted keyphrases to choose sentences from the document and finally form an extractive summary with the chosen sentences. We view keyphrases (single or multi-word) as the important concepts and we assume that an extractive summary of a document is an elaboration of the important concepts contained in the document to some permissible extent and it is controlled by the given summary length. We have tested our proposed keyphrase-based summarization approach on two different datasets: one for English and another for Bengali. The experimental results show that the performance of the proposed system is comparable to some state-of-the art summarization systems
Conference Paper
Text summarization is the technique which automatically creates an abstract or summary of a text. The technique has been developed for many years. So a survey has been done on different summarization techniques. No work in this area has been done for Bangla language. This paper presents a text summarizer for Bangla, which uses some extraction methods for text summarization.
Bengali text summarization by sentence extraction
  • K Sarkar
K. Sarkar, "Bengali text summarization by sentence extraction," arXiv preprint arXiv:1201.2240, 2012.