Figure - uploaded by Farzana Kabir Ahmad
Content may be subject to copyright.
Sample of text data scores

Sample of text data scores

Source publication
Article
Full-text available
The analyzing and extracting important information from a text document is crucial and has produced interest in the area of text mining and information retrieval. This process is used in order to notice particularly in the text. Furthermore, on view of the readers that people tend to read almost everything in text documents to find some specific in...

Contexts in source publication

Context 1
... term has own rank and own score of TF, IDF, and TF-IDF. Table 4 shows the score of each term and its rank. Later, each term is calculated by using TF-IDF formula. ...
Context 2
... each term is calculated by using TF-IDF formula. From Table 4, the highest score is shown in the second last column. The last column shows the rank. ...
Context 3
... 1 shows the highest rank (highest score of TF-IDF) while the lowest score holds the largest rank value. The result of score comparison of the TF-IDF with TF and IDF shows in Table 4 as follows. From the rank column, the term with the highest frequency score is taken as the subject. ...

Similar publications

Article
Full-text available
Text mining is a rapidly growing field in computer science that is used to extract meaningful information from text data. This information can be used for various applications, such as categorizing research abstracts based on their content. This study focuses on the use of text mining techniques. The goal was to determine which algorithm was more a...
Preprint
Full-text available
Information silos have been an oft-maligned feature of scientific research for introducing a bias towards knowledge that is produced within a scientist's own community. The vastness of the scientific literature has been commonly blamed for this phenomenon, despite recent improvements in information retrieval and text mining. Its actual negative imp...
Article
Full-text available
To alleviate the impact of fake news on our society, predicting the popularity of fake news posts on social media is a crucial problem worthy of study. However, most related studies on fake news emphasize detection only. In this paper, we focus on the issue of fake news influence prediction, i.e., inferring how popular a fake news post might become...
Article
Full-text available
Patent Entity and Relation Extraction (PERE) aims to extract entities and entity-relation triples from unstructured patent texts. PERE is one of the fundamental tasks in patent text mining, providing crucial technical support for patent retrieval and technology opportunity discovery. Previous works struggle to capture the implicit semantic informat...
Article
Full-text available
Text information extraction is an important natural language processing (NLP) task, which aims to automatically identify, extract, and represent information from text. In this context, event extraction plays a relevant role, allowing actions, agents, objects, places, and time periods to be identified and represented. The extracted information can b...

Citations

... It is the most critical phase after data gathering. The pre-processing techniques are adopted from the study conducted by [26]. Besides this, [27] research is considered as it briefly examines the pre-processing, feature extraction, and other aspects of the English translation of Quranic verses. ...
... In addition, to avoid the same words having distinct meanings, the characters are transformed into lowercase. This process is crucial because it is required to verify the precise amount of repeated terms in the verses of the Holy Quran specifically in the translation in English language [26]. A phrase is made up of words, and tokenization divides a sentence into a list of terms that may be used to reconstruct it. ...
... The pre-processing procedures for the English translation dataset of Holy Quran are based on the study conducted by [26]. The best results for the topic modeling of this dataset are acquired when the number of topics is set to 7 for utilizing alpha as auto. ...
Article
This study aims to assess the effectiveness of topic modeling in the English translation of the Holy Quran. Topic modeling is a popular text mining technique for uncovering latent semantic patterns in the collection of textual documents and helps to annotate the documents based on these topics. This study identifies the most significant topics in each document as well as grasping an understanding of the topic distribution throughout the document sets. Different steps are performed to acquire the dominant topics in each document and identify the distribution of topics across documents. In this context, the present research work chose to employ Latent Dirichlet Allocation as an unsupervised approach for topic modeling since there is no requirement for a training phase as hidden topics can be discovered throughout the topic modeling process. For this, the word cloud is generated to understand and interpret the results after pre-processing. A dictionary and corpus are created to extract the features from the dataset using the Bag of Words approach. The results are evaluated by calculating the perplexity and coherence score, where high coherence indicates the goodness of well-structured topic models and low perplexity score indicates the correctness of prediction made by the topic models. Lastly, the visualization step is performed.
... Various open source NLP packages, such as SpaCy and NLTK, are used to benefit from contained general functions for text preprocessing tasks (e.g., tokenization, stemming, and lemmatization) and general functions for NER analysis and clustering. Some guiding principles for the implementation of tailored text processing algorithms based on standard components are adopted from the subject identification method proposed by (Jamil et al. 2017). ...
Article
Full-text available
The constantly growing body of global environmental legislation necessitates that corporate environmental compliance managers frequently assess the relevance of new regulations and regulation revisions for each of their sites. Companies are pressured to streamline and automate this crucial task through digital workflows and specialized IT-based assistance systems. This has recently piqued the interest of researchers working in different disciplines, such as intelligent systems, machine learning, and natural language processing. The article describes the latest results of our long-term research program on IT-based support for corporate compliance management, offering insights for these, and other disciplines. The context and the main aspects of environmental regulation announcements and the relevance assessment task are analyzed. An extensive conceptual data model is developed that serves as a foundation for tailoring a generic method to perform a relevance assessment that considers site-specific individual environmental compliance facts. The method uses heuristic data operations and various text processing techniques from the field of natural language understanding. In order to exemplify the method, two application scenarios are described in which the relevance of new waste management directives are assessed for a multi-site production company.
... The field of Holy Quran study has witnessed a quite number of research works. These include: text classification applications of the Holy Quran [1][2][3][4][5][6]; ontology-based applications [7][8][9][10]; digitized Holy Quran applications [11][12][13][14]. Furthermore, from among the techniques that have been widely applied to text classification problems, include the Bayes probabilistic approach [15], decision trees [16], neural networks [17], support vector machines [18], and k-nearest neighbor [19]. ...
... The input verses are classified into three predefined labels: "iman, ibadah, and akhlak". These class labels are from the most fundamental aspects of Islam [1,3]. ...
Article
Full-text available
Feature selection is an integral phase in text classification problems. It is primarily applied in preprocessing text data prior to labeling. However, there exist some limitations with the FS techniques. The filter-based FS techniques have the drawback of lower accuracy performance while the wrapper-based techniques are highly computationally expensive to process. In this paper, a two-step FS method is presented. In the first step, chisquare (CH) filter-based technique is used to reduce the dimensionality of the feature set and then wrapper correlation-based (CFS) technique is employed in the second step to further select most relevant features from the reduced feature set. Specifically, the ultimate aim is to reduce the computational runtime while achieving high classification accuracy. Subsequently, the proposed method was applied in labeling instances of the input data (Quranic verses) using standard classifiers: naïve bayes (NB), support vector machine (SVM), decision trees (J48). The results report the proposed method achieved accuracy result of 93.6% at 4.17secs.
... The field of Holy Quran study have witnessed quite a number of research works among which are: text classification applications on the Holy Quran [1], [5], [6], [7], [8], [9]; ontology-based applications [10], [11], [12], [13]; digitized Holy Quran applications [14], [15], [16], [17]. From among the techniques that have been widely applied to text classification problems, including the Bayes probabilistic approach [4], decision trees [18], neural networks [19], support vector machines [20], and k-nearest neighbour [21]. ...
Article
Full-text available
Text classification problem is primarily applied in document labeling. However, the major setbacks with the existing feature selection techniques are high computational runtime associated with wrapper-based FS techniques and low classification accuracy performance associated with filter-based FS techniques. In this paper, a hybrid feature selection technique is proposed. The proposed hybrid technique is a combination of filter-based information gain (IG) and wrapper-based CFS algorithms. The specific purpose for this combination is to achieve both high classification accuracy performance (associated with wrapper) at lower computational runtime (associated with filter). The proposed IG-CFS technique is then applied to label Quranic verses of al-Baqara and al-Anaam from two major references, the English translation and commentary (tafsir). StringToWordVector with weighted TF-IDF method were used for preprocessing the textual data while four classifiers: naïve bayes, libSVM, k-NN, and decision trees (J48) were experimented. The overall highest classification accuracy of 94.5% was achieved at 3.89secs runtime with the proposed IG-CFS technique. Â
... Several studies focused on Al-Quran verse classifying in Arabic [4]- [6]. Jamil et al. [7] proposed the use of term frequency in subject identification in order to classify text groups into certain subject. Their dataset for the experiment consisted of 286 Al-Quran verse and the verse contains 16 keywords. ...
Article
Full-text available
In Islam, Quran is the holy book that was revealed to the Prophet Muhammad. It functions as complete code of life for the Muslims. Remarks from Allah which contains more than 77,000 words that was passed down through Prophet Muhammad to the mankind for 23 years started in 610 ce. The Quran was divided into 114 chapters. Arabic language is the original text. The need for the Muslims across the world to find the meaning to understand the content in the Quran is necessary. Nevertheless, understanding the Quran is an interest for the Muslims as well as the attention of millions of people from the faiths. Following the generation, lots of content that related to the Quran has been broadcast by Muslims scholars in the way of the tafsirs, translation and the book of hadiths. Problem has happened at current is most Muslim in Malaysia do not understand sentences in the Quran due to language barrier. The purpose of this research is classified topic in each verses of the Quran sentence based on its specific theme. It involves the objective of text mining which are based on linguistic information and domain. The usage of corpus helps to perform various data mining tasks including information extraction, text categorization, the relationship of concepts, association discovery, the evaluation of pattern and assessed. This research project is aiming to create computing environment that enable us use to text mining the Quran. The classification experiment is using the Support Vector Machine to find themes in Juz’ Baqarah. The SVM performance is then compared against other classification algorithms such as Naive Bayes, J48 Decision Tree and K-Nearest Neighbours. This research project aims at creating an enabling computational environment for text mining the Qur’an and to facilitate users to understand every verse in Juz’ Baqarah. © 2018 Institute of Advanced Engineering and Science All rights reserved.
... Jamil et al. [40] Subject identification method based on term frequency technique They have proposed a technique for the subject identification. It may be helpful in subject's identification for the groups of text. ...
... Jamil et al. [14] proposed a subject identification method based on term frequency to categorize groups of text into specified subjects. The dataset for the experimental work comprises of 224 verses of the Holy Quran with the verses containing 16 keywords on female chosen. ...
Chapter
Full-text available
Most existing feature selection approach is limited to determine features from a single source of data. In this paper, a feature selection approach is proposed to consider multiple sources of textual data. The proposed GBFS approach is then applied to label Quranic verses based on two major references, the English translation and tafsir (Commentary). The verses were selected from two chapters, Surah Al-Baqarah and Surah Al-Anaam. The verses are classified into three categories: Faith, Worship, and Etiquette. The textual data from the translation and commentary were preprocessed using StringToWord Vector with weighted TF-IDF. Feature selection algorithms: information gain, chi square, Pearson correlation coefficient, relief, and correlation-based were experimented on four classifiers: naïve Bayes, libSVM, k-NN, and decision trees (J48). The proposed group-based feature selection approach has shown promising results in terms of Accuracy and Area under Receiver Operating Characteristics (ROC) curve (AUC) by achieving Accuracy of 94.5% and AUC of 0.944.
Conference Paper
Full-text available
the real datasets in the world usually are imbalanced; the number of samples for their classes is not equal. Classifying these datasets makes the classifiers pay attention to the class with more samples than the classes with fewer samples. The Qur'anic dataset can be considered an imbalanced dataset because verses of the Qur'anic topics are not equal. Many studies have been performed to classify Qur'anic text using different classifiers. However, few studies classified the Qur'anic verses based on Imbalanced Learning (IL). So, this work aims to classify the Qur'anic text using Ensemble methods, Boosting and Bagging. The base classifiers of these methods were LibSVM, Naïve Bayes, KNN, and J48. Three techniques are conducted in this paper based on the standard classifiers. The three techniques are: implementing the base classifiers alone, implementing these classifiers with the Boosting method, and implementing the classifiers with the Bagging method. The results showed that the Quranic classification performance was improved when the ensemble methods were applied for the imbalanced Qur'anic verses in the standard classifiers.
Article
Full-text available
Imbalanced Learning (IL) is considered as a special case of text classification. It is applied in order to classify Imbalanced classes that are not equal in the number of samples. There are many researches on classified Quranic text which depends on different methods of classification. However, there is no study that classifies the Quranic topics based on Imbalanced Leaning. So, this paper aims to apply the concept of IL to assign corresponding topics for the Quranic verses according to their contents. In this paper, two Quranic datasets have been classified by using Imbalanced Learning consecutively; the first dataset is Unification of God "Tawheed" and Polytheism of God "Shirk" verses, the second dataset is Meccan, and Medinan chapters. Imbalanced Classification is applied here since these topics have imbalanced classes which cannot be classified correctly by traditional methods. The results showed that applying Imbalanced Classification produces better outcomes than the results that are executed without using Imbalanced Classification techniques.