The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
Abstract
Text mining tries to solve the crisis of information overload by combining techniques from data mining, machine learning, natural language processing, information retrieval, and knowledge management. In addition to providing an in-depth examination of core text mining and link detection algorithms and operations, this book examines advanced pre-processing techniques, knowledge representation considerations, and visualization approaches. Finally, it explores current real-world, mission-critical applications of text mining and link detection in such varied fields as M&A business intelligence, genomics research and counter-terrorism activities.
... Utilizando uma variedade de técnicas de PLN e análise estatística, a mineração de texto transforma dados não estruturados em conhecimento. Com o crescimento exponencial de dados textuais na internet e no meio corporativo, essa abordagem se tornou essencial em várias áreas, incluindo saúde, finanças e administração pública (MINER et al., 2012;FELDMAN;SANGER, 2007). Um componente fundamental no processo de mineração de texto é o pré-processamento dos dados. ...
... Utilizando uma variedade de técnicas de PLN e análise estatística, a mineração de texto transforma dados não estruturados em conhecimento. Com o crescimento exponencial de dados textuais na internet e no meio corporativo, essa abordagem se tornou essencial em várias áreas, incluindo saúde, finanças e administração pública (MINER et al., 2012;FELDMAN;SANGER, 2007). Um componente fundamental no processo de mineração de texto é o pré-processamento dos dados. ...
Indicadores desempenham um papel essencial na administração eficaz de recursos e no monitoramento de políticas públicas. No estado de Mato Grosso, o Programa de Gerenciamento do Planejamento Estratégico visa aprimorar a governança pública por meio da adoção de indicadores padronizados. Este estudo explora o uso de métodos de mineração de texto e agrupamentos para analisar 160 indicadores distribuídos em 10 dimensões estratégicas: Aprendizagem e Conhecimento; Desenvolvimento Econômico; Educação; Esportes, Cultura e Lazer; Estrutura Organizacional; Fiscal; Infraestrutura, Saneamento Básico e Meio Ambiente; Satisfação da Sociedade; Saúde e Vulnerabilidade Social. Utilizando o método de Ward para agrupamento e a métrica de distância de cossenos, os indicadores foram agrupados em dois grupos formados por características textuais. Os indicadores da dimensão Educação situaram-se majoritariamente no primeiro grupo, enquanto os demais indicadores foram agregados no segundo grupo. Os resultados deste trabalho são úteis para os gestores públicos na tomada de decisões estratégicas, promovendo melhorias nas políticas públicas municipais.
... Text mining merupakan proses mengeksplorasi dan menganalisis data dalam bentuk teks dengan tujuan mengidentifikasi konsep, pola, kata kunci dan bentuk atribut lainnya. Text mining merupakan proses menambang data yang berupa teks dimana sumber data biasanya didapatkan dari dokumen dan tujuannya adalah mencari kata-kata yang dapat mewakili isi dari dokumen sehingga dapat dilakukan analisa keterhubungan antar dokumen [3]. Tahapan preprocessing dalam proses text mining secara umum yaitu, tahap pertama tokenizing, tahap kedua filtering, tahap ketiga stemming, tahap keempat tagging, dan tahap terakhir analyzing [4]. ...
... Text mining merupakan proses menambang data yang berupa teks dimana sumber data biasanya didapatkan dari dokumen dan tujuannya adalah mencari kata-kata yang dapat mewakili isi dari dokumen sehingga dapat dilakukan analisa keterhubungan antar dokumen [3]. ...
Social media are technology that allow sharing or exchange of information, ideas, interests, etc., via virtual communities and networks. Social media is often use for chatting either private chat or commenting on posts, and most frequently used application in Indonesia such as Facebook, WhatsApp, Instagram, Twitter, and so on. So far, peoples are typing using abbreviations as habit instead using full word and thus cause misunderstanding for others. Descriptive qualitative method was used to collect data. Text mining is a data science technique which mine data in the form of text and look for words that can represent or analyze the content of the document, and by using network of terms that can build a graph for knowing interactions between words in document. In this study, with implementing data preprocessing in text mining process is expected to reduce word or text that are not necessary, which made analyze easier to find out abbreviations within captions or comments.
... Un elemento chiave della digital transformation nel marketing sportivo è rappresentato dalla capacità di raccogliere e analizzare dati sugli utenti, utilizzando strumenti come il machine learning e l'intelligenza artificiale per estrarre insights sulle preferenze dei fan (Feldman & Sanger, 2007;He, 2013). La possibilità di segmentare e personalizzare le interazioni ha aperto nuove opportunità per migliorare il coinvolgimento dei fan e adattare le strategie di comunicazione alle diverse preferenze culturali e demografiche (Pine & Gilmore, 1999). ...
... Grazie all'IA, le società sportive possono raccogliere ed analizzare grandi volumi di dati per comprendere le preferenze ed i comportamenti dei propri tifosi. Utilizzando strumenti come il machine learning e l'analisi predittiva, le squadre sono in grado di segmentare il pubblico in base a variabili come l'età, le preferenze di contenuto e la frequenza di interazione, adattando l'esperienza del fan in modo sempre più personalizzato (Feldman & Sanger, 2007;Davenport & Ronanki, 2018). ...
... Following techniques proposed by Feldman and Sanger [3] and adapted by Min and Kim [4], NLP preprocessing was conducted using spaCy, NLTK, and scikit-learn. These included text tokenization, lemmatization, and part-of-speech tagging. ...
Exit interviews are an underutilized but critical tool for capturing organizational feedback, yet traditional analysis methods often fail to generate meaningful insights. This study investigates the application of artificial intelligence-specifically natural language processing, sentiment analysis, and topic modeling-to interpret qualitative exit interview data within SAP SuccessFactors. Using a mixed-methods design and data extracted from a large multinational enterprise over an 18-month period, the research reveals latent patterns in attrition reasons, identifies hidden organizational issues, and proposes actionable insights for HR leadership. Results demonstrate that AI-enhanced exit analytics uncover unstructured feedback trends more reliably than manual reviews, with significantly higher accuracy in detecting dissatisfaction themes. This paper contributes to social science research by positioning exit interviews as institutional diagnostic tools, offering a predictive lens into workforce behavior. The study concludes by recommending an integrative model for AI-powered offboarding intelligence that can be replicated across enterprise HR platforms.
... Moreover, large-scale, data-driven methods are underutilized for dynamic policy impact assessment. Text mining, a computational technique for extracting meaningful patterns from unstructured texts, has gained increasing attention in policy research [40]. Scholars have shown great interest in large-scale policy documents, including government policy texts, legal records, policy news and media data [41][42][43]. ...
Background
Language policy serves as an essential tool for governments to guide and regulate language development. However, China’s current language policy faces challenges like outdated analytical methods, inefficiencies caused by policy misalignment, and the absence of predictive frameworks. This study provides a comprehensive overview of China’s language policy by identifying key topics and predicting future trends.
Methods
We employ the Latent Dirichlet Allocation topic model and Autoregressive Integrated Moving Average model systematically analyze and predict the evolution of China’s language policy. By gathering a large-scale textual data of 1,420 policy texts from 2001–2023 on official websites, we achieve both topic extraction and evolution prediction.
Results
This study reveals that: (1) Language life, language education, and language resources have high popularity indexes, and language education and language planning exhibit high expected values. (2) The theme intensity of most topics has been a significant upward trend since 2014, with significant fluctuations during T1-T2. (3) From 2001 to 2023, the actual and fitted values show an overall positive trend. In 2024–2028, the predicted value of language resources stabilizes after a brief decline in 2024, while other topics show upward trends.
Conclusions
This study extracts 1,420 policy texts from official websites and outlines the following findings: (1) Language policies focus on maintaining a harmonious linguistic environment, addressing educational inequality, and protecting language resources. (2) Since 2014, most topics have exhibited fluctuating yet sustained growth trend, particularly in language education and research. (3) Except for language resources, the predicted values of the remaining six topics will show a growing trend from 2024 to 2028. Based on these findings, we propose policy recommendations such as strengthening language research, developing a multilingual education system, and optimizing language resource management.
... We used the NLTK (Natural Language Toolkit) library [134], which includes a dictionary of common English stop words to remove (e.g., the, in, a, an) and lemmatize the corpus to reduce the dimensionality of the dataset. The lemma of a word includes its base form plus inflected forms [69,231]. For example, the words models , modeled and modeling have model as their lemma. ...
Conceptual modeling is an important part of information systems development and use that involves identifying and representing relevant aspects of reality. Although the past decades have experienced continuous digitalization of services and products that impact business and society, conceptual modeling efforts are still required to support new technologies as they emerge. This paper surveys research on conceptual modeling over the past five decades and shows how its topics and trends continue to evolve to accommodate emerging technologies, while remaining grounded in basic constructs. We survey over 5,300 papers that address conceptual modeling topics from the 1970s to the present, which are collected from 35 multidisciplinary journals and conferences, and use them as the basis from which to analyze the progression of conceptual modeling. The important role that conceptual modeling should play in our evolving digital world is discussed, and future research directions proposed.
... Computational methods have greatly improved bibliometric analyses, enabling the effective utilization of the vast reservoir of scientific knowledge present in literature. Text mining is a computational text analysis tool that employs methods from data mining, machine learning, natural language processing, information retrieval, and knowledge management to extract patterns and valuable insights from unstructured text (Aggarwal, 2018;Feldman & Sanger, 2006). Text mining has recently been applied in the fields of ecology and evolutionary biology for tasks such as investigating research patterns and subjects, synthesizing evidence and conducting literature reviews, enlarging datasets based on literature, and extracting as well as incorporating primary biodiversity information (Farrell et al., 2022). ...
Understanding the trajectory of microbial biotechnology research is essential for identifying novel processes, techniques, and applications to enhance the efficiency and sustainability of bioeconomic activities. This paper provides a comprehensive overview of global research on microbial applications in forestry-related industries to elucidate key research themes and trends within this domain. Through topic modeling of publications on microbial applications in wood and wood-based products, we identified 14 distinct topics from a dataset of 805 abstracts containing 152,265 terms. A continuing surge of research was found, particularly on microbial enzymes employed primarily in pulp and paper production. There was also a rising publication trend related to microbe applications in bioenergy and agarwood, reflecting an increasing interest in diversifying forest-based bioeconomy. Most scientific publications originated from major producers and traders of forest-based products. To advance bioeconomic objectives, it is critical to foster increased collaborative research on microbe-based technologies within the forestry industry.
... Econometric models (typically multivariate linear regression) are used due to their simplicity and low signal-to-noise ratio [14]. Yet, economic and financial data has evolved beyond infrequent macro and low-dimensional market information (prices, volumes, market trades) to include alternative data of unstructured textual information [4,8], voice recordings, news articles, social media posts, and satellite images [25]. ...
Institutional investors are looking for opportunities that will have a positive social and environmental impact as well as positive returns. Compared to the largely speculative trading in equity markets and financial markets, this is a significant change. Investing with impact requires a long-term perspective. Impact investments are becoming increasingly popular as governments expect institutional investors to contribute to the achievement of UN sustainable development goals by 2030. Several characteristics are included in these goals, including environmental sustainability, social inclusion, integration, competitiveness, and resilience. In this paper, new AI methods are presented that utilize network theory, complex fitness dynamics of networks, and machine learning techniques for sourcing investments more effectively and forecasting their likely impact more accurately. The paper discusses ethical considerations and safeguards that should be observed when deploying artificial intelligence for impact investment.
... They also perform unguided data clustering's. Consisting of gene expression data can be processed by unsupervised clustering algorithms to discover groups of patients who share identical molecular characteristics (Feldman & Sanger, 2006). Latent variable models help discover gene co-expression modules which consist of genes that probably interact together or correspond to equal biological mechanisms and pathways (Hancock & Zvelebil, 2006). ...
Medical research receives fundamental transformation through Artificial Intelligence (AI) which particularly optimizes the analysis of neurodegenerative disorders consisting of Alzheimer's disease (AD), Parkinson's disease (PD) and Multiple Sclerosis (MS). The processing and analysis of extensive and intricate dataset groups that contain medical images together with genetic data in conjunction with speech patterns and clinical files become achievable to AI systems through machine learning (ML) and deep learning (DL) algorithms. The technological tools provide healthcare professionals with opportunities to discover diseases at their onset and achieve precise diagnostic outcomes as well as estimate disease development and craft individualized therapeutic solutions .Artificial intelligence enables analysis of nervous system scans along with genetic information and cognitive screening tests which helps identify early symptoms of cognitive decline and predict how MCI leads to dementia. PD AI models recognize preclinical markers that consist of early nocturnal breathing problems together with motor control issues before symptoms emerge clinically. Fast diagnosis and long-term monitoring in MS is possible because AI joins MRI evaluation with fluid biomarker evaluation.AI models facilitate biomarker research while enabling medical staff to create decision support systems which allow them to evaluate therapeutic results and forecast results and adapt individual patient treatments. The evaluation of diseases and creation of patient groups becomes more effective through machine learning approaches which utilize supervised, unsupervised, and reinforcement learning strategies. Neurodegenerative disorder patients benefit from AI through improved diagnosis accuracy and
... Unstructured data refers to information that lacks a specific structure. Examples include bitmap images, text, customer records, and product lists that are not part of a database [30]. Emails, while stored in a database, are also considered unstructured data due to their text-based, unorganized format. ...
The accurate and efficient classification of network traffic, including malicious traffic, is essential for effective network management, cybersecurity, and resource optimization. However, traffic classification methods in modern, complex, and dynamic networks face significant challenges, particularly at the network edge, where resources are limited and issues such as privacy concerns and concept drift arise. Condensation techniques offer a solution by reducing the data size, simplifying complex models, and transferring knowledge from traffic data. This paper explores data and knowledge condensation methods—such as coreset selection, data compression, knowledge distillation, and dataset distillation—within the context of traffic classification tasks. It clarifies the relationship between these techniques and network traffic classification, introducing each method and its typical applications. This paper also outlines potential scenarios for applying each condensation technique, highlighting the associated challenges and open research issues. To the best of our knowledge, this is the first comprehensive summary of condensation techniques specifically tailored for network traffic classification tasks.
... Text mining, also called computational linguistics or natural language processing, is defined as a process that uses algorithms and methods from the fields of statistics and machine learning to find meaningful patterns in text data (Hotho et al., 2005). Specifically, text mining enables to extract valuable information, look for the relationships and detect patterns using computer techniques (Feldman and Sanger, 2007;Gentzkow et al., 2019;Senave et al., 2023). Gentzkow et al., (2019) draws attention to the relevance of textual data (also in economics) and notes that the popularity of computational linguistics in this field will become increasingly important. ...
The purpose of the article. Sustainability development issues, particularly Environmental, Social, and Governance (ESG) factors, are becoming increasingly relevant in corporate reporting, driven by rising environmental awareness and regulatory requirements. The aim of the study is the evaluation of ESG disclosures of selected Polish banks. The aim of the paper fills a gap in the Polish academic literature in economics and finance by analyzing the volume and size of non-financial ESG disclosures through computer text mining techniques. Methodology. This study applies text mining techniques to evaluate the ESG volume and size from 107 financial reports issued by Polish banks between 2006 and 2023. For this purpose, selected tools for computer-based analysis of textual data (text mining) are used. The primary methods include the emotional attitude (sentiment), analysis of the number of words regarding ESG, and analysis of the readability of ESG volume and size contained in company reports. Results of the research. The study reveals that ESG excerpts are more neutral or less optimistic compared to integrated reports, which tend to have a more positive tone. Additionally, sustainability disclosures are written in a complex language, and the volume of these reports has been increasing over time, likely due to new regulations and growing awareness of sustainability issues. The study focuses on Polish banks but suggests expanding future research to other sectors.
... Naïve Bayes: The Naïve Bayes classifier is a probabilistic and supervised learning algorithm based on the Bayes' theorem commonly used to solve classification problems. Feldman and Sanger [12] refer to the Naïve Bayes classifier as an "independent feature model" because it is based on the assumption that each feature contributes independently and equally to the classification outcome. This assumption has a positive impact on computation speed. ...
Despite the average of 500 million tweets per day, little research has been conducted to categorize tweets and sentiment polarity so that tweets can be analysed based on user preferences. The objective of this paper is two-fold. The first is to perform comparative experiments for Tweet topic classification using six machine algorithms: Random Forest, K-Nearest Neighbours, Naive Bayes, Logistic Regression, Decision Tree and Support Vector Machine. The second is to investigate the impact of sentiment polarity information in the Tweets data on the topic classification experiment. The model performance is evaluated based on sensitivity, specificity, precision, false positive rate and accuracy. The experimental results showed that the Support Vector Machine (SVM) produced the highest accuracy of 84% in topic classification. After embedding sentiment polarity into the dataset, the accuracy of the topic classification model continued to improve to 93%. In the future, these results can be further enhanced through ensemble machine learning algorithms and potential semantic as well as pragmatic features.
... Text mining (TM) is the third prominently used QDA in the selected papers. Often called "automatic content analysis", TM is the process of extracting useful information from unstructured written data such as chat messages, text files, emails, and HTML files to discover useful trends, rules, patterns, or models (Feldman & Sanger, 2007). Accelerating QDA processes by reducing the time and resources needed to process qualitative data, TM can be very advantageous for qualitative researchers (Roberts et al., 2014). ...
... Naïve Bayes: The Naïve Bayes classifier is a probabilistic and supervised learning algorithm based on the Bayes' theorem commonly used to solve classification problems. Feldman and Sanger [12] refer to the Naïve Bayes classifier as an "independent feature model" because it is based on the assumption that each feature contributes independently and equally to the classification outcome. This assumption has a positive impact on computation speed. ...
Despite the average of 500 million tweets per day, little research has been conducted to categorize tweets and sentiment polarity so that tweets can be analysed based on user preferences. The objective of this paper is two-fold. The first is to perform comparative experiments for Tweet topic classification using six machine algorithms: Random Forest, K-Nearest Neighbours, Naive Bayes, Logistic Regression, Decision Tree and Support Vector Machine. The second is to investigate the impact of sentiment polarity information in the Tweets data on the topic classification experiment. The model performance is evaluated based on sensitivity, specificity, precision, false positive rate and accuracy. The experimental results showed that the Support Vector Machine (SVM) produced the highest accuracy of 84% in topic classification. After embedding sentiment polarity into the dataset, the accuracy of the topic classification model continued to improve to 93%. In the future, these results can be further enhanced through ensemble machine learning algorithms and potential semantic as well as pragmatic features.
... Advancements in technology have facilitated the integration of computational techniques like Text Mining (TM) to extract GOs automatically, offering a unique advantage in creating visual representations tailored to speci c texts (Feldman and Sanger, 2007). TM involves extracting meaningful patterns and knowledge from large amounts of textual data, combining techniques from data mining, machine learning, natural language processing, and statistics. ...
The aesthetics of reading have received relatively little research attention, particularly in the context of foreign language readers. In this study, we investigate the impact of text mining-powered graphic organizers (GOs) on aesthetic reading experience with English as a foreign language (EFL) readers. Shusterman's framework of aesthetics was applied to measure reading comprehension, experience, and literary beauty perception. A between-group experiment design ( N = 52) was conducted, where Norwegian students enrolled in the International Baccalaureate classes of Lillestrøm High School were recruited. Participants in the experimental condition interacted with GOs before reading the first three chapters of English versions of Pride & Prejudice , while those in the control condition solely read the same texts without interacting with GOs. A statistically significant enhancement in comprehension scores across all subdomains —summarization, vocabulary, and overall comprehension—was associated with the use of GOs. However, the introduction of GOs did not improve or hinder the reading experience or the perceived literary beauty of the text. These findings highlight the efficacy of automatically extracted GOs in improving specific aspects of the aesthetic reading experience. The implications of such findings for individual domains of reading aesthetics and foreign language readers are discussed.
... Within the realm of text mining of online user-generated data, Latent Dirichlet Allocation (LDA) topic analysis is a noteworthy method for text information mining and processing. The LDA model is an algorithm widely used for text topic analysis, capable of revealing hidden thematic structures within textual data, integrating a suite of analytical techniques such as word categorization, degree centrality, and frequency analysis (Feldman & Sanger, 2007) for document classification, information retrieval, and recommendation systems (Choi & Lee, 2020). The analysis enables keyword extraction, hot topic analysis, topic evolution, and thematic cluster analysis of the mined textual information (Zou et al., 2022a). ...
Digital technologies represented by AR (Augmented Reality), VR (Virtual Reality), and digital twins, along with the expansion of metaverse platforms and digital marketing concepts, have attracted the attention of numerous sports fashion product consumers and brands, particularly in the category of sports shoes. Therefore, in the context of digital technologies, understanding the factors that affect consumer experience and the preferences in the online purchasing process of sports shoes is very important. This study employs Latent Dirichlet Allocation topic analysis to analyze 44,110 online user posts and comments from social platforms, extracting thematic elements of consumer experience needs for purchasing sports shoes online. The information obtained is further encoded and designed into a questionnaire, which is then utilized alongside the Kano model to analyze the overall preferences of consumer experience needs. The results indicate that webpage design and basic product information are considered as Must-be attributes for user experience needs; providing information on after-sales service policies and product comment, products’ special feature information, and online size testing are recognized as Performance attributes. Additionally, high-tech interaction methods, visual presentation, personalized customization, virtual try-on, apparel matching recommendations, and dressing scenario recommendations are identified as Attractive attributes. The study reveals that in the context of new digital technology development, the online shopping experience for sports shoes is enhanced across four dimensions: platform experience augmentation, product experience augmentation, user demand augmentation, and interactive experience augmentation. These four dimensions collectively constitute the holistic experience design for the online retail platform. Therefore, this research provides case references and theoretical insights for researchers and developers in the fields of brand marketing, experience design, and product service innovation.
... For example, Han et al. (2012) describe data mining as involving "data cleaning, data integration, data selection, data transformation, pattern discovery, pattern evaluation, and knowledge presentation." Similarly, Feldman and Sanger (2006) highlight that text mining aims to "extract use-4 Dornis and Stober: Generative AI Training and Copyright Law ful information from data sources through the identification and exploration of interesting patterns." ...
Training generative AI models requires extensive amounts of data. A common practice is to collect such data through web scraping. Yet, much of what has been and is collected is copyright protected. Its use may be copyright infringement. In the USA, AI developers rely on "fair use" and in Europe, the prevailing view is that the exception for "Text and Data Mining" (TDM) applies. In a recent interdisciplinary tandem-study, we have argued in detail that this is actually not the case because generative AI training fundamentally differs from TDM. In this article, we share our main findings and the implications for both public and corporate research on generative models. We further discuss how the phenomenon of training data memorization leads to copyright issues independently from the "fair use" and TDM exceptions. Finally, we outline how the ISMIR could contribute to the ongoing discussion about fair practices with respect to generative AI that satisfy all stakeholders.
... Text mining adalah teknik yang digunakan untuk menggali informasi berharga dari teks yang tidak terstruktur. Proses ekstraksi informasi meliputi langkah-langkah seperti pre-processing, tokenisasi, stemming, dan stop word removal [14]. Proses feature extraction melibatkan representasi teks dalam bentuk yang dapat digunakan dalam algoritma text classification, seperti model Bag of Words (BoW) dan metode term frequency-inverse document frequency (TF-IDF) [12]. ...
Comments in questionnaire feedback carry sentiment meanings, such as positive, negative, or neutral. Each review comment on training services requires prompt and accurate follow-up to improve service quality. However, sentiment classification often demands significant time and effort to determine sentiments accurately. This study aims to enhance efficiency and accuracy in sentiment classification for training questionnaires. A comparative analysis was conducted using three algorithms: Naïve Bayes, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM). The results indicate that SVM is the fastest and most accurate algorithm, with a training time of 3.067 seconds, 100 milliseconds faster than Naïve Bayes and 45.8 seconds faster than KNN. SVM achieved an accuracy of 60.81%, with an average sensitivity of 61%, specificity of 80%, and precision of 63%. Subsequently, this study integrated the Optimized Weight Evolutionary method to enhance SVM's accuracy and address attribute selection. Testing results showed a 2.16% improvement in SVM accuracy, bringing it to 63.10%. The training process was conducted on a dataset of 1,153 comments, with 90% of the data used for algorithm training. The combination of SVM and Optimized Weight Evolutionary proved effective in achieving more accurate sentiment classification. This study provides new insights into the application of sentiment classification, particularly for training feedback. Optimizing the algorithm can help training companies respond more effectively to comments and improve overall service quality.
... The analysis of open-ended responses from Google Forms employed text exploration techniques, specifically focusing on word frequency analysis. This process entailed quantifying frequently utilized words and phrases to discern prevailing preferences or sentiments, as suggested by Feldman and Sanger (2007). The results were presented through a word cloud to enhance readability and comprehension. ...
... In the research field of digital consumer behavior, artificial intelligence and data science methods such as deep learning (Chapman and Feit 2019;Feldman and Sanger 2006;Igual Muñoz and Seguí Mesquida 2017) are increasingly used alongside classical statistical methods to identify relevant information in user data and to derive behavioral patterns. However, such methods are typically available only to companies and data scientists; there is a lack of usable solutions for consumers that enable automated analysis for different fields of application (Fischer et al. 2016). ...
... Research depth refers to the sophistication of TM-NLP techniques, which encompasses the complexity of the algorithms used, the extent of contextual understanding they provide, and their ability to produce nuanced insights (Blei et al., 2003a;Jurafsky, 2000;Manning et al., 2008). The level of research depth is critical because it determines how effectively these tools can go beyond surface-level analysis to uncover intricate details within SRs (Feldman & Sanger, 2007;Miner et al., 2012). For instance, sophisticated algorithms can parse through dense corporate language, identify nuanced differences in how sustainability practices are reported, and detect the sentiment behind the narratives presented. ...
Automated text analysis approaches such as text mining (TM) and natural language processing (NLP) hold great promise for dealing with the growing volume, diversity, and complexity of the data found within corporate sustainability reports (SRs). However, given the novelty of these approaches, we know little about how and how well these research studies have utilized these new tools-specifically, the methods employed, research objectives addressed, and progress in applying these tools in a manner that would allow us to fully maximize the insights generated from the growing wealth of sustainability data. Consequently, we conduct a systematic literature review (SLR) in order to synthesize and assess the literature utilizing TM-NLP to study SRs. Our contribution is threefold: First, we provide an overview of the methodologies and techniques that have been employed in the analysis of SRs. Second, we review the research objectives pursued by scholars employing TM-NLP in the analysis of SRs. Third, based on these, we present a critical assessment of the literature to date. Findings reveal that while there has been some progress, issues related to research depth, breadth, and methodological transparency are evident in the body of literature to date. As such, we argue that the potential of TM-NLP to generate significant insights from SR big data remains largely unrealized and offer suggestions for future research.
... La méthodologie repose sur un devis en fouille de textes orienté vers l'apprentissage et la population automatique d'ontologies. La fouille de textes (text analytics) est un domaine de la science des données caractérisé par l'application de techniques d'intelligence artificielle sur de grands corpus textuels afin d'en extraire de nouvelles connaissances (Feldman et Sanger, 2006 ;Jo, 2018 ;Weiss et al., 2005). Le devis se décline selon les étapes de réalisation suivantes. ...
En raison de leur capacité à formaliser les connaissances d’un domaine en une spécification explicite interprétable par ordinateur (Gruber, 1995), le modèle de représentation des ontologies s’avère idéal pour assister la conduite de méthodes de synthèse des connaissances telles que les revues systématiques de la littérature (Sahlab et al., 2022). Considérant l’accroissement rapide et continu de la production savante, automatiser l’élaboration de ces produits de connaissance s’avère indispensable (Al-Aswadi et al., 2020). Cette présentation fait état des premiers jalons d’une recherche en cours visant à évaluer l’apport de méthodes d’apprentissage d’ontologies pour assister les revues de synthèses de la littérature.
... Analyzing each individual course as a case in addition to analyzing the collective whole, categories of concepts and themes were derived from the data and conceptualization and category definition (Corbin & Strauss, 2008;Creswell & Poth, 2018). A text mining process involved information extraction, topic tracking, concept linking, and categorization/classification was used (Feldman & Sanger, 2007;Tan, 1999). Interview data was professionally transcribed and analyzed using two processes of analysis derived from grounded theoretical perspective: open (develop categories of concepts and themes) and axial (building connections within categories) coding (Strauss & Corbin, 1998). ...
Purpose : This study explored how physical education-specific technology courses within physical education teacher education (PETE) programs address students’ Technological, Pedagogical, and Content Knowledge (TPACK), given that successful technology integration in PETE remains unclear. Method : A case study design was used to examine six PETE programs’ physical education-specific technology courses within the United States. Semistructured interviews with course instructors and course materials (e.g., syllabi, assignments) were inductively analyzed to explore how they addressed TPACK components and PETE technology standards. Results : All courses addressed varying components of TPACK and PETE technology standards. Additional thematic analysis showed that all courses encouraged (a) hands-on practical experience, (b) technology exploration, (c) empowering teachers, and (d) digital citizenship advocacy. Discussion : PETE programs currently with or those considering a physical education-specific course should ensure they enhance students’ TPACK by specifically addressing technology use in lesson planning, lesson delivery, assessment, and advocacy in program and course experiences.
Various paradigms, including Dew Computing (Dew-C) and Cloud Computing (Cloud-C), have arisen within the domain of computing. Dew-C adeptly addresses the constraints of Cloud-C, such as bandwidth reliance and elevated latency, by employing a distributed, lightweight architecture tailored for peripheral computing environments. This study examines the essential concepts, many applications, and prospective future uses of Dew-C through Latent Semantic Analysis (LSA) to discern key research themes and problems. The data for this study were sourced from the Scopus database utilizing the query string TITLE-ABS-KEY (“dew computing”) in compliance with PRISMA requirements. Latent Semantic Analysis (LSA), a prominent natural language processing technique, was utilized to perform term frequency-inverse document frequency (TF-IDF) analysis, in conjunction with bibliometric analysis, to derive quantitative and statistical insights. The tests utilized an augmented dataset of 191 published articles between 2016 and 2024, employing open-source tools like KNIME and VOSviewer. The research using K-means clustering to identify five thematic clusters indicative of potential future trajectories in the Dew-C area. Dew Computing is a distributed paradigm that focuses on edge computing and seamlessly integrates with Cyber-Physical Systems (CPS) to facilitate real-time data processing and autonomous control across diverse applications, including as healthcare and smart cities. Dew-C, notwithstanding its benefits, faces considerable obstacles regarding data privacy, security, and connectivity because to its reliance on lightweight, resource-constrained devices. Dew-C possesses the capacity to markedly enhance distributed compute owing to its scalability, energy efficiency, and little environmental footprint. Future research should primarily concentrate on the creation of domain-specific applications, resource management approaches, and robust security systems in sectors like as healthcare, environmental monitoring, and smart infrastructure.
The HJ-Biplot, introduced by Galindo in 1986, is a multivariate analysis technique that enables the simultaneous representation of rows and columns with high-quality visualization. This systematic review synthesizes findings from 121 studies on the HJ-Biplot, spanning from 1986 to December 2024. Studies were sourced from Scopus, Web of Science, and other bibliographic repositories. This review aims to examine the theoretical advancements, methodological extensions, and diverse applications of the HJ-Biplot across disciplines. Text mining was performed using IRAMUTEQ software, and Canonical Biplot analysis was conducted to identify four key evolutionary periods of the technique. A total of 121 studies revealed that health (14.9%), sustainability (11.6%), and environmental sciences (12.4%) are the primary areas of application. Canonical Biplot analysis showed that two main dimensions explained 80.24% of the variability in the dataset with Group 4 (2016–2024) achieving the highest cumulative representation (98.1%). Recent innovations, such as the Sparse HJ-Biplot and Cenet HJ-Biplot, have been associated with contemporary topics like COVID-19, food security, and sustainability. Artificial intelligence (ChatGPT 3.5) enriched the analysis by generating a detailed timeline and identifying emerging trends. The findings highlight the HJ-Biplot’s adaptability in addressing complex problems with significant contributions to health, management, and socioeconomic studies. We recommend future research explore hybrid applications of the HJ-Biplot with machine learning and artificial intelligence to further enhance its analytical capabilities and address its current limitations.
Recently, social media influencers have promoted social campaigns and movements, contributing to heightened interest and concern among the public about social causes. This study explores influencers’ social cause communication to understand message attributes, which may be catalysts in increasing public attention to the message and engagement in social causes. Data from Instagram were collected, and data mining with sentiment and semantic network analyses was conducted to discover the patterns of sentiments and themes used in influencers’ social cause communication. The results were then compared with the corporate social responsibility (CSR) communication framework proposed by a previous study. Influencers’ message contents comparable to effective CSR communication were found, as well as their distinct features (e.g. emotional expressions with sentimental words and product promotions). The study provides a foundational understanding of influencers’ communication practices, which may contribute to future research investigating the impacts of those message factors on public perceptions.
This study aims to explore the major topics in the recent artificial intelligence (AI) ecosystem literature and identify and categorize those topics into categories of AI ecosystems. The study analyzed 149 publications from Google Scholar using two text mining techniques: latent Dirichlet allocation (LDA) and exploratory factor analysis (EFA). The LDA identified 12 major topics, while the EFA grouped them into six common factors: (a) human resources-driven, (b) technology and algorithm-based, (c) business and entrepreneurial-driven, (d) legal, ethical, privacy, and regulatory framework, (e) innovation-based, and (f) government-supported. The goal is to suggest various AI ecosystems and their best fit for a country or region based on its characteristics and resources. Understanding these types of AI ecosystems can provide valuable insights for government agencies, policymakers, businesses, educational institutions, and other stakeholders to align strategies with resources for developing successful AI-driven ecosystems.
This study analyzes the environmental impacts of Small Hydropower Plants (SHPs) in Mato Grosso by applying text mining and clustering techniques to examine Environmental Impact Assessments (EIAs) of these plants. With the expansion of SHPs as lower-impact alternatives, there is a growing demand for specific evaluations of their effects on fauna, flora, soil, and water resources. Using Doc2Vec to generate semantic vectors from the texts, the documents were grouped into three clusters reflecting distinct approaches: Cluster 1 focuses on broad environmental impacts and mitigation measures; Cluster 2 emphasizes water quality monitoring and erosion control; and Cluster 3 prioritizes rapid responses to soil and socioeconomic impacts. This analysis reveals how EIAs address the environmental challenges of SHPs, highlighting the importance of public policies and mitigation strategies tailored to each ecological context and providing insights for more effective and sustainable environmental planning in Mato Grosso.
Belanja online merupakan proses membeli barang dan jasa dari pedagang yang dijual atau disajikan di internet. Konsumen dapat mengunjungi situs belanja online dari rumah,kantor secara nyaman sambil duduk di depan computer dan smartphone. Sebanyak 43% setuju bahwa media social menjadi alat bantu untuk memenuhi kebutuhan pengetahuan berupa review produk dan ulasan forum, guna membantu membuat keputusan pembelian. Review produk maupun ulasan forum disampaikan melalui komentar di social media yang berisi keluhan, pujian atau pandangan terhadap produk atau Assalamu’alaikum dari suatu situs belanja online. Di Indonesia terdapat beberapa situs e-commerce yang digunakan oleh konsumen. Pada kasus ini penulis hanya menekankan pada ketiga situs e-commerce saja yaitu Shopee, Tokopedia, dan Lazada. Penulis memutuskan untuk memilih ketiga situs tersebut untuk dijadikan objek penelitian karena terdapat banyak ulasan pada ketiga situs tersebut khususnya pada media social facebook serta mengetahui bagaimana gambaran umum mengenai presepsi masyarakat Indonesia yang menggunakan facebook terhadap situs jual beli online Lazada, Shopee dan Tokopedia. Analisis ini memberikan hasil bahwa sentimen tiga toko online ini cenderung mendapatkan sentimen positif berdasarkan komentar pada media sosial facebook.
A dynamic perspective of second language (L2) development enquires into time-intensive data that are longitudinal, dense, non-linear, and individual. This article reports a collection of quantitative methods that could capture time-intensive data, termed Time-Intensive Methods (TIMs). We reviewed empirical studies published from 2008 to 2023 that have used TIMs to investigate non-linear L2 development. Seventy-eight studies were included to be further analyzed regarding their chrono- logical trends, adopted TIMs, research topics, and theoretical contributions. Three major contributions of TIMs are identified: capturing the non-linear developmental paths, detecting the emergent group patterns, and revealing the dynamic interactions over time. Methodological rigor is discussed against how TIMs address their corresponding theoretical concepts and research questions. The findings provide insights into the current state of TIM application in L2 development research, and encourage researchers to explore a wider range of TIMs that could enhance future studies inspired by the dynamic paradigm.
As a result of the transition of many systems to the digital age with the developing technology, it has been observed that fraud reports have increased, and this increase requires banks to develop strong systems against these threats. The inadequacy and slowness of the manual method currently used for the solution has led us to address this problem. As a solution to the fraud problem, this chapter presents an innovative method for periodic data analysis and automatic classification of fraud reports using text mining methods using machine learning algorithms. In this study, the authors aim to automate the labeling of incoming fraud reports as “social engineering” and “phishing” with the developed model. By comparing the performance of different algorithms, the random forest algorithm is selected as the most effective model. These results are valuable in terms of practical applicability.
The integration of Artificial Intelligence (AI) with predictive analytics has transformed business decision-making by leveraging historical data to forecast future trends. This paper examines foundational concepts and applications of AI-driven predictive analytics in business prior to 2013. By analyzing various use cases such as customer behavior analysis, demand forecasting, and risk management, this study highlights the methodologies employed, including machine learning algorithms, statistical modeling, and data mining techniques. Experimental results from benchmark studies demonstrate the potential of predictive analytics to enhance business performance, streamline operations, and improve customer satisfaction.
The integration of Artificial Intelligence (AI) with predictive analytics has transformed business decision-making by leveraging historical data to forecast future trends. This paper examines foundational concepts and applications of AI-driven predictive analytics in business prior to 2013. By analyzing various use cases such as customer behavior analysis, demand forecasting, and risk management, this study highlights the methodologies employed, including machine learning algorithms, statistical modeling, and data mining techniques. Experimental results from benchmark studies demonstrate the potential of predictive analytics to enhance business performance, streamline operations, and improve customer satisfaction.
The research is a qualitative study exploring the influence of service quality on purchase intentions, considering e-WOM and consumer well-being as mediators and brand hate as a moderator, using the SHEIN customers’ reviews on Reviews. Specifically, using text mining tools in Python, 181 reviews were reviewed to obtain emotional and topic characteristics regarding service quality, consumers' responses, and brand perception. The results showed that there is a negative relationship between poor service quality and purchase intentions, through negative e-WOM and lowered consumer well-being. However, the presence of brand hate strengthens this negative impact to a considerable extent. The findings highlighted the many nuanced aspects of consumers, which could be beneficial for online retail firms in refining service and brand management efforts.
This study analyzed the research trends of the beauty industry from 2000 to 2024. The KCI 621 papers selected by the keyword ‘beauty industry’ were targeted, and the analysis method utilized text mining techniques (TF-IDF analysis, LDA thematic modeling) using Python libraries such as Panda, KoNLPy, Scikit-learn, and Gensim, along with traditional content analysis. As a result, the research on the beauty industry has grown significantly since 2016, confirming that beauty research is the main field. The average citation growth analysis identified the most influential papers, and the main research topics were education, consumer behavior, organizational management, marketing, and brand image. This study provides academic insights linked to ways to revitalize the beauty industry in Korea and methodologically contributes to the objective and in-depth analysis of research trends. In conclusion, this study presents a multifaceted overview of the research on the beauty industry and suggests future research directions. Emphasizing the importance of various academic approaches and sustainability, this study can have a practical impact on government policy and industrial strategy establishment.
zet Bu çalışmada, e-ticaret platformları için kullanıcı ve yapımcılara fayda sağlamayı hedefleyen bir sistem tasarlanması amaçlanmıştır. Bu kapsamda Google Play' deki milyonlarca doğal dilde yazılmış ürün veya hizmet yorumlarından özet bilgi çıkaran bir sistem geliştirmek için Türkiye'nin önde gelen üç büyük e-ticaret firmasının yorum veri seti kullanılmıştır. Python dilini kullanarak veri kazıma yöntemi ile elde edilen büyük veri kümesini oluşturan yorumlar Google Colab platformu üzerinde duygu analizi yöntemlerden doğal dil işleme teknikleri ile değerlendirilmiştir. Sonuçlar, duygu analizi ve metin madencişili yöntemleri kullanılarak incelenmiştir. Çalışma, e-ticaret platformlarının müşteri geri bildirimlerini daha etkili bir şekilde analiz etmek ve bu analizlerden elde edilen bilgileri kullanıcı deneyimini geliştirmek, hizmet kalitesini artırmak ve olası sorunları önceden belirlemek için kullanmalarına olanak sağlamayı amaçlamaktadır. Site 1, Site 2 ve Site 3'ün müşteri geri bildirimlerini değerlendirdiğimizde ortak ve farklı öne çıkan memnuniyet ve olumsuzlukları gözlemlendiği görülmüştür. Metin madenciliği yöntemleri kullanımıyla elde edilen sonuçlar, her bir e-ticaret sitesinin kullanıcı memnuniyeti düzeyini ve müşteri geri bildirimlerine verilen duygusal tepkileri derinlemesine inceleme fırsatı sunmuştur. Abstract This study aims to design a system that benefits both users and producers of e-commerce platforms. Within this scope, a system has been developed to extract summary information from millions of product or service reviews written in natural language on Google Play, using the review data set of Turkey's three leading e-commerce companies. Using the Python language, a large dataset of reviews obtained through web scraping methods was evaluated using sentiment analysis techniques on the Google Colab platform. The results were examined using sentiment analysis and text mining methods. The study aims to enable e-commerce platforms to analyses customer feedback more effectively and use the information obtained from these analyses to improve user experience, enhance service quality, and identify potential issues in advance. When evaluating the customer feedback from Site 1, Site 2, and Site 3, it was observed that there were common and different prominent satisfactions and negativities. The results obtained using text mining methods provided an opportunity to deeply examine the level of user satisfaction and the emotional responses to customer feedback for each e-commerce site.
Context
In collaborative software development, the peer code review process proves beneficial only when the reviewers provide useful comments.
Objective
This paper investigates the usefulness of Code Review Comments (CR comments) through textual feature-based and featureless approaches.
Method
We select three available datasets from both open-source and commercial projects. Additionally, we introduce new features from software and non-software domains. Moreover, we experiment with the presence of jargon, voice, and codes in CR Comments and classify the usefulness of CR Comments through featurization, bag-of-words, and transfer learning techniques.
Results
Our models outperform the baseline by achieving state-of-the-art performance. Furthermore, the result demonstrates that the commercial gigantic LLM, GPT-4o, and non-commercial naive featureless approach, Bag-of-Word with TF-IDF, are more effective for predicting the usefulness of CR Comments.
Conclusion
The significant improvement in predicting usefulness solely from CR Comments escalates research on this task. Our analyses portray the similarities and differences of domains, projects, datasets, models, and features for predicting the usefulness of CR Comments.
When companies need to build decision making reports on huge amounts of data, data mining is their go-to method. Data mining is like being a detective who finds hidden clues in a sea of information. Data engineers use clever techniques to turn raw data into useful insights that businesses can act on. This paper explores various data mining techniques like association rule learning, clustering, classification, regression analysis, and text mining. Additionally, the paper highlights the importance of new technologies like machine learning, AI, and big data platforms, and how these advancements make data processing more efficient and insightful.
In the past 20 years, the ecological risks arising from climate change have attracted increasing attention. Understanding its research progress and the evolution of hot topics is paramount. However, efficient, in-depth, and robust analysis of massive and complex unstructured literature is difficult. This study employs a novel approach integrating data mining, bibliometrics, and systematic review to analyze 9,122 interdisciplinary publications from 2000 to 2023. Our findings reveal a consistent annual increase in publications, with a marked acceleration post-2015. The United States and China have emerged as leading contributors to this field. Over time, the theme has traversed three pivotal hotspots and above 100 hot words. We have summarized research progress from early to late stages into nine aspects: (1) species and population responses; (2) ecosystem impacts; (3) social-ecological system risks; (4) land use/cover change interactions; (5) ecological processes; (6) ecosystem services; (7) sustainable development goals; (8) ecosystem conservation, management, and adaptation; and (9) ecological risk assessment and major models. Additionally, we summarized international policies and efforts to combat climate risks, research gaps, and potential directions for future progress: establish the unified and comparable regional risk assessment framework; strengthen research on ecological processes, multiple sources of pressure, and composite risks; enhance research on the space-time and flow of risks; establish high-precision basic datasets; and improve communication and cooperation among multiple stakeholders. This study provides a systematic and comprehensive review of climate change's ecological risks, which may inspire researchers interested in this field.
ResearchGate has not been able to resolve any references for this publication.