Conference Paper

Media Coverage and Public Perception of Distance Learning During the COVID-19 Pandemic: A Topic Modeling Approach Based on BERTopic

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Social media data are created spontaneously, where individuals with diverse backgrounds, personal experiences, and viewpoints share their most recent ideas honestly since social media is an easily accessible and comfortable tool [19]. However, how AI and its possible future impacts are discussed on social media, which reflects the public's opinions and feelings [11], [13], [20], remain unexplored. Thus, in this article, we analyze social media data to search society's opinions, attitudes, perceptions, and expectations about the future of AI. ...
... This is because manual analysis and summarizing of society's attitudes and perceptions are very challenging because of the volume of conversations produced on social media. To address this challenge, the application of Natural Language Processing (NLP) techniques is crucial if we need to analyze large volumes of text data [11], [17], [20], [23], [24], [25], [26], [27]. This research, therefore, applies topic modeling and sentiment analysis methods. ...
... Even though this literature review demonstrated the superiority of BERTopic and BERT models over other similar methods, few studies have explored the potential of BERTopic and BERT models in real-life case studies [20], [27], [45], according to the related literature. Additionally, the majority of the published papers analyze Twitter data [24], [34]. ...
Article
Full-text available
Today’s AI technology has various applications in many fields, thus creating opportunities to improve different aspects of daily life and optimize business operations. However, there are also societal expectations and concerns regarding AI and its future impacts. Investigating such societal opinions and feelings is essential for social acceptance, further development and distribution of such technology, regulation, and adaptation to changes and policies. Despite this situation, such an exploration has not been sufficiently conducted in the existing literature and the most appropriate methods for such an exploration have not been sufficiently investigated. To contribute to addressing this limitation in the literature, this study applies topic modeling and sentiment analysis approaches to investigate societal opinions and feelings about the future of AI on social media, which includes conversations from various segments of society. A corpus consisting of 16,611 comments and 998 unique Reddit post titles was analyzed with a customized BERTopic model for topic modeling and a BERT sentiment classification model. This study highlights the significant advantages of using BERTopic and BERT models in analyzing a large sample of social media discussions. The results of this study can help realize the potential of text analytics methods through transformer-based language models to derive empirical findings from large-scale data samples.
... Though not yet peer-reviewed (at the date of submission), many researchers [48,51,53,54,82,88,109,116] have adopted BERTopic (at peer-reviewed venues) as a topic modeling approach to overcome drawbacks of traditional topic modeling methods such as LDA and NMF (see Section 4.2). For example, Hristova and Netov [51] use BERTopic to identify themes studying media coverage and public perception during the COVID-19 pandemic, Hutama et al. [53] uses BERTopic to analyze different themes emerging from hoax news articles, Jeon et al. [54] study the perceptions of therapeutic technologies, and Uncovska [109] understand themes emerging from reviews of German mHealth applications. ...
... Though not yet peer-reviewed (at the date of submission), many researchers [48,51,53,54,82,88,109,116] have adopted BERTopic (at peer-reviewed venues) as a topic modeling approach to overcome drawbacks of traditional topic modeling methods such as LDA and NMF (see Section 4.2). For example, Hristova and Netov [51] use BERTopic to identify themes studying media coverage and public perception during the COVID-19 pandemic, Hutama et al. [53] uses BERTopic to analyze different themes emerging from hoax news articles, Jeon et al. [54] study the perceptions of therapeutic technologies, and Uncovska [109] understand themes emerging from reviews of German mHealth applications. Even the CHI community recently has iterated the advantages of using BERTopic [48,82,88] over traditional methods such as LDA and NMF, further reinforcing its validity and efficiency. ...
Article
Full-text available
Today's videogames are often intertwined with profound religious narratives, content, characters, and artifacts - but how do player communities understand the role of religion in games, and what is the impression of believers on the use, portrayal, and representation of the religion they practice? This paper dives deep into this confluence by rigorously analyzing discussions from religion and videogame subreddits that often depict the most popular communication channel for their communities. Following topic modeling, we cluster posts and comments from the three major religious groups and the most popular related videogame subreddits, bringing forth representative cultural sentiments and patterns of players and religious stakeholders. Through this, we discovered 22 distinct sub-themes, further categorized into the three overarching themes of (a) Blasphemous Elements of Videogames, (b) Religion as a Design Space for Videogames, and (c) Videogames for Religious Education and Community Forming.
... In [9] is proposed an approach based on sentiment analysis and topic modeling techniques aimed at understanding public opinion on governments implementing digital contact tracing to prevent the spread of COVID-19. In [10] are studied the advantages and limitations of using a novel topic modeling algorithm in big data analysis in the public domain. Drawn are useful insights into the general perception of distance learning in Bulgaria in light of the COVID-19 pandemic. ...
... Although not as fast as in the field of sentiment analysis, new approaches for topic modeling based on transformer language models also emerged. The potential of the novel BERTopic algorithm [22] has been demonstrated in an increasing volume of studies [10,17,23,24,25,26]. ...
Chapter
Full-text available
We live in an era of digital revolution not only in the industry, but also in the public sector. User opinion is key in e-services development. Currently the most established approaches for analyzing citizens’ opinions are surveys and personal interviews. However, governments should focus not only on developing public e-services but also on implementing modern solutions for data analysis based on machine learning and artificial intelligence. The main aim of the current study is to engage state-of-the-art natural language processing technologies to develop an analytical approach for public opinion analysis. We utilize transformer-based language models to derive valuable insights into citizens’ interests and expressed sentiments and emotions towards digitalization of educational, administrative and health public services. Our research brings empirical evidence on the practical usefulness of such methods in the government domain.
... 1) Topics Modeling: Topic modeling reveals the thematic structure within a text corpus by categorizing documents into distinct topics based on related words. This technique is crucial for understanding context, community types, user expertise, and content in forums [61], [23], [57]. Various text analysis methods, such as Latent Semantic Indexing (LSI) [48], Probabilistic Latent Semantic Analysis (PLSA) [6], Latent Dirichlet Allocation (LDA) [26], Non-Negative Matrix Factorization [31], Top2Vec [3], and BERTopic [20], are suited to different text analysis scenarios. ...
Preprint
Full-text available
Underground forums serve as hubs for cybercriminal activities, offering a space for anonymity and evasion of conventional online oversight. In these hidden communities, malicious actors collaborate to exchange illicit knowledge, tools, and tactics, driving a range of cyber threats from hacking techniques to the sale of stolen data, malware, and zero-day exploits. Identifying the key instigators (i.e., key hackers), behind these operations is essential but remains a complex challenge. This paper presents a novel method called EUREKHA (Enhancing User Representation for Key Hacker Identification in Underground Forums), designed to identify these key hackers by modeling each user as a textual sequence. This sequence is processed through a large language model (LLM) for domain-specific adaptation, with LLMs acting as feature extractors. These extracted features are then fed into a Graph Neural Network (GNN) to model user structural relationships, significantly improving identification accuracy. Furthermore, we employ BERTopic (Bidirectional Encoder Representations from Transformers Topic Modeling) to extract personalized topics from user-generated content, enabling multiple textual representations per user and optimizing the selection of the most representative sequence. Our study demonstrates that fine-tuned LLMs outperform state-of-the-art methods in identifying key hackers. Additionally, when combined with GNNs, our model achieves significant improvements, resulting in approximately 6% and 10% increases in accuracy and F1-score, respectively, over existing methods. EUREKHA was tested on the Hack-Forums dataset, and we provide open-source access to our code.
... BERTopic is a topic modeling technique that generates topic representations through three main steps: (i) document embeddings, (ii) document clustering, and (iii) topic representation. It has been applied in various studies, such as analyzing COVID-19's impact on the education system using data from news and social media (Hristova & Netov, 2022). ...
Article
Full-text available
The rapid shift to online education during the COVID-19 crisis has presented both challenges and opportunities for educators worldwide. This paper aims to analyze free-text comments from instructors at Greek Higher Education Institutions (HEIs), gathered during a quantitative survey on the impact of the intensive use of ICT during the lockdown. Topics emerging from extensive survey comments highlight specific barriers and suggestions, offering valuable recommendations for policymakers in similar situations. The majority of barriers were expressed regarding remote teaching, followed by remote examinations, while the section on HEIs support and remote teaching garnered more suggestions. Advanced topic modeling techniques employing Large Language Models (LLMs) are applied to analyze the collected responses. The results reveal that numerous instructors encountered difficulties stemming from insufficient infrastructure, technical assistance, and personnel, along with challenges related to student interaction and remote examination administration. Instructors put forth various suggestions for enhancement, including the provision of equipment, technical support, remote learning workshops, and interactive educational modules. The study’s findings provide insights for guiding the formulation of strategies aimed at better supporting instructors, enriching the teaching and learning process, and shaping policy decisions concerning the utilization of ICTs in higher education during crisis situations, ultimately enhancing education’s effectiveness in preparing students for the knowledge-based economy.
... Since BERTopic is a probabilistic method, each learning objective is associated with each topic with certain probability, resulting in a 7-dimensional vector with values in the 0-1 range. BERTopic was used as a state-of-the-art topic modelling method that outperformed alternative methods in a variety of settings (see e.g., Egger & Yu, 2022;Hristova & Netov, 2022). ...
Article
Full-text available
Higher education institutions are increasingly seeking ways to leverage the available educational data to make program and course quality improvements. The development of automated curriculum analytics can play a substantial role in this effort by bringing novel and timely insights into course and program quality. However, the adoption of curriculum analytics for program quality assurance has been impeded by a lack of accessible and scalable data-informed methods that can be employed to evaluate assessment practices and ensure their alignment with the curriculum objectives. Presently, this work remains a manual and resource intensive endeavour. In response to this challenge, we present an exploratory curriculum analytics approach that allows for scalable, semi-automated examination of the alignment between assessments and learning objectives at the program level. The method employs a comprehensive representation of assessment objectives (i.e., learning objectives associated with assessments), to encode the domain specific and general knowledge, as well as the specific skills the implemented assessments are designed to measure. The proposed method uses this representation for clustering assessment objectives within a study program, and proceeds with an exploratory analysis of the resulting clusters of objectives in relation to the corresponding assessment types and student assessment grades. We demonstrate and discuss the capacity of the proposed method to offer an initial insight into alignment of assessment objectives and practice, using the assessment-related data from an undergraduate study program in information systems.
... Further, it was instrumental in revealing COVID-19's impact on Bulgaria's education system, highlighting the effectiveness of its clustering capabilities even with noisy datasets [11]. BERTopic also demonstrated superior performance in accuracy and efficiency when applied to X data [12]. ...
Article
Indonesia, known for its diverse biodiversity, faces critical challenges such as habitat degradation and species loss. This study delves into public opinion regarding Indonesian government biodiversity policies by analyzing text data from X social media platforms. Leveraging BERTopic, an advanced topic modeling technique, we uncover nuanced topics related to biodiversity within tweets. Our research uniquely contributes by exploring diverse combinations of BERTopic parameters on Indonesian text, assessing their efficacy through coherence values and manual content evaluation. Notably, our findings highlight the optimal combination of sentence embedding, cluster model, and dimension reduction parameters, with Model 5 demonstrating the highest coherence score of 0.7733. Moreover, we elucidate the impact of outlier reduction techniques when applying BERTopic in an Indonesian context. Our study serves as a foundational model for categorizing Indonesian-language topics using BERTopic, showcasing the significance of tailored text processing techniques. We also reveal that while standard preprocessing methods enhance clustering outcomes, certain dataset characteristics, such as the inclusion of hashtags and mentions, can influence coherence differently across models. This work not only provides insights into public perceptions of biodiversity policies but also offers methodological guidance for text analysis in similar contexts.
... In addition to its performance, BERTopic has several other advantages over traditional topic modeling methods. Firstly, BERTopic is highly scalable and can be applied to large datasets, making it ideal for big data applications [22]. Secondly, BERTopic is easy to use and does not require extensive hyperparameter tuning, making it a user-friendly tool for topic modeling. ...
Article
Full-text available
Extractive text summarization has been a popular research area for many years. The goal of this task is to generate a compact and coherent summary of a given document, preserving the most important information. However, current extractive summarization methods still face several challenges such as semantic drift, repetition, redundancy, and lack of coherence. A novel approach is presented in this paper to improve the performance of an extractive summarization model based on bidirectional encoder representations from transformers (BERT) by incorporating topic modeling using the BERTopic model. Our method first utilizes BERTopic to identify the dominant topics in a document and then employs a BERT-based deep neural network to extract the most salient sentences related to those topics. Our experiments on the cable news network (CNN)/daily mail dataset demonstrate that our proposed method outperforms state-of-the-art BERT-based extractive summarization models in terms of recall-oriented understudy for gisting evaluation (ROUGE) scores, which resulted in an increase of 32.53% of ROUGE-1, 47.55% of ROUGE-2, and 16.63% of ROUGE-L when compared to baseline BERT-based extractive summarization models. This paper contributes to the field of extractive text summarization, highlights the potential of topic modeling in improving summarization results, and provides a new direction for future research.
... BERTopic demonstrates notable advantages over other topic modeling methods, particularly in pre-trained embeddings, the c-TF-IDF process, and automatic reduction of topic numbers [19]. The clustering results exhibit higher levels of topic coherence, diversity, and interpretability [20] and have been widely employed in discovering themes within complex short-text datasets [21,22]. Considering the intended thematic clustering of academic paper abstracts in this study, given their concise and complex content, as well as semantic richness, the BERTopic model is employed to cluster 6003 articles that contain abstract information. ...
Chapter
Full-text available
Traditionally, scientific evaluation leaning on quantity and citation metrics rarely places a study within a specific context or a particular historical process for examination, making it difficult to fully reveal its substantial contributions. In a specific scientific field, each study focuses on a certain topic and corresponds to a certain evolutionary stage of the topic. However, few studies analyze research contributions from context-oriented and process-oriented perspectives. This study investigates the contributions of research under several representative topics in the field of quantitative science studies, using articles published in the international journal Scientometrics as samples. BERTopic model is employed for topic clustering, and four research topics are selected for in-depth analysis. In order to unveil research contributions to knowledge production and to different audiences, various metrics including disruptiveness, citation impact and altmetrics are combined for indicator-level analysis, and articles are classified into different categories according to the knowledge contribution types and research orientations for content-level analysis. Results reveal that representative research topics exhibit greater disruptiveness and research impact compared to the overall sample. However, as research topics develop, there is a declining trend in introducing new knowledge and producing impact within academia. Simultaneously, there is a certain degree of enhancement in their impact beyond academia, and also a shift in knowledge contribution types and research orientations. Our findings contribute to a contextual and processual understanding of diverse research contributions, serving as a reference for the evaluation practices of research outcomes oriented towards contribution assessment.
Article
Full-text available
Natural climate solutions (NCS) play a critical role in climate change mitigation. NCS can generate win–win co-benefits for biodiversity and human well-being, but they can also involve trade-offs (co-impacts). However, the massive evidence base on NCS co-benefits and possible trade-offs is poorly understood. We employ large language models to assess over 2 million published journal articles, primarily written in English, finding 257,266 relevant studies on NCS co-impacts. Using machine learning methods to extract data (for example, study location, species and other key variables), we create a global evidence map on NCS co-impacts. We find that global evidence on NCS co-impacts has grown approximately tenfold in three decades, and some of the most abundant evidence relates to NCS that have lower mitigation potential. Studies often examine multiple NCS, indicating some natural complementarities. Finally, we identify countries with high carbon mitigation potential but a relatively weak body of evidence on NCS co-impacts. Through effective methods and systematic and representative data on NCS co-impacts, we provide timely insights to inform NCS-related research and action globally.
Article
Purpose Research on artificial intelligence (AI) and its potential effects on the workplace is increasing. How AI and the futures of work are framed in traditional media has been examined in prior studies, but current research has not gone far enough in examining how AI is framed on social media. This paper aims to fill this gap by examining how people frame the futures of work and intelligent machines when they post on social media. Design/methodology/approach We investigate public interpretations, assumptions and expectations, referring to framing expressed in social media conversations. We also coded the emotions and attitudes expressed in the text data. A corpus consisting of 998 unique Reddit post titles and their corresponding 16,611 comments was analyzed using computer-aided textual analysis comprising a BERTopic model and two BERT text classification models, one for emotion and the other for sentiment analysis, supported by human judgment. Findings Different interpretations, assumptions and expectations were found in the conversations. Three subframes were analyzed in detail under the overarching frame of the New World of Work: (1) general impacts of intelligent machines on society, (2) undertaking of tasks (augmentation and substitution) and (3) loss of jobs. The general attitude observed in conversations was slightly positive, and the most common emotion category was curiosity. Originality/value Findings from this research can uncover public needs and expectations regarding the future of work with intelligent machines. The findings may also help shape research directions about futures of work. Furthermore, firms, organizations or industries may employ framing methods to analyze customers’ or workers’ responses or even influence the responses. Another contribution of this work is the application of framing theory to interpreting how people conceptualize the future of work with intelligent machines.
Article
As problems of injustice observed in the decarbonization process arose, energy scholars have recently sought remedies to address social justice concerns under the banner of just transition. What remains elusive in the existing literature is the role of communication between proponents of policy ideas and the public in fostering social consensus around just transition, particularly within non-Western contexts. The research presented here aims to fill the aforementioned knowledge gap by investigating reasons behind the vanished momentum of a just transition policy in South Korea, despite a public atmosphere accepting of the need for low-carbon energy transition. Employing natural language processing on 2022 news articles and 32,211 online comments, our research reveals that the public perception of just transition has been influenced heavily by ideologically-driven interpretations of the meaning of justice. This is due primarily to the failure of the speakers of just transition to effectively communicate its intended scope and content. The findings underscore the importance of communication in building a shared understanding of just transition aligned with deep core beliefs of a society to ensure its public acceptance and long-term viability.
Chapter
Transformer models have achieved state-of-the-art results for news classification tasks, but remain difficult to modify to yield the desired class probabilities in a multi-class setting. Using a neural topic model to create dense topic clusters helps with generating these class probabilities. The presented work uses the BERTopic clustered embeddings model as a preprocessor to eliminate documents that do not belong to any distinct cluster or topic. By combining the resulting embeddings with a Sentence Transformer fine-tuned with SetFit, we obtain a prompt-free framework that demonstrates competitive performance even with few-shot labeled data. Our findings show that incorporating BERTopic in the preprocessing stage leads to a notable improvement in the classification accuracy of news documents. Furthermore, our method outperforms hybrid approaches that combine text and images for news document classification.
Article
Full-text available
The richness of social media data has opened a new avenue for social science research to gain insights into human behaviors and experiences. In particular, emerging data-driven approaches relying on topic models provide entirely new perspectives on interpreting social phenomena. However, the short, text-heavy, and unstructured nature of social media content often leads to methodological challenges in both data collection and analysis. In order to bridge the developing field of computational science and empirical social research, this study aims to evaluate the performance of four topic modeling techniques; namely latent Dirichlet allocation (LDA), non-negative matrix factorization (NMF), Top2Vec, and BERTopic. In view of the interplay between human relations and digital media, this research takes Twitter posts as the reference point and assesses the performance of different algorithms concerning their strengths and weaknesses in a social science context. Based on certain details during the analytical procedures and on quality issues, this research sheds light on the efficacy of using BERTopic and NMF to analyze Twitter data.
Article
Full-text available
Text analytics is becoming an integral part of modern business and economic research and analysis. However, the extent to which its application is possible and accessible varies for different languages. The main goal of this paper is to outline fundamental research on text analytics applied on data in Bulgarian. A review of key research articles in two main directions is provided – development of language resources for Bulgarian and experimenting with Bulgarian text data in practical applications. By summarizing the results of a large literature review, we draw conclusions about the degree of development of the field, the availability of language resources for the Bulgarian language and the extent to which text analytics has been applied in practical problems. Future directions for research are outlined. To the best of the author’s knowledge, this is the first study providing a comprehensive overview of progress in the field of text analytics in Bulgarian.
Article
Full-text available
In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data size challenge raised by language models from a knowledge extraction perspective, Knowledge Retrievers have been built to extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy. Recent research has also focused on superior inference by providing efficient attention to longer input sequences. In this paper, we summarize and examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal performance and efficiency. We provide a detailed understanding and functioning of the different architectures, a taxonomy of NLP designs, comparative evaluations, and future directions in NLP.
Article
Full-text available
Background COVID-19 is a scientifically and medically novel disease that is not fully understood because it has yet to be consistently and deeply studied. Among the gaps in research on the COVID-19 outbreak, there is a lack of sufficient infoveillance data. Objective The aim of this study was to increase understanding of public awareness of COVID-19 pandemic trends and uncover meaningful themes of concern posted by Twitter users in the English language during the pandemic. Methods Data mining was conducted on Twitter to collect a total of 107,990 tweets related to COVID-19 between December 13 and March 9, 2020. The analyses included frequency of keywords, sentiment analysis, and topic modeling to identify and explore discussion topics over time. A natural language processing approach and the latent Dirichlet allocation algorithm were used to identify the most common tweet topics as well as to categorize clusters and identify themes based on the keyword analysis. Results The results indicate three main aspects of public awareness and concern regarding the COVID-19 pandemic. First, the trend of the spread and symptoms of COVID-19 can be divided into three stages. Second, the results of the sentiment analysis showed that people have a negative outlook toward COVID-19. Third, based on topic modeling, the themes relating to COVID-19 and the outbreak were divided into three categories: the COVID-19 pandemic emergency, how to control COVID-19, and reports on COVID-19. Conclusions Sentiment analysis and topic modeling can produce useful information about the trends in the discussion of the COVID-19 pandemic on social media as well as alternative perspectives to investigate the COVID-19 crisis, which has created considerable public awareness. This study shows that Twitter is a good communication channel for understanding both public concern and public awareness about COVID-19. These findings can help health departments communicate information to alleviate specific public concerns about the disease.
Article
Full-text available
Research on user satisfaction has increased substantially in recent years. To date, most studies have tested the significance of pre‐defined factors thought to influence user satisfaction, with no scalable means of verifying the validity of their assumptions. Digital technology has created new methods of collecting user feedback where service users post comments. As topic models can analyze large volumes of feedback, they have been proposed as a feasible approach to aggregating user opinions. This novel approach has been applied to process reviews of primary‐care practices in England. Findings from an analysis of more than 200,000 reviews show that the quality of interactions with staff and bureaucratic exigencies are the key drivers of user satisfaction. In addition, patient satisfaction is strongly influenced by factors that are not measured by state‐of‐the‐art patient surveys. These results highlight the potential benefits of text mining and machine learning for public administration. This article is protected by copyright. All rights reserved.
Article
Full-text available
Topic modelling is the new revolution in text mining. It is a statistical technique for revealing the underlying semantic structure in large collection of documents. After analysing approximately 300 research articles on topic modeling, a comprehensive survey on topic modelling has been presented in this paper. It includes classification hierarchy, Topic modelling methods, Posterior Inference techniques, different evolution models of latent Dirichlet allocation (LDA) and itsapplications in different areas of technology including Scientific Literature, Bioinformatics, Software Engineering and analysing social network is presented. Quantitative evaluation of topic modeling techniques is also presented in detail for better understanding the concept of topic modeling. At the end paper is concluded with detailed discussion on challengesof topic modelling, which will definitely give researchers an insight for good research.
Conference Paper
Full-text available
Big Data is, clearly, an integral part of modern information societies. A vast amount of data is daily produced and it is estimated that, for the years to come, this number will grow dramatically. In an effort to transform the hidden information in this ocean of data into a useful one, the use of advanced technologies, such as Machine Learning, is deemed appropriate. Machine Learning is a technology that can handle Big Data classification for statistical or even more complex purposes, such as decision making. This fits perfectly with the scope of the new generation of government, Government 3.0, which explores all the new opportunities to tackle any challenge faced by contemporary societies, by utilizing new technologies for data-driven decision making. Boosted by the opportunities, Machine Learning can facilitate more and more governments participate in the development of such applications in different governmental domains. But is the Machine Learning only beneficial for public sectors? Although there is a huge number of researches in the literature related to Machine Learning applications, there is lack of a comprehensive study focusing on the usage of this technology within governmental applications. The current paper moves towards this research question, by conducting a comprehensive analysis of the use of Machine Learning by governments. Through the analysis, quite interesting findings have been identified, containing both benefits and barriers from the public sectors' perspective, pinpointing a wide adoption of Machine Learning approaches in the public sector.
Article
Full-text available
User satisfaction is an important aspect to consider in any public transport system, and as such, regular and sound measurements of its levels are fundamental. However, typical evaluation schemes involve costly and time-consuming surveys. As a consequence, their frequency is not enough to properly and timely characterize the satisfaction of the users. In this work, we propose a methodology, based on Twitter data, to capture the satisfaction of a large mass of users of public transport, allowing us to improve the characterization and location of their satisfaction level. We analyzed a massive volume of tweets referring to the public transport system in Santiago, Chile (Transantiago) using text mining techniques, such as sentiment analysis and topic modeling, in order to capture and group bus users’ expressions. Results show that, although the level of detail and variety of answers obtained from surveys are higher than the ones obtained by our method, the amount of bus stops and bus services covered by the proposed scheme is larger. Moreover, the proposed methodology can be effectively used to diagnose problems in a timely manner, as it is able to identify and locate trends, and issues related to bus operating firms, whereas surveys tend to produce average answers. Based on the consistency and logic of the results, we argue that the proposed methodology can be used as a valuable complement to surveys, as both present different, but compatible characteristics.
Conference Paper
Full-text available
Social media (i.e., Twitter, Facebook, Flickr, YouTube) and other services with user-generated content have made a staggering amount of information (and misinformation) available. Government officials seek to leverage these resources to improve services and communication with citizens. Yet, the sheer volume of social data streams generates substantial noise that must be filtered. Nonetheless, potential exists to identify issues in real time, such that emergency management can monitor and respond to issues concerning public safety. By detecting meaningful patterns and trends in the stream of messages and information flow, events can be identified as spikes in activity, while meaning can be deciphered through changes in content. This paper presents findings from a pilot study we conducted between June and December 2010 with government officials in Arlington, Virginia (and the greater National Capitol Region around Washington, DC) with a view to understanding the use of social media by government officials as well as community organizations, businesses and the public. We are especially interested in understanding social media use in crisis situations (whether severe or fairly common, such as traffic or weather crises).
Article
Topic modeling is an unsupervised machine learning technique for finding abstract topics in a large collection of documents. It helps in organizing, understanding and summarizing large collections of textual information and discovering the latent topics that vary among documents in a given corpus. Latent Dirichlet allocation (LDA) and Non-Negative Matrix Factorization (NMF) are two of the most popular topic modeling techniques. LDA uses a probabilistic approach whereas NMF uses matrix factorization approach, however, new techniques that are based on BERT for topic modeling do exist. In this paper, we aim to experiment with BERTopic using different Pre-Trained Arabic Language Models as embeddings, and compare its results against LDA and NMF techniques. We used Normalized Pointwise Mutual Information (NPMI) measure to evaluate the results of topic modeling techniques. The overall results generated by BERTopic showed better results compared to NMF and LDA.
Article
Transfer learning (TL) has been successfully applied to many real-world problems that traditional machine learning (ML) cannot handle, such as image processing, speech recognition, and natural language processing (NLP). Commonly, TL tends to address three main problems of traditional machine learning: (1) insufficient labeled data, (2) incompatible computation power, and (3) distribution mismatch. In general, TL can be organized into four categories: transductive learning, inductive learning, unsupervised learning, and negative learning. Furthermore, each category can be organized into four learning types: learning on instances, learning on features, learning on parameters, and learning on relations. This article presents a comprehensive survey on TL. In addition, this article presents the state of the art, current trends, applications, and open challenges.
Article
Background Since the World Health Organization (WHO) declared the COVID-19 as a Public Health Emergency of International Concern (PHEIC) on January 31, 2020, governments have been enfaced with crisis for timely responses. The efficacy of these responses directly depends on the social behaviors of the target society. People react to these actions with respect to the information they received from different channels, such as news and social networks. Thus, analyzing news demonstrates a brief view of the information users received during the outbreak. Methods The raw data used in this study is collected from official news channels of news wires and agencies in Telegram messenger, which exceeds 2,400,000 posts. The posts that are quoted by NCRC’s members are collected, cleaned, and divided into sentences. The topic modeling and tracking are utilized in a two-stage framework, which is customized for this problem to separate miscellaneous sentences from those presenting concerns. The first stage is fed with embedding vectors of sentences where they are grouped by the Mapper algorithm. Sentences belonging to singleton nodes are labeled as miscellaneous sentences. The remained sentences are vectorized, adopting Tf-IDF weighting schema in the second stage and topically modeled by the LDA method. Finally, relevant topics are aligned to the list of policies and actions, named topic themes, that are set up by the NCRC. Results Our results show that major concerns presented in about half of the sentences are (1) PCR lab. test, diagnosis, and screening, (2) Closure of the education system, and (3) awareness actions about washing hands and facial mask usage. Among the eight themes, intra-provincial travel and traffic restrictions, as well as briefing the national and provincial status, are under-presented. The timeline of concerns annotated by the preventive actions illustrates the changes in concerns addressed by NCRC. This timeline shows that although the announcements and public responses are not lagged behind the events, but cannot be considered as timely. Furthermore, the fluctuating series of concerns reveal that the NCRC has not a long-time response map, and members react to the closest announced policy/act. Conclusion The results of our study can be used as a quantitative indicator for evaluating the availability of an on-time public response of Iran’s NCRC during the first three months of the outbreak. Moreover, it can be used in comparative studies to investigate the differences between awareness acts in various countries. Results of our customized-design framework showed that about one-third of the discussions of the NCRC’s members cover miscellaneous topics that must be removed from the data.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
BERTopic: neural topic modeling with a class-based TF-IDF procedure
  • M Grootendorst
M. Grootendorst, "BERTopic: neural topic modeling with a class-based TF-IDF procedure," arXiv:2203.05794, 2022.
Using Twitter to infer user satisfaction with public transport: the case of Santiago, Chile
  • J T Méndez
  • H Lobel
  • D Parra
  • J C Herrera
J. T. Méndez, H. Lobel, D. Parra and J. C. Herrera, "Using Twitter to infer user satisfaction with public transport: the case of Santiago, Chile," IEEE Access, vol. 7, pp. 60255-60263, 2019.
  • N Reimers
  • I Gurevych
N. Reimers and I. Gurevych, "Sentence-bert: Sentence embeddings using siamese bert-networks," arXiv:1908.10084, 2019.