Figure - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
Source publication
Topic modeling is a probabilistic graphical model for discovering latent topics in text corpora by using multinomial distributions of topics over words. Topic labeling is used to assign meaningful labels for the discovered topics. In this paper, we present a new topic labeling method that uses automatic term recognition to discover and assign relev...
Context in source publication
Similar publications
Artificial intelligence is changing the world, especially the interaction between machines and humans. Learning and interpreting natural languages and responding have paved the way for many technologies and applications. The amalgam of machine learning, deep learning, and natural language processing helped Conversational Artificial Intelligence (AI...
Citations
... To extract the keywords, we employ TLATR [48] using the following pipeline: ...
Real-time social media data can provide useful information on evolving hazards. Alongside traditional methods of disaster detection, the integration of social media data can considerably enhance disaster management. In this paper, we investigate the problem of detecting geolocation-content communities on Twitter and propose a novel distributed system that provides in near real-time information on hazard-related events and their evolution. We show that content-based community analysis can lead to better and faster dissemination of hazard-related reports than using only traditional methods, such as satellite or airborne sensing platforms. Our distributed disaster reporting system analyzes the social relationship among worldwide geolocated tweets and applies topic modeling to group tweets by topics. Considering for each tweet the following information: user, timestamp, geolocation, retweets, and replies, we create a publisher-subscriber distribution model for topics. We use content similarity and the proximity of nodes to create a new model for geolocation-content based communities. Users can subscribe to different topics in specific geographical areas or worldwide and receive real-time reports regarding these topics. As misinformation can lead to increased damage if propagated in hazards-related tweets, we propose a new deep learning model to detect fake news. The misinformed tweets are then removed from display. We also show empirically the scalability capabilities of the proposed system.
... In many cases, researchers manually choose the title to assign to the topic. In other cases, the assignment is automatic (Lau et al., 2011;Kozono & Saga, 2020;Truicȃ & Apostol, 2021) and is based on word embedding (Bhatia et al., 2016). Word embedding represents words as vectors in an n-dimensional space and allows to perform several mathematical operations that are impossible to perform on text. ...
In this paper, we propose TAWE (title assignment with word embedding), a new method to automatically assign titles to topics inferred from sets of documents. This method combines the results obtained from the topic modeling performed with, e.g., latent Dirichlet allocation (LDA) or other suitable methods and the word embedding representation of words in a vector space. This representation preserves the meaning of the words while allowing to find the most suitable word that represents the topic. The procedure is twofold: first, a cleaned text is used to build the LDA model to infer a desirable number of latent topics; second, a reasonable number of words and their weights are extracted from each topic and represented in n-dimensional space using word embedding. Based on the selected weighted words, a centroid is computed, and the closest word is chosen as the title of the topic. To test the method, we used a collection of tweets about climate change downloaded from some of the main newspapers accounts on Twitter. Results showed that TAWE is a suitable method for automatically assigning a topic title.
... However, this is not the only method with which to analyze Twitter data based on keywords, as the literature has recognized different options that have been considered in other works [55][56][57]. Likewise, there are also available approaches to clustering documents using embeddings or topic labels [58,59]. However, employing embeddings alongside a non-specialized text clustering algorithm introduces complexity and extends the computational time. ...
Neurological disorders represent the primary cause of disability and the secondary cause of mortality globally. The incidence and prevalence of the most notable neurological disorders are growing rapidly. Considering their social and public perception by using different platforms like Twitter can have a huge impact on the patients, relatives, caregivers and professionals involved in the multidisciplinary management of neurological disorders. In this study, we collected and analyzed all tweets posted in English or Spanish, between 2007 and 2023, referring to headache disorders, dementia, epilepsy, multiple sclerosis, spinal cord injury or Parkinson’s disease using a search engine that has access to 100% of the publicly available tweets. The aim of our work was to deepen our understanding of the public perception of neurological disorders by addressing three major objectives: (1) analyzing the number and temporal evolution of both English and Spanish tweets discussing the most notable neurological disorders (dementias, Parkinson’s disease, multiple sclerosis, spinal cord injury, epilepsy and headache disorders); (2) determining the main thematic content of the Twitter posts and the interest they generated temporally by using topic modeling; and (3) analyzing the sentiments associated with the different topics that were previously collected. Our results show that dementias were, by far, the most common neurological disorders whose treatment was discussed on Twitter, and that the most discussed topics in the tweets included the impact of neurological diseases on patients and relatives, claims to increase public awareness, social support and research, activities to ameliorate disease development and existent/potential treatments or approaches to neurological disorders, with a significant number of the tweets showing negative emotions like fear, anger and sadness, and some also demonstrating positive emotions like joy. Thus, our study shows that not only is Twitter an important and active platform implicated in the dissemination and normalization of neurological disorders, but also that the number of tweets discussing these different entities is quite inequitable, and that a greater intervention and more accurate dissemination of information by different figures and professionals on social media could help to convey a better understanding of the current state, and to project the future state, of neurological diseases for the general public.
... Event detection traditionally relies on using topic modeling and topic labeling techniques [72], [70] on sentence-level information. However, events often unfold across sentences. ...
As global digitization continues to grow, technology becomes more affordable and easier to use, and social media platforms thrive, becoming the new means of spreading information and news. Communities are built around sharing and discussing current events. Within these communities, users are enabled to share their opinions about each event. Using Sentiment Analysis to understand the polarity of each message belonging to an event, as well as the entire event, can help to better understand the general and individual feelings of significant trends and the dynamics on online social networks. In this context, we propose a new ensemble architecture, EDSAEnsemble (Event Detection Sentiment Analysis Ensemble), that uses Event Detection and Sentiment Analysis to improve the detection of the polarity for current events from Social Media. For Event Detection, we use techniques based on Information Diffusion taking into account both the time span and the topics. To detect the polarity of each event, we preprocess the text and employ several Machine and Deep Learning models to create an ensemble model. The preprocessing step includes several word representation models: raw frequency, TFIDF, Word2Vec, and Transformers. The proposed EDSA-Ensemble architecture improves the event sentiment classification over the individual Machine and Deep Learning models.
... If a corpus contains documents from different domains, the frequency of individual domain-specific terms is diminished, thus, affecting the accuracy of the C-Value score. Equation (1) presents the mathematical formula for C-Value for extracting both single and multi-word domain-specific terms (i.e., a) as proposed in [17], where: ...
Automatic Term Recognition is used to extract domain-specific terms that belong to a given domain. In order to be accurate, these corpus and language-dependent methods require large volumes of textual data that need to be processed to extract candidate terms that are afterward scored according to a given metric. To improve text preprocessing and candidate terms extraction and scoring, we propose a distributed Spark-based architecture to automatically extract domain-specific terms. The main contributions are as follows: (1) propose a novel distributed automatic domain-specific multi-word term recognition architecture built on top of the Spark ecosystem; (2) perform an in-depth analysis of our architecture in terms of accuracy and scalability; (3) design an easy-to-integrate Python implementation that enables the use of Big Data processing in fields such as Computational Linguistics and Natural Language Processing. We prove empirically the feasibility of our architecture by performing experiments on two real-world datasets.
... For each new type of hazard, we add a new entry containing a list of Algorithm 4: CommunityGraphs -Geolocation-content based communities extraction Input :the undirected topic graph θ the proximity threshold ε l Output : top-k keywords and hashtags for the chosen hazardous event. To extract the keywords we employ TLATR [55] using the following pipeline: 1) extract the topics and keywords including hashtags 2) label the topics. ...
Real-time social media data can provide useful information on evolving hazards. Alongside traditional methods of disaster detection, the integration of social media data can considerably enhance disaster management. In this paper, we investigate the problem of detecting geolocation-content communities on Twitter and propose a novel distributed system that provides in near real-time information on hazard-related events and their evolution. We show that content-based community analysis leads to better and faster dissemination of reports on hazards. Our distributed disaster reporting system analyzes the social relationship among worldwide geolocated tweets, and applies topic modeling to group tweets by topics. Considering for each tweet the following information: user, timestamp, geolocation, retweets, and replies, we create a publisher-subscriber distribution model for topics. We use content similarity and the proximity of nodes to create a new model for geolocation-content based communities. Users can subscribe to different topics in specific geographical areas or worldwide and receive real-time reports regarding these topics. As misinformation can lead to increase damage if propagated in hazards related tweets, we propose a new deep learning model to detect fake news. The misinformed tweets are then removed from display. We also show empirically the scalability capabilities of the proposed system.
... Traditional event detection methods mainly use sentence-level information to identify events and it usually uses topic modeling and topic labeling techniques [64,63]. However, the information used for detecting events is usually spread across multiple sentences, and sentence-level information is often insufficient to resolve ambiguities for some types of events. ...
As global digitization continues to grow, technology becomes more affordable and easier to use, and social media platforms thrive, becoming the new means of spreading information and news. Communities are built around sharing and discussing current events. Within these communities, users are enabled to share their opinions about each event. Using Sentiment Analysis to understand the polarity of each message belonging to an event, as well as the entire event, can help to better understand the general and individual feelings of significant trends and the dynamics on online social networks. In this context, we propose a new ensemble architecture, EDSA-Ensemble (Event Detection Sentiment Analysis Ensemble), that uses Event Detection and Sentiment Analysis to improve the detection of the polarity for current events from Social Media. For Event Detection, we use techniques based on Information Diffusion taking into account both the time span and the topics. To detect the polarity of each event, we preprocess the text and employ several Machine and Deep Learning models to create an ensemble model. The preprocessing step includes several word representation models, i.e., raw frequency, TFIDF, Word2Vec, and Transformers. The proposed EDSA-Ensemble architecture improves the event sentiment classification over the individual Machine and Deep Learning models.
... The authors created a directed weighted graph using relevance centrality, coverage centrality, and discrimination centrality. Authors in [26] proposed TF-IDF (Term Frequency-Inverse Document Frequency) and BM25 [27] based information extraction methods along with a ranking method inspired by [28] over LDA and NMF topic modeling. Authors in [29] proposed an ensemble learning [30] based methodology over truth discovery algorithm [31] that consists of a graph with topics, words, Wikipedia articles, and candidate terms as nodes, and relationship among them as edge. ...
... Authors in [44] created clusters and assumed the centroid of clusters as closest to all the entities in the cluster, so they considered the centroid of the proposed cluster as the candidate. The authors in [26] used an information extraction method based on BM25 and TF-IDF [27] to extract the candidate terms. Authors in [25] rely on graph-based methods for ranking the candidate terms. ...
... Authors in [43] used Kullback-Leibler Divergence [59] to rank the candidates. Authors in [26] used C-Value method inspired from [28] to rank the domain-specific candidate terms. Authors in [39] and [16] utilise neural embeddings for candidate ranking. ...
Hierarchical Topic Modeling is the probabilistic approach for discovering latent topics distributed hierarchically among the documents. The distributed topics are represented with the respective topic terms. An unambiguous conclusion from the topic term distribution is a challenge for readers. The hierarchical topic labeling eases the challenge by facilitating an individual, appropriate label for each topic at every level. In this work, we propose a BERT-embedding inspired methodology for labeling hierarchical topics in short text corpora. The short texts have gained significant popularity on multiple platforms in diverse domains. The limited information available in the short text makes it difficult to deal with. In our work, we have used three diverse short text datasets that include both structured and unstructured instances. Such diversity ensures the broad application scope of this work. Considering the relevancy factor of the labels, the proposed methodology has been compared against both automatic and human annotators. Our proposed methodology outperformed the benchmark with an average score of 0.4185, 49.50, and 49.16 for cosine similarity, exact match, and partial match, respectively.
... Variations are still being developed and used more recently as well, e.g. (Kosa et al., 2020;Steingrímsson et al., 2020;Truica and Apostol, 2021). However, (supervised) machine learning methods have become more popular for automatic term extraction, just like for most other areas in natural language processing. ...
This contribution presents D-Terminer: an open access, online demo for monolingual and multilingual automatic term extraction from parallel corpora. The monolingual term extraction is based on a recurrent neural network, with a supervised methodology that relies on pretrained embeddings. Candidate terms can be tagged in their original context and there is no need for a large corpus, as the methodology will work even for single sentences. With the bilingual term extraction from parallel corpora, potentially equivalent candidate term pairs are extracted from translation memories and manual annotation of the results shows that good equivalents are found for most candidate terms. Accompanying the release of the demo is an updated version of the ACTER Annotated Corpora for Term Extraction Research (version 1.5).