Article

# A statistical interpretation of term specificity and its application in retrieval

Authors:
To read the full-text of this research, you can request a copy directly from the author.

## No full-text available

... The term frequency-inverse document frequency (tf-idf) metric is employed to evaluate the domain relevance of each noun after preprocessing [13,52,53]. Note that after the preprocessing steps, the documents contain only nouns. ...
... Tf-idf treats each word as unigram, and word orders or document orders are not considered. The tf-idf weighting w of a word t, in a document d belonging to a corpus of documents c, is a value computed as [13,52,53] ...
... where count t,d is the total count of a word t in a given document d, and the inverse document frequency idf t,c is given by [13,53] idf t,c = log 10 (N c /(1 + df t,c )) + 1 (3) where N c is the total number of documents in the corpus c, and the document frequency df t,c is the number of documents in the corpus c that contain the word t [13]. In this study, the tf-idf weighting is normalised to the range 0 ≤ w t,d ≤ 1 by dividing by the largest tf-idf weighting in a document. ...
Preprint
Full-text available
The design of complex engineering systems is an often long and articulated process that highly relies on engineers' expertise and professional judgment. As such, the typical pitfalls of activities involving the human factor often manifest themselves in terms of lack of completeness or exhaustiveness of the analysis, inconsistencies across design choices or documentation, as well as an implicit degree of subjectivity. An approach is proposed to assist systems engineers in the automatic generation of systems diagrams from unstructured natural language text. Natural Language Processing (NLP) techniques are used to extract entities and their relationships from textual resources (e.g., specifications, manuals, technical reports, maintenance reports) available within an organisation, and convert them into Systems Modelling Language (SysML) diagrams, with particular focus on structure and requirement diagrams. The intention is to provide the users with a more standardised, comprehensive and automated starting point onto which subsequently refine and adapt the diagrams according to their needs. The proposed approach is flexible and open-domain. It consists of six steps which leverage open-access tools, and it leads to an automatic generation of SysML diagrams without intermediate modelling requirement, but through the specification of a set of parameters by the user. The applicability and benefits of the proposed approach are shown through six case studies having different textual sources as inputs, and benchmarked against manually defined diagram elements.
... Bergelid (2018) compared the performance of several classical ML algorithms-linear Support Vector Machine (LSVM) (Cortes & Vapnik, 1995), Multinomial Naive Bayes (MNB) (Kibriya et al., 2004), k-Nearest Neighbors (KNN) (Fix et al., 1951), Random Forest (RF) (Rokach, 2010)-on a corpus consisting of 25,441 English song lyrics ( 13% of the which marked as explicit). Lyrics were initially preprocessed with vectorization techniques -TF-IDF (Sparck Jones, 1988), Doc2Vec (Le & Mikolov, 2014)-in order to extract the features for the classifiers. Highest scores were achieved with LSVM and MNB, starting from TF-IDF vectors ( F 1 = 0.677 ). ...
... Lyrics have to be converted into numerical features, as LR expects fixed-size numerical vectors in input. A common strategy is to apply a bag-of-words (BOW) vectorization technique such as TF-IDF (Sparck Jones, 1988). ...
Article
Full-text available
Preventing the reproduction of songs whose textual content is offensive or inappropriate for kids is an important issue in the music industry. In this paper, we investigate the problem of assessing whether music lyrics contain content unsuitable for children (a.k.a., explicit content). Previous works that have computationally tackled this problem have dealt with English or Korean songs, comparing the performance of various machine learning approaches. We investigate the automatic detection of explicit lyrics for Italian songs, complementing previous analyses performed on different languages. We assess the performance of many classifiers, including those–not fully exploited so far for this task–leveraging neural language models, i.e., rich language representations built from textual corpora in an unsupervised way, that can be fine-tuned on various natural language processing tasks, including text classification. For the comparison of the different systems, we exploit a novel dataset we contribute, consisting of approximately 34K songs, annotated with labels indicating explicit content. The evaluation shows that, on this dataset, most of the classifiers built on top of neural language models perform substantially better than non-neural approaches. We also provide further analyses, including: a qualitative assessment of the predictions produced by the classifiers, an assessment of the performance of the best performing classifier in a few-shot learning scenario, and the impact of dataset balancing.
... This version used a score ranging from A+ to C to measure the effectiveness of the Level Check, which was removed from the framework for the GRI G4 version. As such, Liu et al. [42] utilise the term frequency-inverse document frequency (TF-IDF) [43] method to obtain important and specific terms for different analytical algorithms and shallow machine learning models. The previously described methods and other more recent have been applied successfully in other problems, such as textual similarity in legal court case reports [44], biomedical texts from scholarly articles and medical databases [45,46] or network analytic approaches for assessing the performance of family businesses in tourism [47]. ...
... A comparison by topic cannot be made. 43 Due to the total imposition of one topic over the others, as in the standard emissions and the 44 example company. The topics are very similar and difficult to catalogue at first inspection. ...
Preprint
This paper investigates if Corporate Social Responsibility (CSR) reports published by a selected group of Nordic companies are aligned with the Global Reporting Initiative (GRI) standards. To achieve this goal, several natural language processing, and text mining techniques were implemented and tested. We extracted strings, corpus, and hybrid semantic similarities from the reports and evaluated the models through the intrinsic assessment methodology. A quantitative ranking score based on index matching was developed to complement the semantic valuation. The final results show that Latent Semantic Analysis (LSA) and Global Vectors for Word Representation (GloVE) are the best methods for our study. Our findings will open the door to the automatic evaluation of sustainability reports which could have a strong impact on the environment.
... To determine the trending topics, this paper uses a new method adopted from a popular classic technique in natural language processing (NLP) called TF-IDF. The main idea behind TF-IDF is to determine the frequency of words as well as the commonness of each word [3] [12]. This paper uses the same idea for calculating commonness in the domain of trending topics. ...
... The measure of term specificity first proposed in 1972 by Karen Spärck Jones later became known as inverse document frequency or IDF. The paper Karen Spärck Jones published called "A statistical interpretation of term specificity and its application in retrieval" [12]. Based on the term frequency normalization method proposed, IDF and some other methods for document sorting and scoring have been proposed. ...
Preprint
Full-text available
A comprehensive literature review has always been an essential first step of every meaningful research. In recent years, however, the availability of a vast amount of information in both open-access and subscription-based literature in every field has made it difficult, if not impossible, to be certain about the comprehensiveness of one's survey. This subsequently can lead to reviewers' questioning of the novelties of the research directions proposed, regardless of the quality of the actual work presented. In this situation, statistics derived from the published literature data can provide valuable quantitative and visual information about research trends, knowledge gaps, and research networks and hubs in different fields. Our tool provides an automatic and rapid way of generating insight for systematic reviews in any research area.
... In addition to the representations obtained from the graph encoder, we use additional features from text data to better learn the relations between entities. Here, we consider three types of features: (a) relevance score between the descriptions of node pairs obtained from information retrieval (IR) algorithms; we use BM-25 (Robertson et al., 1995), classic TF/IDF (Jones, 1972), as well as DFR-H and DFR-Z (Amati and Van Rijsbergen, 2002) models. These IR models capture lexical similarities and relevance between node pairs through different approaches; (b): we also use the initial text embeddings of nodes (x i , ∀i) as additional features because the direct uses of these embeddings at prediction layer can avoid information loss in the iterative process of learning node embeddings for graph data; and (c): if there exist other text information for a given node pair, e.g., a sentence mentioning the node pair as in Figure 1, we use the embeddings of such information as additional features. ...
Preprint
Full-text available
We present a generic and trend-aware curriculum learning approach for graph neural networks. It extends existing approaches by incorporating sample-level loss trends to better discriminate easier from harder samples and schedule them for training. The model effectively integrates textual and structural information for relation extraction in text graphs. Experimental results show that the model provides robust estimations of sample difficulty and shows sizable improvement over the state-of-the-art approaches across several datasets.
... The considered "School corpus" consists of 731,156 words, of which 47,165 are unique words. The TF-IDF -term frequency-inverse document frequency [13] method was used to automatically detect stopwords. To do this, for each of 47,165 unique words, its frequency was determined (the number of occurrences in the texts of the School Corpus), and the inverse document frequency IDF(word) = ln(n/m) where n = 25 -number of documents and m is the number of documents, containing the unique word among 25 documents. ...
Article
Full-text available
Filtering stop words is an important task when processing text queries to search for information in large data sets. It enables a reduction of the search space without losing the semantic meaning. The stop words, which have only grammatical roles and not contributing to information content still add up to the complexity of the query. Existing mathematical models that are used to tackle this problem are not suitable for all families of natural languages [1]. For example, they do not cover families of languages to which Uzbek can be included. In the present work, the collocation method of this problem is o ered for families of languages that include the Uzbek language as well. This method concerns the so-called agglutinative languages, in which the task of recognizing stop words is much more difficult, since the stop words are ”masked” in the text. In this work the unigram, the bigram and the collocation methods are applied to the ”School corpus” that corresponds to the type of languages being studied.
... We trained a Spark CountVectorizer Model, which encodes each tokenized tweet as a sparse vector, where the presence of a word is encoded in binary (i.e. 1 for present, 0 otherwise). Whereas we tested Term Frequency-Inverse Document Frequency (TF-IDF), term weighting strategy developed in the early 1970s and still used in the majority of NLP applications today [59], and normal (absolute) vector encodings of the dataset, the decision was made to utilize one-hot vector encoding as a result of two factors; our dataset is made up of short (typically <140 character) tweets which do not typically repeat terms and as LDA is a word generating model, TF-IDF score representations are not readily interpretable. Additionally, the CountVectorizer Model excludes words appearing above and below a specified threshold of presence. ...
Article
Full-text available
In an effort to gauge the global pandemic’s impact on social thoughts and behavior, it is important to answer the following questions: (1) What kinds of topics are individuals and groups vocalizing in relation to the pandemic? (2) Are there any noticeable topic trends and if so how do these topics change over time and in response to major events? In this paper, through the advanced Sequential Latent Dirichlet Allocation model, we identified twelve of the most popular topics present in a Twitter dataset collected over the period spanning April 3 rd to April 13 th , 2020 in the United States and discussed their growth and changes over time. These topics were both robust, in that they covered specific domains, not simply events, and dynamic, in that they were able to change over time in response to rising trends in our dataset. They spanned politics, healthcare, community, and the economy, and experienced macro-level growth over time, while also exhibiting micro-level changes in topic composition. Our approach differentiated itself in both scale and scope to study the emerging topics concerning COVID-19 at a scale that few works have been able to achieve. We contributed to the cross-sectional field of urban studies and big data. Whereas we are optimistic towards the future, we also understand that this is an unprecedented time that will have lasting impacts on individuals and society at large, impacting not only the economy or geo-politics, but human behavior and psychology. Therefore, in more ways than one, this research is just beginning to scratch the surface of what will be a concerted research effort into studying the history and repercussions of COVID-19.
... An overview of this process is illustrated in Fig. 7. However, as several visual words may occur more frequently than others, the term-frequency inverse-document-frequency (TF-IDF) scheme [191] has been adopted to weight each database element. This way, each visual word is associated with a product proportional to the number of occurrences in a given image (term frequency) and inversely proportional to its instances in the training set (inverse document frequency). ...
Article
Full-text available
Where am I? This is one of the most critical questions that any intelligent system should answer to decide whether it navigates to a previously visited area. This problem has long been acknowledged for its challenging nature in simultaneous localization and mapping (SLAM), wherein the robot needs to correctly associate the incoming sensory data to the database allowing consistent map generation. The significant advances in computer vision achieved over the last 20 years, the increased computational power, and the growing demand for long-term exploration contributed to efficiently performing such a complex task with inexpensive perception sensors. In this article, visual loop closure detection, which formulates a solution based solely on appearance input data, is surveyed. We start by briefly introducing place recognition and SLAM concepts in robotics. Then, we describe a loop closure detection system’s structure, covering an extensive collection of topics, including the feature extraction, the environment representation, the decision-making step, and the evaluation process. We conclude by discussing open and new research challenges, particularly concerning the robustness in dynamic environments, the computational complexity, and scalability in long-term operations. The article aims to serve as a tutorial and a position paper for newcomers to visual loop closure detection.
... An overview of this process is illustrated in Fig. 7. However, as several visual words may occur more frequently than others, the term-frequency inverse-document-frequency (TF-IDF) scheme [191] has been adopted to weight each database element. This way, each visual word is associated with a product proportional to the number of occurrences in a given image (term frequency) and inversely proportional to its instances in the training set (inverse document frequency). ...
Preprint
Where am I? This is one of the most critical questions that any intelligent system should answer to decide whether it navigates to a previously visited area. This problem has long been acknowledged for its challenging nature in simultaneous localization and mapping (SLAM), wherein the robot needs to correctly associate the incoming sensory data to the database allowing consistent map generation. The significant advances in computer vision achieved over the last 20 years, the increased computational power, and the growing demand for long-term exploration contributed to efficiently performing such a complex task with inexpensive perception sensors. In this article, visual loop closure detection, which formulates a solution based solely on appearance input data, is surveyed. We start by briefly introducing place recognition and SLAM concepts in robotics. Then, we describe a loop closure detection system's structure, covering an extensive collection of topics, including the feature extraction, the environment representation, the decision-making step, and the evaluation process. We conclude by discussing open and new research challenges, particularly concerning the robustness in dynamic environments, the computational complexity, and scalability in long-term operations. The article aims to serve as a tutorial and a position paper for newcomers to visual loop closure detection.
... This is modelled by the inverse document frequency (IDF) [123,141]. The idea was first introduced by Spärck Jones already in 1972 [155], who defines 'specificity' of a term as a function that is inverse to the number of documents where the term occurs. Given a corpus of documents (a collection of word sequences) C = {S W,1 , S W,2 , . . . ...
Thesis
Full-text available
Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.
... Since the late 1950s, the field of IR has evolved through several relevant works such as: H.P. Luhn's work [52]; the SMART system by Salton and his students [92] where some important IR concepts (such as the vector space model and relevance feedback) were developed; Cleverdon's Cranfield evaluation model [11]; Sparck Jones' development of idf [40] and the probabilistic models by Robertson [89] and Croft [14,103]. In 1992, with the beginning of the Text REtrieval Conference (TREC) [65], that provide the necessary infrastructure for large-scale evaluation, allowed the modification of old models/techniques and the proposal of new ones. ...
Preprint
Full-text available
This report provides an overview of the field of Information Retrieval (IR) in healthcare. It does not aim to introduce general concepts and theories of IR but to present and describe specific aspects of Health Information Retrieval (HIR). After a brief introduction to the more broader field of IR, the significance of HIR at current times is discussed. Specific characteristics of Health Information, its classification and the main existing representations for health concepts are described together with the main products and services in the area (e.g.: databases of health bibliographic content, health specific search engines and others). Recent research work is discussed and the most active researchers, projects and research groups are also presented. Main organizations and journals are also identified.
... To address this problem of binary classification, to identify SMSs without (ham) or with spam content, we will need to transform the sentences into vectors. We will use the vectorisation technique of Term Frequency-Inverse Document Frequency (TFIDF) [72], which is widely acknowledged by many researchers because it is simple and efficient. ...
Article
Full-text available
Artificial Intelligence (AI) is having an enormous impact on the rise of technology in every sector. Indeed, AI-powered systems are monitoring and deciding on sensitive economic and societal issues. The future is moving towards automation, and we must not prevent it. Many people, though, have opposing views because of the fear of uncontrollable AI systems. This concern could be reasonable if it originated from considerations associated with social issues, like gender-biased or obscure decision-making systems. Explainable AI (XAI) is a tremendous step towards reliable systems, enhancing the trust of people in AI. Interpretable machine learning (IML), a subfield of XAI, is also an urgent topic of research. This paper presents a small but significant contribution to the IML community. We focus on a local-based, neural-specific interpretation process applied to textual and time series data. Therefore, the proposed technique, which we call “LioNets”, introduces novel approaches to present feature importance-based interpretations. We propose an innovative way to produce counterfactual words in textual datasets. Through a set of quantitative and qualitative experiments, we present competitiveness of LioNets compared to other techniques and suggest its usefulness.
... First, Term Frequency -Inverse Document Frequency (TF-IDF), a TWS, topped the list in chronological order on research related to assigning weights to terms [24]. The IDF is one of the pioneering strategies in setting weights to terms adapted from the studies in the area of information retrieval used in text classification task proposed by Karen Spärck Jones, implied that the assigning of weights to terms should take place under the collection frequency factor (IDF) to utilize the terms effectively. ...
Article
Full-text available
The increased volume of data due to advancements in the internet and relevant technology makes text classification of text documents a popular demand. Providing better representations of the feature vector by setting appropriate term weight values using supervised term weighting schemes improves classification performance in classifying text documents. A state-of-the-art term weighting scheme MONO with variants TF-MONO and SRTF-MONO improves text classification considering the values of non-occurrences. However, the MONO strategy suffers setbacks in weighting terms with non-uniformity values in its term's interclass distinguishing power. In this study, extended max-occurrence with normalized non-occurrence (EMONO) with variants TF-EMONO and SRTF-EMONO are proposed where EMO value is determined as MO interclass extensions as improvements to address its problematic weighting behavior of MONO as it neglected the utilization of the occurrence of the classes with short-distance document frequency in non-uniformity values. The proposed schemes' classification performance is compared with the MONO variants on the Reuters-21578 dataset with the KNN classifier. Chi-square-max was used to conduct experiments in different feature sizes using micro-F1 and macro-F1. The results of the experiments explicitly showed that the proposed EMONO outperforms the variants of MONO strategy in all feature sizes with an EMO parameter value of 2 sets number of classes in MO extension. However, the SRTF-EMONO showed better performance with Micro-F1 scores of 94.85% and 95.19% for smallest to largest feature size, respectively. Moreover, this study also emphasized the significance of interclass document frequency values in improving text classification aside from non-occurrence values in term weighting schemes.
... From the pre-processed meta-tag words, a unigram bag of words (BOW) [15] is weighted with each word's inverse document frequency [17] to yield a 12244-D meta-tag feature vector for each domain. As described in Section 3.5, this feature vector is then pruned down to the top 500 most diagnostic words. ...
Preprint
Full-text available
How, in 20 short years, did we go from the promise of the internet to democratize access to knowledge and make the world more understanding and enlightened, to the litany of daily horrors that is today's internet? We are awash in disinformation consisting of lies, conspiracies, and general nonsense, all with real-world implications ranging from horrific humans rights violations to threats to our democracy and global public health. Although the internet is vast, the peddlers of disinformation appear to be more localized. To this end, we describe a domain-level analysis for predicting if a domain is complicit in distributing or amplifying disinformation. This process analyzes the underlying domain content and the hyperlinking connectivity between domains to predict if a domain is peddling in disinformation. These basic insights extend to an analysis of disinformation on Telegram and Twitter. From these insights, we propose that search engines and social-media recommendation algorithms can systematically discover and demote the worst disinformation offenders, returning some trust and sanity to our online communities.
... This matrix determines the relevancy of words in a given folktale by combining the terms' frequencies, and the inverse document frequency which measures the rarity of a word across documents. Tf-idf was originally developed by Luhn (1957) and Jones (1972). Principal Component Analysis (PCA), which was originally proposed by Hotelling, is then performed on the matrix (Hotelling, 1933). ...
Preprint
Full-text available
This paper employs two major natural language processing techniques, topic modeling and clustering, to find patterns in folktales and reveal cultural relationships between regions. In particular, we used Latent Dirichlet Allocation and BERTopic to extract the recurring elements as well as K-means clustering to group folktales. Our paper tries to answer the question what are the similarities and differences between folktales, and what do they say about culture. Here we show that the common trends between folktales are family, food, traditional gender roles, mythological figures, and animals. Also, folktales topics differ based on geographical location with folktales found in different regions having different animals and environment. We were not surprised to find that religious figures and animals are some of the common topics in all cultures. However, we were surprised that European and Asian folktales were often paired together. Our results demonstrate the prevalence of certain elements in cultures across the world. We anticipate our work to be a resource to future research of folktales and an example of using natural language processing to analyze documents in specific domains. Furthermore, since we only analyzed the documents based on their topics, more work could be done in analyzing the structure, sentiment, and the characters of these folktales.
... Explicitly identifying and removing this information would have been preferable, yet such an approach was obstructed by the (variable) structure of judgments on BAILII (see Sect. 4). TF-IDF is a well-established approach in the field of NLP (Jones, 1972) that has been frequently applied to judgment classification tasks (see Sect. 2). The approach is based on a bag-of-words assumption, which takes no account of the relationship between terms. ...
Article
Full-text available
Judgments concerning animals have arisen across a variety of established practice areas. There is, however, no publicly available repository of judgments concerning the emerging practice area of animal protection law. This has hindered the identification of individual animal protection law judgments and comprehension of the scale of animal protection law made by courts. Thus, we detail the creation of an initial animal protection law repository using natural language processing and machine learning techniques. This involved domain expert classification of 500 judgments according to whether or not they were concerned with animal protection law. 400 of these judgments were used to train various models, each of which was used to predict the classification of the remaining 100 judgments. The predictions of each model were superior to a baseline measure intended to mimic current searching practice, with the best performing model being a support vector machine (SVM) approach that classified judgments according to term frequency—inverse document frequency (TF-IDF) values. Investigation of this model consisted of considering its most influential features and conducting an error analysis of all incorrectly predicted judgments. This showed the features indicative of animal protection law judgments to include terms such as ‘welfare’, ‘hunt’ and ‘cull’, and that incorrectly predicted judgments were often deemed marginal decisions by the domain expert. The TF-IDF SVM was then used to classify non-labelled judgments, resulting in an initial animal protection law repository. Inspection of this repository suggested that there were 175 animal protection judgments between January 2000 and December 2020 from the Privy Council, House of Lords, Supreme Court and upper England and Wales courts.
... Considering the city 1 and using Term Frequency -Inverse Document Frequency (tfidf ) measure [10], it is possible to extract the most informative terms contained in each review document , w.r.t. the total collection of review documents for the city { } 1≤ ≤ 1 . Similarly, the most informative terms of each review document in city 2 can be extracted w.r.t. the total collection of review documents for the city { } 1≤ ≤ 2 . ...
... We compare two kinds of solid baselines to give a comprehensive evaluation of the performance of HyperMatch: unsupervised keyphrase extraction models (e.g., TextRank (Mihalcea and Tarau, 2004) and TFIDF (Jones, 2004)) and supervised keyphrase extraction models (e.g., classification and ranking models based variants of BERT (Sun et al., 2020)). Noticeably, HyperMatch extracts keyphrases without using additional features on the OpenKP dataset. ...
Preprint
Keyphrase extraction is a fundamental task in natural language processing and information retrieval that aims to extract a set of phrases with important information from a source document. Identifying important keyphrase is the central component of the keyphrase extraction task, and its main challenge is how to represent information comprehensively and discriminate importance accurately. In this paper, to address these issues, we design a new hyperbolic matching model (HyperMatch) to represent phrases and documents in the same hyperbolic space and explicitly estimate the phrase-document relevance via the Poincar\'e distance as the important score of each phrase. Specifically, to capture the hierarchical syntactic and semantic structure information, HyperMatch takes advantage of the hidden representations in multiple layers of RoBERTa and integrates them as the word embeddings via an adaptive mixing layer. Meanwhile, considering the hierarchical structure hidden in the document, HyperMatch embeds both phrases and documents in the same hyperbolic space via a hyperbolic phrase encoder and a hyperbolic document encoder. This strategy can further enhance the estimation of phrase-document relevance due to the good properties of hyperbolic space. In this setting, the keyphrase extraction can be taken as a matching problem and effectively implemented by minimizing a hyperbolic margin-based triplet loss. Extensive experiments are conducted on six benchmarks and demonstrate that HyperMatch outperforms the state-of-the-art baselines.
... Lerch and Mezini [5] propose to use the approach based on the term frequency and the inverse document frequency (TF-IDF) [12]. In their work, each stack trace is tokenized, and the similarity value for the incoming stack trace q and the given stack trace d is calculated according to the following formula: ...
Preprint
Full-text available
The automatic collection of stack traces in bug tracking systems is an integral part of many software projects and their maintenance. However, such reports often contain a lot of duplicates, and the problem of de-duplicating them into groups arises. In this paper, we propose a new approach to solve the deduplication task and report on its use on the real-world data from JetBrains, a leading developer of IDEs and other software. Unlike most of the existing methods, which assign the incoming stack trace to a particular group in which a single most similar stack trace is located, we use the information about all the calculated similarities to the group, as well as the information about the timestamp of the stack traces. This approach to aggregating all available information shows significantly better results compared to existing solutions. The aggregation improved the results over the state-of-the-art solutions by 15 percentage points in the Recall Rate Top-1 metric on the existing NetBeans dataset and by 8 percentage points on the JetBrains data. Additionally, we evaluated a simpler k-Nearest Neighbors approach to aggregation and showed that it cannot reach the same levels of improvement. Finally, we studied what features from the aggregation contributed the most towards better quality to understand which of them to develop further. We publish the implementation of the suggested approach, and will release the newly collected industrial dataset upon acceptance to facilitate further research in the area.
... Now, with the selected terms from the DFS method, the initial feature vectors are formed. The weight of each feature (term) in a feature vector would be then computed from Term Frequency-Inverse Document Frequency (TF-IDF) (Jones, 2004;Lan et. al., 2009) weighting method as presented in equation 3. ...
Article
Full-text available
Due to the rapid growth of the Internet, large amounts of unlabelled textual data are producing daily. Clearly, finding the subject of a text document is a primary source of information in the text processing applications. In this paper, a text classification method is presented and evaluated for Persian and English. The proposed technique utilizes variance of fuzzy similarity besides discriminative and semantic feature selection methods. Discriminative features are those that distinguish categories with higher power and the concept of semantic feature takes into the calculations the similarity between features and documents by using only available documents. In the proposed method, incorporating fuzzy weighting as a measure of similarity is presented. The fuzzy weights are derived from the concept of fuzzy similarity which is defined as the variance of membership values of a document to all categories in the way that with some membership value at the same time, the sum of these membership values should be equal to 1. The proposed document classification method is evaluated on three datasets (one Persian and two English datasets) and two classification methods, support vector machine (SVM) and artificial neural network (ANN), are used. Comparing the results with other text classification methods, demonstrate the consistent superiority of the proposed technique in all cases. The weighted average F-measure of our method are %82 and %97.8 in the classification of Persian and English documents, respectively.
... The proposed conceptual framework by [30] argues for a categorization of retrieval models into two dimensions: supervised vs. unsupervised and dense vs. sparse representations 5 . An unsupervised sparse representation model such as BM25 [51] and TF-IDF [23] represents each document and query with a sparse vector with the dimension of the collection's vocabulary, having many zero weights due to non-occurring terms. Since the weights of each term are calculated using term statistics they are considered unsupervised methods. ...
Preprint
Full-text available
Ranking responses for a given dialogue context is a popular benchmark in which the setup is to re-rank the ground-truth response over a limited set of $n$ responses, where $n$ is typically 10. The predominance of this setup in conversation response ranking has lead to a great deal of attention to building neural re-rankers, while the first-stage retrieval step has been overlooked. Since the correct answer is always available in the candidate list of $n$ responses, this artificial evaluation setup assumes that there is a first-stage retrieval step which is always able to rank the correct response in its top-$n$ list. In this paper we focus on the more realistic task of full-rank retrieval of responses, where $n$ can be up to millions of responses. We investigate both dialogue context and response expansion techniques for sparse retrieval, as well as zero-shot and fine-tuned dense retrieval approaches. Our findings based on three different information-seeking dialogue datasets reveal that a learned response expansion technique is a solid baseline for sparse retrieval. We find the best performing method overall to be dense retrieval with intermediate training, i.e. a step after the language model pre-training where sentence representations are learned, followed by fine-tuning on the target conversational data. We also investigate the intriguing phenomena that harder negatives sampling techniques lead to worse results for the fine-tuned dense retrieval models. The code and datasets are available at https://github.com/Guzpenha/transformer_rankers/tree/full_rank_retrieval_dialogues.
... (1) General word importance. Frequently occurring words usually have limited discriminative information [13]. To downweight frequent words and upweight rare words, we compute general word importance by ( ) = +P( ) , where = 10 −5 , and P( ) is unigram likelihood of the -th word over all the seen data. ...
Preprint
Zero-shot intent classification is a vital and challenging task in dialogue systems, which aims to deal with numerous fast-emerging unacquainted intents without annotated training data. To obtain more satisfactory performance, the crucial points lie in two aspects: extracting better utterance features and strengthening the model generalization ability. In this paper, we propose a simple yet effective meta-learning paradigm for zero-shot intent classification. To learn better semantic representations for utterances, we introduce a new mixture attention mechanism, which encodes the pertinent word occurrence patterns by leveraging the distributional signature attention and multi-layer perceptron attention simultaneously. To strengthen the transfer ability of the model from seen classes to unseen classes, we reformulate zero-shot intent classification with a meta-learning strategy, which trains the model by simulating multiple zero-shot classification tasks on seen categories, and promotes the model generalization ability with a meta-adapting procedure on mimic unseen categories. Extensive experiments on two real-world dialogue datasets in different languages show that our model outperforms other strong baselines on both standard and generalized zero-shot intent classification tasks.
... Some of the extracted information can be represented as sets; for example, the sets of installed drivers or software programs. We have applied a natural language processing (NLP) technique called tf-idf [21] to transform temporal information about sets into covariates. Consider installed drivers as an example. ...
Preprint
Full-text available
Predictive maintenance (PdM) is the task of scheduling maintenance operations based on a statistical analysis of the system's condition. We propose a human-in-the-loop PdM approach in which a machine learning system predicts future problems in sets of workstations (computers, laptops, and servers). Our system interacts with domain experts to improve predictions and elicit their knowledge. In our approach, domain experts are included in the loop not only as providers of correct labels, as in traditional active learning, but as a source of explicit decision rule feedback. The system is automated and designed to be easily extended to novel domains, such as maintaining workstations of several organizations. In addition, we develop a simulator for reproducible experiments in a controlled environment and deploy the system in a large-scale case of real-life workstations PdM with thousands of workstations for dozens of companies.
... Owing to technical noises and intrinsic biological variability, scRNA-seq data almost always possess high sparsity, known as the zero-inflation phenomenon [54; 69; 52]. Another example is the word occurrence matrix, whose elements are the so-called TF-IDF values [42]. Here, TF-IDF is short for the term frequency-inverse document frequency, which is calculated by multiplying two metrics, i.e., how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. ...
Preprint
The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample. Such an approach has been used pervasively in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome the limitation, we develop a novel element-wise subset selection approach, called core-elements. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an $O(\mathrm{nnz}(\mathbf{X})+rp^2)$ computational cost, where $\mathbf{X}\in\mathbb{R}^{n\times p}$ is the predictor matrix, $r$ is the number of elements selected from each column of $\mathbf{X}$, and $\mathrm{nnz}(\cdot)$ denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and real-world datasets demonstrate the proposed method's superior performance compared to mainstream competitors.
... Note that many tokens are common across documents but may not be relevant to the SR search criteria, such as articles, prepositions, and certain common verbs (e.g., "give", "perform"). Thus we scaled token frequency within a document by the inverse of a token's frequency across all documents, which is called term-frequency-inverse-documentfrequency (TFIDF) [19]: where tf i,d is the term-frequency for token i in document d and idf i is the inverse document frequency for token i. For our study, we used the logarithmically scaled inverse fraction of documents containing token i: idf i ¼ log N dfðiÞ where N is the total number of documents and df(i) is the frequency of token i in documents. ...
Article
Full-text available
There remains a limited understanding of the HIV prevention and treatment needs among female sex workers in many parts of the world. Systematic reviews of existing literature can help fill this gap; however, well-done systematic reviews are time-demanding and labor-intensive. Here, we propose an automatic document classification approach to a systematic review to significantly reduce the effort in reviewing documents and optimizing empiric decision making. We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based method, and a random forest approach that utilizes a large set of feature tokens. This approach is used to identify documents studying female sex workers that contain content relevant to either HIV or experienced violence. We compare the performance of the three classifiers by cross-validation in terms of area under the curve of the receiver operating characteristic and precision and recall plot, and found random forest approach reduces the amount of manual reading for our example by 80%; in sensitivity analysis, we found that even trained with only 10% of data, the classifier can still avoid reading 75% of future documents (68% of total) while retaining 80% of relevant documents. In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews and facilitate live reviews, where reviews are updated regularly. We expect to obtain a reasonable classifier by taking 20% of retrieved documents as training samples. The proposed classifier could also be used for more meaningfully assembling literature in other research areas and for rapid documents screening with a tight schedule, such as COVID-related work during the crisis.
... This is done with respect to the vocabulary extracted from the tweet-corpus. A TF-IDF weight captures the respective term's importance to a document, relative to the term's usage in the corpus [29]. -The stylistic part of the feature-set includes the occurrences of elongated words, fully capitalized words, consecutive punctuation marks, hashtags, as well as the percentage of the profile's messages that were retweets or replies, and the number of URLs that the user has shared. ...
Article
Full-text available
We demonstrate a system for predicting gaming related properties from Twitter accounts. Our system predicts various traits of users based on the tweets publicly available in their profiles. Such inferred traits include degrees of tech-savviness and knowledge on computer games, actual gaming performance, preferred platform, degree of originality, humor and influence on others. Our system is based on machine learning models trained on crowd-sourced data. It allows people to select Twitter accounts of their fellow gamers, examine the trait predictions made by our system, and the main drivers of these predictions. We present empirical results on the performance of our system based on its accuracy on our crowd-sourced dataset.
... It can further provide an unambiguous semantic measure as to the actual meaning of the terms. TF-IDF (Jones, 1972) is a typical approach. Knowledge-based semantic approaches are computationally simple and can easily be extended to compute sentence-to-sentence similarity measures. ...
Thesis
In order to automate the inventive problem-solving knowledge retrieval contained in patent documents by using Natural Language Processing (NLP) techniques. We propose three main contributions: i) two similar problem retrieval models called IDM-Similar based on Word2vec neural networks and SAMIDM based on LSTM neural networks are proposed to retrieve similar problems from different domains patents; ii) a problem-solution matching model named IDM-Matching according to XLNet neural networks is proposed to build connections between problems and solutions in patent documents; iii) an inventive solutions ranking model called PatRIS based on multiple criteria decision analysis approach is proposed to rank potential inventive solutions. These models have been evaluated on both benchmark and real-world patent datasets.
... The minimum range of n-grams is one (so-called unigrams), while its upper limit strongly depends on the length of the analyzed documents. Additionally, the weights of individual words can be normalized for a specific document using Term Frequency (tf) [21], or for the entire analyzed corpus, using Term Frequency-Inverse Document Frequency (tf-idf) [14]. Another noteworthy approach to extracting features from text is Latent Dirichlet Allocation (lda), which is a generative probabilistic model for topic modeling [13]. ...
Preprint
Full-text available
The abundance of information in digital media, which in today's world is the main source of knowledge about current events for the masses, makes it possible to spread disinformation on a larger scale than ever before. Consequently, there is a need to develop novel fake news detection approaches capable of adapting to changing factual contexts and generalizing previously or concurrently acquired knowledge. To deal with this problem, we propose a lifelong learning-inspired approach, which allows for fake news detection in multiple languages and the mutual transfer of knowledge acquired in each of them. Both classical feature extractors, such as Term frequency-inverse document frequency or Latent Dirichlet Allocation, and integrated deep NLP (Natural Language Processing) BERT (Bidirectional Encoder Representations from Transformers) models paired with MLP (Multilayer Perceptron) classifier, were employed. The results of experiments conducted on two datasets dedicated to the fake news classification task (in English and Spanish, respectively), supported by statistical analysis, confirmed that utilization of additional languages could improve performance for traditional methods. Also, in some cases supplementing the deep learning method with classical ones can positively impact obtained results. The ability of models to generalize the knowledge acquired between the analyzed languages was also observed.
... Over the past decades, research into monolingual automatic term extraction first evolved from linguistic (e.g., (Justeson and Katz, 1995) and statistical (Sparck Jones, 1972) methodologies to hybrid methodologies. These rule-based hybrid methodologies combine linguistic information like part-of-speech patterns, with statistical metrics used to calculate termhood and unithood (Kageura and Umino, 1996), which measure how related the candidate term (CT) is to the domain, and, in case of candidate multi-word terms, whether the individual components form a cohesive unit. ...
Conference Paper
Full-text available
This contribution presents D-Terminer: an open access, online demo for monolingual and multilingual automatic term extraction from parallel corpora. The monolingual term extraction is based on a recurrent neural network, with a supervised methodology that relies on pretrained embeddings. Candidate terms can be tagged in their original context and there is no need for a large corpus, as the methodology will work even for single sentences. With the bilingual term extraction from parallel corpora, potentially equivalent candidate term pairs are extracted from translation memories and manual annotation of the results shows that good equivalents are found for most candidate terms. Accompanying the release of the demo is an updated version of the ACTER Annotated Corpora for Term Extraction Research (version 1.5).
... An extension of this method, called TF-IDF [65] (Term Frequency -Inverse Document Frequency), assigns weights to words. The more frequent a word is in a document, the higher the weight. ...
Thesis
To classify user intents, a rigorous annotation must be conducted. In order to overcome the problem of lack of annotated data, we use few-shot classification methods.In a first step, this thesis focuses on a new comparison of few-shot classification methods. The methods were compared with different text encoders, which led to a biased comparison. When each method is equipped with the same transform-based sentence encoder (BERT), older few-shot classification methods take over.Next, we study pseudo-labeling, i.e. the automatic assignment of pseudo-labels to unlabeled data. In this context, we introduce a new pseudo-labeling method inspired by hierarchical clustering. Our method does not use any hyper-parameter and knows how to ignore unlabeled examples that would be too far from the known distribution. We will also show that it is complementary to other existing methods.As a final contribution, we introduce ProtAugment, a meta-learning architecture for the intention detection problem. This new extension trains the model to retrieve the original sentence based on prototypes computed from paraphrases. We will also introduce our own method for generating paraphrases, and see that the way these paraphrases are generated plays an important role.All the code used to run the experiments presented in this thesis is available on my github account (https://github.com/tdopierre/).
... 1. Make text lower case 2. Tokenize words 3. Remove punctuation 4. Remove stop words 5. Replace words into a common synonym 6. Conduct stemming analysis After this preprocessing, we converted the text into numerical vectors based on the TFIDF vectorization (Sparck Jones 1972). Finally, we computed the cosine similarity (scikit-learn developers 2020) between issue reports and commits and if the similarity value is over a certain threshold, we tagged such pairs as linked pairs. ...
Article
Full-text available
The accuracy of the SZZ algorithm is pivotal for just-in-time defect prediction because most prior studies have used the SZZ algorithm to detect defect-inducing commits to construct and evaluate their defect prediction models. The SZZ algorithm has two phases to detect defect-inducing commits: (1) linking issue reports in an issue-tracking system to possible defect-fixing commits in a version control system by using an issue-link algorithm (ILA); and (2) tracing the modifications of defect-fixing commits back to possible defect-inducing commits. Researchers and practitioners can address the second phase by using existing solutions such as a tool called cregit. In contrast, although various ILAs have been proposed for the first phase, no large-scale studies exist in which such ILAs are evaluated under the same experimental conditions. Hence, we still have no conclusions regarding the best-performing ILA for the first phase. In this paper, we compare 10 ILAs collected from our systematic literature study with regards to the accuracy of detecting defect-fixing commits. In addition, we compare the defect prediction performance of ILAs and their combinations that can detect defect-fixing commits accurately. We conducted experiments on five open-source software projects. We found that all ILAs and their combinations prevented the defect prediction model from being affected by missing defect-fixing commits. In particular, the combination of a natural language text similarity approach, Phantom heuristics, a random forest approach, and a support vector machine approach is the best way to statistically significantly reduced the absolute differences from the ground-truth defect prediction performance. We summarized the guidelines to use ILAs as our recommendations.
... Besides, SPICE uses the model pre-trained on ImageNet for ImageNet-10/Dogs (denoted by "( )" in Table 3), while all other methods including ours train the model from scratch. For text clustering, we compare the proposed TCL with 11 benchmarks, including TF/TF-IDF (Jones, 1972), BagOf-Words (BOW) (Harris, 1954), SkipVec (Kiros et al., 2015), Para2Vec (Le & Mikolov, 2014), GSDPMM , RecNN (Socher et al., 2011), STCC (Xu et al., 2017b), HAC-SD (Rakib et al., 2020), ECIC (Rakib et al., 2020), and SCCL (Zhang et al., 2021a). Similarly, the vanilla k-means is conducted on the extracted features to cluster data for those representation-based methods, including BOW, TF/TF-IDF, SkipVec, Para2Vec, and RecNN. ...
Article
Full-text available
This paper proposes to perform online clustering by conducting twin contrastive learning (TCL) at the instance and cluster level. Specifically, we find that when the data is projected into a feature space with a dimensionality of the target cluster number, the rows and columns of its feature matrix correspond to the instance and cluster representation, respectively. Based on the observation, for a given dataset, the proposed TCL first constructs positive and negative pairs through data augmentations. Thereafter, in the row and column space of the feature matrix, instance- and cluster-level contrastive learning are respectively conducted by pulling together positive pairs while pushing apart the negatives. To alleviate the influence of intrinsic false-negative pairs and rectify cluster assignments, we adopt a confidence-based criterion to select pseudo-labels for boosting both the instance- and cluster-level contrastive learning. As a result, the clustering performance is further improved. Besides the elegant idea of twin contrastive learning, another advantage of TCL is that it could independently predict the cluster assignment for each instance, thus effortlessly fitting online scenarios. Extensive experiments on six widely-used image and text benchmarks demonstrate the effectiveness of TCL. The code is released on https://pengxi.me.
... On the video stream, it results in seeing (or not) the rear passenger. For the text modality, we calculate the frequency distribution of words and the term frequency-inverse document frequency (TF-IDF) [23] to find if there are specific distributions of words associated with a given scenario. These approaches are very common in text mining and analyzing. ...
Conference Paper
The use of audio, video and text modalities to simultaneously analyze human interactions is a recent trend in the field of deep learning. The multimodality tends to create computationally expensive models. Our in-vehicle specific context requires recording a database to validate our approach. Twenty-two participants playing three different scenarios ("curious", "ar-gued refusal" and "not argued refusal") of interactions between a driver and a passenger were recorded. We propose two different models to identify tense situations in a car cabin. One is based on an end-to-end approach and the other one is a hybrid model using handcrafted features for audio and video modalities. We obtain similar results (around 81% of balanced accuracy) with the two architectures but we highlight their complementary. We also provide details regarding the benefits of combining different sensor channels.
... Frequency has long been known to correlate with the semantic generality of a word (Caraballo and Charniak, 1999), and this property is used in fundamental algorithms like TF-IDF (Spärck Jones, 1972). ...
Preprint
Full-text available
The diversity and Zipfian frequency distribution of natural language predicates in corpora leads to sparsity when learning Entailment Graphs. As symbolic models for natural language inference, an EG cannot recover if missing a novel premise or hypothesis at test-time. In this paper we approach the problem of vertex sparsity by introducing a new method of graph smoothing, using a Language Model to find the nearest approximations of missing predicates. We improve recall by 25.1 and 16.3 absolute percentage points on two difficult directional entailment datasets while exceeding average precision, and show a complementarity with other improvements to edge sparsity. We further analyze language model embeddings and discuss why they are naturally suitable for premise-smoothing, but not hypothesis-smoothing. Finally, we formalize a theory for smoothing a symbolic inference method by constructing transitive chains to smooth both the premise and hypothesis.
... To better understand the kinds of business method innovations in each category, we also parsed the patent documents to obtain the most important words that appear in the abstracts of the patents in the category. For example, the top keywords in the abstracts of patents in customer targeting put specific emphasis on words such as "digital," "user," "content," "device," "product," "key," "media," "customer," "transaction," "consumer," and "image" (as measured by tf-idf, "termfrequency inverse-document-frequency; " Jones 1972). 4 Broadly, patents in this list cover innovations that involve customer targeting functions, typically before Note. ...
Article
Full-text available
... Previous work has shown that retrieval improves performance across a variety of tasks such as question answering (Voorhees et al., 1999;Chen et al., 2017;Kwiatkowski et al., 2019), fact checking (Thorne et al., 2018), dialogue (Dinan et al., 2019) or citation recommendation . Historically, this information retrieval step was implemented using term-matching methods, such as TF-IDF or BM25 (Jones, 1972;Robertson et al., 1995). For open-domain question answering (Voorhees et al., 1999), documents are often retrieved from Wikipedia (Chen et al., 2017). ...
Preprint
Large language models have shown impressive few-shot results on a wide range of tasks. However, when knowledge is key for such results, as is the case for tasks such as question answering and fact checking, massive parameter counts to store knowledge seem to be needed. Retrieval augmented models are known to excel at knowledge intensive tasks without the need for as many parameters, but it is unclear whether they work in few-shot settings. In this work we present Atlas, a carefully designed and pre-trained retrieval augmented language model able to learn knowledge intensive tasks with very few training examples. We perform evaluations on a wide range of tasks, including MMLU, KILT and NaturalQuestions, and study the impact of the content of the document index, showing that it can easily be updated. Notably, Atlas reaches over 42\% accuracy on Natural Questions using only 64 examples, outperforming a 540B parameters model by 3% despite having 50x fewer parameters.
... The centroid principle summarizer utilizes centroids of groups to distinguish the sentences which are generally fundamental to subjects of the cluster. These centroids are words with the TF-IDF value [28] over a predefined edge. LexRank [29] summarizer scrutinizes the key sentences in an input document using a ranking model based on graph analysis. ...
Article
Full-text available
The difficulty of deriving value out of vast available scientific literature in a condensed form lead us to look for a proficient theme based summarization solution which can preserve precise biomedical content. The study targets to analyze impact of combining semantic biomedical concepts extraction, frequent item-set mining and clustering techniques over information retention, objective functions and ROUGE values for the obtained final summary. The suggested frequent item-set mining and clustering (FRI-CL) graph-based framework uses UMLS metathesarus and BERT-based semantic embeddings to identify domain-relevant concepts. The scrutinized concepts are mined according to their relationship with neighbors and frequency via an amended FP-Growth model. The framework utilizes S-DPMM clustering, which is a probabilistic mixture model and aids in the identification and clubbing of complex relevant patterns to increase coverage of important sub-themes. The sentences with the frequent concepts are scored via PageRank to form an efficient and compelling summary. The research experiments on the 100 sample biomedical documents taken from PubMed archives are evaluated via calculation of ROUGE scores, coverage, readability, non-redundancy, memory utilization and information retention from the summary output. The results with the FRI-CL summarization system showcased 10% ROUGE performance improvement and are at par with the other baseline methods. On an average 30–40% improvement in memory utilization is observed with up to 50% information retention when experiments are performed using S-DPMM clustering. The research indicates that the fusion of semantic mapping, clustering, along with frequent-item set mining of biomedical concepts enhance the overall co-related information covering all sub-themes.
Thesis
L’infrastructure informatique regroupe l’ensemble des équipements matériels (Serveurs, Commutateur, Routeur, poste travail, périphériques……) et logiciels(Cloud, ERP, CRM, Messagerie….) d’une entreprise. Ces éléments sont connectés entre eux pour fournir des services à un utilisateur final (employés, partenaires et / ou clients).En outre, l'interruption d’un telle service peut entraîner des pénalités financières importantes et / ou une perte de confiance des utilisateurs. En effet, il est primordial de réagir rapidement afin de rendre le délai d’interruption d’activité le plus court possible et limiter cet impact financier. Par conséquent, pour s'assurer le bon fonctionnement, disponibilité et la sécurité de n’importe quelle infrastructure informatique nous devons être proactifs dans la détection des incidents qui perturbent la stabilité de l’entreprise. Cette mission implique également de prodiguer des conseils aux usagers, de répondre aux demande spécifiques et de s’adapter en permanence aux évolutions et besoins de l’entreprise. L'objectif principal de notre travail de recherche est de proposer une plateforme de diagnostic efficace pour une Infrastructure informatique appelée (MAITD) afin d'aider les techniciens de l'entreprise à trouver une solution (s) en un temps réduit face à un problème de prise de décision.
Thesis
Full-text available
Stop words list generation contributes to reduction in the size of vector space of the corpus, indexing structure, high compression rate, speed up calculation and increasing the accuracy of Information Retrieval (IR) systems. The IR requires the matching of query to the most appropriate documents, which can cause additional memory overhead, low document recall and ambiguous results if a standard stop words list is not used. One of the current research works used intersection theory (IT) for aggregation of Frequency Analysis, Word Distribution Analysis and Word Entropy Measure for stop words generation. The intersection of set numbers is computed arbitrarily and difficult to manipulate for the generation of stop words. This study designed a Yoruba Stop Words Generator (YSWG) using Inclusion-Exclusion Principle to generalize the aggregated methods together with Term Frequency-Inverse Document Frequency. Each of the two methods generated their own stop words list after passing through text preprocessing stage including the diacritization of the Yoruba Language corpus. Cosine similarity measure was applied to the generated Yoruba and English languages stop words from the two methods as user query with the two corpuses. The YSWG system used Multinomial Naïve Bayes for updating the library of the system, especially in the event of an evolving new word. The YSWG was implemented alongside IT method using Python Programming Language and HyperText Markup Language with Unicode Transformation Format 8 bit (UTF-8) encoding standard for proper recognition and decoding of diacritized characters. A dataset of Yoruba corpus with 1,388,050 tokens and 40,212 distinct words that cuts across different domains was computed by the YSWG together with the IT method to determine the scalability of the system. When the corpus was passed through the IT method, it produced a precise, generalized and standardized 255 Yoruba Language stop words while the existing method produced 230 stop words. The YSWG was able to capture and preserve the character properties of diacritics of Yoruba language and non-diacritics of English language. The YSWG and IT methods were evaluated using precision, recall, accuracy, error, f-measure and execution time. The results of the study showed that YSWG performed better with precision, recall, accuracy, error, f-measure and execution time (95%, 98%, 99%, 0.005%, 0.96 and 0.1119ms) compared with the IT method values (83%, 91%, 86%, 0.02%, 0.86 and 0.1121ms) respectively. The results imply that YSWG had a better performance, efficient and more reliable than the IT method in generating stop words. A text compression rate of 65% was achieved after the removal of stop words generated by YSWG compared with 49% of the IT method. There was a higher cosine similarity value on YSWG than the IT method, which indicated that the new design would retrieve exact user query result. The YSWG method with the inclusion-exclusion principle effectively performed better than IT method in the categorization and identification of stop words.
Thesis
Les techniques du Traitement Automatique des Langues (TAL) et les méthodes d’analyse qualitative des données textuelles entretiennent une certaine affinité épistémologique. Malgré cela, l’analyse qualitative ne profite pas pleinement des apports potentiels du TAL. En particulier, les travaux visant à une réelle automatisation du codage qualitatif des données restent somme toute assez rares. Cette thèse se donne pour ambition d’investiguer le potentiel de différentes techniques du TAL dans cet objectif et pour une tâche qui nécessite un certain degré d’expertise humaine. Elle vise à la création d’un outil utilisable dans un contexte industriel et pour une méthode d’analyse spécifique qui permet d’évaluer l’acceptabilité des innovations. Cette méthode mobilise une grille de 20 codes qui présentent une complexité sémantique plus élevée que ceux traditionnellement utilisés en analyse qualitative outillée.Nous explorons les moyens de parvenir à effectuer cette tâche à travers une approche ascendante puis une approche descendante. Pour la première, nous réalisons une exploration lexicométrique sur un corpus de données d’études anciennes afin de définir le profil lexical des données attendues pour chaque code. Puis, nous traitons le problème qui nous est posé comme une tâche de classification en testant des classifieurs statistiques de plusieurs types. Nous investiguons également les possibilités offertes par la projection d’une ressource syntaxico-sémantique sur le corpus.Nous suivons ensuite une approche descendante moins conventionnelle. Pour celle-ci, nous réalisons une modélisation experte du paradigme de l’entretien qualitatif d’évaluation de l’acceptabilité sous forme d’ontologie. Cette modélisation est assortie à un lexique construit de manière ad hoc. Nous proposons ainsi une architecture originale d’outil d’analyse sémantique dans laquelle les triplets de l’ontologie servent de support à l’interprétation et dont des sous-ensembles constituent des règles de classification. Nous obtenons un outil d’analyse hyperspécialisé dont les performances dépassent celles obtenues par le machine learning sur notre corpus d’entraînement. Cet outil est porté jusqu’à l’opérationnalisation, par son intégration dans une plateforme d’autocodage, en vue de la mise en place d’un processus d’apprentissage continu.
Article
Sentiment analysis is an important task in the field of natural language processing that aims to gauge and predict people’s opinions from large amounts of data. In particular, gender-based sentiment analysis can influence stakeholders and drug developers in real-world markets. In this work, we present a gender-based multi-aspect sentiment detection model using multilabel learning algorithms. We divide Abilify and Celebrex datasets into three groups based on gender information, namely: male, female, and mixed. We then represent bag-of-words (BoW), term frequency-inverse document frequency (TF-IDF), and global vectors for word representation (GloVe) based features for each group. Next, we apply problem transformation approaches and multichannel recurrent neural networks with attention mechanism. Results show that traditional multilabel transformation methods achieve better performance for small amounts of data and long-range sequence in terms of samples and labels, and that deep learning models achieve better performance in terms of mean test accuracy, AUC Score, RL, and average precision using GloVe word embedding features in both datasets.
Conference Paper
The use of artiﬁcial intelligence (AI) within the ﬁnance industry can be considered as a transformative approach as it enables the ﬁnancial institutions to enhance their performance capacity. The use of artiﬁcial intelligence within the ﬁnance sector helps the industries to streamline the processes and optimise their management efﬁciently for various types of operations pertaining to credit decisions-making, ﬁnancial risk assessment and management and quantitative trading. The paper aims at analysing the proactive approach that can be taken with the use of AI in order to enhance effective management within the ﬁnancial sector. The empirical study conducted in the paper utilizes various types of secondary materials with a qualitative approach. The ﬁndings of the study demonstrate the enhanced capacity of AI that can be used for a proactive approach, utilised for the assessment of risks or threats prior to any mismanagement incident. In this regard, ﬁntech companies such as Enova, Ocrolus, ZestFinance, and DataRobot and so on have taken a predominant position in aiding the ﬁnancial industries to use AI-based systems that aids the management process. However, the inclusion of AI within the ﬁnancial sector is faced with certain challenges such as lack of knowledge regarding technological infrastructure, poor ﬁnancial investment especially for government aided banks, unawareness of the employees and weak collaboration with the IT industry. Regardless, AI technologies in recent years have achieved great advancement, leading to the enhancement of its capacity to assist the effective management within the ﬁnancial sector.
Article
Determining if the lyrics of a given song could be hurtful or inappropriate for children is of utmost importance to prevent the reproduction of songs whose textual content is unsuitable for them. This problem can be computationally tackled as a binary classification task, and in the last couple of years various machine learning approaches have been applied to perform this task automatically. In this work, we investigate the automatic detection of explicit song lyrics by leveraging transformer-based language models, i.e., large language representations, unsupervisely built from huge textual corpora, that can be fine-tuned on various natural language processing tasks, such as text classification. We assess the performance of various transformer-based language model classifiers on a dataset consisting of more than 800K lyrics, marked with explicit information. The evaluation shows that while the classifiers built with these powerful tools achieve state-of-the-art performance, they do not outperform lighter and computationally less demanding approaches. We complement this empirical evaluation with further analyses, including an assessment of the performance of these classifiers in a few-shot learning scenario, where they are trained with just few thousands of samples.
Article
In the present-day technology-driven world, information reaching at the individual’s doorstep sometimes becomes complex, haphazard and difficult to classify to get the insights. The endpoint consumer of the information requires processed information which is contextually suited to their needs, interests and is properly formatted and categorised. Interests and need-based categorization of news and stories would enable the user beforehand to further evaluate information deeply. For instances, the type current affairs related issues and news to read or not to read. This research work proposes an advanced current affairs classification model based on deep learning approaches called Intelligent Current Affairs Classification Using Deep Learning (iCACD). The proposed model is better than already proposed models based on machine learning approached which have been compared on accuracy and performance criteria. The proposed model is better in the following ways. Firstly, It is based on advanced deep neural network architecture. Secondly, the model advances the work to include both headline and body of the information/news articles rather than only processing headlines. Thirdly, A detailed comparative analysis and discussion on accuracy and performance with other machine leaning models have also been presented.
Article
Nowadays, the explosive growth in text data emphasizes the need for developing new and computationally efficient methods and credible theoretical support tailored for analyzing such large‐scale data. Given the vast amount of this kind of unstructured data, the majority of it is not classified, hence unsupervised learning techniques show to be useful in this field. Document clustering has proven to be an efficient tool in organizing textual documents and it has been widely applied in different areas from information retrieval to topic modeling. Before introducing the proposals of document clustering algorithms, the principal steps of the whole process, including the mathematical representation of documents and the preprocessing phase, are discussed. Then, the main clustering algorithms used for text data are critically analyzed, considering prototype‐based, graph‐based, hierarchical, and model‐based approaches. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification Statistical Learning and Exploratory Methods of the Data Sciences > Text Mining Data: Types and Structure > Text Data Document clustering: Prototype‐based, Graph‐based, Hierarchical and Model‐based methods
Article
With the advancement of web 2.0 and the development of the Internet of Things (IoT), all tasks can be handled with the help of handheld devices. Web APIs or web services are providing immense power to IoT and are working as a backbone in the successful journey of IoT. Web services can perform any task on a single click event, and these are available over the internet in terms of quantity, quality, and variety. It leads to the requirement of service management in the service repository. The well-managed and structured service repository is still challenging as services are dynamic, and documentation is limited. It is also not a piece of cake to discover, select and recommend services easily from a pool of services. Web service clustering (WSC) plays a vital role in enhancing the service discovery, selection, and recommendation process by analyzing the similarity among services. In this paper, with a systematic process total of 84 research papers are selected, and different state-of-the-art techniques based on web service clustering are investigated and analyzed. Furthermore, this Systematic Literature Review (SLR) also presents the various mandatory and optional steps of WSC, evaluation measures, and datasets. Research challenges and future directions are also identified, which will help the researchers to provide innovative solutions in this area.
ResearchGate has not been able to resolve any references for this publication.