Article

# Jones, K.S.: A Statistical Interpretation of Term Specificity and its Application in Retrieval. Journal of Documentation 28(1), 11-21

Authors:
To read the full-text of this research, you can request a copy directly from the author.

## Abstract

The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing in particular that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.

## No full-text available

... The outlier handling strategies for each work can be seen in Table 3. [62,61,50,29] Working with text data requires embedding of words, sentences, or documents into a vector space model as numerical vectors to enable optimization over the data. Among the reviewed works, Term Frequency-Inverse Document Frequency (TF-IDF) [82,83,84] was the most common vectorization technique, used by 17 (30%) works, followed by 11 (19%) works using Bag-of-words (BOW) or Bag-of-n-grams, and word2vec [85] and [86] was found in 9 (16%) works. ...
... Another popular representation is the Term Frequency-Inverse Document Frequency (TF-IDF) [82,83,84], which is directly proportional to the term frequency within each document and inversely proportional on a logarithm scale to how often a term appears among all the documents in the corpora. Since the precise definition for either TF or IDF terms varies in the literature, a particular formulation for TF-IDF score is shown in Equation 14. ...
Preprint
Full-text available
Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since similar procedures have different terms. The authors describe research opportunities, trends, and open issues. The appendices summarize the theoretical background of the text vectorization, the factorization, and the clustering algorithms that are directly or indirectly related to the reviewed works.
... Tf×Idf (Jones, 1972). La production de mots-clés pour un document consiste d'abord à identifier les concepts importants, puis à ne retenir qu'un certain nombre de ces concepts selon divers critères et enfin à choisir une dénomination de ces concepts parmi toutes les formes linguistiques possibles. ...
... La méthode statistique la plus utilisée est le Tf×Idf (Jones, 1972). C'est une méthode de référence très populaire dans la plupart des tâches de traitement automatique de la langue. ...
Thesis
Le nombre de documents scientifiques dans les bibliothèques numériques ne cesse d'augmenter. Les mots-clés, permettant d'enrichir l'indexation de ces documents ne peuvent être annotés manuellement étant donné le volume de document à traiter. La production automatique de mots-clés est donc un enjeu important. Le cadre évaluatif le plus utilisé pour cette tâche souffre de nombreuses faiblesses qui rendent l'évaluation des nouvelles méthodes neuronales peu fiables. Notre objectif est d'identifier précisément ces faiblesses et d'y apporter des solutions selon trois axes. Dans un premier temps, nous introduisons KPTimes, un jeu de données du domaine journalistique. Il nous permet d'analyser la capacité de généralisation des méthodes neuronales. De manière surprenante, nos expériences montrent que le modèle le moins performant est celui qui généralise le mieux. Dans un deuxième temps, nous effectuons une comparaison systématique des méthodes états de l'art grâce à un cadre expérimental strict. Cette comparaison indique que les méthodes de référence comme TF×IDF sont toujours compétitives et que la qualité des mots-clés de référence a un impact fort sur la fiabilité de l'évaluation. Enfin, nous présentons un nouveau protocole d'évaluation extrinsèque basé sur la recherche d'information. Il nous permet d'évaluer l'utilité des mots-clés, une question peu abordée jusqu'à présent. Cette évaluation nous permet de mieux identifier les mots-clés importants pour la tâche de production automatique de mots-clés et d'orienter les futurs travaux.
... The methodology called term frequency in the document, or TF (term frequency), found its way into almost all terminology weighting schemes. According to Jones's postulate [18], terms can be said to be words or possibly phrases or word-words; assuming that there are N documents in a collection and that the term t i appears in n i of them, then the proposed measure, defined as a weight, will be applied to the term t i , and is described in Equation (1), also known as the inverse document frequency (or IDF), this formulation being one of the most used (Robertson, 2004). ...
... The assignment of unique weighted terms properly produces retrieval results superior to those that can be obtained with other text techniques used, depending on the term weighting system chosen [45,46]. In this study, the weighting of terms called TF-IDF is considered, which is a metric where the TF provides a direct estimate of the probability of occurrence of a term, normalized by the total frequency of the document [47] and in which this indicator is multiplied for the IDF, which in turn can be interpreted as the amount of information, given as the log of the inverse probability [18]. ...
Article
Full-text available
This work objective is to generate an HJ-biplot representation for the content analysis obtained by latent Dirichlet assignment (LDA) of the headlines of three Spanish newspapers in their web versions referring to the topic of the pandemic caused by the SARS-CoV-2 virus (COVID-19) with more than 500 million affected and almost six million deaths to date. The HJ-biplot is used to give an extra analytical boost to the model, it is an easy-to-interpret multivariate technique which does not require in-depth knowledge of statistics, allows capturing the relationship between the topics about the COVID-19 news and the three digital newspapers, and it compares them with LDAvis and heatmap representations, the HJ-biplot provides a better representation and visualization, allowing us to analyze the relationship between each newspaper analyzed (column markers represented by vectors) and the 14 topics obtained from the LDA model (row markers represented by points) represented in the plane with the greatest informative capacity. It is concluded that the newspapers El Mundo and 20 M present greater homogeneity between the topics published during the pandemic, while El País presents topics that are less related to the other two newspapers, highlighting topics such as t_12 (Government_Madrid) and t_13 (Government_millions).
... For this reason, she extracts the keywords and topics (cf. tf-idf 39 and LDA 40 ), sentiment values (cf. VADER 41 ), number of likes, and usernames for tweets. ...
Article
Multidimensional data is often visualized using coordinated multiple views in an interactive dashboard. However, unlike in infographics where text is often a central part of the presentation, there is currently little knowledge of how to best integrate text and annotations in a visualization dashboard. In this paper, we explore a technique called FacetNotes for presenting these textual annotations on top of any visualization within a dashboard irrespective of the scale of data shown or the design of visual representation itself. FacetNotes does so by grouping and ordering the textual annotations based on properties of (1) the individual data points associated with the annotations, and (2) the target visual representation on which they should be shown. We present this technique along with a set of user interface features and guidelines to apply it to visualization interfaces. We also demonstrate FacetNotes in a custom visual dashboard interface. Finally, results from a user study of FacetNotes show that the technique improves the scope and complexity of insights developed during visual exploration.
... Processing unstructured data is a challenge for researchers in NLP, and therefore many researchers have proposed formulas for transforming the unstructured data into a more structured format. A bag of words has been proposed to transform textual data into numerical data by exploiting the frequency of each word in a sentence, resulting in the concepts of Term Frequency and Inverse Document Frequency proposed in [12]. Word embedding methods have been proposed to obtain more redundant data in order to represent words in consideration of their relationship to other words in many contexts. ...
Article
Full-text available
Arabic is one of the official languages recognized by the United Nations (UN) and is widely used in the middle east, and parts of Asia, Africa, and other countries. Social media activity currently dominates the textual communication on the Internet and potentially represents people’s views about specific issues. Opinion mining is an important task for understanding public opinion polarity towards an issue. Understanding public opinion leads to better decisions in many fields, such as public services and business. Language background plays a vital role in understanding opinion polarity. Variation is not only due to the vocabulary but also cultural background. The sentence is a time series signal; therefore, sequence gives a significant correlation to the meaning of the text. A recurrent neural network (RNN) is a variant of deep learning where the sequence is considered. Long short-term memory (LSTM) is an implementation of RNN with a particular gate to keep or ignore specific word signals during a sequence of inputs. Text is unstructured data, and it cannot be processed further by a machine unless an algorithm transforms the representation into a readable machine learning format as a vector of numerical values. Transformation algorithms range from the Term Frequency–Inverse Document Frequency (TF-IDF) transform to advanced word embedding. Word embedding methods include GloVe, word2vec, BERT, and fastText. This research experimented with those algorithms to perform vector transformation of the Arabic text dataset. This study implements and compares the GloVe and fastText word embedding algorithms and long short-term memory (LSTM) implemented in single-, double-, and triple-layer architectures. Finally, this research compares their accuracy for opinion mining on an Arabic dataset. It evaluates the proposed algorithm with the ASAD dataset of 55,000 annotated tweets in three classes. The dataset was augmented to achieve equal proportions of positive, negative, and neutral classes. According to the evaluation results, the triple-layer LSTM with fastText word embedding achieved the best testing accuracy, at 90.9%, surpassing all other experimental scenarios.
... Term Frequency-Inverse Document Frequency (TF-IDF) is another vectorization technique that considers the importance of each word within the corpus [28,29]. In TF-IDF, a document is represented as a vector with the length same as the number of words included in the corpus. ...
Article
Full-text available
Cyberattacks widely occur by using malicious documents. A malicious document is an electronic document containing malicious codes along with some plain-text data that is human-readable. In this paper, we propose a novel framework that takes advantage of such plaintext data to determine whether a given document is malicious. We extracted plaintext features from the corpus of electronic documents and utilized them to train a classification model for detecting malicious documents. Our extensive experimental results with different combinations of three well-known vectorization strategies and three popular classification methods on five types of electronic documents demonstrate that our framework provides high prediction accuracy in detecting malicious documents.
... Finally, a natural language processing analysis of the dictionary and of the five documents containing the sentences in relation with the ROCI-II prevalent dimension (class) of each negotiation style of the participant was conducted. The similarity between the documents and the term frequency-inverse document frequency (TF-IDF) scores were calculated [40]. The TF-IDF can be measured for each word w in a document d with vocabulary V d and weighs the frequency of that word in a document by the occurrence of it across all documents D, as shown below: ...
Article
Full-text available
Negotiation constitutes a fundamental skill that applies to several daily life contexts; however, providing a reliable assessment and definition of it is still an open challenge. The aim of this research is to present an in-depth analysis of the negotiations occurring in a role-play simulation between users and virtual agents using Natural Language Processing. Users were asked to interact with virtual characters in a serious game that helps practice negotiation skills and to complete a psychological test that assesses conflict management skills on five dimensions. The dialogues of 425 participants with virtual agents were recorded, and a dataset comprising 4250 sentences was built. An analysis of the personal pronouns, word context, sentence length and text similarity revealed an overall consistency between the negotiation profiles and the user verbal choices. Integrating and Compromising users displayed a greater tendency to involve the other party in the negotiation using relational pronouns; on the other hand, Dominating individuals tended to use mostly single person pronouns, while Obliging and Avoiding individuals were shown to generally use fewer pronouns. Users with high Integrating and Compromising scores adopted longer sentences and chose words aimed at increasing the other party’s involvement, while more self-concerned profiles showed the opposite pattern.
... Different algorithms interpret the term ''unusual'' in different ways. For example, TF-IDF (Sparck Jones, 1972) identifies a word with a high frequency in a few specific documents rather than being evenly distributed over the corpus. Similarly, Likely (Paukkeri et al., 2008) selects phrases by taking the ratio of a rank value of a phrase in the documents to its rank value in the referenced corpus. ...
Article
Full-text available
A document’s keywords provide high-level descriptions of the content that summarize the document’s central themes, concepts, ideas, or arguments. These descriptive phrases make it easier for algorithms to find relevant information quickly and efficiently. It plays a vital role in document processing, such as indexing, classification, clustering, and summarization. Traditional keyword extraction approaches rely on statistical distributions of key terms in a document for the most part. According to contemporary technological breakthroughs, contextual information is critical in deciding the semantics of the work at hand. Similarly, context-based features may be beneficial in the job of keyword extraction. For example, simply indicating the previous or next word of the phrase of interest might be used to describe the context of a phrase. This research presents several experiments to validate that context-based key extraction is significant compared to traditional methods. Additionally, the KeyBERT proposed methodology also results in improved results. The proposed work relies on identifying a group of important words or phrases from the document’s content that can reflect the authors’ main ideas, concepts, or arguments. It also uses contextual word embedding to extract keywords. Finally, the findings are compared to those obtained using older approaches such as Text Rank, Rake, Gensim, Yake, and TF-IDF. The Journals of Universal Computer (JUCS) dataset was employed in our research. Only data from abstracts were used to produce keywords for the research article, and the KeyBERT model outperformed traditional approaches in producing similar keywords to the authors’ provided keywords. The average similarity of our approach with author-assigned keywords is 51%.
... To translate raw text inputs into representations exploitable by ML models, a vectorization must be performed. One of the classical NLP approaches to perform text vectorization is using Term Frequency-Inverse Document Frequency (TF-IDF), which consists of multiplying the frequency with which a term occurs in a phrase by a weight, in this case, determined through a measure of its rarity in the vocabulary (Karen, 1972). In such a way, trivial terms are valued less than rare words that may be semantically more important. ...
Thesis
Dans l’ère de l’industrie 4.0, exploiter les données stockées dans les systèmes d’information est un axe d’amélioration des systèmes de production. En effet, ces bases de données contiennent des informations pouvant être utilisées par des modèles d’apprentissage automatique (AA) permettant de mieux réagir aux futures perturbations de la production. Dans le cas de la maintenance, les données sont fréquemment récupérées au moyen de rapports établis par les opérateurs. Ces rapports sont souvent rédigés en utilisant des champs de saisie en textes libres avec comme résultats des données non structurées et complexes : elles contiennent des irrégularités comme des acronymes, des jargons, des fautes de frappe, etc. De plus, les données de maintenance présentent souvent des distributions statistiques asymétriques : quelques évènements arrivent plus souvent que d’autres. Ce phénomène est connu sous le nom de « déséquilibre de classes » et peut entraver l’entraînement des modèles d’AA, car ils ont tendance à mieux apprendre les évènements les plus fréquents, en ignorant les plus rares. Enfin, la mise en place de technologies de l’industrie 4.0 doit assurer que l’être humain reste inclus dans la boucle de prise de décision. Si cela n’est pas respecté, les entreprises peuvent être réticentes à adopter ces nouvelles technologies.Cette thèse se structure autour de l’objectif général d’exploiter des données de maintenance pour mieux réagir aux perturbations de la production. Afin de répondre à cet objectif, nous avons utilisé deux stratégies. D’une part, nous avons mené une revue systématique de la littérature pour identifier des tendances et des perspectives de recherche concernant l’AA appliqué à la planification et au contrôle de la production. Cette étude de la littérature nous a permis de comprendre que la maintenance prédictive peut bénéficier de données non structurées provenant des opérateurs. Leur utilisation peut contribuer à l’inclusion de l’humain dans l’application de nouvelles technologies. D’autre part, nous avons abordé certaines perspectives identifiées au moyen d’études de cas utilisant des données issues de systèmes de productions réels. Ces études de cas ont exploité des données textuelles fournies par les opérateurs qui présentaient des déséquilibres de classes. Nous avons exploré l’utilisation de techniques pour mitiger l’effet des données déséquilibrées et nous avons proposé d’utiliser une architecture récente appelée « transformer » pour le traitement automatique du langage naturel.
... Both BOW [12] and TF-IDF [13] are based on word counts. They are among the simple word vectorization algorithms which are widely used to classify text. ...
Preprint
Full-text available
Drug-induced liver injury (DILI) describes the adverse effects of drugs that damage liver. Life-threatening results including liver failure or death were also reported in severe DILI cases. Therefore, DILI-related events are strictly monitored for all approved drugs and the liver toxicity became important assessments for new drug candidates. These DILI-related reports are documented in hospital records, in clinical trial results, and also in research papers that contain preliminary in vitro and in vivo experiments. Conventionally, data extraction from previous publications relies heavily on resource-demanding manual labelling, which considerably decreased the efficiency of the information extraction process. The recent development of artificial intelligence, particularly, the rise of natural language processing (NLP) techniques, enabled the automatic processing of biomedical texts. In this study, based on around 28,000 papers (titles and abstracts) provided by the Critical Assessment of Massive Data Analysis (CAMDA) challenge, we benchmarked model performances on filtering out DILI literature. Among four word vectorization techniques, the model using term frequency-inverse document frequency (TF-IDF) and logistic regression outperformed others with an accuracy of 0.957 with our in-house test set. Furthermore, an ensemble model with similar overall performances was implemented and was fine-tuned to lower the false-negative cases to avoid neglecting potential DILI reports. The ensemble model achieved a high accuracy of 0.954 and an F1 score of 0.955 in the hold-out validation data provided by the CAMDA committee. Moreover, important words in positive/negative predictions were identified via model interpretation. Overall, the ensemble model reached satisfactory classification results, which can be further used by researchers to rapidly filter DILI-related literature.
... The measures Goodall 1 (G1), Goodall 2 (G2), Goodall 3 (G3), Goodall 4 (G4), which are based on the original Goodall measure Goodall (1966), and Lin 1 (LIN1) were introduced by Boriah et al. (2008). The Eskin (ES) measure was proposed by Eskin et al. (2002), the Lin (LIN) measure by Lin (1998), the simple matching coefficient (SM) by Sokal and Michener (1958), the measures occurrence frequency (OF) and inverse occurrence frequency (IOF) by Spärck Jones (1972), and the measures variable entropy (VE) and variable mutability (VM) by Šulc andŘezanková (2019). ...
Article
Full-text available
In this paper, we present the second generation of the nomclust R package, which we developed for the hierarchical clustering of data containing nominal variables (nominal data). The package completely covers the hierarchical clustering process, from dissim-ilarity matrix calculation, over the choice of a clustering method, to the evaluation of the final clusters. Through the whole clustering process, similarity measures, clustering methods, and evaluation criteria developed solely for nominal data are used, which makes this package unique. In the first part of the paper, the theoretical background of the methods used in the package is described. In the second part, the function-ality of the package is demonstrated in several examples. The second generation of the package is completely rewritten to be more natural for the workflow of R users. It includes new similarity measures and evaluation criteria. We also added several graph-ical outputs and support for S3 generic functions. Finally, due to code optimizations, the calculation time of dissimilarity matrix calculation was substantially reduced.
... These representations can be useful to model similarity between documents based on the assumption that similar documents contain the same subsets of words. They can be extended to word frequencies, possibly normalized to give more importance to words occurring more rarely like in the TF-IDF model (Sparck Jones, 1972). ...
Preprint
Full-text available
During the past decade, neural networks have become prominent in Natural Language Processing (NLP), notably for their capacity to learn relevant word representations from large unlabeled corpora. These word embeddings can then be transferred and finetuned for diverse end applications during a supervised training phase. More recently, in 2018, the transfer of entire pretrained Language Models and the preservation of their contextualization capacities enabled to reach unprecedented performance on virtually every NLP benchmark, sometimes even outperforming human baselines. However, as models reach such impressive scores, their comprehension abilities still appear as shallow, which reveal limitations of benchmarks to provide useful insights on their factors of performance and to accurately measure understanding capabilities. In this thesis, we study the behaviour of state-of-the-art models regarding generalization to facts unseen during training in two important Information Extraction tasks: Named Entity Recognition (NER) and Relation Extraction (RE). Indeed, traditional benchmarks present important lexical overlap between mentions and relations used for training and evaluating models, whereas the main interest of Information Extraction is to extract previously unknown information. We propose empirical studies to separate performance based on mention and relation overlap with the training set and find that pretrained Language Models are mainly beneficial to detect unseen mentions, in particular out-of-domain. While this makes them suited for real use cases, there is still a gap in performance between seen and unseen mentions that hurts generalization to new facts. In particular, even state-of-the-art ERE models rely on a shallow retention heuristic, basing their prediction more on arguments surface forms than context.
... A term-weighting factor was optionally applied to the database to enhance the clustering result. Term frequency-inverse document frequency (tf-idf) is a universal numerical statistic that reflects the importance of the terms in a corpus that are often used for information retrieval and text mining [188]. Term frequency, tf(t,d), is the frequency of term t, where ft,d is the raw count of a term in a document, i.e., the number of times that term t occurs in document d. ...
Thesis
Full-text available
Stagnant productivity and workforce shortage are global problems in the Architecture, Engineering and Construction (AEC) industry. Particularly in Japan, slow digital transformation causes lag in leveraging technologies. Building Information Modeling (BIM), a novel digital platform internationally spreading, is expected to enhance productivity. Still, its implementation and collaboration in practice have remained major issues for the last decade. The research question of the thesis is threefold. First, how can a layperson decipher the BIM activity in a data-driven manner? Second, what are the traits of BIM activities in large-scale projects? Last, how can the key BIM cooperator in collaborative projects be specified? The literature review revealed that BIM log mining, a machine-learning-based process mining method, is an emerging and plausible approach. Preparatory studies discovered that outsourcing the modeling workforce could rather disengage project architects from the activities executed in BIM. The proposed methodology introduced visual analytics to assess the result for laypersons. Three different datasets comprise extensive and multidisciplinary BIM records collected from a broad range of organizations, including supplemental big data to overcome the drawbacks of the existing methods. The classification process and visualization are incrementally tested through the empirical chapters. The devised method identified a group of collaborative BIM users in the corporation despite considerable dependence on external BIM workforce termed as BIM operators. Those players can be interpreted as keystone species in the corporate BIM environment. The interview further revealed that mutual respect motivates practitioners for successful BIM learning. The proposed BIM log mining approach is novel, versatile, and comprehensive; different datasets proved its utility. The thesis recommends BIM education aiming for such cooperative BIM practitioners and further research on BIM operators for more successful interpretation to the local ecosystem.
... Arguably, the most common document representations are Bag of Words (BoW) and TF-IDF (Harris 1954;Sparck Jones 1972). Although commonly used in the literature, these representations result in complex models that are difficult to interpret for several reasons, such as the high dimensionality of feature vectors, 3. We refer to a document as a distinct text or a minimum piece of information. ...
Article
Extremist online networks reportedly tend to use Twitter and other Social Networking Sites (SNS) in order to issue propaganda and recruitment statements. Traditional machine learning models may encounter problems when used in such a context, due to the peculiarities of microblogging sites and the manner in which these networks interact (both between themselves and with other networks). Moreover, state-of-the-art approaches have focused on non-transparent techniques that cannot be audited; so, despite the fact that they are top performing techniques, it is impossible to check if the models are actually fair. In this paper, we present a semi-supervised methodology that uses our Discriminatory Expressions algorithm for feature selection to detect expressions that are biased towards extremist content (Francisco and Castro 2020). With the help of human experts, the relevant expressions are filtered and used to retrieve further extremist content in order to iteratively provide a set of relevant and accurate expressions. These discriminatory expressions have been proved to produce less complex models that are easier to comprehend, and thus improve model transparency. In the following, we present close to 70 expressions that were discovered by using this method alongside the validation test of the algorithm in several different contexts.
... This version used a score ranging from A+ to C to measure the effectiveness of the level check, which was removed from the framework for the GRI G4 version. As such, Liu et al. [42] utilised the term frequency-inverse document frequency (TF-IDF) [43] method to obtain important and specific terms for different analytical algorithms and shallow machine-learning models. The previously described methods and other more recent ones have been applied successfully in other problems, such as textual similarity in legal-court-case reports [44], biomedical texts from scholarly articles and medical databases [45,46] or network-analytic approaches for assessing the performance of family businesses in tourism [47]. ...
Article
Full-text available
This paper aims to evaluate the degree of affinity that Nordic companies’ reports published under the Global Reporting Initiatives (GRI) framework have. Several natural language processing and text-mining techniques were implemented and tested to achieve this goal. We extracted strings, corpus, and hybrid semantic similarities from the reports and evaluated the models through the intrinsic assessment methodology. A quantitative ranking score based on index matching was developed to complement the semantic valuation. The final results show that Latent Semantic Analysis (LSA) and Global Vectors for word representation (GloVE) are the best methods for our study. Our findings will open the door to the automatic evaluation of sustainability reports which could have a substantial impact on the environment.
... It has implementations for all the selected twenty-five retrieval models to experiment with in the proposed empirical investigation of this article. These include TF-IDF [44,61,63,68], LemurTF_IDF [80], DLH [3], DLH13 [3,45] DPH [3,5], BM25 [34,48,61,69], DFR_BM25 [2,4], InL2 [2,4], InB2 [44], In_expB2 [2,4], In_expC2 [2,4], IFB2 [2,4], PL2 [2,4], BB2 [2,4], DFIC [22,39], DFIZ [22,39], DFRee [44], DFReeKLIM [5,44], DirichletLM [59,81], HiemstraLM [27], LGD [19,20], ML2 [58], MDL2 [58], PL2F [46], and BM25F [79]. To make the understanding easier on these models, Table 4 of Appendix 1 lists all the symbols used in their mathematical representations along with their intended meaning and usage. ...
Article
Full-text available
Social Book Search (SBS) studies how the Social Web impacts book retrieval. This impact is studied in two steps. In this first step, called the baseline run, the search index having bibliographic descriptions or professional metadata and user-generated content or social metadata is searched against the search queries and ranked using a retrieval model. In the second step, called re-ranking, the baseline search results are re-ordered using social metadata to see if the search relevance improves. However, this improvement in the search relevance can only be justified if the baseline run is made stronger by considering the contribution of the query, index, and retrieval model. Although the existing studies well-explored the role of query formulation and document representation, only a few considered the contribution of the retrieval models. Also, they experimented with a few retrieval models. This article fills this gap in the literature. It identifies the best retrieval model in the SBS context by experimenting with twenty-five retrieval models using the Terrier IR platform on the Amazon/LibraryThing dataset holding topic sets, relevance judgments, and a book corpus of 2.8 million records. The findings suggest that these retrieval models behave differently with changes in query and document representation. DirichletLM and InL2 are the best-performing retrieval models for a majority of the retrieval runs. The previous best-performing SBS studies would have produced better results if they had tested multiple retrieval models in selecting baseline runs. The findings confirm that the retrieval model plays a vital role in developing stronger baseline runs.
... This study generated TF-IDF scores from the corpus to identify the highly weighed article compared to others and extract the terms with a high TF-IDF score. TF-IDF is a technique used under the umbrella of NLP for classification and text summarization (Jones 1972;Ramos 2003;Wu et al. 2008). This technique draws the importance of terms from the extracted terms from the selected corpus. ...
Article
Full-text available
Issues of the environmental crisis are being addressed by researchers, government, and organizations alike. GHRM is one such field that is receiving lots of research focus since it is targeted at greening the firms and making them eco-friendly. This research reviews 317 articles from the Scopus database published on green human resource management (GHRM) from 2008 to 2021. The study applies text mining, latent semantic analysis (LSA), and network analysis to explore the trends in the research field in GHRM and establish the relationship between the quantitative and qualitative literature of GHRM. The study has been carried out using KNIME and VOSviewer tools. As a result, the research identifies five recent research trends in GHRM using K-mean clustering. Future researchers can work upon these identified trends to solve environmental issues, make the environment eco-friendly, and motivate firms to implement GHRM in their practices.
... Many rule-based classification systems have been designed; however, rule development by domain experts is time-consuming and may not generalize well across text input sources. 1 Term frequencyinverse document frequency and Latent Dirichlet Allocation are two powerful unsupervised text classification methods which utilize bag-of-words unigram text representations and do not require expert involvement for rule creation. 2,3 However, bag-ofwords approaches are limited in that they require large vocabularies, do not handle misspellings, and assumes word independence; preprocessing can improve performance. 4 Word embeddings, such as Word2Vec and GloVe, have accelerated the field by encoding semantic similarity in word vector representationsenabling higher level language features such as analogy. ...
Article
Selecting appropriate consultations for self-referred patients to tertiary medical centers is a time and resource intensive task. Deep learning with natural language processing can potentially augment this task and reduce clinician workload. Appointment request forms for 8168 patients self-referred to General Internal Medicine were reviewed and recommended downstream appointments from manual triage were tabulated. This paper describes the development and performance of thirty-nine deep learning algorithms for multi-label text classification: including convolutional neural networks, recurrent neural networks, and pretrained language models with transformer and reformer architectures implemented using Pytorch and trained on a single graphic processing unit. A model with multiple convolutional neural networks with various kernel sizes (1-7 words) and 300 dimensional FastText word embeddings performed best (AUC 0.949, MCC 0.734, F1 0.775). Generally, models with convolutional networks were highest performers. Highly performing models may be candidates for implementation to augment clinician workflow.
... The average, minimum and maximum values of these distances are taken for nouns and verbs found in GermaNet. IDF: Another set of features is based on the inverse document frequency (IDF) of a word w (Sparck Jones, 1972), defined as log N n , where N is the number of documents in a corpus, and n is the number of documents that contain the word w. We used 3 million German sentences taken from newspaper texts in 2015 4 from the Leipzig Corpus Collection 3 https://spacy.io ...
Conference Paper
Full-text available
With the growth of online learning through MOOCs and other educational applications, it has become increasingly difficult for course providers to offer personalized feedback to students. Therefore asking students to provide feedback to each other has become one way to support learning. This peer-to-peer feedback has become increasingly important whether in MOOCs to provide feedback to thousands of students or in large-scale classes at universities. One of the challenges when allowing peer-to-peer feedback is that the feedback should be perceived as helpful, and an import factor determining helpfulness is how specific the feedback is. However, in classes including thousands of students, instructors do not have the resources to check the speci-ficity of every piece of feedback between students. Therefore, we present an automatic classification model to measure sentence speci-ficity in written feedback. The model was trained and tested on student feedback texts written in German where sentences have been labelled as general or specific. We find that we can automatically classify the sentences with an accuracy of 76.7% using a conventional feature-based approach, whereas transfer learning with BERT for German gives a classification accuracy of 81.1%. However, the feature-based approach comes with lower computational costs and preserves human in-terpretability of the coefficients. In addition we show that specificity of sentences in feedback texts has a weak positive correlation with perceptions of helpfulness. This indicates that specificity is one of the ingredients of good feedback, and invites further investigation.
Article
Full-text available
Bursts and collective emotion have been widely studied in social physics field where researchers use mathematical models to understand human social dynamics. However, few researches recognize and separately analyze the internal and external influence on burst behaviors. To bridge this gap, we introduce a non-parametric approach to classify an interevent time series into five scenarios: random arrival, endogenous burst, endogenous non-burst, exogenous burst and exogenous non-burst. In order to process large-scale social media data, we first segment the interevent time series into sections by detecting change points. Then we use the rule-based algorithm to classify the time series based on its distribution. To validate our model, we analyze 27.2 million COVID-19 related comments collected from Chinese social media between January to October 2020. We adopt the emotion category called Profile of Mood States which consists of six emotions: Anger, Depression, Fatigue, Vigor, Tension and Confusion. This enables us to compare the burst features of different collective emotions during the COVID-19 period. The burst detection and classification approach introduced in this paper can also be applied to analyzing other complex systems, including but not limited to social media, financial market and signal processing.
Article
The success and popularity of deep learning is on the rise, partially due to powerful deep learning frameworks such as TensorFlow and PyTorch that make it easier to develop deep learning models. However, these libraries also come with steep learning curves, since programming in these frameworks is quite different from traditional imperative programming with explicit loops and conditionals. In this work, we present a tool called TF-Coder for programming by example in TensorFlow. TF-Coder uses a bottom-up weighted enumerative search, with value-based pruning of equivalent expressions and flexible type- and value-based filtering to ensure that expressions adhere to various requirements imposed by the TensorFlow library. We train models to predict TensorFlow operations from features of the input and output tensors and natural language descriptions of tasks, to prioritize relevant operations during search. TF-Coder solves 63 of 70 real-world tasks within 5 minutes, sometimes finding simpler solutions in less time compared to experienced human programmers.
Article
Purpose The performance of behavioral targeting (BT) mainly relies on the effectiveness of user classification since advertisers always want to target their advertisements to the most relevant users. In this paper, the authors frame the BT as a user classification problem and describe a machine learning–based approach for solving it. Design/methodology/approach To perform such a study, two major research questions are investigated: the first question is how to represent a user’s online behavior. A good representation strategy should be able to effectively classify users based on their online activities. The second question is how different representation strategies affect the targeting performance. The authors propose three user behavior representation methods and compare them empirically using the area under the receiver operating characteristic curve (AUC) as a performance measure. Findings The experimental results indicate that ad campaign effectiveness can be significantly improved by combining user search queries, clicked URLs and clicked ads as a user profile. In addition, the authors also explore the temporal aspect of user behavior history by investigating the effect of history length on targeting performance. The authors note that an improvement of approximately 6.5% in AUC is achieved when user history is extended from 1 day to 14 days, which is substantial in targeting performance. Originality/value This paper confirms the effectiveness of BT on user classification and provides a validation of BT for Internet advertising.
Article
A BERT-MDLP-Bayesian Network model (BMB) is proposed to analyze the improvement strategy of e-commerce products based on user generated content (UGC). The proposed model can be represented into four parts: clearing redundant data on the obtained UGC, extracting product attributes and word vector to generate product attributes, establishing product attribute Bayesian network corresponding to UGC, and inferring the causal relationship between product attributes. In order to verify the effectiveness of the proposed model, an amazon tablet product is used for empirical analysis. Compared with the traditional model, BMB model has better performance in product feature mining in three aspects of feature diversity, feature long tail and attribute difference. In application, the model can effectively describe the core problems of products, and provide suggestions for e-commerce to modify marketing strategies and determine the new direction of product development.
Article
There is an ongoing debate on whether wine reviews provide meaningful information on wine properties and quality. However, few studies have been conducted aiming directly at comparing the utility of wine reviews and numeric measurements in wine data analysis. Based on data from close to 300,000 wines reviewed by Wine Spectator , we use logistic regression models to investigate whether wine reviews are useful in predicting a wine's quality classification. We group our sample into one of two binary quality brackets, wines with a critical rating of 90 or above and the other group with ratings of 89 or below. This binary outcome constitutes our dependent variable. The explanatory variables include different combinations of numerical covariates such as the price and age of wines and numerical representations of text reviews. By comparing the explanatory accuracy of the models, our results suggest that wine review descriptors are more accurate in predicting binary wine quality classifications than are various numerical covariates—including the wine's price. In the study, we include three different feature extraction methods in text analysis: latent Dirichlet allocation, term frequency-inverse document frequency, and Doc2Vec text embedding. We find that Doc2Vec is the best performing feature extraction method that produces the highest classification accuracy due to its capability of using contextual information from text documents. (JEL Classifications: C45, C88, D83)
Article
Many migrants are vulnerable due to noncitizenship, linguistic or cultural barriers, and inadequate safety-net infrastructures. Immigrant-oriented nonprofits can play an important role in improving immigrant well-being. However, progress on systematically evaluating the impact of nonprofits has been hampered by the difficulty in efficiently and accurately identifying immigrant-oriented nonprofits in large administrative data sets. We tackle this challenge by employing natural language processing (NLP) and machine learning (ML) techniques. Seven NLP algorithms are applied and trained in supervised ML models. The bidirectional encoder representations from transformers (BERT) technique offers the best performance, with an impressive accuracy of .89. Indeed, the model outperformed two nonmachine methods used in existing research, namely, identification of organizations via National Taxonomy of Exempt Entities codes or keyword searches of nonprofit names. We thus demonstrate the viability of computer-based identification of hard-to-identify nonprofits using organizational name data, a technique that may be applicable to other research requiring categorization based on short labels. We also highlight limitations and areas for improvement.
Chapter
Information retrieval is the process of satisfying user information needs that are expressed as textual queries. Search engines represent a Web-specific example of the information retrieval paradigm. The problem of Web search has many additional challenges, such as the collection of Web resources, the organization of these resources, and the use of hyperlinks to aid the search. Whereas traditional information retrieval only uses the content of documents to retrieve results of queries, the Web requires stronger mechanisms for quality control because of its open nature. Furthermore, Web documents contain significant meta-information and zoned text, such as title, author, or anchor text, which can be leveraged to improve retrieval accuracy.
Article
We are living in the age of recommendations: it has been estimated that two-thirds of the films viewed on Netflix come from recommendations while the 35% of Amazon sales regard goods suggested to users. There are many factors to consider when providing a new suggestion: in addition to being useful, it should also be relevant and serendipitous, starting from historical data previously collected. In particular, the notion of context has to be considered since it induces some dynamic aspects in the definition of user preferences. The role of context becomes particularly important when we shift from single (myopic) suggestions to be provided to an individual user, to sequences of recommendations for groups of users. When the preferences of individual users are combined to define the preference of a new ephemeral group, dynamic contextual concerns have to be considered in order to provide the best possible experience and extend the group life, preventing the defection of some members because their preferences are not balanced. In this paper we introduce our proposal for producing sequences of recommendations for groups of users which is based on the Multi-Objective Simulated Annealing optimization technique and takes into account dynamic aspects. Moreover, we propose some strategies for extracting the required dynamic information from log data typically available and present the experimental results of the application of our approach in some real-world case studies.
Article
The fusion of two or more different data sources is a widely accepted technique in remote sensing while becoming increasingly important due to the availability of big Earth Observation satellite data. As a complementary source of geo-information to satellite data, massive text messages from social media form a temporally quasi-seamless, spatially multi-perspective stream, but with unknown and diverse quality. Despite the uncontrolled quality: can linguistic features extracted from geo-referenced tweets support remote sensing tasks? This work presents a straightforward decision fusion framework for very high-resolution remote sensing images and Twitter text messages. We apply our proposed fusion framework to a land-use classification task – the building function classification task – in which we classify building functions like commercial or residential based on linguistic features derived from tweets and remote sensing images. Using building tags from OpenStreetMap (OSM), we labeled tweets and very high-resolution (VHR) images from Google Maps. We collected English tweets from San Francisco, New York City, Los Angeles, and Washington D.C. and trained a stacked bi-directional LSTM neural network with these tweets. For the aerial images, we predicted building functions with state-of-the-art Convolutional Neural Network (CNN) architectures fine-tuned from ImageNet on the given task. After predicting each modality separately, we combined the prediction probabilities of both models building-wise at a decision level. We show that the proposed fusion framework can improve the classification results of the building type classification task. To the best of our knowledge, we are the first to use semantic contents of Twitter messages and fusing them with remote sensing images to classify building functions at a single building level.
Article
Existing methods to measure sentence similarity are faced with two challenges: (1) labeled datasets are usually limited in size, making them insufficient to train supervised neural models; and (2) there is a training-test gap for unsupervised language modeling (LM) based models to compute semantic scores between sentences, since sentence-level semantics are not explicitly modeled at training. This results in inferior performances in this task. In this work, we propose a new framework to address these two issues. The proposed framework is based on the core idea that the meaning of a sentence should be defined by its contexts, and that sentence similarity can be measured by comparing the probabilities of generating two sentences given the same context. The proposed framework is able to generate high-quality, large-scale dataset with semantic similarity scores between two sentences in an unsupervised manner, with which the train-test gap can be largely bridged. Extensive experiments show that the proposed framework achieves significant performance boosts over existing baselines under both the supervised and unsupervised settings across different datasets.
Article
Nowadays, the explosive growth in text data emphasizes the need for developing new and computationally efficient methods and credible theoretical support tailored for analyzing such large‐scale data. Given the vast amount of this kind of unstructured data, the majority of it is not classified, hence unsupervised learning techniques show to be useful in this field. Document clustering has proven to be an efficient tool in organizing textual documents and it has been widely applied in different areas from information retrieval to topic modeling. Before introducing the proposals of document clustering algorithms, the principal steps of the whole process, including the mathematical representation of documents and the preprocessing phase, are discussed. Then, the main clustering algorithms used for text data are critically analyzed, considering prototype‐based, graph‐based, hierarchical, and model‐based approaches. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification Statistical Learning and Exploratory Methods of the Data Sciences > Text Mining Data: Types and Structure > Text Data Document clustering: Prototype‐based, Graph‐based, Hierarchical and Model‐based methods
Chapter
In this study, the authors give both theoretical and experimental information about text mining, which is one of the natural language processing topics. Three different text mining problems such as news classification, sentiment analysis, and author recognition are discussed for Turkish. They aim to reduce the running time and increase the performance of machine learning algorithms. Four different machine learning algorithms and two different feature selection metrics are used to solve these text classification problems. Classification algorithms are random forest (RF), logistic regression (LR), naive bayes (NB), and sequential minimal optimization (SMO). Chi-square and information gain metrics are used as the feature selection method. The highest classification performance achieved in this study is 0.895 according to the F-measure metric. This result is obtained by using the SMO classifier and information gain metric for news classification. This study is important in terms of comparing the performances of classification algorithms and feature selection methods.
Article
Full-text available
The difficulty of deriving value out of vast available scientific literature in a condensed form lead us to look for a proficient theme based summarization solution which can preserve precise biomedical content. The study targets to analyze impact of combining semantic biomedical concepts extraction, frequent item-set mining and clustering techniques over information retention, objective functions and ROUGE values for the obtained final summary. The suggested frequent item-set mining and clustering (FRI-CL) graph-based framework uses UMLS metathesarus and BERT-based semantic embeddings to identify domain-relevant concepts. The scrutinized concepts are mined according to their relationship with neighbors and frequency via an amended FP-Growth model. The framework utilizes S-DPMM clustering, which is a probabilistic mixture model and aids in the identification and clubbing of complex relevant patterns to increase coverage of important sub-themes. The sentences with the frequent concepts are scored via PageRank to form an efficient and compelling summary. The research experiments on the 100 sample biomedical documents taken from PubMed archives are evaluated via calculation of ROUGE scores, coverage, readability, non-redundancy, memory utilization and information retention from the summary output. The results with the FRI-CL summarization system showcased 10% ROUGE performance improvement and are at par with the other baseline methods. On an average 30–40% improvement in memory utilization is observed with up to 50% information retention when experiments are performed using S-DPMM clustering. The research indicates that the fusion of semantic mapping, clustering, along with frequent-item set mining of biomedical concepts enhance the overall co-related information covering all sub-themes.
Article
This article explores three aspects of Masterman's language work and applies them to questions of spiritual intelligence: metaphor, coherence, and ambiguity. First, metaphor, which is ubiquitous in ordinary language, both leads and misleads in religious and scientific understanding. Masterman's case for a “dual‐approach” to thinking, both speculative and critical, is explored and tied to concepts of moral‐spiritual development per Pierre Hadot and Hannah Arendt. Second, Masterman's work on machine translation presents semantic disambiguation as an emerging coherence wherein one gradually hones in on meaning through features of ordinary language (like redundancy and repetition). This is applied to the problem of comprehending difficult spiritual language, and tied to spiritual stretching and spiritual cartography. Third, Masterman's work with thesauri, rather than relying on words as having fixed meanings, appeals to a concept of semantic spaces, nebulae of variously interconnected meanings. This is constructed into an exhortation to reambiguate overfamiliar religious language, to reinvest one's quotidian surroundings with spiritual meaning through defamilarization.
Article
Hate speech is any kind of communication that attacks a person or a group based on their characteristics, such as gender, religion and race. Due to the availability of online platforms where people can express their (hateful) opinions, the amount of hate speech is steadily increasing that often leads to offline hate crimes. This paper focuses on understanding and detecting hate speech in underground hacking and extremist forums where cybercriminals and extremists, respectively, communicate with each other, and some of them are associated with criminal activity. Moreover, due to the lengthy posts, it would be beneficial to identify the specific span of text containing hateful content in order to assist site moderators with the removal of hate speech. This paper describes a hate speech dataset composed of posts extracted from HackForums, an online hacking forum, and Stormfront and Incels.co, two extremist forums. We combined our dataset with a Twitter hate speech dataset to train a multi-platform classifier. Our evaluation shows that a classifier trained on multiple sources of data does not always improve the performance compared to a mono-platform classifier. Finally, this is the first work on extracting hate speech spans from longer texts. The paper fine-tunes BERT (Bidirectional Encoder Representations from Transformers) and adopts two approaches – span prediction and sequence labelling. Both approaches successfully extract hateful spans and achieve an F1-score of at least 69%.
Chapter
Automatic indexing is a challenging task in which computers must emulate the behaviour of professional indexers to assign to a document some keywords or keyphrases that represent concisely the content of the document. While most of the existing algorithms are based on a select-and-rank strategy, it has been shown that selecting only keywords from text is not ideal as human annotators tend to assign keywords that are not present in the source. This problem is more evident in scholarly literature. In this work we leverage a transformer-based language model to approach the automatic indexing task from a generative point of view. In this way we overcome the problem of keywords that are not in the original document, as the neural language models can rely on knowledge acquired during their training process. We apply our method to a French collection of annotated scientific articles.
Conference Paper
Content-based semantics-driven recommender systems are often used in the small-scale news recommendation domain, founded on the TF-IDF measure but also taking into account domain semantics through semantic lexicons or ontologies. This work explores the application of content-based semantics-driven recommender systems to large-scale recommendations on the example of movie domain. We propose methods to extract semantic features from various item descriptions, including images. In particular, we use computer vision to extract semantic features from images and use these for recommendation together with various features extracted from textual information. The semantics-driven approach is scaled up with pre-computation of the cosine similarities and gradient learning of the model. The results of the study on a large-scale MovieLens dataset of user ratings demonstrate that semantics-driven recommenders can be extended to more complex domains and outperform TF-IDF on ROC, PR, F1, and Kappa metrics.
Article
Full-text available
In an effort to gauge the global pandemic’s impact on social thoughts and behavior, it is important to answer the following questions: (1) What kinds of topics are individuals and groups vocalizing in relation to the pandemic? (2) Are there any noticeable topic trends and if so how do these topics change over time and in response to major events? In this paper, through the advanced Sequential Latent Dirichlet Allocation model, we identified twelve of the most popular topics present in a Twitter dataset collected over the period spanning April 3 rd to April 13 th , 2020 in the United States and discussed their growth and changes over time. These topics were both robust, in that they covered specific domains, not simply events, and dynamic, in that they were able to change over time in response to rising trends in our dataset. They spanned politics, healthcare, community, and the economy, and experienced macro-level growth over time, while also exhibiting micro-level changes in topic composition. Our approach differentiated itself in both scale and scope to study the emerging topics concerning COVID-19 at a scale that few works have been able to achieve. We contributed to the cross-sectional field of urban studies and big data. Whereas we are optimistic towards the future, we also understand that this is an unprecedented time that will have lasting impacts on individuals and society at large, impacting not only the economy or geo-politics, but human behavior and psychology. Therefore, in more ways than one, this research is just beginning to scratch the surface of what will be a concerted research effort into studying the history and repercussions of COVID-19.
Article
Full-text available
Judgments concerning animals have arisen across a variety of established practice areas. There is, however, no publicly available repository of judgments concerning the emerging practice area of animal protection law. This has hindered the identification of individual animal protection law judgments and comprehension of the scale of animal protection law made by courts. Thus, we detail the creation of an initial animal protection law repository using natural language processing and machine learning techniques. This involved domain expert classification of 500 judgments according to whether or not they were concerned with animal protection law. 400 of these judgments were used to train various models, each of which was used to predict the classification of the remaining 100 judgments. The predictions of each model were superior to a baseline measure intended to mimic current searching practice, with the best performing model being a support vector machine (SVM) approach that classified judgments according to term frequency—inverse document frequency (TF-IDF) values. Investigation of this model consisted of considering its most influential features and conducting an error analysis of all incorrectly predicted judgments. This showed the features indicative of animal protection law judgments to include terms such as ‘welfare’, ‘hunt’ and ‘cull’, and that incorrectly predicted judgments were often deemed marginal decisions by the domain expert. The TF-IDF SVM was then used to classify non-labelled judgments, resulting in an initial animal protection law repository. Inspection of this repository suggested that there were 175 animal protection judgments between January 2000 and December 2020 from the Privy Council, House of Lords, Supreme Court and upper England and Wales courts.
Chapter
Article
We are recently witnessing a radical shift towards digitisation in many aspects of our daily life, including law, public administration and governance. This has sometimes been done with the aim of reducing costs and human errors by improving data analysis and management, but not without raising major technological challenges. One of these challenges is certainly the need to cope with relatively small amounts of data, without sacrificing performance. Indeed, cutting-edge approaches to (natural) language processing and understanding are often data-hungry, especially those based on deep learning. With this paper we seek to address the problem of data scarcity in automatic Legalese (or legal English) processing and understanding. What we propose is an ensemble of shallow and deep learning techniques called SyntagmTuner, designed to combine the accuracy of deep learning with the ability of shallow learning to work with little data. Our contribution is based on the assumption that Legalese differs from its spoken language in the way the meaning is encoded by the structure of the text and the co-occurrence of words. As result, we show with SyntagmTuner how we can perform important tasks for e-governance, as multi-label classification of the United Nations General Assembly (UNGA) Resolutions or legal question answering, with data-sets of roughly 100 samples or even less.
Article
Existing visualization recommendation systems commonly rely on a single snapshot of a dataset to suggest visualizations to users. However, exploratory data analysis involves a series of related interactions with a dataset over time rather than one‐off analytical steps. We present Solas, a tool that tracks the history of a user's data analysis, models their interest in each column, and uses this information to provide visualization recommendations, all within the user's native analytical environment. Recommending with analysis history improves visualizations in three primary ways: task‐specific visualizations use the provenance of data to provide sensible encodings for common analysis functions, aggregated history is used to rank visualizations by our model of a user's interest in each column, and column data types are inferred based on applied operations. We present a usage scenario and a user evaluation demonstrating how leveraging analysis history improves in situ visualization recommendations on real‐world analysis tasks.
Article
Compared with common intelligent service, full-scene intelligent service has its uniqueness in high integration, synergy, and technological spillover. However, the traditional service or business model theories cannot precisely elaborate its sociotechnical contextual nature and value creation logic. To fill this knowledge gap, we provide initial insights into the value co-creation logic in full-scene intelligent service by exploring the value co-creation elements using a data-driven text mining approach. We analyzed 171 business reports on the full-scene intelligent service by the topic modeling using the Latent Dirichlet Allocation (LDA). The findings reveal three main clusters: value proposition, participants, and connection platform. This study presents a theoretical framework for a further exploratory case study and quantitative research on full-scene intelligent service. This study also helps small and medium-sized enterprises to explore and exploit value co-creation opportunities.
Article
Full-text available
Artificial Intelligence (AI) is having an enormous impact on the rise of technology in every sector. Indeed, AI-powered systems are monitoring and deciding on sensitive economic and societal issues. The future is moving towards automation, and we must not prevent it. Many people, though, have opposing views because of the fear of uncontrollable AI systems. This concern could be reasonable if it originated from considerations associated with social issues, like gender-biased or obscure decision-making systems. Explainable AI (XAI) is a tremendous step towards reliable systems, enhancing the trust of people in AI. Interpretable machine learning (IML), a subfield of XAI, is also an urgent topic of research. This paper presents a small but significant contribution to the IML community. We focus on a local-based, neural-specific interpretation process applied to textual and time series data. Therefore, the proposed technique, which we call “LioNets”, introduces novel approaches to present feature importance-based interpretations. We propose an innovative way to produce counterfactual words in textual datasets. Through a set of quantitative and qualitative experiments, we present competitiveness of LioNets compared to other techniques and suggest its usefulness.
Article
Context : Documented goals-of-care discussions are an important quality metric for patients with serious illness. Natural language processing (NLP) is a promising approach for identifying goals-of-care discussions in the electronic health record (EHR). Objectives To compare three NLP modeling approaches for identifying EHR documentation of goals-of-care discussions and generate hypotheses about differences in performance. Methods : We conducted a mixed-methods study to evaluate performance and misclassification for three NLP featurization approaches modeled with regularized logistic regression: bag-of-words (BOW), rule-based, and a hybrid approach. From a prospective cohort of 150 patients hospitalized with serious illness over 2018-2020, we collected 4,391 inpatient EHR notes; 99 (2.3%) contained documented goals-of-care discussions. We used leave-one-out cross-validation to estimate performance by comparing pooled NLP predictions to human abstractors with receiver-operating-characteristic (ROC) and precision-recall (PR) analyses. We qualitatively examined a purposive sample of 70 NLP-misclassified notes using content analysis to identify linguistic features that allowed us to generate hypotheses underpinning misclassification. Results : All three modeling approaches discriminated between notes with and without goals-of-care discussions (AUCROC: BOW, 0.907; rule-based, 0.948; hybrid, 0.965). Precision and recall were only moderate (precision at 70% recall: BOW, 16.2%; rule-based, 50.4%; hybrid, 49.3%; AUCPR: BOW, 0.505; rule-based, 0.579; hybrid, 0.599). Qualitative analysis revealed patterns underlying performance differences between BOW and rule-based approaches. Conclusion : NLP holds promise for identifying EHR-documented goals-of-care discussions. However, the rarity of goals-of-care content in EHR data limits performance. Our findings highlight opportunities to optimize NLP modeling approaches, and support further exploration of different NLP approaches to identify goals-of-care discussions.
Article
Full-text available
Ontology mapping is a crucial task for the facilitation of information exchange and data integration. A mapping system can use a variety of similarity measures to determine concept correspondences. This paper proposes the integration of word-sense disambiguation techniques into lexical similarity measures. We propose a disambiguation methodology which entails the creation of virtual documents from concept and sense definitions, including their neighbourhoods. The specific terms are weighted according to their origin within their respective ontology. The document similarities between the concept document and sense documents are used to disambiguate the concept meanings. First, we evaluate to what extent the proposed disambiguation method can improve the performance of a lexical similarity metric. We observe that the disambiguation method improves the performance of each tested lexical similarity metric. Next, we demonstrate the potential of a mapping system utilizing the proposed approach through the comparison with contemporary ontology mapping systems. We observe a high performance on a real-world data set. Finally, we evaluate how the application of several term-weighting techniques on the virtual documents can affect the quality of the generated alignments. Here, we observe that weighting terms according to their ontology origin leads to the highest performance.
Article
Full-text available
Given a user-selected seed author, a unique experimental system called AuthorWeb can return the 24 authors most frequently co-cited with the seed in a 10-year segment of the Arts and Humanities Citation Index. The Web-based system can then instantly display the seed and the others as a Pathfinder network, a Kohonen self-organizing map, or a pennant diagram. Each display gives a somewhat different overview of the literature cited with the seed in a specialty (e.g., Thomas Mann studies). Each is also a live interface for retrieving (1) the documents that co-cite the seed with another user-selected author, and (2) the works by the seed and the other author that are co-cited. This article describes the Pathfinder and Kohonen maps, but focuses much more on AuthorWeb pennant diagrams, exhibited here for the first time. Pennants are interesting because they unite ego-centered co-citation data from bibliometrics, the TF*IDF formula from information retrieval, and Sperber and Wilson’s relevance theory (RT) from linguistic pragmatics. RT provides a cognitive interpretation of TF*IDF weighting. By making people’s inferential processes a primary concern, RT also yields insights into both topical and non-topical relevance, central matters in information science. Pennants for several authors in the humanities demonstrate these insights.
Article
Full-text available
The latest stealth techniques, such as metamorphism, allow malware to evade detection by today’s signature-based anti-malware programs. Current techniques for detecting malware are compute intensive and unsuitable for real-time detection. Techniques based on opcode patterns have the potential to be used for real-time malware detection, but have the following issues: (1) The frequencies of opcodes can change by using different compilers, compiler optimizations and operating systems. (2) Obfuscations introduced by polymorphic and metamorphic malware can change the opcode distributions. (3) Selecting too many features (patterns) results in a high detection rate but also increases the runtime and vice versa. In this paper we present a novel technique named SWOD-CFWeight (Sliding Window of Difference and Control Flow Weight) that helps mitigate these effects and provides a solution to these problems. The SWOD size can be changed; this property gives anti-malware tool developers the ability to select appropriate parameters to further optimize malware detection. The CFWeight feature captures control flow information to an extent that helps detect metamorphic malware in real-time. Experimental evaluation of the proposed scheme using an existing dataset yields a malware detection rate of 99.08 % and a false positive rate of 0.93 %.
Article
Full-text available
We propose a method to facilitate search through the storyline of TV series episodes. To this end, we use human written, crowdsourced descriptions—plot synopses—of the story conveyed in the video. We obtain such synopses from websites such as Wikipedia and propose various methods to align each sentence of the plot to shots in the video. Thus, the semantic story-based video retrieval problem is transformed into a much simpler text-based search. Finally, we return the set of shots aligned to the sentences as the video snippet corresponding to the query. The alignment is performed by first computing a similarity score between every shot and sentence through cues such as character identities and keyword matches between plot synopses and subtitles. We then formulate the alignment as an optimization problem and solve it efficiently using dynamic programming. We evaluate our methods on the fifth season of a TV series Buffy the Vampire Slayer and show encouraging results for both the alignment and the retrieval of story events.
Article
Full-text available
Quantifying the similarity or dissimilarity between documents is an important task in authorship attribution, information retrieval, plagiarism detection, text mining, and many other areas of linguistic computing. Numerous similarity indices have been devised and used, but relatively little attention has been paid to calibrating such indices against externally imposed standards, mainly because of the difficulty of establishing agreed reference levels of inter-text similarity. The present article introduces a multi-register corpus gathered for this purpose, in which each text has been located in a similarity space based on ratings by human readers. This provides a resource for testing similarity measures derived from computational text-processing against reference levels derived from human judgement, i.e. external to the texts themselves. We describe the results of a benchmarking study in five different languages in which some widely used measures perform comparatively poorly. In particular, several alternative correlational measures (Pearson r, Spearman rho, tetrachoric correlation) consistently outperform cosine similarity on our data. A method of using what we call ‘anchor texts’ to extend this method from monolingual inter-text similarity-scoring to inter-text similarity-scoring across languages is also proposed and tested.
Article
Full-text available
The relevant documents from large data sets are retrieved with the help of ranking function in Information Retrieval system. In this paper, a new fuzzy logic based ranking function is proposed and implemented to enhance the performance of Information Retrieval system. The proposed ranking function is based on the computation of different terms of term-weighting schema such as term frequency, inverse document frequency and normalization. Fuzzy logic is used at two levels to compute relevance score of a document with respect to the query in present work. All the experiments are performed on CACM and CISI benchmark data sets. The experimental results reveal that the performance of our proposed ranking function is much better than the fuzzy based ranking function developed by Rubens along with other widely used ranking function Okapi-BM25 in terms of precision, recall and F-measure.
Article
Full-text available
Objectives: (1) To develop an automated eligibility screening (ES) approach for clinical trials in an urban tertiary care pediatric emergency department (ED); (2) to assess the effectiveness of natural language processing (NLP), information extraction (IE), and machine learning (ML) techniques on real-world clinical data and trials. Data and methods: We collected eligibility criteria for 13 randomly selected, disease-specific clinical trials actively enrolling patients between January 1, 2010 and August 31, 2012. In parallel, we retrospectively selected data fields including demographics, laboratory data, and clinical notes from the electronic health record (EHR) to represent profiles of all 202795 patients visiting the ED during the same period. Leveraging NLP, IE, and ML technologies, the automated ES algorithms identified patients whose profiles matched the trial criteria to reduce the pool of candidates for staff screening. The performance was validated on both a physician-generated gold standard of trial-patient matches and a reference standard of historical trial-patient enrollment decisions, where workload, mean average precision (MAP), and recall were assessed. Results: Compared with the case without automation, the workload with automated ES was reduced by 92% on the gold standard set, with a MAP of 62.9%. The automated ES achieved a 450% increase in trial screening efficiency. The findings on the gold standard set were confirmed by large-scale evaluation on the reference set of trial-patient matches. Discussion and conclusion: By exploiting the text of trial criteria and the content of EHRs, we demonstrated that NLP-, IE-, and ML-based automated ES could successfully identify patients for clinical trials.
Article
Full-text available
Optical character recognition (OCR) is an important application in the field of pattern recognition. It extracts text from an image document and saves it in an editable form. Examples where OCR is used include library digitization and text searching in scanned documents. Web based applications are main tools for data processing over the net. However, implementing such applications in dedicated hardware systems would increase performance and reliability by many folds over software implementation. In this paper, we present a detailed hardware implementation of the features extraction and character matching units of an Arabic optical character recognition (AOCR) system. The hardware implementation of each of these two units is described in VerilogHDL and functionally tested using ISim from Xilinx. Furthermore, each implementation is synthesized using Xilinx ISE 13.1 targeting Xilinx Spartan6 FPGA family. Experimental results show significant speed up in the hardware implementations over software ones. We further, explore the possibility of accessing these systems over the Web. Thus, they are beneficial to wider range of people.
Article
Full-text available
On 20 text categorization data sets, the research investigated different variations of VSM using KNN algorithm and different term weighting approaches compared in term of F1 measure. The experimental results provide evidence that Dice and Jaccard Coefficient outperformed the Cosine Coefficient approach with regards to F1 results and the Dice-based TF. IDF achieved the highest average scores
Article
Full-text available
In this article, we introduce an out-of-the-box automatic term weighting method for information retrieval. The method is based on measuring the degree of divergence from independence of terms from documents in terms of their frequency of occurrence. Divergence from independence has a well-establish underling statistical theory. It provides a plain, mathematically tractable, and nonparametric way of term weighting requiring no means of term frequency normalization. Besides its sound theoretical background, the results of the experiments performed on TREC test collections show that its performance is comparable to that of the state-of-the-art term weighting methods in general. The theoretical and practical aspects together bring it forward as a simple but powerful baseline alternative to the state-of-the-art methods.
Chapter
Interviews and questionnaires are the basis for collecting information about the opinions, concerns and needs of people. Analysis of those texts is crucial to understand the kansei of people. Text mining is an approach to discover useful and interesting patterns, knowledge and information from texts. This chapter contains two sections on text mining for beginners of it. The first section gives a brief survey of basic text mining techniques, such as keyword extraction, word graphs, clustering of texts and association rule mining. The second section demonstrates an example of text mining applied to interview analysis. Two text mining systems - the concept graph system and the matrix search system - are applied to analyze 2,409 remarks about products and services from 19 people. The analysis shows that text mining systems with a search function achieve interactive analysis of texts and an examination of various problems that we targeted.
Article
This paper examines statistical techniques for exploiting relevance information to weight search terms. These techniques are presented as a natural extension of weighting methods using information about the distribution of index terms in documents in general. A series of relevance weighting functions is derived and is justified by theoretical considerations. In particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval. Different applications of relevance weighting are illustrated by experimental results for test collections.
Article
In recent years, text visualization has been widely acknowledged as an effective approach for understanding the structure and patterns hidden in complicated textual information. In this paper, we propose a new visualization system called TextInsight with two of our contributions. Firstly, a textual entropy theory is introduced to encode the semantic importance distribution in the corpus. Based on the proposed multidimensional joint probability histogram in vector fields, the improved algorithm provides a novel way to position valuable information in massive short texts accurately. Secondly, a map-like metaphor is generated to visualize the textual topics and their relationships. For the problem of over-segmentation in the layout and clustering procedure, we propose an optimization algorithm combining Affinity Propagation (AP) and MultiDimensional Scaling (MDS), and the improved geographical representation is more comprehensible and aesthetically appealing. Our experimental results and initial user feedback suggest that this system is effective in aiding text analysis.
Article
The objectives of this paper are to introduce the metric (called ?absolute document incidences?), to compare this metric with the metric used by Braun and Schubert and Gu (called ?document frequencies?), and to briefly present some potential uses and limitations of absolute document incidences. This paper builds on the feedback on my presentation at Collnet 2006. The research presented briefly in this paper aims to develop and apply a metric for estimating trends in databases of articles. The motivation for estimating trends in these databases is that these trends are possibly of interest in research and science policy. This research is part of my Doctoral project on the use of bibliometric data in the field of research policy.
Article

Article
Social Media can be used as a thermometer to measure how society perceives different news and topics. With the advent of mobile devices, users can interact with Social Media platforms anytime/anywhere, increasing the proportion of geo-located Social Media interactions and opening new doors to localized insights. This article suggests a new method built upon the industry standard Recency, Frequency and Monetary model to quantify the impact of a topic on a defined geographical location during a given period of time. We model each component with a set of metrics analyzing how users in the location actively engage with the topic and how they are exposed to the interactions in their Social Media network related to the topic. Our method implements a full fledged information extraction system consuming geo-localized Social Media interactions and generating on a regular basis the impact quantification metrics. To validate our approach, we analyze its performance in two real-world cases using geo-located tweets.
Article
Motivation: Information structure (IS) analysis is a text mining technique, which classifies text in biomedical articles into categories that capture different types of information, such as objectives, methods, results and conclusions of research. It is a highly useful technique that can support a range of Biomedical Text Mining tasks and can help readers of biomedical literature find information of interest faster, accelerating the highly time-consuming process of literature review. Several approaches to IS analysis have been presented in the past, with promising results in real-world biomedical tasks. However, all existing approaches, even weakly supervised ones, require several hundreds of hand-annotated training sentences specific to the domain in question. Because biomedicine is subject to considerable domain variation, such annotations are expensive to obtain. This makes the application of IS analysis across biomedical domains difficult. In this article, we investigate an unsupervised approach to IS analysis and evaluate the performance of several unsupervised methods on a large corpus of biomedical abstracts collected from PubMed. Results: Our best unsupervised algorithm (multilevel-weighted graph clustering algorithm) performs very well on the task, obtaining over 0.70 F scores for most IS categories when applied to well-known IS schemes. This level of performance is close to that of lightly supervised IS methods and has proven sufficient to aid a range of practical tasks. Thus, using an unsupervised approach, IS could be applied to support a wide range of tasks across sub-domains of biomedicine. We also demonstrate that unsupervised learning brings novel insights into IS of biomedical literature and discovers information categories that are not present in any of the existing IS schemes. Availability and implementation: The annotated corpus and software are available at http://www.cl.cam.ac.uk/∼dk427/bio14info.html.
Article
Semantic search is gradually establishing itself as the next generation search paradigm, which meets better a wider range of information needs, as compared to traditional full-text search. At the same time, however, expanding search towards document structure and external, formal knowledge sources (e.g. LOD resources) remains challenging, especially with respect to efficiency, usability, and scalability.This paper introduces Mímir—an open-source framework for integrated semantic search over text, document structure, linguistic annotations, and formal semantic knowledge. Mímir supports complex structural queries, as well as basic keyword search.Exploratory search and sense-making are supported through information visualisation interfaces, such as co-occurrence matrices and term clouds. There is also an interactive retrieval interface, where users can save, refine, and analyse the results of a semantic search over time. The more well-studied precision-oriented information seeking searches are also well supported.The generic and extensible nature of the Mímir platform is demonstrated through three different, real-world applications, one of which required indexing and search over tens of millions of documents and fifty to hundred times as many semantic annotations. Scaling up to over 150 million documents was also accomplished, via index federation and cloud-based deployment.
Article
Accelerated by the technological advances in the biomedical domain, the size of its literature has been growing very rapidly. As a consequence, it is not feasible for individual researchers to comprehend and synthesize all the infor-mation related to their interests. Therefore, it is conceivable to discover hidden knowledge, or hypotheses, by linking fragments of information independently described in the literature. In fact, such hypotheses have been reported in the literature mining community; some of which have even been corroborated by experiments. This paper mainly focuses on hypothesis ranking and investi-gates an approach to identifying reasonable ones based on semantic similarities between events which lead to respective hypotheses. Our assumption is that hypotheses generated from semantically similar events are more reasonable. We developed a prototype system called, Hypothesis Explorer, and conducted evaluative experiments through which the validity of our approach is demon-strated in comparison with those based on term frequencies, often adopted in the previous work.
Article
In this paper we present an efficient speech recognition approach for multitopic speech by combining information retrieval techniques and topic-based language modeling. Information retrieval based techniques, such as topic identification by means of Latent Semantic Analysis, are used to identify the topic in a recognized transcription of an audio segment. According to the confidence on the topics that have been identified, we propose a dynamic language model adaptation in order to improve the recognition performance in ‘a two stages’ automatic speech recognition system. The scheme used for the adaptation of the language model is a linear interpolation between a background general LM and a topic dependent LM. We have studied different approaches to generate the topic dependent LM and also for determining the interpolation weight of this model with the background model. In one of these approaches we use the given topic labels in the training dataset to obtain the topic models. In the other approach we separate the documents in the training dataset into topic clusters by using the k-means algorithm. For strengthening the adaptation models we also use topic identification techniques to group non topic-labeled documents from the EUROPARL text database in order to increase the amount of data for training specific topic based language models. For the evaluation of the proposed system we are using the Spanish partition of the European Parliament Plenary Sessions (EPPS) Database; we selected a subset of the database with 67 labeled topics for the evaluation. For the task of topic identification our experiments show a relative reduction in topic identification error of 44.94% when compared to the baseline method, the Generalized Vector Model with a classic TF–IDF weighting scheme. For the task of dynamic adaptation of LMs applied to ASR we have achieved a relative reduction in WER of 13.52% over a single background language model.
Article
Name ambiguity in the context of bibliographic citation affects the quality of services in digital libraries. Previous methods are not widely applied in practice because of their high computational complexity and their strong dependency on excessive attributes, such as institutional affiliation, research area, address, etc., which are difficult to obtain in practice. To solve this problem, we propose a novel coarse-to-fine framework for name disambiguation which sequentially employs 3 common and easily accessible attributes (i.e., coauthor name, article title, and publication venue). Our proposed framework is based on multiple clustering and consists of 3 steps: (a) clustering articles by coauthorship and obtaining rough clusters, that is fragments; (b) clustering fragments obtained in step 1 by title information and getting bigger fragments; (c) and clustering fragments obtained in step 2 by the latent relations among venues. Experimental results on a Digital Bibliography and Library Project (DBLP) data set show that our method outperforms the existing state-of-the-art methods by 2.4% to 22.7% on the average pairwise F1 score and is 10 to 100 times faster in terms of execution time.
Article
We studied the effectiveness of a new class of context-dependent term weights for information retrieval. Unlike the traditional term frequency–inverse document frequency (TF–IDF), the new weighting of a term t in a document d depends not only on the occurrence statistics of t alone but also on the terms found within a text window (or “document-context”) centered on t. We introduce a Boost and Discount (B&D) procedure which utilizes partial relevance information to compute the context-dependent term weights of query terms according to a logistic regression model. We investigate the effectiveness of the new term weights compared with the context-independent BM25 weights in the setting of relevance feedback. We performed experiments with title queries of the TREC-6, -7, -8, and 2005 collections, comparing the residual Mean Average Precision (MAP) measures obtained using B&D term weights and those obtained by a baseline using BM25 weights. Given either 10 or 20 relevance judgments of the top retrieved documents, using the new term weights yields improvement over the baseline for all collections tested. The MAP obtained with the new weights has relative improvement over the baseline by 3.3 to 15.2%, with statistical significance at the 95% confidence level across all four collections.
Article
For mobile robot, recognizing its current location is very important to navigate autonomously. Especially, loop closing detection that robot recognize location where it has visited before is a kernel problem to solve localization. A considerable amount of research has been conducted on loop closing detection and localization based on appearance because vision sensor has an advantage in terms of costs and various approaching methods to solve this problem. In case of scenes that consist of repeated structures like in corridors, perceptual aliasing in which, the two different locations are recognized as the same, occurs frequently. In this paper, we propose an improved method to recognize location in the scenes which have similar structures. We extracted salient regions from images using visual attention model and calculated weights using distinctive features in the salient region. It makes possible to emphasize unique features in the scene to classify similar-looking locations. In the results of corridor recognition experiments, proposed method showed improved recognition performance. It shows 78.2% in the accuracy of single floor corridor recognition and 71.5% for multi floor corridors recognition.
Protein surface motifs, which can be defined as commonly appearing patterns of shape and physical properties in protein molecular surfaces, can be considered "possible active sites". We have developed a system for mining surface motifs: SUMOMO which consists of two phases: surface motif extraction and surface motif filtering. In the extraction phase, a given set of protein molecular surface data is divided into small surfaces called unit surfaces. After extracting several common unit surfaces as candidate motifs, they are repetitively merged into surface motifs. However, a large amount of surface motifs is extracted in this phase, making it difficult to distinguish whether the extracted motifs are significant to be considered active sites. Since active sites from proteins with a particular function have similar shape and physical properties, proteins can be classified based on similarity among local surfaces. Thus, in the filtering phase, local surfaces extracted from proteins of the same group are considered significant motifs, and the rest are filtered out. The proposed method was applied to discover surface motifs from 15 proteins belonging to four function groups. Motifs corresponding to all 4 known functional sites were recognised.
Article
The ongoing exponential growth of online information sources has led to a need for reliable and efficient algorithms for text clustering. In this paper, we propose a novel text model called the relational text model that represents each sentence as a binary multirelation over a concept space \documentclass{article}\usepackage{amssymb}\pagestyle{empty}\begin{document}${\mathcal{C}}$\end{document}. Through usage of the smart indexing engine (SIE), a patented technology of the Belgian company i.Know, the concept space adopted by the text model can be constructed dynamically. This means that there is no need for an a priori knowledge base such as an ontology, which makes our approach context independent. The concepts resulting from SIE possess the property that frequency of concepts is a measure for relevance. We exploit this property with the development of the CR‐algorithm. Our approach relies on the representation of a data set \documentclass{article}\usepackage{amssymb}\pagestyle{empty}\begin{document}${\mathcal{D}}$\end{document} as a multirelation, of which k‐cuts can be taken. These cuts can be seen as sets of relevant patterns with respect to the topics that are described by documents. Analysis of dependencies between patterns allows to produce clusters, such that precision is sufficiently high. The best k‐cut is the one that best approximates the estimated number of clusters to ensure recall. Experimental results on Dutch news fragments show that our approach outperforms both basic and advanced methods. © 2012 Wiley Periodicals, Inc.
Article
Extracting semantic associations from text corpora is an important problem with several applications. It is well understood that semantic associations from text can be discerned by observing patterns of co-occurrences of terms. However, much of the work in this direction have been piecemeal, addressing specific kinds of semantic associations. In this work, we propose a generic framework, using which several kinds of semantic associations can be mined. The framework comprises a co-occurrence graph of terms, along with a set of graph operators. A methodology for using this framework is also proposed, where the properties of a given semantic association can be hypothesized and tested over the framework. To show the generic nature of the proposed model,four different semantic associations are mined over a corpus comprising of Wikipedia articles. The design of the proposed framework is inspired from cognitive science—specifically the interplay between semantic and episodic memory in humans.
Article
Although Boolean searching has been the standard model for commercial information retrieval systems for the past three decades, natural language input and partial-match weighted retrieval have recently emerged from the laboratories to become a searching option in several well-known online systems. The purpose of this investigation is to compare the performance of one of these partial match options, LEXIS/NEXIS's Freestyle, with that of traditional Boolean retrieval. To create a context for the investigation, the definition of natural language and the natural language search engines currently available are discussed. Although the Boolean searches had better results more often than the Freestyle searches, neither mechanism demonstrated superior performance for every query. These results do not in any way prove the superiority of partial match techniques or exact match techniques, but they do suggest that different queries demand different techniques. Further study and analysis are needed to determine which elements of a query make it best suited for partial match or exact match retrieval.
Article
Term weighting is a strategy that assigns weights to terms to improve the performance of sentiment analysis and other text mining tasks. In this paper, we propose a supervised term weighting scheme based on two basic factors: Importance of a term in a document (ITD) and importance of a term for expressing sentiment (ITS), to improve the performance of analysis. For ITD, we explore three definitions based on term frequency. Then, seven statistical functions are employed to learn the ITS of each term from training documents with category labels. Compared with the previous unsupervised term weighting schemes originated from information retrieval, our scheme can make full use of the available labeling information to assign appropriate weights to terms. We have experimentally evaluated the proposed method against the state-of-the-art method. The experimental results show that our method outperforms the method and produce the best accuracy on two of three data sets.
Article
A Data Warehouse is a huge multidimensional repository used for statistical analysis of historical data. In a data warehouse events are modeled as multidimensional cubes where cells store numerical indicators while dimensions describe the events from different points of view. Dimensions are typically described at different level of details through hierarchies of concepts. Computing the distance/similarity between two cells has several applications in this domain. In this context distance is typically based on the least common ancestor between attribute values, but the effectiveness of such distance functions vary according to the structure and to the number of the involved hierarchies. In this paper we propose a characterization of hierarchy types based on their structure and expressiveness, we provide a characterization of the different types of distance functions and we verify their effectiveness on different types of hierarchies in terms of their intrinsic discriminant capacity.
Article
Applying learning techniques to acquire action models is an area of intense research interest. Most previous work in this area has assumed that there is a significant amount of training data available in a planning domain of interest. However, it is often difficult to acquire sufficient training data to ensure the learnt action models are of high quality. In this paper, we seek to explore a novel algorithm framework, called TRAMP, to learn action models with limited training data in a target domain, via transferring as much of the available information from other domains (called source domains) as possible to help the learning task, assuming action models in source domains can be transferred to the target domain. TRAMP transfers knowledge from source domains by first building structure mappings between source and target domains, and then exploiting extra knowledge from Web search to bridge and transfer knowledge from sources. Specifically, TRAMP first encodes training data with a set of propositions, and formulates the transferred knowledge as a set of weighted formulas. After that it learns action models for the target domain to best explain the set of propositions and the transferred knowledge. We empirically evaluate TRAMP in different settings to see their advantages and disadvantages in six planning domains, including four International Planning Competition (IPC) domains and two synthetic domains.
Article
We describe a project undertaken by an interdisciplinary team combining researchers in sleep psychology and in Natural Language Processing/Machine Learning. The goal is sentiment analysis on a corpus containing short textual descriptions of dreams. Dreams are categorized in a four-level scale of positive and negative sentiments. We chose a four scale annotation to reflect the sentiment strength and simplicity at the same time. The approach is based on a novel representation, taking into account the leading themes of the dream and the sequential unfolding of associated sentiments during the dream. The dream representation is based on three combined parts, two of which are automatically produced from the description of the dream. The first part consists of co-occurrence vector representation of dreams in order to detect sentiment levels in the dream texts. Those vectors unlike the standard Bag-of-words model capture non-local relationships between meanings of word in a corpus. The second part introduces the dynamic representation that captures the sentimental changes throughout the progress of the dream. The third part is the self-reported assessment of the dream by the dreamer according to eight given attributes (self-assessment is different in many respects from the dream’s sentiment classification). The three representations are subject to aggressive feature selection. Using an ensemble of classifiers on the combined 3-partite representation, the agreement between machine rating and the human judge scores on the four scales was 64 % which is in the range of human experts’ consensus in that domain. The accuracy of the system was 14 % more than previous results on the same task.
Article
Recommender systems can mitigate the information overload problem and help workers retrieve knowledge based on their preferences. In a knowledge-intensive environment, knowledge workers need to access task-related codified knowledge (documents) to perform tasks. A worker's document referencing behavior can be modeled as a knowledge flow (KF) to represent the evolution of his or her information needs over time. Document recommendation methods can proactively support knowledge workers in the performance of tasks by recommending appropriate documents to meet their information needs. However, most traditional recommendation methods do not consider workers’ KFs or the information needs of the majority of a group of workers with similar KFs. A group's needs may partially reflect the needs of an individual worker that cannot be inferred from his or her past referencing behavior. In other words, the group's knowledge complements that of the individual worker. Thus, we leverage the group perspective to complement the personal perspective by using hybrid approaches, which combine the KF-based group recommendation method (KFGR) with traditional personalized-recommendation methods. The proposed hybrid methods achieve a trade-off between the group-based and personalized methods by exploiting the strengths of both. The results of our experiment show that the proposed methods can enhance the quality of recommendations made by traditional methods.
Article
Measuring the semantic similarity between sentences is an essential issue for many applications, such as text summarization, Web page retrieval, question-answer model, image extraction, and so forth. A few studies have explored on this issue by several techniques, e.g., knowledge-based strategies, corpus-based strategies, hybrid strategies, etc. Most of these studies focus on how to improve the effectiveness of the problem. In this paper, we address the efficiency issue, i.e., for a given sentence collection, how to efficiently discover the top-k semantic similar sentences to a query. The previous methods cannot handle the big data efficiently, i.e., applying such strategies directly is time consuming because every candidate sentence needs to be tested. In this paper, we propose efficient strategies to tackle such problem based on a general framework. The basic idea is that for each similarity, we build a corresponding index in the preprocessing. Traversing these indices in the querying process can avoid to test many candidates, so as to improve the efficiency. Moreover, an optimal aggregation algorithm is introduced to assemble these similarities. Our framework is general enough that many similarity metrics can be incorporated, as will be discussed in the paper. We conduct extensive experimental evaluation on three real datasets to evaluate the efficiency of our proposal. In addition, we illustrate the trade-off between the effectiveness and efficiency. The experimental results demonstrate that the performance of our proposal outperforms the state-of-the-art techniques on efficiency while keeping the same high precision as them.
Article
In this paper an adaptive hierarchical fuzzy clustering algorithm is presented, named Hierarchical Data Divisive Soft Clustering (H2D-SC). The main novelty of the proposed algorithm is that it is a quality driven algorithm, since it dynamically evaluates a multi-dimensional quality measure of the clusters to drive the generation of the soft hierarchy. Specifically, it generates a hierarchy in which each node is split into a variable number of sub-nodes, determined by an innovative quality assessment of soft clusters, based on the evaluation of multiple dimensions such as the cluster’s cohesion, its cardinality, its mass, and its fuzziness, as well as the partition’s entropy. Clusters at the same hierarchical level share a minimum quality value: clusters in the lower levels of the hierarchy have a higher quality; this way more specific clusters (lower level clusters) have a higher quality than more general clusters (upper level clusters). Further, since the algorithm generates a soft partition, a document can belong to several sub-clusters with distinct membership degrees. The proposed algorithm is divisive, and it is based on a combination of a modified bisecting K-Means algorithm with a flat soft clustering algorithm used to partition each node. The paper describes the algorithm and its evaluation on two standard collections.
Article
This chapter explores the problem of topic identification from text. It is first argued that the conventional representation of text as bag-of-words vectors will always have limited success in arriving at the underlying meaning of text until the more fundamental issues of feature independence in vector-space and ambiguity of natural language are addressed. Next, a groundbreaking approach to text representation and topic identification that deviates radically from current techniques used for document classification, text clustering, and concept discovery is proposed. This approach is inspired by human cognition, which allows 'meaning' to emerge naturally from the activation and decay of unstructured text information retrieved from the Web. This paradigm shift allows for the exploitation rather than avoidance of dependence between terms to derive meaning without the complexity introduced by conventional natural language processing techniques. Using the unstructured texts in Web pages as a source of knowledge alleviates the laborious handcrafting of formal knowledge bases and ontologies that are required by many existing techniques. Some initial experiments have been conducted, and the results are presented in this chapter to illustrate the power of this new approach.
Article
In this paper, we propose a new keyword extraction method for generation a user profile using collected papers without using a large corpus. We assume that a user’s interest exists in papers. Our method can extract keywords that can express user’s interest in papers that user’s interest exit. Our method can be used for enhancing the paper collection and sharing system, MiDoc. In MiDoc, user profiles are automatically constructed by using the method. We conducted several experiments to show how effectively our method can extract keywords that represent user’s interests. In the experiment, our method was compared with the exsiting methods. The results lead to the conclusion that the method can effectively extract keywords that represent user’s interests. In this paper, We define user profile is keywords that express user’s interest.
Article
In this article, we identify, compare, and contrast theo- retical constructs for the fields of information searching and information retrieval to emphasize the uniqueness of and synergy between the fields. Theoretical con- structs are the foundational elements that underpin a field's core theories, models, assumptions, methodolo- gies, and evaluation metrics. We provide a framework to compare and contrast the theoretical constructs in the fields of information searching and information retrieval usingintellectualperspective andtheoreticalorientation. The intellectual perspectives are information searching, information retrieval,and cross-cutting;and the theoreti- cal orientations are information, people, and technology. Using this framework, we identify 17 significant con- structs in these fields contrasting the differences and comparing the similarities. We discuss the impact of the interplay among these constructs for moving research forward within both fields. Although there is tension between the fields due to contradictory constructs, an examination shows a trend toward convergence. We discuss the implications for future research within the information searching and information retrieval fields.
Article