Book

Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, PA

Authors:
Book

Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, PA

... To better reflect users' preferences and to interpret the meaning of the items, in this paper, we explored the effectiveness of utilizing semantic knowledge (meaningful relationships between items) learnt through the use of bottom-up models based on distributional hypothesis such as Doc2vec [43] and TF-IDF [53] methods which consider the con-text of item usage, e.g., their co-occurrence in the purchase sequences to learn the semantic relationships between products (e.g., products co-purchased and co-reviewed along with computing semantic similarities based on their textual features). This semantic knowledge can then be integrated into the Markov process for personalized sequential recommendation process by (i) learning semantic associations between items (ii) creating item transition probability matrix by first extracting the sequential co-occurrences of product pairs, normalizing it and then (iii) fusing the semantic knowledge into the transition probability matrix and using it with users' preferences (personalized vector) to generate semantically similar, sequential next item recommendations. ...
... The extracted semantic knowledge is then utilized to learn (i) semantic and sequential relationships between items, (ii) potential candidate item generation and (iii) generating semantic-rich and sequential next item recommendations for the target customers. The rationale to obtain product embeddings through aggregating two models (e.g., TFIDF [53] and Doc2Vec [43]) on product sequences was to capture semantic information from the purchase sequences at the local and the global level where TFIDF [53] captures information from the purchase sequences at the local level by extracting tokens (key words) present in the products' textual metadata. Doc2vec [43] on the other hand, provides global context as in Doc2Vec model, the words (products) and the paragraph (products' metadata in a purchase sequence) are trained jointly and a document embedding (where a document represents collection of all product descriptions, title, brand in a list of list format and each list element represents description, title and brand of a product purchased, along with a document ID for each document) is generated. ...
... The extracted semantic knowledge is then utilized to learn (i) semantic and sequential relationships between items, (ii) potential candidate item generation and (iii) generating semantic-rich and sequential next item recommendations for the target customers. The rationale to obtain product embeddings through aggregating two models (e.g., TFIDF [53] and Doc2Vec [43]) on product sequences was to capture semantic information from the purchase sequences at the local and the global level where TFIDF [53] captures information from the purchase sequences at the local level by extracting tokens (key words) present in the products' textual metadata. Doc2vec [43] on the other hand, provides global context as in Doc2Vec model, the words (products) and the paragraph (products' metadata in a purchase sequence) are trained jointly and a document embedding (where a document represents collection of all product descriptions, title, brand in a list of list format and each list element represents description, title and brand of a product purchased, along with a document ID for each document) is generated. ...
Article
Full-text available
To model sequential relationships between items, Markov Models build a transition probability matrix P\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {P}$$\end{document} of size n×n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \times n$$\end{document}, where n represents number of states (items) and each matrix entry p(i,j)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{(i,j)}$$\end{document} represents transition probabilities from state i to state j. Existing systems such as factorized personalized Markov chains (FPMC) and fossil either combine sequential information with user preference information or add the high-order Markov chains concept. However, they suffer from (i) model complexity: an increase in Markov Model’s order (number of states) and separation of sequential pattern and user preference matrices, (ii) sparse transition probability matrix: few product purchases from thousands of available products, (iii) ambiguous prediction: multiple states (items) having same transition probability from current state and (iv) lack of semantic knowledge: transition to next state (item) depends on probabilities of items’ purchase frequency. To alleviate sparsity and ambiguous prediction problems, this paper proposes semantic-enabled Markov model recommendation (SEMMRec) system which inputs customers’ purchase history and products’ metadata (e.g., title, description and brand) and extract products’ sequential and semantic knowledge according to their (i) usage (e.g., products co-purchased or co-reviewed) and (ii) textual features by finding similarity between products based on their characteristics using distributional hypothesis methods (Doc2vec and TF-IDF) which consider the context of items’ usage. Next, this extracted knowledge is integrated into the transition probability matrix P\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {P}$$\end{document} to generate personalized sequential and semantically rich next item recommendations. Experimental results on various E-commerce data sets exhibit an improved performance by the proposed model
... Dans le but de représenter l'item de manière plus précise, il nécessite des techniques astucieuses pour pondérer les termes. Le TF-IDF (Term Frequency-Inverse Document Frequency) est l'une des pondérations les plus utilisées (Salton, Wong et Yang, 1975;Salton, 1989). ...
... Prentice-Dunn et Rogers croient que la réduction du niveau de conscience de soi publique n'est pas liée à la désindividuation, car les individus peuvent percevoir clairement leur comportement dans cette situation. Au contraire, la réduction du niveau de conscience de soi privée est reliée à la désindividuation car, à ce moment-là, l'individu ne perçoit pas ses propres pensées, émotions ni autres processus internes (Prentice-Dunn et Rogers, 1980, 1982, 1989). ...
... La deuxième catégorie est appelée « la théorie contemporaine de la désindividuation ». Elle correspond aux études de Prentice-Dunn et Rogers (Prentice-Dunn et Rogers, 1982, 1989 qui proposent la réduction de la conscience de soi privée comme principale variable inductrice de la désindividuation. La troisième catégorie est constituée par le modèle SIDE car c'est la seule théorie qui différencie les normes du groupe des normes sociales. ...
Thesis
Le développement de l'Internet et de la technologie Web 2.0 qui ajoute à la facilité de publication le contenu généré par l'utilisateur, mettent à la disposition des utilisateurs une variété d’informations dont le volume est sans cesse croissant. Face à ce problème de surcharge d'informations, il est difficile pour les utilisateurs de s'orienter et de repérer des informations qui répondent à leurs besoins. De nombreux systèmes de filtrage de l'information sont développés pour faire face à ce problème : l'un d'entre eux est le système de recommandation. L'objectif principal des systèmes de recommandation est de fournir aux utilisateurs des propositions de contenus personnalisées. Le principe sous-jacent est de déduire les besoins d'information de l'utilisateur, puis d’identifier dans le système les informations qui répondent à des besoins et les lui recommander. Les systèmes de recommandation, largement utilisés dans divers domaines, peuvent aussi être intégrés à des réseaux sociaux. La plupart des réseaux sociaux se caractérisent à la fois par le nombre important d'interactions et par l'anonymat des utilisateurs. Ces caractéristiques correspondent aux conditions décrites en psychologie sociale pour qu'un état de désindividuation soit déclenché. Les utilisateurs des réseaux sociaux sont susceptibles de se trouver dans une situation où l'identité du groupe est significativement élevée et leur identité individuelle restreinte. Leurs pensées, leurs comportements et même leurs préférences sont fortement influencées par les normes de groupe, y compris, bien sûr, leurs rétroactions sur les informations reçues. Ces rétroactions pourraient être biaisées c'est-à-dire ne pas refléter les vraies préférences individuelles des utilisateurs. Ainsi les recommandations basées sur ces rétroactions biaisées seraient contraires à l'intention initiale des recommandations personnalisées. Cette thèse est consacrée à l'exploration du phénomène de la désindividuation qui peut exister dans les réseaux sociaux et de son impact sur le comportement de notation des utilisateurs, tout en incluant les différences culturelles. Nous choisissons comme terrain d'étude les systèmes de recommandation de films, ce qui nous amène à examiner les utilisateurs de quatre plateformes pour les cinéphiles à travers leur comportement de notation de films. Les résultats confirment l’existence du phénomène de la désindividuation dans les réseaux sociaux son impact significatif sur le comportement de notation des utilisateurs. La différence culturelle est également un facteur important qui influence le comportement de notation. Sur cette base, nous arguons que les systèmes de recommandation appliqués dans les réseaux sociaux doivent y faire attention et que certaines mesures visant à individualiser les utilisateurs devraient être prises avant de recueillir et d'analyser les réactions des utilisateurs.
... Para que seja possível a aplicação dos algoritmos,é necessário que as coleções de textos passem por um pré-processamento, para que sejam estruturadas e assim interpretadas pelos algoritmos de aprendizado de máquina . Nesse projeto, os textos foram estruturados e representados através do modelo espaço-vetorial [Salton 1989] com a técnica de bag-of-words, por ser um técnica que apresenta boa performance, baixa complexidade e ampla utilização na literatura. Nessa estrutura, cada documentoé representado por um vetor e em cada posição está o peso de cada termo que representa uma característica da coleção. ...
... Nessa estrutura, cada documentoé representado por um vetor e em cada posição está o peso de cada termo que representa uma característica da coleção. Afim de determinar o peso dos termos nesse projeto, foi-se utilizado o esquema de pesos Term Frequency Inverse Document Frequency (TF-IDF) [Salton 1989]. ...
Preprint
Full-text available
Atualmente há uma quantidade massiva de textos sendo produzida no universo digital. Esse grande conjunto de textos pode conter conhecimentoútil para diversasáreas, tanto acadêmicas quanto empresariais. Uma das formas para extração de conhecimento e gerenciamento de grandes volumes de textó e a classificação automática. Uma maneira de tornar mais atrativo e viável a utilização da classificação automática,é utilizando o aprendizado baseado em umaúnica classe (AMUC), no qualé aprendido um modelo de classificação considerando apenas documentos da classe de interesse do usuário. Porém, vale ressaltar que mesmo fazendo uso das técnicas de AMUC, uma grande quantidade de exemplos rotulados para classe de interesse precisa ser infor-mada para uma classificação acurada, o que ainda pode inviabilizar o uso prático do AMUC. Pode-se então fazer uso do aprendizado semissupervisio-nado baseado em umaúnica classe (do inglês Positive and Unlabeled Learning -PUL), o qual faz uso de exemplos não rotulados para melhorar a performance de classificação. Entretanto as técnicas de PUL encontradas na li-teratura fazem uso de algoritmos que, em geral, não obtém performances de classificação satisfatórias ou superiores a outros algoritmos de aprendizado se-missupervisionado. Dado isso, o objetivo desse projetoé a implementação e uso de técnicas de aprendizado semissupervisionado mais adequados para a classificação de textos, como as baseadas em redes. Foram executados experi-mentos em 10 coleções textuais considerando diferentes quantidades de exem-plos rotulados. Observou-se que ao escolher algoritmos semissupervisionados mais adequados para o aprendizado semissupervisionado, obteve-se ganhos para todas as coleções de textos. Além disso, foi possível obter melhores re-sultados em comparação com algoritmos baseline e melhores resultados que algoritmos de AMUC quando utilizados apenas 1 exemplo rotulado para a mai-oria das coleções, observando-se assim que a utilização de exemplos não ro-tulados nos algoritmos de PUL contribuem para o aumento de performance de classificação. Palavras-chave: aprendizado semissupervisionado, aprendizado baseado em umá unica classe, classificação de textos, aprendizado baseado em exemplos positivos e não rotulados.
... In addition, MapReduce framework is not appropriate to deal with iterative algorithms since it requires at each iteration reading and writing data from disks. On the other hand, the application of these methods to text clustering needs to convert the collection of documents to a numerical form by using text representation method such as the Vector Space Model (VSM) [36] which is one of the commonly used text representation methods [28]. In fact, the construction of this representation is time consuming especially when dealing with large documents. ...
... Once the vocabulary is built, the numerical encoding of the term document matrix can be performed respecting the Vector Space Model (VSM) [36] which is one of the most used text representation methods in real life applications [28]. In VSM, a document d i is represented as a vector d i =(x 1 . . . ...
Article
Full-text available
Clustering textual data has become an important task in data analytics since several applications require to automatically organizing large amounts of textual documents into homogeneous topics. The increasing growth of available textual data from web, social networks and open platforms have challenged this task. It becomes important to design scalable clustering method able to effectively organize huge amount of textual data into topics. In this context, we propose a new parallel text clustering method based on Spark framework and hashing. The proposed method deals simultaneously with the issue of clustering huge amount of documents and the issue of high dimensionality of textual data by respectively integrating the divide and conquer approach and implementing a new document hashing strategy. These two facts have shown an important improvement of scalability and a good approximation of clustering quality results. Experiments performed on several large collections of documents have shown the effectiveness of the proposed method compared to existing ones in terms of running time and clustering accuracy.
... To remove the bias that nodes with higher degree typically have a larger number of common interacted items with other nodes, we normalize |N 1 ∩ N 1 | by (N 1 , N 1 ), which is a function of and 's neighborhood set. Depending on the specific used here, the CIR could express many existing graph topological metrics for measuring common neighbors, such as Jaccard Similarity (JC) [19], Salton Cosine Similarity (SC) [26], Leicht-Holme-Nerman (LHN) [17] and Common Neighbors (CN) [21]. The detailed computation of these four metrics is attached in the Appendix A.1. ...
... • Salton Cosine Similarity (SC) [26]: The SC score measures the cosine similarity between the neighborhood sets of two nodes: ...
Preprint
Full-text available
By virtue of the message-passing that implicitly injects collabo-rative effect into the embedding process, Graph Neural Networks (GNNs) have been successfully adopted in recommendation systems (e.g., LightGCN, GTN, and UltraGCN). Nevertheless, most of existing message-passing mechanisms in recommendation are directly inherited from GNNs without any recommendation-tailored modification. Although some efforts have been made towards simplifying GNNs to improve the performance/efficiency of recommendation, no study has comprehensively scrutinized how message-passing captures collaborative effect and whether the captured effect would benefit the prediction of user preferences over items. Therefore, in this work we aim to demystify the collaborative effect captured by message-passing in GNNs and develop new insights towards customizing message-passing for recommendation. First, we theoretically analyze how message-passing captures and leverages the collaborative effect in predicting user preferences. Then, to determine whether the captured collaborative effect would benefit the prediction of user preferences, we propose a recommendation-oriented topological metric, Common Interacted Ratio (CIR), which measures the level of interaction between a specific neighbor of a node with the rest of its neighborhood set. Inspired by our theoretical and empirical analysis, we propose a recommendation-tailored GNN, Augmented Collaboration-Aware Graph Convolutional Network (CAGCN*), that extends upon the LightGCN framework and is able to selectively pass information of neighbors based on their CIR via the Collaboration-Aware Graph Convolution. Experimental results on six benchmark datasets show that CAGCN* outperforms the most representative GNN-based recommendation model, LightGCN, by 9% in Recall@20 and also achieves more than 79% speedup. Our code is publicly available at: https://github.com/YuWVandy/CAGCN
... In order to generate representations for the OCL and PUL algorithms used in the comparison, we used two textual models that consider the complete text of the news to transform them into structured data: Bag-of-Words (BoW) (Salton, 1989) and document embeddings generated with Doc2Vec (D2V) (Le & Mikolov, 2014). The same representations were used to compute similarities and generate relations among documents in a network. ...
... Rights reserved. (Salton, 1989) or document embeddings (Le & Mikolov, 2014), must be adopted to transform news into structured data. ...
Article
Full-text available
Fake news can rapidly spread through internet users and can deceive a large audience. Due to those characteristics, they can have a direct impact on political and economic events. Machine Learning approaches have been used to assist fake news identification. However, since the spectrum of real news is broad, hard to characterize, and expensive to label data due to the high update frequency, One-Class Learning (OCL) and Positive and Unlabeled Learning (PUL) emerge as an interesting approach for content-based fake news detection using a smaller set of labeled data than traditional machine learning techniques. In particular, network-based approaches are adequate for fake news detection since they allow incorporating information from different aspects of a publication to the problem modeling. In this paper, we propose a network-based approach based on Positive and Unlabeled Learning by Label Propagation (PU-LP), a one-class and transductive semi-supervised learning algorithm that performs classification by first identifying potential interest and non-interest documents into unlabeled data and then propagating labels to classify the remaining unlabeled documents. A label propagation approach is then employed to classify the remaining unlabeled documents. We assessed the performance of our proposal considering homogeneous (only documents) and heterogeneous (documents and terms) networks. Our comparative analysis considered four OCL algorithms extensively employed in One-Class text classification (k-Means, k-Nearest Neighbors Density-based, One-Class Support Vector Machine, and Dense Autoencoder), and another traditional PUL algorithm (Rocchio Support Vector Machine). The algorithms were evaluated in three news collections, considering balanced and extremely unbalanced scenarios. We used Bag-of-Words and Doc2Vec models to transform news into structured data. Results indicated that PU-LP approaches are more stable and achieve better results than other PUL and OCL approaches in most scenarios, performing similarly to semi-supervised binary algorithms. Also, the inclusion of terms in the news network activate better results, especially when news are distributed in the feature space considering veracity and subject. News representation using the Doc2Vec achieved better results than the Bag-of-Words model for both algorithms based on vector-space model and document similarity network.
... The preprocessed data are represented in term vector or bag of words format. Each term of the document are weighed and represented in tf-idf matrix (term frequency-inverse document frequency) [19]. Another form of data representation is named entity representation [12]. ...
Article
Full-text available
Micro bloggers are web applications which acts as a broad cast medium in which peoples are allowed to share their statuses, information, links, images , videos and opinions in short messages. They offer a light weight, easy and fastest way of communication among them. Some of the prevalent micro blogging services are twitter, face book, Google + etc. The information posted on twitter is called tweets. These messages provide the information that varies from daily life time events to latest worldwide news and events. Analyzing such rich source of user generated data can yield unprecedentedly valuable information. Mining such valuable information helps to identify the events that occurred over space and time. Event detection from twitter data has many new challenges when compared to event detection from traditional media. This paper provides a survey of various techniques used for detecting events from twitter.
... • CBF-Keywords: is a classic content-based algorithm that recommends items by matching users' profiles with the keywords of items that they have not yet observed or acquired. • PerceptRank (Ficel et al., 2018b): is a real-time learning-to-rank algorithm that formulate users' perceived value analogically to TFIDF measure formulation (Salton, 1988). • BPR (Bayesian Personalized Ranking) (Rendle et al., 2009): it is used as a representative of approaches that are based on periodic batch learning. ...
Article
Highly interactive and dynamic marketplaces shake the underlying hypothesis of established recommendation approaches assuming static offerings and sparse interaction data. Such markets show different dynamics due to goods/contents volatility and public accessibility. This imposes new challenges to recommender systems since they are required to handle unbounded data streams with high velocity, volume and variability while operating at scale and in real-time. Moreover, due to the high rates of new offerings introduction and obsolescence, low latency modeling and inference are needed to keep an up-to-date understanding of the market. In this paper, we propose a recommendation approach addressing the specific challenges of real-time stream recommendation in highly interactive marketplaces. The approach is based on several psychological theories to model consumers’ perceived value towards items. Besides, a ranking measure is defined on a heterogeneous information network to infer the factors that drive consumers’ decisions. With data relationships at its center, this data model is strongly efficient while ensuring free and flexible knowledge evolution as data evolves. It allows low latency incremental learning at scale while providing dynamic recommendations. Several comparative experiments were conducted to validate the potential of this approach in different use cases requiring offline, online, static or dynamic recommendations.
... Text: Traditionally, before the era of deep learning, Term Frequency-Document Inverse Frequency (TF-IDF) [98] was used to identify relevant text segments [25,26,115]. Due to significant advancements in feature extraction, almost all the MMS tasks in the past five years either use pre-trained embeddings like word2vec [73] or Glove [85], or train similar embeddings on their own datasets [133,134] (refer to Feature Extraction in Section 4.2.1). ...
Preprint
Full-text available
The new era of technology has brought us to the point where it is convenient for people to share their opinions over an abundance of platforms. These platforms have a provision for the users to express themselves in multiple forms of representations, including text, images, videos, and audio. This, however, makes it difficult for users to obtain all the key information about a topic, making the task of automatic multi-modal summarization (MMS) essential. In this paper, we present a comprehensive survey of the existing research in the area of MMS.
... The history of automatic summarization began more than 60 years ago with Luhn (Luhn, 1958) followed by Edmundson (Edmundson, 1969) a decade later. The golden age of this field was in the 1990ies ( (Salton, 1989), (Kupiec, Pedersen, & Chen, 1995)). Most of the research in automatic summarization has addressed extraction, but at the beginning of the millennium researchers started to focus on the problem of generating coherent summaries ( (Mani, 1999), (Knight & Marcu, 2000)) using different approaches. ...
Article
Various approaches to text simplification have been proposed in an attempt to increase text readability. The rephrasing of syntactically and semantically complex structures is still challenging. A pedagogically motivated simplified version of the same text can have both positive and negative side effects. On the one hand, it can facilitate reading comprehension because of much shorter sentences and a limited vocabulary, but on the other hand, the simplified text often lacks coherence, unity and style. Therefore, reasonable trade-offs among linguistic simplicity, naturalness and informativeness are highly needed. This is a survey paper that discusses state-of-the-art approaches to sentence/text simplification and evaluation methods, along with an empirical evaluation of our approach. The quality of sentence splitting, using the knowledge extraction tool SAAT was compared to state-of-the-art syntactic simplification systems. The research was carried out on the WikiSplit, the HSplit and MinWikiSplit simplification corpora. Automatic metrics for the HSplit showed that the SAAT outperformed other TS systems in all categories. For the WikiSplit dataset, automatic metrics scores were slightly lower than that of the baseline system DisSim. However, the human evaluation showed that DisSim outperformed the SAAT in terms of simplicity and grammar. The quality of AG18copy output corresponded to that of the SAAT. The inter-annotator agreement was calculated. Research limitations as well as suggestions for future research were also provided. https://authors.elsevier.com/a/1dnHg3PiGTH-vI
... tf.idf is computed for each message word and use it as feature. It is most often adopted representation of set of messages widely used in vector space model [10]. It indicates importance of terms in identifying a document. ...
Conference Paper
Full-text available
Emails are used in most of the fields of education and business. They can be classified into ham and spam and with their increasing use, the ratio of spam is increasing day by day. There are several machine learning techniques, which provides spam mail filtering methods, such as Clustering, J48, Naïve Bayes etc. This paper considers different classification techniques using WEKA to filter spam mails. Result shows that Naïve Bayes technique provides good accuracy (near to highest) and take least time among other techniques. Also a comparative study of each technique in terms of accuracy and time taken is provided. Keyword-spam mail filtering; blacklists; true positive rate; true negative rate
... Feature Vector Space. Feature vector space was proposed based on the idea of partial matching [17]. It gives each independent item in the text a weighted performance to characterise text data sets [3]. ...
Article
Full-text available
The K-means algorithm has been extensively investigated in the field of text clustering because of its linear time complexity and adaptation to sparse matrix data. However, it has two main problems, namely, the determination of the number of clusters and the location of the initial cluster centres. In this study, we propose an improved K-means++ algorithm based on the Davies-Bouldin index (DBI) and the largest sum of distance called the SDK-means++ algorithm. Firstly, we use the term frequency-inverse document frequency to represent the data set. Secondly, we measure the distance between objects by cosine similarity. Thirdly, the initial cluster centres are selected by comparing the distance to existing initial cluster centres and the maximum density. Fourthly, clustering results are obtained using the K-means++ method. Lastly, DBI is used to obtain optimal clustering results automatically. Experimental results on real bank transaction volume data sets show that the SDK-means++ algorithm is more effective and efficient than two other algorithms in organising large financial text data sets. The F-measure value of the proposed algorithm is 0.97. The running time of the SDK-means++ algorithm is reduced by 42.9% and 22.4% compared with that for K-means and K-means++ algorithms, respectively.
... Nous avons parlé, en introduction, de "mouvement" au niveau du langage. La no- L'une des méthodes de représentations les plus classiques dans le domaine du TAL consisteà se baser sur la mesure de pondération TFxIDF [86]. Nous avons explicité dans la section 1.2, les formules mathématiques qui permettent de la calculer. ...
Thesis
Les travaux présentés dans cette thèse, réalisés en partenariat avec l'entreprise Électricité de France (EDF), ont pour objectif de développer des modèles de détection de nouveauté dans des flux de données textuelles. Pour EDF, cela s'inscrit dans une démarche d'anticipation des besoins clients.Nous présentons les différentes approches de détection de nouveauté existantes dans la littérature, ce qui nous permet de définir précisément les tâches que nous voulons résoudre. Ces définitions nous permettent de mettre en place des méthodes d'évaluations, basées soit sur des données simulées, soit sur des données réelles. La modification des données réelles nous permet de simuler des scénarios d'arrivées de la nouveauté et donc de mesurer l'efficacité des méthodes existantes. Nous présentons deux modèles de détections d'éléments nouveaux en utilisant tout d'abord les modèles thématiques probabilistes. Le deuxième modèle est CEND, un algorithme se basant sur les mouvements des mots dans des espaces de représentations en grandes dimensions. Ce type de modèle nous permet de faire la différence entre des mots liés à des évènements abrupts et des thématiques émergents doucement.Nous présentons un modèle de surveillance des dynamiques des plans de classements. En liant des méthodes de prévision de série temporelle et d'analyse séquentielle, nous arrivons à estimer quand est-ce qu'un signal temporel change de dynamique. Nous testons ces méthodes sur des données d'articles de presse et sur des données industrielles d'EDF.
... Three performance metrics were used in this paper: Precision, Recall, and Rank Power. Precision and recall are standard performance metrics for an information system's result that can be used to calculate the performance of an outlier detection method (Baeza-Yates & Ribeiro-Neto, 1999; Salton, 1989). Let us assume that t o is the number of true outliers identified in the top t instances to define rank-power (RP), and R i is the rank of the i th true outlier, then rank-power (RP) can be calculated as follows: ...
Article
Full-text available
An outlier has a significant impact on data quality and the efficiency of data mining. The outlier identification algorithm observes only data points that do not follow clearly defined meanings of projected behaviour in a data set. Several techniques for identifying outliers have been presented in recent years, but if outliers are located in areas where neighbourhood density varies substantially, it can result in an imprecise estimate. To address this problem, we provide a ‘Relative Density-based Outlier Factor (RDOF)’ algorithm based on the concept of mutual proximity between a data point and its neighbours. The proposed approach is divided into two stages: an influential space is created at a test point in the first stage. In the later stage, a test point is assigned an outlier-ness score. We have conducted experiments on three real-world data sets, namely the Johns Hopkins University Ionosphere, the Iris Plant, and Wisconsin Breast Cancer data sets. We have investigated three performance metrics for comparison: precision, recall, and rank power. In addition, we have compared our proposed method against a set of relevant baseline methods. The experimental results reveal that our proposed method detected all (i.e., 100%) outlier class objects with higher rank power than baseline approaches over these experimental data sets.
... (3)), with different similarity measures. We use Salton index (Salton, 1989), Sorensen index (Sorensen, 1948), and Leicht-Holme-Newman index (LHNI) (Leicht, Holme, & Newman, 2006) for the Eq. ...
Article
Parallel and distributed community detection in large-scale complex networks, such as social networks, is a challenging task. Parallel and distributed algorithm with high accuracy and low computational complexity is one of the essential issues in the community detection field. In this paper, we propose a novel fast, and accurate Spark-based parallel label diffusion and label selection-based (PLDLS) community detection algorithm with two-step of label diffusion of core nodes along with a new label selection (propagation) method. We have used multi-factor criteria for computing node's importance and adopted a new method for selecting core nodes. In the first phase, utilizing the fact that nodes forming triangles, tend to be in the same community, parallel label diffusion of core nodes is performed to diffuse labels up to two levels. In the second phase, through an iterative and parallel process, the most appropriate labels are assigned to the remaining nodes. PLDLS proposes an improved robust version of LPA by putting aside randomness parameter tuning. Furthermore, we utilize a fast and parallel merge phase to get even more dense and accurate communities. Conducted experiments on real-world and artificial networks, indicates the better accuracy and low execution time of PLDLS in comparison with other examined methods.
... The thought that unites them is something like a "tabula-rasa", for which our knowledge comes from experience, provided through the senses, arguments reminiscent of the ML approach. With the aim of making the text computable [26] assumed to encode the presence of the text by counting the occurrences. This approach turned out to be very superficial; in fact, sometime later [27][28][29], using purpose-built neural networks and large corpora made large distributed representations of the text. ...
Article
Full-text available
Modern AI technologies make use of statistical learners that lead to self-empiricist logic, which, unlike human minds, use learned non-symbolic representations. Nevertheless, it seems that it is not the right way to progress in AI. The structure of symbols—the operations by which the intellectual solution is realized—and the search for strategic reference points evoke important issues in the analysis of AI. Studying how knowledge can be represented through methods of theoretical generalization and empirical observation is only the latest step in a long process of evolution. For many years, humans, seeing language as innate, have carried out symbolic theories. Everything seems to have skipped ahead with the advent of Machine Learning. In this paper, after a long analysis of history, the rule-based and the learning-based vision, we would investigate the syntax as possible meeting point between the different learning theories. Finally, we propose a new vision of knowledge in AI models based on a combination of rules, learning, and human knowledge.
... Some techniques produce measures of similarities between vertices of a network and according to [43], [44], [34] such measures are the basis of the link prediction theory. For instance, can be cited similarity based on random walk process [33], [30], [34], the cosine similarity [45], the Jaccard similarity [46], and the inverse log-weighted similarity [47]. ...
Article
Full-text available
This article presents Maximum Visibility Approach (MVA), a new time series forecasting method based on the Complex Network theory. MVA initially maps time series data into a complex network using the visibility graph method. Then, based on the similarity measures between the nodes in the network, MVA calculates the one-step-ahead forecasts. MVA does not use all past terms in the forecasting process, but only the most significant observations, which are indicated as a result of the autocorrelation function. This method was applied to five different groups of data, most of them showing trend characteristics, seasonal variations and/or non-stationary behavior. We calculated error measures to evaluate the performance of MVA. The results of statistical tests and error measures revealed that MVA has a good performance compared to the accuracy obtained by the benchmarks considered in this work. In all cases, MVA surpassed other forecasting methods in Literature, which confirms that this work will contribute to the field of time series forecasting not only in the theoretical aspect, but also in practice.
... Cfr. il dato generale sulla valutazione del tagging automatico presentato nell'introduzione.2 Per un'introduzione generale alle procedure e agli scopi dell'information retrieval in basi di dati testuali, si vedaSalton (1989). ...
... On peut aussi citer [Castagnos, 2008] [Adomavicius and Tuzhilin, 2005]. Ce domaine a beaucoup utilisé et utilise toujours les mesures de précision et de rappel [Salton, 1989], il est donc naturel que les premiers systèmes de recommandation aient utilisé ces mesures de performance. Deuxièmement, très tôt dans le domaine des systèmes de recommandation, des corpus de données ont été mis à disposition de tous, tel que, MovieLens 11 grâce à GroupLens [Resnick et al., 1994] ...
Thesis
Recommender systems represent a fundamental research field situated at the intersection of several major disciplines such as: machine learning, human-computer interaction and cognitive sciences. The objective of these systems is to improve interactions between the user and information access or retrieval systems. Facing heterogeneous and ever-increasing data, indeed it has become difficult for a user to access relevant information which would be satisfying his requests. Current systems have proven their added value, and they rely on various learning techniques. Nevertheless, despite temporal and spatial modeling has been made possible, state of the art models which are dealing with the order of recommendations or with the quality of a recommendation sequence are still too rare. In this thesis, we will focus on defining a new formalism and a methodological framework allowing: (1) the definition of human factors leading to decision making and user satisfaction; (2) the construction of a generic and multi-criteria model integrating these human factors with the aim of recommending relevant resources in a coherent sequence; (3) a holistic evaluation of user satisfaction with their recommendation path. The evaluation of recommendations, all domains combined, is currently done on a recommendation-by-recommendation basis, with each evaluation metric taken independently. The aim is to propose a more complete framework measuring the evolutivity and comprehensiveness of the path. Such a multi-criteria recommendation model has many application areas. For example, it can be used in the context of online music listening with the recommendation of intelligent and adaptive playlists. It can also be useful to adapt the recommendation path to the learner's progress and to the teacher's pedagogical scenario in an e-learning context.
... We mapped the co-occurrence relationships between keywords. Before beginning the analysis, the TLAB software did a "linguistic normalization" (Salton, 1989) to correct ambiguous words (typing errors, slang terms, abbreviations), carry out cleaning actions (e.g., the elimination of excess blank spaces, apostrophes, and additional spaces after punctuation marks, etc.), and convert multi-words into unitary strings (e.g., "Opera Duomo Museum" became "Opera_Duomo_Museum"). Then, it executed the text "lemmatization" (Karypis et al., 2000): it has turned words into entries corresponding to lemmas. ...
Article
Full-text available
The paper develops a research approach that combines digital ethnography with text mining to explore consumers’ perception of a brand and the degree of alignment between brand identity and image. In particular, the paper investigates the alignment between the art museum’s brand identity and the brand image emerging from visitors’ narratives of their experience. The study adopts a mixed methodology based on netnography and text mining techniques. The analysis concerns an art museum’s brand, with the case of the “Opera del Duomo Museum” in Florence. The methodological approach enables a combined investigation of user-generated content in online communities and the company’s online brand communication, contributing to identifying branding actions that can be taken to increase the brand alignment. It also enables the measurement of the degree of alignment between museums and visitors among common brand themes. Specific indicators of alignment are provided. A key point is the replicability of the model in other contexts of analysis in which the content produced by consumers in online contexts are relevant and readily available, such as fashion or food.
... That same year, Brian Pinkerton created WebCrawler, a search engine that, unlike the then existing crawlers, indexed the entire content of the web (Seymour et al., 2011;Sonnenreich, 1997). In addition, it used a vector-based information retrieval model (Salton, 1989), which improved the results by displaying them according to relevance. This model took into account the frequency and weight of the query terms (Mauldin, 1997;Pinkerton, 1994), something that has since become the standard. ...
Chapter
Full-text available
Search engines have become one of the main channels for accessing information on the Web. Their widespread use means that the media, companies, institutions or any agent whose objective is to be visible to or to attract digital audiences is obliged to ensure its website is ranked at the top of the results pages. In Spain alone, 88% of people use search engines daily, with Google being the most used to conduct these searches, claiming a national market share of 95% and more than 90% worldwide. Given their current importance, an in-depth understanding of these tools becomes essential and it is worth pondering just how we have reached this current situation. What exactly are the origins of search engines and how has Google come to exercise its quasi-monopoly? The objective of this study is to explore the origin of search engines and to describe the main milestones in their evolution. To do so, a bibliographic review has been carried out using the main academic databases in the social sciences comprising a narrative search, complemented by an examination of the grey literature. The result is a journey through the history of search engines from their origins and subsequent technology developments to the creation of the World Wide Web. Likewise, a study is made of the original search engines and their main characteristics, with a particular emphasis on the path taken by Google given its current position of supremacy. Understanding the past has direct implications not only for our understanding of the present reality of web searches and access to information, but it is also essential for managing the continuous digital transformation to which we are all exposed, which has repercussions for all areas of the economy and society and, of course, for communication. Resumen Los buscadores se han convertido en una de las principales vías de acceso a la información existente en la Web. Su uso generalizado supone que medios de comunicación, empresas, instituciones o cualquier agente que tenga como objetivo la visibilidad o atracción de audiencias digitales, esté obligado a conseguir las primeras posiciones en los resultados de búsqueda. Solo en España, un 88% de personas utilizan buscadores diariamente, siendo Google el más utilizado, con una cuota en el mercado nacional del 95% y más del 90% mundial. Ante esta tesitura, se hace imprescindible un conocimiento profundo de estas herramientas y cabe preguntarse ¿cómo se ha llegado a esta situación? ¿Cuál es el origen de estos motores de búsqueda y cómo ha llegado Google a ejercer este cuasi monopolio? El objetivo de este estudio es explorar el origen de los buscadores y exponer cuáles han sido los principales hitos en su evolución. Para ello se ha realizado una revisión bibliográfica empleando las bases de datos académicas más destacadas en ciencias sociales así como una búsqueda narrativa, complementada con la consulta de literatura gris. El resultado es un recorrido por la historia de los buscadores desde su origen, el desarrollo de la tecnología y el surgimiento de la World Wide Web. Asimismo, se presentan los buscadores primigenios y sus características más destacadas, con especial énfasis en la trayectoria de Google por su posición de supremacía. Conocer el pasado tiene implicaciones directas no solo en el entendimiento del presente de las búsquedas y el acceso a la información, sino que es imprescindible para hacer frente a la transformación digital contínua a la que estamos expuestos, la cual tiene repercusiones en todos los ámbitos de la economía, la sociedad y, por supuesto, en la comunicación.
... In document retrieval applications this matching process tends to be rather complex. The characterisation of documents is known to be a hard problem ( [Mar77], [Cra86]), although newly developed approaches turn out to be quite successful ( [Sal89]). In information systems the matching process is less complex as the objects in the information base have a more clear characterisation (the identification). ...
Preprint
Full-text available
Effective information disclosure in the context of databases with a large conceptual schema is known to be a non-trivial problem. In particular the formulation of ad-hoc queries is a major problem in such contexts. Existing approaches for tackling this problem include graphical query interfaces, query by navigation, query by construction, and point to point queries. In this article we propose the spider query mechanism as a final corner stone for an easy to use computer supported query formulation mechanism for InfoAssisant. The basic idea behind a spider query is to build a (partial) query of all information considered to be relevant with respect to a given object type. The result of this process is always a tree that fans out over existing conceptual schema (a spider). We also provide a brief discussion on the integration of the spider quer mechanism with the existing query by navigation, query by construction, and point to point query
... In document retrieval applications this matching process tends to be rather complex. The characterisation of documents is known to be a hard problem ( [Mar77], [Cra86]), although newly developed approaches turn out to be quite successful ( [Sal89]). In information systems the matching process is less complex as the objects in the information base have a more clear characterisation (the identification). ...
Preprint
Full-text available
Most present day organisations make use of some automated information system. This usually means that a large body of vital corporate information is stored in these information systems. As a result, an essential function of information systems should be the support of disclosure of this information. We purposely use the term {\em information disclosure} in this context. When using the term information disclosure we envision a computer supported mechanism that allows for an easy and intuitive formulation of queries in a language that is as close to the user's perception of the universe of discourse as possible. From this point of view, it is only obvious that we do not consider a simple query mechanism where users have to enter complex queries manually and look up what information is stored in a set of relational tables. Without a set of adequate information disclosure avenues an information system becomes worthless since there is no use in storing information that will never be retrieved.
... 641 Smith & Humphreys 2006, 263. 642 Computation is based on a naïve Bayesian co-occurrence metric as discussed in Salton (1989);and Dumais et al. (1998). additional terms to the previous concept definition. ...
... The content-based approach suggests items similar to those liked in the past by the same user with similar interests or features to create a user profile. The term frequency-inverse document frequency (TF-IDF) is a popular technique used in information retrieval [111] that applies heuristic similarity for measuring item-to-item likeness. The model proposed in [70] adopts semantic analysis and clustering techniques on social user profiles for recommendations. ...
Article
Full-text available
Context-aware recommender systems dedicated to online social networks experienced noticeable growth in the last few years. This has led to more research being done in this area stimulated by the omnipresence of smartphones and the latest web technologies. These systems are able to detect specific user needs and adapt recommendations to actual user context. In this research, we present a comprehensive review of context-aware recommender systems developed for social networks. For this purpose, we used a systematic literature review methodology which clearly defined the scope, the objective, the timeframe, the methods, and the tools to undertake this research. Our focus is to investigate approaches and techniques used in the development of context-aware recommender systems for social networks and identify the research gaps, challenges, and opportunities in this field. In order to have a clear vision of the research potential in the field, we considered research articles published between 2015 and 2020 and used a research portal giving access to major scientific research databases. Primary research articles selected are reviewed and the recommendation process is analyzed to identify the approach, the techniques, and the context elements employed in the development of the recommendation systems. The paper presents the detail of the review study, provides a synthesis of the results, proposes an evaluation based on measurable evaluation tools developed in this study, and advocates future research and development pathways in this interesting field.
... Para a representação de documentos textuais utiliza-se, na maioria das vezes, o "modelo espaço vetorial" ou, simplesmente, "modelo vetorial". Neste modelo, cada documento é representado por um vetor e cada posição desse vetor corresponde a uma dimensão chamada de atributo/termo da coleção de documentos (SALTON, 1989). Esses atributos geralmente representam as palavras simples, mas também podem representar conjuntos de palavras ou frases. ...
Thesis
Full-text available
The amount of knowledge accumulated in scientific articles leads researchers to deal with an expressive number of publications and their fragmentation in different fields of specialties or disciplines. However, it is possible to make the connection between these areas through Literature-Based Discovery. This approach aims to relate different specialties in order to find implicit relationships potentially usable for raising new scientific hypotheses. To make this process feasible and make it faster and more effective, Literature-Based Discovery relies on the help of Text Mining techniques. Despite all the progress made in these areas, researchers still have to deal with the lack of logical explanations for the relationships found. Recent researches have shown several advances in this direction with the help of techniques based on linguistic analysis, with a focus on semantic approaches. However, incorporating an approach that considers and explains the cause and effect relationships between concepts is still a challenge to be overcome. In this context, this doctoral thesis was motivated by the potential of verbal semantics and knowledge representation in concept maps, in order to provide detailed explanations about the mechanisms of causal interaction between concepts. The development of this work had the general purpose of advancing research in the field of Literature-Based Discovery with a focus on detecting causal relationships. For this, a hybrid approach was developed, based on statistical and linguistic analysis. Experiments carried out revealed that statistical techniques based on association rules and complex network metrics enable the selection of the most representative concepts of the corpus, while techniques based on linguistic analysis, focusing on verbal semantics, favor the extraction of causal relationships. These relationships, when represented in concept maps, compose a logical chain of connections, providing an easily interpretable output. This representation model aids the detection of hidden links and knowledge discovery by the user. The results reported in this thesis provide evidence that the approach is effective in reconstructing and explaining discovery hypotheses based on the historical literature, in addition to facilitating the testing and generation of new hypotheses. These results show the benefits that a hybrid Text Mining approach can provide to Literature-Based Discovery.
... However, our approach does not consider the order of the words; instead, it considers them as a bag of words. Choosing a similarity metric is not prescribed theoretically [36], so we study the performance and accuracy of well-known vector similarity metrics, cosine, dice, Jaccard (the Tanimoto coefficient), and overlap to find the more suitable one in this application domain (see Tab. 1). 1 ...
Book
Full-text available
В учебном пособии рассматриваются базовые вопросы компьютерной лингвистики: от теории лингвистического и математического моделирования до вариантов технологических решений. Дается лингвистическая интерпретация основных лингвистических объектов и единиц анализа. Приведены сведения, необходимые для создания отдельных подсистем, отвечающих за анализ текстов на естественном языке. Рассматриваются вопросы построения систем классификации и кластеризации текстовых данных, основы фрактальной теории текстовой информации. Предназначено для студентов и аспирантов высших учебных заведений, работающих в области обработки текстов на естественном языке.
Article
The content of climate change disclosures of large, global companies evolved from 2007−2016. Within that window, the same set of firms started measuring and disclosing their supply chain carbon emissions. Does carbon footprinting influence the nature and content of a firm’s disclosure on the climate change risks that are expected to affect its business? We explore this question using more than 10,925 climate change disclosures collected by the CDP (formerly the Carbon Disclosure Project) from 2,003 firms worldwide. We use singular value decomposition and text‐similarity scores to quantitatively examine the content of the CDP disclosures from 2007−2016. Using fixed‐effects and dynamic panel models, we find that measuring supply chain carbon emissions (Scope 3) explains a substantial shift in the content and nature of the disclosures. We find no evidence that measuring and disclosing direct emissions (Scope 1) are associated with substantial changes in the content of the disclosures. One explanation for this is that most of the climate change‐related risks are in the supply chain, not within the company boundaries of large, global firms. Our results show the importance of encouraging firms to voluntarily measure their supply chain carbon emissions if they are not yet aware of their contribution and exposure to climate change. Our work shows that firms’ response to climate change is dynamic, and it may take a decade to detect these shifts.
Chapter
Information and communication technology circumscribes all forms of technology involved in information dissemination. ICT has been used in different sectors which include oil and gas industry, banking sectors, educational sectors, real estate, marketing, and so on. This work seeks to assess how much ICT bears on the work efficiency of graduates. This research was carried out through a survey. One hundred and one respondents responded to the questionnaire. From the result, it was seen that all of the respondents were in one accord that ICT has improved their work efficiency.
Article
Bipartite graphs have been widely used to model the relationship between entities of different types, where vertices are partitioned into two disjoint sets/sides. Finding dense subgraphs in a bipartite graph is of great significance and encompasses many applications. However, none of the existing dense bipartite subgraph models consider similarity between vertices from the same side, and as a result, the identified results may include vertices that are not similar to each other. In this paper, we formulate the notion of similar-biclique which is a special kind of biclique where all vertices from a designated side are similar to each other, and aim to enumerate all similar-bicliques. The naive approach of first enumerating all maximal bicliques and then extracting all maximal similar-bicliques from them is inefficient, as enumerating maximal bicliques is time consuming. We propose a backtracking algorithm MSBE to directly enumerate maximal similar-bicliques, and power it by vertex reduction and optimization techniques. Furthermore, we design a novel index structure to speed up a time-critical operation of MSBE, as well as to speed up vertex reduction. Efficient index construction algorithms are also developed. Extensive experiments on 17 bipartite graphs as well as case studies are conducted to demonstrate the effectiveness and efficiency of our model and algorithms.
Article
Full-text available
Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical implementation of content analysis is extremely labor-intensive and subject to human coder errors. Applying natural language processing (NLP) techniques can help address these limitations. We explain and illustrate these techniques to psychological researchers. For this purpose, we first present a study exploring the creation of psychometrically meaningful predictions of human content codes. Using an existing database of human content codes, we build an NLP algorithm to validly predict those codes, at generally acceptable standards. We then conduct a Monte-Carlo simulation to model how four dataset characteristics (i.e., sample size, unlabeled proportion of cases, classification base rate, and human coder reliability) influence content classification performance. The simulation indicated that the influence of sample size and unlabeled proportion on model classification performance tended to be curvilinear. In addition, base rate and human coder reliability had a strong effect on classification performance. Finally, using these results, we offer practical recommendations to psychologists on the necessary dataset characteristics to achieve valid prediction of content codes to guide researchers on the use of NLP models to replace human coders in content analysis research. (PsycInfo Database Record (c) 2022 APA, all rights reserved).
Thesis
La valorisation du patrimoine informationnel dans l’entreprise de l’industrie manufacturière est un enjeu important. Elle permet la prise de décisions éclairées et de détecter de nouvelles opportunités à valeur ajoutée. Lorsqu’il est retranscrit numériquement, ce patrimoine informationnel est composé de données hétérogènes et distribuées dans les différents silos de l’entreprise rendant la vision holistique de l’information difficile. La thèse propose d’accéder à l’information hétérogène et distribuée de l’entreprise par un système de recherche d’information. L’originalité de la proposition consiste à considérer et modéliser l’ensemble des données structurées et non structurées de l’entreprise dans un graphe unique. D'autre part, la recherche d'information est exprimée par une requête composée de deux variables, le 'quoi' et le 'à propos de quoi' et permet de fournir en résultat une liste de documents ou enregistrements, une liste de valeurs de propriétés ou une liste de phrases. L’application de l’approche sur un cas d’étude a permis de détecter une liste d’enjeux clés à traiter pour améliorer les critères de performances usuels en recherche d’information, à savoir sa capacité à fournir tous les résultats pertinents (le rappel) et uniquement des résultats pertinents (la précision). Les quatre enjeux à considérer sont : (i) le traitement des spécificités syntaxiques des données, (ii) l’extension sémantiquement des termes utilisés dans la recherche, (iii) le filtrage les résultats peu pertinents et (iv) la détection de liens implicites entre les données. Un enrichissement de la proposition est alors proposé pour répondre à l'ensemble de ces enjeux comprenant notamment la transformation des tableaux dans les documents non structurés en graphe, une extension sémantique des termes de la recherche grâce à un graphe de connaissance ainsi que des filtrages complémentaires pour l'évaluation de la pertinence des résultats. Enfin, l’approche ainsi enrichie est confrontée à un second cas d’étude afin de valider la proposition.
Thesis
La thèse aborde la problématique de transfert de connaissances dans les environnements médiatisés à l'ère de la massification de données. Nous proposons une méthode d'aide à la décision multicritère MAI2P (Multicriteria Approach for the Incremental Periodic Prediction) pour la prédiction périodique et incrémentale de la classe de décision à laquelle une action est susceptible d'appartenir. La méthode MAI2P repose sur trois phases. La première phase est composée de trois étapes : la construction d'une famille de critères pour la caractérisation des actions ; la construction d'un ensemble des “Actions de référence” représentatif pour chacune des classes de décision ; et la construction d'une table de décision. La deuxième phase s'appuie sur l'algorithme DRSA-Incremental que nous proposons pour l'inférence et la mise à jour de l'ensemble de règles de décision suite à l'incrémentation séquentielle de l'ensemble des “actions de référence”. La troisième phase permet de classer les “Actions potentielles” dans l'une des classes de décision en utilisant l'ensemble de règles de décision inféré. La méthode MAI2P est validée sur un contexte des MOOCs (Massive Open Online Courses) qui sont des formations en ligne caractérisées par une masse importante de données échangées entre un nombre massif d’apprenants. Elle a permis la prédiction hebdomadaire des trois classes de décision : Cl1 des “Apprenants en risque” d'abandonner le MOOC; Cl2 des “Apprenants en difficulté” mais n'ayant pas l'intention d'abandon ; et Cl3 des “Apprenants leaders” susceptibles de soutenir les deux autres classes d'apprenants en leur transmettant l'information dont ils ont besoin. La prédiction est basée sur les données de toutes les semaines précédentes du MOOC afin de prédire le profil de l'apprenant pour la semaine suivante. Un système de recommandation KTI-MOOC (Recommender system for the Knowledge Transfer Improvement within a MOOC) est développé pour recommander à chaque “Apprenant en risque” ou “Apprenant en difficulté” une liste personnalisée des “Apprenants leaders”. Le système KTI-MOOC est basé sur la technique de filtrage démographique et a l'objectif de favoriser l'appropriation individuelle, des informations échangées, auprès de chaque apprenant
Chapter
The whole research focuses on a range of techniques, challenges, and different areas of investigation that are useful and identified as an important field of data mining technology. Even though we know, many MNCs and huge organisations are working in better places in different countries. Each place of action can generate huge amounts of data. Corporate executives grant access from every solitary source and make vital choices. The Information Distribution Center is often used in an enormous business sense by enhancing the viability of the administrative dynamic. In an uncertain and deeply serious development business, the approximation of important data frameworks, for example, is nevertheless successfully viewed in the existing business situation. Efficiency or frequency is not the key to intensity. This kind of tremendous amount of information is available as a tera‐to‐petabyte that has completely changed in the scientific fields and specification. To analyze, monitor, and resolve the selection of this massive quantity of information, we require methodology based Data Mining, that will change in a lot of areas. The whole paper also observes focusing on the current characterization and data recovery strategies to use a computation that will acquire a lot of reports. Accordingly, the unfurling of information in writings is chosen as the best possible philosophy to be followed, and the means are disclosed to arrive at the order of the solo report. In the wake of leading a trial with three of the most known strategies for unaided records arrangement and the evaluation of the outcomes with the Silhouette list, it could be seen that the better gathering was with four gatherings whose principle trademark was to manage subjects, for example, data the executive's data, frameworks the board, man‐made reasoning, and advanced picture preparing.
Article
Making the life-saving treatment of transplantation available to patients who need it re-quires the cooperation of individuals and families who decide to donate organs. Healthcare workers navigate organizational, bureaucratic and relational aspects of this process, including cases in which a deceased individual has not specified a wish about organ donation and their surviving family members must be asked for consent to donate during a delicate phase of mourning. This research aims to understand the experience of these health workers regarding their work. We collected 18 interviews from organ donation healthcare workers in five of the major hospitals in Rome. The transcripts underwent a multivariate text analysis to identify the repre-sentations of organ donation and the symbolic categories organizing the practice of these workers. This research elucidated a symbolic space constructed of four factors: the "Context", in-volving family and health workers; the "Work purposes", including the procedures and the relationships; the "Transplant", which involves omnipotence and limits; the "Donation", which involves ideals versus reality. The characterizing elements of these representations, belonging to organ donation work-ers, are the prestige, the certification of brain death, the communication, the transplant, and the salvation. In the lives of these workers, to be a "bridge between life and death2 evokes feelings of prestige rather than difficult feelings associated with confronting one's limitations. These aspects concern the difficulties met by the health staff in their work, and they are useful ele-ments to design a focused training and support program for organ donor workers.
Article
Full-text available
Community detection is a broad area of study in network science, in which its correct detection helps to get information about the groups and the relationships between their nodes. Community detection algorithms use the available snapshot of a network to detect its underlying communities. But, if this snapshot is incomplete, the algorithms may not recover the correct communities. This work proposes a set of link prediction heuristics using different network properties to estimate a more complete version of the network and improve the community detection algorithms. Each heuristic returns the most likely edges to be observed in a future version of the network. We performed experiments on real-world and artificial networks with different insertion sizes, comparing the results with two approaches: (i) without using edge insertion and (ii) using the EdgeBoost algorithm, based on node similarity measures. The experiments show that some of our proposed heuristics improve the results of traditional community detection algorithms. This improvement is even more prominent for networks with poorly defined structures.
Chapter
Recommendation systems are widely used in almost all websites to help the customer to make a decision in the face of information overload problems. These systems provide users with personalized recommendations and help them to make the right decisions. Customer opinion analysis it is a vast area of research, especially in recommendation systems. This justifies the growing importance of opinion analysis. Among the great classifiers to perform this task the Wide Margin Separators classifier which gives encouraging results in the literature. Despite the results found in the classification, there is a problem of ambiguity in the meaning of words, which poses a problem in recommendation systems based on opinion analysis; an Arabic word has several meanings, so there is a probability of misclassification of comments that contain these ambiguous words and therefore a false recommendation. We present after a comparative study between the methods proposed in the literature a method to do the disambiguation before going through the classification phase which we have done using the SVM algorithm. Our proposed system gives its best result in terms of accuracy 97.1%. KeywordsRecommender systemDisambiguationCollaborative filtering
Preprint
Full-text available
Documenting cultural heritage by using artificial intelligence (AI) is crucial for preserving the memory of the past and a key point for future knowledge. However, modern AI technologies make use of statistical learners that lead to self-empiricist logic, which, unlike human minds, use learned non-symbolic representations. Nevertheless, it seems that it is not the right way to progress in AI. If we want to rely on AI for these tasks, it is essential to understand what lies behind these models. Among the ways to discover AI there are the senses and the intellect. We could consider AI as an intelligence. Intelligence has an essence, but we do not know whether it can be considered “something” or “someone”. Important issues in the analysis of AI concern the structure of symbols -operations with which the intellectual solution is carried out- and the search for strategic reference points, aspiring to create models with human-like intelligence. For many years, humans, seeing language as innate, have carried out symbolic theories. Everything seems to have skipped with the advent of Machine Learning. In this paper, after a long analysis of history, the rule-based and the learning-based vision, we propose KERMIT as a unit of investigation for a possible meeting point between the different learning theories. Finally, we propose a new vision of knowledge in AI models based on a combination of rules, learning and human knowledge.
Article
Full-text available
Collaborative Filtering (CF)-based recommendation methods suffer from (i) sparsity (have low user–item interactions) and (ii) cold start (an item cannot be recommended if no ratings exist). Systems using clustering and pattern mining (frequent and sequential) with similarity measures between clicks and purchases for next-item recommendation cannot perform well when the matrix is sparse, due to rapid increase in number of items. Additionally, they suffer from: (i) lack of personalization: patterns are not targeted for a specific customer and (ii) lack of semantics among recommended items: they can only recommend items that exist as a result of a matching rule generated from frequent sequential purchase pattern(s). To better understand users’ preferences and to infer the inherent meaning of items, this paper proposes a method to explore semantic associations between items obtained by utilizing item (products’) metadata such as title, description and brand based on their semantic context (co-purchased and co-reviewed products). The semantics of these interactions will be obtained through distributional hypothesis, which learns an item’s representation by analyzing the context (neighborhood) in which it is used. The idea is that items co-occurring in a context are likely to be semantically similar to each other (e.g., items in a user purchase sequence). The semantics are then integrated into different phases of recommendation process such as (i) preprocessing, to learn associations between items, (ii) candidate generation, while mining sequential patterns and in collaborative filtering to select top-N neighbors and (iii) output (recommendation). Experiments performed on publically available E-commerce data set show that the proposed model performed well and reflected user preferences by recommending semantically similar and sequential products.
Article
Full-text available
The semantic and XML in document classification are used to develop XML data based on tree-based document classification method. The document classification plays the main role in the information management and its retrieval of data, which is a learning problem. In a development context, document classification has a major role in many applications, especially in classifying, organizing, searching and representing concisely large information volumes. A swarm-optimized tree-based association rule approach is presented for the classification of semi-structured data with the use of soft computing. To improve document classification, a tree pruning technique to prune weak and infrequent rules and a binary particle swarm optimization (BPSO) method to optimize tree construction are proposed. An optimized tree-based association rule was proposed to improve XML documents classification based on BPSO, and tree pruning technique to prune weak/infrequent rules is presented. The method was evaluated by Reuters dataset. The Reuters dataset is applied for this method. Results show that the new method performs well for precision and recall compared with current methods.
ResearchGate has not been able to resolve any references for this publication.