Article
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... La méthode historique propose de réduire la dimension de la matrice documents-termes en la factorisant. C'est l'Analyse Sémantique Latente (Deerwester et al., 1990). Elle repose sur la décomposition en valeurs singulières (SVD) de la matrice initiale des données. ...
... Des travaux, moins cités, avaient proposé d'apprendre des représentations vectorielles de mots en faible dimension. Certains utilisaient déjà des réseaux de neurones (Bengio et al., 2003), d'autres étendent l'approche de (Deerwester et al., 1990), comme (Schütze, 1993). ...
... La première approche proposée en 1988 (Dumais et al., 1988) puis en 1990 (Deerwester et al., 1990) propose d'appliquer une simple SVD à la matrice T f ou T f idf . On peut ensuite tronquer (Le and Mikolov, 2014). ...
Thesis
La révolution numérique a entraîné une croissance exponentielle de la quantité d'informations stockées à long terme. Une part importante de cette information est textuelle (pages Web, médias sociaux, etc.). Les modèles de traitement du langage naturel (NLP), qui permettent de classer ou de regrouper cette information, ont besoin que le texte soit représenté sous forme d'objets mathématiques : on parle alors d'apprentissage de représentations. L'objectif de l'apprentissage de représentations est de construire des représentations d'objets textuels (mots, documents, auteurs) dans un espace vectoriel de faible dimension. La similarité entre les représentations vectorielles de ces objets devrait être liée à leur proximité sémantique ou à leur similarité stylistique. En plus du texte lui-même, les documents sont souvent associés à des métadonnées. Ils peuvent être liés (par exemple, par des références hypertextes), associés à leurs auteurs, et horodatés. Il a été démontré que ces informations améliorent la qualité de la représentation d'un document. Néanmoins, l'incorporation de ces métadonnées n'est pas triviale. De plus, le langage naturel a rapidement évolué au cours des dernières décennies. Les modèles de représentation sont maintenant entraînés sur des quantités massives de données textuelles et affinés pour des tâches spécifiques. Ces modèles sont d'un grand intérêt lorsqu'on travaille avec de petits ensembles de données, permettant de transférer des connaissances à partir de sources d'information pertinentes. Il est crucial de développer des modèles d'apprentissage de représentations qui peuvent incorporer ces représentations pré-entraînées. La plupart des travaux antérieurs apprennent une représentation ponctuelle. C'est une limitation sérieuse car la langue est plus complexe que cela : les mots sont souvent polysémiques, et les documents sont, la plupart du temps, sur plusieurs sujets. Une branche de la littérature propose d'apprendre des distributions probabilistes dans un espace sémantique pour contourner ce problème. Dans cette thèse, nous présentons tout d'abord la théorie de l'apprentissage automatique, ainsi qu'un aperçu général des travaux existants en apprentissage de représentations de mots et de documents (sans métadonnées). Nous nous concentrons ensuite sur l'apprentissage de représentations de documents liés. Nous présentons les travaux antérieurs du domaine et proposons deux contributions : le modèle RLE (Regularized Linear Embedding), et le modèle GELD (Gaussian Embedding of Linked Documents). Ensuite, nous explorons l'apprentissage des représentations d'auteurs et de documents dans le même espace vectoriel. Nous présentons les travaux les plus récents et notre contribution VADE (Variational Authors and Documents Embedding). Enfin, nous étudions la problématique de l'apprentissage de représentations dynamiques d'auteurs : leurs représentations doivent évoluer dans le temps. Nous présentons d'abord les modèles existants, puis nous proposons une contribution originale, DGEA (Dynamic Gaussian Embedding of Authors). De plus, nous proposons plusieurs axes scientifiques pour améliorer nos contributions, et quelques questions ouvertes pour de futures recherches.
... To overcome the lexical mismatching challenge, dense embedding is used to represent queries and documents. The main idea of this method was proposed with the LSI approach [3]. ...
... So the range [highest_score-threshold, highest_score] would not be large enough to cover many the relevant articles. -Queries with (3,5,7) relevant articles make the systems with recall@20 significantly lower than queries with (2,4,6) relevant articles. Most of the recall@20 results of systems are pulled down because of queries with (2,4,6) relevant articles. ...
Preprint
This study deals with the problem of information retrieval (IR) for Vietnamese legal texts. Despite being well researched in many languages, information retrieval has still not received much attention from the Vietnamese research community. This is especially true for the case of legal documents, which are hard to process. This study proposes a new approach for information retrieval for Vietnamese legal documents using sentence-transformer. Besides, various experiments are conducted to make comparisons between different transformer models, ranking scores, syllable-level, and word-level training. The experiment results show that the proposed model outperforms models used in current research on information retrieval for Vietnamese documents.
... This is because our purpose is to estimate intuitively understandable topics from many documents and to estimate how the estimated topics fluctuate over time in proportion. Topic modeling is a generic name for various techniques such as latent semantic analysis (LSA) 28 , probabilistic latent semantic analysis (PLSA) 29 , and latent Dirichlet allocation (LDA) 30 . We decided to utilize STM considering the advantages and disadvantages of various topic modeling techniques. ...
... Therefore, the result cannot be interpreted as a probability, and is difficult to understand intuitively. Because of these attributes, researchers proposing LSA stated that the factor estimated by LSA is not for verbal description 28 . ...
Article
Full-text available
Inappropriate information on a deadly and rare disease can make people vulnerable to problematic decisions, leading to irreversible bad outcomes. This study explored online information exchanges on pancreatic cancer. We collected 35,596 questions and 83,888 answers related to pancreatic cancer from January 1, 2003 to May 31, 2020, from Naver, the most popular Korean web portal. We also collected 8495 news articles related to pancreatic cancer during the same period. The study methods employed were structural topic modeling, keyword frequency analysis, and qualitative coding of medical professionals. The number of questions and news articles increased over time. In Naver’s questions, topics on symptoms and diagnostic tests regarding pancreatic cancer increased in proportion. The news topics on new technologies related to pancreatic cancer from various companies increased as well. The use of words related to back pain—which is not an important early symptom in pancreatic cancer—and biomarker tests using blood increased over time in Naver’s questions. Based on 100 question samples related to symptoms and diagnostic tests and an analysis of the threaded answers’ appropriateness, there was considerable misinformation and commercialized information in both categories.
... There are several approaches proposed by authors in [90][91][92] for performing topic modelling on unstructured text data. The most widely used approaches for topic modelling from the literatures are: 1) Latent Semantic Analysis (LSA) [90], 2) Non-Negative Matrix Factorization (NNMF) [91], Probabilistic Latent Semantic Analysis (PLSA) [92] and 4) Latent Dirichlet Allocation (LDA) [93]. ...
... There are several approaches proposed by authors in [90][91][92] for performing topic modelling on unstructured text data. The most widely used approaches for topic modelling from the literatures are: 1) Latent Semantic Analysis (LSA) [90], 2) Non-Negative Matrix Factorization (NNMF) [91], Probabilistic Latent Semantic Analysis (PLSA) [92] and 4) Latent Dirichlet Allocation (LDA) [93]. Although, the traditional approaches show very promising results and were used in variety of studies for performing topic modelling on social media data [94][95][96][97]. ...
Article
Measuring and analyzing user perceptions and behaviors in order to make user-centric decisions has been a topic of research for a long time even before the invention of social media platforms. In the past, the main approaches for measuring user perceptions were conducting surveys, interviewing experts and collecting data through questionnaires. But the main challenge with these methods was that the extracted perceptions were only able to represent a small group of people and not whole public. This challenge was resolved when social media platforms like Twitter and Facebook were introduced and users started to share their perceptions about any product, topic, event using these platforms. As these platforms became popular, the amount of data being shared on these platforms started to grow exponentially and this growth led to another challenge of analyzing this huge amount of data to understand or measure user perceptions. Computational techniques are used to address the challenge. This paper briefly describes the artificial intelligence (AI) techniques, which is one of the types of computational techniques available for analyzing social media data. Along with brief information about the AI techniques, this paper also shows state-of-the-art studies which utilize the AI techniques for measuring user perceptions from the social media data.
... This opens the possibility of using mathematical tools to calculate the similarity of two components, through measuring the distance between their corresponding vectors. Kintsch [6] uses latent semantic analysis (LSA) [2] for modeling the vector space. They generate term vectors that highly correlate with both, the topic and the vehicle; correlation is measured by cosine similarity over the LSA vectors. ...
... Specifically, we use DepCC, 1 a dependency-parsed "web-scale corpus" based on Common Crawl. 2 every sentence was given with word dependencies discovered by MaltParser [17]. We only use a fraction of the corpus containing some 1.7B tokens. ...
... The large feature sizes of our CWMs in Table 3.8 were 1,649 for Confidentiality, Availability and Access Complexity; 4,154 for Integrity and Access Vector; 3,062 for Authentication; and 5,104 for Severity. To address such challenge in RQ4, we investigated a dimensionality reduction method (i.e., Latent Semantic Analysis (LSA) [245]) and recent sub-word embeddings (e.g., fastText [196,246]) for SV assessment. fastText is an extension of Word2Vec [202] word embeddings, in which the character-level features are also considered. ...
... To what extent can low-dimensional model retain the original performance?The features of our proposed model in RQ3 are high-dimensional and sparse. Hence, we evaluate a dimensionality reduction technique (i.e., Latent Semantic Analysis[245]) ...
Preprint
Full-text available
The thesis advances the field of software security by providing knowledge and automation support for software vulnerability assessment using data-driven approaches. Software vulnerability assessment provides important and multifaceted information to prevent and mitigate dangerous cyber-attacks in the wild. The key contributions include a systematisation of knowledge, along with a suite of novel data-driven techniques and practical recommendations for researchers and practitioners in the area. The thesis results help improve the understanding and inform the practice of assessing ever-increasing vulnerabilities in real-world software systems. This in turn enables more thorough and timely fixing prioritisation and planning of these critical security issues.
... LSA is a technique that extracts and demonstrates the contextual utilization of words by using statistical computations (Landauer et al. 1998). LSA is interested in the semantic structure of documents by improving the extraction of relevant documents, and it finds a relationship between the documents (Deerwester et al. 1990). Namely, LSA determines similarity of words meanings. ...
... The initial term by document matrix is approximated by using the n largest singular values and their related singular vectors as seen in the Eq. 2 (Deerwester et al. 1990). ...
Article
Full-text available
Topic detection from Twitter is a significant task that provides insight into real-time information. Recently, word embedding methods and topic modeling techniques have been utilized to find latent topics in various fields. Detecting topics leads to effective semantic structure and provides a better understanding of users. In the proposed study, different types of topic detection techniques are utilized, which are latent semantic analysis (LSA), Word2Vec, and latent Dirichlet allocation (LDA), and their performances are evaluated by the implementation of the K-means clustering technique on a real life application. In this case study, tweets were gathered after an earthquake with a magnitude of 6.6 on the Richter scale that took place on October 30, 2020, on the coast of the Aegean Sea (İzmir), Turkey. Tweets are clustered under fifteen hashtags separately, and the aforementioned techniques are applied to data-sets which vary in size. Therefore, the novelty of the proposed paper can be expressed as the comparison of different topic models and word embedding methods implemented for different sizes of documents in order to demonstrate the performance of these methods. While Word2Vec gives good results in small data-sets, LDA generally gives better results than Word2Vec and LSA in medium and large data-sets. Another aim of the proposed study is to provide information to decision makers for supporting victims and society. Therefore, the general situation of society is analyzed and society's attitude is demonstrated for decision-makers to take actionable activities such as psychological support, educational support, financial support, and political activities, etc.
... Based on this hypothesis, early methods attempt to learn a fixed set of vectors for word representations. Before the advent of neural models, researchers mostly used corpus-based cooccurrence statistics (Deerwester et al., 1990;Schütze, 1992;Brown et al., 1992;Lund and Burgess, 1996). These word vectors have been found to be helpful for intrinsic evaluation tasks (e.g., word similarities (Rubenstein and Goodenough, 1965;Miller and Charles, 1991) and analogies (Turney and Littman, 2005;Turney, 2006)) and for various NLP applications as features (e.g., named entity recognition (Miller et al., 2004) and semantic role labeling (Erk, 2007)). ...
Preprint
Full-text available
Recent breakthroughs in Natural Language Processing (NLP) have been driven by language models trained on a massive amount of plain text. While powerful, deriving supervision from textual resources is still an open question. For example, language model pretraining often neglects the rich, freely-available structures in textual data. In this thesis, we describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision. We first investigate self-supervised training losses to help enhance the performance of pretrained language models for various NLP tasks. Specifically, we alter the sentence prediction loss to make it better suited to other pretraining losses and more challenging to solve. We design an intermediate finetuning step that uses self-supervised training to promote models' ability in cross-task generalization. Then we describe methods to leverage the structures in Wikipedia and paraphrases. In particular, we propose training losses to exploit hyperlinks, article structures, and article category graphs for entity-, discourse-, entailment-related knowledge. We propose a framework that uses paraphrase pairs to disentangle semantics and syntax in sentence representations. We extend the framework for a novel generation task that controls the syntax of output text with a sentential exemplar. Lastly, we discuss our work on tailoring textual resources for establishing challenging evaluation tasks. We introduce three datasets by defining novel tasks using various fan-contributed websites, including a long-form data-to-text generation dataset, a screenplay summarization dataset, and a long-form story generation dataset. These datasets have unique characteristics offering challenges to future work in their respective task settings.
... Natural language processing is the first driving force for automatic reviewer assignment. With the invention of Latent Semantic Indexing (LSI) (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990;Dumais & Nielsen, 1992) explored the automatic reviewer assignment problem. In different literature, researchers adopted some other terms to refer to this problem, such as Paper-Reviewer Assignment (PRA) (Long, Wong, Peng, & Ye, 2013), Reviewer Assignment Problem (RAP) (Nguyen, Sánchez-Hernández, Agell, Rovira, & Angulo, 2018;Wang, Shi and Chen, 2010;Wang, Zhou, & Shi, 2013), Committee Review Assignment (CRA) (Karimzadehgan & Zhai, 2012) and Conference Paper Assignment Problem (CPAP) (Goldsmith & Sloan, 2007;Lian, Mattei, Noble, & Walsh, 2018). ...
Article
Assigning paper to suitable reviewers is of great significance to ensure the accuracy and fairness of peer review results. In the past three decades, many researchers have made a wealth of achievements on the reviewer assignment problem (RAP). In this survey, we provide a comprehensive review of the primary research achievements on reviewer assignment algorithm from 1992 to 2022. Specially, this survey first discusses the background and necessity of automatic reviewer assignment, and then systematically summarize the existing research work from three aspects, i.e., construction of candidate reviewer database, computation of matching degree between reviewers and papers, and reviewer assignment optimization algorithm, with objective comments on the advantages and disadvantages of the current algorithms. Afterwards, the evaluation metrics and datasets of reviewer assignment algorithm are summarized. To conclude, we prospect the potential research directions of RAP. Since there are few comprehensive survey papers on reviewer assignment algorithm in the past ten years, this survey can serve as a valuable reference for the related researchers and peer review organizers.
... Such relaxations bypass the fact that the basic problem is NP-hard. RPCA is proving useful and effective in many applications such as video surveillance [13], image and video processing [14], speech recognition [15] or latent segmentation indexing [16] just to name a few. ...
Preprint
Full-text available
We propose a robust principal component analysis (RPCA) framework to recover low-rank and sparse matrices from temporal observations. We develop an online version of the batch temporal algorithm in order to process larger datasets or streaming data. We empirically compare the proposed approaches with different RPCA frameworks and show their effectiveness in practical situations.
... The text representation is transformed to a numeric vector from a process called word2vec (Goldberg and Levy 2014;Rong 2014;Kenneth 2017) and doc2vec (Lau and Baldwin 2016;Kim et al. 2019), and once the vector is obtained, the process to follow is identical to the one to follow by qCBR. In many cases, seeing references Khrennikov et al. (2019), Bruza and Cole (2006), Lund and Burgess (1996, Deerwester et al. (1990, QIR and NLP already predefine the classes to be analysed, either Pop, Rock, etc. By predefining that each axis of the Hilbert space corresponds to a type, this process is similar to the qCBR but without the synthesiser's ability. ...
Article
Full-text available
Case-Based Reasoning (CBR) is an artificial intelligence approach to problem-solving with a good record of success. This article proposes using Quantum Computing to improve some of the key processes of CBR, such that a quantum case-based reasoning (qCBR) paradigm can be defined. The focus is set on designing and implementing a qCBR based on the variational principle that improves its classical counterpart in terms of average accuracy, scalability and tolerance to overlapping. A comparative study of the proposed qCBR with a classic CBR is performed for the case of the social workers’ problem as a sample of a combinatorial optimization problem with overlapping. The algorithm’s quantum feasibility is modelled with docplex and tested on IBMQ computers, and experimented on the Qibo framework.
... (e) Latent semantic analysis (LSA) similarity [2]: The sentence vectors are decomposed by singular value decomposition (SVD), and the similarity is then computed by the cosine similarity of the two matrices. ...
Conference Paper
Full-text available
In this paper, we describe our submission to the NTCIR-12 Short Text Conversation task. We consider short text conversation as a community Question-Answering problem, hence we solve this task in three steps: First, we retrieve a set of candidate posts from a pre-built indexing service. Second, these candidate posts are ranked according to their similarity with the original input post. Finally, we rank the comments to the top-ranked posts and output these comments as answers. Two ranking models and three comment selection strategies have been introduced to generate five runs. Among them, our best approach receives performances of mean nDCG@1 0.2767, mean P+ 0.4284 and mean nERR@10 0.4095.
... Para la identificación y etiquetado de conceptos, los datos son esquematizados en una MTD (matriz termino-documento) y posterior mejora la calidad de esta matriz mediante la aplicación del análisis latente semántico. Para ejecutar esta técnica; primero, se extrajeron los valores propios de la MTD normalizada considerando que dicha extracción se dio mediante la descomposición de valor singular (SVDsingular value descomposition) (Deerwester et al., 1990); segundo, se seleccionaron los k valores propios que eran mayor a la unidad; tercero, se obtuvo la versión truncada de la MTD, es decir, a partir de las matrices resultantes después de ejecutar la SVD, se eligieron los k valores singulares que habían pasado el criterio de selección y sus respectivas columnas. ...
Chapter
Full-text available
El proyecto “Reconstruyendo la historia desde adentro - Una experiencia significativa en la Ciudad de Pereira, Risaralda: San Isidro, Puerto Caldas” fue una apuesta del Semillero de Investigación “Familia, Educación y Comunidad” adscrito al Grupo de Investigación “Educación y Desarrollo Humano”, realizado en el año 2019. Consistió en generar una articulación de tres conceptos considerados fundamentales en la formación integral como lo son: la familia, la educación y la comunidad, que en esta oportunidad se enmarcaron en la reconstrucción de la historia de vida de la comunidad de San Isidro, la cual ha sido acompañada por diversas instituciones que contemplan dentro de sus áreas de responsabilidad social el desarrollo comunitario.
... Then, we use singular value decomposition (SVD) [24] to reduce the dimensionality of the SPAM to 300 and obtain a SPAM with dimensions of 31,625×300, which contains the sentiment polarity information f pol of the candidate words. ...
Article
Full-text available
Sentiment analysis is an important research area in natural language processing (NLP), and the performance of sentiment analysis models is largely influenced by the quality of sentiment lexicons. Existing sentiment lexicons contain only the sentiment information of words. In this paper, we propose an approach for automatically constructing a fine-grained sentiment lexicon that contains both emotion information and sentiment information to solve the problem that the emotion and sentiment of texts cannot be jointly analyzed. We design an emotion-sentiment transfer method and construct a fine-grained sentiment seed lexicon, and we then expand the sentiment seed lexicon by applying the graph dissemination method to the synonym set. Subsequently, we propose a multi-information fusion method based on neural network to expand the sentiment lexicon based on a corpus. Finally, we generate the Fine-Grained Sentiment Lexicon (FGSL), which contains 40,554 words. FGSL achieves F1 values of 61.97%, 69.58%, and 66.99% on three emotion datasets and 88.19%, 89.31%, and 86.88% on three sentiment datasets. Experimental results on multiple public benchmark datasets illustrate that FGSL achieves significantly better performance in both emotion analysis and sentiment analysis tasks.
... Learning representations of language. From nearly the earliest days of the field, natural language processing researchers observed that representations of words derived from distributional statistics in large text corpora serve as useful features for downstream tasks [7,10]. The earliest versions of these representation learning schemes focused on isolated word forms [25,26]. ...
Article
Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and generalization in general sequential decision-making problems. In this approach, goals and observations are represented as a sequence of embeddings, and a policy network initialized with a pre-trained LM predicts the next action. We demonstrate that this framework enables effective combinatorial generalization across different environments and supervisory modalities. We begin by assuming access to a set of expert demonstrations, and show that initializing policies with LMs and fine-tuning them via behavior cloning improves task completion rates by 43.6% in the VirtualHome environment. We then examine how our framework may be used in environments without pre-collected expert data. To do this, we integrate an active data gathering procedure into pre-trained LMs. The agent iteratively learns by interacting with the environment, relabeling the language goal of past 'failed' experiences, and updating the policy in a self-supervised loop. The active data gathering procedure also enables effective combinatorial generalization, outperforming the best baseline by 25.1%. Finally, we explain these results by investigating three possible factors underlying the effectiveness of the LM-based policy. We find that sequential input representations (vs. fixed-dimensional feature vectors) and favorable weight initialization are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans.
... They observed an improvement of 10% on the mean average precision value. Deerwester et al. [20] proposed singular value decomposition-based Latent Semantic Indexing (LSI) technique to extract the most informative terms for query expansion. Singh and Sharan [21] proposed a fuzzy logic-based technique on top "k" ranked documents for query expansion tasks. ...
Article
Full-text available
Machine learning techniques have been widely used in almost every area of arts, science, and technology for the last two decades. Document analysis and query expansion also use machine learning techniques at a broad scale for information retrieval tasks. The state-of-the-art models like the Bo1 model, Bo2 model, KL divergence model, and chi-square model are probabilistic, and they work on DFR-based retrieval models. These models are much focused on term frequency and do not care about the semantic relationship among the terms. The proposed model applies the semantic method to find the semantic similarity among the terms to expand the query. The proposed method uses the relevance feedback method that selects a user-assisted most relevant document from top “ k ” initially retrieved documents and then applies deep neural network technique to select the most informative terms related to original query terms. The results are evaluated at FIRE 2011 ad hoc English test collection. The mean average precision of the proposed method is 0.3568. The proposed method also compares the state-of-the-art models. The proposed model observed 19.77% and 8.05% improvement on the mean average precision (MAP) parameter with respect to the original query and Bo1 model, respectively.
... To further reduce the dimensionality of these matrices, we applied SVD/LSA (Deerwester et al., 1990;Landauer and Dumais, 1997) on them to transform syntactically constrained VSMs into their latent semantic space models. In the investigation of using LSA to find synonyms, Landauer and Dumais (1997) claimed that the optimal performance was subject to variation on the number of single values or principal components. ...
Article
Full-text available
Recent advances on the Vector Space Model have significantly improved some NLP applications such as neural machine translation and natural language generation. Although word co-occurrences in context have been widely used in counting-/predicting-based distributional models, the role of syntactic dependencies in deriving distributional semantics has not yet been thoroughly investigated. By comparing various Vector Space Models in detecting synonyms in TOEFL, we systematically study the salience of syntactic dependencies in accounting for distributional similarity. We separate syntactic dependencies into different groups according to their various grammatical roles and then use context-counting to construct their corresponding raw and SVD-compressed matrices. Moreover, using the same training hyperparameters and corpora, we study typical neural embeddings in the evaluation. We further study the effectiveness of injecting human-compiled semantic knowledge into neural embeddings on computing distributional similarity. Our results show that the syntactically conditioned contexts can interpret lexical semantics better than the unconditioned ones, whereas retrofitting neural embeddings with semantic knowledge can significantly improve synonym detection.
... Text features range from traditional methods such as latent semantic analysis [12] to pre-trained word embeddings (e.g., Fasttext [6]) and deep learning approaches [2,15,42]. ...
Article
Full-text available
The Brazilian Supreme Court receives tens of thousands of cases each semester. Court employees spend thousands of hours to execute the initial analysis and classification of those cases—which takes effort away from posterior, more complex stages of the case management workflow. In this paper, we explore multimodal classification of documents from Brazil’s Supreme Court. We train and evaluate our methods on a novel multimodal dataset of 6510 lawsuits (339,478 pages) with manual annotation assigning each page to one of six classes. Each lawsuit is an ordered sequence of pages, which are stored both as an image and as a corresponding text extracted through optical character recognition. We first train two unimodal classifiers: A ResNet pre-trained on ImageNet is fine-tuned on the images, and a convolutional network with filters of multiple kernel sizes is trained from scratch on document texts. We use them as extractors of visual and textual features, which are then combined through our proposed fusion module. Our fusion module can handle missing textual or visual input by using learned embeddings for missing data. Moreover, we experiment with bidirectional long short-term memory (biLSTM) networks and linear-chain conditional random fields to model the sequential nature of the pages. The multimodal approaches outperform both textual and visual classifiers, especially when leveraging the sequential nature of the pages.
... LDA topic modeling is a technique that explores topics and themes hidden inside a set of corpora using a set of algorithms [5]. However, prior to the introduction of LDA, Latent Semantic Indexing (LSI) was proposed [10] which uses a singular value decomposition of the X matrix to recognize a linear subspace in the space of tf-idf features that capture most of the variance. ...
Article
Background and aims The COVID-19 pandemic outbreak has created severe public health crises and economic consequences across the globe. This study used text analytics techniques to investigate the key concerns of Indian citizens raised in social media during the second wave of COVID-19. Methods In this study, we performed a sentiment and emotion analysis of tweets to understand the attitude of Indian citizens during the second wave of COVID-19. Moreover, we performed topic modeling to understand the significant issues and concerns related to COVID-19. Results Our results show that most social media posts were in neutral tone, and the percentage of posts that showed positive sentiment was less. Furthermore, emotion analysis results show that ‘Fear’ and ‘Surprise’ were the prominent emotions expressed by the citizens. Topic modeling results reveal that ‘High crowd’ and ‘political rallies’ are the two primary topics of concern raised by Indian citizens during the second wave of COVID-19. Conclusions Hence, Indian government agencies should communicate crisis information and combating strategies to citizens more effectively in order to minimize the fear and anxiety amongst the public.
... Applied to ranking, BERT could potentially build deep interactions between queries and documents that allow uncovering complex relevance patterns bringing us one step closer to the vision for future retrieval systems of Metzler et al. [17] in "Making Domain Experts out of Dilettantes". In contrast to this, another possible explanation could be that BERT lines up alongside other NLP techniques [4,18,21] exploiting the distributional properties of natural language [11] by merely learning simple term distributions. ...
Preprint
Even though term-based methods such as BM25 provide strong baselines in ranking, under certain conditions they are dominated by large pre-trained masked language models (MLMs) such as BERT. To date, the source of their effectiveness remains unclear. Is it their ability to truly understand the meaning through modeling syntactic aspects? We answer this by manipulating the input order and position information in a way that destroys the natural sequence order of query and passage and shows that the model still achieves comparable performance. Overall, our results highlight that syntactic aspects do not play a critical role in the effectiveness of re-ranking with BERT. We point to other mechanisms such as query-passage cross-attention and richer embeddings that capture word meanings based on aggregated context regardless of the word order for being the main attributions for its superior performance.
... Image features range from fixed descriptors such as pixel density at different locations and scales [20] to approaches based on convolutional neural networks [13,8,1,30,17] such as VGG-16 [24] and MobileNetV2 [22]. Text features range from traditional methods such as latent semantic analysis [6] to pretrained word embeddings (e.g. Fasttext [2]) and deep learning approaches [8,1,30]. ...
Preprint
Full-text available
The Brazilian Supreme Court receives tens of thousands of cases each semester. Court employees spend thousands of hours to execute the initial analysis and classification of those cases -- which takes effort away from posterior, more complex stages of the case management workflow. In this paper, we explore multimodal classification of documents from Brazil's Supreme Court. We train and evaluate our methods on a novel multimodal dataset of 6,510 lawsuits (339,478 pages) with manual annotation assigning each page to one of six classes. Each lawsuit is an ordered sequence of pages, which are stored both as an image and as a corresponding text extracted through optical character recognition. We first train two unimodal classifiers: a ResNet pre-trained on ImageNet is fine-tuned on the images, and a convolutional network with filters of multiple kernel sizes is trained from scratch on document texts. We use them as extractors of visual and textual features, which are then combined through our proposed Fusion Module. Our Fusion Module can handle missing textual or visual input by using learned embeddings for missing data. Moreover, we experiment with bi-directional Long Short-Term Memory (biLSTM) networks and linear-chain conditional random fields to model the sequential nature of the pages. The multimodal approaches outperform both textual and visual classifiers, especially when leveraging the sequential nature of the pages.
... Cluster labels are selected from noun phrases and index terms following three different algorithms: Log-Likelihood Ratio (LLR) (C. Chen, 2014), Mutual Information (MI) (Zheng, 2019), and Latent Semantic Indexing (LSI) (Deerwester et al., 1990). The three algorithms use different methods to identify the cluster themes. ...
Article
Full-text available
Gamification, which refers to the use of game design elements in non-game contexts, provides similar experiences and motivations as games do; this makes gamification a useful approach to promote positive behaviors. As a useful tool for keeping users motivated, engaged and active, there is a wide interest in adopting gamification solutions for supporting and promoting positive behaviors and behavior change (e.g. quit smoking, ecological behaviors, food choices, civic engagement, mental healthcare, sustainability, etc.). In this study, we use the CiteSpace software to examine 984 publications and their 46,609 unique references on gamification applied for behavior change. The corpus of studies was downloaded from the Scopus database and refers to studies published between 2011 and the beginning of 2022. Several methods were used to analyze these data: (1) document co-citation analysis (DCA) was performed to identify the pivotal researches and the research areas; (2) author cocitation analysis (ACA) was performed to identify the main authors; (3) and keyword analysis was performed to detect the most influential keywords and their change over time. The results of the analysis provide an overview of the influential documents, authors and keywords that have given shape to the literature of the field, and how it has evolved, showing an initial interest in motivational and persuasion techniques, and in the gamification design, and subsequently in the development of more rigorous methodologies for both design and use. As the first scientometric review of gamification applied to behavior change, this study will be of interest to junior and senior researchers, graduate students, and professors seeking to identify research trends, topics, major publications, and influential scholars.
... e calculation efficiency is low. Landauer et al. [32] proposed the latent semantic analysis method (LSA), whose basic idea is to reduce the dimension of the high-dimensional sparse matrix represented by text by using singular value decomposition technology, so that the data finally obtained do not have high-dimensional sparsity and can better represent text information. Hofmann [33] improved the PLSA model based on LSA, and the model used maximum expectation algorithm to calculate text topic in LSA. ...
Article
Full-text available
Legal judgment prediction (LJP) and decision support aim to enable machines to predict the verdict of legal cases after reading the description of facts, which is an application of artificial intelligence in the legal field. This paper proposes a legal judgment prediction model based on process supervision for the sequential dependence of each subtask in the legal judgment prediction task. Experimental results verify the effectiveness of the model framework and process monitoring mechanism adopted in this model. First, the convolutional neural network (CNN) algorithm was used to extract text features, and the principal component analysis (PCA) algorithm was used to reduce the dimension of data features. Next, the prediction model based on process supervision is proposed for the first time. When modeling the dependency relationship between sequential sub-data sets, process supervision is introduced to ensure the accuracy of the obtained dependency information, and genetic algorithm (GA) is introduced to optimize the parameters so as to improve the final prediction performance. Compared to our benchmark method, our algorithm achieved the best results on four different legal open data sets (CAIL2018_Small, CAIL2018_Large, CAIL2019_Small, and CAIL2019_Large). The realization of automatic prediction of legal judgment can not only assist judges, lawyers, and other professionals to make more efficient legal judgment but also provide legal aid for people who lack legal expertise.
... In fact, matching algorithms based on lexical overlap have great limitations because they ignore grammatical and structural features. Latent Sementic Analysis (LSA) [22], which became popular in the 1990s, has opened up a new idea. By mapping utterances to a low-dimensional continuous space of equal length, similarity computation can be performed on this implicit latent semantic space. ...
Preprint
Despite the continuous efforts in improving both the effectiveness and efficiency of code search, two issues remained unsolved. First, programming languages have inherent strong structural linkages, and feature mining of code as text form would omit the structural information contained inside it. Second, there is a potential semantic relationship between code and query, it is challenging to align code and text across sequences so that vectors are spatially consistent during similarity matching. To tackle both issues, in this paper, a code search model named CSSAM (Code Semantics and Structures Attention Matching) is proposed. By introducing semantic and structural matching mechanisms, CSSAM effectively extracts and fuses multidimensional code features. Specifically, the cross and residual layer was developed to facilitate high-latitude spatial alignment of code and query at the token level. By leveraging the residual interaction, a matching module is designed to preserve more code semantics and descriptive features, that enhances the adhesion between the code and its corresponding query text. Besides, to improve the model's comprehension of the code's inherent structure, a code representation structure named CSRG (Code Semantic Representation Graph) is proposed for jointly representing abstract syntax tree nodes and the data flow of the codes. According to the experimental results on two publicly available datasets containing 540k and 330k code segments, CSSAM significantly outperforms the baselines in terms of achieving the highest SR@1/5/10, MRR, and NDCG@50 on both datasets respectively. Moreover, the ablation study is conducted to quantitatively measure the impact of each key component of CSSAM on the efficiency and effectiveness of code search, which offers the insights into the improvement of advanced code search solutions.
... Deerwester et al. believed that text generally contains several topics [15], and the similarity of text semantics can be approximated as the similarity of topics. A dimensionalityreduced latent semantic space was constructed and applied to text classification tasks by performing singular value decomposition on the vocabulary-text matrix. ...
Article
Full-text available
Due to the rapidly growing volume of data on the Internet, the methods of efficiently and accurately processing massive text information have been the focus of research. In natural language processing theory, sentence embedding representation is an important method. This paper proposes a new sentence embedding learning model called BRFP (Factorization Process with Bidirectional Restraints) that fuses syntactic information, uses matrix decomposition to learn syntactic information, and fuses and calculates with word vectors to obtain the embedded representation of sentences. In the experimental chapter, text similarity experiments are conducted to verify the rationality and effectiveness of the model and analyzed experimental results on Chinese and English texts with the current mainstream learning methods, and potential improvement directions are summarized. The experimental results on Chinese and English datasets, including STS, AFQMC, and LCQMC, show that the model proposed in this paper outperforms the CNN method in terms of accuracy and F1 value by 7.6% and 4.8. The comparison experiment with the word vector weighted model shows that when the sentence length is longer, or the corresponding syntactic structure is complex, the model’s advantages in this paper are more prominent than TF-IDF and SIF methods. Compared with the TF-IDF method, the effect improved by 14.4%. Compared with the SIF method, it has a maximum advantage of 7.9%, and the overall improvement in each comparative experimental task is between 4 and 6 percentage points. In the neural network model comparison experiment, the model in this paper compared the CNN, RNN, LSTM, ST, QT, and InferSent models, and the effect significantly improved on the 14’OnWN, 14’Tweet-news, and 15’Ans.-forum datasets. For example, in the 14’OnWN dataset, the BRFP method has a 10.9% improvement over the ST method. The 14’Tweet-news dataset has a 22.9% advantage over the LSTM method, and the 15’Ans.-forum dataset has a 24.07% improvement over the RNN method. The article also demonstrates the generality of the model, proving that the model proposed in this paper is also a universal learning framework.
... They focus on the implicit relations intertwined between the words and are therefore able to find and present the semantic concepts that existed in the documents, often referred to as conceptual models. Latent semantic indexing (LSI) and probabilistic latent semantic indexing (pLSI) are classic conceptual models [12,23], which can partially capture the information of synonyms and polysemes but have difficulties in interpreting the output and parameters respectively [44]. Latent Dirichlet allocation (LDA), as a probability-generating conceptual model, has better interpretability by introducing the Dirichlet prior [6]. ...
Article
Full-text available
Dynamic topic analysis can examine the data from different perspectives and know the distribution of data with different correlation degrees thoroughly. It is a challenge to perform dynamic topic analysis on domain text data due to the smaller semantic differences among subtopics. This paper proposes a method of dynamically constructing topic hierarchy, which uses formal concept analysis (FCA)-based information retrieval (IR) as the technical basis and sememes as the semantic basis to perform hierarchical processing from fine-grained to coarse-grained on Chinese domain text data according to the topics of user’s query. It can meet the user’s need for different scales of the query results, and realize multi-angle inspection of the whole dataset and high-precision retrieval of the query. Taking sememes as formal attributes reduces the size of the concept lattice and expands the application of FCA technology to large-scale text data. The sememe-based word meaning identification (WMI) algorithm and semantic similarity measurement method for long text enable the topic hierarchy to be fine, and the coarse and fine filtering strategy renders the FCA-based method more efficient. Experimental results based on the open dataset show that the method proposed is an efficient and flexible topic-based hierarchical approach.
... The goal is to segment set of documents on the basis of structural similarities between them. Commonly topic modeling techniques include Latent Semantic Analysis (LSA) [7], Probabilistic Latent Semantic Analysis (PLSA) [10], Latent Dirichlet Allocation (LDA) [5], and Non-negative Matrix Factorization (NMF) [25]. ...
Preprint
As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/word-context matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFk-SPLIT, for semantic topic extraction suitable for large corpora. Contrary to SeNMFk, our method enables the joint factorization of large documents by decomposing the word-context and term-document matrices separately. We demonstrate the capability of SeNMFk-SPLIT by applying it to the entire artificial intelligence (AI) and ML scientific literature uploaded on arXiv.
... One of the most popular methods in the QA domain is query expansion. Query expansion tries to augment a search query with other terms such as relevant synonyms or semantically related terms [5,20]. This process is widely used in information retrieval systems to improve the results and increase recall [28]. ...
Chapter
This research work presents a new augmentation model for knowledge graphs (KGs) that increases the accuracy of knowledge graph question answering (KGQA) systems. In the current situation, large KGs can represent millions of facts. However, the many nuances of human language mean that the answer to a given question cannot be found, or it is not possible to find always correct results. Frequently, this problem occurs because how the question is formulated does not fit with the information represented in the KG. Therefore, KGQA systems need to be improved to address this problem. We present a suite of augmentation techniques so that a wide variety of KGs can be automatically augmented, thus increasing the chances of finding the correct answer to a question. The first results from an extensive empirical study seem to be promising.
... In order to enhance the capabilities of IRQA systems and to avoid the vocabulary mismatch problem in the extractive QA task [125], several techniques have been proposed. For example, instead of using the sparse vector of word counts for editing the distance or finding the exact match between the sequence of words in the query and answer, a dense embedding vector was introduced to find the semantic similarity between the candidate answer and query [126,127]. ...
Preprint
Full-text available
Question answering (QA) systems are among the most important and rapidly developing research topics in natural language processing (NLP). A reason, therefore, is that a QA system allows humans to interact more naturally with a machine, e.g., via a virtual assistant or search engine. In the last decades, many QA systems have been proposed to address the requirements of different question-answering tasks. Furthermore, many error scores have been introduced, e.g., based on n-gram matching, word embeddings, or contextual embeddings to measure the performance of a QA system. This survey attempts to provide a systematic overview of the general framework of QA, QA paradigms, benchmark datasets, and assessment techniques for a quantitative evaluation of QA systems. The latter is particularly important because not only is the construction of a QA system complex but also its evaluation. We hypothesize that a reason, therefore, is that the quantitative formalization of human judgment is an open problem.
... Lorsque le système reçoit une requête, les ensembles correspondant aux mots de la requête sont sélectionnés et utilisés ensuite pour choisir les documents pertinents. Les lecteurs intéressés par les fondements mathématiques du modèle de ASL sont invités à consulter l'article suivant [41]. ...
Thesis
En raison de leur grand potentiel pour l'amélioration de la sécurité, du confort, de la productivité et des économies d'énergie, les environnements connectés sont devenus omniprésents dans notre vie quotidienne, ils ont eu un impact sur différents secteurs, tels que, les hôpitaux, les centres commerciaux, les fermes et les véhicules. Afin d'améliorer encore plus la qualité de vie dans ces environnements, beaucoup d'applications proposant des services basés sur l’exploitation des données collectées par les capteurs ont vu le jour. La détection d’événements est l’un de ces services (par exemple, la détection d’un incendie, la détection d’un accident vasculaire cérébral pour les patients, la détection de la pollution atmosphérique). Généralement, quand un événement est déclenché dans un environnement connecté, la réaction naturelle d’un responsable est d’essayer de comprendre ce qui s’est passé et pourquoi l’évènement s’est déclenché. Pour trouver des réponses à ces questions, l'approche traditionnelle est d'interroger manuellement les différents sources de données (système d'informations réseaux de capteurs et système d'informations corpus de documents), ce qui peut s'avérer très fastidieux, très coûteux en matière de temps et nécessite un énorme effort de compilation. Cette thèse, s’intéresse à l’explication des événements détectés dans les environnements connectés, et plus précisément à ceux qui se produisent dans des environnements disposant de systèmes d’information (SI) hétérogènes (SI documents et SI réseau de capteurs). Nous proposons le framework intitulé ISEE (Information System for Event Explanation). ISEE est basé sur:(i) un modèle pour la définition d'évènements dans les environnements hybride, ce dernier permet au utilisateurs de définir les événements qu'ils souhaitent détecter selon différents axes de description (document et réseaux de capteurs); (ii) un processus pour l'interconnexion ciblée du ..... , son rôle est d'exploiter les données issues des évènements (définition et déclenchement) pour construire des connexions sensibles aux contexte (explication des événements). Ces connexions vont servir à rapprocher les différentes sources d'information et guider le processus de recherche d'explication; (iii) un modèle inspiré de la technique 5W1H (what, who, when, where, how, why) pour structurer les explications d'une façon simple, intuitive et facile à comprendre par n'importe quel type d'utilisateurs. Nous proposons une solution générique qui peut être appliquée dans différents domaines d’applications métiers. Néanmoins, trois expérimentions ont été conduite pour valider cette proposition dans le contexte d'un grand bâtiment de recherche.
... ensures that their columns are othornomal. Note that each column of U (i) , V (i) represents a semantic concept; i.e. user taste (Nathanson, Bitton, and Goldberg 2007) in collaborative filtering or document theme (Deerwester et al. 1990) in information retrieval. Those columns are the principle coordinates in the low-dimensional space, and for this reason, we call our approach the coordinate system transfer. ...
Article
Data sparsity is a major problem for collaborative filtering (CF) techniques in recommender systems, especially for new users and items. We observe that, while our target data are sparse for CF systems, related and relatively dense auxiliary data may already exist in some other more mature application domains. In this paper, we address the data sparsity problem in a target domain by transferring knowledge about both users and items from auxiliary data sources. We observe that in different domains the user feedbacks are often heterogeneous such as ratings vs. clicks. Our solution is to integrate both user and item knowledge in auxiliary data sources through a principled matrix-based transfer learning framework that takes into account the data heterogeneity. In particular, we discover the principle coordinates of both users and items in the auxiliary data matrices, and transfer them to the target domain in order to reduce the effect of data sparsity. We describe our method, which is known as coordinate system transfer or CST, and demonstrate its effectiveness in alleviating the data sparsity problem in collaborative filtering. We show that our proposed method can significantly outperform several state-of-the-art solutions for this problem.
... Matrix factorization is an important technique in machine learning which has proven to be effective for collaborative filtering (Koren 2008), information retrieval (Deerwester et al. 1990), image analysis (Lee and Seung 1999), and many other areas. A drawback of standard matrix factorization algorithms is that they are susceptible to overfitting on the training data and require careful tuning of the regularization parameters and the number of optimization steps. ...
Article
Matrix factorization is a fundamental technique in machine learning that is applicable to collaborative filtering, information retrieval and many other areas. In collaborative filtering and many other tasks, the objective is to fill in missing elements of a sparse data matrix. One of the biggest challenges in this case is filling in a column or row of the matrix with very few observations. In this paper we introduce a Bayesian matrix factorization model that performs regression against side information known about the data in addition to the observations. The side information helps by adding observed entries to the factored matrices. We also introduce a nonparametric mixture model for the prior of the rows and columns of the factored matrices that gives a different regularization for each latent class. Besides providing a richer prior, the posterior distribution of mixture assignments reveals the latent classes. Using Gibbs sampling for inference, we apply our model to the Netflix Prize problem of predicting movie ratings given an incomplete user-movie ratings matrix. Incorporating rating information with gathered metadata information, our Bayesian approach outperforms other matrix factorization techniques even when using fewer dimensions.
... Matrix/tensor decomposition approaches serve as both data compression and unsupervised learning techniques. They have successfully applied in broad applications in artificial intelligence/machine learning domains, including document analysis (Deerwester et al. 1990), bioinformatics (Homayouni et al. 2005), computer vision (Lathauwer, Moor, and Vandewalle 2000;Ding, Huang, and Luo 2008;Ye 2004), inferencing under uncertainty (Wood and Griffiths 2006) and approximate reasoning (Smets 2002) etc. Many other applications were reviewed by Acar and Yener (2008), and Kolda and Bader (2008). ...
Article
A main challenging problem for many machine learning and data mining applications is that the amount of data and features are very large, so that low-rank approximations of original data are often required for efficient computation. We propose new multi-level clustering based low-rank matrix approximations which are comparable and even more compact than Singular Value Decomposition (SVD). We utilize the cluster indicators of data clustering results to form the subspaces, hence our decomposition results are more interpretable. We further generalize our clustering based matrix decompositions to tensor decompositions that are useful in high-order data analysis. We also provide an upper bound for the approximation error of our tensor decomposition algorithm. In all experimental results, our methods significantly outperform traditional decomposition methods such as SVD and high-order SVD.
... To identify neuroadaptations that 131 persist after chronic drug exposure during the withdrawal stage, we collected amygdala tissues 132 after 4 weeks of abstinence from cocaine IVSA (Fig. 1a). We purified nuclei and measured the 133 gene expression and open chromatin profiles of individual nuclei by performing snRNA-seq and 134 snATAC-seq with the 10X Genomics Chromium workflow (see Methods). We performed these 135 experiments on high and low AI rats, as well as naive rats (never exposed to cocaine). ...
Preprint
Full-text available
The amygdala contributes to negative emotional states associated with relapse to drug seeking, but the cell type-specific gene regulatory programs that are involved in addiction are unknown. Here we generate an atlas of single nucleus gene expression and chromatin accessibility in the amygdala of outbred rats with low and high cocaine addiction-like behaviors following a prolonged period of abstinence. Between rats with different addiction indexes, there are thousands of cell type-specific differentially expressed genes and these are enriched for molecular pathways including GABAergic synapse in astrocytes, excitatory, and somatostatin neurons. We find that rats with higher addiction severity have excessive GABAergic inhibition in the amygdala, and that hyperpolarizing GABAergic transmission and relapse-like behavior are reversed by pharmacological manipulation of the metabolite methylglyoxal, a GABAA receptor agonist. By analyzing chromatin accessibility, we identify thousands of cell type-specific chromatin sites and transcription factor (TF) motifs where accessibility is associated with addiction-like behaviors, most notably at motifs for pioneer TFs in the FOX, SOX, and helix-loop-helix families.
... Cluster labels are selected from noun phrases and index terms following three different algorithms: Log-Likelihood Ratio (LLR) (C. Chen, 2014), Mutual Information (MI) (Zheng, 2019), and Latent Semantic Indexing (LSI) (Deerwester et al., 1990). The three algorithms use different methods to identify the cluster themes. ...
Preprint
Gamification, which refers to the use of game design elements in non-game contexts, provides similar experiences and motivations as games do; this makes gamification a useful approach to promote positive behaviors. As a useful tool for keeping users motivated, engaged and active, there is a wide interest in adopting gamification solutions for supporting and promoting positive behaviors and behavior change (e.g. quit smoking, ecological behaviors, food choices, civic engagement, mental healthcare, sustainability, etc.). In this study, we use the CiteSpace software to examine 984 publications and their 46,609 unique references on gamification applied for behavior change. The corpus of studies was downloaded from the Scopus database and refers to studies published between 2011 and the beginning of 2022. Several methods were used to analyze these data: (1) document co-citation analysis (DCA) was performed to identify the pivotal researches and the research areas; (2) author co-citation analysis (ACA) was performed to identify the main authors; (3) and keyword analysis was performed to detect the most influential keywords and their change over time. The results of the analysis provide an overview of the influential documents, authors and keywords that have given shape to the literature of the field, and how it has evolved, showing an initial interest in motivational and persuasion techniques, and in the gamification design, and subsequently in the development of more rigorous methodologies for both design and use. As the first scientometric review of gamification applied to behavior change, this study will be of interest to junior and senior researchers, graduate students, and professors seeking to identify research trends, topics, major publications, and influential scholars.
... Then, we consider every dimension in the lower-rank matrix as a latent topic. A straightforward application of this principle is the Latent Semantic Indexing model (LSI) [50], which uses Singular Value Decomposition as a means to approximate the term-document matrix (potentially mediated by TF-IDF) into one with less rows -each one representing a latent semantic dimension in the data -and preserving the similarity structure among columns (terms). Non-negative Matrix Factorisation (NMF) [153] exploits the fact that the term-document matrix is non-negative, thus producing not only a denser representation of the term-document distribution through the matrix factorisation but guaranteeing that the membership of a document to each topic is represented by a positive coefficient. ...
Thesis
Whether on TV or on the internet, video content production is seeing an unprecedented rise. Not only is video the dominant medium for entertainment purposes, but it is also reckoned to be the future of education, information and leisure. Nevertheless, the traditional paradigm for multimedia management proves to be incapable of keeping pace with the scale brought about by the sheer volume of content created every day across the disparate distribution channels. Thus, routine tasks like archiving, editing, content organization and retrieval by multimedia creators become prohibitively costly. On the user side, too, the amount of multimedia content pumped daily can be simply overwhelming; the need for shorter and more personalized content has never been more pronounced. To advance the state of the art on both fronts, a certain level of multimedia understanding has to be achieved by our computers. In this research thesis, we aim to go about the multiple challenges facing automatic media content processing and analysis, mainly gearing our exploration to three axes: 1. Representing multimedia: With all its richness and variety, modeling and representing multimedia content can be a challenge in itself. 2. Describing multimedia: The textual component of multimedia can be capitalized on to generate high-level descriptors, or annotations, for the content at hand. 3. Summarizing multimedia: we investigate the possibility of extracting highlights from media content, both for narrative-focused summarization and for maximising memorability.
... Topic modelling is one of the most extensively used methods in natural language processing for finding relationships across text documents, topic discovery and clustering, and extracting semantic meaning from a corpus of unstructured (Deerwester et al. 1990), Probabilistic Latent Semantic Analysis (pLSA) (Hofmann 1999) for extracting semantic topic clusters from the corpus of data. In the last decade, Latent Dirichlet Allocation (LDA) has become a successful and standard technique for inferring topic clusters from texts for various applications such as opinion mining (Zhai et al. 2011), social medial analysis (Cohen and Ruths 2013), event detection (Lin et al. 2010) and consequently there have also been various developed variants of LDA (Blei and McAuliffe 2010;. ...
Article
Full-text available
Transcending the binary categorization of racist texts, our study takes cues from social science theories to develop a multidimensional model for racism detection, namely stigmatization, offensiveness, blame, and exclusion. With the aid of BERT and topic modelling, this categorical detection enables insights into the underlying subtlety of racist discussion on digital platforms during COVID-19. Our study contributes to enriching the scholarly discussion on deviant racist behaviours on social media. First, a stage-wise analysis is applied to capture the dynamics of the topic changes across the early stages of COVID-19 which transformed from a domestic epidemic to an international public health emergency and later to a global pandemic. Furthermore, mapping this trend enables a more accurate prediction of public opinion evolvement concerning racism in the offline world, and meanwhile, the enactment of specified intervention strategies to combat the upsurge of racism during the global public health crisis like COVID-19. In addition, this interdisciplinary research also points out a direction for future studies on social network analysis and mining. Integration of social science perspectives into the development of computational methods provides insights into more accurate data detection and analytics.
... (2) GloVe [22] : This model combines the merits of LSA [23] and Word2vec. It uses the co-occurrence matrix to model the local and global semantic information at the same time. ...
Article
With the development of IoT and 5G technologies, more and more online resources are presented in trendy multimodal data forms over the Internet. Hence, effectively processing multimodal information is significant to the development of various online applications, including e-Iearning and digital health, to just name a few. However, most AI-driven systems or models can only handle limited forms of information. In this study, we investigate the correlation between natural language processing (NLP) and pattern recognition, trying to apply the mainstream approaches and models used in the computer vision (CV) to the task of NLP. Based on two different Twitter datasets, we propose a convolutional neural network based model to interpret the content of short text with different goals and application backgrounds. The experiments have demonstrated that our proposed model shows fairly competitive performance compared to the mainstream recurrent neural network based NLP models such as bidirectional long short-term memory (Bi-LSTM) and bidirectional gate recurrent unit (Bi-GRU). Moreover, the experimental results also demonstrate that the proposed model can precisely locate the key information in the given text.
... In the second step, we perform feature extraction on the gathered data. We conducted in our previous work [20] a preliminary study using Latent Dirichlet Allocation (LDA) [21] and Latent Semantic Indexing (LSI) [22] on abstracts versus full-text articles and the information that they contain. The topic modeling methods were enhanced with a Correlated Topic Model (CTM) [23] approach to measure the coherence [24] and u-mass [25] of all three algorithms. ...
Conference Paper
Systematic reviews play an essential role in various disciplines. Particularly, in biomedical sciences, systematic reviews according to a predefined schema and protocol are how related literature is analyzed. Although a protocol-based systematic review is replicable and provides the required information to reproduce each step and refine them, such a systematic review is time-consuming and may get complex. To face this challenge, automatic methods can be applied that support researchers in their systematic analysis process. The combination of artificial intelligence for automatic information extraction from scientific literature with interactive visualizations as a Visual Analytics system can lead to sophisticated analysis and protocoling of the review process. We introduce in this paper a novel Visual Analytics approach and system that enables researchers to visually search and explore scientific publications and generate a protocol based on the PRISMA protocol and the PRISMA statement.
... It is a fully automated statistical method to extract relationships between words based on their contexts of use in documents or sentences. LSA is an unsupervised learning technique (Mathematical details: [13]). Fundamentally, the proposed model is concretized through four main steps: (1) Term-Document Matrix calculation also known as the bag of words model; (2) Transformed Term-Document Matrix; (3) Dimension Reduction [14]; and (4) Retrieval in Reduced Space. ...
Article
Full-text available
With the recent advances in deep learning, different approaches to improving pre-trained language models (PLMs) have been proposed. PLMs have advanced state-of-the-art (SOTA) performance on various natural language processing (NLP) tasks such as machine translation, text classification, question answering, text summarization, information retrieval, recommendation systems, named entity recognition, etc. In this paper, we provide a comprehensive review of prior embedding models as well as current breakthroughs in the field of PLMs. Then, we analyse and contrast the various models and provide an analysis of the way they have been built (number of parameters, compression techniques, etc.). Finally, we discuss the major issues and future directions for each of the main points.
... After pre-processing the tags, we have partially dealt with synonymy and polysemy, the two issues often arising in semantic analysis (Deerwester et al., 1990). However, there still exist related tags. ...
Preprint
We consider the noir classification problem by exploring noir attributes and what films are likely to be regarded as noirish from the perspective of a wide Internet audience. We use a dataset consisting of more than 30,000 films with relevant tags added by users of MovieLens, a web-based recommendation system. Based on this data, we develop a statistical model to identify films with noir characteristics using these free-form tags. After retrieving information for describing films from tags, we implement a one-class nearest neighbors algorithm to recognize noirish films by learning from IMDb-labeled noirs. Our analysis evidences film noirs' close relationship with German Expressionism, French Poetic Realism, British thrillers, and American pre-code crime pictures, revealing the similarities and differences between neo noirs after 1960 and noirs in the classic period.
... Token counts are generally represented in a lower dimensional space, so that a word can be represented by a set of dimensions, and words may meaningful be compared against various dimensions. The past few years has seen strong effort in developing methods for modeling language for text representation, growing from classic approaches that model document coocurrences (Blei et al., 2003;Deerwester et al., 1990) to approaches that learn from a word in its immediate context (Mikolov et al., 2013;Pennington et al., 2014), to models that incorporate the contextual sequences of words (Devlin et al., 2018;Vaswani et al., 2017). All of these approaches seek to find a way to model language relationships, so that similar texts can be understood as such, even when using different language like with Robinson Crusoe in Figure 3. ...
Technical Report
Full-text available
A new class of massive-scale scanned-book digital libraries (DLs) preserves and mediates access to millions of individual books, a glimpse inside an unprecedented portion of the published record. Such digital libraries present tremendous value to the study of history, language, culture. Corpus-wide learning may even improve the archive itself, lending insights about books and book metadata by situating them within the broader archive. Unfortunately, repeating or duplicated text presents challenges to learning from digital libraries as a whole. Repeating text may bias language models that seek to accurately learn from the collection. However, variant copies of texts make it non-trivial to account for text duplication. The Similarities and Duplicates in Digital Libraries project (SaDDL) addresses the challenge of duplication. Specifically, it identified near-duplicate and similar-work relationships in the context of the 17 million-volume HathiTrust Digital Library, employing machine learning over book context distributed by the HathiTrust Research Center. SaDDL’s general tagging of different relationships offers information that is often not in library cataloging metadata and can be used to enrich it. This report outlines the SaDDL project.
... Early approaches to vector representations for words created word-document co-occurrence matrices (Schütze & Pedersen, 1997). These vectors have high dimensions and can be very sparse, an issue addressed by Latent Semantic Analysis (LSA) (Deerwester et al., 1990) which performs single value decomposition on word-document matrix. (Finch & Chater, 1992) and (Gauch, Wang & Rachakonda, 1999) developed word embeddings based on word-word co-occurrences. ...
Article
Full-text available
Word vector representations open up new opportunities to extract useful information from unstructured text. Defining a word as a vector made it easy for the machine learning algorithms to understand a text and extract information from. Word vector representations have been used in many applications such word synonyms, word analogy, syntactic parsing, and many others. GloVe, based on word contexts and matrix vectorization, is an effective vector-learning algorithm. It improves on previous vector-learning algorithms. However, the GloVe model fails to explicitly consider the order in which words appear within their contexts. In this paper, multiple methods of incorporating word order in GloVe word embeddings are proposed. Experimental results show that our Word Order Vector (WOVe) word embeddings approach outperforms unmodified GloVe on the natural language tasks of analogy completion and word similarity. WOVe with direct concatenation slightly outperformed GloVe on the word similarity task, increasing average rank by 2%. However, it greatly improved on the GloVe baseline on a word analogy task, achieving an average 36.34% improvement in accuracy.
Book
Full-text available
page:423 Big Data Analytics for Large Scale Wireless Body Area Networks; Challenges, and Applications
Thesis
p>Engineering design optimization is an emerging technology whose application both tends to shorten design-cycle time and finds new designs that are not only feasible, but also nearer to optimum, based on specified design criteria. Its gain in attention in the field of complex designs is fuelled by advancing computing power now allowing increasingly accurate analysis codes to be deployed. Unfortunately, the optimization of complex engineering design problems remains a difficult task, due to the complexity of the cost surfaces and the human expertise necessary in order to achieve high quality results. This research is concerned with the effective use of past experiences and chronicled data from previous designs to mitigate some of the limitations of present engineering design optimization process. In particular, the present work leverages well established artificial intelligence technologies and extends recent theoretical and empirical advances, particularly in machine learning, adaptive hybrid evolutionary computation, surrogate modelling, radial basis functions and transductive inference, to mitigate the issues of i) choice of optimization methods and ii) dealing with expensive design problems. The resulting approaches are studied using commonly employed benchmark functions. Further demonstrations on realistic aerodynamic aircraft and ship design problems reveal that the proposed techniques not only generate robust design performance, they can also greatly decrease the cost of design space search and arrive at better designs as compared to conventional approaches.</p
Article
Full-text available
One of the best-known and most frequently used measures of creative idea generation is the Torrance Test of Creative Thinking (TTCT). The TTCT Verbal, assessing verbal ideation, contains two forms created to be used interchangeably by researchers and practitioners. However, the parallel forms reliability of the two versions of the TTCT Verbal has not been examined for over 50 years. This study provides a long-needed evaluation of the parallel forms reliability of the TTCT Verbal by correlating publisher generated and text-mining-based scores across the forms. The relatively weak relationship between the two forms, ranging between .21 and .40 for the overall TTCT Verbal and ranging between .03 and .33 for the individual TTCT Verbal tasks, suggests that caution should be exercised when researchers and practitioners use the two forms as equivalent measures of verbal creative idea generation.
Article
Full-text available
Background In the context of large‐scale educational assessments, the effort required to code open‐ended text responses is considerably more expensive and time‐consuming than the evaluation of multiple‐choice responses because it requires trained personnel and long manual coding sessions. Aim Our semi‐supervised coding method eco (exploring coding assistant) dynamically supports human raters by automatically coding a subset of the responses. Method We map normalized response texts into a semantic space and cluster response vectors based on their semantic similarity. Assuming that similar codes represent semantically similar responses, we propagate codes to responses in optimally homogeneous clusters. Cluster homogeneity is assessed by strategically querying informative responses and presenting them to a human rater. Following each manual coding, the method estimates the code distribution respecting a certainty interval and assumes a homogeneous distribution if certainty exceeds a predefined threshold. If a cluster is determined to certainly comprise homogeneous responses, all remaining responses are coded accordingly automatically. We evaluated the method in a simulation using different data sets. Results With an average miscoding of about 3%, the method reduced the manual coding effort by an average of about 52%. Conclusion Combining the advantages of automatic and manual coding produces considerable coding accuracy and reduces the required manual effort.
ResearchGate has not been able to resolve any references for this publication.