Conference Paper

Efficient Estimation of Word Representations in Vector Space

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In this regard, Place2Vec [Yan et al. 2017] is one of the pioneers representing POI types using co-occurrence of POIs in a region along with the Word2Vec model [Mikolov et al. 2013]. In Word2Vec, words in a text are represented as real values' vectors (named word embeddings, vector embeddings or embeddings) based on the co-ocurrence of the words [Bengio et al. 2003, Mikolov et al. 2013. ...
... In this regard, Place2Vec [Yan et al. 2017] is one of the pioneers representing POI types using co-occurrence of POIs in a region along with the Word2Vec model [Mikolov et al. 2013]. In Word2Vec, words in a text are represented as real values' vectors (named word embeddings, vector embeddings or embeddings) based on the co-ocurrence of the words [Bengio et al. 2003, Mikolov et al. 2013. Concerning POIs, Place2Vec [Yan et al. 2017] applied this technique to extract and measure the relation of POI types in a region according to the relation between its vectors (called POI type embeddings). ...
... The Word2Vec, proposed by [Mikolov et al. 2013], is a technique developed for Natural Language Processing capable of generating predictive models from raw text. Word2Vec contains two architectures, as illustrated in Figure 1. ...
Full-text available
Conference Paper
Point of Interest (POI) types are one of the most researched aspects on urban data. Several works have successfully modeled POI types cosidering POI co-occurrences in different spatial regions along with statistical models based on the Word2Vec technique from Natural Language Processing. Although these works have presented good results, they do not consider the spatial dis- tance among related POIs as a feature to represent POI types. In this context, we present an approach based on Word2Vec sentences that includes such a distance to generate POI type embeddings, providing an improved POI type representation. Experiments based on similarity assessments between POI types revealed that our representation provides values close to human judgment.
... We focus here on two-dimensional dynamics because of their preponderance in nature, ranging from classical non-linear models of population growth like the Lotka-Volterra model to increasingly important models of climate dynamics (for applications to both, see 4). Similarly to many word embeddings approaches (Mikolov et al., 2013), which approximate the meaning of words using statistical regularity instead of formal semantics, our dynamical embeddings seek to model the "semantics" of dynamical systems from data instead of via analytical investigation. phase2vec is so-called since it is based on a vector of convolutional features extracted from the vector field representing the dynamics in phase space of input data. ...
... We reason that a good map, ψ, is one that produces embeddings, z i , of testing data from which governing equations can be faithfully decoded. Much in the way word embeddings are learned by optimizing an embedding map on a self-supervised auxiliary task of context prediction (Mikolov et al., 2013), we train our embedding map on a self-supervised auxiliary task of governing equation prediction. ...
Preprint
Dynamical systems are found in innumerable forms across the physical and biological sciences, yet all these systems fall naturally into universal equivalence classes: conservative or dissipative, stable or unstable, compressible or incompressible. Predicting these classes from data remains an essential open challenge in computational physics at which existing time-series classification methods struggle. Here, we propose, \texttt{phase2vec}, an embedding method that learns high-quality, physically-meaningful representations of 2D dynamical systems without supervision. Our embeddings are produced by a convolutional backbone that extracts geometric features from flow data and minimizes a physically-informed vector field reconstruction loss. In an auxiliary training period, embeddings are optimized so that they robustly encode the equations of unseen data over and above the performance of a per-equation fitting method. The trained architecture can not only predict the equations of unseen data, but also, crucially, learns embeddings that respect the underlying semantics of the embedded physical systems. We validate the quality of learned embeddings investigating the extent to which physical categories of input data can be decoded from embeddings compared to standard blackbox classifiers and state-of-the-art time series classification techniques. We find that our embeddings encode important physical properties of the underlying data, including the stability of fixed points, conservation of energy, and the incompressibility of flows, with greater fidelity than competing methods. We finally apply our embeddings to the analysis of meteorological data, showing we can detect climatically meaningful features. Collectively, our results demonstrate the viability of embedding approaches for the discovery of dynamical features in physical systems.
... Word2vec [343,342], GloVe [385], Fasttext [157] Strongly supervised ...
... First uses of unlabeled text to learn word representations were introduced by Collobert and Weston [85], Turian et al. [506] and Collobert et al. [86]. It became dominant in NLP when latent semantic analysis revealed powerful properties learned by the self-supervised Word2vec [343,342,429,163], GloVe [385] and Fasttext [157]. ...
Thesis
Natural Language Processing is motivated by applications where computers should gain a semantic and syntactic understanding of human language. Recently, the field has been impacted by a paradigm shift. Deep learning architectures coupled with self-supervised training have become the core of state-of-the-art models used in Natural Language Understanding and Natural Language Generation. Sometimes considered as foundation models, these systems pave the way for novel use cases. Driven by an academic-industrial partnership between the Institut Polytechnique de Paris and Google Ai Research, the present research has focused on investigating how pretrained neural Natural Language Processing models could be leveraged to improve online interactions.This thesis first explored how self-supervised style transfer could be applied to the toxic-to-civil rephrasing of offensive comments found in online conversations. In the context of toxic content moderation online, we proposed to fine-tune a pretrained text-to-text model (T5) with a denoising and cyclic auto-encoder loss. The system, called CAE-T5, was trained on the largest toxicity detection dataset to date (Civil Comments) and generates sentences that are more fluent and better at preserving the initial content compared to earlier text style transfer systems, according to several scoring systems and human evaluation. Plus the approach showed it could be generalized to additional style transfer tasks, such as sentiment transfer.Then, a subsequent work investigated the human labeling and automatic detection of toxic spans in online conversations. Contrary to toxicity detection datasets and models which classify whole posts as toxic or not, toxic spans detection aims at highlighting toxic spans, that is to say the spans that make a text toxic, when detecting such spans is possible. We released a new labeled dataset to train and evaluate systems, which led to a shared task at the 15th International Workshop on Semantic Evaluation. Systems proposed to address the task include strongly supervised models trained using annotations at the span level as well as weakly supervised approaches, known as rationale extraction, using classifiers trained on potentially larger external datasets of posts manually annotated as toxic or not, without toxic span annotations. Furthermore, the ToxicSpans dataset and systems proved useful to analyze the performances of humans and automatic systems on toxic-to-civil rephrasing.Finally, we developed a recommender system based on online reviews of items, taking part in the topic of explaining users' tastes considered by the predicted recommendations. The method uses textual semantic similarity models to represent a user's preferences as a graph of textual snippets, where the edges are defined by semantic similarity. This textual, memory-based approach to rating prediction holds out the possibility of improved explanations for recommendations. The method is evaluated quantitatively, highlighting that leveraging text in this way can outperform both memory-based and model-based collaborative filtering baselines.
... This is because of their abilities to capture the syntactic and semantic relations among words [14]. Word embeddings models are based on deep learning Word2Vec [15], Global Vectors (Glove) [16], FastText [17] and Bidirectional Encoder Representations from Transformers (BERT) model [18]. Although these word embeddings methods are very effective compared to conventional NLP based methods, [19,20] they have some limitations and thus need improvement. ...
... According to Mikolov [28] [30,31]. Word embeddings are better than the normal bag of words representation since they cater for synonyms and produce vectors with lower dimensionality than the bag of words [14,15]. Garg [32] did research on word embeddings and established that Word2Vec embeddings performed better than the other word embedding algorithms. ...
Full-text available
Preprint
Sentiment analysis has become an important area of research in natural language processing. This technique has a wide range of applications such as comprehending user preferences in ecommerce feedback portals, politics, and in governance. However, accurate sentiment analysis requires robust text representation techniques that can convert words into precise vectors that represent the input text. There are two categories of text representation techniques: lexicon-based techniques and machine learning-based techniques. From research, both techniques have limitations. For instance, pre-trained word embeddings such as Word2Vec, Glove and Bidirectional Encoder Representations from Transformers (BERT) generate vectors by considering word distances, similarities and occurrences ignoring other aspects such as word sentiment orientation. Aiming at such limitations, this paper presents a sentiment classification model (named LeBERT) combining Sentiment Lexicon, N-grams, BERT and CNN. In the model, Sentiment Lexicon, N-grams and BERT are used to vectorize words selected from a section of the input text. CNN is used as the deep neural network classifier for feature mapping and giving the output sentiment class. The proposed model is evaluated on Yelp’s three datasets (movie, restaurant and products’ reviews) using accuracy, precision and F-measure as performance metrics. The experimental results indicate that the proposed LeBERT model outperform the existing state-of-the-art models with an F-measure score of 88.73% in binary sentiment classification.
... Next, text data is transformed into feature vectors using Word2Vec. Word2Vec is a neural network which predicts the context of the word, and returns a feature vector representing said context [29,30]. Finally, the obtained features are passed through a model which combines the semantic information obtained from Support Vector Machines (SVM) with the temporal information obtained from Hidden Markov Models (HMM). ...
... Next, text data is transformed into feature vectors using Word2Vec. Word2Vec is a neural network which predicts the context of the word, and returns a feature vector representing said context [29], [30]. Finally, the obtained features are passed through a model which combines the semantic information obtained from Support Vector Machines (SVM) with the temporal information obtained from Hidden Markov Models (HMM). ...
Full-text available
Article
Automatic surgical workflow analysis (SWA) plays an important role in the modelling of surgical processes. Current automatic approaches for SWA use videos (with accuracies varying from 0.8 and 0.9), but they do not incorporate speech (inherently linked to the ongoing cognitive process). The approach followed in this study uses both video and speech to classify the phases of laparoscopic cholecystectomy, based on neural networks and machine learning. The automatic application implemented in this study uses this information to calculate the total time spent in surgery, the time spent in each phase, the number of occurrences, the minimal, maximal and average time whenever there is more than one occurrence, the timeline of the surgery and the transition probability between phases. This information can be used as an assessment method for surgical procedural skills.
... In this work, we directly take the token representation from the last hidden layers in order to generate product representation, it will be interesting to experiment with other layers in the transformer model and understand their learning in terms of syntax and semantics of the tokens. There are other interesting possibilities, for example taking other auxiliary tasks into consideration like Next Sentence Prediction and Sentence Order Prediction 13 Number of likes/clicks/shares are called as engagement while pre-training in order to see their impact on the learned product representations. Also it will be interesting to incorporate other modalities like product images along with rich product attributes while learning products representation. ...
Full-text available
Preprint
Learning low-dimensional representation for large number of products present in an e-commerce catalogue plays a vital role as they are helpful in tasks like product ranking, product recommendation, finding similar products, modelling user-behaviour etc. Recently, a lot of tasks in the NLP field are getting tackled using the Transformer based models and these deep models are widely applicable in the industries setting to solve various problems. With this motivation, we apply transformer based model for learning contextual representation of products in an e-commerce setting. In this work, we propose a novel approach of pre-training transformer based model on a users generated sessions dataset obtained from a large fashion e-commerce platform to obtain latent product representation. Once pre-trained, we show that the low-dimension representation of the products can be obtained given the product attributes information as a textual sentence. We mainly pre-train BERT, RoBERTa, ALBERT and XLNET variants of transformer model and show a quantitative analysis of the products representation obtained from these models with respect to Next Product Recommendation(NPR) and Content Ranking(CR) tasks. For both the tasks, we collect an evaluation data from the fashion e-commerce platform and observe that XLNET model outperform other variants with a MRR of 0.5 for NPR and NDCG of 0.634 for CR. XLNET model also outperforms the Word2Vec based non-transformer baseline on both the downstream tasks. To the best of our knowledge, this is the first and novel work for pre-training transformer based models using users generated sessions data containing products that are represented with rich attributes information for adoption in e-commerce setting. These models can be further fine-tuned in order to solve various downstream tasks in e-commerce, thereby eliminating the need to train a model from scratch.
... To get word-level transcriptiontranslation alignment pairs,we use fast align [22] toolkit. We also use word2vec [23] to search the the five closest words of each noun from the transcriptions. Training Details We use Adam optimizer [24] to optimize the parameters in our model. ...
Preprint
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine-tuning based on a pre-trained model using external machine translation (MT) data. In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data. At the second stage of fine-tuning, we take both original speech sequences and original text sequences in parallel into the model to fine-tune the network, and use Jensen-Shannon divergence to regularize their outputs. Experiments on MuST-C speech translation benchmark and analysis show that M^3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
... In our previous article (under peer review), we suggested that "brain infarction" and "myocardial infarction" have different vectors and that some vectors share greater similarity if they are diseases of the same organ. Word2Vec is an unsupervised learning system that uses neural networks and a tool to compute distributed representations of words [7]. Word2Vec is effective in capturing semantic relatedness and similarity relations among medical terms [8]. ...
Full-text available
Article
Background The pivot and cluster strategy (PCS) is a diagnostic reasoning strategy that automatically elicits disease clusters similar to a differential diagnosis in a batch. Although physicians know empirically which disease clusters are similar, there has been no quantitative evaluation. This study aimed to determine whether inter-disease distances between word embedding vectors using the PCS are a valid quantitative representation of similar disease groups in a limited domain. Methods Abstracts were extracted from the Ichushi Web database and subjected to morphological analysis and training using Word2Vec, FastText, and GloVe. Consequently, word embedding vectors were obtained. For words including “infarction,” we calculated the cophenetic correlation coefficient (CCC) as an internal validity measure and the adjusted rand index (ARI), normalized mutual information (NMI), and adjusted mutual information (AMI) with ICD-10 codes as the external validity measures. This was performed for each combination of metric and hierarchical clustering method. Results Seventy-one words included “infarction,” of which 38 diseases matched the ICD-10 standard with the appearance of 21 unique ICD-10 codes. When using Word2Vec, the CCC was most significant at 0.8690 (metric and method: euclidean and centroid), whereas the AMI was maximal at 0.4109 (metric and method: cosine and correlation, and average and weighted). The NMI and ARI were maximal at 0.8463 and 0.3593, respectively (metric and method: cosine and complete). FastText and GloVe generally resulted in the same trend as Word2Vec, and the metric and method that maximized CCC differed from the ones that maximized the external validity measures. Conclusions The metric and method that maximized the internal validity measure differed from those that maximized the external validity measures; both produced different results. The cosine distance should be used when considering ICD-10, and the Euclidean distance when considering the frequency of word occurrence. The distributed representation, when trained by Word2Vec on the “infarction” domain from a Japanese academic corpus, provides an objective inter-disease distance used in PCS.
... The Friend Of A Friend(FOAF) [39] is a music recommender system that recommends relying on the user's interests. Keyboard matching or Term Frequency /Inverse Document Frequency(TF-IDF) and Word2Vec(W2V) [108]are the typical approaches in this family of recommender systems. In the past decade, deep learning (DL) has attracted much attention compared to conventional models, relying on its ability to deal with problems. ...
Preprint
Any organization needs to improve their products, services, and processes. In this context, engaging with customers and understanding their journey is essential. Organizations have leveraged various techniques and technologies to support customer engagement, from call centres to chatbots and virtual agents. Recently, these systems have used Machine Learning (ML) and Natural Language Processing (NLP) to analyze large volumes of customer feedback and engagement data. The goal is to understand customers in context and provide meaningful answers across various channels. Despite multiple advances in Conversational Artificial Intelligence (AI) and Recommender Systems (RS), it is still challenging to understand the intent behind customer questions during the customer journey. To address this challenge, in this paper, we study and analyze the recent work in Conversational Recommender Systems (CRS) in general and, more specifically, in chatbot-based CRS. We introduce a pipeline to contextualize the input utterances in conversations. We then take the next step towards leveraging reverse feature engineering to link the contextualized input and learning model to support intent recognition. Since performance evaluation is achieved based on different ML models, we use transformer base models to evaluate the proposed approach using a labelled dialogue dataset (MSDialogue) of question-answering interactions between information seekers and answer providers.
... We follow [76,86] and use supervised word embeddings as a baseline. Word embeddings are most well-known in the context of unsupervised training on raw text as in [87], yet they can also be used to score message-response pairs. The embedding vectors are trained directly for this goal. ...
Full-text available
Preprint
The goal of building dialogue agents that can converse with humans naturally has been a long-standing dream of researchers since the early days of artificial intelligence. The well-known Turing Test proposed to judge the ultimate validity of an artificial intelligence agent on the indistinguishability of its dialogues from humans'. It should come as no surprise that human-level dialogue systems are very challenging to build. But, while early effort on rule-based systems found limited success, the emergence of deep learning enabled great advance on this topic. In this thesis, we focus on methods that address the numerous issues that have been imposing the gap between artificial conversational agents and human-level interlocutors. These methods were proposed and experimented with in ways that were inspired by general state-of-the-art AI methodologies. But they also targeted the characteristics that dialogue systems possess.
... The sequence needs to be expressed in word embedding before it can be used as the input of the sentiment analysis model. In the current commonly pre-training model, we use Word2Vec [17] word vectors pre-trained model considering the training time of the model. ...
Full-text available
Preprint
[Purpose/significance] Weibo is a Chinese social media platform based on user relationships. Users can comment on hot topics to express their emotions and emotional tendencies. Mining users' emotion and attitude from Weibo topic comments are helpful to the analysis of Weibo public opinion analysis, which has practical significance. [Method/process] This paper proposes a sentiment analysis method that combines the bidirectional gated recurrent unit neural network (BiGRU) with the attention mechanism to construct an emotion classification model that takes into account training speed and accuracy. Experiments on public datasets show that compared with the classical neural network model, the model in this paper can effectively improve the accuracy of sentiment classification. Experiments are carried out on topics comment data of 10 different fields, and the results verify the feasibility and generalization ability of the model. [Result/conclusion] Finally, a quick and effective sentiment prediction is made for the comments on hot topics, and the opinions and attitudes of netizens on hot topics are analyzed according to the calculated sentiment index and word cloud graph, so as to grasp the public’s emotional attitude towards public opinion events in time and provide a reference for the subsequent application of popular public event prediction and public opinion analysis and research. [Limitations] There are also some shortcomings in this study, such as errors in sentiment analysis of comment sentences with complex semantics such as irony, and no fine-grained sentiment analysis, which still needs further research and improvement.
... These statistical representations are currently complemented by dense vector representations, called word embeddings, based on deep learning approaches. The authors of (Mikolov et al., 2013) introduce the Word2vec model which corresponds to a neural approach allowing to associate a word with a vector, which is computed depending on the context in which the word appears in the training set. Thus, the vector representing a word contains information about it. ...
Thesis
The objective of this thesis is to design an event detection system on social networks to assist people in charge of decision-making in industrial contexts. The event detection system must be able to detect both targeted, domain-specific events and general events. In particular, we are interested in the application of this system to supply chains and more specifically those related to raw materials. The challenge is to build such a detection system, but also to determine which events are potentially influencing the raw materials supply chains. This synthesis summarizes the different stages of research conducted to answer these problems. Architecture of an event detection system First, we introduce the different building blocks of an event detection system. These systems are classically composed of a data filtering and cleaning step, ensuring the quality of the data processed by the system. Then, these data are embedded in such a way that they can be clustered by similarity. Once these data clusters are created, they are analyzed in order to know if the documents constituting them discuss an event or not. Finally, the evolution of these events is tracked. In this thesis, we have proposed to study the problems specific to each of these steps. Textual representation of documents from social networks We compared different text representation models, in the context of our event detection system. We also compared the performances of our event detection system to the First Story Detection (FSD) algorithm, an algorithm with the same objectives. We first demonstrated that our proposed system performs better than FSD, but also that recent neural network architectures perform better than TF-IDF in our context, contrary to what was shown in the context of FSD. We then proposed to combine different textual representations in order to jointly exploit their strengths. Event detection, monitoring, and evaluation We have proposed different approaches for event detection and event tracking. In particular, we use the entropy and user diversity introduced in ... to evaluate the clusters. We then track their evolution over time by making comparisons between clusters at different times, in order to create chains of clusters. Finally, we studied how to evaluate event detection systems in contexts where only few human-annotated data are available. We proposed a method to automatically evaluate event detection systems by exploiting partially annotated data. Application to the commodities context In order to specify the types of events to supervise, we conducted a historical study of events that have impacted the price of raw materials. In particular, we focused on phosphate, a strategic raw material. We studied the different factors having an influence, proposed a reproducible method that can be applied to other raw materials or other fields. Finally, we drew up a list of elements to supervise to enable experts to anticipate price variations.
... Recently, analyzing data by using neural network inspired language models has gained attention and has become an important part of modern NLP systems [6]. Particularly word embedding methods that use a sparse vector space to represent words are prominent instances [24]. In this context, models such as bidirectional encoder representations from transformers (BERT) [8] have become the standard basis models in all machine learning tasks that have natural language as input. ...
Preprint
The field of cybersecurity is evolving fast. Experts need to be informed about past, current and - in the best case - upcoming threats, because attacks are becoming more advanced, targets bigger and systems more complex. As this cannot be addressed manually, cybersecurity experts need to rely on machine learning techniques. In the texutual domain, pre-trained language models like BERT have shown to be helpful, by providing a good baseline for further fine-tuning. However, due to the domain-knowledge and many technical terms in cybersecurity general language models might miss the gist of textual information, hence doing more harm than good. For this reason, we create a high-quality dataset and present a language model specifically tailored to the cybersecurity domain, which can serve as a basic building block for cybersecurity systems that deal with natural language. The model is compared with other models based on 15 different domain-dependent extrinsic and intrinsic tasks as well as general tasks from the SuperGLUE benchmark. On the one hand, the results of the intrinsic tasks show that our model improves the internal representation space of words compared to the other models. On the other hand, the extrinsic, domain-dependent tasks, consisting of sequence tagging and classification, show that the model is best in specific application scenarios, in contrast to the others. Furthermore, we show that our approach against catastrophic forgetting works, as the model is able to retrieve the previously trained domain-independent knowledge. The used dataset and trained model are made publicly available
... The Gini importance vectors can be interpreted as high-dimensional representations for the products, like word embeddings (33) in natural language processing (34). Indeed, they contain information about the productive background that the Random Forest algorithm recognizes as necessary or highly predictive for their future export. ...
Preprint
Tree-based machine learning algorithms provide the most precise assessment of the feasibility for a country to export a target product given its export basket. However, the high number of parameters involved prevents a straightforward interpretation of the results and, in turn, the explainability of policy indications. In this paper, we propose a procedure to statistically validate the importance of the products used in the feasibility assessment. In this way, we are able to identify which products, called explainers, significantly increase the probability to export a target product in the near future. The explainers naturally identify a low dimensional representation, the Feature Importance Product Space, that enhances the interpretability of the recommendations and provides out-of-sample forecasts of the export baskets of countries. Interestingly, we detect a positive correlation between the complexity of a product and the complexity of its explainers.
... They labelled the sentiment in three-class and five-class shows with 65.97% and 54.24% accuracy. They applied the Word2vec algorithm (Mikolov et al., 2013)to the vector representation of each sentence and also implemented the Continuous Bag of Words (CBOW) and Skip Gram (SG) model. After completing all preparation, they applied two methods for both sentiment and emotion identification. ...
Full-text available
Article
Sentiment Analysis (SA) is the method of studying a person’s comments and statements through computational means. It is a sub-domain of Natural Language Processing. A sentiment analysis (SA) system is created by training a significant number of positive, negative or neutral sentence datasets. Although there are many research papers on this subject in English, the work of sentiment analysis in Bengali has not become very popular due to the complexity of the Bangla language and the insufficient presence of the Bangla language online. But now the use of Bengali language has increased in online news, social media, and blogs. At the same time, the number of researches on Bengali NLP is also increasing. But what is the current state of sentiment analysis, its limitations and is there still room for improvement in some places, are not being properly reviewed and research is lagging behind. We have conducted a review on sentiment analysis so that future researchers of sentiment analysis can easily find out about the current state of sentiment analysis. In this research, we have tried to survey the current context of sentiment analysis (SA) and at the same time, we have created a sequence of comparatively better research from the existing ones. To create this sequence we followed a method called is TOPSIS. We have also discussed the challenges to overcome for improving the Sentiment analyzer.
... Graph2vec is a neural embedding approach that learns representations of the graphs. Inspired by doc2vec which is a document embedding method, graph2vec views an entire graph as a document and the rooted subgraphs as words, and then learns the representations of graphs through the doc2vec skip-gram training process [66]. As shown in Fig. 1a, the document composed of words and the graph composed of rooted subgraphs in Figure 1b. ...
Full-text available
Article
To accelerate the performance estimation in neural architecture search, recently proposed algorithms adopt surrogate models to predict the performance of neural architectures instead of training the network from scratch. However, it is time-consuming to collect sufficient labeled architectures for surrogate model training. To enhance the capability of surrogate models using a small amount of training data, we propose a surrogate-assisted evolutionary algorithm with network embedding for neural architecture search (SAENAS-NE). Here, an unsupervised learning method is used to generate meaningful representation of each architecture and the architectures with more similar structures are closer in the embedding space, which considerably benefits the training of surrogate models. In addition, a new environmental selection based on a reference population is designed to keep diversity of the population in each generation and an infill criterion for handling the trade-off between convergence and model uncertainty is proposed for re-evaluation. Experimental results on three different NASBench and DARTS search space illustrate that network embedding makes the surrogate model achieve comparable or superior performance. The superiority of our proposed method SAENAS-NE over other state-of-the-art neural architecture algorithm has been verified in the experiments.
... Mikolov et al. [8] in the paper came up with the Word2Vec model for learning vector word representation or embeddings. The model learns by predicting the co-occurrence of words implemented through a shallow neural network. ...
Full-text available
Article
Word embeddings are vector representations for words that capture the syntactic as well as semantic content of the language. This paper is an attempt at generating pre-trained word embeddings for the Khasi language, a language spoken in one of the states of India–Meghalaya as well as in some parts of Assam and Bangladesh. In the context of natural language processing (NLP), Khasi is considered as one of the low-resource languages. Because of this limitation, not much has been accomplished in the domain of NLP for the Khasi language so far. For languages with abundant resource sets like German, English, French, etc., deep learning has been hailed as a breakthrough in NLP. Many deep learning models for these languages have demonstrated state-of-the-art levels of performance. This work is an attempt to produce word embeddings for the Khasi language using deep learning. The contextualized deep learning model, bidirectional encoder representations from transformers (BERT), has been used for this purpose. The findings of the word prediction task used to evaluate the model are discussed at length in the paper. In addition, a comparison with Word2Vec is made to demonstrate the efficacy of each embedding based on tasks such as word analogies and similarity grouping, and comparable results are obtained. In order to facilitate their use in related downstream NLP tasks, the learnt and optimized embeddings are generated and made available as an outcome of this work.
... Pioneering works in SSL proposed to exploit spatial cues to generate pretext tasks [22,28,35,37,45,46,52]. Notably, inspired by word2vec [42], Doersch et al. [22] train a network to predict the relative position of a pair of patches from the same image while Noroozi and Favaro [46] extend this approach to solving "jigsaw puzzles" by rearranging a set of shuffled crops of an image. These approaches were developed with Convnets and very little work has revisited them in the scope of Transformers [71]. ...
Full-text available
Preprint
Pixel-level labels are particularly expensive to acquire. Hence, pretraining is a critical step to improve models on a task like semantic segmentation. However, prominent algorithms for pretraining neural networks use image-level objectives, e.g. image classification, image-text alignment a la CLIP, or self-supervised contrastive learning. These objectives do not model spatial information, which might be suboptimal when finetuning on downstream tasks with spatial reasoning. In this work, we propose to pretrain networks for semantic segmentation by predicting the relative location of image parts. We formulate this task as a classification problem where each patch in a query view has to predict its position relatively to another reference view. We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query. Our experiments show that this location-aware (LOCA) self-supervised pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
... Amel-Zadeh et al. [24] provided a proof of concept for the use of ML and NLP to detect companies' alignment with SDGs based on their CSR reports. Their proposed method with binary outcomes used Word2Vec [25] and Doc2Vec models for training a logistic regression classifier, a fullyconnected neural network, and an SVM which, with a Doc2Vec [26] embedding, achieved the highest average accuracy of 83.5% for predicting alignment. Guisiano et al. [27,28] proposed a multi-label classification system using BERT and an online tool "SDG-meter" to automate this task. ...
Full-text available
Article
There is a strong need and demand from the United Nations, public institutions, and the private sector for classifying government publications, policy briefs, academic literature, and corporate social responsibility reports according to their relevance to the Sustainable Development Goals (SDGs). It is well understood that the SDGs play a major role in the strategic objectives of various entities. However, linking projects and activities to the SDGs has not always been straightforward or possible with existing methodologies. Natural language processing (NLP) techniques offer a new avenue to identify linkages for SDGs from text data. This research examines various machine learning approaches optimized for NLP-based text classification tasks for their success in classifying reports according to their relevance to the SDGs. Extensive experiments have been performed with the recently released Open Source SDG (OSDG) Community Dataset, which contains texts with their related SDG label as validated by community volunteers. Results demonstrate that especially fine-tuned RoBERTa achieves very high performance in the attempted task, which is promising for automated processing of large collections of sustainability reports for detection of relevance to SDGs.
... Another solution available to us would have been to use a neural network to generate an embedding for each record (Mikolov et al., 2013), similar to doc2vec (Gensim, 2002). We decided against that approach for three reasons: (a) the difference between records can more easily be parallelized, (b) the use of both neural networks and TDA, both of which are part of artificial intelligence, would have made it difficult to judge the performance of TDA alone, and (c) pursue more easily explainable AI by avoiding embedding solutions, which rely on neural networks whose decisions are not easily explainable to a human analyst. ...
Full-text available
Article
Data quality problems may occur in various forms in structured and semi-structured data sources. This paper details an unsupervised method of analyzing data quality that is agnostic to the semantics of the data, the format of the encoding, or the internal structure of the dataset. A distance function is used to transform each record of a dataset into an n-dimensional vector of real numbers, which effectively transforms the original data into a high-dimensional point cloud. The shape of the point cloud is then efficiently examined via topological data analysis to find high-dimensional anomalies that may signal quality issues. The specific quality faults examined in this paper are the detection of records that, while not exactly the same, refer to the same entity. Our algorithm, based on topological data analysis, provides similar accuracy for both higher and lower quality data and performs better than a baseline approach for data with poor quality.
... The columns "customer", "lost" and "his" should be correlated respectively with the value of columns "client", "mislaid" and "her", since they represent the same concepts. To consider semantic similarity, an adjustment can be applied to the document-by-term matrix using word embedding techniques, such as word2vec (Mikolov et al., 2013). ...
Full-text available
Preprint
Financial institutions manage operational risk by carrying out the activities required by regulation, such as collecting loss data, calculating capital requirements, and reporting. The information necessary for this purpose is then collected in the OpRisk databases. Recorded for each OpRisk event are loss amounts, dates, organizational units involved, event types and descriptions. In recent years, operational risk functions have been required to go beyond their regulatory tasks to proactively manage the operational risk, preventing or mitigating its impact. As OpRisk databases also contain event descriptions, usually defined as free text fields, an area of opportunity is the valorization of all the information contained in such records. As far as we are aware of, the present work is the first one that has addressed the application of text analysis techniques to the OpRisk event descriptions. In this way, we have complemented and enriched the established framework of statistical methods based on quantitative data. Specifically, we have applied text analysis methodologies to extract information from descriptions in the OpRisk database. After delicate tasks like data cleaning, text vectorization, and semantic adjustment, we apply methods of dimensionality reduction and several clustering models and algorithms to develop a comparison of their performances and weaknesses. Our results improve retrospective knowledge of loss events and enable to mitigate future risks.
... Ainsi, pour un texte contenant K mots, celui-ci sera converti en K vecteurs de taille d, avec d ∈ N * dépendant de la méthode utilisée. Ces méthodes sont appelés méthodes de word embedding [84,85,86,87], et peuvent être séparé en deux catégories : les méthodes non-contextuelles et les méthodes contextuelles. ...
Thesis
Un nombre important de modèles probabilistes connaissent une grande perte d'intérêt pour la classification avec apprentissage supervisé depuis un certain nombre d'années, tels que le Naive Bayes ou la chaîne de Markov cachée. Ces modèles, qualifiés de génératifs, sont critiqués car leur classificateur induit doit prendre en compte la loi des observations, qui peut s'avérer très complexe à apprendre quand le nombre de features de ces derniers est élevé. C'est notamment le cas en Traitement des Langues Naturelles, où les récents algorithmes convertissent des mots en vecteurs numériques de grande taille pour atteindre de meilleures performances.Au cours de cette thèse, nous montrons que tout modèle génératif peut définir son classificateur sans prendre en compte la loi des observations. Cette proposition remet en question la catégorisation connue des modèles probabilistes et leurs classificateurs induits - en classes générative et discriminante - et ouvre la voie à un grand nombre d'applications possibles. Ainsi, la chaîne de Markov cachée peut être appliquée sans contraintes à la décomposition syntaxique de textes, ou encore le Naive Bayes à l'analyse de sentiments.Nous allons plus loin, puisque cette proposition permet de calculer le classificateur d'un modèle probabiliste génératif avec des réseaux de neurones. Par conséquent, nous « neuralisons » les modèles cités plus haut ainsi qu'un grand nombre de leurs extensions. Les modèles ainsi obtenus permettant d'atteindre des scores pertinents pour diverses tâches de Traitement des Langues Naturelles tout en étant interprétable, nécessitant peu de données d'entraînement, et étant simple à mettre en production.
... Ainsi, pour un texte contenant K mots, celui-ci sera converti en K vecteurs de taille d, avec d ∈ N * dépendant de la méthode utilisée. Ces méthodes sont appelés méthodes de word embedding [84,85,86,87], et peuvent être séparé en deux catégories : les méthodes non-contextuelles et les méthodes contextuelles. ...
Thesis
Microservice architectures contribute to building complex distributed systems as sets of independent microservices. The decoupling and modularity of distributed microservices facilitates their independent replacement and upgradeability. Since the emergence of agile DevOps and CI/CD, there is a trend towards more frequent and rapid evolutionary changes of the running microservice-based applications in response to various evolution requirements. Applying changes to microservice architectures is performed by an evolution process of moving from the current application version to a new version. The maintenance and evolution costs of these distributed systems increase rapidly with the number of microservices.The objective of this thesis is to address the following issues: How to help engineers to build a unified and efficient version management for microservices and how to trace changes in microservice-based applications? When can microservice-based applications, especially those with long-running activities, be dynamically updated without stopping the execution of the whole system? How should the safe updating be performed to ensure service continuity and maintain system consistency?In response to these questions, this thesis proposes two main contributions. The first contribution is runtime models and an evolution graph for modelling and tracing version management of microservices. These models are built at design time and used at runtime. It helps engineers abstract architectural evolution in order to manage reconfiguration deployments, and it provides the knowledge base to be manipulated by an autonomic manager middleware in various evolution activities. The second contribution is a snapshot-based approach for dynamic software updating (DSU) of microservices. The consistent distributed snapshots of microservice-based applications are constructed to be used for specifying continuity of service, evaluating the safe update conditions and realising the update strategies. The message complexity of the DSU algorithm is not the message complexity of the distributed application, but the complexity of the consistent distributed snapshot algorithm.
... However, tracking semantic evolution is not 152 possible using these techniques because they do not generate language models. 153 Language Models: The state-of-the-art technique for language modeling 154 is word2vec, introduced by Mikolov et al. [5,6]. This method generates a 155 static language model where every word is represented as a vector (also called 156 embedding) by training a neural network to mimic the contextual patterns 157 observed in a text corpus. ...
Full-text available
Preprint
Semantics in natural language processing is largely dependent on contextual relationships between words and entities in a document collection. The context of a word may evolve. For example, the word ``apple'' currently has two contexts -- a fruit and a technology company. The changes in the context of words or entities in text data such as scientific publications, and news articles can help us understand the evolution of innovation or events of interest. In this work, we present a new diffusion-based temporal word embedding model that can capture short and long-term changes in the semantics of entities in different domains. Our model captures how the context of each entity shifts over time. Existing temporal word embeddings capture semantic evolution at a discrete/granular level, aiming to study how a language developed over a long period. Unlike existing temporal embedding methods, our approach provides temporally smooth embeddings, facilitating prediction and trend analysis better than those of existing models. Extensive evaluations demonstrate that our proposed temporal embedding model performs better in sense-making and predicting relationships between entities in the future compared to other existing models.
... However, we want to limit our approach to methods also applicable to low resource languages like Middle High German, where no syntax parsing is available. Thus, we assume that only part-of-speech tags and token-based word embeddings like word2vec (Mikolov et al., 2013) or fastText (Bojanowski et al., 2017) are obtainable. We do not rely on methods requiring large amounts of training data like transformer models or syntax parsers. ...
Full-text available
Conference Paper
In this work, we present a novel unsupervised method for adjective-noun metaphor detection on low resource languages. We propose two new approaches: First, a way of artificially generating metaphor training examples and second, a novel way to find metaphors relying only on word embeddings. The latter enables application for low resource languages. Our method is based on a transformation of word embedding vectors into another vector space, in which the distance between the adjective word vector and the noun word vector represents the metaphoricity of the word pair. We train this method in a zero-shot pseudo-supervised manner by generating artificial metaphor examples and show that our approach can be used to generate a metaphor dataset with low annotation cost. It can then be used to finetune the system in a few-shot manner. In our experiments we show the capabilities of the method in its unsupervised and in its supervised version. Additionally, we test it against a comparable unsupervised baseline method and a supervised variation of it.
... Let us point out that BERT generates contextual embeddings, meaning that the input of the embedding model should be a sentence rather than a single word. In our model, although most textual data is represented by entire sentences, it also relies on identifying relevant trends, which usually are represented as single tokens (single terms or short n-grams); in this case, a context-independent model (e.g., Word2Vec [14]) could seem more appropriate. Nevertheless, the main weakness is that such models usually do not address out-ofvocabulary (OOV) words, meaning that they can compute embedding vectors only for words included in the training vocabulary. ...
Full-text available
Article
In this paper, we propose an innovative tool able to enrich cultural and creative spots ( gems , hereinafter) extracted from the European Commission Cultural Gems portal, by suggesting relevant keywords ( tags ) and YouTube videos (represented with proper thumbnails ). On the one hand, the system queries the YouTube search portal, selects the videos most related to the given gem , and extracts a set of meaningful thumbnails for each video. On the other hand, each tag is selected by identifying semantically related popular search queries (i.e., trends). In particular, trends are retrieved by querying the Google Trends platform. A further novelty is that our system suggests contents in a dynamic way. Indeed, as for both YouTube and Google Trends platforms the results of a given query include the most popular videos/trends, such that a gem may constantly be updated with trendy content by periodically running the tool. The system has been tested on a set of gems and evaluated with the support of human annotators. The results highlighted the effectiveness of our proposal.
... The score is provided by the cosine similarity between the embedding vectors of the current guess and the target word. The embedding vectors can be generated by arbitrary embedding methods, such as Word2Vec (Mikolov et al., 2013b), Skip-gram (Mikolov et al., 2013a), or Glove (Pennington et al., 2014). The score transforms the 0 to 1 scale of cosine similarity to a 0 to 100 scale. ...
Full-text available
Preprint
If scientific discovery is one of the main driving forces of human progress, insight is the fuel for the engine, which has long attracted behavior-level research to understand and model its underlying cognitive process. However, current tasks that abstract scientific discovery mostly focus on the emergence of insight, ignoring the special role played by domain knowledge. In this concept paper, we view scientific discovery as an interplay between $thinking \ out \ of \ the \ box$ that actively seeks insightful solutions and $thinking \ inside \ the \ box$ that generalizes on conceptual domain knowledge to keep correct. Accordingly, we propose Mindle, a semantic searching game that triggers scientific-discovery-like thinking spontaneously, as infrastructure for exploring scientific discovery on a large scale. On this basis, the meta-strategies for insights and the usage of concepts can be investigated reciprocally. In the pilot studies, several interesting observations inspire elaborated hypotheses on meta-strategies, context, and individual diversity for further investigations.
... Then, the processed tokens pass through the backbone of the text-modality. This backbone consists of several stages -Word2Vec [33] embeddings of the tokens are obtained, then the embeddings pass through a linear layer with a non-linear activation function. Lastly, maxpooling layer is applied on the learned representations. ...
Full-text available
Preprint
Video synthesis methods rapidly improved in recent years, allowing easy creation of synthetic humans. This poses a problem, especially in the era of social media, as synthetic videos of speaking humans can be used to spread misinformation in a convincing manner. Thus, there is a pressing need for accurate and robust deepfake detection methods, that can detect forgery techniques not seen during training. In this work, we explore whether this can be done by leveraging a multi-modal, out-of-domain backbone trained in a self-supervised manner, adapted to the video deepfake domain. We propose FakeOut; a novel approach that relies on multi-modal data throughout both the pre-training phase and the adaption phase. We demonstrate the efficacy and robustness of FakeOut in detecting various types of deepfakes, especially manipulations which were not seen during training. Our method achieves state-of-the-art results in cross-manipulation and cross-dataset generalization. This study shows that, perhaps surprisingly, training on out-of-domain videos (i.e., videos with no speaking humans), can lead to better deepfake detection systems. Code is available on GitHub.
... This concept was first proposed in 2013 by Mikolov et. al. in [31] and [32]. The underlying idea of word2vec is to encode the relations between words in a body of text using a shallow neural network. ...
Preprint
Modern software systems are able to record vast amounts of user actions, stored for later analysis. One of the main types of such user interaction data is click data: the digital trace of the actions of a user through the graphical elements of an application, website or software. While readily available, click data is often missing a case notion: an attribute linking events from user interactions to a specific process instance in the software. In this paper, we propose a neural network-based technique to determine a case notion for click data, thus enabling process mining and other process analysis techniques on user interaction data. We describe our method, show its scalability to datasets of large dimensions, and we validate its efficacy through a user study based on the segmented event log resulting from interaction data of a mobility sharing company. Interviews with domain experts in the company demonstrate that the case notion obtained by our method can lead to actionable process insights.
... Such vectorial representations are learned in a large corpus, leading to a semantic vector space wherein words with a similar meaning are mapped to geometrically close vectors [1]. Word2vec [18] was the pioneer semi-supervised neural framework for learning this embedding modality. FastText [19] is a remarkable advance to Word2vec, representing each word by a bag of character n-grams to ensure an attractive balance between the predictive performance and the vocabulary size. ...
Full-text available
Preprint
Text classification is a natural language processing (NLP) task relevant to many commercial applications, like e-commerce and customer service. Naturally, classifying such excerpts accurately often represents a challenge, due to intrinsic language aspects, like irony and nuance. To accomplish this task, one must provide a robust numerical representation for documents, a process known as embedding. Embedding represents a key NLP field nowadays, having faced a significant advance in the last decade, especially after the introduction of the word-to-vector concept and the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite the impressive achievements in this field, the literature coverage regarding generating embeddings for Brazilian Portuguese texts is scarce, especially when considering commercial user reviews. Therefore, this work aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese. This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. The methods are evaluated with five open-source databases with pre-defined data partitions made available in an open digital repository to encourage reproducibility. The Fine-tuned TLMs achieved the best results for all cases, being followed by the Feature-based TLM, LSTM, and CNN, with alternate ranks, depending on the database under analysis.
... Such vectorial representations are learned in a large corpus, leading to a semantic vector space wherein words with a similar meaning are mapped to geometrically close vectors [1]. Word2vec [18] was the pioneer semi-supervised neural framework for learning this embedding modality. FastText [19] is a remarkable advance to Word2vec, representing each word by a bag of character n-grams to ensure an attractive balance between the predictive performance and the vocabulary size. ...
Full-text available
Article
Text classification is a natural language processing (NLP) task relevant to many commercial applications, like e-commerce and customer service. Naturally, classifying such excerpts accurately often represents a challenge, due to intrinsic language aspects, like irony and nuance. To accomplish this task, one must provide a robust numerical representation for documents, a process known as embedding. Embedding represents a key NLP field nowadays, having faced a significant advance in the last decade, especially after the introduction of the word-to-vector concept and the popularization of Deep Learning models for solving NLP tasks, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based Language Models (TLMs). Despite the impressive achievements in this field, the literature coverage regarding generating embeddings for Brazilian Portuguese texts is scarce, especially when considering commercial user reviews. Therefore, this work aims to provide a comprehensive experimental study of embedding approaches targeting a binary sentiment classification of user reviews in Brazilian Portuguese. This study includes from classical (Bag-of-Words) to state-of-the-art (Transformer-based) NLP models. The methods are evaluated with five open-source databases with pre-defined data partitions made available in an open digital repository to encourage reproducibility. The Fine-tuned TLMs achieved the best results for all cases, being followed by the Feature-based TLM, LSTM, and CNN, with alternate ranks, depending on the database under analysis.
... Researchers have developed model aggregation techniques for updating deep learning models, i.e., LSTMs and CNNs (Wang et al., 2018). The most common word embedding methods are GloVe (Pennington et al., 2014) and Word2vec (Mikolov et al., 2013a). Shing et al. (2018) implemented CNN at the user level with various filter sizes to encode users' posts. ...
... Such words are similar at the semantic level but different from the surface level, which is called a term-mismatch problem. One of the issues in the field of IR is handling the term-mismatch problem (Huang et al., 2012;Mikolov et al., 2013aMikolov et al., , 2013b. ...
Full-text available
Article
The tremendous proliferation of Multi-Modal data and the flexible need of users has drawn attention to the field of Cross-Modal Retrieval (CMR), which can perform image-sketch matching, text-image matching, audio-video matching and near infrared-visual image matching. Such retrieval is useful in many applications like criminal investigation, recommendation systems and person reidentification. The real challenge in CMR is to preserve semantic similarities between various modalities of data. To preserve semantic similarities, existing deep learning-based approaches use pairwise labels and generate binary-valued representation. The generated binary-valued representation provides fast retrieval with low storage requirement. However, the relative similarity between heterogeneous data is ignored. So, the objective of this work is to reduce the modality-gap by preserving relative semantic similarities among various modalities. So, a model named "Deep Cross-Modal Retrieval (DCMR)" is proposed, which takes triplet labels as the input and generates binary-valued representation. The triplet labels locate semantic similar data points nearer and dissimilar points far in the vector space. Extensive experiments are performed and the result is compared with deep learning-based approaches, which shows that the performance of DCMR increases by 2% to 3% for Image→Text retrieval and by 2% to 5% for Text→Image retrieval in mean average precision (mAP) on MSCOCO, XMedia, and NUS-WIDE datasets. So, the binary-valued representation generated from triplet labels preserve better relative semantic similarities than pairwise labels.
Preprint
Script event prediction aims to predict the subsequent event given the context. This requires the capability to infer the correlations between events. Recent works have attempted to improve event correlation reasoning by using pretrained language models and incorporating external knowledge~(e.g., discourse relations). Though promising results have been achieved, some challenges still remain. First, the pretrained language models adopted by current works ignore event-level knowledge, resulting in an inability to capture the correlations between events well. Second, modeling correlations between events with discourse relations is limited because it can only capture explicit correlations between events with discourse markers, and cannot capture many implicit correlations. To this end, we propose a novel generative approach for this task, in which a pretrained language model is fine-tuned with an event-centric pretraining objective and predicts the next event within a generative paradigm. Specifically, we first introduce a novel event-level blank infilling strategy as the learning objective to inject event-level knowledge into the pretrained language model, and then design a likelihood-based contrastive loss for fine-tuning the generative model. Instead of using an additional prediction layer, we perform prediction by using sequence likelihoods generated by the generative model. Our approach models correlations between events in a soft way without any external knowledge. The likelihood-based prediction eliminates the need to use additional networks to make predictions and is somewhat interpretable since it scores each word in the event. Experimental results on the multi-choice narrative cloze~(MCNC) task demonstrate that our approach achieves better results than other state-of-the-art baselines. Our code will be available at \url{https://github.com/zhufq00/mcnc}.
Full-text available
Chapter
Our goal is to develop a semantic theory that is equally suitable for the lexical material (words) and for the larger constructions (sentences) put together from these. In 2.1 we begin with the system of lexical categories that are in generative grammar routinely used as preterminals mediating between syntax and the lexicon. Morphology is discussed in 2.2, where subdirect composition is introduced. This notion is further developed in 2.3, where the geometric view is expanded from the standard word vectors and the voronoids introduced in Chapter 1 to include non-vectorial elements that express binary relations.
Full-text available
Chapter
Until this point, we concentrated on the lexicon, conceived of as the repository of shared linguistic information. In 8.1 we take on the problem of integrating real-world knowledge, nowadays typically stored in knowledge graphs as billions of RDF triples, and linguistic knowledge, stored in a much smaller dictionary, typically compressible to a few megabytes. We present proper names as point vectors (rather than the polytopes we use for common nouns and most other lexical entries), and introduce the notion of content continuations, algorithms that extend the lexical entries to more detailed hypergraphs that can refer to technical nodes, such as Date, FloatingPointNumber, or Obligation (see 9.1) that are missing from the core lexicon.
Full-text available
Chapter
Adjectives are present in most, though not necessarily all, natural languages. In 7.1 we begin by discussing the major properties of adjectival roots and the vector semantics associated to the base, comparative, and superlative forms. We discuss the logic associated to these, and extend the analysis to intensifiers.
Full-text available
Chapter
We started with Lewin’s aphorism, “there is nothing as practical as a good theory”. Vector semantics, the broad theory that was raised from a Firthian slogan to a computational theory by Schütze, 1993, has clearly proven its practicality on a wide range of tasks from Named Entity Recognition (see 8.1) to sentiment analysis. But the farther we move from basic labeling and classification tasks, the more indirect the impact becomes, until we reach a point where some conceptual model needs to be fitted to the text. Perhaps the best known such problem is time extraction and normalization, where our target model is the standard (Gregorian) calendar rather than the simple (naive) model we discussed in 3.2. In 9.1, based almost entirely on the work of Gábor Recski and his co-workers at TU Wien, we outline a system that probes for matches with a far more complex conceptual model, that of building codes and regulations in effect in the city of Vienna.
Full-text available
Chapter
In this chapter we describe a rational, but low resolution, model of probability. We do this for two reasons: first, to show how a naive theory, using only discrete categories, can still explain how people think about uncertainty, and second, as a model for fitting discrete theories of valuation (which arise in many other contexts from moral judgments to household finance) into the overall 4lang framework.
Full-text available
Chapter
The notion of modality is almost inextricably intertwined with metaphysics, some kind of theory of what is real, what exists, and why (a theory of ‘first causes’). At the center of the commonsensical theory is the real world, but the idea is that there exist, or at least there can exist, other worlds.
Full-text available
Chapter
Our goal in this chapter is to provide a formal theory of negation in ordinary language, as opposed to the formal theory of negation in logic and mathematics. In order to provide for a linguistically and cognitively sound theory of negation, we argue for the introduction of a dyadic negation predicate lack and a force dynamic account of affirmation and negation in general.
Article
We investigate a novel dataset of more than half a million 15 second transcribed audio snippets containing COVID-19 mentions from major US TV stations throughout 2020. Using the Latent Dirichlet Allocation (LDA), an unsupervised machine learning algorithm, we identify seven COVID-19 related topics discussed in US TV news. We find that several topics identified by the LDA predict significant and economically meaningful market reactions in the next day, even after controlling for the general TV tone derived from a field-specific COVID-19 tone dictionary. Our results suggest that COVID-19 related TV content had nonnegligible effects on financial markets during the pandemic.
Article
The lexicalization of morphologically complex words, i.e. their inclusion in the lexicon, can involve a loss of semantic compositionality. Such a phenomenon, called demotivation, has been overlooked in both morphological and lexical studies, notably regarding its gradual nature. This paper compares two measures of demotivation based on experimental and distributional semantics approaches. It builds on the evaluation of 78 pairs of French verbs and derived nouns selected to represent three levels of demotivation. The comparison of the two approaches using speakers’ judgements and word vector similarity indicates convergence on the identification of demotivation degrees within a continuum, while also highlighting specific aspects of each method. The study provides direction to further research on morphosemantic demotivation, bridging together semantic, morphological and methodological considerations.
Article
Table summarization can be of great help, which generates a concise and informative overview of a table to assist users to understand the table easily and unambiguously. A high-quality summary needs to have two desirable properties: presenting notable entities in the table and achieving broad coverage and high diversity on domains. However, notability and domain are often neglected in table summarization. Thus in this paper, we present a framework of domain-aware table summarization that is able to: (1) identify notable entities using a popularity-sensitive notability evaluation algorithm, (2) find core domains with a measurement of domain centrality, (3) and output the final high-quality summary using a three-phase clustering based algorithm. The experimental results show that our summarization method outperforms state-of-the-art methods by 9.62%, 2.78% and 6.77% on metrics coverage, diversity, and notability, respectively. We also conduct a user study to demonstrate that people can improve the accuracy of understanding tables by 17% with the help of our summarization technique.
Chapter
How to find high-quality articles from many articles is the topic of this competition and also the problem that many enterprises want to solve. In this classification problem, from TF-IDF to word2vec, then to RNN and LSTM, and now to transformer-based models, such as Bert, have achieved great improvement in NLU tasks. However, for many specific problems, such as recognition of high-quality article, directly inputting the text content into the transformer model cannot get the optimal solution, and many other optimizations are needed. In this paper, we try to add statistical features of articles and knowledge graphs, and add entities name of knowledge graph into Bert-based model, specific methods are in Sect. 2 and Sect. 3. Finally, our model achieved 83.6 F1-score in the official test set and ranked first among all teams in task 2 of CCKS-2022. This paper is divided into four parts: 1) The introduction of our task; 2) Main ideas of our model; 3) Other innovation strategies; 4) Experiments and result.
Full-text available
Preprint
Typical drug discovery and development processes are costly, time consuming and often biased by expert opinion. Aptamers are short, single-stranded oligonucleotides (RNA/DNA) that bind to target proteins and other types of biomolecules. Compared with small-molecule drugs, aptamers can bind to their targets with high affinity (binding strength) and specificity (uniquely interacting with the target only). The conventional development process for aptamers utilizes a manual process known as Systematic Evolution of Ligands by Exponential Enrichment (SELEX), which is costly, slow, dependent on library choice and often produces aptamers that are not optimized. To address these challenges, in this research, we create an intelligent approach, named DAPTEV, for generating and evolving aptamer sequences to support aptamer-based drug discovery and development. Using the COVID-19 spike protein as a target, our computational results suggest that DAPTEV is able to produce structurally complex aptamers with strong binding affinities. Author summary Compared with small-molecule drugs, aptamer drugs are short RNAs/DNAs that can specifically bind to targets with high strength. With the interest of discovering novel aptamer drugs as an alternative to address the long-lasting COVID-19 pandemic, in this research, we developed an artificial intelligence (AI) framework for the in silico design of novel aptamer drugs that can prevent the SARS-CoV-2 virus from entering human cells. Our research is valuable as we explore a novel approach for the treatment of SARS-CoV-2 infection and the AI framework could be applied to address future health crises.
ResearchGate has not been able to resolve any references for this publication.