Conference Paper

Efficient Estimation of Word Representations in Vector Space

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This model consists of shared layers and specific task layers and constructs a shared-layer network based on an attention mechanism to learn the key features of smart contract opcode sequences [20]. CBGRU [21] is a hybrid deep learning model that uses Word2Vec [22] and FastText [23] for word embedding. These serve as two input branches for the feature extraction layer, and finally, in the classification layer, the features extracted from both branches are fused through a connection layer. ...
... Leaf nodes correspond to values associated with each type. For the initialization of the word-embedding matrix, two primary approaches are considered: random values or pretrained vectors obtained from models like CodeBERT [35], Word2Vec [22], GloVe [36], FastText [37], ELMo [38], etc. While many methods are originally designed for natural language text, audio, video sequences, or knowledge For the initialization of the word-embedding matrix, two primary approaches are considered: random values or pretrained vectors obtained from models like CodeBERT [35], Word2Vec [22], GloVe [36], FastText [37], ELMo [38], etc. ...
... For the initialization of the word-embedding matrix, two primary approaches are considered: random values or pretrained vectors obtained from models like CodeBERT [35], Word2Vec [22], GloVe [36], FastText [37], ELMo [38], etc. While many methods are originally designed for natural language text, audio, video sequences, or knowledge For the initialization of the word-embedding matrix, two primary approaches are considered: random values or pretrained vectors obtained from models like CodeBERT [35], Word2Vec [22], GloVe [36], FastText [37], ELMo [38], etc. While many methods are originally designed for natural language text, audio, video sequences, or knowledge graphs, CodeBERT stands out as a pretrained model uniquely suited for handling bimodal data, accommodating both natural language text and programming languages. ...
Article
Full-text available
With the proliferation of blockchain technology in decentralized applications like decentralized finance and supply chain and identity management, smart contracts operating on a blockchain frequently encounter security issues such as reentrancy vulnerabilities, timestamp dependency vulnerabilities, tx.origin vulnerabilities, and integer overflow vulnerabilities. These security concerns pose a significant risk of causing substantial losses to user accounts. Consequently, the detection of vulnerabilities in smart contracts has become a prominent area of research. Existing research exhibits limitations, including low detection accuracy in traditional smart contract vulnerability detection approaches and the tendency of deep learning-based solutions to focus on a single type of vulnerability. To address these constraints, this paper introduces a smart contract vulnerability detection method founded on multimodal feature fusion. This method adopts a multimodal perspective to extract three modal features from the lifecycle of smart contracts, leveraging both static and dynamic features comprehensively. Through deep learning models like Graph Convolutional Networks (GCNs) and bidirectional Long Short-Term Memory networks (bi-LSTMs), effective detection of vulnerabilities in smart contracts is achieved. Experimental results demonstrate that the proposed method attains detection accuracies of 85.73% for reentrancy vulnerabilities, 85.41% for timestamp dependency vulnerabilities, 83.58% for tx.origin vulnerabilities, and 90.96% for integer Overflow vulnerabilities. Furthermore, ablation experiments confirm the efficacy of the newly introduced modal features, highlighting the significance of fusing dynamic and static features in enhancing detection accuracy.
... Therefore, researchers have proposed various automated methods to assist in RCA, including statistical methods [2], [3], machine learning [4]- [7], and deep learning [8], [9]. Among them, deep learning has been proven to be a powerful tool in a variety of applications, including image classification [10], natural language processing [11], and speech recognition [12]. However, deep learning techniques are not designed explicitly to handle the uncertainty and ambiguity that frequently arises in network analysis. ...
... The fuzzy rule layer is responsible for matching the conditions of fuzzy rules and calculating the utility of each rule. The membership function layer consists of I groups of membership functions, which are combined by selecting one membership function from each group without repetition to form the nodes of this layer, i.e., Without loss of generality, Fig. 3 represents the q-th layer, j-th node of the fuzzy neural network, where the node's input is (11) and the output of the node is ...
Article
Full-text available
In the realm of communication networks, root cause analysis plays a vital role in maintaining efficient and reliable operation. However, existing root cause analysis methods face limitations and drawbacks, including their inability to handle complex data and disturbances, as well as inaccuracies in identifying root causes. To this end, we present the Deep Fuzzy Neural Network approach as an innovative solution. Integrating the strengths of deep learning and fuzzy logic inference, where the deep learning technique utilizes the parallel computing fusion of convolutional neural network and long short-term memory to extract the spatial-temporal features from sophisticated fault data of communication network. By leveraging this parallel computing fusion module, the proposed framework effectively addresses the flaws of traditional root cause analysis methods. Furthermore, the incorporation of fuzzy logic enables our model to manage disturbances such as uncertainty and noise inherent in the data, ensuring robust performance. Experimental results also demonstrate our proposed deep fuzzy neural network approach is an effective method for network root cause analysis in overcoming limitations inherent in existing methods and providing superior accuracy and resilience.
... The Vector Space Model (VSM) [44] refers to the data structure for each document and BoW refers to what kind of information we can extract from a document. They are different aspects of characterizing texts.Other popular representations include embedding words into vectors, done in word2vec [45] and fastText [46]. This embedding's are used to capture similarities between words and can be used to train a classifier that achieves good performance in a very short time. ...
... For both deep learning methods (RNN and LSTM), we used Word2Vec embedding that were trained using genism API. Word2vec utilizes text to develop words vocabulary, which it further uses to learn vector representation with the help of NNs [45]. Two well-known Neural Networks for generating Word2Vec is Continuous Bag of Words (CBOWs) and Skip-gram. ...
Article
Full-text available
Text classification also known as text categorization is a classical task in natural languageprocessing. The main aim of text classification is assigning one or more predefined classes or categories to textdocuments. Text classification has a wide variety of applications like email classification, opinion classification and news article classification.Traditionally, several researchers proposed different types of approaches for text classification in different domains. In general, most of the approaches contain a sequence of steps like training data collection, pre-processing of training data, extraction of features, selection of features, representation of training documents and selection of classification algorithms. In these steps, feature selection is one important step to identify the important features in the process of text classification. In this work, a new feature selection technique is proposed to identify the prominent features to improve the accuracy of text classification. The experiment conducted on AG news article dataset using three classifiers such as Support Vector Machine, Naive Bayes Multinomial and Random forest. The Random Forest classifier attained good accuracy for text classification among three classifiers. To improve the accuracy, experiment continued with two deep learning techniques such as Long Short Term Memory and Recurrent Neural Networks and observed that the former technique achieved good accuracy for text classification. The accuracies obtained in this work is promising than most of the approaches in text classification.
... It focuses on every single character via transformer encoder and decoding a series of character distribution probability vector v t ∈ R w,Nt through transformer decoder, where the encoder and decoder are consists of a multi head attention layer [30] with 8 heads. The final recognized text can be transformed from v t through the vec2word algorithm in [20]. [20] to embedding character into vector representation v gt t ∈ R w,Nt , where w and N t are the length of the flatten features and vector dimension. ...
... The final recognized text can be transformed from v t through the vec2word algorithm in [20]. [20] to embedding character into vector representation v gt t ∈ R w,Nt , where w and N t are the length of the flatten features and vector dimension. For the recognition task of symbol, text, and panel, we use cross-entropy loss to formulate the optimization function: ...
... Untuk menemukan setiap kekurangan sistem, kemampuan dan kemampuan algoritma yang dimaksud harus diperiksa secara menyeluruh. Peretas dan serangan berbasis teks dapat mengancam keamanan sistem ChatGPT (Mikolov, Chen, Corrado, & Dean, 2013 ChatGPT tanpa algoritma NLP sedangkan kelompok perlakuan akan diberikan perlakuan dengan algoritma NLP untuk meningkatkan keamanan sistem informasi pada ChatGPT (Graves, 2013). Rancangan penelitian yang akan digunakan dalam penelitian ini adalah posttest-only control group design. ...
Article
Full-text available
ChatGPT adalah model pembuatan bahasa alami yang dapat meningkatkan kognisi manusia dan menghasilkan teks berkualitas tinggi. Namun, keamanan sistem informasi di ChatGPT menjadi perhatian utama mengingat potensi penyalahgunaan oleh mereka yang belum yakin. Maka dari itu tujuan dari penelitian ini adalah untuk mengkaji keamanan sistem informasi pada ChatGPT dengan menggunakan algoritma Natural Language Processing (NLP). Metode penelitian yang digunakan adalah survei dengan jumlah sampel 100 orang. Data dianalisis dengan menggunakan teknik statistik deskriptif dan inferensial. Hasil penelitian menunjukkan bahwa menggunakan ChatGPT aman untuk sebagian besar responden. Namun, ada masalah keamanan tertentu, seperti potensi pelanggaran data dan penyalahgunaan informasi. Untuk meningkatkan keamanan sistem informasi ChatGPT, penelitian ini merekomendasikan pengembangan algoritma NLP untuk menganalisis dan mendekode teks, serta penggunaan teknologi enkripsi untuk melindungi informasi pengguna. Selain itu, langkah-langkah harus diambil untuk meningkatkan kesadaran pengguna akan perlunya melindungi informasi pribadi.
... One example of such models is Word2Vec [36], a neural network model that learns the embeddings of words by using the context (e.g., their neighbouring words) in which the word occurs. These embeddings can then be averaged to compute the embedding of a sentence. ...
Article
Full-text available
One of the most time-consuming tasks for developers is the comprehension of new code bases. An effective approach to aid this process is to label source code files with meaningful annotations, which can help developers understand the content and functionality of a code base quicker. However, most existing solutions for code annotation focus on project-level classification: manually labelling individual files is time-consuming, error-prone and hard to scale. The work presented in this paper aims to automate the annotation of files by leveraging project-level labels; and using the file-level annotations to annotate items at larger levels of granularity, for example, packages and a whole project. We propose a novel approach to annotate source code files using a weak labelling approach and a subsequent hierarchical aggregation. We investigate whether this approach is effective in achieving multi-granular annotations of software projects, which can aid developers in understanding the content and functionalities of a code base more quickly. Our evaluation uses a combination of human assessment and automated metrics to evaluate the annotations’ quality. Our approach correctly annotated 50% of files and more than 50% of packages. Moreover, the information captured at the file-level allowed us to identify, on average, three new relevant labels for any given project. We can conclude that the proposed approach is a convenient and promising way to generate noisy (not precise) annotations for files. Furthermore, hierarchical aggregation effectively preserves the information captured at file-level, and it can be propagated to packages and the overall project itself. We can conclude that the proposed approach is a convenient and promising way to generate noisy (not precise) annotations for files. Furthermore, hierarchical aggregation effectively preserves the information captured at file-level, and it can be propagated to packages and the overall project itself.
... The utility of word vector models is that they represent the meaning of words in a geometric form and in ways that have been shown to accurately reflect human semantic cognition (e.g., Dumais et al., 1996;Jones and Mewhort, 2007) . Even in contextually uninformed models, words that are semantically similar to one another based on their word vectors cluster closer together in word vector space (Pennington, Socher & Manning, 2014;Mikolov et al., 2013;Devlin, Chang, Lee & Toutanova, 2019). In contextually aware models like BERT (and all subsequent transformer models) words that have similar word senses cluster separately from other word senses. ...
Article
When communicating, individuals alter their language to fulfill a myriad of social functions. In particular, linguistic convergence and divergence are fundamental in establishing and maintaining group identity. Quantitatively characterizing linguistic convergence is important when testing hypotheses surrounding language, including interpersonal and group communication. We provide a quantitative interpretation of linguistic convergence grounded in information theory. We then construct a computational model, built on top of a neural network model of language, that can be deployed to measure and test hypotheses about linguistic convergence in “big data.” We demonstrate the utility of our convergence measurement in two case studies: (1) showing that our measurement is indeed sensitive to linguistic convergence across turns in dyadic conversation, and (2) showing that our convergence measurement is sensitive to social factors that mediate convergence in Internet-based communities (specifically, r/MensRights and r/MensLib). Our measurement also captures differences in which social factors influence web-based communities. We conclude by discussing methodological and theoretical implications of this semantic convergence analysis. (link to full text: https://rdcu.be/dsmYm)
... We also include features extracted during the previous study in the starting datasets. As a text feature, we include a 300-dimensional Word2Vec (Mikolov et al., 2013) embedding vector for each tweet record where each tweet is represented by the average word embedding vectors of the words that make up the tweet. This feature was created using a Word2Vec model trained on the Google News corpus. ...
Article
Full-text available
With the evolution of social media, cyberspace has become the de-facto medium for users to communicate during high-impact events such as natural disasters, terrorist attacks, and periods of political unrest. However, during such high-impact events, misinformation can spread rapidly on social media, affecting decision-making and creating social unrest. Identifying the spread of misinformation during high-impact events is a significant data challenge, given the multi-modal data associated with social media posts. Advances in multi-modal learning have shown promise for detecting misinformation; however, key limitations still make this a significant challenge. These limitations include the explicit and efficient modeling of the underlying non-linear associations of multi-modal data geared at misinformation detection. This paper presents a novel avenue of work that demonstrates how to frame the problem of misinformation detection in social media using multi-modal latent variable modeling and presents two novel algorithms capable of modeling the underlying associations of multi-modal data. We demonstrate the effectiveness of the proposed algorithms using simulated data and study their performance in the context of misinformation detection using a popular multi-modal dataset that consists of tweets published during several high-impact events.
... Word vectors can maintain semantic relationships, grammatical information and contextual information. At present, the mainstream word embedding models are Word2Vec [49,50] (such as Skip-gram and CBOW), SENNA [51], GloVe [52] and FastText [53]. The word embedding method is the most popular method to measure the semantic similarity of texts. ...
Article
Full-text available
Spatial keyword query is a classical query processing mode for spatio-textual data, which aims to provide users the spatio-textual objects with the highest spatial proximity and textual similarity to the given query. However, the top-k result objects obtained by using the spatial keyword query mode are often similar to each other, while users hope that the system can pick top-k typicality results from the candidate query results in order to make users understand the representative features of the candidate result set. To deal with the problem of typicality analysis and typical object selection of spatio-textual data query results, a typicality evaluation and top-k approximate selection approach is proposed. First, the approach calculates the synthetic distances on dimensions of geographic location, textual semantics, and numeric attributes between all spatio-textual objects. And then, a hybrid index structure that can simultaneously support the location, text, and numeric multi-dimension matching is presented in order to expeditiously obtain the candidate query results. According to the synthetic distances between spatio-textual objects, a Gaussian kernel probability density estimation-based method for measuring the typicality of query results is proposed. To facilitate the query result analysis and top-k typical object selection, the Tournament strategy-based and local neighborhood-based top-k typical object approximate selection algorithms are presented, respectively. The experimental results demonstrated that the text semantic relevancy measuring method for spatio-textual objects is accurate and reasonable, and the local neighborhood-based top-k typicality result approximate selection algorithm achieved both the low error rate and high execution efficiency. The source code and datasets used in this paper are available to be accessed from https://github.com/JiaShengS/Typicality_analysis/.
... In recent years, several network representation learning techniques, including Deep walk [9], Large-scale Information Network Embedding (LINE) [10], and Node2vec [11] have been introduced primarily for homogenous information networks. All these models are inspired by the deep learning-based feature representation of words in the natural language processing based skip-gram word2vec model, which learns the automatic feature representation of words [12]. Normally, in homogeneous networks, the random walk-based search method is used to generate the random walks as a corpus of the sequence of connected nodes within the networks in order to learn feature representations from the skip-gram model. ...
Article
Full-text available
Heterogeneous Information Networks (HINs) consist of multiple categories of nodes and edges and encompass rich semantic information. Representing HINs in a low-dimensional feature space is challenging due to its complex structure and rich semantics. In this paper, we focus on link prediction and node classification by learning efficient low-dimensional feature representations of HINs. Metapath-guided walkers have been extensively studied in the literature for learning feature representations. However, the metapath walker does not control the length of random walks, resulting in weak structural and semantic information embeddings. In this work, we present an influence propagation controlled metapath-guided random walk model (called IPCMetapath2Vec) for representation learning in HINs. The model works in three phases: first, we perform node transition to generate a metapath-guided random walk, which is conditioned on two factors: (i) type mapping of the next node according to the metapath, and (ii) compute influence propagation score for each node and detect potential influencers on the walk by a threshold based filter. Next, we provide the collected random walks as input to the skip-gram model to learn each node’s feature representation. Lastly, we employ an attention mechanism that aggregates the learned feature representations of each node from various semantic metapath-guided walks, preserving the importance of different semantics. We use these network representation features to address link prediction and multi-label node classification tasks. Experimental results on two public HIN datasets, namely DBLP and IMDB, show that our model outperforms the state-of-the-art representation learning models such as DeepWalk, Node2vec, Metapath2Vec, and HIN2Vec by 4.5% to 17.2% in terms of micro-F1 score for multi-label node classification and 4% to 14.50% in terms of AUC-ROC score for link prediction.
... We represented each tweet/user's text in four different ways that focused on the content of the text. These methods include the classic bag-of-words (BoW) approach, as well as three stateof-the-art distributed text representation techniques: Word2Vec [48], FastText [43], and BERT [49]. ...
Article
Full-text available
In this study, we present the acquisition and categorization of a geographically-informed, multi-dialectal Albanian National Corpus, derived from Twitter data. The primary dialects from three distinct regions—Albania, Kosovo, and North Macedonia—are considered. The assembled publicly available dataset encompasses anonymized user information, user-generated tweets, auxiliary tweet-related data, and annotations corresponding to dialect categories. Utilizing a highly automated scraping approach, we initially identified over 1,000 Twitter users with discernible locations who actively employ at least one of the targeted Albanian dialects. Subsequent data extraction phases yielded an augmentation of the preliminary dataset with an additional 1,500 Twitterers. The study also explores the application of advanced geotagging techniques to expedite corpus generation. Alongside experimentation with diverse classification methodologies, comprehensive feature engineering and feature selection investigations were conducted. A subjective assessment is conducted using human annotators, which demonstrates that humans achieve significantly lower accuracy rates in comparison to machine learning (ML) models. Our findings indicate that machine learning algorithms are proficient in accurately differentiating various Albanian dialects, even when analyzing individual tweets. A meticulous evaluation of the most salient attributes of top-performing algorithms provides insights into the decision-making mechanisms utilized by these models. Remarkably, our investigation revealed numerous dialectal patterns that, despite being familiar to human annotators, have not been widely acknowledged within the broader scientific community.
... Several language models were used in this study with the Aesop's Fables corpus, including N-gram model [9], neural network model (specifically, CBOW model) [10], GRU model (gated recurrent unit) [11], Pretrained GPT-2 model [12], and Finetuned GPT-2 model. These models were used for text generation. ...
Article
Full-text available
The aim of this study was to analyze the correlation among different automatic evaluation metrics for text generation. In the study, texts were generated from short stories using different language models: N-gram model, Continuous Bag-of-Word (CBOW) model, Gated recurrent unit (GRU) model, and Generative Pre-trained Transformer 2 (GPT-2) model. All models were trained on short Aesop’s fables. The quality of the generated text was measured with various metrics: Perplexity, BLEU score, the number of grammatical errors, Self-BLEU score, ROUGE score, BERTScore, andWord Mover’s Distance (WMD). The resulting correlation analysis of the evaluation metrics showed four groups of correlated metrics. Firstly, perplexity and grammatical errors were moderately correlated. Secondly, BLEU, ROUGE and BERTScore were highly correlated. Next, WMD was negatively correlated with BLEU, ROUGE and BERTScore. On the other hand, Self-BLEU, which measures text diversity within the model, did not correlate with the other metrics. In conclusion, to evaluate text generation, a combination of various metrics should be used to measure different aspects of the generated text.
... Work by Chen et al. [1] explored the potential of LLMS in learning on Text Attributed Graphs (TAGs). They explain limitations on the shallow embeddings produced by Bag-of-Words [4] and Word2Vec [7] and their difficulty in processing polysemous words [12]. With the advent of LLMs and the breakthrough in products like ChatGPT, they posed two hypotheses for LLMs to tackle graph learning and more specifically node classification. ...
Preprint
Full-text available
Financial cybercrime prevention is an increasing issue with many organisations and governments. As deep learning models have progressed to identify illicit activity on various financial and social networks, the explainability behind the model decisions has been lacklustre with the investigative analyst at the heart of any deep learning platform. In our paper, we present a state-of-the-art, novel multimodal proactive approach to addressing XAI in financial cybercrime detection. We leverage a triad of deep learning models designed to distill essential representations from transaction sequencing, subgraph connectivity, and narrative generation to significantly streamline the analyst's investigative process. Our narrative generation proposal leverages LLM to ingest transaction details and output con-textual narrative for an analyst to understand a transaction and its metadata much further.
... • The LSTM model incorporates word embeddings, leveraging sophisticated algorithms like Word2Vec or GloVe to encode semantic information into dense vector representations, thereby enriching the model's input feature space [3]. ...
Preprint
Full-text available
In this study, we conduct a comparative analysis of Logistic Regression and Long Short-Term Memory (LSTM) models for sentiment classification across two distinct datasets.The IMDB movie review database and a collection of Twitter posts. The objective is to evaluate the performance and applicability of these models in different textual environments. The study demonstrates that while both models perform competently on the structured IMDB dataset, their effectiveness varies significantly when applied to the more informal and diverse Twitter dataset. This variance highlights the impact of textual characteristics on model performance. Our findings provide insights into the adaptability of Logistic Regression and LSTM models for sentiment analysis tasks and emphasize the importance of considering the nature of the dataset in model selection. This research contributes to the field of natural language processing by underscoring the need for context-aware approaches in sentiment classification.
... The field of semantic vector spaces has evolved through the use of neural models, building upon previous foundational work [41]. While there are many word embedding models available, one prevailing paradigm that utilizes neural networks is known as Word2Vec (W2V) [42,43]. The W2V model generates word embeddings using two main methodologies: Continuous Bag-of-Words (CBOW) and Skip-Gram (SG). ...
Preprint
Full-text available
The construction industry in Australia is characterized by its intricate supply chains and vulnerability to myriad risks. As such, effective supply chain risk management (SCRM) becomes imperative. This paper employs different transformer models, and train for Named Entity Recognition (NER) in the context of Australian construction SCRM. Utilizing NER, transformer models identify and classify specific risk-associated entities in news articles, offering a detailed insight into supply chain vulnerabilities. By analysing news articles through different transformer models, we can extract relevant entities and insights related to specific risk taxonomies local (milieu) to the Australian construction landscape. This research emphasises the potential of NLP-driven solutions, like transformer models, in revolutionising SCRM for construction in geo-media specific contexts.
... The recent deep learning-based approaches, including using pre-trained language models (Mikolov et al., 2013;Rossiello et al., 2019;Ushio et al., 2021), are able to generate analogies to some extent, but are currently limited to simple word-level and proportional analogies, such as (ostrich:bird :: lion:?). In contrast, we aim to generate and explain more complex analogies of concepts, e.g. ...
... In Wang and Yang (2015), they considered word embedding with K-Nearest-Neighbor (KNN) and cosine similarity to search and substitute similar words. Other pre-trained word embeddings such as Word2Vec Mikolov et al. (2013), GloVe Pennington et al. (2014) and fastText Bojanowski et al. (2017) have been leveraged for that purpose. Furthermore, the authors in Wei and Zou (2019) generated synthetic texts by changing the words through synonym replacement or random insertions, substitutions and deletions where Shou et al. (2022) include the abstract meaning representation graph along with it for STS task. ...
... So, words like "car" and "airplane" would have close vectors, unlike "computer" and "bottle". Word2Vec [11] is the most widely used pattern group for a lexical extension. Although it is very good at detecting semantic similarities between words [12,13], it has limitations. ...
Chapter
In Africa, there are several linguistic and dialect bundles composed mainly by young people. This youths are excluded from technologies involving their mother tongues. Most research on Natural Language Processing (NLP) tends to focus on languages with large corpora, such as English, Spanish, or other European languages. Although this "well representation" is not perceived in African Languages, this is not the only reason that justifies the "poor representation" of these languages in the field of NLP instead of European Languages. Additionally, the globalization of the latter until their adoption in some African countries, created a lack of consideration of the local populations towards their native idioms, and the non-standardization of these idioms make their use quite problematic. However, salutary initiatives and effective works have attempted to find solutions that consider the lack of data and particular properties of African Languages. This paper presents a review on current methods focused on them while emphasizing their limits and proposing thoughts on potential solutions.
... This method was created for bioinformatics applications. It uses the skipgram model of Word2Vec [99] for protein sequences, representing a single dense n-dimensional vector. Skip-gram is one of self-supervised learning techniques used to find the most related words for a given word. ...
Thesis
Full-text available
One of the most important tasks in the drug discovery process is to identify possible drug-target interactions (DTIs). The research and development of novel drugs frequently consumes billions of dollars and more than a decade of effort with high failure frequency. As a result, it is critical for pharmaceutical companies to identify novel drug–target interactions (DTIs) by leveraging known DTIs. Existing drugs have well-known qualities and are confirmed to be safe. However, biochemical tests for identifying novel DTIs have limitations in terms of coverage and throughput. As a result, computer approaches for the prediction of DTIs have garnered considerable interest. Computational prediction of DTI has been a prominent topic in the bioinformatics sector for the past decade, and it has substantially sped up drug development. Existing methods can be divided into two categories: network-based and classification-based methods. This thesis focuses on the network-based category that employs methods from the graph representation learning field. We propose two different models fulfilling the tasks of the drug-target binding affinity (DTA) prediction (regression) and the drug target interaction prediction (classification), utilizing graph embeddings and graph convolutions. Under these specifications, a data collection process is described for retrieving drug and protein features from public biological databases. Specifically, we extract SMILES of drugs and amino acid sequences of proteins in FASTA form. Then using several programming tools, we transform the drugs to graphs (where nodes correspond to atoms and edges to atom bonds), and the proteins to vectors using Word2Vec. The DTA approach leverages graph convolutions and 1D convolutions to transform drug and target features respectively, and concatenates the two outputs to consequently predict the binding affinity. In the DTI approach, drug features are extracted using graph convolutions, and then both drugs and targets are used to form a heterogeneous graph. This graph is then transformed by a graph auto-encoder, which generates the predicted interactions. Finally, we thoroughly present the models and their results, and compare them with recent state-of-the-art methods, which demonstrates the effectiveness of our approaches.
Article
Full-text available
Hansen solubility parameters (HSPs) have three components, δd, δp and δh accounting for dispersion forces, polar forces, and hydrogen bonding of a molecule, which were designed to better understand how...
Article
In this study, we present an algorithmic framework integrated within the created software platform tailored for the discovery of novel small-molecule anti-tumor agents. Our approach was exemplified in the context of combatting lung cancer. In the initial phase, target identification for therapeutic intervention was accomplished. Leveraging deep learning, we scrutinized gene expression profiles, focusing on those associated with adverse clinical outcomes in lung cancer patients. Augmenting this, generative adversarial neural (GAN) networks were employed to amass additional patient data. This effort yielded a subset of genes definitively linked to unfavorable prognoses. We further employed deep learning to delineate genes capable of discriminating between normal and tumor tissues based on expression patterns. The remaining genes were earmarked as potential targets for precision lung cancer therapy. Subsequently, a dedicated module was formulated to predict the interactions between inhibitors and proteins. To achieve this, protein amino acid sequences and chemical compound formulations engaged in protein interactions were encoded into vectorized representations. Additionally, a deep learning-based component was developed to forecast IC50 values through experimentation on cell lines. Virtual pre-clinical trials employing these inhibitors facilitated the selection of pertinent cell lines for subsequent laboratory assays. In summary, our study culminated in the derivation of several small-molecule formulas projected to bind selectively to specific proteins. This algorithmic platform holds promise in accelerating the identification and design of anti-tumor compounds, a critical pursuit in advancing targeted cancer therapies.
Article
Full-text available
Background Due to the large volume of online health information, while quality remains dubious, understanding the usage of artificial intelligence to evaluate health information and surpass human-level performance is crucial. However, the existing studies still need a comprehensive review highlighting the vital machine, and Deep learning techniques for the automatic health information evaluation process. Objective Therefore, this study outlines the most recent developments and the current state of the art regarding evaluating the quality of online health information on web pages and specifies the direction of future research. Methods In this article, a systematic literature is conducted according to the PRISMA statement in eight online databases PubMed, Science Direct, Scopus, ACM, Springer Link, Wiley Online Library, Emerald Insight, and Web of Science to identify all empirical studies that use machine and deep learning models for evaluating the online health information quality. Furthermore, the selected techniques are compared based on their characteristics, such as health quality criteria, quality measurement tools, algorithm type, and achieved performance. Results The included papers evaluate health information on web pages using over 100 quality criteria. The results show no universal quality dimensions used by health professionals and machine or deep learning practitioners while evaluating health information quality. In addition, the metrics used to assess the model performance are not the same as those used to evaluate human performance. Conclusions This systemic review offers a novel perspective in approaching the health information quality in web pages that can be used by machine and deep learning practitioners to tackle the problem more effectively.
Conference Paper
Citation count is commonly used as a straightforward metric for measuring the impact of a paper. However, since all citations are treated equally, citation count does not accurately capture the true influence of a particular cited paper on the citing paper. To accurately measure the individual impact of cited papers, it is required to identify those that have a high influence on a citing paper. This paper proposes a method to identify the influential citations using the text of citation contexts, specifically the citing sentences. Citing sentences contain the descriptions of the cited papers and the relationship between the citing paper and each cited paper. The proposed method extracts the descriptions of cited papers from the citing sentences and utilizes them to identify influential references. Experimental results have shown the benefits of using the extracted description of each cited paper.
Article
Volumetric Distributed Denial of Service (DDoS) attacks have been a severe threat to the Internet for more than two decades. Some success in mitigation has been achieved based on numerous defensive techniques created by the research community, implemented by the industry, and deployed by network operators. However, evolution is not a privilege of mitigations, and DDoS attackers have found better strategies and continue to cause harm. A key challenge in winning this race is understanding the various characteristics of DDoS attacks in network traffic at scale and in a realistic manner. In this paper, we propose DDoS2Vec, a novel approach to characterise DDoS attacks in real-world Internet traffic using Natural Language Processing (NLP) techniques. DDoS2Vec is a domain-specific application of Latent Semantic Analysis that learns vector representations of potential DDoS attacks. We look into the link between natural language and computer network communication in a way that has not been previously studied. Our approach is evaluated on a large-scale dataset of flow samples collected from an Internet eXchange Point (IXP) in one year. We evaluate the performance of DDoS2Vec via multi-label classification in a Machine Learning (ML) scenario. DDoS2Vec characterises DDoS attacks more clearly than other baselines - including NLP-based approaches inspired by recent networks research and a basic non-NLP solution.
Article
Graphs representation learning has been a very active research area in recent years. The goal of graph representation learning is to generate graph representation vectors that capture the structure and features of large graphs accurately. This is especially important because the quality of the graph representation vectors will affect the performance of these vectors in downstream tasks such as node classification, link prediction and anomaly detection. Many techniques have been proposed for generating effective graph representation vectors, which generally fall into two categories: traditional graph embedding methods and graph neural nets (GNN) based methods. These methods can be applied to both static and dynamic graphs. A static graph is a single fixed graph, while a dynamic graph evolves over time and its nodes and edges can be added or deleted from the graph. In this survey, we review the graph embedding methods in both traditional and GNN-based categories for both static and dynamic graphs and include the recent papers published until the time of submission. In addition, we summarize a number of limitations of GNNs and the proposed solutions to these limitations. Such a summary has not been provided in previous surveys. Finally, we explore some open and ongoing research directions for future work.
Article
Full-text available
Introduction Linking free-text addresses to unique identifiers in a structural address database [the Ordnance Survey unique property reference number (UPRN) in the United Kingdom (UK)] is a necessary step for downstream geospatial analysis in many digital health systems, e.g., for identification of care home residents, understanding housing transitions in later life, and informing decision making on geographical health and social care resource distribution. However, there is a lack of open-source tools for this task with performance validated in a test data set. Methods In this article, we propose a generalisable solution (A F ramework for L inking free-text A ddresses to Ordnance Survey U P RN database, FLAP ) based on a machine learning–based matching classifier coupled with a fuzzy aligning algorithm for feature generation with better performance than existing tools. The framework is implemented in Python as an Open Source tool (available at Link ). We tested the framework in a real-world scenario of linking individual’s ( n = 771,588 ) addresses recorded as free text in the Community Health Index (CHI) of National Health Service (NHS) Tayside and NHS Fife to the Unique Property Reference Number database (UPRN DB). Results We achieved an adjusted matching accuracy of 0.992 in a test data set randomly sampled ( n = 3 , 876 ) from NHS Tayside and NHS Fife CHI addresses. FLAP showed robustness against input variations including typographical errors, alternative formats, and partially incorrect information. It has also improved usability compared to existing solutions allowing the use of a customised threshold of matching confidence and selection of top n candidate records. The use of machine learning also provides better adaptability of the tool to new data and enables continuous improvement. Discussion In conclusion, we have developed a framework, FLAP , for linking free-text UK addresses to the UPRN DB with good performance and usability in a real-world task.
Chapter
This study explores the use of Evolutionary Game Theory (EGT) for the task of sentiment analysis. The proposed approach involves the use of EGT concepts to disambiguate the particular sense of a word and analyze the context in which it is used. Methods involving Evolutionary Game Theory are employed to learn the associations between different words and synsets. Each word is treated as a player and its synset space as its strategy space. The model aims to find the Nash Equilibria to correctly disambiguate all tokens. The SentiWordNet lexicon to identify the sentiment of each sentence. The effectiveness of the proposed approach is evaluated using labeled twitter datasets in which WSD (EGT)-wup-lesk-word2vec variation showed an accuracy of 80.6%. The results demonstrate the efficacy of using pre-trained word embeddings and the potential of WSD for sentiment analysis.
Chapter
Blockchain technology has garnered a lot of interest recently, but it has also become a breeding ground for various network crimes. Cryptocurrency, for example, has suffered losses due to network phishing scams, posing a serious threat to the security of blockchain ecosystem transactions. To create a favorable investment environment, we propose a community-enhanced phishing scam detection model in this paper. We approach network phishing detection as a graph classification task and introduce a network phishing detection graph neural network framework. Firstly, we construct an Ethereum transaction network and extract transaction subgraphs, and corresponding content features from it. Based on this, we propose a community-enhanced graph convolutional network (GCN)-based detection model. It extracts more reasonable node representations in the GCN neighborhoods and explores the advanced semantics of the graph by defining community structure and measuring the similarity of nodes in the community. This distinguishes normal accounts from phishing accounts. Experiments on different large-scale real-data sets of Ethereum consistently demonstrate that our proposed model performs better than related methods.
Chapter
Solving object similarity remains a persistent challenge in the field of data science. In the context of e-commerce retail, the identification of substitutable and similar products involves similarity measures. Leveraging the multimodal learning derived from real-world experiences, humans can recognize similar products based solely on their titles, even in cases where significant literal differences exist. Motivated by this intuition, we propose a self-supervised mechanism that extracts strong prior knowledge from product image-title pairs. This mechanism serves to enhance the encoder’s capacity for learning product representations in a multimodal framework. The similarity between products can be reflected by the distance between their respective representations. Additionally, we introduce a novel attention regularization to effectively direct attention toward product category-related signals. The proposed model exhibits wide applicability as it can be effectively employed in unimodal tasks where only free-text inputs are available. To validate our approach, we evaluate our model on two key tasks: product similarity matching and retrieval. These evaluations are conducted on a real-world dataset consisting of thousands of diverse products. Experimental results demonstrate that multimodal learning significantly enhances the language understanding capabilities within the e-commerce domain. Moreover, our approach outperforms strong unimodal baselines and recently proposed multimodal methods, further validating its superiority.
Chapter
Estimating the time of arrival is a crucial task in intelligent transportation systems. The task poses challenges due to the dynamic nature and complex spatio-temporal dependencies of traffic networks. Existing studies have primarily focused on learning the dependencies between adjacent links on a route, often overlooking a deeper understanding of the links within the traffic network. To address this limitation, we propose DeepLink, a novel approach for travel time estimation that leverages a comprehensive understanding of the spatio-temporal dynamics of road segments from different perspectives. DeepLink introduces triplet embedding, enabling the learning of both the topology and potential semantics of the traffic network, leading to an improved understanding of links’ static information. Then, a spatio-temporal dynamic representation learning module integrates the triplet embedding and real-time information, which effectively models the dynamic traffic conditions. Additionally, a local-global attention mechanism captures both the local dependencies of adjacent road segments and the global information of the entire route. Extensive experiments conducted on a large-scale real-world dataset demonstrate the superior performance of DeepLink compared to state-of-the-art methods.
Chapter
In this paper we explore the application of text similarity for building text-rich knowledge graphs, where nodes describe concepts that relate semantically to each other. Semantic text similarity is a basic task in natural language processing (NLP) that aims at measuring the semantic relatedness of two texts. Transformer-based encoders like BERT combined with techniques like contrastive learning are currently the state-of-the-art methods in the literature. However, these methods act as black boxes where the similarity score between two texts cannot be directly explained from their components (e.g., words or sentences). In this work, we propose a method for similarity explainability for texts that are semantically connected to each other in a knowledge graph. To demonstrate the usefulness of this method, we use the Agenda 2030 which consists of a graph of sustainable development goals (SDGs), their subgoals and the indicators proposed for their achievement. Experiments carried out on this dataset show that the proposed explanations not only provide us with explanations about the computed similarity score but also they allow us to improve the accuracy of the predicted links between concepts.
Chapter
The advancement of position acquisition technology has enabled the study based on vehicle trajectories. However, limitations in equipment and environmental factors often result in missing track records, significantly impacting the trajectory data quality. It is a fundamental task to restore the missing vehicle tracks within the traffic network structure. Existing research has attempted to address this issue through the construction of neural network models. However, these methods neglect the significance of the bidirectional information of the trajectory and the embedded representation of the trajectory unit. In view of the above problems, we propose a Seq2Seq-based trajectory recovery model that effectively utilizes bidirectional information and generates embedded representations of trajectory units to enhance trajectory recovery performance, which is a Pre-Training and Bidirectional Semantic enhanced Trajectory Recovery model, namely PBTR. Specifically, the road network’s representations extracting time factors are captured by a pre-training technique and a bidirectional semantics encoder is employed to enhance the expressiveness of the model followed by an attentive recurrent network to reconstruct the trajectory. The efficacy of our model is demonstrated through its superior performance on two real-world datasets.
Chapter
Offensive content in social media has become a serious issue, due to which its automatic detection is a crucial task. Deep learning approaches for Natural Language Processing (or NLP) have proven to be on or even above human-level accuracy for offensive language detection tasks. Due to this, the deployment of deep learning models for these tasks is justified. However, there is one key aspect that these models lack, which is explainability, in contrast to humans. In this paper, we provide an explainable model for offensive language detection in the case of multi-task learning. Our model achieved an F1 score of 0.78 on the OLID dataset and 0.85 on the SOLID dataset. We also provide a detailed analysis of the model interpretability.
Chapter
Cancer is a complex disease marked by uncontrolled cell growth, potentially leading to tumors and metastases. Identifying cancer types is crucial for treatment decisions and patient outcomes. T Cell receptors (TCRs) are vital proteins in adaptive immunity, specifically recognizing antigens and playing a pivotal role in immune responses, including against cancer. TCR diversity makes them promising for targeting cancer cells, aided by advanced sequencing revealing potent anti-cancer TCRs and TCR-based therapies. Effectively analyzing these complex biomolecules necessitates representation and capturing their structural and functional essence. We explore sparse coding for multi-classifying TCR protein sequences with cancer categories as targets. Sparse coding, a machine learning technique, represents data with informative features, capturing intricate amino acid relationships and subtle sequence patterns. We compute TCR sequence k-mers, applying sparse coding to extract key features. Domain knowledge integration improves predictive embeddings, incorporating cancer properties like Human leukocyte antigen (HLA) types, gene mutations, clinical traits, immunological features, and epigenetic changes. Our embedding method, applied to a TCR benchmark dataset, significantly outperforms baselines, achieving 99.8% accuracy. Our study underscores sparse coding’s potential in dissecting TCR protein sequences in cancer research.
Chapter
This chapter explores the realm of sentiment analysis, covering diverse domains such as text, audio, and facial expressions. It introduces novel approaches that address the limitations of existing methods, emphasizing the significance of semantic relationships, effective fusion of heterogeneous features, and the benefits of multitask learning. The proposed techniques, including word2vec and SVMperf for sentiment classification, RCMSA and CHFFM for audio sentiment analysis, and CMCNN with CCAM and SCAM for facial expression recognition, exhibit superior performance in their respective domains. This chapter sets the stage for the book, showcasing innovative methods that advance the field of sentiment analysis and provide valuable insights for researchers and practitioners alike.
Chapter
This chapter covers text-to-data translation. It begins by providing a historical perspective, briefly covering early relevant work on semantic parsing and meaning representation before moving to newer approaches that utilize word embedding, semantic compositionality, knowledge graph, and large language models. Three key topics are covered in depth: (1) converting sentences to structured form, (2) neural semantic parsing, and (3) recent text-to-SQL models, systems, and benchmarks.
ResearchGate has not been able to resolve any references for this publication.