Conference Paper

Efficient Estimation of Word Representations in Vector Space

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This IDF factor emphasizes rare words that potentially carry greater significance. The final word vector representation is produced by multiplying these two components (TF × IDF), creating a comprehensive representation that balances word frequency with contextual importance Other deep learning-based approaches like Word2Vec [4], GloVE [5], Doc2Vec [6], etc., are used for feature extraction. They are discussed in detail in subsection 2.3. ...
... Word2vec [4], introduced in, uses two approaches-CBOW (Continuous Bag-of-Words) and Skip-gram to calculate word vectors or embeddings. Word2vec is a two-layer feed-forward neural network where the input is one-hot encoding of a vocabulary of size V. ...
... These models generate word embeddings such that semantically closer words are closer to each other in n À dimensional space. The authors of [4] trained word vectors on the Google News corpus, which contains about 6 billion tokens, and evaluated the model on the Semantic-Syntactic Word Relationship test set. The authors found that accuracy is directly proportional to the dimensions of word vectors and the size of the dataset. ...
Article
Full-text available
In today’s competitive recruitment landscape, crafting impactful job outreach messages is essential for attracting top talent. This study presents a novel machine learning and NLP-driven framework for predicting recruiter message quality on professional platforms like LinkedIn, aiming to enhance response rates and hiring success. Our approach leverages a multi-label text classification framework that identifies five critical message attributes: call to action, common ground, credibility, incentives, and personalization. Using a labeled dataset of 97,710 messages annotated across these five categories, we benchmark various machine learning and deep learning models, including Decision Trees, Linear SVM, Logistic Regression, Random Forest, LSTM, and customized transformer-based BERT models. The dataset was meticulously curated to address generalization challenges, with 94,010 samples for training and 3,700 samples in a diversified test set. Model performance was assessed using accuracy, with the customized BERT model achieving 95.67%. Our findings underscore the potential of this framework to enhance recruiter outreach strategies, providing actionable insights to refine message quality and improve candidate engagement. Received: 14 August 2024 | Revised: 29 October 2024 | Accepted: 16 December 2024 Conflicts of Interest The author declares that he has no conflicts of interest to this work. Data Availability Statement Data sharing is not applicable to this article as no new data were created or analyzed in this study. Author Contribution Statement Shaida Muhammad: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization, Supervision, Project administration.
... Computational analysis of social groups Word embeddings, which represent lexical items as dense vectors in high-dimensional space (Mikolov, 2013;Pennington et al., 2014), have emerged as powerful analytical tools for quantitative social science (Kozlowski et al., 2019). These computational methods encode both explicit and implicit word relationships, proving particularly valuable for revealing nuanced shifts in societal attitudes and social group representations across different cultural and political contexts. ...
... We employed the skip-gram (Mikolov, 2013) with negative sampling model to train word embeddings for our diachronic analysis. For both corpora, we gener- ated 300-dimensional word vectors with a context window of 3 words. ...
Preprint
Full-text available
Language encodes societal beliefs about social groups through word patterns. While computational methods like word embeddings enable quantitative analysis of these patterns, studies have primarily examined gradual shifts in Western contexts. We present the first large-scale computational analysis of Chinese state-controlled media (1950-2019) to examine how revolutionary social transformations are reflected in official linguistic representations of social groups. Using diachronic word embeddings at multiple temporal resolutions, we find that Chinese representations differ significantly from Western counterparts, particularly regarding economic status, ethnicity, and gender. These representations show distinct evolutionary dynamics: while stereotypes of ethnicity, age, and body type remain remarkably stable across political upheavals, representations of gender and economic classes undergo dramatic shifts tracking historical transformations. This work advances our understanding of how officially sanctioned discourse encodes social structure through language while highlighting the importance of non-Western perspectives in computational social science.
... In this paper, we use a Word2vec model to obtain the realvalued vector representation-also known as word embeddings in NLP-of the domain names. Word2vec [45], [46] is one of several word embedding techniques, and it uses a shallow neural network to convert each word to a vector of real numbers. Word2vec captures semantic relationships between words by learning from large amounts of text data, and the resulting real-valued vectors depend on the context in which each word appears within the text. ...
... For Skip-gram, the target word is used to predict the surrounding context words within a specified window size. Between CBOW and Skipgram, we chose CBOW as it is less expensive computationally and faster to train [46]. ...
Article
Full-text available
In this paper, we study IoT domain names, the domain names of backend servers on the Internet that are accessed by IoT devices. We investigate how they compare to non-IoT domain names based on their statistical and DNS properties, and the feasibility of classifying these two classes of domain names using machine learning (ML). By surveying past studies that used testbeds with real IoT devices, we construct a dataset of IoT domain names. For the non-IoT dataset,We use two lists of top-visited websites. We study the statistical properties of the domain name lists and their DNS properties. We also leverage machine learning and train six machine learning models to perform the classification between the two classes of domain names. The word embedding technique we use to get the real-value representation of the domain names is Word2vec. Our statistical analysis highlights significant differences in domain name length, label frequency, and compliance typical to domain name guidelines, while our DNS analysis reveals notable variations in resource record availability and configuration between IoT and non-IoT DNS zones. As for classification of IoT and non-IoT domain names using machine learning, among the models we train, Random Forest achieves the highest performance, yielding the highest accuracy, precision, recall, and F 1 score. Our work offers novel insights to IoT, potentially informing protocol design and aiding in network security and performance monitoring.
... The above-referenced method by Troyer et al. 11 has been refined by Kim et al. 31 who developed an automated scoring method to assess the switching of sub-categories. The automated scoring method uses distributional representations to calculate the similarity between two consecutive words based on the word2vec model 33 , with switching counted when the similarity between two consecutive words falls below a predefined threshold 31 . Further, Ovando-Tellez et al. 34 developed a more efficient and clearer interpretation method to assess semantic memory. ...
... To calculate cosine similarity, word vector representationsnumerical vectors that capture the semantic meaning of wordswere derived from chiVe, a pre-trained Japanese word2vec model 33,42 . This model was trained on a large-scale Japanese web corpus consisting of approximately 100 million web pages, allowing for accurate semantic comparisons between words. ...
... Though there exists extensive research on collage layout generation, the human-centric concern has not been taken into consideration. Based on studies in the field of cognitive science (Mikolov et al. 2013;Perls, Hefferline, and Goodman 1951) explaining humans reading habits on collage informative designs, we first introduce two concepts for cognitive-coherence collage layout generation. ...
... tories (Moen and Fee 2000;Mikolov et al. 2013), processing the collage regions row by row from top-left to rightdown. Therefore, a cognitively coherent layout should bring the logical order (illustrated with blue circles) of screenshots close to the F-pattern reading order (illustrated with orange circles). ...
Article
To enhance the processing of complex multi-modal documents (e.g. e-books, long web pages, etc.), it is an efficient way for users to take digital screenshots of key parts and reorganize them into a new collage E-Note. Existing methods for assisting collage layout design primarily employ a semantic relevance-first strategy, with arranging related contents together. Though capable, it can not ensure the visual readability of screenshots and may conflict with human natural reading patterns. In this paper, we introduce CollageNoter for real-time collage layout design that adapts to various devices (e.g. laptop, tablet, phone, etc.), offering users with visually and cognitively well-organized screenshot-based E-Notes. Specifically, we construct a novel two-stage pipeline for collage design, including 1) readability-first layout generation and 2) cognitive-driven layout adjustment. In addition, to achieve real-time response and adaptive model training, we propose a cascade transformer-based layout generator named CollageFormer and a size-aware collage layout builder for automatic dataset construction. Extensive experimental results have confirmed the effectiveness of our CollageNoter.
... Text vectorization was performed using the word2vec model, known for its low dimensionality, fixed vector dimensions, low computational cost, and effective word similarity representation [35]. Word2vec includes Skip-gram and CBOW (Continuous Bag-of-Words) models [36]. Skip-gram was chosen for this study due to its context prediction capabilities. ...
Article
Full-text available
Urban forest parks play a vital role in promoting physical activities (PAs) and providing cultural ecosystem services (CESs) that enhance citizens’ well-being. This study aims to reevaluate CESs by focusing on the physical activity experiences of park visitors to optimize park management and enhance citizen satisfaction. This study utilized social media data and employed natural language processing techniques and text analysis tools to examine experiences related to physical activities in Beijing Olympic Forest Park, Xishan Forest Park, and Beigong Forest Park. A specialized sports activity dictionary was developed to filter and analyze comments related to PA, emphasizing the impact of natural environments on enjoyment and participation in PA. The importance–performance analysis (IPA) method was used to assess the service characteristics of each park. The findings reveal that urban forest parks are highly valued by citizens, particularly for their natural landscapes, leisure activities, and the emotional fulfillment derived from PA, with 82.58% of comments expressing positive sentiments. Notably, appreciation for natural landscapes was exceptionally high, as evidenced by the frequent mentions of key terms such as ‘scenery’ (mentioned 2871 times), ‘autumn’ (mentioned 2314 times), and ‘forest’ (mentioned 1439 times), which significantly influence park usage. However, 17.11% of the reviews highlighted dissatisfaction, primarily with the management of facilities and services during sports and cultural activities. These insights underscore the need for performance improvements in ecological environments and sports facilities. This study provides a novel perspective on assessing and optimizing urban forest parks’ functions, particularly in supporting active physical engagement. The rich CESs offered by these parks enhance physical activity experiences and overall satisfaction. The findings offer strategic insights for park managers to better meet citizens’ needs and improve park functionality.
... Both supervised and unsupervised learning algorithms are employed to train models on large-scale biomedical datasets, creating adaptive representation vectors that capture both local and global patterns. Small-scale protein representation learning methods [18][19][20][21][22], inspired by natural language processing techniques like word2vec [23], focused on local patterns but were limited in capturing broader contexts. Deep learning techniques, including convolutional neural networks (CNNs) [24], long short-term memory (LSTM) networks [24][25][26][27][28], and larger-scale transformer-based [29] architectures [24,[30][31][32][33], have since advanced to capture both local and global features, recognising long-range dependencies across entire protein sequences. ...
Preprint
Full-text available
Proteins play a crucial role in almost all biological processes, serving as the building blocks of life and mediating various cellular functions, from enzymatic reactions to immune responses. Accurate annotation of protein functions is essential for advancing our understanding of biological systems and developing innovative biotechnological applications and therapeutic strategies. To predict protein function, researchers primarily rely on classical homology-based methods, which use evolutionary relationships, and increasingly on machine learning (ML) approaches. Lately, protein language models (PLMs) have gained prominence; these models leverage specialised deep learning architectures to effectively capture intricate relationships between sequence, structure, and function. We recently conducted a comprehensive benchmarking study to evaluate diverse protein representations (i.e., classical approaches and PLMs) and discuss their trade-offs. The current work introduces the Protein Representation Benchmark – PROBE tool, a benchmarking framework designed to evaluate protein representations on function-related prediction tasks. Here, we provide a detailed protocol for running the framework via the GitHub repository and accessing our newly developed user-friendly web service. PROBE encompasses four core tasks: semantic similarity inference, ontology-based function prediction, drug target family classification, and protein-protein binding affinity estimation. We demonstrate PROBE’s usage through a new use case evaluating ESM2 and three recent multimodal PLMs—ESM3, ProstT5, and SaProt—highlighting their ability to integrate diverse data types, including sequence and structural information. This study underscores the potential of protein language models in advancing protein function prediction and serves as a valuable tool for both PLM developers and users.
... Therefore, the trade-off between reusability and frugality should be considered when training such generalized models. Smaller but reusable pretrained models, such as word2vec [MCCD13], should be encouraged. ...
Preprint
Full-text available
This research paper, survey, raises awareness of various related issues and poses open questions for further research in the field of Frugal AI. It is a resource for readers interested in understanding the intersection of AI, sustainability and innovation. The document highlights the concept of Frugal AI as a means to innovate sustainably and cost-effectively in resource-constrained environments. It emphasizes the environmental impact of AI technologies, the need for optimization, and the importance of governance to ensure responsible AI deployment. The document also poses a series of research questions to stimulate further investigation into the implications of Frugal AI in the economic, social and environmental spheres. .
... This approach enables efficient inference through pre-embedding, making it suitable for real-world scenarios. Frome et al. [9] pioneered this approach by combining CNN and Skip-Gram [27] architectures for visual and language feature extraction, respectively. Faghri et al. [8] proposed the VSE++ method, which integrated online hard negative mining strategy into the triplet loss function. ...
Preprint
Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.
... When and why linear structure emerges without explicit bias has been of considerable interest since the era of static word embeddings. Work on skipgram models (Mikolov et al., 2013a) found that vector space models of language learn regularities which allow performing vector arithmetic between word embeddings to calculate semantic relationships (e.g., France − Paris + Spain = Madrid) (Mikolov et al., 2013b;Pennington et al., 2014). This property was subject to much debate, as it was not clear why word analogies would appear for some relations and not others (Köper et al., 2015;Karpinska et al., 2018;Gladkova et al., 2016). ...
Preprint
Full-text available
Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly' in the representations, but what factors cause these representations to form? We study the connection between pretraining data frequency and models' linear representations of factual relations. We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies; specifically for subject-relation-object fact triplets, both subject-object co-occurrence frequency and in-context learning accuracy for the relation are highly correlated with linear representations. This is the case across all phases of pretraining. In OLMo-7B and GPT-J, we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively, regardless of when these occurrences happen during pretraining. Finally, we train a regression model on measurements of linear representation quality in fully-trained LMs that can predict how often a term was seen in pretraining. Our model achieves low error even on inputs from a different model with a different pretraining dataset, providing a new method for estimating properties of the otherwise-unknown training data of closed-data models. We conclude that the strength of linear representations in LMs contains signal about the models' pretraining corpora that may provide new avenues for controlling and improving model behavior: particularly, manipulating the models' training data to meet specific frequency thresholds.
... The scaling factor controls the sensitivity of the node reconstruction loss to feature discrepancies. E + refers to the set of observed (positive) edges in the graph, while E − represents a set of negative edges generated through negative sampling [22,26]. The structural decoder is parameterized denoted as edge (·). ...
Preprint
Graph self-supervised learning has gained significant attention recently. However, many existing approaches heavily depend on perturbations, and inappropriate perturbations may corrupt the graph's inherent information. The Vector Quantized Variational Autoencoder (VQ-VAE) is a powerful autoencoder extensively used in fields such as computer vision; however, its application to graph data remains underexplored. In this paper, we provide an empirical analysis of vector quantization in the context of graph autoencoders, demonstrating its significant enhancement of the model's capacity to capture graph topology. Furthermore, we identify two key challenges associated with vector quantization when applying in graph data: codebook underutilization and codebook space sparsity. For the first challenge, we propose an annealing-based encoding strategy that promotes broad code utilization in the early stages of training, gradually shifting focus toward the most effective codes as training progresses. For the second challenge, we introduce a hierarchical two-layer codebook that captures relationships between embeddings through clustering. The second layer codebook links similar codes, encouraging the model to learn closer embeddings for nodes with similar features and structural topology in the graph. Our proposed model outperforms 16 representative baseline methods in self-supervised link prediction and node classification tasks across multiple datasets.
... TF_IDF(w,t) denotes the TF-IDF of word w at t. For vector acquisition, word2vec [36] is used. ...
Article
Full-text available
In an academic paper search to confirm the novelty of a research project, it is important to improve the recall score for the number of search results that users can check to comprehensively collect research papers related to the user’s information need. However, it may be necessary to check papers in the lower ranks to cover the relevant papers comprehensively when using single search method. To improve the recall score for the number of papers that users can check, we considered that it would be effective to integrate ranking results using multiple search methods with different approaches. This is based on the idea that relevant papers that do not appear in the higher ranks for one method can be found using other methods, and that effect would be amplified by integrating more ranking results. As the methods to be integrated, we used the ranking methods in the vector space model, the query likelihood model, and a newly proposed method. Our method is based on a topic-based Boolean search that uses the topic analysis result from latent Dirichlet allocation, and ranked papers in descending order the number of times they are included in each search result. We performed an evaluation using the NTCIR- 1 and - 2 datasets, and confirmed that our topic-based search method showed different trends from rankings based on conventional vector space model and query likelihood model. Furthermore, we showed the best performance by the re-ranking using these three methods comparing other combinations.
... CRISPR-Cas9 principally uses three encoding methods: one-hot encoding, word embedding, and DNA2vec. A specific word or string is represented by a distinct vector representation during embedding like in word2vec or DNA2vec [79,80]. A vector representation can be created from a sgRNA-DNA sequence by splitting it into substrings of length k, or k-mers. ...
Article
Full-text available
The precision and user-friendliness of genome-editing tools, particularly clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated protein 9 (Cas9) (CRISPR/Cas9), have transformed biological research and medicinal development. This study covers additional genome-editing mechanisms, such as zinc-finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs), and CRISPR/Cas9. It explores the workings of CRISPR/Cas9, detailing its evolution as an adaptive defense system in bacteria and its ability to break DNA at specific sites to create precise double-strand breaks (DSB) at predetermined genomic locations. The principal objective is to improve the precision and efficacy of guide RNA (gRNA) designs, as this is essential for reducing off-target effects (OFTEs) and optimizing on-target effects (OTEs). Recent advancements in computational models employing machine learning (ML) and deep learning (DL) techniques are addressed to more correctly anticipate gRNA outcomes. A comprehensive overview and comparative assessment of traditional ML and advanced DL techniques, including graph convolution networks (GCN) utilized in CRISPR/Cas9 applications, are provided. Major research problems and future directions in the prediction of target activity are highlighted along with recent advancements in single-guide RNA (sgRNA)–DNA interaction encodings that bolster current models for both OTE and OFTE. The potential of different ML and DL algorithms to refine the precision of genetic changes by fine-tuning gRNA’s specificity is examined. The study also emphasizes the critical epigenetic data in predicting the outcomes of CRISPR/Cas9 treatments. Current challenges are examined, and directions for further study to enhance the precision of OTE and OFTE predictions are outlined. This work serves as an essential resource for genomic engineering researchers, providing a critical evaluation of existing approaches and suggestions for future developments in genome-editing technology.
... Shift Positive Pointwise Mutual Information (SPPMI) matrix, proposed by [31], aggregates all -step probability transition matrices to store contextual information of words and avoid sparsity problem, and utilizes a linear transformation on the aggregated matrix that achieves promising performance in NLP. Deriving SPPMI is equivalent to optimizing random walk models based on skip-gram with negative sampling model [32], implying that SPPMI matrix can preserve structural information as much as possible. Therefore, we treat network nodes and their structural information as "words" and "contextual semantic information", respectively, and then adopt SPPMI to store multiscale structural information in NRL. ...
Article
Full-text available
In recent years, network representation learning (NRL) has attracted increasing attention due to its efficiency and effectiveness to analyze network structural data. NRL aims to learn low-dimensional representations of nodes while preserving their structural information, and preserving multiscale structural information of nodes is important for NRL. Deep learning-based algorithms are popular owing to their good performance to learn network representations, but they lack sufficient interpretability as black boxes. In this study, we propose a novel algorithm called Multiscale structural information-based Laplacian generative adversarial Network Representation Learning (MLNRL). This algorithm consists of two components: 1) multiscale structural information preserving component, where a shift positive pointwise mutual information matrix (SPPMI) is calculated for storing multiscale structural information; 2) Laplacian generative adversarial learning component, where the ideas of Laplacian pyramid and generative adversarial networks are leveraged to generate robust and meaningful representations. We apply our model to three downstream tasks on real-world datasets for evaluation, and the results show that our model outperforms the baselines in almost all cases. Then, we perform an ablation study and verified the necessity of both components. We also investigate the hyperparameter sensitivity to prove the robustness of MLNRL.
... To compute semantic similarity, all object images were first given names at the basic category level (e.g., "cat", "monkey", "guitar", "bag"). The semantic similarity of these names was then determined using Word2Vec (Mikolov et al., 2013;Řehůřek & Sojka, 2010), a method of natural language processing (NLP). Through modelling text within large corpora, Word2Vec uses word co-occurrence patterns to identify words with similar contexts and then maps them to vectors that can be considered in terms of their cosine similarity. ...
Preprint
Full-text available
Visual memory search involves comparing a probe item against multiple memorized items. Previous work has shown that distractor probes from a different object category than the objects in the memory set are rejected more quickly than distractor probes from the same category. Because objects belonging to the same category usually share both visual and semantic features compared with objects of different categories, it is unclear whether the category effects reported in previous studies reflected category-level selection, visual similarity, and/or semantic target-distractor similarity. Here, we employed old/new recognition tasks to examine the role of categorical, semantic, and visual similarity in short- and long-term memory search. Participants (N=64) performed visual long-term memory (LTM) or short-term memory (STM) search tasks involving animate and inanimate objects. Trial-wise RT variability to distractor probes in LTM and STM search was modelled using regression analyses that included predictors capturing categorical target-distractor similarity (same or different category), semantic target-distractor similarity (from a distributional semantic model), and visual target-distractor similarity (from a deep neural network). We found that for both memory tasks, categorical, semantic, and visual similarity all explained unique variance in trial-wise memory search performance. However, their respective contributions varied with memory set size and task, with STM performance being relatively more strongly influenced by visual and categorical similarity and LTM performance being relatively more strongly influenced by semantic similarity. These results clarify the nature of the representations used in memory search and reveal both similarities and differences between search in STM and LTM.
... Text embedding, such as Glove [59] and SentenceBERT [62], are popular vector-based representations of text in NLP. In the embedding space, texts with similar semantic meanings (e.g., month-names like "january", "feburary", etc.) tend to cluster closely together, while those with unrelated meanings (e.g., "january" and color-names like "yellow") are positioned further apart [53,59,62]. ...
Preprint
Full-text available
Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied. In this work, we propose a new class of data-quality constraints that we call Semantic-Domain Constraints, which can be reliably inferred and automatically applied to any tables, without requiring domain-experts to manually specify on a per-table basis. We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints. Our extensively labeled benchmark dataset with 2400 real data columns, as well as our code are available at https://github.com/qixuchen/AutoTest to facilitate future research.
... Adding a phosphate or alcohol group to two different molecules should change both coordinates in a similar manner. Our method for lipids is a reimplemented and adjusted version of Mol2Vec [8], a technique from the small-molecule literature which is, in turn, based on Word2Vec [9] a word embedding method from natural language processing. To embed words, one first defines a vocabulary and gives each word a unique token. ...
Article
Full-text available
A shallow neural network was used to embed lipid structures in a 2- or 3-dimensional space with the goal that structurally similar species have similar vectors. Tests on complete lipid databanks show that the method automatically produces distributions which follow conventional lipid classifications. The embedding is accompanied by the web-based software, Lipidome Projector. This displays user lipidomes as 2D or 3D scatterplots for quick exploratory analysis, quantitative comparison and interpretation at a structural level. Examples of published data sets were used for a qualitative comparison with literature interpretation.
... The CoT reasoning method enables students to break down and logically progress through complex problems, ultimately leading to improved comprehension, problem-solving abilities, and cognitive engagement. 14 As shown in Fig. 1, the Experimental System Architecture includes four parts: ...
Article
Full-text available
In the context of rapid advancements in educational technology, Active Interactive Learning Environments (ILEs) have emerged as key tools for enhancing instructional outcomes and student engagement. This paper presents an innovative active interactive learning system based on multi-modal chain-of-thought (CoT) reasoning, aiming to optimize personalized support in the learning process by integrating multi-modal data (including text, images, and videos) with CoT techniques. The system utilizes advanced deep learning technologies such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Retrieval-Augmented Generation (RAG) to achieve dynamic, real-time personalized content generation and feedback mechanisms. Experimental results indicate that this system significantly improves student performance and engagement in the “Computer Networks” course, demonstrating its effectiveness in practical teaching settings. This study provides a solid theoretical foundation and practical guidance for further research and application of intelligent educational systems, highlighting the potential for driving future educational innovation.
... In this process, the walkers stop when they reach the given path length. The paths are treated as sentences and used in the SkipGram model [41] to learn the latent representation. The original DeepWalk was unbiased and extended to a biased version later by Cochez et al [10]. ...
Preprint
Full-text available
Random walks are a primary means for extracting information from large-scale graphs. While most real-world graphs are inherently dynamic, state-of-the-art random walk engines failed to efficiently support such a critical use case. This paper takes the initiative to build a general random walk engine for dynamically changing graphs with two key principles: (i) This system should support both low-latency streaming updates and high-throughput batched updates. (ii) This system should achieve fast sampling speed while maintaining acceptable space consumption to support dynamic graph updates. Upholding both standards, we introduce Bingo, a GPU-based random walk engine for dynamically changing graphs. First, we propose a novel radix-based bias factorization algorithm to support constant time sampling complexity while supporting fast streaming updates. Second, we present a group-adaption design to reduce space consumption dramatically. Third, we incorporate GPU-aware designs to support high-throughput batched graph updates on massively parallel platforms. Together, Bingo outperforms existing efforts across various applications, settings, and datasets, achieving up to a 271.11x speedup compared to the state-of-the-art efforts.
... In language models, words are converted into dense vectors (i.e., semantic embeddings) through the use of techniques such as word2vec or Bidirectional Encoder Representations from Transformers (BERT) trained on large corpora (e.g., Wikipedia) [19,33]. These embeddings can capture complex word semantics and greatly benefit downstream tasks, especially those with a small sample size [34]. ...
Preprint
Full-text available
Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretraining on large-scale transcriptomic profiles with a masking and restoration objective that captures co-expression relationships among thousands of genes. We evaluate GexBERT across three critical tasks in cancer research: pan-cancer classification, cancer-specific survival prediction, and missing value imputation. GexBERT achieves state-of-the-art classification accuracy from limited gene subsets, improves survival prediction by restoring expression of prognostic anchor genes, and outperforms conventional imputation methods under high missingness. Furthermore, its attention-based interpretability reveals biologically meaningful gene patterns across cancer types. These findings demonstrate the utility of GexBERT as a scalable and effective tool for gene expression modeling, with translational potential in settings where gene coverage is limited or incomplete.
... • Doc2Vec+SVM [64] is a paragraph embedding technique based on Word2Vec [65]. It learns representation vectors using skip-gram and CBOW models and is considered an unsupervised learning method for learning latent document representations. ...
Article
Full-text available
This paper proposes the joint learning model Multi-Granularity Semantic Relation Learning and Meta-Path Structure Interaction Learning for fake news detection (MGMP). The MGMP improves global semantic relation learning through a multi-granularity process involving coarse-grained and fine-grained learning modules, along with meta-path based global interaction learning. It begins by refining global semantic recognition accuracy at the word-level and document-level through attention mechanisms and convolutional neural networks. Furthermore, it enhances global interaction learning by enhancing meta-path instance representations with various meta-paths and employing multi-head self-attention mechanisms within the network structure. Experimental findings on real datasets confirm the effectiveness of the MGMP in fake news detection by enhancing global semantic recognition accuracy in news nodes and recognizing network structural characteristics.
... For instance, some models such as BERT or RoBERTa feature 12 layers, whereas other models can feature more layers.10 This is a name borrowed from the tradition of word embeddings, retroactively called 'static embeddings', which is a popular distributional method for representing words -e.g.[109]-in NLP research.Manuscript submitted to ACM ...
Preprint
Full-text available
Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.
... In the feature extraction step, the text is transformed into numeric vectors. Techniques based on frequency, such as Bagof-Words, also known as Count Vectorizer (CV), and TF-IDF, word embedding methods like Word2Vec [20], GloVe [21], and FastText [22], as well as language models like RoBERTa [23] and Falcon [24], were utilized. The selected textual representation methods are widely used in fake news detection tasks and have achieved strong performance. ...
Preprint
Full-text available
Given the volume and speed at which fake news spreads across social media, automatic fake news detection has become a highly important task. However, this task presents several challenges, including extracting textual features that contain relevant information about fake news. Research about fake news detection shows that no single feature extraction technique consistently outperforms the others across all scenarios. Nevertheless, different feature extraction techniques can provide complementary information about the textual data and enable a more comprehensive representation of the content. This paper proposes using multi-view autoencoders to generate a joint feature representation for fake news detection by integrating several feature extraction techniques commonly used in the literature. Experiments on fake news datasets show a significant improvement in classification performance compared to individual views (feature representations). We also observed that selecting a subset of the views instead of composing a latent space with all the views can be advantageous in terms of accuracy and computational effort. For further details, including source codes, figures, and datasets, please refer to the project's repository: https://github.com/ingrydpereira/multiview-fake-news.
... Published Reference Models We found two published reference models for the room classification task. The second version of the Hydra ) model utilizes pre-trained word2vec (Mikolov et al. 2013) vectors to represent object semantic labels and concatenates them with the geometric feature vectors to perform room classification. It has been tested on the Matterport3D ) dataset. ...
Article
The concept of function and affordance is a critical aspect of 3D scene understanding and supports task-oriented objectives. In this work, we develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph representing the spatial organization of a scene. The varying functional affordance is designed to integrate with the varying spatial context of the graph. More specifically, we develop an algorithm that learns to construct a 3D hierarchical scene graph (3DHSG) that captures the spatial organization of the scene. Starting from segmented object point clouds and object semantic labels, we develop a 3DHSG with a top node that identifies the room label, child nodes that define local spatial regions inside the room with region-specific affordances, and grand-child nodes indicating object locations and object-specific affordances. To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ a Transformer Based Hierarchical Scene Understanding (TB-HSU) model to learn the 3DHSG. We use a multi-task learning framework that learns both room classification and learns to define spatial regions within the room with region-specific affordances. Our work improves on the performance of state-of-the-art baseline models and shows one approach for applying transformer models to 3D scene understanding and the generation of 3DHSGs that capture the spatial organization of a room. The code and dataset are publicly available.
Article
Full-text available
Authorship attribution is a critical task in natural language processing that involves identifying the author of a given text based on writing style, linguistic patterns, and structural features. This research presents a deep learning-based approach combining Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) networks to accurately attribute authorship. Using the Reuters-50-50 dataset, we extract syntactic and structural information such as part-of-speech tags, punctuation frequency, and average sentence length, which help capture the unique stylistic traits of individual authors. The text is cleaned, transformed into numerical vectors, and used to train the proposed model. Experimental results demonstrate that the hybrid CNN-BiLSTM architecture achieves high accuracy of 96% in identifying authors from unseen text samples. The model also performs well across other metrics such as precision, recall, and F1-score, showing its robustness and effectiveness in capturing deep textual patterns. This work contributes to the fields of authorship verification, plagiarism detection, and digital forensics, offering a scalable and reliable solution for text-based author identification. Stylometric Author Identification via CNN-BiLSTM Architecture on Syntactic Text Patterns 2 INTRODUCTION Authorship attribution is the task of identifying the writer of a given piece of text by analyzing writing patterns, linguistic cues, and stylistic features. It has wide-ranging applications in areas such as forensic investigations, digital content moderation, literary analysis, academic integrity verification, and cybersecurity. In a digital world where anonymous and pseudonymous writing is increasingly prevalent, reliable authorship identification has become an essential component of content authentication and intellectual property protection. Conventional methods for authorship attribution typically involve statistical feature engineering, such as analyzing word frequencies, sentence lengths, punctuation patterns, and syntactic structures. While these approaches can be effective for small or controlled datasets, they often fail to generalize well on large-scale or noisy data due to their reliance on shallow features. Moreover, hand-crafted features may miss deeper semantic and contextual patterns that differentiate writing styles among authors. Recent advancements in deep learning have opened up new possibilities in natural language processing (NLP), offering models capable of learning hierarchical, semantic, and sequential representations directly from raw text. Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks and their bidirectional variants, have demonstrated significant success in capturing the temporal dynamics of language. Likewise, Convolutional Neural Networks (CNNs), though traditionally used in image processing, have shown impressive performance in capturing local syntactic patterns when applied to text data. In this study, we propose a hybrid model that combines the strengths of CNNs and Bidirectional LSTMs for robust authorship attribution. The model is trained on the Reuters-50-50 dataset, a benchmark corpus comprising text from 50 different authors. To enhance performance, we extract both structural and syntactic features-including part-of-speech tags, average word/sentence lengths, punctuation usage, and TF-IDF-based word vectors-which are then transformed into numerical formats for training. The processed data is used to train the hybrid CNN-BiLSTM model to learn and distinguish between subtle stylistic features unique to each author. The primary aim of this research is to leverage deep learning to minimize manual feature engineering and improve classification performance across varied text samples. Our proposed model achieves high accuracy and generalizes well on unseen data, demonstrating its applicability to real-world scenarios where text authorship needs to be reliably established.
Article
Text summarization research has undergone several significant transformations with the advent of deep neural networks, pre-trained language models (PLMs), and recent large language models (LLMs). This survey thus provides a comprehensive review of the research progress and evolution in text summarization through the lens of these paradigm shifts. It is organized into two main parts: (1) a detailed overview of datasets, evaluation metrics, and summarization methods before the LLM era, encompassing traditional statistical methods, deep learning approaches, and PLM fine-tuning techniques, and (2) the first detailed examination of recent advancements in benchmarking, modeling, and evaluating summarization in the LLM era. By synthesizing existing literature and presenting a cohesive overview, this survey also discusses research trends, open challenges, and proposes promising research directions in summarization, aiming to guide researchers through the evolving landscape of summarization research.
Article
Full-text available
The exponential growth of unstructured data across industries presents a persistent challenge for efficient and intelligent data management. Recent advances in Generative Artificial Intelligence (GenAI), particularly large language models (LLMs), have introduced novel possibilities for automated knowledge extraction, semantic reasoning, and task execution within data management systems. This paper explores how GenAI is integrated into intelligent data management frameworks, enabling automated workflows, data classification, and ontology population. Through a comprehensive review of literature and an analysis of emerging trends, we assess the transformative potential and challenges of GenAI in this domain. We further present an illustrative line graph on automation accuracy over time and a comparative table of GenAI capabilities across applications.
Article
Full-text available
Natural Language Processing (NLP) has witnessed remarkable advancements in recent years, particularly with the introduction of transformer-based models like BERT (Bidirectional Encoder Representations from Transformers). These models have revolutionized NLP tasks by overcoming limitations of previous architectures, such as RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks), primarily due to their ability to capture long-range dependencies and parallelize training. BERT, in particular, introduced a new paradigm by leveraging transformer-based pretraining on large corpora followed by fine-tuning on task-specific data. This paper explores the development of BERT and transformer models, highlighting their significance in NLP tasks such as sentiment analysis, machine translation, and question answering. We also discuss the evolution of transformer-based models and how BERT has become the foundation for many subsequent models in the field
Article
Full-text available
Global climate change has led to frequent and widespread flood disasters in China. Traditional flood disaster investigations mainly focus on major flood events, and small‐scale flood events are often overlooked. This study utilized the Sina Weibo social media platform to detect flood events in 370 cities in China from 2012 to 2023. We downloaded 73.52 million Weibo posts and developed a two‐step flood detection algorithm. In the first step, the algorithm initially identifies 956 flood events based on changes in posting frequency. In the second step, an LDA topic model is used to detect topics for these flood events and automatically filter out false events, resulting in 729 flood events. Verification of these events confirmed that 629 of the 729 were real flood events, achieving a detection accuracy of 86.28%. In the end, after excluding all false flood events and reinstating the mistakenly removed real ones, we obtained a total of 674 verified flood events. Among these 370 cities, 194 cities experienced flood disasters, accounting for 52.43% of the total. Additionally, we compared our findings with online news reports, as well as the flood data sets from the GDACS and EM‐DAT. We found that our study had a high detection rate for urban waterlogging events. However, there were cases of missed detection for flash floods and small watershed flood disasters. Nevertheless, this study represents the most comprehensive publicly available detection of flood events in China to date, which is of great significance for the government's flood management and decision‐making.
Article
Named entity recognition (NER) is a critical task in natural language processing. It extracts entity information such as person, location, and organization by predicting various categories of label types and entity spans in text. Nowadays, NER has achieved good recognition results in English text by machine learning. However, satisfactory recognition results cannot be achieved when processing text in Japanese, due to the diversity of the text composition and the particularity of the language itself. Compared with English text, which different words are marked by spaces, there is no clear separation mark between two words in Japanese. Simultaneously, Japanese text includes three types of representation methods, which is different from English text which only consists of English alphabet. In order to solve the above problems, a feature integration network with BERT called FINB is introduced in this paper based on multi-feature integration, which can integrate pronunciation features and glyph features of Japanese into the model to obtain more semantic information. The experiments for verification are conducted on the Kyoto University Web Document Leads Corpus called KWDLC and the Japanese Wikipedia dataset, which both prove that the proposed method can improve the recognition of named entities in Japanese effectively.
Preprint
Recent advances in the visual-language area have developed natural multi-modal large language models (MLLMs) for spatial reasoning through visual prompting. However, due to remote sensing (RS) imagery containing abundant geospatial information that differs from natural images, it is challenging to effectively adapt natural spatial models to the RS domain. Moreover, current RS MLLMs are limited in overly narrow interpretation levels and interaction manner, hindering their applicability in real-world scenarios. To address those challenges, a spatial MLLM named EarthGPT-X is proposed, enabling a comprehensive understanding of multi-source RS imagery, such as optical, synthetic aperture radar (SAR), and infrared. EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities. Moreover, EarthGPT-X unifies two types of critical spatial tasks (i.e., referring and grounding) into a visual prompting framework. To achieve these versatile capabilities, several key strategies are developed. The first is the multi-modal content integration method, which enhances the interplay between images, visual prompts, and text instructions. Subsequently, a cross-domain one-stage fusion training strategy is proposed, utilizing the large language model (LLM) as a unified interface for multi-source multi-task learning. Furthermore, by incorporating a pixel perception module, the referring and grounding tasks are seamlessly unified within a single framework. In addition, the experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks and its impressive flexibility in multi-modal interaction, revealing significant advancements of MLLM in the RS field.
Article
Full-text available
The proliferation of digital communication platforms has brought convenience but also a surge in unsolicited and potentially harmful spam messages. These messages not only compromise user experience but may also pose security threats. To address this issue, the proposed work leverages a deep learning-based approach using Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) networks, to accurately classify and filter spam from legitimate (ham) SMS messages.The model is trained on a publicly available SMS spam dataset, where extensive preprocessing-including stop word removal, stemming, and lemmatization using the Natural Language Toolkit (NLTK)-is performed to standardize the input text. The cleaned messages are vectorized and normalized before being split into training and testing subsets (80:20 ratio). An LSTM-based architecture is designed and trained with optimized hyperparameters such as batch size and number of epochs to balance model accuracy and training efficiency.Upon evaluation, the model demonstrates robust classification performance, achieving an accuracy exceeding 95%, along with strong precision, recall, and F1-score metrics. The implementation, developed using the Jupyter Notebook environment, highlights the potential of LSTM networks in natural language processing tasks, particularly in spam detection applications. This approach LSTM-Powered Spam Detection: A Deep Learning Approach for Sequential Text Classification 12 provides a reliable and scalable solution for mitigating spam-related issues in messaging systems. INTRODUCTION The rapid growth of digital communication technologies has revolutionized the way individuals interact, conduct business, and share information. Among the most widely used communication methods is Short Message Service (SMS), due to its simplicity, low cost, and widespread availability. However, this convenience has also opened the door to misuse, with spam messages becoming increasingly prevalent across mobile networks. These spam messages, often unsolicited and irrelevant, not only interrupt the user experience but may also pose significant risks such as phishing attacks, financial scams, and the spread of malware.
Preprint
Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure is exclusive to Germanic verbs. When seeing them as a cognitive model, however, such etymology-based generalizations face challenges in terms of learnability, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our findings also uncovered previously unrecognized features of the quasi-etymological clusters, offering novel hypotheses for future experimental studies.
Article
Full-text available
Automated Short Answer Grading (ASAG) plays a crucial role in modern e-learning systems by ensuring the efficient, accurate, and consistent assessment of student responses in online education. However, many existing ASAG models struggle with generalization across different domains and question complexities, often facing challenges such as limited training data, high computational costs, and variations in the length of student answers (SA) relative to reference answers (RA). This paper introduces a Universal ASAG Model that combines multiple natural language processing (NLP) techniques, including Sentence-BERT (SBERT), Transformer-based Attention, BERT, LSTMs, and BM25-based Term Weighting. The model features a length-adaptive architecture that categorizes answers into five groups—very short, short, medium, long, and very long—based on their relative length percentages (e.g., very short: 0–30% shorter than the RA). Each category undergoes customized processing to enhance both accuracy and computational efficiency. We provide a comprehensive breakdown of the model’s architecture, detailing its processing pipeline, pseudo-code implementation, mathematical foundations, hyperparameter tuning strategies, and experimental evaluation using benchmark datasets such as SciEntsBank and SemEval-2013. Our model achieves state-of-the-art results, including an F1-score of 91.2%, a Pearson correlation of 0.90, and an RMSE of 0.18, outperforming existing approaches. Additionally, we review recent advancements in ASAG, discussing key contributions, ongoing challenges, and potential future directions.
Thesis
Full-text available
The rapid expansion of unstructured and semi-structured textual data in technical documentation, industrial datasheets, and regulatory reports has created an urgent need for automated knowledge extraction and representation systems. Traditional rule-based and keyword-driven approaches often fail to capture semantic relationships, hierarchical structures, and contextual dependencies, limiting their effectiveness in structured data retrieval. This thesis explores AI-driven structured knowledge extraction using Large Language Models (LLMs), specifically GPT-4o and Gemini 2.0 Flash, to generate XML-based knowledge graphs from unstructured PDFs. The proposed methodology consists of a multi-stage AI pipeline that integrates text extraction, structured representation, confidence-aware entity extraction, and question-answering (QA) capabilities: • Text Extraction and Preprocessing: A layout-aware text extraction using pdfplumber accurately retrieves textual content from multi-column, tabular, and graphically embedded PDFs. The system ensures context preservation, structural consistency, and efficient handling of complex document formats. • Structured Knowledge Graph Generation: Extracted text is processed using GPT-4o and Gemini 2.0 Flash to transform unstructured content into hierarchically structured XML representations, ensuring that extracted information is machine-readable and semantically rich. • Confidence-Based Entity Extraction: Gemini 2.0 Flash introduces a confidence-aware extraction framework, where each extracted attribute is assigned a confidence score (0.0–1.0), allowing for uncertainty estimation, ranking of high-confidence attributes, and filtering of unreliable extractions. • Question-Answering (QA) over Structured Data: The thesis implements QA systems: (i) Rule-Based Querying which directly maps structured queries to XML elements for fast and precise information retrieval, and (ii) AI-Powered Semantic QA using GPT-4o and Gemini 2.0 Flash which interpret natural language queries, by extracting relevant information dynamically from structured knowledge graphs. • Performance Benchmarking and Evaluation: The structured extraction and QA models are evaluated using: (i) precision, recall, and F1-score to assess extraction accuracy, (ii) processing time and scalability to measure computational efficiency, (iii) schema compliance to ensure adherence to predefined XML structures, and (iv) confidence-score reliability to validate uncertainty estimation in entity extraction. Key Findings and Contributions: Experimental results demonstrate that GPT-4o excels in structured knowledge graph generation, producing highly accurate and semantically coherent XML representations. Gemini 2.0 Flash is more computationally efficient and introduces confidence-based entity extraction, improving reliability in large-scale document processing. However, challenges remain: • Schema Inconsistencies: AI-generated XML structures sometimes deviate from predefined schema formats, requiring post-processing validation. • Numerical Misinterpretation: Models occasionally misinterpret numerical attributes, unit conversions, and measurement relationships. • Ambiguity in Query-Based Extraction: AI-powered QA systems struggle with vague or multi-context queries, requiring a hybrid retrieval approach. • Scalability Constraints: Processing large document corpora with LLM-based structured extraction incurs high computational costs, necessitating optimization strategies. To address these limitations, this thesis explores post-processing validation techniques, fine-tuning strategies, and confidence-based entity ranking to improve AI-driven structured extraction. Future research directions include: • Hybrid AI-Rule-Based Knowledge Extraction: Combining deep learning models with schema-enforced rule-based validation to ensure structural consistency and interpretability. • Reinforcement Learning for Schema Compliance: Optimizing model outputs to match predefined XML standards dynamically. • Adaptive Domain-Specific Tuning: Fine-tuning LLMs on specialized corpora (e.g., technical manuals, financial reports, scientific research) to improve extraction accuracy. • Scalable Processing Pipelines: Implementing distributed and parallelized extraction workflows to enable large-scale structured document processing. Impact and Applications: The findings of this research contribute to the advancement of AI-powered structured knowledge extraction, offering a scalable, interpretable, and queryable approach to information retrieval. This methodology has significant applications in: • Enterprise Knowledge Management – Automating the structuring and retrieval of technical and regulatory documentation. • AI-Assisted Information Retrieval – Enabling intelligent question-answering for business intelligence and decision-support systems. • Semantic Search and Knowledge Graph Integration – Enhancing ontology-based search engines and domain-specific knowledge bases. • Scientific Research and Digital Archiving – Facilitating structured extraction from research papers, patents, and legal documents. This thesis bridges the gap between unstructured document content and structured knowledge representation, paving the way for scalable AI-driven knowledge management systems. Future refinements in hybrid AI pipelines, confidence-aware ranking, and model explainability will further enhance automated structured information extraction’s accuracy, efficiency, and reliability.
Article
Full-text available
Customer Relationship Management (CRM) plays a pivotal role in ensuring businesses optimize customer engage- ment, retention, and satisfaction. Traditional CRM systems have typically relied on rule-based approaches or simple algorithms for customer interaction, which may fail to capture the dynamic and evolving nature of customer behavior. In this paper, we introduce a novel application of Transformer networks, a state- of-the-art deep learning architecture, to enhance CRM systems by generating personalized, multi-step engagement sequences and predicting customer churn risk. Our approach leverages two specialized Transformer models: a Sequence Transformer for the task of generating multi-step engagement plans and a Churn Transformer for predicting the risk of customer churn. These models harness the power of self-attention mechanisms to understand the sequential and contextual dynamics of customer behavior across time. To evaluate the effectiveness of these models, we use simulated datasets inspired by real-world benchmarks, such as MovieLens, Amazon Product Data, and Kaggle Customer Churn. The Se- quence Transformer is trained to predict a series of actions for customer engagement based on historical interactions, while the Churn Transformer estimates the likelihood of customer attrition based on behavioral and demographic data. The results of our experiments show that after 10 epochs of training, the Sequence Transformer achieves an accuracy of 0.0167, while the Churn Transformer reaches an accuracy of 0.4000. Despite modest accuracy values, the models exhibit steady improvement, with training losses decreasing consistently from an initial value of 4.0456 to 3.7837 for the Sequence Transformer, and from 0.8096 to 0.7047 for the Churn Transformer. The mathematical foundation behind the Sequence Trans- former involves minimizing the average cross-entropy loss over the predicted engagement sequence steps. Specifically, the loss function is defined as: 3 3 i i Lseq = 1 Σ CrossEntropy(yˆ , y ), (1) i=1 where yˆi represents the predicted action for step i, and yi is the true action for the corresponding step in the sequence. Similarly, the Churn Transformer optimizes the binary cross- entropy loss to estimate the likelihood of customer churn. The loss function is defined as: 1 Σ N Lchurn = − [yj log(yˆj ) + (1 − yj ) log(1 − yˆj )] , (2) N j=1 where yj is the true churn label for customer j, and yˆj is the predicted churn probability. Through detailed visualizations, including sample engagement plans, attention weight heatmaps, and ROC curves, this paper illustrates the performance of the models and highlights the potential of Transformer networks in revolutionizing proactive, context-aware CRM strategies. While the accuracy results are constrained by the limitations of simulated datasets, the work lays a solid foundation for future enhancements, including the use of real-world data and more complex Transformer variants, ultimately contributing to more effective customer engagement and retention strategies.
Preprint
Full-text available
Johnson and Lindenstrauss (Contemporary Mathematics, 1984) showed that for n>mn > m, a scaled random projection A\mathbf{A} from Rn\mathbb{R}^n to Rm\mathbb{R}^m is an approximate isometry on any set S of size at most exponential in m. If S is larger, however, its points can contract arbitrarily under A\mathbf{A}. In particular, the hypergrid ([B,B]Z)n([-B, B] \cap \mathbb{Z})^n is expected to contain a point that is contracted by a factor of κstat=Θ(B)1/α\kappa_{\mathsf{stat}} = \Theta(B)^{-1/\alpha}, where α=m/n\alpha = m/n. We give evidence that finding such a point exhibits a statistical-computational gap precisely up to κcomp=Θ~(α/B)\kappa_{\mathsf{comp}} = \widetilde{\Theta}(\sqrt{\alpha}/B). On the algorithmic side, we design an online algorithm achieving κcomp\kappa_{\mathsf{comp}}, inspired by a discrepancy minimization algorithm of Bansal and Spencer (Random Structures & Algorithms, 2020). On the hardness side, we show evidence via a multiple overlap gap property (mOGP), which in particular captures online algorithms; and a reduction-based lower bound, which shows hardness under standard worst-case lattice assumptions. As a cryptographic application, we show that the rounded Johnson-Lindenstrauss embedding is a robust property-preserving hash function (Boyle, Lavigne and Vaikuntanathan, TCC 2019) on the hypergrid for the Euclidean metric in the computationally hard regime. Such hash functions compress data while preserving 2\ell_2 distances between inputs up to some distortion factor, with the guarantee that even knowing the hash function, no computationally bounded adversary can find any pair of points that violates the distortion bound.
ResearchGate has not been able to resolve any references for this publication.