Contexts in source publication

Context 1
... for each remaining lemma, we concatenated all the lexical meanings and examples of usage of each separate homonym. The resulting dataset consisted of 2,882 homonym samples, each sample including the lemma, its possible meanings, and examples for each meaning (see Table 2). We used this dataset for further model evaluation. ...
Context 2
... for each remaining lemma, we concatenated all the lexical meanings and examples of usage of each separate homonym. The resulting dataset consisted of 2,882 homonym samples, each sample including the lemma, its possible meanings, and examples for each meaning (see Table 2). We used this dataset for further model evaluation. ...

Similar publications

Article
Full-text available
Due to the development of pre-trained language models, automated code generation techniques have shown great promise in recent years. However, the generated code will not always adhere to syntactic constraints of the target language, especially in the case of Turducken-style code, where declarative code snippets are embedded within imperative progr...

Citations

... Furthermore, the issue of ambiguity, demonstrated in rows 1 and 3, presented challenges by offering multiple possible senses. Although employing specialized Word Sense Disambiguation systems, as suggested by Laba et al. (2023), could mitigate this issue, exploring such solutions falls beyond the scope of this paper. ...
Conference Paper
Full-text available
WordNet is a crucial resource in linguistics and natural language processing, providing a detailed and expansive set of lexico-semantic relationships among words in a language. The trend toward automated construction and expansion of WordNets has become increasingly popular due to the high costs of manual development. This study aims to automate the development of the Ukrainian WordNet, explicitly concentrating on hypo-hypernym relations that are crucial building blocks of the hierarchical structure of WordNet. Utilizing the linking between Princeton WordNet, Wikidata, and multilingual resources from Wikipedia, the proposed approach successfully mapped 17% of Princeton WordNet (PWN) content to Ukrainian Wikipedia. Furthermore, the study introduces three innovative strategies for generating new entries to fill in the gaps of the Ukrainian WordNet: machine translation, the Hypernym Discovery model, and the Hypernym Instruction-Following LLaMA model. The latter model shows a high level of effectiveness, evidenced by a 41.61% performance on the Mean Overlap Coefficient (MOC) metric. With the proposed approach that combines automated techniques with expert human input, we provide a reliable basis for creating the Ukrainian WordNet.
... Researchers also used context divergence as a measure to assess the disparity between retrieved information and the user's intended meaning [26,27,28]. The evaluation included context-specific submetrics as part of standard test methodologies [29,2,30]. Additionally, there was significant research effort directed towards assessing the factuality, reliability, fairness, and appropriateness of retrieved information given the assumed query context [10,31,32,33]. This research area has highlighted the ongoing challenge of contextually evaluating the relevancy of information retrieval in LLMs. ...
Preprint
Full-text available
This study focused on the development and evaluation of an Adaptive Query Contextualization Algorithm (AQCA) within the Alpaca Large Language Model (LLM) framework. The AQCA was designed to enhance the model's capability in information retrieval by employing a novel context encoding methodology that dynamically adapted to multifaceted contextual signals derived from user search history and interaction patterns. The algorithm's efficacy was rigorously tested across various metrics, including Contextual Relevance Score (CRS), Word Prediction Accuracy (WPA), Information Retrieval Fidelity (IRF), and Response Coherence Measure (RCM). Significant improvements were observed in the augmented Alpaca LLM's performance, especially in complex scenarios such as metaphorical language understanding and domain-specific knowledge integration. Challenges related to scalability, adaptability to multilingual contexts, and integration with diverse LLM architectures were identified, emphasizing the need for continued research in these areas. The study concluded that while the AQCA marked a substantial advancement in LLMs for context-aware information retrieval, it also opened avenues for future innovations focusing on technical enhancements and ethical considerations.
... Ukrainian remains one of the low-resource languages with few practical applications in machine learning and deep learning. Many studies on the Ukrainian language are conducted in terms of multilingual settings, such as training the multilingual large language models [14], [18], transformers [23], [6], or abstractive summarization [10]. We offer a corpus analysis tool for the Ukrainian languagethe StyloMetrix. ...
... • building the first flair embeddings (Akbik et al., 2018) of the Ukrainian language 24 and training compact downstream models like POS 25 and NER 26 on these embeddings; • training fastText vectors of a high quality ; • training lean language models for a Ukrainian speech-to-text project 27 ; • training models for punctuation restoration 28 ; • training GPT-2 models of different sizes for the Ukrainian language and fine-tuning for various tasks using instructions (Kyrylov and Chaplynskyi, 2023); • fine-tuning paraphrase-multilingual-mpnetbase-v2 sentence transformer on the sentences mined from the corpus to achieve better performance on WSD task (Laba et al., 2023). We cooperate with teams of researchers to train transformer models like GPT-2 proposed by Radford et al. (2019), BERT by Devlin et al. (2019), RoBERTa by Liu et al. (2019) and ELECTRA by Clark et al. (2020) and are open to further collaborations. ...
... There is a need to develop methods that do not depend this heavily on their training corpus [14] shows competitive results in MD tasks by leveraging its similarity to Word Sense Disambiguation (WSD) [33]. It is shown in [34] the successful usage of LLMs for solving the WSD task. Thus, crossdomain knowledge can be utilized to apply similar techniques for LLM-centric approaches for MD. ...
Chapter
Metaphor Detection is a crucial area of study in computational linguistics and natural language processing, as it enables the understanding and communication of abstract ideas through the use of concrete imagery. This survey paper aims to provide an overview of the current state-of-the-art approaches that tackle this issue and analyze trends in the domain across the years. The survey recapitulates the existing methodologies for metaphor detection, highlighting their key contributions and limitations. The methods are assigned three broad categories: feature-engineering-based, traditional deep learning-based, and transformer-based approaches. An analysis of the strengths and weaknesses of each category is showcased. Furthermore, the paper explores the annotated corpora that have been developed to facilitate the development and evaluation of metaphor detection models. By providing a comprehensive overview of the work already done and the research gaps present in pre-existing literature, this survey paper hopes to help future research endeavors, and thus contribute to the advancement of metaphor detection methodologies.
Article
Full-text available
Ambiguity is considered an indispensable attribute of all natural languages. The process of associating the precise interpretation to an ambiguous word taking into consideration the context in which it occurs is known as word sense disambiguation (WSD). Supervised approaches to WSD are showing better performance in contrast to their counterparts. These approaches, however, require sense annotated corpus to carry out the disambiguation process. This paper presents the first-ever standard WSD dataset for the Kashmiri language. The raw corpus used to develop the sense annotated dataset is collected from different resources and contains about 1 M tokens. The sense-annotated corpus is then created using this raw corpus for 124 commonly used ambiguous Kashmiri words. Kashmiri WordNet, an important lexical resource for the Kashmiri language, is used for obtaining the senses used in the annotation process. The developed sense-tagged corpus is multifarious in nature and has 19,854 sentences. Based on this annotated corpus, the Lexical Sample WSD task for Kashmiri is carried out using different machine-learning algorithms (J48, IBk, Naive Bayes, Dl4jMlpClassifier, SVM). To train these models for the WSD task, bag-of-words (BoW) and word embeddings obtained using the Word2Vec model are used. We used different standard measures, viz. accuracy, precision, recall, and F1-measure, to calculate the performance of these algorithms. Different machine learning algorithms reported different values for these measures on using different features. In the case of BoW model, SVM reported better results than other algorithms used, whereas Dl4jMlpClassifier performed better with word embeddings.
Preprint
Full-text available
This paper provides an overview of a text mining tool the StyloMetrix developed initially for the Polish language and further extended for English and recently for Ukrainian. The StyloMetrix is built upon various metrics crafted manually by computational linguists and researchers from literary studies to analyze grammatical, stylistic, and syntactic patterns. The idea of constructing the statistical evaluation of syntactic and grammar features is straightforward and familiar for the languages like English, Spanish, German, and others; it is yet to be developed for low-resource languages like Ukrainian. We describe the StyloMetrix pipeline and provide some experiments with this tool for the text classification task. We also describe our package's main limitations and the metrics' evaluation procedure.