Fig 1 - uploaded by Iskander Akhmetov
Content may be subject to copyright.
Language groups representation in the data we used: total word pairs

Language groups representation in the data we used: total word pairs

Source publication
Article
Full-text available
Lemmatization is a process of finding the base morphological form (lemma) of a word. It is an important step in many natural language processing, information retrieval, and information extraction tasks, among others. We present an open-source language-independent lemmatizer based on the Random Forest classification model. This model is a supervised...

Contexts in source publication

Context 1
... language group representation of our data is unbalanced, with the majority of languages being Romance and Slavic / Baltic, followed by the Gaelic and Germanic languages. The distribution of the Fig. 1 and Table ...
Context 2
... language group representation of our data is unbalanced, with the majority of languages being Romance and Slavic / Baltic, followed by the Gaelic and Germanic languages. The distribution of the data we have collected by the number of words for a language group is presented in Fig. 1 and Table 2. We can observe that Uralic / Altaic group, represented by only two languages, is greater than such groups as Germanic and Gaelic by the number of wordform-lemma pairs. This is because of enormous Turkish language data. Same effect can be observed for Slavic / Baltic language group, mainly because of Russian language ...

Similar publications

Article
Full-text available
Correlated quality metrics extracted from a source code repository can be utilized to design a model to automatically predict defects in a software system. It is obvious that the extracted metrics will result in a highly unbalanced data, since the number of defects in a good quality software system should be far less than the number of normal insta...
Preprint
Full-text available
The precision of the yield measurement of the Higgs boson decaying into a pair of $Z$ bosons process at the Circular Electron Positron Collider (CEPC) is evaluated. Including the recoil $Z$ boson associated with the Higgs production (Higgsstrahlung) total three $Z$ bosons are involved for this channel, from which final states characterized by the p...
Article
Full-text available
In the real world, noisy data brings tremendous challenges to data mining. Traditional classification methods are proven to be inadequate to assess the efficacy of the data mining methods while using noisy and imbalanced data. Therefore, preprocessing the imbalanced data is necessary before classification. But it's difficult to arrive at an appropr...
Article
Full-text available
Seizure associated with abnormal brain activities caused by epileptic disorder is widely typical and has many symptoms, such as loss of awareness and unusual behavior as well as confusion. In this paper, a classification of the Epileptic Seizure dataset was done using different classifiers. It was shown that the Random Forest classifier outperforme...
Article
Full-text available
Alzheimer’s Disease (AD) is the most common form of dementia that can lead to a neurological brain disorder that causes progressive memory loss as a result of damaging the brain cells and the ability to perform daily activities. This disease is one of kind and fatal. Early detection of AD because of its progressive threat and patients all around th...

Citations

... The vast diversity within this language's veins is so high that it is tough to create a system that can understand what means what. Every character is unique in its way, and the melange of characters creates different meaningful outputs in various ways 1 . We hope this system will help other researchers build more NLP systems for the Bangla language and make it more technology-friendly and more accessible to the whole world. ...
... Akhmetov et al. [1] proposed a random forest classification model for language independent lemmatization. In this paper, authors have proposed to create a character cooccurrence embedding from inflected_words-lemma pairs of 25 languages. ...
... In order to report how our proposed model performs, Benlem [5], BLSTM-BLSTM [6] and Akhmetov et al. [1] have been used as base references. Although character accuracy rate of BaNeL is 95.75%, exact match measure EM (Eq. ...
Article
Full-text available
This study presents an efficient framework of deriving lemma from an inflected Bangla word considering its parts-of-speech as context. Bangla is a morphologically rich Indo-Aryan language where around 70% words are inflected, and some words have around 90 different inflected forms making it one of the most challenging languages for lemmatization. The unavailability of a sufficiently large appropriate dataset in Bangla makes the task even more strenuous. A reliable robust Bangla lemmatizer will create new possibilities for other dependent fields like automatic language translation and grammatical correction to flourish in Bangla. In this paper, we have described a new larger Bangla dataset for lemmatization and an encoder-decoder-based sequence_to_sequence framework for it. After tuning the hyper-parameters, the proposed framework yielded 95.75% character accuracy and 91.81% exact match on the testing split of the prepared dataset which is significantly higher than existing other approaches in Bangla for lemmatization. Article Highlights This article:Discusses lemmatization task in Bangla and demonstrates difference with stemming Presents an artificial neural network based efficient model for lemmatization that yields comparatively better performance than existing ones Describes a new large dataset for lemmatization in Bangla language
... For lemmatization purposes, we considered the approach of Highly language-independent word lemmatization using Machine learning [3]. However, we have found that the approach shows descent results for the Russian language due to the morphological complexity of the language. ...
Article
Full-text available
We consider the problem of sentiment analysis in news media articles cast as a three-way classification task: negative, positive, or neutral. We show that subdividing the training corpus by topic (local news, sports, hi-tech, and others) and training separate sentiment classifiers for each sub-corpus improves classification F1 scores. We use topics since some words carry different sentiments in different domains: e.g., the word "force" is typically positive in the sports domain but negative in the political domain. Our experiments on the Kaggle dataset with sentiment-labeled Kazakhstani news articles in Russian language using the Convolutional Neural Network (CNN) model partially proved our hypothesis, showing that for the most prominent "kz" (local news) topic, we achieve an F1 score of 0.70, which is greater than the baseline model trained without the topic-awareness showing just 0.67. Topic-aware improves F1 scores in some topics, but due to the topic/class imbalance further research is needed. However, the performance in terms of F1 over all the corpus does not improve or the improvements are very small. Moreover, our approach shows better results on topics with many text samples than those with relatively small amounts of articles.
... More specifically, for a period of 45 years from 1976 to 2021, we have considered 1031 titles of the database domain related research papers as the sentences and extracted unigrams as well as lemmata from these titles. The process of lemmatization is deployed to find the base morphological form of a word [10]. This form is called "lemma" if it is singular, and "lemmas" or "lemmata" if it is plural. ...
... Then the lemmatization is applied. Lemmatization converts the word to its dictionary form [2]. ...
... RQ 1 Is it possible to predict the quality of determining the morphological properties of unknown words? RQ 2 Which highest quality of determining the morphological properties of unknown words can be provided with the modified analogy method? ...
... Syriac [22] or Icelandic [11] and the minority languages [10]. Some works are devoted to the Russian language processing [1,2,9,31], [2] presents the results of an independent evaluation of Russian morphological parsers. There ap-peared some language-independent word lemmatization (working with many languages) recently [1,12,28]. ...
... Some works are devoted to the Russian language processing [1,2,9,31], [2] presents the results of an independent evaluation of Russian morphological parsers. There ap-peared some language-independent word lemmatization (working with many languages) recently [1,12,28]. ...
Chapter
Full-text available
Lemmatization is an important step in many natural language processing tasks, information retrieval tasks, and information extraction tasks. In this paper, we present a lemmatization approach based on the modified analogy method which utilizes a reversed dictionary. The presented method is very efficient but simple to realize. Modifying the analogy method with fuzzy sets significantly improves the quality of the morphological analysis. We show (i) that the quality of this approach is comparable to state-of-the-art methods and (ii) that we can increase the accuracy of the unknown words analysis. Although the method was developed for the Russian language, it was successfully verified on the German language. Therefore, it is easily extensible to other highly inflected languages. Link: https://link.springer.com/chapter/10.1007/978-3-030-89477-1_57 Cite this paper as: Eltsova M., Gashkov A., Slovikova E. (2022) Applying Reversed Dictionaries to Improve the Automatic Morphological Analysis of Unknown Words. In: Rocha A., Isaeva E. (eds) Science and Global Challenges of the 21st Century - Science and Technology. Perm Forum 2021. Lecture Notes in Networks and Systems, vol 342. Springer, Cham. https://doi.org/10.1007/978-3-030-89477-1_57
... It is the process of converting a word into a normalized form. It consists of removing the suffix of a word [57,58]. For instance, by removing the words' suffixes, ranked, and ranks, we get the lemma rank. is step is very useful for many natural languages' processing to reduce the size of the vocabulary. ...
Article
Full-text available
The paper presents a recommendation model for developing new smart city and smart health projects. The objective is to provide recommendations to citizens about smart city and smart health startups to improve entrepreneurship and leadership. These recommendations may lead to the country’s advancement and the improvement of national income and reduce unemployment. This work focuses on designing and implementing an approach for processing and analyzing tweets inclosing data related to smart city and smart health startups and providing recommended projects as well as their required skills and competencies. This approach is based on tweets mining through a machine learning method, the Word2Vec algorithm, combined with a recommendation technique conducted via an ontology-based method. This approach allows discovering the relevant startup projects in the context of smart cities and makes links to the needed skills and competencies of users. A system was implemented to validate this approach. The attained performance metrics related to precision, recall, and F-measure are, respectively, 95%, 66%, and 79%, showing that the results are very encouraging.