Conference Paper

Combining Knowledge and CRF-Based Approach to Named Entity Recognition in Russian

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Current machine-learning approaches for information extraction often include features based on large volumes of knowledge in form of gazetteers, word clusters, etc. In this paper we consider a CRF-based approach for Russian named entity recognition based on multiple lexicons. We test our system on the open Russian collections “Persons-1000” and “Persons-1111” labeled with personal names. We additionally annotated the collection “Persons-1000” with names of organizations, media, locations, and geo-political entities and present the results of our experiments for one type of names (Persons) for comparison purposes, for three types (Persons, Organizations, and Locations), and five types of names. We also compare two types of labeling schemes for Russian: IO-scheme and BIO-scheme.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... NER is essentially a multi-classification or sequence labeling task. The category of each entity can be identified to achieve character-level classification tasks via the machine learning algorithm [10,11]. Commonly used machine learning algorithms include the hidden Markov model (HMM) [12], support vector machine (SVM) [13] and conditional random fields (CRFs) [14]. ...
Article
Full-text available
Aviation safety reports can provide detailed records of past aviation safety accidents, analyze their problems and hidden dangers, and help airlines and other aviation enterprises avoid similar accidents from happening again. In a novel way, we plan to use named entity recognition technology to quickly mine important information in reports, helping safety personnel improve efficiency. The development of intelligent civil aviation creates demands for the incorporation of big data and artificial intelligence. Because of the aviation-specific terms and the complexity of identifying named entity boundaries, the mining of aviation safety report texts is a challenging domain. This paper proposes a novel method for aviation safety report entity extraction. First, ten kinds of entities and sequences, such as event, company, city, operation, date, aircraft type, personnel, flight number, aircraft registration and aircraft part, were annotated using the BIO format. Second, we present a semantic representation enhancement approach through the fusion of enhanced representation through knowledge integration embedding (ERNIE), pinyin embedding and glyph embedding. Then, in order to improve the accuracy of specific entity extraction, we constructed and utilized the aviation domain dictionary which includes high-frequency technical aviation terms. After that, we adopted bilinear attention networks (BANs), the feature fusion approach originally used in multi-modal analysis, in our study to incorporate features extracted from both iterated dilated convolutional neural network (IDCNN) and bi-directional long short-term memory (BiLSTM) architectures. A case study of specific entity extraction for an aviation safety events dataset was conducted. The experimental results demonstrate that our proposed algorithm, with an F1 score reaching 97.93%, is superior to several baseline and advanced algorithms. Therefore, the proposed approach offers a robust methodological foundation for the relationship extraction and knowledge graph construction of aviation safety reports.
... Secondly, statistical machine learning techniques like the HMM [16], CRF [17], SVM [18][19][20], and others regard NER as a classification question. This method performs better on small datasets but poorly on complex and huge datasets, and it calls for a feature matrix that has been predefined by experts. ...
Article
Full-text available
Football is one of the most popular sports in the world, arousing a wide range of research topics related to its off- and on-the-pitch performance. The extraction of football entities from football news helps to construct sports frameworks, integrate sports resources, and timely capture the dynamics of the sports through visual text mining results, including the connections among football players, football clubs, and football competitions, and it is of great convenience to observe and analyze the developmental tendencies of football. Therefore, in this paper, we constructed a 1000,000-word Chinese corpus in the field of football and proposed a BiLSTM-based model for named entity recognition. The ALBERT-BiLSTM combination model of deep learning is used for entity extraction of football textual data. Based on the BiLSTM model, we introduced ALBERT as a pre-training model to extract character and enhance the generalization ability of word embedding vectors. We then compared the results of two different annotation schemes, BIO and BIOE, and two deep learning models, ALBERT-BiLSTM-CRF and ALBERT BiLSTM. It was verified that the BIOE tagging was superior than BIO, and the ALBERT-BiLSTM model was more suitable for football datasets. The precision, recall, and F-Score of the model were 85.4%, 83.47%, and 84.37%, correspondingly.
... However, due to the complexity of geographic named entity data and their diverse data sources, this method performs poorly in processing the data. Previous research roughly divided geographic named entity recognition into two categories: spatial statistical-based geographic named entity recognition [4] and deep neural network-based geographic named entity recognition [5]. Many previous studies [6] have achieved high recognition accuracy in geographic named entity recognition, ignoring the correctness of identified geographic entities; thus, further standardization and precision improvement are needed. ...
Article
Full-text available
Social media is widely used to share real-time information and report accidents during natural disasters. Named entity recognition (NER) is a fundamental task of geospatial information applications that aims to extract location names from natural language text. As a result, the identification of location names from social media information has gradually become a demand. Named entity correction (NEC), as a complementary task of NER, plays a crucial role in ensuring the accuracy of location names and further improving the accuracy of NER. Despite numerous methods having been adopted for NER, including text statistics-based and deep learning-based methods, there has been limited research on NEC. To address this gap, we propose the CTRE model, which is a geospatial named entity recognition and correction model based on the BERT model framework. Our approach enhances the BERT model by introducing incremental pre-training in the pre-training phase, significantly improving the model’s recognition accuracy. Subsequently, we adopt the pre-training fine-tuning mode of the BERT base model and extend the fine-tuning process, incorporating a neural network framework to construct the geospatial named entity recognition model and geospatial named entity correction model, respectively. The BERT model utilizes data augmentation of VGI (volunteered geographic information) data and social media data for incremental pre-training, leading to an enhancement in the model accuracy from 85% to 87%. The F1 score of the geospatial named entity recognition model reaches an impressive 0.9045, while the precision of the geospatial named entity correction model achieves 0.9765. The experimental results robustly demonstrate the effectiveness of our proposed CTRE model, providing a reference for subsequent research on location names.
... However, a good set of lexicalsemantic patterns allows to recognize in texts not only toponyms and their relations, but also attributes of relations. In addition, taking into account the peculiarities of the subject area when developing lexical-semantic patterns has a positive effect on the final result of recognition [13][14][15]. ...
Chapter
This paper is devoted to the problem of natural language texts processing for spatial information recognition. The paper describes the technology of recognizing named entities and extracting information about the spatial connectivity of geographical objects for natural language texts. A hybrid approach is used to process the texts. The hybrid approach allows to combine the capabilities of neural network approach, rule-based approach and lexico-semantic pattern-based approach for text processing within a single information technology. The peculiarity of the proposed technology is the original way to form and take into account the context for the recognition of named entities and spatial relationships of geographical objects. Also, this paper describes a way of creating and using lexico-semantic patterns, taking into account the peculiarities of language and subject area. This work is a development of technology for recognition and extraction of geo-attributed entities from natural language texts and the groundwork for creating cognitive geovisualization technology, taking into account the specifics of information perception by the end user. One of the prospects of the considered hybrid technology is its use as one of the main elements of cognitive geographical user interface.
... This method can compensate for the shortcomings of other methods that cannot identify low-frequency terms, and use the data learning models to determine the possibility of whether the word string is a term. Common machine learning methods include the maximum entropy model 28 and the conditional random field model [29][30][31] . However, the methods based on machine learning have high requirements on the scale and quality of the training corpus, and a large-scale manual annotation corpus is required as the training data. ...
Article
Full-text available
China's technology is developing rapidly, and the number of patent applications has surged. Therefore, there is an urgent need for technical managers and researchers that how to apply computer technology to conduct in-depth mining and analysis of lots of Chinese patent documents to efficiently use patent information, perform technological innovation and avoid R&D risks. Automatic term extraction is the basis of patent mining and analysis, but many existing approaches focus on extracting domain terms in English, which are difficult to extend to Chinese due to the distinctions between Chinese and English languages. At the same time, some common Chinese technical terminology extraction methods focus on the high-frequency characteristics, while technical domain correlation characteristic and the unithood feature of terminology are given less attention. Aiming at these problems, this paper proposes a Chinese technical terminology method based on DC-value and information entropy to achieve automatic extraction of technical terminology in Chinese patents. The empirical results show that the presented algorithm can effectively extract the technical terminology in Chinese patent literatures and has a better performance than the C-value method, the log-likelihood ratio method and the mutual information method, which has theoretical significance and practical application value.
... While there are fair amounts of research efforts studying multi-annotation schemes for the task of NER in languages such as English, Spanish, Dutch, Czech [4], Greek [5], Russian [6], and Punjabi [7], the Arabic language suffers from a lack of efforts in this domain. This work is an attempt to rectify this shortcoming by providing the Arabic NLP research community with dataset designated for NER tasks in the medical domain. ...
Article
Full-text available
This article outlines a novel data descriptor that provides the Arabic natural language processing community with a dataset dedicated to named entity recognition tasks for diseases. The dataset comprises more than 60 thousand words, which were annotated manually by two independent annotators using the inside–outside (IO) annotation scheme. To ensure the reliability of the annotation process, the inter-annotator agreements rate was calculated, and it scored 95.14%. Due to the lack of research efforts in the literature dedicated to studying Arabic multi-annotation schemes, a distinguishing and a novel aspect of this dataset is the inclusion of six more annotation schemes that will bridge the gap by allowing researchers to explore and compare the effects of these schemes on the performance of the Arabic named entity recognizers. These annotation schemes are IOE, IOB, BIES, IOBES, IE, and BI. Additionally, five linguistic features, including part-of-speech tags, stopwords, gazetteers, lexical markers, and the presence of the definite article, are provided for each record in the dataset.
... Automatic named entity recognition (NER) is one of the basic tasks in natural language processing. The majority of well-known NER datasets consist of news documents with three types of named entities labeled: persons, organizations, and locations [1,2]. For these types of named entities, the state-of-the-art NER methods usually give impressive results. ...
Chapter
The paper presents the results of applying the BERT representation model in the named entity recognition task for the cybersecurity domain in Russian. Several variants of the model were investigated. The best results were obtained using the BERT model, trained on the target collection of information security texts. We also explored a new form of data augmentation for the task of named entity recognition.
... Bidirectional recurrent neural networks (RNN) and conditional random fields (CRF) are considered to be among the most powerful models for sequence modeling [4][5][6][7][8][9][10][11][12][13], each one having its own advantages and disadvantages. In a direct RNN application, especially with LSTM or GRU cells, one can get a better model for long sequences of inputs, but the RNN output (a softmax layer) will classify every tag independently. ...
Article
Full-text available
Adverse drug reactions (ADRs) are an essential part of the analysis of drug use, measuring drug use benefits, and making policy decisions. Traditional channels for identifying ADRs are reliable but very slow and only produce a small amount of data. Text reviews, either on specialized web sites or in general-purpose social networks, may lead to a data source of unprecedented size, but identifying ADRs in free-form text is a challenging natural language processing problem. In this work, we propose a novel model for this problem, uniting recurrent neural architectures and conditional random fields. We evaluate our model with a comprehensive experimental study, showing improvements over state-of-the-art methods of ADR extraction.
Book
Full-text available
4th Industrial Revolution (4IR) technologies have assumed critical importance in the oil and gas industry, enabling data analysis and automation at unprecedented levels. Formation evaluation and reservoir monitoring are crucial areas for optimizing reservoir production, maximizing sweep efficiency and characterizing the reservoirs. Automation, robotics and artificial intelligence (AI) have led to tremendous transformations in these areas, in particular in subsurface sensing. We present a novel 4IR inspired framework for the real-time sensor selection for subsurface pressure and temperature monitoring, as well as reservoir evaluation. The framework encompasses a deep learning technique for sensor data uncertainty estimation, which is then integrated into an integer programming framework for the optimal selection of sensors to monitor the CO2 penetration in the reservoir. The results are rather promising, showing that a relatively small numbers of sensors can be utilized to properly monitor the fractured reservoir structure.
Preprint
Named entity recognition (NER) is a fundamental and important task in NLP, aiming at identifying named entities (NEs) from free text. Recently, since the multi-head attention mechanism applied in the Transformer model can effectively capture longer contextual information, Transformer-based models have become the mainstream methods and have achieved significant performance in this task. Unfortunately, although these models can capture effective global context information, they are still limited in the local feature and position information extraction, which is critical in NER. In this paper, to address this limitation, we propose a novel Hero-Gang Neural structure (HGN), including the Hero and Gang module, to leverage both global and local information to promote NER. Specifically, the Hero module is composed of a Transformer-based encoder to maintain the advantage of the self-attention mechanism, and the Gang module utilizes a multi-window recurrent module to extract local features and position information under the guidance of the Hero module. Afterward, the proposed multi-window attention effectively combines global information and multiple local features for predicting entity labels. Experimental results on several benchmark datasets demonstrate the effectiveness of our proposed model.
Preprint
Full-text available
China's technology is developing rapidly, and the number of patent applications has surged. Therefore, there is an urgent need for technical managers and researchers that how to apply computer technology to conduct in-depth mining and analysis of lots of Chinese patent documents to efficiently use patent information, perform technological innovation and avoid R&D risks. Automatic term extraction is the basis of patent mining and analysis, but many existing approaches focus on extracting domain terms in English, which are difficult to extend to Chinese due to the distinctions between Chinese and English languages. At the same time, some common Chinese technical terminology extraction methods focus on the high-frequency characteristics, while technical domain correlation characteristic and the unithood feature of terminology are given less attention. Aimed at these problems, this paper proposes a Chinese technical terminology method based on DC-value and information entropy to achieve automatic extraction of technical terminology in Chinese patents. The empirical results show that the presented algorithm can effectively extract the technical terminology in Chinese patent literatures and has a better performance than the C-value method, the log-likelihood ratio method and the mutual information method, which has theoretical significance and practical application value.
Article
Full-text available
The conditional random fields (CRFs) model plays an important role in the machine learning field. Driven by the development of the artificial intelligence, the CRF models have enjoyed great advancement. To analyze the recent development of the CRFs, this paper presents a comprehensive review of different versions of the CRF models and their applications. On the basis of elaborating on the background and definition of the CRFs, it analyzes three basic problems faced by the CRF models and reviews their latest improvements. Based on that, it presents the applications of the CRFs in the natural language processing, computer vision, biomedicine, Internet intelligence and other relevant fields. At last, specific analysis and future directions of the CRFs are discussed.
Chapter
Large texts that analyze a situation in some domain, for example politics or economy, usually are full of opinions. In case of analytical articles, opinions usually are a kind of attitudes with source and target presented as named entities, both mentioned in the text. We present an application of the specific neural network model for sentiment attitude extraction. This problem is considered as a three-class machine learning task for the whole documents. Treating text attitudes as a list of related contexts, we first extract related sentiment contexts and then calculate the resulted attitude label. For sentiment context extraction, we use Piecewise Convolutional Neural Network (PCNN). We experiment with variety of functions that allows us to compose the attitude label, including recurrent neural network, which give the possibility to take into account additional context aspects. For experiments, the RuSentRel corpus was used, it contains Russian analytical texts in the domain of international relations.
Chapter
In this paper we study the task of extracting sentiment attitudes from analytical texts. We experiment with the RuSentRel corpus containing annotated Russian analytical texts in the sphere of international relations. Each document in the corpus is annotated with sentiments from the author to mentioned named entities, and attitudes between mentioned entities. We consider the problem of extracting sentiment relations between entities for the whole documents as a three-class machine learning task.
Article
This paper considers various features for extracting named entities from texts in Russian, which are used within the approaches based on machine learning, including the features of a token itself (lexeme), as well as vocabulary, contextual, cluster, and two-stage features. The contribution of each feature to improving the quality of extraction of named entities is studied. The CRF-classifier is used as a method of machine learning in the experiments that are described in this paper. The contribution of features is compared based on two open collections using the F-measure.
Conference Paper
Full-text available
We present a new named entity recognizer for the Czech language. It reaches 82.82 F-measure on the Czech Named Entity Corpus 1.0 and significantly outperforms previously published Czech named entity recognizers. On the English CoNLL-2003 shared task, we achieved 89.16 F-measure, reaching comparable results to the English state of the art. The recognizer is based on Maximum Entropy Markov Model and a Viterbi algorithm decodes an optimal sequence labeling using probabilities estimated by a maximum entropy classifier. The classification features utilize morphological analysis, two-stage prediction, word clustering and gazetteers.
Conference Paper
Full-text available
Current research efforts in Named Entity Recognition deal mostly with the English language. Even though the interest in multi-language Information Extraction is growing, there are only few works reporting results for the Russian language. This paper introduces quality baselines for the Russian NER task. We propose a corpus which was manually annotated with organization and person names. The main purpose of this corpus is to provide gold standard for evaluation. We implemented and evaluated two approaches to NER: knowledge-based and statistical. The first one comprises several components: dictionary matching, pattern matching and rule-based search of lexical representations of entity names within a document. We assembled a set of linguistic resources and evaluated their impact on performance. For the data-driven approach we utilized our implementation of a linear-chain CRF which uses a rich set of features. The performance of both systems is promising (62.17% and 75.05% F 1 measure), although they do not employ morphological or syntactical analysis.
Conference Paper
Full-text available
In this paper we discuss algorithms for clustering words into classes from unlabelled text using unsupervised algorithms, based on distributional and morphological information. We show how the use of morphological information can improve the performance on rare words, and that this is robust across a wide range of languages.
Article
Full-text available
We address the problem of predicting a word from previous words in a sample of text.In particular, we discuss n-gram models based on classes of words. We also discuss severalstatistical algorithms for assigning words to classes based on the frequency of their cooccurrencewith other words. We find that we are able to extract classes that have the flavor of eithersyntactically based groupings or semantically based groupings, depending on the nature of theunderlying statistics.1...
Conference Paper
Full-text available
Most current statistical natural language process- ing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sam- pling, a simple Monte Carlo method used to per- form approximate inference in factored probabilis- tic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorpo- rate non-local structure while preserving tractable inference. We use this technique to augment an existing CRF-based information extraction system with long-distance dependency models, enforcing label consistency and extraction template consis- tency constraints. This technique results in an error reduction of up to 9% over state-of-the-art systems on two established information extraction tasks.
Article
Full-text available
The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted from unstructured text, such as newspaper articles. In defining the task, people noticed that it is essential to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions. Identifying references to these entities in text was recognized as one of the important sub-tasks of IE and was called “Named Entity Recognition and Classification (NERC)”. Le terme « entité nommée », maintenant largement utilisé dans le cadre du traitement des langues naturelles, a été adopté pour la Sixth Message Understanding Conference (MUC 6) (R. Grishman et Sundheim, 1996). À cette époque, la Conférence était concentrée sur les tâches d'extraction d'information (EI), dans lesquelles l'information structurée relative aux activités des entreprises et aux activités liées à la défense sont extraites de texte non structuré, comme les articles de journaux. Au moment de définir cette tâche, on a remarqué qu'il est essentiel de reconnaître les unités d'information comme les noms (dont les noms de personnes, d'organisations et de lieux géographiques) et les expressions numériques, notamment l'expression de l'heure, de la date, des sommes monétaires et des pourcentages. On a alors conclu que l'identification des références à ces entités dans le texte était une des principales sous-tâches de l'EI et on a alors nommé cette tâche Named Entity Recognition and Classification (NERC) (reconnaissance et classification d'entités nommées).
Article
The paper describes the structure and current state of RuThes - thesaurus of Russian language, constructed as a linguistic ontology. We compare RuThes structure with the WordNet structure, describe principles for inclusion of multiword expressions, types of relations, experiments and applications based on RuThes. For a long time RuThes has been developed within various NLP and informationretrieval projects, and now it became available for public use.
Conference Paper
We analyze some of the fundamental design challenges and misconceptions that underlie the development of an efficient and robust NER system. In particular, we address issues such as the representation of text chunks, the inference approach needed to combine local NER decisions, the sources of prior knowledge and how to use them within an NER system. In the process of comparing several solutions to these challenges we reach some surprising conclusions, as well as develop an NER system that achieves 90.8 F1 score on the CoNLL-2003 NER shared task, the best reported result for this dataset. 1
Conference Paper
In this paper we analyse the importance of data generalisation and usage of local context in the problem of the Proper Name recognition. We present an extended set of features that provide generalised description of the data and encode linguistic information. To utilize the rich set of features we applied Conditional Random Fields (CRF) — a modern approach for sequence labelling. We present results of the evaluation on a single domain following the cross-validation scheme and cross-domain evaluation based on training and testing on different corpora. We show that the extended set of features improves the final results for CRF and also this approach outperforms Hidden Markov Models (HMM). On the single domain CRF obtained 92.53% of F-measure for 5 categories of proper names, and 67.72% and 72.62% of F-measure for other two corpora in cross-domain evaluation.
Article
We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.
Named entity recognition: exploring features
  • M Tkachenko
  • A Simanovsky
Person name recognition in news articles based on the persons-1000/1111-F collections
  • I V Trofimov
Conditional random field models for the processing of Russian
  • A Y Antonova
  • A N Soloviev
The message about Russian collection for named entity recognition task
  • N A Vlasova
  • E A Suleimanova
  • I V Trofimov
Design challenges and misconceptions in named entity recognition
  • L Ratinov
  • D Roth
Persons recognition using CRF model
  • A V Podobryaev
Rich set of features for proper name recognition in polish texts
  • M Marcińczuk
  • M Stanek
  • M Piasecki
  • A Musiał
  • P Bouvry
  • MA Kłopotek
  • F Leprévost
  • M Marciniak
  • A Mykowiecka
  • H Rybiński