Conference Paper

Name Tagging for Low-resource Incident Languages based on Expectation-driven Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The name tagging results on LORELEI data set are presented in Table 5. We can see that our approach advances state-of-the-art languageindependent methods (Zhang et al., 2016a;Tsai et al., 2016) on the same data sets for most languages, and achieves 6.5% -17.6% lower F-scores than the models trained from manually annotated gold-standard documents that include thousands of name mentions. To fill in this gap, we would need to exploit more linguistic resources. ...
... Multi-lingual name tagging: Some recent research (Zhang et al., 2016a;Littell et al., 2016;Tsai et al., 2016) under the DARPA LORELEI program focused on developing name tagging techniques for low-resource languages. These approaches require English annotations for projection (Tsai et al., 2016), some input from a native speaker, either through manual annotations (Littell et al., 2016), or a linguistic survey (Zhang et al., 2016a). ...
... Multi-lingual name tagging: Some recent research (Zhang et al., 2016a;Littell et al., 2016;Tsai et al., 2016) under the DARPA LORELEI program focused on developing name tagging techniques for low-resource languages. These approaches require English annotations for projection (Tsai et al., 2016), some input from a native speaker, either through manual annotations (Littell et al., 2016), or a linguistic survey (Zhang et al., 2016a). Without using any manual annotations, our name taggers outperform previous methods on the same data sets for many languages. ...
... By the way, NER in the high-resource language (i.e., English) is also called high-resource NER. English NER models [11] with good performance are pretrained thanks in part to these sufficient labeled resources. By contrast, the languages other than English are still not fully studied due to the lack of labeled data. ...
... We use the model proposed in Ref. [11], which is one of the stateof-the-art English NER methods. This model utilizes a BiLSTM network as a character-level language model (CharLM) to take contextual information. ...
Article
In recent years, great success has been achieved in many tasks of natural language processing (NLP), e.g., named entity recognition (NER), especially in the high-resource language, i.e., English, thanks in part to the considerable amount of labeled resources. More labeled resources, better word representations. However, most low-resource languages do not have such an abundance of labeled data as high-resource English, leading to poor performance of NER in these low-resource languages due to poor word representations. In the paper, we propose converse attention network (CAN) to augment word representations in low-resource languages from the high-resource language, improving the performance of NER in low-resource languages by transferring knowledge learned in the high-resource language. CAN first translates sentences in low-resource languages into high-resource English using an attention-based translation module. In the process of translation, CAN obtains the attention matrices that align word representations of high-resource language space and low-resource language space. Furthermore, CAN augments word representations learned in low-resource language space with word representations learned in high-resource language space using the attention matrices. Experiments on four low-resource NER datasets show that CAN achieves consistent and significant performance improvements, which indicates the effectiveness of CAN.
... It is challenging to perform entity extraction across a massive variety of languages because most languages don't have sufficient data to train a machine learning model. To tackle the low-resource challenge, we developed creative methods of deriving noisy training data from Wikipedia , exploiting non-traditional languageuniversal resources (Zhang et al., 2016) and crosslingual transfer learning (Cheung et al., 2017). ...
... Some recent work has also focused on lowresource name tagging (Tsai et al., 2016;Littell et al., 2016;Zhang et al., 2016;Yang et al., 2017) and cross-lingual entity linking (McNamee et al., 2011;Spitkovsky and Chang, 2011;Sil and Florian, 2016), but the system demonstrated in this paper is the first publicly available end-to-end system to perform both tasks and all of the 282 Wikipedia languages. ...
... In this dataset, we also focus on four entity types (Person, Location, Organization, Others). These four entity types are commonly adopted in previous name tagging studies [14,15] , especially for low resource language name tagging [34] . In addition, we fed low resource language words to Bing (http://cn.bing.com/dict/) ...
... This method can jointly infer bilingual named entities without using any annotated bilingual corpus. In addition, Zhang et al. [34] conducted a thorough study for low resource language name tagging and on various ways of acquiring, encoding and composing expectations from multiple non-traditional sources. Experiments demonstrate that this framework can be used to build a promising name tagger for a new IL within a few hours. ...
Article
Neural networks have been widely used for English name tagging and have delivered state-of-the-art results. However, for low resource languages, due to the limited resources and lack of training data, taggers tend to have lower performance, in comparison to the English language. In this paper, we tackle this challenging issue by incorporating multi-level cross-lingual knowledge as attention into a neural architecture, which guides low resource name tagging to achieve a better performance. Specifically, we regard entity type distribution as language independent and use bilingual lexicons to bridge cross-lingual semantic mapping. Then, we jointly apply word-level cross-lingual mutual influence and entity-type level monolingual word distributions to enhance low resource name tagging. Experiments on three languages demonstrate the effectiveness of this neural architecture: for Chinese, Uzbek, and Turkish, we are able to yield significant improvements in name tagging over all previous baselines.
... This is a major reason for inclusion in our experiments. We use the same set of test documents as used in Zhang et al. (2016). All other documents in the REFLEX and LORELEI packages are used as the training documents in our monolingual experiments. ...
... For the low-resource languages, we compare our direct transfer model with the expectation learning model proposed in Zhang et al. (2016). This model is not a direct transfer model, but it does not use any training data in the target languages either. ...
... LSTMs are variants of RNNs that can cope with long distance dependencies in the text, and for many applications it is beneficial to access to left and right context in the sentence through bi-directional LSTMs [20,21]. Moreover, the reference model for several stateof-the-art NER implementations in English language is the bidirectional LSTM (BiLSTM)-CRF model by Lample et al. [22][23][24]. Some implementations combine LSTM units with convolutional layers [24,25], and other architectures such as Bidirectional Encoder Representations for Transformers (BERT) [26] have been proposed for several NLP tasks, including NER. ...
Article
Full-text available
Background Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. Results We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%. Conclusions The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it could be easily extended to other languages and medical texts, such as electronic health records.
... A number of approaches explored document-level features (e.g., temporal and co-occurrence patterns) for event extraction (Chambers and Jurafsky, 2008;Ji and Grishman, 2008;Liao and Grishman, 2010;Do et al., 2012;McClosky and Manning, 2012;Berant et al., 2014;Yang and Mitchell, 2016). Other approaches leveraged features from external resources (e.g., Wiktionary or FrameNet) for low resource name tagging and event extraction (Li et al., 2013;Huang et al., 2016;Liu et al., 2016;Zhang et al., 2016;Cotterell and Duh, 2017;Huang et al., 2018). Yaghoobzadeh and Schütze (2016) aggregated corpus-level contextual information of each entity to predict its type and Narasimhan et al. (2016) incorporated contexts from external information sources (e.g., the documents that contain the desired information) to resolve ambiguities. ...
Preprint
Many name tagging approaches use local contextual information with much success, but fail when the local context is ambiguous or limited. We present a new framework to improve name tagging by utilizing local, document-level, and corpus-level contextual information. We retrieve document-level context from other sentences within the same document and corpus-level context from sentences in other topically related documents. We propose a model that learns to incorporate document-level and corpus-level contextual information alongside local contextual information via global attentions, which dynamically weight their respective contextual information, and gating mechanisms, which determine the influence of this information. Extensive experiments on benchmark datasets show the effectiveness of our approach, which achieves state-of-the-art results for Dutch, German, and Spanish on the CoNLL-2002 and CoNLL-2003 datasets.
... Although we do not discuss in the previous sections, there are also other ways to work with the small batches of data sets, such as activate learning [23,24]. However, as we emphasized in Section 1, the annotators do not iterate on the early batches of data set, hence we do not consider learning methods derived from those strategies as our baselines. ...
Preprint
In this paper, we address a practical scenario where training data is released in a sequence of small-scale batches and annotation in earlier phases has lower quality than the later counterparts. To tackle the situation, we utilize a pre-trained transformer network to preserve and integrate the most salient document information from the earlier batches while focusing on the annotation (presumably with higher quality) from the current batch. Using event extraction as a case study, we demonstrate in the experiments that our proposed framework can perform better than conventional approaches (the improvement ranges from 3.6 to 14.9% absolute F-score gain), especially when there is more noise in the early annotation; and our approach spares 19.1% time with regard to the best conventional method.
... In order to compare with prior work, we used the train/test split from Zhang et al. (2016). We removed mjakob et al., 2018), and presented it to two non-Bengali speaking annotators using the TALEN interface (Mayhew and Roth, 2018). ...
... The above issue motivates a lot of work on name tagging in low-resource languages or domains. A typical line of effort focuses on introducing external knowledge via transfer learning (Fritzler et al., 2018;Hofer et al., 2018), such as the use of crossdomain (Yang et al., 2017), cross-task (Peng and Dredze, 2016;Lin et al., 2018) and cross-lingual resources (Ni et al., 2017;Xie et al., 2018;Zafarian et al., 2015;Zhang et al., 2016;Mayhew et al., 2017;Tsai et al., 2016;Feng et al., 2018;Pan et al., 2017). Although they achieve promising results, there are a large amount of weak annotations on the Web, which have not been well studied (Nothman et al., 2008;Ehrmann et al., 2011). ...
... The problem of name tagging in low-resource languages has had real attention within the last few years. For example, ( Zhang et al., 2016) use a variety of non-traditional linguistic resources in order to train a name tagger for use in low-resource languages. ( and (Tsai et al., 2016) both rely on Wikipedia to provide data for training name tagging models for all Wikipedia languages. ...
... Training models on low-resource named entity recognition tasks has been shown to be a challenge [17], especially in industrial applications where deploying updated models a continuous effort and crucial for business operations. Often in such cases, there is abundance of unlabeled data, however, labeled data is scarce or unavailable. ...
Preprint
Training models on low-resource named entity recognition tasks has been shown to be a challenge, especially in industrial applications where deploying updated models a continuous effort and crucial for business operations. Often in such cases, there is abundance of unlabeled data, however, labeled data is scarce or unavailable. Pre-trained language models trained to extract contextual features from text were shown to improve many natural language processing (NLP) tasks, including scarcely labeled tasks, by leveraging on transfer learning. However, such models impose a heavy memory and computational burden, making it a challenge to train and deploy such model for inference use. In this work-in-progress we combined the effectiveness of transfer learning provided by pre-trained masked language models and use a semi-supervised approach to train a fast and compact model using labeled and unlabeled examples. Preliminary evaluations show that the compact models achieve competitive accuracy compared to a state-of-art pre-trained language models with up to 36x compression rate and run significantly faster in inference, thus, allowing deployment of such models in production environments or on edge devices.
... In order to compare with prior work, we used the train/test split from Zhang et al. (2016). We removed mjakob et al., 2018), and presented it to two non-Bengali speaking annotators using the TALEN interface (Mayhew and Roth, 2018). ...
Preprint
Full-text available
Supervised machine learning assumes the availability of fully-labeled data, but in many cases, such as low-resource languages, the only data available is partially annotated. We study the problem of Named Entity Recognition (NER) with partially annotated training data in which a fraction of the named entities are labeled, and all other tokens, entities or otherwise, are labeled as non-entity by default. In order to train on this noisy dataset, we need to distinguish between the true and false negatives. To this end, we introduce a constraint-driven iterative algorithm that learns to detect false negatives in the noisy set and downweigh them, resulting in a weighted training set. With this set, we train a weighted NER model. We evaluate our algorithm with weighted variants of neural and non-neural NER models on data in 8 languages from several language and script families, showing strong ability to learn from partial data. Finally, to show real-world efficacy, we evaluate on a Bengali NER corpus annotated by non-speakers, outperforming the prior state-of-the-art by over 5 points F1.
... the annotated corpora is small, especially in the low resource scenario (Zhang et al., 2016), the performance of these methods degrades significantly since the hidden feature representations cannot be learned adequately. ...
... The above issue motivates a lot of work on name tagging in low-resource languages or domains. A typical line of effort focuses on introducing external knowledge via transfer learning (Fritzler et al., 2018;Hofer et al., 2018), such as the use of crossdomain (Yang et al., 2017), cross-task (Peng and Dredze, 2016;Lin et al., 2018) and cross-lingual resources (Ni et al., 2017;Xie et al., 2018;Zafarian et al., 2015;Zhang et al., 2016;Mayhew et al., 2017;Tsai et al., 2016;Feng et al., 2018;Pan et al., 2017). Although they achieve promising results, there are a large amount of weak annotations on the Web, which have not been well studied (Nothman et al., 2008;Ehrmann et al., 2011). ...
Preprint
Full-text available
Name tagging in low-resource languages or domains suffers from inadequate training data. Existing work heavily relies on additional information, while leaving those noisy annotations unexplored that extensively exist on the web. In this paper, we propose a novel neural model for name tagging solely based on weakly labeled (WL) data, so that it can be applied in any low-resource settings. To take the best advantage of all WL sentences, we split them into high-quality and noisy portions for two modules, respectively: (1) a classification module focusing on the large portion of noisy data can efficiently and robustly pretrain the tag classifier by capturing textual context semantics; and (2) a costly sequence labeling module focusing on high-quality data utilizes Partial-CRFs with non-entity sampling to achieve global optimum. Two modules are combined via shared parameters. Extensive experiments involving five low-resource languages and fine-grained food domain demonstrate our superior performance (6% and 7.8% F1 gains on average) as well as efficiency.
... Other studies incorporate a character-level CNN (Ma and Hovy, 2016), global contexts , or language models Peters et al., 2017Peters et al., , 2018Devlin et al., 2018) to improve name tagging. In addition, several approaches (Zhang et al., 2016a(Zhang et al., , 2017aAl-Badrashiny et al., 2017) attempt to incorporate hand-crafted linguistic features into a Bi-LSTM-CRF to improve low-resource name tagging performance. Recent attempts on cross-lingual transfer for name tagging can be divided into two categories: the first projects annotations from a source language to a target language via parallel corpora (Yarowsky et al., 2001;Zhang et al., 2016b;Fang and Cohn, 2016;Ehrmann et al., 2011;Enghoff et al., 2018;Ni et al., 2017), a bilingual gazetteer (Feng et al., 2017;Zirikly and Hagiwara, 2015), Wikipedia anchor links (Kim et al., 2012;Nothman et al., 2013;Tsai et al., 2016;, and language universal representations, including Unicode bytes (Gillick et al., 2016) and cross-lingual word embeddings (Fang and Cohn, 2017;Wang et al., 2017;Xie et al., 2018). ...
... A number of approaches explored document-level features (e.g., temporal and co-occurrence patterns) for event extraction (Chambers and Jurafsky, 2008;Ji and Grishman, 2008;Liao and Grishman, 2010;Do et al., 2012;McClosky and Manning, 2012;Berant et al., 2014;Yang and Mitchell, 2016). Other approaches leveraged features from external resources (e.g., Wiktionary or FrameNet) for low resource name tagging and event extraction (Li et al., 2013;Huang et al., 2016;Liu et al., 2016;Zhang et al., 2016;Cotterell and Duh, 2017;Huang et al., 2018). Yaghoobzadeh and Schütze (2016) aggregated corpus-level contextual information of each entity to predict its type and Narasimhan et al. (2016) incorporated contexts from external information sources (e.g., the documents that contain the desired information) to resolve ambiguities. ...
... This low-resource scenario calls for new methods for gathering training data. Several works address this with automatic techniques (Tsai et al., 2016;Zhang et al., 2016;Mayhew et al., 2017), but often a good starting point is to elicit manual annotations from annotators who do not speak the target language. ...
... Current state-of-theart approaches for NER usually base themselves on long short-term memory recurrent neural networks (LSTM RNNs) and a subsequent conditional random field (CRF) to predict the sequence labels (Huang et al., 2015). Performances of neural NER methods are compromised if the training data are not enough (Zhang et al., 2016). This problem is severe for many languages due to a lack of labeled datasets, e.g., German and Spanish. ...
Preprint
In recent years, great success has been achieved in the field of natural language processing (NLP), thanks in part to the considerable amount of annotated resources. For named entity recognition (NER), most languages do not have such an abundance of labeled data, so the performances of those languages are comparatively lower. To improve the performance, we propose a general approach called Back Attention Network (BAN). BAN uses translation system to translate other language sentences into English and utilizes the pre-trained English NER model to get task-specific information. After that, BAN applies a new mechanism named back attention knowledge transfer to improve the semantic representation, which aids in generation of the result. Experiments on three different language datasets indicate that our approach outperforms other state-of-the-art methods.
... Most previous active learning algorithms developed for NER tasks is based on one language and then applied to the language itself. Another main difference is that many active learning algorithms use a fixed data selection heuristic, such as uncertainty sampling (Settles and Craven, 2008;Stratos and Collins, 2015;Zhang et al., 2016). However, in our algorithm, we implicitly use uncertainty information as one kind of observations to the RL agent. ...
... Also relevant to our work are approaches for domain adaptation (Sun et al 2016) and active learning (Settles 2010) for extending NER approaches. Notably similar in concept to our approach for automatic annotation selection is Zhang et al (2016), in which the authors use active learning to train a CRF and automatically extracted rules to annotate documents. Table 1 describes the resources we used for named entity recognition and summarizes each resource's approximate size and the evaluation checkpoints in which it was used. ...
Article
Full-text available
We describe a multifaceted approach to named entity recognition that can be deployed with minimal data resources and a handful of hours of non-expert annotation. We describe how this approach was applied in the 2016 LoReHLT evaluation and demonstrate that both statistical and rule-based approaches contribute to our performance. We also demonstrate across many languages the value of selecting the sentences to be annotated when training on small amounts of data.
... Taking into account the data and time constraints we chose to develop a SF pipeline that works on translated English text, rather than one that processes the target language directly. An overview of the overall ELISA pipeline and the SF components is shown in Fig. 3. Describing the NER and MT components of the pipeline is beyond the scope of this paper, but more information about the MT components can be found in Papadopoulos et al. (2017) and about the NER components in Zhang et al. (2016) and and a more task oriented version of the pipeline can be found in Hermjakob et al. (2017). While this paper describes only the development and application of this pipeline to Situation Frame extraction from text, a simplified variant of the same approach was later adapted to perform the same task on speech audio. ...
Article
Full-text available
This paper describes the Situation Frame extraction pipeline developed by team ELISA as a part of the DARPA Low Resource Languages for Emergent Incidents program. Situation Frames are structures describing humanitarian needs, including the type of need and the location affected by it. Situation Frames need to be extracted from text or speech audio in a low resource scenario where little data, including no annotated data, are available for the target language. Our Situation Frame pipeline is the final step of the overall ELISA processing pipeline and accepts as inputs the outputs of the ELISA machine translation and named entity recognition components. The inputs are processed by a combination of neural networks to detect the types of needs mentioned in each document and a second post-processing step connects needs to locations. The resulting Situation Frame system was used during the first yearly evaluation on extracting Situation Frames from text, producing encouraging results and was later successfully adapted to the speech audio version of the same task.
... Most previous active learning algorithms developed for NER tasks is based on one language and then applied to the language itself. Another main difference is that many active learning algorithms use a fixed data selection heuristic, such as uncertainty sampling (Settles and Craven, 2008;Stratos and Collins, 2015;Zhang et al., 2016). However, in our algorithm, we implicitly use uncertainty information as one kind of observations to the RL agent. ...
Article
Active learning aims to select a small subset of data for annotation such that a classifier learned on the data is highly accurate. This is usually done using heuristic selection methods, however the effectiveness of such methods is limited and moreover, the performance of heuristics varies between datasets. To address these shortcomings, we introduce a novel formulation by reframing the active learning as a reinforcement learning problem and explicitly learning a data selection policy, where the policy takes the role of the active learning heuristic. Importantly, our method allows the selection policy learned using simulation on one language to be transferred to other languages. We demonstrate our method using cross-lingual named entity recognition, observing uniform improvements over traditional active learning.
Article
Mongolian named entity recognition (NER) is not only one of the most crucial and fundamental tasks in Mongolian natural language processing, but also an important step to improve the performance of downstream tasks such as information retrieval, machine translation, and dialog system. However, traditional Mongolian NER models heavily rely on the feature engineering. Even worse, the complex morphological structure of Mongolian words makes the data sparser. To alleviate the feature engineering and data sparsity in Mongolian named entity recognition, we propose a novel NER framework with M ulti- K nowledge E nhancement (MKE-NER). Specifically, we introduce both linguistic knowledge through Mongolian morpheme representation and cross-lingual knowledge from Mongolian-Chinese parallel corpus. Furthermore, we design two methods to exploit cross-lingual knowledge sufficiently, i.e., cross-lingual representation and cross-lingual annotation projection. Experimental results demonstrate the effectiveness of our MKE-NER model, which outperforms strong baselines and achieves the best performance (94.04% F1 score) on the traditional Mongolian benchmark. Particularly, extensive experiments with different data scales highlight the superiority of our method in low-resource scenarios.
Article
Full-text available
Studies on named entity recognition (NER) often require a substantial amount of human-annotated training data. This makes technical domain-specific NER from industry data especially challenging as labelled data are scarce. Despite English as the surface language, technical jargon and writing conventions used in technical documents render the low-resource language challenges where techniques such as transfer learning hardly work. Relieving labour intensive annotations using automatic labelling is thus an important research topic, seeking ways to obtain labelled data quickly and consistently. In this work, we propose an iterative deep learning NER framework using distant supervision for automatic labelling of domain-specific datasets. The framework is applied to mineral exploration reports and produced a large BIO-annotated dataset with six geological categories. This quality-labelled dataset, OzROCK, is made publicly available to support future research on technical domain NER. Experimental results demonstrated the effectiveness of this approach, further confirmed by domain experts. The generalisation ability is verified by applying the framework to two other datasets: one for disease names and the other for chemical names. Overall, our approach can effectively reduce annotation efforts by identifying a much smaller subset, that is challenging for automatic labelling thus requires attention from human experts.
Article
Full-text available
Despite the unavailability of national government support and perpetual scarcity of resource, there are indications that the number of natural language processing (NLP) studies on Nigerian languages is growing. This study collates an inventory of resources, potential institutional structures and financial sources that support the research of the Nigerian languages from published scientific articles. Relevant publications were systematically retrieved from Google, Web of Science and Scopus. Resources, research hubs and funding support data was collected from the contents of NLP publications. Only 14.28% of the retrieved documents shared 18 different resources online, while 27.95% of the articles were funded. Support for the NLP of Nigerian languages was significantly high from outside Nigeria. For instance, most of the funding sources were from outside Nigeria. Secondly, more papers were written by authors that were all affiliated with institutions outside Nigeria than authors from Nigeria. Thirdly, most of the resources were shared by authors that were affiliated with institutions outside Nigeria. Major research hubs on the NLP of Nigerian languages are the Obafemi Awolowo University, Nigeria; The University of Sheffield, United Kingdom; University of Uyo, Nigeria; Africa Languages Technology Initiative, Nigeria and University of Ibadan, Nigeria.
Chapter
Low resource Named Entity Recognition can be solved by transferring knowledge from a high to a low-resource language with shared multilingual embedding spaces. In this paper, we focus on the extreme low-resource NER scenario of unsupervised cross-lingual knowledge transfer, where no labelled training data or parallel corpus is available. We apply word-alignment with the contextualised word embedding and propose an efficient cross-lingual centroid-based space translation mechanism for contextual embedding. We found that the proposed alignment mechanism works well between different languages, compared to current state-of-the-art models. Moreover, word order differences is another problem to be resolved in cross-lingual NER. We alleviate this issue by incorporating a transformer, which relies entirely on an attention mechanism to draw global dependency between input and output. Our method was evaluated against state-of-the-art results, and it indicate that our approach was better in terms of the performance and the amount of resources. KeywordsLow resource NERCross-lingual knowledge transfer
Article
We propose a new architecture for addressing sequence labeling, termed Dual Adversarial Transfer Network (DATNet). Specifically, the proposed DATNet includes two variants, i.e., DATNet-F and DATNet-P, which are proposed to explore effective feature fusion between high and low resource. To address the noisy and imbalanced training data, we propose a novel Generalized Resource-Adversarial Discriminator (GRAD) and adopt adversarial training to boost model generalization. We investigate the effects of different components of DATNet across different domains and languages, and show that significant improvement can be obtained especially for low-resource data. Without augmenting any additional hand-crafted features, we achieve state-of-the-art performances on CoNLL, Twitter, PTB-WSJ, OntoNotes and Universal Dependencies with three popular sequence labeling tasks, i.e. Named entity recognition (NER), Part-of-Speech (POS) Tagging and Chunking.
Article
Full-text available
We describe novel approaches to tackling the problem of natural language processing for low-resource languages. The approaches are embodied in systems for name tagging and machine translation (MT) that we constructed to participate in the NIST LoReHLT evaluation in 2016. Our methods include universal tools, rapid resource and knowledge acquisition, rapid language projection, and joint methods for MT and name tagging.
Conference Paper
Full-text available
Linking named mentions detected in a source document to an existing knowledge base provides disambiguated entity referents for the mentions. This allows better document analysis, knowledge extraction and knowledge base population. Most of the previous research extensively exploited the linguistic features of the source documents in a supervised or semi-supervised way. These systems therefore cannot be easily applied to a new language or domain. In this paper, we present a novel unsupervised algorithm named Quantified Collective Validation that avoids excessive linguistic analysis on the source documents and fully leverages the knowledge base structure for the entity linking task. We show our approach achieves state-of-the-art English entity linking performance and demonstrate successful deployment in a new language (Chinese) and two new domains (Biomedical and Earth Science). All the experiment datasets and system demonstration are available at http://tw.rpi.edu/web/doc/hanwang_emnlp_2015 for research purpose.
Conference Paper
Full-text available
This work studies Named Entity Recognition (NER) for Catalan without making use of annotated resources of this language. The approach presented is based on machine learning techniques and exploits Spanish resources, either by first training models for Spanish and then translating them into Catalan, or by directly training bilingual models. The resulting models are retrained on unlabelled Catalan data using bootstrapping techniques. Exhaustive experimentation has been conducted on real data, showing competitive results for the obtained NER systems.
Conference Paper
Full-text available
Training a statistical named entity recognition system in a new domain requires costly manual annotation of large quantities of in-domain data. Active learning promises to reduce the annotation cost by selecting only highly informative data points. This paper is concerned with a real active learning experiment to bootstrap a named entity recognition system for a new domain of radio astronomical abstracts. We evaluate several committee-based metrics for quantifying the disagreement between classifiers built using multiple views, and demonstrate that the choice of metric can be optimised in simulation experiments with existing annotated data from different domains. A final evaluation shows that we gained substantial savings compared to a randomly sampled baseline.
Article
Full-text available
This paper presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. In languages like English, there is a very small number of possible word forms with a given root word. However, languages like Turkish have very productive agglutinative morphology. Thus, it is an issue to build statistical models for specific tasks using the surface forms of the words, mainly because of the data sparseness problem. In order to alleviate this problem, we used additional syntactic information, i.e. the morphological structure of the words. We have successfully applied statistical methods using both the lexical and morphological information to sentence segmentation, topic segmentation, and name tagging tasks. For sentence segmentation, we have modeled the final inflectional groups of the words and combined it with the lexical model, and decreased the error rate to 4.34%, which is 21% better than the result obtained using only the surface forms of the words. For topic segmentation, stems of the words (especially nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set according to the weighted TDT-2 segmentation cost metric. This is 32% better than the word-based baseline model. For name tagging, we used four different information sources to model names. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56%, according to the MUC evaluation criteria. Our results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.
Conference Paper
Full-text available
Bootstrapping is the process of improving the performance of a trained classifier by iteratively adding data that is labeled by the classifier itself to the training set, and retraining the classifier. It is often used in situations where labeled training data is scarce but unlabeled data is abundant. In this paper, we consider the problem of do- main adaptation: the situation where train- ing data may not be scarce, but belongs to a different domain from the target appli- cation domain. As the distribution of un- labeled data is different from the training data, standard bootstrapping often has dif- ficulty selecting informative data to add to the training set. We propose an effective domain adaptive bootstrapping algorithm that selects unlabeled target domain data that are informative about the target do- main and easy to automatically label cor- rectly. We call these instances bridges, as they are used to bridge the source domain to the target domain. We show that the method outperforms supervised, transduc- tive and bootstrapping algorithms on the named entity recognition task.
Conference Paper
Full-text available
Named-entity recognition (NER) is an important task required in a wide variety of applications. While rule-based systems are appealing due to their well-known "explainability," most, if not all, state-of-the-art results for NER tasks are based on machine learning techniques. Motivated by these results, we explore the following natural question in this paper: Are rule-based systems still a viable approach to named-entity recognition? Specifically, we have designed and implemented a high-level language NERL on top of SystemT, a general-purpose algebraic information extraction system. NERL is tuned to the needs of NER tasks and simplifies the process of building, understanding, and customizing complex rule-based named-entity annotators. We show that these customized annotators match or outperform the best published results achieved with machine learning techniques. These results confirm that we can reap the benefits of rule-based extractors' explainability without sacrificing accuracy. We conclude by discussing lessons learned while building and customizing complex rule-based annotators and outlining several research directions towards facilitating rule development.
Conference Paper
Full-text available
This paper proposes a Hidden Markov Model (HMM) and an HMM-based chunk tagger, from which a named entity (NE) recognition (NER) system is built to recognize and classify names, times and numerical quantities. Through the HMM, our system is able to apply and integrate four types of internal and external evidences: 1) simple deterministic internal feature of the words, such as capitalization and digitalization; 2) internal semantic feature of important triggers; 3) internal gazetteer feature; 4) external macro context feature. In this way, the NER problem can be resolved effectively. Evaluation of our system on MUC-6 and MUC-7 English NE tasks achieves F-measures of 96.6% and 94.1% respectively. It shows that the performance is significantly better than reported by any other machine-learning system. Moreover, the performance is even consistently better than those based on handcrafted rules.
Conference Paper
Full-text available
A novel bootstrapping approach to Named Entity (NE) tagging using concept-based seeds and successive learners is presented. This approach only requires a few common noun or pronoun seeds that correspond to the concept for the targeted NE, e.g. for PERSON NE. The bootstrapping procedure is implemented as training two successive learners. First, decision list is used to learn the parsing-based NE rules. Then, a Hidden Markov Model is trained on a corpus automatically tagged by the first learner. The resulting NE system approaches supervised NE performance for some NE types.
Conference Paper
Full-text available
Named Entity (NE) extraction is an important subtask of document processing such as information extraction and question answering. A typical method used for NE extraction of Japanese texts is a cascade of morphological analysis, POS tagging and chunking. However, there are some cases where segmentation granularity contradicts the results of morphological analysis and the building units of NEs, so that extraction of some NEs are inherently impossible in this setting. To cope with the unit problem, we propose a character-based chunking method. Firstly, the input sentence is analyzed redundantly by a statistical morphological analyzer to produce multiple (-best answers. Finally, a support vector machine-based chunker picks up some portions of the input sentence as NEs. This method introduces richer information to the chunker than previous methods that base on a single morphological analysis result. We apply our method to IREX NE extraction task. The cross validation result of the F-measure being 87.2 shows the superiority and effectiveness of the method.
Conference Paper
Full-text available
Most current statistical natural language process- ing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sam- pling, a simple Monte Carlo method used to per- form approximate inference in factored probabilis- tic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorpo- rate non-local structure while preserving tractable inference. We use this technique to augment an existing CRF-based information extraction system with long-distance dependency models, enforcing label consistency and extraction template consis- tency constraints. This technique results in an error reduction of up to 9% over state-of-the-art systems on two established information extraction tasks.
Article
Full-text available
Named Entities provides critical information for many NLP applications. Named Entity recognition and classification (NERC) in text is recognized as one of the important sub-tasks of Information Extraction (IE). The seven papers in this volume cover various interesting and informative aspects of NERC research. Nadeau & Sekine provide an extensive survey of past NERC technologies, which should be a very useful resource for new researchers in this field. Smith & Osborne describe a machine learning model which tries to solve the over-fitting problem. Mazur & Dale tackle a common problem of NE and conjunction; as conjunctions are often a part of NEs or appear close to NEs, this is an important practical problem. A further three papers describe analyses and implementations of NERC for different languages: Spanish (Galicia-Haro & Gelbukh), Bengali (Ekbal, Naskar & Bandyopadhyay), and Serbian (Vitas, Krstev & Maurel). Finally, Steinberger & Pouliquen report on a real WEB application where multilingual NERC technology is used to identify occurrences of people, locations and organizations in newspapers in different languages. The contributions to this volume were previously published in Lingvisticae Investigationes 30:1 (2007).
Conference Paper
Full-text available
In this paper, we propose a named-entity recognition (NER) system that addresses two major limitations frequently discussed in the field. First, the system requires no human intervention such as manually labeling training data or creating gazetteers. Second, the system can handle more than the three classical named-entity types (person, location, and organization). We describe the system’s architecture and compare its performance with a supervised system. We experimentally evaluate the system on a standard corpus, with the three classical named-entity types, and also on a new corpus, with a new named-entity type (car brands).
Conference Paper
Full-text available
An entropy-based active learning scheme with support vector machines (SVMs) is proposed for relevance feedback in content-based image retrieval. The main issue in active learning for image retrieval is how to choose images for the user to label in the next interaction. According to information theory, we proposed an entropy-based criterion for good request selection. To apply the criterion with SVMs, probabilistic outputs are required. Since standard SVMs do not provide such outputs, two techniques are used to produce probabilities. One is to train the parameters of an additional sigmoid function. The other is to use the notion of version space. Experimental results on a database of 10,000 general-purpose images demonstrate the effectiveness of the proposed active learning scheme.
Article
Full-text available
We present and compare various methods for computing word alignments using statistical or heuristic models. We consider the five alignment models presented in Brown, Della Pietra, Della Pietra, and Mercer (1993), the hidden Markov alignment model, smoothing techniques, and refinements. These statistical models are compared with two heuristic models based on the Dice coefficient. We present different methods for combining word alignments to perform a symmetrization of directed statistical alignment models. As evaluation criterion, we use the quality of the resulting Viterbi alignment compared to a manually produced reference alignment. We evaluate the models on the German-English Verbmobil task and the French-English Hansards task. We perform a detailed analysis of various design decisions of our statistical alignment system and evaluate these on training corpora of various sizes. An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models. In the Appendix, we present an efficient training algorithm for the alignment models presented.
Article
Full-text available
We describe a Named Entity Recognition system for Dutch that combines gazetteers, handcrafted rules, and machine learning on the basis of seed material. We used gazetteers and a corpus to construct training material for Ripper, a rule learner. Instead of using Ripper to train a complete system, we used many different runs of Ripper in order to derive rules which we then interpreted and implemented in our own, hand-crafted system. This speeded up the building of a hand-crafted system, and allowed us to use many different rule sets in order to improve performance. We discuss the advantages of using machine learning software as a tool in knowledge acquisition, and evaluate the resulting system for Dutch.
Conference Paper
Translated bi-texts contain complementary language cues, and previous work on Named Entity Recognition (NER) has demonstrated improvements in performance over monolingual taggers by promoting agreement of tagging decisions between the two languages. However, most previous approaches to bilingual tagging assume word alignments are given as fixed input, which can cause cascading errors. We observe that NER label information can be used to correct alignment mistakes, and present a graphical model that performs bilingual NER tagging jointly with word alignment, by combining two monolingual tagging models with two unidirectional alignment models. We introduce additional cross-lingual edge factors that encourage agreements between tagging and alignment decisions. We design a dual decomposition inference algorithm to perform joint decoding over the combined alignment and NER output space. Experiments on the OntoNotes dataset demonstrate that our method yields significant improvements in both NER and word alignment over state-of-the-art monolingual baselines.
Conference Paper
Traditional isolated monolingual name taggers tend to yield inconsistent results across two languages. In this paper, we propose two novel approaches to jointly and consistently extract names from parallel corpora. The first approach uses standard linear-chain Conditional Random Fields (CRFs) as the learning framework, incorporating cross-lingual features propagated between two languages. The second approach is based on a joint CRFs model to jointly decode sentence pairs, incorporating bilingual factors based on word alignment. Experiments on Chinese-English parallel corpora demonstrated that the proposed methods significantly outperformed monolingual name taggers, were robust to automatic alignment noise and achieved state-of-the-art performance. With only 20%of the training data, our proposed methods can already achieve better performance compared to the baseline learned from the whole training set.1
Article
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify and classify names of people, locations and organisations in text. This dependence on expensive annotation is the knowledge bottleneck our work overcomes.We first classify each Wikipedia article into named entity (ne) types, training and evaluating on 7200 manually-labelled Wikipedia articles across nine languages. Our cross-lingual approach achieves up to 95% accuracy.We transform the links between articles into ne annotations by projecting the target articleʼs classifications onto the anchor text. This approach yields reasonable annotations, but does not immediately compete with existing gold-standard data. By inferring additional links and heuristically tweaking the Wikipedia corpora, we better align our automatic annotations to gold standards.We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against conll shared task data and other gold-standard corpora. Our approach outperforms other approaches to automatic ne annotation (Richman and Schone, 2008 [61], Mika et al., 2008 [46]) competes with gold-standard training when tested on an evaluation corpus from a different source; and performs 10% better than newswire-trained models on manually-annotated Wikipedia text.
Article
The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer labeled training instances if it is allowed to choose the data from which is learns. An active learner may ask queries in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator). Active learning is well-motivated in many modern machine learning problems, where unlabeled data may be abundant but labels are difficult, time-consuming, or expensive to obtain. This report provides a general introduction to active learning and a survey of the literature. This includes a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date. An analysis of the empirical and theoretical evidence for active learning, a summary of several problem setting variants, and a discussion of related topics in machine learning research are also presented.
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Article
The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KnowItAll's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KnowItAll extracted over 50,000 class instances, but suggested a challenge: How can we improve KnowItAll's recall and extraction rate without sacrificing precision?This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., “chemist” and “biologist” are identified as sub-classes of “scientist”). List Extraction locates lists of class instances, learns a “wrapper” for each list, and extracts elements of each list. Since each method bootstraps from KnowItAll's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KnowItAll a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.
Conference Paper
In this paper we will present an approach to acquisition of some classes of compound words from large corpora, as well as a method for semi-automatic generation of appropriate linguistic models, that can be further used for compound word recognition and for completion of compound word dictionaries. The approach is intended for a highly inflective language such as Serbo-Croatian. Generated linguistic models are represented by local grammars.
Conference Paper
Active learning is well-suited to many prob- lems in natural language processing, where unlabeled data may be abundant but annota- tion is slow and expensive. This paper aims to shed light on the best active learning ap- proaches for sequence labeling tasks such as information extraction and document segmen- tation. We survey previously used query selec- tion strategies for sequence models, and pro- pose several novel algorithms to address their shortcomings. We also conduct a large-scale empirical comparison using multiple corpora, which demonstrates that our proposed meth- ods advance the state of the art.
Conference Paper
Implemented methods for proper names recognition rely on large gazetteers of common proper nouns and a set of heuristic rules (e.g. Mr. as an indicator of a PERSON entity type). Though the performance of current PN recognizers is very high (over 90%), it is important to note that this problem is by no means a "solved problem". Existing systems perform extremely well on newswire corpora by virtue of the availability of large gazetteers and rule bases designed for specific tasks (e.g. recognition of Organization and Person entity types as specified in recent Message Understanding Conferences MUC).However, large gazetteers are not available for most languages and applications other than newswire texts and, in any case, proper nouns are an open class.In this paper we describe a context-based method to assign an entity type to unknown proper names (PNs). Like many others, our system relies on a gazetteer and a set of context-dependent heuristics to classify proper nouns. However, due to the unavailability of large gazetteers in Italian, over 20% detected PNs cannot be semantically tagged.The algorithm that we propose assigns an entity type to an unknown PN based on the analysis of syntactically and semantically similar contexts already seen in the application corpus.The performance of the algorithm is evaluated not only in terms of precision, following the tradition of MUC conferences, but also in terms of Information Gain, an information theoretic measure that takes into account the complexity of the classification task.
Conference Paper
This paper describes a supervised learning method to automatically select from a set of noun phrases, embedding proper names of different semantic classes, their most distinctive features. The result of the learning process is a decision tree which classifies an unknown proper name on the basis of its context of occur- rence. This classifier is used to esti- mate the probability distribution of an out of vocabulary proper name over a tagset. This probability distribution is itself used to estimate the parameters of a stochastic part of speech tagger.
Conference Paper
Information extraction systems incorpo- rate multiple stages of linguistic analysis. Although errors are typically compounded from stage to stage, it is possible to re- duce the errors in one stage by harnessing the results of the other stages. We dem- onstrate this by using the results of coreference analysis and relation extrac- tion to reduce the errors produced by a Chinese name tagger. We use an N-best approach to generate multiple hypotheses and have them re-ranked by subsequent stages of processing. We obtained thereby a reduction of 24% in spurious and incorrect name tags, and a reduction of 14% in missed tags.
Article
This paper describes our application of conditional random fields with feature induction to a Hindi named entity recognition task. With only five days development time and little knowledge of this language, we automatically discover relevant features by providing a large array of lexical tests and using feature induction to automatically construct the features that most increase conditional likelihood. In an effort to reduce overfitting, we use a combination of a Gaussian prior and early stopping based on the results of 10-fold cross validation.
Conference Paper
This paper presents a feature induction method for CRFs. Founded on the principle of constructing only those feature conjunctions that significantly increase loglikelihood, the approach builds on that of Della Pietra et al (1997), but is altered to work with conditional rather than joint probabilities, and with a mean-field approximation and other additional modifications that improve efficiency specifically for a sequence model. In comparison with traditional approaches, automated feature...
Article
This article can be viewed as an attempt to explore the consequences of two propositions. (1) Intentionality in human beings (and animals) is a product of causal features of the brain. I assume this is an empirical fact about the actual causal relations between mental processes and brains. It says simply that certain brain processes are sufficient for intentionality. (2) Instantiating a computer program is never by itself a sufficient condition of intentionality. The main argument of this paper is directed at establishing this claim. The form of the argument is to show how a human agent could instantiate the program and still not have the relevant intentionality. These two propositions have the following consequences: (3) The explanation of how the brain produces intentionality cannot be that it does it by instantiating a computer program. This is a strict logical consequence of 1 and 2. (4) Any mechanism capable of producing intentionality must have causal powers equal to those of the brain. This is meant to be a trivial consequence of 1. (5) Any attempt literally to create intentionality artificially (strong AI) could not succeed just by designing programs but would have to duplicate the causal powers of the human brain. This follows from 2 and 4. “Could a machine think?” On the argument advanced here only a machine could think, and only very special kinds of machines, namely brains and machines with internal causal powers equivalent to those of brains. And that is why strong AI has little to tell us about thinking, since it is not about machines but about programs, and no program by itself is sufficient for thinking.
Article
Clinicians and microbiologists will participate in voluntary national reporting of HIV infections and AIDS only if they have confidence in the scheme's confidentiality. At the same time, if the data are to be accurate, it must be possible to recognise reports that refer to the same individual. The use of surname 'soundex' code in combination with date of birth meets both requirements. We describe its use in the database of reported HIV infections held at the PHLS AIDS Centre. By the end of 1994 over 93% of the 20,407 reports on the database were soundex coded, and 70% of AIDS reports were linked to independent reports of HIV infection from microbiologists. In 1994, 22% of the reports of HIV infection were recognised as duplicating earlier reports of infection. Coding surnames using soundex is an acceptable and practical tool in surveillance of an infection for which confidentiality is a prime concern.
Article
This paper presents an incremental method for the tagging of proper names in German newspaper texts. The tagging is performed by the analysis of the syntactic and textual contexts of proper names together with a morphological analysis. The proper names selected by this process supply new contexts which can be used for finding new proper names, and so on. This procedure was applied to a small German corpus (50,000 words) and correctly disambiguated 65% of the capitalized words, which should improve when it is applied to a very large corpus. 1 Introduction The recognition of proper names constitutes one of the major problems for the wealth of tagging systems developed in the last few years. Most of these systems are statistically based and make use of statistical properties which are acquired from a large manually tagged training corpus. The formation of new proper names, especially personal names, is very productive, and it is not feasible to list them in a static lexicon. As Church (C...
Nadeau and Sekine, 2007), supervised models using monolingual labeled data
  • Farmakiotou
Related Work Name Tagging is a well-studied problem. Many types of frameworks have been used, including rules (Farmakiotou et al., 2000; Nadeau and Sekine, 2007), supervised models using monolingual labeled data (Zhou and Su, 2002; Chieu and Ng, 2002; Rizzo and Troncy, 2012; McCallum and Li, 2003; Li and McCallum, 2003), bilingual labeled data (Li et al., 2012; Kim et al., 2012; Che et al., 2013; Wang et al., 2013) or naturally partially annotated data such as Wikipedia (Nothman et al., 2013), bootstrapping (Agichtein and Gravano, 2000; Niu et al., 2003;
Name tagging has been explored for many non-English languages such as in Chinese (Ji and Grishman
  • Nadeau
Nadeau et al., 2006; Nadeau and Sekine, 2007; Ji and Lin, 2009). Name tagging has been explored for many non-English languages such as in Chinese (Ji and Grishman, 2005; Li et al., 2014), Japanese (Asahara and Matsumoto, 2003; Li et al., 2014), Arabic (Maloney and Niv, 1998), Catalan (Carreras et al., 2003), Bulgarian (Osenova and Kolkovska, 2002), Dutch (De Meulder et al., 2002), French (Béchet References Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries.
A proposal for wide-coverage spanish named entity recognition
  • Montse Arévalo
  • Xavier Carreras
  • Lluís Màrquez
  • Lluís María Antònia Martí
  • María José Padró
  • Simón
Montse Arévalo, Xavier Carreras, Lluís Màrquez, María Antònia Martí, Lluís Padró, and María José Simón. 2002. A proposal for wide-coverage spanish named entity recognition. Procesamiento del lenguaje natural.
Named entity recognition with bilingual constraints
  • Wanxiang Che
  • Mengqiu Wang
  • D Christopher
  • Ting Manning
  • Liu
Wanxiang Che, Mengqiu Wang, Christopher D Manning, and Ting Liu. 2013. Named entity recognition with bilingual constraints. In Proceedings of HLT-NAACL.
Named entity recognition: a maximum entropy approach using global information
  • Hai Leong Chieu
  • Hwee Tou Ng
Hai Leong Chieu and Hwee Tou Ng. 2002. Named entity recognition: a maximum entropy approach using global information. In Proceedings of the 19th international conference on Computational linguistics.
Unsupervised models for named entity classification
  • Michael Collins
  • Yoram Singer
Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora.
Rule-based named entity recognition for greek financial texts
  • Dimitra Farmakiotou
  • Vangelis Karkaletsis
  • John Koutsias
  • George Sigletos
  • D Constantine
  • Panagiotis Spyropoulos
  • Stamatopoulos
Dimitra Farmakiotou, Vangelis Karkaletsis, John Koutsias, George Sigletos, Constantine D Spyropoulos, and Panagiotis Stamatopoulos. 2000. Rule-based named entity recognition for greek financial texts. In Proceedings of the Workshop on Computational lexicography and Multimedia Dictionaries (COMLEX 2000).
Tagging portuguese with a spanish tagger using cognates
  • Jirka Hana
  • Anna Feldman
Jirka Hana, Anna Feldman, Chris Brew, and Luiz Amaral. 2006. Tagging portuguese with a spanish tagger using cognates. In Proceedings of the International Workshop on Cross-Language Knowledge Induction.
Gender and animacy knowledge discovery from web-scale n-grams for unsupervised person mention detection
  • Heng Ji
  • Dekang Lin
Heng Ji and Dekang Lin. 2009. Gender and animacy knowledge discovery from web-scale n-grams for unsupervised person mention detection. In Proceedings of PACLIC2009.
Georgios Petasis, Natasa Manousopoulou, and Constantine D Spyropoulos
  • Vangelis Karkaletsis
  • Georgios Paliouras
Vangelis Karkaletsis, Georgios Paliouras, Georgios Petasis, Natasa Manousopoulou, and Constantine D Spyropoulos. 1999. Named-entity recognition from greek and english texts. Journal of Intelligent and Robotic Systems.
Multilingual named entity recognition using parallel data and metadata from wikipedia
  • Sungchul Kim
  • Kristina Toutanova
  • Hwanjo Yu
Sungchul Kim, Kristina Toutanova, and Hwanjo Yu. 2012. Multilingual named entity recognition using parallel data and metadata from wikipedia. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics.
Comparison of the impact of word segmentation on name tagging for chinese and japanese
  • Haibo Li
  • Masato Hagiwara
  • Qi Li
  • Heng Ji
Haibo Li, Masato Hagiwara, Qi Li, and Heng Ji. 2014. Comparison of the impact of word segmentation on name tagging for chinese and japanese. In Proceedings of LREC2014.
Tagarab: a fast, accurate arabic name recognizer using high-precision morphological analysis
  • John Maloney
  • Michael Niv
John Maloney and Michael Niv. 1998. Tagarab: a fast, accurate arabic name recognizer using high-precision morphological analysis. In Proceedings of the Workshop on Computational Approaches to Semitic Languages.
Named entity recognition without gazetteers
  • Andrei Mikheev
  • Marc Moens
  • Claire Grover
Andrei Mikheev, Marc Moens, and Claire Grover. 1999. Named entity recognition without gazetteers. In Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics.
Combining the named-entity recognition task and np chunking strategy for robust pre-processing
  • Petya Osenova
  • Sia Kolkovska
Petya Osenova and Sia Kolkovska. 2002. Combining the named-entity recognition task and np chunking strategy for robust pre-processing. In Proceedings of the Workshop on Treebanks and Linguistic Theories, September.
Using soundex codes for indexing names in asr documents
  • Hema Raghavan
  • James Allan
Hema Raghavan and James Allan. 2004. Using soundex codes for indexing names in asr documents. In Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004.