Conference Paper

Practical, Efficient, and Customizable Active Learning for Named Entity Recognition in the Digital Humanities

Authors:
Conference Paper

Practical, Efficient, and Customizable Active Learning for Named Entity Recognition in the Digital Humanities

If you want to read the PDF, try requesting it from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Named-entity recognition (NER) is a sub-task of IE, which aims to determine and identify words or phrases in text into predefined classes that describe the entities of a given domain [13]. There exist several types of NER methods: terminological-driven [12], rule-based [4], corpusbased [16], active learning (AL) methods [6], and deep neural networks (DNNs) methods [17]. ...
... To create an annotate corpus for a new domain is a time consuming task and requires huge effort by domain experts. One way to minimize the annotation cost is to apply active learning [6], which is an iterative supervised learning, where an algorithm is able to interactively query the user to obtain the desired outputs at new data points. Because corpus-based NER methods are based on costly handcrafted features to train NER model, recently a lot of work is done on NER based on deep neural networks (DNNs) [17]. ...
... The evaluation results showed that individual performance of each NER is corpus dependent. To improve corpus-based NER, in [6], the authors proposed an active learning solution. ...
Conference Paper
Full-text available
For creation of digital textual corpora of preserved historical sources, automatic or semi-automatic extraction of specific types of information is becoming a requested tool for many researchers active in the field of digital humanities. With such tools, the efforts in digitization and semantic annotation will be greatly aided. For this reason, we propose a rule-based named-entity recognition system that can be used for location extraction from Latin text (so-call LOCALE). It is based on a set of computational linguistics rules for Latin language. Experimental results obtained on a set of 100 documents, which were further manually evaluated by human experts, showed that very promising results are achieved.
... In this domain, texts are noisy as they were written in times where orthography was rather incidental or due to OCR and transcription errors (see Fig. 1). Tools like named entity recognizers are unavailable or perform poorly (Erdmann et al., 2019). ...
... We do not evaluate on a holdout set. Instead, we follow Erdmann et al. (2019) and simulate annotating the complete corpus and evaluate on the very same data as we are interested in how an annotated sub-set helps to annotate the rest of the data, not how well the model generalizes. We assume that users annotate mention spans perfectly, i.e. we use gold spans. ...
... Since AL prioritizes the data to be labeled in order to maximize the impact for training a supervised model, it performs better than other ML strategies with substantially fewer resources. This justifies the interest in it as an underlying learning guideline to deal with low-resource scenarios [61][62][63][64] and specifically in the area of POS tagging [54,[65][66][67][68][69]. ...
Article
Full-text available
The recent trend toward the application of deep structured techniques has revealed the limits of huge models in natural language processing. This has reawakened the interest in traditional machine learning algorithms, which have proved still to be competitive in certain contexts, particularly in low-resource settings. In parallel, model selection has become an essential task to boost performance at reasonable cost, even more so when we talk about processes involving domains where the training and/or computational resources are scarce. Against this backdrop, we evaluate the early estimation of learning curves as a practical mechanism for selecting the most appropriate model in scenarios characterized by the use of non-deep learners in resource-lean settings. On the basis of a formal approximation model previously evaluated under conditions of wide availability of training and validation resources, we study the reliability of such an approach in a different and much more demanding operational environment. Using as a case study the generation of pos taggers for Galician, a language belonging to the Western Ibero-Romance group, the experimental results are consistent with our expectations.
... (1) Random (Erdmann et al. 2019): This is an intuitive baseline. And this strategy queries unlabeled samples with the same probability. ...
Article
Automated medical named entity recognition and normalization are fundamental for constructing knowledge graphs and building QA systems. When it comes to medical text, the annotation demands a foundation of expertise and professionalism. Existing methods utilize active learning to reduce costs in corpus annotation, as well as the multi-task learning strategy to model the correlations between different tasks. However, existing models do not take task-specific features for different tasks and diversity of query samples into account. To address these limitations, this paper proposes a multi-task adversarial active learning model for medical named entity recognition and normalization. In our model, the adversarial learning keeps the effectiveness of multi-task learning module and active learning module. The task discriminator eliminates the influence of irregular task-specific features. And the diversity discriminator exploits the heterogeneity between samples to meet the diversity constraint. The empirical results on two medical benchmarks demonstrate the effectiveness of our model against the existing methods.
... This modified version of LC works slightly better for sequence tagging tasks (Shen et al., 2017), and is adopted in many other works on DAL (Siddhant and Lipton, 2018;Erdmann et al., 2019;Shelmanov et al., 2021). ...
Preprint
Full-text available
Active learning (AL) is a prominent technique for reducing the annotation effort required for training machine learning models. Deep learning offers a solution for several essential obstacles to deploying AL in practice but introduces many others. One of such problems is the excessive computational resources required to train an acquisition model and estimate its uncertainty on instances in the unlabeled pool. We propose two techniques that tackle this issue for text classification and tagging tasks, offering a substantial reduction of AL iteration duration and the computational overhead introduced by deep acquisition models in AL. We also demonstrate that our algorithm that leverages pseudo-labeling and distilled models overcomes one of the essential obstacles revealed previously in the literature. Namely, it was shown that due to differences between an acquisition model used to select instances during AL and a successor model trained on the labeled data, the benefits of AL can diminish. We show that our algorithm, despite using a smaller and faster acquisition model, is capable of training a more expressive successor model with higher performance.
... Another option is active learning, where a ML system asks an oracle (or a user) to select the most relevant examples to consider, thereby lowering the number of data points required to learn a model. This is the approach adopted by Erdmann et al. [60] to recognise entities in various Latin classical texts, based on an active learning pipeline able to predict how many and which sentences need to be annotated to achieve a certain degree of accuracy, and later on released as toolkit to build custom NER models for the humanities [61]. Finally, another strategy is data augmentation, where an existing data set is expanded via the transformation of training instances without changing their label. ...
Preprint
Full-text available
After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.
... (1) Random (Erdmann et al. 2019): This is an intuitive baseline. And this strategy queries unlabeled samples with the same probability. ...
... A diversity-based acquisition function can be either cold-start or warm-start. There are also hybrid approaches that aim to select data based on both uncertainty and diversity sampling (He et al., 2014;Yang et al., 2015;Erdmann et al., 2019;Yuan et al., 2020;, and other methods that use reinforcement learning (Fang et al., 2017;Liu et al., 2018). ...
Preprint
Full-text available
Active Learning (AL) is a method to iteratively select data for annotation from a pool of unlabeled data, aiming to achieve better model performance than random selection. Previous AL approaches in Natural Language Processing (NLP) have been limited to either task-specific models that are trained from scratch at each iteration using only the labeled data at hand or using off-the-shelf pretrained language models (LMs) that are not adapted effectively to the downstream task. In this paper, we address these limitations by introducing BALM; Bayesian Active Learning with pretrained language Models. We first propose to adapt the pretrained LM to the downstream task by continuing training with all the available unlabeled data and then use it for AL. We also suggest a simple yet effective fine-tuning method to ensure that the adapted LM is properly trained in both low and high resource scenarios during AL. We finally apply Monte Carlo dropout to the downstream model to obtain well-calibrated confidence scores for data selection with uncertainty sampling. Our experiments in five standard natural language understanding tasks demonstrate that BALM provides substantial data efficiency improvements compared to various combinations of acquisition functions, models and fine-tuning methods proposed in recent AL literature.
... To alleviate this problem, active learning is proposed to achieve better performance with fewer labeled training instances (Settles, 2009). Instead of randomly selecting instances, active learning can measure the whole candidate instances according to some criteria, and then select more efficient instances for annotation Shen et al., 2017;Erdmann et al., ;Kasai et al., 2019;Xu et al., 2018). However, previous active learning approaches in natural language processing mainly depend on the entropy-based uncertainty criterion (Settles, 2009), and ignore the characteristics of natural language. ...
... To alleviate this problem, active learning is proposed to achieve better performance with fewer labeled training instances (Settles, 2009). Instead of randomly selecting instances, active learning can measure the whole candidate instances according to some criteria, and then select more efficient instances for annotation Shen et al., 2017;Erdmann et al., ;Kasai et al., 2019;Xu et al., 2018). However, previous active learning approaches in natural language processing mainly depend on the entropy-based uncertainty criterion (Settles, 2009), and ignore the characteristics of natural language. ...
Preprint
Active learning is able to significantly reduce the annotation cost for data-driven techniques. However, previous active learning approaches for natural language processing mainly depend on the entropy-based uncertainty criterion, and ignore the characteristics of natural language. In this paper, we propose a pre-trained language model based active learning approach for sentence matching. Differing from previous active learning, it can provide linguistic criteria to measure instances and help select more efficient instances for annotation. Experiments demonstrate our approach can achieve greater accuracy with fewer labeled training instances.
Chapter
This chapter guides the reader through the key stages of creating language resources. After explaining the difference between linguistic corpora and other text collections, the authors briefly introduce the typology of corpora created by corpus linguists and the concept of corpus annotation. Basic terminology from natural language processing (NLP) and corpus linguistics is introduced, alongside an explanation of the main components of an NLP pipeline and tools, including pre-processing, part-of-speech tagging, lemmatization, and entity extraction.
Conference Paper
As a form of distant reading, mapping texts allows scholars to read classical works anew. Using 17th French theatre as a test case, we describe an easily reproducible and fully open-source workflow used for extracting and mapping place names, then reach conclusions on literary influences and the strength of genre during the Grand Siècle based.
Article
Full-text available
Active learning is a promising way to reduce the labeling cost with a limited training samples initially, and then iteratively select the most valuable samples from a large number of unlabeled data for labeling in order to construct a powerful classifier. The goal of active learning is to make the labeled dataset have no redundancy as much as possible. Uncertainty and diversity are two important criteria for active learning. Currently, a promising way by combining uncertainty and diversity for active learning is developed. However, many of these methods are designed based on the binary class or uncertainty followed by diversity strategy. They are hard to select the most valuable samples for multiple classes with binary setting with diversity and uncertainty simultaneously. In this paper, we integrate uncertainty and diversity into one formula by multi-class settings. Uncertainty is measured by the margin minimum while diversity is measured by the maximum mean discrepancy, which is popular to measure the distribution between two datasets. By minimizing the upper bound for the true risk of the integrating formula, we find the samples that not only uncertainty but also diversity with each other. We conduct our experiments on 12 benchmark UCI datasets, and the experimental results demonstrate that the proposed method performs better than some other state-of-the-art methods.
Article
Full-text available
Recogito 2 is an open source annotation tool currently under development by Pelagios, an international initiative aimed at facilitating better linkages between online resources documenting the past. With Recogito 2, we aim to provide an environment for efficient semantic annotation—i.e., the task of enriching content with references to controlled vocabularies—in order to facilitate links between online data. At the same time, we address a perceived gap in the performance of existing tools, by emphasizing the development of mechanisms for manual intervention and editorial control that support the curation of quality data. While Recogito 2 provides an online workspace for general-purpose document annotation, it is particularly well-suited for geo-annotation, in other words annotating documents with references to gazetteers, and supports the annotation of both texts and images (i.e., digitized maps). Already available for testing at http://recogito.pelagios.org, its formal release to the public occurred in December 2016.
Article
Full-text available
Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first large scale trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from years 1771-1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70-75 % (Kettunen and P\"a\"akk\"onen, 2016). Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. Three other tools are also evaluated briefly. This research reports first published large scale results of NER in a historical Finnish OCRed newspaper collection. Results of the research supplement NER results of other languages with similar noisy data.
Conference Paper
Full-text available
State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures---one based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words: character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora. Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers.
Article
Full-text available
As a way to relieve the tedious work of manual annotation, active learning plays important roles in many applications of visual concept recognition. In typical active learning scenarios, the number of labelled data in the seed set is usually small. However, most existing active learning algorithms only exploit the labelled data, which often suffers from over-fitting due to the small number of labelled examples. Besides, while much progress has been made in binary class active learning, little research attention has been focused on multi-class active learning. In this paper, we propose a semi-supervised batch mode multi-class active learning algorithm for visual concept recognition. Our algorithm exploits the whole active pool to evaluate the uncertainty of the data. Considering that uncertain data are always similar to each other, we propose to make the selected data as diverse as possible, for which we explicitly impose a diversity constraint on the objective function. As a multi-class active learning algorithm, our algorithm is able to exploit uncertainty across multiple classes. An efficient algorithm is used to optimize the objective function. Extensive experiments on action recognition, object classification, scene recognition, and event detection demonstrate its advantages.
Article
Full-text available
Big data from the Internet of Things may create big challenge for data classification. Most active learning approaches select either uncertain or representative unlabeled instances to query their labels. Although several active learning algorithms have been proposed to combine the two criteria for query selection, they are usually ad hoc in finding unlabeled instances that are both informative and representative and fail to take the diversity of instances into account. We address this challenge by presenting a new active learning framework which considers uncertainty, representativeness, and diversity creation. The proposed approach provides a systematic way for measuring and combining the uncertainty, representativeness, and diversity of an instance. Firstly, use instances' uncertainty and representativeness to constitute the most informative set. Then, use the kernel k-means clustering algorithm to filter the redundant samples and the resulting samples are queried for labels. Extensive experimental results show that the proposed approach outperforms several state-of-the-art active learning approaches.
Article
Full-text available
Active learning is a supervised machine learning technique in which the learner is in control of the data used for learning. That control is utilized by the learner to ask an oracle, typically a human with extensive knowledge of the domain at hand, about the classes of the instances for which the model learned so far makes unreliable predictions. The active learning process takes as input a set of labeled examples, as well as a larger set of unlabeled examples, and produces a classifier and a relatively small set of newly labeled data. The overall goal is to create as good a classifier as possible, without having to mark-up and supply the learner with more data than necessary. The learning process aims at keeping the human annotation effort to a minimum, only asking for advice where the training utility of the result of such a query is high. Active learning has been successfully applied to a number of natural language processing tasks, such as, information extraction, named entity recognition, text categorization, part-of-speech tagging, parsing, and word sense disambiguation. This report is a literature survey of active learning from the perspective of natural language processing.
Conference Paper
Full-text available
Recent reinforcement-learning (RL) algorithms with polynomial sample-complexity guarantees (e.g. (Kearns & Singh, 2002)) rely on distinguish- ing between instances that have been learned with sufficient accuracy and those whose outputs are still unknown. This partitioning allows algorithms to directly manage the exploration/exploitation tradeoff. However, prior frameworks for measuring sample complexity in supervised learning, such as Probably Approximately Correct (PAC) and Mistake Bound (MB), do not necessarily maintain such distinctions, so efficient algorithms for learning a model in these paradigms can be insufficient for efficient RL. In this work, we describe the Knows What It Knows (KWIK) framework (introduced by Li et al. (2008)), which intrinsically relies upon the known/unknown partitioning, embodying the sufficient conditions for sample-efficient exploration in RL. We show that sev- eral widely studied RL models are KWIK-learnable and derive polynomial sample-complexity upper bounds within this framework. A KWIK algorithm begins with an input set X output set Y , and observation set Z. The hypothesis class H consists of a set of functions from X to Y : H ⊆ (X → Y ). The target function h∗ ∈ H is unknown to the learner. H and parameters ǫ and δ are known to both the learner and environment. The environment selects a target function h∗ ∈ H adversarially. The
Conference Paper
Full-text available
In this paper, we propose a multi-criteria-based active learning approach and effectively apply it to named entity recognition. Active learning targets to minimize the human annotation efforts by selecting examples for labeling. To maximize the contribution of the selected examples, we consider the multiple criteria: informativeness, representativeness and diversity and propose measures to quantify them. More comprehensively, we incorporate all the criteria using two selection strategies, both of which result in less labeling cost than single-criterion-based method. The results of the named entity recognition in both MUC-6 and GENIA show that the labeling cost can be reduced by at least 80% without degrading the performance.
Conference Paper
Full-text available
Most current statistical natural language process- ing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sam- pling, a simple Monte Carlo method used to per- form approximate inference in factored probabilis- tic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorpo- rate non-local structure while preserving tractable inference. We use this technique to augment an existing CRF-based information extraction system with long-distance dependency models, enforcing label consistency and extraction template consis- tency constraints. This technique results in an error reduction of up to 9% over state-of-the-art systems on two established information extraction tasks.
Article
Full-text available
We report on two JISC-funded projects that aimed to enrich the metadata of digitized historical collections with georeferences and other information automatically computed using geoparsing and related information extraction technologies. Understanding location is a critical part of any historical research, and the nature of the collections makes them an interesting case study for testing automated methodologies for extracting content. The two projects (GeoDigRef and Embedding GeoCrossWalk) have looked at how automatic georeferencing of resources might be useful in developing improved geographical search capacities across collections. In this paper, we describe the work that was undertaken to configure the geoparser for the collections as well as the evaluations that were performed.
Chapter
This paper presents the application of a neural architecture to the identification of place names in English historical texts. We test the impact of different word embeddings and we compare the results to the ones obtained with the Stanford NER module of CoreNLP before and after the retraining using a novel corpus of manually annotated historical travel writings.
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.
Article
In computer vision and pattern recognition applications, there are usually a vast number of unlabelled data whereas the labelled data are very limited. Active learning is a kind of method that selects the most representative or informative examples for labelling and training; thus, the best prediction accuracy can be achieved. A novel active learning algorithm is proposed here based on one-versus-one strategy support vector machine (SVM) to solve multi-class image classification. A new uncertainty measure is proposed based on some binary SVM classifiers and some of the most uncertain examples are selected from SVM output. To ensure that the selected examples are diverse from each other, Gaussian kernel is adopted to measure the similarity between any two examples. From the previous selected examples, a batch of diverse and uncertain examples are selected by the dynamic programming method for labelling. The experimental results on two datasets demonstrate the effectiveness of the proposed algorithm.
Article
Unstructured metadata fields such as ‘description’ offer tremendous value for users to understand cultural heritage objects. However, this type of narrative information is of little direct use within a machine-readable context due to its unstructured nature. This article explores the possibilities and limitations of named-entity recognition (NER) and term extraction (TE) to mine such unstructured metadata for meaningful concepts. These concepts can be used to leverage otherwise limited searching and browsing operations, but they can also play an important role to foster Digital Humanities research. To catalyze experimentation with NER and TE, the article proposes an evaluation of the performance of three third-party entity extraction services through a comprehensive case study, based on the descriptive fields of the Smithsonian Cooper-Hewitt National Design Museum in New York. To cover both NER and TE, we first offer a quantitative analysis of named entities retrieved by the services in terms of precision and recall compared with a manually annotated gold-standard corpus, and then complement this approach with a more qualitative assessment of relevant terms extracted. Based on the outcomes of this double analysis, the conclusions present the added value of entity extraction services, but also indicate the dangers of uncritically using NER and/or TE, and by extension Linked Data principles, within the Digital Humanities. All metadata and tools used within the article are freely available, making it possible for researchers and practitioners to repeat the methodology. By doing so, the article offers a significant contribution towards understanding the value of entity recognition and disambiguation for the Digital Humanities.
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Conference Paper
Geographic interfaces provide natural, scalable visualizations for many digital library collections, but the wide range of data in digital libraries presents some particular problems for identifying and disambiguating place names. We describe the toponym-disambiguation system in the Perseus digital library and evaluate its performance. Name categorization varies significantly among different types of documents, but toponym disambiguation performs at a high level of precision and recall with a gazetteer an order of magnitude larger than most other applications.
Conference Paper
An approach to semi-supervised learning is pro- posed that is based on a Gaussian random field model. Labeled and unlabeled data are rep- resented as vertices in a weighted graph, with edge weights encoding the similarity between in- stances. The learning problem is then formulated in terms of a Gaussian random field on this graph, where the mean of the field is characterized in terms of harmonic functions, and is efficiently obtained using matrix methods or belief propa- gation. The resulting learning algorithms have intimate connections with random walks, elec- tric networks, and spectral graph theory. We dis- cuss methods to incorporate class priors and the predictions of classifiers obtained by supervised learning. We also propose a method of parameter learning by entropy minimization, and show the algorithm's ability to perform feature selection. Promising experimental results are presented for synthetic data, digit classification, and text clas- sification tasks.
Conference Paper
This paper addresses two issues of active learning. Firstly, to solve a problem of uncertainty sampling that it often fails by selecting outliers, this paper presents a new selective sampling technique, sam- pling by uncertainty and density (SUD), in which a k-Nearest-Neighbor-based density measure is adopted to determine whether an unlabeled example is an out- lier. Secondly, a technique of sampling by clustering (SBC) is applied to build a representative initial training data set for active learning. Finally, we implement a new algorithm of active learning with SUD and SBC techniques. The experi- mental results from three real-world data sets show that our method outperforms competing methods, particularly at the early stages of active learning.
Conference Paper
Active learning is well-suited to many prob- lems in natural language processing, where unlabeled data may be abundant but annota- tion is slow and expensive. This paper aims to shed light on the best active learning ap- proaches for sequence labeling tasks such as information extraction and document segmen- tation. We survey previously used query selec- tion strategies for sequence models, and pro- pose several novel algorithms to address their shortcomings. We also conduct a large-scale empirical comparison using multiple corpora, which demonstrates that our proposed meth- ods advance the state of the art.
Article
The Perseus digital library is a substantial test bed of materials on archaic and classical Greece, the early Roman empire, and early modern Europe. The Perseus architecture includes tools that fit the needs of humanists: linguistic analysis for heavily inflected languages, linking and alignment with canonical citation schemes, and terminological, spatial, and visual databases for document contextualization. These tools provide both the scalability to connect disparate entities in the digital library and a groundwork for performance of the synthetic scholarship of the humanities.
Polyglot-NER: Massive multilingual named entity recognition
  • Rami Al-Rfou
  • Vivek Kulkarni
  • Bryan Perozzi
  • Steven Skiena
Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. 2015. Polyglot-NER: Massive multilingual named entity recognition. In Proceedings of the 2015 SIAM International Conference on Data Mining, pages 586-594. SIAM.
Germeval 2014 named entity recognition shared task: companion paper
  • Darina Benikova
  • Chris Biemann
  • Max Kisselew
  • Sebastian Pado
Darina Benikova, Chris Biemann, Max Kisselew, and Sebastian Pado. 2014. Germeval 2014 named entity recognition shared task: companion paper.
The Herodotos project (OSU-UGent): Studies in ancient ethnography
  • Julie Boeten
Julie Boeten. 2015. The Herodotos project (OSU-UGent): Studies in ancient ethnography. Barbarians in Strabos geography (Abii-Ionians) with a casestudy: the Cappadocians. Master's thesis, Gent Universiteit.
Lowresource named entity recognition with crosslingual, character-level neural conditional random fields
  • Ryan Cotterell
  • Kevin Duh
Ryan Cotterell and Kevin Duh. 2017. Lowresource named entity recognition with crosslingual, character-level neural conditional random fields. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 91-96.
Language independent NER using a maximum entropy tagger
  • R James
  • Stephen Curran
  • Clark
James R Curran and Stephen Clark. 2003. Language independent NER using a maximum entropy tagger. In Proceedings of the seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, pages 164-167. Association for Computational Linguistics.
Challenges and solutions for Latin named entity recognition
  • Alexander Erdmann
  • Christopher Brown
  • Brian Joseph
  • Mark Janse
  • Petra Ajaka
  • Micha Elsner
  • Marie-Catherine De Marneffe
Alexander Erdmann, Christopher Brown, Brian Joseph, Mark Janse, Petra Ajaka, Micha Elsner, and Marie-Catherine de Marneffe. 2016. Challenges and solutions for Latin named entity recognition. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 85-93.
Learning word vectors for 157 languages
  • Edouard Grave
  • Piotr Bojanowski
  • Prakhar Gupta
  • Armand Joulin
  • Tomas Mikolov
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
The class imbalance problem: Significance and strategies
  • Nathalie Japkowicz
Nathalie Japkowicz. 2000. The class imbalance problem: Significance and strategies. In Proceedings of the Intl Conf. on Artificial Intelligence.
How transferable are the datasets collected by active learners?
  • David Lowell
  • C Zachary
  • Byron C Lipton
  • Wallace
David Lowell, Zachary C Lipton, and Byron C Wallace. 2018. How transferable are the datasets collected by active learners? arXiv preprint arXiv:1807.04801.
The Herodotos project (OSUUGent): Studies in ancient ethnography. Barbarians in Strabos geography (Isseans-Zygi)
  • Naegel Anke De
Anke de Naegel. 2015. The Herodotos project (OSUUGent): Studies in ancient ethnography. Barbarians in Strabos geography (Isseans-Zygi). with a casestudy: the Britons. Master's thesis, Gent Universiteit.
Transfer learning across low-resource, related languages for neural machine translation
  • Q Toan
  • David Nguyen
  • Chiang
Toan Q Nguyen and David Chiang. 2017. Transfer learning across low-resource, related languages for neural machine translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 296-301.
CRFsuite: a fast implementation of conditional random fields
  • Naoaki Okazaki
Naoaki Okazaki. 2007. CRFsuite: a fast implementation of conditional random fields.
Toward optimal active learning through sampling estimation of error reduction
  • Nicholas Roy
  • Andrew Mccallum
Nicholas Roy and Andrew McCallum. 2001. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18th International Conference on Machine Learning, 2001, pages 441-448.
The Pleiades gazetteer and the Pelagios project
  • Rainer Simon
  • Leif Isaksen
  • Pau De Soto Barker
  • Cañamares
Rainer Simon, Leif Isaksen, ETE Barker, and Pau de Soto Cañamares. 2016. The Pleiades gazetteer and the Pelagios project. In Placing Names: Enriching and Integrating Gazetteers, pages 97-109. Indiana University Press.
Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition
  • Erik F Tjong Kim Sang
  • Fien De Meulder
Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, pages 142-147. Association for Computational Linguistics.
or why we collect and visualize the geographic information of texts
  • French
French, or why we collect and visualize the geographic information of texts. Speculum, 92(S1):S145-S169.
The Herodotos project (OSU-UGent): Studies in ancient ethnography
  • Anke De
Anke de Naegel. 2015. The Herodotos project (OSU-UGent): Studies in ancient ethnography. Barbarians in Strabos geography (Isseans-Zygi). with a casestudy: the Britons. Master's thesis, Gent Universiteit.