Figure 2 - uploaded by Felix Naumann
Content may be subject to copyright.
Source publication
While named entity recognition is a much addressed research topic, recognizing companies in text is of particular difficulty. Company names are extremely heterogeneous in structure, a given company can be referenced in many different ways, their names include person names, locations, acronyms, numbers, and other unusual tokens. Further, instead of...
Contexts in source publication
Context 1
... make use of the information contained in a dictionary during the CRF training process, we create a feature that encodes whether the currently classified token is part of a company name contained in one of the dictionaries. To effi- ciently match token sequences in a text against a particular dictionary we tokenize a company's official name and all its aliases and insert the generated tokens, according to their sequence, into a trie data structure. During the insertion, we mark the last inserted token of each token sequence with a flag, denoting the end of the inserted name. In this manner, we insert all company names into the token trie. Figure 2 shows an excerpt of such a token trie after inserting some company names. After its creation, the token trie functions as a finite state automaton (FSA) for efficiently parsing and annotating token sequences in texts as ...
Context 2
... this manner, we insert all company names into the token trie. Figure 2 shows an excerpt of such a token trie after inserting some company names. After its creation, the token trie functions as a finite state automaton (FSA) for efficiently parsing and annotating token sequences in texts as companies. ...
Citations
... In the early stages of research, ontology-based approaches and statistical learning methods on large-scale risk domain corpora were the main patterns for financial entity extraction, laying the foundation for developing entity extraction research in related domains [15,29]. On this basis, the combination of the conditional random field (CRF) model and dictionary rules has also become a choice for scholars [30,31]. This approach is simple, efficient, and suitable for colloquial text corpora. ...
Enterprise risk management holds significant importance in fostering sustainable growth of businesses and in serving as a critical element for regulatory bodies to uphold market order. Amidst the challenges posed by intricate and unpredictable risk factors, knowledge graph technology is effectively driving risk management, leveraging its ability to associate and infer knowledge from diverse sources. This review aims to comprehensively summarize the construction techniques of enterprise risk knowledge graphs and their prominent applications across various business scenarios. Firstly, employing bibliometric methods, the aim is to uncover the developmental trends and current research hotspots within the domain of enterprise risk knowledge graphs. In the succeeding section, systematically delineate the technical methods for knowledge extraction and fusion in the standardized construction process of enterprise risk knowledge graphs. Objectively comparing and summarizing the strengths and weaknesses of each method, we provide recommendations for addressing the existing challenges in the construction process. Subsequently, categorizing the applied research of enterprise risk knowledge graphs based on research hotspots and risk category standards, and furnishing a detailed exposition on the applicability of technical routes and methods. Finally, the future research directions that still need to be explored in enterprise risk knowledge graphs were discussed, and relevant improvement suggestions were proposed. Practitioners and researchers can gain insights into the construction of technical theories and practical guidance of enterprise risk knowledge graphs based on this foundation.
... Furthermore, the findings underscore the importance of NER in enhancing supply chain resilience and proactive risk management in the construction industry. As highlighted by [104], recognizing company names from textual data is challenging due to the diverse ways a company can be referenced. NER systems that can accurately identify these entities are crucial in risk management, especially for non-exchangelisted entities where obtaining timely information is challenging. ...
In the Australian construction industry, effective supply chain risk management (SCRM) is critical due to its complex networks and susceptibility to various risks. This study explores the application of transformer models like BERT, RoBERTa, DistilBERT, ALBERT, and ELECTRA for Named Entity Recognition (NER) in this context. Utilizing these models, we analyzed news articles to identify and classify entities related to supply chain risks, providing insights into the vulnerabilities within this sector. Among the evaluated models, RoBERTa achieved the highest average F1 score of 0.8580, demonstrating its superior balance in precision and recall for NER in the Australian construction supply chain context. Our findings highlight the potential of NLP-driven solutions to revolutionize SCRM, particularly in geo-specific settings.
... This is because domain-specific texts contain NE categories that are (1) detailed variants of the standard NE categories, e.g., "Person" is replaced with the domain-specific sub-categories "Players" and "Coaches" [27], (2) standard NE categories extended with a small number of new categories, e.g., "Trigger of a traffic jam" [11,19,22], and (3) domain-derived NE categories, e.g., "Proteins" in biology or "Reactions" in chemistry domains [9,18,25,30]. Most domain-derived NE categories originate from structured classifications or dictionaries [9,12,25] or are derived by manually unifying multiple of them [5]. In sum, creating domain-specific datasets for NER requires expert knowledge and is time-consuming. ...
Named entity recognition (NER) is an important task that aims to resolve universal categories of named entities, e.g., persons, locations, organizations, and times. Despite its common and viable use in many use cases, NER is barely applicable in domains where general categories are suboptimal, such as engineering or medicine. To facilitate NER of domain-specific types, we propose ANEA, an automated (named) entity annotator to assist human annotators in creating domain-specific NER corpora for German text collections when given a set of domain-specific texts. In our evaluation, we find that ANEA automatically identifies terms that best represent the texts' content, identifies groups of coherent terms, and extracts and assigns descriptive labels to these groups, i.e., annotates text datasets into the domain (named) entities.
... Loster et al. [65,36] dedicate a series of papers to the recognition of financial entities in text. In particular, they focus on correctly determining the full extent of a mention by using tries, which are tree structures, to improve dictionary-based approaches [39]. ...
In our modern society, almost all events, processes, and decisions in a corporation are documented by internal written communication, legal filings, or business and financial news. The valuable knowledge in such collections is not directly accessible by computers as they mostly consist of unstructured text. This chapter provides an overview of corpora commonly used in research and highlights related work and state-of-the-art approaches to extract and represent financial entities and relations.The second part of this chapter considers applications based on knowledge graphs of automatically extracted facts. Traditional information retrieval systems typically require the user to have prior knowledge of the data. Suitable visualization techniques can overcome this requirement and enable users to explore large sets of documents. Furthermore, data mining techniques can be used to enrich or filter knowledge graphs. This information can augment source documents and guide exploration processes. Systems for document exploration are tailored to specific tasks, such as investigative work in audits or legal discovery, monitoring compliance, or providing information in a retrieval system to support decisions.
... CRFs are trained such that the most discriminative word or phrase in a company name is extracted as a short name. The importance of short names for linking company entities is also demonstrated in [28]. ...
Data integration has been studied extensively for decades and approached from different angles. However, this domain still remains largely rule-driven and lacks universal automation. Recent development in machine learning and in particular deep learning has opened the way to more general and more efficient solutions to data integration problems. In this work, we propose a general approach to modeling and integrating entities from structured data, such as relational databases, as well as unstructured sources, such as free text from news articles. Our approach is designed to explicitly model and leverage relations between entities, thereby using all available information and preserving as much context as possible. This is achieved by combining siamese and graph neural networks to propagate information between connected entities and support high scalability. We evaluate our method on the task of integrating data about business entities, and we demonstrate that it outperforms standard rule-based systems, as well as other deep learning approaches that do not use graph-based representations.
... Various record linkage systems have been proposed in recent decades [2], [12], [16], [17], [20]. As mentioned in the introduction, they can usually be divided into rule-based and machine learning-based systems. ...
... It has been shown by Loster et al. [20] that taking short (colloquial) company names into account is greatly beneficial for company record linkage. However, the company entity matching system described in [20] used a manually created short company name corpus, whereas in this work we focus on automated short name extraction. ...
... It has been shown by Loster et al. [20] that taking short (colloquial) company names into account is greatly beneficial for company record linkage. However, the company entity matching system described in [20] used a manually created short company name corpus, whereas in this work we focus on automated short name extraction. ...
Record linkage is an essential part of nearly all real-world systems that consume structured and unstructured data coming from different sources. Typically no common key is available for connecting records. Massive data integration processes often have to be completed before any data analytics and further processing can be performed. In this work we focus on company entity matching, where company name, location and industry are taken into account. Our contribution is a highly scalable, enterprise-grade end-to-end system that uses rule-based linkage algorithms in combination with a machine learning approach to account for short company names. Linkage time is greatly reduced by an efficient decomposition of the search space using MinHash. Based on real-world ground truth datasets, we show that our approach reaches a recall of 91% compared to 73% for baseline approaches, while scaling linearly with the number of nodes used in the system.
... Various systems to perform record linkage have been proposed over the last decades [16,22,26,6,23]. As mentioned in the introduction, they can be usually divided to the rule-based and machine learning based systems. ...
... It was already shown in the work [26] that there is a great benefit for company record linkage to take into account short (colloquial) company names. However, the company entity matching system described in [26] used manually created short company name corpus, while in this work we focus on the automated short name extraction. ...
... It was already shown in the work [26] that there is a great benefit for company record linkage to take into account short (colloquial) company names. However, the company entity matching system described in [26] used manually created short company name corpus, while in this work we focus on the automated short name extraction. In our deployment the availability of short company names help both efficiency and accuracy of the RL system, as they lead to smaller and more descriptive blocks on one hand and help to give more attention or weight to the most discriminative part of a company name on the other. ...
Record Linkage is an essential part of almost all real-world systems that consume data coming from different sources, structured and unstructured. Typically no common key is available in order to connect the records. Often massive data cleaning and data integration processes have to be completed before any data analytics and further processing can be performed. Though record linkage is often seen as a somewhat tedious necessary step, it is able to reveal valuable insights of the data at hand. These insights guide further analytic approaches over the data and support data visualization. In this work we focus on company entity matching, where company name, location and industry are taken into account. The matching is done on the fly to accommodate realtime processing of streamed data. Our contribution is a system that uses rule-based matching algorithms for scoring operations which we extend with a machine learning approach to account for short company names. We propose an end-to-end highly scalable enterprise-grade system. Linkage time is greatly reduced by efficient decomposition of the search space using MinHash. High linkage accuracy is reached by the proposed thorough scoring process of the matching candidates. Based on two real world ground truth datasets, we show that our approach reaches a recall of 91% compared to 86% for baseline approaches. These results are achieved while scaling linearly with the number of nodes used in the system.
... For example, entity "Apple Inc" will get a alias "Apple". (Loster et al., 2017) ...
Entity Linking is the task to link entity mentions with their corresponding entities in a
knowledge base. Entity Linking is essential in many NLP tasks such as improving the performances of knowledge network construction, knowledge fusion, information retrieval, and knowledge base population.
A large percentage of the web data is in the form of natural language, which is highly ambiguous, primarily the named entities. To make ambiguously named entities mentioned in the web machine-readable, we need to link the named entities to structured databases with clean semantics.
A named entity referring to a company can for instance occur in variations: a list of company names might contain "Dell Inc", but that company might also be referred to as "DELL", "Dell Technologies". Furthermore, the named entity "dell" may have different meanings depending on the context. Two different companies may both be referred to by the word "dell".
There is little research specifically on company named entity linking.
In this work, we study the performance of neural networks for entity linking of company names. We examine the impact of different neural components used in current neural entity linking systems such as mention embedding, entity embedding, attention mechanism in candidate ranking.
We compare the effect of traditional static word embeddings like word2vec or GloVe with the more recent contextual embedding such as ELMo and character embeddings. Company name related entity linking methods will be analyzed, such as how to generate the alias of the company name, how to measure the ambiguity of company named entities and possible reasons for incorrect disambiguation of a company named entity.
... We adopted a dictionary-based approach for recognizing named entities. We developed a dictionary of donor agency names and their aliases by adapting the steps in [42]: ...
Funding acknowledgements in research papers are an important resource for studying the impact of funding on research, but also present a utility for mapping the funding landscape of a particular discipline, the portfolios of certain funding agencies, their co-funding activities as well as the research performance of scientists and their organizations. However, the usage and detail of funding acknowledgements in publications vary in depth and lacks consistency. As it becomes increasingly important for research organizations to show the impact of their work to donor agencies, grey literature such as technical project reports can serve as a valuable source of information to complement the missing funding acknowledgements in their research information management systems. This paper presents the results of a study on extracting funding-related information from grey literature through text mining and complementing the corresponding MARC 21 bibliographic records of an international agroforestry research repository.
... The NER component is used for the discovery of named entities in unstructured texts. Here we use an approach similar to Loster et al. [3], which first creates large dictionaries of externally available knowledge and then integrates it into the training process of a conditional random field classifier. ...
The integration of diverse structured and unstructured information sources into a unified, domain-specific knowledge base is an important task in many areas. A well-maintained knowledge base enables data analysis in complex scenarios, such as risk analysis in the financial sector or investigating large data leaks, such as the Paradise or Panama papers. Both the creation of such knowledge bases, as well as their continuous maintenance and curation involves many complex tasks and considerable manual effort.
With CurEx, we present a modular system that allows structured and unstructured data sources to be integrated into a domain-specific knowledge base. In particular, we (i) enable the incremental improvement of each individual integration component; (ii) enable the selective generation of multiple knowledge graphs from the information contained in the knowledge base; and (iii) provide two distinct user interfaces tailored to the needs of data engineers and end-users respectively. The former has curation capabilities and controls the integration process, whereas the latter focuses on the exploration of the generated knowledge graph.