Article

Evaluation of the DEFINDER system for fully automatic glossary construction

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper we present a quantitative and qualitative evaluation of DEFINDER, a rule-based system that mines consumer-oriented full text articles in order to extract definitions and the terms they define. The quantitative evaluation shows that in terms of precision and recall as measured against human performance, DEFINDER obtained 87% and 75% respectively, thereby revealing the incompleteness of existing resources and the ability of DEFINDER to address these gaps. Our basis for comparison is definitions from on-line dictionaries, including the UMLS Metathesaurus. Qualitative evaluation shows that the definitions extracted by our system are ranked higher in terms of user-centered criteria of usability and readability than are definitions from on-line specialized dictionaries. The output of DEFINDER can be used to enhance these dictionaries. DEFINDER output is being incorporated in a system to clarify technical terms for non-specialist users in understandable non-technical language.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... DEFINDER [6] is based on a methodology that use just lexical and syntactical information. This is an automatic definition extraction system targeting the medical domain, where the data is composed of consumer-oriented medical articles. ...
... Many authors have pointed out that a quantitative evaluation as the one we carried out in this work may not be completely appropriate [12]. A qualitative approach to evaluation has to be taken in account (see for example [6] ). For this reason we are planning to evaluate the effectiveness and the usefulness of the system with real users, by the means of user scenario methodology. ...
... In our work we don't have such previous domain knowledge, the system is supposed to work with documents belonging to different domains. DEFINDER [6] is based on a methodology that use just lexical and syntactical information. This is an automatic definition extraction system targeting the medical domain, where the data is composed of consumer-oriented medical articles. ...
Article
Full-text available
This paper reports a preliminary work on automatic glossary extraction for e-learning purpose. Glossaries are an important resource for learners, in fact they not only facilitate access to learning documents but also represent an important learning resource by themselves. The work presented here was carried out within the project LT4eL which aim is to improve e-Learning experience by the means of natural language and semantic techniques. This work will focus on a system that automatically extract glossary from learning objects, in particular the system extract definitions from morpho-syntactic annotated documents using a rule-based grammar. In order to develop such a system a corpus composed by a collection of Learning Object covering three different domain was collected and annotated. A quantitative evaluation was carried out comparing the definition retrieved by the system against the definitions manually marked, On average, we obtain 14% for precision, 86% for recall and 0.33 for F2 score.
... One of the most effective systems, DEFINDER (Klavans and Muresan 2001), combines simple cue phrases and structural indicators introducing the definitions and the defined term. The corpus used to support the development of the rules consists of well-structured medical documents, where 60 per cent of the definitions are introduced by a set of limited text markers. ...
... If we now turn to systems based only on pattern matching ensured by handcrafted rules, the state of the art in the area is represented by systems such as DEFINDER (Klavans and Muresan 2001), which is reported to have an F-measure of 0.80. Though not strictly comparable due to the use of different experimental conditions, including different datasets for evaluation, our approach seems to deliver results above this performance by a large margin, with scores in the range of 0.90-0.99. ...
Article
Full-text available
This paper addresses the task of automatic extraction of definitions by thoroughly exploring an approach that solely relies on machine learning techniques, and by focusing on the issue of the imbalance of relevant datasets. We obtained a breakthrough in terms of the automatic extraction of definitions, by extensively and systematically experimenting with different sampling techniques and their combination, as well as a range of different types of classifiers. Performance consistently scored in the range of 0.95–0.99 of area under the receiver operating characteristics, with a notorious improvement between 17 and 22 percentage points regarding the baseline of 0.73–0.77, for datasets with different rates of imbalance. Thus, the present paper also represents a contribution to the seminal work in natural language processing that points toward the importance of exploring the research path of applying sampling techniques to mitigate the bias induced by highly imbalanced datasets, and thus greatly improving the performance of a large range of tools that rely on them.
... The first step of such task relies on the identification of what called definitional sentences, i.e., sentences that contain at least one hypernym relation. This subtask is important by itself for many tasks like Question Answering (Cui et al., 2007), construction of glossaries (Klavans and Muresan, 2001), extraction of taxonomic and non-taxonomic relations (Navigli, 2009;Snow et al., 2004), enrichment of concepts (Gangemi et al., 2003;Cataldi et al., 2009), and so forth. ...
... The work of (Klavans and Muresan, 2001) relies on a rule-based system that makes use of "cue phrases" and structural indicators that frequently introduce definitions, reaching 87% of precision and 75% of recall on a small and domain-specific corpus. ...
Article
Full-text available
In this paper we present a technique to reveal definitions and hypernym relations from text. Instead of using pattern matching methods that rely on lexico-syntactic patterns, we propose a technique which only uses syntactic dependencies between terms extracted with a syntactic parser. The assumption is that syntactic information are more robust than patterns when coping with length and complexity of the sentences. Afterwards, we transform such syntactic contexts in abstract representations, that are then fed into a Support Vector Machine classifier. The results on an annotated dataset of definitional sentences demonstrate the validity of our approach overtaking current state-of-The-art techniques.
... Definition extraction is an important NLP task, most frequently a subtask of terminology extraction (Pearson, 1996), the automatic creation of glossaries (Klavans and Muresan, 2000; Klavans and Muresan, 2001), question answering (Miliaraki and Androutsopoulos, 2004; Fahmi and Bouma, 2006), learning lexical semantic relations (Malaisé et al., 2004; Storrer and Wellinghoff, 2006) and automatic construction of ontologies (Walter and Pinkal, 2006). Tools for definition extraction are invariably languagespecific and involve shallow or deep processing , with most work done for English (Pearson, 1996; Klavans and Muresan, 2000; Klavans and Muresan, 2001 ) and other Germanic languages (Fahmi and Bouma, 2006; Storrer and Wellinghoff, 2006; Walter and Pinkal, 2006), as well as French (Malaisé et al., 2004). ...
... Definition extraction is an important NLP task, most frequently a subtask of terminology extraction (Pearson, 1996), the automatic creation of glossaries (Klavans and Muresan, 2000; Klavans and Muresan, 2001), question answering (Miliaraki and Androutsopoulos, 2004; Fahmi and Bouma, 2006), learning lexical semantic relations (Malaisé et al., 2004; Storrer and Wellinghoff, 2006) and automatic construction of ontologies (Walter and Pinkal, 2006). Tools for definition extraction are invariably languagespecific and involve shallow or deep processing , with most work done for English (Pearson, 1996; Klavans and Muresan, 2000; Klavans and Muresan, 2001 ) and other Germanic languages (Fahmi and Bouma, 2006; Storrer and Wellinghoff, 2006; Walter and Pinkal, 2006), as well as French (Malaisé et al., 2004). To the best of our knowledge, no previous attempts at definition extraction have been made for Slavic, with the exception of some work on Bulgarian (Tanev, 2004; Simov and Osenova, 2005). ...
Article
Full-text available
This paper presents the results of the preliminary experiments in the automatic extraction of definitions (for semi-automatic glossary construction) from usually unstructured or only weakly structured e-learning texts in Bulgarian, Czech and Polish. The extraction is performed by regular grammars over XML-encoded morphosyntactically-annotated documents. The results are less than satisfying and we claim that the reason for that is the intrinsic difficulty of the task, as measured by the low interannotator agreement, which calls for more sophisticated deeper linguistic processing, as well as for the use of machine learning classification techniques.
... Other theoretical-descriptive works can be found in Feliu (2004) and Bach (2001 & 2005). Applied investigations, on the other hand, leave from theoretical-descriptive studies with the objective of elaborate methodologies for the automatic extractions of DCs, more specifically for the extraction of definitions in medical texts (Klavans & Muresan, 2001), for the extraction of metalinguistic information (Rodríguez, 2004), and for the automatic elaboration of ontologies (Malaisé, 2005). In general words, those studies employ definitional patterns as a common start point for the extraction of knowledge about terms. ...
... Other theoretical-descriptive works can be found in Feliu (2004) and Bach (2001 Bach ( & 2005). Applied investigations, on the other hand, leave from theoretical-descriptive studies with the objective of elaborate methodologies for the automatic extractions of DCs, more specifically for the extraction of definitions in medical texts (Klavans & Muresan, 2001), for the extraction of metalinguistic information (Rodríguez, 2004), and for the automatic elaboration of ontologies (Malaisé, 2005). In general words, those studies employ definitional patterns as a common start point for the extraction of knowledge about terms. ...
Article
Full-text available
One of the main goals of terminological work is the identification of knowledge about terms in specialised texts. In order to compile dictionaries, glossaries or ontologies, terminographers use to search definitions about the terms they intend to define. The search of definitions can be done in specialised corpus, where they usually appear in definitional contexts, i.e. text fragments where an author explicitly defines a term. In this paper we present a research focused on the automatic extraction of those definitional contexts. Our methodology includes three different processes: the extraction of definitional patterns, the automatic filtering of non definitional contexts, and the automatic identification of constitutive elements, i.e., terms and definitions.
... Research in the context of open-domain definitional question answering have mainly focused on applying handcrafted lexico-syntactic patterns (e.g., "<TERM>, ?(is|was)?Also?<RB>?called|named|known+as<NP>") to identify definitional sentences [23][24][25]. Similarly, Klavans and Muresan [26] extracted glossaries from medical text using a set of manually annotated surface cues (e.g., "also called"). In contrast to other systems [23][24][25][26], MedQA implements a set of lexicosyntactic patterns that are generated automatically. ...
... Similarly, Klavans and Muresan [26] extracted glossaries from medical text using a set of manually annotated surface cues (e.g., "also called"). In contrast to other systems [23][24][25][26], MedQA implements a set of lexicosyntactic patterns that are generated automatically. Additionally, MedQA is built upon four advanced techniques; namely, question analysis, information retrieval, answer extraction, and summarization techniques to generate a coherent answer to definitional questions. ...
Article
Full-text available
Physicians have many questions when caring for patients, and frequently need to seek answers for their questions. Information retrieval systems (e.g., PubMed) typically return a list of documents in response to a user's query. Frequently the number of returned documents is large and makes physicians' information seeking "practical only 'after hours' and not in the clinical settings". Question answering techniques are based on automatically analyzing thousands of electronic documents to generate short-text answers in response to clinical questions that are posed by physicians. The authors address physicians' information needs and described the design, implementation, and evaluation of the medical question answering system (MedQA). Although our long term goal is to enable MedQA to answer all types of medical questions, currently, we implemented MedQA to integrate information retrieval, extraction, and summarization techniques to automatically generate paragraph-level text for definitional questions (i.e., "What is X?"). MedQA can be accessed at http://www.dbmi.columbia.edu/~yuh9001/research/MedQA.html.
... The task of term and definition extraction has a long story because it is tightly related to the desire to structure the information from various texts. Starting from rule-based systems [9], which relies on handcrafted rules, it evolves to statistical methods [10][11][12] and lastly to a deep learning system [13,14]. The statistical methods operate either by automatically mining the patterns from the dataset or by constructing a set of features that fit into a machine learning model. ...
Article
Full-text available
The decision-making process to rule R&D relies on information related to current trends in particular research areas. In this work, we investigated how one can use large language models (LLMs) to transfer the dataset and its annotation from one language to another. This is crucial since sharing knowledge between different languages could boost certain underresourced directions in the target language, saving lots of effort in data annotation or quick prototyping. We experiment with English and Russian pairs, translating the DEFT (Definition Extraction from Texts) corpus. This corpus contains three layers of annotation dedicated to term-definition pair mining, which is a rare annotation type for Russian. The presence of such a dataset is beneficial for the natural language processing methods of trend analysis in science since the terms and definitions are the basic blocks of any scientific field. We provide a pipeline for the annotation transfer using LLMs. In the end, we train the BERT-based models on the translated dataset to establish a baseline.
... The task of term and definition extraction has a long story because it tightly related to the desire to structure the information from various texts. Starting from rule-based systems [6] which relies on handcrafted rules, it evolves to statistical methods [7,8,9] and lastly to a deep learning system [10,11]. The statistical methods operate either by automatically mining the patterns from the dataset or by constructing the set of features that fit into a machine learning model. ...
Preprint
Full-text available
In this work, we investigated how one can use the LLM to transfer the dataset and its annotation from one language to another. This is crucial since sharing the knowledge between different languages could boost certain underresourced directions in the target language, saving lots of efforts in data annotation or quick prototyping. We experiment with English and Russian pairs translating the DEFT corpus. This corpus contains three layers of annotation dedicated to term-definition pair mining, which is a rare annotation type for Russian. We provide a pipeline for the annotation transferring using ChatGPT3.5-turbo and Llama-3.1-8b as core LLMs. In the end, we train the BERT-based models on the translated dataset to establish a baseline.
... Definition extraction, which aims to extract definitions from corpus automatically, has been studied for a long period. Existing works for definition extraction can be roughly divided into three categories: 1) rule-based, which extracts definitions with defined linguistic rules and templates (Klavans and Muresan, 2001;Cui et al., 2004;Fahmi and Bouma, 2006); 2) machine learning-based, which extracts definitions by statistical machine learning with carefully designed features (Westerhout, 2009;Jin et al., 2013); 3) deep learning-based, the state-of-the-art approach for definition extraction, which is based on deep learning models such as CNN, LSTM, and BERT (Anke and Schockaert, 2018; Veyseh et al., 2020;Kang et al., 2020;Vanetik et al., 2020). ...
... Drawing on computational linguistics, the search strategy used cue phrases to indicate the likely presence of a definition. 33,34 ese cue phrases were detected through the use of proximity operators that specify the occurrence of verbs commonly used in definitions (i.e. means, define, is, etc) and the word to be defined (definiendum) within a certain number of words (eg, [informatics NEAR/5 defin*]). ...
Article
Introduction: Health informatics curricular content, while beneficial to the spectrum of education in physical therapy, is currently only required in physical therapist education programs, and even there, it is only crudely defined. The purpose of our study was to use the techniques of concept analysis and concept mapping to provide an outline of informatics content that can be the foundation for curriculum development and the construction of informatics competencies for physical therapy. Review of literature: There is no established consensus on the definition of health informatics. Medical and nursing informatics literature that clarifies and agrees on the attributes of health informatics is insufficient for curriculum development. Concept analysis is an approach commonly used in nursing and other health professions to analyze and deconstruct a term, in this case, health informatics, in order to provide clarity on its meaning. Subjects: A total of 73 definitions of health informatics were extracted from articles that met search criteria. Methods: We used an 8-step methodology from the literature for concept analysis, which included 1) selecting a concept; 2) determining the aims of the analysis; 3) identifying uses of the concept; 4) determining the defining attributes of the concept; 5) identifying a model case; 6) identifying related and illegitimate cases; 7) identifying antecedents and consequences; and 8) defining empirical referents. In addition, concept mapping was used to develop a visual representation of the thematic attributes and the elements that make them up. Results: We provide a visual map of the concept we now term "informatics in human health and health care" and clarify its attributes of data, disciplinary lens, multidisciplinary science, technology, and application. We also provide clarification through the presentation of a model case and a contrary case. Discussion and conclusion: Concept analysis and mapping of informatics in human health and health care provided clarity on content that should be addressed across the continuum of physical therapy education. The next steps from this work will be to develop competencies for all levels of physical therapy education.
... Definition Extraction. Existing works for definition extraction can be roughly divided into three categories: 1) rule-based, which extracts definitions with defined linguistic rules and templates (Klavans and Muresan, 2001;Cui et al., 2004;Fahmi and Bouma, 2006); 2) machine learning-based, which extracts definitions by statistical machine learning with carefully designed features (Westerhout, 2009;Jin et al., 2013); 3) deep learning-based, the stateof-the-art approach for definition extraction, which is based on deep learning models such as CNN, LSTM, and BERT (Anke and Schockaert, 2018; Veyseh et al., 2020;Kang et al., 2020). ...
Preprint
Definitions are essential for term understanding. Recently, there is an increasing interest in extracting and generating definitions of terms automatically. However, existing approaches for this task are either extractive or abstractive - definitions are either extracted from a corpus or generated by a language generation model. In this paper, we propose to combine extraction and generation for definition modeling: first extract self- and correlative definitional information of target terms from the Web and then generate the final definitions by incorporating the extracted definitional information. Experiments demonstrate our framework can generate high-quality definitions for technical terms and outperform state-of-the-art models for definition modeling significantly.
... Previous approaches (Klavans and Muresan, 2001;Fahmi and Bouma, 2006;Zhang and Jiang, 2009) of recognizing definition sentences mainly focused on the use of linguistic clues (e.g., "is", "means", "are", "a", or "()"). However, these studies fail to classify sentences containing definitions, where the linguistic clues are non-existent. ...
... These works rely mainly on manually crafted rules based on linguistic parameters. Klavans and Muresan [3] presented the DEFINDER, a rule-based system that mines consumer-oriented full text articles to extract definitions and the terms they define; the system is evaluated on definitions from on-line dictionaries such as the UMLS Metathesaurus [4]. Xu et al. [2] used various linguistic tools to extract kernel facts for the definitional question-answering task in Text REtrieval Conference (TREC) 2003. ...
Article
Full-text available
Definitions are extremely important for efficient learning of new materials. In particular, mathematical definitions are necessary for understanding mathematics-related areas. Automated extraction of definitions could be very useful for automated indexing educational materials, building taxonomies of relevant concepts, and more. For definitions that are contained within a single sentence, this problem can be viewed as a binary classification of sentences into definitions and non-definitions. In this paper, we focus on automatic detection of one-sentence definitions in mathematical and general texts. We experiment with different classification models arranged in an ensemble and applied to a sentence representation containing syntactic and semantic information, to classify sentences. Our ensemble model is applied to the data adjusted with oversampling. Our experiments demonstrate the superiority of our approach over state-of-the-art methods in both general and mathematical domains.
... These works rely mainly on manually crafted rules based on linguistic parameters. Klavans & Muresan (2001) presented the DEFINDER, a rule-based system that mines consumer-oriented full text articles in order to extract definitions and the terms they define; the system is evaluated on definitions from on-line dictionaries such as the UMLS Metathesaurus (Schuyler et al., 1993). Xu et al. (2003) used various linguistic tools to extract kernel facts for the definitional question-answering task in TREC 2003. ...
Preprint
Automatic definition extraction from texts is an important task that has numerous applications in several natural language processing fields such as summarization, analysis of scientific texts, automatic taxonomy generation, ontology generation, concept identification, and question answering. For definitions that are contained within a single sentence, this problem can be viewed as a binary classification of sentences into definitions and non-definitions. In this paper, we focus on automatic detection of one-sentence definitions in mathematical texts, which are difficult to separate from surrounding text. We experiment with several data representations, which include sentence syntactic structure and word embeddings, and apply deep learning methods such as the Convolutional Neural Network (CNN) and the Long Short-Term Memory network (LSTM), in order to identify mathematical definitions. Our experiments demonstrate the superiority of CNN and its combination with LSTM, when applied on the syntactically-enriched input representation. We also present a new dataset for definition extraction from mathematical texts. We demonstrate that this dataset is beneficial for training supervised models aimed at extraction of mathematical definitions. Our experiments with different domains demonstrate that mathematical definitions require special treatment, and that using cross-domain learning is inefficient for that task.
... Early attempts to solve this task relied on rule-based methods (Klavans and Muresan, 2001;Cui et al., 2005). However, such methods are typically only able to detect explicit, direct and structured definitions, which usually contain definitor verb phrases such as means, is, is defined as. ...
Conference Paper
Full-text available
We describe the system submitted to SemEval-2020 Task 6, Subtask 1. The aim of this subtask is to predict whether a given sentence contains a definition or not. Unsurprisingly, we found that strong results can be achieved by fine-tuning a pre-trained BERT language model. In this paper, we analyze the performance of this strategy. Among others, we show that results can be improved by using a two-step fine-tuning process, in which the BERT model is first fine-tuned on the full training set, and then further specialized towards a target domain.
... Previous approaches (Klavans and Muresan, 2001;Fahmi and Bouma, 2006;Zhang and Jiang, 2009) of recognizing definition sentences mainly focused on the use of linguistic clues (e.g., "is", "means", "are", "a", or "()"). However, these studies fail to classify sentences containing definitions, where the linguistic clues are non-existent. ...
Preprint
Full-text available
This work presents our contribution in the context of the 6th task of SemEval-2020: Extracting Definitions from Free Text in Textbooks (DeftEval). This competition consists of three subtasks with different levels of granularity: (1) classification of sentences as definitional or non-definitional, (2) labeling of definitional sentences, and (3) relation classification. We use various pretrained language models (i.e., BERT, XLNet, RoBERTa, SciBERT, and ALBERT) to solve each of the three subtasks of the competition. Specifically, for each language model variant, we experiment by both freezing its weights and fine-tuning them. We also explore a multi-task architecture that was trained to jointly predict the outputs for the second and the third subtasks. Our best performing model evaluated on the DeftEval dataset obtains the 32nd place for the first subtask and the 37th place for the second subtask. The code is available for further research at: https://github.com/avramandrei/DeftEval.
... We can categorize the previous work on DE into three categories: 1) the rule-based approach: the first attempts in DE have defined linguistic rules and templates to capture patterns to express the term-definition relations (Klavans and Muresan 2001;Cui, Kan, and Chua 2004;Fahmi and Bouma 2006). While the rule-based approach is intuitive and has high precision, it suffers from the low recall issue; 2) the feature engineering approach: this approach address the low recall issue by relying on the statistical machine learning models (i.e., SVM and CRF) with carefully designed features (i.e., syntax and semantics) to solve DE (Jin et al. 2013;Westerhout 2009). ...
Article
Full-text available
Definition Extraction (DE) is one of the well-known topics in Information Extraction that aims to identify terms and their corresponding definitions in unstructured texts. This task can be formalized either as a sentence classification task (i.e., containing term-definition pairs or not) or a sequential labeling task (i.e., identifying the boundaries of the terms and definitions). The previous works for DE have only focused on one of the two approaches, failing to model the inter-dependencies between the two tasks. In this work, we propose a novel model for DE that simultaneously performs the two tasks in a single framework to benefit from their inter-dependencies. Our model features deep learning architectures to exploit the global structures of the input sentences as well as the semantic consistencies between the terms and the definitions, thereby improving the quality of the representation vectors for DE. Besides the joint inference between sentence classification and sequential labeling, the proposed model is fundamentally different from the prior work for DE in that the prior work has only employed the local structures of the input sentences (i.e., word-to-word relations), and not yet considered the semantic consistencies between terms and definitions. In order to implement these novel ideas, our model presents a multi-task learning framework that employs graph convolutional neural networks and predicts the dependency paths between the terms and the definitions. We also seek to enforce the consistency between the representations of the terms and definitions both globally (i.e., increasing semantic consistency between the representations of the entire sentences and the terms/definitions) and locally (i.e., promoting the similarity between the representations of the terms and the definitions). The extensive experiments on three benchmark datasets demonstrate the effectiveness of our approach.1
... In this context, systems able to address the problem of Definition Extraction (DE), i.e., identifying definitional information spanning in free text, are of great value both for computational lexicography and for NLP. In the early days of DE, rulebased approaches leveraged linguistic cues observed in definitional data (Rebeyrolle and Tanguy, 2000;Klavans and Muresan, 2001;Malaisé et al., 2004;Saggion and Gaizauskas, 2004;Storrer and Wellinghoff, 2006). However, in order to deal with problems like language dependence and domain specificity, machine learning was incorporated in more recent contributions (Del Gaudio et al., 2013), which focused on encoding informative lexico-syntactic patterns in feature vectors (Cui et al., 2005;Fahmi and Bouma, 2006;Westerhout and Monachesi, 2007;Borg et al., 2009), both in supervised and semi-supervised settings (Reiplinger et al., 2012;. ...
... -Definder [19] is a rule-based system for the automatic extraction of definitions from medical texts in English. -GlossExtractor [29] works on the Web, mainly online glossaries and Web specialised documents, for the automatic extraction of definitions, but starting from a list of predefined terms. ...
Article
The paper presents LEXIK, an intelligent terminological architecture that is able to efficiently obtain specialized lexical resources for elaborating dictionaries and providing lexical support for different expert tasks. LEXIK is designed as a powerful tool to create a rich knowledge base for lexicography. It will process big amounts of data in a modular system, that combines several applications and techniques for terminology extraction, definition generation, example extraction and term banks, that have been partially developed so far. Such integration is a challenge for the area, which lacks an integrated system for extracting and defining terms from a non-preprocessed corpus.
... Among the approaches which extract unrestricted textual definitions from open text, Fujii and Ishikawa (2000) determine the definitional nature of text fragments by using an n-gram model, whereas Klavans and Muresan (2001) apply pattern matching techniques at the lexical level guided by cue phrases such as "is called" and "is defined as". More recently, a domain-independent supervised approach, named Word-Class Lattices (WCLs), was presented which learns lattice-based definition classifiers applied to candidate sentences containing the input terms (Navigli and Velardi, 2010). ...
Article
Full-text available
In this paper we present a minimally-supervised approach to the multi-domain acquisition of wide-coverage glossaries. We start from a small number of hypernymy relation seeds and bootstrap glossaries from the Web for dozens of domains using Probabilistic Topic Models. Our experiments show that we are able to extract high-precision glossaries comprising thousands of terms and definitions.
... We assume that the best sources for finding hyponymy-hyperonymy relations are the definitions expressed in specialized texts, following to Sager and Ndi-Kimbi (1995), Pearson (1998), Meyer (2001), as well Klavans and Muresan (2001). In order to achieve this goal, we take into account the approach proposed by Acosta et al. (2011). ...
Article
Full-text available
We expose a method for extracting hyponyms and hypernyms from analytical definitions, focusing on the relation observed between hypernyms and relational adjectives (e.g., cardiovascular diffease). These adjectives intro-duce a set of specialized features according to a categorization proper to a par-ticular knowledge domain. For detecting these sequences of hypernyms associ-ated to relational adjectives, we perform a set of linguistic heuristics for recog-nizing such adjectives from others (e.g. psychological/ugly disorder). In our case, we applied linguistic heuristics for identifying such sequences from medi-cal texts in Spanish. The use of these heuristics allows a trade-off between pre-cision & recall, which is an important advance that complements other works.
... There are several techniques in the literature for the automated acquisition of definitional knowledge. Fujii and Ishikawa (2000) use an n-gram model to determine the definitional nature of text fragments, whereas Klavans and Muresan (2001) apply pattern matching techniques at the lexical level guided by cue phrases such as "is called" and "is defined as". Cafarella et al. (2005) developed a Web search engine which handles more general and complex patterns like "cities such as P roperN oun(Head(N P ))" in which it is possible to constrain the results with syntactic properties. ...
Conference Paper
Full-text available
We present GlossBoot, an effective minimally-supervised approach to acquiring wide-coverage domain glossaries for many languages. For each language of interest, given a small number of hypernymy relation seeds concerning a target domain, we bootstrap a glossary from the Web for that domain by means of iteratively acquired term/gloss extraction patterns. Our experiments show high performance in the acquisition of domain terminologies and glossaries for three different languages.
... Os sistemas baseados em conhecimento lingüístico (Heid et al., 1996;Klavans and Muresan, 2000, 2001a, 2001b utilizam diferentes recursos que contêm diferentes informações lingüísticas para a extração dos termos. Essas informações lingüísticas dizem respeito a: informações lexicográficas -dicionários de termos e lista de palavras auxiliares ("stopwords"); informações morfológicas -padrões de estrutura interna da palavra; informações morfossintáticas -categorias morfossintáticas e funções sintáticas; informações semânticas -classificações semânticas; informações pragmáticasrepresentações tipográficas e informações de disposição do termo no texto. ...
Article
Full-text available
This work shows initial results of a project focus in linguistic, statistical and hybrid evaluation of automatic extraction method in Ceramic domain. Three points are developed in this paper: 1) the comparison between manual and statistical process of terms extraction; 2) the comparison between automatic and manual procedures and 3) the comparison among lexical measures employed in the statistical process. Resumo. Neste artigo apresentamos os resultados iniciais de um projeto de avaliação de métodos de extração automática das abordagens lingüística, estatística e híbrida, em textos do domínio de Revestimento Cerâmico. Centramos em três pontos: 1) a comparação entre os processos manual e automático de extração de termos; 2) a comparação entre os procedimentos manual e automático e, 3) a comparação entre medidas estatísticas.
... Applied investigations leave from theoretical-descriptive studies with the objective of elaborate methodologies for the automatic extractions of DCs. Some of those applied investigations are the extraction of definitions in medical texts (Klavans & Muresan 2001), the extraction of metalinguistic information (Rodríguez 2004), and the automatic elaboration of ontologies (Malaisé 2005). In general words, those studies employ definitional patterns as a common start point for the extraction of knowledge about terms. ...
Article
Full-text available
In this paper we present a pattern-based approach to the automatic extraction of definitional knowledge from specialised Spanish texts. Our methodology is based on the search of definitional verbal patterns to extract definitional contexts related to different kinds of definitions: analytic, extensional, functional and synonymic. This system could be a helpful tool in the process of elaborating specialised dictionaries, glossaries and ontologies. http://hdl.handle.net/10230/23552
... The work of [23] relies on a rule-based system that makes use of "cue phrases" and structural indicators that frequently introduce definitions, reaching 87% of precision and 75% of recall on a small and domain-specific corpus. ...
Article
Full-text available
Nowadays, there is a huge amount of textual data coming from on-line social communities like Twitter or encyclopedic data provided by Wikipedia and similar platforms. This Big Data Era created novel challenges to be faced in order to make sense of large data storages as well as to efficiently find specific information within them. In a more domain-specific scenario like the management of legal documents, the extraction of semantic knowledge can support domain engineers to find relevant information in more rapid ways, and to provide assistance within the process of constructing application-based legal ontologies. In this work, we face the problem of automatically extracting structured knowledge to improve semantic search and ontology creation on textual databases. To achieve this goal, we propose an approach that first relies on well-known Natural Language Processing techniques like Part-Of-Speech tagging and Syntactic Parsing. Then, we transform these information into generalized features that aim at capturing the surrounding linguistic variability of the target semantic units. These new featured data are finally fed into a Support Vector Machine classifier that computes a model to automate the semantic annotation. We first tested our technique on the problem of automatically extracting semantic entities and involved objects within legal texts. Then, we focus on the identification of hypernym relations and definitional sentences, demonstrating the validity of the approach on different tasks and domains.
... Rule-based approaches to definition extraction tend to use a combination of linguistic information and cue phrases to identify definitions. In [KM01] and [SW06] the corpora used are technical texts, where definitions are more likely to be well-structured, and thus easier to identify definitions. Other work attempts definition extraction from eLearning texts [WM07] and [GB07] and the Internet [KPP03]. ...
Article
Full-text available
The automatic extraction of definitions from natural language texts has various applications such as the creation of glossaries and question-answering systems. In this paper we look at the extraction of definitions from non-technical texts using parser combinators in Haskell. We argue that this approach gives a general and compositional way of characterising natural lan- guage definitions. The parsers we develop are shown to be highly effective in the identification of definitions. Furthermore, we show how we can also automatically transform these parsers into other formats to be readily available for use within an eLearning system.
... [3] ...
Article
Full-text available
In this paper, a terminological framework, both theoretical and methodological, backed by empirical data, is proposed in order to highlight the particular questions to which attention should be paid when conceiving an evaluation scheme for definition extraction (DE) in terminology. The premise is that not just any information is relevant to defining a given concept in a given expert domain. Therefore, evaluation guidelines applicable to DE should integrate some understanding of what is relevant for terminographic definitions and in which cases. This, in turn, requires some understanding of the mechanisms of feature selection. An explanatory hypothesis of feature relevance is then put forward and one of its aspects examined, to see to what extent the example considered may serve as a relevance referential. To conclude, a few methodological proposals for automating the application of relevance tests are discussed. The overall objective is to explore ways of empirically testing broader theoretical hypotheses and principles that should orient the conception of general guidelines to evaluate DE for terminographic purposes.
... [3] ...
Article
Full-text available
In this paper we present a description and evaluation of a pattern-based approach for definition extraction in Spanish specialised texts. The system is based on the search for definitional verbal patterns related to four different kinds of definitions: analytical, extensional, functional and synonimical. This system could be helpful in the development of ontologies, databases of lexical knowledge, glossaries or specialised dictionaries.
... [3] ...
Article
Full-text available
Books and other text-based learning material contain implicit information which can aid the learner but which usually can only be accessed through a semantic analysis of the text. Definitions of new concepts appearing in the text are one such instance. If extracted and presented to the learner in form of a glossary, they can provide an excellent reference for the study of the main text. One way of extracting definitions is by reading through the text and annotating definitions manually --- a tedious and boring job. In this paper, we explore the use of machine learning to extract definitions from nontechnical texts, reducing human expert input to a minimum. We report on experiments we have conducted on the use of genetic programming to learn the typical linguistic forms of definitions and a genetic algorithm to learn the relative importance of these forms. Results are very positive, showing the feasibility of exploring further the use of these techniques in definition extraction. The genetic program is able to learn similar rules derived by a human linguistic expert, and the genetic algorithm is able to rank candidate definitions in an order of confidence.
... [3] ...
Article
Full-text available
In this paper we present the implementation of definition extraction from multilingual corpora of scientific articles. We establish relations between the definitions and authors by using indexed references in the text. Our method is based on a linguistic ontology designed for this purpose. We propose two evaluations of the annotations.
... [3] ...
Article
Full-text available
Example sentences provide an intuitive means of grasping the meaning of a word, and are frequently used to complement conventional word definitions. When a word has multiple meanings, it is useful to have example sentences for specific senses (and hence definitions) of that word rather than indiscriminately lumping all of them together. In this paper, we investigate to what extent such sense-specific example sentences can be extracted from parallel corpora using lexical knowledge bases for multiple languages as a sense index. We use word sense disambiguation heuristics and a cross-lingual measure of semantic similarity to link example sentences to specific word senses. From the sentences found for a given sense, an algorithm then selects a smaller subset that can be presented to end users, taking into account both representativeness and diversity. Preliminary results show that a precision of around 80% can be obtained for a reasonable number of word senses, and that the subset selection yields convincing results.
... [3] ...
Article
Full-text available
This paper outlines a formal description of grammatical relations between definitions and verbal predications found in Definitional Contexts in Spanish. It can be situated within the framework of Predication Theory, a model derived from Government & Binding Grammar. We use this model to describe: (i) the syntactic patterns that establish the relationship between definitions and predications; (ii) how useful these patterns are for the identification of definitions in technical corpora.
... Ésta última idea fue reforzada más tarde por el estudio de Meyer [7], quien sostiene que en un texto especializado los patrones definitorios que conectan los términos con su definición pueden también introducir claves que permitan reconocer automáticamente el tipo de definición presente en los CDs, así como elaborar automáticamente una red conceptual. Por otro lado, existen investigaciones aplicadas que han partido de los estudios teórico-descriptivos con el fin de elaborar metodologías para la extracción automática de CDs, en específico: para el reconocimiento automático de definiciones en textos médicos [4], para la identificación automática de definiciones para sistemas de pregunta respuesta [10], para la extracción automática de información metalingüística para terminología [9], y para la elaboración automática de ontologías [5]. Los estudios aplicados tienen en común que parten de la búsqueda de patrones definitorios para extraer información relevante sobre términos. ...
Article
definitorios mediante reglas léxico-metalingüísticas, con el fin de obtener conocimiento especializado que nos permita definir términos.
... Since the 90's, several authors have proposed methods to identify lexico-syntactic patterns [4, 5]. DEFINDER [6] is an automatic definition extraction system that combines simple cue-phrases and structural indicators introducing the definitions and the defined term. It was developed with a well-structured medical corpus, where 60% of the definitions are introduced by a set of limited text markers. ...
Article
Full-text available
Definition extraction is an important task in NLP and IR fields in the context of e.g. question answering, ontology learning, dic-tionary and glossary construction. When addressed with learning algo-rithms, it turns out to be a challenging task due to the structure of the data set, the reason being that the definition-bearing sentences are much fewer than the sentences that are non definitions. In this paper, we present results from experiments that seek to obtain optimal solutions for this problem by using a corpus written in the Portuguese language. Our results show an improvement of 29 points regarding AUC metric and more than 60 points when considering the F-measure.
... This method was adopted by [19] to cover other types of relations. DEFINDER [13] is considered a state of the art system . It combines simple cue-phrases and structural indicators introducing the definitions and the defined term. ...
Article
Full-text available
In this paper we report on the performance of dif-ferent learning algorithms and different sampling technique applied to a definition extraction task, using data sets in different language. We com-pare our results with those obtained by hand-crafted rules to extract definitions. When Defi-nition Extraction is handled with machine learn-ing algorithms, two different issues arise. On the one hand, in most cases the data set used to ex-tract definitions is unbalanced, and this means that it is necessary to deal with this characteris-tic with specific techniques. On the other hand it is possible to use the same methods to extract definitions from documents in different corpus, making the classifier language independent.
... Current research indicates, they constructed medical term ontology and identify semantic relation of compound word. Also Klavans and Muresan [3], evaluated the quantitative and qualitative evaluation of DEFINDER which is an automatic lexicon construction system, by comparing to the definition in an online technical terminology dictionary. The DEFINDER, a rule-based system that mines consumer-oriented full text articles in order to extract definitions and the terms they define. ...
Article
Full-text available
This paper is to describe a way to develop the Psychoms process for mental health patient management and variance analysis system has been developed using artificial intelligence. Although it is agreed that there is a need for clinical pathway variance analysis, methods for creating a system are less well defined. The procedure and systematic process described aims to improve patients' quality of life through consistent and timely care. Ultimately, its potential influence is to assist in the improvement of quality health care services. This paper illustrates a method of outcomes management and variance analysis as the prospective development of future research.
... • Term Definition Snippets: Are readable definitions of concepts of interest, extracted from consumer-oriented online full text articles by combining shallow natural language processing with deep grammatical analysis. See [29,30] for more details on term definition extraction approach. Figure 7. System for generating coordinated multimedia summaries Figure 8 shows the snapshot of the user interface of CMS. ...
Conference Paper
Full-text available
Healthcare is a data-rich but information-poor domain. Terabytes of multimedia medical data are being generated on a monthly basis in a typical healthcare organization in order to document patients' health status and care process. Government and health-related organizations are pushing for fully electronic, cross-institution, integrated Electronic Health Records to provide a better, cost effective and more complete access to this data. However, provision of efficient access to the content of such records for timely and decision-enabling information extraction will not be available. Such a capability is essential for providing efficient decision support and objective evidence to clinicians. In addition researchers, medical students, patients, and payers could also benefit from it. We present the idea of concept-based multimedia health records, which aims at organizing the health records at the information level. We will explore the opportunities and possibilities that such an organization will provide, what role the field of multimedia content management could play to materialize this type of health record organization, and what the challenges will be in the quest for realizing the idea.We believe that the field of multimedia can play a very active role in taking healthcare information systems to the next level by facilitating the access to decision-enabling information for different types of users in healthcare. Our goal is to share with the community our thoughts on where the field of multimedia content management research should be focusing its attention to have a fundamental impact on the practice of medicine.
... These language-specific local grammars are applied to a test set from the same language in order to estimate their coverage (cf. also [10]). A major issue will be the precision of these methods: it has been shown that too simple local grammars also capture text snippets which are not definitions (cf. ...
Conference Paper
Full-text available
Given the huge amount of static and dynamic content cre- ated for eLearning tasks, the major challenge for extending their use is to improve the effectiveness of retrieval and accessibility by making use of Learning Management Systems. The aim of the European project Lan- guage Technology for eLearning is to tackle this problem by providing Language Technology based functionalities and by integrating semantic knowledge to facilitate the management, distribution and retrieval of the learning material.
... Applied investigations, on the other hand, leave from theoretical-descriptive studies with the objective of elaborate methodologies for the automatic extractions of DCs, more specifically for the extraction of definitions in medical texts [5], for the extraction of definitions for question answering systems [6], for the automatic elaboration of ontologies [7], for the extraction of semantic relations from specialised texts [8], as well as for the extraction of relevant information for eLearning purposes [9], [10]. ...
Conference Paper
Full-text available
Terminological work aims to identify knowledge about terms in specialised texts in order to compile dictionaries, glossaries or ontologies. Searching for definitions about the terms that terminographers intend to define is therefore an essential task. This search can be done in specialised corpus, where they usually appear in definitional contexts, i.e. text fragments where an author explicitly defines a term. We present a research focused on the automatic extraction of those definitional contexts. The methodology includes three different processes: the extraction of definitional patterns, the automatic filtering of non-relevant contexts, and the automatic identification of constitutive elements, i.e., terms and definitions. http://hdl.handle.net/10230/24631
... VI. PREVIOUS WORK There is a substantial previous work on definition extraction, as this is a subtask of many applications, including terminology extraction [8], the automatic creation of glossaries [9], [10], question answering [11], [12], learning lexical semantic relations [13], [14] and the automatic construction of ontologies [15]. Despite the current dominance of the ML paradigm in NLP, tools for definition extraction are invariably languagespecific and involve shallow or deep processing, with most work done for English and other Germanic languages, as well as French. ...
Conference Paper
Full-text available
The article discusses methods of improving the ways of applying balanced random forests (BRFs), a machine learning classification algorithm, used to extract definitions from written texts. These methods include different approaches to selecting attributes, optimising the classifier prediction threshold for the task of definition extraction and initial filtering by a very simple grammar.
... Definition Extraction. A great deal of work is concerned with definition extraction in several languages (Klavans and Muresan, 2001;Storrer and Wellinghoff, 2006;Gaudio and Branco, 2007;Iftene et al., 2007;Westerhout and Monachesi, 2007;Przepiórkowski et al., 2007;Degórski et al., 2008). The majority of these approaches use symbolic methods that depend on lexico-syntactic patterns or features, which are manually crafted or semi-automatically learned (Zhang and Jiang, 2009;Hovy et al., 2003;Fahmi and Bouma, 2006;Westerhout, 2009). ...
Conference Paper
Full-text available
Definition extraction is the task of automatically identifying definitional sentences within texts. The task has proven useful in many research areas including ontology learning, relation extraction and question answering. However, current approaches -- mostly focused on lexicosyntactic patterns -- suffer from both low recall and precision, as definitional sentences occur in highly variable syntactic structures. In this paper, we propose Word-Class Lattices (WCLs), a generalization of word lattices that we use to model textual definitions. Lattices are learned from a dataset of definitions from Wikipedia. Our method is applied to the task of definition and hypernym extraction and compares favorably to other pattern generalization methods proposed in the literature.
Article
Full-text available
Termos representam os conceitos de um domínio e sua compreensão permite o acesso aos saberes contidos nos textos especializados. Entender o significado dos termos, portanto, é de grande importância não apenas para que pesquisadores possam socializar seus estudos e descobertas, mas também para que profissionais e estudantes de várias áreas possam se valer da informação especializada em contextos de estudo e de trabalho. A evolução rápida do conhecimento muitas vezes não permite que a terminologia criada para designar conceitos seja dicionarizada com a necessária rapidez. Tal fato pode representar um grande desafio para aqueles que necessitam ter acesso ao conhecimento especializado. Tendo em vista o contexto descrito, este estudo parte da revisão de abordagens utilizadas para a extração automática de traços definitórios (TDs) e contextos definitórios (CDs) e propõe a utilização da ferramenta Corpus Query Language (CQL) para a extração de informações que auxiliem no entendimento da terminologia empregada em textos especializados. Em especial, verificamos a utilidade das sintaxes de busca construídas com a CQL para esse propósito, aplicando-as ao Corpus COVID-19. O percurso apresentado neste estudo poderá auxiliar não apenas especialistas da área médica, mas também tradutores, lexicógrafos e professores a processarem, de forma mais rápida e precisa, o conhecimento contido em textos especializados.
Conference Paper
Definition extraction is the task to identify definitional sentences automatically from unstructured text. The task can be used in the aspects of ontology generation, relation extraction and question answering. Previous methods use handcraft features generated from the dependency structure of a sentence. During this process, only part of the dependency structure is used to extract features, thus causing information loss. We model definition extraction as a supervised sequence classification task and propose a new way to automatically generate sentence features using a Long Short-Term Memory neural network model. Our method directly learns features from raw sentences and corresponding part-of-speech sequence, which makes full use of the whole sentence. We experiment on the Wikipedia benchmark dataset and obtain 91.2 % on F1F_1 score which outperforms the current state-of-the-art methods by 5.8 %. We also show the effectiveness of our method in dealing with other languages by testing on a Chinese dataset and obtaining 85.7 % on F1F_1 score.
Article
Full-text available
Ontologies are used for representing information units that contain related semantic understanding of varied real world situations. For systematizing the set of terminological data from a domain, the use of computational tools for term extraction is essential. This work presents the evaluation of statistical, linguistic and hybrid approaches to automatic term extraction for ontology construction. The evaluation was carried out with a reference list of terms from Ecology domain, using precision and recall metrics. The OntoEco ontology predicts three Ecology subdomains: ecosystems, populations and communities. For extracting the ontological lexical units, an Ecology corpus was built – the CórpusEco. After delineating the ontology in classes, subclasses and instances, the data were stored in the tool Protégé-2000. Resumo. Ontologias são usadas para a representação de informações que contêm um entendimento semântico comum de situações variadas do mundo real. Para a sistematização do conjunto de informações terminológicas de um domínio, é fundamental o uso de ferramentas computacionais para a extração de termos. Este trabalho apresenta a avaliação de métodos de extração automática de termos (EAT) das abordagens estatística, lingüística e híbrida para a construção de ontologias. A avaliação é feita com uma lista de referência com termos do domínio da Ecologia, usando as métricas de precisão e revocação. A OntoEco prevê três subdomínios da Ecologia: Ecossistemas, Populações e Comunidades. Para a extração das unidades lexicais ontológicas, confeccionamos um córpus da Ecologia – o CórpusEco. Após a finalização do delineamento da ontologia, em classes, subclasses e instâncias os dados foram armazenados na ferramenta computacional Protégé-2000.
Conference Paper
In this paper we address the problem of automatically constructing structured knowledge from plain texts. In particular, we present a supervised learning technique to first identify definitions in text data, while then finding hypernym relations within them making use of extracted syntactic structures. Instead of using pattern matching methods that rely on lexico-syntactic patterns, we propose a method which only uses syntactic dependencies between terms extracted with a syntactic parser. Our assumption is that syntax is more robust than patterns when coping with the length and the complexity of the texts. Then, we transform the syntactic contexts of each noun in a coarse-grained textual representation, that is later fed into hyponym/hypernym-centered Support Vector Machine classifiers. The results on an annotated dataset of definitional sentences demonstrate the validity of our approach overtaking the current state of the art.
Article
Genesandproteins areoften associated withmultiple names, andmorenamesareaddedasnewfunctional orstructural information isdiscovered. Because authors oftenalternate between thesesynonyms, information retrieval andextraction benefits from identifying thesesynonymous names. We have developed a methodto extract automatically synonymous geneandprotein namesfromMEDLINE andjournal articles. Wefirst identified patterns authors usetolist synonymous geneandprotein names.Wedeveloped SGPE(for synonym extraction ofgeneandprotein names), asoftware program that recognizes thepatterns andextracts fromMEDLINE abstracts andfull-text journal articles candidate synonymous terms. SGPEthenapplies asequence of filters that automatically screen outthose terms that arenotgeneandprotein names.Weevaluated our method tohaveanoverall precision of71%onboth MEDLINEandjournal articles, and90%precision on themoresuitable full-text articles alone.
Article
Full-text available
Uno de los objetivos principales del trabajo terminográfico es la identificación de conocimiento sobre los términos que aparecen en textos especializados. Para confeccionar diccionarios, glosarios u ontologías, los terminógrafos suelen buscar definiciones sobre los términos que pretenden definir. La búsqueda de definiciones se puede hacer a partir de corpus especializados, donde normalmente aparecen en contextos definitorios, es decir, en fragmentos de texto donde un autor explícitamente define el término en cuestión. Hoy en día hay un interés creciente por automatizar este proceso, basado en la búsqueda de patrones definitorios sobre corpus especializados anotados morfosintácticamente. En este artículo presentamos una investigación centrada en la extracción automática de contextos definitorios. Presentamos una metodología que incluye tres procesos automáticos diferentes: la extracción de ocurrencias de patrones definitorios, el filtrado de contextos no relevantes, y la identificación de elementos constitutivos, es decir, términos, definiciones y patrones pragmáticos. http://repositori.upf.edu/handle/10230/16965
Conference Paper
We propose a novel machine learning approach to the task of identifying definitions in Polish documents. Specifics of the problem domain and characteristics of the available dataset have been taken into consideration, by carefully choosing and adapting a classification method to highly imbalanced and noisy data. We evaluate the performance of a Random Forest-based classifier in extracting definitional sentences from natural language text and give a comparison with previous work.
Conference Paper
Full-text available
In the domain of genomic research, the understanding of specific gene name is a portal to most Information Retrieval (IR) and Information Ex- traction (IE) systems. In this paper we present an automatic method to extract genomic glossary triggered by the initial gene name in query. LocusLink gene names and MEDLINE abstracts are employed in our system, playing the roles of query triggers and genomic corpus respectively. The evaluation of the ex- tracted glossary is through query expansion in TREC2003 Genomics Track ad hoc retrieval task, and the experiment results yield evidence that 90.15% recall can be achieved.
Article
Full-text available
Natural languages are full of collocations, recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages. Recent work in lexicography indicates that collocations are pervasive in English; apparently, they are common in all types of writing, including both technical and nontechnical genres. Several approaches have been proposed to retrieve various types of collocations from the analysis of large samples of textual data. These techniques automatically produce large numbers of collocations along with statistical figures intended to reflect the relevance of the associations. However, none of these techniques provides functional information along with the collocation. Also, the results produced often contained improper word associations reflecting some spurious aspect of the training corpus that did not stand for true collocations.In this paper, we describe a set of techniques based on statistical methods for retrieving and identifying collocations from large textual corpora. These techniques produce a wide range of collocations and are based on some original filtering methods that allow the production of richer and higher-precision output. These techniques have been implemented and resulted in a lexicographic tool, Xtract. The techniques are described and some results are presented on a 10 million-word corpus of stock market news reports. A lexicographic evaluation of Xtract as a collocation retrieval tool has been made, and the estimated precision of Xtract is 80%.
Article
Full-text available
This paper identifies some linguistic properties of technical terminology, and uses them to formulate an algorithm for identifying technical terms in running text. The grammatical properties discussed are preferred phrase structures: technical terms consist mostly of noun phrases containing adjectives, nouns, and occasionally prepositions; rerely do terms contain verbs, adverbs, or conjunctions. The discourse properties are patterns of repetition that distinguish noun phrases that are technical terms, especially those multi-word phrases that constitute a substantial majority of all technical vocabulary, from other types of noun phrase. The paper presents a terminology indentification algorithm that is motivated by these linguistic properties. An implementation of the algorithm is described; it recovers a high proportion of the technical terms in a text, and a high proportaion of the recovered strings are vaild technical terms. The algorithm proves to be effective regardless of the domain of the text to which it is applied.
Chapter
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
We consider the role of textual structures in medical texts. In particular, we examine the impact the lacking recognition of text phenomena has on the validity of medical knowledge bases fed by a natural language understanding front-end. First, we review the results from an empirical study on a sample of medical texts considering, in various forms of local coherence phenomena (anaphora and textual ellipses). We then discuss the representation bias emerging in the text knowledge base that is likely to occur when these phenomena are not dealt with--mainly the emergence of referentially incoherent and invalid representations. We then turn to a medical text understanding system designed to account for local text coherence.
Article
There are a wide variety of computer applications that deal with various aspects of medical language: concept representation, controlled vocabulary, natural language processing, and information retrieval. While technical and theoretical methods appear to differ, all approaches investigate different aspects of the same phenomenon: medical sublanguage. This paper surveys the properties of medical sublanguage from a formal perspective, based on detailed analyses cited in the literature. A review of several computer systems based on sublanguage approaches shows some of the difficulties in addressing the interaction between the syntactic and semantic aspects of sublanguage. A formalism called Conceptual Graph Grammar is presented that attempts to combine both syntax and semantics into a single notation by extending standard Conceptual Graph notation. Examples from the domain of pathology diagnoses are provided to illustrate the use of this formalism in medical language analysis. The strengths and weaknesses of the approach are then considered. Conceptual Graph Grammar is an attempt to synthesize the common properties of different approaches to sublanguage into a single formalism, and to begin to define a common foundation for language-related research in medical informatics.
Article
We recently conducted a study of a subset of the data collected in the NLM/AHCPR Large Scale Vocabulary Test (LSVT). We studied those 11,387 terms in the LSVT data that were narrower in meaning than the UMLS concepts mapped to. We hypothesized that when one term is narrower in meaning than another, the first and second terms are likely to differ primarily by modification. We compared three lexical processing methods of increasing sophistication and measured the ability of each of these methods to correctly identify the modifiers in the data set. The results indicate that when using the most powerful of the methods, 63% of the term pairs were found to differ only by premodification, by postmodification or by both, 31% share some lexical material, and the remaining 6% have no lexical items in common. The implications of the study are discussed.
Article
Natural Language Processing (NLP) is a tool for transforming natural text into codable form. Success of NLP systems is contingent on a well constructed semantic lexicon. However, creation and maintenance of these lexicons is difficult, costly and time consuming. The UMLS contains semantic and syntactic information of medical terms, which may be used to automate some of this task. Using UMLS resources we have observed that it is possible to define one semantic type by its syntactic combinations with other types in a corpus of discharge summaries. These patterns of combination can then be used to classify words which are not in the lexicon. The technique was applied to a corpus for a single semantic type and generated a list of 875 words which matched the classification criteria for that type. The words were ranked by number of patterns matched and the top 95 words were correctly typed with 80% accuracy.
Article
Automatic part of speech tagging is an area of natural language processing where statistical techniques have been more successful than rule- based methods. In this paper, we present a sim- ple rule-based part of speech tagger which automatically acquires its rules and tags with accuracy coinparable to stochastic taggers. The rule-based tagger has many advantages over these taggers, including: a vast reduction in stored information required, the perspicuity of a sinall set of meaningful rules, ease of finding and implementing improvements to the tagger, and better portability from one tag set, cor- pus genre or language to another. Perhaps the biggest contribution of this work is in demonstrating that the stochastic method is not the only viable method for part of speech tagging. The fact that a simple rule-based tagger that automatically learns its rules can perform so well should offer encouragement for researchers to further explore rule-based tagging, searching for a better and more expressive set of rule templates and other variations on the simple but effective theme described below.
Available from: URL:http://www.cs.columbia
  • Kr Mckeown
  • S-F Chang
  • Jj Cimino
  • J Starren
  • Jl Klavans
  • Persival
McKeown KR, Chang S-F, Cimino JJ, Starren J, Klavans, JL. PERSIVAL sytem. Available from: URL:http://www.cs.columbia.edu/diglib/Persival/