Sanda M. Harabagiu

University of Texas at Dallas, Richardson, Texas, United States

Are you Sanda M. Harabagiu?

Claim your profile

Publications (131)21.56 Total impact

  • Cosmin A Bejan, Sanda M. Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: The task of event coreference resolution plays a critical role in many natural language processing applications such as information extraction, question answering, and topic detection and tracking. In this article, we describe a new class of unsupervised, nonparametric Bayesian models with the purpose of probabilistically inferring coreference clusters of event mentions from a collection of unlabeled documents. In order to infer these clusters, we automatically extract various lexical, syntactic, and semantic features for each event mention from the document collection. Extracting a rich set of features for each event mention allows us to cast event coreference resolution as the task of grouping together the mentions that share the same features (they have the same participating entities, share the same location, happen at the same time, etc.). Some of the most important challenges posed by the resolution of event coreference in an unsupervised way stem from (a) the choice of representing event mentions through a rich set of features and (b) the ability of modeling events described both within the same document and across multiple documents. Our first unsupervised model that addresses these challenges is a generalization of the hierarchical Dirichlet process. This new extension presents the hierarchical Dirichlet process’s ability to capture the uncertainty regarding the number of clustering components and, additionally, takes into account any finite number of features associated with each event mention. Furthermore, to overcome some of the limitations of this extension, we devised a new hybrid model, which combines an infinite latent class model with a discrete time series model. The main advantage of this hybrid model stands in its capability to automatically infer the number of features associated with each event mention from data and, at the same time, to perform an automatic selection of the most informative features for the task of event coreference. The evaluation performed for solving both within- and cross-document event coreference shows significant improvements of these models when compared against two baselines for this task.
    Computational Linguistics 06/2014; 40(2). · 0.94 Impact Factor
  • Travis Goodwin, Sanda Harabagiu
    Language Resources and Evaluation Conference (LREC); 05/2014
  • Travis Goodwin, Sanda M. Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: The introduction of electronic medical records (EMRs) enabled the access of unprecedented volumes of clinical data, both in structured and unstructured formats. A significant amount of this clinical data is expressed within the narrative portion of the EMRs, requiring natural language processing techniques to unlock the medical knowledge referred to by physicians. This knowledge, derived from the practice of medical care, complements medical knowledge already encoded in various structured biomedical ontologies. Moreover, the clinical knowledge derived from EMRs also exhibits relational information between medical concepts, derived from the cohesion property of clinical text, which is an attractive attribute that is currently missing from the vast biomedical knowledge bases. In this paper, we describe an automatic method of generating a graph of clinically related medical concepts by considering the belief values associated with those concepts. The belief value is an expression of the clinician's assertion that the concept is qualified as present, absent, suggested, hypothetical, ongoing, etc. Because the method detailed in this paper takes into account the hedging used by physicians when authoring EMRs, the resulting graph encodes qualified medical knowledge wherein each medical concept has an associated assertion (or belief value) and such qualified medical concepts are spanned by relations of different strengths, derived from the clinical contexts in which concepts are used. In this paper, we discuss the construction of a qualified medical knowledge graph (QMKG) and treat it as a BigData problem addressed by using MapReduce for deriving the weighted edges of the graph. To be able to assess the value of the QMKG, we demonstrate its usage for retrieving patient cohorts by enabling query expansion that produces greatly enhanced results against state-of-the-art methods.
    International Journal of Semantic Computing 05/2014; 07(04).
  • Kirk Roberts, Bryan Rink, Sanda M Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: OBJECTIVE: To provide a natural language processing method for the automatic recognition of events, temporal expressions, and temporal relations in clinical records. MATERIALS AND METHODS: A combination of supervised, unsupervised, and rule-based methods were used. Supervised methods include conditional random fields and support vector machines. A flexible automated feature selection technique was used to select the best subset of features for each supervised task. Unsupervised methods include Brown clustering on several corpora, which result in our method being considered semisupervised. RESULTS: On the 2012 Informatics for Integrating Biology and the Bedside (i2b2) shared task data, we achieved an overall event F1-measure of 0.8045, an overall temporal expression F1-measure of 0.6154, an overall temporal link detection F1-measure of 0.5594, and an end-to-end temporal link detection F1-measure of 0.5258. The most competitive system was our event recognition method, which ranked third out of the 14 participants in the event task. DISCUSSION: Analysis reveals the event recognition method has difficulty determining which modifiers to include/exclude in the event span. The temporal expression recognition method requires significantly more normalization rules, although many of these rules apply only to a small number of cases. Finally, the temporal relation recognition method requires more advanced medical knowledge and could be improved by separating the single discourse relation classifier into multiple, more targeted component classifiers. CONCLUSIONS: Recognizing events and temporal expressions can be achieved accurately by combining supervised and unsupervised methods, even when only minimal medical knowledge is available. Temporal normalization and temporal relation recognition, however, are far more dependent on the modeling of medical knowledge.
    Journal of the American Medical Informatics Association 05/2013; · 3.57 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Radiology reports often contain findings about the condition of a patient which should be acted upon quickly. These actionable findings in a radiology report can be automatically detected to ensure that the referring physician is notified about such findings and to provide feedback to the radiologist that further action has been taken. In this paper we investigate a method for detecting actionable findings of appendicitis in radiology reports. The method identifies both individual assertions regarding the presence of appendicitis and other findings related to appendicitis using syntactic dependency patterns. All relevant individual statements from a report are collectively considered to determine whether the report is consistent with appendicitis. Evaluation on a corpus of 400 radiology reports annotated by two expert radiologists showed that our approach achieves a precision of 91%, a recall of 83%, and an F1-measure of 87%.
    AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science. 01/2013; 2013:221.
  • Travis Goodwin, Sanda M Harabagiu
    Information Access Evaluation. Multilinguality, Multimodality, and Visualization, 01/2013: pages 155-166; Springer.
  • Travis Goodwin, Sanda M. Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: An extraordinary amount of clinical information is available within Electronic Medical Records. However, interpreting this knowledge typically demands a significant level of clinical understanding. This can facilitated by access to structured knowledge bases. However, even if vast, biomedical knowledge bases have very limited relational information available. In contrast, clinical text expresses many relations between concepts using an extraordinary amount of variation regarding the author's belief state - whether a medical concept is present, uncertain, or absent. In this paper, we propose a method for automatically constructing a graph of clinically related concepts based on their belief state. For this purpose, we first devise a method for classifying the belief state of certain medical concepts. Second, we designed a technique for constructing a graph of related medical concepts qualified by the physician's belief value. Thirdly, we demonstrate several techniques for inferring the similarity between qualified medical concepts, and present a generalized algorithm for determining the second-order similarity between qualified medical concepts. Finally, we show that incorporating the knowledge encoded from this graph yield competitive results when applied to query expansion for the retrieval of hospital patient cohorts.
    Semantic Computing (ICSC), 2013 IEEE Seventh International Conference on; 01/2013
  • Travis Goodwin, Kirk Roberts, Sanda Harabagiu
    Text REtrieval Conference (TREC), Gaithersburg, Maryland USA; 11/2012
  • Kirk Roberts, Sanda M. Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: Spatial queries in the form of natural language questions have typically been assumed to have unconstrained geographic answers. However, analysis of prototypical spatial questions reveals two important types of constraints that must be considered by spatial question answering systems. First, locational relativity constraints limit answers to a particular location or the user's implied location. Second, domain constraints specify non-geographic locations such as web pages or anatomical sites. In order to detect these constraints, we have conducted a crowd-sourced annotation effort for a set of over 1,200 questions gathered from a community question answering website. We utilize machine learning techniques trained on this data to automatically classify these two types of constraints. We report results nearing 90% accuracy at locational relativity detection and 76% accuracy at domain classification using this approach.
    Proceedings of the 20th International Conference on Advances in Geographic Information Systems; 11/2012
  • Source
    Travis Goodwin, Bryan Rink, Kirk Roberts, Sanda M. Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: The Choice of Plausible Alternatives (COPA) task in SemEval-2012 presents a series of forced-choice questions wherein each question provides a premise and two viable cause or effect scenarios. The correct answer is the cause or effect that is the most plausible. This paper describes the COPACETIC system developed by the University of Texas at Dallas (UTD) for this task. We approach this task by casting it as a classification problem and using features derived from bigram co-occurrences, TimeML temporal links between events, single-word polarities from the Harvard General Inquirer, and causal syntactic dependency structures within the gigaword corpus. Additionally, we show that although each of these components improves our score for this evaluation, the difference in accuracy between using all of these features and using bigram co-occurrence information alone is not statistically significant.
    Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation; 06/2012
  • Kirk Roberts, Sanda M. Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a joint approach for recognizing spatial roles in SemEval-2012 Task 3. Candidate spatial relations, in the form of triples, are heuristically extracted from sentences with high recall. The joint classification of spatial roles is then cast as a binary classification over the candidates. This joint approach allows for a rich feature set based on the complete relation instead of individual relation arguments. Our best official submission achieves an F1-measure of 0.573 on relation recognition, best in the task and outperforming the previous best result on the same data set (0.500).
    Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation; 06/2012
  • Bryan Rink, Sanda Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present our approach for assigning degrees of relational similarity to pairs of words in the SemEval-2012 Task 2. To measure relational similarity we employed lexical patterns that can match against word pairs within a large corpus of 12 million documents. Patterns are weighted by obtaining statistically estimated lower bounds on their precision for extracting word pairs from a given relation. Finally, word pairs are ranked based on a model predicting the probability that they belong to the relation of interest. This approach achieved the best results on the SemEval 2012 Task 2, obtaining a Spearman correlation of 0.229 and an accuracy on reproducing human answers to MaxDiff questions of 39.4%.
    Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation; 06/2012
  • Kirk Roberts, Travis Goodwin, Sanda M Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: A significant amount of spatial information in textual documents is hidden within the relationship between events. While humans have an intuitive understanding of these relationships that allow us to recover an object's or event's location, currently no annotated data exists to allow automatic discovery of spatial containment relations between events. We present our process for building such a corpus of manually annotated spatial relations between events. Events form complex predicate-argument structures that model the participants in the event, their roles, as well as the temporal and spatial grounding. In addition, events are not presented in isolation in text; there are explicit and implicit interactions between events that often participate in event structures. In this paper, we focus on five spatial containment relations that may exist between events: (1) SAME, (2) CONTAINS, (3) OVERLAPS, (4) NEAR, and (5) DIFFERENT. Using the transitive closure across these spatial relations, the implicit location of many events and their participants can be discovered. We discuss our annotation schema for spatial containment relations, placing it within the pre-existing theories of spatial representation. We also discuss our annotation guidelines for maintaining annotation quality as well as our process for augmenting SpatialML with spatial containment relations between events. Additionally, we outline some baseline experiments to evaluate the feasibility of developing supervised systems based on this corpus. These results indicate that although the task is challenging, automated methods are capable of discovering spatial containment relations between events.
    05/2012;
  • Bryan Rink, Kirk Roberts, Sanda M Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: A method for the automatic resolution of coreference between medical concepts in clinical records. A multiple pass sieve approach utilizing support vector machines (SVMs) at each pass was used to resolve coreference. Information such as lexical similarity, recency of a concept mention, synonymy based on Wikipedia redirects, and local lexical context were used to inform the method. Results were evaluated using an unweighted average of MUC, CEAF, and B(3) coreference evaluation metrics. The datasets used in these research experiments were made available through the 2011 i2b2/VA Shared Task on Coreference. The method achieved an average F score of 0.821 on the ODIE dataset, with a precision of 0.802 and a recall of 0.845. These results compare favorably to the best-performing system with a reported F score of 0.827 on the dataset and the median system F score of 0.800 among the eight teams that participated in the 2011 i2b2/VA Shared Task on Coreference. On the i2b2 dataset, the method achieved an average F score of 0.906, with a precision of 0.895 and a recall of 0.918 compared to the best F score of 0.915 and the median of 0.859 among the 16 participating teams. Post hoc analysis revealed significant performance degradation on pathology reports. The pathology reports were characterized by complex synonymy and very few patient mentions. The use of several simple lexical matching methods had the most impact on achieving competitive performance on the task of coreference resolution. Moreover, the ability to detect patients in electronic medical records helped to improve coreference resolution more than other linguistic analysis.
    Journal of the American Medical Informatics Association 05/2012; 19(5):875-82. · 3.57 Impact Factor
  • Kirkroberts, Sanda M.harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: Recognizing new and emerging events in a stream of news documents requires understanding the semantic structure of news reported in natural language. New event detection (NED) is the task of recognizing when a news document discusses a completely novel event. To be successful at this task, we believe a NED method must extract and represent four principal components of an event: its type, participants, temporal, and spatial properties. These components must then be compared in a semantically robust manner to detect novelty. We further propose event centrality, a method for determining the most important participants in an event. Our NED methods produce a 29% cost reduction over a bag-of-words baseline and a 17% cost reduction over an existing state-of-the-art approach. Additionally, we discuss our method for recognizing emerging events: the tracking and categorization of unexpected or novel events.
    International Journal of Semantic Computing 04/2012; 05(04).
  • Source
    Kirk Roberts, Sanda M Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we report on the approaches that we developed for the 2011 i2b2 Shared Task on Sentiment Analysis of Suicide Notes. We have cast the problem of detecting emotions in suicide notes as a supervised multi-label classification problem. Our classifiers use a variety of features based on (a) lexical indicators, (b) topic scores, and (c) similarity measures. Our best submission has a precision of 0.551, a recall of 0.485, and a F-measure of 0.516.
    Biomedical Informatics Insights 01/2012; 5(Suppl. 1):195-204.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recognizing the anatomical location of actionable findings in radiology reports is an important part of the communication of critical test results between caregivers. One of the difficulties of identifying anatomical locations of actionable findings stems from the fact that anatomical locations are not always stated in a simple, easy to identify manner. Natural language processing techniques are capable of recognizing the relevant anatomical location by processing a diverse set of lexical and syntactic contexts that correspond to the various ways that radiologists represent spatial relations. We report a precision of 86.2%, recall of 85.9%, and F(1)-measure of 86.0 for extracting the anatomical site of an actionable finding. Additionally, we report a precision of 73.8%, recall of 69.8%, and F(1)-measure of 71.8 for extracting an additional anatomical site that grounds underspecified locations. This demonstrates promising results for identifying locations, while error analysis reveals challenges under certain contexts. Future work will focus on incorporating new forms of medical language processing to improve performance and transitioning our method to new types of clinical data.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2012; 2012:779-88.
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes the system created by the University of Texas at Dallas for content-based medical record retrieval submitted to the TREC 2011 Medical Records Track. Our system builds a query by extracting keywords from a given topic using a Wikipedia-based approach we use regular expressions to ex-tract age, gender, and negation requirements. Each query is then expanded by relying on UMLS, SNOMED, Wikipedia, and PubMed Co-occurrence data for retrieval. Four runs were submitted: two based on Lucene with varying scoring methods, and two based on a hybrid approach with varying negation detec-tion techniques. Our highest scoring submis-sion achieved a MAP score of 40.8.
    Text REtrieval Conference (TREC), Gaithersburg, Maryland USA; 11/2011
  • Source
    K. Roberts, S.M. Harabagiu
    [Show abstract] [Hide abstract]
    ABSTRACT: Recognizing new and emerging events in a stream of news documents requires understanding the semantic structure of news reported in natural language. New event detection (NED) is the task of recognizing when a news document discusses a completely novel event. To be successful at this task, we argue a NED method must extract and represent the type of event and its participants as well as the temporal and spatial properties of the event. Our NED methods produce a 25% cost reduction over a bag-of-words baseline and a 13% cost reduction over an existing state-of-the-art approach. Additionally, we discuss our method for recognizing emerging events: the tracking and categorization of unexpected or novel events.
    Semantic Computing (ICSC), 2011 Fifth IEEE International Conference on; 10/2011
  • Source
    Bryan Rink, Sanda Harabagiu, Kirk Roberts
    [Show abstract] [Hide abstract]
    ABSTRACT: A supervised machine learning approach to discover relations between medical problems, treatments, and tests mentioned in electronic medical records. A single support vector machine classifier was used to identify relations between concepts and to assign their semantic type. Several resources such as Wikipedia, WordNet, General Inquirer, and a relation similarity metric inform the classifier. The techniques reported in this paper were evaluated in the 2010 i2b2 Challenge and obtained the highest F1 score for the relation extraction task. When gold standard data for concepts and assertions were available, F1 was 73.7, precision was 72.0, and recall was 75.3. F1 is defined as 2*Precision*Recall/(Precision+Recall). Alternatively, when concepts and assertions were discovered automatically, F1 was 48.4, precision was 57.6, and recall was 41.7. Although a rich set of features was developed for the classifiers presented in this paper, little knowledge mining was performed from medical ontologies such as those found in UMLS. Future studies should incorporate features extracted from such knowledge sources, which we expect to further improve the results. Moreover, each relation discovery was treated independently. Joint classification of relations may further improve the quality of results. Also, joint learning of the discovery of concepts, assertions, and relations may also improve the results of automatic relation extraction. Lexical and contextual features proved to be very important in relation extraction from medical texts. When they are not available to the classifier, the F1 score decreases by 3.7%. In addition, features based on similarity contribute to a decrease of 1.1% when they are not available.
    Journal of the American Medical Informatics Association 09/2011; 18(5):594-600. · 3.57 Impact Factor

Publication Stats

2k Citations
21.56 Total Impact Points

Institutions

  • 2002–2013
    • University of Texas at Dallas
      • Department of Computer Science
      Richardson, Texas, United States
    • The University of Sheffield
      Sheffield, England, United Kingdom
  • 2001–2003
    • University of Texas at Austin
      • Department of Computer Science
      Austin, Texas, United States
  • 1998–2001
    • Southern Methodist University
      • Department of Computer Science and Engineering
      Dallas, Texas, United States
  • 1996–1998
    • University of Southern California
      • Department of Electrical Engineering
      Los Angeles, California, United States