Judith L. Klavans

Judith L. Klavans
  • Ph.D. Linguistics
  • University of Maryland, College Park

About

159
Publications
17,928
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,751
Citations
Current institution
University of Maryland, College Park
Additional affiliations
January 2005 - May 2016
University of Maryland, College Park
Position
  • Senior Researcher

Publications

Publications (159)
Article
Full-text available
This study demonstrates an improved conceptual foundation to support well-structured analysis of image topicality. First we present a conceptual framework for analyzing image topicality, explicating the layers, the perspectives, and the topical relevance relationships involved in modeling the topicality of art images. We adapt a generic relevance t...
Technical Report
A new opportunity to explore and leverage the power of computational linguistic methods and analysis in ensuring effective Cybersecurity is presented. This White Paper discusses some of the specific emerging research opportunities, covering human language technologies such as language identification, topic modeling, and information extraction for k...
Article
In recent years, cultural heritage institutions have increasingly used social tagging. To better understand the nature of these tags, we analyzed tags assigned to a collection of 100 images of art (provided by the steve.museum project) using subject matter categorization. Our results show that the majority of tags describe the people and objects in...
Article
Specialized medical ontologies and terminologies, such as SNOMED CT and the Unified Medical Language System (UMLS), have been successfully leveraged in medical information systems to provide a standard web-accessible medium for interoperability, access, and reuse. However, these clinically oriented terminologies and ontologies cannot provide suffic...
Article
Keeping up with rapidly growing research fields, especially when there are multiple interdisciplinary sources, requires substantial effort for researchers, program managers, or venture capital investors. Current theories and tools are directed at finding a paper or website, not gaining an understanding of the key papers, authors, controversies, and...
Article
Information retrieval (IR) involves retrieving information from stored data, through user queries or pre-formulated user profiles. The information can be in any format. IR typically advances over four broad stages viz., identification of text types, document preprocessing, document indexing, and query processing and matching the same to documents....
Conference Paper
Full-text available
Action Science Explorer (ASE) is a tool designed to support users in rapidly generating readily consumable summaries of academic literature. It uses citation network visualization, ranking and filtering papers by network statistics, and automatic clustering and summarization techniques. We describe how early formative evaluations of ASE led to a ma...
Article
Full-text available
Action Science Explorer (ASE) is a tool designed to support users in rapidly generating readily consumable summaries of academic literature. The authors describe ASE and report on how early formative evaluations led to a mature system evaluation, consisting of an in-depth empirical evaluation with 4 domain expert participants. The user study tasks...
Article
Full-text available
This paper reports on the linguistic analysis of a tag set of nearly 50,000 tags collected as part of the steve.museum project. The tags describe images of objects in museum collections. We present our results on morphological, part of speech and semantic analysis. We demonstrate that deeper tag processing provides valuable information for organizi...
Article
Full-text available
Keeping up with rapidly growing research fields, especially when there are multiple interdisciplinary sources, requires substantial effort for researchers, program managers, or venture capital investors. Current theories and tools are directed at finding a paper or website, not gaining an understanding of the key papers, authors, controversies, and...
Conference Paper
Full-text available
This paper reports on the linguistic analysis of a tag set of nearly 50,000 tags collected as part of the steve.museum project. The tags describe images of objects in museum collections. We present our results on morphological, part of speech and semantic analysis. We demonstrate that deeper tag processing provides valuable information for organizi...
Article
Full-text available
In this paper, we present our results on comparing the language of social tags with text-mined terms for images. We have developed a novel modification of the standard term frequency/inverse document frequency metric (tf*idf) (Salton & Buckley 1988) over tags and terms to identify and filter terms which discriminate images for searchers. Since tags...
Conference Paper
Full-text available
In this paper, we present on a new project, " T 3 : Text, Tags, and Trust to Improve Image Access for Museums and Libraries " , the goal of which is to improve access to digital image collections in museums and libraries for art historians, museum professionals, and the general public. T 3 combines text mining, social tagging, and trust inferencing...
Article
Full-text available
In this paper we extend existing sense-making models by adding detail on types of conceptual change and cognitive mechanisms taken from theories of cognition and learning. Our extended model aims to offer a more complete picture of the cognitive processes of sense-making, including the underlying cognitive mechanisms and different types of conceptu...
Article
In this paper, we present a system using computational linguistic techniques to extract metadata for image access. We discuss the implementation, functionality and evaluation of an image catalogers’ toolkit, developed in the Computational Linguistics for Metadata Building (CLiMB) research project. We have tested components of the system, including...
Conference Paper
In this paper, we describe a downloadable text-mining tool for enhancing subject access to image collections in digital libraries.
Chapter
This chapter describes the progress of the Digital Government Research Center in tackling the challenges of integrating and accessing the massive amount of statistical and text data available from government agencies. In particular, we address the issues of database heterogeneity, size, distribution, and control of terminology. In this chapter we p...
Article
Full-text available
The CLiMB project investigates semi-automatic methods to extract descriptive metadata from texts for indexing digital image collections. We devel-oped a set of functional semantic categories to classify text extracts that describe images. Each semantic category names a functional relation between an image depicting a work of art historical signific...
Article
Full-text available
In this paper, we present a fully implemented system using computational linguistic techniques to apply automatic text mining for the extraction of metadata for image access. We describe the implementation of a workbench created for, and evaluated by, image catalogers. We discuss the current functionality and future goals for this image catalogers'...
Conference Paper
Full-text available
This paper describes a formative evaluation of an integrated mul- tilingual, multimedia information system, a series of user studies designed to guide system development. The system includes automatic speech recognition for English, Chinese, and Arabic, automatic translation from Chinese and Arabic into English, and query-based and profile-based se...
Book
This report is about the value of a specific area of planning and about how the United States might make improvements in that specific area. Geospatial data and tools are currently used for emergency response, but recent events have demonstrated the many ways in which our geospatial data and tools and the use we make of them fail us, both in prepar...
Article
Full-text available
The issue of sentence ordering is an important one for natural language tasks such as multi-document summarization, yet there has not been a quantitative exploration of the range of acceptable sentence orderings for short texts. We present results of a sentence reordering experiment with three experimental conditions. Our findings indicate a very h...
Conference Paper
Full-text available
This panel will explore the broad DELOS vision of digital libraries1 and its implications for research and teaching. What is envisioned is “a much broader system of highly interlinked information and services that will provide very rich functionality, supporting new ways of intellectual work, communication, and process execution in business, govern...
Article
We present the summarization system in the PErsonalized Retrieval and Summarization of Images, Video and Language (PERSIVAL) medical digital library. Although we discuss the context of our summarization research within the PERSIVAL platform, the primary focus of this article is on strategies to define and generate customized summaries. Our summariz...
Conference Paper
Full-text available
Digital government research spans disciplinary fields that are exceptionally diverse. Although digital government research emerged primarily from efforts to engage computer scientists in research that would improve the functions of government, digital government research today addresses a far broader set of issues including the adoption and use of...
Article
Full-text available
Columbia's Newsblaster tracking and summarization system is a robust system that clusters news into events, categorizes events into broad topics and summarizes multiple articles on each event. Here we outline our most current work on tracking events over days, producing summaries that update a user on new information about an event, outlining the p...
Article
Full-text available
We present the new multilingual version of the Columbia Newsblaster news summariza-tion system. The system addresses the problem of user access to browsing news from multiple languages from multiple sites on the internet. The system automatically collects, organizes, and summarizes news in multiple source lan-guages, allowing the user to browse new...
Article
Full-text available
Digital image collections in libraries and other curatorial institutions grow too rapidly to create new descriptive metadata for subject matter search or browsing. CLiMB (Computational Linguistics for Metadata Building) was a project designed to address this dilemma that involved computer scientists, linguists, librarians, and art librarians. The C...
Article
Full-text available
We present a relational learning framework for grammar induction that is able to learn meaning as well as syn- tax. We introduce a type of constraint-based grammar, lexicalized well-founded grammar (lwfg) , and we prove that it can always be learned from a small set of seman- tically annotated examples, given a set of assumptions. The semantic repr...
Conference Paper
Full-text available
We present the new multilingual version of the Columbia Newsblaster news summarization system. The system addresses the problem of user access to browsing news from multiple languages from multiple sites on the internet. The system automatically collects, organizes, and summarizes news in multiple source languages, allowing the user to browse news...
Article
Full-text available
The task of creating indicative summaries that help a searcher decide whether to read a particular document is a difficult task. This paper examines the indicative summarization task from a generation perspective, by first analyzing its required content via published guidelines and corpus analysis. We show how these summaries can be factored into a...
Article
Full-text available
We present a system for the automatic extraction of salient information from email messages, thus providing the gist of their meaning. Dealing with email raises several challenges that we address in this paper: heterogeneous data in terms of length and topic. Our method combines shallow linguistic processing with machine learning to extract phrasal...
Article
Full-text available
If Natural Language Processing (NLP) systems are viewed as intelligent systems then we should be able to make use of verification and validation (V&V) approaches and methods that have been developed in the intelligent systems community.
Article
this paper is to present the entire multilingual Columbia Newsblaster system as a platform for multilingual multi-document summarization experiments. The phases in the multilingual version of Columbia Newsblaster have been modified to take language and character encoding into account, and a new phase, translation, has been added. Figure 1 depicts t...
Article
Full-text available
We have developed a multilingual version of Columbia Newsblaster as a testbed for multilingual multi-document summarization. The system collects, clusters, and summarizes news documents from sources all over the world daily. It crawls news sites in many different countries, written in different languages, extracts the news text from the HTML pages,...
Article
Full-text available
We present a statistical similarity measuring and clustering tool, SIMFINDER, that organizes small pieces of text from one or multiple documents into tight clusters. By placing highly related text units in the same cluster, SIMFINDER enables a subsequent content selection/generation component to reduce each cluster to a single sentence, either by e...
Article
Full-text available
Recently, there have been significant advances in several areas of language technology, including clustering, text categorization, and summarization. However, efforts to combine technology from these areas in a practical system for information access have been limited. In this paper, we present Columbia's Newsblaster system for online news summariz...
Article
Full-text available
Domain specific texts often have implicit rules on content and organization. We introduce a novel method for synthesizing this topical structure. The system uses corpus examples and recursively merges their topics to build a hierarchical tree. A subjective cross domain evaluation showed that the system performed well in combining related topics and...
Article
Full-text available
contexte non-marqud. Les valeurs deriv4es du corpus sont variables, c'est-a-dire des valeurs k de{r&. Des tests lingnistiques ont t automatiquement appliqu& aux corpus analyss de manire k dterminer la valeur initiale de l'aspect pour la stativit, et ce, pour un ensenble de verbes frequents, repr&entant plus de 90% des occurrences de verbes dans un...
Article
Full-text available
This paper presents the resnits of an experiment nsing machine-readable dictionaries (MR1)s) and corpora for building concatcnative units for text to speech (TTS) systems. Theoretical questions concerning the nature of phoncmic data in dictionaries are raised; p}loncmic dictionary data is viewed as a represcnative corpus over which to extract n- gr...
Conference Paper
Full-text available
We describe an interactive system, built within the context of CLiMB project, which permits a user to locate the occurrences of named entities within a given text. The named entity tool was developed to identify references to a single art object (e.g. a particular building) with high precision in text related to images of that object in a digital c...
Conference Paper
Full-text available
An obstacle to understanding results across hetero- geneous databases is the ability to determine con- ceptual connections between differing terminolo- gies. In this paper, we present the two step ap- proach which we have used to build a terminological database in order to address this issue. First we au- tomatically built a heterogeneous collectio...
Conference Paper
Full-text available
Metadata descriptions of database contents are required to build and use systems that access and deliver data in response to user requests. When numerous heterogeneous databases are brought together in a single system, their various metadata formalizations must be homogenized and integrated in order to support the access planning and delivery syste...
Article
Full-text available
This paper addresses the problem of developing methods to be used in the identification and extraction of meaningful semantic components from large online glossaries. We present two sets of results. First, we report on the algorithm, ParseGloss, which was used to analyze definitions, and extract the main concept, or genus phrase. We ran the system...
Article
Full-text available
We present a new composite similarity metric that combines information from multiple linguistic indicators to measure semantic distance between pairs of small tex[uM units. Several potential features are investigated and an opti- mal combination is selected via machine learn- ing. We discuss a more restrictive definition of similarity than traditio...
Article
Full-text available
We present a new method for discovering a segmental discourse structure of a document while categorizing each segment's function and importance. Segments are determined by a zero-sum weighting scheme, used on occurrences of noun phrases and pronominal forms retrieved from the document. Segment roles are then calculated from the distribution of the...
Article
This paper describes the evaluation of a new automated text summarization system, Centrifuser. This system provides information to patients and families relevant to their specific health questions. Centrifuser accepts queries about health conditions, and produces a summary of information from articles retrieved by a standard search engine that is t...
Conference Paper
Full-text available
A current application of automatic text summarization is to provide an overview of relevant documents coming from an information retrieval (IR) system. This paper examines how Centrifuser, one such summarization system, was designed with respect to methods used in the library community. We have reviewed these librarian expert techniques to assist i...
Article
language itself. He argues that stochastic grammars play an important role in the handling of a' number of theoretical linguistic issues. The principal emphasis of the paper is on syntactic issues, with little description given of the incorporation of semantic and pragmatic factors into a statistical system, or of their interaction with syntax. In...
Article
Full-text available
We report on a language resource consisting of 2000 annotated bibliography entries, which is being analyzed as part of our research on indicative document summarization. We show how annotated bibliographies cover certain aspects of summarization that have not been well-covered by other summary corpora, and motivate why they constitute an important...
Article
Full-text available
A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unificationbased shallow-level parser using transformational rules over syntactic patterns. The contribu...
Article
Full-text available
We present an evaluation of domainindependent natural language tools for use in the identification of significant concepts in documents. Using qualitative evaluation, we compare three shallow processing methods for extracting index terms, i.e., terms that can be used to model the content of documents. We focus on two criteria: quality and coverage....
Article
Full-text available
This paper describes the comparative evaluation of an experimental automated text summarization system, Centrifuser and three conventional search engines - Google, Yahoo and About.com. Centrifuser provides information to patients and families relevant to their questions about specific health conditions. It then produces a multidocument summary of a...
Conference Paper
The Digital Government Research Center (DGRC) has completed phase one of the Energy Data Collection (EDC) project. In this paper, we present the results of building and evaluating system components, along with plans for phase two of the project. Phase one focused on data about petroleum products' prices and volumes, provided by the Energy Informati...
Article
Full-text available
This paper describes a method toward automatically building dictionaries from text. We present DEFINDER, a rule-based system for extraction of definitions from on-line consumer-oriented medical articles. We provide an extensive evaluation on three dimensions: i) performance of the definition extraction technique in terms of precision and recall, ii...
Article
Full-text available
In this paper, we propose the use of multidocument summarization as a post-processing step in document retrieval. We examine the use of the summary as a replacement to the standard ranked list. The form of the summary is novel because it has both informative and indicate elements, designed to help di#erent users perform their tasks better. Our summ...
Preprint
The task of creating indicative summaries that help a searcher decide whether to read a particular document is a difficult task. This paper examines the indicative summarization task from a generation perspective, by first analyzing its required content via published guidelines and corpus analysis. We show how these summaries can be factored into a...
Article
Full-text available
This paper shows that linguistic techniques along with machine learning can extract high quality noun phrases for the purpose of providing the gist or summary of email messages. We describe a set of comparative experiments using several machine learning algorithms for the task of salient noun phrase extraction. Three main conclusions can be drawn f...
Article
Full-text available
%'&)(+*-,.0/1(2,3*546*/6/178&9:4;<,:/170*&=/64>&+**-?@,A22*464B/6C5C&=.878&*D78&=EFC36G ,:/170C&I/1(,:/J2,&K(+*.8<L/1(+* HNM &?*346/6,"&+?O/1(*783 H *-?P702,".Q4170/ M G ,A/170C&RTSU(=VW4170270,"&+4X&**?O78&EYC3 H ,A/170C&Z/1(+,A/[70452.878&=702,".8.'V3*.0*G ,"&=/^/6C_,&X78&?W7 ] 70? M ,:.`<,:/170*&=/R%Y&a/1(=704^<,<*3";+bc*><=3*46*&=/C M 3 <=3C:9"...
Article
Full-text available
The massive amount of statistical and text data available from government agencies has created a set of daunting challenges to both the research and analysis communities. These problems include heterogeneity, size, distribution, and control of terminology. At the Digital Government Research Center (www.dgrc.org) we are investigating solutions to th...
Article
Full-text available
Using technology developed at the Digital Government Research Center, a team of researchers is seeking to make government statistical data more accessible through the Internet. In collaboration with government experts, they are conducting research into advanced information systems, developing standards, interfaces and a shared infrastructure, and b...
Article
In this paper we present a quantitative and qualitative evaluation of DEFINDER, a rule-based system that mines consumer-oriented full text articles in order to extract definitions and the terms they define. The quantitative evaluation shows that in terms of precision and recall as measured against human performance, DEFINDER obtained 87% and 75% re...
Article
Full-text available
The massive amount of statistical and text data available from government agencies has created a set of daunting challenges to both research and analysis communities. These problems include heterogeneity, size, distribution, and control of terminology. At the Digital Government Research Center we are investigating solutions to these key problems. I...
Conference Paper
Full-text available
In this paper we present DEFINDER, a rule-based system that mines cons umer-oriented full text articles in order to extract definitions and the terms they define. This research is part of Digital Library Project at Columbia University, entitled PERSIVAL (PErsonalized Retrieval and Summarization of Image, Video and Language resources) [5]. One goal...
Conference Paper
The potential of automatically generated indexes for information acces s has been recognized for several decades (e.g., Bush 1945 [2], Edmundson and Wyllys 1961 [4]), but the quantity of text and the ambiguity of natural language processing have made progress at this task more difficult than was originally foreseen. Recently, a body of work on deve...
Conference Paper
Full-text available
We present a system which extracts the genus word and phrase from free -form definition text, entitled LEXING, for Lexical Information from Glossaries. The extractions will be used to build automatically a lexical knowledge base from on-line domain specific glossary sources. We combine statistical and semantic processes to extract these terms, and...
Conference Paper
Full-text available
The needs of society have long been addressed through government resea rch support for new technologies-the Internet representing one example. Today, under the rubric of digital government, federal agencies as well as state and local units of governments at all levels have begun to leverage the fruits of these research investments to better serve t...
Conference Paper
Full-text available
In healthcare settings, patients need access to online information tha t can help them understand their medical situation. Physicians need information that is clinically relevant to an individual patient. In this paper, we present our progress on developing a system, PERSIVAL, that is designed to provide personalized access to a distributed patient...
Article
vote a whole (and extra-long) issue of Elsnews to the topics covered at LREC, and indeed to the conference itself. One of the key questions in Evaluation is: how far can the evaluation-driven methodology, which has proved so fruitful in the field of speech recognition, be generalized to other areas of language and speech technology? After a scene-s...
Article
Full-text available
By synthesizing information common to retrieved documents, multi-document summarization can help users of information retrieval systems to find relevant documents with a minimal amount of reading. We are developing a multidocument summarization system to automatically generate a concise summary by identifying and synthesizing similarities across a...
Article
Full-text available
We present a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a full-fledged derivational morphological system, combined with a shallow parser. The system has been applied to French. The unique contribution of the research is in using these linguistically based tools with safety filters in order to avoid the pr...
Conference Paper
This paper provides both an overview of the Digital Library Research Program at Columbia University, along with some detail on three selected projects. First, a review of the interdisciplinary approach to Digital Library Research from the point of view of the Center for Research on Information Access will be presented. Second, technical aspects wil...
Conference Paper
Full-text available
We present a linguistically-motivated technique for the recognition and grouping of simplex noun phrases (SNPs) called LinkIT. Our system has two key features: (1) we efficiently gather minimal NPs, i.e. SNPs, as precisely and linguistically defined and motivated in our paper; (2) we apply a refined set of postprocessing rules to these SNPs to link...
Conference Paper
Full-text available
The massive amount of statistical and text data available from government agencies has created a set of daunting challenges to both research and analysis communities. These problems include heterogeneity, size, distribution, and control of terminology. At the Digital Government Research Center we are investigating solutions to these key problems. I...
Conference Paper
Full-text available
Government at all levels is a major collector and provider of data and user of information technologies. The goal of the Digital Government Program is to fund research at the intersection of the computer and information sciences research communities ...
Article
Full-text available
The massive amount of statistical and text data available from Federal Agencies has created a set of daunting challenges to both research and analysis communities. These problems include heterogeneity, size, distribution, and control of terminology. We are investigating solutions to three key problems, namely, (1) ontological mappings for terminolo...
Article
Full-text available
0. Abstract Evaluation of natural language processing tools and systems must focus on two comple-mentary aspects: first, evaluation of the accu-racy of the output, and second, evaluation of the functionality of the output as embedded in an application. This paper presents evalua-tions of two aspects of LinkIT, a tool for noun phrase identification...
Article
Full-text available
ves sufficient; statistical approaches are robust and achieve broad coverage but lack the insightful domain knowledge that symbolic methods can provide. In addition, 1 we now have available cheap and powerful computing hardware, large amounts of online corpora, and high quality online dictionaries and thesari, all of which make empirical statistica...
Article
This paper presents a new similarity metric which measures distance between pairs of small textual units using a set of linguistic indicators and a machine learning algorithm. We define and motivate a more restrictive definition of similarity than traditional definitions. We evaluate the performance of our system by comparing to TF*IDF, and show th...
Article
o approaches may have been viewed as contradictory by many researchers, yet the claim to the contrary does not seem very controversial at the current time. Exploration into the type of symbolic and statistical models which can complement each other is however extremely worthwhile as the NLP community still probes for answers in this direction. All...
Article
Full-text available
We report on two corpora to be used in the evaluation of component systems for the tasks of (1) linear segmentation of text and (2) summary-directed sentence extraction. We present characteristics of the corpora, methods used in the collection of user judgments, and an overview of the application of the corpora to evaluating the component system. F...
Article
Full-text available
This paper presents a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a fullfledged derivational morphological system, combined with a shallow parser. The unique contribution of the research is in using these linguistically based tools with filters in order to avoid the problems of semantic degradation typical...

Network

Cited By