
Judith L. Klavans- Ph.D. Linguistics
- University of Maryland, College Park
Judith L. Klavans
- Ph.D. Linguistics
- University of Maryland, College Park
About
159
Publications
17,928
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,751
Citations
Introduction
Skills and Expertise
Current institution
Additional affiliations
January 2005 - May 2016
Publications
Publications (159)
This study demonstrates an improved conceptual foundation to support well-structured analysis of image topicality. First we present a conceptual framework for analyzing image topicality, explicating the layers, the perspectives, and the topical relevance relationships involved in modeling the topicality of art images. We adapt a generic relevance t...
A new opportunity to explore and leverage the power of computational linguistic methods and analysis in ensuring effective Cybersecurity is presented. This White Paper discusses some of the specific emerging research opportunities, covering human language technologies such as language identification, topic modeling, and information extraction for k...
In recent years, cultural heritage institutions have increasingly used social tagging. To better understand the nature of these tags, we analyzed tags assigned to a collection of 100 images of art (provided by the steve.museum project) using subject matter categorization. Our results show that the majority of tags describe the people and objects in...
Specialized medical ontologies and terminologies, such as SNOMED CT and the Unified Medical Language System (UMLS), have been successfully leveraged in medical information systems to provide a standard web-accessible medium for interoperability, access, and reuse. However, these clinically oriented terminologies and ontologies cannot provide suffic...
Keeping up with rapidly growing research fields, especially when there are multiple interdisciplinary sources, requires substantial effort for researchers, program managers, or venture capital investors. Current theories and tools are directed at finding a paper or website, not gaining an understanding of the key papers, authors, controversies, and...
Information retrieval (IR) involves retrieving information from stored data, through user queries or pre-formulated user profiles. The information can be in any format. IR typically advances over four broad stages viz., identification of text types, document preprocessing, document indexing, and query processing and matching the same to documents....
Action Science Explorer (ASE) is a tool designed to support users in rapidly generating readily consumable summaries of academic literature. It uses citation network visualization, ranking and filtering papers by network statistics, and automatic clustering and summarization techniques. We describe how early formative evaluations of ASE led to a ma...
Action Science Explorer (ASE) is a tool designed to support users in rapidly generating readily consumable summaries of academic literature. The authors describe ASE and report on how early formative evaluations led to a mature system evaluation, consisting of an in-depth empirical evaluation with 4 domain expert participants. The user study tasks...
This paper reports on the linguistic analysis of a tag set of nearly 50,000 tags collected as part of the steve.museum project. The tags describe images of objects in museum collections. We present our results on morphological, part of speech and semantic analysis. We demonstrate that deeper tag processing provides valuable information for organizi...
Keeping up with rapidly growing research fields, especially when there are multiple interdisciplinary sources, requires substantial effort for researchers, program managers, or venture capital investors. Current theories and tools are directed at finding a paper or website, not gaining an understanding of the key papers, authors, controversies, and...
This paper reports on the linguistic analysis of a tag set of nearly 50,000 tags collected as part of the steve.museum project. The tags describe images of objects in museum collections. We present our results on morphological, part of speech and semantic analysis. We demonstrate that deeper tag processing provides valuable information for organizi...
In this paper, we present our results on comparing the language of social tags with text-mined terms for images. We have developed a novel modification of the standard term frequency/inverse document frequency metric (tf*idf) (Salton & Buckley 1988) over tags and terms to identify and filter terms which discriminate images for searchers. Since tags...
In this paper, we present on a new project, " T 3 : Text, Tags, and Trust to Improve Image Access for Museums and Libraries " , the goal of which is to improve access to digital image collections in museums and libraries for art historians, museum professionals, and the general public. T 3 combines text mining, social tagging, and trust inferencing...
In this paper we extend existing sense-making models by adding detail on types of conceptual change and cognitive mechanisms taken from theories of cognition and learning. Our extended model aims to offer a more complete picture of the cognitive processes of sense-making, including the underlying cognitive mechanisms and different types of conceptu...
In this paper, we present a system using computational linguistic techniques to extract metadata for image access. We discuss
the implementation, functionality and evaluation of an image catalogers’ toolkit, developed in the Computational Linguistics
for Metadata Building (CLiMB) research project. We have tested components of the system, including...
In this paper, we describe a downloadable text-mining tool for enhancing subject access to image collections in digital libraries.
This chapter describes the progress of the Digital Government Research Center in tackling the challenges of integrating and
accessing the massive amount of statistical and text data available from government agencies. In particular, we address the
issues of database heterogeneity, size, distribution, and control of terminology. In this chapter we p...
The CLiMB project investigates semi-automatic methods to extract descriptive metadata from texts for indexing digital image collections. We devel-oped a set of functional semantic categories to classify text extracts that describe images. Each semantic category names a functional relation between an image depicting a work of art historical signific...
In this paper, we present a fully implemented system using computational linguistic techniques to apply automatic text mining for the extraction of metadata for image access. We describe the implementation of a workbench created for, and evaluated by, image catalogers. We discuss the current functionality and future goals for this image catalogers'...
This paper describes a formative evaluation of an integrated mul- tilingual, multimedia information system, a series of user studies designed to guide system development. The system includes automatic speech recognition for English, Chinese, and Arabic, automatic translation from Chinese and Arabic into English, and query-based and profile-based se...
This report is about the value of a specific area of planning and about how the United States might make improvements in that specific area. Geospatial data and tools are currently used for emergency response, but recent events have demonstrated the many ways in which our geospatial data and tools and the use we make of them fail us, both in prepar...
The issue of sentence ordering is an important one for natural language tasks such as multi-document summarization, yet there has not been a quantitative exploration of the range of acceptable sentence orderings for short texts. We present results of a sentence reordering experiment with three experimental conditions. Our findings indicate a very h...
This panel will explore the broad DELOS vision of digital libraries1 and its implications for research and teaching. What is envisioned is “a much broader system of highly interlinked information and services that will provide very rich functionality, supporting new ways of intellectual work, communication, and process execution in business, govern...
We present the summarization system in the PErsonalized Retrieval and Summarization of Images, Video and Language (PERSIVAL) medical digital library. Although we discuss the context of our summarization research within the PERSIVAL platform, the primary focus of this article is on strategies to define and generate customized summaries.
Our summariz...
Digital government research spans disciplinary fields that are exceptionally diverse. Although digital government research emerged primarily from efforts to engage computer scientists in research that would improve the functions of government, digital government research today addresses a far broader set of issues including the adoption and use of...
Columbia's Newsblaster tracking and summarization system is a robust system that clusters news into events, categorizes events into broad topics and summarizes multiple articles on each event. Here we outline our most current work on tracking events over days, producing summaries that update a user on new information about an event, outlining the p...
We present the new multilingual version of the Columbia Newsblaster news summariza-tion system. The system addresses the problem of user access to browsing news from multiple languages from multiple sites on the internet. The system automatically collects, organizes, and summarizes news in multiple source lan-guages, allowing the user to browse new...
Digital image collections in libraries and other curatorial institutions grow too rapidly to create new descriptive metadata for subject matter search or browsing. CLiMB (Computational Linguistics for Metadata Building) was a project designed to address this dilemma that involved computer scientists, linguists, librarians, and art librarians. The C...
We present a relational learning framework for grammar induction that is able to learn meaning as well as syn- tax. We introduce a type of constraint-based grammar, lexicalized well-founded grammar (lwfg) , and we prove that it can always be learned from a small set of seman- tically annotated examples, given a set of assumptions. The semantic repr...
We present the new multilingual version of the Columbia Newsblaster news summarization system. The system addresses the problem of user access to browsing news from multiple languages from multiple sites on the internet. The system automatically collects, organizes, and summarizes news in multiple source languages, allowing the user to browse news...
The task of creating indicative summaries that help a searcher decide whether to read a particular document is a difficult task. This paper examines the indicative summarization task from a generation perspective, by first analyzing its required content via published guidelines and corpus analysis. We show how these summaries can be factored into a...
We present a system for the automatic extraction of salient information from email messages, thus providing the gist of their meaning. Dealing with email raises several challenges that we address in this paper: heterogeneous data in terms of length and topic. Our method combines shallow linguistic processing with machine learning to extract phrasal...
If Natural Language Processing (NLP) systems are viewed as intelligent systems then we should be able to make use of verification and validation (V&V) approaches and methods that have been developed in the intelligent systems community.
this paper is to present the entire multilingual Columbia Newsblaster system as a platform for multilingual multi-document summarization experiments. The phases in the multilingual version of Columbia Newsblaster have been modified to take language and character encoding into account, and a new phase, translation, has been added. Figure 1 depicts t...
We have developed a multilingual version of Columbia Newsblaster as a testbed for multilingual multi-document summarization. The system collects, clusters, and summarizes news documents from sources all over the world daily. It crawls news sites in many different countries, written in different languages, extracts the news text from the HTML pages,...
We present a statistical similarity measuring and clustering tool, SIMFINDER, that organizes small pieces of text from one or multiple documents into tight clusters. By placing highly related text units in the same cluster, SIMFINDER enables a subsequent content selection/generation component to reduce each cluster to a single sentence, either by e...
Recently, there have been significant advances in several areas of language technology, including clustering, text categorization, and summarization. However, efforts to combine technology from these areas in a practical system for information access have been limited. In this paper, we present Columbia's Newsblaster system for online news summariz...
Domain specific texts often have implicit rules on content and organization. We introduce a novel method for synthesizing this topical structure. The system uses corpus examples and recursively merges their topics to build a hierarchical tree. A subjective cross domain evaluation showed that the system performed well in combining related topics and...
contexte non-marqud. Les valeurs deriv4es du corpus sont variables, c'est-a-dire des valeurs k de{r&. Des tests lingnistiques ont t automatiquement appliqu& aux corpus analyss de manire k dterminer la valeur initiale de l'aspect pour la stativit, et ce, pour un ensenble de verbes frequents, repr&entant plus de 90% des occurrences de verbes dans un...
This paper presents the resnits of an experiment nsing machine-readable dictionaries (MR1)s) and corpora for building concatcnative units for text to speech (TTS) systems. Theoretical questions concerning the nature of phoncmic data in dictionaries are raised; p}loncmic dictionary data is viewed as a represcnative corpus over which to extract n- gr...
We describe an interactive system, built within the context of CLiMB project, which permits a user to locate the occurrences of named entities within a given text. The named entity tool was developed to identify references to a single art object (e.g. a particular building) with high precision in text related to images of that object in a digital c...
An obstacle to understanding results across hetero- geneous databases is the ability to determine con- ceptual connections between differing terminolo- gies. In this paper, we present the two step ap- proach which we have used to build a terminological database in order to address this issue. First we au- tomatically built a heterogeneous collectio...
Metadata descriptions of database contents are required to build and use systems that access and deliver data in response to user requests. When numerous heterogeneous databases are brought together in a single system, their various metadata formalizations must be homogenized and integrated in order to support the access planning and delivery syste...
This paper addresses the problem of developing methods to be used in the identification and extraction of meaningful semantic components from large online glossaries. We present two sets of results. First, we report on the algorithm, ParseGloss, which was used to analyze definitions, and extract the main concept, or genus phrase. We ran the system...
We present a new composite similarity metric that combines information from multiple linguistic indicators to measure semantic distance between pairs of small tex[uM units. Several potential features are investigated and an opti- mal combination is selected via machine learn- ing. We discuss a more restrictive definition of similarity than traditio...
We present a new method for discovering a segmental discourse structure of a document while categorizing each segment's function and importance. Segments are determined by a zero-sum weighting scheme, used on occurrences of noun phrases and pronominal forms retrieved from the document. Segment roles are then calculated from the distribution of the...
This paper describes the evaluation of a new automated text summarization system, Centrifuser. This system provides information to patients and families relevant to their specific health questions. Centrifuser accepts queries about health conditions, and produces a summary of information from articles retrieved by a standard search engine that is t...
A current application of automatic text summarization is to provide an overview of relevant documents coming from an information retrieval (IR) system. This paper examines how Centrifuser, one such summarization system, was designed with respect to methods used in the library community. We have reviewed these librarian expert techniques to assist i...
language itself. He argues that stochastic grammars play an important role in the handling of a' number of theoretical linguistic issues. The principal emphasis of the paper is on syntactic issues, with little description given of the incorporation of semantic and pragmatic factors into a statistical system, or of their interaction with syntax. In...
We report on a language resource consisting of 2000 annotated bibliography entries, which is being analyzed as part of our research on indicative document summarization. We show how annotated bibliographies cover certain aspects of summarization that have not been well-covered by other summary corpora, and motivate why they constitute an important...
A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unificationbased shallow-level parser using transformational rules over syntactic patterns. The contribu...
We present an evaluation of domainindependent natural language tools for use in the identification of significant concepts in documents. Using qualitative evaluation, we compare three shallow processing methods for extracting index terms, i.e., terms that can be used to model the content of documents. We focus on two criteria: quality and coverage....
This paper describes the comparative evaluation of an experimental automated text summarization system, Centrifuser and three conventional search engines - Google, Yahoo and About.com. Centrifuser provides information to patients and families relevant to their questions about specific health conditions. It then produces a multidocument summary of a...
The Digital Government Research Center (DGRC) has completed phase one of the Energy Data Collection (EDC) project. In this paper, we present the results of building and evaluating system components, along with plans for phase two of the project. Phase one focused on data about petroleum products' prices and volumes, provided by the Energy Informati...
This paper describes a method toward automatically building dictionaries from text. We present DEFINDER, a rule-based system for extraction of definitions from on-line consumer-oriented medical articles. We provide an extensive evaluation on three dimensions: i) performance of the definition extraction technique in terms of precision and recall, ii...
In this paper, we propose the use of multidocument summarization as a post-processing step in document retrieval. We examine the use of the summary as a replacement to the standard ranked list. The form of the summary is novel because it has both informative and indicate elements, designed to help di#erent users perform their tasks better. Our summ...
The task of creating indicative summaries that help a searcher decide whether to read a particular document is a difficult task. This paper examines the indicative summarization task from a generation perspective, by first analyzing its required content via published guidelines and corpus analysis. We show how these summaries can be factored into a...
This paper shows that linguistic techniques along with machine learning can extract high quality noun phrases for the purpose of providing the gist or summary of email messages. We describe a set of comparative experiments using several machine learning algorithms for the task of salient noun phrase extraction. Three main conclusions can be drawn f...
%'&)(+*-,.0/1(2,3*546*/6/178&9:4;<,:/170*&=/64>&+**-?@,A22*464B/6C5C&=.878&*D78&=EFC36G ,:/170C&I/1(,:/J2,&K(+*.8<L/1(+* HNM &?*346/6,"&+?O/1(*783 H *-?P702,".Q4170/ M G ,A/170C&RTSU(=VW4170270,"&+4X&**?O78&EYC3 H ,A/170C&Z/1(+,A/[70452.878&=702,".8.'V3*.0*G ,"&=/^/6C_,&X78&?W7 ] 70? M ,:.`<,:/170*&=/R%Y&a/1(=704^<,<*3";+bc*><=3*46*&=/C M 3 <=3C:9"...
The massive amount of statistical and text data available from government agencies has created a set of daunting challenges to both the research and analysis communities. These problems include heterogeneity, size, distribution, and control of terminology. At the Digital Government Research Center (www.dgrc.org) we are investigating solutions to th...
Using technology developed at the Digital Government Research Center, a team of researchers is seeking to make government statistical data more accessible through the Internet. In collaboration with government experts, they are conducting research into advanced information systems, developing standards, interfaces and a shared infrastructure, and b...
In this paper we present a quantitative and qualitative evaluation of DEFINDER, a rule-based system that mines consumer-oriented full text articles in order to extract definitions and the terms they define. The quantitative evaluation shows that in terms of precision and recall as measured against human performance, DEFINDER obtained 87% and 75% re...
The massive amount of statistical and text data available from government agencies has created a set of daunting challenges to both research and analysis communities. These problems include heterogeneity, size, distribution, and control of terminology. At the Digital Government Research Center we are investigating solutions to these key problems. I...
In this paper we present DEFINDER, a rule-based system that mines cons umer-oriented full text articles in order to extract definitions and the terms they define. This research is part of Digital Library Project at Columbia University, entitled PERSIVAL (PErsonalized Retrieval and Summarization of Image, Video and Language resources) [5]. One goal...
The potential of automatically generated indexes for information acces s has been recognized for several decades (e.g., Bush 1945 [2], Edmundson and Wyllys 1961 [4]), but the quantity of text and the ambiguity of natural language processing have made progress at this task more difficult than was originally foreseen. Recently, a body of work on deve...
We present a system which extracts the genus word and phrase from free -form definition text, entitled LEXING, for Lexical Information from Glossaries. The extractions will be used to build automatically a lexical knowledge base from on-line domain specific glossary sources. We combine statistical and semantic processes to extract these terms, and...
The needs of society have long been addressed through government resea rch support for new technologies-the Internet representing one example. Today, under the rubric of digital government, federal agencies as well as state and local units of governments at all levels have begun to leverage the fruits of these research investments to better serve t...
In healthcare settings, patients need access to online information tha t can help them understand their medical situation. Physicians need information that is clinically relevant to an individual patient. In this paper, we present our progress on developing a system, PERSIVAL, that is designed to provide personalized access to a distributed patient...
vote a whole (and extra-long) issue of Elsnews to the topics covered at LREC, and indeed to the conference itself. One of the key questions in Evaluation is: how far can the evaluation-driven methodology, which has proved so fruitful in the field of speech recognition, be generalized to other areas of language and speech technology? After a scene-s...
By synthesizing information common to retrieved documents, multi-document summarization can help users of information retrieval systems to find relevant documents with a minimal amount of reading. We are developing a multidocument summarization system to automatically generate a concise summary by identifying and synthesizing similarities across a...
We present a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a full-fledged derivational morphological system, combined with a shallow parser. The system has been applied to French. The unique contribution of the research is in using these linguistically based tools with safety filters in order to avoid the pr...
This paper provides both an overview of the Digital Library
Research Program at Columbia University, along with some detail on three
selected projects. First, a review of the interdisciplinary approach to
Digital Library Research from the point of view of the Center for
Research on Information Access will be presented. Second, technical
aspects wil...
We present a linguistically-motivated technique for the recognition and grouping of simplex noun phrases (SNPs) called LinkIT. Our system has two key features: (1) we efficiently gather minimal NPs, i.e. SNPs, as precisely and linguistically defined and motivated in our paper; (2) we apply a refined set of postprocessing rules to these SNPs to link...
The massive amount of statistical and text data available from government agencies has created a set of daunting challenges to both research and analysis communities. These problems include heterogeneity, size, distribution, and control of terminology. At the Digital Government Research Center we are investigating solutions to these key problems. I...
Government at all levels is a major collector and provider of data and user of information technologies. The goal of the Digital Government Program is to fund research at the intersection of the computer and information sciences research communities ...
The massive amount of statistical and text data available from Federal Agencies has created a set of daunting challenges to both research and analysis communities. These problems include heterogeneity, size, distribution, and control of terminology. We are investigating solutions to three key problems, namely, (1) ontological mappings for terminolo...
0. Abstract Evaluation of natural language processing tools and systems must focus on two comple-mentary aspects: first, evaluation of the accu-racy of the output, and second, evaluation of the functionality of the output as embedded in an application. This paper presents evalua-tions of two aspects of LinkIT, a tool for noun phrase identification...
ves sufficient; statistical approaches are robust and achieve broad coverage but lack the insightful domain knowledge that symbolic methods can provide. In addition, 1 we now have available cheap and powerful computing hardware, large amounts of online corpora, and high quality online dictionaries and thesari, all of which make empirical statistica...
This paper presents a new similarity metric which measures distance between pairs of small textual units using a set of linguistic indicators and a machine learning algorithm. We define and motivate a more restrictive definition of similarity than traditional definitions. We evaluate the performance of our system by comparing to TF*IDF, and show th...
o approaches may have been viewed as contradictory by many researchers, yet the claim to the contrary does not seem very controversial at the current time. Exploration into the type of symbolic and statistical models which can complement each other is however extremely worthwhile as the NLP community still probes for answers in this direction. All...
We report on two corpora to be used in the evaluation of component systems for the tasks of (1) linear segmentation of text and (2) summary-directed sentence extraction. We present characteristics of the corpora, methods used in the collection of user judgments, and an overview of the application of the corpora to evaluating the component system. F...
This paper presents a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a fullfledged derivational morphological system, combined with a shallow parser. The unique contribution of the research is in using these linguistically based tools with filters in order to avoid the problems of semantic degradation typical...