Gregory Grefenstette

Gregory Grefenstette
  • PhD
  • Senior Researcher at Florida Institute for Human and Machine Cognition

About

167
Publications
44,366
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,480
Citations
Introduction
Developing domain specific vocabulary for faceted indexing of personal semantics and personal knowledge
Current institution
Florida Institute for Human and Machine Cognition
Current position
  • Senior Researcher
Additional affiliations
September 1993 - March 2001
Xerox Corporation
Position
  • Senior Researcher
November 2013 - November 2015
National Institute for Research in Computer Science and Control
Position
  • Advanced Researcher
August 2008 - November 2013
Dassault Systèmes
Position
  • Resarch Director

Publications

Publications (167)
Article
At least since the invention of writing, people have been troubled by the problem of what a word means. Dictionaries have traditionally been written with numbered word senses, giving the impression that the different senses of a word are fixed abstract entities, which can be used to separate usages into neat piles according to their different meani...
Patent
Full-text available
La présente invention concerne un procédé de traduction automatique. Une phrase dans une langue source étant traduite vers une phrase dans une langue cible, le procédé comporte : une étape ( 1 ) d'extraction de l'ensemble de parties de phrases de la langue cible dans une base de données textuelle correspondant à une traduction totale ou partielle d...
Article
Full-text available
We describe the state-of-the-art in performance modeling and prediction for Information Retrieval (IR), Natural Language Processing (NLP) and Recommender Systems (RecSys) along with its shortcomings and strengths. We present a framework for further research, identifying five major problem areas: understanding measures, performance analysis, making...
Article
Full-text available
This paper reports the findings of the Dagstuhl Perspectives Workshop 17442 on performance modeling and prediction in the domains of Information Retrieval, Natural language Processing and Recommender Systems. We present a framework for further research, which identifies five major problem areas: understanding measures, performance analysis, making...
Article
Full-text available
This reports briefly describes the organization and the plenary talks given during the Dagstuhl Perspectives Workshop 17442. The goal of this workshop was to investigate the state-of-the-art and to delineate a roadmap and research challenges for performance modeling and prediction in three neighbour domains, namely information retrieval (IR), recom...
Article
Full-text available
With the proliferation of online information sources, it has become more and more difficult to judge the trustworthiness of news found on the Web. The beauty of the web is its openness, but this openness has lead to a proliferation of false and unreliable information, whose presentation makes it difficult to detect. It may be impossible to detect w...
Article
Full-text available
In this paper we present and compare two methodologies for rapidly inducing multiple subject-specific taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based algorithm with a directed crawler. We exploit t...
Article
Full-text available
Current research in lifelog data has not paid enough attention to analysis of cognitive activities in comparison to physical activities. We argue that as we look into the future, wearable devices are going to be cheaper and more prevalent and textual data will play a more significant role. Data captured by lifelogging devices will increasingly incl...
Article
Full-text available
Specialized dictionaries are used to understand concepts in specific domains, especially where those concepts are not part of the general vocabulary, or having meanings that differ from ordinary languages. The first step in creating a specialized dictionary involves detecting the characteristic vocabulary of the domain in question. Classical method...
Conference Paper
Full-text available
Specialized dictionaries are used to understand concepts in specific domains, especially where those concepts are not part of the general vocabulary, or having meanings that differ from ordinary languages. The first step in creating a specialized dictionary involves detecting the characteristic vocabulary of the domain in question. Classical method...
Conference Paper
Full-text available
Wikipedia is widely used for finding general information about a wide variety of topics. Its vocation is not to provide local information. For example, it provides plot, cast, and production information about a given movie, but not showing times in your local movie theatre. Here we describe how we can connect local information to Wikipedia, without...
Article
Full-text available
Finding experts for a given problem is recognized as a difficult task. Even when a taxonomy of subject expertise exists, and is associated with a group of experts, it can be hard to exploit by users who have not internalized the taxonomy. Here we present a method for both attaching experts to a domain ontology, and hiding this fact from the end use...
Article
Full-text available
Given a set of terms from a given domain, how can we structure them into a taxonomy without manual intervention? This is the task 17 of SemEval 2015. Here we present our simple taxonomy structuring techniques which, despite their simplicity, ranked first in this 2015 benchmark. We use large quantities of text (English Wikipedia) and simple heuristi...
Article
Full-text available
Quantified self, life logging, digital eyeglasses, technology is advancing rapidly to a point where people can gather masses of data about their own persons and their own life. Large-scale models of what people are doing are being built by credit companies, advertising agencies, and national security agencies, using digital traces that people leave...
Conference Paper
Full-text available
Online news comes in rich multimedia form: video, audio, photos, in addition to traditional text. Recent advances in semantically-rich text processing, in speech-to-text processing, and in image processing allows us to develop new ways of presenting and enriching news stories. Here we present MEWS, a Multimedia nEWS platform, which enriches news br...
Patent
Full-text available
The present invention relates to an automatic translation method. When a sentence in a source language is translated into a sentence in a target language, the method comprises: a step (1) of extracting the set of sentence portions of the target language from a textual database that correspond to a total or partial translation of the source sentenc...
Conference Paper
It is easy to find large-scale, free-to-use, computer readable geographical databases nowadays, and many people are examining how to exploit them, crossing them with other linked data to produce new applications. We present here a publicly accessible web application for searching for movies based on location, mashing up information from DBpedia and...
Conference Paper
Full-text available
Video search is still largely based on text search human-supplied metadata, sometimes supplemented by extracted thumbnails. We have been developing a broadcast news search system based on recent progress in automatic speech recognition (ASR), natural language processing (NLP) and video and image processing to provide a rich content-based search to...
Conference Paper
Full-text available
Most news organizations provide immediate access to topical news broadcasts through RSS streams or podcasts. Until recently, applications have not permitted a user to perform content based search within a longer spoken broadcast to find the segment that might interest them. Recent progress in both automatic speech recognition (ASR) and natural lang...
Article
Full-text available
One important class of online videos is that of news broadcasts. Most news organisations provide near-immediate access to topical news broadcasts over the Internet, through RSS streams or podcasts. Until lately, technology has not made it possible for a user to automatically go to the smaller parts, within a longer broadcast, that might interest th...
Conference Paper
Full-text available
Photo sharing platforms users often annotate their trip photos with landmark names. These annotations can be aggregated in order to recommend lists of popular visitor attractions similar to those found in classical tourist guides. However, individual tourist preferences can vary significantly so good recommendations should be tailored to individual...
Conference Paper
Full-text available
People often try to find an image using a short query and images are usually indexed using short annotations. Matching the query vocabulary with the indexing vocabulary is a difficult problem when little text is available. Textual user generated content in Web 2.0 platforms contains a wealth of data that can help solve this problem. Here we describ...
Book
We are poised at a major turning point in the history of information management via computers. Recent evolutions in computing, communications, and commerce are fundamentally reshaping the ways in which we humans interact with information, and generating enormous volumes of electronic data along the way. As a result of these forces, what will data m...
Article
One important class of online videos is news broadcasts. Most news organizations provide immediate access to topical news broadcasts over the Internet, through RSS streams or podcasts. Until lately, technology has not made it possible for a user to automatically find, within a longer broadcast, the smaller parts that might interest them. Recent adv...
Conference Paper
Full-text available
Quaero, a research and innovation program addressing automatic processing of multimedia and multilingual content, fosters the development of new tools for navigation in large volumes of audiovisual content. Quaero projects (automatic information retrieval, analysis, segmentation and classification of text, speech, music, image and video) are suppor...
Conference Paper
Full-text available
Semantics has many different definitions in science. In natural language processing, there has been much research over the past three decades involving extracting the semantics, the meaning, of natural texts. This has led to entity recognition (people, places, companies, prices, dates, and events), and more recently into sentiment analysis, explori...
Article
Full-text available
Social computing sites constitute a valuable source of user-generated content for user modeling. Whereas user gener-ated content and the mining of such content are well stud-ied, little attention has been given in the literature to mod-eling the relationship between users' personal information and content. Here we analyze the relation of user gende...
Article
Full-text available
Beyond current textual search, there remains a need for greater variety in query modality and in media input for querying. The I-SEARCH project aims at developing an experimental platform for new types of querying of multimedia document sources. This article presents an overview of the I-SEARCH project, shows some typical use case scenarios for the...
Chapter
Full-text available
Most Natural Language Processing systems have been built around the idea of a word being something found between white spaces and punctuation. This is a normal and efficient way to proceed. Tasks such as word sense disambiguation, machine translation, or even indexing rarely go beyond the single word. Language models used in NLP applications are bu...
Conference Paper
Full-text available
Space and time are important dimensions in the representation of a large number of concepts. However there exists no available resource that provides spatiotemporal mappings of generic concepts. Here we present a link-analysis based method for extracting the main locations and periods associated to all Wikipedia concepts. Relevant locations are sel...
Conference Paper
Full-text available
Personal photos and their associated metadata reveal different aspects of our lives and, when shared online, let others have an idea about us. Automating the extraction of personal information is an arduous task but it contributes to better understanding and serving users. Here we present methods for analyzing textual metadata associated to Flickr...
Article
Personal photos and their associated metadata reveal different aspects of our lives and, when shared online, let others have an idea about us. Automating the extraction of personal information is an arduous task but it contributes to better understanding and serving users. Here we present methods for analyzing textual metadata associated to Flickr...
Conference Paper
Video is poised to largely replace both text and images as the media for transmitting information in the coming years. The challenge of the Information Processing community is how to index the information found in this voluminous and dynamic media stream. Most of the linguistic information is encoded in the audio channel of video data, which, once...
Conference Paper
Full-text available
Tourist photographs constitute a large part of the images uploaded to photo sharing platforms. But filtering methods are needed before one can extract useful knowledge from noisy user-supplied metadata. Here we show how to extract clean trip related information (what people visit, for how long, panoramic spots) from Flickr metadata. We illustrate o...
Conference Paper
Full-text available
We present an interface to video and audio podcasts that extracts semantics from the speech content, and packages the extracted information in a variety of navigation tools. The user can jump to the relevant sections and browse from relevant section to relevant section. This interface is related to the Yahoo! Challenge: Robust Automatic Segmentatio...
Conference Paper
Full-text available
Uploading tourist photos is a popular activity on photo sharing platforms. These photographs and their associated metadata (tags, geo-tags, and temporal information) should be useful for mining information about the sites visited. However, user-supplied metadata are often noisy and efficient filtering methods are needed before extracting useful kno...
Conference Paper
Enterprise search and web searching have different goals and characteristics. Whereas internet browsing can sometimes be seen as a form of entertainment, enterprise search involves activities in which search is mainly a tool. People have work they want to get done. In this context, the idea of relevance in documents is different. Community can beco...
Conference Paper
Full-text available
Geographical gazetteers are necessary in a wide variety of applications. In the past, the construction of such gazetteers has been a tedious, manual process and only recently have the first attempts to automate the gazetteers creation been made. Here we describe our approach for mining accurate but large-scale multilingual geographic information by...
Conference Paper
Finding geographically-based information constitutes a common use of Web search engines, for a variety of user needs. With the rapid growth of the volume of geographically-related information on the Web, efficient and adaptable ways of tagging, browsing and accessing relevant documents still needs to be found. Structuring and mashing-up geographic...
Conference Paper
Full-text available
Classic search engines accept a user query and return a list of ranked results. Two independent phenomena may make soon make this response seem archaic. One is that younger users are used to seeing all their information always present, in different configurations maybe, but present, accessible. The other phenomenon is increasing sophistication of s...
Conference Paper
Full-text available
Geolocalized databases are becoming necessary in a wide variety of application domains. Thus far, the creation of such databases has been a costly, manual process. This drawback has stimulated interest in automating their construction, for example, by mining geographical information from the Web. Here we present and evaluate a new automated techniq...
Chapter
Full-text available
Many people use the Internet to find pictures of things. When extraneous images appear in response to simple queries on a search engine, the user has a hard time understanding why his seemingly clear request was not properly satisfied. If the computer could only understand what he wanted better, then maybe the results would be more precise. The int...
Patent
Full-text available
Machine Translation using Cross-Language Information Retrieval
Chapter
This book collects and introduces some of the best and most useful work in practical lexicography. It has been designed as a resource for students and scholars of lexicography and lexicology and to be an essential reference for professional lexicographers. It focusses on central issues in the field and covers topics hotly debated in lexicography ci...
Conference Paper
Full-text available
Detecting the tone or emotive content of a text message is increasingly important in many natural language processing applications. Examples of such applications are rating new books or movies or products, judging the mood of a customer e-mail and routing it accordingly, measuring reputation that a person or a product has in the blogosphere. While...
Conference Paper
Full-text available
People use the Internet to find a wide variety of i mages. Existing image search engines do not understand the pictures they return. The introduction of semantic layers in information retr ieval frameworks may enhance the quality of the res ults compared to existing systems. One important challenge in the field is to develop architectures that fit...
Conference Paper
Full-text available
Words can be associated with images in different ways. Google and Yahoo use text found around a photo on a web page, Flickr image uploaders add their own tags. How do the annotations differ when they are extracted from text and when they are manually created? How does these language populations compare to written text? Here we continue our explorat...
Conference Paper
Full-text available
Certain components in images can be recognized with high accuracy, for example, backgrounds such as leaves, grass, snow, sky, water. These components provide the human eye with context for identifying items in the foreground. Likewise for the machine, the identification of background should help in the recognition of foreground objects. But, in thi...
Conference Paper
Full-text available
The management and exchange of multimedia data is challenging due to the variety of formats, standards and intended applications. In addition, production of multimedia data is rapidly increasing due to the availability of off-the-shelf, modern digital devices that can be used by even inexperienced users. It is likely that this volume of information...
Conference Paper
Full-text available
Dictionaries only contain some of the information we need to know about a language. The growth of the Web, the maturation of linguistic processing tools, and the decline in price of memory storage allow us to envision descriptions of languages that are much larger than before. We can conceive of building a complete language model for a language usi...
Conference Paper
Full-text available
Many people use the Internet is to find pictures of things. When extraneous images appear in response to simple queries on a search engine, the user has a hard time understanding why his seemingly clear request was not properly satisfied. If the computer could only understand what he wanted better, then maybe the results would be more precise. We b...
Conference Paper
Full-text available
For a computer to recognize objects, persons, situations or actions in multimedia, it needs to have learned models of each thing beforehand. For the moment, no large, general collection of training examples exists for the wide variety of things that we would want to automatically recognize in multimedia, video and still images. We believe that the...
Chapter
Full-text available
The usual approach to finding information on the WWW via existing Web browsers is to use a one or two word query. Browsers return a number of documents containing these words, and the user examines those documents, or their abstracts, sees how the word or words in their query are being used and alters their initial query accordingly. This contrasts...
Conference Paper
Full-text available
In our effort to contribute to the closing of the "semantic gap" between images and their semantic description, we are building a large-scale ontology of images of objects. This visual catalog will contain a large number of images of objects, structured in a hierarchical catalog, allowing image processing researchers to derive signatures for wide c...
Conference Paper
Full-text available
The rapid growth of the Internet information sources has led to organizing proposals, such as th e Semantic Web initiative, with its ontological level providin g a formal structuring for this disparate data. But given the amount of information to be treated even in a restricted doma in, manual organization becomes rapidly unmanageable, and automati...
Article
Full-text available
In this paper, we propose to improve our previous work on automatically filling an image ontology via clustering using images from the web. This work showed how we can automatically create and populate an image ontology using the WordNet textual ontology as a basis, pruning it to keep only portrayable objects, and clustering to get representative i...
Article
Full-text available
Physical objects are often described in dictionaries by visual features. But the information needed by computer applications for image analysis is not always found in dictionaries, nor in a complete form in any other publicly available information source. This article describes some first steps in finding more complete visual information about obje...
Conference Paper
Full-text available
For cross-language text retrieval systems that rely on bilingual dictionaries for bridging the language gap between the source query language and the target document language, good bilingual dictionary coverage is imperative. For terms with missing translations, most systems employ some approaches for expanding the existing translation dictionaries...
Conference Paper
Full-text available
The goal of many natural language proc-essing platforms is to be able to someday correctly treat all languages. Each new language, especially one from a new lan-guage family, provokes some modifica-tion and design changes. Here we present the changes that we had to introduce into our platform designed for European lan-guages in order to handle a Se...
Article
At the NTCIR-4 workshop, Justsystem Corporation (JSC) and Clairvoyance Corporation (CC) collaborated in the cross-language retrieval task (CLIR). Our goal was to evaluate the performance and robustness of our recently developed commercial-grade CLIR systems for English and Asian languages. The main contribution of this article is the investigation...
Conference Paper
Full-text available
Cross-language information retrieval performance depends on the quality of the translation resources used to pass from a user's source language query to target language documents. Translation lists of proper names are rare but vital resources for cross-language retrieval between languages using different character sets. Named entities translation d...
Article
Full-text available
We describe building a large-scale image ontology using the WordNet lexical resources. This ontology is based on English words identifying portrayable objects. We reviewed the upper structure and interconnections of WordNet and selected only the branches leading to portrayable objects. This article explains our pruning approach to WordNet. The word...
Conference Paper
Full-text available
The idea behind the semantic web is that documents will contain ad-ditional markup that make explicit the information content of unstructured me-dia. We present here the Document Souls system which allows documents to become animate, actively searching the wider world for more information about their own contents, attaching relevant information to...
Conference Paper
Full-text available
Today, much of product feedback is provided by customers/critiques online through websites, discussion boards, mailing lists, and blogs. People trying to make strategic decisions (e.g., a product launch, a purchase) will find that a web search will return many useful but heterogeneous and, increasingly, multilingual opinions on a product. Generally...
Article
Cross-language information retrieval over non parallel text requires a translation phase between a source language query and a target language document. In order to achieve the same performance as a monolingual target language query, good translations for all terms in a source language query must be found. Unfortunately, available translation dicti...
Article
Full-text available
The World Wide Web has grown so big, in such an anarchic fashion, that it is difficult to describe. One of the evident intrinsic characteristics of the World Wide Web is its multilinguality. Here, we present a technique for estimating the size of a language-specific corpus given the frequency of commonly occurring words in the corpus. We apply this...
Conference Paper
The Web provides the largest, exploitable collection of language use. If we can mine the Web to build abstract models of language use, these models may have many applications. Here we present one example of using the implicit intelligence of language use to solve an important problem for machine translation programs and cross-lingual applications....
Conference Paper
Full-text available
Multilingual applications frequently involve dealing with proper names, but names are often missing in bilingual lexicons. This problem is exacerbated for applications involving translation between Latin-scripted languages and Asian languages such as Chinese, Japanese and Korean (CJK) where simple string copying is not a solution. We present a nove...
Conference Paper
Full-text available
Newspapers generally attempt to present the news objectively. But textual affect analysis shows that many words carry positive or negative emotional charge. In this article, we show that coupling niche browsing technology and affect analysis technology allows us to create a new application that measures the slant in opinion given to public figures...
Article
Full-text available
At the NTCIR-4 workshop, Justsystem Corporation and Clairvoyance Corporation collaborated in participating in the Cross-Language Retrieval Task (CLIR). We submitted results to the sub-tracks of SLIR and BLIR. For the SLIR track, we submitted Chinese, English, and Japanese monolingual runs. For the BLIR track, we submitted Japanese-English and Chine...
Conference Paper
Full-text available
In the bilingual track of CLEF 2002, focusing on word translation ambiguity, we experimented with several techniques for choosing the best target translation for each source query word by using co-occurrence statistics in a reference corpus consisting of documents in the target language. Our techniques give one best translation per source query wor...
Article
Full-text available
The Web, teeming as it is with language data, of all manner of varieties and languages, in vast quantity and freely available, is a fabulous linguists' playground. This special issue of Computational Linguistics explores ways in which this dream is being explored.
Conference Paper
Full-text available
In CLEF 2003, Clairvoyance participated in the bilingual retrieval track with the German and Italian language pair. As we did not have any German-to-Italian translation resources, we used the Babel Fish translation service provided by Altavista.com for translating German topics into Italian, with English as a pivot language. Then the translated Ita...
Conference Paper
Full-text available
For cross language information retrieval (CLIR) based on bilingual translation dictionaries, good performance depends upon lexical coverage in the dictionary. This is especially true for languages possessing few inter-language cognates, such as between Japanese and English. In this paper, we describe a method for automatically creating and validati...
Conference Paper
Full-text available
Every time a user engaged in work reads or writes, the user spontaneously generates new information needs: to understand the text he or she is reading or to supply more substance to the arguments he or she is creating. Simul- taneously, each Information Object (IO) (i.e., word, entity, term, concept, phrase, proposition, sentence, paragraph, sectio...
Article
Full-text available
As large on-line corpora become more prewlent, a number of attempts have been made to automatically extract thesaurus-like relations directly from text using knowledge poor methods. In the absence of any specific application, comparing the results of these attempts is difficult. Here we propose an ewluation method using gold standards, i.e., pre-ex...
Article
Full-text available
The same word can have many different meanings depending on the context in which it is used. Discovering the meaning of a word, given the text around it, has been an interesting problem for both the psychology and the artificial intelligence research communities. In this article, we present a series of experiments, using methods which have proven t...
Article
Over the past decade, the World Wide Web has been providing access to ever-increasing multilingual corpora. At the same time, computational linguists have been creating a wide range of linguistically motivated text abstraction techniques. These two phenomena permit the creation of extremely large collections of abstracted exemplars of text. One app...
Article
Full-text available
One of the bottlenecks in Natural Language Processing for a given language is creating a lexicon that covers the language. The morphological lexicon provides two important pieces of information for NLP applications: 1) the normalization of a word, its lemmatization, which allows the application to recognize two variants of the same word; and 2) the...
Article
Nominalization is a highly productive phenomena in most languages.
Article
Full-text available
For a very long time, it has been considered that the only way of automatically extracting similar groups of words from a text collection for which no semantic information exists is to use document co-occurrence data. But, with robust syntactic parsers that are becoming more frequently available, syntactically recognizable phenomena about word usag...
Article
Full-text available
Binding constraints form one of the most robust modules of grammatical knowledge. Despite their crosslinguistic generality and practical relevance for anaphor resolution, they have resisted full integration into grammar processing. The ultimate reason ...
Chapter
Full-text available
Until the appearance of the Brown Corpus with its 1 million words in the 1960s and then, on a larger scale, the British National Corpus (the BNC) with its 100 million words, the lexicographer had to rely pretty much on his or her intuition (and amassed scraps of papers) to describe how words were used. Since the task of a lexicographer was to summa...
Article
Choosing the correct target words is a difficult problem for machine translation. In cross-language information retrieval, this problem of choice is mitigated since more than one translation alternative can be retained in the target query. Between choosing just one word as a translation and keeping all the possible translations for each source word...
Article
Research on a number of developments in language technologies, targeted at improving patent processing procedures within patent offices and in subsequent patent database search systems, is described. Aspects of patent processing covered are (1) OCR correction, to assist the conversion of paper documents to electronic versions, and (2) text classifi...

Network

Cited By