Conference PaperPDF Available

Anticipating Information Needs: Everyday Applications as Interfaces to Internet Information Resources.

Authors:

Figures

Content may be subject to copyright.
Anticipating Information Needs:
Everyday Applications as Interfaces to Internet
Information Resources
Jay Budzik, Kristian Hammond, Cameron Marlow, and Andrei
Scheinkman
Intelligent Information Laboratory
Department of Computer Science
The University of Chicago
1100 E. 58th St.
Chicago, IL 60637 USA
{budzik, hammond, camarlow, andrei}@cs.uchicago.edu
http://infolab.cs.uchicago.edu/
Motivation and Introduction
In recent years, we have experienced an explosion in the amount of information available
online. Unfortunately, tools which allow users to access this information are still quite
rudimentary. Users are forced to express their information needs in boolean query
languages. More, results returned are often unnecessarily redundant and poor in quality,
partly because the user is unable to specify his needs in terms of a query well enough, and
partly because of the stupidity of the software servicing his query. More intelligent
systems allow users to pose their information needs in the form of a question [Burke, et
al., 1997]. Nonetheless, these kinds of systems still require the user to make his
information needs explicit to the system. Thus, while Internet search engines provide a
first step at solving this information access problem, most of them not only fail to
produce good results reliably, but are also hard to use. Question-answering systems
provide a solution to part of this problem, yet remain inconvenient.
In response to the problems posed by the current state of information retrieval systems,
we are working on a class of systems we call Personal Information Management
Assistants (PIMAs). PIMAs observe user interaction with everyday applications, use
these observations to anticipate a user's information needs, and then automatically fulfill
these needs by accessing Internet information sources, filtering the results and presenting
them to the user. Essentially, they allow everyday applications to serve as interfaces for
Internet information systems. For the remainder of this paper, we present our preliminary
work on an architecture for this class of systems, and our progress implementing such a
system. Finally we discuss our preliminary results and survey directions for future work.
One of the main insights driving our work is that information-seeking behavior, such as
posing a query to a search engine, is goal-directed behavior. In this view, posing a query
to a search engine is a step in a plan that satisfies the goal of finding information about a
certain topic. Given that finding information is usually in service of some goal, we can
construct a library of information-consumption scripts (using "script" in the sense of
[Schank and Abelson, 1977]) associated with satisfying a user's goals. Scripts are
knowledge structures that house information about highly routine situations. In an
appropriate context, they serve the purpose of making strong predictions about situations
and sequences of events. For a PIMA, knowledge of a user's information-consumption
scripts means the ability to anticipate information-seeking goals and the ability to
automatically fulfill them.
Architectural Overview
We have built a prototype PIMA that observes user interaction with everyday
applications (e.g., Navigator, Explorer, and Word), and, using very preliminary
knowledge of information-consumption scripts, is able to anticipate a user's information
needs. It then attempts to automatically fulfill them using common Internet information
resources. Given the requirements that it must observe several applications and that it
must also use multiple Internet information resources, we have adopted a five-tiered
architecture (see Figure 1).
Figure 1: PIMA Architecture
The user interacts with the sundry applications shown at the bottom of the diagram, and
the information management application in the middle. Through a series of adapters in
the green layers, the assistant application communicates with the existing software
applications through the operating system's IPC facilities. The assistant then interprets
user behavior in these applications, and constructs a query which it sends off to
information sources at the top. It collects the results, and applies several heuristics which
allow it to present the user a concise, high-quality list of suggestions. These suggestions
are presented in a window for the user to browse. Eventually, we plan to give our PIMA a
memory of user interests and expertise, as well as the ability to communicate with other
users' assistants, in order to personalize and improve the quality of the results.
Implementation
Currently, our PIMA observes user interaction in unmodified versions of Microsoft
Internet Explorer and Microsoft Word, as well as a modified version of Mozilla
(Netscape's free-source version of Navigator). The PIMA communicates with Microsoft
Word and Internet Explorer through COM (PC only), and with Mozilla through BSD-
style sockets (UNIX and PC). We designed our architecture with the idea that application
interfaces should be the only OS/Software dependent components. We implemented the
assistant application in Java, for maximum portability. These design decisions afford us
the ability to extend the PIMAs field of observation relatively easily, without having to
change the core application code.
Finding Relevant Pages
The simplest of the information-consumption scripts we have identified is associated with
finding related web pages. The FIND-RELATED-PAGES script is composed of the
following basic steps:
1. Summarize the document in terms of a few words.
2. Pose a query to a search engine using these words.
3. Sift through the results, searching for ones that are actually related.
It is usually applied when the user is interested in augmenting their knowledge of the
subject of the document at hand. We make the assumption that the user is usually
interested in documents relevant to the one he is reading or composing. For the two Web
browsers, our PIMA recognizes when a user navigates to a new web site, either by
clicking on a link, or by explicitly opening a URL. In Microsoft Word, it recognizes
when a use has opened a document or changed it significantly. The PIMA responds to
these user behaviors by anticipating the user will want to know about Web sites like the
document he is looking at. There are essentially two processes associated with retrieving
relevant documents: query construction and information filtering.
Query Construction
In order to retrieve relevant sites, the PIMA constructs a query based on the contents of
the current web page, and sends it off to AltaVista [AltaVista, 1998] (see Figure 2).
Figure 2: PIMA Prototype - Suggesting Web Documents
In order to construct a query, the PIMA uses three techniques to decide on which words
should be included: a standard stop list and two heuristics for rating the importance of a
word. The first heurist is that words at the top of the page tend to be more important than
those at the bottom. The second is words that occur with high frequency (that are not in
the stop list) are usually representative of the document. The specifics of our term-
weighting function are still under development, and as such, beyond the scope of this
paper. The terms with the top 20 weights are sent to AltaVista.
Information Filtering
Because the results returned from AltaVista are often redundant, containing copies of the
same page or similar pages from the same server, the PIMA must filter the results so as
not to add to a user's feeling of information overload [Maes, 1994]. If these similarities
are not accounted for, some of the more interesting pages returned by AltaVista may be
missed. Moreover, we constantly face the risk of annoying the user instead of helping
him. As a result, we actively attempt to reduce the amount of spurrious information
presented, and in doing so address some of the problems associated with a constantly-
updating interfaces (like [Lieberman, 1997]). To this end, we have designed our
prototype to collect search engine results and cluster similar pages, displaying a single
representative from each cluster for the user to browse.
For this task, we currently use three pieces of information AltaVista returns for each
document: the document's title, its URL, and the date on which it was last modified. For
each of these pieces we have developed a heuristic to numerically describe the degree of
similarity of two pages. The similarity of two titles is represented numerically by the
percentage of words they share in common; two URLs are compared by examining their
host, port and directory structure; and two dates are judged by the number of days
separating them. The combiniation of these three heuristic similarity metrics is sufficient
to determine the uniqueness of the documents returned, allowing us to provide more
intelligent output to the user.
Table 1 shows a typical response from AltaVista generated by a query posed in response
to a page on Java Standardization (we have deleted several long ones for brevity). Notice
there are a number of mirror sites, as well as logical duplicates (they may have different
URLs, but they are the same file). Table 2 shows these URLs after clustering. Instead of
presenting the user with 20 sites, we present him with 10, effectively removing the
duplicates and mirrors.
Sun Speaks Out On Java
Standardization http://www.html.co.nz/ news/110605.htm
Java Standardization
http://aidu.cs.nthu.edu.tw/
java/JavaSoft/www.javasoft.com/
aboutJava/standardization/
Java Standardization
Update - SnapShot http://www.psgroup.com/ snapshot/1997/ss109706.htm
Informal XML API
Standardization for Java http://xml.datachannel.com/ xml/dev/XAPIJ1p0.html
International
Organization For
Standardization Gives
Java The Nod
http://techweb1.web.cerf.net/
wire/news/1997/11/1118java.html
Java Standardization - A
Whitepaper
http://aidu.cs.nthu.edu.tw/
java/JavaSoft/www.javasoft.com/
aboutJava/standardization/javastd.html
Java Standardization http://java.sun.com:81/
aboutJava/standardization/index.html
Sun Moves Java
Standardization Forward
http://techweb4.web.cerf.net/
wire/news/1997/09/0922standard.html
Java Standardization http://java.sun.com/ aboutJava/standardization/index.html
Java Standardization http://www.javasoft.com/
aboutJava/standardization/index.html
Informal XML API
Standardization for Java http://www.datachannel.com/ xml/dev/Commonality.html
Java Standardization http://www.intel.se/ design/news/javastand.htm
The impact of Java http://www.idg.net/
standardization new_docids/find/java/suns/standardization/
developers/submitter/approval/ affects/new_docid_9-
48305.html
Java Standardization - A
Whitepaper http://java.sun.com/ aboutJava/standardization/javastd.html
Table 1: Output of a query (Titles and URLs) generated from a page on Java Standardization
Sun Speaks Out
On Java
Standardization
http://www.html.co.nz/news/110605.htm
Java
Standardization
http://aidu.cs.nthu.edu.tw/java/JavaSoft/
www.javasoft.com/aboutJava/standardization/,
http://aidu.cs.nthu.edu.tw/java/JavaSoft/
www.javasoft.com/aboutJava/standardization/javastd.html,
http://java.sun.com:81/aboutJava/standardization/index.html,
http://java.sun.com/aboutJava/standardization/index.html,
http://www.javasoft.com/aboutJava/standardization/index.html,
http://www.intel.se/design/news/javastand.htm,
http://www.idg.net/new_docids/find/java/
suns/standardization/developers/submitter/
approval/affects/new_docid_9-48305.html,
http://java.sun.com/aboutJava/standardization/javastd.html,
Java
Standardization
Update -
SnapShot
http://www.psgroup.com/snapshot/1997/ss109706.htm,
Informal XML
API
Standardization
for Java
http://xml.datachannel.com/xml/dev/XAPIJ1p0.html,
http://www.datachannel.com/xml/dev/Commonality.html,
International
Organization
For
Standardization
Gives Java The
Nod
http://techweb1.web.cerf.net/wire/news/1997/11/1118java.html,
Sun Moves Java
Standardization
Forward
http://techweb4.web.cerf.net/
wire/news/1997/09/0922standard.html,
change nothing
- JavaWorld -
October 19
http://www.javaworld.com/javaworld/jw-10-1997/jw-10-iso.html,
Table 2: Results of clustering search engine responses.
Exploiting Structural Context: An Example
To demonstrate the power of this paradigm of interaction, we have programmed our
PIMA to recognize when a user inserts a caption. The script associated with this situation
suggests a different class of information-consumption behaviors we can anticipate: those
that are dependent on the structural context of the active document. The FIND-
RELATED-IMAGES script is applied when the user has inserted a caption with no image
to fill it (and probably others---we make no claims about being exhaustive, here). It
contains the following steps:
1. Summarize the desired picture in terms of a few words.
2. Send these words off to an image search engine.
3. Sift through the results and choose the best one.
In response to the above situation, the PIMA applies this script and predicts the user will
require an image. It then sends off a query to Arriba Vista [Arriba Vista, 1998], a
commercial implementation of WebSeer [Frankel, et al., 1997], an image search engine.
The PIMA constructs the query using a piece of knowledge about the structure of
documents, in general: that the caption of an image is a verbal summary of that image.
Hence the query it constructs is simply the conjunction of the stop-listed terms in the
caption. The results are presented in a web browser window, from which the user can
drag-and-drop the images into Microsoft Word (see Figure 3).
Figure 3: PIMA Prototype - Suggesting Images
Discussion and Directions for Future Research
Our initial observations suggest that the combination of our heuristics for query
generation and for response clustering produce very high-quality results. Our hypothesis
is that this is due to the fact that the query generation algorithm we apply to documents
roughly mirrors the process of document indexing, and that the clustering heuristics are
effective. While our initial results are promising, the system has much room for
improvement. Most obviously, our library of scripts is very sparse. Augmenting it so it
understands more user/application interactions (and thus is able to anticipate more kinds
of information needs) will be of primary concern. Moreover, the query construction
process ignores the structure of the documents it uses to produce them. Applying
heuristics to improve the query construction algorithm based on document structure will
be fairly straightforward to implement. Moreover, queries frequently include terms that
are of little information value to vector-space retrieval systems like AltaVista.
Composing a table of term frequencies from a random sample of web documents and
using this table to negatively weight terms with very high frequencies will increase the
number of "quality" query terms AltaVista receives. As a further improvement, we plan
on adding support for more information resources and developing a vocabulary for
expressing the kind of information available, as well as a means by which the assistant
can be made aware of new information resources, similar to [Doorenbos, 1997]. Finally
our prototype is reactive in the strictest sense---it has no memory, and knows nothing
about what the user prefers. Giving our PIMA the ability to learn user preferences and
leverage this knowledge as it attempts to anticipate information needs and select
appropriate sources is sure to improve the quality of suggestions dramatically. Clearly
there is much more to be done.
Conclusion
In summary, we have outlined several major problems associated with contemporary
information access paradigms. In response, we presented an architecture for a class of
systems we call Personal Information Management Systems. These systems observe user
interactions with everyday applications, anticipate information needs, and automatically
fulfill them using Internet information resources, essentially turning everyday
applications into interfaces for information sources. We presented our initial work on a
prototype of this kind of system, and closed with directions for future research.
References
[AltaVista, 1998] AltaVista, A Compaq Internet Service, 1998.
http://altavista.digital.com
[Arriba Vista, 1998] Arriba Vista Co., 1998. http://www.arribavista.com
[Burke, et al., 1997] Burke, R.; Hammond, K.; Kulyukin, V.; Lytinen, S.; Tomuro N.;
Schoenberg, S. 1997. Question Answering from Frequently-Asked Question Files:
Experiences with the FAQ Finder System. Technical Report TR-97-05, The University of
Chicago, Department of Computer Science.
[Doorenbos, 1997] Doorenbos, R.; Etzioni, O.; Weld, D. 1997. A Scalable Comparison-
Shopping Agent for the World-Wide Web. In Proc. Autonomous Agents '97.
[Frankel, et al., 1996] Frankel, C.; Swain, M., and Athitsos, V. 1996. WebSeer: An Image
Search Engine for the World Wide Web. Technical Report TR-96-14, The University of
Chicago, Department of Computer Science.
[Lieberman, 1997], Lieberman, H. 1997. Autonomous Interface Agents. In Proc. CHI-97.
[Maes, 1994], Maes, P. 1994. Agents that Reduce Work and Information Overload. In
Communications of the ACM 37(7).
[Schank and Abelson, 1977] Schank, R.; and Abelson R. 1977. Scripts, Plans, Goals and
Understanding. New Jersey: Lawrence Earlbaum Associates.
... Northwestern University's Watson [2] also uses tracking user behavior and automatically generated queries to a search engine to recommend pages, as does Powerscout, and also has some other interesting capabilities, such as searching for pictures or searching for contrasting information instead of that most similar. ...
... The engine, Inquirus 2, takes the query plus context information and attempts to use the context information to find relevant documents via regular web search engines. Budzik [41] present a system that automatically infers the context of the search request. The system, Watson, does this based on the contents of the document that the user is editing. ...
... In the future, profiles could indicate user expectations in regards to search queries, for example whether search aims at finding information in regards to a product name, a manufacturer, a dealer, or a data sheet for a product. Also, the current search context of the user might be gathered from the documents found on the local computer [15,13,14,37]. Keywords might be automatically determined from the contents of desktop documents to indicate user interest. ...
Article
Full-text available
The importance of information in today's society is still growing and information search has become an essential task in both the workplace and in private life. eSearch services provide access to the abundance of information available on the Internet by means of search engine technology. However, conventional search engines have certain limitations in dealing with the typical information overload problems. With the application of personalisation techniques search engine pro-viders aim at moderating some of the problems by providing users with informa-tion access individualised to their needs. The aim of this paper is twofold. Firstly, techniques for personalisation of eSearch services are introduced. Secondly, the results of an empirical study of the market for eSearch services are presented. Typical examples illustrate eSearch personalisation in practice, and the diffusion of techniques and implications for further research in the domain are discussed.
... In particular, the Adaptive Web Site Agent will make recommendations based upon similarity between pages at a site and the analysis of web logs at the site. Adaptive Web Site Agent also shares some goals with the Remembrance Agent [18] and Watson [4]. Both of those systems recommend other information sources related to the current document focus of the user. ...
Article
We discuss the design and evaluation of a class of agents that we call adaptive web site agents. The goal of such an agent is to help a user find additional information at a particular web site, adapting its behavior in response to the actions of the individual user and the actions of other visitors to the web site. The agent recommends related documents to visitors and we show that these recommendations result in increased information read at the site. It integrates and coordinates among different reasons for making recommendations including user preference for subject area, similarity between documents, frequency of citation, frequency of access, and patterns of access by visitors to the web site. We argue that this information is best used not to change the structure or content of the web site but rather to change the behavior of an animated agent that assists the user.
Article
Explosive growth in size and usage of the World Wide Web has made it Necessary for Web site administrators to track and analyze the navigation patterns of Web site visitors. However, data mining techniques are not easily applicable to Web data due to problems both related with the technology underlying the Web and the lack of standards in the design and implementation of Web pages. Information collected by Web servers and kept in the server log is the main source of data for analyzing user navigation patterns. Once logs have been preprocessed and sessions have been obtained there are several kinds of access pattern mining that can be performed depending on the needs of the analyst. It is important to mention that most efforts have relied on relatively simple techniques which can be inadequate for real user profile data since noise in the data has to be firstly tacked. Thus, there is a need for robust methods that integrates different intelligent techniques that are free of any assumptions about the noise contamination rate. In this paper, the problem of mining behavior patterns on the Web is studied in detail and different approaches to solve the problem are analyzed. An algorithm is given to calculate frequent access patterns. This algorithm is based on a model structure that has been called WPC-Tree that stores in each node relevant information about pages that make it possible to apply data mining techniques to obtain useful patterns.
Article
Our central claim is that user interactions with productivity applications (e.g. word processors, Web browsers, etc.) provide rich contextual information that can be leveraged to support just-in-time access to task-relevant information. As evidence for our claim, we present Watson, a system which gathers contextual information in the form of the text of the document the user is manipulating, in order to proactively retrieve documents from distributed information repositories related to task at hand, as well as process explicit requests in the context of this task. We close by describing the results of several experiments with Watson, which show it consistently provides useful information to its users. The experiments also suggest that, contrary to the assumptions of many system designers, similar documents are not necessarily useful documents in the context of a particular task.
Article
This paper describes the research underpinning a networked application for the delivery of personalised streams of music over the Internet. The initial system used automated collaborative filtering (ACF), a ‘content-less’ approach to recommend new music to users. We show how we have improved on this basic technique by leveraging a light content-based technique that attempts to capture the user's current listening ‘context’. This involves a two-stage retrieval process where ACF recommendations are ranked according to the user's current interests. Finally, we demonstrate an on-line evaluation strategy that pits the ACF strategy against the context-boosted strategy in a real-time competition.
Article
Traditional search engine techniques are inadequate when it comes to helping the average user locate relevant information online. The key problem is their inability to recognize and respond to the implicit preferences of a user that are typically unstated in a search query. In this paper we describe CASPER, an online recruitment search engine, which attempts to address this issue by extending traditional search techniques with a personalization technique that is capable of taking account of user preferences as a means of classifying retrieved results as relevant or irrelevant. We evaluate a number of different classification strategies with respect to their accuracy and noise tolerance. Furthermore we argue that because CASPER transfers its personalization process to the client-side it offers significant efficiency and privacy advantages over more traditional server-side approaches.
Conference Paper
The capability to reallocate items--e.g. tasks, securities, bandwidth slices, Mega Watt hours of electricity, and collectibles--is a key feature in automated negotiation. Especially when agents have preferences over combinations of items, this is highly ...
Conference Paper
Traditional search techniques frequently fail the average user in their quest for online information. Recommender systems attempt to address this problem by discovering the context in which the search occurs. Though effective, these systems are often hampered by the brevity of typical user queries. In this paper we describe CASPER, an online recruitment search engine which combines similarity-based search with a client-side personalisation technique. In particular we argue thatCASPER’s personalisation strategy is effective in determining retrieval relevance in the face of incomplete queries. Keywords: case-based search, personalisation, incomplete queries.
Article
Full-text available
This article describes FAQ FINDER, a natural language question-answering system that uses files of frequently asked questions as its knowledge base. Unlike AI question-answering systems that focus on the generation of new answers, FAQ FINDER retrieves existing ones found in frequently asked question files. Unlike information-retrieval approaches that rely on a purely lexical metric of similarity between query and document, FAQ FINDER uses a semantic knowledge base (WORDNET) to improve its ability to match question and answer. We include results from an evaluation of the system's performance and show that a combination of semantic and statistical techniques works better than any single approach.
Article
Computer are becoming the vehicle for an increasing range of everyday activities. Acquisition of news and information, mail and even social interactions and entertainment have become more and more computer based. This article focuses on a novel approach to building interface agents. It presents results from several prototype agents that have been built using this approach, including agents that provide personalized assistance with meeting schedules, email handling, electronic news filtering and selection of entertainment.
Article
The Web is less agent-friendly than we might hope. Most information on the Web is presented in loosely structured natural language text with no agent-readable semantics. HTML annotations structure the display of Web pages, but provide virtually no insight into their content. Thus, the designers of intelligent Web agents need to address the following questions: (1) To what extent can an agent understand information published at Web sites? (2) Is the agent's understanding sufficient to provide genuinely useful assistance to users? (3) Is site-specific hand-coding necessary, or can the agent automatically extract information from unfamiliar Web sites? (4) What aspects of the Web facilitate this competence? In this paper we investigate these issues with a case study using the ShopBot. ShopBot is a fullyimplemented, domain-independent comparison-shopping agent. Given the home pages of several on-line stores, ShopBot autonomously learns how to shop at those vendors. After its learning is co...
Article
Two branches of the trend towards "agents" that are gaining currency are interface agents, software that actively assists a user in operating an interactive interface, and autonomous agents, software that takes action without user intervention and operates concurrently, either while the user is idle or taking other actions. These two branches are related, but not identical, and are often lumped together under the single term "agent". Much agent work can be classified as either being an interface agent, but not autonomous, or as an autonomous agent, but not operating directly in the interface. We show why it is important to have agents that are both interface agents and autonomous agents. We explore some design principles for such agents, and illustrate these principles with a description of Letizia, an autonomous interface agent that makes real-time suggestions for Web pages that a user might be interested in browsing. Keywords Agents, interface agents, autonomous agents, Web, browsi...
Article
Because of the size of the World Wide Web and its inherent lack of structure, finding what one is looking for can be a challenge. In fact, some of the most highly visited Web sites are search engines. However, while Web pages typically contain both text and images, most currently available search engines only index text. This paper describes WebSeer, a system for locating images on the Web. WebSeer uses image content in addition to associated text to index images; the image analysis is designed to complement the information obtained from the text. 1 Introduction The explosive growth of the World Wide Web has proven to be a double-edged sword. An immense amount of material is now easily accessible on the Web, but locating specific information remains a difficult task. While there has been some success in developing search engines for text, search engines for other media on the Web (images, video, and sounds) are just starting to appear, and are extremely primitive. WebSeer uses inform...
Arriba Vista Co Question Answering from Frequently-Asked Question Files: Experiences with the FAQ Finder System
  • A Compaq Internet Altavista
  • Serviceburke
[AltaVista, 1998] AltaVista, A Compaq Internet Service, 1998. http://altavista.digital.com [Arriba Vista, 1998] Arriba Vista Co., 1998. http://www.arribavista.com [Burke, et al., 1997] Burke, R.; Hammond, K.; Kulyukin, V.; Lytinen, S.; Tomuro N.; Schoenberg, S. 1997. Question Answering from Frequently-Asked Question Files: Experiences with the FAQ Finder System. Technical Report TR-97-05, The University of Chicago, Department of Computer Science.