Marti A. Hearst's research while affiliated with University of California, Berkeley and other places

Publications (113)

Preprint
This work describes an automatic news chatbot that draws content from a diverse set of news articles and creates conversations with a user about the news. Key components of the system include the automatic organization of news articles into topical chatrooms, integration of automatically generated questions into the conversation, and a novel method...
Preprint
This work presents a new approach to unsupervised abstractive summarization based on maximizing a combination of coverage and fluency for a given length constraint. It introduces a novel method that encourages the inclusion of key terms from the original document into the summary: key terms are masked out of the original document and must be filled...
Preprint
Recent progress in Natural Language Understanding (NLU) has seen the latest models outperform human performance on many standard tasks. These impressive results have led the community to introspect on dataset limitations, and iterate on more nuanced challenges. In this paper, we introduce the task of HeadLine Grouping (HLG) and a corresponding data...
Preprint
Full-text available
Exploratory data science largely happens in computational notebooks with dataframe API, such as pandas, that support flexible means to transform, clean, and analyze data. Yet, visually exploring data in dataframes remains tedious, requiring substantial programming effort for visualization and mental effort to determine what analysis to perform next...
Preprint
Word clouds continue to be a popular tool for summarizing textual information, despite their well-documented deficiencies for analytic tasks. Much of their popularity rests on their playful visual appeal. In this paper, we present the results of a series of controlled experiments that show that layouts in which words are arranged into semantically...
Article
Word clouds continue to be a popular tool for summarizing textual information, despite their well-documented deficiencies for analytic tasks. Much of their popularity rests on their playful visual appeal. In this paper, we present the results of a series of controlled experiments that show that layouts in which words are arranged into semantically...
Article
We report the results of interviewing thirty professional data analysts working in a range of industrial, academic, and regulatory environments. This study focuses on participants' descriptions of exploratory activities and tool usage in these activities. Highlights of the findings include: distinctions between exploration as a precursor to more di...
Conference Paper
This study analyzes the use of paper exams in college-level STEM courses. It leverages a unique dataset of nearly 1,800 exams, which were scanned into a web application, then processed by a team of annotators to yield a detailed snapshot of the way instructors currently structure exams. The focus of the investigation is on the variety of question f...
Article
Full-text available
Most United States Patent and Trademark Office (USPTO) patent documents contain drawing pages which describe inventions graphically. By convention and by rule, these drawings contain figures and parts that are annotated with numbered labels but not with text. As a result, readers must scan the document to find the description of a given part label....
Conference Paper
In an instructional setting it can be difficult to accurately assess the quality of information visualizations of several variables. Instead of a standard design critique, an alternative is to ask potential readers of the chart to answer questions about it. A controlled study with 47 participants shows a good correlation between aggregated novice h...
Article
Modern Web data is highly structured in terms of entities and relations from large knowledge resources, geo-temporal references and social network structure, resulting in a massivemultidimensional graph. This graph essentially unifies both the searcher and the information resources that played a fundamentally different role in traditional IR, and "...
Conference Paper
Modern Web data is highly structured in terms of entities and relations from large knowledge resources, geo-temporal references and social network structure, resulting in a massive multidimensional graph. This graph essentially unifies both the searcher and the information resources that played a fundamentally different role in traditional IR, and...
Article
The results of a study of online peer learning suggests that it may be advantageous to automatically assign students to small peer learning groups based on how many students initially get answers to questions correct.
Article
Modern Web data is highly structured in terms of entities and relations from large knowledge resources, geo-temporal references and social network structure, resulting in a massive multidimensional graph. This graph essentially unifies both the searcher and the information resources that played a fundamentally different role in traditional IR, and...
Article
Full-text available
We report the findings of a month-long online competition in which participants developed algorithms for augmenting the digital version of patent documents published by the United States Patent and Trademark Office (USPTO). The goal was to detect figures and part labels in U.S. patent drawing pages. The challenge drew 232 teams of two, of which 70...
Article
It is rare for a new user interface to break through and become successful, especially in information-intensive tasks like search, coming to consensus or building up knowledge. Most complex interfaces end up going unused. Often the successful solution lies in a previously unexplored part of the interface design space that is simple in a new way tha...
Conference Paper
Full-text available
A common task in qualitative data analysis is to characterize the usage of a linguistic entity by issuing queries over syntactic relations between words. Previous interfaces for searching over syntactic structures require programming-style queries. User interface research suggests that it is easier to recognize a pattern than to compose it from scr...
Article
News articles, reports, blog posts and academic papers often include graphical charts that serve to visually reinforce arguments presented in the text. To help readers better understand the relation between the text and the chart, we present a crowdsourcing pipeline to extract the references between them. Specifically, we give crowd workers paragra...
Conference Paper
Peer learning, in which students discuss questions in small groups, has been widely reported to improve learning outcomes in traditional classroom settings. Classroom-based peer learning relies on students being in the same place at the same time to form peer discussion groups, but this is rarely true for online students in MOOCs. We built a softwa...
Article
Research shows that people frequently try to search for information with other people. The fact that no user interface for collaborative searching has yet caught fire suggests that the best parts of the design space have yet to be investigated.
Conference Paper
With massive amounts of data being generated and stored ubiquitously in every discipline and every aspect of our daily life, how to handle such big data poses many challenging issues to researchers in data and information systems. The participants of CIKM 2013 are active researchers on large scale data, information and knowledge management, from mu...
Article
We describe WordSeer, a tool whose goal is to help scholars and analysts discover patterns and formulate and test hypotheses about the contents of text collections, midway between what humanities scholars call a traditional "close read'' and the new "distant read" or "culturomics" approach. To this end, WordSeer allows for highly flexible "slicing...
Conference Paper
Professional search activities such as patent and legal search are often time sensitive and consist of rich information needs with multiple aspects or subtopics. This paper proposes a 3D water filling model to describe this search process, and derives a new evaluation metric, the Cube Test, to encompass the complex nature of professional search. Th...
Conference Paper
This paper presents a usability-tested interface design that enables time-constrained analysts to organize their search results in a lightweight manner during and immediately following their search sessions. The research literature suggests that users want to lay out search results spatially in overlapping "piles," but a pilot study with a flexible...
Conference Paper
Data exploration is largely manual and labor intensive. Although there are various tools and statistical techniques that can be applied to data sets, there is little help to identify what questions to ask of a data set, let alone what domain knowledge is useful in answering the questions. In this paper, we study user queries against production data...
Article
We present WordSeer, an exploratory analysis environment for literary text. Literature study is a cycle of reading, interpretation, exploration, and understanding. While there is now abundant technological support for reading and interpreting literary text in new ways through text-processing algorithms, the other parts of the cycle—exploration and...
Conference Paper
This paper examines how social networks can be used to recruit and promote a crowdsourced citizen science project and compares this recruiting method to the use of tradition-al media channels including press releases, news stories, and participation campaigns. The target studied is Creek Watch, a citizen science project that allows anyone with an i...
Article
We present a sensemaking environment for literary text analysis. Literature study is a cycle of reading, interpretation, exploration, and understanding. While there is now abundant technological support for reading and interpreting literary text in new ways through text-processing algorithms, the other parts of the cycle - exploration and understan...
Article
This book focuses on the human users of search engines and the tool they use to interact with them: the search user interface. The truly worldwide reach of the Web has brought with it a new realization among computer scientists and laypeople of the enormous importance of usability and user interface design. In the last ten years, much has become un...
Article
Full-text available
Increasing numbers of primary and secondary source texts in the humanities have been digitized in recent years. Humanities scholars who want to study these new collections in depth need computational assistance because of their large scale. We have built WordSeer, a text analysis tool that includes visualizations and works on the grammatical struct...
Conference Paper
What does the future hold for search user interfaces? Following on a recently completed book on this topic, this talk identifies some important trends in the use of information technology and suggest how these may affect search in future. This includes is a notable trend towards more "natural" user interfaces, a trend towards social rather than sol...
Conference Paper
Back in the heady days of 1999 and WWW8 (Toronto) we held a panel titled "Finding Anything in the Billion Page Web: Are Algorithms the Key?" In retrospect the answer to this question seems laughably obvious - the search industry has burgeoned on a foundation of algorithms, cloud computing and machine learning. As we move into the second decade of t...
Article
Faceted navigation is a proven technique for supporting ex-ploration and discovery within an information collection. The underlying data model is simple enough to make nav-igation understandable while at the same time rich enough to make navigation flexible in a wide range of domains. Nonetheless, there remain issues in both the presentation of nav...
Conference Paper
This paper contributes to the study of self-presentation in on- line dating systems by performing a factor analysis on the text portions of online profiles. Findings include a similar- ity in the overall factor structures between male and female profiles, including a use of tentative words by men, which supports earlier findings that men femininize...
Conference Paper
The School of Information at UC Berkeley (also known as the iSchool) is an interdisciplinary program, and the newest professional program on the UC Berkeley campus. The program has 12 ladder faculty members, some of whom are shared with other departments on campus, and several prominent adjunct faculty. The educational component consists of a profe...
Conference Paper
Full-text available
Web search engines today typically show re- sults as a list of titles and short snippets that summarize how the retrieved documents are related to the query. However, recent research suggests that longer summaries can be prefer- able for certain types of queries. This pa- per presents empirical evidence that judges can predict appropriate search re...
Conference Paper
Full-text available
We examine the recent information visualiza- tion phenomenon known as tag clouds, which are an interesting combination of data visual- ization, web design element, and social marker. Using qualitative methods, we find evidence that those who use tag clouds do so primarily because they are perceived as having an inher- ently social or personal compo...
Conference Paper
Full-text available
Online dating systems play a prominent role in the social lives of millions of their users, but little research has considered how users perceive one another through their personal profiles. We examined how users perceive attractiveness in online dating profiles, which provide their first exposure to a potential partner. Participants rated whole pr...
Conference Paper
Full-text available
The model of search as a turn-taking dialogue between the user and an intermediary has remained unchanged for decades. However, there is growing interest within the search community in evolving this model to support search-driven information exploration activities. So-called " exploratory search" describes a class of search activities that move bey...
Article
Citations have great potential to be a valuable re-source in mining the bioscience literature (Nakov et al., 2004). The text around citations (or citances) tends to state biological facts with reference to the original papers that discovered them. The cited facts are typically stated in a more concise way in the citing papers than in the original....
Conference Paper
To build systems shielding users from fraudulent (or phishing) websites, designers need to know which attack strategies work and why. This paper provides the first empirical evidence about which malicious strategies are successful at deceiving general users. We first analyzed a large set of captured phishing attacks and developed a set of hypothese...
Article
The role of clustering and faceted categories to design interfaces for supporting information exploration is discussed. Clustering is fully automatable and can easily be applied to any text collection and is useful for clarifying vague queries. This method also works well for disambiguating ambiguous queries but lacks predictability and the difficu...
Article
We propose a cross-species approach for assigning Gene Ontology terms to LocusLink genes based on evidence extracted from biomedical journal articles. We make use of information from orthologous genes to derive and merge two sets of GO codes for a given target gene. For the first set, we restrict GO code assignments to be selected from only those c...
Article
In Fall 2004 I introduced a new course called Applied Natural Language Process- ing, in which students acquire an under- standing of which text analysis techniques are currently feasible for practical appli- cations. The class was intended for in- terdisciplinary students with a somewhat technical background. This paper de- scribes the topics cover...
Conference Paper
Full-text available
Many aircraft accidents each year are caused by encounters with invisible airflow hazards. Recent advances in aviation sensor technology offer the potential for aircraft-based sensors that can gather large amounts of airflow velocity data in real-time. With this influx of data comes the need to study how best to present it to the pilot - a cognitiv...
Conference Paper
ImproViz is a visualization technique for diagramming music that brings to light the signature patterns of a jazz musician's improvisational style. ImproViz consists of two parts: (1) melodic landscapes show the general contours of musical phrasing; and (2) harmonic palettes represent the musician's tendency to use a particular combination of notes...
Conference Paper
In Fall 2004 I introduced a new course called Applied Natural Language Processing, in which students acquire an understanding of which text analysis techniques are currently feasible for practical applications. The class was intended for interdisciplinary students with a somewhat technical background. This paper describes the topics covered and the...
Article
This paper presents TextTiling, a method for partitioning full-length text documents into coherent multiparagraph units. The layout of text tiles is meant to reflect the pattern of subtopics contained in an expository text. The approach uses lexical analyses based on tf.idf, an information retrieval measurement, to determine the extent of the tiles...
Article
We discuss a method for augmenting and rearranging a structured lexicon in order to make it more suitable for a topic labeling task, by making use of lexical association information from a large text corpus. We first describe an algorithm for converting the hierarchical structure of WordNet [13] into a set of fiat categories. We then use lexical co...
Article
Full-text available
We argue that the advent of large volumes of fulllength text, as opposed to short texts like abstracts and newswire, should be accompanied by corresponding new approaches to information access. Toward this end, we discuss the merits of imposing structure .on fulllength text documents; that is, a partition of t'he text into coherent multi-paragraph...
Article
This paper describes TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. The algorithm uses domain-independent lexical frequency and distribution information to recognize the interactions of multiple simultaneous themes.
Article
The Pk evaluation metric, initially proposed by Beeferman, Berger, and Lafferty (1997), is becoming the standard measure for assessing text segmentation algorithms. However, a theoretical analysis of the metric finds several problems: the metric penalizes false negatives more heavily than false positives, overpenalizes near misses, and is affected...
Article
Context-aware systems are ones that have a greater awareness of the physical and social worlds we live in. Such systems make use of sensing technologies, recognition algorithms, and wireless networking to enhance human safety, add convenience, improve efficiency, and augment how we find and remember information. However, context-aware systems are c...
Article
We are developing corpus-based techniques for iden-tifying semantic relations at an intermediate level of description (more specific than those used in case frames, but more general than those used in tra-ditional knowledge representation systems). In this paper we describe a classification algorithm for iden-tifying relationships between two-word...
Article
Full-text available
This panel debates a topic that has been popping up recently as a consequence of different disciplines rubbing up against each other in a new field: can the quality of an information architecture be measured quantitatively? And if so, how can this analysis be verified?Information architects and HCI professionals already are discussing this issue re...
Article
The current state of web search is most successful at directing users to appropriate web sites. Once at the site, the user has a choice of following hyperlinks or using site search, but the latter is notoriously problematic. One solution is to develop specialized search interfaces that explicitly support the types of tasks users perform using the i...
Article
Two long, full-length texts are not likely to discuss all, or almost all, of the same subtopics or subpoints. Even if the documents contain many of the same terms, the ways the terms are grouped to form subtopical discussions still might be quite different. A solution is to create a description of a document which lists all of its subtopical discus...
Article
This paper presents TextTiling, a method for partitioning full-length text documents into coherent multiparagraph units. The layout of text tiles is meant to reflect the pattern of subtopics contained in an expository text. The approach uses lexical analyses based on tf.idf, an information retrieval measurement, to determine the extent of the tiles...
Article
We look at a controversy: the use of computers for automated and semiautomated grading of exams. K. Kukich, the director of the Natural Language Processing group at Educational Testing Service, provides an insider's view of the history of the field of automated essay grading and describes how ETS is currently using computer programs to supplement h...
Article
This paper introduces a novel user interface that integrates search and browsing of very large category hierarchies with their associated text collections. A key component is the separate but simultaneous display of the representations of the categories and the retrieved documents. Another key component is the display of multiple selected categorie...
Article
We describe a method for the automatic acquisition of the hyponymy lexical relation from unrestricted text. Two goals motivate the approach: (i) avoidance of the need for pre-encoded knowledge and (ii) applicability across a wide range of text. We identify a set of lexicosyntactic patterns that are easily recognizable, that occur frequently and acr...
Article
The field of information retrieval has traditionally focused on textbases consisting of titles and abstracts. As a consequence, many underlying assumptions must be altered for retrieval from full-length text collections. This paper argues for making use of text structure when retrieving from full text documents, and presents a visualization paradig...
Article
We show that two simple constraints, when applied to short user queries (on the order of 5--10 words) can yield precision scores comparable to or better than those achieved using long queries (50--85 words) at low document cutoff levels. These constraints are meant to detect documents that have subtopic passages that includes the most important com...
Article
Full-text available
this document sample than one would expect by chance. The terms are selected according to a binomial likelihood ratio test [10], comparing their occurrence in the first 20 documents to their occurrence in the rest of the collection. The selected terms are then weighted in proportion to the significance of their occurrence in the sampled documents....
Article
This paper describes an accurate, relatively inexpensive method for the disambiguation of noun homographs using large text corpora. The algorithm checks the context surrounding the target noun against that of previously observed instances and chooses the sense for which the most evidence is found, where evidence consists of a set of orthographic, s...
Article
The transition to the ne,vt millennium gives us an opportunity to reflect on the past and project the future. In this spirit, we have asked a set of distinguished scholars rind practitioners who were involved in AI's formative stages to describe, in just a few paragraphs, the most notable trend or controversy (or nontrend or noncontroversy) during...
Article
A Text-Based Intelligent System should provide more in-depth information about the contents of its corpus than does a standard information retrieval system, while at the same time avoiding the complexity and resource-consuming behavior of detailed text understanders. Instead of focusing on discovering documents that pertain to some topic of interes...
Article
Full-text available
When people use computer-based tools to find answers to general questions, they often are faced with a daunting list of search results or "hits" returned by the search engine. Many search tools address this problem by helping users to make their searches more specific. However, when dozens or hundreds of documents are relevant to their question, us...
Article
Society and information technology are rapidly co-evolving, and often in surprising ways. The paper considers different views on how society and networked information technology are changing one another. Becoming socialized means learning what kinds of behaviour are appropriate in a given social situation. The increasing trend of digitizing and sto...
Conference Paper
Full-text available
Although search over World Wide Web pages has recently received much academic and commercial attention, surprisingly little research has been done on how to search the web pages within large, diverse intranets. Intranets contain the information associ- ated with the internal workings of an organization. A standard search engine retrieves web pages...
Article
An important problem for information access systems is that of organizing large sets of documents that have been retrieved in response to a query. Text categorization and text clustering are two natural language processing tasks whose results can be applied to document organization. This chapter describes user interfaces that use categories and clu...
Article
Looks at one aspect of the trend toward more natural forms of interaction: the use of sketch-based interfaces for the design of intelligent systems. The authors contend that current computer interfaces are too formal and precise for creative tasks such as design. When working out ideas and brainstorming, people often sketch their thoughts informall...
Article
Examines three different innovations in electronic academic publishing of interest to practitioners in the field of intelligent systems: (1) JAIR (The Journal of Artificial Intelligence Research)-an electronic journal by and for the AI research community; (2) medical publishing and AI; and (3) how can we get high-quality electronic journals?
Conference Paper
Two stages in measurement of techniques for information retrieval are gathering of documents for relevance assessment and use of the assessments to numerically evaluate effectiveness. We consider both of these stages in the context of the TREC experiments, ...
Article
We consider the status of Bayesian statistical methods for the analysis of complex real-world problems. Most AI problems require a method for representing uncertainty. Currently, there is intense interest in the use of Bayesian reasoning, Bayesian belief networks, and other Bayesian methods for the modeling of uncertainty in many avenues of AI rese...
Article
An important aspect of a complex intelligent system is the human-computer interface. The paper discusses the less familiar topic of audio interfaces. The controversy surrounds which of the three competing audio interface approaches is most effective: sonification, earcons and auditory icons
Article
TextTiling is a technique for subdividing texts into multi-paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of lexical co-occurrence and distribution. The algorithm is fully implemented and is shown to produce segmentation that corresponds well to human judgments of the sub...

Citations

... The source article groups are collected using an existing news dataset [33] based on the live feed of around 20 international news sources in English, using an NLP-based clustering algorithm [31]. ...
... The CIMA collection [246] includes tutoring dialogues between crowd workers playing the role of students and tutors. The tutoring utterances include educational strategies, such as hint provision and questions asked to check the student's understanding. ...