Michael J. Cafarella

Michael J. Cafarella
  • University of Michigan

About

60
Publications
11,072
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,117
Citations
Current institution
University of Michigan

Publications

Publications (60)
Article
In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of structured databases casually published online in the form of HTML tables. The past decade has seen a flurry of research and commercial activities around the WebTables project itself, as well as the broad topic of informal online structured data. In this paper, we¹...
Preprint
Querying the content of images, video, and other non-textual data sources requires expensive content extraction methods. Modern extraction techniques are based on deep convolutional neural networks (CNNs) and can classify objects within images with astounding accuracy. Unfortunately, these methods are slow: processing a single image can take about...
Article
Data analysts often need to transform an existing dataset, such as with filtering, into a new dataset for downstream analysis. Even the most trivial of mistakes in this phase can introduce bias and lead to the formation of invalid conclusions. For example, consider a researcher identifying subjects for trials of a new statin drug. She might identif...
Article
The dark data extraction or knowledge base construction (KBC) problem is to populate a relational database with information from unstructured data sources, such as emails, webpages, and PDFs. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a syste...
Article
In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their...
Article
Nowcasting is the practice of using social media data to quantify ongoing real-world phenomena. It has been used by researchers to measure flu activity, unemployment behavior, and more. However, the typical nowcasting workflow requires either slow and tedious manual searching of relevant social media messages or automated statistical approaches tha...
Conference Paper
DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data --- scientific papers, Web classified ads, customer service notes, and so on --- were instead in a relational d...
Data
EdgeBoost Paired With InfoMap. Performance of EdgeBoost (solid) and the baseline InfoMap algorithm (dashed) on LFR benchmarks. The purple shaded region shows the improvement of EdgeBoost for NMI. The of plots shows the relative error of the partition size. (TIF)
Data
Distribution of community sizes for Facebook ego networks. Nodes were given community labels by ego users as part of a user study. (TIF)
Data
EdgeBoost Paired With Surprise. Performance of EdgeBoost (solid) and the baseline Surprise algorithm (dashed) on LFR benchmarks. The purple shaded region shows the improvement of EdgeBoost for NMI. The bottom row shows the relative error of the partition size. (TIF)
Data
RE heat map of six community detection algorithms. the parameters μ and δ are represented on the x and y axis respectively. Each square is labeled with the corresponding RE value. (TIF)
Data
EdgeBoost Paired With WalkTrap. Performance of EdgeBoost (solid) and the baseline WalkTrap algorithm (dashed) on LFR benchmarks. The purple shaded region shows the improvement of EdgeBoost for NMI. The bottom row shows the relative error of the partition size. (TIF)
Data
NMI heat map of six community detection algorithms. the parameters μ and δ are represented on the x and y axis respectively. Each square is labeled with the corresponding NMI value. (TIF)
Data
EdgeBoost Paired With Significance. Performance of EdgeBoost (solid) and the baseline Significance algorithm (dashed) on LFR benchmarks. The purple shaded region shows the improvement of EdgeBoost for NMI. The bottom row shows the relative error of the partition size. (TIF)
Data
EdgeBoost Paired With Label-Propagation. Performance of EdgeBoost (solid) and the baseline Label-Propagation algorithm (dashed) on LFR benchmarks. The purple shaded region shows the improvement of EdgeBoost for NMI. The bottom row shows the relative error of the partition size. (TIF)
Article
Approximate kNN (k-nearest neighbor) techniques using binary hash functions are among the most commonly used approaches for overcoming the prohibitive cost of performing exact kNN queries. However, the success of these techniques largely depends on their hash functions' ability to distinguish kNN items; that is, the kNN items retrieved based on dat...
Article
Interactive visualizations are crucial in ad hoc data exploration and analysis. However, with the growing number of massive datasets, generating visualizations in interactive timescales is increasingly challenging. One approach for improving the speed of the visualization tool is via data reduction in order to reduce the computational overhead, but...
Article
End-to-end knowledge base construction systems using statistical inference are enabling more people to automatically extract high-quality domain-specific information from unstructured data. As a result of deploying DeepDive framework across several domains, we found new challenges in debugging and improving such end-to-end systems to construct high...
Article
Full-text available
Many real networks that are inferred or collected from data are incomplete due to missing edges. Missing edges can be inherent to the dataset (Facebook friend links will never be complete) or the result of sampling (one may only have access to a portion of the data). The consequence is that downstream analyses that consume the network will often yi...
Conference Paper
Machine learning seems to be eating the world with a new breed of high-value data-driven applications in image analysis, search, voice recognition, mobile, and office productivity products. To paraphrase Mike Stonebraker, machine learning is no longer a zero-billion-dollar business. As the home of high-value, data-driven applications for over four...
Conference Paper
A large amount of data is available only through data-driven diagrams such as bar charts and scatterplots. These diagrams are stylized mixtures of graphics and text and are the result of complicated data-centric production pipelines. Unfortunately, neither text nor image search engines exploit these diagram-specific properties, making it difficult...
Patent
Full-text available
To implement open information extraction, a new extraction paradigm has been developed in which a system makes a single data-driven pass over a corpus of text, extracting a large set of relational tuples without requiring any human input. Using training data, a Self-Supervised Learner employs a parser and heuristics to determine criteria that will...
Article
Social media nowcasting--using online user activity to describe real-world phenomena--is an active area of research to supplement more traditional and costly data collection methods such as phone surveys. Given the potential impact of such research, we would expect general-purpose nowcasting systems to quickly become a standard tool among noncomput...
Article
Full-text available
A new generation of data processing systems, including web search, Google's Knowledge Graph, IBM's Watson, and sev-eral different recommendation systems, combine rich databases with software driven by machine learning. The spectacular successes of these trained systems have been among the most notable in all of computing and have generated exciteme...
Conference Paper
Web Data Management (or WDM) refers to a body of work concerned with leveraging the large collections of structured data that can be extracted from the Web. Over the past few years, several research and commercial efforts have explored these collections of data with the goal of improving Web search and developing mechanisms for surfacing different...
Article
Full-text available
The MapReduce distributed programming framework has become popular, despite evidence that current implementations are inefficient, requiring far more hardware than a traditional relational databases to complete similar tasks. MapReduce jobs are amenable to many traditional database query optimizations (B+Trees for selections, column-store- style te...
Article
Google's Web Tables and Deep Web Crawler identify and deliver this otherwise inaccessible resource directly to end users.
Conference Paper
Full-text available
The MapReduce distributed programming framework is very popular, but currently lacks the optimization techniques that have been standard with relational database systems for many years. This paper proposes Manimal, which uses static code analysis to detect MapReduce program semantics and thereby enable wholly-automatic optimization of MapReduce pro...
Article
The Web contains a vast amount of structured information such as HTML tables, HTML lists and deep-web databases; there is enormous potential in combining and re-purposing this data in creative ways. However, integrating data from this relational web raises several challenges that are not addressed by current data integration systems or mash-up tool...
Article
Full-text available
Many of the largest database-driven web sites use custom webscale data managers (WDMs). On the surface, these WDMs are being applied to problems that are well-suited for relational database systems. Some examples are the following: � Map-Reduce [5], Hadoop [7], and Dryad [9] are used to process queries on large data sets using sequential scan and a...
Article
Full-text available
Many of the largest database-driven web sites use custom web-scale data managers (WDMs). On the surface, these WDMs are being applied to problems that are well-suited for relational database systems. Some examples are the following: • Map-Reduce [5], Hadoop [7], and Dryad [9] are used to process queries on large data sets using sequential scan and...
Conference Paper
Recent research in domain-independent information extrac- tion holds the promise of an automatically-constructed struc- tured database derived from the Web. A query system based on this database would oer the same breadth as a Web search engine, but with much more sophisticated query tools than are common today. Unfortunately, these domain-independ...
Article
A long-standing goal of Web research has been to con- struct a unified Web knowledge base. Information ex- traction techniques have shown good results on Web in- puts, but even most domain-independent ones are not appropriate for Web-scale operation. In this paper we describe three recent extraction systems that can be operated on the entire Web (t...
Article
The Semantic Web’s need for machine understandable content has led researchers to attempt to automatically acquire such content from a number of sources, including the web. To date, such research has focused on “document-driven” systems that individually process a small set of documents, annotating each with respect to a given ontology. This articl...
Article
Full-text available
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each rela...
Article
Full-text available
This article describes some of the ongoing research projects related to structured data management at Google today. The organization of Google encourages research scientists to work closely with engineering teams. As a result, the research projects tend to be motivated by real needs faced by Google's products and services, and solutions are put int...
Conference Paper
Full-text available
The World-Wide Web consists of a huge number of unstruc- tured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small "schema" of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extr...
Conference Paper
Traditionally, Information Extraction (IE) has fo- cused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag n...
Conference Paper
Traditional information extraction systems have focused on satisfying precise, narrow, pre-specified requests from small, homogeneous corpora. In contrast, the TextRunner system demonstrates a new kind of information extraction, called Open Information Extraction (OIE), in which the system makes a single, data-driven pass over the entire corpus and...
Conference Paper
Full-text available
Open Information Extraction (OIE) is a recently-introduced type of information extraction that extracts small individ- ual pieces of data from input text without any domain- specific guidance such as special training data or extrac- tion rules. For example, an OIE system might discover the triple Frenzy, year, 1972 from a set of documents about mov...
Conference Paper
We propose two new online methods for estimating the size of a backtracking search tree. The first method is based on a weighted sample of the branches visited by chronological backtracking. The second is a recursive method based on assuming that the ...
Article
Full-text available
The Web contains a huge amount of text that is currently beyond the reach of structured access tools. This unstruc- tured data often contains a substantial amount of implicit structure, much of which can be captured using information extraction (IE) algorithms. By combining an IE system with an appropriate data model and query language, we could en...
Conference Paper
Traditional information extraction systems have focused on satisfying precise, narrow, pre-specified requests from small, homogeneous corpora. In contrast, the TextRunner system demonstrates a new kind of information extraction, called Open Information Extraction (OIE), in which the system makes a single, data-driven pass over the entire corpus and...
Conference Paper
The Semantic Web's need for machine understandable con- tent has led researchers to attempt to automatically acquire such content from a number of sources, including the web. To date, such research has focused on "document-driven" systems that individually process a small set of documents, annotating each with respect to a given ontology. This pape...
Conference Paper
Many modern natural language-processing applications utilize search engines to locate large numbers of Web documents or to compute statistics over the Web corpus. Yet Web search engines are designed and optimized for simple human queries---they are not well suited to support such applications. As a result, these applications are forced to issue mil...
Article
Full-text available
The Web contains a vast amount of text that can only be queried using simple keywords-in, documents- out search queries. But Web text often contains structured elements, such as hotel location and price pairs embedded in a set of hotel reviews. Queries that process these structural text elements would be much more powerful than our current document...
Article
The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KnowItAll's novel architecture and design principles, emphasizing its distinctive ability to extract...
Conference Paper
Numerous NLP applications rely on search-engine queries, both to ex- tract information from and to com- pute statistics over the Web corpus. But search engines often limit the number of available queries. As a result, query-intensive NLP applica- tions such as Information Extraction (IE) distribute their query load over several days, making IE a sl...
Conference Paper
Many modern natural language-processing applications utilize search engines to locate large numbers of Web documents or to compute statistics over the Web corpus. Yet Web search engines are designed and optimized for simple human queries---they are not well suited to support such applications. As a result, these applications are forced to issue mil...
Article
Our KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an autonomous, domain-independent, and scalable manner.
Conference Paper
Manually querying search engines in order to accumulate a large bodyof factual information is a tedious, error-prone process of piecemealsearch. Search engines retrieve and rank potentially relevantdocuments for human perusal, but do not extract facts, assessconfidence, or fuse information from multiple documents. This paperintroduces KnowItAll, a...
Article
Manually querying search engines in order to accumulate a large body of factual information is a tedious, error-prone process of piecemeal search. Search engines retrieve and rank potentially relevant documents for human perusal, but do not extract facts, assess confidence, or fuse information from multiple documents. This paper introduces KNOWITAL...
Article
Search engines are as critical to Internet use as any other part of the network infrastructure, but they differ from other components in two important ways. First, their internal workings are secret, unlike, say, the workings of the DNS (domain name system). Second, they hold political and cultural power, as users increasingly rely on them to navig...
Conference Paper
Our KNOWITALL system aims to automate the tedious process of extracting large collections of facts ( e.g., names of scientists or politicians) from the Web in an autonomous, domain-independent, and scalable man- ner. In its first major run, K NOWITALL extracted over 50,000 facts with high precision, but suggested a chal- lenge: How can we improve K...
Article
Facts are naturally organized in terms of entities, classes, and their relationships as in an entity-relationship diagram or a semantic network. Search engines have eschewed such structures because, in the past, their creation and processing have not been practical at Web scale. This paper introduces the extraction graph, a textual ap-proximation t...