Panos Ipeirotis

Panos Ipeirotis

22.78
· PhD, Columbia University 2004
  • 1
  • 129
  • About
    Current institution
    New York University | NYU
    Current position
    • Professor (Associate)
    Top co-authors
    Network
    Co-authors
    Followers
    Following
    Projects
    Projects (1)
    Research
    Research Items (129)
    The emergence of online paid micro-crowdsourcing platforms, such as Amazon Mechanical Turk, allows on-demand and at-scale distribution of tasks to human workers around the world. In such settings, online workers come and complete small tasks posted by employers, working for as long or as little as they wish, a process that eliminates the overhead of hiring (and dismissal). This flexibility introduces a different set of inefficiencies: verifying the quality of every submitted piece of work is an expensive operation that often requires the same level of effort as performing the task itself. A number of research challenges arise in such settings. How can we ensure that the submitted work is accurate? What allocation strategies can be employed to make the best use of the available labor force? How can we appropriately assess the performance of individual workers? In this paper, we consider labeling tasks and develop a comprehensive scheme for managing the quality of crowd labeling: First, we present several algorithms for inferring the true classes of objects and the quality of participating workers, assuming the labels are collected all at once before the inference. Next, we allow employers to adaptively decide which object to assign to the next arriving worker and propose several heuristic-based dynamic label allocation strategies to achieve the desired data quality with significantly fewer labels. Experimental results on both simulated and real data confirm the superior performance of the proposed allocation strategies over other existing policies. Finally, we introduce two novel metrics that can be used to objectively rank the performance of crowdsourced workers after fixing correctableworker errors and taking into account the costs of different classification errors. In particular, the worker value metric directly measures the monetary value contributed by each label of a worker toward meeting the quality requirements and provides a basis for the design of fair and efficient compensation schemes.
    Searching digital libraries refers to searching and retrieving information from remote databases of digitized or digital objects. These databases may hold either the metadata for an object of interest (e.g., author and title), or a complete object such as a book or a video.
    When Jim Cramer offers investment advice on his CNBC show Mad Money, he influences market prices (Engelberg et al., 2009). By analyzing text from transcripts of the show, we explore the relationship between what Cramer says and the magnitude and direction of his price effect. We demonstrate that Cramer’s influence is more complex than simply drawing investor attention to particular stocks and is in fact related the content of his recommendations. A cursory viewing of Mad Money reveals that Cramer generally provides no new information about stocks, but instead argues that they may be mispriced by investors with access to identical information. The puzzle of the Cramer effect is why, despite containing little new information about stock fundamentals, does Cramer’s advice influence investors to alter their valuations and thus the stock price? An intuitive explanation is that markets are informationally incomplete, that investors are not aware of all the securities they could trade, and that when Cramer recommends a stock, he simply draws attention to it. Had investors known about the stock, they would have incorporated this knowledge into their decisions and the stock would have been priced appropriately. Merton (1987) formalized this explanation in his “investor recognition hypothesis.” In his model, stocks with low investor recognition earn higher returns to compensate holders for being imperfectly diversified. Indeed, stocks with no media coverage earn higher returns when controlling for common risk factors (Fang and Peress, 2008), and increased investor attention to a particular Cramer recommendation (as measured by Nielsen television ratings) significantly increases the market’s response to Cramer’s advice (Engelberg et al., 2009). The story behind this hypothesis is that Cramer simply draws attention to stocks which lacked investor awareness and were therefore earning higher returns. Another potential explanation for the Cramer effect is that markets are affected by noise traders who, unlike rational investors who only consider fundamentals, irrationally act on noise coming from media coverage, pundits, and their own generally uninformed research (DeLong et al., 1990). These noise traders are swayed by media content that expresses optimistic or pessimistic sentiment about stocks without providing any new information on fundamentals. There is some empirical evidence that media content affects stock prices. For example, Tetlock (Forthcoming) conducted a simple binary text analysis of a daily Wall Street Journal column and found, consistent with the theoretical predictions of DeLong et al. (1990), that pessimistic media content induces downward pressure on stock prices and that the price impact of this pressure reverses itself over time. A similar trend is evident in the price impact of Cramer’s recommendations. When he mentions a stock on his show, it initially undergoes a significant price change which reverses over the next 30 days (Engelberg et al., 2009). As Cramer rarely discusses obscure stocks, it could be that the magnitude and direction of his influence on the market is not simply attentional, but rather related to the content of what he says-essentially, that the content of his recommendations creates changes in sentiment that move the market. To explore the source of Cramer’s price effect and to extend work on sentiment analysis beyond simple binary characterizations of positive and negative coverage, we constructed a model of Cramer’s influence on investor sentiment based on content features derived from Mad Money transcripts. Applying recent developments in generative text analysis (Blei et al., 2003), we estimated posterior probabilities that Cramer discussed specific topics in his recommendations and assessed the relative impact of these different topics on the magnitude and direction of Cramer’s influence on stock prices. Our analysis suggests that the topics of Cramer’s discourse explain a significant amount of the variance in the abnormal returns generated the day after he recommends a stock. The results imply that Cramer is more influential when he presents specific kinds of arguments or discusses particular rationales for investments, demonstrating the influence of topical information content on individual economic decisions and aggregate market outcomes.
    Online workplaces such as oDesk, Amazon Mechanical Turk, and TaskRabbit have been growing in importance over the last few years. In such markets, employers post tasks on which remote contractors work and deliver the product of their work online. As in most online marketplaces, reputation mechanisms play a very important role in facilitating transactions, since they instill trust and are often predictive of the employer's future satisfaction. However, labor markets are usually highly heterogeneous in terms of available task categories; in such scenarios, past performance may not be an accurate signal of future performance. To account for this natural heterogeneity, in this work, we build models that predict the performance of a worker based on prior, category-specific feedback. Our models assume that each worker has a category- specific quality, which is latent and not directly observable; what is observable, though, is the set of feedback ratings of the worker and of other contractors with similar work histories. Based on this information, we provide a series of models of increasing complexity that successfully estimate the worker's quality. We start by building a binomial model and a multinomial model under the implicit assumption that the latent qualities of the workers are static. Next, we remove this assumption, and we build linear dynamic systems that capture the evolution of these latent qualities over time. We evaluate our models on a large corpus of over a million transactions ( completed tasks) from oDesk, an online labor market with hundreds of millions of dollars in transaction volume. Our results show an improved accuracy of up to 25% compared to feedback baselines and significant improvement over the commonly used collaborative filtering approach. Our study clearly illustrates that reputation systems should present different reputation scores, depending on the context in which the worker has been previously evaluated and the job for which the worker is applying.
    The emergence of online labor platforms, online crowdsourcing sites, and even Massive Open Online Courses (MOOCs), has created an increasing need for reliably evaluating the skills of the participating users (e.g., “does a candidate know Java”) in a scalable way. Many platforms already allow job candidates to take online tests to assess their competence in a variety of technical topics. However the existing approaches face many problems. First, cheating is very common in online testing without supervision, as the test questions often “leak” and become easily available online along with the answers. Second, technical-skills, such as programming, require the tests to be frequently updated in order to reflect the current state-of-the-art. Third, there is very limited evaluation of the tests themselves, and how effectively they measure the skill that the users are tested for.
    We introduce tools and methodologies to collect high quality, large scale fine-grained computer vision datasets using citizen scientists - crowd annotators who are passionate and knowledgeable about specific domains such as birds or airplanes. We worked with citizen scientists and domain experts to collect NABirds, a new high quality dataset containing 48,562 images of North American birds with 555 categories, part annotations and bounding boxes. We find that citizen scientists are significantly more accurate than Mechanical Turkers at zero cost. We worked with bird experts to measure the quality of popular datasets like CUB-200-2011 and ImageNet and found class label error rates of at least 4%. Nevertheless, we found that learning algorithms are surprisingly robust to annotation errors and this level of training data corruption can lead to an acceptably small increase in test error if the training set has sufficient size. At the same time, we found that an expert-curated high quality test set like NABirds is necessary to accurately measure the performance of fine-grained computer vision systems. We used NABirds to train a publicly available bird recognition service deployed on the web site of the Cornell Lab of Ornithology.
    In crowdsourcing systems, the interests of contributing participants and system stakeholders are often not fully aligned. Participants seek to learn, be entertained, and perform easy tasks, which offer them instant gratification; system stakeholders want users to complete more difficult tasks, which bring higher value to the crowdsourced application. We directly address this problem by presenting techniques that optimize the crowdsourcing process by jointly maximizing the user longevity in the system and the true value that the system derives from user participation. We first present models that predict the "survival probability" of a user at any given moment, that is, the probability that a user will proceed to the next task offered by the system. We then leverage this survival model to dynamically decide what task to assign and what motivating goals to present to the user. This allows us to jointly optimize for the short term (getting difficult tasks done) and for the long term (keeping users engaged for longer periods of time). We show that dynamically assigning tasks significantly increases the value of a crowdsourcing system. In an extensive empirical evaluation, we observed that our task allocation strategy increases the amount of information collected by up to 117.8%. We also explore the utility of motivating users with goals. We demonstrate that setting specific, static goals can be highly detrimental to the long-term user participation, as the completion of a goal (e.g., earning a badge) is also a common drop-off point for many users. We show that setting the goals dynamically, in conjunction with judicious allocation of tasks, increases the amount of information collected by the crowdsourcing system by up to 249%, compared to the existing baselines that use fixed objectives.
    This paper studies the relation between activity on Twitter and sales. While research exists into the relation between Tweets and movie and book sales, this paper shows that the same relations do not hold for products that receive less attention on social media. For such products, classification of Tweets is far more important to determine a relation. Also, for such products advanced statistical relations, in addition to correlation, are required to relate Twitter activity and sales. In a case study that involves Tweets and sales from a company in four countries, the paper shows how, by classifying Tweets, such relations can be identified. In particular, the paper shows evidence that positive Tweets by persons (as opposed to companies) can be used to forecast sales and that peaks in positive Tweets by persons are strongly related to an increase in sales. These results can be used to improve sales forecasts and to increase sales in marketing campaigns.
    We present techniques for gathering data that expose errors of automatic predictive models. In certain common settings, traditional methods for evaluating predictive models tend to miss rare but important errors—most importantly, cases for which the model is confident of its prediction (but wrong). In this article, we present a system that, in a game-like setting, asks humans to identify cases that will cause the predictive model-based system to fail. Such techniques are valuable in discovering problematic cases that may not reveal themselves during the normal operation of the system and may include cases that are rare but catastrophic. We describe the design of the system, including design iterations that did not quite work. In particular, the system incentivizes humans to provide examples that are difficult for the model to handle by providing a reward proportional to the magnitude of the predictive model's error. The humans are asked to “Beat the Machine” and find cases where the automatic model (“the Machine”) is wrong. Experiments show that the humans using Beat the Machine identify more errors than do traditional techniques for discovering errors in predictive models, and, indeed, they identify many more errors where the machine is (wrongly) confident it is correct. Furthermore, those cases the humans identify seem to be not simply outliers, but coherent areas missed completely by the model. Beat the Machine identifies the “unknown unknowns.” Beat the Machine has been deployed at an industrial scale by several companies. The main impact has been that firms are changing their perspective on and practice of evaluating predictive models. “There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know.” --Donald Rumsfeld
    In an online labor marketplace employers post jobs, receive freelancer applications and make hiring decisions. These hiring decisions are based on the freelancer’s observed (e.g., education) and latent (e.g., ability) characteristics. Because of the heterogeneity that appears in the observed characteristics, and the existence of latent ones, identifying and hiring the best possible applicant is a very challenging task. In this work we study and model the employer’s hiring behavior. We assume that employers are utility maximizers and make rational decisions by hiring the best possible applicant at hand. Based on this premise, we propose a series of probabilistic models that estimate the hiring probability of each applicant. We train and test our models on more than 600,000 job applications obtained by oDesk.com, and we show evidence that the proposed models outperform currently in-use baselines. To get further insights, we conduct an econometric analysis and observe that the attributes that are strongly correlated with the hiring probability are whether or not the freelancer and the employer have previously worked together, the available information on the freelancer’s profile, the countries of the employer and the freelancer and the skillset of the freelancer. Finally, we find that the faster a freelancer applies to an opening, the higher is the probability to get the job.
    We describe Quizz, a gamified crowdsourcing system that simultaneously assesses the knowledge of users and acquires new knowledge from them. Quizz operates by asking users to complete short quizzes on specific topics; as a user answers the quiz questions, Quizz estimates the user's competence. To acquire new knowledge, Quizz also incorporates questions for which we do not have a known answer; the answers given by competent users provide useful signals for selecting the correct answers for these questions. Quizz actively tries to identify knowledgeable users on the Internet by running advertising campaigns, effectively leveraging the targeting capabilities of existing, publicly available, ad placement services. Quizz quantifies the contributions of the users using information theory and sends feedback to the advertisingsystem about each user. The feedback allows the ad targeting mechanism to further optimize ad placement. Our experiments, which involve over ten thousand users, confirm that we can crowdsource knowledge curation for niche and specialized topics, as the advertising network can automatically identify users with the desired expertise and interest in the given topic. We present controlled experiments that examine the effect of various incentive mechanisms, highlighting the need for having short-term rewards as goals, which incentivize the users to contribute. Finally, our cost-quality analysis indicates that the cost of our approach is below that of hiring workers through paid-crowdsourcing platforms, while offering the additional advantage of giving access to billions of potential users all over the planet, and being able to reach users with specialized expertise that is not typically available through existing labor marketplaces.
    The largest publicly available knowledge repositories, such as Wikipedia and Freebase, owe their existence and growth to volunteer contributors around the globe. While the majority of contributions are correct, errors can still creep in, due to editors' carelessness, misunderstanding of the schema, malice, or even lack of accepted ground truth. If left undetected, inaccuracies often degrade the experience of users and the performance of applications that rely on these knowledge repositories. We present a new method, CQUAL, for automatically predicting the quality of contributions submitted to a knowledge base. Significantly expanding upon previous work, our method holistically exploits a variety of signals, including the user's domains of expertise as reflected in her prior contribution history, and the historical accuracy rates of different types of facts. In a large-scale human evaluation, our method exhibits precision of 91% at 80% recall. Our model verifies whether a contribution is correct immediately after it is submitted, significantly alleviating the need for post-submission human reviewing.
    A large number of organizations today generate and share textual descriptions of their products, services, and actions. Such collections of textual data contain significant amount of structured information, which remains buried in the unstructured text. While information extraction algorithms facilitate the extraction of structured relations, they are often expensive and inaccurate, especially when operating on top of text that does not contain any instances of the targeted structured information. We present a novel alternative approach that facilitates the generation of the structured metadata by identifying documents that are likely to contain information of interest and this information is going to be subsequently useful for querying the database. Our approach relies on the idea that humans are more likely to add the necessary metadata during creation time, if prompted by the interface; or that it is much easier for humans (and/or algorithms) to identify the metadata when such information actually exists in the document, instead of naively prompting users to fill in forms with information that is not available in the document. As a major contribution of this paper, we present algorithms that identify structured attributes that are likely to appear within the document, by jointly utilizing the content of the text and the query workload. Our experimental evaluation shows that our approach generates superior results compared to approaches that rely only on the textual content or only on the query workload, to identify attributes of interest.
    In this work we define the utility of having a certain skill in an (OLM), and we propose that this utility is strongly correlated with the level of expertise of a given worker. However, the actual level of expertise for a given skill and a given worker is both latent and dynamic. What is observable is a series of characteristics that are intuitively correlated with the level of expertise of a given skill. We propose to build a Hidden Markov Model (HMM), which estimates the latent and dynamic levels of expertise, based on the observed characteristics. We build and evaluate our approaches on a unique transactional dataset from oDesk.com. Finally, we estimate the utility of a series of skills and discuss how certain skills (e.g. 'editing') provide a higher expected payoff once a person masters them over others (e.g. 'microsoftexcel').
    Many online services like Twitter and GNIP offer streaming programming interfaces that allow real-time information filtering based on keyword or other conditions. However, all these services specify strict access constraints, or charge a cost based on the usage. We refer to such streams as "hidden streams" to draw a parallel to the well-studied hidden Web, which similarly restricts access to the contents of a database through a querying interface. At the same time, the users' interest is often captured by complex classification models that, implicitly or explicitly, specify hundreds of keyword-based rules, along with the rules' accuracies. In this paper, we study how to best utilize a constrained streaming access interface to maximize the number of retrieved relevant items, with respect to a classifier, expressed as a set of rules. We consider two problem variants. The static version assumes that the popularity of the keywords is known and constant across time. The dynamic version lifts this assumption, and can be viewed as an exploration-vs.-exploitation problem. We show that both problems are NP-hard, and propose exact and bounded approximation algorithms for various settings, including various access constraint types. We experimentally evaluate our algorithms on real Twitter data. Copyright © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
    A system combines inputs from human processing and machine processing, and employs machine learning to improve processing of individual tasks based on comparison of human processing results. Once performance of a particular task by machine processing reaches a threshold, the level human processing used on that task is reduced.
    Online labor markets such as oDesk and Amazon Mechanical Turk have been growing in importance over the last few years. In these markets, employers post tasks on which remote contractors work and deliver the product of their work. As in most online marketplaces, reputation mechanisms play a very important role in facilitating transactions, since they instill trust and are often predictive of the future satisfaction of the employer. However, labor markets are usually highly heterogeneous in terms of available task categories; in such scenarios, past performance may not be a representative signal of future performance. To account for this heterogeneity, in our work, we build models that predict the performance of a worker based on prior, category-specific feedback. Our models assume that each worker has a category-specific quality, which is latent and not directly observable; what is observable, though, is the set of feedback ratings of the worker and of other contractors with similar work histories. Based on this information, we build a multi-level, hierarchical scheme that deals effectively with the data sparseness, which is inherent in many cases of interest (i.e., contractors with relatively brief work histories). We evaluate our models on a large corpus of real transactional data from oDesk, an online labor market with hundreds of millions of dollars in transaction volume. Our results show an improved accuracy of up to 47% compared to the existing baseline.
    In this paper, we study the effects of three different kinds of search engine rankings on consumer behavior and search engine revenues: direct ranking effect, interaction effect between ranking and product ratings, and personalized ranking effect. We combine a hierarchical Bayesian Model estimated on approximately one million online sessions from Travelocity, together with randomized experiments using a real-world hotel search engine application. Our archival data analysis and randomized experiments are consistent in demonstrating the following: (1) a consumer utility-based ranking mechanism can lead to a significant increase in overall search engine revenue. (2) Significant interplay occurs between search engine ranking and product ratings. An inferior position on the search engine affects “higher-class” hotels more adversely. On the other hand, hotels with a lower customer rating are more likely to benefit from being placed on the top of the screen. These findings illustrate that product search engines could benefit from directly incorporating signals from social media into their ranking algorithms. (3) Our randomized experiments also reveal that an “active” (wherein users can interact with and customize the ranking algorithm) personalized ranking system leads to higher clicks but lower purchase propensities and lower search engine revenue compared to a “passive” (wherein users cannot interact with the ranking algorithm) personalized ranking system. This result suggests that providing more information during the decision-making process may lead to fewer consumer purchases because of information overload. Therefore, product search engines should not adopt personalized ranking systems by default. Overall, our study unravels the economic impact of ranking and its interaction with social media on product search engines.
    The emergence of online paid micro-crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), allows on-demand and at scale distribution of tasks to human workers around the world. In such settings, online workers come and complete small tasks posted by a company, working for as long or as little as they wish. Such temporary employer-employee relationships give rise to adverse selection, moral hazard, and many other challenges. How can we ensure that the submitted work is accurate, especially when the verification cost is comparable to the cost of performing the task? How can we estimate the exhibited quality of the workers? What pricing strategies should be used to induce the effort of workers with varying ability levels? We develop a comprehensive framework for managing the quality in such micro crowdsourcing settings: First, we describe an algorithm for estimating the error rates of the participating workers, and show how to separate systematic worker biases from unrecoverable errors and generate an unbiased “worker quality” measurement. Next, we present a selective repeated-labeling algorithm that acquires labels in a way so that quality requirements can be met at minimum cost. Then, we propose a quality-adjusted pricing scheme that adjusts the payment level according to the contributed value by each worker. We test our compensation scheme in a principal-agent setting in which workers respond to incentives by varying their effort. Our simulation results demonstrate that the proposed pricing scheme is able to induce workers to exert higher levels of effort and yield larger profits for employers compared to the commonly adopted uniform pricing schemes. We also describe strategies that build on our quality control and pricing framework, to tackle crowdsourced tasks of increasingly higher complexity, while still maintaining a tight quality control of the process.
    This paper addresses the repeated acquisition of labels for data itemswhen the labeling is imperfect. We examine the improvement (or lackthereof) in data quality via repeated labeling, and focus especially onthe improvement of training labels for supervised induction. With theoutsourcing of small tasks becoming easier, for example via Amazon'sMechanical Turk, it often is possible to obtain less-than-expertlabeling at low cost. With low-cost labeling, preparing the unlabeledpart of the data can become considerably more expensive than labeling.We present repeated-labeling strategies of increasing complexity, andshow several main results. (i) Repeated-labeling can improve labelquality and model quality, but not always. (ii) When labels are noisy,repeated labeling can be preferable to single labeling even in thetraditional setting where labels are not particularly cheap. (iii) Assoon as the cost of processing the unlabeled data is not free, even thesimple strategy of labeling everything multiple times can giveconsiderable advantage. (iv) Repeatedly labeling a carefully chosen setof points is generally preferable, and we present a set of robusttechniques that combine different notions of uncertainty to select datapoints for which quality should be improved. The bottom line: theresults show clearly that when labeling is not perfect, selectiveacquisition of multiple labels is a strategy that data miners shouldhave in their repertoire. For certain label-quality/cost regimes, thebenefit is substantial.
    A non-transitory computer-readable medium, method and system for providing results associated with a ranking of a plurality of items of a particular item type can be provided. For example, for each respective item of a plurality of items having an associated cost, it is possible to (i) determining an item utility value for the respective item of the items based on aggregate data associated with a plurality of users without requiring utilization of information particular to each of the users, and (ii) determine a surplus value for the respective item as the item utility value less a cost utility value associated with the cost of the respective item. Further, it is possible to provide the results, based on the respective surplus values, to a particular user of the users.
    User-generated content on social media platforms and product search engines is changing the way consumers shop for goods online. However, current product search engines fail to effectively leverage information created across diverse social media platforms. Moreover, current ranking algorithms in these product search engines tend to induce consumers to focus on one single product characteristic dimension (e.g., price, star rating). This approach largely ignores consumers' multidimensional preferences for products. In this paper, we propose to generate a ranking system that recommends products that provide, on average, the best value for the consumer's money. The key idea is that products that provide a higher surplus should be ranked higher on the screen in response to consumer queries. We use a unique data set of U.S. hotel reservations made over a three-month period through Travelocity, which we supplement with data from various social media sources using techniques from text mining, image classification, social geotagging, human annotations, and geomapping. We propose a random coefficient hybrid structural model, taking into consideration the two sources of consumer heterogeneity the different travel occasions and different hotel characteristics introduce. Based on the estimates from the model, we infer the economic impact of various location and service characteristics of hotels. We then propose a new hotel ranking system based on the average utility gain a consumer receives from staying in a particular hotel. By doing so, we can provide customers with the "best-value" hotels early on. Our user studies, using ranking comparisons from several thousand users, validate the superiority of our ranking system relative to existing systems on several travel search engines. On a broader note, this paper illustrates how social media can be mined and incorporated into a demand estimation model in order to generate a new ranking system in product search engines. We thus highlight the tight linkages between user behavior on social media and search engines. Our interdisciplinary approach provides several insights for using machine learning techniques in economics and marketing research.
    Time is an important dimension of relevance for a large number of searches, such as over blogs and news archives. So far, research on searching over such collections has largely focused on locating topically similar documents for a query. Unfortunately, topic similarity alone is not always sufficient for document ranking. In this paper, we observe that, for an important class of queries that we call time-sensitive queries, the publication time of the documents in a news archive is important and should be considered in conjunction with the topic similarity to derive the final document ranking. Earlier work has focused on improving retrieval for “recency” queries that target recent documents. We propose a more general framework for handling time-sensitive queries and we automatically identify the important time intervals that are likely to be of interest for a query. Then, we build scoring techniques that seamlessly integrate the temporal aspect into the overall ranking mechanism. We present an extensive experimental evaluation using a variety of news article data sets, including TREC data as well as real web data analyzed using the Amazon Mechanical Turk. We examine several techniques for detecting the important time intervals for a query over a news archive and for incorporating this information in the retrieval process. We show that our techniques are robust and significantly improve result quality for time-sensitive queries compared to state-of-the-art retrieval techniques.
    The papers in these proceedings were presented at the 13th ACM Conference on Electronic Commerce (EC'12), held June 4-8 in Valencia, Spain. Since 1999 the ACM Special Interest Group on Electronic Commerce (SIGecom) has sponsored the leading scientific conference on advances in theory, systems, and applications for electronic commerce. The natural focus of the conference is on computer science issues, but the conference is interdisciplinary in nature, including research in economics and research related to (but not limited to) the following three non-exclusive focus areas: TF: Theory and Foundations (Computer Science Theory; Economic Theory) AI: Artificial Intelligence (AI, Agents, Machine Learning, Data Mining) EA: Experimental and Applications (Empirical Research, Experience with E-Commerce Applications) In addition to the main technical program, EC'12 featured four workshops and five tutorials. EC'12 was also co-located with the Autonomous Agents and Multiagent Systems (AAMAS) 2012 Conference. The call for papers attracted 219 submissions from authors in academia and industry all around the world --- a new record for the conference, and indeed a 16% increase over the previous record. Each paper was reviewed by at least three program committee members (all of whom were active researchers with PhDs) and two senior program committee members (who were all prominent, senior researchers) on the basis of scientific novelty, technical quality, and importance to the field. This matching was performed algorithmically, and offered the guarantee that there existed no "blocking pairs" of a reviewer who preferred a different paper and a paper that preferred a different reviewer. After extensive discussion and deliberation among the program committee, senior program committee and program chairs, 73 papers were selected for presentation at the conference. 57 of these are published in these proceedings. For the remaining 16, at the authors' request, only abstracts are included along with pointers to full working papers. This option accommodates the practices of fields outside of computer science in which conference publishing can preclude journal publishing. We expect that many of the papers in these proceedings will appear in a more polished and complete form in scientific journals in the future. For the first time, authors were allowed to explicitly align their papers with one or two of the conference's three focus areas, with the guarantee that they would be reviewed by SPC members and PC members aligned with the same area(s). Overall, the conference accepted submissions in every one of the six "tracks" induced by the three focus areas, with the TF, AI and EA tags being chosen by 147, 62 and 47 submissions and 50, 22 and 13 accepted papers respectively. (Note: because some papers chose two tags, these numbers sum to more than 219 and 73 respectively.) These "tracks" existed solely to provide a fair review process across different communities. To emphasize commonalities among the problems studied at EC, and to facilitate interchange at the conference, sessions were organized by topic rather than by focus area, and no indication of a paper's focus area(s) was given at the conference or appears in these proceedings. Also for the first time at EC, a third of the papers were presented in plenary sessions, with the other two thirds in parallel sessions. (Thus, attendees spent half their time in plenary sessions.) Quality was a necessary but not sufficient condition for getting a plenary slot; it was also necessary for reviewers to judge that a paper had broad appeal. Some of the conference's technically strongest work addressed smaller cross-sections of the community, and so appeared in parallel sessions. We had one overlap day with AAMAS, our co-located sister conference. We had a variety of joint activities: two invited talks (Colin Camerer and Moshe Tennenholtz), common coffee breaks, and two shared poster sessions (featuring EC papers, AAMAS papers, and EC-relevant papers published in the broader community over the last year). The latter poster session was another innovation this year: we solicited posters for papers relevant to the EC community that had been published in other venues during the past year. We accepted 22 posters for this session, and also featured posters from all EC authors who wished to present their paper in this additional format.
    With the proliferation of social media, consumers’ cognitive costs during information-seeking can become non-trivial during an online shopping session. We propose a dynamic structural model of limited consumer search thatcombines an optimal stopping framework with an individual-level choice model. We estimate the parameters of the model using a dataset of approximately 1 million online search sessions resulting in bookings in 2117 U.S. hotels. The model allows us to estimate the monetary value of the the search costs incurred by users of product search engines in a social media context. On average, searching an extra page on a search engine costs consumers $39.15 and examining an additional offer within the same page has a cost of $6.24, respectively. A good recommendation saves consumers, on average, $9.38, whereas a bad one costs $18.54. Our policy experiment strongly supports this finding by showing that the quality of ranking can have significant impact on consumers’ search efforts, and customized ranking recommendations tend to polarize the distribution of consumer search intensity. Our model-fit comparison demonstrates that the dynamic search model provides the highest overall predictive power compared to the baseline static models. Our dynamic model indicates that consumers have lower price sensitivity than a static model would have predicted, implying that consumers pay a lot of attention to non-price factors during an online hotel search.
    The emergence of online crowdsourcing sites has opened up new channels for third parties and companies to solicit paid reviews from people. In this paper, we investigate 1) how the introduction of monetary payments affects review quality, and 2) the impact of bonus rewards, sponsorship disclosure, and choice freedom on the quality of paid reviews. We conduct a 2×2×2 between subjects experiment on Amazon Mechanical Turk. Our results indicate that there are no significant quality differences between paid and unpaid reviews. The quality of paid reviews improves by both the presence of additional performance-contingent rewards and the requirement to add disclosure text about material connections, and deteriorates by the restrictions imposed on the product set to be reviewed. These results have implications for websites and companies who are seeking legitimate reviews for their online products from paid workers.
    The overload of social media content today can lead to significant latency in the delivery of results displayed to users on product search engines. We propose a dynamic structural model whose output can facilitate digital content analytics by search engines by helping predict consumers' online search paths. Such predictive prowess can facilitate web caching of the "most likely-to-be-visited" web pages and reduce latency. Our model combines an optimal stopping framework with an individual-level random utility choice model. It allows us to jointly estimate consumers' heterogeneous preferences and search costs in the context of product search engines, and predict a probability-based search path for each consumer. We estimate the parameters of the model using a dataset of approximately 1 million online search sessions resulting in room bookings in 2117 U.S. hotels. We find that search engine ranking can polarize search costs incurred by users. A good ranking saves consumers, on average, $9.38, whereas a bad one costs $18.54. Our model prediction results demonstrate that the proposed dynamic structural model provides the best overall performance in predicting the probabilities of consumers' online search paths compared to several baseline models that do not include a formal structure or dynamics in them.
    Many online or local data sources provide powerful querying mechanisms but limited ranking capabilities. For instance, PubMed allows users to submit highly expressive Boolean keyword queries, but ranks the query results by date only. However, a user would typically prefer a ranking by relevance, measured by an information retrieval (IR) ranking function. A naive approach would be to submit a disjunctive query with all query keywords, retrieve all the returned matching documents, and then re-rank them. Unfortunately, such an operation would be very expensive due to the large number of results returned by disjunctive queries. In this paper we present algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database and a TREC dataset show that we achieve order of magnitude improvement compared to the current baseline approaches
    With the rapid growth of the Internet, the ability of users to create and publish content has created active electronic communities that provide a wealth of product information. However, the high volume of reviews that are typically published for a single product makes harder for individuals as well as manufacturers to locate the best reviews and understand the true underlying quality of a product. In this paper, we reexamine the impact of reviews on economic outcomes like product sales and see how different factors affect social outcomes such as their perceived usefulness. Our approach explores multiple aspects of review text, such as subjectivity levels, various measures of readability and extent of spelling errors to identify important text-based features. In addition, we also examine multiple reviewer-level features such as average usefulness of past reviews and the self-disclosed identity measures of reviewers that are displayed next to a review. Our econometric analysis reveals that the extent of subjectivity, informativeness, readability, and linguistic correctness in reviews matters in influencing sales and perceived usefulness. Reviews that have a mixture of objective, and highly subjective sentences are negatively associated with product sales, compared to reviews that tend to include only subjective or only objective information. However, such reviews are rated more informative (or helpful) by other users. By using Random Forest-based classifiers, we show that we can accurately predict the impact of reviews on sales and their perceived usefulness. We examine the relative importance of the three broad feature categories: “reviewer-related” features, “review subjectivity” features, and “review readability” features, and find that using any of the three feature sets results in a statistically equivalent performance as in the case of using all available features. This paper is the first study that integrates eco- - nometric, text mining, and predictive modeling techniques toward a more complete analysis of the information captured by user-generated online reviews in order to estimate their helpfulness and economic impact.
    User-Generated Content (UGC) on social media platforms and product search engines is changing the way consumers shop for goods online. However, current product search engines fail to effectively leverage information created across diverse social media platforms. Moreover, current ranking algorithms in these product search engines tend to induce consumers to focus on one single product characteristic dimension (e.g., price, star rating). This approach largely ignores consumers ’ multi-dimensional preferences for products. In this paper, we propose to generate a ranking system that recommends products that provide on average the best value for the consumer’s money. The key idea is that products that provide a higher surplus should be ranked higher on the screen in response to consumer queries. We use a unique dataset of U.S. hotel reservations made over a three-month period through Travelocity, which we supplement with data from various social media sources using techniques from text mining, image classification, social geotagging, human annotations, and geo-mapping. We propose a random coefficient hybrid structural model, taking into consideration the two sources of consumer heterogeneity the different travel occasions and different hotel characteristics introduce. Based on the estimates from the model, we infer the economic impact of various location and service characteristics of hotels. We then propose a new hotel ranking
    With the rapid growth of the Internet, the ability of users to create and publish content has created active electronic communities that provide a wealth of product information. However, the high volume of reviews that are typically published for a single product makes harder for individuals as well as manufacturers to locate the best reviews and understand the true underlying quality of a product. In this paper, we re-examine the impact of reviews on economic outcomes like product sales and see how different factors affect social outcomes such as their perceived usefulness. Our approach explores multiple aspects of review text, such as subjectivity levels, various measures of readability and extent of spelling errors to identify important text-based features. In addition, we also examine multiple reviewer-level features such as average usefulness of past reviews and the self-disclosed identity measures of reviewers that are displayed next to a review. Our econometric analysis reveals that the extent of subjectivity, informativeness, readability, and linguistic correctness in reviews matters in influencing sales and perceived usefulness. Reviews that have a mixture of objective, and highly subjective sentences are negatively associated with product sales, compared to reviews that tend to include only subjective or only objective information. However, such reviews are rated more informative (or helpful) by other users. Further, reviews that rate products negatively can be associated with increased product sales when the review text is informative and detailed.By using Random Forest based classiers, we show that we can accurately predict the impact of reviews on sales and their perceived usefulness. We examine the relative importance of the three broad feature categories: 'reviewer-related' features, 'review subjectivity' features, and 'review readability' features, and find that using any of the three feature sets results in a statistically equivalent performance as in the case of using all available features. This paper is the first study that integrates econometric, text mining, and predictive modeling techniques toward a more complete analysis of the information captured by user-generated online reviews in order to estimate their helpfulness and economic impact.
    Many practitioners currently use rules of thumb to price tasks on online labor markets. Incorrect pricing leads to task starvation or inefficient use of capital. Formal pricing policies can address these challenges. In this paper we argue that a pricing policy can be based on the trade-off between price and desired completion time. We show how this duality can lead to a better pricing policy for tasks in online labor markets. This paper makes three contributions. First, we devise an algorithm for job pricing using a survival analysis model. We then show that worker arrivals can be modeled as a non-homogeneous Poisson Process (NHPP). Finally using NHPP for worker arrivals and discrete choice models we present an abstract mathematical model that captures the dynamics of the market when full market information is presented to the task requester. This model can be used to predict completion times and pricing policies for both public and private crowds.
    The tutorial covers an emerging topic of wide interest: Crowdsourcing. Specifically, we cover areas of crowdsourcing related to managing structured and unstructured data in a web-related content. Many researchers and practitioners today see the great opportunity that becomes available through easily-available crowdsourcing platforms. However, most newcomers face the same questions: How can we manage the (noisy) crowds to generate high quality output? How to estimate the quality of the contributors? How can we best structure the tasks? How can we get results in small amounts of time and minimizing the necessary resources? How to setup the incentives? How should such crowdsourcing markets be setup? Their presented material will cover topics from a variety of fields, including computer science, statistics, economics, and psychology. Furthermore, the material will include real-life examples and case studies from years of experience in running and managing crowdsourcing applications in business settings.
    There is increased participation by the developing world in the global manufacturing marketplace: the sewing machine in Bangladesh can be a means to support an entire family. Crowdsourcing for cognitive tasks consists of asking humans for questions that are otherwise impossible to answer by algorithms, e.g., is this image pornographic, are these two addresses the same, what is the translation for this text in French? In the last five years, there has been an exponential growth in the size of the global cognitive marketplace: Amazon.com's Mechanical Turk has an estimated 500,000 active workers in over 100 countries, and there are dozens of other companies in this space. This turns the computer into a modern-day sewing machine, where cognitive work of various levels of difficulty will pay anywhere from 5 to 50 dollars a day. Unlike outsourcing, which usually requires college education, competence at these tasks might be a month or even less of training. At its best, this could be a powerful bootstrap for a billion people. At its worst, this can lead to unprecedented exploitation. In this panel, we discuss the technical, social and economic questions and implications that a global cognitive marketplace raises.
    With the growing pervasiveness of the Internet, online search for products and services is constantly increasing. Most product search engines are based on adaptations of theoretical models devised for information retrieval. However, the decision mechanism that underlies the process of buying a product is different than the process of locating relevant documents or objects. We propose a theory model for product search based on expected utility theory from economics. Specifically, we propose a ranking technique in which we rank highest the products that generate the highest surplus, after the purchase. In a sense, the top ranked products are the "best value for money" for a specific user. Our approach builds on research on "demand estimation" from economics and presents a solid theoretical foundation on which further research can build on. We build algorithms that take into account consumer demographics, heterogeneity of consumer preferences, and also account for the varying price of the products. We show how to achieve this without knowing the demographics or purchasing histories of individual consumers but by using aggregate demand data. We evaluate our work, by applying the techniques on hotel search. Our extensive user studies, using more than 15,000 user-provided ranking comparisons, demonstrate an overwhelming preference for the rankings generated by our techniques, compared to a large number of existing strong state-of-the-art baselines.
    Most product search engines today build on models of relevance devised for information retrieval. However, the decision mechanism that underlies the process of buying a product is different than the process of locating relevant documents or objects. We propose a theory model for product search based on expected utility theory from economics. Specifically, we propose a ranking technique in which we rank highest the products that generate the highest surplus, after the purchase. We instantiate our research by building a demo search engine for hotels that takes into account consumer heterogeneous preferences, and also accounts for the varying hotel price. Moreover, we achieve this without explicitly asking the preferences or purchasing histories of individual consumers but by using aggregate demand data. This new ranking system is able to recommend consumers products with "best value for money" in a privacy-preserving manner. The demo is accessible at http://nyuhotels.appspot.com/
    Managers and researchers alike suspect that the vast amounts of qualitative information found in blogs, product reviews, real estate listings, news stories, analyst reports and experts’ advice influence consumer behavior. But, do these kinds of qualitative information impact or rather reflect consumer choices? We argue that message content and consumer choice are endogenous, and that non-random selection and the conflation of awareness and persuasion complicate causal estimation of the impact of message content on economic decisions and outcomes. Using data on the transcribed content of 2,397 stock recommendations provided by Jim Cramer on his CNBC show Mad Money from 2005 to 2008, combined with data on Internet search volume, the content of prior news, and prior stock price and trading volume data, we show that selection bias in the stocks Cramer chooses to recommend and prior product awareness on the part of his audience create measurable upward bias in estimates of the impact of Cramer’s advice on stock prices. Using Latent Dirichlet Allocation (LDA) to characterize the topical content of Cramer’s speech and the content of prior news, we show that he is less persuasive when he supports his recommendations with arguments that have themselves been recently mentioned in the news. We argue that the classic sales skill of “knowing what a customer needs to hear” can significantly enhance the influence of qualitative information precisely because what the consumer already knows affects how they evaluate messages. The tools and techniques we develop can be put to practical use in a variety of settings where marketers can present subjects with different messages depending on what they already know.
    We present techniques for gathering data that expose errors of automatic predictive models. In certain common settings, traditional methods for evaluating predictive models tend to miss rare-but-important errors---most importantly, rare cases for which the model is confident of its prediction (but wrong). In this paper we present a system that, in a game-like setting, asks humans to identify cases what will cause the predictive-model-based system to fail. Such techniques are valuable in discovering problematic cases that do not reveal themselves during the normal operation of the system, and may include cases that are rare but catastrophic. We describe the design of the system, including design iterations that did not quite work. In particular, the system incentivizes humans to provide examples that are difficult for the model to handle, by providing a reward proportional to the magnitude of the predictive model's error. The humans are asked to ``\emph{Beat the Machine}'' and find cases where the automatic model (``\emph{the Machine}'') is wrong. Experiments show that the humans using Beat the Machine identify more errors than traditional techniques for discovering errors in from predictive models, and indeed, they identify many more errors where the machine is confident it is correct. Further, the cases the humans identify seem to be not simply outliers, butcoherent areas missed completely by the model. Beat the machine identifies the ``unknown unknowns.''
    INVITED TALK The Smarter Crowd: Active Learning, Knowledge Corroboration, and Collective IQs Thore Graepel Microsoft Research Cambridge, UK thoreg@microsoft.com Abstract Crowdsourcing mechanisms such as Amazon Mechanical Turk (AMT) or the ESP game are now routinely being used for labelling data for machine learning and other computational intelligence applications. I will discuss three important aspects of crowdsourcing which can help us tap into this powerful new resource in a more efficient way. When obtaining training data from a crowdsourcing system for the purpose of machine learning we can either collect all the training data in one batch or proceed sequentially and decide which labels to obtain based on the model learnt from the data labelled so far, a method often referred to as active learning. I will discuss which criteria can be used for selecting new examples to be labelled and demonstrate how this approach has been used in the FUSE/MSRC news recommender system projectemporia.com to categorise news stories in a cost-efficient way. Data obtained from crowdsourcing systems is typically plentiful and cheap, but noisy. The redundancy in the data can be used to improve the quality of the inferred labels based on models that take into account the reliability and expertise of the workers as well as the nature and difficulty of the tasks. I will present an algorithm for such a corroboration process based on graphical models, and show its application on the example of verifying the truth values of facts in the entity-relationship knowledge base Yago. Finally, I will talk about some very recent results on the effects of parameters of crowd-sourcing marketplaces (such as price and required track record for participation) on the quality of results. This work is based on methods from psychometrics, effectively measuring the IQ of the Mechanical Turk when viewed as a form of collective intelligence. This is joint work with Ralf Herbrich, Ulrich Paquet, David Stern, Jurgen Van Gael, Gjergji Kasneci, and Michal Kosinksi.
    Crowdsourcing has shown itself to be well-suited for the accomplishment of certain kinds of small tasks, yet many crowdsourceable tasks still require extensive structuring and managerial effort before using a crowd is feasible. We argue that this overhead could be substantially reduced via standardization. In the same way that task standardization enabled the mass production of physical goods, standardization of basic “building block” tasks would make crowdsourcing more scalable. Standardization would make it easier to set prices, spread best practices, build meaningful reputation systems and track quality. All of this would increase the demand for paid crowdsourcing-a development we argue is positive on both efficiency and welfare grounds. Standardization would also allow more complex processes to be built out of simpler tasks while still being able to predict quality, cost and time to completion. Realizing this vision will require interdisciplinary research effort as well as buy-in from online labor platforms.
    I will discuss the repeated acquisition of "labels" for data items when the labeling is imperfect. Labels are values provided by humans for specified variables on data items, such as "PG-13" for "Adult Content Rating on this Web Page." With the increasing popularity of micro-outsourcing systems, such as Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction.
    In this paper, we examine how different ranking and personalization mechanisms on product search engines influence consumer online search and purchase behavior. To investigate these effects, we combine archival data analysis with randomized field experiments. Our archival data analysis is based on a unique dataset containing approximately 1 million online sessions from Travelocity over a 3-month period. Using a hierarchical Bayesian model, we first jointly estimate the relationship among consumer click and purchase behavior, and search engine ranking decisions. To evaluate the causal effect of search engine interface on user behavior, we conduct randomized field experiments. The field experiments are based on a real-world hotel search engine application designed and built by us. By manipulating the default ranking method of search results, and by enabling or disabling a variety of personalization features on the hotel search engine website, we are able to empirically identify the causal impact of search engines on consumers’ online click and purchase behavior. The archival data analysis and the randomized experiments are consistent in demonstrating that ranking has a significant effect on consumer click and purchase behavior. We find that hotels with a higher reputation for providing superior services are more adversely affected by an inferior screen position. In addition, a consumer utility-based ranking mechanism yields the highest click and purchase propensities in comparison to existing benchmark systems such as ranking based on price or customer ratings. Our randomized experiments on the impact of active vs. passive personalization mechanisms on user behavior indicate that although active personalization (wherein users can interact with the recommendation algorithm) can lead to a higher click-through rate compared to passive personalization, it leads to a lower conversion rate when consumers have a planned purchase beforehand. This finding suggests that active personalization strategies should not be adopted ubiquitously by product search engines. On a broader note, our inter-disciplinary approach provides a methodological framework for how econometric modeling, randomized field experiments, and IT-based artifacts can be integrated in the same study towards deriving causal relationships between variables of interest.
    The emergence of online crowdsourcing services such as Amazon Mechanical Turk, presents us huge opportunities to distribute micro-tasks at an unprecedented rate and scale. Unfortunately, the high verification cost and the unstable employment relationship give rise to opportunistic behaviors of workers, which in turn exposes the requesters to quality risks. Currently, most requesters rely on redundancy to identify the correct answers. However, existing techniques cannot separate the true (unrecoverable) error rates from the (recoverable) biases that some workers exhibit, which would lead to incorrect assessment of worker quality. Furthermore, massive redundancy is expensive, increasing significantly the cost of crowdsourced solutions. In this paper, we present an algorithm that can easily separate the true error rates from the biases. Also, we describe how to seamlessly integrate the existence of “gold” data for learning the quality of workers. Next, we bring up an approach for actively testing worker quality in order to quicky identify spammers or malicious workers. Finally, we present experimental results to demonstrate the performance of our proposed algorithm.
    The proposed tutorial covers an emerging topic of wide interest: Crowdsourcing. Speci�cally, we cover areas of crowdsourcing related to managing structured and unstructured data in a web-related content. Many re-searchers and practitioners today see the great oppor-tunity that becomes available through easily-available crowdsourcing platforms. However, most newcomers face the same questions: How can we manage the (noisy) crowds to generate high quality output? How to estimate the quality of the contributors? How can we best struc-ture the tasks? How can we get results in small amounts of time and minimizing the necessary resources? How to setup the incentives? How should such crowdsourcing markets be setup? Their presented material will cover topics from a va-riety of �elds, including computer science, statistics, economics, and psychology. Furthermore, the material will include real-life examples and case studies from years of experience in running and managing crowdsourcing applications in business settings. The tutorial presenters have an extensive academic and systems building experience and will provide the au-dience with data sets that can be used for hands-on tasks.
    Since the concept of crowd sourcing is relatively new, many potential participants have questions about the AMT marketplace. For example, a common set of questions that pop up in an 'introduction to crowd sourcing and AMT' session are the following: What type of tasks can be completed in the marketplace? How much does it cost? How fast can I get results back? How big is the AMT marketplace? The answers for these questions remain largely anecdotal and based on personal observations and experiences. To understand better what types of tasks are being completed today using crowd sourcing techniques, we started collecting data about the AMT marketplace. We present a preliminary analysis of the dataset and provide directions for interesting future research.
    In the last decade, prediction markets became popular forecasting toolsin areas ranging from election results to movie revenues and Oscarnominations. One of the features that make prediction marketsparticularly attractive for decision support applications is that theycan be used to answer what-if questions and estimate probabilities ofcomplex events. Traditional approach to answering such questionsinvolves running a combinatorial prediction market, what is not alwayspossible. In this paper, we present an alternative, statistical approachto pricing complex claims, which is based on analyzing co-movements ofprediction market prices for basis events. Experimental evaluation ofour technique on a collection of 51 InTrade contracts representing theDemocratic Party Nominee winning Electoral College Votes of a particularstate shows that the approach outperforms traditional forecastingmethods such as price and return regressions and can be used to extractmeaningful business intelligence from raw price data.
    A system, method, software arrangement and computer-accessible medium can be provided for incorporating quantitative and qualitative information into an economic model is provided. An exemplary method for analyzing qualitative information associated with a characteristic of at least one entity based on associated quantitative information, includes, obtaining first information which contains at least in part a qualitative information relating to at least one of the at least one entity; determining second information associated with at least one attribute of the characteristic obtained from the first information; obtaining third information which contains at least in part quantitative information associated with at least one of at least one entity; and establishing fourth information as a function of the second information and the third information to determine which of at least one attribute affects the characteristic. For example, an observable economic variable can be characterized using numerical and qualitative information associated with one or more of the entities. The influence of the quantitative and qualitative information on the observable economic variable for a given entity relative to other entities may be determined using statistical regressions.
    Crowdsourcing services, such as Amazon Mechanical Turk, allow for easy distribution of small tasks to a large number of workers. Unfortunately, since manually verifying the quality of the submitted results is hard, malicious workers often take advantage of the verification difficulty and submit answers of low quality. Currently, most requesters rely on redundancy to identify the correct answers. However, redundancy is not a panacea. Massive redundancy is expensive, increasing sig-nificantly the cost of crowdsourced solutions. Therefore, we need techniques that will accurately estimate the quality of the workers, allowing for the rejection and blocking of the low-performing workers and spammers. However, existing techniques cannot separate the true (un-recoverable) error rate from the (recoverable) biases that some workers exhibit. This lack of separation leads to incorrect assessments of a worker's quality. We present algorithms that improve the existing state-of-the-art techniques, enabling the separation of bias and error. Our algorithm generates a scalar score representing the inherent quality of each worker. We illustrate how to incorporate cost-sensitive classification errors in the overall framework and how to seamlessly integrate unsu-pervised and supervised techniques for inferring the quality of the workers. We present experimental results demonstrating the performance of the proposed algorithm under a variety of settings.
    Although Mechanical Turk has recently become popular among social scientists as a source of experimental data, doubts may linger about the quality of data provided by subjects recruited from online labor markets. We address these potential concerns by presenting new demographic data about the Mechanical Turk subject population, reviewing the strengths of Mechanical Turk relative to other online and offline methods of recruiting subjects, and comparing the magnitude of effects obtained using Mechanical Turk and traditional subject pools. We further discuss some additional benefits such as the possibility of longitudinal, cross cultural and prescreening designs, and offer some advice on how to best manage a common subject pool.
    Many online or local data sources provide powerful querying mechanisms but limited ranking capabilities. For instance, PubMed allows users to submit highly expressive Boolean keyword queries, but ranks the query results by date only. However, a user would typically prefer a ranking by relevance, measured by an Information Retrieval (IR) ranking function. The naive approach would be to submit a disjunctive query with all query keywords, retrieve the returned documents, and then re-rank them. Unfortunately, such an operation would be very expensive due to the large number of results returned by disjunctive queries. In this paper we present algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database and TREC dataset show that we achieve order of magnitude improvement compared to the current baseline approaches.
    We present the results of a survey that collected information about the demographics of participants on Amazon Mechanical Turk, together with information about their level of activity and motivation for working on Amazon Mechanical Turk. We find that approximately 50% of the workers come from the United States and 40% come from India. Country of origin tends to change the motivating reasons for workers to participate in the marketplace. Significantly more workers from India participate on Mechanical Turk because the online marketplace is a primary source of income, while in the US most workers consider Mechanical Turk a secondary source of income. While money is a primary motivating reason for workers to participate in the marketplace, workers also cite a variety of other motivating reasons, including entertainment and education.
    Increasingly, user-generated product reviews serve as a valuable source of information for customers making product choices online. The existing literature typically incorporates the impact of product reviews on sales based on numeric variables representing the valence and volume of reviews. In this paper, we posit that the information embedded in product reviews cannot be captured by a single scalar value. Rather, we argue that product reviews are multifaceted, and hence the textual content of product reviews is an important determinant of consumers' choices, over and above the valence and volume of reviews. To demonstrate this, we use text mining to incorporate review text in a consumer choice model by decomposing textual reviews into segments describing different product features. We estimate our model based on a unique data set from Amazon containing sales data and consumer review data for two different groups of products (digital cameras and camcorders) over a 15-month period. We alleviate the problems of data sparsity and of omitted variables by providing two experimental techniques: clustering rare textual opinions based on pointwise mutual information and using externally imposed review semantics. This paper demonstrates how textual data can be used to learn consumers' relative preferences for different product features and also how text can be used for predictive modeling of future changes in sales. This paper was accepted by Ramayya Krishnan, information systems.
    With the growing pervasiveness of the Internet, online search for commercial goods and services is constantly increasing, as more and more people search and purchase goods from the Internet. Most of the current algorithms for product search are based on adaptations of theoretical models devised for ¿classic¿ information retrieval. However, the decision mechanism that underlies the process of buying a product is different than the process of judging a document as relevant or not. So, applying theories of relevance for the task of product search may not be the best approach. We propose a theory model for product search based on expected utility theory from economics. Specifically, we propose a ranking technique in which we rank highest the products that generate the highest consumer surplus after the purchase. In a sense, we rank highest the products that are the ¿best value for money¿ for a specific user. Our approach naturally builds on decades of research in the field of economics and presents a solid theoretical foundation in which further research can build on. We instantiate our research by building a search engine for hotels, and show how we can build algorithms that naturally take into account consumer demographics, heterogeneity of consumer preferences, and also account for the varying price of the hotel rooms. Our extensive user studies demonstrate an overwhelming preference for the rankings generated by our techniques, compared to a large number of existing strong baselines.
    Information seeking in an online shopping environment is different from classical relevance-based information retrieval. In this paper, we focus on understanding how humans seek information and make economic decisions, when interacting with an array of choices in an online shopping environment. Our study is instantiated on a unique dataset of US hotel reservations from Travelocity.com. Current travel search engines display results with only rudimentary interfaces by using a single ranking criterion. This largely ignores consumers’ multi-dimensional preferences and is constrained by human cognitive limitations towards multiple and diverse online data sources. This paper proposes to improve the search interface using an inter-disciplinary approach. It combines text mining, image classification, social geo-tagging and field experiments with structural modeling techniques, and designs an interface whereby each hotel of interest is ranked according to its average consumer surplus on the travel search site. This system would display the hotels with the “best value for money” for a given consumer at the top of the computer screen during the travel search process in response to a query. The final outcome based on several user studies shows that our proposed ranking system outperforms the existing baselines. This suggests that our inter-disciplinary approach has the potential to enhance user search experience for information seeking in the context of economic decision-making.
    The Sarbanes-Oxley (SOX) Act of 2002 is one of the, if not the, most important pieces of legislation affecting corporations traded on the U.S. stock exchanges. While SOX does not explicitly address the issue of information security, the definition of internal control provided by the SEC, combined with the fact that the reporting systems in all firms required to comply with SOX are based on systems that promote information security and integrity does imply that more focus on information security is a necessary compliance requirement. Using a dataset on stock market abnormal returns that runs from the period 2000-2006 and consists of 300 firms, we aim to examine how the stock market reaction varies for 8-K filings and news media releases, and how this reaction has changed since the passage of the SOX Act. We hypothesize that the greater timeliness of the 8-K filings induced by SOX increases and accelerates the quality of their information disclosure and dissemination in the market. Further, we classify news articles into press-and firm-initiated articles and hypothesize that the press-initiated coverage of material events has increased in the post-SOX period. We find that the effect of firm-initiated media coverage had significant negative impact relative to press-initiated coverage on the measures of informativeness suggesting that media played a significant role during the scandal-ridden periods when the firms had poor information environment between 2002 and 2004. We also find that the timeliness of release of media articles determines the level of informativeness, suggesting that media is an information intermediary and its role acts as a substitute to the firm's existing information disclosure environment.
    This chapter discusses the design of taxonomies to be used in dynamic taxonomy systems. Although the only actual requirement of dynamic taxonomies is a multidimensional classification, an organization by facets is normally used. The first section provides guidelines for the design of DT taxonomies, which include the automatic construction from structured data, and the retrofitting of traditional monodimensional taxonomies. The second section shows how a faceted taxonomy can be automatically extracted from the infobase itself when objects are textual or are described by textual captions or tags.
    Large amounts of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect precision and recall (i.e., contains spurious tuples and misses good tuples). Typically, an extraction system has a set of parameters that can be used as ``knobs'' and tune the system to be either precision- or recall-oriented. Furthermore, the choice of documents processed by the extraction system also affects the quality of the extracted relation. So far, estimating the output quality of an information extraction task was an ad-hoc procedure, based mainly on heuristics. In this paper, we show how to use receiver operating characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and show how to use ROC analysis to select the extraction parameters in a principled manner. Furthermore, we present analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Finally, we present our maximum likelihood approach for estimating---on the fly---the parameters required by our analytic models to predict the run time and the output quality of each execution plan. Our experimental evaluation demonstrates that our optimization approach predicts accurately the output quality and selects the fastest execution plan that satisfies the output quality restrictions.
    In this paper, we analyze how different dimensions of a seller's reputation affect pricing power in electronic markets. Given the interplay between buyers' trust and sellers' pricing power, we use text mining techniques to identify and structure dimensions of importance from feedback posted on reputation systems. By aggregating and scoring these dimensions based on the sentiment they contain, we use them to estimate a series of econometric models associating reputation with price premiums. We find that different dimensions do indeed affect pricing power differentially, and that a negative reputation hurts more than a positive one helps on some dimensions but not on others. We provide evidence that sellers of identical products in electronic markets differentiate themselves based on a distinguishing dimension of strength, and that buyers vary in the relative importance they place on different fulfillment characteristics. We highlight the importance of textual reputation feedback further by demonstrating that it substantially improves the performance of a classifier we have trained to predict future sales. Our results also suggest that online sellers distinguish themselves on specific and varying fulfillment characteristics that resemble the unique selling points highlighted by successful brands. We conclude by providing explicit examples of IT artifacts (buyer and seller tools) that use our interdisciplinary approach to enhance buyer trust and seller efficiency in online environments. This paper is the first study that integrates econometric, text mining and predictive modeling techniques toward a more complete analysis of the information captured by reputation systems, and it presents new evidence of the importance of their effective and judicious design in online markets.
    Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers several alternatives for these factors, and predicts the output quality - and, of course, the execution time - of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.
    An important use of the internet today is in providing a platform for consumers to disseminate information about products and services they buy, and share experiences about the merchants with whom they transact. Increasingly, online markets develop into social shopping channels, and facilitate the creation of online communities and social networks. Till date, businesses, government organisations and customers have not fully incorporated such information in their decision making and policy formulation processes, either because the potential value of the intellectual capital or appropriate techniques for measuring that value have not been identified. Increasingly, although, this publicly available digital content has concrete economic value that is often hidden beneath the surface. For example, online product reviews affect the buying behaviour of customers, as well as the volume of sales, positively or negatively. Similarly, user feedback on sites such as eBay and Amazon affect the reputation of online merchants and, in turn, their ability to sell their products and services. Our research on the EconoMining project studies the economic value of user-generated content in such online settings. In our research program, we combine established techniques from economics and marketing with text mining algorithms from computer science to measure the economic value of each text snippet. In this paper, we describe the foundational blocks of our techniques, and demonstrate the value of user-generated content in a variety of areas, including reputation systems and online product reviews, and we present examples of how such areas are of immediate importance to the travel industry.
    The first Human Computation Workshop (HComp2009) was held on June 28 th , 2009, in Paris, France, collocated with SIGKDD 2009. This report summarizes the workshop, with details of the papers, demos and posters presented. The report also includes common themes, issues, and open questions that came up in the workshop.
    Nowadays, there is a significant experimental evidence of excellent ex-post predictive accuracy in certain types of prediction markets, such as markets for elections. This evidence shows that prediction markets are efficient mechanisms for aggregating information and are more accurate in forecasting events than traditional forecasting methods, such as polls. Interpretation of prediction market prices as probabilities has been extensively studied in the literature, however little attention so far has been given to understanding volatility of prediction market prices. In this paper, we present a model of a prediction market with a binary payoff on a competitive event involving two parties. In our model, each party has some underlying "ability" process that describes its ability to win and evolves as an Ito diffusion. We show that if the prediction market for this event is efficient and accurate, the price of the corresponding contract will also follow a diffusion and its instantaneous volatility is a particular function of the current claim price and its time to expiration. We generalize our results to competitive events involving more than two parties and show that volatilities of prediction market contracts for such events are again functions of the current claim prices and the time to expiration, as well as of several additional parameters (ternary correlations of the underlying Brownian motions). In the experimental section, we validate our model on a set of InTrade prediction markets and show that it is consistent with observed volatilities of contract returns and outperforms the well-known GARCH model in predicting future contract volatility from historical price data. To demonstrate the practical value of our model, we apply it to pricing options on prediction market contracts, such as those recently introduced by InTrade. Other potential applications of this model include detection of significant market moves and improving forecast standard errors.
    We are experiencing an unprecedented increase of content contributed by users in forums such as blogs, social networking sites and micro- blogging services. Such abundance of content complements con- tent on web sites and traditional media forums such as news papers, news and financial streams, and so on. Given such plethora of in- formation there is a pressing need to cross reference information across textual services. For example, commonly we read a news item and we wonder if there are any blogs reporting related content or vice versa. In this paper, we present techniques to automate the process of cross referencing online information content. We introduce method- ologies to extract phrases from a given "query document" to be used as queries to search interfaces with the goal to retrieve con- tent related to the query document. In particular, we consider two techniques to extract and score key phrases. We also consider tech- niques to complement extracted phrases with information present in external sources such as Wikipedia and introduce an algorithm called RelevanceRank for this purpose. We discuss both these techniques in detail and provide an ex- perimental study utilizing a large number of human judges from Amazons's Mechanical Turk service. Detailed experiments demon- strate the effectiveness and efficiency of the proposed techniques for the task of automating retrieval of documents related to a query document.
    In this paper, we empirically estimate the economic value of different hotel characteristics, especially the location-based and service-based characteristics given the associated local infrastructure. We build a random coefficients-based structural model taking into consideration the multiple-levels of consumer heterogeneity introduced by different travel contexts and different hotel characteristics. We estimate this econometric model with a unique dataset of hotel reservations located in the US over 3 months and user-generated content data that was processed based on techniques from text mining, image classification, and on-demand annotations. This enables us to infer the economic significance of various hotel characteristics. We then propose to design a new hotel ranking system based on the empirical estimates that take into account the multi-dimensional preferences of customers and imputes consumer surplus from transactions for a given hotel. By doing so, we are able to provide customers with the “best value for money” hotels. Based on blind tests of users from Amazon Mechanical Turk, we test our ranking system with some benchmark hotel ranking systems. We find that our system performs significantly better than existing ones. This suggests that our inter-disciplinary approach has the potential to improve the quality of hotel search.
    Many online or local data sources provide powerful querying mechanisms but limited ranking capabilities. For instance, PubMed allows users to submit highly expressive Boolean keyword queries, but ranks the query results by date only. However, a user would typically prefer a ranking by relevance, measured by an Information Retrieval (IR) ranking function. The naive approach would be to submit a disjunctive query with all query keywords, retrieve the returned documents, and then re-rank them. Unfortunately, such an operation would be very expensive due to the large number of results returned by disjunctive queries. In this paper we present algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database and a TREC dataset show that we achieve order of magnitude improvement compared to the current baseline approaches.
    Research in data mining and knowledge discovery relies heavily on the availability of datasets. However, compared to the amount of work in the field on techniques for pattern discovery and knowledge extraction, there has been relatively little effort directed at the study of effective methods for collecting and evaluating the quality of data. Human computation is a relatively new research area that studies the process of channeling the vast internet population to perform tasks or provide data towards solving difficult problems that no known efficient computer algorithms can yet solve. There has been a lot of work on games with a purpose (e.g., the ESP Game) that specifically target online gamers who, in the process of playing an enjoyable game, generate useful data (e.g., image tags). There has been considerable interest and research on crowdsourcing marketplaces (e.g. Amazon Mechanical Turk), which are essentially human computation applications that coordinate workers to perform tasks in exchange for monetary rewards. While there have been significant research challenges, increasing business interest and active work in human computation, till last year there was no dedicated forum to discuss these ideas. The first Human Computation Workshop (HComp2009) was held on June 28th, 2009, in Paris, France, collocated with KDD 2009. With HComp2010, we hope to continue to serve the needs of researchers and practitioners interested in this area, integrating work from a number of fields including KDD, information retrieval, gaming, machine learning, game theory, and Human Computer Interaction. Learning from HComp 2009, we expanded the topics of relevance to the workshop. We have also updated the organizing and program committees, bringing new people in, while keeping enough of the last year's team for continuity. This year, our call for papers resulted in 35 submissions (18 long papers, 11 short papers and 6 demos) from a range of perspectives. All submissions were thoroughly reviewed, typically with 3 reviews each, by the organizing and program committees and external reviewers. Even though we had a number of submissions of high quality, given that we had only a half-day workshop, we could accept only 4 long papers, 4 short papers, 5 posters and 5 demos; these are the submissions that appear in the proceedings. The overall themes that emerged from last year's workshop were very clear: on the one hand, there is the experimental side of human computation, with research on new incentives for users to participate, new types of actions, and new modes of interaction. On the more theoretic side, we have research modeling these actions and incentives to examine what theory predicts about these designs. Finally, last year's program focused on how to best handle noise, identify labeler expertise, and use the generated data for data mining purposes. This year's submissions demonstrated the continuation and evolution of these themes with the accepted papers divided into three sessions: "Market Design", "Human Computation in Practice", and "Task and Process Design". In the first session, the authors discuss the economic context of human computation and the practical lessons learned from building a human computation resource. In the second session we see many practical examples of human computation in fields as far-flung as virtual world construction to language translation and beyond. In the final session, the authors focus on how to get the best quality results out of the crowdsourced work. Given the strong submissions for demos both last year and this, we see -- and the community seems to recognize -- the poster and demo session as an integral part of the workshop, where participants can showcase their human computation applications.
    Text documents often embed data that is structured in nature. This structured data is increasingly exposed using information extraction systems, which generate structured relations from documents, introducing an opportunity to process expressive, structured queries over text databases. This paper discusses our SQoUT1 project, which focuses on processing structured queries over relations extracted from text databases. We show how, in our extraction-based scenario, query process- ing can be decomposed into a sequence of basic steps: retrieving relevant text documents, extracting relations from the documents, and joining extracted relations for queries involving multiple relations. Each of these steps presents dierent alternatives and together they form a rich space of possible query execution strategies. We identify execution eciency and output quality as the two critical properties of a query execution, and argue that an optimization approach needs to consider both properties. To this end, we take into account the user- specified requirements for execution eciency and out- put quality, and choose an execution strategy for each query based on a principled, cost-based comparison of the alternative execution strategies.
    Time is an important dimension of relevance for a large number of searches, such as over blogs and news archives. So far, research on searching over such collections has largely focused on locating topically similar documents for a query. Unfortunately, topic similarity alone is not always sufficient for document ranking. In this paper, we observe that, for an important class of queries that we call time-sensitive queries, the publication time of the documents in a news archive is important and should be considered in conjunction with the topic similarity to derive the final document ranking. Earlier work has focused on improving retrieval for "recency" queries that target recent documents. We propose a more general framework for handling time-sensitive queries and we automatically identify the important time intervals that are likely to be of interest for a query. Then, we build scoring techniques that seamlessly integrate the temporal aspect into the overall ranking mechanism. We extensively evaluated our techniques using a variety of news article data sets, including TREC data as well as real web data analyzed using the Amazon Mechanical Turk. We examined several alternatives for detecting the important time intervals for a query over a news archive and for incorporating this information in the retrieval process. Our techniques are robust and significantly improve result quality for time-sensitive queries compared to state-of-the-art retrieval techniques.
    This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.
    Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of interest in such databases. Faceted interfaces represent a new powerful paradigm that proved to be a successful complement to keyword searching. Thus far, the identification of the facets was either a manual procedure, or relied on apriori knowledge of the facets that can potentially appear in the underlying collection. In this paper, we present an unsupervised technique for automatic extraction of facets useful for browsing text databases. In particular, we observe, through a pilot study, that facet terms rarely appear in text documents, showing that we need external resources to identify useful facet terms. For this, we first identify important phrases in each document. Then, we expand each phrase with ";context"; phrases using external resources, such as WordNet and Wikipedia, causing facet terms to appear in the expanded database. Finally, we compare the term distributions in the original database and the expanded database to identify the terms that can be used to construct browsing facets. Our extensive user studies, using the Amazon Mechanical Turk service, show that our techniques produce facets with high precision and recall that are superior to existing approaches and help users locate interesting items faster.
    Many valuable text databases on the web have non-crawlable contents that are ``hidden'' behind search interfaces. Metasearchers are helpful tools for searching over multiple such ``hidden-web'' text databases at once through a unified query interface. An important step in the metasearching process is database selection, or determining which databases are the most relevant for a given user query. The state-of-the-art database selection techniques rely on statistical summaries of the database contents, generally including the database vocabulary and the associated word frequencies. Unfortunately, hidden-web text databases typically do not export such summaries, so previous research has developed algorithms for constructing approximate content summaries from document samples extracted from the databases via querying. We present a novel ``focused probing'' sampling algorithm that detects the topics covered in a database and adaptively extracts documents that are representative of the topic coverage of the database. Our algorithm is the first that constructs content summaries that include the frequencies of the words in the database. Unfortunately, Zipf's law practically guarantees that, for any relatively large database, content summaries built from moderately sized document samples will fail to cover many low-frequency words; in turn, incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To enhance the sparse document samples and improve the database selection decisions, we exploit the fact that topically similar databases tend to have similar vocabularies, so samples extracted from databases with a similar topical focus can complement each other. We have developed two database selection algorithms that exploit this observation. The first algorithm proceeds hierarchically and selects the best category for a query, and then sends the query to the appropriate databases in the chosen category. The second algorithm uses ``shrinkage,'' a statistical technique for improving parameter estimation in the face of sparse data, to enhance the database content summaries with category-specific words. We describe how to modify existing database selection algorithms to adaptively decide --at run-time-- whether shrinkage is beneficial for a query. A thorough evaluation over a variety of databases, including 315 real web databases as well as TREC data, suggests that the proposed sampling methods generate high-quality content summaries and the database selection algorithms produce significantly more relevant database selection decisions and overall search results than existing algorithms.
    One of the common Web searches that have a strong local component is the search for hotel accommodation. Customers try to identify hotels that satisfy particular criteria, such as ser- vice, food quality, and so on. Unfortunately, today, the travel search engines provide only rudimentary ranking facilities, typ- ically using a single ranking criterion such as distance from city center, number of stars, price per night, or, more recently, cus- tomer reviews. This approach has obvious shortcomings. First, it ignores the multidimensional preferences of the consumer and, second, it largely ignores characteristics related to the location of the hotel, for instance, proximity to the beach or proximity to a downtown shopping area. These location-based features represent important characteristics that inuence the desirability of a particular hotel. However, currently there are no established metrics that can isolate the importance of the location characteristics of hotels. In our work, we use the fact that the overall desirability of the hotel is reected in the price of the rooms; therefore, using hedonic regressions, an established technique from econometrics, we estimate the weight that consumers place on dierent
    One of the common Web searches that have a strong local component is the search for hotels. Customers try to identify hotels that satisfy particular criteria, such as food quality, service, and so on. A crucial feature is the location of the hotel. For example, everything else being equal, a hotel close to the beach is typically more desirable than a hotel that is separated by a highway from the beachfront. Similarly, a hotel located in the downtown area can charge higher prices than that located at the outskirts of a city, and still make a sale. Such location-based features like “proximity to the beach” or “near downtown” represent important characteristics that influence the desirability of a particular hotel, and in turn can influence the corresponding rate of that hotel. However, currently there are no established economic metrics that can isolate the economic impact associated with these various features of a local hotel. Our goal is to empirically estimate the economic value of different hotel features, especially the location-based features given the associated local infrastructure. We aim to do so by combining state-of-the art econometric modeling with spatial data and image classification methods. Then, after inferring the economic significance of each feature, we will incorporate these features in a ranking function to improve the local search for hotels. This will result in a real-world application of our research.
    Nowadays, there is a significant experimental evidence of excellent ex-post predictive accuracy in certaintypes of prediction markets, such as markets for presidential elections. This evidence shows that predictionmarkets are efficient mechanisms for aggregating information and are more accurate in forecasting events thantraditional forecasting methods, such as polls. In this paper, we present a model of a prediction market with abinary payoff on a competitive event involving two parties where the underlying â¬Sabilitiesâ¬? of the competingparties evolve as Ito diffusions. We show that if the prediction market for this event is efficient and accurate,the price of the corresponding contract will also follow a diffusion and its instantaneous volatility is a functionof only the current claim price and its time to expiration. We then generalize our results to competitive eventsinvolving more than two parties and show that volatilities of prediction market contracts for such events areagain functions of only the current claim prices, the time to expiration and several additional parameters(ternary correlations). Finally, we validate our model on a set of InTrade prediction markets and show that itis consistent with observed volatilities of contract returns. Practical applications of our model may includeevent detection and ranking and improving forecast standard errors in political markets.
    The increasing pervasiveness of the Internet has dramatically changed the way that consumers shop for goods. Consumer- generated product reviews have become a valuable source of information for customers, who read the reviews and decide whether to buy the product based on the information pro- vided. In this paper, we use techniques that decompose the reviews into segments that evaluate the individual character- istics of a product (e.g., image quality and battery life for a digital camera). Then, as a major contribution of this paper, we adapt methods from the econometrics literature, specif- ically the hedonic regression concept, to estimate: (a) the weight that customers place on each individual product fea- ture, (b) the implicit evaluation score that customers as- sign to each feature, and (c) how these evaluations afiect the revenue for a given product. Towards this goal, we de- velop a novel hybrid technique combining text mining and econometrics that models consumer product reviews as ele- ments in a tensor product of feature and evaluation spaces. We then impute the quantitative impact of consumer re- views on product demand as a linear functional from this tensor product space. We demonstrate how to use a low- dimension approximation of this functional to signiflcantly reduce the number of model parameters, while still provid- ing good experimental results. We evaluate our technique using a data set from Amazon.com consisting of sales data and the related consumer reviews posted over a 15-month period for 242 products. Our experimental evaluation shows that we can extract actionable business intelligence from the data and better understand the customer preferences and ac- tions. We also show that the textual portion of the reviews can improve product sales prediction compared to a baseline technique that simply relies on numeric data.
    Text is ubiquitous and, not surprisingly, many important applications rely on textual data fora variety of tasks. As a notable example, information extraction applications derive structuredrelations from unstructured text; as another example, focused crawlers explore the web to locatepages about specific topics. Execution plans for text-centric tasks follow two general paradigmsfor processing a text database: either we can scan, or "crawl," the text database or, alternatively,we can exploit search engine indexes and retrieve the documents of interest via carefully craftedqueries constructed in task-specific ways. The choice between crawl- and query-based executionplans can have a substantial impact on both execution time and output "completeness" (e.g.,in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plainintuition. In this article, we present fundamental building blocks to make the choice of executionplans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how toanalyze query- and crawl-based plans in terms of both execution time and output completeness.We adapt results from random-graph theory and statistics to develop a rigorous cost model forthe execution plans. Our cost model reflects the fact that the performance of the plans dependson fundamental task-specific properties of the underlying text databases. We identify theseproperties and present efficient techniques for estimating the associated parameters of the costmodel. We also present two optimization approaches for text-centric tasks that rely on the cost-modelparameters and select efficient execution plans. Overall, our optimization approacheshelp build efficient execution plans for a task, resulting in significant efficiency and outputcompleteness benefits. We complement our results with a large-scale experimental evaluationfor three important text-centric tasks and over multiple real-life data sets.
    With the rapid growth of the Internet, users' ability to pub- lish content has created active electronic communities that provide a wealth of product information. Consumers natu- rally gravitate to reading reviews in order to decide whether to buy a product. However, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the re- views. Similarly, the manufacturer of a product needs to identify the reviews that in∞uence the customer base, and examine the content of these reviews. In this paper, we pro- pose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufacturer- oriented ranking mechanism ranks the reviews according to their expected efiect on sales. Our ranking mechanism com- bines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. We show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. Our results can have several implications for the market de- sign of online opinion forums.
    Large amounts of (often valuable) information are stored in web-accessible text databases. ``Metasearchers'' provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not need to change over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this article, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use ``survival analysis'' techniques in general, and Cox's proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real web databases.
    Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area
    Deriving the polarity and strength of opinions is an important research topic, attracting sig- nificant attention over the last few years. In this work, to measure the strength and po- larity of an opinion, we consider the eco- nomic context in which the opinion is eval- uated, instead of using human annotators or linguistic resources. We rely on the fact that text in on-line systems influences the behav- ior of humans and this effect can be observed using some easy-to-measure economic vari- ables, such as revenues or product prices. By reversing the logic, we infer the semantic ori- entation and strength of an opinion by tracing the changes in the associated economic vari- able. In effect, we use econometrics to iden- tify the "economic value of text" and assign a "dollar value" to each opinion phrase, measur- ing sentiment effectively and without the need for manual labeling. We argue that by inter- preting opinions using econometrics, we have the first objective, quantifiable, and context- sensitive evaluation of opinions. We make the discussion concrete by presenting results on the reputation system of Amazon.com. We show that user feedback affects the pricing power of merchants and by measuring their pricing power we can infer the polarity and strength of the underlying feedback postings.
    We demonstrate a fully working system for multifaceted browsing over large collections of text-annotated data, such as annotated images, that are stored in relational databases. Typically, such databases can be browsed across multiple facets (by topic, genre, location, and so on) and previous user studies showed that multifaceted in- terfaces improve substantially the ability of users to iden- tify items of interest in the database. We demonstrate a scalable system that automatically generates multifaceted browsing hierarchies on top of a relational database that stores the underlying text-annotated objects. Our system supports a wide range of ranking alternatives for selecting and displaying the best facets and the best portions of the generated hierarchies, to facilitate browsing. We combine our ranking schemes with Rapid Serial Visual Presenta- tion (RSVP), an advanced visualization technique, which further enhances the browsing experience and demon- strate how to use prefetching techniques to overcome the latency issues that are inherent when browsing the con- tents of a relational database using multifaceted inter- faces.
    The increasing pervasiveness of the Internet has dramatically changed the way that consumers shop for goods. Consumer-generated product reviews have become a valuable source of information for customers, who read the reviews and decide whether to buy the product based on the information provided. In this paper, we use techniques that decompose the reviews into segments that evaluate the individual characteristics of a product (e.g., image quality and battery life for a digital camera). Then, as a major contribution of this paper, we adapt methods from the econometrics literature, specifically the hedonic regression concept, to estimate: (a) the weight that customers place on each individual product feature, (b) the implicit evaluation score that customers assign to each feature, and (c) how these evaluations affect the revenue for a given product. Towards this goal, we develop a novel hybrid technique combining text mining and econometrics that models consumer product reviews as elements in a tensor product of feature and evaluation spaces. We then impute the quantitative impact of consumer reviews on product demand as a linear functional from this tensor product space. We demonstrate how to use a low-dimension approximation of this functional to significantly reduce the number of model parameters, while still providing good experimental results. We evaluate our technique using a data set from Amazon.com consisting of sales data and the related consumer reviews posted over a 15-month period for 242 products. Our experimental evaluation shows that we can extract actionable business intelligence from the data and better understand the customer preferences and actions. We also show that the textual portion of the reviews can improve product sales prediction compared to a baseline technique that simply relies on numeric data.
    Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, in- formation extraction applications derive structured relations from un- structured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: ei- ther we can scan, or "crawl," the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output "completeness" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these prop- erties and present efficient techniques for estimating the associated parameters of the cost model. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our re- sults with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.
    With the rapid growth of the Internet, users' ability to publish content has created active electronic com-munities that provide a wealth of product information. Consumers naturally gravitate to reading reviews in order to decide whether to buy a product. However, the high volume of reviews that are typically pub-lished for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. Similarly, the manufacturer of a product wants to identify the reviews that influence the customer base, and examine the content of these reviews. In this paper we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufacturer-oriented rank-ing mechanism ranks the reviews according to their expected effect on sales. Our ranking mechanism combines econometric analysis with text mining techniques in general, with subjectivity analysis in par-ticular. We show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. Our results can have several implications for the market design of online opin-ion forums.
    Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of inter- est in such databases. Faceted interfaces represent a new powerful paradigm which has been proven to be a successful complement to keyword searching. Thus far, the generation of faceted interfaces relied either on manual identification of the facets, or on apriori knowledge of the facets that can potentially appear in the under- lying database. In this paper, we present our ongoing research to- wards automatic identification of facets that can be used to browse a collection of free-text documents. We present some preliminary results on building facets on top of a news archive. The results are promising and suggest directions for future research.
    Large amounts of (often valuable) information are stored in Web-accessible text databases. "Metasearchers" provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not need to change over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this paper, we first report the results of a study showing how the content summaries of 152 real Web databases evolved over a period of 52 weeks. Then, we show how to use "survival analysis" techniques in general, and Cox's proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real Web databases.
    Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of interest in such databases. Interfaces that use multifaceted hierarchies represent a new powerful browsing paradigm which has been proven to be a successful complement to keyword searching. Thus far, multifaceted hierarchies have been created manually or semi-automatically, making it difficult to deploy multifaceted interfaces over a large number of databases. We present automatic and scalable methods for creation of multifaceted interfaces. Our methods are integrated with traditional relational databases and can scale well for large databases. Furthermore, we present methods for selecting the best portions of the generated hierarchies when the screen space is not sufficient for displaying all the hierarchy at once. We apply our technique to a range of large data sets, including annotated images, television programming schedules, and web pages. The results are promising and suggest directions for future research.
    Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf's law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate content summaries of topically related databases can complement each other and increase their coverage. Specifically, we exploit a (given or derived) hierarchical categorization of the databases and adapt the notion of "shrinkage" -a form of smoothing that has been used successfully for document classification-to the content summary construction task. A thorough evaluation over 315 real web databases as well as over TREC data suggests that the shrinkage-based content summaries are substantially more complete than their "unshrunk" counterparts. We also describe how to modify existing database selection algorithms to adaptively decide -at run-time-whether to apply shrinkage for a query. Our experiments, which rely on TREC data sets, queries, and the associated "relevance judgments," show that our shrinkage-based approach significantly improves state-of-the-art database selection algorithms, and also outperforms a recently proposed hierarchical strategy that exploits database classification as well.
    An organization’s data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching can be effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these “text joins” are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. In this paper, we propose an approximate, samplingbased text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.
    The World-Wide Web continues to grow rapidly, which makes exploiting all available information a challenge. Search engines such as Google index an unprecedented amount of information, but still do not provide access to valuable content in text databases “hidden” behind search interfaces. For example, current search engines largely ignore the contents of the Library of Congress, the US Patent and Trademark database, newspaper archives, and many other valuable sources of information because their contents are not “crawlable.” However, users should be able to find the information that they need with as little effort as possible, regardless of whether this information is crawlable or not. As a significant step towards this goal, we have designed algorithms that support browsing and searching -the two dominant ways of finding information on the web- over “hidden-web” text databases. To support browsing, we have developed QProber, a system that automatically categorizes hidden-web text databases in a classification scheme, according to their topical focus. QProber categorizes databases without retrieving any document. Instead, QProber uses just the number of matches generated from a small number of topically focused query probes. The query probes are automatically generated using state-of-the-art supervised machine learning techniques and are typically short. QProber’s classification approach is sometimes orders of magnitude faster than approaches that require document retrieval. To support searching, we have developed crucial building blocks for constructing sophisticated metasearchers, which search over many text databases at once through a unified query interface. For scalability and effectiveness, it is crucial for a metasearcher to have a good database selection component and send queries only to databases with relevant content. Usually, database selection algorithms rely on statistics that characterize the contents of each database. Unfortunately, many hidden-web text databases are completely autonomous and do not report any summaries of their contents. To build content summaries for such databases, we extract a small, topically focused document sample from each database during categorization and use it to build the respective content summaries. A potential problem with content summaries derived from document samples is that any reasonably small sample will suffer from data sparseness and will not contain many words that appear in the database. To enhance the sparse samples and improve the database selection decisions, we exploit the fact that topically similar databases tend to have similar vocabularies, so samples extracted from databases with similar topical focus can complement each other. We have developed two database selection algorithms that exploit this observation. The first algorithm proceeds hierarchically and selects first the best category for a query and then sends the query to the appropriate databases in the chosen category. The second database selection algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data, to enhance the database content summaries with category-specific words. The shrinkage-enhanced summaries characterize the database contents better than their “unshrunk” counterparts do, and in turn help produce significantly more relevant database selection decisions and overall search results. Content summaries of static databases do not need to change over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. To understand how real-world databases change over time and how these changes propagate to the database content summaries, we studied how the content summaries of 152 real web databases changed every week, for a period of 52 weeks. Then, we used “survival analysis” techniques to examine which parameters can help predict when the content summaries need to be updated. Based on the results of this study, we designed algorithms that analyze various characteristics of the databases and their update history to predict when the content summaries need to be modified, thus avoiding overloading the databases unnecessarily. In summary, this thesis presents building blocks that are critical to enable access to the often valuable contents of hidden-web text databases, hopefully approaching the goal of making access to these databases as easy and efficient as over regular web pages.
    require access to such databases often resort to querying to extract relevant documents because of two main reasons. First, some text databases on the web are not "crawlable," and hence the only way to retrieve their documents is via querying. Second, applications often require only a small fraction of a database's contents, so retrieving relevant documents via querying is an attractive choice from an efficiency viewpoint, even for crawlable databases. Often an application's query-based strategy starts with a small number of user-provided queries. Then, new queries are extracted --in an application-dependent way-- from the documents in the initial query results, and the process iterates. The success of this common type of strategy relies on retrieved documents "contributing" new queries. If new documents fail to produce new queries, then the process might stall before all relevant documents are retrieved. In this paper, we develop a graph-based "reachability" metric that allows to characterize when an application's query-based strategy will successfully "reach" all documents that the application needs. We complement our metric with an efficient sampling-based technique that accurately estimates the reachability associated with a text database and an application's query-based strategy. We report preliminary experiments backing the usefulness of our metric and the accuracy of the associated estimation technique over real text databases and for two applications.
    case the result returned by the Figure 1 query is incomplete and su#ers from "false negatives," in contrast to our claim to the contrary in [GIJ 01b]. In general, the string pairs that are omitted are pairs of short strings. Even when these strings match within small edit distance, the match tends to be meaningless (e.g., "IBM" matches "ACM" within edit distance 2). However, when it is absolutely necessary to have no false negatives, we can make the appropriate modifications to the SQL query in Figure 1 so that it produces the correct results. Since the false negatives are only pairs of short strings, we can join all pairs of these small strings, using only the length filter, and UNION the result with the result of the SQL query described in [GIJ 01b]. We list the modified query in Figure 2. 2 Experimental Results We now experimentally measure the number of false negatives from which the query in [GIJ 01b] (Figure 1) can su#er. For the experiments we use the same thre
    Top co-authors