ArticlePDF Available

Abstract and Figures

Open innovation platforms (web sites where crowds post ideas in a shared space) enable us to elicit huge volumes of potentially valuable solutions for problems we care about, but identifying the best ideas in these collections can be prohibitively expensive and time-consuming. This paper presents an approach, called the "bag of lemons", which enables crowd to filter ideas with accuracy superior to conventional (Likert scale) rating approaches, but in only a fraction of the time. The key insight behind this approach is that crowds are much better at eliminating bad ideas than at identifying good ones.
Content may be subject to copyright.
High-Speed Idea Filtering With the Bag of Lemons
Mark Klein a,b
Ana Cristina Bicharra Garciac
a Massachusetts Institute of Technology, 77 Massachusetts Avenue, NE25-754, Cambridge MA USA 02139
b University of Zurich, Binzmühlestrasse 14, CH-8050 Zürich, Switzerland
c Universidade Federal Fluminense, Rua Passos da Patria, 156, Niteroi, RJ, Brazil
Open innovation platforms (web sites where crowds post ideas in a shared space) enable us to elicit huge
volumes of potentially valuable solutions for problems we care about, but identifying the best ideas in
these collections can be prohibitively expensive and time-consuming. This paper presents an approach,
called the "bag of lemons", which enables crowd to filter ideas with accuracy superior to conventional
(Likert scale) rating approaches, but in only a fraction of the time. The key insight behind this approach
is that crowds are much better at eliminating bad ideas than at identifying good ones.
Key Words: collective intelligence; open innovation; social computing; idea filtering
"Open innovation" is the concept of going outside your organization (e.g. to customers, suppliers,
stakeholders, and other interested parties) to get ideas for how to solve challenging problems
(Chesbrough, 2003). Increasingly, organizations are using web-based open innovation 1 software
platforms (such as Spigit, Imaginitik, Nosco, BrightIdea, Salesforce, and Ideascale) as a powerful tool to
solicit ideas from open communities (Bailey and Horvitz, 2010) (Bjelland and Wood, 2008)
(Chesbrough et al., 2008) (Morgan and Wang, 2010) (Westerski et al., 2011) (Hrastinski et al., 2010)
(Piller and Walcher, 2006) (Enkel et al., 2005) (Gassmann, 2006) (Lakhani and Panetta, 2007) (Ogawa
and Piller, 2006) (West and Lakhani, 2008) (Von Hippel, 2009). In domains ranging from government
to industry to NGOs, they have been finding that crowds are willing and able to volunteer ideas, for
questions they care about, in vast volumes. The six-day IBM "Innovation Jam" in 2006, for example,
involved over 150,000 participants from 104 countries in identifying 46,000 product ideas for the
company (Bjelland and Wood, 2008). Dell's ongoing Ideastorm website (Di Gangi and Wasko, 2009)
has received, to date, over 20,000 suggestions for improved Dell products and services. In the early
weeks of his first term, President Obama asked citizens to submit questions on his web site,
and promised to answer the top 5 questions in each category in a major press conference (Phillips). Over
70,000 questions were submitted. Google’s 10 to the 100th project received over 150,000 suggestions on
how to channel Google's charitable contributions (Buskirk, 2010), while the 8,000 participants in the
2009 Singapore Thinkathon generated 454,000 ideas (Butterworth, 2005). This kind of engagement thus
gives organizations access, at very low cost, to a much broader selection of ideas, increasing the
likelihood that they will encounter truly superior "out-of-the-box" solutions (Lakhani and Jeppesen,
This very success has, however, raised a new dilemma: screening this outpouring of ideas to identify the
ones most worth implementing (Toubia and Florès, 2007). Open innovation engagements tend to
generate idea corpuses that are large, highly redundant, and of highly variable quality (Riedl et al., 2010)
1 These are also sometimes referred to as "idea management" or "social ideation" systems.
(Schulze et al., 2012) (Westerski et al., 2013) (Blohm et al., 2011a) (Bjelland and Wood, 2008) (Di
Gangi and Wasko, 2009). Previous research suggests that about 10-30% of the ideas from open
innovation engagements are considered, by the customers, as being high quality (Blohm et al., 2011a).
Convening a group of experts to identify the best ideas, from these corpuses, can be prohibitively
expensive and slow. Nearly 100 senior executives at IBM, for example, had to spend weeks sifting
through the tens of thousands of postings generated by their Web Jam (Bjelland and Wood, 2008).
Google had to recruit 3000 Google employees to filter the unexpected deluge of ideas for the 10 to the
100th project, a process that put them 9 months behind schedule (Buskirk, 2010). The change,gov
website, finally, had to be shut down prematurely because the huge volume of contributions
overwhelmed the staff's ability to meaningfully process it. It has been estimated that it takes about $500
and four hours to evaluate one idea in a Fortune 100 company (Robinson and Schroeder, 2009).
In response to this, organizations have turned to crowds to not just generate ideas, but also filter them, so
only the best ideas need be considered by the decision makers. It has in fact been shown that crowds,
under the right circumstances, can solve classification problems like that with accuracy equal to or even
better than that of experts (Surowiecki, 2005). But this approach, in practice, has been no panacea. As
we will see below, existing filtering approaches, when faced with large idea corpuses, tend to fare
poorly in terms of accuracy, and can make unrealistic demands on crowd participants in terms of time
and cognitive complexity.
In this paper we will present a novel crowd-based idea filtering technique for meeting this important
challenge. We will review the strengths and shortcomings of existing idea filtering techniques, describe
our own approach to the problem, present an empirical evaluation conducted as part of a "real-world"
open innovation engagement, and discuss the contributions and future directions for this work.
To provide context for our approach, we review existing techniques that can be applied to idea filtering
in open innovation settings. These techniques fall into several categories (Figure 1):
Figure 1. A taxonomy of idea filtering techniques.
Author-based filtering filters out ideas based on who contributed them. Authors can be excluded, for
example, based on their previous behavior (i.e. based on their reputation) (Kittur et al., 2013). This
requires substantial prior knowledge about the authors, however, and may result in low recall, since
good ideas can often come from unexpected quarters. Authors can also be excluded based on 'gold
questions", wherein contributors are asked, before submitting an idea, to perform a simple task with a
known answer in order to assess whether or not they have a basic level of competence (Oleson et al.,
2011). This approach may help filter out some of the worse content but has only, as far as we are aware,
been applied to estimating quality in crowd-sourced micro-tasks, rather than for filtering ideas from
open innovation engagements.
Content-based filtering distinguishes among ideas based on their content, rather than their author. One
approach is algorithmic, wherein we use software to derive metrics for idea quality based on such
features as word frequency statistics. Walter et al, for example, measure the presence of rarely-used
words to estimate the creativity of a contribution (Walter and Back, 2013). Westerski derived metrics
based on manually- as well as machine-generated idea annotations (e.g. concerning what triggered the
idea) (Westerski et al., 2013). Such techniques are also fundamentally limited by the fact that current
natural language processing algorithms have only a shallow understanding of natural language, and thus
can be easily fooled. In the Westerski work, for example, the automatically-generated idea quality
metrics only achieved a correlation of 0.1 with the quality scores given by human experts. We can also
use machine learning to define idea filtering rules, if we have examples of desired and non-desired
content. A learning-based approach has proven useful in contexts, such as movie recommendations or
email spam filtering, where very large training sets are readily available (Brennan et al., 2010)
(Adomavicius and Tuzhilin, 2005) (Caruana and Li, 2012). Finding such training sets is problematic for
open innovation engagements, however, because creating them requires exhaustive human evaluation of
large idea corpuses, and the rules learned for one particular idea corpus and set of evaluation criteria
may not apply equally well in other contexts.
For this reason, much attention has been given to crowd-based filtering, where human participants are
asked to select the top ideas, since they can potentially understand the ideas much more deeply than
Prediction Markets
Figure 2. Examples of crowd-based filtering methods.
This can be done in many ways, including (Figure 2):
Voting, where participants vote for which ideas should be selected
Rating, where participants give ideas a numeric quality score)
Ranking, where participants place the ideas into a full or partial order, and
Prediction markets, where users buy and sell stocks representing predicted winners, knowing they
will receive a payoff if they own stocks that are eventually selected as winners: the stock prices then
represent the crowd's assessment of the likelihood of the associated prediction
Voting systems ask crowd members to vote for the ideas that they think merit adoption. In an idea-
filtering context, a multi-voting (Kessler, 1995) approach is typically taken, where users are asked to
allocate a budget of N votes to the best ideas in the corpus e.g. as in (Bao et al., 2011). Voting systems
are simple to use but face well-known practical as well as theoretical limitations, especially when
applied to large option sets (Arrow, 1963) (Kostakos, 2009).
Rating systems (Likert, 1932) (Salminen and Harmaakorpi, 2012) can gather useful feedback with
moderate numbers of options, but are prone to several challenging dysfunctions. One is that rating
systems tend to elicit mainly average scores from raters, and thus tend to do a poor job of distinguishing
between good and excellent ideas (Bao et al., 2011). Another is that rating systems tend to lock into
fairly static and arbitrary rankings with large option sets: people do not have time to rate all the options
and thus tend to consider only those that have already received good ratings, creating positive feedback
loops (Salganik et al., 2006) (Riedl et al., 2013) (Bjelland and Wood, 2008) (Yang, 2012). This problem
can be alleviated somewhat by using algorithms that adaptively assign ideas to raters (e.g. focusing on
the ideas most likely to have been misclassified) to maximize rating accuracy (Toubia and Florès, 2007).
Ranking systems ask participants to provide relative rankings of idea pairs (Salganik and Levy., 2012)
(Miller et al., 2009) (Baez and Convertino, 2012) (Saaty, 1990) rather than rating ideas individually.
Such systems can help alleviate rating lock, but the number of required comparisons increases as the
square of the number of ideas, creating severe scaling challenges for large idea sets.
All the approaches mentioned so far face the tendency for a disconnect between the criteria used by
raters and decision-makers (Forsythe et al., 1999). Raters often assess ideas based on their personal
criteria, or even self-interest, rather than on the potential value of the idea to the open innovation
customer (Spears et al., 2009) (Newell et al., 2004). This can be alleviated by asking participants to
evaluate ideas with respect to multiple customer-defined criteria (Dean et al., 2006) (Riedl et al., 2010)
(Riedl et al., 2013) but at the cost of increasing time and cognitive complexity demands for the raters
and thus potentially reducing participation.
Prediction markets (Bothos et al., 2009; Slamka et al., 2012) (Blohm et al., 2011b) (Berg and Rietz,
2003) (Wolfers and Zitzewitz, 2004) (Soukhoroukova et al., 2012) (Dahan et al., 2011) provide
incentives for raters to use the same criteria as decision-makers, but require traders to participate in
cognitively complex and time-consuming tasks (Blohm et al., 2011b) (Bothos et al., 2012), are prone to
manipulation by traders (Forsythe et al., 1999) (Hanson, 2004) (Wolfers and Zitzewitz, 2004) (Wolfers
and Leigh, 2002), face substantial scalability challenges (since, ideally, all traders should compare every
idea to find the best ones) and often generate too little trading activity to get meaningful stock prices,
especially since the benefits to the traders of getting the correct portfolio are often too nominal to merit a
substantial ongoing time investment (Hanson, 2003; Healy et al., 2010). Initial studies have in fact
reported low correlations between idea market share prices and rankings by human experts, e.g. 0.33 in
(Blohm et al., 2011b), 0.43 in (LaComb et al., 2007) and 0.10, 0.39 and 0.46 in (Soukhoroukova et al.,
2012). Dahan et al reported a correlation of 0.86 (Dahan et al., 2011), but this approach was applied to
very small idea markets with only eight stocks. Blohm et al reported that multi-criteria rating out-
performed prediction markets in terms of evaluation accuracy (Blohm et al., 2011b). Thus, despite their
promise, there is little evidence that idea prediction markets out-perform other crowd-based techniques
for idea filtering for open innovation engagements (Soukhoroukova et al., 2012).
We designed our approach to address the limitations of existing idea filtering techniques. The idea is
simple. Raters are provided with the list of candidate ideas, as well as a clear description of the decision-
makers' selection criteria. They are then given a limited number of "votes", and asked to allocate them to
ideas based on whether or not they believe the ideas will be selected as top candidates by the decision
makers. The more confident they feel about a judgment, the more votes they can allocate to that idea
(within the limits of the overall vote budget). Raters are given financial incentives for allocating votes
accurately. Ideas can then be filtered based on the number of votes each idea received. This approach is
potentially attractive, we believe, for at least the following reasons:
incentive alignment: just as in prediction markets, crowd participants are given incentives to make idea
evaluations that align with those of the decision makers.
time demands: rather than asking participants to rate all the ideas, they need only identify the small
subset that they think are most (or least, see below) likely to be selected by the decision makers, rather
than having to figure out the correct rating for all the ideas.
cognitive complexity: participants are not required to deal with the cognitive overhead of trading and
monitoring stock prices (as in idea prediction markets). In addition, as we will discuss below, the trick
of asking users to assign votes to the worse ideas (rather than the best ones) further simplifies the
evaluation process.
Our hypothesis is that these innovations will allow crowds to achieve at least comparable levels of
accuracy in filtering idea sets, while requiring less rater time, than existing techniques. The design and
results of our experiments to test this hypothesis are discussed in the sections below.
The experiments occurred in the context of an R&D lab in Fluminense Federal University in Brazil
which, for the last 17 years, has been developing software solutions for complex problems in the
petroleum exploration and production domain. The lab employs 70 students, professors, managers,
programmers and staff with expertise in such areas as computer science, engineering, statistics, physics,
linguistics, and management.
Productivity has been a challenge for the lab, since competition with other research labs is growing, as is
the complexity of the problems presented by the clients. The lab director decided, accordingly, to try
soliciting productivity-enhancement suggestions from the lab members themselves, because of several
potential advantages:
it broadens the circle of people providing ideas, increasing the possibility of finding superior "out-of-
the-box" solutions
the suggestions come from people who are intimately familiar with the lab's structure and challenges,
and thus more likely to address it's particular needs
it may reduce the resistance of the lab members to deploying new practices, since the ideas would
come from the lab members itself
it alleviates confidentiality issues by keeping the idea generation process within the lab itself
Our experimental evaluation consisted of two stages: an open innovation engagement that collected a
corpus of productivity enhancement ideas from members of the lab, followed by an idea filtering
engagement which compared our multi-voting approach with a widely-used idea rating scheme - the
five-level Likert Scale (Likert, 1932) - as techniques for identifying the best ideas from within the idea
We will discuss these elements in the paragraphs below.
Open Innovation Engagement: Ideas were gathered from lab members using the Deliberatorium, an open
innovation system developed for use with large user communities (Klein, 2012). The Deliberatorium
asks participants to enter their ideas in the form of deliberation maps i.e. tree structures consisting of
issues (problems to be solved), ideas (potential solutions to these problems), and arguments (pros and
cons for the different candidate ideas) (Figure 3):
Figure 3. A snapshot of the Deliberatorium.
A mediator ensured that the deliberation map remained well-structured, and clustered similar ideas
under shared generalizations.
The research lab's open innovation engagement was framed as a contest, with financial rewards offered
to the authors of the three top ideas selected by the committee to be implemented in the lab (roughly
US$440 for 1st place, US$220 for 2nd place, and US$130 for 3rd place). The contest stayed opened for
one month: in the end, 23 authors had generated a total of 48 ideas. Every idea had at least 2 pro and 2
con arguments that enumerated the advantages and disadvantages of adopting that idea.
The ideas were evaluated by a committee made up of four highly experienced research managers:
Experience as
about the
Research Lab
Member 1
Director of a Petroleum Company
> 15
> 50
Member 2
CEO of a University Foundation
> 50
Member 3
CEO of a University Foundation
1 to 5
Member 4
Member of the National academic
committee to sponsor research
> 50
Table 1: Committee members’ profile
The experts were given the descriptions, including pro and con arguments, for all ideas, and were
asked to rate the ideas as "bad", "average", "good", or "excellent", basing their decisions on the
following three criteria:
cost for implementing the idea (lower is better)
productivity benefit of the idea (higher is better)
time needed to experience the benefit (lower is better)
The identity of the idea authors was not revealed to the judges, so ideas were assessed based on
their own merits rather than on who wrote them.
The figures below give examples of ideas submitted as part of the open innovation engagement:
Figure 4a. An open innovation idea that received a high rating from the expert committee.
Figure 4b. An open innovation idea that received a low rating from the expert committee.
The distribution of the expert scores for the ideas was as follows (Figure 5):
Figure 5. Distribution of expert scores for ideas. The scores were calculated by summing the
ratings from the four experts, with a "bad" score counted as 0 points, an "average" as 1 point, a
"good" as 2 points, and an "excellent" as 3 points.
Nineteen ideas were rated "good" or better by a super-majority (three or more) of the experts.
The experts then went through a four round Delphi process (Rowe and Wright, 2001) to pick the
winning three ideas. The experts did not know the identity of the idea authors or the other
committee members, so ideas were judged based on their own merits rather than on the identity
of the idea authors or other committee members.
Idea Filtering Engagement: Current and former lab members were then asked to identify the best
productivity-enhancement ideas from the open innovation engagement. They were informed that
the ideas had been reviewed by an expert committee, and that their job was to predict which
ideas had been selected as winners by these experts. The raters were given the criteria used by
the expert committee, as well as the general profile for the committee members. The raters were
not told, however, the identity of the experts or idea authors, so ideas would be judged on their
own merit rather than on who authored them. The three most accurate raters (i.e. the raters who
did the best job of identifying ideas that were highly ranked by the expert committee) received
financial rewards (roughly US$220 for 1st place, US$130 for 2nd place, and US$65 for 3rd place).
The idea filtering participants overlapped significantly the idea generation participants. While
this may in theory lead to bias in idea selection – users may, for example, be tempted to up-rate
their own ideas we did not consider this an issue for this study since (1) we wished to compare
our technique with existing open innovation systems, which typically draw ideas and ratings
from the same community, and (2) this was true for all the conditions in our experiment, so it did
not introduce a bias for or against any particular idea filtering approach.
1" 2" 3" 4" 5" 6" 7" 8"
The idea filtering participants were split into three groups of roughly 20 members each. This
assignment was done randomly, with minor adjustments to ensure that the groups were balanced
with respect to age, education level, gender, and length of tenure in the lab (Figure 6):
Figure 6a-d. Rater demographics for the four experimental conditions.
Each group tried a different idea filtering approach:
Likert: The 23 participants in this condition were asked to rate each idea using a 5-point Likert
scale, ranging from 1 (very unlikely to be selected as a winner by the committee) to 5 (highly
likely to be selected) (Figure 7):
Likert BOS BOL
Likert BOS BOL
Likert BOS BOL
Likert BOS BOL
Figure 7. The Likert idea rating interface. Users click on a yellow star to indicate their rating.
Bag of stars (BOS): The 20 participants in this condition were asked distribute a budget of 10
"stars" to the ideas they felt were most likely to be selected as winners by the expert committee
(Figure 8). Recall that roughly 10-30% of ideas from open innovation engagements are typically
considered, by the engagement customers, to be high quality. We therefore elected to give each
user 10 stars, since this represents 20% of the idea corpus size.
Figure 8. The bag of stars idea rating interface. Users clicked on the + and - icons next to the
ideas to move their tokens (represented as gold coins) between their "treasure chest" (of
unallocated tokens) and the ideas.
Bag of Lemons (BOL): The 23 participants in this condition were asked distribute a budget of 10
"lemons" to the ideas they felt were least likely to be selected as winners by the expert
committee. The focus was thus on eliminating bad ideas, rather than identifying good ones.
In all conditions, users could click on the "reveal more" button for an idea to see it's associated
pro and con arguments.
Every group was given one week to enter their idea ratings. All the idea filtering engagements
took place in parallel, participants could not see each other's ratings, and were asked to not
discuss their evaluations with each other during the experiment, to help assure the rater
independence that is required for accurate crowd classification (Ladha, 1992). The identity of the
winning ideas was kept secret until after the completion of the idea filtering engagement. All
user interactions with the system were recorded and time-stamped.
Recall that our goal was not to see whether a crowd can replace the selection committee, but
rather to see whether the crowd can prune the idea corpus so that the selection committee need
not waste time reviewing poor ideas. We assessed, accordingly, how accurate the raters were at
identifying the ideas that were considered good or excellent by at least three members of the
expert committee. 19 ideas (40% of the idea corpus) met that criterion. An ideal filter would
select all and only these ideas. It would have, in other words, a "true positive" rate of 100% (it
would find all the positives = perfect recall), and a "false positive" rate of 0% (it would not select
any non-positives = perfect precision). We used a standard technique known as ROC curves
(Fawcett, 2004) to assess how close our idea filtering methods came to this ideal. ROC curves
plot the true positive rate vs. the false positive rate for a filter. Our idea filters all work by
providing ideas with numeric scores, and then filtering out the ideas whose score falls below a
given threshold. Because of this, our ROC curves consist of multiple points, each corresponding
to the true and false positive rates for different selection thresholds. The area under the ROC
curves is then a measure of accuracy: a perfectly accurate idea filter would have an area of 1.0,
while a random selection filter would have an area of 0.5.
The individual rater's accuracies had, as one might expect, a Gaussian distribution (Figure 9):
Figure 9. Distribution of idea filtering accuracy for raters.
We found no clear effects of the rater's demographics (age, gender, educational level, years of
work experience) on their idea filtering accuracy (Figure 10):
0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"
Figure 10a-d. Impact of demographics on rater accuracy.
The error bars show +/- 1 standard deviation.
The accuracy scores showed a strong "wisdom of the crowds" (Surowiecki, 2005) effect, i.e.,
larger groups were more accurate than smaller ones (Figure 11):
Figure 11. Filtering accuracy as a function of subgroup size, for 5000 random subgroups. The
error bars show +/- 1 standard deviation.
Remarkably, the average idea filtering accuracy for individual raters (left-most edge of the
accuracy curves) was only slightly better than random (0.5), while the groups achieved
accuracies as high as 0.9.
The ROC curves for the three idea filtering conditions were as follows (Figure 12):
1" 2" 3" 4" 5" 6" 7" 8" 9" 10"11"12"13"14"15"16"17"18"19"20" 21"
Figure 12. ROC curves for the idea filtering conditions.
The accuracy scores for these conditions were (Figure 13):
Figure 13. Idea filtering accuracy for each condition.
The error bars show +/- 1 standard deviation.
0" 0.2" 0.4" 0.6" 0.8" 1" 1.2"
BOL thus had the highest accuracy, followed by Likert and then BOS. All conditions performed
better than a random filter (which would have an accuracy of 0.5), and all these differences were
statistically significant at p<0.05.
The following table gives the average amount of time the participants spent, in minutes, doing
the ratings in each idea filtering condition (Figure 14):
Figure 14. Average per-rater time, in minutes, for each idea filtering condition.
The error bars show +/- 1 standard deviation.
BOS and BOL required roughly 1/3rd the rater time of the Likert approach (p<0.05). The
difference between BOS and BOL was not statistically significant.
In order to better understand these time differences, we also analyzed the number of times
participants clicked on the "reveal more" button for ideas (Figure 15):
Figure 15. Average per-rater clicks on "reveal more", for each idea filtering condition.
The error bars show +/- 1 standard deviation.
BOS and BOL raters clicked on the "reveal more" button significantly fewer times than Likert
raters (p<0.05), while the difference between BOS and BOL was not statistically significant.
We also used Krippendorf's alpha (Hayes and Krippendorff, 2007) to assess the degree of
agreement between the raters in each condition, with the following:
Krippendorf's alpha
An alpha of 0.8 or above is considered to represent an acceptable level of inter-rater reliability
for multi-rater coding. Inter-rater reliability was thus very low for the BOS and BOL conditions,
and acceptable for Likert. The inter-rater reliability score for Likert was significantly higher than
that for BOS or BOL (p < 0.02), but the BOS and BOL scores were not significantly different.
Our data allows us to reach the following conclusions:
The bag of lemons (BOL) approach provided substantially (about 33%) greater idea
filtering accuracy (as measured by ROC curve area) than the conventional Likert
approach, while requiring only about one third of the rater time.
Our crowds were much (about 60%) more accurate at eliminating bad ideas (BOL) than
selecting good ones (BOS).
Our hypothesis (that our approach will achieve at least comparable idea filtering accuracy as
Likert rating, while requiring less rater time) was thus validated for the Bag of Lemons, but not
for the Bag of Stars.
How can we understand these results? The multi-voting conditions require less rater time, we
hypothesize, because raters are encouraged, by these approaches, to make snap judgments. Since
they have only a limited number of tokens to distribute, and are no doubt motivated to not spend
undue time and energy on the rating task, it makes sense that they would quickly skip past ideas
that clearly are not at the higher (BOS) or lower (BOL) extremes of the quality spectrum. This
explanation is supported by the data on how often raters used the "reveal more" button to
examine an idea in more detail: they did so for almost all the ideas in the Likert condition, but for
only about half of them in the BOS/BOL conditions.
But why is BOL so much more accurate than the other approaches? One possible explanation is
that it is easier for raters to agree on which ideas are terrible, than to agree on which ideas are
great. This explanation is not supported, however, by the data. Inter-rater reliability was very low
for the multi-voting conditions, which means that the raters diverged widely in terms of deciding
which ideas deserved being tagged as either very good, or very bad.
Another possible explanation is that the BOL raters simply spent more time examining the ideas
than did the BOS and Likert raters. This also, however, is not supported by the data. BOL raters
took no more time than BOS raters, and were significantly faster than Likert raters.
Our hypothesis is that the accuracy results reflect the fact that idea filtering requires finding ideas
that are exceptional with respect to all of the customer's criteria (e.g. feasibility, value, and cost).
If we ask raters to find the best ideas, they therefore need to judge all the ideas with respect to all
the criteria. This may force raters to make judgments that they are not well-qualified to make. A
rater, for example, may have a good sense of the potential benefits of an idea, but not of how
quick or costly it would be to implement. The Bag of Lemons approach, by contrast, only
requires that people identify ideas that are clearly deficient with respect at least one criterion,
because that is all it takes to eliminate an idea from consideration. The incentives and limited
vote budget, in addition, encourage raters to focus only on the ideas they feel they can evaluate
quickly and well. As long as the rater community is diverse enough so that every criterion has at
least some raters who can evaluate it, the bag of lemons can thus achieve greater idea filtering
accuracy than any member could achieve on his/her own.
One final question raised by this work is why the per-rater BOS and BOL times were the same.
If BOS rating is in fact inherently a more complex task than BOL rating, why didn't the raters
take longer to do it. We hypothesize that this time parity was the result of a well-known effect,
from the psychology literature, that appears when people are asked to do forced-choice tasks
such as rating. In such conditions, people make strategic choices, trying to find the point on their
speed-accuracy tradeoff that maximizes their expected payoff per amount of time invested
(Bogacz et al., 2006). If they spend too little time, their accuracy is too low. If they spend more
time, their accuracy increases, but at some point this reaches a point of diminishing returns. The
BOS and BOL raters had no a priori information about what their speed-accuracy tradeoff was
for their token-assignment tasks, so it is likely, we believe, that raters in these conditions
allocated roughly the same amount of time to their tasks. The lower accuracy scores for BOS
then simply reflect the fact that BOS is a more cognitively complex task and thus will result in
lower accuracy, than BOL, for a given time investment.
The key challenge for future work in this area is to explore how accurate, high-speed, crowd-
based idea filtering can be extended to much larger idea corpuses than the ones we used in this
paper. We have identified several promising strategies for scaling up our approach:
Figure 16. Promising avenues for future work.
These strategies include (1) avoiding redundancy in the contributed ideas (reducing needless
rater effort), (2) rating idea clusters (instead of just individual ideas), (3) adaptive token budgets
(whose size is adapted to the raters and idea set), (4) information-theoretic rater assignment (to
assign raters where they can do the most good) and (5) integration with other filtering techniques
(to find the best ideas after eliminating the worst).
Redundancy avoidance: Conventional open innovation systems do little or nothing to prune the
corpus during the idea generation process, meaning that many of the submissions can represent
duplications or small variants of one another, greatly increasing the amount of effort needed to
filter the corpus later. We can ask the crowd to detect and eliminate redundant entries post-hoc,
as is done for example in (Salganik and Levy., 2012), but this can be expensive, since the
number of pairwise comparisons this requires increases exponentially with the size of the idea
corpus. One possible alternative, based on our previous work on crowd-based deliberation
(Klein, 2012), would be to ask authors to place the ideas they contribute within an evolving,
adaptive token budgets
information-theoretic rater
logically-organized taxonomy which groups related ideas together and thus makes redundancy
easy to detect and avoid.
Idea clustering: Open innovation engagements frequently generate clusters of related ideas,
clusters that can be identified by human moderators or by document clustering algorithms. Raters
can then be assigned to idea clusters based on their expertise, reducing the number of ideas each
rater need see. We can also allow raters to assign scores to entire clusters, as opposed to just the
individual ideas that make them up. This would allow them to eliminate entire clusters of related,
but unpromising, ideas in a single operation. This approach, if successful, would mean that users
only need rate a small subset of the entire corpus, thereby increasing the scale of corpuses that
can handle. This process will make most sense, of course, if good ideas tend to co-occur in
clusters: otherwise, giving a cluster a good or bad rating would provide no information about the
ideas that lie within it. While this remains an open question, we did observe substantial
clustering, in our experiment, of ideas rated as good or excellent by the expert committee:
Figure 17. Chart of the idea clusters generated by our open innovation engagement, where black
nodes represent ideas rated as good or excellent by a super-majority of the expert committee.
This suggests that rating idea clusters is potentially a viable approach for scalable idea filtering.
Adaptive token budgets: Future research should examine what the optimal token budget for each
rater should be, based on such attributes as the past performance of the raters, the number of
ideas in the corpus, the number of raters, and the a-priori estimates of the percentage of excellent
ideas in the corpus being filtered. We hypothesize, for example, that the utility of the crowd will
be increased if the token budgets are a good match to the distribution of bad, borderline, and
good ideas in the idea corpus.
Information-theoretic rater assignment: Most open innovation systems allow users to select
which ideas they do or do not rate, which can result in rating gaps (if users tend to only rate
recently-created ideas) as well as rating dysfunctions (if users tend only to rate ideas that already
have high average ratings) (Salganik et al., 2006) (Klein and Convertino, 2014). A promising
direction is to define algorithms, based on information-theoretic concepts, that assign raters to
ideas so that the system converges as quickly as possible to a well-founded identification of the
most promising ideas. Such ideas have been explored for pairwise idea ranking (Salganik and
Levy., 2012) as well as for multi-voting for top ideas (Toubia and Florès, 2007), but we are
unaware of this being applied to a multi-voting approach like ours.
Integration with other filtering techniques: One potentially promising direction is to use the bag
of lemons as a pre-filter for other techniques (such as idea markets or multi-criteria rating) that
are more time-consuming but may be superior at finding the very best ideas when applied to
small idea corpuses. Another possibility is to create a hybrid of multi-voting and multi-criteria
rating, wherein crowd members are asked to assign tokens to the ideas that most (or least)
effectively achieve a given criterion.
In addition to these technology development directions, empirical research is needed to compare
BOL with other idea filtering techniques such as ranking, as well as to better understand the
mechanisms underlying BOL's superior performance compared to Likert and BOS.
This work represents, we believe, a novel and important contribution to the literature on idea
filtering for open innovation systems. The success of open innovation systems is crucially
dependent on being able to identify the best ideas from the wealth of suggestions that are now
available at very low cost. Crowd-based idea filtering represents a promising approach to this
challenge, but must be able to do so with high accuracy (to avoid unreasonable burdens on the
customers of open innovation engagements) as well as low participant costs (since reducing such
costs is crucial to maintaining high levels of participation (Benkler, 2006)). Our work
demonstrates substantial progress towards meeting these goals, building on an approach which
incorporates three key strategies:
raters are asked to distributed a limited budget of "judgment" tokens
the task is to identify the worse, rather than the best, ideas
raters are given incentives for using the customer's evaluation criteria
While other efforts have used limited-budget multi-voting for idea filtering (Bao et al., 2011), or
provided incentives for idea filtering accuracy (e.g. in idea markets), we are aware of no previous
work that combines all three concepts.
This approach, of course, is not a panacea. It is prone to the same limitations as any other crowd-
based classification task: it requires that the members of the crowd be diverse, independent, have
non-zero knowledge of the domain, and be in sufficient numbers (Surowiecki, 2005). It can thus
fail, for example, in the presence of collusion (where a substantial fraction of raters work
together to push their favorites to the top), sparsity (where there are too few participants in the
rating pool compared to the size of the idea corpus), ignorance (the crowd members do not have
the expertise to assess ideas with respect to the relevant criteria), or homogeneity (raters
represent a single bias and/or type of expertise, so their errors tend to aggregate rather than
cancel each other out). These problems can be managed, to a considerable extent, by appropriate
management of the rater pool.
We'd like to thank all the people that participated in the experiments as well as the four experts
that evaluated the ideas from the open innovation engagement. We also thank Fabian Lang for
many fruitful discussions about the ideas underlying this work.
A.C.B. Garcia was supported by grant 10121-12-9 from the Brazilian government research
agency (CAPES). M. Klein was supported by grant 6611188 from the European Union Seventh
Framework (FP7) Program - the CATALYST project.
Adomavicius, G., & Tuzhilin, A. (2005). Toward the Next Generation of Recommender
Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Transactions on
Knowledge and Data Engineering, 17(6)(6), 734-749.
Arrow, K. J. (1963). Social choice and individual values. Wiley.
B.P. Bailey, Horvitz, E. (2010). What’s Your Idea? A Case Study of a Grassroots Innovation
Pipeline within a Large Software Company. Proceedings of the Computer-Human
Interaction Conference.
Baez, M., & Convertino, G. (2012). Designing a facilitator's cockpit for an idea management
system. Proceedings of the ACM 2012 conference on Computer Supported Cooperative
Work Companion, 59-62.
Bao, J., Sakamoto, Y., & Nickerson, J. V. (2011). Evaluating design solutions using crowds.
Proceedings of the Seventeenth Americas Conference on Information Systems,.
Benkler, Y. (2006). The Wealth of Networks: How Social Production Transforms Markets and
Freedom. Yale University Press.
Berg, J. E., & Rietz, T. A. (2003). Prediction markets as decision support systems. Information
Systems Frontiers, 5(1)(1), 79-93.
Bjelland, O. M., & Wood, R. C. (2008). An Inside View of IBM’s ‘Innovation Jam’. Sloan
Management Review, 50(1)(1),.
Blohm, I., Bretschneider, U., Leimeister, J. M., & Krcmar, H. (2011a). Does collaboration
among participants lead to better ideas in IT-based idea competitions? An empirical
investigation. International Journal of Networking and Virtual Organisations, 9(2)(2), 106-
Blohm, I., Riedl, C., Leimeister, J. M., & Krcmar, H. (2011b). Idea Evaluation Mechanisms for
Collective Intelligence in Open Innovation Communities: Do Traders outperform Raters?
Proceedings of the International Conference on Information Systems.
Bogacz, R., Brown, E., Moehlis, J., Holmes, P., & Cohen, J. D. (2006). The physics of optimal
decision making: a formal analysis of models of performance in two-alternative forced-
choice tasks. Psychological review, 113(4)(4), 700.
Bothos, E., Apostolou, D., & Mentzas, G. (2009). IDEM: A Prediction Market for Idea
Management. In Designing E-Business Systems. Markets, Services, and Networks (pp. 1-
13). Springer.
Bothos, E., Apostolou, D., & Mentzas, G. (2012). Collective intelligence with web-based
information aggregation markets: The role of market facilitation in idea management.
Expert Systems with Applications, 39(1)(1), 1333-1345.
Brennan, M. R., Wrazien, S., & Greenstadt, R. (2010). Learning to Extract Quality Discourse in
Online Communities. Proceedings of the WikiAI-10: AAAI-2010 Workshop on
Collaboratively-built Knowledge Sources and Artificial Intelligence.
Buskirk, E. V. (2010). Google Struggles to Give Away $10 Million. Wired Magazine
Caruana, G., & Li, M. (2012). A survey of emerging approaches to spam filtering. ACM
Computing Surveys (CSUR), 44(2)(2), 9.
Chesbrough, H., Vanhaverbeke, W., & West, J. (2008). Open Innovation: Researching a New
Paradigm. Oxford university press.
Chesbrough, H. W. (2003). Open innovation: The new imperative for creating and profiting from
technology. Harvard Business Press.
Dahan, E., Kim, A. J., Lo, A. W., Poggio, T., & Chan, N. (2011). Securities trading of concepts
(STOC). Journal of Marketing Research, 48(3)(3), 497-517.
Dean, D. L., Hender, J. M., Rodgers, T. L., & Santanen, E. L. (2006). Identifying quality, novel,
and creative Ideas: Constructs and scales for idea evaluation. Journal of the Association for
Information Systems, 7(1)(1), 30.
Di Gangi, P. M., & Wasko, M. (2009). Steal my idea! Organizational adoption of user
innovations from a user innovation community: A case study of Dell IdeaStorm. Decision
Support Systems, 48(1)(1), 303-312.
Enkel, E., Perez-Freije, J., & Gassmann, O. (2005). Minimizing market risks through customer
integration in new product development: learning from bad practice. Creativity and
Innovation Management, 14(4)(4), 425-437.
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine
Learning, 311-38.
Forsythe, R., Rietz, T. A., & Ross, T. W. (1999). Wishes, expectations and actions: a survey on
price formation in election stock markets. Journal of Economic Behavior & Organization,
39(1)(1), 83-110.
Gassmann, O. (2006). Opening up the innovation process: towards an agenda. R&D
Management, 36(3)(3), 223-228.
Hanson, R. (2003). Combinatorial information market design. Information Systems Frontiers,
5(1)(1), 107-119.
Hanson, R. (2004). Foul play in information markets. In,R.W. Hahn & P. C. Tetlock (Eds.),
Information Markets: A new way of making decisions (pp. 126-141). Washington DC: AEI
Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure
for coding data. Communication methods and measures, 1(1)(1), 77-89.
Healy, P. J., Linardi, S., Lowery, J. R., & Ledyard, J. O. (2010). Prediction markets: alternative
mechanisms for complex environments with few traders. Management Science, 56(11)(11),
Hrastinski, S., Kviselius, N. Z., Ozan, H., & Edenius, M. (2010). A review of technologies for
open innovation: characteristics and future trends. 1-10.
Kessler, F. (1995). Team decision making: pitfalls and procedures. Management Development
Review, 8(5)(5), 38-40.
Kittur, A., Nickerson, J. V., Bernstein, M., Gerber, E., Shaw, A., Zimmerman, J., Lease, M., &
Horton, J. (2013). The future of crowd work. Proceedings of the International Conference
on Computer supported cooperative work, 1301-1318.
Kostakos, V. (2009). Is the crowd's wisdom biased? a quantitative annalysis of three online
communitiesn. Proceedings of the Conference on Computational Science and Engineering,
LaComb, C. A., Barnett, J. A., & Pan, Q. (2007). The imagination market. Information Systems
Frontiers, 9(2-3)(2-3), 245-256.
Ladha, K. K. (1992). The Condorcet jury theorem, free speech, and correlated votes. American
Journal of Political Science, 617-634.
Lakhani, K. R., & Jeppesen, L. B. (2007). R&D: Getting Unusual Suspects to Solve R&D
Puzzles. Harvard Business Review, 85(5)(5), 30-32.
Lakhani, K. R., & Panetta, J. A. (2007). The principles of distributed innovation. Innovations,
Likert, R. (1932). A Technique for the Measurement of Attitudes. Archives of Psychology, 1401-
Klein, M. (2012). Enabling Large-Scale Deliberation Using Attention-Mediation Metrics.
Computer-Supported Collaborative Work, 21(4)(4), 449-473.
Klein, M., & Convertino, G. (2014). An Embarrassment of Riches: A Critical Review of
Open Innovation Systems. Communications of the ACM, 57(11)(11), 40-42.
Butterworth, M. S. J. (2005). Using mass idea brainstorming as an organizational approach to
jumpstarting innovation initiative. Unpublished PhD, University of South Australia.
Miller, B., Hemmer, P., Steyvers, M., & Lee, M. D. (2009). The wisdom of crowds in rank
ordering problems. Proceedings of the Cognitive Modeling.
Morgan, J., & Wang, R. (2010). Tournaments for ideas. California Management Review,
52(2)(2), 77.
Newell, B. R., Rakow, T., Weston, N. J., & Shanks, D. R. (2004). Search strategies in decision
making: The success of “success”. Journal of Behavioral Decision Making, 17(2)(2), 117-
Ogawa, S., & Piller, F. T. (2006). Reducing the risks of new product development. MIT Sloan
management review, 47(2)(2), 65.
Oleson, D., Sorokin, A., Laughlin, G. P., Hester, V., Le, J., & Biewald, L. (2011). Programmatic
Gold: Targeted and Scalable Quality Assurance in Crowdsourcing. Human computation,
Phillips, M. Open for Questions: President Obama to Answer Your Questions on Thursday.
Piller, F. T., & Walcher, D. (2006). Toolkits for idea competitions: a novel method to integrate
users in new product development. R&D management, 36(3)(3), 307-318.
Riedl, C., Blohm, I., Leimeister, J. M., & Krcmar, H. (2010). Rating scales for collective
intelligence in innovation communities: Why quick and easy decision making does not get it
right. Proceedings of the International Conference on Information Systems.
Riedl, C., Blohm, I., Leimeister, J. M., & Krcmar, H. (2013). The Effect of Rating Scales on
Decision Quality and User Attitudes in Online Innovation Communities. International
Journal of Electronic Commerce, 17(3)(3), 7-36.
Robinson, A. G., & Schroeder, D. M. (2009). Ideas are free: How the idea revolution is
liberating people and transforming organizations. Berrett-Koehler Publishers.
Rowe, G., & Wright, G. (2001). Expert opinions in forecasting: The role of the Delphi technique.
In Principles of forecasting (pp. 125-144). Springer.
Saaty, T. (1990). The Analytic Hierarchy Process. New York: McGraw-Hill.
Salganik, M. J., Dodds, P. S., & Watts, D. J. (2006). Experimental Study of Inequality and
Unpredictability in an Artificial Cultural Market. Science, 311(5762)(5762), 854-856.
Salganik, M. J., & Levy., K. E. C. (2012). Wiki surveys: Open and quantifiable social data
Salminen, J., & Harmaakorpi, V. (2012). Collective Intelligence and Practice-Based Innovation:
An Idea Evaluation Method Based on Collective Intelligence. In Practice-Based
Innovation: Insights, Applications and Policy Implications (pp. 213-232). Springer.
Schulze, T., Indulska, M., Geiger, D., & Korthaus, A. (2012). Idea assessment in open
innovation: A state of practice. Proceedings of the European Conference on Information
Slamka, C., Jank, W., & Skiera, B. (2012). Second-Generation Prediction Markets for
Information Aggregation: A Comparison of Payoff Mechanisms. Journal of Forecasting,
31(6)(6), 469-489.
Soukhoroukova, A., Spann, M., & Skiera, B. (2012). Sourcing, filtering, and evaluating new
product ideas: an empirical exploration of the performance of idea markets. Journal of
Product Innovation Management, 29(1)(1), 100-112.
Spears, B., LaComb, C., Interrante, J., Barnett, J., & Senturk-Dogonaksoy, D. (2009). Examining
trader behavior in idea markets: an implementation of GE's imagination markets. The
Journal of Prediction Markets, 3(1)(1), 17.
Surowiecki, J. (2005). The Wisdom of Crowds. Anchor.
Toubia, O., & Florès, L. (2007). Adaptive idea screening using consumers. Marketing Science,
26(3)(3), 342-360.
Von Hippel, E. (2009). Democratizing Innovation. MIT Press.
Walter, T. P., & Back, A. (2013). A Text Mining Approach to Evaluate Submissions to
Crowdsourcing Contests. 3109-3118.
West, J., & Lakhani, K. R. (2008). Getting clear about communities in open innovation. Industry
and Innovation, 15(2)(2), 223-231.
Westerski, A., Dalamagas, T., & Iglesias, C. A. (2013). Classifying and comparing community
innovation in Idea Management Systems. Decision Support Systems, 54(3)(3), 1316-1326.
Westerski, A., Iglesias, C. A., & Nagle, T. (2011). The road from community ideas to
organisational innovation: a life cycle survey of idea management systems. International
Journal of Web Based Communities, 7(4)(4), 493-506.
Wolfers, J., & Leigh, A. (2002). Three tools for forecasting federal elections: lessons from 2001.
Australian Journal of Political Science, 37(2)(2), 223-240.
Wolfers, J., & Zitzewitz, E. (2004). Prediction Markets. Journal of Economic Perspectives,
18(2)(2), 107-126.
Yang, Y. (2012). Open innovation contests in online markets: Idea generation and idea
evaluation with collective intelligence. Temple University.
... While individual humans still have the cognitive limitations discussed in previous chapters, these can be minimized through the mechanism of collective intelligence (Larrick et al. 2011). This approach aggregates the judgments of a larger group of humans to reduce the noise and bias of individual evaluations (Klein and Garcia 2015;Blohm et al. 2016;Leimeister et al. 2009). For making judgments in uncertain settings, the value of crowds over individual experts can be explained by two basic principles: error reduction and knowledge aggregation (Larrick et al. 2011). ...
... Second, our results indicate a possible application of collective intelligence in more complex and knowledge-intensive tasks. While previous work (e.g., Blohm et al. 2016;Klein and Garcia 2015) utilized the wisdom of the crowd in rather basic decision support settings such as filtering novel product ideas without considering explicit expertise requirements, our findings indicate the potential of applying collective intelligence in uncertain decision tasks. Addressing the concrete expertise requirements of humans, decisional guidance is based on the heterogenous domain knowledge of experts and reduces misleading biases and heuristics. ...
Full-text available
One of the most critical tasks for startups is to validate their business model. Therefore, entrepreneurs try to collect information such as feedback from other actors to assess the validity of their assumptions and make decisions. However, previous work on decisional guidance for business model validation provides no solution for the highly uncertain and complex context of earlystage startups. The purpose of this paper is, thus, to develop design principles for a Hybrid Intelligence decision support system (HI-DSS) that combines the complementary capabilities of human and machine intelligence. We follow a design science research approach to design a prototype artifact and a set of design principles. Our study provides prescriptive knowledge for HI-DSS and contributes to previous work on decision support for business models, the applications of complementary strengths of humans and machines for making decisions, and support systems for extremely uncertain decision-making problems.
... Aufgrund des hohen Datenvolumens besteht ein besonderer Bedarf an Messgrößen, die mit minimaler menschlicher Intervention aus Community-Daten erstellt werden können. Wenn menschliche Urteile benötigt werden, wird der Prozess prohibitiv teuer und langsam (Klein & Garcia, 2015). Deshalb ist die Entwicklung von Indikatoren notwendig, die automatisch aus vorhandenen Daten der Online-Communities generiert werden können. ...
... Dai et al. [44] developed a structural approach that strategically aggregates multiple contributions to improve the output quality. Klein and Garcia [45] proposed an approach, which is named ''bag of lemons,'' to leverage crowd members to filter bad contributions. Lampe et al. [43] developed a system tool to deal with information overload when too many contributions are made. ...
Full-text available
Nowadays, crowdsourcing has become a popular way of sourcing. As intermediaries that connect crowdsourcers and crowds, crowdsourcing platforms integrate state-of-the-art information technologies and specialized organizational functions to host and govern crowdsourcing projects. The extant literature on crowdsourcing has investigated numerous aspects of crowdsourcing platforms. However, a majority of studies are project-oriented and short-term focused. There is a lack of a holistic view of crowdsourcing platforms as enterprises with a developmental perspective. This study aims to address this issue by investigating business sustainability of crowdsourcing platforms. By considering temporal dimensions and multiple interpretations of business sustainability, a conceptual framework is proposed to investigate the sustainability of a crowdsourcing platform by analyzing the key business process, value co-creation, and business development, which is a major theoretical contribution of the study. A case study of LEGO Ideas is presented to illustrate the practical implementation of the proposed framework. Both theoretical and practical implications are discussed.
... vote for the best idea), ranking (sort the ideas in order of preference), pairwise comparison (the analytic hierarchy process), and multi-voting (where the raters distribute a fixed budget of tokens to good ideas, with better ideas getting more tokens). A study by the PI (Klein & Garcia, 2015) prototyped a novel form of idea assessment called negative multi-voting, where raters allocate a fixed budget of tokens to bad ideas, where worse ideas get more tokens). Perhaps counter-intuitively, crowds using negative multi-voting had much higher accuracy in their idea quality assessment (when compared with expert judgments) than Likert rating and conventional multi-voting. ...
Full-text available
This paper reviews the recent progress that has been made by the author and his colleagues on developing technology to enable effective crowd-scale deliberation. The paper includes the following five sections: 1. Goals: what our research is working to achieve 2. Context: the limitations of existing technologies 3. Maps: the core knowledge schema in our approach 4. Architecture: the algorithmic building blocks of our system 5. Results: what this research has achieved so far in building this architecture The first four sections provide the context needed to understand the role and value of the results we describe in section 5.
... it is also important to increase the diversity and reduce redundancy in submitted ideations [11,42,68,72]. Prior methods to avoid redundancy include iterative or adaptive task workflows [86], constructing a taxonomy of the idea space [39], visualizing a concept map of peer ideas [72], and directing crowdworkers towards diverse prompts and away from prior ideas with language model embedding distances [20]. ...
Full-text available
Feedback can help crowdworkers to improve their ideations. However, current feedback methods require human assessment from facilitators or peers. This is not scalable to large crowds. We propose Interpretable Directed Diversity to automatically predict ideation quality and diversity scores, and provide AI explanations - Attribution, Contrastive Attribution, and Counterfactual Suggestions - for deeper feedback on why ideations were scored (low), and how to get higher scores. These explanations provide multi-faceted feedback as users iteratively improve their ideation. We conducted think aloud and controlled user studies to understand how various explanations are used, and evaluated whether explanations improve ideation diversity and quality. Users appreciated that explanation feedback helped focus their efforts and provided directions for improvement. This resulted in explanations improving diversity compared to no feedback or feedback with predictions only. Hence, our approach opens opportunities for explainable AI towards scalable and rich feedback for iterative crowd ideation.
Full-text available
Many of humanity's most pressing and challenging problems-such as environmental degradation, physical and economic security, and public health-are inherently complex (involve many different interacting components) as well as widely impactful (effect many diverse stakeholders). Solving such problems requires crowd-scale deliberation in order to cover all the types of disciplinary expertise needed, as well as to take into account the many impacts the decision will have. Current approaches to group decision-making, however, fail at scale, producing outcomes that are needlessly sub-optimal for all the parties involved. This chapter will investigate why group decision-making fails in this way, explaining the problems of achieving Pareto optimality and noting the tendency to miss win-win solutions that are not the "dream choices" of any participant. It will go on to describe how recent advances in social computing technology can address these failings, for example through the use of deliberation maps, idea filtering, and crowd-scale complex negotiation.
Online collaborations allow teams to pool knowledge from multiple domains, often across dispersed geographic locations to find innovative solutions for complex, multi-faceted problems. However, motivating individuals within online groups can prove difficult, as individual contributions are easily missed or forgotten. This study introduces the concept of creative ancestry, which describes the extent to which collaborative outputs can be traced back to the individual contributions that preceded them. We build a laboratory experiment to demonstrate the impact of creative ancestry on perceptions of fairness and output quality in online collaborations. Results from this experiment suggest the addition of creative ancestry has a positive impact on these variables and is associated with increasing perceptions of procedural justice and possibly interactional and distributive justice, dependent on the level of perceived creativity and cognitive consensus.
As organizations grow, it is often challenging to maintain levels of efficiency. It can also be difficult to identify, prioritize and resolve inefficiencies in large, hierarchical organizations. Collaborative crowdsourcing systems can enable workers to contribute to improving their own organizations and working conditions, saving costs and increasing worker empowerment. In this paper, we briefly review relevant research and innovations in collaborative crowdsourcing and describe our experience researching and developing a collaborative crowdsourcing system for large organizations. We present the challenges that we faced and the lessons we learned from our effort. We conclude with a set of implications for researchers, leaders, and workers to support the rise of collective intelligence in the workplace.
The availability of social media-based data creates opportunities to obtain information about consumers, trends, companies and technologies using text mining techniques. However, the quality of the data is a significant concern for social media-based analyses. The aim of this study was to mine tweets (microblogs) to explore trends and retrieve ideas for various purposes such as product development, technology and sustainability-oriented considerations. The core methodological approach was to create a classification model to identify tweets that contained an idea. This classification model was used as a pre-processing step so the query results obtained from the application programming interface were cleared from the messages that contained the search terms used in the query but did not contain an idea. The results of this study demonstrate that our method based on text mining, and supervised or semi-supervised classification methods, can extract ideas from social media. The social media data mining process illustrated in our study can be utilised as a decision-making tool to detect innovative ideas or solutions about a product or service and summarise them into meaningful clusters. We believe that our findings are significant for the sustainability, tech mining and innovation management communities.
Full-text available
Humanity now finds itself faced with a range of highly complex and controversial challenges—such as climate change, the spread of disease, international security, scientific collaborations, product development, and so on—that call upon us to bring together large numbers of experts and stakeholders to deliberate collectively on a global scale. Collocated meetings can however be impractically expensive, severely limit the concurrency and thus breadth of interaction, and are prone to serious dysfunctions such as polarization and hidden profiles. Social media such as email, blogs, wikis, chat rooms, and web forums provide unprecedented opportunities for interacting on a massive scale, but have yet to realize their potential for helping people deliberate effectively, typically generating poorly-organized, unsystematic and highly redundant contributions of widely varying quality. Large-scale argumentation systems represent a promising approach for addressing these challenges, by virtue of providing a simple systematic structure that radically reduces redundancy and encourages clarity. They do, however, raise an important challenge. How can we ensure that the attention of the deliberation participants is drawn, especially in large complex argument maps, to where it can best serve the goals of the deliberation? How can users, for example, find the issues they can best contribute to, assess whether some intervention is needed, or identify the results that are mature and ready to " harvest " ? Can we enable, for large-scale distributed discussions, the ready understanding that participants typically have about the progress and needs of small-scale, collocated discussions?. This paper will address these important questions, discussing (1) the strengths and limitations of current deliberation technologies, (2) how argumentation technology can help address these limitations, and (3) how we can use attention-mediation metrics to enhance the effectiveness of large-scale argumentation-based deliberations.
Full-text available
Open innovation systems have provided organizations, ranging from businesses and governments to universities and NGOs, with unprecedented access to the "wisdom of the crowd", allowing them to collect candidate solutions, for problems they care about, from potentially thousands of individuals, at very low cost. These systems, however, face important open challenges deriving, ironically, from their very success: they can elicit such huge levels of participation that it becomes very difficult to guide the crowd in productive ways, and pick out the best of what they have created. This viewpoint article reviews the key challenges facing open innovation systems and issues a call-to-arms describing how the research community can move forward on this important topic.
Full-text available
Receiver Operating Characteristics (ROC) graphs are useful for organizing classi-fiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.
Full-text available
A critical review of open innovation systems.
This paper presents an overview of the field of recommender systems and describes the current generation of recommendation methods that are usually classified into the following three main categories: content-based, collaborative, and hybrid recommendation approaches. This paper also describes various limitations of current recommendation methods and discusses possible extensions that can improve recommendation capabilities and make recommender systems applicable to an even broader range of applications. These extensions include, among others, an improvement of understanding of users and items, incorporation of the contextual information into the recommendation process, support for multicriteria ratings, and a provision of more flexible and less intrusive types of recommendations.
Your problem solvers could be not only out in the wider world but also in the wide reaches of your own organization.
Open innovation has become a fruitful approach to increasing the potential of innovation in organisations. Similar to traditional innovation, an open innovation approach can be characterised in three phases; namely idea generation, idea assessment and idea implementation/diffusion. While the academic community has begun to provide initial guidance for improving the various stages of the open innovation process, still little is known about how organisations currently assess ideas once they are collected. The potentially vast quantity of ideas collected through an open innovation approach has limited benefits to an organisation that is not able to categorise and assess ideas. Accordingly, in this study we carry out an exploratory survey among 331 managers to obtain a better understanding of idea assessment in practice. Our findings show, among others, that organisations investing in information systems for idea management have a higher satisfaction with the effectiveness of idea assessment, which is, in turn, associated with higher satisfaction with the innovation process overall.