Content uploaded by Mark Klein
Author content
All content in this area was uploaded by Mark Klein on Feb 14, 2024
Content may be subject to copyright.
Crowd-sourced idea filtering with Bag of Lemons: the impact of
the token budget size.
Gafari Lukumon a,c
Gafari.LUKUMON@um6p.ma
Mark Klein a,b
http://cci.mit.edu/klein/
a School of Collective Intelligence, University Mohammed VI Polytechnic
b Center for Collective Intelligence, Massachusetts Institute of Technology
c Institut Jean Nicod, Ecole Normale Supérieure, Paris, France.
ABSTRACT
Identifying the best ideas from the vast volumes generated by open innovation engagements is
costly and often time-consuming. One approach is to engage crowds in filtering the ideas, not just
generating them. Klein and Garcia, 2015 proposed a "BOL" approach that is better (in terms of
accuracy and speed) at idea filtering than other filtering methods such as a conventional Likert
approach. The idea behind this approach (BOL) is that it asks the crowd to distribute a fixed budget
of tokens that eliminate bad ideas rather than select good ones. In this paper, we explain why BOL
works better than other filtering methods using both empirical experiments (with n=750 subjects)
as well as mathematical explanation. Also, we present the effect of the token budget size on idea
filtering engagement and found, among others, that the accuracy of a filter depends on the token
budget size.
Keywords: Collective Intelligence, Idea filtering, Bag of Lemon, Bag of Stars, Likert, Filtering,
Token budget size
1. Introduction
Crowdsourcing has become a popular method for generating and collecting a large number of ideas
from a diverse group of individuals. However, the sheer volume of ideas generated through
crowdsourcing can make it challenging to effectively filter through and identify the truly valuable
ones. To address this problem, researchers have proposed the use of crowd-based filters, such as
the Bag of Lemons (BOL) approach. The BOL approach is unique in its focus on quickly
identifying and discarding poor or low-quality ideas, rather than solely recognizing the excellent
ones. However, the performance of the BOL filter is not solely dependent on the approach itself,
but also on the parameters used in the filtering process. One crucial parameter is the token budget
size, which refers to the number of tokens given to the crowd to put on the ideas they considered
as worse. The token budget size plays a critical role in the filtering process, as it determines the
amount of weight given to the crowd's opinions in the final decision-making process.
In this article, we aim to explore the impact of budget token size on the effectiveness of the BOL
filter in filtering and identifying good and bad ideas in a crowdsourcing environment. Through a
series of experiments, we will investigate how varying the token budget size affects the filter's
ability to accurately distinguish between valuable and non-valuable ideas. Additionally, we will
discuss the implications of these findings for the use of the BOL filter and other similar techniques
in real-world crowdsourcing scenarios.
The findings of this research have the potential to inform the development of more effective and
efficient crowdsourcing strategies for a wide range of industries and organizations. By providing
a deeper understanding of the challenges and limitations of using crowd-based filters in a
crowdsourcing setting, we hope to offer insights into how the token budget size can be optimized
to improve the accuracy of idea filtering.
This paper seeks address the following research question:
Research question
How does the implementation of the Bag of Lemons (BOL) filter in combination with varying
token budget sizes affect the accuracy of identifying both good and bad ideas in a crowdsourcing
setting?
The paper is structured as follows: In Section 2, we review relevant literature on the topic. In
Section 3, we explain the rationale for our research and why it is important. In Section 4, we present
the theoretical framework for our study. In Section 5, we outline the research hypotheses. We
describe the methodology used in the study in Section 6. The results and findings are discussed in
Section 7. In Section 8, we provide a mathematical explanation that explains why BOL is more
effective than other filtering methods. We conclude the paper in Section 9, summarizing the key
takeaways and implications of our research. Finally, we provided limitations and suggestions for
future research in Section 10.
2. Literature review
Challenges: crowd-based idea filtering with the bag of lemons.
Formal organizations, such as industries, educational institutions, government agencies, and
NGOs, often rely on open innovation to gather ideas and suggestions for solving complex
challenges and making better policies to serve their clients. Open innovation refers to a process
where organizations solicit ideas and suggestions from external sources, such as through a website
or contest, to improve productivity and drive innovation (Chesbrough, 2003; Bjelland & Wood,
2008; Chesbrough, Vanhaverbeke, & West, 2008; Morgan & Wang, 2010; Westerski et al., 2011;
Hrastinski, Kviselius, Ozan, & Edenius, 2010; Piller & Walcher, 2006; Enkel et al., 2005;
Gassmann, 2006; Lakhani & Panetta, 2007; Ogawa & Piller, 2006; West & Lakhani, 2008; Von
Hippel, 2009).
However, a significant challenge in open innovation is filtering through the vast amount of ideas
and suggestions received to identify the best and most relevant ones. This process can be time-
consuming and costly, as organizations may have to recruit a large number of experts to evaluate
the ideas (Klein and Garcia, 2015; Westerski et al., 2013; Blohm et al., 2011; Schulze et al., 2012;
Bjelland and Wood, 2008). For example, the Panama government recently organized an idea
contest, and the website received over 150,000 suggested ideas within a week of its launch
(Panama, 2020). Similarly, Google had to recruit roughly 3,000 experts to evaluate their "10 to the
100th" 150,000 suggestions from open innovation, while Dell's idea-storm website received ten
thousand ideas on how their products and services could be improved (Klein and Garcia, 2015;
Klein, 2012; Buskirt, 2010; Di Gangi and Wasko, 2009).
One solution to this problem is to harness the wisdom of collective intelligence (CI) to filter
through the ideas. CI refers to the ability of groups of people to judge the quality and accuracy of
content, sometimes even better than experts (Surowieki, 2005). Various CI techniques have been
proposed in the literature to filter ideas, including author-based and content-based filtering (Klein
and Lukumon, 2021; Klein and Garcia, 2015). Author-based filtering techniques filter ideas based
on the reputation of the authors who contributed them, but this approach may exclude brilliant
ideas from unexpected sources and requires extensive prior knowledge of the authors. One way to
eliminate authors is through "gold questions," where contributors are asked to complete a simple
activity with a known solution before submitting an idea to determine their competency level
(Oleson et al., 2011). While this method can assist in removing some of the worst content, it has
only been used to estimate quality in crowd-sourced microtasks, not for filtering ideas from open
innovation engagements.
Another approach is where ideas are filtered based on their contents. This kind of filtering is
possible through algorithms or crowds (Klein and Garcia 2015). The algorithmic approach uses
software to establish metrics for the quality of ideas based on features such as word frequency
statistics. An algorithmic approach, however, can easily be fooled because of its low
comprehension of the natural language. Similarly, finding training sets to develop such algorithms
by machine learning is problematic because large idea corpora are needed to be evaluated by
experts. For a complete account of the algorithmic approach and its shortcomings, see Walter &
Back, 2013; Westerski et al., 2013; Brennan, Wrazien, & Greenstadt, 2010; Adomavicius &
Tuzhilin, 2005; Caruana & Li, 2012 and the references therein. In the case of crowd-based filtering,
human beings serve as the filters - they check all or some of the ideas and select the best or worse
ones. This filtering is considered better than the algorithm-based because humans comprehend the
ideas to be filtered more than software can. Crowd-based filtering can be implemented by voting,
ranking, rating, or prediction markets, each with shortcomings. (Klein and Garcia, 2015; Kessler,
1995; Salganik et al., 2006; Kostakos, 2009; Arrow, 1963). For example, voting enables members
of the crowd to vote for the ideas they believe should be adopted. In multi-voting (Kessler, 1995),
an approach often used in an idea-filtering scenario, crowds are asked to distribute a budget of N
votes to the best ideas in the corpus (Bao, Sakamoto, & Nickerson, 2011). Although voting systems
are easy to use, they have well-known practical and theoretical limits, particularly with huge option
sets (Lykourentzou et al.; 2018; Klein and Garcia, 2015). One of these shortcomings is a
"snowball" effect - where voters become fixated on a few ideas or idea concepts (ideas with similar
themes) because these ideas or concepts received the first initial upvotes and are thus more likely
to be seen (and upvoted) by subsequent users. In contrast, other potentially better ideas do not get
the same attention (Juho, 2015). Another limit is that the crowds are less likely to discern poor
from excellent ideas when positive voting is used, such as in majority voting (Klein and Garcia,
2015). Meanwhile, Klein and Garcia, 2015 addressed the last challenge using a novel approach
called “Bag of Lemons (BOL)”. The essential concept underlying this method is that crowds are
more effective at weeding out bad ideas than spotting the good ones. They compared this method
with Bag of Stars (BOS) and the conventional Likert and found BOL was superior both in terms
of accuracy and speed.
BOL method is characterized by its focus on quickly identifying and discarding poor or low-
quality ideas, in order to quickly arrive at a set of the best ideas. This method is particularly useful
in situations where there is a large volume of ideas that need to be evaluated, and time is of the
essence. In this method, raters are trained to quickly identify "lemons" (poor ideas) based on a set
of pre-determined criteria, and discard them in order to focus on the more promising ideas. On the
other hand, the BOS method is focused on identifying and developing high-quality ideas, with an
emphasis on idea improvement and development. This method is particularly useful in situations
where there is a smaller number of ideas that need to be evaluated, and more time is available for
idea development. In this method, raters are trained to identify "stars" (high-quality ideas) based
on a set of pre-determined criteria, and then work to improve and develop these ideas through a
series of iterative processes.
One key distinction between the two methods is that the BOL method focuses on identifying the
"lemons" or poor ideas quickly, while the BOS method evaluates all ideas in more detail and
assigns scores based on multiple criteria. This means that the BOL method is more efficient and
can be used to quickly filter through a large number of ideas, while the BOS method is more
thorough but takes longer to complete. Another distinction is that, the BOL method relies on a
single criterion, such as feasibility or novelty, to quickly identify poor ideas, while the BOS method
uses multiple criteria, such as feasibility, novelty, and potential impact, to evaluate ideas. This
means that the BOL method is simpler and easier to use, but may not capture all aspects of an
idea's potential.
The approach of the Bag of Lemons (BOL) has been further examined and expanded upon by
various researchers in the field of idea filtering. Wagenknecht et al., (2017) compared the BOL
method to traditional Likert scales and up-down voting schemes and found that the BOL approach
generated greater user activity, although it was also perceived as more frustrating. The study
suggests that this frustration may be due to perceived information overload, providing insight into
the design of effective and manageable information systems for open innovation and idea filtering.
Similarly, Lykourentzou et al. (2018) expanded upon the BOL approach by introducing
"DBLemons," a dynamic voting scenario in which multiple voters participate at different times
and with uncertain arrival rates. The study aimed to investigate whether the BOL approach would
still be effective in improving filtering efficiency and shortening task times in a scenario that more
closely resembles real-world open innovation communities. Using data from an open innovation
contest on women's safety, the researchers found that their DBLemons method (which incorporates
diversity in the idea concept space) was more efficient than both majority voting and traditional
BOL. This was attributed to voters being able to make more idea comparisons in a shorter amount
of time due to the representative ideas for all concepts being displayed early on. Overall, the study
suggests that the proposed DBLemons strategy can enhance the strategic value of open innovation
by increasing trust in crowd-based idea filtering. In the next section, we will detail the crowd-
based approach, particularly BOL.
Studies on idea convergence in crowdsourcing have explored the effects of different types of
incentives, feedback, and community structures on the convergence of ideas in an open innovation
setting. For example, some studies have found that community feedback, such as commenting and
voting, can influence idea convergence by providing participants with a better understanding of
what is considered valuable by the community (Cheng et al; 2020). Similarly, studies on idea
selection in crowdsourcing have focused on identifying the most effective methods for selecting
ideas from a large pool of submissions. Some studies have found that combining multiple
evaluation criteria, such as originality and feasibility, can improve the accuracy of idea selection
in crowdsourcing contests (Kornish et al; 2017).
Studies on idea evaluation in open innovation contexts have also explored various methods for
evaluating the quality of ideas, such as expert evaluation, peer-review, and machine learning
algorithms (Blohm, 2011). However, there is still much room for future research in this area, as
the effectiveness of these methods is largely dependent on the specific context and goals of the
innovation process. In relation to the Bag of Lemons (BOL) concept, there has been limited
research conducted.
In the current research, we introduce the concept of a "token budget" as a means of filtering and
selecting ideas from a pool of submissions. Specifically, a token budget refers to the allocation of
a limited number of votes or tokens that individuals in a crowdsourcing context can use to evaluate
and select ideas. This approach is commonly used in open innovation and crowdsourcing to
encourage participation and engagement while also allowing for the efficient selection of high-
quality ideas. It is a way to control the amount of input and influence that each participant has in
the process. In the context of idea filtering, a token budget can be used to ensure that the best ideas
rise to the top by giving more weight to the opinions of participants who have demonstrated a
ability to accurately evaluate and select ideas.
The use of a token budget in idea filtering has been widely studied in the literature, with prior
research showing that it can have a significant impact on the quality and diversity of selected ideas.
For example, studies have shown that increasing the token budget can lead to more diverse and
high-quality ideas being selected (Walser, Seeber, & Maier, 2019; Görzen & Kundisch, 2017;
Cheng et al., 2020). Furthermore, research has also shown that the allocation of tokens can be an
effective way to reduce information overload and cognitive overload in idea evaluation (Walser,
Seeber, & Maier, 2019).
In literature, token budget has been found to have the potential negative effect of leading to the
phenomenon of "herding," where participants tend to follow the voting patterns of others instead
of using their own judgement (Görzen & Kundisch, 2017; Walser, Seeber, & Maier, 2019). This
can lead to a lack of diversity in ideas being selected and a lack of independent thinking among
participants (Cheng et al., 2020). To mitigate this effect, in our study we have designed our token
budget system in a way that encourages independent thinking and diversification of opinions. This
includes providing participants with a sufficient amount of tokens, distributing tokens evenly
among participants, and implementing a system of incentives for independent thinking and
dissenting opinions. Additionally, we have incorporated measures such as idea anonymity and
randomized presentation of ideas to further discourage herding behavior. These measures will be
discussed in more detail in the methods section of our study.
3. Motivation: BOL approach over other filtering schemes
As covered in the previous Section, BOL (Figure 1) is a novel multi-voting approach developed
by Klein and Garcia 2015. This approach enhances the filtering of the ideas gathered from open
innovation. The approach asks the crowd to identify choices that they believe an expert committee
with a specific profile would exclude based on a set of criteria. This method proves superior to
other filters because it enables the crowd to eliminate bad ideas and quickly identify good ones.
To check the efficiency of their approach experimentally, Klein and Garcia, 2015 designed an
experiment in the context of a research laboratory in one of the universities in Brazil, which had a
productivity challenge. Members of the lab were asked to suggest and identify promising ideas
about how the lab's productivity could be increased. They; then divided the members into three
demographically matched groups of roughly 20 members each. Each group was asked to use one
of the filtering approaches: Bag of Stars (BOS), Bag of Lemons (BOL), and Likert scale. The
experiment consists of two stages – the open innovation engagement, where ideas about the lab
productivity were gathered by the members using an MIT Deliberatorium
1
, and the idea filtering
engagement, which compared the BOL approach with other filtering schemes. They asked
participants to distribute a budget of 10 lemons/stars to the idea they felt were least likely/most
likely to be excellent in BOL and BOS conditions, respectively. For the Likert scenario, they asked
participants to rate each proposal on a five-point Likert scale. Prizes were awarded to those whose
ratings were closest to that of an expert committee. The BOL approach achieved higher accuracy
in filtering ideas than the other schemes. A Receiver Operating Characteristics (ROC) curve was
used to check the accuracy of the filters
2
. In this approach, the area under the curves (AUC) is a
measure of filtering accuracy. BOL was the most accurate, followed by Likert and then BOS, and
all conditions outperformed a filter at random (which would have an accuracy of 0.5). At p < 0.05,
all of these differences were statistically significant. Also, BOS and BOL were found to require
roughly one third of the Likert method rater time (p < 0.05).
Our research was inspired by the findings of Klein and Garcia (2015) on the effectiveness of the
Bag of Lemons (BOL) filter in identifying valuable ideas in a crowdsourcing setting. We sought
to build upon their work by investigating the impact of varying token budget sizes on the
performance of the BOL filter. Through experimental testing and mathematical analysis, we aim
to provide a more comprehensive understanding of how token budget size affects the accuracy and
efficiency of idea filtering in a crowdsourcing context. Additionally, we aim to provide practical
recommendations for those looking to implement an idea filtering engagement in the future. To
the best of our knowledge, no previous studies have examined these questions in depth, making
our research a valuable addition to the existing literature on crowdsourcing and idea filtering.
4. Theoretical background
Negativity bias (Rozin P. et al; 2001), also known as the negativity effect, is the psychological
phenomenon where negative information or events have a greater impact on an individual's
thoughts and emotions than positive or neutral information. This bias has been observed in a wide
range of contexts, including memory, attention, and decision making. This phenomenon is thought
to be an evolved cognitive mechanism that helped our ancestors to survive by prioritizing potential
threats over opportunities.
In the context of idea filtering and evaluation, negativity bias can have a significant impact on the
selection of ideas. Research has shown that people tend to be more critical of and less likely to
1
A web-based system that combines ideas from argumentation theory and social computing to
address this critical challenge (Klein 2011).
2
The curves plot the true-positive rate vs. the false positive rate for a filter and consist of
multiple points, each corresponding to the true and false-positive rates for different selection
values of thresholds was used to check the filters' group accuracy (Fawcet, 2004)
support new ideas, which can lead to a lack of creativity and innovation. This bias can also lead to
group polarization, where the group becomes more extreme in their views as a result of the
negative feedback of individual members. To mitigate the effects of negativity bias, various
techniques have been proposed such as anonymous idea generation, randomized idea presentation,
and the use of token budget systems to encourage independent thinking and dissenting opinions.
The use of a "Bag of Lemons" (BOL) approach has also been proposed as a way to mitigate the
negativity bias by allowing individuals to rate ideas on a scale of bad to good rather than simply
accepting or rejecting an idea.
Another background theories to the paper are the concept of "cognitive load" which refers to the
amount of mental effort required to process and understand information and decision making in
choice tasks. In explaining these, we follow expositions from Walser, R. et al; (2019).
Cognitive Load Theory, as proposed by Sweller (1988) and further developed by Paas et al. (2004),
suggests that mental effort is required when processing information, particularly in the context of
idea filtering. This theory suggests that cognitive load can be divided into three types: intrinsic,
germane, and extraneous. Intrinsic load, as described by Sweller, Ayres, & Kalyuga (2011), is the
cognitive effort required by the task of filtering ideas itself, and it can be influenced by factors
such as the total number of ideas to be processed and the individual's familiarity with the task.
Germane load, as described by Fu et al. (2017), is the effort required to process information and
store it in long-term memory, which can be improved through instructional guidance and prompts.
Extraneous load, as described by Fu et al. (2017), is the effort required to present information, and
it increases when the ideas are presented poorly or inadequately. In idea-filtering settings, as
highlighted by Misuraca & Teuscher (2013), intrinsic load can quickly become high, thus it is
important to consider how to present ideas in a manner that keeps cognitive load at manageable
levels to improve the filtering process.
The process of idea filtering, which involves selecting the most promising ideas from a larger pool,
is a complex decision-making process that requires judgment and risk management (Oman, Tumer,
Wood, & Seepersad, 2013). According to Einhorn and Hogarth (1981), the decision-making
process can be broken down into four subprocesses: information acquisition, evaluation, action,
and feedback/learning. In the information acquisition phase, individuals search for and store
information in their memory and in the external environment. They then use various search
strategies, such as maximization of expected value or elimination-by-aspects, to evaluate the
acquired information and make a final choice in the action phase. The feedback phase involves
learning through the decision experience.
However, not all individuals acquire and process information in the same way. Depending on the
context, individuals may apply either a more compensatory or a more non-compensatory heuristic-
based decision-making strategy (Pilli & Mazzon, 2016). Compensatory decision making involves
assigning weights or values to all attributes and selecting the alternative with the highest utility
(Johnson & Payne, 1985). In contrast, non-compensatory decision making does not take into
account all available attribute information or trades off the benefit of one attribute against the
deficit of another (Payne, Bettman, & Johnson, 1993).
In understanding how individuals interact with an idea filtering platform, it is important to consider
these different decision-making strategies and how they may impact the accuracy of choice
outcomes. By understanding the processes and heuristics used by individuals in filtering ideas, the
platform can be designed to better support and guide their decision-making.
5. Research hypotheses
We postulated the following three hypotheses:
H1: People are better at spotting bad ideas than recognizing good ones.
This hypothesis suggests that individuals may be more adept at identifying ideas that are unlikely
to be successful or have flaws, rather than being able to identify ideas that are particularly
innovative or promising. This is a common phenomenon in idea generation and filtering, known
as "the negativity bias," where people tend to focus on the negative aspects of an idea rather than
the positive ones. This hypothesis suggests that by focusing on identifying "bad" ideas, we may be
able to more effectively filter out those that are unlikely to be successful, and ultimately increase
the overall quality of the remaining ideas.
H2: BOL allows one to do less work and get much more accurate than other rating schemes.
This hypothesis suggests that the BOL method, which focuses on identifying "bad" ideas, may be
a more efficient and effective way of filtering ideas compared to other methods such BOS and
conventional Likert scale. The BOL method is based on the idea that by identifying a small number
bad ideas with "lemons" (ideas that are unlikely to be successful), one can effectively filter out a
large number of other ideas that are also unlikely to be successful. This hypothesis suggests that
by using the BOL method, individuals may be able to more effectively filter out bad ideas with
less effort than with other methods.
H3: The accuracy of a filter depends on the token budget size.
This hypothesis suggests that the token budget size, or the number of tokens that users are allotted
to rate ideas, may have an impact on the accuracy of the idea filtering system. The token budget
size may affect the number of ideas that users are able to evaluate, which in turn may affect the
overall accuracy of the filters (decision making strategies). Additionally, the token budget size
may affect the degree of diversity among ideas that are evaluated, which may also impact the
accuracy of the filters. This hypothesis suggests that the token budget size is an important
consideration in the design of idea filtering systems and the choice of a good token budget size
may lead to more accurate evaluations.
6. Methodology
6.1. Experimental design
We selected eight substantive ideas from the list of ideas generated by the study reported in Klein
and Garcia, 2015. Since these ideas were originally generated in Brazil (Portuguese), we translated
and edited them to be roughly equal in length (to avoid differences caused by the length of ideas
rather than idea quality). The eight ideas we used are in Figure 2 below. The average word length
of the ideas is around 25 to 30 characters. Idea 1 has the longest word length of 38 characters,
while idea 7 has the shortest word length of 21 characters. The other ideas have word lengths that
are similar to each other, ranging from 22 to 35 characters. Overall, the word length is not very
long, which makes them easy to read and understand.
Figure 2: Snapshot of the eight substantive ideas selected from Klein and Garcia 2015 as they
appeared in the experiment.
Four experienced researchers from the UM6P School of Collective Intelligence rated the ideas on
a scale of 1 to 10. We adopted a "Delphi approach." (Linstone & Turoff, 1975), which involved
several rounds of deliberation by the experts on the strengths and weaknesses of each idea to ensure
there is a consensus in the expert ratings. From the expert ratings (see Table 1), we determined
that ideas 2, 7, and 8 were classified as "top" ideas, while the others (ideas 1, 3, 4, 5, and 6) were
considered as "inferior." The selection of the top ideas was based on the consensus reached by the
experts after several rounds of deliberation, and not on a fixed cut-off criterion. The sums for idea
1, idea 2 and idea 7 were higher than the others, which is why they were considered as top ideas.
Expert 1
Expert 2
Expert 3
Expert 4
Sum
Idea 1
1
2
4
4
11
Idea 2
5
5
7
5
22
Idea 3
5
1
5
4
15
Idea 4
1
2
3
3
9
Idea 5
1
1
4
3
9
Idea 6
1
5
4
4
14
Idea 7
5
7
8
8
28
Idea 8
5
8
6
7
26
Table 1: the expert ratings of ideas
6.2. Participants and procedure
In parallel to the expert ratings, we launched an experiment on the Prolific platform (a
crowdsourced platform), prolific.co, with human subjects and placed them in 17 different
conditions, namely: BOLx, BOSx (x=1,2,3,4,5,6,7; x being the number of token(s)), nBOL, nBOS
(in nBOL and nBOS, participants are flexible to select as many ideas as possible) and the Likert
condition where the participants rate from the scale of 1 to 5. Each condition had 50 participants.
The average duration of the experiment was five minutes. Most of the participants on this platform
are United States and United Kingdom residents. However, significant residents of countries such
as Italy, Portugal, South Africa, Poland, Spain, Canada, and France are also part of the workers on
the platform. We limited the eligible participants to those with a doctorate degree, with the
intention that they would all likely possess some knowledge of how a research lab works. The age
of the participants averaged 31
±
8 years. 42% recognized themselves as females, 57% as males,
and the remaining 1% preferred not to say. We excluded seven participants who failed the attention
test in Figure 3.
Figure 3: Snapshot of the attention check question.
To incentivize the participants, on the consent page (see Figure 4), we informed them that
participants with the highest accuracy at rating ideas (as calculated by the ROC analysis we explain
below) would get a bonus when the experiment is completed.
Figure 4: Snapshot of the consent page of the experiment.
We also introduced the experiment to the participants after getting their consent. This page
ensures the participants understand what they were invited to do (see Figure 5).
Figure 5. Snapshot of the experiment introduction screen.
On the next page, participants see an interface where they can either select good or bad ideas as
the case may be (for xBOS and xBOL) based on the token budget size, except for the nBOL and
nBOS, where they are flexible in selecting as many ideas as possible. For the Likert condition,
they are expected to enter their ratings (see Figure 8).
Figure 6. Snapshot of the BOS1 interface (showing 2 of the eight ideas).
Figure 7. Snapshot of the BOL1 interface (showing two of the eight ideas).
Figure 8. Snapshot of the Likert interface (showing 2 of the eight ideas).
For each BOL and BOS interface, users identify the worse/best ideas by allocating tokens to them.
For example, in Figures 6 and 7, participants were given one token (they were expected to select
one of the ideas). Participants were forced to not use more or less than token budget allocated to
them. For the Likert condition, users rated each idea on a scale between 1 (indicating “poor”) to 5
(indicating “excellent”) by clicking on the star(s) (see Figure 8). To avoid the potential bias due to
the presentation sequence, we presented the ideas in a randomized manner.
After that, we assessed the accuracy of the rating schemes, i.e., how well the groups could identify
the "best or worse ideas" as identified by the expert committees. An ideal filter would select all
and only top ideas, i.e., it would have a "true positive" rate of 100% (find all the top ideas) and a
"false positive" rate of 0% (not select any bad ideas). To evaluate the efficacy of our filtering
techniques, we applied a standard method known as ROC curves (Fawcett, 2004). ROC analysis
takes, as input, the scores each test item received from the rating and a list of which test items were
"true" positives, i.e., actual instances of the condition we are trying to diagnose. We then plot the
actual positive rate against the false-positive rate as a curve. The area under the ROC curve (AUC)
is then a measure of accuracy, and a perfectly accurate test would have an area of 1.0, while a
random one would have an area of 0.5.
7. Results and findings
First, we assessed the accuracy of our filters at identifying the top ideas identified by the expert
committee. We submitted participants' ratings for each condition to the ROC algorithm, including
ideas selected as best by the experts. The algorithm produced the following Area under the Curve
(AUC) values for our 17 conditions (represented in Figure 9 ):
Figure 9: AUC values across all the conditions (p < 0.05).
Our differences were statistically significant at p<0.05. We used a statistical method proposed by
(Hanley JA et al; 1982) to measure the statistical differences in the AUC scores between the
conditions.
We found BOL had the highest AUC scores (see BOL5, BOL3, and BOL1), with some filters
performing just like a random filter (we will explain the reason below). Also, we noticed that there
was also a higher accuracy when the number of tokens matched the winners. In the case of our
experiment, where the experts selected three winners, we observed that this is the case for the
BOL3 and even for the BOS3, being the second highest in the BOS categories. If the mismatch is
too large, accuracy falls to random or even below. This is true when the token budget is at least
twice the number of winners, e.g., BOL6, BOS6, BOL7, BOS7. An intuitive explanation for this
is simple – we observe that since there are three winners, participants were forced to assign tokens
to “inferior” ideas. BOL5 is however an exception: the number of tokens is higher than the number
of winner ideas, but it still performed better than a random filter.
We also observed that BOLx is always more accurate than BOSx except for the nBOL condition.
The reason for this is that we believe nBOL has low accuracy because it, like its Likert counterpart,
may show a scattershot placement due to a lack of token budget. This finding interestingly
replicates the study of Klein and Garcia, 2015. Regarding accuracy, we found BOL (except for
BOL6, BOL7, and nBOL) better than BOS and Likert. This validates our prediction that people
are better at identifying worse ideas (when they use BOL) than recognizing the good ones.
Also, we observed that BOL works better for an odd number of tokens. Half of the eight BOL
conditions in the experiment had an odd number of tokens, i.e., BOL1, BOL3, BOL5, and BOL7.
As shown in Figure 9, all these "odd" conditions produced a higher AUC, a value greater than 70%
in all cases except for BOL7 which explanation for low AUC was given above. In essence, if an
odd number of tokens is given in some sense, good accuracy is likely to be achieved. We currently
have no explanation for this claim, but we plan to dig deep to understand the “odd tokens” impact
on the filtering in our future study.
We also observed that higher accuracy is achieved when the number of tokens equals the number
of items available for selection. This accounts for the superior performance of BOL3 and BOL5.
The experts categorized three and five ideas, respectively, as “top” and “inferior” items. From our
analysis, we found that BOL3, with the number of tokens equivalent to the “top” ideas, produced
a higher accuracy (n=0.77) - this is the next condition having the highest accuracy after BOL5. So,
in an idea filtering engagement, it might be ideal to make the number of tokens equal the number
of “top” ideas we are targeting, i.e., if three ideas are marked as “top,” it might be ideal to have a
budget of three tokens.
However, nBOL and nBOS performed no more accurately than random filters. Our hypothesis is
that when crowd members can select any number of ideas in the list with no restriction, as in the
case of BOLx and BOSx conditions, they might select idea without a deep thought. So, we would
not recommend nBOL and nBOS for anyone planning an idea filtering engagement.
The average amount of time spent, in seconds, doing the ratings in each filtering condition is shown
in Figure 10 below. We found that the filtering schemes with higher accuracy consume less rating
time. This shows the relationship between the speed and accuracy of filters.
Figure 10: Average time spent across conditions (p<0.05)
Building on our findings, it is important to consider the implications of our study for both theory
and practice. From a theoretical perspective, our findings contribute to the existing literature on
open innovation and idea filtering engagement. By demonstrating the superior performance of
BOL compared to other rating schemes, we add to the understanding of how different
approaches to rating and filtering ideas can impact the accuracy of results.
Additionally, our findings on the effect of token budget size on accuracy have important
implications for practitioners. Given that the cost and time of filtering ideas is a critical concern
for those implementing open innovation systems, our results suggest that using BOL with an
appropriate token budget size can provide more accurate results in a more efficient manner.
In conclusion, our study provides important insights into the relationship between token budget
size, accuracy, and time in idea filtering engagement and open innovation. We hope that our
findings will be useful for practitioners and researchers alike in their pursuit of effective and
efficient methods for filtering and selecting ideas.
In the following section, we provide a mathematical theory explaining why BOL is better.
8. Why is BOL better? A mathematical explanation.
In this section, we present a mathematical explanation as to why BOL performs better than other
filters, and we have the below theorem:
Theorem 8.1
In the context of idea filtering, where the number of ideas needed (k) is less than the total
number of ideas generated (N), a filter that prioritizes less exceptional or worse ideas (BOL) will
result in a more accurate and efficient selection process compared to a filter that prioritizes
exceptional or better ideas (BOS).
Proof
Consider a manager in a company who is tasked with finding a solution to improve productivity
in their research lab. The manager receives N proposals or ideas on how to solve the problem.
However, due to limited resources and time, the manager can only implement a certain number of
proposals, represented by k.
We can represent the selection process using probability. For each proposal, the manager assigns
a probability of being selected based on their own criteria, such as ease of implementation and
potential for high productivity. Let P(ni) be the probability of proposal i being selected, where 0
≤
P(ni)
≤
1 and ΣP(ni) = 1. The proposals that have a higher probability of being selected are
prioritized, and we can represent this as P(
𝑛$
) < P(
𝑛%
) < ... < P(
𝑛&
) where
𝑛&'(𝑖 = 1, 2, 3 … 𝑁)
represents a proposal. Now we have two filtering methods, BOS and BOL.
When using a BOS filter, the manager is likely to spend more time evaluating proposals that may
not meet the necessary criteria, as they are focusing on exceptional or better ideas. This can lead
to a less efficient selection process as the manager may not have enough time to evaluate all of the
proposals that meet the necessary criteria.
On the other hand, when using a BOL filter, the manager is more likely to select a smaller number
of proposals, k, from the total number of proposals received, N. This is because there will be a
large number of proposals that do not meet the necessary criteria and are therefore less likely to be
selected. This results in a more efficient selection process as the manager is not wasting time on
proposals that do not meet the necessary criteria.
If the probability that proposal x is selected is P(x). For the BOS filter, we have: ∑ P(x) for x in
BOS proposals > ∑ P(x) for x in BOL proposals. This shows that the total probability of the
proposals being selected under the BOS filter is greater than that of the BOL filter. To further
prove the accuracy and efficiency of the BOL filter, we can also compare the ratio of selected
proposals to total proposals: (k/N) for BOL filter > (k/N) for BOS filter. This shows that the ratio
of selected proposals to total proposals is higher for the BOL filter than the BOS filter, thus
demonstrating its efficiency in selecting the most suitable proposals. In conclusion, using a BOL
filter results in a more accurate and efficient selection process compared to using a BOS filter, as
it prioritizes less exceptional or worse ideas and allows the manager to focus on proposals that
meet the necessary criteria without wasting time on exceptional but irrelevant proposals.
It is worth stating that this theorem assumes that the manager has a clear set of criteria for
evaluating proposals, and that these criteria do not change throughout the selection process.
Additionally, it is assumed that the manager has the capability to accurately assign probabilities to
each proposal based on their level of exceptionality. Another assumption of the theorem is that the
manager is only interested in finding k solutions from the set of N proposals, and that the manager
is not interested in any additional solutions beyond k. However, in some scenarios, the criteria for
evaluating proposals may change over time or may be subject to different interpretations by
different managers. Additionally, it may be difficult for the manager to accurately assign
probabilities to proposals without conducting further research or analysis. Furthermore, the
theorem assumes that the manager is only interested in selecting the best k solutions and not in
selecting additional solutions that may have potential benefits.
9. Conclusions
In conclusion, the results of our study demonstrate the importance of considering the token budget
size when implementing idea filtering in open innovation systems. The results showed that the
BOL method is more accurate than the Conventional Likert and the BOS method in filtering out
bad ideas. Furthermore, our findings support the hypothesis that people are better at spotting bad
ideas than recognizing good ones. The relationship between accuracy and speed was also
established, with the filters with higher accuracy consuming less rating time.
Our study provides new insights into the effect of token budget size on idea filtering engagement
and supports the findings of previous studies. The mathematical theory presented in Section 5
provides a clear explanation for why BOL is a better approach compared to other methods.
In summary, our study contributes to the literature on open innovation and idea filtering, and
provides practical recommendations for those looking to implement an idea filtering engagement
in the future. Our findings highlight the importance of considering token budget size in open
innovation systems and demonstrate that BOL is an effective approach for filtering out bad ideas
in a fast and accurate manner.
10. Limitations and suggestions for future research
It is important to acknowledge the potential limitations of this study and identify areas for future
research. While our findings support our hypotheses and provide valuable insights, there are
several limitations to consider. One potential limitation is that the study was conducted in a
controlled laboratory setting and may not generalize to real-world situations. Further studies in
real-world settings would be beneficial in determining the generalizability of our results.
Another limitation is the small sample size used in the study, which may not be representative of
the larger population. Larger sample sizes and more diverse participant populations would provide
stronger evidence for the validity of our findings.
Future research could also investigate the effects of different token budget sizes on the accuracy
of idea filtering engagement in other domains, such as healthcare or education. In addition, the
interaction between token budget size and the type of idea being filtered could also be studied to
determine if certain types of ideas are more accurately filtered than others. Also, future research
could investigate the influence of the ratio of top and bad ideas on the idea set and strive to identify
a robust number of optimal tokens. This could involve simulating different ratios of good and bad
ideas in an idea set, and varying the selection probability of top and bad ideas.
Overall, while our study provides valuable insights into the relationship between token budget size
and idea filtering accuracy, there is still much to be explored in this area. Further research is needed
to fully understand the implications of our findings for both theory and practice.
11. Acknowledgement
We are grateful to the 750 prolific members who participated in the experiments; and the four
experts at the UM6P School of Collective Intelligence that evaluated the ideas we used as test
items.
Declaration
Conflicts of interest/Competing interests: The authors declare that there are no conflicts of
interests.
12. References
Adomavicius, G., & Tuzhilin, A. (2005). Toward the Next Generation of Recommender
Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Transactions on
Knowledge and Data Engineering, 17(6), 734-749.
Arrow, K. J. (1963). Social choice and individual values. Wiley., B. P. B., & Horvitz, E. (2010).
What’s Your Idea? A Case Study of a Grassroots Innovation Pipeline within a Large Software
Company. Proceedings from Computer-Human Interaction Conference.
Bjelland, O. M., & Wood, R. C. (2008). An Inside View of IBM’s ‘Innovation Jam.’ Sloan
Management Review, 50(1)(1).
Blohm, I., Riedl, C., Leimeister, J. M., & Krcmar, H. (2011, October). Idea evaluation mechanisms
for collective intelligence in open innovation communities: Do traders outperform raters?.
In Proceedings of 32nd International Conference on Information Systems.
Brennan, M. R., Wrazien, S., & Greenstadt, R. (2010). Learning to Extract Quality Discourse in
Online Communities. Proceedings from WikiAI-10: AAAI-2010 Workshop on Collaboratively-
built Knowledge Sources and Artificial Intelligence.
Buskirk, E. V. (2010). Google Struggles to Give Away $10 Million. Wired Magazine
http://www.wired.com/business/2010/06/google-struggles-to-give-away-10-million/all/.
Caruana, G., & Li, M. (2012). A survey of emerging approaches to spam filtering. ACM
Computing Surveys (CSUR), 44(2), 9.
Chesbrough, H. W. (2003). Open innovation: The new imperative for creating and profiting from
technology. Harvard Business Press.
Chesbrough, H., Vanhaverbeke, W., & West, J. (2008). Open Innovation: Researching a New
Paradigm. Oxford university press.
Cheng, X., Fu, S., de Vreede, T., de Vreede, G.-J., Seeber, I., Maier, R., & Weber, B. (2020). Idea
Convergence Quality in Open Innovation Crowdsourcing: A Cognitive Load Perspective. Journal
of Management Information Systems, 37(2), 349–
376. https://doi.org/10.1080/07421222.2020.1759344.
Di Gangi, P. M., & Wasko, M. (2009). Steal my idea! Organizational adoption of user innovations
from a user innovation community: A case study of Dell IdeaStorm. Decision Support Systems,
48(1)(1), 303-312.
Einhorn, H. J., & Hogarth, R. M. (1981). Behavioral decision theory: Processes of judgment and
choice. Annual Review of Psychology, 32(1), 53-88.
Enkel, E., Perez-Freije, J., & Gassmann, O. (2005). Minimizing market risks through customer
integration in new product development: learning from bad practice. Creativity and Innovation
Management, 14(4), 425-437.
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine
Learning, 31, 1-38.
Fu, S., de Vreede, G.-J., Cheng, X., Seeber, I., Maier, R., & Weber, B. (2017). Convergence of
crowdsourcing ideas: A cognitive load perspective. In Proceedings of the 38th International
Conference on Information Systems: Transforming Society with Digital Innovation.
Gassmann, O. (2006). Opening up the innovation process: towards an agenda. R&D
Management, 36(3), 223-228.
Görzen, T., & Kundisch, D. (2017). When in Doubt Follow the Crowd: How Idea Quality
Moderates the Effect of an Anchor on Idea Evaluation. Thirty Eighth International Conference
on Information Systems (ICIS), 1–20.
Hanley JA, McNeil BJ. The meaning and use of the area under a Receiver Operating
Characteristic (ROC) curve. Radiology, 1982, 143, 29-36.
Hrastinski, S., Kviselius, N. Z., Ozan, H., & Edenius, M. (2010). A review of technologies for open
innovation: characteristics and future trends.
Johnson, E. J., & Payne, J. W. (1985). Effort and accuracy in choice. Management Science, 31(4),
395414.
Juho Salminen. 2015. The role of collective intelligence in crowdsourcing innovation. Ph.D.
Dissertation. Lappeenranta University of Technology. http://urn.fi/URN:ISBN:978-952-265-876-
0
Kornish, L. J., & Hutchison-Krupat, J. (2017). Research on idea generation and selection:
Implications for management of technology. Production and Operations Management, 26(4),
633-651.
Kessler, F. (1995). Team decision making: pitfalls and procedures. Management Development
Review, 8(5), 38-40.
Klein, M; & Garca, A-C-B. (2015), High-Speed idea Filtering with the Bag of Lemons. Decision
Support Systems 78 (c): 39-50.
Klein, M. (2012). Enabling Large-Scale Deliberation Using Attention-Mediation Metrics.
Computer-Supported Collaborative Work, 21(4)(4), 449-473.
Klein, M., & Convertino, G. (2014). An Embarrassment of Riches: A Critical Review of Open
Innovation Systems. Communications of the ACM, 57(11)(11), 40-42.
Kostakos, V. (2009). Is the crowd’s wisdom biased? a quantitative annalysis of three online
communitiesn. Proceedings from Conference on Computational Science and Engineering.
Lakhani, K. R., & Panetta, J. A. (2007). The principles of distributed innovation. Innovations, 2(3).
Linstone, H. A., & Turoff, M. (1975). The Delphi Method: Techniques and Applications. Addison-
Wesley Publishing Company.
Lykourentzou, I., Ahmed, F., Papastathis, C., Sadien, I., & Papangelis, K. (2018). When crowds
give you lemons: Filtering innovative ideas using a diverse-bag-of-lemons strategy. Proceedings
of the ACM on Human-Computer Interaction, 2(CSCW), 1-23.
Morgan, J., & Wang, R. (2010). Tournaments for ideas. California Management Review, 52(2),
77.
Misuraca, R., & Teuscher, U. (2013). Time flies when you maximize—maximizers and satisficers
perceive time differently when making decisions. Acta Psychologica, 143(2), 176-180.
Ogawa, S., & Piller, F. T. (2006). Reducing the risks of new product development. MIT Sloan
management review, 47(2), 65.
Oman, S. K., Tumer, I. Y., Wood, K., & Seepersad, C. (2013). A comparison of creativity and
innovation metrics and sample validation through in-class design projects. Research in
Engineering Design, 24(1), 65-92.
Paas, F., Renkl, A., & Sweller, J. (2004). Cognitive load theory: Instructional implications of the
interaction between information structures and cognitive architecture. Instructional Science, 32(1),
1-8.
Panama Government (2020). Ideation Panama. Accessed on the 16th day of June 2022.
https://propanama.gob.pa/en/evento/71
Payne, J. W., Bettman, J. R., & Johnson, E. J. (1993). The adaptive decision maker. New York,
NY: Cambridge University Press.
Pilli, L. E., & Mazzon, J. A. (2016). Information overload, choice deferral, and moderating role of
need for cognition: Empirical evidence. Revista de Administração, 51(1), 36-55.
Piller, F. T., & Walcher, D. (2006). Toolkits for idea competitions: a novel method to integrate
users in new product development. R&D management, 36(3), 307-318.
Rozin, P., & Royzman, E. B. (2001). Negativity bias, negativity dominance, and
contagion. Personality and social psychology review, 5(4), 296-320.
Salganik, M. J., & Levy., K. E. C. (2012). Wiki surveys: Open and quantifiable social data
collection. http://arxiv.org/abs/1202.0500.
Surowiecki, J. (2005). The Wisdom of Crowds. Anchor.
Schulze, T., Indulska, M., Geiger, D., & Korthaus, A. (2012). Idea assessment in open
innovation: A state of practice. Proceedings of the European Conference on Information
Systems.
Salganik, M. J., Dodds, P. S., & Watts, D. J. (2006). Experimental Study of Inequality and
Unpredictability in an Artificial Cultural Market. Science, 311(5762), 854-856.
Salminen, J., & Harmaakorpi, V. (2012). Collective Intelligence and Practice-Based Innovation:
An Idea Evaluation Method Based on Collective Intelligence. In Practice-Based Innovation:
Insights, Applications and Policy Implications (pp. 213-232). Springer.\
Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science,
12(2), 257-285. Sweller, J., Ayres, P., & Kalyuga, S. (2011). Cognitive load theory. New York,
NY: Springer.
Thomas Wagenknecht, Jan Crommelinck, Timm Teubner, and Christof Weinhardt. 2017. When
Life Gives You Lemons: How rating scales affect user activity and frustration in collaborative
evaluation processes. In 13th International Conference on Wirtschaftsinformatik.
Von Hippel, E. (2009). Democratizing Innovation. MIT Press.
Walter, T. P., & Back, A. (2013). A Text Mining Approach to Evaluate Submissions to
Crowdsourcing Contests.
Walser, R., Seeber, I., & Maier, R. (2019). Designing idea convergence platforms: The role of
decomposition of information load to nudge raters towards accurate choices. AIS Transactions on
Human-Computer Interaction, 11(3), 179–207. https://doi.org/10.17705/1thci.00119
West, J., & Lakhani, K. R. (2008). Getting clear about communities in open innovation. Industry
and Innovation, 15(2), 223-231.
Westerski, A., Dalamagas, T., & Iglesias, C. A. (2013). Classifying and comparing community
innovation in Idea Management Systems. Decision Support Systems, 54(3)(3), 1316-1326.
Westerski, A., Iglesias, C. A., & Nagle, T. (2011). The road from community ideas to
organisational innovation: a life cycle survey of idea management systems. International Journal
of Web Based Communities, 7(4), 493-506.