ArticlePDF Available

Assessment of Machine Learning Performance for Decision Support in Venture Capital Investments

Authors:

Abstract and Figures

The venture capital (VC) industry offers opportunities for investment in early-stage companies where uncertainty is very high. Unfortunately, the tools investors currently have available are not robust enough to reduce risk and help them managing uncertainty better. Machine learning data-driven approaches can bridge this gap, as they already do in the hedge fund industry. These approaches are now possible because data from thousands of companies over the world is available through platforms such as Crunchbase. Previous academic efforts have focused only on predicting two classes of exits, i.e., being acquired by other company or offering shares to the public, using only one or a few subsets of explanatory variables. These events are typically related to high returns, but also higher risk, making hard for a venture fund to get repeatable and sustainable returns. On the contrary, we will try to predict more possible outcomes including a subsequent funding round or the closure of the company using a large set of signals. In this way, our approach would provide VC investors with more information to set up a portfolio with lower risk that may eventually achieve higher returns than those based on finding unicorns (i.e., companies with a valuation higher than one billion dollars). We will analyze the performance of several machine learning methods in a dataset of over 120,000 early-stage companies in a realistic setting that tries to predict their progress in a 3-year time window. Results show that machine learning can support venture investors in their decision-making processes to find opportunities and better assessing the risk of potential investments.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2938659, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Assessment of machine learning
performance for decision support in
venture capital investments
JAVIER ARROYO1, FRANCESCO COREA2, GUILLERMO JIMENEZ-DIAZ1, and JUAN A.
RECIO-GARCIA1
1Department of Software Engineering and Artificial Intelligence. University Complutense of Madrid, Spain
2Ca’ Foscari University, Venice, Italy. Four Trees Merchant Partners Inc. Madrid, Spain
Corresponding author: Juan A. Recio-Garcia (e-mail: jareciog@fdi.ucm.es).
This work was supported by Four Trees Merchant Partners Inc. and in part by the the Spanish Committee on Economy and
Competitiveness (TIN2017-87330-R) by the European Union’s H2020 coordination and support actions under grant agreement No 825215.
ABSTRACT The venture capital (VC) industry offers opportunities for investment in early-stage com-
panies where uncertainty is very high. Unfortunately, the tools investors currently have available are not
robust enough to reduce risk and help them managing uncertainty better. Machine learning data-driven
approaches can bridge this gap, as they already do in the hedge fund industry. These approaches are now
possible because data from thousands of companies over the world is available through platforms such as
Crunchbase.
Previous academic efforts have focused only on predicting two classes of exits, i.e., being acquired by other
company or offering shares to the public, using only one or a few subsets of explanatory variables. These
events are typically related to high returns, but also higher risk, making hard for a venture fund to get
repeatable and sustainable returns. On the contrary, we will try to predict more possible outcomes including
a subsequent funding round or the closure of the company using a large set of signals. In this way, our
approach would provide VC investors with more information to set up a portfolio with lower risk that may
eventually achieve higher returns than those based on finding unicorns (i.e., companies with a valuation
higher than one billion dollars).
We will analyze the performance of several machine learning methods in a dataset of over 120,000 early-
stage companies in a realistic setting that tries to predict their progress in a 3-year time window. Results show
that machine learning can support venture investors in their decision-making processes to find opportunities
and better assessing the risk of potential investments.
INDEX TERMS CrunchBase, Decision Support Systems, Investment, Machine Learning, Risk Assess-
ment, Venture Capital.
I. INTRODUCTION
After the last financial crisis, one of the most immediate
reactions for financial institutions and regulators has been
to artificially lower the interest rates. We live indeed today
in a historical time where interest rates are recording the
lowest levels of the last several decades. As a consequence,
traditional public markets do not longer represent the solution
to achieve sustainable returns for investors.
In this difficult environment, venture capital (VC) as an
asset class has emerged as one of the potential poles of
attraction for investors that are both looking for financial
returns and innovation sprints. Many of the biggest empires
have indeed been created (and funded) in the last 10-15 years
and the sector itself is evolving at an incredibly high speed
because of the huge interest raised by private and institutional
investors.
Fast forward ten years, this investment frenzy has bought
a couple of different considerations: first of all, it is getting
harder and harder nowadays to get sustainable returns also
in VC. The industry is therefore polarizing: many bigger
funds have been quickly raised in the last two years to invest
bigger tickets in faster-scaling companies and to double on
winners. On the other side, a good opportunity still exists for
investments in early-stage projects but the related uncertainty
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2938659, IEEE Access
Arroyo et al.: Assessment of ML performance for decision support in VC investments
makes many of those deals hard to be finalized. Hence, if
from one hand we have crazy valuations completely detached
from any fundamental, on the other hand, we have the num-
ber of seed deals that is shrinking down.
This creates a gap and therefore an opportunity for smart
investors that are willing to bet on early-stage companies
with not so much validation as their more mature peers.
Unfortunately, the toolbox investors currently have available
is not robust enough to reduce their risk and help them
managing uncertainty in a better way. The main rationale of
this work is, therefore, to provide the investment community
with tools that can make early-stage deals more attractive.
Artificial intelligence and machine learning could be that
new tool. Being a data-driven investor is a very well-known
concept in the hedge fund industry, but in VC is yet not so
popular and a lot of work can be done in this space. Machine
learning can indeed support VCs investor by helping them
spotting business opportunities, performing better portfolio
management, matching co-investors and deals, obtaining
intelligence on the competitorsâ ˘
A´
Z landscape, identifying
potential acquirers, and much more. In other words, it has the
potential to make venture investors better and more informed,
even in the post-investment phase where they need to help
companies to grow.
However, there is another specific case we are interested
in. A venture capitalist could be either a great financial
investor or a great operator. In either case (and often the
edges blur) VCs have to possess two other skills in addition
to post-investment support abilities: i) they have to be able to
generate interesting deal flow and understanding where good
companies are; ii) they have to be able to identify patterns or
signals of potential success in a company and pay the right
price for it.
Those skills are hard to acquire and can only be devel-
oped spending years in the industry. We believe though that
machine learning can speed up the learning curve for an
investor. This paper is then an attempt to establish a data-
driven approach that might be useful for early-stage investors
to predict the future success of a company and understand the
associated risks. We will show that it is possible to draw some
insights from mostly qualitative data and that those insights
have a positive correlation with the probability of a company
to progress. In order to prove it, we test different machine
learning methods that could better inform an investor on what
company deserves funding. Our dataset will consist of over
120,000 early-stage companies retrieved from Crunchbase,1
a platform that gathers business information about compa-
nies, e.g. funding sources, founders, business sector, etc.
While many studies and VC focus on predicting whether a
company will be eventually acquired or go through an initial
public offering (IPO), we will try to predict what will happen
to the company next, including not only acquisition or IPO,
but also obtaining more funds, or closure. In this way, we
aim to offer VC investors the opportunity to consider not only
1https://www.crunchbase.com/
high-risk/high-reward companies (such as potential unicorns)
but also to be able to set up a portfolio with lower risk.
The rest of the work is structured as follows: Section II
discusses how this work relates to previous similar stud-
ies. Section III introduces our main models, techniques and
dataset composition and collection. Section IV shows the
empirical results of our study, while Section V analyzes the
technical challenges and the business implications of our
predictions. Section VI finally summarizes our main results
and discusses future research directions.
II. LITERATURE REVIEW
We can find in literature several approaches for the prediction
of the success of early-stage or startup companies. A popular
one is the success/failure model presented based on logistic
regression [1]. It considers 15 dependent variables obtained
from the review of 20 previous works. This model has been
extended and validated for different markets such as the
United States, Chile or Croatia [2], [3]. In the case of the U.S.
market, only 4 of these variables were statistically significant
(planning, professional advice, education, and staffing) pos-
sibly due to the relatively small sample size.
There may also be different ways to take into account and
measure the reputation of a VC investor. In fact, using a
logit framework, studies show that reputable VC firms are
more likely to lead their companies to successful exits [4].
This evidence also holds when analyzing the performance
of individual VC investments. In another work, logistic re-
gression analysis is used to study the relation of growth in
200 Finnish firms with founders’ motive, their background
characteristics, management styles, etc [5].
A study with a different approach analyzes the survival
of 181 newly established manufacturing firms in north-east
England [6]. The work uses log-logistic hazard models to
study the relationship between the survival time of the firm
and signals either related to the firm (plant size) or the
macroeconomic aspects.
Other approach models the total amount of VC funding
raised with a linear regression [7] . In this case, the dataset
contains information about biotechnology US companies
created from 1974 to 2011. The variables that define these
companies are related to the number of VC investments
received, the number of patents, the citations of these patents,
and other geographical information.
So far, most of the approaches reviewed are based on
regression analysis, mainly logistic regression. However,
fewer works have explored alternatives based on artificial
intelligence.
One of the earliest works in this stream presents a rule-
based expert system that predicts the acquisition of compa-
nies [8]. This system achieves a success rate of 70% although
evaluation is limited to a dataset of 200 companies. Even if
an expert system can be used to select successful companies,
data-driven approaches based on machine learning have been
more popular. For example, Wei et al. (2009) propose the use
of ensemble classifiers to predict about 600 cases of mergers
2VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2938659, IEEE Access
Arroyo et al.: Assessment of ML performance for decision support in VC investments
and acquisitions in Japan [9]. In this work, predictors are
technological variables from patent analysis and both profiles
of investors and candidate target companies. The authors
report a global accuracy of 88% and precision over 40%
when predicting an acquisition.
Similarly, Yankov et al. (2014) compare several machine
learning methods to predict the success of 142 Bulgarian
startups using data from a questionnaire [10]. The authors
show that decision trees are the most accurate method and
use them to reveal startup success factors, e.g. the presence
of competitive advantage, founders experience in a similar
position, etc. A more sophisticated approach combines super-
vised (Support Vector Machines) and unsupervised learning
(clustering) for the prediction of business models with higher
growth expectations and chances of survival [11]. The work
considers startups from USA and Germany and achieves an
accuracy of 83.6% when trying to predict the survival of a
venture, but again using a small dataset of 181 companies.
Other authors compare the performance of human experts
and machine learning techniques when predicting outcomes
of early-stage firms [12]. They consider 2,506 Nigerian firms
that participate in a business plan competition and conclude
that machine learning methods do not achieve significant
improvements compared to human experts. They report a
63% success rate for machine learning and 58% for human
experts that led them to conclude that human experts also
have difficulty in identifying which firms will succeed.
While machine learning seems a promising venue, the
works presented so far have either worked with a small
dataset or with data retrieved ad-hoc. More representative
samples can be obtained by platforms like Crunchbase,
which gathers data from hundreds of thousands startups, even
though the data retrieved is not as rich as in the case of ad-hoc
datasets.
Xiang et al. (2012) try to predict acquisitions through
machine learning for companies founded between 1970 and
2007 [13]. As predictors, they use different kind of firm
descriptors, including information on the management team
and finance sources, but also from TechCrunch news. They
highlight the problems related to the sparsity of the dataset,
dropping-off approximately 20,000 companies because of the
lack of a complete description. This way, they use a dataset
with 60,000 companies described by 22 features and they
segment them according to their business sector. They enrich
the dataset with the distribution of news for each company in
the 5 most representative topics from each business sector
using a corpus of over 38,000 news. Unfortunately, only
about 5,000 companies had a presence in the corpus. Finally,
they show that considering the information provided by the
news improves the results. They achieve a precision between
60% to 79.8% from half of the categories and Bayesian
Networks outperform both SVM and logistic regression.
Other works use a Crunchbase dataset of over 80,000
startups from five states in the US from 1985 to 2014 to
predict either an M&A (merger and acquisition) or an IPO
(initial public offering) [14]. The author compares the per-
formance of logistic regression, SVM and random forests
for the prediction of the success of startups using the data
provided by Crunchbase. The proposed approach comprises
a data acquisition and selection stage. Next, a pre-processing
stage tries to avoid the sparsity problem reported in other
works [13]. To address this problem, the author proposes
to re-code several variables and generate synthetic variables
to represent potential interesting features. Another problem
faced when trying to create the predictive model was the
large class imbalance between successful and non-successful
companies. After pre-processing, only 16.8% of the dataset
consisted of successful companies. Therefore, he employs an
oversampling strategy of the minority class to fix that issue.
The author reports a precision close to 92%. However, the
impact of the artificial increase of the dataset caused by the
oversampling (with 60% more instances) is not taken into
consideration when discussing the results. Moreover, the time
window precedes the Crunchbase creation, so it is possible
to observe a survival bias since successful companies whose
foundation precedes that of Crunchbase are over-represented.
As a result, we can conclude that the literature reveals the
potential of using machine learning to exploit Crunchbase
data to create a decision support system for the predic-
tion of the success of early-stage companies. However, we
consider that previous approaches focus on too long time-
windows, which are unrealistic for a VC, and only on IPO
or acquisitions, rather than including a larger set of potential
outcomes. The following section presents our approach and
differences in the use of Crunchbase dataset to help VC
investors screening promising early-stage companies.
III. PREDICTION OF SUCCESS IN EARLY-STAGE
COMPANIES
The main goal of this work is the development and evaluation
of a data-driven approach that uses machine learning to help
VC investors scouting and selecting the best companies to
support. As in some previous works, our approach relies on
the data provided through Crunchbase.
The main features of our approach are the following:
Focused on early-stage companies: We will consider
that early stage companies are defined as active com-
panies with less than a precise age at the start of a
simulation window and that are in an early funding
stage (earlier than series C), according to the startup
fundraising stages.2
Time-aware analysis: Our approach defines a realistic
time window, which represents a reasonable investment
window for a VC investor.
2There are various types of funding rounds: Seed, Series A, B, and C,
and so on. While Seed can be achieved through investments of business
angels and Micro-VCs, in Series A typically institutional investors and large
funds kick in. Series B usually comes into play for a startup that is already
profitable and that wishes to increase its profit margin. The next round is
renamed as Series C. Many companies utilize Series C funding to help boost
their valuation in anticipation of an IPO. However, some companies can go
on to Series D and even Series E rounds of funding as well, mainly because
they are in search of a final push before an IPO.
VOLUME 4, 2016 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2938659, IEEE Access
Arroyo et al.: Assessment of ML performance for decision support in VC investments
Warmup
-Createdbetwee n [tc ts]
-Not acquired, closed, IPO
-Funding round < Series C
Simulation
First event occured:
-ACQUIRED
-CLOSED
-IPO
-FUNDING ROU ND
-NO EVENT
tctstf
FIGURE 1. Warmup and Simulation window.
Multi-class prediction problem: Our target variable
represents the next event that a company will achieve
in a defined simulation window. The prediction of these
events is useful for the VC investor to make an in-
vestment decision because they represent the company
progress. Namely: a company closes; it obtains invest-
ment through a new funding round; the company is
acquired by another company; or it goes public via an
IPO (allowing the company to raise capital from public
markets). We also consider the case where no event
is reported for a company in Crunchbase during the
simulation window.
We want to emphasize that our time-aware approach offers
a more realistic prediction than other works detailed in Sec-
tion II. Instead of using all the data available from the dataset,
our approach uses data from the companies that were early-
stage startups before the time window under investigation.
The approach aims to predict what will happen first to each
company in the window.
More specifically, our approach defines the following time
windows (shown in Figure 1):
The approach defines a timestamp tsthat represents the
moment where the VC investor decides to invest in a
startup company.
The Simulation window represents the temporal win-
dow where the company will evince its success. It is
defined by a Start timestamp (ts) and a Finish timestamp
(tf). The interval will be selected according to the time
that VC investors consider to evaluate the success of a
company.
The Warmup window represents the period that defines
what happened in a company before the VC investor de-
cided to invest in that company. As our goal is startups,
our approach only takes into account companies created
between the start of the warmup window (tc) and the
time when the simulation begins (ts).
In our experiment, we keep the time windows close to
the current time. In this way, we want to minimize the
survivorship bias that is obviously present in Crunchbase (or
in any similar platform). Companies that succeeded are to
some extent over-represented because the ones that failed
early may have not even appeared in the database in the first
place. By studying a period close in time, we also aim to learn
the way startups and VCs operate nowadays, which might be
different from previous best practices.
Furthermore, it also minimizes the problems of consid-
ering Crunchbase data as is when downloaded and not as
was at the beginning of the simulation. There is a problem
with some Crunchbase variables that are not dated and may
not reflect the situation of the company at the time where
the simulation started (e.g. the number of employees, the
managers of the company, etc). While we try to avoid such
variables, there is a more subtle bias that is present and that
we will mention in Section V. All these aspects help make
the results of our experiment more reliable.
Next sections will describe how the training and test sets
are created according to our temporal constraints, which
variables are employed by the decision models, and how
these models are evaluated.
A. DATA SELECTION
Our data sample is extracted from a Daily CSV Export of
Crunchbase from August 2018. The full dataset contains
623,232 companies, information about 799,446 company
founders and 227,172 tuples about funding round events
(mainly from the last 20 years).
We define a Simulation window of 3 years (ts=
August 2015 and tf=August 2018) and a Warmup
window of 4 years (tc=August 2011). These windows
mean that an early-stage company will not be older than 4
years at the time of the investment and we expect that it will
raise new funds in no more than 3 years after the prediction
(or the VC investment) is provided. These windows are
considered adequate for a startup given the high failure rate
in the early years; for example, at four years was about 44
percent in the US.3
According to these windows, we filtered companies using
their creation date and removed the ones acquired, closed or
that went public by an IPO during the Warmup window, or
the companies which closed a funding round above Series C,
which is not interesting for early-stage VC investments.
The final data sample consists of 120,507 companies with
the company name, sector, country, age, and other additional
information, and 34,180 funding round events about the
selected companies.
B. TARGET VARIABLE
The ultimate goal of a VC investment is to invest in compa-
nies that will be acquired or go for an IPO. However, these
companies are rare due to the natural selection inherent in the
venture capital process [15]. A VC investor may also look
for companies that advance toward new injections of capital
and hopefully larger outcomes. This way, we have defined
a multi-class target variable whose value is extracted from
the events occurred during the Simulation window. Using
only the first event occurred for a company, we define the
following classes:
3https://smallbiztrends.com/2019/03/startup-statistics-small-business.
html
4VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2938659, IEEE Access
Arroyo et al.: Assessment of ML performance for decision support in VC investments
Class Frequency Ratio
CLOSED (CL) 686 0.57%
ACQUIRED (AC) 3,293 2.73%
FUNDING ROUND (FR) 21,682 17.99%
IPO (IP) 143 0.12%
NO EVENT (NE) 94,703 78.58%
TABLE 1. Class value distribution of the success measure.
ACQUIRED (AC): The company is acquired during the
simulation window.
FUNDING ROUND (FR): The company reaches at
least another round of funding in the simulation window.
IPO (IP): The company will go for an IPO during the
simulation window.
CLOSED (CL): The company is closed during the
simulation window. This class is overridden by AC-
QUIRED when the closed and acquired events occurred
simultaneously in a short period (indicating that the
company was successfully acquired and then closed by
the acquirer).
NO EVENT (NE): None of the previous events occurred
during the simulation window.
In a perfectly rational environment, it can be argued that
subsequent funding may be a proxy of non-success since it
will indicate that a company is burning money and need a
new capital injection. In the industry though, this is not the
case because VCs urge companies to spend money quickly
to grow faster to automatically elicit the ones that can return
them the money in a reasonable time frame from the ones that
are dead ends.
In the same fashion, a prior closure of a venture can be
seen as money and energy saving both from the entrepreneur
perspective as well as the VC side. Again, even though
this makes perfect sense in a theoretical model setting, it
is not what happens in practice and very rarely occurs. In
fact, it is hard to disentangle the reasons for product failure
and entrepreneurs tend to stick to that as long as possible.
Furthermore, they prefer to use every penny at their disposal
to improve the product, launch a new one, or pivot the com-
pany to win the market rather than going from a preemptive
closure. There is no stigma in failing, and even VCs often
prefer to try them all before considering the company as a
failure. Their solution is not, in fact, withdrawing the money,
but rather avoid to keep investing in companies that are not
profitable.
The value distribution of the target variable in our data
sample is shown in Table 1. While a priori only the CL-
class represents a failure, the percentage of early-stage com-
panies closed after three years is unreasonably low (less than
3%) given the low startup survival rate. We believe that
many NE class companies may be closed indeed, but the
Crunchbase database does not reflect it, as closure generally
happens without any official announcement or regulatory
filing. Furthermore, from a VC investor point of view, if a
startup does not show any progress after 3 years (the time of
our simulation window) it is not a good investment. Hence,
we decided to consider that the NE class denotes a failed
investment.
As a result, we consider three “success” classes (AC, FR,
and IP) that roughly represent the 21% of the companies
in our sample, and two “failure” ones (CL and NE) that
represent the remaining 79%.
C. PREDICTOR VARIABLES
After a pre-processing stage for removing unnecessary vari-
ables, computing new synthetic variables and cleaning miss-
ing values, we compiled a set of 105 variables for each
company in the data sample. Most of them were extracted
from the events occurred during the Warmup window and
some variables that may have changed the simulation window
–like the number of employees– were omitted.
Although some of the predictors are selected based on
previous research on predicting early-stage company per-
formance –like variables concerning founders education [2]
or company location [7], among others–, we included other
variables available in Crunchbase.
The set of predictors can be divided into the following
categories.
1) Company information
This category comprises general information about the com-
panies, like location (we considered that country_code pro-
vided an acceptable granularity level and a homogeneous set
of values) and the business sectors where the company oper-
ates from a list of 46 categories predefined by Crunchbase.
Additionally, we included the company age in months
(age_months) at the beginning of the simulation (ts) and we
added variables that measure the presence of the company
in social media networks. These binary variables indicate
whether the company registered in Crunchbase its con-
tact information (has_email and has_phone) or a Facebook
(has_facebook_url), Twitter (has_twitter_url) or Linkedin
(has_linkedin_url) account.
2) Funding information
The variables in this category summarize the funding events
occurred in a company during the Warmup window. The
availability of a temporal series of funding round events in
Crunchbase allowed us to synthesize information about:
Number of funding rounds that the company achieved
before ts(round_count) and the total amount raised in
those rounds (raised_amount_usd).
Data about the last funding round in Warmup win-
dow: funding round type (last_round_investment_type),
amount raised (last_round_raised_amount_usd), com-
pany valuation after this funding round
(last_round_post_money_valuation) and time lapsed,
in months, between the beginning of the simula-
tion (ts) and when this funding round occurred
(last_round_timelapse_months).
VOLUME 4, 2016 5
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2938659, IEEE Access
Arroyo et al.: Assessment of ML performance for decision support in VC investments
Number of (unique) investors who participated in
the funding rounds during the Warmup window (in-
vestor_count) and, specifically, in the last funding
(last_round_investor_count).
Qualitative information about the investors. We defined
a category of renowned investors as the investment
companies registered in Crunchbase and created vari-
ables for the number of unique renowned investors who
participated in the funding rounds during the Warmup
window (known_investor_count) and, specifically, in
the last funding (last_round_known_investor_count).
3) Founders information
The last category comprises information about the peo-
ple who founded a company. In addition to the num-
ber of founders (founders_count), we synthesized new
variables that provide information about the heterogene-
ity of the founders according to their origin –number
of different countries where the founders come from
(founders_dif_country_count)– and gender –number of male
(founders_male_count) and female (founders_female_count)
founders.
Previous studies highlighted the importance of having a
college education when building a new company [2], [3].
Crunchbase provides information about the education of
most of the founders. However, the way this information is
stored (in a free-form text) hinders the synthesis of qualitative
variables about the education received by company founders.
After revisiting the information contained in Crunchbase and
observing that most of the education entries refer to higher
education, we decided to synthesize quantitative variables
about the total number of degrees obtained by company
founders (founders_degree_count_total), as well as the maxi-
mum (founders_degree_count_max) and the average number
of degrees (founders_ degree_count_mean) among them.
The sparsity problem is evident also in this category, as
most of the companies do not have information about their
founders or their education. In this case, we consider the
absence of data as useful information and consider 0, where
it corresponds. It means that the company has not updated the
information about the founders in Crunchbase.
D. MODELS AND ALGORITHMS
We have considered five different machine learning classi-
fier algorithms:4Support Vector Machines (SVM), Decision
Trees (DT), Random Forests (RF), Extremely Randomized
Trees (ERT) and Gradient Tree Boosting (GTB).
Decision Trees, Random Forests, Extremely Randomized
Trees and Gradient Tree Boosting are tree-based classifiers.
We considered tree-based classifiers for the following rea-
sons:
They incorporate feature selection, so they are able to
cope with a high number of variables of presumably
4Algorithms implemented in Scikit-learn. http://scikit-learn.org/
Classifier Accuracy
Decision Trees 74.6
Random Forests 81.8
Extremely Randomized Trees 81.9
Gradient Tree Boosting 82.2
Support Vector Machines 81.7
TABLE 2. Global accuracy of the classifiers in percentage
very different importance for the classification problem,
as it is in our case.
They are white-box classifiers that can be interpreted
since it is possible to measure the relevance of the
features used for the classification. This is particu-
larly useful because we are interested in understanding
the classification decision and in identifying success
drivers.
We consider decision trees as a baseline and because it
has been successfully used in the literature [10]. However,
decision trees have typically a limited success compared with
more sophisticated classifiers, such as the tree-based ensem-
ble classifiers considered (RF, ERT, and GTB). Ensemble
classifiers are known to be more accurate than any of its
members if the classifiers in the ensemble are accurate and
diverse [16].
As a complement of the tree-based classifiers, we will use a
classification method based on a different paradigm: Support
Vector Machines (SVM). This method has already shown
good results in plenty of fields, including VC investing [13],
[14]. While originally designed for binary classification they
are extended for multi-class classification using a one-vs-one
scheme. It is also a method that is effective in high dimen-
sional spaces as the one we are facing. As a disadvantage, it
is important to mention the lack of transparency of its results,
contrary to what happens with tree-based classifiers. Another
disadvantage may be its computational time when using non-
linear kernels, so we will use a linear version of the method.
To evaluate the classifiers we use stratified k-fold cross-
validation with k= 5. The validation is stratified to preserve
the same amount of companies of each class in each fold.
This is important because the data is imbalanced. We also use
the same dataset partition (i.e., the same folds) for training
all the classifiers. In this way, we eliminate the impact of
different partitions when comparing their performance. Nat-
urally, we will report the performance of the classifiers in the
validation set (aggregating the k validation subsamples).
IV. RESULTS
Table 2 shows the global accuracy of our classifiers in the val-
idation subsamples. The classifier that performed worse was
unsurprisingly the decision tree with 74.6%. This classifier
performs in principle worse than a naive classifier that always
predicts the majority class (i.e., “no event”). Such classifier
would have an accuracy of 78.6%, because this value is the
frequency of the “no event” class in the dataset (see Table
6VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2938659, IEEE Access
Arroyo et al.: Assessment of ML performance for decision support in VC investments
1). However, decision tree predictions are more informative
as we will see below when analyzing the performance for
each class in the target variable. The rest of the classifiers
considered performed better than the naive classifier and the
GTB classifier was the best, followed closely by the other
methods considered.
Global accuracy is a general performance indicator, but
a VC is especially interested in those classes that reflect
a successful scenario (FR, AC, and IP). If the classifier
performs well in flagging companies that belong to those
classes, the VC could become more confident in betting on
those companies. As a result, we analyze below the classifier
performance for each class considered in the target variable.
A. ANALYSIS OF THE RESULTS FOR EACH CLASS
Table 3 shows the performance of the considered machine
learning algorithms for each class in our target variable. As
in an information retrieval problem, the metrics provided
will be precision, recall and F1 score for every class. For a
given class, high precision means that an algorithm returned
substantially more relevant instances than irrelevant ones,
while high recall means that an algorithm returned most of
the relevant instances. The F1 score is the harmonic mean of
the precision and recall.
In this problem, recall is not a critical measure. VCs do
not need to find all the “interesting” companies available in
the world (or in the database) since they cannot invest or
even consider investing in all of them. However, a minimum
recall is important because the classifier should provide the
VC with a sufficiently large subset of relevant companies to
invest in. In this sense, we suggest going beyond the percent-
age value of recall and considering the number of companies
actually retrieved by the classifier for the considered class.
On the other side, a VC investor is primarily interested in
increasing the success rate when making the decision about
which company invest in. Hence, we need to focus on preci-
sion of profitable classes (FR, AC, and IP). Thus, precision
will serve us for establishing the main comparisons among
classifiers, while recall will be used to nuance our analysis.
F1 score is merely added as a complementary measure for
comparisons, but will not inform our analysis.
Looking at the results for the “closed” companies in Table
3, no single classifier performed well. In the case of SVM,
the classifier did not even activate for this class, while in the
rest of the classifiers activations were false positives (except
for few true positives in the case of decision trees). This could
be due to the small number of companies labeled as “closed”
in the dataset, but the figures in Table 3 show that this does
not apply to the “acquired” companies that are even fewer
(686 versus 143). Thus, we believe that the real problem is
that the “no event” class includes companies that are closed,
but whose closure has not been updated in Crunchbase, as
we already anticipated in Section III-B. This fact makes very
difficult for the classifier to really discriminate between those
two classes.
Regarding the performance for the “no event” class, we
do not notice relevant differences in the performance of
the classifiers in terms of precision (all between 0.83 and
0.85). Remarkably, all of them outperform a naive classifier
always predicting the “no event” class (precision of 0.79),
and therefore an investor could discard companies flagged
as “no event” using any of our classifiers. From a practical
perspective, we would choose the classifier with a higher
recall because it means that it will activate more times.
Hence, we do not consider the decision trees but rather take
into account the rest of the classifiers as equally good, being
able to identify around the 95% of the “no event” companies.
Bear in mind that, as above mentioned, we consider the “no
event” class as not interesting for a VC investor since it
probably includes companies that are not attractive (i.e., not
in the startup cycle) or closed ones where closure was not
reflected in Crunchbase.
The differences in terms of classifier performance increase
for the “funding round” class. The decision tree has the
worst precision (0.43), while the precision for the rest of the
classifiers varies between 0.6 and 0.64. The highest precision
is offered by SVM and GTB. Since they are very similar, we
could prefer GTB because is the classifier with the highest
recall (0.4 versus 0.33). However, in this case, recall is not
highly relevant because the number of “funding round” in our
dataset is 21,682 and even the 33% of them still represents an
incredibly large number of potential investments to analyze
for a single VC.
In the “acquired” class the precision decreases notably.
Random Forest and Extremely Randomized Trees are the
best algorithms with 0.33 and 0.31, respectively, and the
same recall. While the performance is apparently low, the
percentage of the “acquired” class in the population is the
2.7% and it is never over 14% in the VC funnel of US com-
panies [15]. Considering these facts, these classifiers enhance
more than 10 times the chances of finding a company that
will be acquired in the sample, which would represent an
outstanding performance for a VC investor. While the recall
is low in both cases (0.03), its translation to absolute numbers
means that each of these classifiers activated close to 300
times and got it right around 90 times. Again, the absolute
numbers are satisfactory for the investment potential of many
if not most VC investors.
Finally, we analyze the results in the “IPO” class, which
is the less frequent event of our dataset, with only 143
companies. As in the “acquired” class, Random Forest and
Extremely Randomized Trees are the best algorithms with
a precision of 0.44 and 0.27, respectively. However, their
success rate is 4/9 and 3/11, respectively, and the differences
may be due to mere chance. Moreover, the number of activa-
tions for all the algorithms is very low for this class, which
means that is a difficult class to discriminate. This may be
due to similarities between the companies within “IPO” class
and the other classes that represent success, but also that the
“IPO” class is algorithmically neglected because of its low
frequency.
VOLUME 4, 2016 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2938659, IEEE Access
Arroyo et al.: Assessment of ML performance for decision support in VC investments
CL NE FR AC IP
Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1
DT 0.02 0.02 0.02 0.85 0.85 0.85 0.43 0.43 0.43 0.09 0.10 0.10 0.04 0.03 0.04
RF 0.00 0.00 0.00 0.85 0.94 0.89 0.60 0.44 0.51 0.33 0.03 0.05 0.44 0.03 0.05
ERT 0.00 0.00 0.00 0.85 0.94 0.89 0.61 0.43 0.51 0.31 0.03 0.05 0.27 0.02 0.04
GTB 0.00 0.00 0.00 0.85 0.95 0.90 0.64 0.40 0.49 0.17 0.003 0.01 0.07 0.01 0.02
SVM - - - 0.83 0.96 0.89 0.64 0.33 0.44 0.00 0.00 0.00 - - -
TABLE 3. Precision, recall and F1 score for each class (columns) and each machine learning algorithm.
In summary, the main conclusions of the analysis are:
The classifiers are of no use for predicting the “closed”
companies, and of little practical use for the “IPO” ones.
For the “no event” and “funding round” classes en-
semble classifiers provide the best results, with a slight
preference for the Gradient Tree Boosting.
Random Forests and Extremely Randomized trees are
the best classifiers for the “acquired” class, even if their
recall is low.
SVM was not able to generalize the features of those
classes with few instances (CL, AC, and IP), and its
performance in the rest of the classes was not the best
one. Hence, we consider that should not be considered
in the present approach.
We should proceed with caution to select the best model
among the ensemble classifiers. In fact, if we look at global
performance in the majority classes (NE and FR), we would
choose GTB. However, if we also consider the precision of
the classes that represent acquisition, we would choose RF,
as it offers a good balance of precision in the FR and AC
classes.
B. REINTERPRETATION OF THE CLASSIFICATION
ERROR FOR VC INVESTORS
In this section, we turn our attention to classification errors.
In a typical classification problem, errors are undesirable.
However, in this domain, some errors can cause no harm to
the portfolio of a VC investor or can even turn into pleasant
surprises. For example, if a classifier predicts that a given
company will be “acquired”, but it turns out that the company
just obtains a “funding round”, the error is not vital, because
it means that the company keeps growing. On the opposite
case, finding that a company the classifier predicted would
obtain a “funding round” is instead being “acquired” is again
a successful event, because the investor exits its investment
with a profit.
Following this idea, we will reinterpret the classification
error, collapsing our multi-class target variable into a binary
variable with two classes: “good” and “bad”. The “good”
class includes the classes “funding round”, “acquired”, and
“IPO”, while the “bad” one is including the “no event”
and the “closed” classes. Below we show the “binarized”
classification error of the algorithms considered.
In terms of global accuracy, the decision tree classifier is
the worst with 0.77, while the rest of the classifiers obtain a
similar result, i.e., 0.83.
In Table 4 we show the “extended” precision values for
each class, that is, we consider that an error only happens
if you predict one of the “good” classes and you get one of
the “bad” ones or vice versa. We also include the number of
“extended” true positives to better contextualize the precision
value. By definition, all precision values must be (and in fact
are) equal to or better than the ones shown in Table 3.
The comparisons among methods are similar to those that
could be extracted from the previous table. However, for a
VC investor the chances to do wrong decrease when investing
in companies flagged as “good”.
In Table 5 we aggregate the results of Table 4 to show the
precision of two categories: “good” and “bad”. This table
makes possible to better compare the global performance
of the classifiers in terms of good and bad investments.
According to the numbers, the precision of GTB and SVM
are superior to the other methods for the “good” category and
slightly worse for the “bad” one.
Given the importance of the “good” signal when construct-
ing a portfolio, we consider that GTB and SVM are the
methods that better suit this domain. An investor following
the “good” signals provided by the GTB or SVM would
have a success rate close to 7 out of 10, which is truly
remarkable. Given the better performance of GTB in terms
of recall and precision of the “bad” class, we would finally
recommend using GTB for the portfolio construction using
binary signals.
In experiments not reported here for the sake of brevity,
we trained the classifiers using the binary target variable with
the “good” and “bad” classes. Results were roughly similar,
so it does not seem an advantage to consider the binary target
instead of the more nuanced multi-class one.
C. FEATURE IMPORTANCE
In this section, we analyze the feature importance of the
multi-class approach shown in Section IV-A and not the
binary one.
The problem with machine learning classifiers is that they
are usually black-box decision tools. However, decision-
makers need to understand why a decision is suggested or
at least which factors are taken into account.
From the classifiers used in our work, SVM are black
boxes, but tree-based classifiers make possible to analyze the
importance of different features. From all of them, decision
trees are the most transparent and understandable, since
the tree can be read and understood. On the other hand,
8VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2938659, IEEE Access
Arroyo et al.: Assessment of ML performance for decision support in VC investments
CL NE FR AC IP
Prec. TP Prec. TP Prec. TP Prec. TP Prec. TP
Decision Trees 0.74 535 0.86 80638 0.47 10195 0.34 1228 0.39 53
Random Forest 0.33 1 0.86 89397 0.64 10318 0.53 155 0.56 5
Extremely Randomized Trees 0.33 1 0.86 89721 0.65 10039 0.47 128 0.55 6
Gradient Tree Boosting 0.44 24 0.85 91016 0.68 9139 0.55 35 0.41 11
Support Vector Machines - 0 0.84 91817 0.68 7604 1 1 - 0
TABLE 4. Precision values and True Positives for each machine learning algorithm after reinterpreting the classification error.
Classifier Bad Good
Decision Trees 0.86 0.45
Random Forests 0.86 0.64
Extremely Randomized Trees 0.86 0.64
Gradient Tree Boosting 0.85 0.68
Support Vector Machines 0.84 0.68
TABLE 5. Precision values for each machine learning algorithm after
reinterpreting the error and aggregating the original classes into “good” and
“bad”.
tree-based ensemble classifiers are more difficult to analyze
because they use multiple trees to classify, but still, it is
possible to estimate the importance of the features used for
classification.
Figure 2 shows the ten most important features of
the tree-based ensemble classifiers.5In order to esti-
mate features relevance, we have re-trained the classi-
fier with the whole dataset. The Figure shows that 4
of the variables in the top 5 are the same for the
three classifiers, and the other appears in two of them
–namely, age_months,founders_count,has_linkedin_url,
founders_dif_country_count and raised_amount_usd. Given
the frequency of the classes in the dataset and the perfor-
mance of the classifiers, these variables most likely help to
discriminate between the “no event” and the “funding round”
classes.
For example, age_months represents the age of the com-
pany in months. This makes sense because for a startup the
older it is the higher the probability to survive and hence the
higher the probability to receive funding or to be acquired.
Since we consider companies up to 4-years old, the variable
is related to the maturity of an early-stage company.
The variable has_linkedin_url means that the company has
a LinkedIn profile and that link appears in Crunchbase. Being
LinkedIn the most relevant professional networking website,
the presence of the company there and the appearance of the
link in Crunchbase seems to have a relevant role in signaling
potentially attractive companies.
In the case of founders_count, the variable is important and
it means that some information about the founders exists in
Crunchbase. As mentioned in Section III-C3, if no informa-
tion on the founders is available the respective value would
be zero. However, the variable founders_dif_country_count,
5Decision trees are not shown because they obtained worse results than
the ensemble classifiers.
which represents the number of different countries of origin
of the founders, provides different information. It probably
speaks of the internationalization of the company, which is
interpreted as a good sign.
Finally, raised_amount_usd, which is the amount of
money raised in US dollars, is related to the historical ability
of the company to obtain funds. This likely means that capital
is an edge, and having receive (or receiving) funds will
increase the probability of success.
The rest of the variables are related to the completeness
of the company information in Crunchbase (has_email,
has_phone), to the gender and education of the founders
(founders_male_count,founders_degree_count_mean), or
to the funding rounds (last_round_timelapse_months,
round_count). Interestingly, some business sectors looks
more popular and attractive, e.g., Health Care or Science and
Engineering. Some countries of origin also look to have a
strong relevance. Those countries, unsurprisingly, represent
strong economies, e.g., USA, China, and Sweden.
V. DISCUSSION
We have shown the potential of using machine learning algo-
rithms to predict different levels of success — and not only
IPO or acquisitions— of early-stage companies in a medium-
term time window. This approach had not been explored
before in the literature. The results show a global accuracy of
around 82% of the best algorithm, Gradient Tree Boosting.
Looking at the details, most algorithms explored obtain a
precision for determining that a company will achieve the
next funding round between 64% and 68%. This result means
that a VC investor can construct a portfolio with less risk
using such classifiers. The best method for flagging high-gain
companies — those that are acquired or go public through an
IPO— is Random Forest which obtains a precision of 0.33
and 0.44 for acquisitions and IPOs, respectively. The result
is outstanding given that the percentage of such companies
in the dataset is less than 3%. The recall is low, but it sums
up to over 100 companies between the acquired ones and the
ones that went public, which is more than enough for most
VC firms. In rough terms, a VC firm would need to invest in
300 companies flagged as acquired or IPO by the classifier
and expect only 100 to turn into high-reward investments.
These results are certainly promising for VC investors,
who typically do not obtain such success rates. Classifiers
with predictions of different levels of success can be incor-
porated in their screening process and their portfolio con-
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2938659, IEEE Access
Arroyo et al.: Assessment of ML performance for decision support in VC investments
! !"!# !"$ !"$# !"%
&'()*+,-./'()01
23-.45)6+*5).(, 41
37+.8')02-1
&'()*+,-.*5&./'()0,9./ '()01
,35-+*.38'()0.(-*1
5):+-0',./'()01
;+34021<3,+1
=/5+)/+ 13)*1 >)75)+ +,5)71
&'()*+,-.*+7,++ ./'()0.8+3)1
43-0.,'()*.058+43 ?-+.8')02-1
! !"!% !"!@ !"!A !"!B
37+.8')02-1
23-.45)6+*5).(, 41
&'()*+,-.*5&./'()0,9./ '()01
&'()*+,-.834+./'()0 1
&'()*+,-./'()01
23-.?2')+1
23-.+83 541
=/5+)/+ 13)*1 >)75)+ +,5)71
43-0.,'()*.058+43 ?-+.8')02-1
,'()*./'()01
! !"!% !"!@ !"!A !"!B !"$
37+.8')02-1
23-.45)6+*5).(, 41
&'()*+,-./'()01
&'()*+,-.*5&./'()0,9./ '()01
,35-+*.38'()0.(-*1
43-0.,'()*.058+43 ?-+.8')02-1
&'()*+,-.834+./'()0 1
43-0.,'()*.,35-+*.3 8'()0.(-*1
23-.?2')+1
23-.+83 541
FIGURE 2. Feature importance for the tree-based ensemble classifiers (left: random forests. center: extremely randomized trees. right: gradient tree boosting)
struction. However, the results reported might not be free of
potential biases that need to be considered and are therefore
discussed below.
It is important to bear in mind that the experiment may
have a bias since all the data was retrieved at the end of
the simulation window (tf). However, our simulation starts
in August 2015 (ts) and we ideally should be using the
information available in Crunchbase at that time. For exam-
ple, perhaps not all the companies we are considering were
already in Crunchbase in August 2015. They could have been
added later. It is also reasonable to think that this problem
creates a bias towards successful companies since those are
more likely to be added in Crunchbase.
A similar and more subtle bias can be present as well. It is
related to the company information available in Crunchbase
at the start of the simulation. It is possible that a company at
that time did not have specific information, such as the profile
of its founders that was very relevant for classification. This
information could have been added later, especially in the
case of companies that progressed and had success to some
extent. The Crunchbase data dump does not determine when
this information was added and this may distort the results.
Finally, it is important to remember that the tool does
not increase the investor’s ability to actually close the deal,
but it only augments the ability to process information and
assess companies in absence of more traditional financial
data. However, we believe that even if these problems may
affect the performance of the classifier presented, they should
still produce useful advice for VC investors. The effect of the
bias might be evaluated by simulating the use of the system
in real-time, which takes years to be done, given the time
window considered.
VI. CONCLUSIONS
This work has shown that machine learning can help in the
baseline screening to early-stage investors that are looking
at potential investments with no relevant quantitative data or
track record. Our experiment in a realistic setting demon-
strates that a multi-class machine learning classifier can help
to increase the success rate of an investor.
Clearly, this tool is only one of many in a VC’s toolbox, but
it is very relevant when it comes to quickly skim through the
thousands of opportunities an investor sees every year. Being
aware of specific features that may signal a company will
outperform its peers is key to reduce both risk and uncertainty
when investing. It cannot and should not be the only tool
used to evaluate a company, but it is definitely the first step
an investor should undertake in order to move forward a
conversation or not.
From the perspective of a venture capitalist, a multi-class
approach such as the one proposed is much more useful than
a binary one, which only gives “good” or “bad” scenarios.
On one hand, our approach can help to reduce the risk of the
portfolio, as we considered a class that represents moderate
success for a company — to proceed to a subsequent funding
round— which is easy to spot for the classifiers. On the other
hand, since best performing funds are mostly derived from a
very few numbers of companies that end up producing out-
sized results, a VC may decide to focus just on the “acquired”
signal, which has a higher risk, but higher potential rewards.
We also showed that some classification errors of the multi-
class approach cannot be considered harmful or dramatic for
the portfolio. The nuanced information from the multi-class
approach can be used to set up a detailed portfolio strategy.
Moreover, the classifier’s information can be combined with
“fundamental” analysis on the company activity, the sector
where it operates, the profile of the founders and managers,
etc.
Furthermore, the work presented here can be refined fur-
ther and some of its limitations can be tried to be overcome.
Besides those related to training the classifiers, such as hy-
perparameters optimization or using sampling strategies for
imbalanced classes, new variables could be considered. For
example, the information from the founders has proven to
be relevant in our approach — even in the form of rough
numerical variables with high sparsity— and its relevance is
documented in the literature [5], [10], [13]. Since LinkedIn
provides founders and managers data and in a structured
way, more informative variables describing their professional
experience or academic background could be included in our
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2938659, IEEE Access
Arroyo et al.: Assessment of ML performance for decision support in VC investments
approach, which could lead to better performance.
founding teamÂt’s diversity (so various types of degrees
instead of mere number of them
Additionally, specific classifiers for different countries
and/or business sectors could be trained. These classifiers
could outperform the ones presented and offer additional
insights and signals to spot interesting companies for each
country or sector.
The practitioners implications of this first work are poten-
tially huge. We believe that a more refined tool could change
the entire dynamics of the industry. The baseline assumption
any fund does is that most of its investments will be a loss
and will be written off the book and that the entire return
will be driven by no more than 10% of the investments made.
The risk-return profile is thus very unstable, which forces
investors to only pursue opportunities that might turn into
moonshots and reject those investments that, although good,
cannot achieve at least a 10x return. However, if there would
be a way to better predict a company success, an investor
could easily invest in a portfolio of companies all producing a
2x returns rather than looking for a single unicorn, eventually
achieving a higher return than what most funds do nowadays.
This would also drive the valuations down and potentially in-
crease innovation and entrepreneurial activity, and therefore
should be widely embraced by policy-makers as well.
REFERENCES
[1] R. N. Lussier and S. Pfeifer, “A Crossnational Prediction Model for
Business Success,” Journal of Small Business Management, vol. 39, pp.
228–239, 2001.
[2] C. E. Halabí and R. N. Lussier, “A model for predicting small firm
performance,” Journal of Small Business and Enterprise Development,
vol. 21, no. 1, pp. 4–25, feb 2014.
[3] R. N. Lussier and C. E. Halabi, “A three-country comparison of the busi-
ness success versus failure prediction model,” Journal of Small Business
Management, vol. 48, pp. 360–377, 2010.
[4] R. Nahata, “Venture capital reputation and investment performance,” Jour-
nal of Financial Economics, vol. 90, no. 2, pp. 127–151, 2008.
[5] H. Littunen and H. Niittykangas, “The rapid growth of young firms
during various stages of entrepreneurship,” Journal of Small Business and
Enterprise Development, vol. 17, no. 1, pp. 8–31, 2010.
[6] P. Holmes, A. Hunt, and I. Stone, “An analysis of new firm survival using
a hazard function,” Applied Economics, vol. 42, no. 2, pp. 185–195, 2010.
[7] S. Hoenen, “Do Patents Increase Venture Capital Investments between
Rounds of Financing,” Master’s thesis, Wageningen University and Re-
search Center, the Netherlands, 2012.
[8] S. Ragothaman, B. Naik, and K. Ramakrishnan, “Predicting Corporate Ac-
quisitions: An Application of Uncertain Reasoning Using Rule Induction,”
Information Systems Frontiers, vol. 5, no. 4, pp. 401–412, dec 2003.
[9] C.-P. Wei, Y.-S. Jiang, and C.-S. Yang, “Patent analysis for supporting
merger and acquisition (m&a) prediction: A data mining approach,” in De-
signing E-Business Systems. Markets, Services, and Networks, C. Wein-
hardt, S. Luckner, and J. Stößer, Eds. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2009, pp. 187–200.
[10] B. Yankov, P. Ruskov, and K. Haralampiev, “Models and Tools for Tech-
nology Start-Up Companies Success Analysis,” Economic Alternatives,
no. 3, pp. 15–24, 2014.
[11] M. Böhm, J. Weking, F. Fortunat, S. Müller, and I. Welpe, “The Busi-
ness Model DNA: Towards an Approach for Predicting Business Model
Success,” in Proceedings der 13. Internationalen Tagung Wirtschaftsinfor-
matik, 2017, pp. 1006–1020.
[12] D. J. McKenzie and D. Sansone, “Man vs. Machine in Predicting Success-
ful Entrepreneurs: Evidence from a Business Plan Competition in Nigeria,”
CEPR Discussion Paper No. DP12523, Tech. Rep., 2017.
[13] G. Xiang, Z. Zheng, M. Wen, J. I. Hong, C. P. Rosé, and C. Liu, “A
Supervised Approach to Predict Company Acquisition with Factual and
Topic Features Using Profiles and News Articles on TechCrunch,” ICWSM
2012 - Proceedings of the 6th International AAAI Conference on Weblogs
and Social Media, pp. 2690–2696, 2012.
[14] F. R. d. S. R. Bento, “Predicting start-up success with machine learning,”
Master’s thesis, Universidade Nova do Lisboa, Portugal, 2018.
[15] CB Insights, “Venture Capital funnel shows odds of becoming
a unicorn are about 1%,” https://www.cbinsights.com/research/
venture-capital-funnel-2/, 2018, accessed: 2018-01-11.
[16] L. K. Hansen and P. Solomon, “Neural network ensembles,” IEEE Trans.
Pattern Analysis and Machine Intelligence, no. 12, pp. 903–1002, 1990.
JAVIER ARROYO is Associate Professor at Uni-
versidad Complutense of Madrid (UCM) since
2013. He got a PhD degree in Computer Science
from Universidad Pontificia Comillas (2008).
He has research experience in time series
forecasting, agent-based simulation, and machine
learning applied to different domains and real-life
problems.
FRANCESCO COREA is Vice president at
Four Trees Merchant Partners and researcher at
Ca’Foscari University of Venice.
His focus is on venture capital, entrepreneurship
and artificial intelligence, and has worked as an
investor and startup advisor for the last few years.
He holds a Ph.D. in Economics from LUISS Uni-
versity and he’s a former fellow in Applied Math
at UCLA.
GUILLERMO JIMENEZ-DIAZ is a Computer
Research Scientist and Associate Professor at Uni-
versidad Complutense of Madrid where he re-
ceived his Ph.D. in Computer Science in 2008.
His research is concerned to Recommender Sys-
tems and its combination with Social Network
Analysis. His main domains of application are
tourism and e-learning, but he is also interested in
Augmented Reality technologies in Museums.
JUAN A. RECIO-GARCIA is Head of Depart-
ment of Software Engineering and Artificial In-
telligence at Universidad Complutense of Madrid,
where he obtained a PhD in Computer Science in
2008.
His research has focused on the confluence of
Software Engineering and Case-Based Reasoning
(CBR), developing the COLIBRI platform to build
CBR systems. He is also working in the areas of
Context-aware and social Recommender Systems.
VOLUME 4, 2016 11
... In recent years, machine learning is developing rapidly and achieve great success in many areas. There exists some research applying machine learning algorithm to predict the future success of startup companies (Yankov et al., 2014;McKenzie and Sansone, 2017;Arroyo et al., 2019;Kaiser and Kuhn, 2020), but their methods are not well suited for dealing with sparse data, which is common in datasets of startup companies. With the development of machine learning methods, recent algorithms like XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017) have the potential to solve this data sparsity problem. ...
... The problem of predicting the future success of startup companies has existed for a long time (Schendel and Hofer, 1979;Chandler and Hanks, 1993) and is still exploring by scholars (Arroyo et al., 2019;Kaiser and Kuhn, 2020). ...
... The author suggests that investors using the combination of man and machine rather than relying on human judges or machine learning-chosen portfolios. (Arroyo et al., 2019) analyze the performance of several machine learning methods in a dataset of over 120,000 startup companies retrieved from Crunchbase. They consider five machine learning algorithms: Support Vector Machines, Decision Tree, Random Forests, Extremely Randomized Trees, and Gradient Tree Boosting. ...
Preprint
Predicting the success of startup companies is of great importance for both startup companies and investors. It is difficult due to the lack of available data and appropriate general methods. With data platforms like Crunchbase aggregating the information of startup companies, it is possible to predict with machine learning algorithms. Existing research suffers from the data sparsity problem as most early-stage startup companies do not have much data available to the public. We try to leverage the recent algorithms to solve this problem. We investigate several machine learning algorithms with a large dataset from Crunchbase. The results suggest that LightGBM and XGBoost perform best and achieve 53.03% and 52.96% F1 scores. We interpret the predictions from the perspective of feature contribution. We construct portfolios based on the models and achieve high success rates. These findings have substantial implications on how machine learning methods can help startup companies and investors.
... Recently, machine learning techniques have been applied to support decisions in venture capital investments. The typical approaches are predicting two or more classes, such as being acquired by a company or offering shares to public, high returns or higher risk predictions (Arroyo et al. 2019). The paper by Arroyo et al. (2019) uses different ML classifiers: Support Vector Machines, Decision Trees, Random Forests to predict five possible events of the success of a start-up company. ...
... The typical approaches are predicting two or more classes, such as being acquired by a company or offering shares to public, high returns or higher risk predictions (Arroyo et al. 2019). The paper by Arroyo et al. (2019) uses different ML classifiers: Support Vector Machines, Decision Trees, Random Forests to predict five possible events of the success of a start-up company. Applications are found, for example, in investments in renewable energy (Frey et al. 2019), in manufacturing (Dogan and Birant 2021) and for real options in oilfield development (Lazo et al. 2007) and (Pratikto et al. 2019) and in the context of abandoning options (reinforcement learning). ...
Chapter
Digitalization and artificial intelligence are growing in importance as parts of decision-support tools in various application domains. One of the important developments in this vein has been the creation of interactive tools for coaching users of complex decision-support systems to help them successfully and correctly use the said systems. This paper focuses on digital coaching in the context of strategic investment analysis, specifically connected to fuzzy real options analysis (ROA). We present some important and difficult choices connected to ROA and discuss how digital coaching may assist users in better using ROA tools. We illustrate the real-world use of digital coaching in the contexts of cash-flow evaluation with machine learning support and aggregation of cash-flows from multiple experts. The discussion and the cases illustrate well how digital coaching can make a difference, especially for an inexperienced user, in guiding users to use complex tools correctly and in creating better circumstances for credible analyses. The findings presented are new and contribute specifically to the literature on digital coaching and real options analysis. *** Cite this paper as: Kinnunen J., Collan M., Georgescu I., Hosseini Z. (2021) Digital Coaching System for Real Options Analysis with Multi-expert and Machine Learning Support. In: Stephanidis C. et al. (eds) HCI International 2021 - Late Breaking Papers: Multimodality, eXtended Reality, and Artificial Intelligence. HCII 2021. Lecture Notes in Computer Science, vol 13095, pp. 455-473. Springer, Cham. https://doi.org/10.1007/978-3-030-90963-5_35 ***
... Past startup evaluation methods have explored predicting the success and failure of certain projects based on criteria [6][7][8][9]. Some research has used different models and machine learning techniques to analyze which classification methods are most accurate [10,11]. These tests are generally based on widely available public data of large companies such as Spotify as a base case [10]. ...
Article
Full-text available
This research aims to explore which kinds of metrics are more valuable in making investment decisions for a venture capital firm using machine learning methods. We measure the fit of developed companies to a venture capital firm’s investment thesis with a balanced scorecard based on quantitative and qualitative characteristics of the companies. Collaborating with the management team of Rose Street Capital (RSC), we explore the most influential factors of their balanced scorecard using their retrospective investment decisions of successful and failed startup companies. Our study employs six standard machine learning models and their counterparts with an additional feature selection technique. Our findings suggest that “planning strategy” and “team management” are the two most determinant factors in the firm’s investment decisions, implying that qualitative factors could be more important to startup evaluation. Furthermore, we analyzed which machine learning models were most accurate in predicting the firm’s investment decisions. Our experimental results demonstrate that the best machine learning models achieve an overall accuracy of 78% in making the correct investment decisions, with an average of 87% and 69% in predicting the decision of companies the firm would and would not have invested in, respectively. Our study provides convincing evidence that qualitative criteria could be more influential in investment decisions and machine learning models can be adapted to help provide which values may be more important to consider for a venture capital firm.
... Technological advances and the extensive availability of big data allow organizations to leverage artificial intelligence (AI) for a growing number of cognitive tasks in the sphere of "thinking work" (Phan et al., 2017) and guide decision-making practice at the core of innovation management (Cockburn et al., 2019;Haefner et al., 2021;Kakatkar et al., 2020). For instance, AI-based approaches are now pivotal factors in identifying trends and emerging technologies (Mühlroth and Grottke, 2020), informing investment decisions (Arroyo et al., 2019;Blohm et al., 2020), and accelerating R&D activities (Fleming, 2018). ...
Article
AI-augmented decision-making processes promise to transform strategic decisions around innovation management. However, despite a growing body of research on algorithmic management, very little is known about the behavioral effects of the AI-augmented decision-making process. This article utilizes a psychological perspective to research the interaction of artificial intelligence and human judgment, suggesting that AI-based advice affects human decision-making behavior and skews perceptions of decision outcomes. We present a vignette-based decision experiment involving 150 senior executives to examine the perception of AI-augmented decision-making at the individual level. In contrast to earlier research on algorithm aversion, we find that employing AI-based advisory systems positively affects choice behavior and amplifies decision quality perception. We further show how this overreliance on an AI-augmented decision-making process can be explained through both a higher degree of trust in the advisor and the attribution of a more structured process. This paper contributes to the emerging discussion as to the role of AI in management and the novel phenomenon of algorithm appreciation by investigating the interplay of human and artificial intelligence in strategic decision-making to show that AI-based advice is perceived as more trustworthy than human advice in an R&D investment context.
Article
Studying the influencing factors of venture capital fund investment performance is crucial for the decision making of venture capital institutions. This paper explored the influencing factors of venture capital institutions from the perspective of startups, aiming to elucidate the mechanisms of these factors on the performance of venture capital funds and to propose a novel and effective predictive model of investment performance. Linear regression and one-way ANOVA were used to analyze the influence of each variable on investment performance, and the weight proportion of each influencing factor was obtained under the linear model. Two machine learning models, including the random forest algorithm and extreme learning machine algorithm, are established, and the particle swarm algorithm and machine learning algorithm were combined to optimize the random parameters in the two models. Compare the reliability and accuracy of machine learning models and multivariate linear regression models. The analysis results indicate that the PSO-ELM hybrid model has a better predictive performance than other prediction models. A convenient machine learning algorithm provided in this paper can quickly and effectively predict the investment performance of various investment portfolios and provide investors with decision-making assistance.
Conference Paper
Full-text available
Machine learning techniques are used for discovering the hidden patterns from the application-centric data analysis. Using these techniques various applications can be developed which can be used for supporting different business sectors. These patterns will be used by business administrators and managers to develop sustainable, growing, and withstand the global business challenges. In this context, the employment of machine learning techniques in business data analysis can become a fruitful tool that can assist and help to new start-ups businesses and entrepreneurs, to sustain and grow with time. Therefore, in this paper, we are conducting a review on existing machine learning techniques that are recently contributed to understand the need of start-ups, trends of business and can provide recommendations to plan their future strategies to deal with the business problems. Secondly, based on the observations we have proposed our future road map to design and develop an intellectual framework to support Start-up India-based entrepreneurs.
Conference Paper
Full-text available
Digitalization and artificial intelligence are growing in importance as parts of decision-support tools in various application domains. One of the important developments in this vein has been the creation of interactive tools for coaching users of complex decision-support systems to help them successfully and correctly use the said systems. This paper focuses on digital coaching in the context of strategic investment analysis, specifically connected to fuzzy real options analysis (ROA). We present some important and difficult choices connected to ROA and discuss how digital coaching may assist users in better using ROA tools. We illustrate the real-world use of digital coaching in the contexts of cash-flow evaluation with machine learning support and aggregation of cash-flows from multiple experts. The discussion and the cases illustrate well how digital coaching can make a difference, especially for an inexperienced user, in guiding users to use complex tools correctly and in creating better circumstances for credible analyses. The findings presented are new and contribute specifically to the literature on digital coaching and real options analysis. *** Cite this paper as: Kinnunen J., Collan M., Georgescu I., Hosseini Z. (2021) Digital Coaching System for Real Options Analysis with Multi-expert and Machine Learning Support. In: Stephanidis C. et al. (eds) HCI International 2021 - Late Breaking Papers: Multimodality, eXtended Reality, and Artificial Intelligence. HCII 2021. Lecture Notes in Computer Science, vol 13095, pp. 455-473. Springer, Cham. https://doi.org/10.1007/978-3-030-90963-5_35 ***
Article
Investing in early-stage companies is incredibly hard, especially when no data are available to support the decision process. Venture capitalists often rely on gut feeling or heuristics to reach a decision, which is biased and potentially harmful. This work proposes a new data-driven framework to help investors be more effective in selecting companies with a higher probability of success. We built upon existing interdisciplinary research and augmented it with further analysis on more than 600,000 companies over a 20-year timeframe. The resulting framework is therefore a smart checklist of 21 relevant features that may help investors to select the companies more likely to succeed.
Chapter
The allocation of venture capital is one of the primary factors determining who takes products to market, which startups succeed or fail, and as such who gets to participate in the shaping of our collective economy. While gender diversity contributes to startup success, most funding is allocated to male-only entrepreneurial teams. In the wake of COVID-19, 2020 is seeing a notable decline in funding to female and mixed-gender teams, giving raise to an urgent need to study and correct the longstanding gender bias in startup funding allocation.
Conference Paper
Full-text available
Business models have gained much interest in the last decade to analyze the potential of new business ventures or possible innovation paths of existing businesses. However, the business model concept has only rarely been used as basis for quantitative empirical studies. This paper suggests the concept of a Business Model DNA to describe the characteristics of specific business models. This concept allows to analyze business models in order to identify clusters of business models that outperform others and calculate future prospects of specific business models. We used 181 startups from the USA and Germany and applied data mining techniques, i.e. cluster analysis and Support Vector Machines, to classify different business models in regards to their performance. Our findings show that 12 distinct business model clusters with different growth expectations and chances of survival exist. We can predict the survival of a venture with an accuracy of 83.6 %.
Article
We compare the absolute and relative performance of three approaches to predicting outcomes for entrants in a business plan competition in Nigeria: Business plan scores from judges, simple ad-hoc prediction models used by researchers, and machine learning approaches. We find that i) business plan scores from judges are uncorrelated with business survival, employment, sales, or profits three years later; ii) a few key characteristics of entrepreneurs such as gender, age, ability, and business sector do have some predictive power for future outcomes; iii) modern machine learning methods do not offer noticeable improvements; iv) the overall predictive power of all approaches is very low, highlighting the fundamental difficulty of picking competition winners.
Purpose – This study aims to develop an ordered probit model to explain and predict small business relative performance in Chile, South America. Design/methodology/approach – The design is survey research. The sample includes 403 small businesses classified as 158 failed firms, 101 mediocre firms and 144 successful firms within all economic sectors. The model variables are: internet, starting with adequate working capital, managing good financial and accounting records, planning, owner formal education, professional advice, having partners, parents owning a business, and marketing efforts. Findings – The eight-variable model, tested with ordered probit, is a significant predictor of the level of performance at the 0.000 level. Also, six of the eight variables are significant predictors at the 0.05 level: internet, starting with adequate working capital, managing good financial and accounting records, owner, professional advice, having partners, parents owning a business, and marketing efforts. Two of the variables – i.e. planning and formal education – were not significant. ANOVA test of differences were run for each of the eight variables based on the level of performance were also run and results reported. Practical implications – The model does in fact predict relative performance, so the model can be used to improve the probability of success. Thus, an entrepreneur can use the model to gain a better understanding of which resources are needed to increase the probability of success, and those who advise entrepreneurs can help them use the model. Investors and creditors can use the model to better assess a firm's potential for success. There is an extensive public policy implications discussion regarding how to use the model to assist entrepreneurial ventures so that society can benefit in direct and indirect ways via the allocation of limited resources toward higher potential businesses. Entrepreneurs and small business educators can use the model's variables to influence future business leaders, public policy makers, and their practices. Originality/value – This study improves the Lussier 15 variable success versus failure prediction model by adding the use of the internet and taking out highly correlated variables. While Lussier and others ran logistic regression with only two levels of performance, this study uses the more robust ordered probit model with three levels of performance. It presents public policy with implications for Chilean institutions to promote entrepreneurship. Finally, it contributes to the literature because, to date, no empirical success versus failure studies have been found that were conducted in Chile or any small, open economies in Latin America
Purpose – This paper aims to examine factors influencing the high growth of new firms in metal-based manufacturing and business service firms in Finland. It seeks to compare the factors of how new firms achieve a high rate of growth during the first four years and years five to eight. Design/methodology/approach – The study reported here is part of a longitudinal research project that has followed the development of 200 SMEs in Finnish metal-based manufacturing and business services since their start-up in 1990. At the seven-year follow-up the present study concentrates on the 86 surviving firms. Logistic regression analysis was used as statistical technique in locating differences between high-growth and other firms and their owner-managers in the selected attributes. This paper focuses on Storey's key elements. In search of potential differences in these characteristics between high-growth firms and other firms, this study compares Finnish firms in relation to founders' motives in starting up on their own account and in their individual background characteristics, changes in strategic factors, changes in networks and management styles during various stages of entrepreneurship. Findings – The results indicated firstly that there is a clear connection between entrepreneur's know-how and the high growth of firms. Secondly, the findings of this study demonstrate that external networks as a management capability bring about great competitive advantage, innovations and efficiency, especially during the first four years. However, the findings of five to eight years of development contradict the findings of the first four years. The results show that the use of internal networks has a positive effect on firms' high growth during years five to eight years. Finally, the results show that industry sector affected high growth, especially in specialised metal industry firms, both during the first four years and after five to eight years of development. Research limitations/implications – The implications of this study for academics, educational institutions, entrepreneurs, and other practitioners are that the so-called support services of internationalisation and growth for new firms are most important. These support services could be developed with public sector assistance in areas such as financing research, innovation and information technology projects. Orginiality/value – The paper provides a framework for testing the factors that differentiate growing new ventures during various stages of entrepreneurship.
Article
Why do some businesses succeed and others end up bankrupt? There is great discrepancy in the literature as to which variables do in fact lead to success, thus, there currently is no theory. To move the field in that direction, this study tests the Lussier 15-variable business success versus failure prediction model in Chile with a sample of 234 small businesses—131 failed and 103 successful. Results support the model's validity in Chile. Thus, the model has been tested with significant results in three very different parts of the world; first in United States (North America), then in Croatia (Central Eastern Europe), and now in Chile (South America). The model will reliably predict a group of businesses as failed or successful more accurately than random guessing in all three countries over 96 percent of the time.
Article
In this study, the Lussier (1995) success prediction model, originally developed using U.S. data, is tested using a sample of firms from Central Eastern Europe. The same factors found to be predictors of success in the U.S. (staffing, education level, use of professional advice, and planning) were also predictors of success and failure in Central Eastern Europe. All these factors have to do with the firm's human resources. These findings should lead to reconsideration of preconceptions existing in Central Eastern Europe regarding small business, as in many of its countries it is commonly believed that human resources have little to do with business success and failure.
Conference Paper
M&A plays an increasingly important role in the contemporary business environment. Companies usually conduct M&A to pursue complementarity from other companies for preserving and/or extending their competitive advantages. For the given bidder company, a critical first step to the success of M&A activities is the appropriate selection of target companies. However, existing studies on M&A prediction incur several limitations, such as the exclusion of technological variables in M&A prediction models and the omission of the profile of the respective bidder company and its compatibility with candidate target companies. In response to these limitations, we propose an M&A prediction technique which not only encompasses technological variables derived from patent analysis as prediction indictors but also takes into account the profiles of both bidder and candidate target companies when building an M&A prediction model. We collect a set of real-world M&A cases to evaluate the proposed technique. The evaluation results are encouraging and will serve as a basis for future studies.
Article
Artificial Intelligence (AI)-based rule induction techniques such as IXL and ID3 are powerful tools that can be used to classify firms as acquisition candidates or not, based on financial and other data. The purpose of this paper is to develop an expert system that employs uncertainty representation and predicts acquisition targets. We outline in this paper, the features of IXL, a machine learning technique that we use to induce rules. We also discuss how uncertainty is handled by IXL and describe the use of confidence factors. Rules generated by IXL are incorporated into a prototype expert system, ACQTARGET, which evaluates corporate acquisitions. The use of confidence factors in ACQTARGET allows investors to specifically incorporate uncertainties into the decision making process. A set of training examples comprising 65 acquired and 65 non-acquired real world firms is used to generate the rules and a separate holdout sample containing 32 acquired and 32 non-acquired real world firms is used to validate the expert system results. The performance of the expert system is also compared with a conventional discriminant analysis model and a logit model using the same data. The results show that the expert system, ACQTARGET, performs as well as the statistical models and is a useful evaluation tool to classify firms into acquisition and non-acquisition target categories. This rule induction technique can be a valuable decision aid to help financial analysts and investors in their buy/sell decisions.
Article
A unique dataset is used to provide a detailed examination of the survival of newly-established manufacturing firms in north-east England. Using data on 781 firms established between 1973 and 2001, log-logistic hazard models are estimated separately for (i) micro-enterprises and (ii) small and medium establishments (SMEs). Both micro-enterprises and SMEs show clear evidence of positive duration dependence, followed by negative duration dependence. We find the two firm types are differentially affected by firm-specific and macroeconomic variables. Increases in initial plant size impact negatively on micro-enterprise survival and positively on SME survival.