Content uploaded by Francesco Ferrati
Author content
All content in this area was uploaded by Francesco Ferrati on Apr 29, 2021
Content may be subject to copyright.
1
To quote this article, please use the following citation:
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine
Learning and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
This document is the preprint version of the article.
You can find the published version at this link: http://dx.doi.org/10.1561/0300000099
Published in Foundations and Trends® in Entrepreneurship on 28 April 2021
Entrepreneurial finance: emerging approaches using machine
learning and big data
Francesco Ferratia*, Moreno Muffattob
School of Entrepreneurship (SCENT), Department of Industrial Engineering
University of Padova, Padova, Italy
a francesco.ferrati@unipd.it (*corresponding author)
b moreno.muffatto@unipd.it
Abstract
For equity investors the identification of ventures that most likely will achieve the expected return
on investment is an extremely complex task. To select early-stage companies, venture capitalists
and business angels traditionally rely on a mix of assessment criteria and their own experience.
However, given the high level of risk with new, innovative companies, the number of financially
successful startups within an investment portfolio is generally very low. In this context of
uncertainty, a data-driven approach to investment decision-making can provide more effective
results. Specifically, the application of machine learning techniques can provide equity investors
and scholars in entrepreneurial finance with new insights on patterns common to successful
startups.
This study presents a comprehensive overview of the applications of machine learning algorithms
to the Crunchbase database. We highlight the main research goals that can be addressed and then
we review all the variables and algorithms used for each goal. For each machine learning algorithm,
we analyze the respective performance metrics to identify a baseline model. This study aims to be
a reference for researchers and practitioners on the use of machine learning as an effective tool to
support decision-making processes in equity investments.
Keywords: decision making, startup, investments, venture capital, machine learning, Crunchbase
Acknowledgment: this research was made possible thanks to the support of Crunchbase Inc.
http://www.crunchbase.com
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
2
1. Introduction
In recent decades the digital revolution is spreading in all sectors of economic and social life. The
increased computational capacity, the improvement of the algorithms as well as the large amounts
of available data have given a new boost to artificial intelligence applications (Agrawal et al.,
2018). Progress has been so significant that we can claim to have been entered a "second machine
age", driven by artificial intelligence and big data (Brynjolfsson and McAfee, 2014). The digital
revolution is radically transforming the way people operate and various areas of management are
increasingly benefiting from these profound changes. Intelligent systems are applied in production
and management processes, having a strong impact, for example, within decision making activities.
In this context of change, entrepreneurship is also inevitably transforming itself, as much as it can
be said that artificial intelligence and big data started a new era in entrepreneurship (Obschonka
and Audretsch, 2019). While artificial intelligence has received increasing attention in a variety of
research and application fields, such as economics, economic policy, innovation and management
(Cockburn et al., 2018, George et al., 2014), not so much research has been done in
entrepreneurship so far. Interestingly, Obschonka and Audretsch (2019) place emphasis on the co-
evolution and reciprocity of the fields of research and entrepreneurial practice. Smart machines and
algorithms will not only inspire and enhance a new generation of research in this field, but will also
reshape the current real phenomenon of entrepreneurship. It remains to be understood how artificial
intelligence and big data can contribute to a transformation of both research and practice, perhaps
giving rise to “smarter forms of entrepreneurship”.
Being a branch of entrepreneurship, entrepreneurial finance (Cumming and Vismara, 2017) is also
affected by this transformation. Similarly, artificial intelligence and big data can change
entrepreneurial finance research and practice and can let us in an era of “smarter entrepreneurial
finance”.
In this context, the main stakeholders in artificial intelligence are equity investors, such as venture
capitalists (VCs) and business angels. VCs inject capital into startups in exchange for equity, with
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
3
the aim of supporting the venture’s growth and increasing its value in order to make a capital gain
when they sell their equity stack. In general, to achieve an adequate return on investment (ROI),
VCs look for companies that grow exponentially. The high ROI sought by investors is justified by
the high risk of failure associated with technology-driven startups. The average ROI per investment
is generally very low and VCs need to apply portfolio design strategies so that a small number of
companies are able to cover the losses of others and create an overall profit (Metrick and Yasuda,
2010). In order to maximize the chances of generating a profit, VCs should be able to estimate the
investment outcome as accurately as possible. The prediction of future events related to a company,
such as its ability to make an exit (in the form of a marge and acquisition or an initial public
offering) plays an important role in the investment selection process. Prediction is therefore a key
element in the decision-making process as the ability to make reliable predictions can reduce
uncertainty.
The evaluation of investment proposals, especially for early-stage technology-driven startups,
represents a very difficult task. In fact, as in entrepreneurial action (Sarasvathy, 2009), decision-
making in entrepreneurial finance is characterised by high level of uncertainty and information
asymmetry. Unlike assessments of small and medium-to-large enterprises, the evaluation of
investment opportunities in startup companies cannot be based on economic and financial
considerations, because of the lack of historical financial data (Miloud, Aspelund, and Cabrol,
2012). Traditional evaluation techniques are therefore less suitable for startup companies, given the
specificity of this type of venture (Silva, 2004). To improve the financial performance of an
investment fund, VCs must implement strategies to improve every phase of the investment process,
starting with refining their decision-making process (Zacharakis and Shepherd, 2007). Over the last
30 years, many studies have tried to codify investors’ decision-making into a set of effective criteria
(Ferrati and Muffatto, 2019). The variables considered by investors are many and often with subtle
nuances. It was also found that there is no consensus as to which variables are more relevant, and
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
4
that selection methods have so far been based primarily on human judgement and experience often
with what is normally referred to as "gut feeling".
Sarasvathy pointed out that uncertainty depends, to a large extent, on the absence of reliable and
complete pre-existing data that would help with respect to entrepreneurial tasks (Sarasvathy, 2001).
However, nowadays, large databases are increasingly available, making it possible to apply
artificial intelligence techniques to reduce uncertainty to some extent. While having a large amount
of data at their disposal, using it to make decisions requires investors to analyze a multitude of
variables that can be difficult to consider all together by one person or a group of people. For this
purpose, machine learning can provide essential support.
The focus of the present research is on investment decisions in technology-based companies by
institutional investors. Our goal is to understand how these new tools can bring new perspectives
of analysis so far unexplored in this type of investment. Machine learning is a new tool in
entrepreneurial finance for which there are still few studies available at the moment. The aim of
this study is to analyze the available works in this emerging field and to highlight the possible
questions that machine learning can answer. In fact, the research question that we aim to answer in
this study is: “For what purposes have machine learning tools been applied so far to support equity
investors in their decision-making process?” For this purpose, we analyzed the available works on
the application of machine learning models in investments in technology-based companies. The
product of this research can help researchers to clarify the scope of applications and thus make the
use of these instruments more viable. Since the feasibility of using machine learning techniques is
closely related to the data that is used, in answering the search question, we have selected only the
works related to a specific database, Crunchbase. This work is therefore intended to be an analysis
of the academic state of the art of the contributions that have applied machine learning approaches
using the data provided by Crunchbase.
The choice of Crunchbase as a reference database is due to its growing popularity in research.
Before making this choice, the authors also explored the features of other existing datasets
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
5
potentially suitable for the application of machine learning methods in entrepreneurial finance. In
this regard, some existing alternatives focused on startup companies are for example PitchBook,
Dealroom, PrivCo and Venture Scanner. However, as far as the authors know, these companies do
not allow a complete free access to data for research purposes. On the other hand, in addition to
paid access for commercial purposes, Crunchbase is also accessible for free for academic research,
through a dedicated license called "Academic Research Access". Researchers are eligible for full
or discounted access to the Crunchbase dataset on a case-by-case basis. Once the researcher has
obtained consent to access and after complying with the terms of use, Crunchbase can provide API
that allows full access of their data in the CSV and JSON format. Considering the large amount of
data collected and the privileged access for research purposes, many researchers have begun to
explore the potential of this database, generally to do research in entrepreneurship and economics
area. In order to promote the advancement of this field of research, the authors have therefore
considered a data source potentially freely accessible to researchers. Considering other paid data
sources would in fact limit the progress of research, making the experiments replicable for a limited
number of researchers.
However, since this is still an emerging field in research in entrepreneurial finance (Obschonka and
Audretsch, 2019), the identified works have a pioneering character and this research is proposed as
a detailed but still exploratory study on the subject.
This document is organized as follows. Section 2 provides an introduction to machine learning and
outlines the main differences from a traditional statistical approach. Section 3 provides an overview
of the venture capital firms that have already applied a data-driven approach to their investment
decision-making. Section 4 is an introduction to Crunchbase, one of the most relevant databases on
startup companies and investors. Section 5 describes the scope of this study, focusing on research
contributions that have applied machine learning techniques to Crunchbase data. Section 6
classifies the studies’ research goals and describes the various machine learning approaches.
Section 7 describes an example of how the models proposed by previous studies could be integrated
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
6
synergistically into investor decision-making. Section 8 synthesizes all the features or variables
used, which are obtained either directly from Crunchbase or through a features engineering process.
Section 9 analyses the algorithms used. In Section 10 we discuss the results obtained in previous
researches in order to establish a baseline for future research in this field. Finally, Section 11
presents a final discussion of the applicability of machine learning as a tool for data-driven
investments, while conclusions and future developments are presented in Section 12. Appendix 1
details all the features in Crunchbase, while Appendix 2 and Appendix 3 provide, respectively, all
the algorithms and the types of performance metrics considered in previous works.
2. An introduction to machine learning
Since this research explores the application of machine learning techniques to support equity
investors in their decision-making, it is important to provide an introduction to the topic from a
technical point of view. Machine learning is a subset of the wider scientific field of artificial
intelligence (Zomaya and Sakr, 2017). Since the seminal work by Arthur Samuel (Samuel, 1959),
machine learning has been defined as the field of study giving systems the ability to automatically
learn and improve from experience without being explicitly programmed. Traditional programming
requires a human intervention to explicitly code the rules to transform the input data into the desired
output. On the other hand, machine learning uses the input data and output to train an algorithm in
order to automatically formulate the rules from the data and so generate a program (i.e., a model).
Once the algorithm has identified patterns and learned relationships between the input data and
output of a phenomenon, the generated model can be used to calculate the output, starting from a
new input.
Comparing to traditional statistical models, this paradigm shift also introduces some evaluations
on how machine learning can be used in quantitative research. Breiman (2001) identified two
cultures within statistics: “data modelling culture” and “algorithmic modelling culture”. The “data
modelling culture” must start with a model hypothesis and the validation method is of the type
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
7
yes/no by doing goodness-of-fit tests. On the other hand, the “algorithmic modelling culture”,
doesn't need an initial model to test but tries to find an algorithm that operating on input variables
is able to predict an output response. In the latter case the validation method is accuracy in
prediction. Unlike traditional statistical methods that require the articulation of hypotheses and
human intuition for the specification of an analysis model, machine learning has less need to specify
in advance what the model should contain and can elaborate much more complex models with
many more interactions between variables. Machine learning therefore allows us to make
predictions based on correlations that are not easily predictable if we rely only on an analyst's
intuition and assumptions. The modelling phase is the most difficult part of the methodology in
inferential statistics. Conversely, machine learning methods allow us to eliminate the hypotheses
related to traditional statistical methodology. With machine learning we try to make the algorithm
identify the model itself directly from the data (Bzdok, Altman and Krzywinski, 2018).
We can draw similar conclusions starting from a big data perspective. “When we let the data speak,
we can make connections that we had never thought existed.” (Mayer-Schönberger and Cukier,
2013). Machine learning therefore allows flexibility in the application of various analysis models
and this means that variables can combine with each other in unexpected ways. The models in
machine learning are particularly efficient in determining which of the many variables are most
important and in recognizing that some elements don't matter and others do, even in an
unpredictable way.
It is importance to notice that the hypothesis driven approach is mainly focused on “why”
something is happening and so on causality. Instead within a data-driven approach the focus is
mainly on “what” is happening. “Big data is about what, not why. We don’t always need to know
the cause of a phenomenon; rather, we can let data speak for itself.” (Mayer-Schönberger and
Cukier, 2013). However, it is important to underline that machine learning does not stand in
contrast to traditional statistical analysis tools, but rather aims to be a new tool available to analysts
to highlight relationships and correlations that are not easily discoverable. And this is thanks to the
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
8
wide availability of models and algorithms able to operate on the many available variables. “By
telling us which two things are potentially connected, correlations allow us to investigate further
whether a causal relationship is present, and if so, why. […] Through correlations we can catch a
glimpse of the important variables that we then use in experiments to investigate causality.”
(Mayer-Schönberger and Cukier, 2013).
In order to learn, machine learning systems needs three key elements: data, variables and
algorithms. Learning is only possible if a large amount of data is available. In machine learning,
the more data we have, the more the system learns well and is able to give more accurate answers
later on. For investment decisions we can now rely on established data sources providing
information about thousands of companies. In this regard, Crunchbase is a highly valuable source
of data as it collects and provides business data about private and public ventures on a global scale.
The database includes data about companies, equity investors, funding rounds and people involved
in the entrepreneurial ecosystem. Before starting to use the data, however, it is necessary to know
that often the available datasets have not been created specifically for being used in machine
learning. As in the case of Crunchbase it is in fact necessary to carry out a preliminary data
processing activity, carefully analyzing and cleaning the content of the dataset. Secondly, the data
used must include useful information to represent the considered population. It is in fact essential
that the variables (or features in the machine learning language) are able to capture the relevant
elements for an exhaustive profiling of the unit of analysis. Specifically, in entrepreneurial finance,
the system should consider the key variables used by equity investors in their decision-making
process. For example, given a dataset of startup companies, the features should contain information
about the team, industry, market, product/service offered, intellectual property and investments
raised. A critical step in the machine learning workflow, prior to the actual implementation of the
algorithms, is the feature selection and feature engineering activity. In fact, in order to improve the
performance of machine learning algorithms, it is essential to identify the most relevant features
and create new ones if necessary. This operation requires both data mining skills and a deep
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
9
knowledge of the domain. Once the dataset has been optimized, it is then possible to proceed with
the implementation of the algorithms, properly chosen according to the objective of the work.
Machine learning algorithms are in charge of identifying the relationships between data and thus
generate the outcome model. In fact, considering the large number of machine learning algorithms
available, at this stage it is important to provide a classification scheme, so as to better understand
their context of use. Machine learning algorithms can be grouped by learning style into three main
class: supervised learning, unsupervised learning and semi-supervised learning. The main
difference between these three classes of algorithms is the way they model a problem based on
their interaction with the input data.
• Supervised learning techniques use input data (called training data) that has a known label
or result. The output model is generated through a training process. The algorithm attempt
to make predictions by trying to find and learn patterns in the input data and is corrected
when those prediction are not correct. The training phase keep going until the model
achieve a desired level of accuracy on the training data. Depending on the type of the target
variable used, supervised algorithms can be applied to regression problems (when the
output variable is a number) or classification problems (when the output variable is a
category). Classification problems can then be distinguished into binary classification
(when instances must be classified into one of two possible classes) and multiclass
classification (when instances must be classified into one of three or more possible classes).
Some examples of regression algorithms are: simple linear regression, lasso regression,
polynomial regression. Examples of classification algorithms are: k-nearest neighbors,
naive Bayes, logistic regression, support vector machine, decision trees, random forest,
boosted decision trees.
• Unsupervised learning techniques, by contrast, do not rely on labeled input data (i.e., no
output or target variable is available). The algorithm generates a model by identifying and
learning patterns from the input data only. Unsupervised learning is widely applied for
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
10
clustering, to identify subsets of data points (i.e., clusters) having homogeneous properties
that differ from those of the data points belonging to other clusters. Other example problems
are also dimensionality reduction and association rule learning.
• Semi-supervised learning techniques use a mixture of labeled and unlabeled input data. If
on the one hand there is a desired prediction problem, on the other hand the algorithms must
also learn the structures to organize the data and make assumptions about how to model the
unlabeled data.
3. Data-driven venture capital firms
Since the late 1990s, the venture funding industry has changed in step with the evolution of
technology-driven entrepreneurship. The decreasing costs of telecommunications and computing
have made it possible to start new, disruptive businesses with a relatively small amount of seed
capital. The emergence of new entrepreneurial opportunities has consequently increased the
number of new technology-driven ventures and therefore the number of early-stage proposals
submitted to VCs. The equity investment industry has changed accordingly, having to deal with an
increasing number of projects to be assessed and with a consequent increase in the complexity of
identifying the best proposals. Reviewing a business project, or completing the preliminary
screening phase, is a knowledge-based and time-consuming activity that could be made more
efficient by the synergic use of predictive algorithms capable of processing large amounts of data
in real time.
Traditionally, the venture capital industry has been considered a closed network in which a few
leading firms have enjoyed access to the most disruptive companies. Today, however, as some
authors have pointed out, new players in entrepreneurial finance have emerged (Block, Colombo,
Cumming and Vismara, 2018). According to Scott Anthony (Scott, 2012) venture capital industry
can be divided into three types of firms: (1) top-tier firms (e.g., Sequoia Capital), (2) incubators
and accelerators (e.g., Y Combinator) and (3) innovative firms that approach the funding process
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
11
more scientifically by analyzing large amounts of data using data mining techniques (e.g.,
Correlation Ventures). Since data mining—and specifically machine learning—has already
radically changed many areas of the financial sector (e.g., stock exchanges, insurance and wealth
management), it is foreseeable that in the near future it will have a game-changing impact on the
traditional venture capital industry as well. Using artificial intelligence to identify, evaluate, select
and support the most promising startups could provide investors with a new competitive advantage
capable of redefining the landscape and roles within the venture capital industry.
To date only a small number of investment firms have claimed to use data-driven or quantitative
methods for the assessment of prospective investments (Corea, 2018). These innovative firms could
be an early indicator of a new approach to early-stage funding. The following are some of the main
venture capital firms that are using data science techniques in their investment process (the year of
foundation of each firm is in parentheses): Kleiner Perkins (1972), Ardian (1996), e.ventures
(1998), WR Hambrecht Ventures (1998), Origin Ventures (1999), General Catalyst Partners
(2000), Scale Venture Partners (2000), Nauta Capital (2004), Correlation Ventures (2006),
Georgian Partners (2008), GV (2008), Ulu Ventures (2008), NorthEdge Capital (2009), Right Side
Capital Management (2010), Social Capital (2011), Venture/Science (2012), 645 Ventures (2013),
SignalFire (2013), Redstone (2014), Switch Ventures (2014), Connetic Ventures (2015), EQT
Ventures (2015), Follow[the]seed (2015), Fyrfly (2015), Hone Capital (2015), InReach Ventures
(2015), Fly Ventures (2016) and Hatcher+ (2016). Other emerging companies are developing tools
or providing business intelligence services to support investors and entrepreneurs throughout the
fundraising process. Some examples are WR Hambrecht Ventures (1998), Aingel.ai (2015),
PreSeries (2015), Capital Pilot (2016), Radicle (2016), Kähler AI (2018), Rocket DAO (2018),
Valsys (2018) and Specter (2019). Looking at the year of foundation of each company on these two
lists, it is not only brand-new firms that are active in this field; well established and traditional
investors are also exploring the potential of a data-driven approach to investing. It can also be noted
that alongside investment firms themselves, business-to-business companies are emerging to offer
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
12
tools and services to a new generation of data-driven investors. While large investment firms can
afford the costs to acquire the necessary resources and competencies to develop proprietary
algorithms in-house, smaller firms generally do not have such capital. In this context, boutique
venture capital firms or even business angels are provided as as-a-service solutions developed by
third-party data science companies. It is important to highlight, however, that the success and
reputation of an investment fund is mainly built on its ability to select and support the most valuable
business proposals. In a scenario in which some effective algorithms would select and propose the
same best startups to all investors, the competitive advantage of each firm would decrease
dramatically. Such a solution would trigger substantial new dynamics in the deal flow management,
and it is therefore essential that the proposed models allow individual investment firms to customize
some key parameters to maintain an active role in the decision-making process. Furthermore,
machine learning should not be considered a black-box solution able to replace human experience
but rather a valuable tool that can support practitioners in their decision-making process.
Although all of the companies listed above claim to use a data-driven approach to investment, the
specifications of their models are not publicly disclosed, as they are considered strategic
competitive assets. Since breakthroughs in this field are mainly converted into commercial software
rather than academic publications, this article aims, by analyzing the academic state of the art, to
be a useful reference for tracking the progress of this field of research.
4. Crunchbase
The application of machine learning algorithms in entrepreneurial finance is a very challenging
task. In this regard, retrieving data that can be used to properly model an early stage company has
a huge impact on the final results. In this regard, Crunchbase is an excellent source of information
(Ferrati and Muffatto, 2020 June) and for this reason it has been chosen as reference database in
this search. Crunchbase is an online platform collecting and providing business data about private
and public companies on a global scale. Originally built to track startups, the database includes
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
13
descriptive data about companies, equity investors, funding rounds and people involved in the
entrepreneurial ecosystem.
Compared to other commonly-used startup databases covering similar information and frequently
used for economic research (e.g., ThomsonOne, formerly known as "VentureXpert"), Crunchbase
is not only focused on venture-backed companies, but covers both companies that have been funded
and that have not been yet. Moreover, aggregate statistics on funding rounds by Country and year
are quite similar to those produced with other established sources, going to validate the use of
Crunchbase as a reliable source in term of coverage of funded ventures (Dalle et al., 2017). So far
Crunchbase has been used by a few scholars and with very different research purposes. An
overview of research contributions that used Crunchbase has been done by some of OECD scholars
(Dalle, Den Besten and Menon, 2017; Menon and Tarasconi, 2017). A very limited number of
papers have been devoted to prediction in investments, like an article that proposes a machine
learning approach to predicting investment behaviour (Liang and Yuan, 2016), while some other
papers report experiments with new methods to select investment opportunities (Zhong et al.,
2016a; Liang and Yuan, 2012; Liang and Yuan, 2013; Santana et al., 2014).
Since Crunchbase is partially a crowd-sourced database, it is important to highlight the
methodology with which the company collects and verifies the accuracy of data (Crunchbase,
January, 7, 2020). Crunchbase sources, updates and validates their data on a daily basis, using four
synergistic activities:
• The Crunchbase Venture Program: investors monthly submit portfolio updates in exchange
for discounted access to the Crunchbase API, Excel export, Crunchbase Pro, and
Crunchbase Marketplace. More than 3,500 global investment firms update their profile
personally. This strategy allows Crunchbase to have access to the most up-to-date data.
• Active Community Contributors: active users can submit information to the database. The
community makes the database grow and refine over time. It is important to specify that
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
14
every submission is subject to registration, social validation, and is often reviewed by a
moderator before being accepted and published.
• Artificial intelligence: in order to verify the reliability of data, Crunchbase apply machine
learning algorithms to validate data accuracy, scan for anomalies, and alert their data
scientists of data discrepancies.
• In-house data science team: Crunchbase data analysts manually validates the collected data.
In addition, the team also develops the algorithms internally used and analyses the data to
provide business insights, for example in the form of periodic reports.
The combined use of these four strategies of data collection and validation is an element of
innovation and competitive advantage over other databases commonly used in the research field of
entrepreneurship and economics. In fact, thanks to the quantity and quality of the data collected,
Crunchbase is used not only by practitioners (e.g., entrepreneurs, investors or policy makers), but
also by academic researchers who intend to apply a quantitative approach to research on
entrepreneurship and innovation. However, it should be noted that since the dataset is partially
created through a crowd-sourcing approach, data density cannot be guaranteed due to the voluntary
nature of information recording. This is especially true for startups that have not collected
investments yet and that therefore, not being part of the portfolio of any investor, are not reported
by investors participating in the Crunchbase Venture Program. For this reason, to be effectively
used by researchers, Crunchbase data requires an accurate pre-processing activity (Ferrati and
Muffatto, 2020 July).
In order to assess the potential of Crunchbase as a source of data to be used for research in
Entrepreneurship, a description of its content and structure is provided. Overall, the database is
organized into seventeen .csv files. By grouping the information of each individual dataset by topic,
the Crunchbase covers five macro areas of information, respectively related to organizations,
investment activities, people (e.g., founders, employees, investment partners), exits and public
events. As of May 21, 2019, the database provided information on 760,590 organizations (of which
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
15
708,558 companies, 38,740 financial organizations and 13,292 schools and/or universities),
121,509 investors of different types (e.g., venture capital firms, angel investors, etc.), 263,426
funding rounds, 890,429 people, 1,346,357 jobs, 17,068 Initial Public Offerings (IPOs) and 89,959
acquisitions. As the use of Crunchbase by entrepreneurs and investors has grown over time and the
database is now establishing itself as a primary source of information on startups and funding
rounds, it can be expected that the number of companies, investors and people registered voluntarily
will grow even more over time. Considering both the large amount of data provided by Crunchbase
and the type of information it contains, an increasing number of researchers are using this data
source to apply new machine learning approaches to the study of phenomena in entrepreneurship.
The following sections analyze how machine learning and Crunchbase have been used so far in
research.
5. Scope of the study
This study provides an analysis of machine learning applications to support the decision-making
process of equity investors. All the sources analyzed used Crunchbase as their main source of data.
Contributions were then searched online in October 2019 using the search string “machine
learning” AND “crunchbase.” In total, 20 documents were identified. Due to the novelty of the
topic, it was decided to include in the results sources of any type. The works considered included
two journal articles, six conference papers, two doctoral theses, two master’s degree theses, and
eight working papers. This distribution confirms the innovative nature of the topic and opens up
important research opportunities for scholars who wish to apply machine learning in analyzing
Crunchbase data. The first contribution dates to 2012, and out of the 20 identified works, 10 had
been published by 2017. As for geographical distribution, 11 of the studies were developed by
authors affiliated with universities in the United States, eight in Europe and one in East Asia.
California was particularly well represented, with six contributions, all carried out at Stanford
University.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
16
6. Classification of the research objectives in using machine learning and Crunchbase
This work is aims to answer the research question “For what purposes have machine learning tools
been applied so far to support equity investors in their decision-making process?”. After a detailed
analysis of previous studies, seven research objectives have been identified (the number of works
addressing each topic is in parentheses): predicting the exit event of a company (6), predicting next
funding events in a given period of time (3), predicting the next key event that a company will
achieve (1), predicting investment relationship between companies and investors (3), generating
investment recommendations for investors (3), predicting a company valuation (1) and classifying
companies by industry (4). Depending on the research objectives, different machine learning
approaches were applied to solve the specific problem. Table 1 summarizes the research goal
addressed by each contribution as well as the machine learning approach applied (indicated by a
capital letter). In particular 11 works used a supervised learning binary classification (A), two a
supervised learning multiclass classification (B), one a supervised learning multilabel classification
(C), one a supervised learning regression (D), three an unsupervised learning approach (E) and
three a recommender system. The majority of previous studies used a supervised learning approach
for binary classification and have applied it in the context of various research objectives (predicting
the exit event of a company, predicting funding events in a given period of time, predicting
investment relationship between companies and investors, and generating investment
recommendations for investors). The following sections describe in detail the specific research
objectives addressed in previous works. Each section discusses the goal, highlights how the
problem was addressed using Crunchbase and describes the class of machine learning algorithms
used to approach the problem.
[TABLE 1]
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
17
6.1 Predicting the exit event of a company
Over the years, the topic that has most attracted researchers’ interest has been the startup success
prediction problem, also known as the picking winners problem. The concept of success for a
startup is not straightforward, and therefore it must be properly formalized. However, in choosing
the most appropriate definition for success, previous authors had also to consider the constraints
imposed by the availability of suitable variables in Crunchbase.
For example, the profit generated by a company over time is a reliable measure of its financial
success. From a financial point of view, generating profits is the ultimate goal of a company and is
what an investor expects when deciding to invest. However, Crunchbase does not disclose any
information on the companies’ financials, and thus this information cannot be considered in this
context.
Another proxy to measure a company’s growth (and indirectly its possible financial success) is the
increase in the number of employees over time. However, Crunchbase does not provide this
information in a punctual manner. The database includes a categorical variable employee_count to
describe the size of a company. This variable can take a value chosen among nine ranges (1–10,
11–50, 51–100, 101–250, 251–500, 501–1,000, 1,001–5,000, 5,001–10,000 and 10,000+), and the
variation of its value over time cannot be inferred from a single extraction from the dataset on a
certain date. Therefore, this variable also cannot be considered for formalizing the startup success
prediction problem.
It is commonly accepted that the critical milestone that classifies a venture-backed company as
successful is the so-called exit event. A venture-backed company can make an exit through one of
two main strategies: it can either make an IPO or be acquired by a larger company through a merger
and acquisition (M&A). Both strategies give the shareholders (e.g., founders, investors and
sometimes early employees) the opportunity to receive cash in return for their equity stack.
Crunchbase provides information about the status of an organization through a dedicated
categorical variable in the organizations dataset. The variable can assume four different values:
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
18
operating, closed, acquired or IPO. Because of the direct availability of this information, the value
of the status variable can be considered as a proxy for the success or lack of success achieved by a
company. It is important to note that not all the studies use the same definition for success. From a
machine learning point of view, this is a problem of binary classification (successful or not
successful). However, the target variable assumes nonhomogeneous values in different studies.
Table 2 reports how different studies have encoded a company’s status into a binary target variable:
successful (equal to one) or not successful (equal to zero). From an operational point of view, all
the identified publications consider acquired companies as successful and closed ones as not
successful. Only one study, focusing only on acquired ventures, does not consider the IPO status
as a successful category (Xiang et al., 2012). It is important to highlight that the rationale used to
label operating ventures as successful or unsuccessful is not shared by all studies, and only one
contribution classifies operating companies as successful (Ünal, 2019). Furthermore, one work
does not reveal the labels it used to identify successful and unsuccessful companies (N.A.), making
the results difficult to interpret (Krishna et al., 2016). Finally, for completeness it should also be
mentioned that one study added an additional Boolean variable to define the success of a company:
whether the company is a unicorn—that is, whether it is valued at $1 billion or more (Bo Guang
Huang, 2016).
[TABLE 2]
It must be noted that acquisition or IPO predictions are not always a perfect proxy for a company’s
success. For example, already profitable startups may decide to not make an exit and remain private
for years, but labeling such companies as unsuccessful would be incorrect. Furthermore, from the
investors’ point of view, a company in their portfolio should be considered successful if they can
sell their equity stack for a profit. Conversely an exit event does not always translate into a positive
financial return for shareholders (Moeller, Schlingemann and Stulz, 2005), as investors might not
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
19
be able to recover their initial investments. Since neither the exit price nor the equity stack acquired
by investors at the time of funding is generally disclosed, classifying a company’s exit with a
successful or unsuccessful label is highly complex. Finally, the motivations leading to the
acquisition of a startup are not always linked to the desire to generate a return on investment but
may depend on strategic decisions of the acquiring company, as in the case of acqui-hiring deals.
Given that only two values are possible for the target variable, works that aims to predict the exit
event of a company applied a supervised binary classification approach.
6.2 Predicting funding events in a given period of time
Given the difficulty of unambiguously defining the concept of success for a startup, some authors
focused on the more specific problem of predicting a company's ability to attract a further round of
investment within a given period of time (Dellermann, Lipusch, Ebel, Popp and Leimeister, 2017,
Sharchilev, Roizner, Rumyantsev, Ozornin, Serdyukov and de Rijke, 2018, Gastaud, Carniel, and
Dalle, 2019). This kind of problem is generally formulated as follows: for a given company that
has already secured at least one funding round (e.g., seed or angel investment), predict whether it
will raise an additional round of investment (i.e., Series A, B, etc.) during a given period of time
(e.g., one year). Series A rounds are generally the first round collected from VCs, and they usually
let angel investors exit the company, hopefully making a positive return on their previous
investment. For this reason, series A rounds are considered an important milestone in the life cycle
of a startup company, and insights about the likelihood of their future occurrence can have an
impact on the funding decisions of early-stage investors. This strategy can be extended to consider
different series of funding rounds e.g. Series B, C etc. In addition, we could change the given period
of time e.g. two, three years and so on. In this way it is possible to predict, for example, if a company
that has already secured a series B round, is likely to get a series C round within two years.
(Dellermann, Lipusch, Ebel, Popp and Leimeister, 2017).
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
20
This prediction problem can be effectively handled using Crunchbase data. In fact, the database
provides information about the number, amount, type and date of every funding round collected by
a company. Specifically, the information about the stage of investment (e.g., seed or series A, B or
C) is provided by the variable investment_type in the funding_rounds dataset and is easily
accessible for every company in the database. Given a company that has already raised a founding
round of a specific type (e.g., seed round) and given its complete funding round history, the target
variable is set equal to one or zero depending on whether the company has collected an additional
round of funding.
Since the target variable can assume only two values (1 or 0), the studies that considered this type
of problem applied a supervised binary classification approach.
6.3 Predicting the next event that a company will achieve
In the context of predicting the success of an investment, a more general approach aims to predict
the next key event in a company life cycle. An investor can exit from an investment not only at the
moment when a company is acquired or makes an IPO but also when the venture collects an
additional funding round. In the latter case, investors can sell their equity stack to new entrant
investors (secondary sale), making a capital gain or a capital loss.
To formalize the problem, one study (Arroyo, Corea, Jimenez-Diaz and Recio-Garcia, 2019)
defined a multiclass target variable, given that a first event occurred for a company within a
simulation time window. Specifically, five categorical values were used to label whether the
company was acquired (AC), reached at least another funding round (FR), went for an IPO (IP),
closed (CL) or achieved none of these events (NE). Among the five labels, the study considered
three classes as successful (AC, FR and IP) and two classes as unsuccessful (CL and NE).
Since the target variable can assume more than two possible values, the machine learning models
implemented for this objective applied a supervised multiclass classification approach.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
21
6.4 Predicting investment relationship between companies and investors
Another research goal is predicting whether a specific investor will invest in a given company
(Shan, Cao and Lin, 2014, Zhang, Chan, and Abdulhamid, 2015, Liang and Yuan, 2016). From a
business point of view, this type of problem can be interesting both for investors and for companies
looking for funding. For entrepreneurs, knowing in advance which investor might be interested in
funding their company can make the process of scouting and identifying target investors more
efficient and effective. For VCs, knowing in advance which other firms might be interested in
supporting a given startup could be valuable information both in a competitive case (to anticipate
the future action of a competitor) and in a syndicate scenario (to act as co-investors in a deal).
In the context of this problem, Crunchbase still represents a valuable data source. In fact, the
database has information on all the investors involved in each funding round. Companies and
investors can therefore be represented as nodes in a graph data structure, where investment
relationships are modeled as edges in a bipartite network i.e., a network whose nodes are divided
into two sets and where the only connections allowed are those between nodes in different sets. In
this case, machine learning techniques are applied to network analysis methods. The problem of
identifying investment opportunities is reframed as the problem of predicting new links in a
bipartite network. In this context, the target variable assumes a value equal to one or zero depending
on whether or not a link (an investment) exists between the node of a company and that of an
investor.
Since the target variable can assume only two values, the prediction problem can be addressed
through a supervised binary classification approach.
6.5 Generating investment recommendations for investors
Generally speaking, recommender systems are designed to learn and predict the preference a user
would assign to an item. These systems are used to guide users in a customized way toward
interesting items in a large space of possible options. Companies such as Amazon, Netflix, Spotify
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
22
and LinkedIn have developed recommender systems to suggest to their users, respectively, relevant
products, videos, songs or jobs. In the current context, recommender system techniques are used to
learn the specific investment preferences of a venture capital firm to identify the most suitable
companies to suggest in the screening activity (Stone, 2014, Liu and Wangperawong, 2018, Zhong,
2019). Recommendation methods can be distinguished into three main classes (Adomavicius and
Tuzhilin, 2005): a content-based approach, collaborative filtering and hybrid models. In the
content-based approach, the items that the system recommends to the user are similar to those
he/she preferred in the past. In the collaborative filtering approach, the system recommends items
that other users with similar tastes have preferred in the past. The hybrid approach combines
content-based and collaborative methods.
The considered studies created their recommender systems using a different set of information for
every specific investor, such as (1) the investor’s preferences about startups' geographic locations,
their industry categories, their historical acquisition records and their leading products; (2) the
historical investment deals of the venture capital firm; or (3) the historical investment deals of other
similarly behaving venture capital firms. As the information provided by Crunchbase allows
researchers to create a detailed profile of each venture capital firm and its investment portfolio, the
database is well suited to the implementation of recommender systems.
6.6 Predicting a company valuation
Estimating the pre-money valuation of a startup is one of the most crucial tasks for an equity
investor. In fact, given the current funding amount and the pre-money valuation, the new
distribution of the equity stack can be immediately computed. Reaching an agreement on the
company’s pre-money valuation, and therefore on the percentage of equity that the investor will
acquire, is a particularly tricky element of the due-diligence process. From an investor’s point of
view, the value of this percentage is key, as it will determine the actual value of the future ROI.
Three main approaches to company valuation can be identified: asset-based, income-based and
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
23
market-based. A company’s valuation can therefore be based on either its assets, its forecasted cash
flows or comparable market transactions. However, the valuation accuracy of private venture
capital-backed companies is still a controversial topic. In this context, a machine learning approach
to the problem can be applied to support the computation.
Among the considered contributions, only one focused on predicting the valuation of a venture-
backed company (Askaryan and Frost, 2015). It is noteworthy that although Crunchbase provides
a dedicated variable (post_money_valuation_usd) for recording the post-money valuation of a
company after each round of investment, this value is missing for most rounds. This is in line with
the general behavior of companies and investors. Although the funding round amount is usually
disclosed, no information is generally given about the equity stack it represents at the date of the
deal. To find the information necessary for the development of a model, the Askaryan and Frost
study integrated Crunchbase data with additional data collected by other sources (i.e., Data Hub,
DataFox and CBInsights).
Since the target variable for this objective is a continuous numerical variable, the task can be
formalized as a machine learning regression problem.
6.7 Classifying companies by industry
A further area of research concerns the implementation of machine learning models for determining
a company’s industry classification. An accurate company classification by sector has a key role in
many applications, such as identifying similar ventures (for example, for competitor analysis or for
company valuation purposes), matching investors with companies in their specific sector of interest
(useful for the implementation of recommender systems) or providing valuable features to aid
machine learning algorithms in their pattern recognition task.
To allow users to quickly search for companies operating in specific sectors, Crunchbase provides
two variables, called category_list and category_group_list. Specifically, organizations are labeled
using 680 unique categories and 46 category groups. Note that a category can be associated with
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
24
more than one category group. For example, the category “mobile advertising” is associated with
two category groups: “advertising” and “sales and marketing.” The associations between categories
and category groups are reported in the two columns category_name and category_group_list of
the category_groups dataset.
Although Crunchbase provides a useful classification scheme, it has some important limitations.
First, the categories used in both the category_list and category_group_list variables contain
nonhomogeneous information. In fact, industries (e.g., “automotive”), business functions (e.g.,
“marketing”) and technologies (e.g., “machine learning”) can appear in the same field. Second, the
classification scheme includes some over-represented and catchall classes (e.g., “software,”
“hardware” and “web”) that could be used to describe the vast majority of companies covered by
Crunchbase. Third, Crunchbase’s categories are not organized in any hierarchy. In fact, all the
categories are considered to belong to the same level, despite their differing degrees of specificity
(e.g., “energy” and “battery” are considered independent labels). The Crunchbase’s classification
scheme is rather simplistic compared to other classification methodologies, which often provide an
industry hierarchy (e.g., NAICS). However, these public governmental schemes are often out of
date and thus not generally suitable for accurately analyzing innovation-based companies.
Starting from these considerations, some scholars have tried to improve the categories provided by
Crunchbase by applying text-mining techniques to automate the industry classification of private
companies (Stone, 2014, Batista and Carvalho, 2015, Huang and Shi, 2015, Semeniuta and Ismail,
2017). Previous researchers generally used as an information source for classification the
unstructured company descriptions provided by Crunchbase. However, it is interesting to highlight
the differences between the applied research approaches. Batista and Carvalho (2015) considered
a supervised learning approach to associate to every company with one among forty-two possible
labels (multiclass classification problem). Stone (2014) generated a multilabel industry
classification using supervised learning techniques. Finally, Huang and Shi (2015) and Semeniuta
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
25
and Ismail (2017) used unsupervised machine learning techniques to cluster companies by working
industry or by related working area.
7. Integrating different machine learning modules to support investors’ decision-
making
After analyzing the purposes for which machine learning has been used in previous research, it is
therefore possible to take a step forward, proposing a possible integrated use of different modules.
In fact, in order to support investors in their decision-making process, the different models in
literature could be used sequentially, resulting in a single integrated solution. Given a large set of
investment proposal, an investor could use in sequence the models related to the following activities
in order to obtain a relatively small subset of companies to be analyzed in deep detail.
1. Classifying companies by industry. Given a textual description of the organization's activity,
machine learning models could provide a classification of companies by industry and
technology used. This preliminary step could allow investors to automatically filter only the
proposals in the areas in which they seek to invest.
2. Generating investment recommendations for investors. Once the industry of interest has been
identified, a machine learning recommendation system could detect only those companies that
are most in line with the preferences of the specific investment firm. This selection could be
made considering for example the characteristics of successful investments made in the past or
the preferences of other similar investors.
3. Predicting funding events in a given period of time. Considering the subset of recommended
companies, an additional model could predict the occurrence of a subsequent investment round.
This information could give the investor an insight into the companies' ability to grow over a
certain period of time. Knowing whether the company will raise further investments in the
future may also allow the investor to make assessments about his/her participation in future
rounds. On the other hand, having information about the occurrence of subsequent rounds could
allow the investor to plan a sale of his company shares even before an exit occurs.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
26
4. Predicting the exit event of a company. In the event that the investor decides to realize his return
on investment only at the time of a possible exit of the company, the machine learning models
could filter the remaining organizations going to consider only those for which an exit event is
expected.
5. Predicting a company valuation. At this point, once a subset of potential deal companies has
been identified, an investor could use a machine learning model to estimate the value of the
companies in the future. Given the return on investment he/she intends to achieve, he/she could
then select only those companies that can generate the desired multiples. The forecast of the
future value of the firm could also be used to determine the optimal amount of the current
funding round in order to get an adequate equity stack.
6. Predicting investment relationship between companies and investors. Finally, for the remaining
companies after this series of filters, the investor could also identify other potential investors
interested in the deal to contact as possible co-investors.
The proposed sequence is only an example of how the different models could be used in a
synergistic way. At each step the specific investor could apply some opportune choices of
parameters in order to customize the models in accordance with his/her own strategy. However, it
should be emphasized that these models must be seen as a tool to support the investors decision
making process and can’t replace the final decision of a professional practitioner.
8. Features’ classification
To train machine learning algorithms to support equity investors in their decision-making process,
the data provided as input must represent a set of variables relevant for the evaluation of a startup
company. What in statistics are called independent variables are in machine learning named after
features. Selecting the most significant features is a key step within the data mining process and
has a great impact on the performance of the final model. Since the model has to support investors
during the screening and evaluation of a startup venture, the features should include the most
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
27
relevant information considered among the assessment criteria traditionally used by equity
investors. When evaluating a potential deal, investors consider for example the industry in which
the company operates, the location, the stage of the business life cycle, the characteristics of the
team, the market, the competitors, the competitive advantage of the product/service, the milestones
achieved, the history of funding and the quality of the investors who have already supported the
project. However, to ensure the models are not limited to merely imitating the investors’ decision-
making process, the potential of features not considered in the classical approach should also be
explored. This could allow the discovery of new patterns not yet evident. To provide a useful
reference for future scholars, this article collects all features considered in previous studies. In total,
239 features have been identified, and all have been classified based on their thematic area into 8
generic classes and 36 specific classes. Figure 1 summarizes the results of the classification activity.
The number of features in each class is reported, and an example is provided to make the content
clear. Each percentage is the ratio of the number of features in a class to the total number of
identified features (i.e., 239). As can be seen from Figure 1, the information collected by
Crunchbase allows investigation of most of the areas considered by equity investors in their
decision-making process. In fact, in addition to the company’s basic information, the database
provides data about the founding team and the history of funding rounds. As for the cardinality of
these three generic classes, 74 of the features relate to the company’s basic information, 69 to
funding round information and 59 to founders’ characteristics. The top nine specific classes by
number of features concern the investors involved (19 features), the timing of rounds (18), the round
money amount (16), the founders’ previous work experience (13), the founders’ education (12), the
number of rounds (12), the characteristics of the team (11), the company's location (10) and the
company's lifetime (9).
In accordance with the proposed classification, Appendix 1 provides a detailed collection of all the
239 features considered in previous studies. For each feature, the research goal for which it was
considered is specified. All features have been reported exactly as they were presented in the
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
28
original work. For example, the features “# of years since the company’s foundation” and “# of
months since the company’s foundation” have been listed as two different variables although they
express exactly the same concept (i.e., the company lifespan) in different units of measure. This
level of detail has been maintained in order to highlight the possible features that can be obtained
from Crunchbase as well as to provide useful information to researchers who wish to study the
topic. In fact, a comprehensive map of all considered features allows the replication of experiments
already carried out and can provide inspiration for the feature engineering process in future
research.
[FIGURE 1]
9. The algorithms used
In this study we have also analyzed the algorithms used in previous works. Appendix 2 provides a
summary of the machine learning algorithms and data analysis techniques used for different
purposes.
In total, 48 algorithms or data analysis techniques have been identified. As can be easily noted,
most of these were used in a supervised classification context. Regression algorithms, by contrast,
have not been explored as thoroughly; only one study considered a simple linear regression model.
Similarly, only one algorithm has been used to assess unsupervised learning for clustering. As
regards natural language processing, term frequency–inverse document frequency (tf–idf) is the
most used technique. As for recommender systems, 10 techniques have been used, nine of which
were proposed in a single article (Zhong, 2019).
As reported in Figure 2, among the 10 algorithms or techniques used most often in previous
publications, the most frequently used is random forest (applied in 9 studies), followed by logistic
regression classifier (7), support vector machine (7), naive Bayes classifier (4), artificial neural
network (3), decision trees (3), td–idf (3), Bayesian network (2), K-means clustering (2) and finally
latent Dirichlet allocation (2).
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
29
[FIGURE 2]
Considering the large variety of algorithms, it is interesting to provide a description of their working
logic. To do this it is useful to identify some classes that group algorithms with similar approaches.
Focusing on supervised learning methods used in previous works for classification, the following
classes can be identified.
• Regression algorithms (e.g. Logistic Regression). Regression methods are used extensively
in statistics and have been embraced by machine learning. The relationship between features is
iteratively refined using an error measurement in model predictions. Specifically, Logistic
Regression is based on the concept of probability and assigns observations to a discrete set of
classes by transforming its output using the logistic sigmoid function.
• Instance-based algorithms (e.g. k-Nearest Neighbor, Support Vector Machines). These
methods are also called memory-based learning. Starting from the training set, they generate a
database of examples and compare the new instances with those already in memory. To find
the best match and make a prediction, the comparison is made by computing a similarity
(distance) measure between the new instance and the data that have been stored. As they
construct hypotheses directly from the training data itself, these algorithms are able to
continuously adapt the model as new data is added.
• Bayesian Algorithms (e.g. Naive Bayes, Gaussian Naive Bayes, Multinomial Naive Bayes).
These methods explicitly apply Bayes’ Theorem, that means they compute the probability of
an event, considering the prior knowledge of conditions that might be related to the event.
Specifically, the term naive refers to the fact of assuming that the value of a particular feature
is independent of the value of any other feature, given the class variable. This is a very strong
assumption, which does not apply to most real cases. Nevertheless, the approach works
remarkably well even on data where this assumption is not valid.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
30
• Decision Tree algorithms (e.g. Decision Trees). Using a top-down approach, these algorithms
build a tree by splitting the source dataset according to appropriate rules based on classification
features. Initially, the whole training set is taken as the root node of the tree and is then split
into subsets that become the successor children. The splitting process is recursively repeated on
each subset until the subset at a node has all the same values of the target variable (or until no
value would be added by a further split). To decide the best rules to split each node, the
algorithm splits the nodes on all available features and then selects the split which results in
most purity of subnodes respect to the target variable. One of the advantages of these algorithms
is the possibility to effectively visualize the resulting tree. For this reason, unlike other
algorithms, decision trees may not be considered as black boxes.
• Ensemble Algorithms (e.g. Random Forest, AdaBoost). These methods use models made of
multiple weaker models that are independently trained and whose predictions are combined to
obtain better overall prediction than could be made from any of the single learning model alone.
The key element of these models is the selection of weak individual models to use and how
they are combined. This approach is usually very powerful and makes these algorithms
particularly effective.
• Artificial Neural Network Algorithms (e.g. Perceptron, Back-Propagation, Stochastic
Gradient Descent). These algorithms mimic the structure and function of biological neural
networks. Specifically, they reproduce the way in which the human brain is made up of
connected networks of neurons in order to allow computers to learn and decide in a similar way
to humans. To mimic the hierarchical way in which the brain elaborates information, they are
organized in sequences of layers, so that each layer processes the information, provides insight,
and passes the results to the next one. The simplest form of an ANN has three layers of neurons:
input data enters the system through the input layer, information is processed in the hidden layer
and the decision is taken in the output layer. When an ANN has more than one hidden layer, it
is called a Deep Neural Network (DNN). DNNs are the main structures used in deep learning.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
31
The choice of the best type of algorithm to use depends strictly on the type of problem to address
and the structure and characteristics of the data themselves. It is therefore not possible to state in
advance which algorithm is the most suitable for the creation of an effective model. Using different
datasets and different features, the publications analyzed have in fact tested multiple algorithms in
order to identify in each case the most effective solution in the specific context. For example,
instance-based methods are probably not the best choice in case the dataset consists of a very large
number of companies as the complexity of the algorithm increases with the number of data. To use
Decision Tree methods instead, feature values are preferred to be categorical as the algorithm
manages to create sub-trees in a simpler way. Therefore, before training the model it is usually
necessary to perform a pre-processing activity to prepare the data properly.
However, in analyzing the different algorithms used, it is worth noting that many previous works
have achieved good results using random forest. The random forest classifier is a tree-based
ensemble learning method that creates a multitude of decision trees during training. The final class
of an instance is assigned by selecting the mode of the outputs provided by individual trees. To
generate different results for each tree, two types of randomness are introduced: each tree is built
both on a different subsample of rows (with repetition) and using a randomly selected subset of
columns. This method has some advantages: (1) it incorporates feature selection so it can handle a
large number of variables of presumably differing levels of importance; (2) it is usually robust
against overfitting to the training set (unlike simple decision trees); and (3) it can handle datasets
with highly imbalanced class distributions (as is the case with Crunchbase). Also, unlike most
machine learning classifiers, which are usually black-box models, random forest can be considered
a quite white-box classifier, since it is possible to estimate the importance of the separate features
used for classification (even though, unlike with a simple decision tree, the relationship between
dependent and independent variables is not immediately detectable). This aspect is particularly
important in the context of decision-making, because investors need to determine which factors are
the most important. Thanks to these advantages and the excellent results generally achieved, in
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
32
most recent works (Arroyo et al., 2019) a random forest algorithm has been used as the baseline
against which to compare the performance of new models.
From a technical point of view, most authors do not explicitly state the programming languages or
software used for the implementation of models. For company multilabel classification, one work
used RapidMiner by Rapid-I8 (formerly known as YALE, or Yet Another Learning Environment),
a rapid prototyping environment for data mining (Stone, 2014). To predict investment relationships,
Shan et al. (2014) used the Stanford CoreNLP natural language software. Weka has been used to
classify companies by industry (Batista et al., 2015), to predict the exit event of a company (Krishna
et al., 2016) and in the data preprocessing phase (Bento, 2017). As for programming languages,
two works used Python (Bento, 2017, Arroyo et al., 2019), and another was implemented in R (Bo
Guang Huang, 2016).
10. Comparing the performances of different models
To evaluate and compare the results of previous work, the performance metrics considered by each
contribution were analyzed. This analysis can be useful for future researchers, allowing them to
identify the most relevant metrics to consider for presenting their results. Appendix 3 details all the
metrics used in the context of the different research goals.
In a classification problem, the most relevant metrics are accuracy, precision, recall, F1 score and
the area under the Receiver Operating Curve (ROC). In defining these metrics, the following
terminology is used: true positive predictions (TP), true negative predictions (TN), false positive
predictions (FP) and false negative predictions (FN). The meanings of the five metrics are as
follows:
• Accuracy is the ratio of correct predictions (i.e., both true positives and true negatives) to
the total number of predictions (either correct or incorrect).
ACC = (TP + TN) / (TP + TN + FP + FN)
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
33
• Precision is the ratio of positive predictions that are correct to the total number of positive
predictions (either correct or incorrect). High precision means that an algorithm returns
substantially more relevant than irrelevant results.
PPV = TP / (TP + FP)
• Recall (or true positive rate or sensitivity) is the ratio of positive predictions that are correct
to the total number of real positive cases. High recall means that an algorithm returns most
of the relevant results.
TPR = TP / (TP + FN)
• F1 score is the harmonic mean of precision and recall.
F1 = 2 * (PPV * TPR) / (PPV + TPR)
• The receiver operating curve (ROC) represents the performance of a classification model
by plotting true positive rate (y-axis) against false positive rate (x-axis) at all threshold
settings. The area under the ROC (AUC) takes the integral of the ROC curve between 0
and 1 and provides a measure of performance at different threshold levels. The AUC of an
excellent model is between 90% and 100%, whereas results between 50% and 60% are
considered very poor because they mean a model is unable to separate classes (Hanley and
McNeil, 1982).
Considering, for example, the binary classification problem of predicting whether a company will
be successful (e.g., will make an exit by being acquired or going public), accuracy can be defined
as the percentage of companies among the total that are classified correctly as either successful and
unsuccessful. However, when using the Crunchbase dataset, accuracy is a risky choice as a metric
for evaluating the quality of a model. This is because Crunchbase has a large class imbalance
between successful (e.g., status equal to acquired or ipo), unsuccessful (e.g., closed) and uncertain
(e.g., operating) companies. For example, in Bento (2017), after the data preprocessing activity,
only 17% of the companies in the dataset were labeled as successful. When dealing with strongly
unbalanced datasets, most machine learning algorithms tend to classify the least represented class
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
34
as the most represented one. For example, in Bento (2017), if all companies had been predicted to
be unsuccessful, the model would still have achieved an accuracy near 83%. In classification
problems with highly unbalanced datasets, therefore, accuracy is not a reliable metric for assessing
the performance of a model.
Recall, precision and ROC are more satisfactory metrics when the dataset is unbalanced. In this
context, recall can be defined as the ratio of the number of successful companies correctly identified
as successful (TP) to the actual number of successful companies (TP + FN). On the other hand,
precision can be defined as the ratio of the number of successful companies correctly identified as
successful (TP) to the number of companies classified as successful (TP + FP). In a scenario in
which an investor would like to use a machine learning model as a tool to support his/her decision-
making process, the choice of which metric to apply in selecting the best model to use would
depend on the investment strategy. For example, the model with the highest recall would identify
as many successful companies as possible among all the successful ones in the database. However,
investors who can select only a limited number of companies for a portfolio are likely to be less
focused on identifying all potentially successful companies. Instead, investors are interested in
ensuring the companies predicted to be successful are correctly classified as such and do not fail.
Not investing in a successful company, which constitutes a false negative, results in a missed
opportunity but not a capital loss. On the other hand, investing in a company that is a false positive
translates into an actual capital loss and constitutes a riskier scenario for a venture capital firm.
Hence, from an investor point of view, precision may be a more meaningful metric for measuring
the performance of a machine learning model.
After having identified the most significant metrics for an effective comparison of the previous
works’ results, Table 3 thus summarizes the results of each study in terms of accuracy, precision,
recall, F1 score and AUC. For consistency, if one of the selected metrics was not considered in a
study, that measure has not been reported for that study.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
35
As can be noted, the research area that produced the best results concerned the problem of
predicting the exit event of a company. Specifically, three works achieved an accuracy, recall and
AUC all greater than 90% (Krishna et al., 2016, Bento, 2017, Ünal, 2019). Work on the other
research goals achieved more modest results.
[TABLE 3]
The obtained results, however, are not sufficient for correctly evaluating and comparing the
performances of different studies; it is necessary to also consider the size of the datasets used. As
reported in Table 4, the studies used datasets of very different sizes. In addition to the dataset size,
another key factor for evaluating the performance of a machine learning classification algorithm is
the number of samples in each class of the training set and test set. Despite the fundamental
importance of these values, only three articles clearly declared the number of samples in each class
(undeclared information is marked in Table 4 as N.A.). The lack of such information makes it
extremely difficult to compare the obtained results.
[TABLE 4]
11. Discussion
We started this study trying to answer the research question: "For what purposes have machine
learning tools been applied so far to support equity investors in their decision-making process?"
For this purpose, we needed to investigate which questions previous researchers have tried to
answer using machine learning models.
Among the seven research objectives identified in this study, two in particular deserve to be
considered as examples in this regard: predicting the exit event of a company and predicting a
funding relationship between an investor and a company. From an investor point of view, the first
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
36
task aims to predict the probability that a company will make an exit so that the investor can select
only those companies with the highest probability of success. The second task is meant to predict
whether a target investment firm will invest in a specific venture, a prediction that may allow an
investor to identify syndicate partners.
The use of existing machine learning models can be effective for these objectives, especially in the
latter case. In fact, when considering whether or not to invest in a company, VCs generally seek
for a social proof of the deal’s value by carefully analyzing whether a credible and effective firm
has already invested in that company or has at least endorsed the company in some way. Since
successful investment firms have a well-known track record, it is possible to measure their
performance. Using these data, a quantitative version of social proof can be implemented, and
machine learning models can be used to identify successful investors potentially interested in a deal
and thus to select syndicate partners. Selecting the proper syndicate partners has a great impact on
investment risk management, and it is therefore important to be able to identify which partners to
involve. Generally, VCs select partners from a small group of firms they already know. Using data
mining techniques (e.g., machine learning, web crawling or text analytics) could enable investors
to broaden their action scenario by identifying the most suitable partners from online platforms and
professional networks.
On the other hand, the use of machine learning with Crunchbase data has some limitations. Despite
the enormous amount of data made available by Crunchbase, the database suffers from a large
amount of missing and sparse data, and thus its data requires careful preprocessing before it can be
used effectively. To obtain a dataset of homogeneous samples, it is also important to carry out a
preliminary filtering operation on the companies selected.
In addition, the information available on Crunchbase is mostly quantitative (e.g., number of years
from the company’s foundation and number of raised funding rounds) or is made measurable (e.g.,
the work experience of each founder is computed by counting the number of companies on
Crunchbase that report each founder as a past employee). Although such information can play a
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
37
key role in identifying common patterns for successful companies, previous models have not
considered qualitative information about the entrepreneur, such as personality, motivation and
commitment. This type of data is extremely difficult to collect, since collecting it would require an
in-depth analysis of each individual entrepreneurial team. Similarly, information about products or
services provided by the companies is not directly available, making it difficult to exactly
characterize the value proposition of each venture. It is therefore reasonable to discuss the potential
and the boundaries within which the developed machine learning models should be used, at least
at this point in time.
Considering the criteria used by investors in assessing a company for funding, a serious limitation
to effective application of machine learning for evaluating startups is the lack of qualitative data in
current databases. For example, in choosing which startup to invest in, investors carefully assess
the founding team. Although Crunchbase provides data about the team members’ background (in
terms of both education and work experience), information about the personality and motivation of
the founders is difficult to evaluate without direct knowledge of each person. Although sector,
market and product performance can be numerically evaluated, the attitudes of the team members
and their ability to execute the entrepreneurial vision are still difficult to model. It is therefore
important to underline how the available data and the implemented models are currently able to
manage only a simplified version of the investment selection activity.
However, contextualized within the different phases of the investment process, a machine learning
model can be a very useful support tool—for example, in the proposal screening phase. Considering
the large number of projects that VCs must assess, using a set of models to automatically identify
those that are potentially most interesting would allow investors to focus their resources and carry
out an accurate analysis of the most promising companies only. In this context, machine learning
can be used by investors as a tool for screening and preliminarily assessing the quality of a business
proposal, and it can provide valuable insights from the processing of large amounts of quantitative
data.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
38
12. Conclusion and future research
Machine learning can provide equity investors valuable support in screening and evaluating
companies. Some investment firms have already started to apply a data-driven approach to their
decision-making process, combining the use of algorithms with human experience and judgments.
Researchers have recently begun to explore the potential of machine learning in areas related to the
identification of successful startup companies. Specifically, seven main research goals have been
identified thus far: predicting the exit event of a company, predicting funding events within a given
period of time, predicting the next event that a company will achieve, predicting investment
relationships between companies and investors, generating investment recommendations for
investors, predicting a company valuation and classifying companies by industry. Starting from
previous research, an example of how different types of models could be used in a synergistic way
was also discussed.
All the studies considered in this article used Crunchbase as main source of data. The database
provides information about companies, investors, people and funding rounds. Starting from the raw
data available, some researchers have developed new features both through a process of feature
engineering and through data integration from other data sources. Previous studies have applied
various machine learning algorithms, but the obtained results are hardly comparable to each other
because of the different performance metrics considered in each study and the frequent lack of
information about the ratios of samples in the training set belonging to the different classes.
However, at least for classification problems, to date random forest has generally been the most
promising algorithm to use.
We should also point out the difference between predictions and decisions. Predictions through
machine learning models are better than those made by human beings especially when calculating
the complex interactions between different variables. As the number of such interactions increases,
human beings' ability to form accurate forecasts decreases compared to machines. Prediction is a
key ingredient in decision making under uncertain conditions. However, a prediction is not a
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
39
decision, it is only a component of a decision. The human component remains fundamental in many
activities like data collection and selection, for example data regarding teams, evaluation or
judgment, and subsequent actions like actual investments. Future research should investigate the
relationship between the machine component and the human component in investment decisions.
In conclusion the use of machine learning as a tool to support the decision-making process of
investors is a topic of great interest and presents much room for further development. This article
aimed to provide scholars with useful elements for entering this highly interdisciplinary research
area and to support and speed up progress in the field.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
40
13. References
Adomavicius, G., and Tuzhilin, A. (2005). Toward the next generation of recommender systems:
A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and
Data Engineering, (6), 734-749.
Agrawal, A., Gans, J., and Goldfarb, A. (2018). Prediction machines: The simple economics of
artificial intelligence. Boston, MA: Harvard Business Press.
Arroyo, J., Corea, F., Jimenez-Diaz, G., and Recio-Garcia, J. A. (2019). Assessment of machine
learning performance for decision support in venture capital investments. IEEE Access.
Askaryan, T., Frost, C. (2015). Predicting Valuations of Venture Capital Funded Private
Companies. (CS229: Machine Learning, Fall 2015, Stanford University).
Batista, F., and Carvalho, J. P. (2015). Text based classification of companies in CrunchBase. In
2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (pp. 1-7).
Bento, F. R. S. R. (2017). Predicting start-up success with machine learning (Master's thesis,
Universidade Nova de Lisboa).
Block, J. H., Colombo, M. G., Cumming, D. J., & Vismara, S. (2018). New players in
entrepreneurial finance and why they are there. Small Business Economics, 50(2), 239-250.
Bo Guang Huang (2016). Predict startup success using network analysis and machine learning
techniques. (CS224W: Machine Learning with Graphs, Fall 2016, Stanford University).
Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the
author). Statistical science, 16(3), 199-231.
Brynjolfsson, E., and McAfee, A. (2014). The second machine age: Work, progress, and prosperity
in a time of brilliant technologies. WW Norton and Company.
Cockburn, I. M., Henderson, R., and Stern, S. (2018). The impact of artificial intelligence on
innovation (No. w24449). National Bureau of Economic Research.
Corea, F. (2018). An Introduction to Data: Everything You Need to Know About AI, Big Data and
Data Science (Vol. 50). Springer.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
41
Crunchbase (January, 7, 2020). Where does Crunchbase get their data? Accessed January, 20, 2020
https://support.Crunchbase.com/hc/en-us/articles/360009616013
Cumming, D. J., & Vismara, S. (2017). De-segmenting research in entrepreneurial finance. Venture
Capital, 19(1-2), 17-27.
Dalle, J. M., Den Besten, M., and Menon, C. (2017). Using Crunchbase for economic and
managerial research. OECD Science, Technology and Industry Working Papers, 2017/08,
OECD Publishing, Paris.
Dellermann, D., and Calma, A. (2018). Making AI Ready for the Wild: The Hybrid Intelligence
Unicorn Hunter. Available at SSRN 3245383.
Ferrati, F., and Muffatto, M. (2019). A Systematic Literature Review of the Assessment Criteria
Applied by Equity Investors. 14th European Conference on Innovation and Entrepreneurship,
(p. 304-312). Kalamata, Greece.
Ferrati, F., Muffatto, M (2020 June), Using Crunchbase for Research in Entrepreneurship: Data
Content and Structure. 19th European Conference on Research Methodology for Business and
Management Studies (ARSO 2020), 342-351
Ferrati, F., Muffatto, M (2020 July), Setting Crunchbase for Data Science: Preprocessing, Data
Integration and Feature Engineering. 3rd International Conference on Advanced Research
Methods and Analytics (CARMA 2020)
Gastaud, C., Carniel, T., and Dalle, J. M. (2019). The varying importance of extrinsic factors in the
success of startup fundraising: competition at early-stage and networks at growth-stage. arXiv
preprint arXiv:1906.03210.
George, G., Haas, M. R., and Pentland, A. (2014). Big data and management. Academy of
Management Journal, 57, 321–332.
Hall, J., and Hofer, C. W. (1993). Venture capitalists' decision criteria in new venture evaluation.
Journal of business venturing, 8(1), 25-42.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
42
Hanley, J. A., and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating
characteristic (ROC) curve. Radiology, 143(1), 29-36.
Huang, J., and Shi, M. Z. (2015). With a Little Help of My (Former) Employer: Past Employment
and Entrepreneurs’ External Financing. In Academy of Management Proceedings (Vol. 2015,
No. 1, p. 12050). Briarcliff Manor, NY 10510: Academy of Management.
Krishna, A., Agrawal, A., and Choudhary, A. (2016). Predicting the outcome of startups: less
failure, more success. In 2016 IEEE 16th International Conference on Data Mining Workshops
(ICDMW) (pp. 798-805).
Liang, Y. E., and Yuan, S. T. D. (2016). Predicting investor funding behavior using Crunchbase
social network features. Internet Research, 26(1), 74-100.
Liu, X., and Wangperawong, A. (2018). A Collaborative Approach to Angel and Venture Capital
Investment Recommendations. arXiv preprint arXiv:1807.09967.
Mayer-Schönberger Viktor, Kenneth Cukier (2013). Big Data: A Revolution that Will Transform
how We Live, Work, and Think, Houghton Mifflin Harcourt.
Menon, C. and G. Tarasconi (2017). Matching Crunchbase with patent data, OECD Science,
Technology and Industry Working Papers, 2017/07, OECD Publishing, Paris.
Metrick, A., and Yasuda, A. (2010). Venture capital and the finance of innovation. Venture capital
and the finance of innovation, 2nd Edition, Andrew Metrick and Ayako Yasuda, eds., John
Wiley and Sons, Inc.
Miloud, T., Aspelund, A., and Cabrol, M. (2012). Startup valuation by venture capitalists: an
empirical study. Venture Capital, 14(2/3), 151-174.
Moeller, S. B., Schlingemann, F. P., and Stulz, R. M. (2005). Wealth destruction on a massive
scale? A study of acquiring‐firm returns in the recent merger wave. The journal of finance,
60(2), 757-782.
Obschonka, M., and Audretsch, D. B. (2019). Artificial intelligence and big data in
entrepreneurship: a new era has begun. Small Business Economics, 1-11.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
43
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal
of research and development, 3(3), 210-229.
Sarasvathy, S. D. (2001). Causation and effectuation: Toward a theoretical shift from economic
inevitability to entrepreneurial contingency. Academy of Management Review, 26, 243–263.
Sarasvathy, S. D. (2009). Effectuation: Elements of entrepreneurial expertise. Edward Elgar
Publishing.
Semeniuta, D., and Ismail, M. (2017). Clustering Startups Based on Customer-Value Proposition.
Scott, D. A. (2012, June 7). Is Venture Capital Broken? Tratto da Harvard Business Review:
https://hbr.org/2012/06/is-venture-capital-broken
Shan, Z., Cao, H., and Lin, Q. (2014). Capital Crunch: Predicting Investments in Tech Companies.
(course project, Fall 2014, Stanford University).
Sharchilev, B., Roizner, M., Rumyantsev, A., Ozornin, D., Serdyukov, P., and de Rijke, M. (2018).
Web-based startup success prediction. In Proceedings of the 27th ACM International
Conference on Information and Knowledge Management (pp. 2283-2291).
Silva, J. (2004). Venture capitalists' decision making in small equity markets: a case study using
participant observation. Venture Capital, 6(2-3), 125-145.
Stone, T. R. (2014). Computational analytics for venture finance (Doctoral thesis, University
College London).
Ünal, C. (2019). Searching for a Unicorn: A Machine Learning Approach Towards Startup Success
Prediction (Master's thesis, Humboldt-Universität zu Berlin).
Xiang, G., Zheng, Z., Wen, M., Hong, J., Rose, C., and Liu, C. (2012). A supervised approach to
predict company acquisition with factual and topic features using profiles and news articles on
TechCrunch. In Sixth International AAAI Conference on Weblogs and Social Media.
Zacharakis, A., and Shepherd, D. A. (2007). The Pre-investment Process: Venture Capitalists’
Decision Policies. In Handbook of Research on Venture Capital (p. chapter 6). Edward Elgar
Publishing.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
44
Zomaya, A. Y., and Sakr, S. (2017). Handbook of big data technologies. Springer.
Zhang, C., Chan, E., and Abdulhamid, A. (2015). Link prediction in bipartite venture capital
investment networks. CS224-w report, Stanford.
Zhong, H. (2019). Venture capital investment: from rule of thumb to data science (Doctoral thesis,
Rutgers University-Graduate School-Newark).
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
45
Table 1: Identified research goals and machine learning approaches
The capital letters indicate the applied machine learning approach: supervised learning binary
classification (A), supervised learning multiclass classification (B), supervised learning multilabel
classification (C), supervised learning regression (D), unsupervised learning (E) and recommender
system (F).
Predicting
the exit
event of a
company
Predicting
funding
events in a
given period
of time
Predicting
the next
event that a
company
will achieve
Predicting
investment
relationship
Generating
investment
recommen-
dations
Predicting a
company
valuation
Classifying
companies
by industry
Xiang et al., 2012
A
-
-
-
-
-
-
Stone, 2014
-
-
-
-
F
-
C
Shan et al., 2014
-
-
-
A
-
-
Batista et al., 2015
-
-
-
-
-
-
B
Zhang et al., 2015
-
-
-
E
-
-
-
Huang et al., 2015
-
-
-
-
-
-
E
Askaryan et al., 2015
-
-
-
-
-
D
-
Krishna et al., 2016
A
-
-
-
-
-
-
Bo Guang Huang, 2016
A
-
-
-
-
-
-
Liang et al., 2016
-
-
-
A
-
-
-
Dellermann et al., 2017
-
A
-
-
-
-
-
Semeniuta et al., 2017
-
-
-
-
-
-
E
Bento, 2017
A
-
-
-
-
-
-
Pan et al., 2018
A
-
-
-
-
-
-
Liu et al., 2018
-
-
-
-
F
-
-
Sharchilev et al., 2018
-
A
-
-
-
-
-
Arroyo et al., 2019
-
-
B
-
-
-
-
Zhong, 2019
-
-
-
-
F
-
-
Ünal, 2019
A
-
-
-
-
-
-
Gastaud et al., 2019
A
-
-
-
-
-
Total
6
3
1
3
3
1
4
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
46
Table 2: Values of the status variable to label companies as successful (1) or not (0)
Operating
Acquired
IPO
Closed
Xiang et al., 2012
N.A.
1
0
0
Bo Guang Huang, 2016; Bento, 2017; Pan et al., 2018
0
1
1
0
Ünal, 2019
1
1
1
0
Krishna et al., 2016
N.A.
N.A.
N.A.
N.A.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
47
Table 3: Performance achieved by the models of the considered works
Best model / Technique
Accuracy
Precision
Recall
F1 score
AUC
Predicting the exit event of a company
Xiang et al., 2012
Bayesian Network
N.A.
N.A.
0.60 - 0.79
N.A.
0.68 - 0.94
Krishna et al., 2016
Random Forest
0.99
0.96
0.97
N.A.
0.97
Bo Guang Huang, 2016
Random Forrest
0.72
0.51
0.22
0.31
N.A.
Bento, 2017
Random Forrest
0.93
0.92
0.94
N.A.
0.93
Pan et al., 2018
K-Nearest Neighbors
0.46
N.A.
0.74
0.73
0.80
Ünal, 2019
Extreme Gradient Boosting
0.94
N.A.
0.97
N.A.
0.93
Predicting funding events in a given period of time
Dellermann et al., 2017
Not implemented yet
N.A.
N.A.
N.A.
N.A.
N.A.
Sharchilev et al., 2018
Logistic Regression + Neural
Network + CatBoost
N.A.
N.A.
N.A.
N.A.
0.85
Gastaud et al., 2019
Graph Convolutional Networks
N.A.
0.64
0.64
0.64
0.64
Predicting the next event that a company will achieve
Arroyo et al., 2019
Gradient Tree Boosting
0.82
0.85
0.95
0.90
N.A.
Predicting investment relationship
Shan et al., 2014
Logistic Regression
N.A.
0.86
0.89
0.88
N.A.
Zhang et al., 2015
Preferential Attachment Link
Prediction
N.A.
0.004
0.013
0.006
N.A.
Liang et al., 2016
Support Vector Machine
N.A.
N.A.
0.90
N.A.
0.79
Generating investment recommendations
Liu et al., 2018
Matrix Factorization
0.13
N.A.
N.A.
N.A.
N.A.
Zhong, 2019
Bayesian Probabilistic Latent
Factor model
0.62
N.A.
N.A.
N.A.
N.A.
Predicting a company valuation
Askaryan et al., 2015
N.A.
N.A.
N.A.
N.A.
N.A.
N.A.
Classifying companies by industry
Stone, 2014
Naive Bayes
0.46
N.A.
N.A.
N.A.
N.A.
Batista et al., 2015
Fuzzy Fingerprints Methods
0.44
N.A.
N.A.
N.A.
N.A.
Huang et al., 2015
N.A.
N.A.
N.A.
N.A.
N.A.
N.A.
Semeniuta et al., 2017
N.A.
N.A.
N.A.
N.A.
N.A.
N.A.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning and Big Data”, Foundations and Trends® in Entrepreneurship:
Vol. 17: No. 3, pp 232-329. http://dx.doi.org/10.1561/0300000099
48
Table 4: Size of the dataset of the considered studies
Considered data sources
Dataset size
# positive
samples
# negative
samples
Dataset split
Predicting the exit event of a company
Xiang et al., 2012
Crunchbase; Tech Crunch
N.A.
N.A.
N.A.
N.A.
Krishna et al., 2016
Crunchbase; Tech Crunch
11,000 companies
7,000
4,000
N.A.
Bo Guang Huang, 2016
Crunchbase; TechCrunch
N.A.
N.A.
N.A.
Training: 0.9; Test: 0.1
Bento, 2017
Crunchbase
143,348 companies (after over-sampling using SMOTE)
70,950
72.398
Training: 0.7; Test: 0.3
Pan et al., 2018
Crunchbase
32,700 companies
N.A.
N.A.
Training: 0.9; Test: 0.05; Val: 0.05
Ünal, 2019
Crunchbase
44,522 companies
N.A.
N.A.
Training: 0.7; Test: 0.3
Predicting funding events in a given period of time
Dellermann et al., 2017
Crunchbase; Mattermark; Dealroom
1,500 companies
N.A.
N.A.
N.A.
Sharchilev et al., 2018
Crunchbase; LinkedIn; web-crawling
316,185 samples
N.A.
N.A.
Training: 0.7; Test: 0.3
Gastaud et al., 2019
Crunchbase
618,366 companies; 221,299 investment rounds;
783,787 people; 6,363,831 news articles on Crunchbase
N.A.
N.A.
N.A.
Predicting the next event that a company will achieve
Arroyo et al., 2019
Crunchbase
120,507 companies; 34,180 funding rounds
N.A.
N.A.
N.A.
Predicting investment relationship
Shan et al., 2014
Crunchbase
50,956 companies
7.749
43.207
Training: 0.75; Test: 0.25
Zhang et al., 2015
Crunchbase
about 105,000 edges; 55,000 nodes
(34,000 companies and 21,000 investors)
N.A.
N.A.
Training: 0.7; Test: 0.3
Liang et al., 2016
Crunchbase
5,341 investment activities; 25,165 nodes (11,916 companies,
12,127 people and 1,122 financial organizations)
N.A.
N.A.
Training: 0.4; Test: 0.6
Generating investment recommendations
Liu et al., 2018
Crunchbase
21,417 companies; 16,946 investors; 80,245 investments
N.A.
N.A.
N.A.
Zhong, 2019
Crunchbase; About.me
4,007 companies; 1,467 investors; 17,485 investments
N.A.
N.A.
N.A.
Predicting a company valuation
Askaryan et al., 2015
Crunchbase; Data Hub; DataFox;
CBInsights
N.A.
N.A.
N.A.
Training: 400 VC-backed
companies
Classifying companies by industry
Stone, 2014
Crunchbase; AngelList;
Dow Jones VentureSource
20,000 companies' description
Not applicable
Not applicable
N.A.
Batista et al., 2015
Crunchbase
119,000 companies
N.A.
N.A.
Training: 0.70; Test: 0.15; Dev:
0.15
Huang et al., 2015
CrunchBase (July 2013)
155,000 companies
N.A.
N.A.
Not applicable
Semeniuta et al., 2017
Crunchbase; Pitchbook
70,000 company's description; feature set of 90,000 words
N.A.
N.A.
N.A.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
49
#
%
example of feature
Features
239 (100%)
Company's basic
information
74 (31.0%)
Company's location
10
4.2
Country
Company's lifetime
9
3.8
foundation date
1
Company's competitors
8
3.3
number of competitors
1
Company's product
8
3.3
number of products
Company's website
8
3.3
website available
Company's industry and market
7
2.9
industry
Company's social media presence
6
2.5
Facebook profile available
Acquisitions made by the company
5
2.1
number of ventures acquired
Company's business model
3
1.3
type of business model
Investments made by the company
3
1.3
number of investments made
1
Company's customers
2
0.8
number of customers
1
Company's milestones
2
0.8
number of milestones
Company's contact information
2
0.8
email available
Company's Crunchbase profile updates
1
0.4
number of revisions
Company's people
information
59 (24.7%)
Work experience
13
5.4
founders’ years of experience
Education
12
5.0
number of graduate founders
Team
11
4.9
number of members
Founders
8
3.3
number of founders
Board members
8
3.3
number of board members
Employees
4
1.7
number of employees
Founders previous entrep. experience
3
1.3
number of ventures founded
Funding rounds
69 (28.9%)
Investors involved
19
7.9
number of inv. per round
Timing of rounds
18
7.5
date of the first funding
Round amount (money)
16
6.7
amount per round type
Number of rounds
12
5.0
total number of rounds
Company valuation
4
1.7
valuation after each round
3rd party support
3 (1.3%)
Commitment of 3rd party support
3
1.3
knowledge support
Mentions
9 (3.8%)
In news
7
2.9
number of articles in CB
Links
2
0.8
number of links.
Features from text
6 (2.5%)
Company's description in Crunchbase
4
1.7
bag-of-words
Articles in TechCrunch
2
0.8
features extraction
Network analysis
attributes
19 (7.9%)
Startup centrality scores
8
3.3
pageRank
Features based on node neighborhood…
6
2.5
closeness
Investor-to-startup centrality score
5
2.1
betweenness
Figure 1: Taxonomy of the features directly available or derivable from Crunchbase
1
Data about the companies’ products, competitors, customers and milestones are key elements when assessing a potential investment. Despite
the importance of these variables, the information was provided by a very small number of company profiles and was therefore removed in
the current version of Crunchbase. Data about products, competitors, customers and milestones are therefore no longer directly available and
their identification requires the use data mining techniques.
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
50
Figure 2: Top 10 algorithms or analysis techniques per number of contributions
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
51
Appendix 1 – Classification of the features provided by Crunchbase
The numbers in the table refer to the number of studies that considered each feature (in the rows)
to address a specific research task (in the columns).
Total
Predicting the exit event
of a company
Predicting funding events in
a given period of time
Predicting the next event
that a company will achieve
Predicting investment
relationship
Generating investment
recommendations
Predicting a company
valuation
Classifying companies by
industry
Number of documents by research goal
6
3
1
3
2
1
4
Number of considered features by research goal
89
108
24
28
24
17
2
Company's basic information
Company's lifetime
company foundation (date)
1
1
-
-
-
-
-
-
company foundation (year)
2
-
-
-
1
-
1
-
company age (# of years since foundation)
3
2
1
-
-
-
-
-
company age (# of months since foundation)
3
2
-
1
-
-
-
-
company age (# of days since foundation)
2
-
2
-
-
-
-
-
company closing (date)
1
1
-
-
-
-
-
-
company's offices age (min)
1
-
1
-
-
-
-
-
company's offices age (max)
1
-
1
-
-
-
-
-
company's offices age (average)
1
-
1
-
-
-
-
-
company foundation (date)
1
1
-
-
-
-
-
-
Company's industry and market
Crunchbase category (overall industry/sector)
8
2
2
1
1
1
1
-
number of Crunchbase categories
1
-
1
-
-
-
-
-
category group (mapped to the 11 industries classification in SandP500)
1
1
-
-
-
-
-
-
rough categorization of industries (not specified in detail)
1
-
-
-
1
-
-
-
specific market
2
-
-
-
-
1
1
-
whether is a tech company (boolean)
1
1
-
-
-
-
-
-
b2b VS b2c
1
-
1
-
-
-
-
-
Company's location
headquarter location (Continent)
1
1
-
-
-
-
-
-
headquarter location (Country)
4
1
-
1
1
1
-
-
headquarter location (Region)
1
-
-
-
-
1
-
-
headquarter location (State)
2
1
-
-
-
1
-
-
headquarter location (City)
2
-
-
-
1
1
-
-
headquarter location (not specified in detail)
2
2
-
-
-
-
-
-
number of offices
2
1
1
-
-
-
-
-
numbers of offices in different countries or cities
1
-
1
-
-
-
-
-
distance between the startup and the investor
1
-
-
-
-
1
-
-
number of startups in a region
1
-
1
-
-
-
-
-
Company's competitors
wheather has at least one competitor (boolean)
1
1
-
-
-
-
-
-
number of competitors
7
2
3
-
1
1
-
-
competitors (not specified in detail)
1
-
1
-
-
-
-
-
wheather has at least one competitor that got acquired or make an ipo
(boolean)
1
1
-
-
-
-
-
-
number of competitors that got acquired
1
1
-
-
-
-
-
-
number of competitors that got acquired or make an ipo
1
1
-
-
-
-
-
-
money raised by competitors
1
-
1
-
-
-
-
-
money raised by competitors in the last year
1
-
1
-
-
-
-
-
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
52
Total
Predicting the exit event
of a company
Predicting funding events in
a given period of time
Predicting the next event
that a company will achieve
Predicting investment
relationship
Generating investment
recommendations
Predicting a company
valuation
Classifying companies by
industry
Company's customers
wheather has at least one customer (boolean)
1
1
-
-
-
-
-
-
number of customers
1
1
-
-
-
-
-
-
Company's milestones
-
-
-
-
-
-
-
number of milestones in the CrunchBase profile
1
1
-
-
-
-
-
-
entrepreneurial vision
1
-
1
-
-
-
-
-
Company's products
wheather has at least a proof of concept (boolean)
1
-
1
-
-
-
-
-
number of products
3
1
1
-
-
1
-
-
age of products (min)
1
-
1
-
-
-
-
-
age of products (max)
1
-
1
-
-
-
-
-
age of products (average)
1
-
1
-
-
-
-
-
technological hype (phase in Gartner hype cycle)
1
-
1
-
-
-
-
-
product innovativeness
1
-
1
-
-
-
-
-
number of providers
1
1
-
-
-
-
-
-
Company's business model
type of business model
1
-
1
-
-
-
-
-
revenue model
1
-
1
-
-
-
-
-
scalability
1
-
1
-
-
-
-
-
Investments made by the company
wheater has invested in at least one another company (boolean)
1
1
-
-
-
-
-
-
total number of investments made by the company
3
2
1
-
-
-
-
-
number of investments made in each company
1
-
1
-
-
-
-
-
Acquisitions made by the company
wheater has acquired at least one another company (boolean)
1
1
-
-
-
-
-
-
total number of acquisitions made
3
2
-
-
-
1
-
-
frequency of acquisitions made
1
-
-
-
-
1
-
-
acquisition records (not specified in detail)
1
-
-
-
-
1
-
-
sub-organizations (not specified in detail)
1
-
-
-
-
1
-
-
Company's Crunchbase profile updates
number of revisions on the company Crunchbase profile
1
1
-
-
-
-
-
-
Company's website
wheather the homepage url is on Crunchbase (boolean)
1
-
1
-
-
-
-
-
number of websites listed on Crunchbase
2
-
1
-
1
-
-
-
number of websites created in last 6/12/24 months (3 features)
1
-
1
-
-
-
-
-
estimated monthly web traffic
1
-
-
-
-
-
1
-
website visits
1
-
1
-
-
-
-
-
website average duration
1
-
1
-
-
-
-
-
website backlinks
1
-
1
-
-
-
-
-
website bounce rate
1
-
1
-
-
-
-
-
Company's social media presence
wheather the Facebook url is on Crunchbase (boolean)
3
1
1
1
-
-
-
-
wheather the Twitter url is on Crunchbase (boolean)
3
1
1
1
-
-
-
-
wheather the LinkedIn url is on Crunchbase (boolean)
2
-
1
1
-
-
-
-
number of Twitter followers
2
-
1
-
-
-
1
-
number of tweets
1
-
1
-
-
-
-
-
sentiment of tweets
1
-
1
-
-
-
-
-
Company's contact information
wheather the email is on Crunchbase (boolean)
1
-
-
1
-
-
-
-
wheather the phone number is on Crunchbase (boolean)
1
-
-
1
-
-
-
-
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
53
Total
Predicting the exit event
of a company
Predicting funding events in
a given period of time
Predicting the next event
that a company will achieve
Predicting investment
relationship
Generating investment
recommendations
Predicting a company
valuation
Classifying companies by
industry
Company's people information
Team
team size (number of team members)
3
-
2
-
1
-
-
-
company headcount
1
-
-
-
-
-
1
-
number of male team members
1
-
1
-
-
-
-
-
number of female team members
1
-
1
-
-
-
-
-
IDs of current team members on Crunchbase
1
-
1
-
-
-
-
-
members' times on the team (min)
1
-
1
-
-
-
-
-
members' times on the team (max)
1
-
1
-
-
-
-
-
members' times on the team (average)
1
-
1
-
-
-
-
-
number of current/CB registered hired or released staff in the last 6/12/24
months (12 features)
1
-
1
-
-
-
-
-
number of past members
1
-
-
-
-
1
-
-
frequency of member quitting
1
-
-
-
-
1
-
-
Founders
whether founders are indicated on Crunchbase (boolean)
1
1
-
-
-
-
-
-
number of founders
5
1
2
1
-
1
-
-
number of male founders
2
-
1
1
-
-
-
-
number of female founders
2
-
1
1
-
-
-
-
number of different countries where the founders come from
1
-
-
1
-
-
-
-
name of founders or CEO
1
-
-
-
1
-
-
-
IDs of company founders on Crunchbase
1
-
1
-
-
-
-
-
founder characteristics (not specified in detail)
1
-
-
-
1
-
-
-
Board members
number of board members
1
-
1
-
-
-
-
-
number of male board members
1
-
1
-
-
-
-
-
number of female board members
1
-
1
-
-
-
-
-
members' time on the board (min)
1
-
1
-
-
-
-
-
members' time on the board (max)
1
-
1
-
-
-
-
-
members' time on the board (average)
1
-
1
-
-
-
-
-
IDs of board members on Crunchbase
1
-
1
-
-
-
-
-
board members (not specified in detail)
1
-
-
-
-
1
-
-
Employees
number of employees
3
2
-
-
-
1
-
-
current employees (not specified in detail)
1
-
-
-
-
1
-
-
past employees (not specified in detail)
1
-
-
-
-
1
-
-
employee characteristics (not specified in detail)
1
-
-
-
1
-
-
-
Education (generally referred to founders)
total number of degrees obtained by founders
1
-
-
1
-
-
-
-
total number of degrees obtained by founders or CEO
1
-
-
-
1
-
-
-
maximum number of degrees obtained by a founder
1
-
-
1
-
-
-
-
average number of degrees obtained by a founder
1
-
-
1
-
-
-
-
University of graduation of founders or CEO
1
-
-
-
1
-
-
-
wheather a founder or CEO has obtained an MBA (boolean)
1
-
-
-
1
-
-
-
top schools attended by management/employees
1
-
-
-
-
-
1
-
whether management/employees attended a business school (boolean)
1
-
-
-
-
-
1
-
whether management/employees attended a medical school (boolean)
1
-
-
-
-
-
1
-
whether management/employees attended a law school (boolean)
1
-
-
-
-
-
1
-
number of key persons with financial background
1
1
-
-
-
-
-
-
level of education (not specified in detail)
1
-
1
-
-
-
-
-
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
54
Total
Predicting the exit event
of a company
Predicting funding events in
a given period of time
Predicting the next event
that a company will achieve
Predicting investment
relationship
Generating investment
recommendations
Predicting a company
valuation
Classifying companies by
industry
Work experiene
wheather the founders have experience (boolean)
1
1
-
-
-
-
-
-
wheather the company has experience (boolean)
1
1
-
-
-
-
-
-
total experience of founders (years)
1
1
-
-
-
-
-
-
experience of founders (months)
1
1
-
-
-
-
-
-
total experience of total jobs in the company (years)
1
1
-
-
-
-
-
-
wheather the company has at least one job in its profile (boolean)
1
1
-
-
-
-
-
-
total jobs of the company
1
1
-
-
-
-
-
-
company worked in (referred to founders or CEO)
1
-
-
-
1
-
-
-
top previous employers of current management/employees
1
-
-
-
-
-
1
-
number of people with a given job title and company in LinkedIn resume
(all job titles used as features)
1
-
1
-
-
-
-
-
founder work history (not specified in detail)
1
-
-
-
1
-
-
-
employee work history (not specified in detail)
1
-
-
-
1
-
-
-
number of field background of the team
1
-
1
-
-
-
-
-
Founders previous entrepreneurial experience
wheather founders has founded previous companies (boolean)
1
-
1
-
-
-
-
-
number of previous companies founded by the founders
2
1
1
-
-
-
-
-
number of successful companies founded by founders
1
1
-
-
-
-
-
-
Funding rounds
Number of rounds
wheather has at least one funding round (boolean)
1
1
-
-
-
-
-
-
wheather the company raised a seed round (boolean)
1
-
1
-
-
-
-
-
wheather the company raised a series A round (boolean)
2
1
1
-
-
-
-
-
wheather the company raised a series B round (boolean)
2
1
1
-
-
-
-
-
wheather the company raised a series C round (boolean)
2
1
1
-
-
-
-
-
wheather the company raised a series D round (boolean)
2
1
1
-
-
-
-
-
number of funding rounds
10
5
2
1
-
1
1
-
counts of funding types as given in Crunchbase (e.g., seed, angel, venture
etc.)
1
-
1
-
-
-
-
-
last funding round type
1
-
-
1
-
-
-
-
number of investments
1
-
1
-
-
-
-
-
numbers of rounds funded in different currencies
1
-
1
-
-
-
-
-
rough categorization of investments (not specified in detail)
1
-
-
-
1
-
-
-
Round amount (money)
wheather has at least one declared funding amount (boolean)
1
1
-
-
-
-
-
-
total funding amount
10
5
3
1
-
-
1
-
amount of investment per funding round
3
2
-
-
1
-
-
-
seed round raised amount
2
1
1
-
-
-
-
-
round A raised amount
2
2
-
-
-
-
-
-
round B raised amount
2
2
-
-
-
-
-
-
round C raised amount
2
2
-
-
-
-
-
-
round D raised amount
2
2
-
-
-
-
-
-
round E raised amount
1
1
-
-
-
-
-
-
round F raised amount
1
1
-
-
-
-
-
-
round G raised amount
1
1
-
-
-
-
-
-
last funding round raised amount
2
-
1
1
-
-
-
-
money raised in different funding types
1
-
1
-
-
-
-
-
monthly funding amount
1
-
-
-
-
1
-
-
number of rounds without declared amount
1
-
1
-
-
-
-
-
burn rate
1
1
-
-
-
-
-
-
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
55
Total
Predicting the exit event
of a company
Predicting funding events in
a given period of time
Predicting the next event
that a company will achieve
Predicting investment
relationship
Generating investment
recommendations
Predicting a company
valuation
Classifying companies by
industry
Timing of rounds
date of first funding
1
-
-
-
-
-
1
-
time it took to raise seed funding (months)
1
1
-
-
-
-
-
-
time it took to raise first funding (years)
2
2
-
-
-
-
-
-
time it took to raise round A (years)
1
1
-
-
-
-
-
-
time it took to raise round B (years)
1
1
-
-
-
-
-
-
time it took to raise round C (years)
1
1
-
-
-
-
-
-
time it took to raise round D (years)
1
1
-
-
-
-
-
-
time passed since the company received the first funding to date (days)
1
-
1
-
-
-
-
-
date of last funding
2
-
-
1
-
-
1
-
time it took to raise last funding (months)
1
-
-
1
-
-
-
-
time passed since the company received the last funding to date (days)
1
-
1
-
-
-
-
-
time passed since the company received the last funding to date (years)
1
1
-
-
-
-
-
-
difference between the date of the first and the last funding rounds
1
1
-
-
-
-
-
-
time passed between the first and the last funding (years)
1
1
-
-
-
-
-
-
age since past rounds (min)
1
-
1
-
-
-
-
-
age since past rounds (max)
1
-
1
-
-
-
-
-
age since past rounds (average)
1
-
1
-
-
-
-
-
timing of investments (not specified in detail)
1
-
-
-
1
-
-
-
Investors involved
number of investors in previous rounds
3
-
2
-
-
1
-
-
wheather the company has any form of venture capital (boolean)
1
1
-
-
-
-
-
-
number of venture capital and private equity firms investing in the
company
1
1
-
-
-
-
-
-
number of (unique) investors who participated in all the funding rounds
1
-
1
-
-
-
-
-
number of (unique) investors who participated in the last funding round
1
-
1
-
-
-
-
-
number of investors per funding round
2
2
-
-
-
-
-
money invested by each investor
1
-
1
-
-
-
-
-
number of investments made by each investor
1
-
1
-
-
-
-
-
max size of the startup's investors' portfolios
1
-
1
-
-
-
-
-
investor_shares i.e. money invested by each investor but normalized by
total raised money
1
-
1
-
-
-
-
-
time since investors got involved with the company (min)
1
-
1
-
-
-
-
-
time since investors got involved with the company (max)
1
-
1
-
-
-
-
-
time since investors got involved with the company (average)
1
-
1
-
-
-
-
-
top investors
1
-
-
-
-
-
1
-
the company has at least one top 500 investor (boolean)
1
1
-
-
-
-
-
-
number of top500 investors in the company (top 500 by number of
investments made by investor)
1
1
-
-
-
-
-
-
number of unique renowned investors who participated in all the funding
rounds
1
-
-
1
-
-
-
-
number of unique renowned investors who participated in the last funding
round
1
-
-
1
-
-
-
-
number of people with financial background investing in the company
1
1
-
-
-
-
-
-
Company valuation
date of the company valuation
1
-
-
-
-
-
1
-
company valuation after each funding round (post-money valuation)
1
1
-
-
-
-
-
-
company valuation after the last funding round (post-money valuation)
1
-
-
1
-
-
-
-
current market value of the company
1
1
-
-
-
-
-
-
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
56
Total
Predicting the exit event
of a company
Predicting funding events in
a given period of time
Predicting the next event
that a company will achieve
Predicting investment
relationship
Generating investment
recommendations
Predicting a company
valuation
Classifying companies by
industry
Commitment of 3rd party support
knowledge support
1
-
1
-
-
-
-
-
financial support
1
-
1
-
-
-
-
-
proof of value (i.e. number of pilot customers)
1
-
1
-
-
-
-
-
Mentions
In news
number of news articles on Crunchbase mentioning the company
2
-
2
-
-
-
-
-
year on year augmentation of the number of articles on Crunchbase
mentioning the company
1
-
1
-
-
-
-
-
number of news items added to Crunchbase posted in last 6/12/24 months
(6 features)
1
-
1
-
-
-
-
-
number of TechCrunch articles about the company
1
1
-
-
-
-
-
-
frequency of news on public media
1
-
-
-
-
1
-
-
counts of mentions on each domain
1
-
1
-
-
-
-
-
topic model (LDA) features
1
-
1
-
-
-
-
-
Links
(logarithm of) number of domains/pages mentioning the company in
total/last 6/12/18 months (16 features)
1
-
1
-
-
-
-
-
IDs/counts/log of counts of mentions on each domain in total/in last 6
months
1
-
1
-
-
-
-
-
Linguistic features from unstructured text
Company's description in Crunchbase
unigrams of lemmatized nouns
1
-
-
-
1
-
-
-
location phrases
1
-
-
-
1
-
-
-
bag-of-words
1
-
1
-
-
-
-
-
text analysis (not specified in detail)
4
-
-
-
-
-
-
4
Articles in TechCrunch
extract topic features
1
1
-
-
-
-
-
-
text analysis (not specified in detail)
4
-
-
-
-
-
-
4
Network analysis attributes
Startup centrality scores
betweenness
1
1
-
-
-
-
-
-
maximal betweenness centrality of the startup's investors
1
-
1
-
-
-
-
-
mean betweenness centrality of the startup's investors
1
-
1
-
-
-
-
-
sum of the betweenness centrality of the startup's investors
1
-
1
-
-
-
-
-
closeness
1
1
-
-
-
-
-
-
degree
1
1
-
-
-
-
-
-
eigenvalue
1
1
-
-
-
-
-
-
pageRank
1
1
-
-
-
-
-
-
Investor-to-startup aggregate centrality score
betweenness
1
1
-
-
-
-
-
-
closeness
1
1
-
-
-
-
-
-
degree
1
1
-
-
-
-
-
-
eigenvalue
1
1
-
-
-
-
-
-
pageRank
1
1
-
-
-
-
-
-
Features based on node neighborhoods…
shortest paths
1
-
-
-
1
-
-
-
adamic/adar
1
-
-
-
1
-
-
-
jaccard coefficient
1
-
-
-
1
-
-
-
common neighbors
1
-
-
-
1
-
-
-
preferential attachment
1
-
-
-
1
-
-
-
number of shortest paths between an investor and a company
1
-
-
-
1
-
-
-
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
57
Appendix 2 – Classification of the algorithms
The numbers in the table refer to the number of studies that used each algorithm (in the rows) to
address a specific research task (in the columns).
Total
Predicting the exit event
of a company
Predicting funding events in
a given period of time
Predicting the next event
that a company will achieve
Predicting investment
relationship
Generating investment
recommendations
Predicting a company
valuation
Classifying companies by
industry
Number of documents by research goal
6
3
1
3
2
1
4
Number of considered algorithms by research goal
17
8
5
10
10
1
12
Supervised learning
Classification
random forest
9
5
2
1
-
-
-
1
support vector machine
7
3
1
1
1
-
-
2
logistic regression
7
4
2
-
1
-
-
-
naive Bayes classifier
4
1
1
-
1
-
-
1
decision trees
3
-
-
1
1
-
-
1
artificial neural network
3
-
2
-
-
-
-
1
k-nearest neighbors
2
1
-
-
-
-
-
1
Bayesian network
2
2
-
-
-
-
-
-
reduced logistic regression
1
1
-
-
-
-
-
-
simpleLogistic
1
1
-
-
-
-
-
-
support vector clustering
1
1
-
-
-
-
-
-
recursive partitioning tree
1
1
-
-
-
-
-
-
conditional inference tree
1
1
-
-
-
-
-
-
extremely randomized trees
1
-
-
1
-
-
-
-
alternating decision tree
1
1
-
-
-
-
-
-
gradient tree boosting
1
-
-
1
-
-
-
-
adaptive boosting
1
1
-
-
-
-
-
-
extreme gradient boosting
1
1
-
-
-
-
-
-
catBoost (Categorical + Boosting)
1
-
1
-
-
-
-
-
lazy associative classifier
1
1
-
-
-
-
-
-
Regression
linear regression
1
-
-
-
-
-
1
-
Unsupervised learning
Clustering
K-means clustering
2
-
-
-
1
-
-
1
Deep learning
graph convolutional networks
1
-
1
-
-
-
-
-
Natural language processing
zeroR
1
-
-
-
-
-
-
1
term frequency-inverse document frequency
3
-
-
-
-
-
-
3
multinomial naive bayes
1
-
-
-
-
-
-
1
latent Dirichlet allocation
2
1
-
-
-
-
-
1
fuzzy fingerprint method
1
-
-
-
-
-
-
1
latent semantic analysis
1
-
-
-
-
-
-
1
word2vec
1
-
1
-
-
-
-
-
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
58
Total
Predicting the exit event
of a company
Predicting funding events in
a given period of time
Predicting the next event
that a company will achieve
Predicting investment
relationship
Generating investment
recommendations
Predicting a company
valuation
Classifying companies by
industry
Recommender systems
matrix factorization
1
-
-
-
-
1
-
-
non-negative matrix factorization
1
-
-
-
-
1
-
-
probabilistic matrix factorization
1
-
-
-
-
1
-
-
social-adjusted probabilistic matrix factorization
1
-
-
-
-
1
-
-
singular value decomposition
1
-
-
-
-
1
-
-
item-based collaborative filtering
1
-
-
-
-
1
-
-
Bayesian probabilistic latent factor
1
-
-
-
-
1
-
-
collaborative filtering with social information
1
-
-
-
-
1
-
-
connected recommendation
1
-
-
-
-
1
-
-
simplified connected recommendation
1
-
-
-
-
1
-
-
Other
factor graph / CRF
1
-
-
-
1
-
-
-
random link prediction
1
-
-
-
1
-
-
-
preferential attachment link prediction
1
-
-
-
1
-
-
-
weighted preferential attachment link prediction
1
-
-
-
1
-
-
-
segmented preferential attachment link prediction
1
-
-
-
1
-
-
-
supervised random walk
1
1
-
-
-
-
-
-
Francesco Ferrati and Moreno Muffatto (2021), “Entrepreneurial Finance: Emerging Approaches Using Machine Learning
and Big Data”, Foundations and Trends® in Entrepreneurship: Vol. 17: No. 3, pp 232-329.
http://dx.doi.org/10.1561/0300000099
59
Appendix 3 – Classification of the performance metrics
The numbers in the table refer to the number of studies that used each metric (in the rows) to
measure model performance in addressing a specific research task (in the columns).
Total
Predicting the exit event
of a company
Predicting funding events in
a given period of time
Predicting the next event that
a company will achieve
Predicting investment
relationship
Generating investment
recommendations
Predicting a company
valuation
Classifying companies by
industry
Number of documents by research goal
6
3
1
3
2
1
4
Number of considered metrics by research goal
10
8
4
6
8
1
6
accuracy
10
5
-
1
-
2
-
2
error rate
1
1
-
-
-
-
-
-
precision
8
3
1
1
2
-
-
1
recall (or true positive rate or sensitivity)
12
6
1
1
3
-
-
1
inverse recall (or false positive rate)
4
3
-
-
1
-
-
-
specificity (or true negative rate or selectivity)
1
1
-
-
-
-
-
-
F1 score
6
2
1
1
2
-
-
-
area under the ROC curve
9
5
2
-
1
-
-
1
precision-recall curve
1
-
1
-
-
-
-
confusion matrix
2
-
-
-
1
-
-
1
Matthews correlation coefficient
2
1
1
-
-
-
-
-
residual error from cross-validation testing
1
-
-
-
-
1
-
root of the mean square error
1
-
-
-
-
1
-
-
mean absolute error
1
-
-
-
-
1
-
-
training loss
1
-
-
-
-
1
-
-
precision@K
1
-
-
-
-
1
-
-
recall@K
1
-
-
-
-
1
-
-
normalized discounted cumulative gain @K
1
-
-
-
-
1
-
-
mean average precision
1
-
-
-
-
1
-
-
reconstruction error
1
-
-
-
-
-
-
1
gain ratio
1
1
-
-
-
-
-
-
SHAP (SHapley Additive exPlanations) value
1
-
1
-
-
-
-
-
one-sided Wilcoxon signed-rank test
1
-
1
-
-
-
-
-