ArticlePDF Available

Application of Decision Tree Algorithm in Early Entrepreneurial Project Screening

Authors:

Abstract and Figures

Venture capital firms are always faced with insufficient information and insufficient time when evaluating whether startups are worth investing in. This paper focuses on how to combine the public information of startups with the decision tree algorithm to assist investors in project screening. By extracting the public information of 1104 AI and big data companies from January 2016 to June 2017 and the financing progress in the following 18 months, this paper finds that: the six indicators of having a working background in well-known companies, being reported by well-known media, having patents and being invested by excellent institutions, working experience is highly related to this venture, and having well-known financing consultants can effectively help investors screen out projects that can obtain sustainable financing, that is, high potential projects. Considering the problem of data availability in the real world, combined with industry experience, this paper makes a more detailed variable mining for semistructured information, such as the team resume of start-ups, and selects a decision tree algorithm that is insensitive to missing values and has strong interpretability to amplify the value of fragmented information. Finally, by the improvement of the algorithm, a project screening model that can meet the needs of investment practice is designed, and the fusion mode of public information and private information is discussed, which has a more complete guiding significance for optimizing investment work.
This content is subject to copyright. Terms and conditions apply.
Research Article
Application of Decision Tree Algorithm in Early Entrepreneurial
Project Screening
Yu Min Wang and Lin Xue
School of Innovation, Entrepreneurship and Creation, Minjiang University, Fuzhou, Fujian 350108, China
Correspondence should be addressed to Lin Xue; 2184@mju.edu.cn
Received 26 January 2022; Revised 19 February 2022; Accepted 1 March 2022; Published 31 March 2022
Academic Editor: Tongguang Ni
Copyright ©2022 Yu Min Wang and Lin Xue. is is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Venture capital firms are always faced with insufficient information and insufficient time when evaluating whether startups are
worth investing in. is paper focuses on how to combine the public information of startups with the decision tree algorithm to
assist investors in project screening. By extracting the public information of 1104 AI and big data companies from January 2016 to
June 2017 and the financing progress in the following 18 months, this paper finds that: the six indicators of having a working
background in well-known companies, being reported by well-known media, having patents and being invested by excellent
institutions, working experience is highly related to this venture, and having well-known financing consultants can effectively help
investors screen out projects that can obtain sustainable financing, that is, high potential projects. Considering the problem of data
availability in the real world, combined with industry experience, this paper makes a more detailed variable mining for sem-
istructured information, such as the team resume of start-ups, and selects a decision tree algorithm that is insensitive to missing
values and has strong interpretability to amplify the value of fragmented information. Finally, by the improvement of the al-
gorithm, a project screening model that can meet the needs of investment practice is designed, and the fusion mode of public
information and private information is discussed, which has a more complete guiding significance for optimizing
investment work.
1. Introduction
Venture capital is a form of investment that provides
capital for start-ups with high growth potential in ex-
change for equity. Massive entrepreneurial projects and
high investment costs pose a great challenge to fund
managers. How to screen as many high-quality entre-
preneurial companies in the market as possible under the
condition of limited manpower and time, identify their
development potential, and provide support is the first
kind of problem faced by investors, i.e., time constraints.
Another consensus is that there is a serious lack of in-
formation in the field of early investment. In the face of
newly established companies, investors often cannot
obtain hard data about the market, products, and finance,
and they can only make decisions under imperfect in-
formation in combination with their own experience,
industry estimates, and the judgment of the start-up team.
In this process, which factors should be given a higher
weight, and how to reduce the components of subjective
assumptions and make decisions more rational are the
second kind of problems faced by investors, i.e., infor-
mation constraints.
Facing the problem of insufficient time and information,
an intuitive solution is to let the machine assist people to
collect and process more information, precipitate the de-
cision-making knowledge in the way of algorithm, and
rationalize the investment process. However, a series of
questions, such as which channels, what information to
extract, what indicators to refine, and what feedback to
modify the model, need some definition to be effectively
answered. erefore, this paper extracts the link of early
project screening from the whole process of investment
activities and simplifies the problem as follows:
Hindawi
Scientific Programming
Volume 2022, Article ID 3584196, 9 pages
https://doi.org/10.1155/2022/3584196
Assume that an investment institution has obtained the
profile of an early-stage company disclosed in the venture
capital media, including the endorsement information of
relevant parties, such as “past investors,” and the infor-
mation of team characteristics, such as “profile of team
members.” Can this institution predict the company’s po-
tential to obtain the next round of financing based on the
financing development of related companies in the same
industry over the past period of time? Finally, decide if it is
worth the time to research and follow up on the investment.
e two types of input required to answer this question
have been extensively studied in academia.
In terms of team characteristics, the positive effects of
industry-related experience [1, 2], entrepreneurial experi-
ence [1, 3, 4], and well-known company work background
[5, 6] on the development of startups have been supported by
most literature. However, studies on demographic factors
are currently scarce. In the dimension of stakeholders, the
endorsement of investment institutions [7, 8] and social
capital [9, 10] are positive to the development of the
company. e value of financing consultants, venture capital
media, and other relevant parties has not been fully explored.
erefore, this paper attempts to apply the research
results of academia on venture capital decision-making and
the influencing factors of startups to real investment ac-
tivities, and based on semistructured public datasets, extract
valuable traditional indicators from as many dimensions as
possible and novel metrics to test its predictive power.
erefore, this paper attempts to apply the research
results of academia on venture capital decision-making and
the influencing factors of startups to real investment ac-
tivities, and based on semistructured public datasets, extract
valuable traditional indicators from as many dimensions as
possible and novel indicators to test its predictive effect.
2. Data Preparation
2.1. Data Sources and Extraction Methods. Since the po-
tential of a startup is a latent variable, the effectiveness of the
predictive model needs to be validated by explicit indicators.
erefore, this paper selects “whether to obtain the next
round of financing,” which is highly related to the company’s
potential, as a proxy variable to test the effect of “team
capability signal” and “relevant party endorsement signal” in
building a predictive model. Considering that companies
that have never raised funding have less publicly available
information, we target our research on startups that have
raised at least one round of funding [11].
In practice, different industries have different investment
analysis frameworks, and the timing and market enthusiasm
have a significant impact on the difficulty of obtaining fi-
nancing for entrepreneurial projects. erefore, this paper,
firstly, controls the industry and time window, and it selects
all investment and financing events in the field of artificial
intelligence and big data from January 2016 to December
2018, which is a total of 2527. e source of the data is
Xinniu Data (https//:www.xiniudata.com), a third-party
venture capital service platform, whose institutional account
can provide data export function.
e normal financing rhythm for startups is to complete
a new round of financing every 12–18 months. Based on this
industry experience, this paper splits the dataset and selects
all projects that have received early (round A and earlier)
financing between January 2016 and June 2017. After the
project, a total of 1104 independent companies entered the
sample set. Whether these companies receive a new round of
funding between July 2017 and December 2018 will be the
predictor variable.
2.2. Data Field Description. e dataset collected in this
paper contains 17 original fields: company name, industry
field, one-sentence introduction, the current round of fi-
nancing time, current round, the current round of financing
amount, the current round of investor, company intro-
duction, contact information, establishment time, region,
company advantages, industry label, team members and
introduction, activity performance, and business informa-
tion [12]. After desensitizing the company name to the
company ID, there are 4 remaining fields related to this
modeling, all of which are semistructured or unstructured
information.
By manually labeling the original dataset, we split the
above four fields into the following two categories, with a
total of 10 binary variables. See Table 1 for specific field
extraction and processing.
e structured processing of information and the pro-
cessing of missing values are explained as follows:
(1) Invested by excellent institutions (GoodInvestor): If
the current round of investors includes excellent
investment institutions, the variable is 1, otherwise,
it is 0. e definition of an excellent institution is
that more than 45% of the projects invested in 2015
have the next round of financing (this proportion is
an empirical value and can be adjusted in practice).
e information comes from the venture capital
institution database of Enniu. If the investment
institution does not disclose it, the variable is a
missing value, marked with “?”
(2) Have well-known financing advisors (GoodFA):
e variable is 1 if the company advantage contains
“well-known FA”; otherwise, it is 0. e definition
of well-known financing consultants by Enniu Data
includes well-known financing consultants in the
industry, such as Huaxing Capital and Xiaofan
Table (this range can be dynamically adjusted in
practice). ere are no missing values for this item.
(3) Well-known media reports (GoodCoverage): e
variable is 1 if the company advantage contains
“famous media coverage,” and 0 otherwise. e
definition of this label by Enniu Data is that
platforms, such as Pencil Road and Entrepre-
neurship, have an exclusive interview with the
target company, rather than a simple financing
news disclosure. In practice, this range can be
dynamically adjusted. ere are no missing values
for this item.
2Scientific Programming
(4) Have patents (Patent): e variable is 1 if the
company or its members hold patents, otherwise 0.
Among them, patents only refer to the company’s
invention patents, excluding trademarks, and so on.
ere are no missing values for this item.
(5) Working background in a well-known company
(CompanyHalo): If the team members and their
profiles mention well-known internet companies
(Tencent, Baidu, Alibaba, JD, Xiaomi, Netease, 360,
Sina, Google, etc.) or technology enterprises
(Microsoft, IBM, Huawei, Lenovo, sap, Oracle, etc.),
then the variable is 1, otherwise, it is 0. If the source
field is missing, it is marked as “?”
(6) Core members have a well-known university edu-
cation background (EduHalo): If the team members
and the experience of famous universities in the
world are mentioned in the profile, the variable is 1,
otherwise, it is 0. If the source field is missing, it is
marked as “?”
(7) Core members have entrepreneurial experience
(StartupExps): If “continuous entrepreneurship” is
mentioned in team members and profiles, the
variable is 1, otherwise, it is 0. If the source field is
missing, it is marked as “?”
(8) e members have the same school or work
experience (CoworkExps): In the team members
and profiles, it can be inferred that the core
members have studied in the same university or
the same company. en, the variable is 1. If there
is no obvious intersection in the main experience,
it is 0. If the source field is missing or difficult to
judge, it is marked as “?”
(9) e core team has diverse professional skills
(DiverseExps): In the team members and profiles,
the core members’ experience covers at least two
items in technology, product, management,
marketing, and finance. en, the variable is 1. If
the background of core members is similar, then
the variable is 0 (for example, CEO and CTO are
from R& D background). If the source field is
missing or difficult to judge, it is marked as “?”
(10) Previous work is highly relevant to this venture
(WorkRelated): If the team members and the work
background mentioned in the profile are highly
correlated with the industry or business of the start-
up company, the variable is 1, and if it is obviously
irrelevant, it is 0. If the source field is missing or
difficult to judge, it is marked as “?”
3. Research Design and Model Validation
To make a relatively independent judgment and comparison
on the predictive value of the two types of signals, this paper,
firstly, constructs two basic models using the endorsement
signal of relevant parties and the team ability signal, re-
spectively, and compares the Bayesian network and decision
tree C4.5 and random forest. en, the two kinds of signals
are fused to build a comprehensive model to test the im-
provement of key indicators. Finally, considering the sig-
nificant difference in the cost of “rejecting the true and
accepting the false” in investment practice, this paper in-
troduces a cost matrix to improve the recall rate of the model
for high potential projects.
3.1. Basic Model 1: Based on Endorsement Signals of Related
Parties. We, firstly, select the endorsement signals of three
interested parties—invested by excellent institutions, well-
known financing consultants, and well-known media
reports—as the model input, and take whether we obtain the
next round of financing within 18 months after the first
disclosure of financing information as the classification label
(yes is 1, no is 0). ere are 1104 records in the complete
dataset, of which 414 have the next round of financing
(defined as a positive sample), accounting for 37.5%.
Under 10 10-fold cross-checks, the average performance
of the Bayesian network, decision tree C4.5, and random
forest algorithms was tested. e experimental environment
is the open-source data mining software Weka, and the
algorithm uses the default parameters. Considering the
problem situation abstracted from investment practice and
how to select a small proportion of samples that cover as
many real positive samples as possible, we give the overall
classification performance as a reference in Table 2 but focus
Table 1: Field extraction and processing.
Field classification Processed field name Data Sources
Relevant party endorsement
signal
Invested by excellent institutions Investors in the current round
Well-known financing advisors (FA) Company advantage
Well-known media reports Company advantage
Team ability signal
Have patents Company advantage + team members and
profiles
Working background in a well-known company Team members and profiles
Core members have a well-known university education
background Team members and profiles
Core members have entrepreneurial experience Team members and profiles
e members have the same school or work experience Team members and profiles
e core team has diverse professional skills Team members and profiles
Previous work is highly relevant to this venture Team members and profiles + company
profile
Scientific Programming 3
on comparing the recall rate and accuracy rate of positive
samples (i.e., continuous financing projects).
It can be seen from the above table that the Bayesian
network is the model with the highest positive sample
precision, and the decision tree C4.5 is the model with the
highest positive sample recall rate.
Although the decision tree C4.5 model is slightly lower
than the Bayesian network in accuracy, it predicts 30 projects
that can get the next round of financing more than the latter
on average (153 vs. 123), while only 71 additional projects
need to be reviewed (281 vs. 210), which has significant value
in practice. Because for investors, compared with the
original 1,104 projects, the number of projects has been
reduced to 281, which has significantly saved working time
(corresponding to a 75% compression rate). 24% more
project coverage is desirable in exchange for some additional
project viewing time, and the recall of high-potential
projects is a major consideration at this time.
In summary, decision tree C4.5 is the best performing
model for this requirement.
3.2. Basic Model 2: Based on Team Ability Signal. In the basic
model 2, we select seven founding team ability signals as the
model input, including having patents, working background
in well-known companies, core members having an edu-
cational background in well-known universities at home and
abroad, core members having entrepreneurial experience,
members having the same school or work experience, core
teams having diversified professional skills, and previous
work is highly related to this entrepreneurship. en, repeat
the aforementioned experimental process.
It can be seen from Table 3 that at this time, random
forest is the model with the highest classification accuracy of
positive samples, while the Bayesian network is the model
with the highest recall rate of positive samples. e differ-
ence in precision is slightly lower than the difference in
recall. In addition, compared with basic model 1, the three
types of algorithms have improved in key indicators. Among
them, the positive sample recall rate of the Bayesian network
improved the most significantly (43.5% vs. 29.7%). e
positive sample precision rate of the random forest algo-
rithm improved the most significantly (65.0% vs. 52.5%).
A possible explanation is that, compared with the related
party endorsement signal, the dimension of the team ca-
pability signal is richer, however, the proportion of missing
values is also higher. Under the characteristics of this dataset,
the Bayesian network can learn the rules existing in a small
number of samples from the combination of indicators, and
the pruned decision tree algorithm is more inclined to learn
the negative samples with a higher proportion in the dataset.
In the discrimination method, although the size of the se-
lected samples is smaller (the compression rate is above
77%), the recall ability of positive samples is weak.
To sum up, in investment practice, if one encounters a
situation with rich attributes but many missing values, the
Bayesian network may be the better choice among the three.
3.3. Fusion Model. For base models 1 and 2, the sample
compression and precision are acceptable in practice,
however, the optimal model does not exceed 44% in terms of
recall on positive samples. For investment institutions, the
cost of omitting star projects is huge. Only when the recall
rate of positive samples is high enough, the model has
application value.
To this end, we tried to fuse the endorsement signal of
related parties with fewer dimensions and the team capa-
bility signal with more dimensions as the model input and
repeated the above modeling and testing process. e results
are summarized in Table 4.
It can be seen from the above table that compared with
the basic models 1 and 2, the three types of algorithms have a
large proportion of the improvement in the recall rate of
positive samples, and the value of the combination of the two
types of indicators for improving the prediction effectiveness
of the model has been verified. Among them, the Bayesian
network has the best performance. e performance of
decision tree C4.5 is in the middle, and the performance of
random forest is second. Compared with model 2, the recall
Table 2: Basic model 1: endorsement signals of related parties.
Bayesian network Decision tree C4.5 Random forest
Overall accuracy 64.2% 63.2% 61.6%
Overall recall 65.8% 64.8% 63.7%
e number of positive samples judged 210 281 256
e actual number of positive samples 123 153 135
Positive sample precision 58.7% 54.6% 52.5%
Positive sample recall 29.7% 36.9% 32.5%
Table 3: Basic model 2: team ability signal.
Bayesian network Decision tree C4.5 Random forest
Overall accuracy 68.3% 68.0% 68.2%
Overall recall 69.3% 68.9% 69.0%
e number of positive samples judged 286 249 237
e actual number of positive samples 180 160 154
Positive sample precision 63.0% 64.2% 65.0%
Positive sample recall 43.5% 38.6% 37.3%
4Scientific Programming
rate of positive samples for decision tree C4.5 is improved by
10 percentage points.
It is of great significance for investment practice. Take
decision tree C4.5 as an example. Its practical meaning is as
follows: after 1104 projects are screened by the model, 339
will be judged as possible for the next round of financing, of
which 203 can indeed get the next round of financing. From
the perspective of project coverage, if investors screen
projects according to this method, they can cover 48.9% of
high potential projects under 30.7% of the workload. From
the perspective of time cost, investors can spend 59.7% of
their working time on valuable projects, which is signifi-
cantly improved compared with the positive sample rate of
37.5% when looking at projects at random.
However, it is worth noting that even under the optimal
model, half of the high potential projects will not be screened
out by the algorithm. By tracing back the characteristics of
samples and their classification results, it can be inferred that
there are three reasons, which are as follows:
(1) Incomplete information. For example, 72 positive
samples lack team members’ profiles and lack pos-
itive signals in the dimension of endorsement of
related parties, which objectively limits the upper
limit of the prediction ability of the model.
(2) e positive and negative proportions of the samples
are uneven. e proportion of negative samples is
62.5%, which is 1.7 times that of positive samples,
which makes the model with the goal of reducing the
error rate more inclined to learn the discriminative
method of negative samples, while the mining of
positive samples is insufficient.
(3) ere are unobserved influencing factors. Important
factors affecting a company’s ability to secure its next
round of funding may not be limited to the 10
metrics used in the model.
4. Model Discussion and Extension
4.1. Factor Analysis and Rules Summary. is paper designs
and validates a predictive model that can be used for early-
stage project screening, but in investment practice, pre-
diction results alone are not enough to support project
screening. Investors need to understand the decision-
making basis behind the model to confirm each other with
their past investment experience, correct possible wrong
judgments, or add new screening rules. At the same time,
investors also need to know which factors contribute more
to the prediction to consciously collect this information
when contacting entrepreneurs or improve the information
quality of these dimensions through other channels.
Considering the interpretability of the rules, this section
selects the pruned decision tree C4.5 model as the analysis
object, as shown in Figure 1.
As can be seen from the above figure, there are six in-
dicators in the four layers of the decision tree, which are
team capability signals (work background in well-known
companies, patents, previous work experience is highly
related to this venture) and relevant party endorsement
signals (reported by well-known media, invested by excellent
institutions and well-known financing consultants), three
each. ey are the most discriminative for judging whether a
project can get the next round of financing [3].
ere are also observable differences in the 01 distri-
bution of these indicators in the whole sample and the positive
sample, as shown in Table 5 for simplicity (the proportion of
missing values is hidden). Except for “well-known financing
consultants,” the proportion of the remaining five indicators
with a value of 1 in the positive sample is 1219 percentage
points higher than that in the whole sample.
Combined with the classification results output from
Figure 1 and the decision tree model, the following four
representative investment logics can be summarized from
right to left:
(1) 76% of entrepreneurial teams with a working
background in well-known companies can obtain the
next round of financing (characteristics: strong team
ability signal).
(2) 69% of the teams without working background of
well-known companies but with well-known media
reports and patents can obtain the next round of
financing (characteristics: screened by stakeholders
and strong technical ability of the team).
(3) 64% of the teams that have no working background
in well-known companies and no well-known media
reports but have been invested by excellent insti-
tutions and whose previous work is highly related to
this entrepreneurship can obtain the next round of
financing (characteristics: they have been screened
by stakeholders and their team ability matches the
entrepreneurship).
(4) 55% of the entrepreneurial teams without working
background of well-known companies, well-
known media reports, and investment by excellent
institutions but endorsed by well-known financ-
ing consultants can obtain the next round of
Table 4: Fusion model.
Bayesian network Decision tree C4.5 Random forest
Overall accuracy 69.7% 67.6% 67.0%
Overall recall 70.5% 68.5% 67.9%
e number of positive samples judged 327 339 332
e actual number of positive samples 207 203 196
Positive sample precision 63.5% 59.7% 59.0%
Positive sample recall 50.1% 48.9% 47.4%
Scientific Programming 5
financing. If the enterprise does not even have the
signal of a well-known financing consultant, there
is a 70% probability that it will not be able to
obtain the next round of financing (feature: when
it does not have the signal of team ability, the
signal of stakeholders can play a certain screening
role).
From the above typical rules, it can be found that the
working background of well-known companies is the most
differentiated signal. If it does not have this feature, it needs to
have at least one of the ability signals of other teams and the
endorsement signals of relevant parties to have a higher
probability of obtaining the next round of financing.
e language to be converted into investment practice
is as follows: if the team has outstanding highlights in its
work history, it is worthy of investors’ research and
follow-up. If the team is average, it needs the investment
of other excellent institutions or the information re-
ported by well-known media to be considered.
4.2. Multiclassification Problem. e so-called multi-
classification problem refers to using an empirical loss
function to predict the classification results of different
categories [13]. From start-up to listing, companies usually
need to go through more than four rounds of financing
(angel round, round A, round B, and pre-IPO round).
However, companies that receive continuous financing in
the early stage may not be able to go public or be acquired.
For investment institutions, if an investment cannot be
successfully exited, there will be no return on investment.
erefore, the more valuable the predictive model is to the
investment institution, the more the predicted variable can
be closer to the long-term potential of the company.
In this regard, an intuitive idea is as follows: after the first
financing, companies that can obtain two or more rounds of
financing have higher potential and value than companies
that have only received one round of financing. In other
words, we can classify start-ups into three categories based
on the number of follow-up rounds a company has received
after their initial funding round, which are as follows: low-
value (no follow-on financing, recorded as L), medium value
(1 follow-on round, recorded as class M), and high value (2
or more rounds of follow-up financing, recorded as class H).
It should be noted that this multiclassification problem
has certain particularities: on the one hand, the three cat-
egories are not semantically independent but have a certain
degree of progressive relationship. In terms of value, class
L<class M <class H. On the other hand, from the per-
spective of investment practice, both M and H categories
should be selected into the project library to be followed up,
however, the H category deserves more attention. erefore,
the misclassification between class M and class H does not
incur excessive costs.
e original decision tree C4.5 model can handle
multiclassification problems but does not pay much atten-
tion to the above two problems. erefore, this section,
firstly, uses it as a reference model to measure the general
classification effect of the algorithm and then discusses how
to improve it.
CompanyHalo
GoodCoverage High
potential
GoodInvestor
GoodFA WorkRelated
Patent
GoodInvestor High
potential
Low
potential
High
potential
Low
potential
High
potential
Low
potential
High
potential
=0 =1
=1
=1
=1
=1
=1
=1
=0
=0
=0
=0
=0
=0
Figure 1: Visualization of decision tree rules (after pruning).
Table 5: 0–1 distribution difference of typical elements in all
samples and positive samples.
Variable All samples Positive samples
only
1 0 1 0
CompanyHalo 319 785 195 219
29% 71% 47% 53%
Patent 583 521 272 142
53% 47% 66% 34%
WorkRelated 474 68 257 18
43% 6% 62% 4%
GoodCoverage 272 832 155 259
25% 75% 37% 63%
GoodInvestor 222 726 132 239
20% 66% 32% 58%
GoodFA 134 970 71 343
12% 88% 17% 83%
6Scientific Programming
In terms of datasets, this section follows all the
projects that have received early financing from January
2016 to June 2017 in the artificial intelligence and big data
industries, which is a total of 1104 projects. e inde-
pendent variable is consistent with the training set.
However, the predictor variable is changed to the number
of subsequent financings, with a value of 0 representing
no subsequent financing (type L), a value of 1 repre-
senting a subsequent round of financing (type M), and a
value of 2 represents 2 or more follow-on financing
rounds (Class H). Repeating the modeling process and
10-fold cross-checking, the classification performance of
the three categories is shown in Tables 6 and 7.
It can be seen from the above table that the decision tree
C4.5 algorithm is more biased toward negative samples with
a higher proportion (i.e., samples that have not obtained
subsequent financing). 84.3% (931/1104) of the samples were
classified into the low-value L class, and the identification of
medium-value (M-class) and high-value (H-class) samples
was severely underidentified. Among the M and H classes,
the algorithm also prefers to discriminate as the conservative
M class (134 vs. 39). From the perspective of investment
practice, this classification effect is not satisfactory.
4.3. Scalability Discussion. It is worth noting that the above
models we designed and tested only consider the public
information provided by third-party venture capital service
institutions, however, in practice, investment institutions
can also obtain private information from multiple channels,
such as reading project business plans, communicate with
founders, communicate with financing consultants, com-
municate with peers, communicate with colleagues with
relevant experience, obtain third-party research data, etc.
e potential value of this private information is
fourfold:
(1) Supplements the missing values of each indicator in
the original model, and corrects the wrong labels.
(2) Adds new observations to the model to improve the
predictive efficiency of the model.
(3) Provides more information about the research
subjects (i.e., startups), such as funding amounts and
valuations, to design proxy variables that are closer
to the company’s potential.
(4) Builds a closed loop between forecasts and obser-
vations earlier than the market, and iterates models
earlier to improve forecasts and gain a competitive
advantage.
erefore, this section focuses on how to combine public
information with private information to form a dynamically
updated project screening mechanism. Figure 2 presents a
conceptual framework considering practical feasibility.
As shown in the figure, after the public information
about the project is obtained, it can be analyzed and pro-
cessed through the following five steps:
(1) Clean and label the data. e dimensions of the
labeling can include the ten dimensions proposed in
this article, and it can also continue to expand with
the enrichment of practical experience.
(2) For projects with relatively complete information,
use decision trees or other machine learning al-
gorithms to generate predictions and make cor-
rections based on expert knowledge in investment
institutions, thereby generating two setsprojects
worthy of follow-up and projects that are not to be
followed up for the time being.
(3) Projects that have not been followed up for the time
being will enter the waiting pool together with
projects with incomplete information before. After
investors obtain more information about the team
and projects, they will enter the data cleaning process
for remarking.
(4) Projects worthy of follow-up and projects not to be
followed up enter the observation pool together.
Combined with the financing and development in-
formation obtained by investors through industry ex-
changes, projects are divided into four categories: TP
(true positive) refers to projects that are determined by
institutions to be worthy of follow-up and have real
potential through follow-up observation, FP (false
positive) refers to the project that the institution de-
termines is worthy of follow-up, however, it is found
Table 6: ree classification performance: based on decision tree C4.5 model.
L M H
Sample frequency and proportion 690(62.5%) 294(26.6%) 120(10.9%)
Quantity determined as this category 931 134 39
Actual quantity of this category 627 45 13
Accuracy 67.3% 33.6% 33.3%
Recall 90.9% 15.3% 10.8%
Table 7: Confusion matrix: based on decision tree C4.5 model.
Classified into L Classified into M Classified into H
L 627 52 11
M 234 45 15
H 70 37 13
Scientific Programming 7
that it has no development potential and sustainable
financing ability, TN (true negative) refers to the project
determined by the organization not to follow up
temporarily and found to have no potential in the
follow-up, and FN (false negative) refers to the project
that the organization determines not to follow up
temporarily but actually finds potential.
(5) e misclassified samples (i.e., FP and FN) are in-
cluded in the check pool, and the dataset is used to
modify the previous algorithm.
5. Conclusion
is paper mainly studies the application of the machine
learning method represented by a decision tree in early
entrepreneurial project screening. e research process is
divided into four steps.
e first step is to establish a prediction model that can
be used in investment practice. By processing the real in-
vestment and financing event data set in the field of artificial
intelligence and big data, this paper extracts two kinds of
signals, which are as follows: the endorsement of relevant
parties and team ability, with a total of 10 indicators. After
establishing the basic model based on the two types of in-
dicators and verifying their prediction effectiveness, this
paper combines all indicators to test whether the fusion
model is significantly improved compared with the basic
model. en, considering that in practice, investors are more
interested in projects that can continuously obtain financing
(i.e., positive samples). is paper introduces a heuristic
misclassification cost matrix to adjust the weight of the two
types of samples and verifies how much the improved model
can improve the recall rate of positive samples.
e second step is to interpret the law revealed by the
model and to explore whether this law is still applicable over
time and whether it can be applied to project screening in
similar industries. e selected model is the decision tree
C4.5 model with more interpretable rules. By interpreting
the top-down judgment path of the decision tree, this paper
extracts general rules for guiding project selection.
e third step is to extend the decision tree model from
the binary classification scenario of positive and negative
samples to the multiclassification scenario of project value
segmentation to improve the usefulness of the model in
investment practice.
Finally, in view of the private information available to
investment institutions in reality, this paper discusses how to
incorporate private information into an iterative analysis
process based on the model constructed from public in-
formation to achieve a more accurate prediction of high-
potential projects.
e implications of this paper for management are as
follows:
(1) e value of the three team ability indicators, in-
cluding working background in a well-known
company, patents, and work experience, which are
highly related to this entrepreneurship in predicting
the potential of the company, has been supported by
data. When communicating with entrepreneurs or
conducting project surveys, investors deserve to
verify the authenticity and basis of these indicators
from more information sources. In addition, based
on the working experience and the subsequent de-
velopment performance of the company, a more
accurate division basis can be established for the
scope of well-known companies, the quality and
quantity of patents, and the correlation between
work experience and entrepreneurship to continu-
ously improve the screening and judgment of
projects.
(2) e optimal model and cost matrix settings are
different for different classification objectives, and
the same is true for different data dimensions. In
practice, investors should fully consider their own
classification needs and the availability of external
data and then select the model with the best clas-
sification effect. In addition, in the context of not
pursuing model interpretability, it is also an optional
idea to introduce algorithms such as Bayesian net-
work and random forest or use integrated learning
Open letter
Data
cleaning
and labeling
Projects
with
complete
information
Algorithm worthy of
follow-up
TP
FP
Not
followed up
TN
FN
Projects with
incomplete information
Private information
Expert knowledge
Team&project
information
Financing information
Correct Check pool
Observation pool
Waiting pool
Figure 2: Project screening framework that integrates private information and public information.
8Scientific Programming
methods, such as Bagging and Boosting, to improve
model prediction.
(3) On the issue of early project screening, the dimen-
sion of relevant party endorsement, which is less
studied by the academic community, actually has a
high signal value. It is also consistent with the basic
common sense of the investment industry. ese
experiences should be incorporated into the con-
struction of project screening mechanisms based on
machine learning methods.
Data Availability
e dataset can be accessed upon request.
Conflicts of Interest
e authors declare that they have no conflicts of interest.
References
[1] C. Carpentier and J.-M. Suret, “Angel group members’ de-
cision process and rejection criteria: a longitudinal analysis,”
Journal of Business Venturing, vol. 30, no. 6, pp. 808–821, 2015.
[2] P. Klimas and W. Czakon, “Organizational innovativeness
and coopetition: a study of video game developers,” Review of
Managerial Science, vol. 12, pp. 469–497, 2018.
[3] D. Guo and K. Jiang, “Venture capital investment and the
performance of entrepreneurial firms: evidence from China,”
Journal of Corporate Finance, vol. 22, pp. 375–395, 2013.
[4] J. Hoyos-Iruarrizaga, A. Fern´
andez-Sainz, and M. Saiz-Santos,
“High value-added business angels at post-investment stages:
key predictors,” International Small Business Journal, vol. 35,
pp. 949–968, 2017.
[5] R. Sørheim, “e pre-investment behaviour of business an-
gels: a social capital approach,” Venture Capital, vol. 5, no. 4,
pp. 337–364, 2003.
[6] D. Kirsch, B. Goldfarb, and A. Gera, “Form or substance: the
role of business plans in venture capital decision making,”
Strategic Management Journal, vol. 30, no. 5, pp. 487–515,
2009.
[7] M. Jansson and A. Biel, “Investment institutions’ beliefs about
and attitudes toward socially responsible investment (SRI): a
comparison between SRI and non-SRI management,” Sus-
tainable Development, vol. 22, pp. 33–41, 2014.
[8] M. Brahim and H. Rachdi, “Foreign direct investment, in-
stitutions and economic growth: evidence from the MENA
region,” Journal of Reviews on Global Economics, vol. 3,
pp. 328–339, 2014.
[9] G. D. Leeves and R. Herbert, “Gender differences in social
capital investment: theory and evidence,” Economic Model-
ling, vol. 37, pp. 377–385, 2014.
[10] J. Peir´o-Palomino and E. Tortosa-Ausina, “Social capital,
investment and economic growth: some evidence for Spanish
provinces,” Spatial Economic Analysis, vol. 10, pp. 102–126,
2015.
[11] T. Astebro, “Key success factors for technological entrepre-
neurs’ R&D projects,” IEEE Transactions on Engineering
Management, vol. 51, pp. 314–321, 2004.
[12] A. Croce, J. Mart´
ı, and S. Murtinu, “e impact of venture
capital on the productivity growth of European
entrepreneurial firms: “Screening” or “value added” effect?”
Journal of Business Venturing, vol. 28, pp. 489–510, 2013.
[13] J. Argerich, E. Hormiga, and J. Valls-Pasola, “Financial ser-
vices support for entrepreneurial projects: key issues in the
business angels investment decision process,” Service Indus-
tries Journal, vol. 33, pp. 9-10, 2013.
Scientific Programming 9
... Similarly, just like SVM, k-Nearest Neighbours (KNN) is a simple but effective method for classification, which also has successfully been implemented in real-time applications. The scalability of the KNN methods to large-scale datasets makes it more adaptable for a vast range of applications [37]. The KNN has been applied and several different domains including the prediction of financial straits in commercial entities, with the goal of limiting societal losses. ...
... Other than SVM and KNN, another outperforming classification method is Decision Tree (DT) [38], which is commonly utilised to design prediction algorithms for a target variable or to create classification systems based on multiple covariates. By splitting the large datasets into training and validation datasets, DT can easily deal with huge, intricate datasets without imposing a sophisticated parametric structure, which makes it an effective classification technique [37]. ...
Article
Full-text available
Since the pandemic organizations have been required to build agility to manage risks, stakeholder engagement, improve capabilities and maturity levels to deliver on strategy. Not only is there a requirement to improve performance, a focus on employee engagement and increased use of technology have surfaced as important factors to remain competitive in the new world. Consideration of the strategic horizon, strategic foresight and support structures is required to manage critical factors for the formulation, execution and transformation of strategy. Strategic foresight and Artificial Intelligence modelling are ways to predict an organizations future agility and potential through modelling of attributes, characteristics, practices, support structures, maturity levels and other aspects of future change. The application of this can support the development of required new competencies, skills and capabilities, use of tools and develop a culture of adaptation to improve engagement and performance to successfully deliver on strategy. In this paper we apply an Artificial Intelligence model to predict an organizations level of future agility that can be used to proactively make changes to support improving the level of agility. We also explore the barriers and benefits of improved organizational agility. The research data was collected from 44 respondents in public and private Australian industry sectors. These research findings together with findings from previous studies identify practices and characteristics that contribute to organizational agility for success. This paper contributes to the ongoing discourse of these principles, practices, attributes and characteristics that will help overcome some of the barriers for organizations with limited resources to build a framework and culture of agility to deliver on strategy in a changing world.
... Other papers are consistent with this conclusion (Gao, 2023;Arroyo et al., 2019;Maurer et al., 2024;Wang and Xue, 2022;Setty et al., 2024), supporting the possibility of conducting an effective screening using AI. ...
Preprint
This paper explores the adoption and perceived impact of artificial intelligence in venture capital (VC) firms. Although AI adoption in financial services is increasing, its integration within VC firms remains largely underexplored, in particular as regards its application in business operations. Based on data obtained from a questionnaire distributed among European venture capitalists, it is observed that the adoption of artificial intelligence has increased markedly since 2022, with screening emerging as the most prevalent application. Statistical models suggest that firms with employees with strong ICT backgrounds are more likely to adopt AI. Although AI reduces due diligence time, its overall effect on long-term benefits remains inconclusive, perhaps due to the limited available data.
... Another important finding was related to the fact that a higher amount of capital received from equity can lower the company's probability of failing. Wang et al (2022) found that having a quality team is one of the key factors that increase the likelihood of a start-up being successful in attracting potential investors. This suggests that founders shall work to have a strong and experienced team, with members that have a strong background. ...
Conference Paper
Full-text available
The aim of this research paper is to make an analysis of the funding options of start-ups. There are many new enterprises that may have brilliant ideas and products that they want to offer, but the process of turning an idea into a product is not easy. There are many challenges that a start-up faces and securing the funds that they need is one of the key challenges. This research paper will focus on how start-ups deal with funding issues, by exploring different funding options that they may choose to use and the ones that they are able to use because of the constraints that may prevent them from choosing the desired funding option. This will also help to see how these financial constraints "force" their decision-making process when it comes to choosing the best funding option. This research is based on secondary research and is exploratory. The research method that is used consists of the exploration of existing information about start-up funding issues and challenges and the interpretation and analysis of the findings generated by it. The results found that start-ups can use different funding options, like angel investors, loans, venture capital, crowdfunding, grants, accelerators, etc. The business idea itself, the competition, the reputation of the founders, and the network of the founder are some key factors that determine the success of the funding process. The funding option that start-ups use can also impact the future of the enterprise, as different funding options can have implications on the capital structure of the company and on the decision-making process.
... e insurance company is the organization that sells the policy, and the insured is the individual or organization that purchases the policy for the advantages it provides. In exchange for financial compensation, referred to as a high value, the insurance promises to absorb the responsibility of a protected entity against future eventualities [1]. In the event of an unexpected incident, the insurance company is required to pay the demand to the policyholder, i.e., the benefits are paid in full to the beneficiaries as specified in the company's policy. ...
Article
Full-text available
The insurance financial management information system has accumulated a large amount of data as the insurance financial system has improved and the number of people investing in insurance has increased rapidly. The performance of the insurance agency significantly contributes to the industry’s growth, which leads to economic prosperity. Different financial ratios were developed to investigate it, taking into consideration the insurance provider’s stability, insolvency, profitability, and leverage. The profitability of organizations and insurers is used to evaluate the general effectiveness. In order to achieve this goal, this study examines the impact of insolvency, leverage, stability, scope, and impartiality of capital on the efficiency of Chinese life insurers. The study of financial statements examines a company’s overall financial health throughout time. It is a method of identifying a company’s financial assets and liabilities by integrating a statement of financial position and balance sheet features. It provides a systematic approach to assessing and evaluating the company’s predicament. Using the experimental results, the scores of several insurance firms are compared, and their performance is described based on these results. The effective use of these data to assist decision-makers in developing more reasonable financial insurance investment policies have emerged as a significant challenge that must be addressed. This study utilized the decision tree C 4.5 mining algorithm to analyze insurance financial system data, identify key factors influencing insurance finance, and assist decision-makers in optimizing policy parameters. Finally, the consequence of an increase is analyzed using a previously unseen method to assess the precision of the prediction result.
Article
Full-text available
Collaboration with rivals is viewed as a way to achieve superior performance of firms in terms of innovation output. Yet empirical results show that coopetition may either foster, hamper or be neutral to innovation. The motivation of our study resides in firms’ heterogeneity in terms of their innovative capacity, that is innovativeness, in order to better understand the complex relationship between coopetition and innovation. We explore the interdependency between organizational innovativeness and coopetition. Our study has been conducted in the Polish video game industry. The data has been collected through a survey administered to all 506 identified Polish video game developers, with an effective sample of 84 coopetitors. We run correlation and regression analyses in a multidimensional approach to organizational innovativeness and coopetition. Our findings show that coopetition is a popular strategy for video game developers, and is adopted by 68% of firms. Organizational innovativeness and its particular dimensions are positively and significantly related to both direct and indirect coopetition. Based on factor analysis we find its three components to be reliable: openness and encouragement to innovate; strategic innovative focus; and extrinsic monetary motivation. While extrinsic monetary motivation does not play a role in coopetition of video game developers, openness and encouragement to innovate stimulates especially indirect coopetition, while strategic innovative focus affects especially direct coopetition.
Article
Full-text available
This paper investigates business angel group members' decision-making from project submission to the final decision. Using a Canadian group's archival data on 636 proposals, we provide a detailed longitudinal analysis of the decision process. The rejection reasons generally refer to market and execution risk; this finding holds for every step of the process for proposals that pass the pre-screen. Angel group members focus more on market and execution risk than agency risk, similar to venture capitalists. Inexperienced entrepreneurs are rejected for market and product reasons. Decision-making by the studied angel group members differs from that generally described for independent angels.
Article
Full-text available
The objective of this paper is to provide knowledge about the determinants of success in the screening phase of the investment process and to demonstrate its relationship with success in obtaining capital from business angels (BA). This research sets out to achieve this objective by analyzing the impact that the evaluation of the business opportunity, the managing team and the presentation have on success in the screening phase. To do this, the research proposes four main hypotheses that are tested on 215 projects presented at a BA's network. The data for the analysis are extracted both from the BA and from the entrepreneurs. The results show that the evaluation of the presentation is the most important factor that influences success in the screening phase, followed by the evaluation of the business opportunity.
Article
This article examines the differences and features displayed by business angels (BAs), depending on the extent of their involvement with, and support for, the start-ups they finance measured by expertise, experience and contacts. With a sample of 293 Spanish BAs, using data obtained from the Global Entrepreneurship Monitor (GEM) survey, our results indicate that investors who develop more rigorous screening processes in the pre-investment process and hold regular meetings with founder teams are more likely to become High Value-Added Business Angels (HVBAs). Accordingly, the ability of BAs to transfer so-called ‘smart capital’ is conditioned by the levels of screening and assessment applied at the pre-investment stage in terms of both the quality of projects and founder teams and the extent to which the expectations and profiles of the two parties match.
Article
Few scientific papers treat the role of institutions on the relationship between foreign direct investment (hereafter FDI) and economic growth. In the existing literature, the FDI effects on growth are not easy to understand. Mixed findings, both theoretical and empirical, have been provided on this issue by the academic research. The first contribution of this study is an analysis of how institutions quality affects FDI-growth nexus. The second contribution is the use of the Panel Smooth Transition Regression (PSTR) modeling because the nexus between FDI and economic growth is nonlinear and depends on specific national factors especially institutions quality. This method helps to account for a change of regime in the effects of FDI on economic growth. The major finding of this study is that the effect of FDI on economic growth is conditional to the development of institutions in MENA countries. Empirically, on a sample of 19 MENA countries over the period 1984-2011, we found that only countries with good institutions can exploit the advantages of FDI on growth.
Article
We examine the effects of venture capital (VC) investment on the performance (measured by return on assets, return on equity, and Tobin’s Q) and growth (measured by growth of total sales and total number of employees) of entrepreneurial firms in the People’s Republic of China (PRC) after an initial public offering (IPO). Firm-level panel data analysis shows that VC investment contributes to the long-term performance and growth of entrepreneurial firms after an IPO. Meanwhile,we observe a significant and positive relationship between corporate governance of firms and VC investment. However, we do not find that experience or specialization of VC firms influences the effects of venture investment on post-IPO performance or growth of entrepreneurial firms in the PRC. © 2015 Asian Development Bank and Asian Development Bank Institute Published under a Creative Commons Attribution 3.0 IGO (CC BY 3.0 IGO) license.
Article
This paper analyses individual social capital investment by extending the investment model of Glaeser et al. (2002) to allow for differing types of social capital. A dynamic solution to the individual's maximisation problem illustrates differences in social capital investment dependent on the conversion factor of investment. An empirical section finds that females invest more and derive greater wellbeing from this type of social capital investment; consistent with a higher conversion factor. The findings have implications for the work–life balance policies within firms and provide another explanation for gender differences in earnings.
Article
This article analyzes the impact of social capital on regional economic growth in Spain during the 1985-2005 period. The literature in this context is virtually nonexistent and, in addition, whereas most studies, regardless of their context, have used survey data in order to measure social capital, we use a measure whose construction is based on similar criteria to other measures of capital stock. Compared with more standard measures of social capital and trust, our measure is available with a high level of disaggregation, and with annual frequency for a long time period. Following a panel data approach, our findings indicate that social capital has a positive impact on GDP per capita growth in the context of Spanish provinces, implying that “social features” are important for explaining the differences in wealth one might find across Spanish provinces. We also explore the transmission mechanisms from social capital to growth, finding a highly positive relation between social capital and private physical investment.
Article
Using insights from social capital theory, this paper examines the pre-investment behaviour of experienced business angels in Norway. Previous research indicates there are considerable inefficiencies in the informal venture capital market, notably information inefficiencies related to the identification of investment opportunities, and problems associated with the screening or evaluation of new investment proposals. The empirical findings show that an investor's previous track record, to a great extent, determines how they can operate in the informal venture capital market. It is quite rational for individuals who have acquired most of their experience in one specific region to make the overwhelming majority of their investments in the same region. It is this regional track record that gives them a competitive advantage in the informal venture capital market. This reasoning also seems to be valid with regard to individuals with industry specific experience, where the regional track record is ‘replaced’ by an industry specific record. Moreover, these industry specific investors take care of the initial screening themselves, whereas regional investors are predominately generalists who rely more on information provided by their regional networks. The business angels in this study are very concerned with establishing common ground with entrepreneurs and potential co-investors. This establishment of common ground can be viewed as a necessary antecedent for long-term trustworthy relationships.
Article
We aim to ascertain to what extent the better performance of European venture capital (VC)‐backed firms in high-tech industries is due to either ‘screening’ or ‘value added’ provided by VC investors. We compare portfolio firms' productivity growth before and after the first VC round, using a matched control group as benchmark. We show that productivity growth is not significantly different between VC and non-VC-backed firms before the first round of VC financing, whereas significant differences are found in the first years after the investment event. We also find that the value-adding services provided by VC investors ‘imprint’ the portfolio firm.