Access to this full-text is provided by Wiley.
Content available from Scientific Programming
This content is subject to copyright. Terms and conditions apply.
Research Article
Application of Decision Tree Algorithm in Early Entrepreneurial
Project Screening
Yu Min Wang and Lin Xue
School of Innovation, Entrepreneurship and Creation, Minjiang University, Fuzhou, Fujian 350108, China
Correspondence should be addressed to Lin Xue; 2184@mju.edu.cn
Received 26 January 2022; Revised 19 February 2022; Accepted 1 March 2022; Published 31 March 2022
Academic Editor: Tongguang Ni
Copyright ©2022 Yu Min Wang and Lin Xue. is is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Venture capital firms are always faced with insufficient information and insufficient time when evaluating whether startups are
worth investing in. is paper focuses on how to combine the public information of startups with the decision tree algorithm to
assist investors in project screening. By extracting the public information of 1104 AI and big data companies from January 2016 to
June 2017 and the financing progress in the following 18 months, this paper finds that: the six indicators of having a working
background in well-known companies, being reported by well-known media, having patents and being invested by excellent
institutions, working experience is highly related to this venture, and having well-known financing consultants can effectively help
investors screen out projects that can obtain sustainable financing, that is, high potential projects. Considering the problem of data
availability in the real world, combined with industry experience, this paper makes a more detailed variable mining for sem-
istructured information, such as the team resume of start-ups, and selects a decision tree algorithm that is insensitive to missing
values and has strong interpretability to amplify the value of fragmented information. Finally, by the improvement of the al-
gorithm, a project screening model that can meet the needs of investment practice is designed, and the fusion mode of public
information and private information is discussed, which has a more complete guiding significance for optimizing
investment work.
1. Introduction
Venture capital is a form of investment that provides
capital for start-ups with high growth potential in ex-
change for equity. Massive entrepreneurial projects and
high investment costs pose a great challenge to fund
managers. How to screen as many high-quality entre-
preneurial companies in the market as possible under the
condition of limited manpower and time, identify their
development potential, and provide support is the first
kind of problem faced by investors, i.e., time constraints.
Another consensus is that there is a serious lack of in-
formation in the field of early investment. In the face of
newly established companies, investors often cannot
obtain hard data about the market, products, and finance,
and they can only make decisions under imperfect in-
formation in combination with their own experience,
industry estimates, and the judgment of the start-up team.
In this process, which factors should be given a higher
weight, and how to reduce the components of subjective
assumptions and make decisions more rational are the
second kind of problems faced by investors, i.e., infor-
mation constraints.
Facing the problem of insufficient time and information,
an intuitive solution is to let the machine assist people to
collect and process more information, precipitate the de-
cision-making knowledge in the way of algorithm, and
rationalize the investment process. However, a series of
questions, such as which channels, what information to
extract, what indicators to refine, and what feedback to
modify the model, need some definition to be effectively
answered. erefore, this paper extracts the link of early
project screening from the whole process of investment
activities and simplifies the problem as follows:
Hindawi
Scientific Programming
Volume 2022, Article ID 3584196, 9 pages
https://doi.org/10.1155/2022/3584196
Assume that an investment institution has obtained the
profile of an early-stage company disclosed in the venture
capital media, including the endorsement information of
relevant parties, such as “past investors,” and the infor-
mation of team characteristics, such as “profile of team
members.” Can this institution predict the company’s po-
tential to obtain the next round of financing based on the
financing development of related companies in the same
industry over the past period of time? Finally, decide if it is
worth the time to research and follow up on the investment.
e two types of input required to answer this question
have been extensively studied in academia.
In terms of team characteristics, the positive effects of
industry-related experience [1, 2], entrepreneurial experi-
ence [1, 3, 4], and well-known company work background
[5, 6] on the development of startups have been supported by
most literature. However, studies on demographic factors
are currently scarce. In the dimension of stakeholders, the
endorsement of investment institutions [7, 8] and social
capital [9, 10] are positive to the development of the
company. e value of financing consultants, venture capital
media, and other relevant parties has not been fully explored.
erefore, this paper attempts to apply the research
results of academia on venture capital decision-making and
the influencing factors of startups to real investment ac-
tivities, and based on semistructured public datasets, extract
valuable traditional indicators from as many dimensions as
possible and novel metrics to test its predictive power.
erefore, this paper attempts to apply the research
results of academia on venture capital decision-making and
the influencing factors of startups to real investment ac-
tivities, and based on semistructured public datasets, extract
valuable traditional indicators from as many dimensions as
possible and novel indicators to test its predictive effect.
2. Data Preparation
2.1. Data Sources and Extraction Methods. Since the po-
tential of a startup is a latent variable, the effectiveness of the
predictive model needs to be validated by explicit indicators.
erefore, this paper selects “whether to obtain the next
round of financing,” which is highly related to the company’s
potential, as a proxy variable to test the effect of “team
capability signal” and “relevant party endorsement signal” in
building a predictive model. Considering that companies
that have never raised funding have less publicly available
information, we target our research on startups that have
raised at least one round of funding [11].
In practice, different industries have different investment
analysis frameworks, and the timing and market enthusiasm
have a significant impact on the difficulty of obtaining fi-
nancing for entrepreneurial projects. erefore, this paper,
firstly, controls the industry and time window, and it selects
all investment and financing events in the field of artificial
intelligence and big data from January 2016 to December
2018, which is a total of 2527. e source of the data is
Xinniu Data (https//:www.xiniudata.com), a third-party
venture capital service platform, whose institutional account
can provide data export function.
e normal financing rhythm for startups is to complete
a new round of financing every 12–18 months. Based on this
industry experience, this paper splits the dataset and selects
all projects that have received early (round A and earlier)
financing between January 2016 and June 2017. After the
project, a total of 1104 independent companies entered the
sample set. Whether these companies receive a new round of
funding between July 2017 and December 2018 will be the
predictor variable.
2.2. Data Field Description. e dataset collected in this
paper contains 17 original fields: company name, industry
field, one-sentence introduction, the current round of fi-
nancing time, current round, the current round of financing
amount, the current round of investor, company intro-
duction, contact information, establishment time, region,
company advantages, industry label, team members and
introduction, activity performance, and business informa-
tion [12]. After desensitizing the company name to the
company ID, there are 4 remaining fields related to this
modeling, all of which are semistructured or unstructured
information.
By manually labeling the original dataset, we split the
above four fields into the following two categories, with a
total of 10 binary variables. See Table 1 for specific field
extraction and processing.
e structured processing of information and the pro-
cessing of missing values are explained as follows:
(1) Invested by excellent institutions (GoodInvestor): If
the current round of investors includes excellent
investment institutions, the variable is 1, otherwise,
it is 0. e definition of an excellent institution is
that more than 45% of the projects invested in 2015
have the next round of financing (this proportion is
an empirical value and can be adjusted in practice).
e information comes from the venture capital
institution database of Enniu. If the investment
institution does not disclose it, the variable is a
missing value, marked with “?”
(2) Have well-known financing advisors (GoodFA):
e variable is 1 if the company advantage contains
“well-known FA”; otherwise, it is 0. e definition
of well-known financing consultants by Enniu Data
includes well-known financing consultants in the
industry, such as Huaxing Capital and Xiaofan
Table (this range can be dynamically adjusted in
practice). ere are no missing values for this item.
(3) Well-known media reports (GoodCoverage): e
variable is 1 if the company advantage contains
“famous media coverage,” and 0 otherwise. e
definition of this label by Enniu Data is that
platforms, such as Pencil Road and Entrepre-
neurship, have an exclusive interview with the
target company, rather than a simple financing
news disclosure. In practice, this range can be
dynamically adjusted. ere are no missing values
for this item.
2Scientific Programming
(4) Have patents (Patent): e variable is 1 if the
company or its members hold patents, otherwise 0.
Among them, patents only refer to the company’s
invention patents, excluding trademarks, and so on.
ere are no missing values for this item.
(5) Working background in a well-known company
(CompanyHalo): If the team members and their
profiles mention well-known internet companies
(Tencent, Baidu, Alibaba, JD, Xiaomi, Netease, 360,
Sina, Google, etc.) or technology enterprises
(Microsoft, IBM, Huawei, Lenovo, sap, Oracle, etc.),
then the variable is 1, otherwise, it is 0. If the source
field is missing, it is marked as “?”
(6) Core members have a well-known university edu-
cation background (EduHalo): If the team members
and the experience of famous universities in the
world are mentioned in the profile, the variable is 1,
otherwise, it is 0. If the source field is missing, it is
marked as “?”
(7) Core members have entrepreneurial experience
(StartupExps): If “continuous entrepreneurship” is
mentioned in team members and profiles, the
variable is 1, otherwise, it is 0. If the source field is
missing, it is marked as “?”
(8) e members have the same school or work
experience (CoworkExps): In the team members
and profiles, it can be inferred that the core
members have studied in the same university or
the same company. en, the variable is 1. If there
is no obvious intersection in the main experience,
it is 0. If the source field is missing or difficult to
judge, it is marked as “?”
(9) e core team has diverse professional skills
(DiverseExps): In the team members and profiles,
the core members’ experience covers at least two
items in technology, product, management,
marketing, and finance. en, the variable is 1. If
the background of core members is similar, then
the variable is 0 (for example, CEO and CTO are
from R& D background). If the source field is
missing or difficult to judge, it is marked as “?”
(10) Previous work is highly relevant to this venture
(WorkRelated): If the team members and the work
background mentioned in the profile are highly
correlated with the industry or business of the start-
up company, the variable is 1, and if it is obviously
irrelevant, it is 0. If the source field is missing or
difficult to judge, it is marked as “?”
3. Research Design and Model Validation
To make a relatively independent judgment and comparison
on the predictive value of the two types of signals, this paper,
firstly, constructs two basic models using the endorsement
signal of relevant parties and the team ability signal, re-
spectively, and compares the Bayesian network and decision
tree C4.5 and random forest. en, the two kinds of signals
are fused to build a comprehensive model to test the im-
provement of key indicators. Finally, considering the sig-
nificant difference in the cost of “rejecting the true and
accepting the false” in investment practice, this paper in-
troduces a cost matrix to improve the recall rate of the model
for high potential projects.
3.1. Basic Model 1: Based on Endorsement Signals of Related
Parties. We, firstly, select the endorsement signals of three
interested parties—invested by excellent institutions, well-
known financing consultants, and well-known media
reports—as the model input, and take whether we obtain the
next round of financing within 18 months after the first
disclosure of financing information as the classification label
(yes is 1, no is 0). ere are 1104 records in the complete
dataset, of which 414 have the next round of financing
(defined as a positive sample), accounting for 37.5%.
Under 10 10-fold cross-checks, the average performance
of the Bayesian network, decision tree C4.5, and random
forest algorithms was tested. e experimental environment
is the open-source data mining software Weka, and the
algorithm uses the default parameters. Considering the
problem situation abstracted from investment practice and
how to select a small proportion of samples that cover as
many real positive samples as possible, we give the overall
classification performance as a reference in Table 2 but focus
Table 1: Field extraction and processing.
Field classification Processed field name Data Sources
Relevant party endorsement
signal
Invested by excellent institutions Investors in the current round
Well-known financing advisors (FA) Company advantage
Well-known media reports Company advantage
Team ability signal
Have patents Company advantage + team members and
profiles
Working background in a well-known company Team members and profiles
Core members have a well-known university education
background Team members and profiles
Core members have entrepreneurial experience Team members and profiles
e members have the same school or work experience Team members and profiles
e core team has diverse professional skills Team members and profiles
Previous work is highly relevant to this venture Team members and profiles + company
profile
Scientific Programming 3
on comparing the recall rate and accuracy rate of positive
samples (i.e., continuous financing projects).
It can be seen from the above table that the Bayesian
network is the model with the highest positive sample
precision, and the decision tree C4.5 is the model with the
highest positive sample recall rate.
Although the decision tree C4.5 model is slightly lower
than the Bayesian network in accuracy, it predicts 30 projects
that can get the next round of financing more than the latter
on average (153 vs. 123), while only 71 additional projects
need to be reviewed (281 vs. 210), which has significant value
in practice. Because for investors, compared with the
original 1,104 projects, the number of projects has been
reduced to 281, which has significantly saved working time
(corresponding to a 75% compression rate). 24% more
project coverage is desirable in exchange for some additional
project viewing time, and the recall of high-potential
projects is a major consideration at this time.
In summary, decision tree C4.5 is the best performing
model for this requirement.
3.2. Basic Model 2: Based on Team Ability Signal. In the basic
model 2, we select seven founding team ability signals as the
model input, including having patents, working background
in well-known companies, core members having an edu-
cational background in well-known universities at home and
abroad, core members having entrepreneurial experience,
members having the same school or work experience, core
teams having diversified professional skills, and previous
work is highly related to this entrepreneurship. en, repeat
the aforementioned experimental process.
It can be seen from Table 3 that at this time, random
forest is the model with the highest classification accuracy of
positive samples, while the Bayesian network is the model
with the highest recall rate of positive samples. e differ-
ence in precision is slightly lower than the difference in
recall. In addition, compared with basic model 1, the three
types of algorithms have improved in key indicators. Among
them, the positive sample recall rate of the Bayesian network
improved the most significantly (43.5% vs. 29.7%). e
positive sample precision rate of the random forest algo-
rithm improved the most significantly (65.0% vs. 52.5%).
A possible explanation is that, compared with the related
party endorsement signal, the dimension of the team ca-
pability signal is richer, however, the proportion of missing
values is also higher. Under the characteristics of this dataset,
the Bayesian network can learn the rules existing in a small
number of samples from the combination of indicators, and
the pruned decision tree algorithm is more inclined to learn
the negative samples with a higher proportion in the dataset.
In the discrimination method, although the size of the se-
lected samples is smaller (the compression rate is above
77%), the recall ability of positive samples is weak.
To sum up, in investment practice, if one encounters a
situation with rich attributes but many missing values, the
Bayesian network may be the better choice among the three.
3.3. Fusion Model. For base models 1 and 2, the sample
compression and precision are acceptable in practice,
however, the optimal model does not exceed 44% in terms of
recall on positive samples. For investment institutions, the
cost of omitting star projects is huge. Only when the recall
rate of positive samples is high enough, the model has
application value.
To this end, we tried to fuse the endorsement signal of
related parties with fewer dimensions and the team capa-
bility signal with more dimensions as the model input and
repeated the above modeling and testing process. e results
are summarized in Table 4.
It can be seen from the above table that compared with
the basic models 1 and 2, the three types of algorithms have a
large proportion of the improvement in the recall rate of
positive samples, and the value of the combination of the two
types of indicators for improving the prediction effectiveness
of the model has been verified. Among them, the Bayesian
network has the best performance. e performance of
decision tree C4.5 is in the middle, and the performance of
random forest is second. Compared with model 2, the recall
Table 2: Basic model 1: endorsement signals of related parties.
Bayesian network Decision tree C4.5 Random forest
Overall accuracy 64.2% 63.2% 61.6%
Overall recall 65.8% 64.8% 63.7%
e number of positive samples judged 210 281 256
e actual number of positive samples 123 153 135
Positive sample precision 58.7% 54.6% 52.5%
Positive sample recall 29.7% 36.9% 32.5%
Table 3: Basic model 2: team ability signal.
Bayesian network Decision tree C4.5 Random forest
Overall accuracy 68.3% 68.0% 68.2%
Overall recall 69.3% 68.9% 69.0%
e number of positive samples judged 286 249 237
e actual number of positive samples 180 160 154
Positive sample precision 63.0% 64.2% 65.0%
Positive sample recall 43.5% 38.6% 37.3%
4Scientific Programming
rate of positive samples for decision tree C4.5 is improved by
10 percentage points.
It is of great significance for investment practice. Take
decision tree C4.5 as an example. Its practical meaning is as
follows: after 1104 projects are screened by the model, 339
will be judged as possible for the next round of financing, of
which 203 can indeed get the next round of financing. From
the perspective of project coverage, if investors screen
projects according to this method, they can cover 48.9% of
high potential projects under 30.7% of the workload. From
the perspective of time cost, investors can spend 59.7% of
their working time on valuable projects, which is signifi-
cantly improved compared with the positive sample rate of
37.5% when looking at projects at random.
However, it is worth noting that even under the optimal
model, half of the high potential projects will not be screened
out by the algorithm. By tracing back the characteristics of
samples and their classification results, it can be inferred that
there are three reasons, which are as follows:
(1) Incomplete information. For example, 72 positive
samples lack team members’ profiles and lack pos-
itive signals in the dimension of endorsement of
related parties, which objectively limits the upper
limit of the prediction ability of the model.
(2) e positive and negative proportions of the samples
are uneven. e proportion of negative samples is
62.5%, which is 1.7 times that of positive samples,
which makes the model with the goal of reducing the
error rate more inclined to learn the discriminative
method of negative samples, while the mining of
positive samples is insufficient.
(3) ere are unobserved influencing factors. Important
factors affecting a company’s ability to secure its next
round of funding may not be limited to the 10
metrics used in the model.
4. Model Discussion and Extension
4.1. Factor Analysis and Rules Summary. is paper designs
and validates a predictive model that can be used for early-
stage project screening, but in investment practice, pre-
diction results alone are not enough to support project
screening. Investors need to understand the decision-
making basis behind the model to confirm each other with
their past investment experience, correct possible wrong
judgments, or add new screening rules. At the same time,
investors also need to know which factors contribute more
to the prediction to consciously collect this information
when contacting entrepreneurs or improve the information
quality of these dimensions through other channels.
Considering the interpretability of the rules, this section
selects the pruned decision tree C4.5 model as the analysis
object, as shown in Figure 1.
As can be seen from the above figure, there are six in-
dicators in the four layers of the decision tree, which are
team capability signals (work background in well-known
companies, patents, previous work experience is highly
related to this venture) and relevant party endorsement
signals (reported by well-known media, invested by excellent
institutions and well-known financing consultants), three
each. ey are the most discriminative for judging whether a
project can get the next round of financing [3].
ere are also observable differences in the 0–1 distri-
bution of these indicators in the whole sample and the positive
sample, as shown in Table 5 for simplicity (the proportion of
missing values is hidden). Except for “well-known financing
consultants,” the proportion of the remaining five indicators
with a value of 1 in the positive sample is 12–19 percentage
points higher than that in the whole sample.
Combined with the classification results output from
Figure 1 and the decision tree model, the following four
representative investment logics can be summarized from
right to left:
(1) 76% of entrepreneurial teams with a working
background in well-known companies can obtain the
next round of financing (characteristics: strong team
ability signal).
(2) 69% of the teams without working background of
well-known companies but with well-known media
reports and patents can obtain the next round of
financing (characteristics: screened by stakeholders
and strong technical ability of the team).
(3) 64% of the teams that have no working background
in well-known companies and no well-known media
reports but have been invested by excellent insti-
tutions and whose previous work is highly related to
this entrepreneurship can obtain the next round of
financing (characteristics: they have been screened
by stakeholders and their team ability matches the
entrepreneurship).
(4) 55% of the entrepreneurial teams without working
background of well-known companies, well-
known media reports, and investment by excellent
institutions but endorsed by well-known financ-
ing consultants can obtain the next round of
Table 4: Fusion model.
Bayesian network Decision tree C4.5 Random forest
Overall accuracy 69.7% 67.6% 67.0%
Overall recall 70.5% 68.5% 67.9%
e number of positive samples judged 327 339 332
e actual number of positive samples 207 203 196
Positive sample precision 63.5% 59.7% 59.0%
Positive sample recall 50.1% 48.9% 47.4%
Scientific Programming 5
financing. If the enterprise does not even have the
signal of a well-known financing consultant, there
is a 70% probability that it will not be able to
obtain the next round of financing (feature: when
it does not have the signal of team ability, the
signal of stakeholders can play a certain screening
role).
From the above typical rules, it can be found that the
working background of well-known companies is the most
differentiated signal. If it does not have this feature, it needs to
have at least one of the ability signals of other teams and the
endorsement signals of relevant parties to have a higher
probability of obtaining the next round of financing.
e language to be converted into investment practice
is as follows: if the team has outstanding highlights in its
work history, it is worthy of investors’ research and
follow-up. If the team is average, it needs the investment
of other excellent institutions or the information re-
ported by well-known media to be considered.
4.2. Multiclassification Problem. e so-called multi-
classification problem refers to using an empirical loss
function to predict the classification results of different
categories [13]. From start-up to listing, companies usually
need to go through more than four rounds of financing
(angel round, round A, round B, and pre-IPO round).
However, companies that receive continuous financing in
the early stage may not be able to go public or be acquired.
For investment institutions, if an investment cannot be
successfully exited, there will be no return on investment.
erefore, the more valuable the predictive model is to the
investment institution, the more the predicted variable can
be closer to the long-term potential of the company.
In this regard, an intuitive idea is as follows: after the first
financing, companies that can obtain two or more rounds of
financing have higher potential and value than companies
that have only received one round of financing. In other
words, we can classify start-ups into three categories based
on the number of follow-up rounds a company has received
after their initial funding round, which are as follows: low-
value (no follow-on financing, recorded as L), medium value
(1 follow-on round, recorded as class M), and high value (2
or more rounds of follow-up financing, recorded as class H).
It should be noted that this multiclassification problem
has certain particularities: on the one hand, the three cat-
egories are not semantically independent but have a certain
degree of progressive relationship. In terms of value, class
L<class M <class H. On the other hand, from the per-
spective of investment practice, both M and H categories
should be selected into the project library to be followed up,
however, the H category deserves more attention. erefore,
the misclassification between class M and class H does not
incur excessive costs.
e original decision tree C4.5 model can handle
multiclassification problems but does not pay much atten-
tion to the above two problems. erefore, this section,
firstly, uses it as a reference model to measure the general
classification effect of the algorithm and then discusses how
to improve it.
CompanyHalo
GoodCoverage High
potential
GoodInvestor
GoodFA WorkRelated
Patent
GoodInvestor High
potential
Low
potential
High
potential
Low
potential
High
potential
Low
potential
High
potential
=0 =1
=1
=1
=1
=1
=1
=1
=0
=0
=0
=0
=0
=0
Figure 1: Visualization of decision tree rules (after pruning).
Table 5: 0–1 distribution difference of typical elements in all
samples and positive samples.
Variable All samples Positive samples
only
1 0 1 0
CompanyHalo 319 785 195 219
29% 71% 47% 53%
Patent 583 521 272 142
53% 47% 66% 34%
WorkRelated 474 68 257 18
43% 6% 62% 4%
GoodCoverage 272 832 155 259
25% 75% 37% 63%
GoodInvestor 222 726 132 239
20% 66% 32% 58%
GoodFA 134 970 71 343
12% 88% 17% 83%
6Scientific Programming
In terms of datasets, this section follows all the
projects that have received early financing from January
2016 to June 2017 in the artificial intelligence and big data
industries, which is a total of 1104 projects. e inde-
pendent variable is consistent with the training set.
However, the predictor variable is changed to the number
of subsequent financings, with a value of 0 representing
no subsequent financing (type L), a value of 1 repre-
senting a subsequent round of financing (type M), and a
value of 2 represents 2 or more follow-on financing
rounds (Class H). Repeating the modeling process and
10-fold cross-checking, the classification performance of
the three categories is shown in Tables 6 and 7.
It can be seen from the above table that the decision tree
C4.5 algorithm is more biased toward negative samples with
a higher proportion (i.e., samples that have not obtained
subsequent financing). 84.3% (931/1104) of the samples were
classified into the low-value L class, and the identification of
medium-value (M-class) and high-value (H-class) samples
was severely underidentified. Among the M and H classes,
the algorithm also prefers to discriminate as the conservative
M class (134 vs. 39). From the perspective of investment
practice, this classification effect is not satisfactory.
4.3. Scalability Discussion. It is worth noting that the above
models we designed and tested only consider the public
information provided by third-party venture capital service
institutions, however, in practice, investment institutions
can also obtain private information from multiple channels,
such as reading project business plans, communicate with
founders, communicate with financing consultants, com-
municate with peers, communicate with colleagues with
relevant experience, obtain third-party research data, etc.
e potential value of this private information is
fourfold:
(1) Supplements the missing values of each indicator in
the original model, and corrects the wrong labels.
(2) Adds new observations to the model to improve the
predictive efficiency of the model.
(3) Provides more information about the research
subjects (i.e., startups), such as funding amounts and
valuations, to design proxy variables that are closer
to the company’s potential.
(4) Builds a closed loop between forecasts and obser-
vations earlier than the market, and iterates models
earlier to improve forecasts and gain a competitive
advantage.
erefore, this section focuses on how to combine public
information with private information to form a dynamically
updated project screening mechanism. Figure 2 presents a
conceptual framework considering practical feasibility.
As shown in the figure, after the public information
about the project is obtained, it can be analyzed and pro-
cessed through the following five steps:
(1) Clean and label the data. e dimensions of the
labeling can include the ten dimensions proposed in
this article, and it can also continue to expand with
the enrichment of practical experience.
(2) For projects with relatively complete information,
use decision trees or other machine learning al-
gorithms to generate predictions and make cor-
rections based on expert knowledge in investment
institutions, thereby generating two sets—projects
worthy of follow-up and projects that are not to be
followed up for the time being.
(3) Projects that have not been followed up for the time
being will enter the waiting pool together with
projects with incomplete information before. After
investors obtain more information about the team
and projects, they will enter the data cleaning process
for remarking.
(4) Projects worthy of follow-up and projects not to be
followed up enter the observation pool together.
Combined with the financing and development in-
formation obtained by investors through industry ex-
changes, projects are divided into four categories: TP
(true positive) refers to projects that are determined by
institutions to be worthy of follow-up and have real
potential through follow-up observation, FP (false
positive) refers to the project that the institution de-
termines is worthy of follow-up, however, it is found
Table 6: ree classification performance: based on decision tree C4.5 model.
L M H
Sample frequency and proportion 690(62.5%) 294(26.6%) 120(10.9%)
Quantity determined as this category 931 134 39
Actual quantity of this category 627 45 13
Accuracy 67.3% 33.6% 33.3%
Recall 90.9% 15.3% 10.8%
Table 7: Confusion matrix: based on decision tree C4.5 model.
Classified into L Classified into M Classified into H
L 627 52 11
M 234 45 15
H 70 37 13
Scientific Programming 7
that it has no development potential and sustainable
financing ability, TN (true negative) refers to the project
determined by the organization not to follow up
temporarily and found to have no potential in the
follow-up, and FN (false negative) refers to the project
that the organization determines not to follow up
temporarily but actually finds potential.
(5) e misclassified samples (i.e., FP and FN) are in-
cluded in the check pool, and the dataset is used to
modify the previous algorithm.
5. Conclusion
is paper mainly studies the application of the machine
learning method represented by a decision tree in early
entrepreneurial project screening. e research process is
divided into four steps.
e first step is to establish a prediction model that can
be used in investment practice. By processing the real in-
vestment and financing event data set in the field of artificial
intelligence and big data, this paper extracts two kinds of
signals, which are as follows: the endorsement of relevant
parties and team ability, with a total of 10 indicators. After
establishing the basic model based on the two types of in-
dicators and verifying their prediction effectiveness, this
paper combines all indicators to test whether the fusion
model is significantly improved compared with the basic
model. en, considering that in practice, investors are more
interested in projects that can continuously obtain financing
(i.e., positive samples). is paper introduces a heuristic
misclassification cost matrix to adjust the weight of the two
types of samples and verifies how much the improved model
can improve the recall rate of positive samples.
e second step is to interpret the law revealed by the
model and to explore whether this law is still applicable over
time and whether it can be applied to project screening in
similar industries. e selected model is the decision tree
C4.5 model with more interpretable rules. By interpreting
the top-down judgment path of the decision tree, this paper
extracts general rules for guiding project selection.
e third step is to extend the decision tree model from
the binary classification scenario of positive and negative
samples to the multiclassification scenario of project value
segmentation to improve the usefulness of the model in
investment practice.
Finally, in view of the private information available to
investment institutions in reality, this paper discusses how to
incorporate private information into an iterative analysis
process based on the model constructed from public in-
formation to achieve a more accurate prediction of high-
potential projects.
e implications of this paper for management are as
follows:
(1) e value of the three team ability indicators, in-
cluding working background in a well-known
company, patents, and work experience, which are
highly related to this entrepreneurship in predicting
the potential of the company, has been supported by
data. When communicating with entrepreneurs or
conducting project surveys, investors deserve to
verify the authenticity and basis of these indicators
from more information sources. In addition, based
on the working experience and the subsequent de-
velopment performance of the company, a more
accurate division basis can be established for the
scope of well-known companies, the quality and
quantity of patents, and the correlation between
work experience and entrepreneurship to continu-
ously improve the screening and judgment of
projects.
(2) e optimal model and cost matrix settings are
different for different classification objectives, and
the same is true for different data dimensions. In
practice, investors should fully consider their own
classification needs and the availability of external
data and then select the model with the best clas-
sification effect. In addition, in the context of not
pursuing model interpretability, it is also an optional
idea to introduce algorithms such as Bayesian net-
work and random forest or use integrated learning
Open letter
Data
cleaning
and labeling
Projects
with
complete
information
Algorithm worthy of
follow-up
TP
FP
Not
followed up
TN
FN
Projects with
incomplete information
Private information
Expert knowledge
Team&project
information
Financing information
Correct Check pool
Observation pool
Waiting pool
Figure 2: Project screening framework that integrates private information and public information.
8Scientific Programming
methods, such as Bagging and Boosting, to improve
model prediction.
(3) On the issue of early project screening, the dimen-
sion of relevant party endorsement, which is less
studied by the academic community, actually has a
high signal value. It is also consistent with the basic
common sense of the investment industry. ese
experiences should be incorporated into the con-
struction of project screening mechanisms based on
machine learning methods.
Data Availability
e dataset can be accessed upon request.
Conflicts of Interest
e authors declare that they have no conflicts of interest.
References
[1] C. Carpentier and J.-M. Suret, “Angel group members’ de-
cision process and rejection criteria: a longitudinal analysis,”
Journal of Business Venturing, vol. 30, no. 6, pp. 808–821, 2015.
[2] P. Klimas and W. Czakon, “Organizational innovativeness
and coopetition: a study of video game developers,” Review of
Managerial Science, vol. 12, pp. 469–497, 2018.
[3] D. Guo and K. Jiang, “Venture capital investment and the
performance of entrepreneurial firms: evidence from China,”
Journal of Corporate Finance, vol. 22, pp. 375–395, 2013.
[4] J. Hoyos-Iruarrizaga, A. Fern´
andez-Sainz, and M. Saiz-Santos,
“High value-added business angels at post-investment stages:
key predictors,” International Small Business Journal, vol. 35,
pp. 949–968, 2017.
[5] R. Sørheim, “e pre-investment behaviour of business an-
gels: a social capital approach,” Venture Capital, vol. 5, no. 4,
pp. 337–364, 2003.
[6] D. Kirsch, B. Goldfarb, and A. Gera, “Form or substance: the
role of business plans in venture capital decision making,”
Strategic Management Journal, vol. 30, no. 5, pp. 487–515,
2009.
[7] M. Jansson and A. Biel, “Investment institutions’ beliefs about
and attitudes toward socially responsible investment (SRI): a
comparison between SRI and non-SRI management,” Sus-
tainable Development, vol. 22, pp. 33–41, 2014.
[8] M. Brahim and H. Rachdi, “Foreign direct investment, in-
stitutions and economic growth: evidence from the MENA
region,” Journal of Reviews on Global Economics, vol. 3,
pp. 328–339, 2014.
[9] G. D. Leeves and R. Herbert, “Gender differences in social
capital investment: theory and evidence,” Economic Model-
ling, vol. 37, pp. 377–385, 2014.
[10] J. Peir´o-Palomino and E. Tortosa-Ausina, “Social capital,
investment and economic growth: some evidence for Spanish
provinces,” Spatial Economic Analysis, vol. 10, pp. 102–126,
2015.
[11] T. Astebro, “Key success factors for technological entrepre-
neurs’ R&D projects,” IEEE Transactions on Engineering
Management, vol. 51, pp. 314–321, 2004.
[12] A. Croce, J. Mart´
ı, and S. Murtinu, “e impact of venture
capital on the productivity growth of European
entrepreneurial firms: “Screening” or “value added” effect?”
Journal of Business Venturing, vol. 28, pp. 489–510, 2013.
[13] J. Argerich, E. Hormiga, and J. Valls-Pasola, “Financial ser-
vices support for entrepreneurial projects: key issues in the
business angels investment decision process,” Service Indus-
tries Journal, vol. 33, pp. 9-10, 2013.
Scientific Programming 9