ArticlePDF Available

Product progression: a machine learning approach to forecasting industrial upgrading

Authors:
  • Istituto dei Sistemi Complessi, CNR - UOS Sapienza

Abstract and Figures

Economic complexity methods, and in particular relatedness measures, lack a systematic evaluation and comparison framework. We argue that out-of-sample forecast exercises should play this role, and we compare various machine learning models to set the prediction benchmark. We find that the key object to forecast is the activation of new products, and that tree-based algorithms clearly outperform both the quite strong auto-correlation benchmark and the other supervised algorithms. Interestingly, we find that the best results are obtained in a cross-validation setting, when data about the predicted country was excluded from the training set. Our approach has direct policy implications, providing a quantitative and scientifically tested measure of the feasibility of introducing a new product in a given country.
This content is subject to copyright. Terms and conditions apply.
1
Vol.:(0123456789)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports
Product progression: a machine
learning approach to forecasting
industrial upgrading
Giambattista Albora
1,2, Luciano Pietronero
2, Andrea Tacchella
3 & Andrea Zaccaria
2,4*
Economic complexity methods, and in particular relatedness measures, lack a systematic evaluation
and comparison framework. We argue that out-of-sample forecast exercises should play this role, and
we compare various machine learning models to set the prediction benchmark. We nd that the key
object to forecast is the activation of new products, and that tree-based algorithms clearly outperform
both the quite strong auto-correlation benchmark and the other supervised algorithms. Interestingly,
we nd that the best results are obtained in a cross-validation setting, when data about the predicted
country was excluded from the training set. Our approach has direct policy implications, providing a
quantitative and scientically tested measure of the feasibility of introducing a new product in a given
country.
In her essay e Impact of Machine Learning on Economics, Susan Athey states: “Prediction tasks [...] are typi-
cally not the problems of greatest interest for empirical research in economics, who instead are concerned with
causal inference ” and “economists typically abandon the goal of accurate prediction of outcomes in pursuit of
an unbiased estimate of a causal parameter of interest ”1. is situation is mainly due to two factors: the need to
ground policy prescriptions2,3 and the intrinsic diculty to make correct predictions in complex systems4,5. e
immediate consequence of this behavior is the ourishing of dierent or even contrasting economic models,
whose concrete application largely relies on the specic skills, or biases, of the scholar or the policymaker6. is
horizontal view, in which models are every time aligned and selected, in contrast with the vertical view of hard
sciences, in which models are selected by comparing them with empirical evidence, leads to the challenging issue
of distinguishing which models are wrong. While this situation can be viewed as a natural feature of economic
and, more in general, complex systems6, a number of scholars coming from hard sciences have recently tackled
these issues, trying to introduce concepts and methods from their disciplines in which models’ falsiability, tested
against empirical evidence, is the key element. is innovative approach, called Economic Fitness and Complex-
ity712 (EFC), combines statistical physics and complex network based algorithms to investigate macroeconomics
with the aim to provide testable and scientically valid results. e EFC methodology studies essentially two
lines of research: indices for the competitiveness of countries and relatedness measures.
e rst one aims at assessing the industrial competitiveness of countries by applying iterative algorithms to
the bipartite network connecting countries to the products they competitively export13. Two examples are the
Economic Complexity Index ECI14 and the Fitness7. In this case, the scientic soundness of either approach can
be assessed by accumulating pieces of evidence: by analyzing the mathematical formulation of the algorithm and
the plausibility of the resulting rankings1518, and by using the indicator to predict other quantities. In particular,
the Fitness index, when used in the so-called Selective Predictability Scheme19, provides GDP growth predictions
that outperform the ones provided by the International Monetary Fund10,20. All these elements concur towards the
plausibility of the Fitness approach; however, a direct way to test the predictive performance of these indicators21
is still lacking. is naturally leads to the consideration of further indices, that can mix the existing ones22 or use
new concepts such as information theory23. We argue that, on the contrary, the scientic validity of relatedness
indicators can be univocally assessed, and this is the purpose of the present work.
e second line of research in EFC investigates the concept of Relatedness24, the idea that two human activi-
ties are, in a sense, similar if they share many of the capabilities needed to be competitive in them25. Practical
applications are widespread and include international trade11,26, rm technological diversication27,28, regional
smart specialization29,30, and the interplay among the scientic, technological, and industrial layers31. Most
of these contributions use relatedness not to forecast future quantities, but as an independent variable in a
OPEN
1Dipartimento di Fisica, Universitá Sapienza, Rome, Italy. 2Centro Ricerche Enrico Fermi, Rome,
Italy. 3Joint Research Centre, Seville, Spain. 4Istituto dei Sistemi Complessi-CNR, UOS Sapienza, Rome,
Italy. *email: andrea.zaccaria@cnr.it
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
Vol:.(1234567890)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
regression, and so the proximity (or quantities derived from it) is used to explain some observed simultaneous
behavior. We point out, moreover, that no shared denition of relatedness exists, despite the widespread use of
co-occurrences, since dierent scholars use dierent normalizations, null models, and data, so the problem to
decide “which model is wrong” persists. For instance, Hidalgo etal.26 base the goodness of their measure on its
correlation with the probability that a country starts to export a product. O’Clery etal.32 test the goodness of
their relatedness measure through an in-sample logit regression; in this way models with a greater number of
parameters (as provided, for instance, by the addition of xed eects on countries and products) tend to have
greater scores. Finally, Gnecco etal.33 propose an approach to assess the relatedness based on matrix completion.
Note that their test of the goodness of their approach is based on the reconstruction of the country-product
pairs that have been removed from the data; the approach used here, instead, consists into looking at how good
the proposed model is to guess new exports of countries aer 5 years. So once again the performances are not
comparable, as it is evident by looking, for instance, at the respective magnitude of the reported F1 scores.
e examples just discussed clarify why we believe that it is fundamental to introduce elements of falsi-
ability in order to compare the dierent existing models, and that such comparison should be made by looking
at the performances in out-of-sample forecasting, that is the focus of the present paper. We will consider export
as the economic quantity to forecast because most of the indicators used in economic complexity are derived
from export data, being it regarded as a global, summarizing quantity of countriescapabilities10,34 but also for
the immediate policy implications of the capability to be able, for instance, to predict in which industrial sector
a country will be competitive, say, in ve years.
In this paper, we propose a procedure to systematically compare dierent prediction approaches and, as a
consequence, to scientically validate or falsify the underlying models. Indeed, some attempts to use complex
networks or econometric approaches to predict exports exist32,3537, but these methodologies are practically
impossible to compare precisely because of the lack of a common framework to choose how to preprocess data,
how to build the training and the test set, or even which indicator to use to evaluate the predictive performance.
In the following, we will systematically scrutiny the steps to build a scientically sound testing procedure to
predict the evolution of the export basket of countries. In particular, we will forecast the presence or the activa-
tion of a binary matrix element
Mcp
, that indicates whether the country c competitively exports product p in the
Revealed Comparative Advantage sense38 (see “Methods” for a detailed description of the export data).
Given the simultaneous presence in the literature of dierent approaches to measure the relatedness, it is
natural to argue whether machine learning algorithm might play a role and build comparable or even better
measures. In particular, given the present ubiquitous and successful use of articial intelligence in many dierent
contexts, it is natural to use machine learning algorithms to set the benchmark. A relevant by-product of this
analysis is the investigation of the statistical properties of the database (namely, the strong auto-correlation and
class imbalance), that has deep consequences on the choice of the most suitable algorithms, testing exercises,
and performance indicators.
Applying these methods we nd two interesting results:
1. e best performing models for this task are based on decision trees. A fundamental property that separates
these algorithms from the main approaches used in the literature26 is the fact that here the presence of a
product in the export basket of a country can have a negative eect on the probability of exporting the target
product. i.e. decision trees are able to combine Relatedness and Anti-Relatedness signals to provide strong
improvements in the accuracy of predictions39
2. Our best model performs better in a cross-validation setting where we exclude data from the predicted
country from the training set. We interpret this nding by arguing that in cross-validation the model is able
to better learn the actual Relatedness relationships among products, rather than focusing on the very strong
self-correlation of the trade data.
In the “Methods” section we show a detailed comparison between our machine learning based approach and
some of theother denitions of relatedness we mentioned.
e present investigation of the predictability of the time evolution of export baskets has a number of practi-
cal and theoretical applications. First, predicting the time evolution of the export basket of a country needs, as
an intermediate step, an assessment of the likelihood that the single product will be competitively exported by
the country in the next years. is likelihood can be seen as a measure of the feasibility of that product, given
the present situation of that country. e possibility to investigate with such a great level of detail which product
is relatively close to a country and which one is out of reach has immediate implications in terms of strategic
policies40. Second, the study of the time evolution of the country-product bipartite network is key to validate the
various attempts to model it41,42. Finally, the present study represents on of the rst attempts to systematically
investigate how machine learning techniques can be applied in development economics, that is something still
not much discussed in literature with except to very recent works33,39,43.
Results
Statistical properties of the country-product network. A key result of the present investigation is a
clear-cut methodology to compare dierent models or predictive approaches in Economic Complexity. In order
to understand the reasons behind some of the choices we made in building the framework, we rst discuss some
statistical properties of the data we will analyze.
Our database is organized in a set of matrices
V
whose element
Vcp
is the amount, expressed in US dollars, of
product p exported by country c in a given year. When not otherwise specied, the number of countries is 169,
the number of products is 5040, and the time range covered by our analysis is 1996-2018. We use the HS1992,
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
Vol.:(0123456789)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
6-digits classication. e data are obtained from the UN-COMTRADE database and suitably cleaned in order to
take into account the possible disagreements between importers’ and exporters’ declarations (see “Methods”). We
compute the Revealed Comparative Advantage38 to obtain a set of RCA matrices
R
and, by applying a threshold
equal to 1, a set of matrices
M
whose binary elements are equal to 1 if the given country competitively exports
the given product. Here and in the following we use “competitively” in the Balassa sense, that is,
Rcp >1
. In this
paper we will discuss the prediction of two dierent situations: the unconditional presence of a “1” element in
the
M
matrix and the appearance of such an element requiring that the RCA values were below a non-signicance
threshold t=0.25 in all the previous years. We will refer to the rst case as the full matrix and to the new product
event as an activation. e denition of the activation is somehow arbitrary: one could think, for instance, to
change the threshold t or the number of inactive years. We nd however our choice to be a good trade-o to
have both a good numerosity of the test set and avoid the inuence of trivial 0/1 ips. We point out that our nal
aim is to detect, as much as possible, the appearance of really new products in the export basket of countries.
In Fig.1, le, we plot the probability that a matrix element
Mcp
in 1996 will change or not change its binary
value in the future years. One can easily see that even aer 5 years the probability that a country remains com-
petitive in a product is relatively high (
0.64
); being the probability that a country remains not competitive
, we conclude that there is a very strong auto-correlation: this is a reection of the persistent nature of
both the capabilities and the market conditions that are needed to competitively export a product. Moreover, the
appearance of a new product in the export basket of a country is a rare event: the empirical frequency is about
0.047 aer 5 years. A consequence of this persistence is that we can safely predict the presence of a 1 in the
M
matrices by simply looking at the previous years, while the appearance of a new product that was not previously
exported by a country is much more dicult and, in a sense, more interesting from an economical point of view,
since it depends more on the presence of suitable, but unrevealed, capabilities in the country; but these capabili-
ties can be traced by looking at the other products that country exports. Not least, an early detection of a future
activation of a new product has a number of practical policy implications. Note in passing that, since machine
learning based smoothing procedures10,44 may introduce extra spurious correlations, they should be avoided in
prediction exercises, and so only the RCA values directly computed from the raw export data are considered.
On the right side of Fig.1 we plot the density of the matrices
M
, that is the number of nonzero elements with
respect to the total number of elements. is ratio is roughly
10%
. is means that both the prediction of the
full, unconditional matrix elements and the prediction of the so-called activations (i.e., conditioning to that ele-
ment being 0 and with RCA below 0.25 in all the previous years) show a strong class imbalance. is has deep
consequences regarding the choice of the performance indicators to compare the dierent predictive algorithms.
For instance, the ROC-AUC score45, one of the most used measures of performance for binary classiers, is well
known to suer from strong biases when a large class imbalance is present46. More details are provided in the
Methods” sections.
Recognize the country vs. learning the products’ relations. In this section we present the results
concerning the application of dierent supervised learning algorithms. e training and the test procedures are
fully described in the “Methods” section. Here we just point out that the training set is composed by the matrices
R(y)
with
y∈[1996 ...2013]
, and the test is performed against
M(2018)
, so we try to predict the export basket of
countries aer
=5
years.
e algorithms we tested are XGBoost47,48, a basic Neural Network implemented using the Keras library49
and the following algorithms implemented using the scikit learn library50: Random Forest51, Support Vector
Machines52, Logistic Regression53, a Decision Tree54, ExtraTreesClassier55, ADA Boost56 and Gaussian Naive
Bayes57. For reasons of space, we cannot discuss all these methods here. However, a detailed description can be
found in58 and references therein and, in the following sections, we will elaborate more on the algorithms based
on decision trees, which result to be the most performing ones.
Figure1. Le: transition probabilities between the binary states of the export matrix M. e strong persistency
implies the importance of the study of the appearance of new products (called activations) with respect to the
unconditional presence of one matrix element (in the following, full matrix). Right: the fraction of nonzero
elements in M as a function of time. A strong class imbalance is present.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
Vol:.(1234567890)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
In Fig.2 we show an example of the dynamics that our approach is able to unveil. On the le we show the
RCA of Bhutan for the export of Electrical Transformers as a function of time. RCA is zero from 1996 to 2016,
when a sharp increase begins. Was it possible to predict the activation of this matrix element? Let us train our
machine learning algorithm XGBoost using the data from 1996 to 2012 to predict which products will Bhutan
likely export in the future. e result is a set of scores, or progression probabilities, one score for each possible
product. Each of these scores measures the feasibility, or relatedness, between Bhutan and all the products it
does not export. e distribution of such scores is depicted in Fig.2 on the right. e progression probability
for Electrical Transformers was much higher than average, as shown by the arrow: this means that, already in
2012, Bhutan was very close to this product. Indeed, as shown by the gure on the le, Bhutan will start to export
that specic product in about 5 years. Obviously, this is just an example, so we need a set of quantitative tools to
measure the prediction performance on the whole test set on a statistical basis.
In order to quantitatively assess the goodness of the prediction algorithms, a number of performance indica-
tors are available from the machine learning literature of binary classiers. Here we focus on three of them, and
we show the results in Fig.3, where we show a dierent indicator in each row, while the two columns refer to the
two prediction tasks, full matrix (i.e., the presence of a matrix element equal to one) and activations (a zero matrix
element, with RCA below 0.25 in previous years, possibly becoming higher than one, that is the appearance of a
new product in the export basket of a country). AUC-PR46 gives a parameter-free, comprehensive assessment of
the prediction performance. e F1 Score59,60 is a harmonic mean of the Precision and Recall measures61, and so
takes into account both False Positives and False Negatives. Finally, mean Precision@10 considers each country
separately and computes how many products, on average, are actually exported out of the top 10 predicted. All
the indicators we used are discussed more in detail in the “Methods” section.
We highlight with a red color the RCA benchmark model, which simply uses the RCA values in 2013 to
predict the export matrix in 2018. From the analysis of Fig.3 we can infer the following points:
1. e performance indicators are much higher for the full matrix. is means that predicting the unconditional
presence of a product in the export basket of a country is a relatively simple task, being driven by the strong
persistence of the M matrices through the years.
2. On the contrary, the performance on the activations is relatively poor: for instance, on average, less than one
new product of out the top ten is correctly predicted.
3. Algorithms based on ensembles of trees perform better than the benchmark and the other algorithms on all
the indicators.
4. anks to the strong autocorrelation of the matrices, the RCA-based prediction represents a very strong
benchmark, also in the case of the activations. However, Random Forest, ExtraTreesClassier and XGBoost
perform better both in the full matrix prediction task and in the activations prediction task.
We speculate that the machine learning algorithms perform much better in the full matrix case because, in a
sense, they recognize the single country and, when inputted with a similar export basket, they correctly reproduce
the strong auto-correlation of the export matrices. We can deduce that using this approach we are not learning
the complex interdependencies among products, as we should, and, as a consequence, we do not correctly predict
the new products. In order to overcome this issue, we have to use a k-fold Cross Validation (CV): we separately
train our models to predict the outcome of k countries using the remaining
Ck
, where in our case
C=169
and
k=13
. In this way, we prevent the algorithm to recognize the country, since the learning is performed on
disjoint sets; as a consequence, the algorithm learns the relations among the products and is expected to improve
the performances on the activations.
Using the cross validation procedure, we trained again the three best performing algorithms which are the
Random Forest, ExtraTreesClassier, and XGBoost. e result is that only the XGBoost algorithm improves
Figure2. An example of successful prediction. On the le, the RCA of Bhutan in electrical transformers as a
function of time. Already in 2012, with RCA stably below 1, the progression probability of that matrix element
was well above its country average, as shown by the histogram in the gure on the right. Bhutan will start to
competitively export electrical transformers aer 5 years.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
Vol.:(0123456789)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
ExtraTree Classifier
XGBoost
Random Forest
G. N. B.
Decision Tree
Logistic Regression
Dense Neural Network
S. V. M.
AdaBoost
RCARCA
0.23
0.334
0.336
0.533
0.61
0.639
0.648
0.71
0.716
0.724
PR − AUC
Full Matrix
0 0.005 0.01 0.015 0.02
ExtraTree Classifier
XGBoost
Random Forest
G. N. Bayes
Decision Tree
Logistic Regression
Neural Network
S. V. M.
AdaBoost
RCARCA
0.00603
0.00644
0.00669
0.00639
0.012
0.0129
0.0112
0.0157
0.0175
0.0169
PR − AUC
Activations
0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7
ExtraTree Classifier
XGBoost
Random Forest
G. N. B.
Decision Tree
Logistic Regression
Dense Neural Network
S. V. M.
AdaBoost
RCARCA
0.402
0.541
0.566
0.614
0.587
0.667
0.684
0.67
0.685
0.68
Best F1
Full Matrix
0 0.01 0.02 0.03 0.04 0.05
ExtraTree Classifier
XGBoost
Random Forest
G. N. Bayes
Decision Tree
Logistic Regression
Neural Network
S. V. M.
AdaBoost
RCARCA
0.0124
0.0183
0.0206
0.0174
0.0321
0.0374
0.0369
0.0417
0.0465
0.0476
Best F1
Activations
0 0.2 0.4 0.6 0.8 1
ExtraTree Classifier
XGBoost
Random Forest
G. N. B.
Decision Tree
Logistic Regression
Dense Neural Network
S. V. M.
AdaBoost
RCARCA
0.409
0.465
0.649
0.742
0.681
0.778
0.746
0.842
0.89
0.875
Mean Precision@10
Full Matrix
0 0.01 0.02 0.03 0.04 0.05 0.06
ExtraTree Classifier
XGBoost
Random Forest
G. N. Bayes
Decision Tree
Logistic Regression
Neural Network
S. V. M.
AdaBoost
RCARCA
0.0173
0.0145
0.0173
0.0218
0.0275
0.0355
0.0391
0.0509
0.0564
0.0455
Mean Precision@10
Activations
Figure3. Comparison of the prediction performance of dierent algorithms using three performance
indicators. Tree-based approaches are performing better; the prediction of the activations is a harder task with
respect to the simple future presence of a product.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
Vol:.(1234567890)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
its scores, which means that in the cross-validation setting it is more capable to learn the inter-dependencies
among products. So what is happening is that, if we do not perform the cross validation, the Random Forest
tends to recognize the countries better than XGBoost, but if we perform the cross validation XGBoost learns the
inter-dependencies among products better than the Random Forest. is step is crucial if one wants to build a
representation of such interdependencies which has also a good forecasting power39.
In Fig.4 (le) we show the relative improvements of various performance indicators when the CV is used
to train the XGBoost model and the test is performed on the activations. All indicators improve; in particular,
F1-score and mean Precision@10 increase by more than 10%. On the right, we compare the cross-validated
XGBoost predictions with the RCA benchmark, showing a remarkable performance although the previously
noted goodness of the benchmark.
In Table1 we report the values of the performance indicators for the non cross-validated Random Forest,
the cross-validated XGBoost and the RCA benchmark model, once again tested on the activations. e last four
rows represent the confusion matrix, where the threshold on the prediction scores is computed by optimizing
the F1 scores.
e cross validated XGBoost gives the best scores except for the AUC-ROC and the accuracy which are
inuenced by the high class imbalance because of the large number of True Negatives, making these metrics
unsuitable for evaluating the goodness of the predictions. However, the non cross-validated Random Forest is
comparable and in any case shows better scores than the RCA benchmark, so it represents a good alternative,
especially because of the much lower computational cost. Indeed, the inclusion of the cross-validation proce-
dure increases the computational cost by about a factor 13, moreover, if compared with the same number of
trees, Random Forest is 7.7 times faster than XGBoost. So, even if the cross validated XGBoost model is the best
performing, the non cross validated Random Forest is a good compromise to have good predictions in less time.
In general, a desirable output of a classication task is not only a correct prediction, but also an assessment
of the likelihood of the label, in this case, the activation. is likelihood provides a sort of condence in the
prediction. In order to test whether the scores are correlated or not with the actual probability of activations we
Figure4. Le: relative improvement of the prediction performance of XGBoost when the training is cross
validated. e algorithm now can not recognize the country, and so all the performance indicators improve.
Right: relative improvement of the cross validated XGBoost algorithm with respect to the RCA benchmark.
Table 1. Comparison of thepredictive performance of XGBoost with cross validation, Random Forest without
cross validation and the RCA benchmark for the activationsusing dierent indicators. e last row indicates
the computational cost with respectto the non cross validated Random Forest; XGBoost is about 100 times
slower. e highestvaluesof each indicator are in bold.
Algorithm XGBoost-CV Random Forest RCA
AUC-ROC 0.698 0.724 0.592
F1 score 0.0479 0.0476 0.0369
mean Precision@10 0.059 0.045 0.039
Precision 0.34 0.035 0.023
Recall 0.079 0.073 0.103
MCC 0.043 0.042 0.035
AUC-PR 0.018 0.017 0.011
Accuracy 0.981 0.982 0.967
Negative predictive value 0.994 0.994 0.994
TP 202 186 263
FP 5663 5063 11413
FN 2359 2375 2298
TN 403767 404367 398017
Computational cost 100 1
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
Vol.:(0123456789)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
build a calibration curve. In Fig.5 we show the fraction of positive elements as a function of the output (i.e., the
scores) of the XGBoost and Random Forest algorithms in the activations prediction task. We divide the scores
into logarithmic bins and then we compute the mean and the standard deviation inside each bin. In both cases a
clear correlation is present, pointing out that a higher prediction score corresponds to a higher empirical prob-
ability that the activation of a new product will actually occur. Moreover, we note that the greater is the score
produced by the model, the greater is the error on the y axis; the reason is that the models tend to assign higher
scores to the products already exported from a country, so if we look at the activations the values start to uctu-
ate, and the statistic becomes lower.
We close this section mentioning the possibility to train our algorithms by taking explicitly into account the
class imbalance, as suggested in62,63. e results of this investigation are reported in section2 of the Supplemen-
tary Information. We observe a mild decrease of the prediction performance.
Opening the black box. In order to qualitatively motivate the better performance of tree-based algo-
rithms, in this paragraph we elaborate on the operation of Random Forests. As specied in the “Methods” sec-
tion, in these prediction exercises we train one Random Forest model for each product, and each Random Forest
contains 100 decision trees. In Fig.6 we show one representative decision tree. is tree is obtained by putting
the number of features available for each tree equal to
P=5040
: this means that we are bootstrap aggregating, or
bagging64 the trees, instead of building an actual Random Forest, which considers instead a random subset of the
products51 (the decision trees may be dierent also in this case, since the bagging procedure extracts the features
with replacement). Moreover, the training procedure is cross validated, so the number of input countries is 156
×
7 (156 countries and 7 years from 2007 to 2013).
e decision tree we show refers to the product with HS1992 code 854089; the description is valves and tubes
not elsewhere classied in heading no. 8540, where 8540 stands for cold cathode or photo-cathode valves and tubes
like vacuum tubes, cathode-ray tubes and similars.
e color represents the class imbalance of the leaf (dark orange, many zeros; dark blue, many ones, quanti-
ed in the square brackets). e root product, the one which provides the best split, is chromium, which is used,
Figure5. Calibration curves: fraction of positive elements as a function of the scores produced by XGBoost
(le) and Random Forest (right) for the activations prediction task. In both cases a clear positive correlation
is present, indicating that higher scores are associated to higher empirical probabilities that the activation will
actually occur.
Chromium
wrought other than waste and scrap
value = [1006, 86]
Microscopes and
diffraction apparatus
value = [994, 44]
no
export
Non threaded washers
of iron and steel
value = [12, 42]
export
Rail locomotives
value = [961, 19]
Narrow woven fabrics
value = [33, 25]
value = [912, 8] value = [49, 11]value = [3, 23]value = [30, 2]
Particle accelerators
value = [10, 2]
Tanned or crust
hides and skins of goats
value = [2, 40]
value = [9, 0] value = [1, 2]
gini = 0.476
value = [0, 39]
gini = 0.065
value = [2, 1]
Figure6. A representative decision tree to forecast the export of the product valves and tubes. e root product,
chromium, has a well known technological relation with the target product, and in fact is able to discriminate
against future exporters with high precision.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
Vol:.(1234567890)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
for instance, in the cathode-ray tubes to reduce X-ray leaks. So the Random Forest found a nontrivial connec-
tion between chromium and these types of valves and tubes: out of the 1006 couples country-year that do not
export valves and tubes, 994 do not export chromium either (note the negative association). We can explore the
network considering that the no-export link is always on the le. Looking at the export direction we nd the cut
on washers of iron and steel that works very well: only 2 of the 12 couples country-year that do not export valves
and tubes do export washers and only 2 of the 42 countries that export valves and tubes do not export washers.
Looking at the other splits we nd some of them more reasonable, like the one on particle accelerators, and
some that seem coincidental, like the one on hides and skins of goats.
From this example it is clear that the decision tree is a natural framework to deal with a set of data in which
some features (i.e., products) may be by far more informative than others, and so a hierarchical structure is
needed to take into account this heterogeneous feature importance.
Feature importance may be evaluated by looking at the normalized average reduction of the impurity at each
split that involves that feature50. In our case, we are considering the Gini impurity. In Fig.7 we plot this assess-
ment of the feature importance to predict the activation of valves and tubes. One can easily see that the average
over the dierent decision trees is even more meaningful than the single decision tree shown before, even if the
each one of the former sees fewer products than the latter: all the top products are reasonably connected with
the target product and so it is natural to expect them to be key elements to decide whether the given country
will export valves and tubes or not.
Time dependence. In the procedure discussed above we used a time interval
model
equal to 5 years for the
training, and we tested our out-of-sample forecasts using the same time interval
. Here we investigate how the
choice of the forecast horizon
aects the quality of the predictions. To make this analysis we used XGBoost
models trained with the cross validation methodand a lower
model =3
. e machine learning algorithms are
trained using data in the range
y∈[1996 ...2008]
and their output, obtained giving RCA
(2008)
as input, will be
compared with the various M
(2008+�)
by varying
. Being the 2018 the last year of available data, we can explore
a range of
s from 1 to 10. All details about the training procedure of the machine learning algorithms are given
in the “Methods” section.
e quality of the predictions as a function of the forecast horizon
are summarized in Fig.8, where we nor-
malized the indicators in such a way that they are all equal to 1 at
=1
. In the le gure we have the plot for the
activations prediction task: both precision and precision@10 increase with
, while the negative predictive value
decreases and accuracy shows an erratic behavior. is means that our ability to guess positive values improves
or, in other words, the greater the time you wait the higher the probability that a country sooner or later does
activate the products we predict. is improvement on positive values, however, corresponds to a worsening on
negative values that can be interpreted as the fact that countries during time develop new capabilities and start
to export products we cannot predict with a
interval too large.
If we look to a score that includes both performances on positive values and performance on negative values,
like accuracy, we have a (noisy) worsening with the increase of
.
In the gure on the right we show instead the full matrix prediction task. In this case all the scores decrease
with
because the algorithm can not leverage anymore on the strong auto-correlation of the RCA matrix.
0
0.02 0.04 0.06 0.08 0.1
Microwave tubes
Chromium wrought other than waste and scrap
Hydroxide and peroxide of magnesium
Navigational instruments and appliances
Isobutene−isoprene rubber
Valves and tubes receiver or amplifier
Threaded screws of iron and steel
Microscopes and diffraction apparatus
Halogenated derivatives of acyclic hydrocarbons
Television camera tubes
Feature Importance
Target product: valves and tubes
Figure7. Feature importance is a measure of how much a product is useful to predict the activation of the
target product. Here we use the average reduction of the Gini impurity at each split. All important products are
reasonably connected with the target.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
Vol.:(0123456789)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
Note that the steepness of the decreasing curves is higher when we look at precision scores, the reason being
the high class imbalance and the large number of true negatives with respect to true positives asshown in Table1.
Discussion
One of the key issues in economic complexity and, more in general, in complexity science is the lack of systematic
procedures to test, validate, and falsify theoretical models and empirical, data-driven methodologies. In this
paper we focus on export data, and in particular on the country-product bipartite network, which is the basis
of most literature on economic complexity, and the likewise widespread concept of relatedness, that is usually
associated to an assessment of the proximity between two products or the density or closeness of a country with
respect to a target product. As detailed in the Introduction, many competing approaches exist to quantify these
concepts, however, a systematic framework to evaluate which approach works better is lacking, and the result
is the ourishing of dierent methodologies, each one tested in a dierent way and with dierent purposes. We
believe that this situation can be discussed in a quantitative and scientically sound way by dening a concrete
framework to compare the dierent approaches in a systematic way; the framework we propose is out-of-sample
forecast, and in particular the prediction of the presence or the appearance of products in the futureexport
baskets of countries. is approach has the immediate advantage to avoid a number of recognized issues65 such
as the mathiness of microfounded models66 and the p-hacking in causal inference and regression analyses1,67.
In this paper we systematically compare dierent machine learning algorithms in the framework of a super-
vised classication task. We nd that the statistical properties of the export data, namely the strong auto-corre-
lation and the class imbalance, imply that the appearance, or activation, of new products should be investigated,
and some indicators of performance, such as ROC–AUC and accuracy, should be considered with extreme
care. On the contrary, indicators such as the mean Precision@k have an immediate policy interpretation. In the
prediction tasks tree-based models, such as Random Forest and Boosted Trees, clearly outperform the other
algorithms and the quite strong benchmark provided by the simple RCA measure. e prediction performance
of Boosted Trees can be further improved by training them in a cross validation setting, at the cost of a higher
computational eort. e calibration curves, which show a high positive correlation between the machine learn-
ing scores and the actual probability of the activation of a new product, provide further support to the correct-
ness of these approaches. A rst step towards opening this black box is provided by the visual inspection of a
sample decision tree and the feature importance analysis, which shows that the hierarchical organization of the
decision tree is a key element to provide correct predictions but also insights about which products are more
useful in this forecasting task.
From a theoretical perspective, this exercise points out the relevance of context for the appearance of
new products, in the spirit ofthe New Structural Economics68, but it has also immediate policy implications:
each country comes with its own endowments and should follow a personalized path, and machine learning
approaches are able to eciently extract this information. In particular, the output of the Random Forest or the
Boosted Trees algorithm, provides scores, or progression probabilities, that a product will be soon activated by
the given country. is represents a quantitative and scientically tested measure of the feasibility of a product
in a country. is measure can be used in very practical contexts of investment design and industrial planning,
a key issue aer the covid-related economic crisis69,70.
Conclusion
Measuring the relatedness between countries and products is one of the main topics in the economic complex-
ity literature71, given its importance to assess the feasibility of investments and strategic policies72,73. Start-
ing from 2007 with the Product Space26, many dierent approaches to measure the relatedness have been
proposed11,32,3537,39,43. With all these models in the literature, a big issue is the absence of a scientically sound
procedure to compare them and quantifying how good they are in measuring the relatedness.
Figure8. In the plot on the le we show the performance indicators in the case of the activations prediction
task. e performance on positive values improves, while the one on negative values gets worse. On the right we
show the same performance indicators in the case of the full matrix prediction task. All the scores get worse due
to the vanishing auto-correlation of the matrices.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
10
Vol:.(1234567890)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
e rst contribution of this work is the proposal of out-of-sample forecasts of new exported products as
a method to compare dierent relatedness models. In this way, the problem of measuring the relatedness can
be casted as a binary classication exercise and, by using standard performance indicators, one can assess the
goodness of a measure and compare them quantitatively. e second contribution of the present paper is the
use of machine learning algorithms to measure the relatedness. We show that decision trees-based algorithms
like Random Forest51 and XGBoost48 provide the best assessment and represent the benchmark for possible new
measures of relatedness.
is paper opens up a number of research lines in various directions. One critical issue of the machine learn-
ing algorithms with respect to traditional network-based approaches is the explainability ot the results, so an
important direction of research is the construction of a model that is fully explainable and do not lose quality
with respect to the measures provided by machine learning algorithms. Another possible direction for future
research is the application of this framework to dierent bipartite networks using dierent databases. Finally,
one could use statistically validated projections31 to build density-based predictions and compare them within
our testing framework. All these studies will be presented in future works.
Methods
Data description. e data we use in this analysis are obtained from the UN-COMTRADE database, Har-
monized System 1992 classication (HS 1992) and include the volumes of the export ows between countries.
e raw database, however, presents internal inconsistencies: for instance, the import declaration of the buy-
ing country might not coincide with the corresponding export declaration of the selling country. e correct
exchanged volumes may be inferred using a Bayesian approach10. e data used in this work are obtained from
this cleaning procedure. e time range covered is 1996–2018 and for each year we have a matrix
V
whose ele-
ment
Vcp
is the amount, expressed in US dollars, of product p exported by country c. e total number of coun-
tries is 169 and the total number of products is 5040.
To binarize the data we determine if a country competitively exports a product by computing the Revealed
Comparative Advantage (RCA) introduced by Balassa38. e RCA of a country c in product p in year y is given by:
R(y)
cp
is a continuous value and represents the ratio between the weight of product p in the export basket of country
c and the total weight of that product in the international trade. Alternatively, the RCA can be seen as the ratio
between the market share of country c relatively to product p and the weight of country c with respect to the total
international trade. is is the standard way, in the economic complexity literature, to remove trivial eects due
to the size of the country and the size of the total market of the product. In this way, a natural threshold equal to
1 can be used to establish whether country c exports product p in a competitive way or not. As a consequence,
we dene the matrix
M
whose binary element
Mcp
tells us if country c is competitive in the export of product
p or not:
In this work we will try to predict future values of
Mcp
using past values of RCA. In Table2 we report the main
features of the country-export bipartite network described by the biadjacency matrix M (in dierent years). e
minimum country degree is zero from 1996 to 2011 due to South Sudan since it gained its independence on
2011. e minimum degree of the products is always zero because there are some products in which on some
years none of the countries has a RCA value greater than 1.
A detailed description of the dataset we used is available at74.
Supervised machine learning and relatedness. Before describing our approach to measure the relat-
edness, here we want to give a quick and intuitive description of how supervised machine learning works. A
simple example consists in the construction of a binary classier that predicts if a patient is healthy or it has
contracted COVID-19 starting from its symptoms (called features). A simple approach consists into drawing an
hyperspace with dimension equal to the number of features (N). Here a patient identies a specic point in this
space. A binary classier could be a simple hyperplane with dimension N−1 splitting the space in two distinct
areas. A patient is then classied as healthy or sick depending on which of the two areas he belongs to. e learn-
ing part consists in the denition of the hyperplane. During the training phase we provide to the model some
patients with their symptoms and the information whether they contracted COVID-19 or not. By minimizing a
suitable loss function the model nds the optimal hyperplane that separates the healthy from the sick.
is is a very simple example of the functioning of a supervised machine learning binary classier (that usu-
ally does not perform well, except in trivial cases where the positive and negative classes can be linearly sepa-
rated). e functioning of more complex architectures like the ones we present in this paper is not so dierent:
what we have is always a classier that learns its task looking at a set of training samples and their correct output.
In our case, we rst x a target product. us a sample is a country and its exported products are the features.
Looking to past data we show to the algorithm if a country aer 5 years will export the target product, and, once
the training phase is ended, the algorithm can be used to predict whether a country will export that product
aer 5 years or not given its present exports. en this procedure is repeated for all products, each of which thus
(1)
R
(y)
cp =V(y)
cp
pV(y)
cp
cV
(y)
cp
cpV(y)
c
p
(2)
M
(y)
cp =
1if R
(y)
cp
1
0if R(y)
cp <
1
Content courtesy of Springer Nature, terms of use apply. Rights reserved
11
Vol.:(0123456789)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
needs a dierent training. In Fig.9 we show a schematic diagram with the general functioning of the machine
learning algorithms discussed here. As a rst step, the algorithm is trained receiving the matrix of the RCAs of
countries (
Xtrain
) and the information whether these countries will export a product or not (
Ytrain
). Once the
algorithm is trained, it receives in input the exports of countries in a year y (not used during the training stage)
and its output is the relatedness of countries with a product.
Training and testing procedure. We want to guess which products will be exported by a country aer
years. To do this, we exploit machine learning algorithms with the goal of (implicitly) understanding the
capabilities needed to export a product from the analysis of the export basket of countries. Since each product
requires a dierent set of capabilities, we need to train dierent models: in this work, we train 5040 dierent
Random Forests, one for each product.
e training procedure is analogous for all the models: they have to connect the RCA values of the products
exported by a country in year y with the element
M(y+�)
cp
, which tells us if country c in year
y+
is competitive
in the export of product p.
In the general case we have export data that covers a range of years [
y0
,
ylast
]. e last year is used for the
test of the model and so the training set is built using only the years [
y0
,
ylast
]. In this way, no information
about the
years preceding
ylast
is given.
e input of the training set, that we call X
train
, is vertical stack of the R
(y)
matrices from
y0
to
ylast 2
(see
Fig.10). In such a way we can consider all countries and all years of the training set, and these export baskets will
be compared with the corresponding presence or absence of the target product p aer
years; this is because
our machine learning procedure is supervised, that is, during the training we provide a set of answers Y
train
corresponding to each export basket in X
train
. While X
train
is the same for all the models (even if they refer to
dierent products), the output of the training set Y
train
changes on the basis of the product we want to predict.
If we consider the model associated to product p, to build Y
train
we aggregate the columns corresponding to the
target product, C
(y)
p
, of the M matrices from
y0+
to
ylast
(so we use the same number of years, all shied
by
years with respect to X
train
). is is graphically represented on the extreme le side of Fig.10.
Once the model is trained, in order to perform the test we give as input X
test
the matrix R
(ylast �)
. Each model
will give us its prediction for the column p of the matrix M
(ylast )
and, putting all the results relative to the single
products together, we reconstruct the whole matrix of scores M
(y
last
)
pred
, which we compare with the empirical one.
ere are various ways to compare the predictions with the actual outcomes, and these performance metrics are
discussed in the following section.
As already mentioned, the same models can be tested against two dierent prediction tasks: either we can
look to the full matrix M
(ylast )
, either we can concentrate only on the possible activations, that is products that
were not present in an export basket and countries possibly start exporting. e set of possible activations is
dened as follows:
Table 2. Main properties of the country-export bipartite network over the years between 1996 and 2018.
Year Number of
countries Number of
products Number of links Min country
degree Max country
degree Avg country
degree Min product
degree Max product
degree Avg product
degree
1996 169 5040 83,754 0 2082 496 0 64 16.6
1997 169 5040 83,666 0 2059 495 0 61 16.6
1998 169 5040 84,976 0 2023 503 0 64 16.9
1999 169 5040 86,071 0 2089 509 0 66 17.1
2000 169 5040 90,327 0 2171 534 0 67 17.9
2001 169 5040 89,242 0 2138 528 0 71 17.7
2002 169 5040 88,849 0 2114 526 0 73 17.7
2003 169 5040 88,153 0 2089 522 0 73 17.5
2004 169 5040 88,662 0 2148 525 0 69 17.6
2005 169 5040 90,807 0 2171 537 0 74 18.0
2006 169 5040 90,429 0 2162 535 0 69 17.9
2007 169 5040 90,152 0 2155 533 0 72 17.9
2008 169 5040 90,505 0 2230 536 0 69 18.0
2009 169 5040 89,388 0 2157 529 0 72 17.7
2010 169 5040 88,742 0 2195 525 0 71 17.6
2011 169 5040 87,801 0 2286 520 0 68 17.4
2012 169 5040 88,368 8 2253 523 0 73 17.5
2013 169 5040 87,482 5 2222 518 0 79 17.4
2014 169 5040 85,724 7 2236 507 0 80 17.0
2015 169 5040 83,151 10 2236 492 0 81 16.5
2016 169 5040 83,012 11 2260 491 0 78 16.5
2017 169 5040 82,992 13 2202 491 0 81 16.5
2018 169 5040 81,059 12 2256 480 0 91 16.0
Content courtesy of Springer Nature, terms of use apply. Rights reserved
12
Vol:.(1234567890)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
In other words, a pair (c,p) is a possible activation if country c has never been competitive in the export of product
p until year
ylast
, that is its RCA values never exceeded 0.25. is selection of the test set may look too strict,
however it is key to test our algorithms against situations in which countries really start exporting new products.
Because of the RCA binarization, there are numerous cases in which a country noisily oscillates around RCA
=
1 and, de facto, that country is already competitive in that product; in these cases the RCA benchmark is more
than enough for a correct prediction.
e way to train the models we just described performs better on the full matrix than in the activations. e
reason is probably that the machine learning algorithms recognize the countries because the ones in the training
set and the ones in the test set are the same. When the algorithms receive as input the export basket of a coun-
try they have already seen in the training data, they tend to reproduce the strong autocorrelation of the export
matrices. To avoid this problem we used a k-fold cross validation, which means that we split the countries into k
groups. Since the number of countries is 169, the natural choice is to use k
=
13, so we randomly extract a group
α
of 13 countries from the training set, which is then composed by the remaining 156 countries, and we use
only the countries contained in
α
for the test. In this way each model is meant to make predictions only on the
countries of the group
α
, so to cover all the 169 countries we need to repeat the procedure 13 times, every time
changing the countries in the group
α
. is dierent training procedure is depicted on the right part of Fig.10.
So there will be 13 models associated to a single product and, for this reason, the time required to make the
training is 13 times longer. Like in the previous case, in the training set we aggregate the years in the range [
y0
,
ylast
]. X
train
is the aggregation of the RCA matrices from
y0
to
ylast 2
and Y
train
is the aggregation of the
column p of the M matrices from
y
0
+
to
ylast
. In both cases, the countries in the group
α
are removed.
When we perform the test, each models takes as X
test
the matrix RCA
(ylast �)
with only the rows correspond-
ing to the 13 countries in group
α
and gives as output scores the elements of the matrix M
(y
last
)
pred
. All the 5040
×
13
models together give as output the whole matrix of scores M
(y
last
)
pred
that will be compared to the actual Y
test
=
M
(ylast )
.
Since the output of the machine learning algorithms is a probability, and most of the performance indicators
require a binary prediction, in order to establish if we predict a value of 0 or 1 we have to introduce a threshold.
e value of this threshold we use is the one that maximizes the F1-score. We note that the only performance
measures that do not require a threshold are the ones that consider areas under the curves, since these curves
are built precisely by varying the threshold value.
Figure10 schematically shows the training procedures with and without cross validation.
(3)
(
c,p)activations ⇐⇒ R
(y)
cp
<0.25 y∈[y0,y
last
]
Figure9. Schematic diagram with the functioning of machine learning algorithms to assess the relatedness
between countries and a target product. During the training phase the model receives an
Xtrain
matrix with the
training samples (countries) and their features (products) for the years from 1996 to 2008; they are compared
with the
Ytrain
vector that contains the corresponding possible exports the target product in 2001–2013 (that is,
a binary label for each sample). Once the model is trained, it can receive in input new data (that is, an export
basket) and will provide the probability that the label (the possible export of the target product)) is 1. is
progression score is our assessment of the relatedness.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
13
Vol.:(0123456789)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
Performance indicators. e choice of the performance indicators is a key issue of supervised learning61,75
and, in general, strongly depends on the specic problem under investigation. Here we discuss the practical
meaning of the performance indicators we used to compare the ML algorithms. For all the scores but the areas
under curves, we need to dene a threshold above which the output scores of the ML algorithms are associated
with a positive prediction. For this purpose we choose the threshold that maximizes the F1 score76.
Precision Precision is dened as the ratio between true positives and positives61. In our case, we predict that
a number of products will be competitively exported by some countries; these are the positives. e preci-
sion is the fraction that counts how many of these predicted products are actually exported by the respective
countries aer
years. A high value of precision is associated to a low number of false positives, that is if
products that are predicted to appear they usually do so.
mean Precision@k (mP@k) is indicator usually corresponds to the fraction of the top k positives that are
correctly predicted. We considered only the rst k predicted products separately for each country, and then
we average on the countries. is is of practical relevance from a policy perspective, because many new
products appear in already highly diversied countries, while we would like to be precise also in low and
medium income countries. By using mP@k we quantify the correctness of our possible recommendations
of k products, on average, for a country.
Recall Recall is dened as the ratio between true positives and the sum of true positives and false negatives
or, in other words, the total number of products that a country will export aer
years61. So a high recall is
associated with a low number of false negatives, that is, if we predict that a country will not start exporting
a product, that country will usually not export that product. A negative recommendation is somehow less
usual in strategic policy choices.
F1 Score e F1 score or F-measure59,60 is dened as the harmonic mean of precision and recall. As such, it is
possible to obtain a high value of F1 only if both precision and recall are relatively high, so it is a very frequent
Figure10. e training and testing procedure with (right) and without (le) cross validation. See the text for a
detailed explanation.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
14
Vol:.(1234567890)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
choice to assess the general behavior of the classicator. As mentioned before, both precision and recall can
be trivially varied by changing the scores’ binarization threshold; however, the threshold that maximizes the
F1 score is far from trivial, since precision and recall quantify dierent properties and are linked here in a
nonlinear way. e Best F1 Score is computed by nding the threshold that maximizes the F1 score.
Area under the PR curve It is possible to build a curve in the plane dened by precision and recall by varying
the threshold that identies the value above which the scores are associated to positive predictions. is value
is not misled by the class imbalance46.
ROC–AUC e Area Under the Receiving Operating Characteristic Curve77,78 is a widespread indicator that
aims at measuring the overall predictive power, in the sense that the user does not need to specify a threshold,
like for Precision and Recall. On the contrary, all the scores are considered and ranked, and for each possible
threshold both the True and the False Positive Rate (TPR and FPR, respectively) are computed. is procedure
allows to dene a curve in the TPR/FPR plane, and the area under this curve represents the probability that a
randomly selected positive instance will receive a higher score than a randomly selected negative instance45.
For a random classier, AUC
=
0.5 . It is well known46,79 that in the case of highly imbalanced data the AUC
may give too optimistic results. is is essentially due to its focus on the overall ranking of the scores: in our
case, misordering even a large number of not exported products does not aect the prediction performance;
one makes correct true negative predictions only because there are a lot of negative predictions to make.
Matthews coecient Matthews’ correlation coecient80 takes into account all the four classes of the confusion
matrix and the class imbalance issue81,82.
Accuracy Accuracy is the ratio between correct predictions (true positives and true negatives) and the total
number of predictions (true positives, false positives, false negatives and true negatives)61. In our prediction
exercise we nd relatively high values of accuracy essentially because of the overwhelming number of (trivi-
ally) true negatives (see Table1).
Negative predictive value Negative predictive value is dened as the ratio between true negatives and negatives,
that are the products we predict will not be exported by a country61. Also in this case, a major role is played
by the very large number of true negatives, that are however less signicant from a policy perspective.
Libraries for the ML models. Most of the models are implemented with scikit-learn 0.24.0 and, as
described in the Supplementary Information, we performed a carefully hyperparameter optimization; in par-
ticular we used (the unspecied hyperparameters values are the default ones):
sklearn.ensemble.RandomForestClassier(n_estimators = 100, min_samples_leaf = 7)
sklearn.svm.SVC(kernel = “rbf”)
sklearn.linear_model.LogisticRegression(solver = “newton-cg”)
sklearn.tree.DecisionTreeClassier()
sklearn.tree.ExtraTreesClassier(n_estimators = 100, min_samples_leaf = 8)
sklearn.ensemble.AdaBoostClassier(n_estimators = 3)
sklearn.naive_bayes.GaussianNB()
xgboost.XGBClassier(n_estimators = 15, min_child_weight = 45, reg_lambda = 1.5)
XGBoost is implemented using the library xgboost 1.3.1.
Finally, the neural network is implemented using keras 2.4.3. It consists on two layers with 64 neurons and
activation function RELU and a nal layer with a single neuron and sigmoid activation. We used rmsprop as
optimizer, binary_crossentropy as loss function, accuracy as loss metric and we stopped the training at 10 epochs.
For a detailed explanation about the choice of the hyperparameters the reader is referred to the supplementary
information. Note that in our case tree-based models perform better and it is known in the literature that the
random forest default values already provide very good results79,83,84. In our case, the hyperparameters optimiza-
tion increased our prediction performances of about 10%; in particular, it decreased the number of false positives.
Comparison with other works. Here we compare our Random Forest model with the other approaches
presented in literature that we cited in the introduction section, using a consistent testing framework (4-digits
classication, comparison between the relatedness computed in 2013 and the actual new exported products in
2018 that had RCA<0.25 from 1996 to 2013).
Hidalgo etal. in 2007 dene the Product Space26 that is still widely used to measure relatedness37. It is a
projection of the country-product bipartite network into the layer of the products (thus dening a proximity
network of the products). e relatedness between a country and a product is dened as the density of the
former around the latter in the Product Space;
O’Clery etal. in 2021 introduce a new approach to dene the proximity network of the products called
EcoSpace32. From this network they dene the Ecosystem density—that is the likelihood of the appearance
of a product in a country—as a relatedness measure;
Medo etal. compare dierent approaches to perform a link prediction on bipartite nested networks nding
that the two most performing techniques are the Number of violations of the nestedness property (NViol)85
and the preferential attachment (prefA), where the relatedness is the product of the diversication of the
country with the ubiquity of the product36 .
Content courtesy of Springer Nature, terms of use apply. Rights reserved
15
Vol.:(0123456789)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
In Table3 we show the AUC-PR, AUC-ROC, Best F1 and mean precision@5 of the dierent models. We nd
that the Random Forest outperforms the other approaches independently from the specic performance metric
used in the comparison.
Data availibility
e data that support the ndings of this study are available from UN-COMTRADE but restrictions apply to the
availability of these data, which were used under license for the current study, and so are not publicly available.
Data are however available from the authors upon reasonable request to the corresponding author and with
permission of UN-COMTRADE. An anonymized and processed version of the data is available at h t tps:// github.
com/ giamb a95/ Sapli ngSim ilari ty/ tree/ main/ data/ RCA to permit the full replicability of our study.
Received: 21 June 2022; Accepted: 13 January 2023
References
1. Athey, S. e impact of machine learning on economics. in e Economics of Articial Intelligence: An Agenda. 507–547 (University
of Chicago Press, 2018).
2. Rodrik, D. Diagnostics before prescription. J. Econ. Perspect. 24, 33–44 (2010).
3. Hausmann, R., Rodrik, D. & Velasco, A. Growth diagnostics. in e Washington Consensus Reconsidered: Towards a New Global
Governance. 324–355 (2008).
4. Baldovin, M., Cecconi, F., Cencini, M., Puglisi, A. & Vulpiani, A. e role of data in model building and prediction: A survey
through examples. Entropy 20, 807 (2018).
5. Hosni, H. & Vulpiani, A. Forecasting in light of big data. Philos. Technol. 31, 557–569 (2018).
6. Rodrik, D. Economics Rules: e Rights and Wrongs of the Dismal Science (WW Norton & Company, 2015).
7. Tacchella, A., Cristelli, M., Caldarelli, G., Gabrielli, A. & Pietronero, L. A new metrics for countries’ tness and products’ complex-
it y. Sci. Rep. 2, 723 (2012).
8. Cristelli, M., Gabrielli, A., Tacchella, A., Caldarelli, G. & Pietronero, L. Measuring the intangibles: A metrics for the economic
complexity of countries and products. PloS one 8, e70726 (2013).
9. Tacchella, A., Cristelli, M., Caldarelli, G., Gabrielli, A. & Pietronero, L. Economic complexity: Conceptual grounding of a new
metrics for global competitiveness. J. Econ. Dyn. Control 37, 1683–1691 (2013).
10. Tacchella, A., Mazzilli, D. & Pietronero, L. A dynamical systems approach to gross domestic product forecasting. Nat. Phys. 14,
861–865 (2018).
11. Zaccaria, A., Cristelli, M., Tacchella, A. & Pietronero, L. How the taxonomy of products drives the economic development of
countries. PloS one 9, e113770 (2014).
12. Zaccaria, A., Cristelli, M., Kupers, R., Tacchella, A. & Pietronero, L. A case study for a new metrics for economic complexity: e
Netherlands. J. Econ. Interact. Coord. 11, 151–169 (2016).
13. Gaulier, G. & Zignago, S. Baci: International trade database at the product-level (the 1994–2007 version). inCEPII Working Paper
2010–2023 (2010).
14. Hidalgo, C.A. & Hausmann, R. e building blocks of economic complexity. Proc. Natl. Acad. Sci. 106, 10570–10575 (2009).
15. Albeaik, S., Kaltenberg, M., Alsaleh, M. & Hidalgo, C. Improving the Economic Complexity Index. arXiv preprint arXiv: 1707.
05826 (2017).
16. Gabrielli, A. etal. Why we like the eci+ algorithm. arXiv preprint arXiv: 1708. 01161 (2017).
17. Albeaik, S., Kaltenberg, M., Alsaleh, M. & Hidalgo, C. 729 new measures of economic complexity (addendum to improving the
economic complexity index). arXiv preprint arXiv: 1708. 04107 (2017).
18. Pietronero, L. etal. Economic complexity:“ Buttarla in caciara” vs a constructive approach. arXiv preprint arXiv: 1709. 05272 (2017).
19. Cristelli, M., Tacchella, A. & Pietronero, L. e heterogeneous dynamics of economic complexity. PloS one 10, e0117174 (2015).
20. Cristelli, M., Tacchella, A., Cader, M., Roster, K. & Pietronero, L. On the Predictability of Growth (e World Bank, 2017).
21. Liao, H. & Vidmer, A. A comparative analysis of the predictive abilities of economic complexity metrics using international trade
network. Complexity (2018).
22. Sciarra, C., Chiarotti, G., Ridol, L. & Laio, F. Reconciling contrasting views on economic complexity. Nat. Commun. 11, 1–10
(2020).
23. Frenken, K., Van Oort, F. & Verburg, T. Related variety, unrelated variety and regional economic growth. Region. Stud. 41, 685–697
(2007).
24. Hidalgo, C.A. etal. e principle of relatedness. in International Conference on Complex Systems. 451–457 (Springer, 2018).
25. Teece, D. J., Rumelt, R., Dosi, G. & Winter, S. Understanding corporate coherence: eory and evidence. J. Econ. Behav. Organ.
23, 1–30 (1994).
26. Hidalgo, C. A., Klinger, B., Barabási, A.-L. & Hausmann, R. e product space conditions the development of nations. Science 317,
482–487 (2007).
27. Breschi, S., Lissoni, F. & Malerba, F. Knowledge-relatedness in rm technological diversication. Res. Policy 32, 69–87 (2003).
Table 3. Comparison between our Random Forest model and other approaches proposed in literature.
e Random Forest provides a better assessment of the relatedness with all the performance indicators. e
highestvaluesof each indicator are in bold.
Algorithm AUC-PR Best F1 AUC-ROC mean Precision@5
Random Forest 0.015 0.042 0.689 0.049
Product Space 0.010 0.022 0.637 0.032
EcoSpace 0.013 0.035 0.663 0.042
prefA 0.011 0.024 0.645 0.046
NViol 0.011 0.025 0.607 0.046
Content courtesy of Springer Nature, terms of use apply. Rights reserved
16
Vol:.(1234567890)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
28. Pugliese, E., Napolitano, L., Zaccaria, A. & Pietronero, L. Coherent diversication in corporate technological portfolios. PloS one
14 (2019).
29. Nee, F., Henning, M. & Boschma, R. How do regions diversify over time? Industry relatedness and the development of new
growth paths in regions. Econ. Geogr. 87, 237–265 (2011).
30. Boschma, R. etal. Technological relatedness and regional branching. in Beyond Territory. Dynamic Geographies of Knowledge
Creation, Diusion and Innovation. 64–68 (2012).
31. Pugliese, E. et al. Unfolding the innovation system for the development of countries: Coevolution of science, technology and
production. Sci. Rep. 9, 1–12 (2019).
32. O’Clery, N., Yıldırım, M. A. & Hausmann, R. Productive ecosystems and the arrow of development. Nat. Commun. 12, 1–14 (2021).
33. Gnecco, G., Nutarelli, F. & Riccaboni, M. A machine learning approach to economic complexity based on matrix completion. Sci.
Rep. 12, 1–10 (2022).
34. Hausmann, R., Hwang, J. & Rodrik, D. What you export matters. J. Econ. Growth 12, 1–25 (2007).
35. Bustos, S., Gomez, C., Hausmann, R. & Hidalgo, C. A. e dynamics of nestedness predicts the evolution of industrial ecosystems.
PloS one 7, e49393 (2012).
36. Medo, M., Mariani, M. S. & Lü, L. Link prediction in bipartite nested networks. Entropy 20, 777 (2018).
37. Zhang, W.-Y., Chen, B.-L., Kong, Y.-X., Shi, G.-Y. & Zhang, Y.-C. Industry upgrading: Recommendations of new products based
on world trade network. Entropy 21, 39 (2019).
38. Balassa, B. Trade liberalisation and “revealed” comparative advantage 1. Manchester Sch. 33, 99–123 (1965).
39. Tacchella, A., Zaccaria, A., Miccheli, M. & Pietronero, L. Relatedness in the era of machine learning. arXiv preprint arXiv: 2103.
06017 (2021).
40. Hausmann, R. et al. A roadmap for investment promotion and export diversication: e case of Jordan (Technical Report. Center
for International Development at Harvard University, 2019).
41. Saracco, F., Di Clemente, R., Gabrielli, A. & Pietronero, L. From innovation to diversication: A simple competitive model. PloS
one 10, e0140420 (2015).
42. Tacchella, A., DiClemente, R., Gabrielli, A. & Pietronero, L. e build-up of diversity in complex ecosystems. arXiv preprint arXiv:
1609. 03617 (2016).
43. Che, N. X. Intelligent export diversication: An export recommendation system with machine learning (Technical Report. Interna-
tional Monetary Fund, 2020).
44. Angelini, O. & Di Matteo, T. Complexity of products: e eect of data regularisation. Entropy 20, 814 (2018).
45. Fawcett, T. An introduction to roc analysis. Pattern Recognit. Lett. 27, 861–874 (2006).
46. Saito, T. & Rehmsmeier, M. e precision–recall plot is more informative than the roc plot when evaluating binary classiers on
imbalanced datasets. PloS one 10, e0118432 (2015).
47. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 1189–1232 (2001).
48. Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. in Proceedings of the 22nd ACM Sigkdd International Conference
on Knowledge Discovery and Data Mining. 785–794 (2016).
49. Gulli, A. & Pal, S. Deep Learning with Keras (Packt Publishing Ltd, 2017).
50. Pedregosa, F. etal. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
51. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
52. Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
53. HosmerJr, D.W., Lemeshow, S. & Sturdivant, R.X. Applied Logistic Regression. Vol. 398 (Wiley, 2013).
54. Quinlan, J. R. Induction of decision trees. Mach. Learn. 1, 81–106 (1986).
55. Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006).
56. Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst.
Sci. 55, 119–139 (1997).
57. John, G.H. & Langley, P. Estimating continuous distributions in Bayesian classiers. arXiv preprint arXiv: 1302. 4964 (2013).
58. Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From eory to Algorithms (Cambridge University Press,
2014).
59. Dice, L. R. Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945).
60. VanRijsbergen, C.J. Foundation of evaluation. J. Docum. (1974).
61. Powers, D. M. Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation. J. Mach. Learn.
Tec hnol. (2011).
62. Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. Smote: Synthetic minority over-sampling technique. J. Artif. Intell.
Res. 16, 321–357 (2002).
63. G éron, A. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent
Systems ( O’Reilly Media, Inc., 2019).
64. Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).
65. Romer, P. e trouble with macroeconomics. Am. Econ. (2016).
66. Romer, P. M. Mathiness in the theory of economic growth. Am. Econ. Rev. 105, 89–93 (2015).
67. Head, M. L., Holman, L., Lanfear, R., Kahn, A. T. & Jennions, M. D. e extent and consequences of p-hacking in science. PLoS
Biol. 13, e1002106 (2015).
68. Lin, J.Y. New Structural Economics: A Framework for Rethinking Development and Policy (e World Bank, 2012).
69. Fernandes, N. Economic eects of coronavirus outbreak (COVID-19) on the world economy. in Available at SSRN 3557504 (2020).
70. Nana, I. & Starnes, S. When trade falls-eects of covid-19 and outlook (Technical Report. International Finance Corporation-World
Bank Group, 2020).
71. Hidalgo, C. A. Economic complexity theory and applications. Nat. Rev. Phys. 3, 92–113 (2021).
72. Lin, J., Cader, M. & Pietronero, L. What African industrial development can learn from east Asian successes. in EM COmpass 88
(2020).
73. Pugliese, E. & Tacchella, A. Economic complexity for competitiveness and innovation: A novel bottom-up strategy linking global and
regional capacities (Technical Report. Joint Research Centre (Seville site), 2020).
74. Patelli, A., Pietronero, L. & Zaccaria, A. Integrated database for economic complexity. Sci. Data 9, 1–13 (2022).
75. Caruana, R. & Niculescu-Mizil, A. An empirical comparison of supervised learning algorithms. in Proceedings of the 23rd Inter-
national Conference on Machine Learning. 161–168 (2006).
76. Lipton, Z.C., Elkan, C. & Naryanaswamy, B. Optimal thresholding of classiers to maximize f1 measure. in Joint European Confer-
ence on Machine Learning and Knowledge Discovery in Databases. 225–239 (Springer, 2014).
77. Hanley, J. A. & McNeil, B. J. e meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology
143, 29–36 (1982).
78. Hanley, J. A. & McNeil, B. J. e meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology
143, 29–36 (1982).
79. Hanley, J. A. & McNeil, B. J. e meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology
143, 29–36 (1982).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
17
Vol.:(0123456789)
Scientic Reports | (2023) 13:1481 | https://doi.org/10.1038/s41598-023-28179-x
www.nature.com/scientificreports/
80. Matthews, B.W. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim. Biophys. Acta
(BBA)-Protein Struct. 405, 442–451 (1975).
81. Chicco, D. & Jurman, G. e advantages of the Matthews correlation coecient (mcc) over f1 score and accuracy in binary clas-
sication evaluation. BMC Genomics 21, 6 (2020).
82. Boughorbel, S., Jarray, F. & El-Anbari, M. Optimal classier for imbalanced data using Matthews correlation coecient metric.
PloS one 12, e0177678 (2017).
83. Genuer, R., Poggi, J.-M. & Tuleau, C. Random forests: Some methodological insights. arXiv preprint arXiv: 0811. 3619 (2008).
84. Probst, P., Wright, M. N. & Boulesteix, A.-L. Hyperparameters and tuning strategies for random forest. Wiley Interdiscip. Rev. Data
Mining Knowl. Discov. 9, e1301 (2019).
85. Grimm, A. & Tessone, C. J. Analysing the sensitivity of nestedness detection methods. Appl. Netw. Sci. 2, 1–19 (2017).
Author contributions
Conceptualization: A.Z., A.T.; Methodology: all; Investigation, Soware: G.A.; Validation: A.Z., A.T., G.A.; Writ-
ing, Review and editing: all; Supervision: L.P.
Competing interests
e authors declare no competing interests.
Additional information
Supplementary Information e online version contains supplementary material available at https:// doi. org/
10. 1038/ s41598- 023- 28179-x.
Correspondence and requests for materials should be addressed to A.Z.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
© e Author(s) 2023
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... In such a way, two products can be defined as close in the sense that they share many of the capabilities needed in order to export them in a competitive way. Co-occurrences based approaches have however a low predictive performance, and this fact favors machine learning approaches as better tools to measure relatedness both at country (18)(19)(20) and firm level (21,22). In (16,23,24), the authors proposed approaches to explicitly model the relationship among products, capabilities, and development. ...
... In (16,23,24), the authors proposed approaches to explicitly model the relationship among products, capabilities, and development. These frameworks naturally lead to the concepts of product progression (16,19,25) and arrow of development (26): the relationship between products is often not undirected, or symmetric, as in the product space (10), but directed: countries starts their development from simple products and gradually enter in more sophisticated markets, following well defined paths of development (16). Obviously, the identification of the specific products enabling countries to competitively export a given target product is a key element to design industrial policies and strategic patterns of development. ...
... To do so, we investigate the mechanisms underlying a machine learning based prediction approach (18). Such approach considers the competitiveness level of each country's export on each product as features (19): obviously, some products will be dominant in the forecast ex-ercise, while others will be practically irrelevant. The feature importance (19,27,28) will be our statistically validated measure of the ability of a product to activate another product. ...
Preprint
Tree-based machine learning algorithms provide the most precise assessment of the feasibility for a country to export a target product given its export basket. However, the high number of parameters involved prevents a straightforward interpretation of the results and, in turn, the explainability of policy indications. In this paper, we propose a procedure to statistically validate the importance of the products used in the feasibility assessment. In this way, we are able to identify which products, called explainers, significantly increase the probability to export a target product in the near future. The explainers naturally identify a low dimensional representation, the Feature Importance Product Space, that enhances the interpretability of the recommendations and provides out-of-sample forecasts of the export baskets of countries. Interestingly, we detect a positive correlation between the complexity of a product and the complexity of its explainers.
... Similar themes were also faced using the approach known as Economic Complexity. This approach stands out for the use of tools from complexity science, such as co-occurrence networks 7,19 and machine learning algorithms 20,21 . In 22 this approach is used to study the technological diversification of firms and the relatedness between their activities. ...
... In general, the majority of the M&A literature that builds relatedness measures between acquirers and target firms focuses on correlating such measures with successive performances and not on using them for predictions. However, as pointed out in 21 , we believe that a forecast constitute an important test to compare the goodness of relatedness assessments. Notable forecast exercise includes 41 , in which an ensemble learning algorithm is trained on a set of relative features between companies, built using patent data, to predict future acquisitions, and the attempt to M&A prediction in 42 . ...
... The output of this classifier is a score RF Y f t which represents the likelihood that the link M Y f t is 1. These scores represent an optimal measure of the relatedness 20,21 , in this case, between a company and a technology. ...
Preprint
Full-text available
Mergers and Acquisitions represent important forms of business deals, both because of the volumes involved in the transactions and because of the role of the innovation activity of companies. Nevertheless, Economic Complexity methods have not been applied to the study of this field. By considering the patent activity of about one thousand companies, we develop a method to predict future acquisitions by assuming that companies deal more frequently with technologically related ones. We address both the problem of predicting a pair of companies for a future deal and that of finding a target company given an acquirer. We compare different forecasting methodologies, including machine learning and network-based algorithms, showing that a simple angular distance with the addition of the industry sector information outperforms the other approaches. Finally, we present the Continuous Company Space, a two-dimensional representation of firms to visualize their technological proximity and possible deals. Companies and policymakers can use this approach to identify companies most likely to pursue deals or to explore possible innovation strategies.
... Tacchella et al. [20] and Straccamore et al. [21] have shown that standard co-occurrence methods perform worse than autocorrelation benchmarks, and that tree-based machine learning algorithms such as random forest [22,23] provide the present state-of-the-art with respect to the assessment of relatedness. Albora et al. [24] described this approach in detail, providing a comparison between different machine learning algorithms. ...
... e first aim of our analysis is to compare different approaches to measure the relatedness 2 Complexity between firms and products, that is how much a firm is close to being able to export a product. is is something largely studied when the economic actors are not firms but countries; as discussed in the Introduction, two types of approaches exist: complex networks [3][4][5] and supervised machine learning algorithms [20,21,24,27]. ...
... e measure of relatedness given by the use of machine learning algorithms based on decision trees has been shown to provide a better assessment of the probability for the future exports of countries than networkbased approaches [20]. In particular, it has been shown that random forest [23] and XGBoost [29] are the most performing algorithms for the task of assessing relatedness [24]. In this article, we decided to adopt the random forest (RF) since, even if with country-level data XGBoost gets slightly superior results [24], the computational time required to train a random forest is much lower, so it allows us to make a more complete analysis with a tuning of the hyperparameters and, as we will see, the use of community detection algorithms. ...
Article
Full-text available
The relatedness between a country or a firm and a product is a measure of the feasibility of that economic activity. As such, it is a driver for investments at a private and institutional level. Traditionally, relatedness is measured using networks derived by country-level co-occurrences of product pairs, that is counting how many countries export both. In this work, we compare networks and machine learning algorithms trained not only on country-level data, but also on firms, which is something not much studied due to the low availability of firm-level data. We quantitatively compare the different measures of relatedness, by using them to forecast the exports at the country and firm level, assuming that more related products have a higher likelihood to be exported in the future. Our results show that relatedness is scale dependent: the best assessments are obtained by using machine learning on the same typology of data one wants to predict. Moreover, we found that while relatedness measures based on country data are not suitable for firms, firm-level data are very informative also for the development of countries. In this sense, models built on firm data provide a better assessment of relatedness. We also discuss the effect of using parameter optimization and community detection algorithms to identify clusters of related companies and products, finding that a partition into a higher number of blocks decreases the computational time while maintaining a prediction performance well above the network-based benchmarks.
... To the best of our knowledge, these analyses are all aiming at finding explanatory variables for the present performance and not at forecasting future activity. On the contrary, the approach known as Economic Fitness and Complexity, widely applied at both country and regional level, naturally focuses on forecasting, which represent a natural, scientifically sound framework to validate and falsify the different approaches (Tacchella et al., 2018;Albora et al., 2021;Tacchella et al., 2021). The aim of the present paper is to apply the EFC forecasting methods at firm level, and in particular to the bipartite network of firms and the technological sectors in which they show patenting activity. ...
... To the best of our knowledge, these analyses are all aiming at finding explanatory variables for the present performance and not at forecasting future activity. On the contrary, the approach known as Economic Fitness and Complexity, widely applied at both country and regional level, naturally focuses on forecasting, which represent a natural, scientifically sound framework to validate and falsify the different approaches (Tacchella et al., 2018;Albora et al., 2021;Tacchella et al., 2021). The aim of the present paper is to apply the EFC forecasting methods at firm level, and in particular to the bipartite network of firms and the technological sectors in which they show patenting activity. ...
... • Machine Learning: Since our prediction exercise can be expressed in a supervised classification exercise, we can use the Random Forest algorithm (Breiman, 2001;Albora et al., 2021), and what we call the Continuous Technology Space (CTS). The first is a popular machine learing algorithm based on decision trees, while the CTS is based on the studies of Tacchella et al. (2021), and it is a projection on the space of the technology codes of the scores obtained with the Random Forest. ...
Preprint
Full-text available
We reconstruct the innovation dynamics of about two hundred thousand companies by following their patenting activity for about ten years. We define the technological portfolios of these companies as the set of the technological sectors present in the patents they submit. By assuming that companies move more frequently towards related sectors, we leverage on their past activity to build network-based and machine learning algorithms to forecast the future submission of patents in new sectors. We compare different evaluation metrics and prediction methodologies, showing that tree-based machine learning algorithms overperform the standard methods based on networks of co-occurrences. This methodology can be applied by firms and policymakers to disentangle, given the present innovation activity, the feasible technological sectors from those that are out of reach, given their present innovation activity.
... For instance, the bipartite network representing which country exports which product is the basis of the economic complexity (EC) framework [11]. The main tool of EC is the so-called relatedness [12,13,14], a measure of how much a country is close to start exporting a given product. This is a key tool for institutions and policy makers, and a driver for investments [15,16]. ...
Preprint
Many bipartite networks describe systems where a link represents a relation between a user and an item. Measuring the similarity between either users or items is the basis of memory-based collaborative filtering, a widely used method to build a recommender system with the purpose of proposing items to users. When the edges of the network are unweighted, traditional approaches allow only positive similarity values, so neglecting the possibility and the effect of two users (or two items) being very dissimilar. Here we propose a method to compute similarity that allows also negative values, the Sapling Similarity. The key idea is to look at how the information that a user is connected to an item influences our prior estimation of the probability that another user is connected to the same item: if it is reduced, then the similarity between the two users will be negative, otherwise it will be positive. Using different datasets, we show that the Sapling Similarity outperforms other similarity metrics when it is used to recommend new items to users.
... Optimal diversification strategies and technology forecasting for MAs at different scales and capabilities. Our theoretical framework can be applied to study the best diversification strategy for MAs, assessing the best technologies to develop in a city, as in 32,60 . However, some metropolitan areas can diversify as much as they want because their size and capabilities are close to those of a whole country; others cannot diversify their technological products as much because they do not have the resources to do so. ...
Preprint
Over the years, the growing availability of extensive datasets about registered patents allowed researchers to better understand technological innovation drivers. In this work, we investigate how the technological contents of patents characterise the development of metropolitan areas and how innovation is related to GDP per capita. Exploiting worldwide data from 1980 to 2014, and through network-based techniques that only use information about patents, we identify coherent distinguished groups of metropolitan areas, either clustered in the same geographical area or similar from an economic point of view. We also extend the concept of coherent diversification to patent production by showing how it represents a decisive factor in the economic growth of metropolitan areas. These results confirm a picture in which technological innovation can lead and steer the economic development of cities, opening, in this way, the possibility of adopting the tools introduced here to investigate the interplay between urban development and technological innovation.
... With this conception of development, a country's productive structure -identified with a set of nodes within a product spaceis not static because of the continuous appearance of new capabilities and the changing nature of knowledge. Thus, the dynamics of an economy depends on how rapidly productive knowledge is created internally ( Albora et al., 2021 ). ...
When analyzing countries’ medium- and long-term economic performance, it is important to study jointly the dynamic of growth and the industrial evolution that determines how the productive structure changes over time. In this paper, we use the Economic Fitness metric to describe the competitiveness of the countries' industrial structure, and classify economic episodes, with five-years windows, to establish how countries grow (i.e., above or below a trend, and with a dynamic or static industrial structure.) We show a complex growth dynamic using data covering two decades (1995-2014) for a large set of countries. This pattern indicates that the observed sequences of spells vary substantially even between countries within the same growth regime (low, medium, and high). Moreover, we find a robust statistical relationship between these spells and Economic Fitness with a multinomial econometric model. In particular, we show that economies with higher fitness are more resilient since episodes of below-average growth and the net disappearance of competitive firms are less likely to happen.
Article
Full-text available
Over the years, the growing availability of extensive datasets about registered patents allowed researchers to get a deeper insight into the drivers of technological innovation. In this work, we investigate how patents’ technological contents characterise metropolitan areas’ development and how innovation is related to GDP per capita. Exploiting worldwide data from 1980 to 2014, and through network-based techniques that only use information about patents, we identify coherent distinguished groups of metropolitan areas, either clustered in the same geographical area or similar in terms of their economic features. Moreover, we extend the notion of coherent diversification to patent production and show how it is linked to the economic growth of metropolitan areas. Our findings draw a picture in which technological innovation can play a key role in the economic development of urban areas. We contend that the tools introduced in this paper can be used to further explore the interplay between urban growth and technological innovation.
Article
Full-text available
Relatedness is a key concept in economic complexity, since the assessment of the similarity between industrial sectors enables policymakers to design optimal development strategies. However, among the different ways to quantify relatedness, a measure that takes explicitly into account the time correlation structure of exports is still lacking. In this paper, we introduce an asymmetric definition of relatedness by using statistically significant partial correlations between the exports of economic sectors and we apply it to a recently introduced database that integrates the export of physical goods with the export of services. Our asymmetric relatedness is obtained by generalising a recently introduced correlation-filtering algorithm, the partial correlation planar graph, in order to allow its application on multi-sample and multi-variate datasets, and in particular, bipartite temporal networks. The result is a network of economic activities whose links represent the respective influence in terms of temporal correlations; we also compute the statistical confidence of the edges in the network via an adapted bootstrapping procedure. We find that the underlying influence structure of the system leads to the formation of intuitively-related clusters of economic sectors in the network, and to a relatively strong assortative mixing of sectors according to their complexity. Moreover, hub nodes tend to form more robust connections than those in the periphery.
Article
Full-text available
We present an integrated database suitable for the investigation of the economic development of countries by using the Economic Fitness and Complexity framework. Firstly, we implement machine learning techniques to reconstruct the export flow of services and we combine them to the export flow of the physical goods, generating a complete view of the international market, denoted the Integrated database. Successively, we support the technical quality of the database by computing the main metrics of the Economic Fitness and Complexity framework: (i) we build a statistically validated network of economic activities, where preferred paths of development and clusters of High-Tech industries naturally emerge; (ii) we evaluate the Economic Fitness, an algorithmic assessment of the competitiveness of countries, removing the unexpected misbehaviour of economies under-represented by the sole consideration of the export of the physical goods.
Article
Full-text available
This work applies Matrix Completion (MC) – a class of machine-learning methods commonly used in recommendation systems – to analyze economic complexity. In this paper MC is applied to reconstruct the Revealed Comparative Advantage (RCA) matrix, whose elements express the relative advantage of countries in given classes of products, as evidenced by yearly trade flows. A high-accuracy binary classifier is derived from the MC application to discriminate between elements of the RCA matrix that are, respectively, higher/lower than one. We introduce a novel Matrix cOmpletion iNdex of Economic complexitY (MONEY) based on MC and related to the degree of predictability of the RCA entries of different countries (the lower the predictability, the higher the complexity). Differently from previously-developed economic complexity indices, MONEY takes into account several singular vectors of the matrix reconstructed by MC. In contrast, other indices are based only on one/two eigenvectors of a suitable symmetric matrix derived from the RCA matrix. Finally, MC is compared with state-of-the-art economic complexity indices, showing that the MC-based classifier achieves better performance than previous methods based on the application of machine learning to economic complexity.
Article
Full-text available
Economic growth is associated with the diversification of economic activities, which can be observed via the evolution of product export baskets. Exporting a new product is dependent on having, and acquiring, a specific set of capabilities, making the diversification process path-dependent. Taking an agnostic view on the identity of the capabilities, here we derive a probabilistic model for the directed dynamical process of capability accumulation and product diversification of countries. Using international trade data, we identify the set of pre-existing products, the product Ecosystem, that enables a product to be exported competitively. We construct a directed network of products, the Eco Space, where the edge weight corresponds to capability overlap. We uncover a modular structure, and show that low- and middle-income countries move from product communities dominated by small Ecosystem products to advanced (large Ecosystem) product clusters over time. Finally, we show that our network model is predictive of product appearances.
Article
Full-text available
Economic complexity methods have become popular tools in economic geography, international development and innovation studies. Here, I review economic complexity theory and applications, with a particular focus on two streams of literature: the literature on relatedness, which focuses on the evolution of specialization patterns, and the literature on metrics of economic complexity, which uses dimensionality reduction techniques to create metrics of economic sophistication that are predictive of variations in income, economic growth, emissions and income inequality.
Article
Full-text available
This paper presents a set of collaborative filtering algorithms that produce product recommendations to diversify and optimize a country's export structure in support of sustainable long-term growth. The recommendation system is able to accurately predict the historical trends in export content and structure for high-growth countries, such as China, India, Poland, and Chile, over 20-year spans. As a contemporary case study, the system is applied to Paraguay, to create recommendations for the country's export diversification strategy.
Article
Full-text available
Summarising the complexity of a country’s economy in a single number is the holy grail for scholars engaging in data-based economics. In a field where the Gross Domestic Product remains the preferred indicator for many, economic complexity measures, aiming at uncovering the productive knowledge of countries, have been stirring the pot in the past few years. The commonly used methodologies to measure economic complexity produce contrasting results, undermining their acceptance and applications. Here we show that these methodologies – apparently conflicting on fundamental aspects – can be reconciled by adopting a neat mathematical perspective based on linear-algebra tools within a bipartite-networks framework. The obtained results shed new light on the potential of economic complexity to trace and forecast countries’ innovation potential and to interpret the temporal dynamics of economic growth, possibly paving the way to a micro-foundation of the field.