Content uploaded by Mir Riyanul Islam
Author content
All content in this area was uploaded by Mir Riyanul Islam on Jun 21, 2023
Content may be subject to copyright.
Explaining the Unexplainable: Role
of XAI for Flight Take-Off Time Delay
Prediction
Waleed Jmoona1(B
), Mobyen Uddin Ahmed1, Mir Riyanul Islam1,
Shaibal Barua1, Shahina Begum1, Ana Ferreira2, and Nicola Cavagnetto2
1School of Innovation, Design and Engineering, M¨alardalen University, 72123
V¨aster˚as, Sweden
{waleed.jmoona,mobyen.uddin.ahmed,mir.riyanul.islam,
shaibal.barua,shahina.begum}@mdu.se
2Deep Blue, Rome, Italy
{ana.ferreira,nicola.cavagnetto}@dblue.it
Abstract. Flight Take-Off Time (TOT) delay prediction is essential to
optimizing capacity-related tasks in Air Traffic Management (ATM) sys-
tems. Recently, the ATM domain has put afforded to predict TOT delays
using machine learning (ML) algorithms, often seen as “black boxes”,
therefore it is difficult for air traffic controllers (ATCOs) to understand
how the algorithms have made this decision. Hence, the ATCOs are reluc-
tant to trust the decisions or predictions provided by the algorithms. This
research paper explores the use of explainable artificial intelligence (XAI)
in explaining flight TOT delay to ATCOs predicted by ML-based predic-
tive models. Here, three post hoc explanation methods are employed to
explain the models’ predictions. Quantitative and user evaluations are
conducted to assess the acceptability and usability of the XAI meth-
ods in explaining the predictions to ATCOs. The results show that the
post hoc methods can successfully mimic the inference mechanism and
explain the models’ individual predictions. The user evaluation reveals
that user-centric explanation is more usable and preferred by ATCOs.
These findings demonstrate the potential of XAI to improve the trans-
parency and interpretability of ML models in the ATM domain.
Keywords: Explainable Artificial Intelligence ·LIME ·SHAP ·
DALEX ·Flight Take-off Time Delay Prediction ·Air Traffic
Management
1 Introduction
Artificial Intelligence (AI) has seen a surge in interest over the past decade due to
the availability of massive volumes of data and the efficient calculation of learn-
ing algorithms using computer graphics card processors1. AI has been applied to
1https://www.coe.int/en/web/artificial-intelligence/history-of-ai.
c
IFIP International Federation for Information Processing 2023
Published by Springer Nature Switzerland AG 2023
I. Maglogiannis et al. (Eds.): AIAI 2023, IFIP AICT 676, pp. 81–93, 2023.
https://doi.org/10.1007/978-3-031-34107-6_7
82 W. Jmoona et al.
various domains, including ATM [8]. The availability of vast amounts of data in
aviation compared to other transportation modes has contributed to this trend
[15]. However, despite previous research in AI for the ATM domain, it has not
been fully operational, nor has it brought significant benefits to end-users [8].
The slow progress in the application of AI in ATM is due to safety concerns, as
the domain involves critical situations with human lives at stake. Nevertheless,
the use of AI models such as Deep Convolutional Neural Networks (DCNN),
Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Gradient
Boosting Machine (GBM) has gained significant attention for predicting delays
in the ATM, as evidenced by recent studies [1,6,9,18]. Understanding the rea-
sons behind congestion, trajectory routes, and delay is essential for enhancing
traffic. However, one of the challenges in using these algorithms is their lack of
explainability, which has prompted the XAI community to focus on developing
techniques such as Local Interpretable Model-agnostic Explanation (LIME) [14],
Shapley Additive Explanations (SHAP) [13], and Model-Agnostic Language for
Exploration and Explanations (DALEX) [2] to enhance the transparency of these
models. XAI has emerged as a field of research that aims to provide transparency
to high-consequence decisions made by AI systems in various domains, including
healthcare, criminal justice, and ATM [11]. Further, in the XAI domain, the aim
is to empower users in data analysis and decision-making processes.
This research focuses on XAI techniques applied to flight TOT delay predic-
tion, which deals with capacity optimization in the ATM domain. The main con-
tribution of the study is to investigate the acceptability and usability of different
XAI methods using both quantitative and user evaluation processes. Here, RF
and XGBoost are used to develop two predictive models for delay predictions. The
SHAP, LIME, and DALEX are employed to explain the predictions. The perfor-
mance of ML models and the goodness of explanations are compared in the quan-
titative evaluation. Two rounds of experiments were conducted in user evaluation
to compare the acceptability and usability of the three XAI methods in explaining
the prediction to ATCOs. In general, the initial results demonstrate the feasibility
and effectiveness of XAI techniques in enhancing the transparency, interpretabil-
ity, and performance of AI models in ATM, particularly for TOT delay prediction.
2 Materials and Methods
2.1 Dataset
The dataset used in this study was acquired from EUROCONTROL2, which con-
tained the flight data messages from Enhanced Tactical Flow Management System
(ETFMS) for all flights during the period of May - October 2019. In particular, the
dataset contains approximately 9.5 million instances of EFD messages. The mes-
sages include basic information about the flights, their status, information on the
previous flight legs, Air Traffic Flow Management (ATFM) regulations, calendar
information, etc. A detailed description of the features can be found in the previous
research work conducted with similar tasks of delay prediction [6].
2https://www.eurocontrol.int/.
Explaining the Unexplainable: Role of XAI for Flight TOT Delay Prediction 83
2.2 Prediction Models
A vast range of ML and AI models has been developed regarding ATM tasks.
However, RF and XGBoost have been mostly used in prediction tasks (regres-
sion) based on our previous literature study reported in [11]. Therefore, in this
paper, two different prediction models were developed with RF and XGBoost.
The RF algorithm, which consists of a collection of randomised decision trees,
is one of the well-known ensemble algorithms of ML utilised in numerous ATM-
linked tasks [3]. The motivation behind using RF is that it can generalise better
with both tabular and categorical data without assuming the independence of
the features [3]. Concurrently, XGBoost was chosen with a view to enhance the
performance of flight TOT delay prediction models developed with predecessor
algorithms in the previous work by Dalmau et al. [7]. In particular, XGBoost is
a scalable ML algorithm for tree boosting used in regression and classification
applications [5]. It uses an ensemble of weak prediction models to make the final
prediction, and a weak prediction model refers to the randomised decision trees.
2.3 Explanation Models
To explain the flight TOT delay prediction, LIME and SHAP were chosen as the
most widely used explanation generation tool for heterogeneous data [11]. These
tools can produce an explanation of the prediction from the models perspective.
In addition, DALEX was invoked to provide users with provisions of interac-
tions by selecting their preferred features while generating explanations. LIME
[14] is a tool that uses an interpretable model to approximate each individual
prediction made by any black box ML model. LIME uses a three-step process to
determine the specific contributions of the chosen features: perturbing the orig-
inal data points, feeding them to the black-box model, and then observing the
related predictions. For example, in our delay prediction case, each prediction of
a flight TOT delay is shown together with a list of features that contributes to the
delay and their weights in that prediction. SHAP is a mathematical technique
[13] that was developed based on the “Shapley Values” proposed by Shapley
in the cooperative game theory [16]. Shapley values are a mechanism to fairly
assign impact to features that might not have an equal influence on the predic-
tions. To generate additive explanations for predictions from black-box models,
the Shapley value concept was incorporated. In delay prediction, to explain the
decisions from the model (i.e., prediction), SHAP calculates the contribution of
each feature in the prediction from the model. DALEX is a python library built
upon the software for explainable machine learning proposed by Biecek [2]. The
main goal of the DALEX tool is to create a level of abstraction around a model
that makes it easier to explore and explain the model. Explanation deals with
two uncertainty levels, model level and explanation level. The underlying idea
is to capture the contribution of a feature to the model’s prediction by comput-
ing the shift in the expected value of the prediction while fixing the values of
other features. In this paper, for delay prediction, DALEX was used to generate
an interactive Breakdown plot, which detects local interactions of user-selected
features.
84 W. Jmoona et al.
2.4 Validation Approaches
Evaluation Metrics. In this study, evaluation metrics are used as prediction
accuracy metrics, namely Mean Absolute Error (MAE), for both the prediction
and explanation models. MAE is computed as the average difference between
the actual values (yi) and the observed values (ˆyi) from the model using the
following equation:
MAE =1
n
n
i=1
|yi−ˆyi|(1)
For the prediction models, MAE was calculated using Eq. 1, considering the
delays from the test dataset as actual values in comparison with the predicted
delays by the developed regression models as the observed values.
To assess the similarity between the inference mechanisms of the prediction
and explanation models, the evaluation prediction has been carried out. This
involves calculating the MAE by comparing the observed values of the expla-
nation model predictions with the actual values obtained from the regression
models. This metric will be termed as the local acc uracy in the discussions pre-
sented in Sect. 4.
To assess the quality of the feature attribution from the explanation models
with respect to the prediction models, Normalised Discounted Cumulative Gain
(nDCG) scores were examined. By definition, the nDCG score compares the
order of retrieved documents in information retrieval tasks [4,17]. Here, nDCG
scores were observed to compare the order of the features from the regression
and explanation models in terms of their importance scores resulting in the
corresponding predictions.
Table 1. Questionnaires to assess ATCOs’ satisfaction and acceptance of the delay
prediction XAI tool.
Objective Questionnaire
Level of
Understanding
(Post Condition)
Question 1: ATCOs’ understand why the delay value (time) is
influenced by the selected features
Question 2: ATCOs’ understand the contribution of each feature to
the overall delay value (time)
Question 3: ATCOs’ understand why the tool selected the features
based on their operational relevance
Question 4: Having access to an explanation would increase ATCOs’
accuracy in making an impact assessment in operations.
Overall Delay
Prediction
Outcome
(Final
Questionnaire)
Question 5: ATCOs’ find the information presented is clear and
understandable
Question 6: The unit in which information is presented is usable in
operations
Question 7: Knowing the features that influence the overall delay
would help ATCOs’ optimize runway use
Explaining the Unexplainable: Role of XAI for Flight TOT Delay Prediction 85
User Evaluation on the Satisfaction of Explanation. The objective of the
user evaluation was to assess the impact of different levels of explainability on con-
trollers’ acceptance. The hypothesis was: ATCOs self-reported acceptance would
differ between the explanation methods and explanation levels. Hoffman et al. [10]
suggested using a list of checklists through questionnaires to measure user satisfac-
tion that addresses key attributes of explanation e.g., understandability, the suf-
ficiency of detail, usefulness, etc. Therefore, to assess user satisfaction and accep-
tance of the explanation, the study utilized self-report questionnaires designed
using a Likert scale [12]. The Likert scale used a five-level format ranging from
“Strongly disagree” to “Strongly agree”. Two categories of questionnaires were
administered (see Table 1) - one after each condition i.e., explanations generated
by SHAP, LIME, and DALEX, and one final questionnaire after all tasks were com-
pleted. The “post condition” questionnaire focused on understanding the AI out-
come and the impact on work performance, while the final questionnaire evaluated
the general usability and impact on work performance of a delay prediction tool.
3 Explaining Take-Off Delay Prediction
This section outlines the methodology we followed to answer the research ques-
tion of this study. The methodology consists of three phases: Data Preparation,
Model Development, and Validation. This section provides an overview of each
phase and describes how they are related to each other. Figure 1illustrates the
phases of our research study.
Fig. 1. Experimental methodology of explaining flight TOT delay prediction.
3.1 Phase 1: Data Preparation
The success of any machine learning model depends heavily on the quality and
relevance of the dataset used for training. In this phase, we describe the steps
taken data pre-processing and feature engineering to optimise the models’ per-
formances. During pre-processing, all the messages in the dataset with missing
features and outliers were dropped to contain complete information of the flights.
Considering the time of pre-tactical phase, i.e., 6 h [6], only those messages were
considered which were received within the time interval from zero (inclusive)
86 W. Jmoona et al.
to three-hundred and sixty minutes (exclusive): (0,360] from the Estimated Off-
Block Time (EOBT). In summary, the dataset was analyzed to extract 42 fea-
tures based on air traffic control experts’ determination, as described in [6].
The target variable was calculated from the difference between the Actual Take-
off Time (ATOT) and the Estimated Take-off Time (ETOT). The features are
classified as categorical or numerical, and whether it changes (dynamic) or not
(static) during the progress of the flight, from the individual flight plan (IFP)
to the ATOT. The final dataset contained 7,613,584 instances for 609,202 flights
flown by 18,214 distinct aircraft with an average of 12 flights per day and approx-
imately 15 min of average take-off delay. Finally, to improve the prediction per-
formance of the trained models, a split of the time series data was performed
where the training set consisted of instances from May to August, while the test
set included instances from September to October. This split was preferred over
the more standard 80/20 training/testing split as it preserved the chronological
order of the data (i.e., each row of data in the time series has a time dependency)
and accounted for any seasonal patterns or trends in the time series, leading to
more accurate predictions.
3.2 Phase 2: Development
In the second phase, the focus is on developing predictive and explainable models
for the study. Once the data has been prepared in Phase 1, it is then processed
in Phase 2 to implement two AI models and three explainability tools.
Both the models, RF and XGBoost were trained with the parameter val-
ues selected through a grid search 5-fold cross-validation with arbitrary value
grids that were selected based on the experience from the previous works. In
total, 144 different combinations of parameters were tested for each of the RF
and XGBoost models. Finally, the RF was trained with the following hyper-
parameters: max depth =7,max features =16andn estimators = 500.
And, the final XGBoost model was trained with the following hyperparameters:
learning rate =0.1, max depth =7,min child weight =1,subsample =0.5,
colsample bytree =0.5, nestimators = 500.
To generate explanations on the flight TOT delay prediction, three explain-
ability tools, namely, SHAP, LIME and DALEX were used. Each of the two
trained prediction models and the test dataset were fed into the tools to com-
pute the individual contributions of each features to the final prediction for each
instance. To present the individual contributions of the features in a comprehen-
sible manner to the users, breakdown plots were drawn. Figure 2c illustrates a
breakdown plot that explains the prediction of 7.31 min. In the figure, the blue
bar corresponds to the predicted delay. The red and green bars with correspond-
ing values represent increase and decrease of delay time respectively.
3.3 Phase 3: Validation
The final phase of this methodology involves two validation scenarios: quantita-
tive evaluation and user validation.
Explaining the Unexplainable: Role of XAI for Flight TOT Delay Prediction 87
Quantitative Evaluation. The evaluation of the take-off delay prediction was
performed from two perspectives: prediction models and explainability meth-
ods. To show the performance of our models (RF and XGBoost), two TOT
delay prediction models reported in Dalmau’s work [6], ETFMS and gradient-
boosted decision tree (GBDT), were selected to comprehensively compare pre-
diction models. The ETFMS was used as a benchmark for comparison as it is
widely used in aviation for delay prediction. Besides, to investigate its poten-
tial in flight delay prediction, our aim was to explore the performance of the
GBDT model, given its good results in other time series prediction tasks. Addi-
tionally, two explainability methods, LIME and SHAP, were used to explain the
predictions of the XGBoost model based on the previous results of the model.
Fig. 2. An example of scenario presentation and associated prediction of delay with
explanation for user validation.
User Evaluation. In the user evaluation phase, a web platform was developed
to conduct the evaluation of three XAI methods used to explain the prediction
results to end users. The selected methods were LIME, SHAP, and DALEX. In
the user validation exercise, a group of 9 Air Traffic Controllers (ATCOs) with
varying levels of expertise participated. The participants, who were all male and
aged between 30 and 60, had a minimum of 5 years of experience as ATCOs. For
LIME and SHAP, each participant watched an introductory video and was pre-
sented with a scenario context with narrative (Fig. 2a) and illustration (Fig. 2b)
88 W. Jmoona et al.
of a delayed flight. In addition, a prediction of the delay with explanation was
also presented (Fig. 2c) which varied based on the explainability tool used to
generate the explanation. The system based on LIME and SHAP then presented
a breakdown plot that contained the most important features contributing to
the delay. At the end of each scenario, the participant was asked to fill in ques-
tions 1 to 4 in the questionnaire presented in Table 1to evaluate factors linked
to human performance. For DALEX, each participant watched an introductory
video and was presented with a scenario context of a delayed flight. They are
then asked to select five features from a list of features. Based on the chosen
features, the system presented a breakdown plot containing the five selected fea-
tures contributing to the delay. At the end of the scenario, participants were
asked to report the questionnaire (questions 1 to 4). Once all scenarios were
completed, the participants were asked to answer a final questionnaire (ques-
tions 5 to 7) to evaluate the effectiveness of the XAI methods in explaining the
prediction results. Further, qualitative feedback was collected with open-ended
questions from the representative of ANACNA (Italian Air Traffic Controllers
Association).
4 Results and Discussion
4.1 Quantitative Evaluation
The quantitative evaluation of the delay prediction task was performed from two
perspectives: prediction models and explainability methods.
Performance Comparison of RF and XGBoost Models for TOT Delay
Prediction. Here, the models were trained using flight data from May to August
2019 (5.9M instances) and tested on flight data from September to October 2019
(1.7M instances) to predict the TOT delay. Mean absolute error (MAE) values
were used to assess the performance of the predictors. Both models performed
better than previous similar work [6] in terms of MAE when predicting the take-
off delay on the whole test set. To assess performance in more detail, the dataset
was sliced based on the time remaining in minutes until the EOBT. Table2lists
the chunks with the prediction performance of existing ETFMS, GBDT [6], RF,
and XGBoost. The table shows that RF and XGBoost performed better than
previous models for all intervals, with XGBoost outperforming RF.
Interpretation and Explanation of XGBoost Model’s Decision-making
Process Using LIME and SHAP. Based on the outcome, XGBoost outper-
forms RF models, so we have only considered explanation models to explain the
predictions of XGBoost only. As described in Sect. 2, LIME and SHAP were
used to explain the predictions on the flight TOT delay made by the XGBoost
model. These tools try to mimic the prediction of the trained models and deter-
mine the important features to explain the prediction. To evaluate the similarity
between the predictions of the explanation tools and XGBoost, local accuracy
Explaining the Unexplainable: Role of XAI for Flight TOT Delay Prediction 89
Table 2. Comparison of performances on take-off delay prediction from ETFMS,
GBDT, RF and XGBoost using the MAE (in minutes), lower is better, and the lowest
values are highlighted. The MAE values for the ETFMS and GBDT are considered as
a reference from experimentation performed by Dalmau et al. [6].
Time to EOBT ETFMS [6]GBDT [6]RF XGBoost
(0, 15] 10.7 8.8 7.22 7.51
(15, 30] 12.4 10.2 8.75 8.47
(30, 60] 13.3 10.5 9.05 9.05
(60, 90] 14.3 10.8 9.46 9.12
(90, 120] 14.3 11.1 9.83 9.50
(120, 180] 19.1 13.5 11.09 10.50
(180, 240] 23.0 15.4 11.58 11.46
(240, 360] 21.2 15.1 11.93 12.02
was considered as MAE in minutes. The conditions of comparison for LIME
and SHAP are consistent as both generate prediction model-centric explana-
tions. Therefore, the results of their comparison are summarised in Table3for
all instances, top 100k instances, and top 67k instances with the most accurate
predictions, respectively. However, DALEX provides a user-centric explanation
that is subjective and varies, which is different from LIME and SHAP. Due to
this difference in the characteristics of the explanations, DALEX was excluded
from the quantitative evaluation. Finally, based on the comparison, SHAP was
used to generate visualisations to explain the predicted flight TOT delay pre-
diction. Figure 2c illustrates an example of a visualisation explaining a single
instance of flight TOT delay prediction.
Table 3. Comparison of l ocal acc urac y and nDCG values for SHAP and LIME. For
local accuracy, lower is better, and for nDCG, higher is better. The best values are
highlighted.
XAI Model SHAP LIME
Metric Local Accuracy nDCG Local Accuracy nDCG
All 3.3×10−60.806 8.62 0.882
100k 1.1×10−60.722 4.75 0.847
67k 6.2×10−70.717 3.13 0.800
4.2 User Evaluation
The user evaluation of the delay prediction task followed Phase 3 as shown in
Fig. 1.
90 W. Jmoona et al.
Post Condition Assessment. The aim was to evaluate the impact of LIME,
SHAP and DALEX on the understanding of the influence of selected features
and their contribution to the final output, as well as their impact on work perfor-
mance. The results showed that LIME and DALEX received the highest positive
feedback for understanding the influence of selected features. Moreover, SHAP
received no negative feedback, resulting in a more balanced condition for the
understanding of the influence of the features on the final delay. In contrast,
DALEX was the most effective in understanding the contribution of each fea-
ture to the final output. Therefore, we could say that the user-centric selec-
tion of the features can positively influence the understanding of the contribu-
tion to the final delay value. However, none of the conditions received positive
feedback about understanding the selected features’ operational relevance. This
could mean that, even if the influence and the contribution on the final delay of
the selected features for SHAP and LIME, and the contribution of the features
selected by the users for DALEX, is high, this would not impact the opera-
tional relevance of the information received. The rationalization is reflected in
the qualitative interview discussed in the Qualitative Feedback section. Addi-
tionally, the impact on work performance was generally positive for all three
tools, with DALEX being the most useful for operations, confirming the slight
preference expressed by the users in the last understanding item (question 4).
These findings suggest that the user-centric selection of features can improve
understanding of the contribution to the final output value. However, further
efforts are needed to improve the understanding of the operational relevance of
the selected features. The three methods are compared per question (see Fig. 3).
Fig. 3. ATCOs’ response on the level of understanding of the explanation using LIME,
SHAP, and DALEX.
Explaining the Unexplainable: Role of XAI for Flight TOT Delay Prediction 91
Final Questionnaire. The final questionnaire revealed that the unit’s usabil-
ity in which information was presented was generally rated positively, with no
negative feedback on this first usability item, as shown in Fig. 4. The second
usability item regarding the clarity and understandability of the information
presented also received positive feedback. The question about the impact on
work performance showed that only 22% of participants reported negative feed-
back about the usefulness of a delay prediction tool in optimising the use of a
runway, regardless of their condition.
Fig. 4. Overall Delay Prediction outcomes.
Qualitative Feedback. An Open-ended question interview revealed that most
ATCOs found the validation exercise clear and understandable, but some users
found some details and features unclear. The tool has the potential for several
operational tasks, such as generating the best sequence of departures and opti-
mizing runway usage, optimizing Target Start-up Approval Time (TSAT), Air
Traffic Flow and Capacity Management (ATFCM) delay, and airport strate-
gic planning. However, The representative from the ANACNA pointed out that
improvements to the DALEX condition should be needed to reduce the vari-
ables involved in selecting features. Also, an important finding was that basic
training on AI is required for ATCOs to understand such AI tools for better
communication with the end user.
Overall, these results highlight the potential and limitations of XAI tools in
the aviation industry and can inform the development of future AI tools for air
traffic management. Extensive details on user validation can be found in the
project’s deliverable 6.23.
3https://doi.org/10.5281/zenodo.7486982.
92 W. Jmoona et al.
5 Conclusions
This study investigates ML and post hoc XAI methods to provide explanations
of TOT delay prediction to the ATCOs, and presents a comparative analy-
sis regarding usability and acceptance of the explanation offered by these XAI
methods in the ATM domain. Here, we have compared the XGBoost and RF
models as prediction models, along with their explanations provided by LIME,
SHAP, and DALEX. The results indicate that XGBoost outperforms RF regard-
ing MAE values for predicting TOT delays. Besides, XGBoost is more scalable,
faster, and better at optimizing errors than RF and GBDT from an algorith-
mic perspective. In addition, the study finds that SHAP is more effective than
LIME in explaining the predictions of the XGBoost model. SHAP presents bet-
ter results when considering the nDCG values. Overall, the study highlights the
advantages of XGBoost as an ML model and SHAP as an explainability method
for predicting TOT delays in the ATM domain.
In terms of user evaluation, three XAI tools (SHAP, LIME, and DALEX)
were evaluated for explaining the decisions of the XGBoost model to human
operators. The results indicate that DALEX is more usable and preferred by
human operators than SHAP and LIME. Feedback from the ATCOs showed that
the information presented by the tool was clear and understandable, attributed
to the videos and narration, which communicated the content intelligibly. How-
ever, the impact of the features on the estimated delay and take-off time was
not self-explanatory for all ATCOs, and the level of clarity was an issue due
to the synchronisation of video, text, and colour. Most ATCOs suggested that
the information provided by the tool would allow them to generate the best
sequence of departures, optimising runway usage, increasing the airside capacity
of the airport, and reducing runway occupation time.
In conclusion, the study demonstrates the potential of XAI in improving
the transparency and interpretability of machine learning models in the ATM
domain, particularly in predicting flight TOT delays. The DALEX tool was found
to be more effective and usable for explaining the decisions of the XGBoost model
to human operators. Further research is needed to identify the most relevant
features that should be shown event by event to the ATCOs.
Acknowledgements. This work was financed by the European Union’s Horizon 2020
within the framework SESAR 2020 research and innovation program under grant agree-
ment N. 894238, project Transparent Artificial Intelligence and Automation to Air
Traffic Management Systems (ARTIMATION) and BrainSafeDrive, co-funded by the
Vetenskapsr˚adet - The Swedish Research Council and the Ministero dell’Istruzione
dell’Universit`a e della Ricerca della Repubblica Italiana under Italy-Sweden Coopera-
tion Program.
Explaining the Unexplainable: Role of XAI for Flight TOT Delay Prediction 93
References
1. Bardach, M., Gringinger, E., Schrefl, M., Schuetz, C.G.: Predicting flight delay
risk using a random forest classifier based on air traffic scenarios and environmen-
tal conditions. In: 2020 AIAA/IEEE 39th Digital Avionics Systems Conference
(DASC), pp. 1–8. IEEE (2020)
2. Biecek, P.: Dalex: explainers for complex predictive models in R. J. Mach. Learn.
Res. 19(1), 3245–3249 (2018)
3. Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996)
4. Busa-Fekete, R., Szarvas, G., Elteto, T., K´egl, B.: An apple-to-apple comparison of
learning-to-rank algorithms in terms of normalized discounted cumulative gain. In:
ECAI 2012-20th European Conference on Artificial Intelligence: Preference Learn-
ing: Problems and Applications in AI Workshop, vol. 242. IOS Press (2012)
5. Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 785–794 (2016)
6. Dalmau, R., Ballerini, F., Naessens, H., Belkoura, S., Wangnick, S.: An explainable
machine learning approach to improve take-off time predictions. J. Air Transp.
Manag. 95, 102090 (2021)
7. Dalmau Codina, R., Belkoura, S., Naessens, H., Ballerini, F., Wagnick, S.: Improv-
ing the predictability of take-off times with machine learning: a case study for the
maastricht upper area control centre area of responsibility. In: Proceedings of the
9th SESAR Innovation Days, pp. 1–8 (2019)
8. Degas, A., et al.: A survey on artificial intelligence (AI) and explainable AI in air
traffic management: current trends and development with future research trajec-
tory. Appl. Sci. 12(3), 1295 (2022)
9. Guo, Z., et al.: SGDAN-a spatio-temporal graph dual-attention neural network for
quantified flight delay prediction. Sensors 20(22), 6433 (2020)
10. Hoffman, R.R., Mueller, S.T., Klein, G., Litman, J.: Metrics for explainable AI:
challenges and prospects. arXiv abs/1812.04608 (2018)
11. Islam, M.R., Ahmed, M.U., Barua, S., Begum, S.: A systematic review of explain-
able artificial intelligence in terms of different application domains and tasks. Appl.
Sci. 12(3), 1353 (2022)
12. Joshi, A., Kale, S., Chandel, S., Pal, D.K.: Likert scale: explored and explained.
Br.J.Appl.Sci.Technol.7(4), 396 (2015)
13. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions.
In: Advances in Neural Information Processing Systems, vol. 30 (2017)
14. Ribeiro, M.T., Singh, S., Guestrin, C.: “why should i trust you?” explaining the
predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
15. Sanaei, R., Pinto, B.A., Gollnick, V.: Toward ATM resiliency: a deep CNN to
predict number of delayed flights and ATFM delay. Aerospace 8(2), 28 (2021)
16. Shapley, L.S.: A value for n-person games. In: Classics in Game Theory, vol. 69
(1997)
17. Wang, Y., Wang, L., Li, Y., He, D., Liu, T.Y.: A theoretical analysis of NDCG type
ranking measures. In: Conference on Learning Theory, pp. 25–54. PMLR (2013)
18. Yu, B., Guo, Z., Asian, S., Wang, H., Chen, G.: Flight delay prediction for commer-
cial air transport: a deep learning approach. Transp. Res. Part E Logist. Transp.
Rev. 125, 203–221 (2019)