Figure 1 - uploaded by Arash Khoda Bakhshi
Content may be subject to copyright.
Feature selection based on Corrected Impurity Importance (CII).

Feature selection based on Corrected Impurity Importance (CII).

Source publication
Article
Full-text available
This study bridges the gap between Real-Time Risk Assessment (RTRA) and its practical implications by following the post-hoc interpretability approach and utilizing black-box graphical tools for safety data visualization. The real-time traffic-related crash contributing factors were detected using the matched-case control design on 402-miles Inters...

Contexts in source publication

Context 1
... in addition to MDA and MDI, CII was also investigated to measure unbiased importance of variables in RF (Wright and Ziegler 2015;Nembrini, König, and Wright 2018;Wright et al. 2019). Figure 1 depicts the result of CII for FS, where the features corresponding to green points must be kept in the model, and the features corresponding to red points must be removed from the model. Furthermore, to deal with nonlinear predictors, GAM can provide a more effective tool in comparison with GLM and GNM (Jones and Almond 1992;Jones and Wrigley 1995). ...
Context 2
... there is no promising value for K since the most appropriate K depends on the distribution of the predictor of interest ( Goldstein et al. 2015;Apley 2016). In Figure 10, HSL_SpMean and LSL_SpMean are nominated to illustrate the effect of the number of intervals on in 1D-ALE and 2D-ALE. ...
Context 3
... issue does not make a remarkable problem in 1D-ALE plots since the overall effect of a predictor on the response variable can be similarly observed through a range of K. However, for 2D-ALE plots, different levels of K Figure 10. ALE dependency on numbers of intervals (K) in 1D and 2D plots. ...

Citations

... Unlike Partial Dependence Plots (PDPs), the ALE effectively addresses challenges associated with high-dimensionality and inter-feature correlations, which are common in complex datasets. By isolating the local effects of features and aggregating them across the dataset, the ALE provides a more accurate and reliable interpretation of feature contributions, making it particularly useful in scenarios where feature interactions are significant [43]. The ALE is defined by the following equation: ...
Article
Full-text available
The transportation sector is a major contributor to carbon dioxide (CO2) emissions in Canada, making the accurate forecasting of CO2 emissions critical as part of the global push toward carbon neutrality. This study employs interpretable machine learning techniques to predict vehicle CO2 emissions in Canada from 1995 to 2022. Algorithms including K-Nearest Neighbors, Support Vector Regression, Gradient Boosting Machine, Decision Tree, Random Forest, and Lasso Regression were utilized. The Gradient Boosting Machine delivered the best performance, achieving the highest R-squared value (0.9973) and the lowest Root Mean Squared Error (3.3633). To enhance the model interpretability, the SHapley Additive exPlanations (SHAP) and Accumulated Local Effects methods were used to identify key contributing factors, including fuel consumption (city/highway), ethanol (E85), and diesel. These findings provide critical insights for policymakers, underscoring the need for promoting renewable energy, tightening fuel emission standards, and decoupling carbon emissions from economic growth to foster sustainable development. This study contributes to broader discussions on achieving carbon neutrality and the necessary transformations within the transportation sector.
... These predictors describe soil by different attributes that are directly associated with the response variables, such as soil type, pedoclimatic characteristics, or hydro-geochemical properties ( Figure S2). The main effect of each predictor in this group on the response variable was assessed using first-order ALE due to its robustness against potential dependency among predictors and its lower computational cost [66,78] (Figure 4). The first-order ALE of each predictor class is represented by a bar in Figure 4. ...
Article
Full-text available
Soil organic matter (SOM) and the ratio of soil organic carbon to total nitrogen (C/N ratio) are fundamental to the ecosystem services provided by soils. Therefore, understanding the spatial distribution and relationships between the SOM components mineral-associated organic matter (MAOM), particulate organic matter (POM), and C/N ratio is crucial. Three ensemble machine learning models were trained to obtain spatial predictions of the C/N ratio, MAOM, and POM in German agricultural topsoil (0–10 cm). Parameter optimization and model evaluation were performed using nested cross-validation. Additionally, a modification to the regressor chain was applied to capture and interpret the interactions among the C/N ratio, MAOM, and POM. The ensemble models yielded mean absolute percent errors (MAPEs) of 8.2% for the C/N ratio, 14.8% for MAOM, and 28.6% for POM. Soil type, pedo-climatic region, hydrological unit, and soilscapes were found to explain 75% of the variance in MAOM and POM, and 50% in the C/N ratio. The modified regressor chain indicated a nonlinear relationship between the C/N ratio and SOM due to the different decomposition rates of SOM as a result of variety in its nutrient quality. These spatial predictions enhance the understanding of soil properties’ distribution in Germany.
... The use of technology in education opens up opportunities for distance learning, online courses, interactive digital content, and the use of learning tools such as simulations and visualization models (Khoda Bakhshi & Ahmed, 2021) that help enrich students' learning experience. It also provides the ability to measure and analyze (Fattahi et al., 2020) student progress more accurately, allowing educators to tailor their methods based on individual needs. ...
... The benefits of Padlet facilitate collaboration between individuals or groups by allowing them to contribute to the same idea board online. Padlet helps organize (Khoda Bakhshi & Ahmed, 2021) content in a visual way, allowing users to see and understand information more clearly. Padlets can be accessed from multiple devices and locations, allowing easy access and contribution from anywhere. ...
Article
Full-text available
This study aims to explore the effectiveness of using the Padlet application to improve writing skills through self-study. Self-directed learning methods are becoming increasingly important in today's educational landscape, with technology acting as a key driver. This study engages students in educational settings to apply independent learning using padlets in developing their writing skills. This research was conducted through a qualitative approach with a classroom action research design. Data collection was carried out through observation, interviews, and document analysis of the activities carried out by students in the learning environment. The results of the study show that the use of the Padlet application is effective in stimulating student learning independence and active participation in the development of writing skills. This app provides an interactive platform that facilitates collaboration, reflection, and feedback between students and other teachers. The results of this study have important implications for the design of learning strategies that encourage student independence and develop writing skills. In addition, this research also highlights the role of technology in creating innovative learning environments and supporting student academic growth. Thus, the application of the Padlet application in independent learning can be considered as the right solution to improve students' writing skills
... These averages are finally connected, and the overall curve is generated. The predictions are then centered by subtracting the mean value from all other values (Galkin et al., 2018;Grace-Martin, 2011;Khoda Bakhshi and Ahmed, 2021). A key advantage of this approach is that, similar to the caret implementation of variable importance analysis, ALE is also model-agnostic when applied using the iml package in R (Molnar and Schratz, 2020). ...
... Most of the ML algorithms such as RF, AdaBoost, XGBoost, and SVM are ''blackbox" models which need blackbox visualization tools to unveil their internal workings (Li et al., 2020(Li et al., , 2008. There are many blackbox visualization tools applied in road safety studies including partial dependence plot (PDP), Individual conditional Expectation (ICE), Centered ICE, and Accumulated Local effect (ACE) (Afshar et al., 2022;Bakhshi and Ahmed, 2021). The utilization of model explainability techniques helps in building the trust of the developed ML model for deployment in practical grounds. ...
Article
Full-text available
Speeding is one of the most common aberrant driving behaviors among the driving population. Although research on speeding behavior among drivers has been increased over the decades, little is known about the motivating factors associated with speeding behavior among Long-Haul Truck Drivers (LHTDs), especially in developing nations like India. This study aims to develop a prediction model for speeding behavior and to identify the contributory factors and their influential patterns underlying speeding behavior among LHTDs in India. A cross-sectional study was conducted among LHTDs in Salem city, Tamil Nadu, India. The data were collected through face-to-face interviews using a questionnaire encompassing socio-demographic, work, vehicle, health-related lifestyle, and speeding-related characteristics. A total of 756 valid samples were collected and utilized for analysis purposes. While conventional statistical method like binary logit technique lacked prediction capabilities, machine learning algorithms including Decision tree (DT), Random Forest (RF), Adaptive Boosting (AdaBoost), and Extreme gradient boosting (XGBoost) were employed to model speeding behavior among LHTDs. The analysis results showed that RF demonstrated superior performance in predicting speeding behavior over other competing algorithms with accuracy (0.80), F1 score (0.77), and AUROC (0.81). From the befitting RF model, the importance of factors contributing to speeding behavior among LHTDs were determined through the variable importance plot. Pressured delivery of goods, sleeping duration per day, age of truck, size of truck, monthly income, driving experience, driving duration per day, and age of the driver were identified as the eight topmost critical factors contributing to speeding behavior among LHTDs. Based on the developed RF model, the hidden relationships behind identified critical factors in relation to the speeding behavior were investigated using partial dependence plots (PDPs). The outcomes of this research will be useful for road safety authorities and Indian trucking industries to frame suitable policies and to introduce effective strategies for mitigating speeding behavior among LHTDs to promote road safety.
... Most of the ML algorithms such as RF, AdaBoost, XGBoost, and SVM are "blackbox" models which need blackbox visualization tools to unveil their internal workings (Li et al., 2020(Li et al., , 2008. There are many blackbox visualization tools applied in road safety studies including partial dependence plot (PDP), Individual conditional Expectation (ICE), Centered ICE, and Accumulated Local effect (ACE) (Afshar et al., 2022;Bakhshi and Ahmed, 2021). The utilization of model explainability techniques helps in building the trust of the developed ML model for deployment in practical grounds. ...
... Most of the ML algorithms such as RF, AdaBoost, XGBoost, and SVM are ''blackbox" models which need blackbox visualization tools to unveil their internal workings (Li et al., 2020(Li et al., , 2008. There are many blackbox visualization tools applied in road safety studies including partial dependence plot (PDP), Individual conditional Expectation (ICE), Centered ICE, and Accumulated Local effect (ACE) (Afshar et al., 2022;Bakhshi and Ahmed, 2021). The utilization of model explainability techniques helps in building the trust of the developed ML model for deployment in practical grounds. ...
... Although the contribution of each feature predicting fatigue driving among LHTDs in the study sample has been identified using variable importance plot, it does not render the trend of relationship between predictor variables and the outcome feature. There are many Blackbox visualization tools utilized in prior studies in transportation domain such as partial dependence plot (PDP), Individual conditional Expectation (ICE), Centered ICE, and Accumulated Local effect (ACE) which can provide interpretability and transparency on internal working of the model (Ding et al., 2018;Bakhshi and Ahmed, 2021;Afshar et al., 2022). Of all, PDP, as proposed by Friedman (2001) has been widely used in many studies to provide an intuitive explanation of how features influence model predictions or model performance Kidando et al., 2021;Komol et al., 2021). ...
Article
Introduction: Long-haul Truck drivers (LHTDs) have long working hours, insufficient rest, and poor health conditions and often experience fatigue that substantially may lead to crashes and injuries. Despite its potential harmfulness, we have little understanding on non-linear hidden patterns of influential factors on fatigue driving, especially in developing nations like India. Objectives: This paper aimed to predict fatigue driving among LHTDs using four tree-based machine learning techniques including Decision tree (DT), Random Forest (RF), Adaptive boosting (AdaBoost), and Extreme gradient boosting (XGBoost) to analyze the non-linear hidden pattern of most influential variables contributing to fatigue driving among Indian LHTDS. Methods: Using a cross-sectional study design, a face-to-face interview was conducted among LHTDs using carefully designed questionnaire in Salem, Tamil Nadu, India. A total of 756 responses were obtained from LHTDs based on four aspects including socioeconomic characteristics , work and vehicle characteristics, health-related lifestyle characteristics, and fatigue-inducing characteristics using a convenience sampling technique. Four tree-based machine learning algorithms namely DT, RF, Adaboost, and Xgboost were employed to predict fatigue driving. The influence of predictors on fatigue driving for the most suitable model was determined through variable importance plot and their causality effect on probability of fatigue were examined using Partial dependency plot (PDP). All the analysis were carried out using IBM SPSS statistics Version 27.0 and R programming language version 4.2.1. Results: From the analysis, it was found that RF model outperformed other investigated classifiers (Accuracy = 81.2%, F1 score = 58.82%, AUROC = 0.854). Furthermore, variable importance plot of befitting RF classifier showed that type of commodity carried, pressured delivery of goods, countermeasure mostly followed, and education level of the LHTD as some of most influential predictors causing fatigue. Conclusion: These findings provide insights to state and highway transportation officials and In-dian trucking industries for framing effective strategies to promote safety and well-being among LHTDs.
... On the other hand, LIME focuses on the local faithfulness of a model. Although LIME[19] has the desirable property of additivity[18], it has weaknesses regarding the lack of consistency[16], missingness[17], and stability [20,27]. SHAP fulfils these and hence is commonly used. ...
Preprint
Full-text available
Explainability of AI models is an important topic that can have a significant impact in all domains and applications from autonomous driving to healthcare. The existing approaches to explainable AI (XAI) are mainly limited to simple machine learning algorithms, and the research regarding the explainability-accuracy tradeoff is still in its infancy especially when we are concerned about complex machine learning techniques like neural networks and deep learning (DL). In this work, we introduce a new approach for complex models based on the co-relation impact which enhances the explainability considerably while also ensuring the accuracy at a high level. We propose approaches for both scenarios of independent features and dependent features. In addition, we study the uncertainty associated with features and output. Furthermore, we provide an upper bound of the computation complexity of our proposed approach for the dependent features. The complexity bound depends on the order of logarithmic of the number of observations which provides a reliable result considering the higher dimension of dependent feature space with a smaller number of observations.
... Meanwhile, local explanation is achieved by SHAP in Barredo-Arrieta et al. (2019) for traffic flow prediction, Veran et al. (2020) for crash prediction, and Kalatian and Farooq (2021) for pedestrians' wait time prediction. Other common methods for local explanation, such as partial dependence plot (PDP), individual conditional expectation (ICE), and accumulated local effect (ALE) were used in Khoda Bakhshi and Ahmed (2021) for road crash probability prediction. ...
Article
Port state control is the safeguard of maritime transport achieved by inspecting foreign visiting ships and supervising them to rectify the non-compliances detected. One key issue faced by port authorities is to identify ships of higher risk accurately. This study aims to address the ship selection issue by first developing two data-driven ship risk prediction frameworks using features the same as or derived from the current ship selection scheme. Both frameworks are empirically shown to be more efficient than the current ship selection method. Like existing ship risk prediction models, the proposed frameworks are of black-box nature whose working mechanism is opaque. To improve model explainability, local explanation of the prediction of individual ships by the Shapley additive explanations (SHAP) is provided. Furthermore, we innovatively extend the local SHAP model to a near linear-form global surrogate model which is fully-explainable. This demonstrates that the behavior of black-box data-driven models can be as interpretable as white-box models while retaining their prediction accuracy. Numerical experiments demonstrate that the white-box global surrogate models can accurately show the behavior of the original black-box models, shedding light on model validation, fairness verification, and prediction explanation. This study makes the very first attempt in the maritime transport area to quantitatively explain the rationale of black-box prediction models from both local and global perspectives , which facilitates the application of data-driven models and promotes the digital transformation of the traditional shipping industry.