Article

# Caret: Classification and regression training

Authors:
To read the full-text of this research, you can request a copy directly from the author.

## No full-text available

... • The lower part of Entamoeba histolytica glycolysis ( Figure 3A), one of the major metabolic pathways of the parasite (Moreno-Sánchez et al., 2008;Muller et al., 2012;Pineda et al., 2015), through the use of a recently developed model (Lo-Thong et al., 2020); • The peroxide detoxification pathway of Trypanosoma cruzi ( Figure 3B) (González-Chávez et al., 2015, 2019; ...
... To model the metabolic pathway, different machine learning models are developed on RStudio (Version 1.2.5001), with the help of Classification And Regression Training (caret, Version 6.0-86) (Kuhn, 2020). ...
... The nonlinear aspect of the peroxide detoxification pathway is certainly not to be negligeable, since the coefficient average, when all enzyme activities are varied, is lower than 0.6. These results support those obtained by González-Chávez et al. (2015, 2019 which demonstrate that TXN and TXNPx exert the greatest control on the pathway's flux, while TryR exerts very little control on the flux. ...
Article
Full-text available
The use of machine learning (ML) in life sciences has gained wide interest over the past years, as it speeds up the development of high performing models. Important modeling tools in biology have proven their worth for pathway design, such as mechanistic models and metabolic networks, as they allow better understanding of mechanisms involved in the functioning of organisms. However, little has been done on the use of ML to model metabolic pathways, and the degree of non-linearity associated with them is not clear. Here, we report the construction of different metabolic pathways with several linear and non-linear ML models. Different types of data are used; they lead to the prediction of important biological data, such as pathway flux and final product concentration. A comparison reveals that the data features impact model performance and highlight the effectiveness of non-linear models (e.g., QRF: RMSE = 0.021 nmol·min−1 and R2 = 1 vs. Bayesian GLM: RMSE = 1.379 nmol·min−1 R2 = 0.823). It turns out that the greater the degree of non-linearity of the pathway, the better suited a non-linear model will be. Therefore, a decision-making support for pathway modeling is established. These findings generally support the hypothesis that non-linear aspects predominate within the metabolic pathways. This must be taken into account when devising possible applications of these pathways for the identification of biomarkers of diseases (e.g., infections, cancer, neurodegenerative diseases) or the optimization of industrial production processes.
... The classification models that we chose are: "Artificial Neural Network (ANN)" [22], "Random Forest" [23] and "Adaboost". To run these models, we use the packages "neuralnet" [24], "caret" [25], "dplyr" [26] and "fastAdaboost" [27] respectively. The obtained results are shared in Table II. ...
Article
Full-text available
Handling datasets nowadays has become a crucial task, since today's world is heavily dependent on data information. However, many data tend to be big and contain redundancy which makes them difficult to deal with. Due to that, data pre-processing became almost necessary before using any data, and one of the main tasks in data pre-processing is dimensionality reduction. In this paper we propose a new approach for dimensionality reduction using feature selection method based on bivariate copulas. This approach is a direct application of copulas to describe and model the inter-correlation between any two dimensions - bivariate analysis. The study will first show how we use the bivariate method to detect redundant dimensions and eliminate them, and then compare the quality of the results against most-known selection methods in term of accuracy, using statistical precision and classification models.
... All the computational processes were performed in R (R Core Team, 2021) with the following packages: "caret" (Kuhn, 2020), "neuralnet" (Fritsch, Guenther, & Wright, 2019), "pls" (Kristian Hovde Liland, Bjørn-Helge Mevik, & Wehrens, 2021), "ggplot2" (Wickham, 2016), "ggpubr" (Kassambara, 2020), "tidyverse" (Wickham et al., 2019), "ggrepel" (Slowikowski, 2021), "factoextra" (Kassambara & Mundt, 2020), "Facto-MineR" (Le, Josse, & Husson, 2008) and "hyperSpec" (Beleites & Sergo, 2020). SL-AI was also implemented in R (R Core Team, 2021). ...
Article
This paper focuses on predicting predawn leaf water potential through a self-learning artificial intelligence (SL-AI) algorithm, a novel spectral processing algorithm that is based on the search for covariance modes, providing a direct relationship between spectral information and plant constituents. The SL-AI algorithm was applied in a dataset containing 847 observations obtained with a handheld hyperspectral spectroradiometer (400–1010 nm), structured as: three grapevine cultivars (Touriga Nacional, Touriga Franca and Tinta Barroca), collected in three years (2014, 2015 and 2017), in two test sites in the renowned Douro Wine Region, northeast of Portugal. The Ψpd SL-AI quantification was tested both in regressive (R2 = 0.97, MAPE = 18.30%) and classification (three classes; overall accuracy = 86.27%) approaches, where the radiation absorption spectrum zones of the chlorophylls, xanthophyll and water were identified along the vegetative growth cycle. The dataset was also tested with Artificial Neural Networks with Principal Component Analysis (ANN-PCA) and Partial Least Square (PLS), which presented worse performance when compared to SL-AI in the regressive (ANN-PCA - R2 = 0.85, MAPE = 43.64%; PLS - R2 = 0.94, MAPE = 28.76%) and classification (ANN-PCA - overall accuracy: 72.37%; PLS - overall accuracy: 73.79%) approaches. The Ψpd modelled with SL-AI demonstrated, through hyperspectral reflectance, a cause-effect of the grapevine's hydric status with the absorbance of bands related to chlorophyll, xanthophylls and water. This cause-effect interaction could be explored to identify cultivars and cultural practices, hydric, heating and lighting stresses.
... In addition, several complementing packages may be needed to perform cross validation of models, hyperparameter tuning, and compute accuracy metrics, among others. There are some libraries that seek to integrate a wide range of tools needed for machine learning in one place, such as scikit-learn (Pedregosa et al., 2011) in Python; H2O in Java (with both R and Python versions); and caret (Kuhn, 2016), mlr3 (Lang et al., 2019) and tidy models (Kuhn and Wickham, 2020) in R. All these options have their own philosophy, and they were designed using diverse approaches to implement machine learning models. ...
Article
Full-text available
The adoption of machine learning frameworks in areas beyond computer science have been facilitated by the development of user-friendly software tools that do not require an advanced understanding of computer programming. In this paper, we present a new package (sparse kernel methods, SKM) software developed in R language for implementing six (generalized boosted machines, generalized linear models, support vector machines, random forest, Bayesian regression models and deep neural networks) of the most popular supervised machine learning algorithms with the optional use of sparse kernels. The SKM focuses on user simplicity, as it does not try to include all the available machine learning algorithms, but rather the most important aspects of these six algorithms in an easy-to-understand format. Another relevant contribution of this package is a function for the computation of seven different kernels. These are Linear, Polynomial, Sigmoid, Gaussian, Exponential, Arc-Cosine 1 and Arc-Cosine L (with L = 2, 3,. . .) and their sparse versions, which allow users to create kernel machines without modifying the statistical machine learning algorithm. It is important to point out that the main contribution of our package resides in the functionality for the computation of the sparse version of seven basic kernels, which is indispensable for reducing computational resources to implement kernel machine learning methods without a significant loss in prediction performance. Performance of the SKM is evaluated in a genome-based prediction framework using both a maize and wheat data set. As such, the use of this package is not restricted to genome prediction problems, and can be used in many different applications.
... The following ML algorithms, LASSO, SVM, random forest, XGBoost, and neural network, were used to build diagnostic models for gene expression data and the R package "glmnet, e1071, randomForest, caret, and neuralnet" were used. [21][22][23][24][25] ML models with default parameters, as defined in the Scikit Learn library (http://scikit-learn.org/stable/), and from ML models after hyperparameter optimization were applied. Furthermore, the AUC, accuracy, precision, recall, specificity, and F1-score were calculated from Confusion Matrices to evaluate the classification capability of each model. ...
Article
Full-text available
Objective: This study aimed to analyze immune-related genes and immune cell components in the peripheral blood of patients with acute myocardial infarction (AMI). Methods: Six datasets were obtained from the GEO repository comprising 88 healthy samples and 215 AMI samples. We performed the weighted gene co-expression analysis (WGCNA) and five machine learning (ML) methods to identify immune-related genes and construct diagnostic models. CIBERSORT algorithm was adopted for the assessment of the degree of immune infiltration. Finally, RT-PCR, immunofluorescence double and immunohistochemistry were conducted to analyze the expression level of the identification of featured immune-related genes and localization relationship in heart tissue of AMI mouse model. Results: A total of 496 immune-related DEGs were obtained between AMI and normal samples. WGCNA finally determined the co-expression modules that showed the most significantly positively associated with AMI (r=0.41; P<0.001). Among the five ML models, XGBoost had the highest AUC (0.849) and accuracy (0.812) to discriminate patients with AMI from normal in the validation sets. Furthermore, we found that the proportion of chemokine receptor (CCR), macrophages, neutrophils, and Treg cells in the AMI groups was significantly higher than that in the normal groups. In vitro RT-PCR verification revealed that SOCS3, MMP9, and AQP9 expression increased significantly in the AMI mouse model. Among the 22 immune cells, AQP9, MMP9, and SOCS3 displayed the strongest positive correlation with neutrophils. In MI-mice, MPO stained strongly along the lateral cardiomyocytes, whereas it was weaker in sham mice. Combined immunofluorescence was observed in same parts of the cytoplasm of cardiomyocytes in myocardial infarction area, indicating co-localization of MPO with MMP9 and SOCS3 in these areas, respectively. Conclusion: Immune-related genes and immune cells are intimately related to AMI. Constructing different ML models based on these biomarkers could be a valuable approach to diagnosing AMI in clinical practice.
... We drew on the classification algorithm implemented in xgboost (Chen et al., 2021). The hyperparameter grid search of the inner cross-validation loop was performed using caret (Kuhn, 2021). Biand trigrams were extracted using ngram (Schmidt & Heckendorf, 2017). ...
Article
Full-text available
Early detection of risk of failure on interactive tasks comes with great potential for better understanding how examinees differ in their initial behavior as well as for adaptively tailoring interactive tasks to examinees' competence levels. Drawing on procedures originating in shopper intent prediction on e-commerce platforms, we introduce and showcase a machine learning-based procedure that leverages early-window clickstream data for systematically investigating early predictability of behavioral outcomes on interactive tasks. We derive features related to the occurrence, frequency, sequentiality, and timing of performed actions from early-window clickstreams and use extreme gradient boosting for classification. Multiple measures are suggested to evaluate the quality and utility of early predictions. The procedure is outlined by investigating early predictability of failure on two PIAAC 2012 Problem Solving in Technology Rich Environments (PSTRE) tasks. We investigated early windows of varying size in terms of time and in terms of actions. We achieved good prediction performance at stages where examinees had, on average, at least two thirds of their solution process ahead of them, and the vast majority of examinees who failed could potentially be detected to be at risk before completing the task. In-depth analyses revealed different features to be indicative of success and failure at different stages of the solution process, thereby highlighting the potential of the applied procedure for gaining a finer-grained understanding of the trajectories of behavioral patterns on interactive tasks.
... Model parameters mtry (i.e., amount of variables to sample at each split) and ntree (i.e., number of trees, aiming for the minimum number to stabilize the error) were determined using model optimization procedures in the 'caret' package from R (Kuhn, 2020), and set to 11 and 300 respectively for all fine segmentation models. Variable importance for each RF model was determined by using mean decrease in accuracy. ...
Article
Full-text available
Accurate maps of biological communities are essential for monitoring and managing marine protected areas but more information on the most effective methods for developing these maps is needed. In this study, we use Wilsons Promontory Marine National Park in southeast Australia as a case study to determine the best combination of variables and scales for producing accurate habitat maps across the site. Wilsons Promontory has full multibeam echosounder (MBES) coverage coupled with towed video, remotely operated underwater vehicle (ROV) and drop video observations. Our study used an image segmentation approach incorporating MBES backscatter angular response curve and bathymetry derivatives to identify benthic community types using a hierarchical habitat classification scheme. The angular response curve data were extracted from MBES data using two different methods: 1) angular range analysis (ARA) and 2) backscatter angular response (AR). Habitat distributions were predicted using a supervised Random Forest approach combining bathymetry, ARA, and AR derivatives. Variable importance metrics indicated that ARA derivatives, such as grain size, impedance and volume heterogeneity were more important to model performance than AR derivatives mean, skewness, and kurtosis. Additionally, this study investigated the impact of segmentation software settings when creating segmented surfaces and their impact on overall model accuracy. We found using fine scale segmentation resulted in the best model performance. These results indicate the importance of incorporating backscatter derivatives into biological habitat maps and the need to consider scale to increase the accuracy of the outputs to help improve the spatial management of marine environments.
... In order to analyse the datasets using the aforementioned methods, various R packages have been utilised. The package caret [51] implemented in R is used for kNN. The R package kknn [52] is used for weighted kNN, while R library rknn [53] is used for random kNN. ...
Preprint
Full-text available
kNN based ensemble methods minimise the effect of outliers by identifying a set of data points in the given feature space that are nearest to an unseen observation in order to predict its response by using majority voting. The ordinary ensembles based on kNN find out the k nearest observations in a region (bounded by a sphere) based on a predefined value of k. This scenario, however, might not work in situations when the test observation follows the pattern of the closest data points with the same class that lie on a certain path not contained in the given sphere. This paper proposes a k nearest neighbour ensemble where the neighbours are determined in k steps. Starting from the first nearest observation of the test point, the algorithm identifies a single observation that is closest to the observation at the previous step. At each base learner in the ensemble, this search is extended to k steps on a random bootstrap sample with a random subset of features selected from the feature space. The final predicted class of the test point is determined by using a majority vote in the predicted classes given by all base models. This new ensemble method is applied on 17 benchmark datasets and compared with other classical methods, including kNN based models, in terms of classification accuracy, kappa and Brier score as performance metrics. Boxplots are also utilised to illustrate the difference in the results given by the proposed and other state-of-the-art methods. The proposed method outperformed the rest of the classical methods in the majority of cases. The paper gives a detailed simulation study for further assessment.
... Regarding Kern et al. (2019), who emphasize the suitability of tree-based machine learning methods for the analysis of survey data, we apply a Gradient Boosting Machine (GBM) for model building in R (R Core Team, 2021). We use the packages gbm (Greenwell et al., 2020), rsample , caret (Kuhn, 2020), foreign (R Core Team, 2020), plotmo (Milborrow, 2020) and ROCR (Sing et al., 2005). We consider the binary outcome variable (vacation ≥5d yes or no) as a classification problem and use the GBM's machine learning technique to solve it. ...
Article
The COVID-19 pandemic led to global disruptions – especially in tourism. As a result, travel participation decreased. Thus, the proportion of Germans (18–85a) travelling (≥5d) between March and December decreased from 76% in 2019 to 56% in 2020. To better understand who travels during the COVID-19 pandemic or not, we used two population-representative surveys of the German-speaking residential population. Applying a Gradient Boosting Machine, we compared the pandemic year 2020 (n = 5,823) with the non-pandemic year 2019 (n = 7,366). Considering 12 sociodemographic variables in two models, we predict their relative influence on the probability of leisure travel participation in the respective years. The 2019 model shows a relatively high accuracy (71%), whereas the accuracy of the 2020 model decreases to 59%, indicating that the variables used have lost importance. Results show, e.g. that household income and age are the two most important predictors for travel participation. However, their importance reversed due to the pandemic, with age being the most relevant predictor for travel participation during COVID-19. Using Partial Dependence Plots, we compare the direction, impact, and functional form of all variables regarding travel participation for both years – and thus identify who travels during the pandemic.
... As such, it has become an attractive alternative among gradient-boosting implementations and is an increasingly popular tool in the agronomic field by outperforming other machine learning alternatives. XGBoost has been used to predict biomass ( We implemented the 'caret' package (Kuhn 2021) in R software (version 4.1.2; R Core Team 2021) for model training, tuning hyper-parameters and validation. ...
Article
Full-text available
Nitrate (NO3) leaching from agriculture represents the primary source of groundwater contamination and freshwater ecosystem degradation. At the field level, NO3 leaching is highly variable due to interactions among soil, weather and crop management factors, but the relative effects of these drivers have not been quantified on a global scale. Using a global database of 82 field studies in temperate rainfed cereal crops with 961 observations, our objectives were to (a) quantify the relative importance of environmental and management variables to identify key leverage points for NO3 mitigation and (b) determine associated changes in crop productivity and potential tradeoffs for high and low NO3 loss scenarios. Machine learning algorithms (XGboost) and feature importance analysis showed that the amount and intensity of rainfall explained the most variability in NO3 leaching (up to 24 kg N ha−1), followed by nitrogen (N) fertilizer rate and crop N removal. In contrast, other soil and management variables such as soil texture, crop type, tillage and N source, timing and placement had less importance. To reduce N losses from global agriculture under changing weather and climatic conditions, these results highlight the need for better targeting and increased adoption of science-based, locally adapted management practices for improving N use efficiency. Future policy discussions should support this transition through different instruments while also promoting more advanced weather prediction analytics, especially in areas susceptible to extreme climatic variation.
... Lastly, we use calibration plots and compute two calibration metrics, calibration in the large and calibration slope (Steyerberg, 2019). Calibration plots are produced using the caret add-on package for R (Kuhn, 2022). Confidence intervals are obtained via standard binomial tests and averaged over splits. ...
Preprint
Full-text available
Ensembles improve prediction performance and allow uncertainty quantification by aggregating predictions from multiple models. In deep ensembling, the individual models are usually black box neural networks, or recently, partially interpretable semi-structured deep transformation models. However, interpretability of the ensemble members is generally lost upon aggregation. This is a crucial drawback of deep ensembles in high-stake decision fields, in which interpretable models are desired. We propose a novel transformation ensemble which aggregates probabilistic predictions with the guarantee to preserve interpretability and yield uniformly better predictions than the ensemble members on average. Transformation ensembles are tailored towards interpretable deep transformation models but are applicable to a wider range of probabilistic neural networks. In experiments on several publicly available data sets, we demonstrate that transformation ensembles perform on par with classical deep ensembles in terms of prediction performance, discrimination, and calibration. In addition, we demonstrate how transformation ensembles quantify both aleatoric and epistemic uncertainty, and produce minimax optimal predictions under certain conditions.
... LDA was conducted via R package "MASS" (Venables and Ripley, 2002). The cross validation approach was used to pick out the best K parameter via R package "caret" (Kuhn, 2015). Based on the best K parameter, R packages "class" (Venables and Ripley, 2002) and "kknn" were used for the KNN method. ...
Article
Full-text available
Background: Adrenocortical carcinoma (ACC) is an orphan tumor which has poor prognoses. Therefore, it is of urgent need for us to find candidate prognostic biomarkers and provide clinicians with an accurate method for survival prediction of ACC via bioinformatics and machine learning methods. Methods: Eight different methods including differentially expressed gene (DEG) analysis, weighted correlation network analysis (WGCNA), protein-protein interaction (PPI) network construction, survival analysis, expression level comparison, receiver operating characteristic (ROC) analysis, and decision curve analysis (DCA) were used to identify potential prognostic biomarkers for ACC via seven independent datasets. Linear discriminant analysis (LDA), K-nearest neighbor (KNN), support vector machine (SVM), and time-dependent ROC were performed to further identify meaningful prognostic biomarkers (MPBs). Cox regression analyses were performed to screen factors for nomogram construction. Results: We identified nine hub genes correlated to prognosis of patients with ACC. Furthermore, four MPBs (ASPM, BIRC5, CCNB2, and CDK1) with high accuracy of survival prediction were screened out, which were enriched in the cell cycle. We also found that mutations and copy number variants of these MPBs were associated with overall survival (OS) of ACC patients. Moreover, MPB expressions were associated with immune infiltration level. Two nomograms [OS-nomogram and disease-free survival (DFS)-nomogram] were established, which could provide clinicians with an accurate, quick, and visualized method for survival prediction. Conclusion: Four novel MPBs were identified and two nomograms were constructed, which might constitute a breakthrough in treatment and prognosis prediction of patients with ACC.
... Features with ICCs < 0.75 were excluded, ICCs ranging from 0.75 to 1 considered "excellent" [18]. The Boruta algorithm [19], corrplot by carret [20] and the least absolute shrinkage and selection operator (LASSO) with tenfold cross-validation [21] were performed in a stepwise manner for dimension reduction. Furthermore, both the CA and HCM groups were randomly divided into a training dataset and a testing dataset (7:3). ...
Article
Full-text available
Background: To elucidate the value of texture analysis (TA) in detecting and differentiating myocardial tissue alterations on T2-weighted CMR (cardiovascular magnetic resonance imaging) in patients with cardiac amyloidosis (CA) and hypertrophic cardiomyopathy (HCM). Methods: In this retrospective study, 100 CA (58.5 ± 10.7 years; 41 (41%) females) and 217 HCM (50.7 ± 14.8 years, 101 (46.5%) females) patients who underwent CMR scans were included. Regions of interest for TA were delineated by two radiologists independently on T2-weighted imaging (T2WI). Stepwise dimension reduction and texture feature selection based on reproducibility, machine learning algorithms, and correlation analyses were performed to select features. Both the CA and HCM groups were randomly divided into a training dataset and a testing dataset (7:3). After the TA model was established in the training set, the diagnostic performance of the model was validated in the testing set and further validated in a subgroup of patients with similar hypertrophy. Results: The 7 independent texture features provided, in combination, a diagnostic accuracy of 86.0% (AUC = 0.915; 95% CI 0.879-0.951) in the training dataset and 79.2% (AUC = 0.842; 95% CI 0.759-0.924) in the testing dataset. The differential diagnostic accuracy in the similar hypertrophy subgroup was 82.2% (AUC = 0.864, 95% CI 0.805-0.922). The significance of the difference between the AUCs of the TA model and late gadolinium enhancement (LGE) was verified by Delong's test (p = 0.898). All seven texture features showed significant differences between CA and HCM (all p < 0.001). Conclusions: Our study demonstrated that texture analysis based on T2-weighted images could feasibly differentiate CA from HCM, even in patients with similar hypertrophy. The selected final texture features could achieve a comparable diagnostic capacity to the quantification of LGE. Trial registration Since this study is a retrospective observational study and no intervention had been involved, trial registration is waived.
... To perform the steps in the predictive modeling process (Shmueli, 2010), we used the caret package, version 6.0.88 (Kuhn, 2021), for partitioning of the data; selection of important predictors, that is, features; building the prediction models; validation and evaluation of the models; and selection of the optimal model. To avoid imbalance bias, we created stratified, balanced splits of the data for each research group (Meinel et al., 2021). ...
Thesis
Full-text available
In this PhD thesis, we aimed to improve understanding of the study progression and success of autistic students in higher education by comparing them to students with other disabilities and students without disabilities. We studied their background and enrollment characteristics, whether barriers in progression existed, how and when possible barriers manifested themselves in their student journey, and how institutions should address these issues. We found autistic students to be different from their peers but not worse as expected based on existing findings. We expect we counterbalanced differences because we studied a large data set spanning seven cohorts and performed propensity score weighting. Most characteristics of autistic students at enrollment were similar to those of other students, but they were older and more often male. They more often followed an irregular path to higher education than students without disabilities. They expected to study full time and spend no time on extracurricular activities or paid work. They expected to need more support and were at a higher risk of comorbidity than students with other disabilities. We found no difficulties with participation in preparatory activities. Over the first bachelor year, the grade point averages (GPAs) of autistic students were most similar to the GPAs of students without disabilities. Credit accumulation was generally similar except for one of seven periods, and dropout rates revealed no differences. The number of failed examinations and no-shows among autistic students was higher at the end of the first semester. Regarding progression and degree completion, we showed that most outcomes (GPAs, dropout rates, resits, credits, and degree completion) were similar in all three groups. Autistic students had more no-shows in the second year than their peers, which affected degree completion after three years. Our analysis of student success prediction clarified what factors predicted their success or lack thereof for each year in their bachelor program. For first-year success, study choice issues were the most important predictors (parallel programs and application timing). Issues with participation in pre-education (absence of grades in pre-educational records) and delays at the beginning of autistic students’ studies (reflected in age) were the most influential predictors of second-year success and delays in the second and final year of their bachelor program. Additionally, academic performance (average grades) was the strongest predictor of degree completion within three years. Our research contributes to increasing equality of opportunities and the development of support in higher education in three ways. First, it provides insights into the extent to which higher education serves the equality of autistic students. Second, it clarifies which differences higher education must accommodate to support the success of autistic students during their student journey. Finally, we used the insights into autistic students’ success to develop a stepped, personalized approach to support their diverse needs and talents, which can be applied using existing offerings.
... All nuisance parameters were estimated with extreme gradient tree boosting where the hyperparameters are chosen from a random grid of size 100 using the caret (Kuhn, 2021) library in R. ...
Preprint
Full-text available
Recent approaches to causal inference have focused on the identification and estimation of \textit{causal effects}, defined as (properties of) the distribution of counterfactual outcomes under hypothetical actions that alter the nodes of a graphical model. In this article we explore an alternative approach using the concept of \textit{causal influence}, defined through operations that alter the information propagated through the edges of a directed acyclic graph. Causal influence may be more useful than causal effects in settings in which interventions on the causal agents are infeasible or of no substantive interest, for example when considering gender, race, or genetics as a causal agent. Furthermore, the "information transfer" interventions proposed allow us to solve a long-standing problem in causal mediation analysis, namely the non-parametric identification of path-specific effects in the presence of treatment-induced mediator-outcome confounding. We propose efficient non-parametric estimators for a covariance version of the proposed causal influence measures, using data-adaptive regression coupled with semi-parametric efficiency theory to address model misspecification bias while retaining $\sqrt{n}$-consistency and asymptotic normality. We illustrate the use of our methods in two examples using publicly available data.
... The classifier is 300-tree deep, and the number of variables used to split at each tree is selected automatically based on the cross-validation results. We use the R programming language [13] and the caret package [14]. ...
Article
Full-text available
When analyzing large datasets from high-throughput technologies, researchers often encounter missing quantitative measurements, which are particularly frequent in metabolomics datasets. Metabolomics, the comprehensive profiling of metabolite abundances, are typically measured using mass spectrometry technologies that often introduce missingness via multiple mechanisms: (1) the metabolite signal may be smaller than the instrument limit of detection; (2) the conditions under which the data are collected and processed may lead to missing values; (3) missing values can be introduced randomly. Missingness resulting from mechanism (1) would be classified as Missing Not At Random (MNAR), that from mechanism (2) would be Missing At Random (MAR), and that from mechanism (3) would be classified as Missing Completely At Random (MCAR). Two common approaches for handling missing data are the following: (1) omit missing data from the analysis; (2) impute the missing values. Both approaches may introduce bias and reduce statistical power in downstream analyses such as testing metabolite associations with clinical variables. Further, standard imputation methods in metabolomics often ignore the mechanisms causing missingness and inaccurately estimate missing values within a data set. We propose a mechanism-aware imputation algorithm that leverages a two-step approach in imputing missing values. First, we use a random forest classifier to classify the missing mechanism for each missing value in the data set. Second, we impute each missing value using imputation algorithms that are specific to the predicted missingness mechanism (i.e., MAR/MCAR or MNAR). Using complete data, we conducted simulations, where we imposed different missingness patterns within the data and tested the performance of combinations of imputation algorithms. Our proposed algorithm provided imputations closer to the original data than those using only one imputation algorithm for all the missing values. Consequently, our two-step approach was able to reduce bias for improved downstream analyses.
... A disparity in the frequencies of the observed classes can have a significant negative impact on the fit of the model. To solve this problem, we used the "upsampling" method [37], which randomly samples  F1 balanced metric: The F1 score is the 2*((precision*recall)/(precision+recall)). It is also called the F Score or the F Measure. ...
Preprint
Full-text available
Background & Aims Hepatocellular carcinoma (HCC) is the most frequent malignant tumor of the liver and its incidence is increasing worldwide. Several treatments are currently available, but predictors of cancer recurrence are poorly characterized. The development of artificial intelligence has recently made available a new tool called Machine Learning (ML). ML allows running strong prediction of several variables, after inputting several data into a dedicated software. This study aimed to create a MLmodel for predicting HCC recurrence. Patients and methods In this study, we analyzed retrospectively data of 166 patients who were managed at the Bolzano Regional Hospital between 1998 and 2019. In order to find the best predictive model, either both non-parametric and parametric models were evaluated. Non-parametric models trained in this study were the following: Random Forest (RF), Support Vector Machine (SVM) and K-Nearest Neighbours (KNN). Parametric model adopted was the logistic regression model with the elastic net algorithm (ENET). Results In our dataset, the Random Forest model is the most performant (AUC 0.712). Independently from the treatment performed, age at diagnosis, MELD, the absence of previous obesity, type of diagnosis, BMI, and BCLC emerged as significant HCC recurrence predictors. Conclusion ML may be a valuable tool in the prediction of HCC recurrence. Larger sample sizes are needed to create useful tool for the clinical management of patients with HCC.
... We included stem age as crown and stem age are strongly correlated and stem age was more strongly correlated with range size than crown age (see Supplementary 1, figure 2). We tuned the hyperparameters using caret 6.0-90 (Kuhn 2021) and the BRTs were fit using the gbm package (Greenwell et al. 2020). We used a Gaussian distribution with 1050 trees, a bag fraction of 0.5, a learning rate of 0.01 and interaction depth of 2. BRTs do not yield traditional significance statistics; instead, they estimate each predictor variable's importance and allow construction of partial dependence plots (effect of a predictor holding the others constant). ...
... The EEG preprocessing, brain graph construction and model evaluation are implemented in R 4.1.2 [32] using in-house scripts, and caret [33] for SVM training. The training of CNN and GNN classifiers is implemented using PyTorch 1.10 [34] and PyTorch Geometric 2.0.2 [35]. ...
Preprint
Full-text available
Alzheimer's disease (AD) is the leading form of dementia in the world. AD disrupts neuronal pathways and thus is commonly viewed as a network disorder. Many studies demonstrate the power of functional connectivity (FC) graph-based biomarkers for automated diagnosis of AD using electroencephalography (EEG). However, various FC measures are commonly utilised as each aims to quantify an unique aspect of brain coupling. Graph neural networks (GNN) provide a powerful framework for learning on graphs. While there is a growing number of studies using GNN to classify EEG brain graphs, it is unclear which method should be utilised to estimate the brain graph. We use eight FC measures to estimate FC brain graphs from sensor-level EEG signals. GNN models are trained in order to compare the performance of the selected FC measures. Additionally, three baseline models based on literature are trained for comparison. We show that GNN models perform significantly better than the other baseline models. Moreover, using FC measures to estimate brain graphs improves the performance of GNN compared to models trained using a fixed graph based on the spatial distance between the EEG sensors. However, no FC measure performs consistently better than the other measures. The best GNN reaches 0.984 area under sensitivity-specificity curve (AUC) and 92% accuracy whereas the best baseline model, a convolutional neural network, has 0.924 AUC and 84.7% accuracy.
... Distance matrices were computed through cophenetic applied to dendrograms built for each parameter using the clustering methods also employed for the host niche dendrograms. We imputed missing data through a k-nearest neighbour algorithm (kNN) in caret (Kuhn, 2008(Kuhn, , 2020. ...
Article
Full-text available
... Additional nodes were added to the network manually based on literature (Bonner et al., 2008;Hauer and Gasser, 2017;Hustedt and Durocher, 2016;Panier and Boulton, 2014;Polo and Jackson, 2011). Classification of genotoxicity was performed by Support Vector Machine ("svmRadial") and Random Forest ("rf") method from R-package caret version 6.0-88 (Kuhn, 2021). The training with 10-fold cross-validation was repeated three times and the final model was chosen according to accuracy. ...
Article
Full-text available
Toxicological risk assessment is essential in the evaluation and authorization of different classes of chemical substances. Genotoxicity and mutagenicity testing are of highest priority and rely on established in vitro systems with bacterial and mammalian cells, sometimes followed by in vivo testing using rodent animal models. Transcriptomic approaches have recently also shown their value to determine transcript signatures specific for genotoxicity. Here, we studied how transcriptomic data, in combination with in vitro tests with human cells, can be used for the identification of genotoxic properties of test compounds. To this end, we used liver samples from a 28-day oral toxicity study in rats with the pesticidal active substances imazalil, thiacloprid, and clothianidin, a neonicotinoid-type insecticide with, amongst others, known hepatotoxic properties. Transcriptomic results were bioinformatically evaluated and pointed towards a genotoxic potential of clothianidin. In vitro Comet and γH2AX assays in human HepaRG hepatoma cells, complemented by in silico analyses of mutagenicity, were conducted as follow-up experiments to check if the genotoxicity alert from the transcriptomic study is in line with results from a battery of guideline genotoxicity studies. Our results illustrate the combined use of toxicogenomics, classic toxicological data and new approach methods in risk assessment. By means of a weight-of-evidence decision, we conclude that clothianidin does most likely not pose genotoxic risks to humans.
... The area under the receiver-operating characteristic curve (AUROC) averaged across all classes was used to tune the model. These procedures were implemented in R v3.5.2 using the randomForest v4.6 [53] and caret v.6.0 [49] packages. To work with the best performing classification approach, we evaluated and compared the classification performance of alternative machine learning approaches (l1-/l2-regularized support vector classification, ridge regression) and crossvalidation strategies (k-fold, stratified shuffle split). ...
Article
Full-text available
The clinical validity of the distinction between ADHD and ASD is a longstanding discussion. Recent advances in the realm of data-driven analytic techniques now enable us to formally investigate theories aiming to explain the frequent co-occurrence of these neurodevelopmental conditions. In this study, we probe different theoretical positions by means of a pre-registered integrative approach of novel classification, subgrouping, and taxometric techniques in a representative sample (N = 434), and replicate the results in an independent sample (N = 219) of children (ADHD, ASD, and typically developing) aged 7–14 years. First, Random Forest Classification could predict diagnostic groups based on questionnaire data with limited accuracy—suggesting some remaining overlap in behavioral symptoms between them. Second, community detection identified four distinct groups, but none of them showed a symptom profile clearly related to either ADHD or ASD in neither the original sample nor the replication sample. Third, taxometric analyses showed evidence for a categorical distinction between ASD and typically developing children, a dimensional characterization of the difference between ADHD and typically developing children, and mixed results for the distinction between the diagnostic groups. We present a novel framework of cutting-edge statistical techniques which represent recent advances in both the models and the data used for research in psychiatric nosology. Our results suggest that ASD and ADHD cannot be unambiguously characterized as either two separate clinical entities or opposite ends of a spectrum, and highlight the need to study ADHD and ASD traits in tandem.
... (Friedman et al., 2010) available through the caret package (Version 6.0.88) (Kuhn, 2021) in R (Version 4.1.0) (Team R Development Core, 2021). ...
Article
Full-text available
Background The identification of predictors of recurrence and persistence of depressive episodes in major depressive disorder (MDD) can be important to inform clinicians and collaborate to clinical decisions. Objective The aim of the present study is to predict recurrent or persistent depressive episodes, in addition to predicting severe recurrent or persistent depressive episodes using a machine learning method. Methods This is a prospective cohort study with three years of follow-up. Individuals diagnosed with MDD in the first phase of the study (2012–2015) were evaluated in the second phase (2012–2015). The sociodemographic, clinical, comorbid disorders and substance use variables were used as predictors in all predictive models. Initially, the first model predicted recurrence/persistence, including subjects of any severity of depression level. The second model predicted recurrence/persistence depression as the first model, although it was trained with severely depressed subjects and those without indicative for depression. The third model predicted severe depression among depressed patients. Results Area under the curve (AUC) values ranged from 0.65 to 0.81, and accuracies ranged from 62% to 71%. Psychiatric comorbidities, substance abuse/dependence, and family medical history were important features in all three models. Limitation The time between baseline and the second phase of the study was approximately three years, making it difficult to detect depressive symptoms during this time frame. Also, age at depression onset and number of episodes were not included in the model due to the large number of missing data. Conclusions In conclusion, this study adds new information that can help health professionals both in their clinical practice and in public services.
... Pre-processing and analysis of transcriptomic data was performed using the DESeq2 23 and WGCNA 24 software packages, and the classification models using the caret 25 and glmnet 26 packages. ...
Article
Full-text available
Autism Spectrum Disorders (ASD) have a strong, yet heterogeneous, genetic component. Among the various methods that are being developed to help reveal the underlying molecular aetiology of the disease one approach that is gaining popularity is the combination of gene expression and clinical genetic data, often using the SFARI-gene database, which comprises lists of curated genes considered to have causative roles in ASD when mutated in patients. We build a gene co-expression network to study the relationship between ASD-specific transcriptomic data and SFARI genes and then analyse it at different levels of granularity. No significant evidence is found of association between SFARI genes and differential gene expression patterns when comparing ASD samples to a control group, nor statistical enrichment of SFARI genes in gene co-expression network modules that have a strong correlation with ASD diagnosis. However, classification models that incorporate topological information from the whole ASD-specific gene co-expression network can predict novel SFARI candidate genes that share features of existing SFARI genes and have support for roles in ASD in the literature. A statistically significant association is also found between the absolute level of gene expression and SFARI’s genes and Scores, which can confound the analysis if uncorrected. We propose a novel approach to correct for this that is general enough to be applied to other problems affected by continuous sources of bias. It was found that only co-expression network analyses that integrate information from the whole network are able to reveal signatures linked to ASD diagnosis and novel candidate genes for the study of ASD, which individual gene or module analyses fail to do. It was also found that the influence of SFARI genes permeates not only other ASD scoring systems, but also lists of genes believed to be involved in other neurodevelopmental disorders.
... The RF models were generated using the caret package (v. 6.0-88, Kuhn, 2021) in program R. We divided the training data into five folds cross-validation with no repeats. Both models identified mtry = 12 as the optimal number of variables sampled at each split. ...
Article
Full-text available
Animal conservation requires understanding animal-habitat relationships. The integration of novel remote sensing platforms such as Light Detection and Ranging (LiDAR) technology has dramatically improved the resolution of insight when evaluating animal-habitat relationships by characterizing forest structure. However, conventional LiDAR collection (e.g., airborne or terrestrial laser scanning) may be limited by small spatial extents and logistical constraints (e.g., budget) associated with sampling. NASA’s Global Ecosystem Dynamics Investigation (GEDI) mission provides an alternative and complement to conventional LiDAR sampling with globally available waveform LiDAR, which is being collected to characterize vertical and horizontal structure of Earth’s forests. Forest carnivores are wide-ranging species occupying forested ecosystems, and are generally associated with vertical and horizontal forest structure for their survival and reproduction. We evaluated patterns in occurrence and habitat use of forest carnivores, which included Pacific martens (Martes caurina), Rocky Mountain red foxes (Vulpes vulpes macroura), and coyotes (Canis latrans) and patterns in occurrence of their prey; American red squirrels (Tamiasciurus hudsonicus) and snowshoe hares (Lepus americanus). Camera trap data were collected during the 2014–2017 winters in the Greater Yellowstone Ecosystem in Wyoming, USA. Our objectives were to (1) combine GEDI samples with multispectral satellite imagery from Landsat 8 to upscale vertical forest structure metrics; (2) assess the relative importance of environmental characteristics influencing occurrence and habitat use of forest-associated predators and prey; and (3) determine if GEDI-derived variables aided our efforts in characterizing animal-environment relationships. We used Random Forest regression models to upscale GEDI samples across our study area and implemented a multi-tiered approach using generalized linear mixed effect models to simultaneously evaluate animal-environment relationships and how GEDI-derived metrics improved the animal-habitat models. GEDI-derived metrics of relative height and foliage height diversity improved our animal-environment models and were among the strongest covariates (effect sizes were 1.3–1.8 times larger than the next closest) in the coyote, red squirrel, and snowshoe hare models. All five species were influenced to some degree by the frequency of rebaiting a camera trap and varying conditions of snow depth. Collectively, our work indicates forest canopy height and complexity variables significantly improved our ability to assess the importance of forest characteristics on forest carnivores and their prey. Indeed, there is an untapped opportunity to enhance animal ecology and conservation planning with continued integration of GEDI information with freely available satellite data to characterize attributes of forest structure across expansive areas.
Article
Full-text available
Onychomycosis (OM) is a common fungal nail infection. Based on the rich mycobial diversity in healthy toenails, we speculated that this is lost in OM due to the predominance of a single pathogen. We used next generation sequencing to obtain insights into the biodiversity of fungal communities in both healthy individuals and OM patients. By sequencing, a total of 338 operational-taxonomic units were found in OM patients and healthy controls. Interestingly, a classifier distinguished three distinct subsets: healthy controls and two groups within OM patients with either a low or high abundance of Trichophyton. Diversity per sample was decreased in controls compared to cases with low Trichophyton abundance (LTA), while cases with a high Trichophyton abundance (HTA) showed a lower diversity. Variation of mycobial communities between the samples showed shifts in the community structure between cases and controls—mainly driven by HTA cases. Indeed, LTA cases had a fungal β-diversity undistinguishable from that of healthy controls. Collectively, our data provides an in-depth characterization of fungal diversity in health and OM. Our findings also suggest that onychomycosis develops either through pathogen-driven mechanisms, i.e., in HTA cases, or through host and/or environmental factors, i.e., in cases with a low Trichophyton abundance.
Article
Full-text available
With the growth in complexity of real-time embedded systems, there is an increasing need for tools and techniques to understand and compare the observed runtime behavior of a system with the expected one. Since many real-time applications require periodic interactions with the environment, one of the fundamental problems in guaranteeing their temporal correctness is to be able to infer the periodicity of certain events in the system. The practicability of a period inference tool, however, depends on both its accuracy and robustness (also its resilience) against noise in the output trace of the system, e.g., when the system trace is impacted by the presence of aperiodic tasks, release jitters, and runtime variations in the execution time of the tasks. This work (i) presents the first period inference framework that uses regression-based machine-learning (RBML) methods, and (ii) thoroughly investigates the accuracy and robustness of different families of RBML methods in the presence of uncertainties in the system parameters. We show, on both synthetically generated traces and traces from actual systems, that our solutions can reduce the error of period estimation by two to three orders of magnitudes w.r.t. the state of the art.
Article
Full-text available
In the literature of modern psychometric modeling, mostly related to item response theory (IRT), the fit of model is evaluated through known indices, such as χ ² , M2, and root mean square error of approximation (RMSEA) for absolute assessments as well as Akaike information criterion (AIC), consistent AIC (CAIC), and Bayesian information criterion (BIC) for relative comparisons. Recent developments show a merging trend of psychometric and machine learnings, yet there remains a gap in the model fit evaluation, specifically the use of the area under curve (AUC). This study focuses on the behaviors of AUC in fitting IRT models. Rounds of simulations were conducted to investigate AUC’s appropriateness (e.g., power and Type I error rate) under various conditions. The results show that AUC possessed certain advantages under certain conditions such as high-dimensional structure with two-parameter logistic (2PL) and some three-parameter logistic (3PL) models, while disadvantages were also obvious when the true model is unidimensional. It cautions researchers about the dangers of using AUC solely in evaluating psychometric models.
Article
Remote sensing indices have been proposed to characterize soil salinity. However, the sensitivity of these indicators is unstable owing to differences in geographic environment and vegetation type. This study investigated the performance of several existing indices to estimate the salinity of topsoil with residues in southern Xinjiang, China. The results showed that these indices were not satisfactory. In order to construct an index that can be used to directly indicate soil salinity in a specific area, novel salinity indices were calculated using optical bands (blue, green, red, vegetation red edge, and shortwave infrared bands) derived from Sentinel-2 multispectral data and Sentinel-1 radar data (backscattering coefficient VV, VH). To enhance the sensitivity of the optical bands, five transformation methods (logarithmic, reciprocal, first-, second-, and third-derivative) were applied to the original spectra. Based on previous studies, statistical methods were used to construct two-, three-, and four-bands indices. One constructed three-bands index with the second-derivative transformation, called the Enhanced Residues Soil Salinity Index (ERSSI), showed the highest correlation with topsoil salinity (r = 0.65 and 0.68 in training and testing). ERSSI establishes a linear relationship in soil salinity estimation with an R² of 0.53 and a LCCC of 0.65 in training dataset, with an R² of 0.51 and a LCCC of 0.73 in testing dataset. And it shows contribution in random forest regression with an R² of 0.80 and a LCCC of 0.86 in training dataset, with an R² of 0.77 and a LCCC of 0.81 in testing dataset. The ERSSI consisted of the B, G, and SWIR1 bands, and was sensitive to salinity variations in the residues remaining in farmland soils. This study provides a novel index and method for the accurate and robust assessment and mapping of salinity in farmland covered by crop residues.
Article
Blue cheese flavour development derives from complex biochemical reactions that depend on numerous factors including milk source, culture/strain selection, processing, and ripening conditions. Understanding volatile compound development during blue cheese ripening will help reduce production costs and facilitate quality improvements. Volatile compounds contribute to the characteristic flavours of the cheeses but ripening time predictions based on chemical data have proven difficult. The present study employed untargeted fingerprinting combined with linear and non-linear chemometric approaches to identify key volatiles for the modelling of Shenley Station blue cheese ripening times. Self-organizing maps and entropy-based feature selection along with partial least squares regression and variable identification coefficients were used to parse the linear and non-linear development behaviours of volatiles. The blue cheese ripening times were accurately modelled by twenty-three discriminant volatiles. The present study demonstrated that volatile fingerprints can be used to effectively model blue cheese ripening times using a non-linear chemometric approach.
Article
Full-text available
Cation exchange capacity (CEC) is a major indicator of soil quality and nutrient retention capacity. Despite the considerable progress in CEC prediction using various models, studies to develop CEC pedotransfer functions (PTFs) using machine learning algorithms precisely, such as support vector regression (SVR) and random forest (RF), have not yet been performed in various land uses globally. This study aims to develop, evaluate, and compare the effectiveness of RF and SVR algorithms in determining CEC in different land uses that included agriculture, plantations, grasslands, forests, fallow land and deserts in five countries (Sudan, India, Italy, Iran, and Senegal). A total of 2,418 soil samples were fully analyzed and clay, silt, sand, pH, and soil organic carbon (SOC) were the selected covariates for modelling. Both RF and SVR were calibrated with a training dataset (70%, 1,693 samples) and validated by the remaining data (30%, 725 samples). The performance and accuracy of both models were evaluated using the Lin’s concordance correlation coefficient (LCCC), root mean square error (RMSE), and normalized root mean square error (NRMSE). The accuracy of the modeling predictions was further analyzed via the Taylor diagram. The findings revealed that clay content showed a positive significant correlation with CEC in all land uses, with highest correlation in desert land use (r = 0.94; p<0.05). Conversely, CEC was significantly and negatively correlated with sand in all land uses, with highest negative correlation obtained in desert land use (r = −0.84; p<0.05). The RF algorithm was able to predict the CEC better than SVR in nearly 67% of the validated land use datasets precisely in desert (RMSE = 2.68 cmolc kg−1, NRMSE = 29.9%, and LCCC = 0.94), fallow land (RMSE =5.12 cmolc kg−1, NRMSE = 55.6%, and LCCC = 0.82), forest (RMSE = 4.78 cmolc kg−1, NRMSE = 78.2%, and LCCC = 0.59), and grassland (RMSE = 8.39 cmolc kg−1, NRMSE = 50.5%, and LCCC = 0.84). Conversely, SVR better predicted CEC in agriculture (RMSE = 5.82 cmolc kg−1, NRMSE = 57.9%, and LCCC = 0.78) and plantation (RMSE = 4.64 cmolc kg−1, NRMSE = 57.9%, and LCCC = 0.74). Therefore, RF represents a promising technique to estimate soil CEC and can be used to derive effective CEC-PTFs in case of limited data availability, due to the lack of time and financial resources when the few basic soil properties are available. The findings reported in this study can be used to verify the suggested CEC-PTFs and/or their improvement. We recommend that further similar studies based on RF and SVR algorithms should consider including land use type in the Whole dataset and clay minerals in the modelling, and then compare the performance of both algorithms considering the climatic regions of the different studied countries.
Article
Full-text available
Anthropogenic activities are degrading forest health globally. Detection of changes in forest growth rates is possible with field measures of total net primary productivity (tNpp), however, assessing forest adaptive capacity was historically challenging due to a scarcity of tNpp data, knowledge of carbon allocation to above-and belowground biomass, and inability to calculate forest productive potential. This study used a global data set of 307 published research studies to identify tNpp thresholds and site-level factors constraining growth for forests in boreal, temperate, and tropical biomes. This data makes it possible to: (i) calculate the ratio of measured tNpp to a theoretical maximum tNpp from variables external to the site, its "ecosystem fit," (ii) identify environmental thresholds by scale and biogeography, and (iii) determine stand level conditions that limit growth. At the global scale, climatic variables explain most of the variance in tNpp whereas, at the biome scale different combinations of climatic/edaphic variables interact with phenological traits to explain productivity. For example, deciduous boreal forests were less resilient if precipitation increased, but deciduous tropical forests were less resilient if minimum annual temperature increased. At the biome scale, boreal forest mean tNpp (7.3 Mg ha − 1 yr − 1) was significantly lower than tropical forests (19.0 Mg ha − 1 yr − 1) but the mean tNpp of temperate forests tNpp (14.2 Mg ha − 1 yr − 1) did not significantly differ from boreal or tropical forests, representing a first-order latitudinal forcing of productivity. Comparison of ranked clusters of tNpp (low, medium, high) indicated that the tropics had a larger proportion of highly productive forests (20%) than temperate (12%) or boreal forests (14%). Tropical forests are highly adapted to small-scale environmental heterogeneity but their unique evolutionary trajectories make them more sensitive to land-use change, habitat fragmentation, and climate disruption. There was minimal overlap among the high tNpp groups between boreal forests (12.0-18.2 Mg ha − 1 yr − 1) and temperate forests (17.6-37.7 Mg ha − 1 yr − 1), and moderate overlap between temperate and tropical forests (26.8-45.9 Mg ha − 1 yr − 1), indicating increased adaptive capacity to environmental variability by climatic zone. All tropical forests had similar ecosystem fit, but Paleotropic forests growing on Inceptisol and Entisol soils were significantly more productive than Neotropical forests revealing similar adaptive capacity although there were less favorable growing conditions in the Neotropics. This study supports including site-level variations in edaphic and climatic factors to understand changes in primary productivity due to disturbance and proposes using Ecosystem fit to identify forest adaptive capacity in response to climate destabilization. article link: https://authors.elsevier.com/sd/article/S1470-160X(22)00444-7
Article
Soil cation exchange capacity (CEC) and pH affect the condition of soil. To improve soil capability in sugarcane growing areas, Sugar Research Australia introduced the Six-Easy-Steps Nutrient Guidelines based on CEC and pH of topsoil (0–0.3 m). A three-dimensional digital soil mapping (DSM) framework has been used to predict CEC and pH by fitting equal-area splines to four depth intervals (i.e., topsoil, subsurface [0.3–0.6 m], shallow [0.6–0.9 m], and deep subsoil [0.9–1.2 m]) to resample soil data at 0.01 m increments. A single quantile regression forest (QRF) was calibrated to model the relationship between spline-fitted soil data and individual digital data. These included proximal soil sensing (PSS) data such as electromagnetic (EM) induction and gamma-ray (γ-ray) spectrometry, remote sensing (RS) Sentinel-2 imagery, a light detection and ranging (LiDAR) based digital elevation model (DEM), and soil depth. Various data fusion methods and minimum calibration size have been evaluated, including concatenation and model averaging approaches, namely, simple averaging (SA), Bates-Granger averaging (BGA), Granger-Ramanathan averaging (GRA), and bias-corrected eigenvector averaging (BC-EA). In all cases, an independent validation was used to assess prediction agreement (Lin's concordance correlation coefficient—LCCC) and accuracy (ratio of performance to deviation—RPD). For CEC, γ-ray (LCCC = 0.82) was the best, with EM (0.78) and Sentinel-2 (0.77) producing similar agreement, whereas DEM (0.64) had worst performance. For pH, EM, γ-ray, and Sentinel-2 were similar (0.69, 0.73, and 0.77, respectively), and DEM poor (0.48). Optimum results were achieved when PSS, Sentinel-2, and DEM were fused using GRA; CEC agreement (0.88) and accuracy (RPD = 2.14) were strong, while for pH, concatenation had good agreement (0.79) and accuracy (1.59). Neither agreement nor accuracy varied among sample size, with a minimum of 30 (CEC) and 80 (pH) sites necessary (0.4 and 1.1 sampling sites ha⁻¹, respectively). The final DSM for topsoil CEC and pH, were useful for lime application; the northern fields required 2.5 t ha⁻¹ of lime, whereas the southern fields required variable rates (4 and 5 t ha⁻¹).
Article
Full-text available
Classification of beaches into morphodynamic states is a common approach in sandy beach studies, due to the influence of natural variables in ecological patterns and processes. The use of remote sensing for identifying beach type and monitoring changes has been commonly applied through multiple methods, which often involve expensive equipment and software processing of images. A previous study on the South African Coast developed a method to classify beaches using conditional tree inferences, based on beach morphological features estimated from public available satellite images, without the need for remote sensing processing, which allowed for a large-scale characterization. However, since the validation of this method has not been tested in other regions, its potential uses as a trans-scalar tool or dependence from local calibrations has not been evaluated. Here, we tested the validity of this method using a 200-km stretch of the Brazilian coast, encompassing a wide gradient of morphodynamic conditions. We also compared this locally derived model with the results that would be generated using the cut-off values established in the previous study. To this end, 87 beach sites were remotely assessed using an accessible software (i.e., Google Earth) and sampled for an in-situ environmental characterization and beach type classification. These sites were used to derive the predictive model of beach morphodynamics from the remotely assessed metrics, using conditional inference trees. An additional 77 beach sites, with a previously known morphodynamic type, were also remotely evaluated to test the model accuracy. Intertidal width and exposure degree were the only variables selected in the model to classify beach type, with an accuracy higher than 90% through different metrics of model validation. The only limitation was the inability in separating beach types in the reflective end of the morphodynamic continuum. Our results corroborated the usefulness of this method, highlighting the importance of a locally developed model, which substantially increased the accuracy. Although the use of more sophisticated remote sensing approaches should be preferred to assess coastal dynamics or detailed morphodynamic features (e.g., nearshore bars), the method used here provides an accessible and accurate approach to classify beach into major states at large spatial scales. As beach type can be used as a surrogate for biodiversity, environmental sensitivity and touristic preferences, the method may aid management in the identification of priority areas for conservation.
Article
Tissue testing used to assess the chemical contents in potato plants is considered laborious, time-consuming, destructive, and expensive. Ground-based sensors have been assessed to provide efficient information on nitrogen using leaf canopy reflectance. In potatoes, however, the main organ required for tissue testing is the petiole to estimate the elements of all nutrients. This research aims to assess whether there is a correlation between the chemical contents of potato petioles and leaf spectrum, and to examine whether the spectrum of dried or fresh leaves have higher correlation values. Petiole chemical contents of all elements were tested as a reference point. Leaves were split equally into dried and fresh groups for spectral analysis (400–2500 nm). Lasso Regression models were built to estimate concentrations in comparison to actual values. The performances of the model were tested using the Ratio of (standard error of) Prediction to (standard) Deviation (RPD). All elements showed reasonable to excellent RPD values except for sodium. All elements showed higher correlation in the dried testing mode except for nitrogen and potassium. The models showed that the most significant wavebands were in the visible and very near infrared range (400–1100 nm) for all macronutrients except magnesium and sulfur, while all micronutrients had the most significant wavebands in full range (400–2500 nm) with a common significant waveband at 1932 nm. The results show high potentials of a new approach to estimate potato plant elements based on foliar spectral reflectance.
Article
Full-text available
Forests will be critical to mitigate the effects of climate and global changes. Therefore, knowledge on the drivers of forest area changes are important. Although the drivers of deforestation are well known, drivers of afforestation are almost unexplored. Moreover, protected areas (PAs) effectively decrease deforestation, but other types of area‐based conservation measures exist. Among these, sacred natural sites (SNS) deliver positive conservation outcomes while making up an extensive “shadow network” of conservation. However, little is known on the capacity of SNS to regulate land‐use changes. Here, we explored the role of SNS and PAs as drivers of forest loss and forest gain in Italy between 1936 and 2018. We performed a descriptive analysis and modeled forest gain and forest loss by means of spatial binomial generalized linear models with residual autocovariates. The main drivers of forest area changes were geographical position and elevation, nonetheless SNS and PAs significantly decreased forest loss and increased forest gain. Although the negative relationship between SNS and forest loss is a desirable outcome, the positive relationship with forest gain is concerning because it could point to abandonment of cultural landscapes with consequent loss of open habitats. We suggest a legal recognition of SNS and an active ecological monitoring and planning to help maintain their positive role in biodiversity conservation. As a novel conservation planning approach, SNS can be used as stepping stones between PAs increasing connectivity and also to conserve small habitat patches threatened by human activities.
Article
Full-text available
Historical samples collected from 1985 to 2020 in coastal and open ocean regions of the Northeast Pacific were utilised to explore salp assemblage composition, morphometrics, and ontogeny of dominant species, as well as spatial, seasonal, and interannual distribution patterns at subarctic latitudes. Species richness was low, however, three of the seven observed species (Cyclosalpa bakeri, Salpa aspera, and S. fusiformis) were dominant and widely distributed in oceanic waters. Salpa maxima and Ihlea punctata were sporadically encountered, while Thalia democratica and Thetys vagina were observed exclusively during the 2014–2016 heatwave. Of the regions sampled, salps occurred at the shelf break and offshore, only occasionally on the shelf and never in inner waterways. Although salps were encountered year-round, salps were most prevalent between late spring and autumn. Over the 35 years, no clear up- or downward trend in the percentage of presence was detected due to high and perhaps cyclic variability. For some species, abundance was significantly affected by season, year, bottom depth, and Multivariate El Niño/Southern Oscillation Index. There is an urgent need for detailed research to realistically incorporate pelagic thaliaceans in North Pacific ecosystem models. The present study lays the foundation for future research on salps at high northern latitudes, utilising a unique and rich sample collection.
Article
Full-text available
Fruit bats are important pollinators and seed dispersers whose distribution may be affected by climate change and extreme-temperature events. We assessed the potential impacts of those changes and events on the future distribution of fruit bats in Australia. Correlative species distribution modelling was used to predict the distribution of seven (based on data availability) tropical and temperate fruit bat species. We used bioclimatic variables, the number of days where temperature ≥ 42 °C (known to induce extreme heat stress and mortality in fruit bats), and land cover (a proxy for habitat) as predictors. An ensemble of machine-learning algorithms was used to make predictions for the current-day distribution and future (2050 and 2070) scenarios, using multiple emission scenarios (RCP 4.5 and 8.5) and global circulation models (Australian Community Climate and Earth System Simulator, Hadley Centre Global Environment Model Carbon Cycle, and the Model for Interdisciplinary Research on Climate). Our results predict, under current conditions, on average, 9.1% and 90.8% of the area are suitable and unsuitable, respectively. Under future scenarios, on average, 6.7% and 89.7% continued to be suitable and unsuitable, respectively, while there was a 1.1% gain and 2.4% loss in suitable areas. Under current conditions, we predict, on average, 5.6% and 3.4% are suitable inside and outside species’ IUCN-defined range, respectively. While under future scenarios, 4.8% (4.4% stable and 0.4% gain) and 2.9% (2.2% stable and 0.6% gain) are suitable inside and outside the range respectively. On average, the gain in areas inside the range covers 2703.5 grid cells of size 5 km, while outside the range it is 4070.3 cells. Under future scenarios, the loss in areas is predicted to be 1.2% and 1.1% on average, inside and outside species range respectively. Fruit bats are likely to respond to climate change and extreme temperatures by migrating to more suitable areas, including regions not historically inhabited by those species. Our results can be used for identifying areas at risk of new fruit-bat colonisation, such as human settlements and orchards, as well as areas that might be important for habitat conservation.
Article
Article
Heterogeneity in the course of posttraumatic stress symptoms (PTSS) following a major life trauma such as childhood sexual abuse (CSA) can be attributed to numerous contextual factors, psychosocial risk, and family/peer support. The present study investigates a comprehensive set of baseline psychosocial risk and protective factors including online behaviors predicting empirically derived PTSS trajectories over time. Females aged 12–16 years ( N = 440); 156 with substantiated CSA; 284 matched comparisons with various self-reported potentially traumatic events (PTEs) were assessed at baseline and then annually for 2 subsequent years. Latent growth mixture modeling (LGMM) was used to derive PTSS trajectories, and least absolute shrinkage and selection operator (LASSO) logistic regression was used to investigate psychosocial predictors including online behaviors of trajectories. LGMM revealed four PTSS trajectories: resilient (52.1%), emerging (9.3%), recovering (19.3%), and chronic (19.4%). Of the 23 predictors considered, nine were retained in the LASSO model discriminating resilient versus chronic trajectories including the absence of CSA and other PTEs, low incidences of exposure to sexual content online, minority ethnicity status, and the presence of additional psychosocial protective factors. Results provide insights into possible intervention targets to promote resilience in adolescence following PTEs.
Article
Full-text available
Pipe failure prediction models are essential for informing proactive management decisions. This study aims to establish a reliable prediction model returning the probability of pipe failure using a gradient boosted tree model, and a specific segmentation and grouping of pipes on a 1 km grid that associates localised characteristics. The model is applied to an extensive UK network with approximately 40,000 km of pipeline and a 14-year failure history. The model was evaluated using the Receiver Operator Curve and Area Under the Curve (0.89), briers score (0.007) and Mathews Correlation Coefficient (0.27) for accuracy, indicating acceptable predictions. A weighted risk analysis is used to identify the consequence of a pipe failure and provide a graphical representation of high-risk pipes for decision makers. The weighted risk analysis provided an important step to understanding the consequences of the predicted failure. The model can be used directly in strategic planning, which sets long-term key decisions regarding maintenance and potential replacement of pipes.
Article
Objectives: Our goal was to evaluate the diagnostic value of DNA methylation analysis in combination with machine learning to differentiate pleural mesothelioma (PM) from important histopathological mimics. Material and methods: DNA methylation data of PM, lung adenocarcinomas, lung squamous cell carcinomas and chronic pleuritis was used to train a random forest as well as a support vector machine. These classifiers were validated using an independent validation cohort including pleural carcinosis and pleomorphic variants of lung adeno- and squamous cell carcinomas. Furthermore, we performed differential methylation analysis and used a deconvolution method to estimate the composition of the tumor microenvironment. Results: T-distributed stochastic neighbor embedding clearly separated PM from lung adenocarcinomas and squamous cell carcinomas, but there was a considerable overlap between chronic pleuritis specimens and PM with low tumor cell content. In a nested cross validation on the training cohort, both machine learning algorithms achieved the same accuracies (94.8%). On the validation cohort, we observed high accuracies for the support vector machine (97.8%) while the random forest performed considerably worse (89.5%), especially in distinguishing PM from chronic pleuritis. Differential methylation analysis revealed promoter hypermethylation in PM specimens, including the tumor suppressor genes BCL11B, EBF1, FOXA1, and WNK2. Deconvolution of the stromal and immune cell composition revealed higher rates of regulatory T-cells and endothelial cells in tumor specimens and a heterogenous inflammation including macrophages, B-cells and natural killer cells in chronic pleuritis. Conclusion: DNA methylation in combination with machine learning classifiers is a promising tool to reliably differentiate PM from chronic pleuritis and lung cancer, including pleomorphic carcinomas. Furthermore, our study highlights new candidate genes for PM carcinogenesis and shows that deconvolution of DNA methylation data can provide reasonable insights into the composition of the tumor microenvironment.
Article
Full-text available
Warming trends are altering fire regimes globally, potentially impacting on the long-term persistence of some ecosystems. However, we still lack clear understanding of how climatic stressors will alter fire regimes along productivity gradients. We trained a Random Forests model of fire probabilities across a 5°lat × 2° long trans-Andean rainfall gradient in northern Patagonia using a 23-year long fire record and biophysical, vegetation, human activity and seasonal fire weather predictors. The final model was projected onto mid- and late 21st century fire weather conditions predicted by an ensemble of GCMs using 4 emission scenarios. We finally assessed the vulnerability of different forest ecosystems by matching predicted fire return intervals with critical forest persistence fire return thresholds developed with landscape simulations. Modern fire activity showed the typical hump-shaped relationship with productivity and a negative distance relationship with human settlements. However, fire probabilities were far more sensitive to current season fire weather than to any other predictor. Sharp responsiveness of fire to the accelerating drier/warmer fire seasons predicted for the remainder of the 21st century in the region led to 2 to 3-fold (RCPs 4.5 and 8.5) and 3 to 8-fold increases in fire probabilities for the mid- and late 21st century, respectively. Contrary to current generalizations of larger impacts of warming on fire activity in fuel-rich ecosystems, our modeling results showed first an increase in predicted fire activity in less productive ecosystems (shrublands and steppes) and a later evenly amplified fire activity-productivity relationship with it shape resembling (at higher fire probabilities) the modern hump-shaped relationship. Despite this apparent homogeneous effect of warming on fire activity, vulnerability to predicted late 21st century shorter fire intervals were higher in most productive ecosystems (subalpine deciduous and evergreen Nothofagus-dominated rainforests) due to a general lack of fire-adapted traits in the dominant trees that compose these forests.
Article
For this work, the surface-layer states of turned AISI 4140 QT were investigated by means of surface roughness and microhardness measurements. Different machining conditions are regarded, namely cutting velocity, feed rate, tool wear and the tool corner radius, as well as the tempering state of the workpiece. The resulting data is analyzed by multiple algorithms, in order to create analytical models for a real time process control. Modeling approaches applied are linear regression, stepwise regression, LASSO and Elastic Net. Finally, the models are evaluated in terms of quality, complexity and physical plausibility.
Article
Bark beetles (Coleoptera: Curculionidae) alter forest ecosystem functioning through tree mortality, thus causing billions of dollars in economic and ecological damages worldwide. Dendroctonus frontalis Zimmermann is considered one of the most significant insect pest species of pine (Pinus spp.) trees in the United States (U.S.), Central America, and Mexico. To manage this threat, research has sought to predict and forecast outbreaks and identify high risk areas to focus on preventative management and reduce economic losses. Prior work has focused on using environmental predictors, but limitations have arisen regarding data availability and structure. Our research objective was to improve on current D. frontalis outbreak prediction models using contemporary modeling techniques. Beetle outbreak data were obtained from the United States Department of Agriculture Forest Service (USDA-FS) and were paired with three spatial-temporal beetle outbreak dynamics from the prior year, fifteen climate variables (DAYMET and WORLDCLIM) (temperature, radiation, wind, and water resources), two terrain attributes (NASA) (elevation and compound topographic index), and four vegetation indices [Moderate Resolution Imaging Spectroradiometer (MODIS)] [maximum and mean normalized difference vegetation index (NDVI)] as predictive features. Extreme gradient boosting was used to create two separate models that predicted the probability and magnitude of beetle outbreaks, which were used to create interpolated prediction maps for the southeastern U.S. The interpolated maps were combined to estimate outbreak risk (i.e., risk = probability of outbreak × magnitude of outbreak). Overall model accuracy was 87.7% when tested on an independent dataset. Distance to prior year outbreak and mean NDVI were the most important features when predicting the probability of outbreak, while summer maximum temperature, distance to prior year outbreak, and winter minimum temperature were the highest weighted features when predicting the magnitude. Results indicated that most of the southeastern U.S. was at low risk (<0.0001% damage per hectare) for the years 2008-2020. Risk was highest for 2012, 2016, and 2017. A few areas in Alabama, Georgia, northern Florida, and South Carolina contained stands at higher risk for damage (>0.01% per hectare) and some locations were at risk for >90% damage per hectare. Extreme gradient boosting paired with outbreak probability and magnitude performed well and is proposed as a solution for future bark beetle prediction and forecasting for more timely management strategies. The inclusion of climatic variables in outbreak models allows for forecasting the effects of future climate change on pine pest populations globally.
Article
Full-text available
Abstract: Coronary heart disease (CHD) is a major cause of death in Middle Eastern (ME) populations, with current studies of the metabolic fingerprints of CHD lacking in diversity. Identification of specific biomarkers to uncover potential mechanisms for developing predictive models and targeted therapies for CHD is urgently needed for the least-studied ME populations. A case-control study was carried out in a cohort of 1001 CHD patients and 2999 controls. Untargeted metabolomics was used, generating 1159 metabolites. Univariate and pathway enrichment analyses were performed to understand functional changes in CHD. A metabolite risk score (MRS) was developed to assess the predictive performance of CHD using multivariate analysis and machine learning. A total of 511 metabolites were significantly different between the CHD patients and the controls (FDR p < 0.05). The enriched pathways (FDR p < 10−300) included D-arginine and D-ornithine metabolism, glycolysis, oxidation and degradation of branched chain fatty acids, and sphingolipid metabolism. MRS showed good discriminative power between the CHD cases and the controls (AUC = 0.99). In this first study in the Middle East, known and novel circulating metabolites and metabolic pathways associated with CHD were identified. A small panel of metabolites can efficiently discriminate CHD cases and controls and therefore can be used as a diagnostic/predictive tool. Keywords: metabolomics; coronary heart disease; arginine metabolism; metabolite risk score; Middle East
Article
The objective of this retrospective longitudinal study was to evaluate the relationship between dry period length and the production of milk, fat, protein, lactose and total milk solids in the subsequent lactation of Holstein dairy cows under tropical climate. After handling and cleaning of the data provided by the Holstein Cattle Breeders Association of Minas Gerais, data from 32 867 complete lactations of 19 535 Holstein animals that calved between 1993 and 2017 in 122 dairy herds located in Minas Gerais state (Brazil) were analysed. In addition to dry period length, calving age, lactation length, milking frequency, parity, calf status at birth, herd, year, and season of calving were included in the analysis as covariables to account for additional sources of variation. The machine learning algorithms gradient boosting machine, extreme gradient boosting machine, random forest and artificial neural network were used to train models using cross validation. The best model was selected based on four error metrics and used to evaluate the variable importance, the interaction strength between dry period length and the other variables, and to generate partial dependency plots. Random forest was the best model for all production outcomes evaluated. Dry period length was the third most important variable in predicting milk production and its components. No strong interactions were observed between the dry period and the other evaluated variables. The highest milk and lactose productions were observed with a 50-d long dry period, while fat, protein, and total milk solids were the highest with dry period lengths of 38, 38, and 44 d, respectively. Overall, dry period length is associated with the production of milk and its components in the subsequent lactation of Holstein cows under tropical climatic conditions, but the optimum length depends on the production outcome.
ResearchGate has not been able to resolve any references for this publication.