Book

# An Introduction to Statistical Learning: With Applications in R

Authors:

## Abstract

An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform.Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.
... In some research situations, there could be some confusion on choosing the most appropriate technique for the analysis, because different techniques seem to be applicable. In order to overcome such problems, the researcher should be aware of the major differences between possible statistical modeling approaches that could be applied simultaneously [1]. In addition, the researcher should have clear idea of the variables that will be used in the research work, whether they are categorical or nominal, ordinal, or rank-ordered, interval, or ratio-level. ...
... Non-parametric techniques must be used for categorical and ordinal data, but for interval & ratio data they are generally less powerful and less flexible and should only be used where the standard parametric test is not appropriate-e.g., when the sample size is small [2]. Sample size calculation or power analysis is directly related to the statistical technique that is chosen, because the sample size calculation is based on the power (typically 0.80 is desired), and the effect size (typically a medium or large effect are selected; the larger the effect, the smaller a sample is needed) [1] [2] [3]. ...
... A supervised learning algorithm takes a known set of input data and known responses to the data (output) and trains a model to generate reasonable predictions for the response. Supervised learning uses classification and regression techniques to develop predictive models [1] [2] [3]. ...
Article
Full-text available
Statistical techniques are important tools in modeling research work. However, there could be misleading outcomes if sufficient care is undermined in choosing the right approach. Employing the correct analysis in any research work needs deep knowledge on the differences between these tools. Incorrect selection of the modeling technique would create serious problems during the interpretation of the findings and could affect the conclusion of the study. Each technique has its own assumptions and procedures about the data. This paper compares common statistical approaches, including regression vs classification, discriminant analysis vs logistic regression, ridge regression vs LASSO, and decision tree vs random forest. Results show that each approach has its unique statistical characteristics that should be well understood before deciding upon its utilization in the research.
... Для решения задачи классификации применялся метод линейного дискриминантного анализа (англ. LDA -linear discriminant analysis) [8]. ...
... В случае вычисления пертурбационных параметров, каждой записи ставился в соответствие только один параметр. На втором этапе для полученной обучающей выборки выполнялось ранжирование признаков методом LASSO [8]. На третьем этапе выполнялось обучение и тестирование классификатора методом перекрестной проверки по К = 4 блокам [8]. ...
... На втором этапе для полученной обучающей выборки выполнялось ранжирование признаков методом LASSO [8]. На третьем этапе выполнялось обучение и тестирование классификатора методом перекрестной проверки по К = 4 блокам [8]. Причем разбиение на блоки выполнялось на уровне дикторов, и, таким образом, тестовый и обучающий наборы содержали вектора, относящиеся к голосам разных дикторов. ...
Article
Full-text available
The paper describes an approach to design a system for analyzing and classification of a voice signal based on perturbation parameters and cepstral representation. Two variants of the cepstral representation of the voice signal are considered: based on mel-frequency cepstral coefficients (MFCC) and based on bark-frequency cepstral coefficients (BFCC). The work used a generally accepted approach to calculating the MFCC based on the time-frequency analysis by the method of discrete Fourier transform (DFT) with summation of energy in subbands. This method approximates the frequency resolution of human hearing, but has a fixed temporal resolution. As an alternative, a variant of the cepstral representation based on the BFCC has been proposed. When calculating the BFCC, a warped DFT-modulated filter bank was used, which approximates the frequency and temporal resolution of hearing. The aim of the work was to compare the effectiveness of the use of features based on the MFCC and BFCC for the designing systems for the analysis and classification of the voice signal. The results of the experiment showed that in the case when using acoustic features based on the MFCC, it is possible to obtain a voice classification system with an average recall of 80.6 %, and in the case when using features based on the BFCC, this metric is 83.7 %. With the addition of the set of MFCC features with perturbation parameters of the voice, the average recall of the classification increased to 94.1 %, with a similar addition to the set of BFCC features, the average recall of the classification increased up to 96.7 %.
... Threshold and SNR threshold parameters were calibrated to determine which provided the greatest proportion of calls extracted with the smallest rates of noise/error introduced. All the numerical call measurement values were subsequently centred and scaled to normalise the data (James et al., 2013). ...
... To determine the optimum number of call parameters to be included in each random forest, we tested for overfitting (the process by which too many parameters included in a model reduces its performance) using 10-fold cross-validations for models containing between 1 and 26 call parameters (James et al., 2013). We also calculated the error rate for the models using between 1 and 500 decision trees to determine which provided the least error for the lowest computational power. ...
... We assessed the relative importance of call parameters using variable importance scores (James et al., 2013) and the system runtime required to train the models. This was measured on an Intel i5 2.50 GHz core processor with 8 GB RAM. ...
Article
Bats comprise a quarter of all mammal species, provide key ecosystem services and serve as effective bioindicators. Automated methods for classifying echolocation calls of free-flying bats are useful for monitoring but are not widely used in the tropics. This is particularly problematic in Southeast Asia, which supports more than 388 bat species. Here, sparse reference call databases and significant overlap among species call characteristics makes the development of automated processing methods complex. To address this, we outline a semi-automated framework for classifying bat calls in Southeast Asia and demonstrate how this can reliably speed up manual data processing. We implemented the framework to develop a classifier for the bats of Borneo and tested this at a landscape in Sabah. Borneo has a relatively well-described bat fauna, including reference calls for 52% of all 81 known echolocating species on the island. We applied machine learning to classify calls into one of four call types that serve as indicators of dominant ecological ensembles: frequency-modulated (FM; forest-specialists), constant frequency (CF; forest-specialists and edge/gap foragers), quasi-constant frequency (QCF; edge/gap foragers), and frequency-modulated quasi constant frequency (FMqCF; edge/gap and open-space foragers) calls. Where possible, we further identified calls to species/sonotype. Each classification is provided with a confidence value and a recommended threshold for manual verification. Of the 245,991 calls recorded in our test landscape, 85% were correctly identified to call type and only 10% needed manual verification for three of the call types. The classifier was most successful at classifying CF calls, reducing the volume of calls to be manually verified by over 95% for three common species. The most difficult bats to classify were those with FMqCF calls, with only a 52% reduction in files. Our framework allows users to rapidly filter acoustic files for common species and isolate files of interest, cutting the total volume of data to be processed by 86%. This provides an alternative method where species-specific classifiers are not yet feasible and enables researchers to expand non-invasive monitoring of bat species. Notably, this approach incorporates aerial insectivorous ensembles that are regularly absent from field datasets despite being important components of the bat community, thus improving our capacity to monitor bats remotely in tropical landscapes.
... This is done with a regularization parameter λ. The optimization problem is according to [11]: ...
... The optimization of the parameters is still a convex optimization problem and is performed like the linear regression with iterative methods for example the gradient descent method [11]. ...
... The data set is divided binary at points where a division has the highest information content. This can be described using the Gini coefficient G, wherep mk stands for the proportion of the k-th class in the m-th region [11]: ...
Article
Full-text available
The present paper describes a measurement setup and a related prediction of the electrical impedance of rolling bearings using machine learning algorithms. The impedance of the rolling bearing is expected to be key in determining the state of health of the bearing, which is an essential component in almost all machines. In previous publications, the determination of the impedance of rolling bearings has already been advanced using analytical methods. Despite the improvements in accuracy achieved within the calculations, there are still discrepancies between the calculated and the measured impedance, leading to an approximately constant off-set value. This discrepancy motivates the machine learning approach introduced in this paper. It is shown that with the help of the data-driven methods the difference between analytical prediction and measurement is reduced to the order of up to 2% across the operational range analyzed so far. To introduce the context of the research shown, first the underlying physics of bearing impedance is presented. Subsequently different machine learning approaches are highlighted and compared with each other in terms of their prediction quality in the results part of this paper. As a further aspect, in addition to the prediction of the bearing impedance, it is investigated whether the rotational speed present at the bearing can be predicted from the frequency spectrum of the impedance using order analysis methods which is independent from the force prediction accuracy. The background to this is that, if the prediction quality is sufficiently high, the additional use of speed sensors could be omitted in future investigations.
... SVM is an effective method in different situations. When dealing with small dimension the flexibility of the separating function can help to find a perfect separation, however with high dimensional data over-fitted problems can emerge and, as mentioned in [12], there is not need of additional flexibility that give this models, being the linear function a good option. ...
... The aim here is to predict which customers will default on their credit card debt, the minority class. This data set is in ISLR package [12]. ...
Article
Full-text available
The search of separation hyperplanes is an efficient way to find rules with classification purposes. This paper presents an alternative mathematical programming formulation to existing methods to find a discriminant hyperplane. The hyperplane H is found by minimizing the sum of all the distances to the area assigned to the group each individual belongs to. It results in a convex optimization problem for which we find an equivalent linear programming problem. We demonstrate that H exists when the centroids of the two groups are not equal. The method is effective dealing with low and high dimensional data where reduction of the dimension is proposed to avoid overfitting problems. We show the performance of this approach with different data sets and comparisons with other classifications methods. The method is called LPDA and it is implemented in a R package available in https://github.com/mjnueda/lpda .
... The bootstrap resamples the data with replacement and the estimate is calculated for this new resampled data set. This is repeated many times to form an approximate sampling distribution for the estimate, from which standard errors and confidence intervals can be calculated (James et al., 2021). ...
... Simplified versions of the generic COPERT emission factor function and logistic regression functions not used in COPERT were also included. The Akaike Information Criterion (AIC) criterion was used to select the best algorithm (James et al., 2021). AIC estimates the quality of each model fit, relative to each of the other candidate models. ...
Article
A Portable Emissions Measurement System (PEMS) was used to measure emissions of five sports utility vehicles (SUVs) in a wide range of real-world driving conditions. The program included testing of fuel quality, coast-down and emissions in start, hot running and extended idling conditions. Geo-computation methods were used to add critical information (road gradient) to the PEMS data. Results from this study are generally in good agreement with international PEMS data. Hot running NOx emission factors are on average seven times higher than the type-approval limit for diesel SUVs, and they reach about 2100 and 400 mg/km in urban conditions for NOx and NO2, respectively. They are 7 (NOx) and 4 (NO2) times higher than current emission factors in COPERT Australia. COPERT Australia emission algorithms for CO2 are well behaved and the PEMS data suggest an update is not required. COPERT Australia emission algorithms should be revised for diesel SUVs (NOx, NO2) and petrol SUVs (CO, THC, NO2) to ensure accurate estimation of vehicle emissions at fleet level. Inclusion of logistic regression is proposed for future COPERT updates.
... able (James et al., 2013). This method fits the linear relationship between input features and the target (observed data) using the least-squared approach. ...
... This method fits the linear relationship between input features and the target (observed data) using the least-squared approach. In the least-squared approach, the best relationship model will be obtained by minimizing the sum of the squared distance between the calculated values (as model outputs) and the target values (James et al., 2013). This algorithm is the most straightforward approach in ML models and is generally used as the baseline method. ...
Article
Full-text available
Flood forecasting based on hydrodynamic modeling is an essential non-structural measure against compound flooding across the globe. With the risk increasing under climate change, all coastal areas are now in need of flood risk management strategies. Unfortunately, for local water management agencies in developing countries, building such a model is challenging due to the limited computational resources and the scarcity of observational data. We attempt to solve this issue by proposing an integrated hydrodynamic and machine learning (ML) approach to predict water level dynamics as a proxy for the risk of compound flooding in a data-scarce delta. As a case study, this integrated approach is implemented in Pontianak, the densest coastal urban area over the Kapuas River delta, Indonesia. Firstly, we build a hydrodynamic model to simulate several compound flooding scenarios. The outputs are then used to train the ML model. To obtain a robust ML model, we consider three ML algorithms, i.e., random forest (RF), multiple linear regression (MLR), and support vector machine (SVM). Our results show that the integrated scheme works well. The RF is the most accurate algorithm to model water level dynamics in the study area. Meanwhile, the ML model using the RF algorithm can predict 11 out of 17 compound flooding events during the implementation phase. It could be concluded that RF is the most appropriate algorithm to build a reliable ML model capable of estimating the river's water level dynamics within Pontianak, whose output can be used as a proxy for predicting compound flooding events in the city.
... (k, l) = {(4,8),(6,8),(8,4)}), balancing the trade-off between bias and variance46 .report top-5 models for each configuration setting. ...
... For example, the best model found with 8 qubits simulation including 4 layers, yielding a minimal test error in volume measurements. The dynamics show in the pink curve indicating the balance of the bias-variance trade-off46 . ...
Preprint
Full-text available
Quantifying the dynamics of tumor burden reveals useful information about cancer evolution concerning treatment effects and drug resistance, which play a crucial role in advancing model-informed drug developments (MIDD) towards personalized medicine and precision oncology. The emergence of Quantum Machine Intelligence offers unparalleled insights into tumor dynamics via a quantum mechanics perspective. This paper introduces a novel hybrid quantum-classical neural architecture named $\eta-$Net that enables quantifying quantum dynamics of tumor burden concerning treatment effects. We evaluate our proposed neural solution on two major use cases, including cohort-specific and patient-specific modeling. In silico numerical results show a high capacity and expressivity of $\eta-$Net to the quantified biological problem. Moreover, the close connection to representation learning - the foundation for successes of modern AI, enables efficient transferability of empirical knowledge from relevant cohorts to targeted patients. Finally, we leverage Bayesian optimization to quantify the epistemic uncertainty of model predictions, paving the way for $\eta-$Net towards reliable AI in decision-making for clinical usages.
... The reason of shifting from FCTs to FRF is similar to the non-functional context, that is, to lower the variability of estimates due to the presence of correlated FCTs (Breiman 2004;Hastie et al. 2009;James et al. 2013). Effectively, FRF creates many FCTs on B bootstrap replicates of the original dataset, decorrelating the FCT. ...
... Concerning the assessment of the functional classifiers' accuracy, different strategies are available as in the non-functional framework. Indeed, both for FKNN and FRF, it is possible to exploit cross-validation, bootstrap, or validation test set (Hastie et al. 2009;James et al. 2013). In the FDA context, the use of a functional test set is of particular appeal because the test functions must be described according to the same basis system used to define the functional training set. ...
Article
Full-text available
This paper offers a supervised classification strategy that combines functional data analysis with unsupervised and supervised classification methods. Specifically, a two-steps classification technique for high-dimensional time series treated as functional data is suggested. The first stage is based on extracting additional knowledge from the data using unsupervised classification employing suitable metrics. The second phase applies functional supervised classification of the new patterns learned via appropriate basis representations. The experiments on ECG data and comparison with the classical approaches show the effectiveness of the proposed technique and exciting refinement in terms of accuracy. A simulation study with six scenarios is also offered to demonstrate the efficacy of the suggested strategy. The results reveal that this line of investigation is compelling and worthy of further development.
... Motivated by the necessity of such a classification scheme in supercooled liquids, a computational approach via ML algorithms such as PCA and K-means clustering [6][7][8] is developed to demonstrate that the structures of supercooled liquids can be classified into a few structurally distinct nano-domains that tile up the whole configurational space with long lifetimes and dynamically differ from each other by calculating diffusion constant distributions from the mean square displacements (MSD). ...
Preprint
Full-text available
A computational approach via implementation of the Principle Component Analysis (PCA) and Gaussian Mixture (GM) clustering methods from Machine Learning (ML) algorithms to identify domain structures of supercooled liquids is developed. Raw features data are collected from the coordination numbers of particles smoothed using its radial distribution function and are used as an order-parameter of disordered structures for GM clustering after dimensionality reduction from the PCA. To transfer the knowledge from features(structural) space to configurational space, another GM clustering is performed using the Cartesian coordinates as an order-parameter with the particles' identity from GM in the feature space. Both GM clustering are performed iteratively until convergence. Results show the appearance of aggregated clusters of nano-domains over sufficient long timescale both in structural and configurational spaces with heterogeneous dynamics. More importantly, consistent nano-domains tilling up the whole space regardless of the system size are observed and our approach can be applied to any disordered systems.
... To maximize classification and cope with the bias-variance trade-off, this approach must determine the optimal value of k, the number of neighbors. Optimal choices of k keep the bias-variance balance in check and, ideally, reduce both [34]. d) Logistic Regression (LR): LR is a classification algorithm generally used in binary classification problems [35], as is the case here with negative, 0 and positive response values, 1. ...
Article
Full-text available
The volume and amount of data in cancerology is continuously increasing, yet the vast majority of this data is not being used to uncover useful and hidden insights. As a result, one of the key goals of physicians for therapeutic decision-making during multidisciplinary consultation meetings is to combine prediction tools based on data and best practices (MCM). The current study looked into using CRISP-DM machine learning algorithms to predict metastatic recurrence in patients with early-stage (non-metastatic) breast cancer so that treatment-appropriate medicine may be given to lower the likelihood of metastatic relapse. From 2014 to 2021, data from patients with localized breast cancer were collected at the Regional Oncology Center in Meknes, Morocco. There were 449 records in the dataset, 13 predictor variables and one outcome variable. To create predictive models, we used machine learning techniques such as Support Vector Machine (SVM), Nave Bayes (NB), K-Nearest Neighbors (KNN) and Logistic Regression (LR). The main objective of this article is to compare the performance of these four algorithms on our data in terms of sensitivity, specificity and precision. According to our results, the accuracies of SVM, kNN, LR and NB are 0.906, 0.861, 0.806 and 0.517 respectively. With the fewest errors and maximum accuracy, the SVM classification model predicts metastatic breast cancer relapse. The unbiased prediction accuracy of each model is assessed using a 10-fold cross-validation method.
... Our generic ISE algorithm has been applied to many problems related to drug discovery and has been presented in reviews, with details of the mathematical and statistical criteria to distinguish between two activities based on physicochemical properties (descriptors) of known active vs. inactive compounds (Stern and Goldblum, 2014;El-Atawneh and Goldblum, 2017). For each model, five cross-validations were performed (James et al., 2013), with 4 out of the five-folds producing the model, and the fifth fold was used as a test set. We include some of the main details of model construction and screening in Supplementary Data section 1.1. ...
Article
Full-text available
In recent years, the cannabinoid type 2 receptor (CB2R) has become a major target for treating many disease conditions. The old therapeutic paradigm of “one disease-one target-one drug” is being transformed to “complex disease-many targets-one drug.” Multitargeting, therefore, attracts much attention as a promising approach. We thus focus on designing single multitargeting agents (MTAs), which have many advantages over combined therapies. Using our ligand-based approach, the “Iterative Stochastic Elimination” (ISE) algorithm, we produce activity models of agonists and antagonists for desired therapeutic targets and anti-targets. These models are used for sequential virtual screening and scoring large libraries of molecules in order to pick top-scored candidates for testing in vitro and in vivo . In this study, we built activity models for CB2R and other targets for combinations that could be used for several indications. Those additional targets are the cannabinoid 1 receptor (CB1R), peroxisome proliferator-activated receptor gamma (PPARγ), and 5-Hydroxytryptamine receptor 4 (5-HT4R). All these models have high statistical parameters and are reliable. Many more CB2R/CBIR agonists were found than combined CB2R agonists with CB1R antagonist activity (by 200 fold). CB2R agonism combined with PPARγ or 5-HT4R agonist activity may be used for treating Inflammatory Bowel Disease (IBD). Combining CB2R agonism with 5-HT4R generates more candidates (14,008) than combining CB2R agonism with agonists for the nuclear receptor PPARγ (374 candidates) from an initial set of ∼2.1 million molecules. Improved enrichment of true vs. false positives may be achieved by requiring a better ISE score cutoff or by performing docking. Those candidates can be purchased and tested experimentally to validate their activity. Further, we performed docking to CB2R structures and found lower statistical performance of the docking (“structure-based”) compared to ISE modeling (“ligand-based”). Therefore, ISE modeling may be a better starting point for molecular discovery than docking.
... The SVM performs well in a variety of settings due its use of a maximal margin classifier. The maximal margin classifier uses a hyperplane to classify and separate observations by computing the maximum distance of an observation to the hyperplane and then determining the class of the observation based on which side of the hyperplane it falls on (Gareth et al., 2013). Additionally, SVMs can enlarge the feature space of the data using kernels to accommodate nonlinear boundaries between classes and simplify the inner product, which overcomes the dimensionality of the data. ...
Article
Full-text available
Most genomic prediction models are linear regression models that assume continuous and normally distributed phenotypes, but responses to diseases such as stripe rust (caused by Puccinia striiformis f. sp. tritici) are commonly recorded in ordinal scales and percentages. Disease severity (SEV) and infection type (IT) data in germplasm screening nurseries generally do not follow these assumptions. On this regard, researchers may ignore the lack of normality, transform the phenotypes, use generalized linear models, or use supervised learning algorithms and classification models with no restriction on the distribution of response variables, which are less sensitive when modeling ordinal scores. The goal of this research was to compare classification and regression genomic selection models for skewed phenotypes using stripe rust SEV and IT in winter wheat. We extensively compared both regression and classification prediction models using two training populations composed of breeding lines phenotyped in 4 years (2016-2018 and 2020) and a diversity panel phenotyped in 4 years (2013-2016). The prediction models used 19,861 genotyping-by-sequencing single-nucleotide polymorphism markers. Overall, square root transformed phenotypes using ridge regression best linear unbiased prediction and support vector machine regression models displayed the highest combination of accuracy and relative efficiency across the regression and classification models. Furthermore, a classification system based on support vector machine and ordinal Bayesian models with a 2-Class scale for SEV reached the highest class accuracy of 0.99. This study showed that breeders can use linear and non-parametric regression models within their own breeding lines over combined years to accurately predict skewed phenotypes.
... The second-'test'-sample was used to test the accuracy of the algorithm in ranking the offender among the top suspects. 10 Separating these analyses prevented the risk of 'overfitting' the likelihood values to the data, which would reduce the algorithm's generalisability to new data and produce inflated estimates of its accuracy (James et al., 2013). Figure 1 depicts a high-level summary of the GP-SMART process that maps and ranks suspects, for a given input offence and list of possible suspects. ...
Article
Full-text available
This study developed and tested a new geographic profiling method for automating suspect prioritisation in crime investigations. The Geographic Profiling Suspect Mapping And Ranking Technique (GP‐SMART) maps suspects' activity locations available in police records—such as home addresses, family members' home addresses, prior offence locations, locations of non‐crime incidents, and other contacts with police—and ranks suspects based on both the proximity and nature of these locations, relative to an input crime. In accuracy tests using solved burglary, robbery and extra‐familial sex offence cases in New Zealand (n = 4511), GP‐SMART ranked the offender at or near the top of the suspect list at rates greatly exceeding chance. Highlighting the benefit of its novel inclusion and differentiation of many different types of activity location, GP‐SMART also outperformed baseline methods—approximating existing algorithms—that ranked suspects using only the proximity of their activity locations, or home addresses, to the input crime.
... This approach divides the set of 16062 segments into 10 groups of approximately equal size. The first fold is treated as a test set and the method is fit on the remaining nine folds; finally the accuracy is averaged over all test groups [32]. ...
Article
Full-text available
Differentiating between shockable and non-shockable Electrocardiogram (ECG) signals would increase the success of resuscitation by the Automated External Defibrillators (AED). In this study, a Deep Neural Network (DNN) algorithm is used to distinguish 1.4-second segment shockable signals from non-shockable signals promptly. The proposed technique is frequency-independent and is trained with signals from diverse patients extracted from MIT-BIH, MIT-BIH Malignant Ventricular Ectopy Database (VFDB), and a database for ventricular tachyarrhythmia signals from Creighton University (CUDB) resulting, in an accuracy of 99.1%. Finally, the raspberry pi minicomputer is used to load the optimized version of the model on it. Testing the implemented model on the processor by unseen ECG signals resulted in an average latency of 0.845 seconds meeting the IEC 60601-2-4 requirements. According to the evaluated results, the proposed technique could be used by AED’s.
... The process was repeated five times, with a different fold for training (k-1) and testing selected each time, creating different accuracies each time. This strategy allowed for an objective, less biased and less optimistic estimation of the model's performance than other methods (James et al., 2013). ...
Article
Full-text available
An accumulated body of choice research has demonstrated that choice behavior can be understood within the context of its history of reinforcement by measuring response patterns. Traditionally, work on predicting choice behaviors has been based on the relationship between the history of reinforcement-the reinforcer arrangement used in training conditions-and choice behavior. We suggest an alternative method that treats the reinforcement history as unknown and focuses only on oper-ant choices to accurately predict (more precisely, retrodict) reinforcement histories. We trained machine learning models known as artificial spiking neural networks (SNNs) on previously published pigeon datasets to detect patterns in choices with specific reinforcement histories-seven arranged concurrent variable-interval schedules in effect for nine reinforcers. Notably, SNN extracted information from a small 'window' of observational data to predict reinforcer arrangements. The models' generalization ability was then tested with new choices of the same pigeons to predict the type of schedule used in training. We examined whether the amount of the data provided affected the prediction accuracy and our results demonstrated that choices made by the pigeons immediately after the delivery of rein-forcers provided sufficient information for the model to determine the reinforcement history. These results support the idea that SNNs can process small sets of behavioral data for pattern detection, when the reinforcement history is unknown. This novel approach can influence our decisions to determine appropriate interventions; it can be a valuable addition to our toolbox, for both therapy design and research.
... In this case, the analyzed model should be corrected. Less liberal assumptions indicate that a VIF value> 5 means moderate multicollinearity (34), which is a cause for concern. The VIF values for the EBF and MF mothers regression models were not greater than 5 (from 1.014 to 1.020 for EBF and 1.014 to 1.246 for the MF group). ...
Article
Full-text available
Background: Although breastfeeding is recommended by WHO and professionals as the most beneficial for newborn babies, many women find it challenging. Previous research yielded ambiguous results concerning the role of breastfeeding in the development of postpartum depression. The study aimed to identify the best predictors of depressive symptoms for each of these feeding method. Methods: The participants were 151 women (mean age 29.4 yrs; SD = 4.5) who gave birth within the last 6 months and included 82 women classified as breastfeeding, 38 classified as mixed-feeding (breast and bottle), and 31 as formula-feeding. The study had a cross-sectional design using a web-based survey for data collection. The following measures were administered: The Edinburgh Postnatal Depression Scale; Sense of Stress Questionnaire; The Postpartum Bonding Questionnaire; Parenting Sense of Competence Scale; Infant Feeding Questionnaire. Results: Women in study groups differed in stress, bonding difficulties, and beliefs related to feeding practices and infancy. There were no significant differences in the severity of depressive symptoms, but all mean EPDS scores were above 12. Maternal satisfaction, intrapsychic stress, and concerns about feeding on a schedule were the best predictors of EPDS scores for breastfeeding women. For mixed-feeding - emotional tension, concern about infant's hunger, overeating, and awareness of infant's hunger and satiety cues; while for the formula-feeding group, predictors included emotional tension, bonding difficulties, and such maternal feeding practices and beliefs as concern about undereating, awareness of infant's hunger and satiety cues, concerns about feeding on a schedule and social interaction with the infant during feeding. Conclusion: Differences in predictors of postpartum depression for study groups suggest that breastfeeding itself may not be a risk for postpartum depression. However, the specificity of maternal experiences with the various types of feeding is related to difficulties promoting postpartum depression. Providing emotional and educational support appropriate for different types of feeding may be an essential protective factor for postnatal depression.
... We perform Leave-one-out Cross-validation (LOOCV) to evaluate the HB prediction model. James, Witten, Hastie, and Tibshirani (2013) indicate that LOOCV is K-fold cross-validation taken to its logical extreme, with K equal to N, the number of data points in the set. This means that for N separate times, the function approximator is trained on all data except for one point, and a prediction is made for that point. ...
Article
Full-text available
Few past studies have tackled the relationship between marketing strategies and revenue forecasts of live streamers, not to mention the influence of streamer heterogeneity. This study applies the Hierarchical Bayesian (HB) model to examine the predictive effects of viewers' comments and streamer' behaviors on viewers' gift-sending behavior in live streaming while considering the effect of streamer heterogeneity. In particular, we empirically analyze 38,183 samples of time data from 10 food live-stream samples. We find that the effects of viewers' comment features and streamers' marketing strategies on viewers' gift-sending behavior are mainly influenced by the cross-level effect of streamers' heterogeneities. These results reveal that existing live-streaming studies might have overlooked the impact of streamers' heterogeneities, offering only biased conclusions. Finally, the model proposed in this study has good predictive accuracy for live streamer revenue. 【Keywords】word-of-mouth, discrete emotion theory, live streamer's behavior and characteristics, gift-sending, Hierarchical Bayesian model 摘 要 過去有關直播主的行銷策略與營收預測之研究十分匱乏，且忽略考慮直播主異質性之 影響。本研究應用層級貝氏模型，檢驗在考慮直播主異質性下，觀眾的留言特質和 直播主行銷策略對觀眾送禮行為之預測價值。本研究針對 10 部美食直播共 38,183 筆 時間資料進行分析，發現留言特質和直播主行銷策略對觀眾送禮行為之效果主要受到 直播主異質性的跨層次影響。此顯示過去忽略直播主異質性影響的研究結論可能有偏 誤。最後，本研究提出的模型對直播主營收有很好的預測力。 口碑、分立情緒理論、直播主行為特質、送禮、層級貝氏模型 93
... Considering the multi-collinearity assumptions, a correlative link of over .80 was not observed between any variables (see , Table 1). Since VIF values (between 1.32 and 2.21) are less than 10 (James et al., 2013) and tolerance values (between .45 and .76) are greater than .10 ...
Article
Full-text available
Although existential loneliness seems to be a natural consequence of being human, some people may experience it more intensely. In this study, it was aimed to investigate whether frustration intolerance, which is one of the basic concepts of Rational Emotive Behavior Therapy and psychological need frustration, which is the basic concept of Self-determination Theory predicted existential loneliness or not. A total of 294 adults were included in the study. The results showed that existential loneliness was directly predicted by frustration intolerance. As a result of the mediation test, all dimensions of psychological need frustration (autonomy frustration, relatedness frustration, and competence frustration) fully mediated the relationship between frustration intolerance and existential loneliness. The place of these findings in the literature was discussed and some recommendations were made.
... For detailed considerations of the methods used here, excellent reference materials are available. 22,23 F I G U R E 2 Scaling factor (β) as a function of mean (μ) and standard deviation (σ) of wind speed F I G U R E 3 Scaling factor (β) joint variations with mean (μ) and standard deviation (σ) of wind speed for (left) linear de-trending (right) filtered de-trending A wide range of machine learning models were assessed within the study to quantify the ability of each to identify the unknown functional relationship of Equation 3. These models have varying complexities and characteristics, as follows: ...
Article
Full-text available
This paper considers the removal of low‐frequency trend contributions from turbulence intensity values at sites for which only 10‐min statistics in wind speed are available. It is proposed the problem be reformulated as a direct regression task, solvable using machine learning techniques in conjunction with training data formed from measurements at sites for which underlying (non‐averaged) wind data are available. Once trained, the machine learning models can de‐trend sites for which only 10‐min statistics have been retained. A range of machine learning techniques are tested, for cases of linear and filtered approaches to de‐trending, using data from 14 sites. Results indicate this approach allows for excellent approximation of de‐trended turbulence intensity distributions at unobserved sites, providing significant improvements over the existing recommended method. The best results were obtained using Neural Network, Random Forest and Boosted Tree models.
... Contudo, o efeito marginal de reduzir a soma das variâncias tende a ser decrescente. Uma heurística para selecionar o número conveniente de clusters é, portanto, observar o ponto de inflexão na curva da soma das variâncias dentro do cluster, ou seja, o "cotovelo" na curva (Han, Kamber e Pei, 2012;James et al., 2013). ...
Chapter
Full-text available
O objetivo deste capítulo é analisar os dados da agroindústria rural (AGR) do censo agropecuário de 2017, de forma a construir um perfil das experiências no Brasil, comparando os resultados nas cinco macrorregiões e nos dois tipos de agriculturas: agricultura familiar (AF) e agricultura não familiar (ANF). Em menor medida, também se problematiza a falta de políticas públicas, o que fragiliza as agroindústrias e destaca-se sua importância para os processos de desenvolvimento regional nos locais em que as agroindústrias operam. Os dados usados são quantitativos e proveem do censo agropecuário de 2017 do IBGE. Os diversos indicadores sobre as AGRs foram retirados do banco de dados online denominado Sistema IBGE de Recuperação Automática (Sidra).
... Contudo, o efeito marginal de reduzir a soma das variâncias tende a ser decrescente. Uma heurística para selecionar o número conveniente de clusters é, portanto, observar o ponto de inflexão na curva da soma das variâncias dentro do cluster, ou seja, o "cotovelo" na curva (Han, Kamber e Pei, 2012;James et al., 2013). ...
Book
Full-text available
Diversidades, multifuncionalidade, heterogeneidade e políticas públicas rurais e agrícolas no Brasil. Análises de dados do Censo Agropecuário, do orçamento público da União para a agricultura com recorte regional. Considerações sobre avanços e dificuldades produtivas e de desenvolvimento da agroindústria, Ater. Breve análise comparativa entra políticas da União Europeia e Estados Unidos com foco na agricultura familiar e pequena produção.
... More information on how hierarchical clustering works can be found elsewhere. 46 ...
Article
A method for predicting the effect of solvent on the morphology of organic crystals is presented, providing an efficient screening tool for identifying ideal crystallization solvents. The solvent effect is estimated by the computation of chemical potentials and activity coefficients of crystal surfaces using a first principles-based statistical thermodynamics approach. Density functional theory and COSMO-RS are utilized to determine the activity coefficients of the crystal growth faces of a selection of active pharmaceutical ingredients (APIs) in solvents across a broad range of polarities. The ability of COSMO-RS to predict and quantify the effects of solvent on crystal growth and morphology is assessed using hierarchical clustering to classify the solvents according to their overall interaction strength with the crystal faces. The COSMO-RS approach allows for a physical interpretation of the predictions in terms of surface polarity and is confirmed by comparison to published experimental data. Herein a methodology is reported for automated computation of the activity coefficients of all solvent-surface pairs directly from the drug crystal structure. The procedure goes beyond the traditional trial-and-error solvent selection process and has the potential to be used as a rapid computational screening tool in pharmaceutical drug development.
... The procedure is iterated over the same dataset for all folds. This method was preferred to the leave-one-out cross-validation (LOOCV) usually applied to small datasets, because it provides a more accurate estimate of the test error rate (Gareth et al., 2014) and a lower variance than LOOCV (Efron, 1983). The higher the k value, the higher the accuracy in cross-validation (Yadav and Shukla, 2016), but this can lead to overfitting. ...
Article
Full-text available
Pinot blanc is a leading grapevine variety in South Tyrol (Italy) for wine production. The high quality of its wines derives from a typical aroma of elegant apple notes and lively acidity. The typicity of the final wine depends on the origin of the vine, the soil, the oenological practices and time of harvest. The South Tyrolean mountainous areas meet the cold climatic requirements of Pinot blanc, which guarantee its sweet-acidic harmony obtained when organic acids are in balance with the other components of the wine. However, increasing temperatures in valley sites during the berry development period boost the activity of malic acid (MA) enzymes, which negatively affect the final sugar/acid ratio. Researchers are currently focused on understanding acid dynamics in wines, and there are no references for the best sugar/acid ratio for Pinot blanc. Moreover, the contribution of individual acids to the sensory profile of this wine has not yet been studied. In this study we address the effect of different climate conditions and site elevations on the sugar/acid ratio in developmental grapevine berries, and we evaluate the effect on wine bouquet. Even if different models and indices have been proposed for predicting sugar content, no predictive models exist for MA in white grapes. In a three-year study (2017, 2018 and 2019) that involved eight vineyards in four different location in South Tyrol at various elevations ranging from 223 to 730 m a.s.l., the relationships between bioclimatic indices, such as growing-degree day (GDD) and grapevine sugar ripeness (GSR) and grapevine berry content were investigated. The analysis reveals that GDD may potentially predict MA dynamics in Pinot blanc; hence, a GDD-based model was used to determine the GDD to reach target MA concentrations (3.5, 3.0, 2.5, 2.0 g/L). This simple model was improved with additional temperature-based parameters by feature selection, and the best three advanced models were selected and evaluated by 5-fold cross-validation. These models could be used to support location and harvest date choice to produce high-quality Pinot blanc wines. k e y w o r d s Pinot blanc, South Tyrol, sugar, GDD, GSR, organic acids, MA
... In this model, the lasso penalty term should be added to the model's loss function and minimized as well as the squared errors term. In this case, the new model with the lasso penalty is called the sparse model, in which it includes only the essential variables (James, Witten, Hastie, & Tibshirani, 2013). The following equation is the loss function in the VARX-L model after adding the Lasso penalty term: ...
... Within the field of fraud detection, machine learning can pose a potential method of identifying fraudulent activity or perpetrators of fraud. This can drastically minimize the number of individuals negatively impacted by fraud [1], [2]. ...
Article
Full-text available
The use of technology has benefited society in more ways than one ever thought possible. Unfortunately, as society’s knowledge of technology has advanced, so has its knowledge of ways to use technology to manipulate others. This has led to a simultaneous advancement in the world of fraud. Machine learning techniques can offer a possible solution to help decrease these advancements. This research explores how the use of various machine learning techniques can aid in detecting fraudulent activity across two different types of fraudulent datasets, and the accuracy, precision, recall, and F1 were recorded for each method. Each machine learning model was also tested across five different training and testing splits in order to discover which split and technique would lead to the most optimal results.
... With the variables significantly correlated, predictive models were generated using multiple hierarchical linear regression techniques (Supo 2016). The efficiency of the models was determined on the basis of the highest R 2 adjusted value, to those value allows the selection of the best models and the closer to 1 the model is able to explain a large proportion of the variance of the response variable (James et al. 2013). ...
Article
Full-text available
Introduction A substantial proportion of individuals infected with severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), report persisting symptoms weeks and months following acute infection. Estimates on prevalence vary due to differences in study designs, populations, heterogeneity of symptoms and the way symptoms are measured. Common symptoms include fatigue, cognitive impairment and dyspnoea. However, knowledge regarding the nature and risk factors for developing persisting symptoms is still limited. Hence, in this study, we aim to determine the prevalence, severity, risk factors and impact on quality of life of persisting symptoms in the first year following acute SARS-CoV-2 infection. Methods and analysis The LongCOVID-study is both a prospective and retrospective cohort study being conducted in the Netherlands, with a one year follow-up. Participants aged 5 years and above, with self-reported positive or negative tests for SARS-CoV-2 will be included in the study. The primary outcome is the prevalence and severity of persistent symptoms in participants that tested positive for SARS-CoV-2 compared with controls. Symptom severity will be assessed for fatigue (Checklist Individual Strength (CIS subscale fatigue severity)), pain (Rand-36/SF-36 subscale bodily pain), dyspnoea (Medical Research Council (mMRC)) and cognitive impairment (Cognitive Failure Questionnaire (CFQ)). Secondary outcomes include effect of vaccination prior to infection on persistent symptoms, loss of health-related quality of life (HRQoL) and risk factors for persisting symptoms following infection with SARS-CoV-2. Ethics and dissemination The Utrecht Medical Ethics Committee (METC) declared in February 2021 that the Medical Research Involving Human Subjects Act (WMO) does not apply to this study (METC protocol number 21-124/C). Informed consent is required prior to participation in the study. Results of this study will be submitted for publication in a peer-reviewed journal.
Article
Full-text available
Being able to classify experienced emotions by identifying distinct neural responses has tremendous value in both fundamental research (e.g. positive psychology, emotion regulation theory) and in applied settings (clinical, healthcare, commercial). We aimed to decode the neural representation of the experience of two discrete emotions: sadness and disgust, devoid of differences in valence and arousal. In a passive viewing paradigm, we showed emotion evoking images from the International Affective Picture System to participants while recording their EEG. We then selected a subset of those images that were distinct in evoking either sadness or disgust (20 for each), yet were indistinguishable on normative valence and arousal. Event-related potential analysis of 69 participants showed differential responses in the N1 and EPN components and a support-vector machine classifier was able to accurately classify (58%) whole-brain EEG patterns of sadness and disgust experiences. These results support and expand on earlier findings that discrete emotions do have differential neural responses that are not caused by differences in valence or arousal.
Article
Full-text available
Since the publication of the Millennium Ecosystem Assessment, the research of ecosystem services valuation has seen an exponential growth with a consequent development, improvement, and spread of different qualitative and quantitative methods. The interest is due to the benefits that ecosystem services provide for human wellbeing. A large part of ecosystem services is provided by the so-called forest ecosystem services (FES) in both protected and non-protected areas. The aim of the present study is to investigate key variables driving the FES value at the global level. These include, other than socio-economic information, the ecosystem services' quality condition and the location of the study. The research uses a meta-regression of 478 observations from 57 studies in the time span 1992–2021 retrieved from the online Ecosystem Service Valuation Database (ESVD). The main results show that both the ES quality condition and spatial aspect are relevant factors in determining the estimated value of FES, suggesting the existence of a difference in the forest value from a North-South perspective. The investigation of an economic assessment of FES is advised as a key research trend in the immediate future. This allows to close the gap between the global North and South and favors the implementation of adequate socio-economic and environmental governance for an efficient forest management.
Article
Full-text available
This paper reviews dilemmas and implications of erroneous data for clinical implementation of AI. It is well-known that if erroneous and biased data are used to train AI, there is a risk of systematic error. However, even perfectly trained AI applications can produce faulty outputs if fed with erroneous inputs. To counter such problems, we suggest 3 steps: (1) AI should focus on data of the highest quality, in essence paraclinical data and digital images, (2) patients should be granted simple access to the input data that feed the AI, and granted a right to request changes to erroneous data, and (3) automated high-throughput methods for error-correction should be implemented in domains with faulty data when possible. Also, we conclude that erroneous data is a reality even for highly reputable Danish data sources, and thus, legal framework for the correction of errors is universally needed.
Article
Background: Researchers need visualization methods (using statistical and interactive techniques) to efficiently perform quality assessments and glean insights from their data. Data on networks can particularly benefit from more advanced techniques since typical visualization methods, such as node-link diagrams, can be difficult to interpret. We use heatmaps and consensus clustering on network data and show they can be combined to easily and efficiently explore nonparametric relationships among the variables and networks that comprise an ego network data set. Methods: We used ego network data from the Québec Adipose and Lifestyle Investigation in Youth (QUALITY) cohort used to evaluate this method. The data consists of 35 networks centered on individuals (egos), each containing a maximum of 10 nodes (alters). These networks are described through 41 variables: 11 describing the ego (e.g. fat mass percentage), 18 describing the alters (e.g. frequency of physical activity) and 12 describing the network structure (e.g. degree). Results: Four stable clusters were detected. Cluster one consisted of variables relating to the interconnectivity of the ego networks and the locations of interaction, cluster two consisted of the ego’s age, cluster three contained lifestyle variables and obesity outcomes and cluster four was comprised of variables measuring alter importance and diet. Conclusions: This exploratory method using heatmaps and consensus clustering on network data identified several important associations among variables describing the alters’ lifestyle habits and the egos’ obesity outcomes. Their relevance has been identified by studies on the effect of social networks on childhood obesity.
Article
Concrete is a versatile construction material, but the water content can greatly influence its quality. However, using the trials and error method to determine the optimum water for the concrete mix results in poor quality concrete structures, which often end up in landfills as construction wastes, thus threatening environmental safety. This paper develops deep neural networks to predict the required water for a normal concrete mix. Standard data samples obtained from certified/leading laboratories were fed into a deep learning model (multilayers feedforward neural network) to automate the calibration of mixing power of the concrete water content for improved water control accuracy. We randomly split the data into 70%, 15% and 15%, respectively, to train, validate and test the model. The developed DNN model was subjected to relevant statistical metrics and benchmarked against the random forest, gradient boosting machines, and support vector machines. The performance indices obtained by the DNN model have the highest reliability compared to other models for concrete water prediction.
Article
Objective Worrying is a pervasive transdiagnostic symptom in schizophrenia. It is most often associated in the literature with verbal modality due to many studies of its presence in generalised anxiety disorder. The current study aimed to elucidate worry in different sensory modalities, visual and verbal, in individuals with schizophrenia. Method We tested persons with schizophrenia (n = 92) and healthy controls (n = 138) in a cross-sectional design. We used questionnaires of visual and verbal worry (original Worry Modality Questionnaire), trait worry (Penn State Worry Questionnaire) and general psychopathology symptoms (General Functioning Questionnaire-58 and Brief Psychiatric Rating Scale). Results Both visual and verbal worry were associated with psychotic, anxiety and general symptoms of psychopathology in both groups with medium to large effect sizes. Regression analyses indicated that visual worry was a single significant predictor of positive psychotic symptoms in a model with verbal and trait worry, both in clinical and control groups (β′s of 0.49 and 0.38, respectively). Visual worry was also a superior predictor of anxiety and general psychopathology severity (β′s of 0.34 and 0.37, respectively) than verbal worry (β′s of 0.03 and −0.02, respectively), under control of trait worry, in the schizophrenia group. We also proposed two indices of worry modality dominance and analysed profiles of dominating worry modality in both groups. Conclusions Our study is the first to demonstrate that visual worry might be of specific importance for understanding psychotic and general psychopathology symptoms in persons with schizophrenia.
Poster
Full-text available
The Healthcare industry has witnessed major advancements and innovation over the years. However, there still exist diseases that are diﬃcult to diagnose and require specialized care that can often destroy one’s ﬁnances. Treatments like major organ transplants, surgery, etc, are such treatments that cost huge amounts of money where hospitalization is required multiple times and prolonged duration. For such situations, one should increase the cover through a combination of base cover and health insurance cover for added ﬁnancial protection available at an aﬀordable cost. With the rising cost of healthcare in Thailand, a medical emergency could quickly deplete your savings. The primary purpose of health insurance is to provide ﬁnancial coverage in case you suﬀer from a medical condition so that you can keep your savings protected. Talking about income, inequality for each person becomes problematic in every region in Thailand. Based on this fact, we tried to reclassify by taking the sum insured data from every province in Thailand. We assume for Bangkok to be a privileged area because the income of people in the capital city is relatively high. Considering that SVM is the best indicator of its overall statistical data processing and also is a classiﬁer with a strong generalization ability. Informing SVM as a new innovation that is more accurate in data classiﬁcation in the industrial sector in Thailand.
Chapter
Foliage environment target detection has been an extremely difficult problem to solve. In this paper, we propose a machine learning approach for sense through target detection. Detection of target can be achieved with an accuracy of 93.7% with our XGBoost based technology on single received Ultra-Wide Band (UWB) radar waveform. This excellent result is achieved with very less computational resource making it a lucrative application in the target field.
Article
The size, type and abundance of planktonic organisms influence the efficiency with which carbon is transferred through the lower trophic levels, ultimately affecting dynamics at the higher trophic levels of the marine food web. In temperate shelf sea, such as the waters south-west of the UK, the plankton growing season span from early spring to autumn. While the plankton spring bloom has been extensively studied, the end of the growing season in September-October has received less attention, despite its potential importance for autumn-spawning fish and their larval stage survival. In this study we investigated the variability of the structure and carbon content of the planktonic communities in the waters south-west of the UK in October 2013 and 2014, discussing potential implications of these changes to small pelagic fish and higher trophic levels. Microphytoplankton (20-200 µm) dominated the plankton community in terms of carbon in 2013, while nanophytoplankton (<20 µm) in 2014. Ciliates, Copepoda, Decapoda and Cnidaria represented the highest proportion of carbon in the zooplankton component in both years, although ciliates and Copepoda biomass was higher in 2014. Environmental conditions were linked to these changes and were significant in describing the carbon content of the plankton groups. In particular, silicate concentration appeared to be a key variable, affecting diatom/plankton dynamics at the end of the growing season. Other important environmental variables associated with the structure of plankton groups were salinity, sea surface temperature, chlorophyll-a, difference between sea surface temperature and bottom temperature, phosphate and nitrogen concentrations. Although the composition and carbon distribution of the plankton community were different in the two years, cluster and Random Forest analyses showed similarities in the clusters of stations identified, defining an area of higher plankton carbon along the south coast of Cornwall, and an area of lower carbon in the Bristol Channel in both years. Presence of suitable prey for planktivorous small pelagic fish (e.g. Paracalanus/Pseudocalanus), particularly in 2014, provided supporting evidence of the importance of this sea area as a foraging and nursery ground for sardines and other small pelagic fish, as well as for their predators.
Article
Developing a wireless indoor positioning system with high accuracy, reliability, and reasonable cost has been the focus of many researchers. Recent studies have shown that visible-light-based positioning (VLP) systems have better positioning accuracy than radio-frequency-based systems. A notable highlight of those research articles is their combination of VLP and machine learning (ML) to improve the positioning performance in both two-dimensional and three-dimensional spaces. In this paper, in addition to describing VLP systems and well-known positioning algorithms, we analyze, evaluate, and summarize the ML techniques that have been applied recently. We break these into four categories: supervised learning, unsupervised learning, reinforcement, and deep learning. We also provide deep discussion of articles published during the past five years in terms of their proposed algorithm, space (2D/3D), experimental method (simulation/experiment), positioning accuracy, type of collected data, type of optical receiver, and number of transmitters.
Article
Full-text available
The genetic diversity of the Coronaviruses gives them different biological abilities, such as infect different cells and/or organisms, a wide spectrum of clinical manifestations, their different routes of dispersion, and viral transmission in a specific host. In recent decades, different Coronaviruses have emerged that are highly adapted for humans and causing serious diseases, leaving their host of unknown origin. The viral genome information is particularly important to enable the recognition of patterns linked to their biological characteristics, such as the specificity in the host-parasite relationship. Here, based on a previously computational tool, the Seq2Hosts, we developed a novel approach which uses new variables obtained from the frequency of spike-Coronaviruses codons, the Relative Synonymous Codon Usage (RSCU) to shed new light on the molecular mechanisms involved in the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) host specificity. By using the RSCU obtained from nucleotide sequences before the SARS-CoV-2 pandemic, we assessed the possibility of know the hosts capable to be infected by these new emerging species, which was first identified infecting humans during 2019 in Wuhan, China. According to the model trained and validated using sequences available before the pandemic, bats are the most likely the natural host to the SARS-CoV-2 infection, as previously suggested in other studies that searched for the host viral origin.
Article
Full-text available
Determinants of housing prices are particularly significant for monitoring and understanding housing prices. Traditional variables are measured through official statistics or questionnaire surveys, which are labour intensive and time-consuming. New forms of data, such as point of interest or street view imagery, have been used to extract housing location and neighbourhood features, but they cannot capture how different individuals recognised and evaluated the properties nearby, which may also be relevant in the house price estimation. Therefore, this study investigates whether user-generated images may be used to monitor and understand housing prices and how they influence real estate values. Within this context, perceived scenes features are extracted and quantified to blend with commonly used determinants of housing prices. Two machine learning algorithms, random forest and gradient boosting machines, are utilised and deployed for integration with a typical housing price modelling-hedonic price model. By comparing the performance and interpretability of different models, the relative importance of features and how they influence the estimation power of the models is visualised and analysed. The findings suggest that random forest predictions perform the best and are interpretable, with geotagged Flickr images adding 4.6% to the model’s accuracy (R²) from 61.9% to 66.5%. Although user-generated images increase minor value in house price estimation, they may be used as a supplementary data source to capture perception features for house price estimation. This could help the restructuring and optimisation of residential areas in future regional construction, planning and development.
Article
The advent of large-scale bibliographic databases and powerful prediction algorithms led to calls for data-driven approaches for targeting scarce funds at researchers with high predicted future scientific impact. The potential side-effects and fairness implications of such approaches are unknown, however. Using a large-scale bibliographic data set of N = 111,156 Computer Science researchers active from 1993 to 2016, I build and evaluate a realistic scientific impact prediction model. Given the persistent under-representation of women in Computer Science, the model is audited for disparate impact based on gender. Random forests and Gradient Boosting Machines are used to predict researchers’ h -index in 2010 from their bibliographic profiles in 2005. Based on model predictions, it is determined whether the researcher will become a high-performer with an h -index in the top-25% of the discipline-specific h -index distribution. The models predict the future h -index with an accuracy of $$R^2 = 0.875$$ R 2 = 0.875 and correctly classify 91.0% of researchers as high-performers and low-performers. Overall accuracy does not vary strongly across researcher gender. Nevertheless, there is indication of disparate impact against women. The models under-estimate the true h -index of female researchers more strongly than the h -index of male researchers. Further, women are 8.6% less likely to be predicted to become high-performers than men. In practice, hiring, tenure, and funding decisions that are based on model predictions risk to perpetuate the under-representation of women in Computer Science.
Article
THE PURPOSE. With the help of data from smart electricity meters, an analysis of the profiles of electrical loads of commercial organizations that are part of apartment buildings was carried out. The results obtained are compared with their current standard values. New values of specific electrical loads for public premises are considered: pharmacies, grocery and manufactured goods stores, catering establishments, office premises. METHODS. Half-hour load profiles were obtained from intelligent electricity metering devices installed directly at the objects under study, data transmission was carried out by an automated electricity metering system. The observation intervals were several tens of days. To process the experimentally obtained data, statistical methods for the analysis of electrical loads were used. RESULTS. The article describes the relevance of the topic, presents the profiles of electrical loads of public premises with the highlighting of characteristic features separately for each group of electricity consumers. New specific design electrical loads are considered, including an analysis in comparison with existing standards. CONCLUSION. The calculated values of electrical power in order to ensure technological connection for public premises, including social and cultural facilities, must be updated, since today there is a significant difference between the actual and calculated according to regulatory documents electrical loads. Updating the specific design electrical loads of public premises will reduce the locked capacity of these facilities, at the same time reduce the cost of technological connection, thereby increasing the rating of the investment climate in the region.
Article
Full-text available
The human microbiome has been linked to several diseases. Gastrointestinal diseases are still one of the most prominent area of study in host-microbiome interactions however the underlying microbial mechanisms in these disorders are not fully established. Irritable bowel syndrome (IBS) remains as one of the prominent disorders with significant changes in the gut microbiome composition and without definitive treatment. IBS has a severe impact on socio-economic and patient’s lifestyle. The association studies between the IBS and microbiome have shed a light on relevance of microbial composition, and hence microbiome-based trials were designed. However, there are no clear evidence of potential treatment for IBS. This review summarizes the epidemiology and socioeconomic impact of IBS and then focus on microbiome observational and clinical trials. At the end, we propose a new perspective on using data-driven approach and applying computational modelling and machine learning to design microbiome-aware personalized treatment for IBS.
Article
Full-text available
Significance Choosing a statistical model and accounting for uncertainty about this choice are important parts of the scientific process and are required for common statistical tasks such as parameter estimation, interval estimation, statistical inference, point prediction, and interval prediction. A canonical example is the choice of variables in a linear regression model. Many ways of doing this have been proposed, including Bayesian and penalized regression methods, and it is not clear which are best. We compare 21 popular methods via an extensive simulation study based on a wide range of real datasets. We found that three adaptive Bayesian model averaging methods performed best across all the statistical tasks and that two of these were also among the most computationally efficient.
Article
Full-text available
Several socio-economic sectors are sensitive to the occurrence of extreme climate events. The ability to predict these extremes will allow precautionary measures to reduce their impacts. This work aims to disseminate a seasonal statistical forecast of daily temperature extremes in Argentina to the international scientific community. At the local level, this forecast is shared at monthly meetings organized by the Argentine National Meteorological Service and attended by different users. For the temperature extremes modeling, several predictors and statistical techniques were applied. We estimated the probability of each tercile category (above-normal, near-normal, and below-normal) by quantifying the percentage of models that predict each of them. The forecasts were verified by calculating different metrics. In general, we observed that the forecast system has less skill to discriminate the near-normal category in all seasons, and the other categories present a skill highly variable according to the season, region, and extreme index. The verification process revealed that predictability increases for all extreme indices with a previous La Niña phase. This product represents an advance towards an operational seasonal forecast of extreme temperatures in Argentina because it offers predictions based on a detailed study of predictors in the region, the incorporation of multiple statistical methodologies, and the predicted variables are not the most typical ones offered by forecasting centers. Finally, it is highlighted that the accuracy rate obtained with this product exceeds a forecast based on climatology, i.e., despite the uncertainties, our forecasts provide additional information to users for decision making.
Article
This paper reviews literature on data-driven approaches for characterizing rock mass and ground conditions in tunnels. There have been significant advances in the use of both unsupervised and supervised machine learning (ML) methods to predict the ground condition or rock mass class ahead of tunnel boring machines (TBMs). This study evaluates the likelihood of a single ML model being able to predict ground conditions or rock mass ahead of TBMs regardless of the TBM type, rock mass condition, or the rock mass classification system used in classifying the rock mass conditions. To do this, extensive literature review was conducted to develop a list of ML models for the evaluation. Ground conditions/rock mass data and TBM operational data collected from the Pahang-Selangor Raw Water Transfer Tunnel (PSRWT) project were used to evaluate the selected models. The selected models were trained and evaluated on the PSRWT dataset. The performance metrics obtained from these models using the PSRWT data were then compared to the performance metrics reported by the respective authors. The second part of this paper focused on determining the best model among all the models reviewed using nine input variables from the from PSRWT dataset. Variable importance evaluation was conducted to determine the relevant input variables for this analysis. The results revealed that the ML models performed well in correctly predicting the rock mass conditions on the PSRWT dataset, but the performances were relatively lower compared to the performances reported by the various authors. However, when all the nine selected variables were used to train and test the models, better performances were achieved. This indicates that it is highly unlikely that a single ML model can predict every rock mass behavior with the same degree of accuracy using the same input variables. The model type, number and input parameters required for a given model will depend on among other factors, the soil and rock types and their conditions. It is worth noting that where rock mass classes were similar to the PSWRT data, the models’ performances were similar. It is therefore highly recommended to conduct site-specific modeling to understand which parameters are relevant and determine the kind of model that works well for the different cases. If a model is being adopted due to similarities in rock mass, it is recommended to proceed with caution and ascertain that model works in a similar manner.
Article
Full-text available
Background Coronavirus (CoV) is a novel respiratory virus that can cause severe acute respiratory syndrome (SARS). It affects millions of people in the world and thousands of people in Ethiopia. In responding to this, digital health technologies help to reduce COVID-19 outbreaks by sharing accurate and timely COVID-19 related information. Additionally, digital solutions are used for remote consulting during the pandemic, in creating COVID-19 related awareness, for distribution of the vaccine, and so on. Therefore, this study aimed to assess digital health literacy to share COVID-19 related information and associated factors among healthcare providers who worked at COVID-19 treatment centers in the Amhara region, Northwest Ethiopia. Method An institutional-based cross-sectional survey was conducted from April 4 to May 4, 2021. The study included 476 healthcare providers who worked at COVID-19 treatment centers in the Amhara region. A pretested, structured self-administered questionnaire was used to collect data. EpiData 4.6 and SPSS version 26 were used for data entry and analysis respectively. Bi-variable and Multivariable logistic regression analysis was used to identify factors associated with the dependent variable. A P-value of less than 0.05 was used to declare statistical significance. Result A total of 456 respondents were participated in the study, with 95.8% response rate. Digital health literacy to share COVID-19 related information found to be 50.4% (95% CI: 46–55). Educational status [AOR = 4.37, 95% CI(2.08–9.17)], training [AOR = 3.00, 95% CI (1.80–5.00)], attitude [AOR = 1.99, 95% CI(1.18–3.36)], perceived usefulness [AOR = 2.01, 95% CI(1.22–3.32)], perceived ease of use [AOR = 2.00, 95% CI(1.25–3.21)] and smartphone access [AOR = 5.21, 95% CI(2.34–9.62)] were significantly associated with digital health literacy to sharing of COVID-19 related information at P-value less than 0.05. Conclusion This finding indicated that approximately half of the respondents had digital health literacy to share COVID-19 related information which was inadequate. Improving respondents’ educational status, computer training, smartphone access, perceived usefulness, perceived ease of use, and attitude was necessary to measure digital health literacy to sharing of COVID-19 related information.
ResearchGate has not been able to resolve any references for this publication.