Article

Bootstrapping regression models with many parameters

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Generalizing from [18], a parametric resampling algorithm that is valid to recover the distribution of test statistics under the global null is to first fix the covariates (X 1 , · · · , X C ) for all observations while setting the resampled "outcomes" equal to the fitted values plus a vector of residuals resampled with replacement. That is, letting n ′ denote an observation sampled with replacement, the resampled variables for observation n are: ...
... A final incorrect alternative would be a generic bootstrap hypothesis test performed by resampling with replacement entire rows of data and then centering the test statistics as in Equation (2). However, this algorithm incorrectly treats the design matrix as random rather than fixed, which would be appropriate for correlation models but not the intended regression models [18]. Additionally, this algorithm can produce data violating the assumptions of standard OLS inference, even when the original data fulfill the assumptions. ...
... Specifically, we show that under a certain class of resampling algorithms defined below, the empirical sampling distribution of the number of rejections in the resamples converges to the true distribution of the number of rejections in samples generated under the global null. We chose to characterize the sampling distribution empirically rather than theoretically because it does not appear to have a tractable closed form without imposing assumptions Because simulation error associated with using a finite number of resamples to approximate the CDF of the resampled data can be made arbitrarily small by taking B → ∞, we follow convention (e.g., [18]) in ignoring this source of error and considering only asymptotics on N. ...
Article
Full-text available
When investigators test multiple outcomes or fit different model specifications to the same dataset, as in multiverse analyses, the resulting test statistics may be correlated. We propose new multiple-testing metrics that compare the observed number of hypothesis test rejections ( θ ^ ) at an unpenalized α-level to the distribution of rejections that would be expected if all tested null hypotheses held (the “global null”). Specifically, we propose reporting a “null interval” for the number of α-level rejections expected to occur in 95% of samples under the global null, the difference between θ ^ and the upper limit of the null interval (the “excess hits”), and a one-sided joint test based on θ ^ of the global null. For estimation, we describe resampling algorithms that asymptotically recover the sampling distribution under the global null. These methods accommodate arbitrarily correlated test statistics and do not require high-dimensional analyses, though they also accommodate such analyses. In a simulation study, we assess properties of the proposed metrics under varying correlation structures as well as their power for outcome-wide inference relative to existing methods for controlling familywise error rate. We recommend reporting our proposed metrics along with appropriate measures of effect size for all tests. We provide an R package, NRejections . Ultimately, existing procedures for multiple hypothesis testing typically penalize inference in each test, which is useful to temper interpretation of individual findings; yet on their own, these procedures do not fully characterize global evidence strength across the multiple tests. Our new metrics help remedy this limitation.
... Note that the analogous procedure based on ordinary residuals y i − M n,p (T n , x i ) instead of leaveone-out residuals would, in general, not be valid in such a large-p scenario (cf. Bickel and Freedman (1983)). Extending these results to the general k-fold cross-validation case relies on a generalization of a result by Bousquet and Elisseeff (2002) on the estimation of the test error of a learning algorithm by its empirical leave-one-out error, which might be of independent interest (see Lemma C.1 in the Supplementary Material). ...
... But the methods from these references, like most resampling procedures in the literature, are investigated only in the classical large sample asymptotic regime where the number of available explanatory variables is fixed. Prominent exceptions are Bickel and Freedman (1983), Mammen (1996) and, more recently, El Karoui and Purdom (2018). These articles draw mainly negative conclusions about resampling methods in high dimensions, arguing, for instance, that the famous residual bootstrap in linear regression, which relies on the consistent estimation of the true unknown error distribution, is unreliable when the number of variables in the model is not small compared to sample size. ...
... Remark B.2 in the Supplementary Material). In such cases, the unconditional distribution function of the prediction error P(y 0 −μ n (x 0 ) ≤ s) = E[F n (s)], the empirical distribution function of the ordinary residuals s → 1 n n i=1 1 (−∞,s] (y i −μ n (x i )) and the true error distribution function P (y 0 − μ P (x 0 ) ≤ s) need not be close to one another, becausê μ n may not contain enough information about the true regression function μ P (see, for instance, Bean et al. (2013), Bickel and Freedman (1983), for a linear regression example where μ P (x) = x β P ). 2 Nevertheless, we will see that even in such a challenging scenario, it is often possible to consistently estimate the conditional distributionF n of y 0 −μ n (x 0 ), given the training sample T n , by the (weighted) empirical distributionF n of the k-CV residuals. ...
... Our simulation results indicate that the use of asymptotic critical values produce substantial size distortions for the four backtests. In order to correct the finite sample size distortions of our backtests, we propose a pairs bootstrap algorithm (Freedman, 1981). The resulting bootstrap critical values provides satisfactory size performances regardless of the sample size and should accordingly be used when asymptotic theory does not apply conveniently. ...
... Then, we test the resulting parameter restrictions using Wald-type inference. We apply QML estimation for the quantile regression parameters and we implement a pairs bootstrap algorithm (Freedman, 1981) to correct the finite sample size distortions of our backtests. Finally, we introduce a procedure deduced from our regression framework to adjust the invalid risk forecasts. ...
... In the following, we propose a pairs bootstrap algorithm (Freedman, 1981) in order to correct the finite sample size distortions of our backtests. This is a fully non-parametric procedure that can be applied to a very wide range of models, including quantile regression 72 3.4 Simulation study model (Koenker et al., 2018). ...
Thesis
This dissertation contributes to the academic research in econometrics and financial risk management. Our research's goal is twofold: (i) to quantify the financial risks incurred by financial institutions and (ii) to assess the validity of the risk measures commonly used in the financial industry or by regulators. We focus on three kind of financial risks, (i) credit risk, (ii) market risk, and (iii) systemic risk. In chapters 2 and 3, we develop new methods for modeling and backtesting the volatility and the Expected Shortfall (ES), two measures typically used to quantify the risk of incurred losses in investment portfolios. In Chapter 4, we provide new estimation methods for the systemic risk measures that are used to identify the financial institutions contributing the most to the overall risk in the financial system.In Chapter 2, we develop a volatility structure that groups the whole sequence of intraday returns as functional covariates. Contrary to the well-known GARCH model with exogenous variables (GARCH-X), our approach makes possible to account for the whole information contained in the intraday price movements via functional data analysis. Chapter 3 introduces an econometric methodology to test for the validity of ES forecasts in market portfolios. This measure is now used to calculate the market risk capital requirements following the adoption of the Basel III accords by the Basel Committee on Banking Supervision (BCBS). Our method exploits the existing relationship between ES and Value-at-Risk (VaR) and complies - as a special case - with the BCBS recommendation of verifying the VaR at two specific risk levels. In Chapter 4, we focus on the elicitability property for the market-based systemic risk measures. A risk measure is said to be elicitable if there exists a loss function such that the risk measure itself is the solution to minimize the expected loss. We identify a strictly consistent scoring function for the Marginal Expected Shortfall (MES) and the VaR of the market return jointly and we exploit the scoring function to develop a semi-parametric M-estimator for the pair (VaR, MES).
... In cases where fitting procedures were used, data were binned by logarithmic-spaced intervals, and fits were performed with a total least-squares method, also known as orthogonal regression, by the Total Least Squares Approach to Modeling Toolbox for MATLAB (Petráš and Bednárová, 2010). The fitting procedure was bootstrapped by resampling from the residuals of individual cells (Freedman, 1981;Efron and Tibshirani, 1986). The data were resampled, binned, and fit for 10,000 repetitions generating sampling distributions of model parameters. ...
... Statistical comparisons between BAPTA and control conditions for R max experiments were made by first assessing a one-way repeated-measures ANOVA by a custom bootstrap approach for unbalanced design in MATLAB. This custom algorithm is equivalent to the standard linear mixed-effects model, except that bootstrap replicates are calculated from the residuals as the fixedeffects estimator (Freedman, 1981). Post hoc analysis proceeded if the results of ANOVA indicated a significant effect, that is, p , 0.05. ...
Article
Full-text available
The sensitivity of retinal cells is altered in background light to optimize the detection of contrast. For scotopic (rod) vision, substantial adaptation occurs in the first two cells, the rods and rod bipolar cells (RBCs), through sensitivity adjustments in rods and postsynaptic modulation of the transduction cascade in RBCs. To study the mechanisms mediating these components of adaptation, we made whole-cell, voltage-clamp recordings from retinal slices of mice from both sexes. Adaptation was assessed by fitting the Hill equation to response-intensity relationships with the parameters of half-maximal response ( I 1/2 ), Hill coefficient ( n ), and maximum response amplitude ( R max ). We show that rod sensitivity decreases in backgrounds according to the Weber-Fechner relation with an I 1/2 of about 50 R* s ⁻¹ . The sensitivity of RBCs follows a near-identical function, indicating that changes in RBC sensitivity in backgrounds bright enough to adapt the rods are mostly derived from the rods themselves. Backgrounds too dim to adapt the rods can however alter n , relieving a synaptic nonlinearity likely through entry of Ca ²⁺ into the RBCs. There is also a surprising decrease of R max , indicating that a step in RBC synaptic transduction is desensitized or that the transduction channels became reluctant to open. This effect is greatly reduced after dialysis of BAPTA at a membrane potential of +50 mV to impede Ca ²⁺ entry. Thus the effects of background illumination in RBCs are in part the result of processes intrinsic to the photoreceptors and in part derive from additional Ca ²⁺ -dependent processes at the first synapse of vision. SIGNIFICANCE STATEMENT: Light adaptation adjusts the sensitivity of vision as ambient illumination changes. Adaptation for scotopic (rod) vision is known to occur partly in the rods and partly in the rest of the retina from presynaptic and postsynaptic mechanisms. We recorded light responses of rods and rod bipolar cells to identify different components of adaptation and study their mechanisms. We show that bipolar-cell sensitivity largely follows adaptation of the rods, but that light too dim to adapt the rods produces a linearization of the bipolar-cell response and a surprising decrease in maximum response amplitude, both mediated by a change in intracellular Ca ²⁺ . These findings provide a new understanding of how the retina responds to changing illumination.
... We compare the proposed repro samples approach with the residual bootstrap approach in the current literature [27,13], where we used three different tuning criteria, AIC, BIC and cross-validation (CV) to select models. The numbers of bootstrap samples are 1, 000, 10, 000, and 100, 000 for (M1), (M2) and (M3) respectively, the same as the numbers of repro samples for searching for the candidate models in our method. ...
... Then Theorem 5 follows from (27) and Theorem 2. ...
Preprint
Full-text available
This paper presents a new and effective simulation-based approach to conducting both finite- and large- sample inference for high-dimensional linear regression models. We develop this approach under the so-called repro samples framework, in which we conduct statistical inference by creating and studying the behavior of artificial samples that are obtained by mimicking the sampling mechanism of the data. We obtain confidence sets for either the true model, a single, or any collection of regression coefficients. The proposed approach addresses two major gaps in the high-dimensional regression literature: (1) lack of inference approaches that guarantee finite-sample performance; (2) lack of effective approaches to address model selection uncertainty and provide inference for the underlying true model. We provide both finite-sample and asymptotic results to theoretically guarantee the performance of the proposed methods. Besides enjoying theoretical advantages, our numerical results demonstrate that the proposed methods achieve better coverage with smaller confidence sets than the existing state-of-art approaches, such as debiasing and bootstrap approaches. We also extend our approaches to drawing inferences on functions of the regression coefficients.
... The second condition is typically satisfied for appropriately chosen estimatesF whenever the data dimension p is fixed. In addition to general theory, statisticians have carried out detailed studies for specific statistics including the sample mean [9,11], regression coefficients [12,13,9,14], and continuous functions of the empirical measure [15], and so forth. ...
... Specifically, this article concerns the accuracy of bootstrap methods when p and n are both very large and perhaps grow with a fixed ratio. In linear regression for example, while the residual bootstrap is weakly consistent if p is fixed and n → ∞, it is inconsistent when n, p → ∞ in such a way that p/n → κ > 0; to be sure, [13] displays a data-dependent contrast, i.e., a linear combination of coefficients, for which the estimated contrast distribution is asymptotically incorrect. Motivated by results from high-dimensional maximum likelihood theory [16,17,18], [19] proposed to use corrected residuals to achieve correct inference. ...
Preprint
Accurate statistical inference in logistic regression models remains a critical challenge when the ratio between the number of parameters and sample size is not negligible. This is because approximations based on either classical asymptotic theory or bootstrap calculations are grossly off the mark. This paper introduces a resized bootstrap method to infer model parameters in arbitrary dimensions. As in the parametric bootstrap, we resample observations from a distribution, which depends on an estimated regression coefficient sequence. The novelty is that this estimate is actually far from the maximum likelihood estimate (MLE). This estimate is informed by recent theory studying properties of the MLE in high dimensions, and is obtained by appropriately shrinking the MLE towards the origin. We demonstrate that the resized bootstrap method yields valid confidence intervals in both simulated and real data examples. Our methods extend to other high-dimensional generalized linear models.
... Upon visual inspection of histograms and scatterplot residuals, the dependent and independent variables appeared significantly skewed and heteroscedastic. Given the small sample size, skewed and heteroskedastic data, wild bootstrap analysis was performed for the regression model to continually fit to the data through re-sampling without altering the raw data (Freedman, 1981;Godfrey & Tremayne, 2005;Pek et al., 2018). To adhere to statistical requirements for appropriate linear fit, bootstrapping with 2000 sample replications was used to calculate nonparametric 95% confidence intervals (CI) for the prediction of the dependent variable; bias-corrected and accelerated correction (BCa) was used on each confidence interval and reviewed (Astivia & Zumbo, 2019;Freedman, 1981;Pek et al., 2018). ...
... Given the small sample size, skewed and heteroskedastic data, wild bootstrap analysis was performed for the regression model to continually fit to the data through re-sampling without altering the raw data (Freedman, 1981;Godfrey & Tremayne, 2005;Pek et al., 2018). To adhere to statistical requirements for appropriate linear fit, bootstrapping with 2000 sample replications was used to calculate nonparametric 95% confidence intervals (CI) for the prediction of the dependent variable; bias-corrected and accelerated correction (BCa) was used on each confidence interval and reviewed (Astivia & Zumbo, 2019;Freedman, 1981;Pek et al., 2018). Independence of bootstrapped residuals were assessed by a Durbin-Watson statistic of 2.059. ...
Article
Full-text available
Little research has explored factors effecting the time it takes caregivers to initiate the diagnostic process (e.g., first phone call to a provider) after their infant or toddler received a positive screen for Autism Spectrum Disorder (ASD). The current study investigated the time it takes caregivers of an at-risk sample of toddlers who are enrolled in a statewide early intervention program and screen positive for ASD to initiate the evaluation process. A hierarchical multiple regression was used to identify factors that may affect this timeline. ASD symptom severity and child age were found to predict the length of time between receiving the positive screen and the initiation of the evaluation process; no additional factors (i.e., gender, only child status, younger sibling status, race/ethnicity) were significant. Results indicate that severity of symptoms and child age may be the driving factors for caregivers to initiate the formal evaluation process.
... In a linear regression setting, Studentization includes an additional covariate-dependent leverage factor that is not readily available in complex, nonlinear models such as ours. There are alternative residual adjustments in the bootstrap literature (Davidson and Hinkley, 1997;Bickel and Freedman, 1983;Weber, 1984). Residuals should also be centered before resampling if their mean differs significantly from zero; here the means of the line and sample residuals were negligible. ...
Preprint
Several planetary satellites apparently have subsurface seas that are of great interest for, among other reasons, their possible habitability. The geologically diverse Saturnian satellite Enceladus vigorously vents liquid water and vapor from fractures within a south polar depression and thus must have a liquid reservoir or active melting. However, the extent and location of any subsurface liquid region is not directly observable. We use measurements of control points across the surface of Enceladus accumulated over seven years of spacecraft observations to determine the satellite's precise rotation state, finding a forced physical libration of 0.120 ±\pm 0.014{\deg} (2{\sigma}). This value is too large to be consistent with Enceladus's core being rigidly connected to its surface, and thus implies the presence of a global ocean rather than a localized polar sea. The maintenance of a global ocean within Enceladus is problematic according to many thermal models and so may constrain satellite properties or require a surprisingly dissipative Saturn.
... We use the residual bootstrap method (Freedman 1981) to test the significance of the coralassociation g parameter. The null hypothesis of this test is g p 0 against the alternative g 1 0. First, the fitted values and the residuals are obtained from the null model. ...
... One of the most important and frequent types of statistical analysis is regression analysis, in which we study the effects of explanatory variables on a response variable. The use of the jackknife and bootstrap to estimate the sampling distribution of the parameter estimates in linear regression model was first proposed by Efron (1979) and further developed by Freedman(1981), Wu (1986).There has been considerable interest in recent years in the use of the jackknife and bootstrap in the regression context. In this study, we focus on the accuracy of the jackknife and bootstrap resampling methods in estimating the distribution of the regression parameters through different sample sizes and different bootstrap replications. ...
... If λ min (V ) ≥ c min > 0 for an absolute constant c min , the bound in (2.1) converges to 0 as long as d/n 1/2 → 0, which is required by Assumption 1.1 and matches the condition invoked by Bickel and Freedman (1983) to prove the asymptotic normality of the least squares coefficient with fixed regressors. Even if the errors are strongly correlated with λ min (V ) → 0, the bound in (2.1) is still useful for establishing the central limit theorem of T j as long as the bound converges to 0. ...
Preprint
Full-text available
Linear regression is arguably the most widely used statistical method. With fixed regressors and correlated errors, the conventional wisdom is to modify the variance-covariance estimator to accommodate the known correlation structure of the errors. We depart from the literature by showing that with random regressors, linear regression inference is robust to correlated errors with unknown correlation structure. The existing theoretical analyses for linear regression are no longer valid because even the asymptotic normality of the least-squares coefficients breaks down in this regime. We first prove the asymptotic normality of the t statistics by establishing their Berry-Esseen bounds based on a novel probabilistic analysis of self-normalized statistics. We then study the local power of the corresponding t tests and show that, perhaps surprisingly, error correlation can even enhance power in the regime of weak signals. Overall, our results show that linear regression is applicable more broadly than the conventional theory suggests, and further demonstrate the value of randomization to ensure robustness of inference.
... Bootstrapping pairs: bootstrapping pairs is a rather simple but powerful approach proposed first by Freedman (1981). Under this approach, we resample independent and dependent variables from the original sample which results in a bootstrap sample. ...
Article
Full-text available
OLS models have several assumptions for its interval estimations to be unbiased and efficient. Non-constant variance of residuals can cause serious issues in making inferences on coefficients as well as interval estimations. In this paper, we discuss the presence of heteroscedasticity in a linear model and suggest a paired bootstrap approach as an assumption-free approach on constructing confidence intervals. We carry a simulation study to compare bootstrap confidence intervals to traditional intervals. We conclude bootstrap intervals, though not perfect, can give better interval estimates when heteroscedasticity is observed and no remedy is applied.
... Because individuals are nested within groups, we cluster the standard errors by group in all analyses. Additionally, because the variance is not normally distributed, we use bootstrapping (Freedman 1981). Finally, because individuals are randomly assigned to condition, we do not use control variables in models in which experimental condition is the independent variable. ...
... Bootstrapping pairs: bootstrapping pairs is a rather simple but powerful approach proposed first by Freedman (1981). Under this approach, we resample independent and dependent variables from the original sample which results in a bootstrap sample. ...
Article
Full-text available
OLS regressions have a set of assumption in order to have its point and interval estimates to be unbiased and efficient. Data missing not at random (MNAR) can pose serious estimations issues in the linear regression. In this study we evaluate the performance of OLS confidence interval estimates with MNAR data. We also suggest bootstrapping as a remedy for such data cases and compare the traditional confidence intervals against bootstrap ones. As we need to know the true parameters, we carry out a simulations study. Research results indicate that both approaches show similar results having similar intervals size. Given that bootstrap required a lot of computations, traditional methods is still recommended to be used even in case of MNAR
... Under the linear model (top-left plot), parametric tests and flipscores ones show a perfect control of the Type I error as expected from the theory. Even the bootstrap method shows optimal behavior, although the control is ensured only asymptotically (Freedman, 1981) . In the Binomial (top-right) and Poisson (bottom-left) scenarios, the parametric and the flipscores are formally proved to have an asymptotic control of the Type I error; the simulation confirms the good control in practice. ...
Article
When analyzing data, researchers make some choices that are either arbitrary, based on subjective beliefs about the data-generating process, or for which equally justifiable alternative choices could have been made. This wide range of data-analytic choices can be abused and has been one of the underlying causes of the replication crisis in several fields. Recently, the introduction of multiverse analysis provides researchers with a method to evaluate the stability of the results across reasonable choices that could be made when analyzing data. Multiverse analysis is confined to a descriptive role, lacking a proper and comprehensive inferential procedure. Recently, specification curve analysis adds an inferential procedure to multiverse analysis, but this approach is limited to simple cases related to the linear model, and only allows researchers to infer whether at least one specification rejects the null hypothesis, but not which specifications should be selected. In this paper, we present a Post-selection Inference approach to Multiverse Analysis (PIMA) which is a flexible and general inferential approach that considers for all possible models, i.e., the multiverse of reasonable analyses. The approach allows for a wide range of data specifications (i.e., preprocessing) and any generalized linear model; it allows testing the null hypothesis that a given predictor is not associated with the outcome, by combining information from all reasonable models of multiverse analysis, and provides strong control of the family-wise error rate allowing researchers to claim that the null hypothesis can be rejected for any specification that shows a significant effect. The inferential proposal is based on a conditional resampling procedure. We formally prove that the Type I error rate is controlled, and compute the statistical power of the test through a simulation study. Finally, we apply the PIMA procedure to the analysis of a real dataset on the self-reported hesitancy for the COronaVIrus Disease 2019 (COVID-19) vaccine before and after the 2020 lockdown in Italy. We conclude with practical recommendations to be considered when implementing the proposed procedure.
... The outputs of such forecasting models for a given test instance are collected and used to create a sampling distribution of the statistic of interest (e.g., the traffic forecast), which represents the variability or uncertainty associated with the estimate. From this distribution we can compute prediction intervals, i.e. a range within which the true traffic measure is likely to fall [71]. In addition to different works relying on bootstrapping for estimating the uncertainty of different traffic data models [40], [72], [73], modifications of the naïve bootstrap procedure have been proposed to account for all sources of uncertainty in prediction intervals. ...
Article
Full-text available
The estimation of the amount of uncertainty featured by predictive machine learning models has acquired a great momentum in recent years. Uncertainty estimation provides the user with augmented information about the model’s confidence in its predicted outcome. Despite the inherent utility of this information for the trustworthiness of the user, there is a thin consensus around the different types of uncertainty that one can gauge in machine learning models and the suitability of different techniques that can be used to quantify the uncertainty of a specific model. This subject is mostly non existent within the traffic modeling domain, even though the measurement of the confidence associated to traffic forecasts can favor significantly their actionability in practical traffic management systems. This work aims to cover this lack of research by reviewing different techniques and metrics of uncertainty available in the literature, and by critically discussing how confidence levels computed for traffic forecasting models can be helpful for researchers and practitioners working in this research area. To shed light with empirical evidence, this critical discussion is further informed by experimental results produced by different uncertainty estimation techniques over real traffic data collected in Madrid (Spain), rendering a general overview of the benefits and caveats of every technique, how they can be compared to each other, and how the measured uncertainty decreases depending on the amount, quality and diversity of data used to produce the forecasts.
... Within the algorithm, D is the dataset used for the classification fit, χ is the fit classifier and S is the final reported solution to the optimization problem. The data generation process of the proposed machine learning framework, which is based on solving a randomly generated reducedspace version of the problem multiple times, has a close relationship with statistical bootstrapping [25]. Therefore, the resulting dataset enjoys the simplicity, effectiveness, and statistical properties of bootstrapping samples. ...
Article
Full-text available
Biofuels derived from feedstock offer a sustainable source for meeting energy needs. The design of supply chains that deliver these fuels needs to consider quality variability with special attention to shipping costs, because biofuel feedstocks are voluminous. Stochastic programming models that consider all these considerations incur a heavy computational burden. The present work proposes a hybrid strategy that leverages machine learning to reduce the computational complexity of stochastic programming models via problem space reduction. First, numerous randomly generated reduced-space versions of the problem are solved multiple times to generate a set of solution data based on the concept of bootstrapping. Next, a supervised machine learning algorithm is implemented to predict a potentially beneficial mixed integer linear program problem space from which a near-optimal solution can be obtained. Finally, the mixed integer linear program selects the optimal solution from the reduced space generated by the machine learning algorithm. Through extensive numerical experimentation, we determine how much the problem space can be reduced, how many times the reduced space problem needs to be solved and the best performing machine learning techniques for this application. Several supervised learning algorithms, including logistic regression, decision tree, random forest, support vector machine, and k-nearest neighbors, are evaluated. The numerical experiments demonstrate that our proposed solution procedure yields near-optimal outcomes with a considerably reduced computational burden.
... For comparison, we also conduct experiments using two variants of the bootstrap. First, we use the pairs bootstrap (Freedman, 1981), where each observation of the bootstrap sample [X * , Y * ] is sampled randomly with replacement from the rows of [X, Y ], see Figure 3. We also include the residual bootstrap, which samples with replacement the residuals Y − Xβ, and adds them to Xβ to get the new Y * . ...
Preprint
Full-text available
Drawing statistical inferences from large datasets in a model-robust way is an important problem in statistics and data science. In this paper, we propose methods that are robust to large and unequal noise in different observational units (i.e., heteroskedasticity) for statistical inference in linear regression. We leverage the Hadamard estimator, which is unbiased for the variances of ordinary least-squares regression. This is in contrast to the popular White's sandwich estimator, which can be substantially biased in high dimensions. We propose to estimate the signal strength, noise level, signal-to-noise ratio, and mean squared error via the Hadamard estimator. We develop a new degrees of freedom adjustment that gives more accurate confidence intervals than variants of White's sandwich estimator. Moreover, we provide conditions ensuring the estimator is well-defined, by studying a new random matrix ensemble in which the entries of a random orthogonal projection matrix are squared. We also show approximate normality, using the second-order Poincaré inequality. Our work provides improved statistical theory and methods for linear regression in high dimensions.
... One of the applications of the bootstrap estimator lies in constructing confidence intervals for regression models (Freedman, 1981). Although the application of bootstrap to provide local error estimates to PCE model predictions within a single-fidelity context has been previously studied (see Marelli and Sudret (2018)), its usage in the context of MFSM has not yet been explored. ...
Preprint
Full-text available
Computer simulations (a.k.a. white-box models) are more indispensable than ever to model intricate engineering systems. However, computational models alone often fail to fully capture the complexities of reality. When physical experiments are accessible though, it is of interest to enhance the incomplete information offered by computational models. Gray-box modeling is concerned with the problem of merging information from data-driven (a.k.a. black-box) models and white-box (i.e., physics-based) models. In this paper, we propose to perform this task by using multi-fidelity surrogate models (MFSMs). A MFSM integrates information from models with varying computational fidelity into a new surrogate model. The multi-fidelity surrogate modeling framework we propose handles noise-contaminated data and is able to estimate the underlying noise-free high-fidelity function. Our methodology emphasizes on delivering precise estimates of the uncertainty in its predictions in the form of confidence and prediction intervals, by quantitatively incorporating the different types of uncertainty that affect the problem, arising from measurement noise and from lack of knowledge due to the limited experimental design budget on both the high- and low-fidelity models. Applied to gray-box modeling, our MFSM framework treats noisy experimental data as the high-fidelity and the white-box computational models as their low-fidelity counterparts. The effectiveness of our methodology is showcased through synthetic examples and a wind turbine application.
... Moreover, instead of using the raw anomalies for each sample, we used a technique from the R package analog to calculate an estimate using bootstrapping (n = 1000) (Simpson, 2007). Bootstrapping is a method by which a random set of samples in the training set is replaced by samples of the same size, where some observations are used multiple times and some samples will not be used at all (Birks, 2014;Birks et al., 1990;Birks and Simpson, 2013;Efron, 1982;Freedman, 1981;Herbert and Harrison, 2016;Simpson, 2007). These bootstrapped anomalies are very similar to those produced in the traditional manner, as shown in Supplemental Figure S4 (see also Supplemental Table S1); the linear correlation between the two series as measured by r 2 = 0.96 (p > F < 0.001). ...
Article
Temperature variability likely played an important role in determining the spread and productive potential of North America’s key prehispanic agricultural staple, maize. The United States Southwest (SWUS) also served as the gateway for maize to reach portions of North America to the north and east. Existing temperature reconstructions for the SWUS are typically low in spatial or temporal resolution, shallow in time depth, or subject to unknown degrees of insensitivity to low-frequency variability, hindering accurate determination of temperature’s role in agricultural productivity and variability in distribution and success of prehispanic farmers. Here, we develop a model-based modern analog technique (MAT) approach applied to 29 SWUS fossil pollen sites to reconstruct July temperatures from 3000 BC to AD 2000. Temperatures were generally warmer than or similar to those of the modern (1961–1990) period until the first century AD. Our reconstruction also notes rapid warming beginning in the AD 1800s; modern conditions are unprecedented in at least the last five millennia in the SWUS. Temperature minima were reached around 1800 BC, 1000 BC, AD 400 (the global minimum in this series), the mid-to-late AD 900s, and the AD 1500s. Summer temperatures were generally depressed relative to northern hemisphere norms by a dominance of El Niño-like conditions during much of the second millenium BC and the first millenium AD, but somewhat elevated relative to those same norms in other periods, including from about AD 1300 to the present, due to the dominance of La Niña-like conditions.
... Negative definite matrices (and, more generally, definite matrices) are widespread in the literature, with applications in functional analysis (Lin, 2019), machine learning (Freedman, 1981;Tsuda et al., 2005) or control, to name a few. In particular, considering the context of control, definite matrices often play a fundamental role since they are, for instance, pivotal for Lyapunov stability analysis of linear systems (e.g., see Khalil, 2009, pp. ...
Article
In this paper we focus our attention on an interesting property of linear and time-invariant systems, namely negativizability: a pair (A,C) is negativizable if a gain matrix K exists such that A-KC is negative definite. Notably, in this paper we show that negativizability can be a useful feature for solving distributed estimation and control problems in Cyber–Physical Systems (CPS), since such a property allows a network of agents to bring the estimation error to zero or to control the overall system by designing gains that only require information locally available to each agent. In detail, we first characterize the negativizability problem, developing a necessary and sufficient condition for the problem to admit a solution. Then, we show how distributed estimation and control schemes for linear and time-invariant CPSs can greatly benefit from this property. A simulation campaign aiming at showing the potential of negativizability in the context of distributed state estimation and control of CPSs concludes the paper.
... The bootstrapping algorithm is a statistical inference method that resamples the original data using a sampling distribution (Freedman, 1981;Efron, 1992). A basic principle of this algorithm is to consider the sample data as being part of a population from which a few samples might be selected. ...
Article
Full-text available
Identifying the individual and combined hydrological response of land use landscape pattern and climate changes is key to effectively managing the ecohydrological balance of regions. However, their nonlinearity, effect size, and multiple causalities limit causal investigations. Therefore, this study aimed to establish a comprehensive methodological framework to quantify changes in the landscape pattern and climate, evaluate trends in streamflow response, and analyze the attribution of streamflow events in five basins in Beijing from the past to the future. Future climate projections were based on three general circulation models (GCMs) under two shared socioeconomic pathways (SSPs). Additionally, the landscape pattern in 2035 under a natural development scenario was simulated by the patch-generating land use simulation (PLUS). The Soil and Water Assessment Tool (SWAT) was applied to evaluate the streamflow spatial and temporal dynamics over the period 2005-2035 with multiple scenarios. A bootstrapping nonlinear regression analysis and boosted regression tree (BRT) model were used to analyze the individual and combined attribution of landscape pattern and climate changes on streamflow, respectively. The results indicated that in the future, the overall streamflow in the Beijing basin would decrease, with a slightly reduced peak streamflow in most basins in the summer and a significant increase in the autumn and winter. The nonlinear quadratic regression more effectively explained the impact of landscape pattern and climate changes on streamflow. The trends in the streamflow change depended on where the relationship curve was in relation to the threshold. In addition, the impacts of landscape pattern and climate changes on streamflow were not isolated but were joint. They presented a nonlinear, non-uniform, and coupled relationship. Except for the YongDing River Basin, the annual streamflow change was influenced more by the landscape pattern. The dominant factors and the critical pair interactions varied from basin to basin. Our findings have implications for city planners and managers for optimizing ecohydrological functions and promoting sustainable development.
... The application of bootstrapping in the case of models with independent and identically distributed errors has been studied by Freedman (1981), Stoffer and Wall (1991), Tibshirani (1994), andLahiri (2003). ...
Article
Full-text available
Bootstrap is a resampling method of estimating parameters or sampling distributions based on observed data. In order to apply the bootstrap approach when evaluating the parameters of time-series models, we need to consider the lack of independence between the observations. This study addresses the sensitivity of the white-noise distribution to the performance of the bootstrap method in uncovering the true sampling distribution of parameter estimates of autoregressive models. In order to study the performance, we use three white–noise (normal, exponential, and uniform) distributions for three (first and higher-order) models.
... Various studies consider the asymptotic validity of the bootstrap-based estimator in (1.1). More specifically, the validity of explosive autoregressive processes is studied by Basawa et al. (1989Basawa et al. ( , 1991 while in general asymptotics for the bootstrap in stationary models is predominately concerned with independent observations (see, Freedman et al. (1984Freedman et al. ( , 1981). Although it has been initially documented in the literature that the bootstrap cannot work for dependent processes (see, Bose (1988)) it was anticipated that it would work if the dependence is taken care of while resampling (see, Politis and Romano (1994) and Politis (2003, 2005)). ...
Preprint
Full-text available
We establish the asymptotic validity of the bootstrap-based IVX estimator proposed by Phillips and Magdalinos (2009) for the predictive regression model parameter based on a local-to-unity specification of the autoregressive coefficient which covers both nearly nonstationary and nearly stationary processes. A mixed Gaussian limit distribution is obtained for the bootstrap-based IVX estimator. The statistical validity of the theoretical results are illustrated by Monte Carlo experiments for various statistical inference problems.
... Given an observed sample, we first apply the two-stage estimator to select an initial model and obtain the refitted coefficient estimate. We further generate bootstrap samples using the refitted estimate via the celebrated residual bootstrap (Freedman, 1981;Efron, 1992). For each bootstrap sample, we apply the same two-stage estimator to identify the bootstrap modelŜ (b) and obtain the refitted bootstrap coefficient estimateβ (b) . ...
Preprint
Full-text available
Statistical inference of the high-dimensional regression coefficients is challenging because the uncertainty introduced by the model selection procedure is hard to account for. A critical question remains unsettled; that is, is it possible and how to embed the inference of the model into the simultaneous inference of the coefficients? To this end, we propose a notion of simultaneous confidence intervals called the sparsified simultaneous confidence intervals. Our intervals are sparse in the sense that some of the intervals' upper and lower bounds are shrunken to zero (i.e., [0,0]), indicating the unimportance of the corresponding covariates. These covariates should be excluded from the final model. The rest of the intervals, either containing zero (e.g., [1,1][-1,1] or [0,1]) or not containing zero (e.g., [2,3]), indicate the plausible and significant covariates, respectively. The proposed method can be coupled with various selection procedures, making it ideal for comparing their uncertainty. For the proposed method, we establish desirable asymptotic properties, develop intuitive graphical tools for visualization, and justify its superior performance through simulation and real data analysis.
... Our main contribution in this paper is to propose bootstrap methods to adjust the p-value defined in (1.3) under both the random design and the fixed design (see, e.g., Freedman (1981) for a discussion), and establish the asymptotic size validity as well as the power consistency. We emphasize that the critical values obtained by the proposed bootstrap methods lead to proper Type I error control under both the identifiable (γ 0 = 0) and nonidentifiable case (γ 0 = 0). ...
... where A ǫ is the set of all points within ǫ of the set A. Let A t = (t, ∞) so that the calculated p-values are µ(A t ) andμ(A t ) when using F and F n , respectively. As shown in (1.4) of (Bickel and Freedman, 1983), ...
Preprint
Full-text available
Causal discovery procedures aim to deduce causal relationships among variables in a multivariate dataset. While various methods have been proposed for estimating a single causal model or a single equivalence class of models, less attention has been given to quantifying uncertainty in causal discovery in terms of confidence statements. The primary challenge in causal discovery is determining a causal ordering among the variables. Our research offers a framework for constructing confidence sets of causal orderings that the data do not rule out. Our methodology applies to structural equation models and is based on a residual bootstrap procedure to test the goodness-of-fit of causal orderings. We demonstrate the asymptotic validity of the confidence set constructed using this goodness-of-fit test and explain how the confidence set may be used to form sub/supersets of ancestral relationships as well as confidence intervals for causal effects that incorporate model uncertainty.
... Following the two-stage analytical procedure (Anderson & Gerbing, 1988), we tested the measurement model initially, then, the hypotheses were analyzed. The bootstrapping technique was applied to assess the path coefficients (Freedman, 1981;Hair et al., 2014). Contemporary scholars (Abbas et al., 2014;Abbas & Raja, 2019), particularly project management scholars (Byra et al., 2021;Chen & Lin, 2018;Ul Musawir et al., 2017) prefer the contemporary approach (Preacher & Hayes, 2004) over traditional method (Baron & Kenny, 1986). ...
Article
Full-text available
The study aims to examine the association between team emotional intelligence (EI) and team performance, particularly in construction projects, with a deliberate focus on identifying the factors that may foster or hinder team performance. To examine the hypothesized nexus of the model, dyadic data of 302 project employees and their site supervisors was collected, representing 53 teams in total. Organizations working on construction projects and relevant participants were selected through purposive sampling method. Findings of the study pronounce a positive association between team EI and team performance of engineers. Further, this study supports the mediating role of team trust between focal variables. Notably, task interdependence buffers the association between team EI and team performance. Coupled with the theoretical contribution, the study also offers valuable insights for managerial consideration, which may help them to maintain the workflow in construction projects and enhance team performance. Although EI constitutes essential resources for higher performance, prior research has not investigated whether and when team EI facilitates team performance, particularly in construction projects. We contributed to filling this gap by establishing the direct and indirect association between team EI and team performance via team trust. Moreover, the study also established task interdependence as a relevant moderator. In sum, the study at hand brings to the fore the dispositional and contextual antecedents that could potentially impact team performance, particularly in the construction project context.
... After further study of the logarithmic LD data, we realized -owing to the application of a Kolmogorov-Smirnov test -that the data failed to conform to the Gaussian distribution. To that end -and also because of the common underreporting of LD cases and other weaknesses associated with LD surveillance [21] -all inferential statistical techniques were conducted using bootstrapping with 1,000 resamples [72]. ...
Article
Full-text available
Background: Lyme disease (LD), which is highly preventable communicable illness, is the most commonly reported vector borne disease in the USA. The Social Vulnerability Index (SoVI) is a county level measure of SES and vulnerability to environmental hazards or disease outbreaks, but has not yet been used in the study of LD. The purpose of this study was to determine if a relationship existed between the SoVI and LD incidence at the national level and regional division level in the United States between 2000 and 2014. Methods: County level LD data were downloaded from the CDC. County level SoVI were downloaded from the HVRI at the University of South Carolina and the CDC. Data were sorted into regional divisions as per the US Census Bureau and condense into three time intervals, 2000-2004, 2005-2009, and 2010-2014. QGIS was utilized to visually represent the data. Logarithmic OLS regression models were computed to determine the predictive power of the SoVI in LD incidence rates. Results: LD incidence was greatest in the Northeastern and upper Midwestern regions of the USA. The results of the regression analyses showed that SoVI exhibited a significant quadratic relationship with LD incidence rates at the national level. Conclusion: Our results showed that counties with the highest and lowest social vulnerability were at greatest risk for LD. The SoVI may be a useful risk assessment tool for public health practitioners within the context of LD control.
... Στη συνέχεια, θα περιγράψουμε πώς μπορούμε να κάνουμε συμπερασματολογία για τις παραμέτρους του μοντέλου χωρίς την υπόθεση της κανονικότητας των σφαλμάτων. Δύο προσεγγίσεις bootstrap για ένα μοντέλο παλινδρόμησης είναι οι εξής (Freedman, 1981): ...
Book
Full-text available
The main aim of this book is to present both traditional Nonparametric Methodologies (such as tests based on ranks, runs, goodness-of-fit tests) as well as subsequent developments (estimation of the probability density function, bootstrap, nonparametric regression). Specifically, in Chapter 1 a review of basic definitions and concepts from Probability Theory and Statistics is given. Moreover, an introduction to the area of Nonparametric Statistics is provided, in which the necessity of nonparametric methodologies, their differences with their parametric counterparts and relevant areas of application are presented. Chapter 2 consists of methods and techniques for the nonparametric estimation of the cumulative distribution function and its functionals, while, in Chapter 3, we provide the main methods for the nonparametric estimation of the probability density function. Chapter 4 is devoted to goodness-of-fit tests, while Chapter 5 presents the simplest hypothesis testing techniques, those that are based on the Binomial distribution. In Chapter 6 we provide a wide variety of nonparametric, techniques for statistical hypothesis testing, which techniques are based on ranks, while randomness tests are presented in Chapter 7. In Chapter 8 we provide the main nonparametric statistical measures for the correlation of two variables. The corresponding nonparametric statistical tests for correlation are also provided. Nonparametric regression techniques are discussed in Chapter 9, while resampling methods, such as jackknife and bootstrap, are presented in Chapters 10 and 11, respectively. Chapter 12 deals with basic nonparametric techniques in statistical process control, while Chapter 13 presents the application of various nonparametric methodologies using SPSS and R. In addition, we provide statistical tables in the Appendix which enable the application of the statistical techniques presented in the main part of the book. Finally, the webpage https://github.com/abatsidis/NPDataSets allows access to the datasets and R codes used in the chapters of this book.
... Bootstrapped regression, based on the sampling of smaller subsamples and replicating the model estimations, is present for 30 years (since Freedman, 1981and Wu, 1986, and now as Hesterberg, 2015Harris et al., 2017). It enables operating on a much narrower scale while obtaining consistent, efficient and nonbiased estimates (e.g., Davison et al., 2003;Efron & Tibshirani, 1997). ...
Article
Full-text available
Spatial econometric models estimated on the big geo‐located point data have at least two problems: limited computational capabilities and inefficient forecasting for the new out‐of‐sample geo‐points. This is because of spatial weights matrix W defined for in‐sample observations only and the computational complexity. Machine learning models suffer the same when using kriging for predictions; thus this problem still remains unsolved. The paper presents a novel methodology for estimating spatial models on big data and predicting in new locations. The approach uses bootstrap and tessellation to calibrate both model and space. The best bootstrapped model is selected with the PAM (Partitioning Around Medoids) algorithm by classifying the regression coefficients jointly in a non‐independent manner. Voronoi polygons for the geo‐points used in the best model allow for a representative space division. New out‐of‐sample points are assigned to tessellation tiles and linked to the spatial weights matrix as a replacement for an original point what makes feasible usage of calibrated spatial models as a forecasting tool for new locations. There is no trade‐off between forecast quality and computational efficiency in this approach. An empirical example illustrates a model for business locations and firms' profitability.
... Somewhat surprisingly, this natural approach was rarely used in statistics. Using Mallows metric to measure the distance between variables from the original and the bootstrap process, it was implicitly employed in the context of independent random variables by Bickel and Freedman (1981) and Freedman (1981). A more explicit use of coupling was made, in the context of U-and V-statistics, but again in the independent case, by Dehling and Mikosch (1994) and Leucht and Neumann (2009). ...
Article
Full-text available
The Markov property is shared by several popular models for time series such as autoregressive or integer-valued autoregressive processes as well as integer-valued ARCH processes. A natural assumption which is fulfilled by corresponding parametric versions of these models is that the random variable at time t gets stochastically greater conditioned on the past, as the value of the random variable at time t-1t1t-1 increases. Then the associated family of conditional distribution functions has a certain monotonicity property which allows us to employ a nonparametric antitonic estimator. This estimator does not involve any tuning parameter which controls the degree of smoothing and is therefore easy to apply. Nevertheless, it is shown that it attains a rate of convergence which is known to be optimal in similar cases. This estimator forms the basis for a new method of bootstrapping Markov chains which inherits the properties of simplicity and consistency from the underlying estimator of the conditional distribution function.
... • Closed-form Solutions: Classical results on the distribution of linear functions of closed-form ordinary least-square regression solutions (or ridge regression solutions) and the study of related bootstrap estimators, are available in [BF83,Por86,Mam89,Mam93,EKP18]. While such results allow for d to grow with the number of observations, they still focus on the low-dimensional setting. ...
Preprint
Stochastic gradient descent (SGD) has emerged as the quintessential method in a data scientist's toolbox. Much progress has been made in the last two decades toward understanding the iteration complexity of SGD (in expectation and high-probability) in the learning theory and optimization literature. However, using SGD for high-stakes applications requires careful quantification of the associated uncertainty. Toward that end, in this work, we establish high-dimensional Central Limit Theorems (CLTs) for linear functionals of online least-squares SGD iterates under a Gaussian design assumption. Our main result shows that a CLT holds even when the dimensionality is of order exponential in the number of iterations of the online SGD, thereby enabling high-dimensional inference with online SGD. Our proof technique involves leveraging Berry-Esseen bounds developed for martingale difference sequences and carefully evaluating the required moment and quadratic variation terms through recent advances in concentration inequalities for product random matrices. We also provide an online approach for estimating the variance appearing in the CLT (required for constructing confidence intervals in practice) and establish consistency results in the high-dimensional setting.
... To produce accurate SDMs, not just the effects of different algorithms should be investigated, but also the effects of resampling techniques on model's training data (Efron, 1982;Freedman, 1981). Specifically, bootstrapping (i.e., random subsampling with replacement) the training data has been shown to increase models' precision by providing a combination of models, which reduces stochastic errors in estimation (e. g., Vaughan and Ormerod (2005); Hefley et al. (2014); Xu and Goodacre (2018)). ...
... Bootstrapping in high dimension has a rich history and recent offerings. Consider Bickel and Freedman (1983), Mammen (1993), andKato (2013, 2017) and their references. ...
Preprint
Full-text available
We propose a test of many zero parameter restrictions in a high dimensional linear iid regression model. The test statistic is formed by estimating key parameters one at a time based on many low dimension regression models with nuisance terms. The parsimoniously parametrized models identify whether the original parameter of interest is or is not zero. Estimating fixed low dimension sub-parameters ensures greater estimator accuracy, does not require a sparsity assumption, and using only the largest in a sequence of weighted estimators reduces test statistic complexity and therefore estimation error. We provide a parametric wild bootstrap for p-value computation, and prove the test is consistent and has non-trivial root(n/ln(n))-local-to-null power.
... For all prediction techniques, including non-parametric and machine learning techniques, the sampling variability component of Eq. (4), V ar sam (μ), can be estimated using a standard 5-step pairs bootstrap approach (Efron and Tibshirani, 1994, p. 113;Freedman, 1981): ...
Article
Statistically rigorous inferences in the form of confidence intervals for map-based estimates require model-based inferential methods. Model-based mean square errors (MSE) incorporate estimates of both residual variability and sampling variability, of which the latter includes population unit variance estimates and pairwise population unit covariance estimates. Bootstrapping, which can be used with any prediction technique, provides a means of estimating the required variances and covariances. The objectives of the study were to to demonstrate a method for estimating the sampling variability, Var̂samμ̂, that can be used with all prediction techniques, to develop an efficient method that map makers can use to disseminate metadata that facilitates calculation of Var̂samμ̂ for arbitrary subregions of maps, and to estimate the individual contributions of sampling variability and residual variability to the overall standard error of the prediction of the population mean. The primary results were fourfold: (i) map makers must provide metadata that facilitate estimation of population unit variances and covariances for arbitrary map subregions, (ii) bootstrapping was demonstrated as an effective means of estimating the variances and covariances, (iii) the very large matrix of pairwise population unit covariances can be aggregated into a much smaller matrix that can be readily communicated by map makers to map users, and (iv) MSEs that include only estimates of residual variability and/or estimates of population unit variances, but not estimates of the pairwise population unit covariances, grossly under-estimate the actual MSEs.
... bootstrapping method [58] is applied to generate 1,000 realizations of the BN model based on resampling with replacement. The bootstrapping variances are used to calculate the error bars shown in Fig. 4 and Fig. S3 in the SI. ...
... Moreover, we report, for each model specification, the likelihood ratio test statistic to formally test for the departure from the "null" model (which only includes the intercept) accompanied by its associated p-value. Finally, 90% and 95% bootstrap-based (percentile) confidence intervals, obtained through resampling with replacement of the data, or "pair" bootstrap (Freedman et al., 1981), are reported for the model coefficients to take into account the uncertainty of regressions (Carpenter and Bithell, 2000;Davison and Hinkley, 1997). To summarise, after resampling the data, bootstrapped estimatorsβ * are computed, and this procedure is repeated B = 9999 times. ...
Article
Full-text available
Social start-ups constitute a subclass of the innovative entrepreneurship market and they are a relatively new topic in the literature from both the scientific and normative perspectives. In Italy, a start-up can be registered as social-oriented innovative start-up (SIAVS), after demonstrating its ability to make social impacts and satisfy a set of normative requirements. This paper describes and models (at the territorial level) the presence of these innovative companies focused on social, cultural, and/or environmental needs. We combine data on the number of SIAVS and other start-ups in Italy with other socio-economic quantities at the territorial level (NUTS 3 regions). Variables such as population density, metropolitan cities, number of certified incubators and percentage of non-profit organisations are identified as determinants of new social enterprises through the application of generalised linear models, while also considering the presence of zero inflation
... To assess the performance of the estimators, we use the relative efficiency of the mean squared error (RMSE) with respect to the true parameters θ which will be estimated byθ * , whereθ * can be any of the estimators. The approach is based on the bootstrapping method which is similar to that introduced by Freedman (1981). ...
Article
Full-text available
The Autoregressive Conditionally Heteroscedastic (ARCH) model is useful for handling volatilities in economical time series phenomena that ARIMA models are unable to handle. The ARCH model has been adopted in many applications that contain time series data such as financial market prices, options, commodity prices and the oil industry. In this paper, we propose an improved post-selection estimation strategy. We investigated and developed some asymptotic properties of the suggested strategies and compared with a benchmark estimator. Furthermore, we conducted a Monte Carlo simulation study to reappraise the relative characteristics of the listed estimators. Our numerical results corroborate with the analytical work of the study. We applied the proposed methods on the S&P500 stock market daily closing prices index to illustrate the usefulness of the developed methodologies.
... One of the most important and frequent types of statistical analysis is regression analysis, in which we study the effects of explanatory variables on a response variable. The use of the jackknife and bootstrap to estimate the sampling distribution of the parameter estimates in linear regression model was first proposed by Efron (1979) and further developed by Freedman(1981), Wu (1986).There has been considerable interest in recent years in the use of the jackknife and bootstrap in the regression context. In this study, we focus on the accuracy of the jackknife and bootstrap resampling methods in estimating the distribution of the regression parameters through different sample sizes and different bootstrap replications. ...
Article
Statistical inference is based generally on some estimates that are functions of the data. Resampling methods offer strategies to estimate or approximate the sampling distribution of a statistic. In this article, two resampling methods are studied, jackknife and bootstrap, where the main objective is to examine the accuracy of these methods in estimating the distribution of the regression parameters through different sample sizes and different bootstrap replications. Keywords: Jackknife, Bootstrap, Multiple regression, Bias , Variance.
Article
Full-text available
Linear models has been a powerful econometric tool used to show the relationship between two or more variables. Many studies also use linear approximation for nonlinear cases as it still might show valid results. OLS method requires the relationship of dependent and independent variables to be linear, although many studies employ OLS approximation even for nonlinear cases. In this study, we are introducing alternative method of intervals estimation, bootstrap, in linear regressions when the relationship is nonlinear. We compare the traditional and bootstrap confidence intervals when data has nonlinear relationship. As we need to know the true parameters, we carry out a simulation study. Our research findings indicate that when error term has non-normal shape, bootstrap interval will outperform the traditional method due to no distributional assumption and wider interval width
Article
Full-text available
The purpose of this paper was to investigate the performance of the parametric bootstrap data generating processes (DGPs) methods and to compare the parametric and nonparametric bootstrap (DGPs) methods for estimating the standard error of simple linear regression (SLR) under various assessment conditions. When the performance of the parametric bootstrap method was investigated, simple linear models were employed to fit the data. With the consideration of the different bootstrap levels and sample sizes, a total of twelve parametric bootstrap models were examined. Three hypothetical and one real datasets were used as the basis to define the population distributions and the “true” SEEs. A bootstrap paper was conducted on different parametric and nonparametric bootstrap (DGPs) methods reflecting three levels for group proficiency differences, three levels of sample sizes, three test lengths and three bootstrap levels. Bias of the SLR, standard errors of the SLR, root mean square errors of the SLR, were calculated and used to evaluate and compare the bootstrap results. The main findings from this bootstrap paper were as follows: (i) The parametric bootstrap DGP models with larger bootstrap levels generally produced smaller bias likewise a larger sample size. (ii) The parametric bootstrap models with a higher bootstrap level generally yielded more accurate estimates of the standard error than the corresponding models with lower bootstrap level. (iii) The nonparametric bootstrap method generally produced less accurate estimates of the standard error than the parametric bootstrap method. However, as the sample size increased, the differences between the two bootstrap methods became smaller. When the sample size was equal to or larger than 3,000, say 10000, the differences between the nonparametric bootstrap DGP method and the parametric bootstrap DGP model that produced the smallest RMSE were very small. (4) Of all the models considered in this paper, parametric bootstrap DGP models with higher bootstrap performed better under most bootstrap conditions. (5) Aside from method effects, sample size and test length had the most impact on estimating the Simple Linear Regression.
Article
Full-text available
Background Accurate segmentation of lung nodules is crucial for the early diagnosis and treatment of lung cancer in clinical practice. However, the similarity between lung nodules and surrounding tissues has made their segmentation a longstanding challenge. Purpose Existing deep learning and active contour models each have their limitations. This paper aims to integrate the strengths of both approaches while mitigating their respective shortcomings. Methods In this paper, we propose a few‐shot segmentation framework that combines a deep neural network with an active contour model. We introduce heat kernel convolutions and high‐order total variation into the active contour model and solve the challenging nonsmooth optimization problem using the alternating direction method of multipliers. Additionally, we use the presegmentation results obtained from training a deep neural network on a small sample set as the initial contours for our optimized active contour model, addressing the difficulty of manually setting the initial contours. Results We compared our proposed method with state‐of‐the‐art methods for segmentation effectiveness using clinical computed tomography (CT) images acquired from two different hospitals and the publicly available LIDC dataset. The results demonstrate that our proposed method achieved outstanding segmentation performance according to both visual and quantitative indicators. Conclusion Our approach utilizes the output of few‐shot network training as prior information, avoiding the need to select the initial contour in the active contour model. Additionally, it provides mathematical interpretability to the deep learning, reducing its dependency on the quantity of training samples.
Conference Paper
Full-text available
In the last few years, control engineers have started to use artificial neural networks (NNs) embedded in advanced feedback control algorithms. Its natural integration into existing controllers, such as programmable logic controllers (PLCs) or close to them, represents a challenge. Besides, the application of these algorithms in critical applications still raises concerns among control engineers due to the lack of safety guarantees. Building trustworthy NNs is still a challenge and their verification is attracting more attention nowadays. This paper discusses the peculiarities of formal verification of NNs controllers running on PLCs. It outlines a set of properties that should be satisfied by a NN that is intended to be deployed in a critical high-availability installation at CERN. It compares different methods to verify this NN and sketches our future research directions to find a safe NN.KeywordsVerification of neural networksPLCsControl system
Preprint
Full-text available
Statistical data simulation is essential in the development of statistical models and methods as well as in their performance evaluation. To capture complex data structures, in particular for high-dimensional data, a variety of simulation approaches have been introduced including parametric and the so-called plasmode simulations. While there are concerns about the realism of parametrically simulated data, it is widely claimed that plasmodes come very close to reality with some aspects of the "truth'' known. However, there are no explicit guidelines or state-of-the-art on how to perform plasmode data simulations. In the present paper, we first review existing literature and introduce the concept of statistical plasmode simulation. We then discuss advantages and challenges of statistical plasmodes and provide a step-wise procedure for their generation, including key steps to their implementation and reporting. Finally, we illustrate the concept of statistical plasmodes as well as the proposed plasmode generation procedure by means of a public real RNA dataset on breast carcinoma patients.
Conference Paper
Full-text available
Uncertainty is unavoidable in flood modeling practices and should be properly communicated. There are a variety of methods and techniques for uncertainty analysis, but normally they require a large number of hydrological/hydrodynamic model realizations. Among several uncertainty analysis methods, bootstrap is a popular technique which is carried out to make statistical inferences by using limited (or small) number of realizations without imposing much structural assumptions. This study has critically assessed the applicability of bootstrap method for assessing the uncertainty in flood mapping and compared the results with those that are obtained from Monte Carlo method. The results challenge the applicability of bootstrap method as an alternative to the more computationally intensive methods such as Monte Carlo. Furthermore, the results suggest that the mean parameter's variation, which is typically undertaken as a convergence criterion in uncertainty analysis, can lead to early stopping of the process and consequently wrong statistical inferences.
Article
Objectives Capturing frailty using a quick tool has proven to be challenging. We hypothesise that this is due to the complex interactions between frailty domains. We aimed to identify these interactions and assess whether adding interactions between domains improves mortality predictability. Methods In this retrospective cohort study, we selected all patients aged 70 or older who were admitted to one Dutch hospital between April 2015 and April 2016. Patient characteristics, frailty screening (using VMS (Safety Management System), a screening tool used in Dutch hospital care), length of stay, and mortality within three months were retrospectively collected from electronic medical records. To identify predictive interactions between the frailty domains, we constructed a classification tree with mortality as the outcome using five variables: the four VMS-domains (delirium risk, fall risk, malnutrition, physical impairment) and their sum. To determine if any domain interactions were predictive for three-month mortality, we performed a multivariable logistic regression analysis. Results We included 4,478 patients. (median age: 79 years; maximum age: 101 years; 44.8% male) The highest risk for three-month mortality included patients that were physically impaired and malnourished (23% (95%-CI 19.0-27.4%)). Subgroups had comparable three-month mortality risks based on different domains: malnutrition without physical impairment (15.2% (96%-CI 12.4-18.6%)) and physical impairment and delirium risk without malnutrition (16.3% (95%-CI 13.7-19.2%)). Discussion We showed that taking interactions between domains into account improves the predictability of three-month mortality risk. Therefore, when screening for frailty, simply adding up domains with a cut-off score results in loss of valuable information.
Article
The most common and well‐known meta‐regression models work under the assumption that there is only one effect size estimate per study and that the estimates are independent. However, meta‐analytic reviews of social science research often include multiple effect size estimates per primary study, leading to dependence in the estimates. Some meta‐analyses also include multiple studies conducted by the same lab or investigator, creating another potential source of dependence. An increasingly popular method to handle dependence is robust variance estimation (RVE), but this method can result in inflated Type I error rates when the number of studies is small. Small‐sample correction methods for RVE have been shown to control Type I error rates adequately but may be overly conservative, especially for tests of multiplecontrast hypotheses. We evaluated an alternative method for handling dependence, cluster wild bootstrapping, which has been examined in the econometrics literature but not in the context of meta‐analysis. Results from two simulation studies indicate that cluster wild bootstrapping maintains adequate Type I error rates and provides more power than extant small‐sample correction methods, particularly for multiplecontrast hypothesis tests.We recommend using cluster wild bootstrapping to conduct hypothesis tests for meta‐analyses with a small number of studies. We have also created an R package that implements such tests. This article is protected by copyright. All rights reserved.
ResearchGate has not been able to resolve any references for this publication.