ArticlePDF Available

Abstract

In this paper, we develop an efficient moments-based permutation test approach to improve the test's computational efficiency by approximating the permutation distribution of the test statistic with Pearson distribution series. This approach involves the calculation of the first four moments of the permutation distribution. We propose a novel recursive method to derive these moments theoretically and analytically without any permutation. Experimental results using different test statistics are demonstrated using simulated data and real data. The proposed strategy takes advantage of nonparametric permutation tests and parametric Pearson distribution approximation to achieve both accuracy and efficiency.
A preview of the PDF is not available
... Hypothesis testing has been widely used in neuroimaging data analysis, such as morphometry analysis [1][2][3][4][5], brain activation detection and inference [6][7][8][9][10], and functional integration and connectivity [11]. Traditionally, brain imaging researchers perform statistical analysis by using parametric hypothesis testing, including commonly used F test, t test, Z test and Hotelling's T 2 test [6,[10][11][12]. ...
... However, this method requires the derivation of the permutation for each specific test statistic, which is not easily accessible in real world scenarios. Recently, we have been developing a new, computationally efficient and more general recursive algorithm to calculate the moments of the permutation distribution by a simple sumproduct of data partition sums and index partition sums [5]. The data partition sums and index partition sums are computed recursively, from the simplest to the most complex sum. ...
... Given the first four moments, the permutation distribution can be well fitted by the Pearson distribution series [22]. Extensive validation of accuracy or error rate when the Pearson distribution is used to approximate the permutation distribution has been performed in our previous work [5,13]. ...
Article
Full-text available
In this paper, we present a new blockwise permutation test approach based on the moments of the test statistic. The method is of importance to neuroimaging studies. In order to preserve the exchangeability condition required in permutation tests, we divide the entire set of data into certain exchangeability blocks. In addition, computationally efficient moments-based permutation tests are performed by approximating the permutation distribution of the test statistic with the Pearson distribution series. This involves the calculation of the first four moments of the permutation distribution within each block and then over the entire set of data. The accuracy and efficiency of the proposed method are demonstrated through simulated experiment on the magnetic resonance imaging (MRI) brain data, specifically the multi-site voxel-based morphometry analysis from structural MRI (sMRI).
... The higher k, the more complicated interactions among observations can be modeled in the weighted v-statistic. Machine learning researchers have already used weighted v-statistics in hypothesis testing, density estimation, dependence measurement, data pre-processing, and classification [6, 14, 19, 21] . Traditionally, estimation of resampling statistics is solved by random sampling since exhaustive examination of the resampling space is usually ill advised [5,16]. ...
... ampling statistics is solved by random sampling since exhaustive examination of the resampling space is usually ill advised [5,16]. There is a tradeoff between accuracy and computational cost with random sampling. To date, there is no systematic and efficient solution to the issue of exact calculation of resampling statistics. Recently, Zhou et.al. [21] proposed a recursive method to derive moments of permutation distributions (i.e., empirical distribution generated by resampling without replacement). The key strategy is to divide the whole index set (i.e., indices of all possible k observations ) into several permutation equivalent index subsets such that the summa-tion of the data/in ...
... Therefore, moments are obtained by summing up several subtotals. However, methods for listing all permutation equivalent index subsets and calculating of the respective cardinalities were not emphasized in the previous publication [21]. There is also no systematic way to obtain coefficients in the recursive relationship. ...
Conference Paper
Full-text available
In this paper, a novel and computationally fast algorithm for computing weighted v-statistics in resampling both univariate and multivariate data is proposed. To avoid any real resampling, we have linked this problem with finite group action and converted it into a problem of orbit enumeration. For further computational cost reduction, an efficient method is developed to list all orbits by their sym- metry orders and calculate all index function orbit sums and data function orbit sums recursively. The computational complexity analysis shows reduction in the computational cost from n! or nn level to low-order polynomial level.
... Permutation tests obtain p-values from permutation distributions of a test statistic, rather than from parametric distributions. They belong to the nonparametric " distribution-free " category of hypothesis testing and are thus flexible, and have been used successfully in biomedical image analysis (Nichols & Holmes, 2001; Pantazis, et al., 2004; Zhou et al., 2009). One way to construct the permutation distribution is through exact permutation which enumerates all possible arrangements. ...
... Here, the term " linear test statistic " refers to a linear function of test statistic coefficients, instead of that of data. An extension of the method to the general weighted v-statistics has also been developed recently in (Zhou et al., 2009). The key idea is to separate the moments of permutation distribution into two parts, permutation of test statistic coefficients and function of the data. ...
... However, the authors noted, "In order to construct a hypothesis test for the GPC statistics, any use of the means and variances would require the assumption of asymptotic normality," an assumption which is not needed for our proposed methodology. Relatedly, a wide variety of works accelerate testing by fitting an approximation to the permutation distribution [see, e.g., 53,26,27,43,21]. However, these works do not analyze the power of their approximate tests, and, in each case, the approximation sacrifices finite-sample validity. ...
Preprint
Full-text available
Permutation tests are a popular choice for distinguishing distributions and testing independence, due to their exact, finite-sample control of false positives and their minimax optimality when paired with U-statistics. However, standard permutation tests are also expensive, requiring a test statistic to be computed hundreds or thousands of times to detect a separation between distributions. In this work, we offer a simple approach to accelerate testing: group your datapoints into bins and permute only those bins. For U and V-statistics, we prove that these cheap permutation tests have two remarkable properties. First, by storing appropriate sufficient statistics, a cheap test can be run in time comparable to evaluating a single test statistic. Second, cheap permutation power closely approximates standard permutation power. As a result, cheap tests inherit the exact false positive control and minimax optimality of standard permutation tests while running in a fraction of the time. We complement these findings with improved power guarantees for standard permutation testing and experiments demonstrating the benefits of cheap permutations over standard maximum mean discrepancy (MMD), Hilbert-Schmidt independence criterion (HSIC), random Fourier feature, Wilcoxon-Mann-Whitney, cross-MMD, and cross-HSIC tests.
... Many other approximation methods have been proposed for permutation tests. For instance, Zhou et al. (2009) fit approximations by moments in the Pearson family. Larson and Owen (2015) fit Gaussian and beta approximations to linear statistics and gamma approximations to quadratic statistics for gene set testing problems. ...
Preprint
It is common for genomic data analysis to use p-values from a large number of permutation tests. The multiplicity of tests may require very tiny p-values in order to reject any null hypotheses and the common practice of using randomly sampled permutations then becomes very expensive. We propose an inexpensive approximation to p-values for two sample linear test statistics, derived from Stolarsky's invariance principle. The method creates a geometrically derived set of approximate p-values for each hypothesis. The average of that set is used as a point estimate p^\hat p and our generalization of the invariance principle allows us to compute the variance of the p-values in that set. We find that in cases where the point estimate is small the variance is a modest multiple of the square of the point estimate, yielding a relative error property similar to that of saddlepoint approximations. On a Parkinson's disease data set, the new approximation is faster and more accurate than the saddlepoint approximation. We also obtain a simple probabilistic explanation of Stolarsky's invariance principle.
... RMSEs forp 1 andp 2 under Models 1 and 2. The x-axis shows the estimatê p as ρ varies from 1 to 0. Here m 0 = m 1 . Plots with m 0 = m 1 are similar.8. Comparison to saddlepoint approximationMany approximation methods have been proposed for permutation tests.Zhou et al. (2009) fit approximations by moments in the Pearson family.Larson and Owen (2015) fit Gaussian and beta approximations to linear statistics and gamma approximations to quadratic statistics for gene set testing problems.Knijnenburg et al. (2009) fit generalized extreme value distributions to the tails of sampled permutation values. ...
Article
Full-text available
When it is necessary to approximate a small permutation p-value p, then simulation is very costly. For linear statistics, a Gaussian approximation p^1\hat p_1 reduces to the volume of a spherical cap. Using Stolarsky's (1973) invariance principle from discrepancy theory, we get a formula for the mean of (p^1p)2(\hat p_1-p)^2 over all spherical caps. From a theorem of Brauchart and Dick (2013) we get such a formula averaging only over spherical caps of volume exactly p^1\hat p_1. We also derive an improved estimator p^2\hat p_2 equal to the average true p-value over spherical caps of size p^1\hat p_1 containing the original data point x0\boldsymbol{x}_0 on their boundary. This prevents p^2\hat p_2 from going below 1/N when there are N unique permutations. We get a formula for the mean of (p^2p)2(\hat p_2-p)^2 and find numerically that the root mean squared error of p^2\hat p_2 is roughly proportional to p^2\hat p_2 and much smaller than that of p^1\hat p_1.
... Like us, Zhou et al. (2009) have used a beta distribution to approximate a permutation. They used the first 4 moments of a Pearson curve for their approach. ...
Article
Full-text available
Permutation-based gene set tests are standard approaches for testing relationships between collections of related genes and an outcome of interest in high throughput expression analyses. Using M random permutations, one can attain p-values as small as 1/(M+1). When many gene sets are tested, we need smaller p-values, hence larger M, to achieve significance while accounting for the number of simultaneous tests being made. As a result, the number of permutations to be done rises along with the cost per permutation. To reduce this cost, we seek parametric approximations to the permutation distributions for gene set tests. We study two gene set methods based on sums and sums of squared correlations. The statistics we study are among the best performers in the extensive simulation of 261 gene set methods by Ackermann and Strimmer in 2009. Our approach calculates exact relevant moments of these statistics and uses them to fit parametric distributions. The computational cost of our algorithm for the linear case is on the order of doing |G| permutations, where |G| is the number of genes in set G. For the quadratic statistics, the cost is on the order of |G|(2) permutations which can still be orders of magnitude faster than plain permutation sampling. We applied the permutation approximation method to three public Parkinson's Disease expression datasets and discovered enriched gene sets not previously discussed. We found that the moment-based gene set enrichment p-values closely approximate the permutation method p-values at a tiny fraction of their cost. They also gave nearly identical rankings to the gene sets being compared. We have developed a moment based approximation to linear and quadratic gene set test statistics' permutation distribution. This allows approximate testing to be done orders of magnitude faster than one could do by sampling permutations. We have implemented our method as a publicly available Bioconductor package, npGSEA (www.bioconductor.org) .
Article
Full-text available
A number of biomedical problems require performing many hypothesis tests, with an attendant need to apply stringent thresholds. Often the data take the form of a series of predictor vectors, each of which must be compared with a single response vector, perhaps with nuisance covariates. Parametric tests of association are often used, but can result in inaccurate type I error at the extreme thresholds, even for large sample sizes. Furthermore, standard two-sided testing can reduce power compared with the doubled [Formula: see text]-value, due to asymmetry in the null distribution. Exact (permutation) testing is attractive, but can be computationally intensive and cumbersome. We present an approximation to exact association tests of trend that is accurate and fast enough for standard use in high-throughput settings, and can easily provide standard two-sided or doubled [Formula: see text]-values. The approach is shown to be equivalent under permutation to likelihood ratio tests for the most commonly used generalized linear models (GLMs). For linear regression, covariates are handled by working with covariate-residualized responses and predictors. For GLMs, stratified covariates can be handled in a manner similar to exact conditional testing. Simulations and examples illustrate the wide applicability of the approach. The accompanying mcc package is available on CRAN http://cran.r-project.org/web/packages/mcc/index.html. © The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Article
In this paper, we present a new blockwise permutation test approach based on the moments of the test statistic. The method is of importance to functional neuroimaging studies. In order to preserve the exchangeability condition required in permutation, we divide the time series into certain exchangeability blocks. In addition, efficient moments-based permutation tests are performed by approximating the permutation distribution of the test statistic with the Pearson distribution series. This involves the calculation of the first four moments of the permutation distribution within each block and then over the whole time series. The accuracy and efficiency of the proposed method are demonstrated using both simulated time series and fMRI data.
Article
Full-text available
This paper presents a new statistical surface analysis framework that aims to accurately and efficiently localize regionally specific shape changes between groups of 3D surfaces. With unknown distribution and small sample size of the data, existing shape morphometry analysis involves testing thousands of hypotheses for statistically significant effects through permutation. In this work, we develop a novel hybrid permutation test approach to improve the system's efficiency by approximating the permutation distribution of the test statistic with a Pearson distribution series that involves the calculation of the first four moments of the permutation distribution. We propose to derive these moments theoretically and analytically without any permutation. Detailed derivations and experimental results using two different test statistics are demonstrated using simulated data and brain data for shape morphometry analysis. Furthermore, an adaptive procedure is utilized to control the False Discovery Rate (FDR) for increased power of finding significance.
Article
Preliminary Notation and Definitions Modes of Convergence of a Sequence of Random Variables Relationships Among the Modes of Convergence Convergence of Moments; Uniform Integrability Further Discussion of Convergence in Distribution Operations on Sequences to Produce Specified Convergence Properties Convergence Properties of Transformed Sequences Basic Probability Limit Theorems: The WLLN and SLLN Basic Probability Limit Theorems: The CLT Basic Probability Limit Theorems: The LIL Stochastic Process Formulation of the CLT Taylor's Theorem; Differentials Conditions for Determination of a Distribution by Its Moments Conditions for Existence of Moments of a Distribution Asymptotic Aspects of Statistical Inference Procedures Problems
Article
An abstract is not available.