1. Introduction and Summary. When estimating the coefficients in a linear regression it is usually assumed that the covariances of the observations on the dependent variable are known up to multiplication by some common positive number, say c, which is unknown. If this number c is known to be less than some number k, and if the set of possible distributions of the dependent variable includes "enough" normal distributions (in a sense to be specified later) then the minimum variance linear unbiased estimators of the regression coefficients (see [1]) are minimax among the set of all estimators; furthermore these minimax estimators are independent of the value of k. (The risk for any estimator is here taken to be the expected square of the error.) This fact is closely related to a theorem of Hodges and Lehmann ([3], Theorem 6.5), stating that if the observations on the dependent variable are assumed to be independent, with variances not greater than k, then the minimum variance linear estimators corresponding to the assumption of equal variances are minimax. For example, if a number of observations are assumed to be independent, with common (unknown) mean, and common (unknown) variance that is less than k; and if, for every possible value of the mean, the set of possible distributions of the observations includes the normal distribution with that mean and with variance equal to k; then the sample mean is the minimax estimator of the mean of the distribution. The assumption of independence with common unknown variance is, of course, essentially no less general than the assumption that the covariances are known up to multiplication by some common positive number, since the latter situation can be reduced to the former by a suitable rotation of the coordinate axes (provided that the original matrix of covariances is non-singular). This note consideres the problem of minimax estimation, in the general "linear regression" framework, when less is known about the covariances of the observations on the "dependent variable" than in the traditional situation just described. For example, one might not be sure that these observations are independent, nor feel justified in assuming any other specific covariance structure. It is immediately clear that, from a minimax point of view, one cannot get along without any prior information at all about the covariances, for in that case the risk of every estimator is unbounded. In practice, however, one is typically willing to grant that the covarainces are bounded somehow, but one may not
Investigation of some conditions which are sufficient for a sequence of independent integral-valued lattice random variables to satisfy a local theorem in strengthened form. A number of theorems giving the conditions under which the investigated sequence satisfies a local theorem in strengthened form are proven with the aid of lemmas derived by Kruglov (1968).
For a random walk with negative mean and heavy{tailed increment distribution F , it is well known that under suitable subexponential assumptions, the distribution of the maximum has a tail (x; 1) which is asymptotically proportional to R F (y; 1) dy. We supplement here this by a local result showing that (x; x + z] is asymptotically proportional to zF (x; 1).
It may fairly be said that modern time series analysis is a subject which embraces three fields which while closely related have tended to develop somewhat independently. These fields are (i) statistical communication and control theory, (ii) the probabilistic (and Hilbert space) theory of stochastic processes processing finite second moments, and (iii) the statistical theory of regression analysis, correlation analysis, and spectral (or harmonic) analysis of time series. In this paper it is my aim to show the close relation between these fields and to summarize some recent developments. The topics discussed are (i) stationary time series and their statistical analysis, (ii) prediction theory and the Hilbert space spanned by a time series, and (iii) regression analysis of time series with known covariance function. In particular, I describe a new approach to prediction and regression problems using reproducing kernel Hilbert spaces.
Let $\pi_1, \pi_2$ be normal populations with means $m_1, m_2$ respectively and a common variance $\sigma^2$, the parameter point $\omega = (m_1, m_2:\sigma)$ which characterizes the two populations being unknown, and let $\Omega$ be an arbitrary given set of possible points $\omega$. Random samples of fixed sizes $n_1, n_2$ are drawn from $\pi_1, \pi_2$ respectively, giving the combined sample point $\nu = (x_{11}, x_{12}, \cdots, x_{1n_1}; x_{21}, x_{22}, \cdots, x_{2n_2})$. For reasons which will be made clear later in connection with practical examples, any function $f(\nu)$ such that $0 \leq f(\nu) \leq 1$ is called a decision function, and for any such $f(\nu)$ the risk function is defined to be \begin{equation*}\tag{(1)} r(f\|\omega) = \max \lbrack m_1, m_2\rbrack - m_1E\lbrack f\mid \omega\rbrack - m_2E\lbrack 1 - f\|\omega\rbrack \geq 0\end{equation*} where $E$ denotes the expectation operator. A decision function $\bar{f}(\nu)$ is said to be (a) uniformly better than $f(\nu)$ if $r(\bar{f} \| \omega) \leq r(f\| \omega)$ for all $\omega$ in $\Omega$, the strict inequality holding for at least one $\omega$, (b) admissible if no decision function is uniformly better than $\bar{f}(\nu)$, and (c) minimax if $\sup_{\omega\epsilon\Omega} \lbrack r (\bar{f} \| \omega)\rbrack = \inf_f \sup_{\omega\epsilon\Omega} \lbrack r(f\| \omega)\rbrack$. The "problem of the greater mean" is, for any given $\Omega$, to determine the minimax decision functions, particularly those which are also admissible. Special interest attaches to the case in which there exists a unique minimax decision function $\bar{f}(\nu)$ (in the sense that if $f(\nu)$ is any minimax decision function then $f(\nu) = \bar{f}(\nu)$ for almost every $\nu$ in the sample space); such an $\bar{f}(\nu)$ is automatically admissible. The problem of the greater mean is, of course, a special problem in Wald's general theory of statistical decision functions [1]. Our results will, however, be derived by very simple direct methods which make no use of Wald's general theorems. We cite without proofs a few examples in order to show how strongly the solution of the problem of the greater mean depends on the structure of $\Omega$. In each case the minimax decision function is a function only of the two sample means $\bar{x}_1, \bar{x}_2$. (i) Let $\Omega'$ consist of the two points $(a, b: \sigma)$ and $(b, a: \sigma),$ with $a < b$. Then \begin{equation*}\tag{(2)} f^\ast(\nu) = \begin{cases}1 \text{if} n_1\bar{x}_1 - n_2\bar{x}_2 > (n_1 - n_2)(a + b)/2,\\0 \text{otherwise}\end{cases}\end{equation*} is the unique minimax decision function. (ii) Let $\Omega"$ consist of the two points $(c + h, c: \sigma)$ and $(c - h, c: \sigma)$, with $h > 0$. Then \begin{equation*}\tag{(3)} f^0_c(\nu) = \begin{cases}1\text{if} \bar{x}_1 > c\\0\text{otherwise},\end{cases}\end{equation*} is the unique minimax decision function. (iii) Let $\Omega"'$ consist of the three points $(\frac{1}{2}, - \frac{1}{2}:1), (\frac{1}{2}, \frac{3}{2}:1), (-\frac{3}{2}, -\frac{1}{2}:1),$ and let $n_1 = n_2 = n$. Then \begin{equation*}\tag{(4)}f^{\ast\ast}(\nu) = \begin{cases}1\text{if} e^{-2n\bar{x}_1} + e^{2n\bar{x}_2} < \lambda,\\0 \text{otherwise}\end{cases}\end{equation*}, where $\lambda$ is a certain definite constant, is the unique minimax decision function. The parameter spaces of two or three points specified in these examples are rather trivial, but in fact the corresponding decision functions (2), (3), (4) remain the unique minimax solutions of the decision problem with respect to much more general parameter spaces. Thus, for example, it is clear that $f^\ast(\nu)$ will remain the unique minimax decision function with respect to any $\Omega$ which contains $\Omega'$ and is such that $\sup_{\omega\epsilon\Omega} \lbrack r(f^\ast \| \omega)\rbrack = \sup_{\omega\epsilon\Omega'} \lbrack r(f^\ast \|\omega)\rbrack$. Corresponding remarks apply to $f^0_c(\nu)$ and $f^{\ast\ast}(\nu)$. When $n_1 = n_2$, (2) reduces to \begin{equation*}\tag{(5)}f^0(\nu) = \begin{cases}1 \text{if} \bar{x}_1 > \bar{x}_2,\\0 \text{otherwise}\end{cases}\end{equation*}. This decision function is of particular interest when both the means $m_1, m_2$ are unknown. It will be shown that whether or not $n_1 = n_2, f^0(\nu)$ is the unique minimax decision function under certain conditions on $\Omega$ which are likely to hold in practice, at least when both $n_1$ and $n_2$ are sufficiently large (Theorem 3). Likewise, $f^0_c(\nu)$, which is the analogue of $f^0(\nu)$ when one of the means $(m_2)$ is known exactly, is apt to be the unique minimax decision function in such cases, at least when $n_1$ is sufficiently large (Theorem 4). These results on $f^0(\nu)$ and $f^0_c(\nu)$ form the main results of the present paper. So much by way of a general summary. We shall now give a practical illustration (another is given in Section 3) to show how the problem of the greater mean arises in applications. Suppose that a consumer requires a certain number of manufactured articles which can be supplied at the same cost by each of two sources $\pi_1$ and $\pi_2$. The quality of an article is measured by a numerical characteristic $x$, and it is known that in the product of $\pi_i, x$ is normally distributed with mean $m_i$ and variance $\sigma^2,$ but the values of these parameters are unknown. The consumer has obtained a random sample of $n_1$, and $n_2$ articles from $\pi_1$ and $\pi_2$ respectively, and has found the values of $x$ to be $(x_{11}, x_{12}, \cdots, x_{1n_1}; x_{21}, x_{22}, \cdots, x_{2n_2}) = \nu$. What is the best way of ordering a total of $N$ articles from the two sources? The usual statistical theory, which confines itself to estimating the unknown parameters and to testing hypotheses of the form $H_0(m_1 = m_2),$ has at best an indirect bearing on the problem at hand. We therefore adopt Wald's point of view and investigate the consequences of any given course of action. If the consumer orders $fN$ articles from $\pi_1$ and $(1 - f)N$ from $\pi_2$, where $0 \leq f \leq 1,$ then the expectation of the sum of the $x$-values in the articles he obtains will be $N(m_1f + m_2(1 - f))$. The maximum possible value of this quantity is $N \max\lbrack m_1, m_2\rbrack$, and the "loss" per article which he sustains may therefore be taken as $W(\omega, f) = \max \lbrack m_1, m_2\rbrack - m_1f - m_2(1 - f) \geq 0$, where $\omega = (m_1, m_2: \sigma)$ is the true parameter point. The consumer wants to choose $f$ so as to make $W$ as small as possible. If he knew $m_1$ to be greater, or to be less, than $m_2$, then by choosing $f = 1$ or 0 respectively he could make $W = 0$. But since he does not know which $m_1$ is the greater he will presumably choose $f$ as some function of the sample point $\nu$. Suppose, therefore, that a "decision function" $f(\nu)$, such that $0 \leq f(\nu) leq 1$ but not necessarily taking on only the values 0 and 1, is defined for all points $\nu$ in the sample space and that the consumer sets $f = f(\nu).$ In repeated applications of this procedure, the "risk" or expected loss (a double expectation is involved: the expected loss for a given $f$ and the expected value of $f$ in using the decision function $f(\nu))$ per article is given by (1), and the consumer will try to find an $f(\nu)$ which minimizes this risk. Since the value of the risk depends on $\omega$ it is necessary to specify which values of $\omega$ are to be regarded as possible in the given problem; let the set of all such $\omega$ be denoted by $\Omega$. If the consumer agrees to adopt the "conservative" criterion of minimizing the maximum possible risk, then the statistician's problem is to find the minimax decision functions in the sense defined above. We have given the solutions of this problem for certain types of parameter spaces. The reader will observe that each of the minimax decision functions (2), (3), (4) was of the "all or nothing" type, with values 0 and 1 only. (Whether this remains true for every $\Omega$ we do not know.) By using one of these decision functions in a given instance one arrives at either the best possible decision or the worst. The attitudes of doubt sometimes associated with the non-rejection of the hypothesis $H_0(m_1 = m_2)$ are therefore irrelevant to the problem of the greater mean in the examples cited. (Cf. footnote 2; also Example 1 in Section 3.) The risk function (1) is but one of a general class $R$ of risk functions, to be defined in Section 2, which are associated with the problem of the greater mean. The most important members of $R$ are (1) and \begin{equation*}\tag{(6)}\bar{r}(f\| \omega) = P\text{(incorrect decision using} f(\nu)\|\omega),\end{equation*} where $"m_1 \leq m_2"$ and $"m_1 \geq m_2"$ are the two possible decisions. The risk function (6) is relevant to applications of a purely "scientific" nature in which the statistician is asked merely to give his opinion as to which population has the greater mean. Although the problem of constructing a suitable decision function for (6) is akin in spirit to the problems considered in the now classical Neyman-Pearson theory of statistical tests, no satisfactory solutions seem to be available. It is easy to see, however, that (1) and (6) are quite similar. Of course, in the case of (1) a decision function $f(\nu)$ may take on any value between 0 and 1 inclusive, while for (6) we allow only functions which take on only the values 0 and 1, corresponding respectively to the decisions $"m_1 \leq m_2"$ and $"m_1 \geq m_2"$. We then have for any such $f(\nu)$, \begin{equation*}\tag{(6')}\bar{r}(f\mid\omega) = \begin{cases}P(f(\nu) = 1 \| \omega) = E\lbrack f\| \omega\rbrack \text{if} m_1 < m_2,\\P(f(\nu) = 0 \| \omega) = E\lbrack 1 - f \|\omega\rbrack \text{if} m_1 > m_2,\\0\quad \text{if} m_1 = m_2,\end{cases}\end{equation*} and by comparison with (1) we see that $r(f\| \omega) = \| m_1 - m_2\| \bar{r}(f\| \omega)$ for all $\omega$. Now, in the three examples (i), (ii), (iii) cited above the unique minimax decision functions happen to take on only the values 0 and 1, and $\|m_1 - m_2\|$ is constant on each of the respective parameter sets. It follows that (2), (3), (4) are also the unique minimax decision functions relative to (6) and to $\Omega', \Omega", \Omega'''$ respectively. The remarks above following Example (iii) also remain valid for the risk function (6). We conclude this section with a remark on the methods of this paper. Any decision function relevant to (6) is equivalent to a test of the hypothesis $H_0(m_1 < m_2)$ against the alternative $H_1(m_1 > m_2)$, the region $\{\nu: f(\nu) = 1\}$ being the "critical region." Hence the Neyman-Pearson probability ratio method can be used to obtain the unique minimax decision function with respect to (6) and an $\Omega$ consisting of two (or more) points, and the result carries over to more general types of $\Omega$ in the manner already indicated. It turns out, however, that the dominant properties of the probability ratio tests are not confined to the class of tests alone, but extend to the class of all functions $f(\nu)$ such that $0 \leq f(\nu) \leq 1$. This result (Theorem 1) enables us to solve the problem of the greater mean for the risk function (1) as well as for (6). The reader who is interested in applications may turn to Section 3.
Stein [4] has exhibited a double sampling procedure to test hypotheses concerning the mean of normal variables with power independent of the unknown variances. This procedure is here adapted to test hypotheses concerning the ratio of means of two normal populations, also with power independent of the unknown variances. The use of a two sample procedure in a regression problem is also considered. Let $\{X_{ij}\} (i = 1, 2) (j = 1, 2, 3, \cdots)$ be independent random variables distributed according to $N(m_i, \sigma_i):$ all parameters are assumed to be unknown. Defining $k$ by the equation \begin{equation*}\tag{(1)} m_1 = km_2\end{equation*} we wish to test the hypothesis $H$ that $k$ has a specified value $k_0$. If $k_0 = 1$ the hypothesis $H$ reduces to a classical problem, often referred to in the literature as the Behrens-Fisher-problem (cf. Scheffe [3] for a bibliography). At the present time it is still an open question whether it is possible (or desirable) to find a non-trivial single sample test for $H$ with the size of the critical region independent of $\sigma_1$ and $\sigma_2$. In any case it is a simple extension of the result of Dantzig [1] (cf. also Stein [4]) to show that no non-trivial single sample test exists whose power is independent of $\sigma_1$ and $\sigma_2$. On the other hand the case $k_0 \neq 1$ may be expected to occur frequently in fields of application where a choice must be made between different products, methods of experimentation etc. which involve different costs. The statistician must make a choice on the basis of results relative to the ratio of costs involved. Nevertheless this problem appears to have received little attention in the literature. In general tests based on a two-sample procedure may not be as "efficient" in the sense of Wald [5] as a strict sequential procedure. On the other hand the two sample procedure reduces the number of decisions to be made by the experimenter and it will, in certain fields, simplify the experimental procedure.
For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. And when I have pondered about why such techniques as the spectrum analysis of time series have proved so useful, it has become clear that their “dealing with fluctuations” aspects are, in many circumstances, of lesser importance than the aspects that would already have been required to deal effectively with the simpler case of very extensive data, where fluctuations would no longer be a problem. All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.
The efficiency of a family of tests is defined. Methods for evaluating the efficiency are discussed. The asymptotic efficiency is obtained for certain families of tests under assumptions which imply that the sample size is large.