British Journal of Mathematical and Statistical Psychology

Published by Wiley
Online ISSN: 2044-8317
Print ISSN: 0007-1102
Example items similar to the items in the SON-R –17: (a) Mosaics; (b) Patterns.
Example items similar to the items in the SON-R 5 1 2-17: (a) Mosaics; (b) Patterns. 
Estimated effects of the covariates and the mean discrimination parameter (and standard 
Deviations of the L1PNO-C item parameters from their mean values and 68 and 95% confidence ellipsoids based on the estimated covariance matrix for (a) the Mosaics subtest and (b) the Patterns subtest.
Observed (grey) and replicated (black) distributions of (a) the item point-biserial correlations and (b) the proportions of correct responses for the Mosaics subtest.
Fischer's (1973) linear logistic test model can be used to test hypotheses regarding the effect of covariates on item difficulty and to predict the difficulty of newly constructed test items. However, its assumptions of equal discriminatory power across items and a perfect prediction of item difficulty are never absolutely met. The amount of misfit in an application of a Bayesian version of the model to two subtests of the SON-R 512-17 is investigated by means of item fit statistics in the framework of posterior predictive checks and by means of a comparison with a model that allows for residual (co)variance in the item parameters. The effect of the degree of residual (co)variance on the robustness of inferences is investigated in a simulation study.
I cannot see any advantages for SigmaStat. SigmaPlot does indeed have many excellent features and any psychologist could feel proud of many of the graphs it produces (not the box plots). As to the competition, JMP-IN produces a similar rage of graphs (and a rotating 3D plot for factor analysis), but has much less flexibility about appearance in terms of colours, fills and other features of graph elements. It can also be difficult to make different graphs the same size appear neatly on the page. STATVIEW produces less graph forms, and does not perform the non-linear regression, but has similar excellent control of the colour, form and sizing of different graph elements. Both STATIVEW and JMP-IN have well implemented 'by variable' facilities and produce graphs well linked to their associated statistical analyses. SigmaPlot wins on the flexibility of its error bars. However, EXCEL and other spreadsheets are also well worth considering, as they produce the same range of graphics and are equally flexible over error bars. An experimenter would have to be very sure that the slight advantages in flexibility of presentation from SigmaPlot outweighed the hassle of having to totally re-organize their data.
When a model is fitted to data in a 2p contingency table many cells are likely to have very small expected frequencies. This sparseness invalidates the usual approximation to the distribution of the chi-squared or log-likelihood tests of goodness of fit. We present a solution to this problem by proposing a test based on a comparison of the observed and expected frequencies of the second-order margins of the table. A chi2 approximation to the sampling distribution is provided using asymptotic moments. This can be straightforwardly calculated from the expected cell frequencies. The new test is applied to several previously published examples relating to the fitting of latent variable models, but its application is quite general.
Parameters in structural equation models are typically estimated using the maximum likelihood (ML) approach. Bollen (1996) proposed an alternative non-iterative, equation-by-equation estimator that uses instrumental variables. Although this two-stage least squares/instrumental variables (2SLS/IV) estimator has good statistical properties, one problem with its application is that parameter equality constraints cannot be imposed. This paper presents a mathematical solution to this problem that is based on an extension of the 2SLS/IV approach to a system of equations. We present an example in which our approach was used to examine strong longitudinal measurement invariance. We also investigated the new approach in a simulation study that compared it with ML in the examination of the equality of two latent regression coefficients and strong measurement invariance. Overall, the results show that the suggested approach is a useful extension of the original 2SLS/IV estimator and allows for the effective handling of equality constraints in structural equation models.
This paper develops a computationally efficient procedure for analysis of structural equation models with continuous and polytomous variables. A partition maximum likelihood approach is used to obtain the first stage estimates of the thresholds and the polyserial and polychoric correlations in the underlying correlation matrix. Then, based on the joint asymptotic distribution of the first stage estimator and an appropriate weight matrix, a generalized least squares approach is employed to estimate the structural parameters in the correlation structure. Asymptotic properties of the estimators are derived. Some simulation studies are conducted to study the empirical behaviours and robustness of the procedure, and compare it with some existing methods.
Three normative models based on the Change Detector analogies of Shallice (1964) are investigated to see how well they account for the integration of input over time that occurs in absolute brightness threshold experiments. Of the three, the geometric moving average receives no support. The perceptual moment and moving average models give about equally satisfactory but not perfect fits for short flash results, but the perceptual moment model is much more satisfactory for long flash results. The analysis takes into account variations in information transmission times, noise in the visual system and the quantal nature of the light input. The locus of temporal integration is discussed and a central locus preferred, following Treisman (1966).
In the likely event that some clients refuse to participate in a psychosocial field experiment, the estimates of the effects of the experimental treatment on client outcomes may suffer from sample selection bias, regardless of whether the statistical analyses include control variables. This paper explores ways of correcting for this bias with advanced correction strategies, focusing on experiments in which clients refuse assignment into treatment conditions. The sample selection modelling strategy, which is highly recommended but seldom applied to random sample psychosocial experiments, and some alternatives are discussed. Data from an experiment on homelessness and substance abuse are used to compare sample selection, conventional control variable, instrumental variable, and propensity score matching correction strategies. The empirical findings suggest that the sample selection modelling strategy provides reliable estimates of the effects of treatment, that it and some other correction strategies are awkward to apply when there is post-assignment rejection, and that the varying correction strategies provide widely divergent estimates. In light of these findings, researchers might wish regularly to compare estimates across multiple correction strategies.
An approach to item analysis is described by means of which the difficulty of an item and the ability of an individual may sometimes be assessed without reference to the norms provided by some population. The model employed in the analysis could also be used to study any situation in which a number of subjects perform a series of tasks having the same two alternative responses. Some particular uses of the model are discussed briefly.
A covariance structure analysis method for testing time-invariance in reliability in multiwave, multiple-indicator models in outlined. The approach accounts for observed variable specificity and permits, in addition, estimation of reliability in terms of 'pure' measurement error variance. The proposed procedure is developed within a confirmatory factor analysis framework and illustrated with data from a cognitive intervention study.
The speed-accuracy trade-off (SAT) paradigm forces participants to trade response speed for information accuracy by presenting them with a response signal at variable times after the onset of processing to which they must give an immediate response (within 300 ms). The processes that underlie the paradigm, especially those affecting response times, are not completely understood. Also, the extent to which the paradigm might affect the evidence accumulation process is still unclear. By testing several different sets of assumptions, we present a random walk model for the SAT paradigm that qualitatively explains both accuracy and response time data. The model uses a tandem random walk, with two possible continuations in a second phase which begins after the response signal. If a boundary is not reached during phase one, the walk transfers the current sum (relative to the size of the boundaries) from phase one to phase two in the form of bias, with drift rate equal to zero. If, however, a boundary is reached in phase one, the second phase starts from zero (no bias) with a strong drift rate towards the previously reached boundary. The model also incorporates a psychological refractory period: a delay in the onset of a second task when two tasks are presented in close succession. The model is consistent with the idea that information about the evidence accumulation rate is not contaminated by the paradigm.
Latent trait models for responses and response times in tests often lack a substantial interpretation in terms of a cognitive process model. This is a drawback because process models are helpful in clarifying the meaning of the latent traits. In the present paper, a new model for responses and response times in tests is presented. The model is based on the proportional hazards model for competing risks. Two processes are assumed, one reflecting the increase in knowledge and the second the tendency to discontinue. The processes can be characterized by two proportional hazards models whose baseline hazard functions correspond to the temporary increase in knowledge and discouragement. The model can be calibrated with marginal maximum likelihood estimation and an application of the ECM algorithm. Two tests of model fit are proposed. The amenability of the proposed approaches to model calibration and model evaluation is demonstrated in a simulation study. Finally, the model is used for the analysis of two empirical data sets.
Contrasts of means are often of interest because they describe the effect size among multiple treatments. High-quality inference of population effect sizes can be achieved through narrow confidence intervals (CIs). Given the close relation between CI width and sample size, we propose two methods to plan the sample size for an ANCOVA or ANOVA study, so that a sufficiently narrow CI for the population (standardized or unstandardized) contrast of interest will be obtained. The standard method plans the sample size so that the expected CI width is sufficiently small. Since CI width is a random variable, the expected width being sufficiently small does not guarantee that the width obtained in a particular study will be sufficiently small. An extended procedure ensures with some specified, high degree of assurance (e.g., 90% of the time) that the CI observed in a particular study will be sufficiently narrow. We also discuss the rationale and usefulness of two different ways to standardize an ANCOVA contrast, and compare three types of standardized contrast in the ANCOVA/ANOVA context. All of the methods we propose have been implemented in the freely available MBESS package in R so that they can be easily applied by researchers.
In real testing, examinees may manifest different types of test-taking behaviours. In this paper we focus on two types that appear to be among the more frequently occurring behaviours - solution behaviour and rapid guessing behaviour. Rapid guessing usually happens in high-stakes tests when there is insufficient time, and in low-stakes tests when there is lack of effort. These two qualitatively different test-taking behaviours, if ignored, will lead to violation of the local independence assumption and, as a result, yield biased item/person parameter estimation. We propose a mixture hierarchical model to account for differences among item responses and response time patterns arising from these two behaviours. The model is also able to identify the specific behaviour an examinee engages in when answering an item. A Monte Carlo expectation maximization algorithm is proposed for model calibration. A simulation study shows that the new model yields more accurate item and person parameter estimates than a non-mixture model when the data indeed come from two types of behaviour. The model also fits real, high-stakes test data better than a non-mixture model, and therefore the new model can better identify the underlying test-taking behaviour an examinee engages in on a certain item. © 2015 The British Psychological Society.
Composite measures play an important role in psychology and related disciplines. Composite measures almost always have error. Correspondingly, it is important to understand the reliability of the scores from any particular composite measure. However, the point estimates of the reliability of composite measures are fallible and thus all such point estimates should be accompanied by a confidence interval. When confidence intervals are wide, there is much uncertainty in the population value of the reliability coefficient. Given the importance of reporting confidence intervals for estimates of reliability, coupled with the undesirability of wide confidence intervals, we develop methods that allow researchers to plan sample size in order to obtain narrow confidence intervals for population reliability coefficients. We first discuss composite reliability coefficients and then provide a discussion on confidence interval formation for the corresponding population value. Using the accuracy in parameter estimation approach, we develop two methods to obtain accurate estimates of reliability by planning sample size. The first method provides a way to plan sample size so that the expected confidence interval width for the population reliability coefficient is sufficiently narrow. The second method ensures that the confidence interval width will be sufficiently narrow with some desired degree of assurance (e.g., 99% assurance that the 95% confidence interval for the population reliability coefficient will be less than W units wide). The effectiveness of our methods was verified with Monte Carlo simulation studies. We demonstrate how to easily implement the methods with easy-to-use and freely available software.
In an effort to find accurate alternatives to the usual confidence intervals based on normal approximations, this paper compares four methods of generating second-order accurate confidence intervals for non-standardized and standardized communalities in exploratory factor analysis under the normality assumption. The methods to generate the intervals employ, respectively, the Cornish-Fisher expansion and the approximate bootstrap confidence (ABC), and the bootstrap-t and the bias-corrected and accelerated bootstrap (BC(a)). The former two are analytical and the latter two are numerical. Explicit expressions of the asymptotic bias and skewness of the communality estimators, used in the analytical methods, are derived. A Monte Carlo experiment reveals that the performance of central intervals based on normal approximations is a consequence of imbalance of miscoverage on the left- and right-hand sides. The second-order accurate intervals do not require symmetry around the point estimates of the usual intervals and achieve better balance, even when the sample size is not large. The behaviours of the second-order accurate intervals were similar to each other, particularly for large sample sizes, and no method performed consistently better than the others.
In sequences of human sensory assessments, the response toa stimulus may be influenced by previous stimuli. When investigating this phenomenon experimentally with several types or levels of stimulus, it is useful to have treatment sequences which are balanced for first-order carry-over effects. The requirement of balance for each experimental participant leads us to consider sequences of n symbols comprising an initial symbol followed by n ;blocks' each containing a permutation of the symbols. These sequences are designed to include all n (2) ordered pairs of symbols once each, and to have treatment and sequence position effects which are approximately or thogonal. Such sequences were suggested by Finney and Outhwaite (1956), who were able to find examples for particular values of n. We describe and illustrate acomputer algorithm for systematically enumerating the sequences for those values of n for which they exist. Criteria are proposed for choosing between the sequences according to the nearness to orthogonality of their treatment and position effects.
In personality and attitude measurement, the presence of acquiescent responding can have an impact on the whole process of item calibration and test scoring, and this can occur even when sensible procedures for controlling acquiescence are used. This paper considers a bidimensional (content acquiescence) factor-analytic model to be the correct model, and assesses the effects of fitting unidimensional models to theoretically unidimensional scales when acquiescence is in fact operating. The analysis considers two types of scales: non-balanced and fully balanced. The effects are analysed at both the calibration and the scoring stages, and are of two types: bias in the item/respondent parameter estimates and model/person misfit. The results obtained theoretically are checked and assessed by means of simulation. The results and predictions are then assessed in an empirical study based on two personality scales. The implications of the results for applied personality research are discussed.
We propose a method for controlling acquiescent response in which acquiescence response variance is isolated in an independent factor. This kind of procedure is available for perfectly balanced scales (i.e. half of the items are worded in the opposite direction to the other half with respect to a general trait). However, few questionnaires are designed so that exactly half of the items are worded in this way. If this is not the case, the available methods are useless. We propose to adapt the rotation method of Lorenzo-Seva and Rodríguez-Fornells to handle partially balanced scales (i.e. only a few items in the scale are worded in the opposite direction). The most important characteristic of our method is that it removes the variance due to acquiescent response from all the items in the questionnaire (i.e. the balanced subset of items, but also the unbalanced subsets of items). The usefulness of the method is illustrated in a numerical example.
A number of models for the analysis of moment structures, such as LISREL, have recently been shown to be capable of being given a particularly simple and economical representation, in terms of the Reticular Action Model (RAM). In contrast to previous treatments, a formal algebraic treatment is provided which shows that RAM directly incorporates many common structural models, including models describing the structure of means. It is also shown here that RAM treats coefficient matrices with patterned inverses simply and generally.
The mental processes involved in performing some tasks can be represented as directed arcs in an acyclic network. A path directed from the head of one arc to the tail of another indicates that the process represented by the first arc must be executed prior to the process represented by the second arc. If there is no directed path from one arc to another, the corresponding processes can be executed concurrently. Information about the arrangement of processes in an acyclic network can be found from the effects on response times of factors selectively influencing the processes. The methodology was developed earlier for critical path networks, in which a process begins execution when all its immediate predecessors have finished. This paper considers shortest path networks, in which a process begins execution as soon as any immediate predecessor is finished. Results analogous to those for critical path networks are reported. New results are presented enabling investigators to distinguish sequential and concurrent processes in both critical path and shortest path networks. This information is sufficient to construct an acyclic network representing the processes. Further, by examining the effects of selectively influencing processes, one can determine whether a task network is a critical path network or a shortest path network.
The family of (non-parametric, fixed-step-size) adaptive methods, also known as 'up-down' or 'staircase' methods, has been used extensively in psychophysical studies for threshold estimation. Extensions of adaptive methods to non-binary responses have also been proposed. An example is the three-category weighted up-down (WUD) method (Kaernbach, 2001) and its four-category extension (Klein, 2001). Such an extension, however, is somewhat restricted, and in this paper we discuss its limitations. To facilitate the discussion, we characterize the extension of WUD by an algorithm that incorporates response confidence into a family of adaptive methods. This algorithm can also be applied to two other adaptive methods, namely Derman's up-down method and the biased-coin design, which are suitable for estimating any threshold quantiles. We then discuss via simulations of the above three methods the limitations of the algorithm. To illustrate, we conduct a small scale of experiment using the extended WUD under different response confidence formats to evaluate the consistency of threshold estimation.
This paper proposes an on-line version of the Sympson and Hetter procedure with test overlap control (SHT) that can provide item exposure control at both the item and test levels on the fly without iterative simulations. The on-line procedure is similar to the SHT procedure in that exposure parameters are used for simultaneous control of item exposure rates and test overlap rate. The exposure parameters for the on-line procedure, however, are updated sequentially on the fly, rather than through iterative simulations conducted prior to operational computerized adaptive tests (CATs). Unlike the SHT procedure, the on-line version can control item exposure rate and test overlap rate without time-consuming iterative simulations even when item pools or examinee populations have been changed. Moreover, the on-line procedure was found to perform better than the SHT procedure in controlling item exposure and test overlap for examinees who take tests earlier. Compared with two other on-line alternatives, this proposed on-line method provided the best all-around test security control. Thus, it would be an efficient procedure for controlling item exposure and test overlap in CATs.
Bayesian adaptive methods have been extensively used in psychophysics to estimate the point at which performance on a task attains arbitrary percentage levels, although the statistical properties of these estimators have never been assessed. We used simulation techniques to determine the small-sample properties of Bayesian estimators of arbitrary performance points, specifically addressing the issues of bias and precision as a function of the target percentage level. The study covered three major types of psychophysical task (yes-no detection, 2AFC discrimination and 2AFC detection) and explored the entire range of target performance levels allowed for by each task. Other factors included in the study were the form and parameters of the actual psychometric function Psi, the form and parameters of the model function M assumed in the Bayesian method, and the location of Psi within the parameter space. Our results indicate that Bayesian adaptive methods render unbiased estimators of any arbitrary point on psi only when M=Psi, and otherwise they yield bias whose magnitude can be considerable as the target level moves away from the midpoint of the range of Psi. The standard error of the estimator also increases as the target level approaches extreme values whether or not M=Psi. Contrary to widespread belief, neither the performance level at which bias is null nor that at which standard error is minimal can be predicted by the sweat factor. A closed-form expression nevertheless gives a reasonable fit to data describing the dependence of standard error on number of trials and target level, which allows determination of the number of trials that must be administered to obtain estimates with prescribed precision.
In computerized adaptive testing (CAT), traditionally the most discriminating items are selected to provide the maximum information so as to attain the highest efficiency in trait (theta) estimation. The maximum information (MI) approach typically results in unbalanced item exposure and hence high item-overlap rates across examinees. Recently, Yi and Chang (2003) proposed the multiple stratification (MS) method to remedy the shortcomings of MI. In MS, items are first sorted according to content, then difficulty and finally discrimination parameters. As discriminating items are used strategically, MS offers a better utilization of the entire item pool. However, for testing with imposed non-statistical constraints, this new stratification approach may not maintain its high efficiency. Through a series of simulation studies, this research explored the possible benefits of a mixture item selection approach (MS-MI), integrating the MS and MI approaches, in testing with non-statistical constraints. In all simulation conditions, MS consistently outperformed the other two competing approaches in item pool utilization, while the MS-MI and the MI approaches yielded higher measurement efficiency and offered better conformity to the constraints. Furthermore, the MS-MI approach was shown to perform better than MI on all evaluation criteria when control of item exposure was imposed.
We examined nine adaptive methods of trimming, that is, methods that empirically determine when data should be trimmed and the amount to be trimmed from the tails of the empirical distribution. Over the 240 empirical values collected for each method investigated, in which we varied the total percentage of data trimmed, sample size, degree of variance heterogeneity, pairing of variances and group sizes, and population shape, one method resulted in exceptionally good control of Type I errors. However, under less extreme cases of non-normality and variance heterogeneity a number of methods exhibited reasonably good Type I error control. With regard to the power to detect non-null treatment effects, we found that the choice among the methods depended on the degree of non-normality and variance heterogeneity. Recommendations are offered.
The purpose of this study is to find a formula that describes the relationship between item exposure parameters and item parameters in computerized adaptive tests by using genetic programming (GP) - a biologically inspired artificial intelligence technique. Based on the formula, item exposure parameters for new parallel item pools can be predicted without conducting additional iterative simulations. Results show that an interesting formula between item exposure parameters and item parameters in a pool can be found by using GP. The item exposure parameters predicted based on the found formula were close to those observed from the Sympson and Hetter (1985) procedure and performed well in controlling item exposure rates. Similar results were observed for the Stocking and Lewis (1998) multinomial model for item selection and the Sympson and Hetter procedure with content balancing. The proposed GP approach has provided a knowledge-based solution for finding item exposure parameters.
This paper introduces a new heuristic approach, the maximum priority index (MPI) method, for severely constrained item selection in computerized adaptive testing. Our simulation study shows that it is able to accommodate various non-statistical constraints simultaneously, such as content balancing, exposure control, answer key balancing, and so on. Compared with the weighted deviation modelling method, it leads to fewer constraint violations and better exposure control while maintaining the same level of measurement precision.
An ANCOVA model is formulated for two non-equivalent groups, experimentals and controls. The response variable of interest is a continuous latent variable construct, measured by four dichotomous variables. Estimation and testing draws on methodology of Muthén (1984). The modelling is carried out on data from the California Civil Addict Programme, studying treatment effects related to drug abuse, employment, crime and incarceration.
The existence of a close relation between personality and drug consumption is recognized, but the corresponding causal connection is not well known. Neither is it well known whether personality exercises an influence predominantly at the beginning and development of addiction, nor whether drug consumption produces changes in personality. This paper presents a dynamic mathematical model of personality and addiction based on the unique personality trait theory (UPTT) and the general modelling methodology. This model attempts to integrate personality, the acute effect of drugs, and addiction. The UPTT states the existence of a unique trait of personality called extraversion, understood as a dimension that ranges from impulsive behaviour and sensation-seeking (extravert pole) to fearful and anxious behaviour (introvert pole). As a consequence of drug consumption, the model provides the main patterns of extraversion dynamics through a system of five coupled differential equations. It combines genetic extraversion, as a steady state, and dynamic extraversion in a unique variable measured on the hedonic scale. The dynamics of this variable describes the effects of stimulant drugs on a short-term time scale (typical of the acute effect); while its mean time value describes the effects of stimulant drugs on a long-term time scale (typical of the addiction effect). This understanding may help to develop programmes of prevention and intervention in drug misuse.
Diffusion model data analysis permits the disentangling of different processes underlying the effects of experimental manipulations. Estimates can be provided for the speed of information accumulation, for the amount of information used to draw conclusions, and for a decision bias. One parameter describes the duration of non-decisional processes including the duration of motor-response execution. In the default diffusion model, it is implicitly assumed that both responses are executed with the same speed. In some applications of the diffusion model, this assumption will be violated. This will lead to biased parameter estimates. Consequently, we suggest accounting explicitly for differences in the speed of response execution for both responses. Results from a simulation study illustrate that parameter estimates from the default model are biased if the speed of response execution differs between responses. A second simulation study shows that large trial numbers (N>1,000) are needed to detect whether differences in response-execution times are based on different execution times.
Methods for the hierarchical clustering of an object set produce a sequence of nested partitions such that object classes within each successive partition are constructed from the union of object classes present at the previous level. Any such sequence of nested partitions can in turn be characterized by an ultrametric. An approach to generalizing an (ultrametric) representation is proposed in which the nested character of the partition sequence is relaxed and replaced by the weaker requirement that the classes within each partition contain objects consecutive with respect to a fixed ordering of the objects. A method for fitting such a structure to a given proximity matrix is discussed, along with several alternative strategies for graphical representation. Using this same ultrametric extension, additive tree representations can also be generalized by replacing the ultrametric component in the decomposition of an additive tree (into an ultrametric and a centroid metric). A common numerical illustration is developed and maintained throughout the paper.
The performances of three additive tree algorithms which seek to minimize a least-squares loss criterion were compared. The algorithms included the penalty-function approach of De Soete (1983), the iterative projection strategy of Hubert & Arabie (1995) and the two-stage ADDTREE algorithm, (Corter, 1982; Sattath & Tversky, 1977). Model fit, comparability of structure, processing time and metric recovery were assessed. Results indicated that the iterative projection strategy consistently located the best-fitting tree, but also displayed a wider range and larger number of local optima.
A definition of binaural additivity is given in terms of the theory of simultaneous conjoint measurement. Additivity is then tested and verified by a conjoint measurement procedure. Methods for deriving psychophysical scales from such procedures are discussed, and the experimental scales are compared with the usual ratio scales for loudness, derived from extensive measurement such as magnitude estimation. The functions are in good agreement and it is concluded that binaural additivity of loudness holds for non-zero stimulation of the ears.
Multi-group latent growth modelling in the structural equation modelling framework has been widely utilized for examining differences in growth trajectories across multiple manifest groups. Despite its usefulness, the traditional maximum likelihood estimation for multi-group latent growth modelling is not feasible when one of the groups has no response at any given data collection point, or when all participants within a group have the same response at one of the time points. In other words, multi-group latent growth modelling requires a complete covariance structure for each observed group. The primary purpose of the present study is to show how to circumvent these data problems by developing a simple but creative approach using an existing estimation procedure for growth mixture modelling. A Monte Carlo simulation study was carried out to see whether the modified estimation approach provided tangible results and to see how these results were comparable to the standard multi-group results. The proposed approach produced results that were valid and reliable under the mentioned problematic data conditions. We also present a real data example and demonstrate that the proposed estimation approach can be used for the chi-square difference test to check various types of measurement invariance as conducted in a standard multi-group analysis.
In this study we demonstrate how the asymptotically distribution-free (ADF) fit function is affected by (excessive) kurtosis in the observed data. More specifically, we address how different levels of univariate kurtosis affect fit values (and therefore fit indices) for misspecified factor models. By using numerical calculation, we show (for 13 factor models) that the probability limit F(0) of F empty set for the ADF fit function decreases considerably as the kurtosis increases. We also give a formal proof that the value of F(0) decreases monotonically with the kurtosis for a whole class of structural equation models.
The asymptotically distribution-free (ADF) test statistic for covariance structure analysis (CSA) has been reported to perform very poorly in simulation studies, i.e. it leads to inaccurate decisions regarding the adequacy of models of psychological processes. It is shown in the present study that the poor performance of the ADF test statistic is due to inadequate estimation of the weight matrix (W = gamma -1), which is a critical quantity in the ADF theory. Bootstrap procedures based on Hall's bias reduction perspective are proposed to correct the ADF test statistic. It is shown that the bootstrap correction of additive bias on the ADF test statistic yields the desired tail behaviour as the sample size reaches 500 for a 15-variable-3-factor confirmatory factor-analytic model, even if the distribution of the observed variables is not multivariate normal and the latent factors are dependent. These results help to revive the ADF theory in CSA.
A path model of genetic and environmental transmission developed for application to data collected in the ongoing Colorado Adoption Project was evaluated. In addition to providing tests of hereditary and environmental influences, the model includes parameters for passive genotype-environment correlation, parental influences on the child's environment, assortative mating and selective placement. A maximum likelihood method was used to obtain parameter estimates from mental ability data on one- and two-year-old children, their biological and adoptive parents, and members of control families. There was a satisfactory fit of the model to the data. Highly significant estimates of genetic influence and a significant measure of home environment were obtained. By comparing the fit of the full model to that of a reduced model in which selective placement parameters were dropped, the absence of selective placement was confirmed. Although some evidence for possible genotype-environment correlation was found, it accounted for a relatively small proportion of the total variance.
A basic property of various rank-based hypothesis testing methods is that they are invariant under a linear transformation of the data. For multivariate data, a generalization of this property is sometimes sought (called affine invariance), but typically techniques for assigning ranks do not achieve this goal, or it is assumed that sampling is from a symmetric distribution. A rank-based method is suggested for comparing dependent groups that is based on halfspace depth, is affine invariant in terms of difference scores, and allows sampling from asymmetric distributions.
The agreement between two competing tests which purport to measure the same trait is a common concern in test development. In this paper three alternative parameterizations of the measurement model useful in this context are presented. Both one-factor and two-factor approaches are applied. Lord's classic example, where the main problem is to investigate whether time limits represent an extra speed component in a vocabulary test, is used to illustrate the ideas.
The paper is concerned with the measurement of internal consistency of rating scales and interviewing schedules, with the assessment of bias between different raters and with coefficients for measuring the degree of agreement between them. Analysis of variance models are first employed, but reference is also made to earlier psychometric techniques and to recent work by Armitage et al. and by Fleiss.
Raters are an important potential source of measurement error when assigning targets to categories. Therefore, psychologists have devoted considerable attention to quantifying the extent to which ratings agree with each other. Two main approaches to analysing rater agreement data can be distinguished. While the first approach focuses on the development of summary statistics that index rater agreement, the second models the association pattern among the observers' ratings. With the modelling approach three groups of models can be distinguished: latent class models, simple quasisymmetric agreement models, and mixture models. This paper discusses a class of mixture models that is defined by its characteristic of having a quasi-symmetric log-linear representation. This class of models has two interesting properties. First, the simple quasi-symmetric agreement models can be shown to be members of this class. Therefore, the results of a rater agreement analysis based on a simple quasi-symmetric agreement model may be interpreted in the mixture model framework. Second, since the mixture models readily provide a familiar measure of rater reliability, it is possible to obtain a model-based estimate of rater reliability from the simple quasi-symmetric agreement models. The suggested class of mixture models will be illustrated using data from a persuasive communication study in which three raters classified respondents on the basis of their elicited cognitive responses.
Diagnoses by the raters of 100 subjects as psychotic, neurotic or organic (Fleiss, 1981)
The most common measure of agreement for categorical data is the coefficient kappa. However, kappa performs poorly when the marginal distributions are very asymmetric, it is not easy to interpret, and its definition is based on hypothesis of independence of the responses (which is more restrictive than the hypothesis that kappa has a value of zero). This paper defines a new measure of agreement, delta, 'the proportion of agreements that are not due to chance', which comes from model of multiple-choice tests and does not have the previous limitations. The paper shows that kappa and delta generally take very similar values, except when the marginal distributions are strongly unbalanced. The case of the 2 x 2 tables (which admits very simple solutions) is considered in detail.
Three techniques are proposed for oblique rotation of two or more loading matrices to a mixture of simple structure and optimal agreement. The three techniques, consensus direct oblimin, consensus promin and consensus simple target rotation, are compared to existing techniques and to each other, using artificial and real data. When agreement between solutions can be expected, the new techniques appear to be more useful than oblique rotation of each of the loading matrices separately. Consensus direct oblimin achieved the best results in terms of 'agreement between simple structures'; consensus promin was the technique that best recovered the underlying simple loading matrix; and consensus simple target rotation seemed to give an interesting compromise between agreement and simple structure, focusing a little more on simplicity than consensus direct oblimin.
Distribution of n participants by rater and response category 
Pi (pi) and kappa (kappa) statistics are widely used in the areas of psychiatry and psychological testing to compute the extent of agreement between raters on nominally scaled data. It is a fact that these coefficients occasionally yield unexpected results in situations known as the paradoxes of kappa. This paper explores the origin of these limitations, and introduces an alternative and more stable agreement coefficient referred to as the AC1 coefficient. Also proposed are new variance estimators for the multiple-rater generalized pi and AC1 statistics, whose validity does not depend upon the hypothesis of independence between raters. This is an improvement over existing alternative variances, which depend on the independence assumption. A Monte-Carlo simulation study demonstrates the validity of these variance estimators for confidence interval construction, and confirms the value of AC1 as an improved alternative to existing inter-rater reliability statistics.
A mathematical model is proposed, based on catastrophe theory, to describe the qualitative effect of stress upon the neural mechanisms used for making judgements, such as estimating speed. The model is used quantitatively to fit data, and to explain the cusp-shaped results of Drew et al. (1959), showing that introverts under alcohol tend to drive either too fast or too slow in a driving simulator. Experiments are suggested in which discontinuous jumps in perception of continuous variables like speed might well appear.
The aim of this study is the empirical verification of the Bayesian approach applied to the description of the decision-making process with regard to prosodic stimuli in different psychopathological states. Using the Bayesian formalism, the interpretation of a disturbance in internal representation of the contextual information in schizophrenia was given. The results obtained satisfied the formula derived from Bayes' theorem in all tested except a schizophrenic group. Results were interpreted as reflecting cognitive flexibility, and discussed in the context of social adaptation. Although the investigation was based on psychopathological grounds, the results may be applied to the functioning of working memory in general.
Research problems that require a non-parametric analysis of multifactor designs with repeated measures arise in the behavioural sciences. There is, however, a lack of available procedures in commonly used statistical packages. In the present study, a generalization of the aligned rank test for the two-way interaction is proposed for the analysis of the typical sources of variation in a three-way analysis of variance (ANOVA) with repeated measures. It can be implemented in the usual statistical packages. Its statistical properties are tested by using simulation methods with two sample sizes (n = 30 and n = 10) and three distributions (normal, exponential and double exponential). Results indicate substantial increases in power for non-normal distributions in comparison with the usual parametric tests. Similar levels of Type I error for both parametric and aligned rank ANOVA were obtained with non-normal distributions and large sample sizes. Degrees-of-freedom adjustments for Type I error control in small samples are proposed. The procedure is applied to a case study with 30 participants per group where it detects gender differences in linguistic abilities in blind children not shown previously by other methods.
When planning a study, sample size determination is one of the most important tasks facing the researcher. The size will depend on the purpose of the study, the cost limitations, and the nature of the data. By specifying the standard deviation ratio and/or the sample size ratio, the present study considers the problem of heterogeneous variances and non-normality for Yuen's two-group test and develops sample size formulas to minimize the total cost or maximize the power of the test. For a given power, the sample size allocation ratio can be manipulated so that the proposed formulas can minimize the total cost, the total sample size, or the sum of total sample size and total cost. On the other hand, for a given total cost, the optimum sample size allocation ratio can maximize the statistical power of the test. After the sample size is determined, the present simulation applies Yuen's test to the sample generated, and then the procedure is validated in terms of Type I errors and power. Simulation results show that the proposed formulas can control Type I errors and achieve the desired power under the various conditions specified. Finally, the implications for determining sample sizes in experimental studies and future research are discussed.
Top-cited authors
Rand R Wilcox
  • University of Southern California
Harvey Jay Keselman
  • University of Manitoba
Lawrence Hubert
  • University of Illinois, Urbana-Champaign
Karl G. Jöreskog
  • Uppsala University
James T Townsend
  • Indiana University Bloomington