Accountability for public education often requires estimating and ranking the quality of individual teachers or schools on the basis of student test scores. Although the properties of estimators of teacher-or-school effects are well established, less is known about the properties of rank estimators. We investigate performance of rank (percentile) estimators in a basic, two-stage hierarchical model capturing the essential features of the more complicated models that are commonly used to estimate effects. We use simulation to study mean squared error (MSE) performance of percentile estimates and to find the operating characteristics of decision rules based on estimated percentiles. Each depends on the signal-to-noise ratio (the ratio of the teacher or school variance component to the variance of the direct, teacher- or school-specific estimator) and only moderately on the number of teachers or schools. Results show that even when using optimal procedures, MSE is large for the commonly encountered variance ratios, with an unrealistically large ratio required for ideal performance. Percentile-specific MSE results reveal interesting interactions between variance ratios and estimators, especially for extreme percentiles, which are of considerable practical import. These interactions are apparent in the performance of decision rules for the identification of extreme percentiles, underscoring the statistical and practical complexity of the multiple-goal inferences faced in value-added modeling. Our results highlight the need to assess whether even optimal percentile estimators perform sufficiently well to be used in evaluating teachers or schools.
The use of complex value-added models that attempt to isolate the contributions of teachers or schools to student development is increasing. Several variations on these models are being applied in the research literature, and policy makers have expressed interest in using these models for evaluating teachers and schools. In this article, we present a general multivariate, longitudinal mixed-model that incorporates the complex grouping structures inherent to longitudinal student data linked to teachers. We summarize the principal existing modeling approaches, show how these approaches are special cases of the proposed model, and discuss possible extensions to model more complex data structures. We present simulation and analytical results that clarify the interplay between estimated teacher effects and repeated outcomes on students over time. We also explore the potential impact of model misspecifications, including missing student covariates and assumptions about the accumulation of teacher effects over time, on key inferences made from the models. We conclude that mixed models that account for student correlation over time are reasonably robust to such misspecifications when all the schools in the sample serve similar student populations. However, student characteristics are likely to confound estimated teacher effects when schools serve distinctly different populations.
When identification of causal effects relies on untestable assumptions regarding nonidentified parameters, sensitivity of causal effect estimates is often questioned. For proper interpretation of causal effect estimates in this situation, deriving bounds on causal parameters or exploring the sensitivity of estimates to scientifically plausible alternative assumptions can be critical. In this paper, we propose a practical way of bounding and sensitivity analysis, where multiple identifying assumptions are combined to construct tighter common bounds. In particular, we focus on the use of competing identifying assumptions that impose different restrictions on the same non-identified parameter. Since these assumptions are connected through the same parameter, direct translation across them is possible. Based on this cross-translatability, various information in the data, carried by alternative assumptions, can be effectively combined to construct tighter bounds on causal effects. Flexibility of the suggested approach is demonstrated focusing on the estimation of the complier average causal effect (CACE) in a randomized job search intervention trial that suffers from noncompliance and subsequent missing outcomes.
An analytical approach was employed to compare sensitivity of causal effect estimates with different assumptions on treatment noncompliance and non-response behaviors. The core of this approach is to fully clarify bias mechanisms of considered models and to connect these models based on common parameters. Focusing on intention-to-treat analysis, systematic model comparisons are performed on the basis of explicit bias mechanisms and connectivity between models. The method is applied to the Johns Hopkins school intervention trial, where assessment of the intention-to-treat effect on school children's mental health is likely to be affected by assumptions about intervention noncompliance and nonresponse at follow-up assessments. The example calls attention to the importance of focusing on each case in investigating relative sensitivity of causal effect estimates with different identifying assumptions, instead of pursuing a general conclusion that applies to every occasion.
Researchers often compare the relationship between an outcome and covariate for two or more groups by evaluating whether the fitted regression curves differ significantly. When they do, researchers need to determine the "significance region," or the values of the covariate where the curves significantly differ. In analysis of covariance (ANCOVA), the Johnson-Neyman procedure can be used to determine the significance region; for the hierarchical linear model (HLM), the Miyazaki and Maier (M-M) procedure has been suggested. However, neither procedure can assume nonnormally distributed data. Furthermore, the M-M procedure produces biased (downward) results because it uses the Wald test, does not control the inflated Type I error rate due to multiple testing, and requires implementing multiple software packages to determine the significance region. In this article, we address these limitations by proposing solutions for determining the significance region suitable for generalized linear (mixed) model (GLM or GLMM). These proposed solutions incorporate test statistics that resolve the biased results, control the Type I error rate using Scheffé's method, and uses a single statistical software package to determine the significance region.
As any empirical method used for causal analysis, social experiments are prone to attrition which may flaw the validity of the results. This article considers the problem of partially missing outcomes in experiments. First, it systematically reveals under which forms of attrition—in terms of its relation to observable and/or unobservable factors—experiments do (not) yield causal parameters. Second, it shows how the various forms of attrition can be controlled for by different methods of inverse probability weighting (IPW) that are tailored to the specific missing data problem at hand. In particular, it discusses IPW methods that incorporate instrumental variables (IVs) when attrition is related to unobservables, which has been widely ignored in the experimental literature before.
An extension of the latent Markov Rasch model is described for the analysis
of binary longitudinal data with covariates when subjects are collected in
clusters, e.g. students clustered in classes. For each subject, the latent
process is used to represent the characteristic of interest (e.g. ability)
conditional on the effect of the cluster to which he/she belongs. The latter
effect is modeled by a discrete latent variable associated with each cluster.
For the maximum likelihood estimation of the model parameters we outline an EM
algorithm. We show how the proposed model may be used for assessing the
development of cognitive Math achievement. This approach is applied to the
analysis of a dataset collected in the Lombardy Region (Italy) and based on
test scores over three years of middle-school students attending public and
The problem of identifying outliers has two important aspects: the choice of outlier measures, and the method to assess the degree of outlyingness #norming# of those measures. We introduce several classes of measures for identifying outliers in Computerized Adaptive Tests #CATs#. Some of these measures are new and are constructed to take advantage of CAT's sequential choice of items; other measures are taken directly from paper and pencil #P&P# tests and are used for baseline comparisons. Assessing the degree of outlyingness of CAT responses however can not be applied directly from P&P tests because stopping rules associated with CATs yield examinee responses of varying lengths. Standard outlier measures are highly correlated with the varying lengths which makes comparison across examinees impossible. Therefore, we present and compare four methods which map outlier statistics to a familiar probability scale #a p-value#. The application of these methods to CAT data is new....
A substantiallite# ature on switche# inline# r r e# re ssion functions conside# s situations in which the re gre#;:L n function is discontinuous at an unknown value of the re gre#LM r, X k isthe so-calle# unknown "change point".The r e# re ssion mo is thus a two-phase composite 1 ),i 1, 2,. ,k and y i 2 ),i k +1,k+2,. ,n. Solutions to this single se# ie# proble# are conside# ably more comple# whe# we conside# a wrinkle fre que# tly e#ly=: te#=: ine valuation studie# ofsyste# inte#M e# tions in that asyste# typicallycomprise# multiple me# b e# s (j 1, 2,. ,m) and thatme# be#M of the syste# cannot all be e#L e#LQJ to change synchronously. Fore xample , schools di#e r not only in whether a program, imple me nte# syste# wide# improve s the#8 stude# ts'te#H scor e# but, de#e#MM;9 onthe re#L urce# alre ady in pla ce# schools may also di#e r in when the# start to showe #e#M s of the program. If ignore# , he#e#LQ8M e ity among schools inwhe# the program take s initiale#MJ5 unde# mine# any program e# aluation that assume# that change points are known and thatthe# are the same for all schools. To be tte# de#L8 ibe individual be#; vior within asyste#; and using a sample of longitudinal te#L score s from a large urban school syste#: we conside# hie# archical Baye se stimation of a multile# e lline# r r e# re ssion mode# in which e# ch individual r e# re ssion slope of te#M score ontime switche# at some unknown point in time# k j . Pre liminarye vide#9: sugge sts that change points in te#L score tre#=L inde#8 di#e# from school to school in a sample of urbane le#JQ tary schools. Furthe# more , the e##QQ; te#poste# ior distribution of the change points sugge sts that,while the e stimate# timings of change in pe rformance do not contradictthe claim that a we#JM;8=z;H8:9= inte...
We consider maximum likelihood (ML) estimation of mean and covariance structure models when data are missing. Expectation maximization (EM), generalized expectation maximization (GEM), Fletcher-Powell, and Fisherscoring algorithms are described for parameter estimation. It is shown how the machinery within a software that handles the complete data problem can be utilized to implement each algorithm. A numerical di#erentiation method for obtaining the observed information matrix and the standard errors is given. This method too uses the complete data program machinery. The likelihood ratio test is discussed for testing hypotheses. Three examples are used to compare the cost of the four algorithms mentioned above, as well as to illustrate the standard error estimation and the test of hypothesis considered. The sensitivity of the ML estimates as well as the mean imputed and listwise deletion estimates to missing data mechanisms is investigated using three artificial data sets that are mis...
this paper I (1) examine three levels of inferential strength supported by typical social science data-gathering methods, and call for a greater degree of explicitness, when HMs and other models are applied, in identifying which level is appropriate; (2) reconsider the use of HMs in school effectiveness studies and meta-analysis from the perspective of causal inference; and (3) recommend the increased use of Gibbs sampling and other Markov-chain Monte Carlo (MCMC) methods in the application of HMs in the social sciences, so that comparisons between MCMC and better-established fitting methods---including full or restricted maximum likelihood estimation based on the EM algorithm, Fisher scoring or iterative generalized least squares---may be more fully informed by empirical practice.
this paper we discuss some of the practical problems in using multilevel techniques, by looking into the choices users of these techniques have to make. It is difficult, of course, to define "user". Different users have different degrees of statistical background, computer literacy, experience, and so on. We adopt a particular operational definition of a "user" in this paper, which certainly does not apply to all users. Our "user" is defined by the set of questions asked by the Statistical Standards and Methodology Division of the National Center for Education Statistics (as formulated in a letter of 9-16-93 of Bob Burton of NCES to Jerry Sacks of the National Institute of Statistical Sciences). This is in the context of a grant which has taken the path NCES ) NSF ) NISS ) UCLA, and which has the specific purpose to evaluate the practical usefulness of multilevel modeling for educational statistics. We cannot discuss, let alone answer, all the questions from NCES in this paper. Many of them will require additional statistical and computational research, but they illustrate nicely some of the practical methodological problems in using hierarchical linear models.
Milkman (1978) accuses Arthur Jensen of misapplying heritability data in speculating on the causes of racial differences in intelligence test scores. He then offers a method for illuminating Jensen’s alleged “error”. It is contended in this article thatMilkman (1978) has misconstruedJensen’s (1973) argument and that, as a consequence, his method is without point.
An index measuring the degree to which a binary response pattern conforms to some baseline pattern was defined and named the Pattern Conformity Index (PCI). One way of conceptualizing what the PCI measures is the extent to which each individual's particular response pattern contributes to, or detracts from, the overall consistency found in the group's mode of responding. One use of the PCI consists of spotting anomalous response patterns that result from a student's problems. From this it is a short step to utilizing the PCI for identifying a subgroup of students for whom the given set of items approximately constitutes a unidimensionally scalable set. The duality between students and items then permits selection of a subset of items for further improving the unidimensionality. A measure of how constant an individual's response pattern remains for parallel subsets of items ocurring earlier and later in a test was developed and called the Individualized Consistency Index (ICI). (Author/BW)
This article addresses the relationship between academic achievement and the student characteristics of absence and mobility. Mobility is a measure of how often a student changes schools. Absence is how often a student misses class. Standardized test scores are used as proxies for academic achievement. A model for the full joint distribution of the parameters and the data, both missing and observed, is postulated. After priors are elicited, a Metropolis-Hastings algorithm within a Gibbs sampler is used to evaluate the posterior distributions of the model parameters for the Pittsburgh Public Schools. Results are given in two stages. First, mobility and absence are shown to have, with high probability, negative relationships with academic achievement. Second, the posterior for mobility is viewed in terms of the equivalent harm done by absence: changing schools at least once in the three year period, 1998–2000, has an impact on standardized tests administered in the spring of 2000 equivalent to being absent about 14 days in 1999–2000 or 32 days in 1998–1999.
The preference scaling of a group of subjects may not be homogeneous, but different
groups of subjects with certain characteristics may show different preference scalings,
each of which can be derived from paired comparisons by means of the Bradley-Terry model.
Usually, either different models are fit in predefined subsets of the
sample, or the effects of subject covariates are explicitly specified in a parametric
model. In both cases, categorical covariates can be employed directly to distinguish
between the different groups, while numeric covariates are typically discretized
prior to modeling.
Here, a semi-parametric approach for recursive partitioning of Bradley-Terry models is
introduced as a means for identifying groups of subjects with homogeneous preference scalings
in a data-driven way. In this approach, the covariates that -- in main effects or
interactions -- distinguish between groups of subjects with different preference
orderings, are detected automatically from the set of candidate covariates. One main
advantage of this approach is that sensible partitions in numeric covariates are
also detected automatically.
The item response times (RTs) collected from computerized testing represent an underutilized type of information about items and examinees. In addition to knowing the examinees’ responses to each item, we can investigate the amount of time examinees spend on each item. Current models for RTs mainly focus on parametric models, which have the advantage of conciseness, but may suffer from reduced flexibility to fit real data. We propose a semiparametric approach, specifically, the Cox proportional hazards model with a latent speed covariate to model the RTs, embedded within the hierarchical framework proposed by van der Linden to model the RTs and response accuracy simultaneously. This semiparametric approach combines the flexibility of nonparametric modeling and the brevity and interpretability of the parametric modeling. A Markov chain Monte Carlo method for parameter estimation is given and may be used with sparse data obtained by computerized adaptive testing. Both simulation studies and real data analysis are carried out to demonstrate the applicability of the new model.
In longitudinal education studies, assuming that dropout and missing data occur completely at random is often unrealistic. When the probability of dropout depends on covariates and observed responses (called missing at random [MAR]), or on values of responses that are missing (called informative or not missing at random [NMAR]), inappropriate analysis can cause biased estimates. NMAR requires explicit modeling of the missingness process together with the response variable. In this article, we review assumptions needed for consistent estimation of hierarchical linear growth models using common missing-data approaches. We also suggest a joint model for the longitudinal data and missingness process to handle the situation where data are NMAR. The different approaches are applied to the NELS:88 study, as well as simulated data. Results from the NELS:88 analyses were similar between the MAR and NMAR models. However, use of listwise deletion and mean imputation resulted in significant bias, both for the NELS:88 study and simulated data. Simulation results showed that incorrectly assuming MAR leads to greater bias for the growth-factor variance–covariance matrix than for the growth factor means, the former being severe with as little as 10% missing data and the latter with 40% missing data when departure from MAR is strong.
Test scores are commonly reported in a small number of ordered categories. Examples of such reporting include state accountability testing, Advanced Placement tests, and English proficiency tests. This article introduces and evaluates methods for estimating achievement gaps on a familiar standard-deviation-unit metric using data from these ordered categories alone. These methods hold two practical advantages over alternative achievement gap metrics. First, they require only categorical proficiency data, which are often available where means and standard deviations are not. Second, they result in gap estimates that are invariant to score scale transformations, providing a stronger basis for achievement gap comparisons over time and across jurisdictions. The authors find three candidate estimation methods that recover full-distribution gap estimates well when only censored data are available.
Two simple constraints on the item parameters in a response–time model are proposed to control the speededness of an adaptive test. As the constraints are additive, they can easily be included in the constraint set for a shadow-test approach (STA) to adaptive testing. Alternatively, a simple heuristic is presented to control speededness in plain adaptive testing without any constraints. Both types of control are easy to implement and do not require any other real-time parameter estimation during the test than the regular update of the test taker’s ability estimate. Evaluation of the two approaches using simulated adaptive testing showed that the STA was especially effective. It guaranteed testing times that differed less than 10 seconds from a reference test across a variety of conditions.
In this study some alternative item selection criteria for adaptive testing are proposed. These criteria take into account the uncertainty of the ability estimates. A general weighted information criterion of which the usual maximum information criterion and the proposed alternative criteria are special cases is suggested. A small simulation study was conducted to compare the different criteria. The results showed that the likelihood weighted information criterion is a good alternative to the maximum information criterion. Another good alternative is a maximum information criterion with the maximum likelihood estimator of ability replaced by the Bayesian expected a posteriori estimator.
This article addresses likely error rates for measuring teacher and school performance in the upper elementary grades using value-added models applied to student test score gain data. Using a realistic performance measurement system scheme based on hypothesis testing, the authors develop error rate formulas based on ordinary least squares and Empirical Bayes estimators. Empirical results suggest that value-added estimates are likely to be noisy using the amount of data that are typically used in practice. Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25% with 3 years of data and 35% with 1 year of data. Corresponding error rates for overall false positive and negative errors are 10% and 20%, respectively. Lower error rates can be achieved if schools are the performance unit. The results suggest that policymakers must carefully consider likely system error rates when using value-added estimates to make high-stakes decisions regarding educators.
Measuring teacher effectiveness is challenging since no direct estimate exists; teacher effectiveness can be measured only indirectly through student responses. Traditional value-added assessment (VAA) models generally attempt to estimate the value that an individual teacher adds to students' knowledge as measured by scores on successive administrations of a standardized test. Such responses, however, do not reflect the long-term contribution of a teacher to real-world student outcomes such as graduation, and cannot be used in most university settings where standardized tests are not given. In this paper, the authors develop a multiresponse approach to VAA models that allows responses to be either continuous or categorical. This approach leads to multidimensional estimates of value added by teachers and allows the correlations among those dimensions to be explored. The authors derive sufficient conditions for maximum likelihood estimators to be consistent and asymptotically normally distributed. The authors then demonstrate how to use SAS software to calculate estimates. The models are applied to university data from 2001 to 2008 on calculus instruction and graduation in a science or engineering field.
Adjusted Wald intervals for binomial proportions in one-sample and two-sample designs have been shown to perform about as well as the best available methods. The adjusted Wald intervals are easy to compute and have been incorporated into introductory statistics courses. An adjusted Wald interval for paired binomial proportions is proposed here and is shown to perform as well as the best available methods. A sample size planning formula is presented that should be useful in an introductory statistics course.
Defining causal effects as comparisons between marginal population means, this article introduces marginal mean weighting through stratification (MMW-S) to adjust for selection bias in multilevel educational data. The article formally shows the inherent connections among the MMW-S method, propensity score stratification, and inverse-probability-of-treatment weighting (IPTW). Both MMW-S and IPTW are suitable for evaluating multiple concurrent treatments, and hence have broader applications than matching, stratification, or covariance adjustment for the propensity score. Furthermore, mathematical consideration and a series of simulations reveal that the MMW-S method has incorporated some important strengths of the propensity score stratification method, which generally enhance the robustness of MMW-S estimates in comparison with IPTW estimates. To illustrate, the author applies the MMW-S method to evaluations of within-class homogeneous grouping in early elementary reading instruction.
In a randomized controlled trial, a decision needs to be made about the total number of subjects for adequate statistical power. One way to increase the power of a trial is by including a predictive covariate in the model. In this article, the effects of various covariate adjustment strategies on increasing the power is studied for discrete-time survival endpoints; the circumstances are examined under which the covariate adjustment results in a sufficient increase in power. Using a predictive covariate may increase the costs for each subject, so it is useful to quantify when using a covariate is a cost-efficient strategy. The results reveal that using a covariate is highly recommended if the costs for measuring the covariate are relatively small and the correlation with the outcome sufficiently high.
Model-based multiple imputation has become an indispensable method in the educational and behavioral sciences. Mean and covariance structure models are often fitted to multiply imputed data sets. However, the presence of multiple random imputations complicates model fit testing, which is an important aspect of mean and covariance structure modeling. Extending the logic developed by Yuan and Bentler, Cai, and Cai and Lee, we propose an alternative method for conducting multiple imputation–based inference for mean and covariance structure modeling. In addition to computational simplicity, our method naturally leads to an asymptotically chi-square model fit test statistic. Using simulations, we show that our new method is well calibrated, and we illustrate it with analyses of three real data sets. A SAS macro implementing this method is also provided.
Since heterogeneity between reliability coefficients is usually found in reliability generalization studies, moderator analyses constitute a crucial step for that meta-analytic approach. In this study, different procedures for conducting mixed-effects meta-regression analyses were compared. Specifically, four transformation methods for the reliability coefficients, two estimators of the residual between-studies variance, and two methods for testing regression coefficients significance were combined in a Monte Carlo simulation study. The different methods were compared in terms of bias and mean square error (MSE) of the slope estimates, and Type I error and statistical power rates for the slope statistical tests. The results of the simulation study did not vary as a function of the residual variance estimator. All transformation methods provided negatively biased estimates, but both bias and MSE were reasonably small in all cases. In contrast, important differences were found regarding statistical tests, with the method proposed by Knapp and Hartung showing a better adjustment to the nominal significance level and higher power rates than the standard method.
A binomial model is proposed for testing the significance of differences in binary response probabilities in two independent treatment groups. Without correction for continuity, the binomial statistic is essentially equivalent to Fisher’s exact probability. With correction for continuity, the binomial statistic approaches Pearson’s chi-square. Due to mutual dependence of the binomial and F distributions on the beta distribution, a simple F statistic can be used for computation instead of the binomial.
It is common practice to log-transform response times before analyzing them with standard factor analytical methods. However, sometimes the log-transformation is not capable of linearizing the relation between the response times and the latent traits. Therefore, a more general approach to response time analysis is proposed in the current manuscript. The approach is based on the assumption that the response times can be decomposed into a linear function of latent traits and a normally distributed residual term after the response times have been transformed by a monotone, but otherwise unknown transformation function. The proposed model can be fitted by a limited information approach, using the matrix of Kendall’s τ coefficients and unweighted least squares estimation. The transformation function can be determined by resorting to discrete time. The proposed approach offers a framework for testing model fit by comparing expected and observed correlations and for investigating the hypothesis about the form of the transformation function. The adequacy of the proposed approaches to model calibration and model validation are investigated in a simulation study. Two real data sets are analyzed as a demonstration of the model’s applicability.
In a traditional regression-discontinuity design (RDD), units are assigned to treatment on the basis of a cutoff score and a continuous assignment variable. The treatment effect is measured at a single cutoff location along the assignment variable. This article introduces the multivariate regression-discontinuity design (MRDD), where multiple assignment variables and cutoffs may be used for treatment assignment. For an MRDD with two assignment variables, we show that the frontier average treatment effect can be decomposed into a weighted average of two univariate RDD effects. The article discusses four methods for estimating MRDD treatment effects and compares their relative performance in a Monte Carlo simulation study under different scenarios.
Parametric analysis of covariance was compared to analysis of covariance with data transformed using ranks. Using a computer simulation approach, the two strategies were compared in terms of the proportion of Type I errors made and statistical power when the conditional distribution of errors was normal and homoscedastic, normal and heteroscedastic, non-normal and homoscedastic, and non-normal and heteroscedastic. The results indicated that parametric ANCOVA was robust to violations of either normality or homoscedasticity. However, when both assumptions were violated, the observed α levels underestimated the nominal α level when sample sizes were small and α = .05. Rank ANCOVA led to a slightly liberal test of the hypothesis when the covariate was non-normal, the sample size was small, and the errors were heteroscedastic. Practical significant power differences favoring the rank ANCOVA procedures were observed with moderate sample sizes and a variety of conditional distributions.
Empirical type I error rates and the power of the parametric and rank transform ANCOVA were compared for situations involving conditional distributions that differed between groups in skew and/or scale. For the conditions investigated in the study, the parametric ANCOVA was typically the procedure of choice both as a test of equality of conditional means and as a test of equality of conditional distributions. For those conditions in which rank ANCOVA was the procedure of choice, the power advantages were usually quite small.
This study examined the Type I error and power properties of the rank transform test when employed in the context of a balanced 2x2x2 fixed effects ANOVA. The results showed the rank transform procedure to be erratic with respect to both Type I error and power. Under some circumstances the test was both robust and powerful, whereas in other circumstances it was decidedly nonrobust and manifested power considerably below that of the usual ANOVA F test. It is recommended that researchers avoid this test except in those specific circumstances where its properties are well understood.
Regression methods can locate student test scores in a conditional distribution, given past scores. This article contrasts and clarifies two approaches to describing these locations in terms of readily interpretable percentile ranks or “conditional status percentile ranks.” The first is Betebenner’s quantile regression approach that results in “Student Growth Percentiles.” The second is an ordinary least squares (OLS) regression approach that involves expressing OLS regression residuals as percentile ranks. The study describes the empirical and conceptual similarity of the two metrics in simulated and real-data scenarios. The metrics contrast in their scale-transformation invariance and sample size requirements but are comparable in their dependence on the number of prior years used as conditioning variables. These results support guidelines for selecting the model that best fits the data and have implications for the interpretations of these percentiles ranks as “growth” measures.
This paper derives a formula for the square of the correlation coefficient r2 in terms of its maximum possible value under rotation of the axes. The value of r2 is minimized at 0 when the principal components of the scatterplot coincide with the axes, and is maximized when the principal components are at ±45° to the axes. An application of the formula to quick visual estimation of r is indicated.
In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression imputation, can fail to capture complex dependencies and can be difficult to implement effectively in high dimensions. We present a fully Bayesian, joint modeling approach to multiple imputation for categorical data based on Dirichlet process mixtures of multinomial distributions. The approach automatically models complex dependencies while being computationally expedient. The Dirichlet process prior distributions enable analysts to avoid fixing the number of mixture components at an arbitrary number. We illustrate repeated sampling properties of the approach using simulated data. We apply the methodology to impute missing background data in the 2007 Trends in International Mathematics and Science Study.
A wide literature uses date of birth as an instrument to study the causal effects of educational attainment. This paper shows how parents delaying their children’s initial enrollment in kindergarten, a practice known as redshirting, can make estimates obtained through this identification framework all but impossible to interpret. A latent index model is used to illustrate how the monotonicity assumption in this framework is violated if redshirting decisions are made in a setting of essential heterogeneity. Empirical evidence is presented from the ECLS-K data set that favors this scenario; redshirting is common and heterogeneity in the treatment effect of educational attainment is likely a factor in parents’ redshirting decisions.
This study demonstrates how the stability of Mantel–Haenszel (MH) DIF (differential item functioning) methods can be improved by integrating information across multiple test administrations using Bayesian updating (BU). The authors conducted a simulation that showed that this approach, which is based on earlier work by Zwick, Thayer, and Lewis, can yield more accurate DIF estimation and improve the detection of DIF items, even when compared to other approaches that aggregate data across administrations. The authors also applied the method to data from several college-level tests. The BU approach provides a natural way to accumulate all known DIF information about each test item while mitigating the undesirable bias toward zero that affected the performance of two previous Bayesian DIF methods.
The posterior distribution of the bivariate correlation (ρxy) is analytically derived given a data set consisting N1 cases measured on both x and y, N2 cases measured only on x, and N3 cases measured only ony. The posterior distribution is shown to be a function of the subsample sizes, the sample correlation (rxy) computed from the N1 complete cases, a set of four statistics which measure the extent to which the missing data are not missing completely at random, and the specified prior distribution for ρxy. A sampling study suggests that in small (N = 20) and moderate (N = 50) sized samples, posterior Bayesian interval estimates will dominate maximum likelihood based estimates in terms of coverage probability and expected interval widths when the prior distribution for ρxy is simply uniform on (0, 1). The advantage of the Bayesian method when more informative priors based on beta densities are employed is not as consistent.
The purpose of this paper is to formulate optimal sequential rules for mastery tests. The framework for the approach is derived from Bayesian sequential decision theory. Both a threshold and linear loss structure are considered. The binomial probability distribution is adopted as the psychometric model involved. Conditions sufficient for sequentially setting optimal cutting scores are presented. Optimal sequential rules will be derived for the case of a subjective beta distribution representing prior true level of functioning. An empirical example of sequential mastery esting for concept-learning in medicine concludes the paper.
The Libby-Novick class of three-parameter generalized beta distributions is shown to provide a rich class of prior distributions for the binomial model that removes some of the restrictions of the standard beta class. The posterior distribution defines a new class of four-parameter generalized beta distributions for which numerical posterior analysis is easily done. A numerical example indicates the desirability of using these wider classes of densities for binomial models, particularly in an interactive computing environment.
Bayesian network models offer a large degree of flexibility for modeling dependence among observables (item outcome variables) from the same task, which may be dependent. This article explores four design patterns for modeling locally dependent observations: (a) no context—ignores dependence among observables; (b) compensatory context—introduces a latent variable, context, to model task-specific knowledge and use a compensatory model to combine this with the relevant proficiencies; (c) inhibitor context—introduces a latent variable, context, to model task-specific knowledge and use an inhibitor (threshold) model to combine this with the relevant proficiencies; (d) compensatory cascading—models each observable as dependent on the previous one in sequence. This article explores the four design patterns through experiments with simulated and real data. When the proficiency variable is categorical, a simple Mantel-Haenszel procedure can test for local dependence. Although local dependence can cause problems in the calibration, if the models based on these design patterns are successfully calibrated to data, all the design patterns appear to provide very similar inferences about the students. Based on these experiments, the simpler no context design pattern appears more stable than the compensatory context model, while not significantly affecting the classification accuracy of the assessment. The cascading design pattern seems to pick up on dependencies missed by other models and should be explored with further research.
Though ubiquitous, Likert scaling’s traditional mode of analysis is often unable to uncover all of the valid information in a data set. Here, the authors discuss a solution to this problem based on methodology developed by quantum physicists: the state multipole method. The authors demonstrate the relative ease and value of this method by examining college students’ endorsement of one possible cause of prejudice: segregation. Though the mean level of students’ endorsement did not differ among ethnic groups, an examination of state multipoles showed that African Americans had a level of polarization in their endorsement that was not reflected by Hispanics or European Americans. This result could not have been obtained with the traditional approach and demonstrates the new method’s utility for social science research.
Does reduced class size cause higher academic achievement for both Black and other students in reading, mathematics, listening, and word recognition skills? Do Black students benefit more than other students from reduced class size? Does the magnitude of the minority advantages vary significantly across schools? This article addresses the causal questions via analysis of experimental data from Tennessee’s Student/Teacher Achievement Ratio study where students and teachers are randomly assigned to small or regular class type. Causal inference is based on a three-level multivariate simultaneous equation model (SM) where the class type as an instrumental variable (IV) and class size as an endogenous regressor interact with a Black student indicator. The randomized IV causes class size to vary which, by hypothesis, influences academic achievement overall and moderates a disparity in academic achievement between Black and other students. Within each subpopulation characterized by the ethnicity, the effect of reduced class size on academic achievement is the average causal effect. The difference in the average causal effects between the race ethnic groups yields the causal disparity in academic achievement. The SM efficiently handles ignorable missing data with a general missing pattern and is estimated by maximum likelihood. This approach extends Rubin’s causal model to a three-level SM with cross-level causal interaction effects, requiring intact schools and no interference between classrooms as a modified Stable Unit Treatment Value Assumption. The results show that, for Black students, reduced class size causes higher academic achievement in the four domains each year from kindergarten to third grade, while for other students, it improves the four outcomes except for first-grade listening in kindergarten and first grade only. Evidence shows that Black students benefit more than others from reduced class size in first-, second-, and third-grade academic achievement. This article does not find evidence that the causal minority disparities are heterogeneous across schools in any given year.
Doubly bounded continuous data are common in the social and behavioral sciences. Examples include judged probabilities, confidence ratings, derived proportions such as percent time on task, and bounded scale scores. Dependent variables of this kind are often difficult to analyze using normal theory models because their distributions may be quite poorly modeled by the normal distribution. The authors extend the beta-distributed generalized linear model (GLM) proposed in Smithson and Verkuilen (2006) to discrete and continuous mixtures of beta distributions, which enables modeling dependent data structures commonly found in real settings. The authors discuss estimation using both deterministic marginal maximum likelihood and stochastic Markov chain Monte Carlo (MCMC) methods. The results are illustrated using three data sets from cognitive psychology experiments.
Both the binomial and beta-binomial models are applied to various problems occurring in mental test theory. The paper reviews and critiques these models. The emphasis is on the extensions of the models that have been proposed in recent years, and that might not be familiar to many educators.
We explore the use of instrumental variables (IV) analysis with a multisite randomized trial to estimate the effect of a mediating variable on an outcome. We use a random-coefficient IV model that allows both the impact of program assignment on the mediator (compliance with assignment) and the impact of the mediator on the outcome (the treatment effect) to vary across sites and to covary with one another. This extension of conventional fixed-coefficient IV analysis illuminates a potential bias in IV analysis, which Reardon and Raudenbush (forthcoming) refer to as "compliance-effect covariance bias." We first derive an expression for this bias and then use simulations to investigate the sampling variance of the conventional fixed-coefficient two-stage least squares (2SLS) estimator in the presence of varying (and covarying) compliance and treatment effects. We next develop an alternate IV estimator that is less susceptible to compliance-effect covariance bias. We compare the bias, sampling variance, and root mean squared error of this "bias-corrected IV estimator" with those of 2SLS and ordinary least squares (OLS). We find that when the first stage F-statistic exceeds 10 (a commonly used threshold for instrument strength), the bias-corrected estimator typically performs better than 2SLS or OLS. In the last part of the paper, we use both the new estimator and 2SLS to reanalyze data from two large multisite studies.
This article considers the impact of missing data arising from balanced incomplete block (BIB) spiraled designs on the chi-square goodness-of-fit test in factor analysis. Specifically, data arising from BIB designs possess a unique pattern of missing data that can be characterized as missing completely at random (MCAR). Standard approaches to factor analyzing such data rest on forming pairwise available case (PAC) covariance matrices. Developments in statistical theory for missing data show that PAC covariance matrices may not satisfy Wishart distribution assumptions underlying factor analysis, thus impacting tests of model fit. One approach, advocated by Muthén, Kaplan, and Hollis (1987) for handling missing data in structural equation modeling, is proposed as a possible solution to these problems. This study compares the new approach to the standard PAC approach in a Monte Carlo framework. Results show that tests of goodness-of-fit are very sensitive to PAC approaches even when data are MCAR, as is the case for BIB designs. The new approach is shown to outperform the PAC approach for continuous variables and is comparatively better for dichotomous variables.
The authors present a generalization of the multiple-group bifactor model that extends the classical bifactor model for categorical outcomes by relaxing the typical assumption of independence of the specific dimensions. In addition to the means and variances of all dimensions, the correlations among the specific dimensions are allowed to differ between groups. By including group-specific difficulty parameters, the model can be used to assess differential item functioning (DIF) for testlet-based tests. The model encompasses various item response models for polytomous data by allowing for different link functions, and it includes testlet and second-order models as special cases. Importantly, by assuming that the testlet dimensions are conditionally independent given the general dimension, the authors show, using a graphical model framework, that the integration over all latent variables can be carried out through a sequence of computations in two-dimensional subspaces, making full-information maximum likelihood estimation feasible for high-dimensional problems and large datasets. The importance of relaxing the orthogonality assumption and allowing for a different covariance structure of the dimensions for each group is demonstrated in the context of the assessment of DIF. Through a simulation study, it is shown that ignoring between-group differences in the structure of the multivariate latent space can result in substantially biased estimates of DIF.