# Lee J. Cronbach's research while affiliated with University of Illinois at Chicago and other places

## Publications (28)

A measuring operation is a sample from a universe of admissible observations....generalizability studies estimate the magnitude of the discrepancies likely to arise under a given measuring procedure, and provide formulas for establishing interval and point estimates of the universe score. a multifacet generalizability analysis departs in several wa...

The properties of various internal-consistency formulas have been examined with hypothetic stratified-parallel tests constructed by sampling items from universes with specified characteristics. "When a test is constructed by stratifying on content and difficulty, one may properly estimate its coefficient of generalizability by αCD or αC… . Stratify...

Generalizability theory concerns the adequacy with which a universe score can be inferred from a set of observations. In this paper the theory is applied to a universe in which observations are classifiable according to two independent variable aspects of the measuring procedure. Several types of universe scores are developed and the variance compo...

The theory of generalizability (Cronbach, et al., 1963) regards a measure as sampled from a universe of comparable but not necessarily equivalent measures. The most suitable index of agreement would be the average of the squared correlations of the measures with the average of all measures. Since this cannot be directly determined, the intraclass c...

Comments on a letter by Raymond B. Cattell (1964), in which he defended the use of the 16PF. Cattell thinks that L. J. Cronbach's criticism of the 16PF is outdated because Cronbach's views on reliability have since developed. Cronbach still stands by his view that 16-PF scales should have form-to-form reliabilities well above .55 if the scores are...

"'Reliability theory' is reinterpreted as a theory regarding the adequacy with which one can generalize from one observation to a universe of observations. If the observation is randomly sampled from the universe—whether or not the universe consists of equivalent observations—the intraclass correlation provides an approximate lower bound to the exp...

Loewe's recommendations regarding treatment of stimulus-response relations are criticized. Conditions are described where
quantal analysis is justified. Loewe's interpretation of his graded analysis must be modified in the light of the fact that
response curves for individuals frequently cross. Superior lines of attack on the problem are suggested.

Lord's method of analysis in his paper on the usefulness of unreliable difference scores is general and applies to all reliability and validity coefficients. 3 risks which may be taken into account in fixing strategies for test interpretation and for evaluating the usefulness of a test are identified and discussed: fixed risk, average risk, and fix...

"Construct validation was introduced in order to specify types of research required in developing tests for which the conventional views on validation are inappropriate. Personality tests, and some tests of ability, are interpreted in terms of attributes for which there is no adequate criterion. This paper indicates what sorts of evidence can subst...

General methodological difficulties are discussed, particularly; the need to discuss similarity only with respect to specified dimensions, loss of information involved when configurations are reduced to indices, the need to interpret a similarity index as a relative rather than as an absolute measure, and the general non-comparability of scale unit...

The validity of a univocal multiple-choice test is determined for varying distributions of item difficulty and varying degrees of item precision. Validity is a function of
d
2
+
v
2
, where
d measures item unreliability and
v measures the spread of item difficulties. When this variance is very small, validity is high for one optimum cutting score...

A general formula (α) of which a special case is the Kuder-Richardson coefficient of equivalence is shown to be the mean of
all split-half coefficients resulting from different splittings of a test. α is therefore an estimate of the correlation between
two random samples of items from a universe of items like those in the test. α is found to be an...

Non-spurious methods are needed for estimating the coefficient of equivalence for speeded tests from single-trial data. Spuriousness in a split-half estimate depends on three conditions; the split-half method may be used if any of these is demonstrated to be absent. A lower-bounds formula,r
c, is developed. An empirical trial of this coefficient an...

... The effect of a changing research environment on replicability may be inferred by performing the same experiment in several randomly selected environments and applying a mixed model analysis to assess treatment effects. Such mixed-model analysis has been a staple of fields including agriculture, biology, and psychology (Littell et al. 1996, Milliken & Johnson 2009, Kafkafi et al. 2005, Cronbach et al. 1963, Cronbach 1972, Shavelson et al. 1989). However, this approach is limited because it is often impractical to do a study more than once due to costs, time, or lack of incentives for carrying out replication studies (National Academies 2019, p. 137-138, Koole & Lakens 2012, Lundwall 2019. ...

... Profile analysis-To examine whether in situ elevated air temperatures could evoke phenotypic responses of functional leaf traits dissimilar to those induced by ambient temperatures, we implemented a multivariate statistical technique known as "profile analysis". Profile analysis is a repeated measures extension of multivariate analysis of variance (MANOVA), which is primarily focused on profiles (vectors) of multivariate data gathered by repeated measurements of a variable from the same individual at several different points in time [117][118][119][120][121]. In this approach, the different measurements conducted on each individual should be considered as multiple dependent variables [117,119]. ...

... The ICC is the most used agreement index in research because it is the oldest and most flexible. To the best of our knowledge, the first application of ICCs as agreement indices occurred in 1964 (Cronbach, Ikeda, & Avner, 1964). Currently, ICCs are used to estimate IRA in different contexts, such as developmental psychology (e.g., Silva, Crespo, Carona, Bullinger, & Canavarro, 2015) and neuropsychology (e.g., Semrau et al., 2015) as well as in clinical (e.g., Ratelle, Kelm, Halvorsen, West, & Oxentenko, 2015;Unsworth, Harries, & Davies, 2015), educational (e.g., Dickman, 2014), and organizational contexts (e.g., Pearsall & Venkataramani, 2015). ...

... Prior to computing observed scores, it is important to obtain validity evidence in support of observed score use (Kane, 2009;Messick, 1989). From a qualitative perspective, observed scores exhibit validity when they reflect the intended meaning and magnitude of the construct of interest (e.g., see Cronbach & Meehl, 1955). outlined a three-phase sequential framework for generating validity evidence for observed scores. ...

... The Cronbach's alpha (α), item-total correlation, factor loadings, composite reliability (CR), and average variance extracted (AVE) obtained in this study are shown in Table 4. The α and CR values were calculated to evaluate the internal reliability at 0.70 (Cronbach et al., 1972). The values have acceptable internal consistency with a range of 0.874-0.927 ...

... Internal consistency reliability was assessed using Cronbach's alpha [42] despite shortcomings on inflation [43], given that those coefficients that originate in the Confirmatory Factor Analysis model such as maximal and omega [44] internal consistency reliability cannot be estimated with domains having only two items [44], as models are not identified (non-positive degrees of freedom). Estimates ranged between 0.60 and 0.80; specifically, verbal bullying had an alpha of 0.803, relational bullying 0.634, physical bullying 0.683, and online bullying 0.727. ...

... The robustness of the model against violations of this assumption should be examined. Cronbach and Merwin (1955) have suggested a model which assumes that the alternatives have different probabilities of being chosen. Although their model is mathematically unwieldly, it could be employed in monte carlo studies of the model used here. ...

... Adaptive test designs have in common that these are usually more efficient in terms of shorter test lengths while providing equal or even higher measurement precision. Furthermore, this type of design is associated with higher predictive validity compared to linear fixed-length tests (Betz & Weiss, 1974;Chang, 2015;Cronbach & Gleser, 1957;Hendrickson, 2007;Jodoin et al., 2006;Kim & Plake, 1993;Linn et al., 1969;Lord, 1980;Schnipke & Reese, 1997;Wainer et al., 2000;Weiss, 1982;Weiss & Kingsbury, 1984). In particular, the advantages of adaptive test designs will occur for the more extreme abilities at the lower and upper end of the measurement scale (Hendrickson, 2007;Lord, 1974Lord, , 1980. ...

... Consideration of these recommendations along with the findings of relevant studies (e.g., Bolotin, 1960;Chambers & Hamlin, 1957;Dana, 1962;Dawes, 1962;Dennis, 1960;Mogar, 1962;Hosteller & Bush, 1954;Secord, 1952;Wallon, 1959), criticism (e.g., Brown, 1952;Hamlin, 1954;Hammer, 1959;Schneider, 1950;Shneidman, 1959), unpublished manuscripts, and personal communications by Karen Machover and Solomon Machover, provided enriching sources for conceptualizing the method elected here. In sum, the empirical literature suggested that a concurrent validity study, applying the method of correct matchings of global judgments (Cronbach, 1950) to clinically homogeneous groups, would be appropriate. Researchers have cautioned that judges be experts, that samples be relevant, that demographic and sampling variables be controlled, and that the task be clearly defined-not too simple, yet not unmanageably complex. ...

... One potential source of noise in Likert-scale tests is response styles. Response styles can be thought of as a tendency to systematically select responses as a function of item format rather than item content, which can decrease the validity of a test (Cronbach 1946(Cronbach , 1950. Likert-type items are particularly prone to response style effects (response style effects, if unaccounted for, can negatively impact model fit and estimate accuracy through the false attribution of responses to content-related traits). ...