About
51
Publications
44,196
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,644
Citations
Introduction
Additional affiliations
May 2015 - present
Education
September 1977 - June 1981
September 1977 - June 1981
October 1971 - June 1977
Publications
Publications (51)
We set out to test the notion that Automatic Essay Scoring (AES) can be used as a criterion for judging the quality of human rating of essays. A simulation study (Navon & Cohen, 2006) showed that by replacing human ratings which disagree with computer-generated ratings, we can improve the validity of the rating. To test this hypothesis, we applied...
Higher Education Admissions Practices - edited by María Elena Oliveri January 2020
In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay’s true score. An AES system was trained on the datasets and then scored the essays using a cross-validation scheme. By eliminating one, two or th...
The intra-rater reliability in rating essays is usually indexed by the inter-rater correlation. We suggest an alternative method for estimating intra-rater reliability, in the framework of classical test theory, by using the dis-attenuation formula for inter-test correlations. The validity of the method is demonstrated by extensive simulations, and...
The rating of essays written as a response to a given prompt is a complex cognitive task that encompasses many subtasks. Reading is of course the main task, but also: understanding and interpreting the written essay; relating to its cultural context; constructing a theory-of-mind of the writer; conducting comparison processes – with other essays an...
It is suggested that some shortcomings of Null Hypothesis Significance Testing (NHST), viewed from the perspective of Bayesian statistics, turn benign once the traditional threshold p value of .05 is substituted by a sufficiently smaller value. To illustrate, the posterior probability of H0 stating P=.5, given data that just render it rejected by N...
This document summarizes briefly the method of Multiple Linear Calibration (MLC) for
calibrating a set of ratings given by different raters to a set of work samples. Each work
sample is rated by at least two raters, and the researcher or the practitioner is interested in
the best (in a sense discussed below) parameters for calibrating all the ratin...
The intra-rater reliability in rating essays is usually indexed by the inter-rater correlation.
We suggest an alternative method for estimating intra-rater reliability, in the framework of
classical test theory, by using the dis-attenuation formula for inter-test correlations. The
validity of the method is demonstrated by extensive simulations, and...
Content is one of the main writing dimensions on which essays are judged and rated. Since no automated essay scoring (AES) system is capable (yet) of truly understanding the content of an essay and assessing its breadth, depth and relevance, AES systems use indirect methods and proxy indices for judging its quality. Most such indices are based on m...
The vast diversity of operational definitions of learning disabilities (LD) and practices used for its diagnosis threaten standardization, objectivity and fairness in the diagnosis of LD and the provision of test accommodations.
The current paper describes an endeavor to overcome this problem by regulating and standardizing the diagnosis of learni...
The current paper describes an endeavor to develop policy and procedure for
standardizing and regulating the diagnosis of learning disability (LD) both in
applicants to higher education institutions and in currently enrolled students, and
for regulating the provision of test accommodations.
This endeavor, conducted by The National Institute for Tes...
One of the criteria by which essays are judged and rated is lexical diversity, or richness of vocabulary: the quality, depth and sophistication of the lexicon that the writer employs.
The purpose of this study is to examine the efficiency (validity) of various measures of lexical diversity which can be used in the context of automated scoring of...
Among the large variety of attentional tasks that have been used to study sustained attention, the Continuous Performance Task (CPT) is perhaps the most widely used. Despite substantial differences in task characteristics and demands, all CPT paradigms have been referred to as measures of sustained attention. In the present study we introduce a new...
In 2000, NITE launched the Hebrew Language Project (HLP). The goal of the project is to
develop computational tools for the analysis and evaluation of Hebrew texts.
The current paper reports the results of two studies.
The first study examined the differential contribution of quantified text features to the
automated scoring of essays elicited...
Gamliel & Cahan (2004) have argued that the fairness of admission procedures requires that the relative representation of socio-economic status (SES) groups within the whole group of students (namely those candidates who were admitted and actually ) equals the relative representation SES groups would have, had the admission been based on a perfect...
The use of CAT in higher education admissions testing in Israel is described. This includes: (1) AMIRAM—a CAT of English as a foreign language that has been used by various institutions of higher education for placement purposes for the past 22 years, and (2) MIFAM—a CAT version of the Psychometric Entrance Test that has been in use for nine years...
In the last two decades there has been an increase in the number of university applicants who are diagnosed as learning disabled (LD) and for whom test accommodations on university entrance exams are provided. The most frequent recommendation in the diagnostic reports of LD applicants is to extend the time limits of their tests. In the context of h...
Test transadaptation (translation and adaptation) is the process whereby a test
constructed in one language and culture is prepared for use in a second language and
culture. Test transadaptation involves both the translation and adaptation of items
written originally in the source language and the replacement of items unsuitable for
translation...
Two methods proposed for determining the lengths of the subtests of a test with a fixed total testing time, so as to maximize the predictive validity of the test, were compared. In the search method (Kennet-Cohen, Bronner, & Cohen, 2003) a search for the optimal allocation of the total testing time among the subtests is conducted by a repetitive pr...
Psychometric measurement based on subjective judgments of performance quality (e.g.,
essay ratings) is, typically, not very reliable. The subjective judgments are often
integrated into a single score by means of the following scoring model: Initially, two
independent judgments are conducted; then, if the absolute difference between them is
not...
The maintenance of scoring standards is crucial to any testing program, and to high stakes testing in particular. Not only is it important in tests that are composed of multiple choice items, but it warrants special attention when scoring is based on the judgment of human readers or raters, for example in essay tests. Several means are employed to...
Automated essay scoring (AES) can be a reliable and efficient assessment procedure. AES is currently performed using three types of methods: those based on analysis of surface features of the text, those based on analysis of semantic space, and those based on natural language processing (NLP).
Each type of method is sensitive, to a certain extent,...
Tests used for college or university admissions normally contain several types of items. After the desired set of item types has been specified, a decision regarding the proportions of the various item types has to be made. This work offers an approach for determining these proportions. The proposed approach is based on maximization of the predicti...
Modified parallel analysis (MPA) is a heuristic method for assessing ''approximate unidimensionality'' of item poors. It compares the second eigenvalue of the observed correlation matrix with the corresponding eigenvalue extracted from a ''parallel'' matrix generated by a unidimensional and locally independent model. Revised MPA (RMPA) generalizes...
American Sign Language (ASL) is a gestural language used by the hearing impaired. This paper describes experimental tests with deaf subjects that compared the most effective known methods of creating extremely compressed ASL images. The minimum requirements for intelligibility were determined for three basically different kinds of transformations:...
A method for the efficient coding of line drawings is discussed. The intent is to provide an extremely low bandwidth representation for images, preserving “intelligible” image content but not necessarily image quality. This has led to an image transformation involving edge enhancement, detection, line thinning, and polygonal splining which is terme...
Quadtrees are a compact hierarchical method of representation of images. In this paper, we explore a number of hierarchical image representations as applied to binary images, of which quadtrees are a single exemplar. We discuss quadtrees, binary trees, and an adaptive hierarchical method. Extending these methods into the third dimension of time res...
A software system for image processing, HIPS, was developed for use in a UNIX environment. It includes a small set of subroutines which primarily deals with a standardized descriptive image sequence header, and an ever-growing library of image transformation tools in the form of UNIX “filters.” Programs have been developed for simple image transfor...
HIPS (Human Information Processing Laboratory’s Image processing System) is a software system for image processing that runs
under the UNIX operating system. HIPS is modular and flexible: it provides automatic documentation of its actions, and is
relatively independent of special equipment. It has proved its usefulness in the study of the perceptio...
In this paper we briefly survey theories and ideas about image processing, with some illustrative examples, taken mostly from the Human information Processing Laboratory at N.Y.U. First we develop the concepts of multiple stable states and path dependence in a basic visual-motor task (vergence of the eyes) and show how these can be encompassed in p...
A peripheral visual cue in an empty field (1) often summons head or eyes, or both, (2) improves efficiency at the cued position while attention is directed to it, even without overt movements, and (3) reduces processing efficiency at the cued position once attention is withdrawn. We have studied the time course and the effects of mid-brain and cort...
During voluntary movements hand and eye are coordinated under the control of central attentional mechanisms. Several techniques are reviewed that serve to dissociate the usually coordinate activity of attention, eye and hand movements. Discrepancies between central expectancies and visual input produce errors that differentially affect hand and eye...
To counter the prevailing unsystematic approach to the use of polygraph data, a generalized decision theory approach applicable to a variety of polygraph uses is discussed. Examples of applications of decision theoretic tools to the polygraph interrogation problem are then presented, and typical misuses of the polygraph as a basis for decisions are...
To counter the prevailing unsystematic approach to the use of polygraph data, a generalized decision theory approach applicable to a variety of polygraph uses is discussed. Examples of applications of decision theoretic tools to the polygraph interrogation problem are then presented, and typical misuses of the polygraph as a basis for decisions are...