The purpose of the present study was to examine the agreement of diagnostic classifications from two parallel subtests assessing a procedural skill in mathematics using three levels of scoring: (a) observed item scores (correct/incorrect), (b) underlying rules of operation, and (c) underlying task attributes. A bug analysis and a rule space analysis were employed to assess categories b and c,
... [Show full abstract] respectively. The results indicated that even when the parallel form reliability coefficient of a given test is relatively high, less agreement is evidenced when performance is evaluated at the micro level. This suggests that incorrect responses to equivalent items may result from application of different underlying mal-rules ("bugs"), which in turn may result from nonmastery of the same task attribute(s). The results are discussed in light of their implications for diagnostic assessment and remediation.