Dichotomous or polytomous model? equating of testlet-based tests in light of conditional item pair correlations
The performance of dichotomous and polytomous IRT models in equating testlet-based tests was compared in this study.To clarify the conditions under which dichotomous and polytomous item response models produce differing results, the DIMTEST program was used for testing essential unidimensionality, and a bias-corrected index (Final Condcorr) was adapted in this study for measuring local item dependence (LID).True score and observed score equating using either the three-parameter logistic or generalized partial credit model was conducted for three subtests of the Iowa Tests of Educational Development (ITED) subtests and seven simulated datatests. Two factors were manipulated in generating the simulated tests: (a) the number of items nested under each testlet (5 or 8), and (b) the level of LID within testlets (low, medium, or high). A data set with no LID was also simulated and served as a baseline for the others.The results from traditional equipercentile equating and the first- and the second-order equity were used as the criteria for evaluating the 3PL and the GPC equating results.It was found that in general the GPC equating tended to outperform the 3PL equating for tests with slight or moderate multidimensionality, whereas for highly multidimensional tests the 3PL equating results were closer to those of the equipercentile equating. However, the signs and the relative magnitudes of the within- and the between-testlet LID should be considered. When the testlet structure failed to coincide with the way in which items actually clustered in the multidimensional test space (reflected by abnormal LID patterns), the GPC equating showed larger difference from the equipercentile equating results and larger deviation from the first- and the second-order equity, even for tests with moderate multidimensionality. Meanwhile, for tests with similar degrees of multidimensionality, the GPC model produced better equating results for the tests with relatively higher absolute within-testlet LID and lower absolute between-testlet LID than for those with relatively lower absolute within-testlet LID and higher absolute between-testlet LID.Moreover, differences in the test form frequency distributions seemed to negatively affect the performance of the GPC equating more than the 3PL equating.