Some (old) random notes found by chance on my iPhone.
van der Maas, H. L. J. & Wagenmakers, E.J. (2005). The Amsterdam Chess Test: a psychometric analysis of chess expertise. American Journal of Psychology, 118: 29-60. With accompagnying website, Testing chess ability.
In reference to the multi-trait multi-method framework: Nussbeck, F.W., Eid, M., Geiser, C., Courvoisier, D.S., and Lischetzke, T. (2009). A CTC(M-1) Model for Different Types of Raters. Methodology, 5(3): 88-98. More link: ]Unfolding the constituents of psychological scores: Development and application of mixture and multitrait-multimethod LST models](http://www.affective-sciences.org/node/1058) (PhD thesis, Courvoisier, D. S.).
Labeled eggs vs. latent variable (R. Steyer): “Latent variable might be thought of as tools constructed to answer specific questions rather than true constructs under a generative model.”
Wolfe, E.W. (2009). Item and rater analysis of constructed response items via the multi-faceted Rasch model. Journal of Applied Measurement, 10(3): 335-47. Here is another paper on the MFRM, and a short illustration on rasch.org.
Generalizability theory approach vs. MFRM: Kim, S.C. and Wilson, M. (2009). A comparative analysis of the ratings in performance assessment using generalizability theory and the many-facet Rasch model. Journal of Applied Measurement, 10(4): 408-423. There’s also a PhD thesis on Detecting Rater Centrality Effect Using Simulation Methods and Rasch Measurement Analysis. For an application in medical assessment, see Developing a Measure of Therapist Adherence to Contingency Management: An Application of the Many-Facet Rasch Model, by Chapman et al. (J Child Adolesc Subst Abuse, 2008, 17(3): 47). I’ve not seen many applications of Generalizability Theory in medical studies. However, it should be noted that Streiner and Norman devoted a complete chapter to its interpretation in their textbook, Health Meausrement Scales (2008, 4th ed.).
Computer adaptive testing is often encountered in educational psychometrics, more rarely in biomedical settings. In the context of clinical decision (not measurement!) there are some targeted instruments whose properties were screened through CAT design. For example, Walter described the development of a CAT system for assessing depression and anxiety: Walter OB, Becker J, Bjorner JB, et al. Development and evaluation of a computer adaptive test for ‘Anxiety’ (Anxiety-CAT). Qual Life Res. 2007;16 Suppl 1:143–155. Rebollo and coll. recently discussed CAT for HRQL studies. For other discussions, see
The use of IRT modeling has also been discussed in the context of gene-environment intercation, see Waller, N.G. and Reise, S.P. (1989). Genetic and environmental influences on item response pattern scalability, Behavior Genetics, 22(2): 135-152.
Mastery testing refers to delivering ordered graded to examinees, see the Connecticut Mastery Test (CMT). “In a sequential mastery test, the decision is to classify a subject as a master, a nonmaster, or to continue sampling and administering another random item.” (Vos, Applying the Minimax Principle to Sequential Mastery Testing).
Niels Smits discussed the “measurement-prediction” paradox in a nice paper: The Measurement Versus Prediction Paradox in the Application of Planned Missingness to Psychological and Educational Tests. The key point is that there’s a compromise between selecting items that have high inter-item correlations or select items that have high correlations with the criterion, and low inter-item correlations (because items with high inter-item correlations usually explain the same part of the criterion’s variance, hence do not contribute to predictive ability). Quoting the authors, p. 3:
So, it seems to be impossible to maximize both measurement precision and predictive validity at the same time. If measurement is maximized, reliability is high, inter-item correlations are high, and, as a consequence predictive validity tends to be lower. If predictive validity is maximized, inter-item correlations are low, and as a consequence, measurement precision tends to be lower. This paradox was considered in three classical handbooks for psychological testing (Cronbach & Gleser, 1965, pp. 136-137; Gulliksen, 1950, pp. 380-381; Lord & Novick, 1968, p. 332). In practice, researchers have to make a choice between either measuring precisely, or predicting accurately. Maximize either predictive validity at the expense of measurement precision, or maximize measurement precision, at the expense of predictive validity.
Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel decisions. Urbana: University of Illinois Press; Gulliksen, H. (1950). Theory of mental tests. New York: Wiley; Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
(See here to know how abstracts looked like in plain typewriter font.)
But see Lisrel MG-CFA procedure. More papers:
- Hernández, A. and González-Romá, V. (2003). [Evaluating the multiple-group mean and covariance structure analysis model for the detection of differential item functioning in polytomous ordered items](http://www.psicothema.com/psicothema.asp?id=1064). Psicothema, 15(2): 315-321. - Wu, A.D., Li, Z., and Zumbo, B.D. (2007). [Decoding the Meaning of Factorial Invariance and Updating the Practice of Multi-group Confirmatory Factor Analysis: A Demonstration With TIMSS Data](http://pareonline.net/pdf/v12n3.pdf). PARE Online, 12(3).