Some (old) random notes found by chance on my iPhone.
van der Maas, H. L. J. & Wagenmakers, E.J. (2005). The Amsterdam Chess Test: a psychometric analysis of chess expertise. American Journal of Psychology, 118: 29-60. With accompagnying website, Testing chess ability.
In reference to the multi-trait multi-method framework: Nussbeck, F.W., Eid, M., Geiser, C., Courvoisier, D.S., and Lischetzke, T. (2009). A CTC(M-1) Model for Different Types of Raters. Methodology, 5(3): 88-98. More link: ]Unfolding the constituents of psychological scores: Development and application of mixture and multitrait-multimethod LST models](http://www.affective-sciences.org/node/1058) (PhD thesis, Courvoisier, D. S.).
Labeled eggs vs. latent variable (R. Steyer): “Latent variable might be thought of as tools constructed to answer specific questions rather than true constructs under a generative model.”
Wolfe, E.W. (2009). Item and rater analysis of constructed response items via the multi-faceted Rasch model. Journal of Applied Measurement, 10(3): 335-47. Here is another paper on the MFRM, and a short illustration on rasch.org.
Generalizability theory approach vs. MFRM: Kim, S.C. and Wilson, M. (2009). A comparative analysis of the ratings in performance assessment using generalizability theory and the many-facet Rasch model. Journal of Applied Measurement, 10(4): 408-423. There's also a PhD thesis on Detecting Rater Centrality Effect Using Simulation Methods and Rasch Measurement Analysis. For an application in medical assessment, see Developing a Measure of Therapist Adherence to Contingency Management: An Application of the Many-Facet Rasch Model, by Chapman et al. (J Child Adolesc Subst Abuse, 2008, 17(3): 47). I've not seen many applications of Generalizability Theory in medical studies. However, it should be noted that Streiner and Norman devoted a complete chapter to its interpretation in their textbook, Health Meausrement Scales (2008, 4th ed.).
Computer adaptive testing is often encountered in educational psychometrics, more rarely in biomedical settings. In the context of clinical decision (not measurement!) there are some targeted instruments whose properties were screened through CAT design. For example, Walter described the development of a CAT system for assessing depression and anxiety: Walter OB, Becker J, Bjorner JB, et al. Development and evaluation of a computer adaptive test for ‘Anxiety’ (Anxiety-CAT). Qual Life Res. 2007;16 Suppl 1:143–155. Rebollo and coll. recently discussed CAT for HRQL studies. For other discussions, see
The use of IRT modeling has also been discussed in the context of gene-environment intercation, see Waller, N.G. and Reise, S.P. (1989). Genetic and environmental influences on item response pattern scalability, Behavior Genetics, 22(2): 135-152.
Mastery testing refers to delivering ordered graded to examinees, see the Connecticut Mastery Test (CMT). “In a sequential mastery test, the decision is to classify a subject as a master, a nonmaster, or to continue sampling and administering another random item.” (Vos, Applying the Minimax Principle to Sequential Mastery Testing).
Niels Smits discussed the “measurement-prediction” paradox in a nice paper: The Measurement Versus Prediction Paradox in the Application of Planned Missingness to Psychological and Educational Tests. The key point is that there's a compromise between selecting items that have high inter-item correlations or select items that have high correlations with the criterion, and low inter-item correlations (because items with high inter-item correlations usually explain the same part of the criterion’s variance, hence do not contribute to predictive ability). Quoting the authors, p. 3:
So, it seems to be impossible to maximize both measurement precision and predictive validity at the same time. If measurement is maximized, reliability is high, inter-item correlations are high, and, as a consequence predictive validity tends to be lower. If predictive validity is maximized, inter-item correlations are low, and as a consequence, measurement precision tends to be lower. This paradox was considered in three classical handbooks for psychological testing (Cronbach & Gleser, 1965, pp. 136-137; Gulliksen, 1950, pp. 380-381; Lord & Novick, 1968, p. 332). In practice, researchers have to make a choice between either measuring precisely, or predicting accurately. Maximize either predictive validity at the expense of measurement precision, or maximize measurement precision, at the expense of predictive validity.
Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel decisions. Urbana: University of Illinois Press; Gulliksen, H. (1950). Theory of mental tests. New York: Wiley; Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
I often heard about Reckase's criterion for assessing unidimensionality of a scale (where we should observe 20% or more of explained variance on the first PCA component). Here is the complete reference: Reckase, M. D. (1979). Unifactor latent trait models applied to multi-factor tests: Results and implications. Journal of Educational Statistics, 4: 207-230. (See here to know how abstracts looked like in plain typewriter font.)
Stark et al. (2006) argued that assuming continuity is clearly inadequate for testing differential item functioning (DIF) with dichotomous items: Stark, S., Chernyshenko, O. S., and Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91(6): 1292-1306.
But see Lisrel MG-CFA procedure. More papers: