Here are four papers dealing with the reporting of subgroup analysis and using baseline data (and their pitfalls). There might be plenty of other papers on this topic available through Google, but these ones focus on RCTs and biomedical research. Below is just a few recap’ of the critical points raised in these papers.
Wang et al.^{(1)} provide general guidelines for reporting subgroup analysis. Based on their insert p. 2193, here are those recommendations for researchers intending to publish results from subgroup analysis (strict copy/paste):
[Abstract] Present subgroup results in the Abstract only if the subgroup analyses were based on a primary study outcome, if they were pre-specified, and if they were interpreted in light of the totality of pre-specified subgroup analyses undertaken.
[Methods]
Indicate the number of pre-specified subgroup analyses that were performed and the number of pre-specified subgroup analyses that are reported. Distinguish a specific subgroup analysis of special interest, such as that in the article by Sacks et al., from the multiple subgroup analyses typically done to assess the consistency of a treatment effect among various patient characteristics, such as those in the article by Jackson et al. For each reported analysis, indicate the end point that was assessed and the statistical method that was used to assess the heterogeneity of treatment differences.
Indicate the number of post hoc subgroup analyses that were performed and the number of post hoc subgroup analyses that are reported. For each reported analysis, indicate the end point that was assessed and the statistical method used to assess the heterogeneity of treatment differences. Detailed descriptions may require a supplementary appendix.
Indicate the potential effect on type I errors (false positives) due to multiple subgroup analyses and how this effect is addressed. If formal adjustments for multiplicity were used, describe them; if no formal adjustment was made, indicate the magnitude of the problem informally, as done by Jackson et al.
[Results] When possible, base analyses of the heterogeneity of treatment effects on tests for interaction, and present them along with effect estimates (including confidence intervals) within each level of each baseline covariate analyzed. A forest plot is an effective method for presenting this information.
[Discussion] Avoid over-interpretation of subgroup differences. Be properly cautious in appraising their credibility, acknowledge the limitations, and provide supporting or contradictory data from other studies, if any.
The papers that are referenced above are:
Sacks FM, Pfeffer MA, Moye LA, et al. The effect of pravastatin on coronary events after myocardial infarction in patients with average cholesterol levels. N Engl J Med 1996;335:1001-9.
Jackson RD, LaCroix AZ, Gass M, et al. Calcium plus vitamin D supplementation and the risk of fractures. N Engl J Med 2006; 354:669-83. [Erratum, N Engl J Med 2006;354:1102.]
Lagakos's article^{(2)} is a short note providing good reminders about the problem of data dredging (carrying out further analyses after having seen the data) and multiple unprotected comparisons in subgroup analyses, especially when testing interaction effects, which might be circumvented using simple (but overly conservative) Bonferroni correction. His conclusions are (p. 1668):
Ignorance of the total number of subgroup analyses, which ones were prespecified and which were post hoc, and whether any were suggested by the data makes it very difficult to interpret the reported results. When an interaction test for a baseline variable fails to reach the appropriate threshold for significance, conclusions about a differential treatment benefit related to this variable should be avoided or presented with caution.
Assmann's paper^{(3)} is all about the same than the next paper, so I'll skip it. Reader interested in a shorter version of the next paper might, however, be interested in reading this one.
Pocock et al.^{(4)} focused on the inclusion of baseline data obtained during the random allocation to treatment group and did sort of a meta-analytical review of 50 trial reports published in four major journals in 1997: British Medical Journal (15 RCTs), Journal of the American Medical Association (6), Lancet (15), and New England Journal of Medicine (24). Overall, 70% reported results from subgroup analyses, with up to four baseline factors in 56% of the cases.
According to these authors, baseline data might be used for subgroup analyses, covariate adjustment, or baseline comparisons. Statistical tests for interaction are probably the most useful approach, as they allow to “directly examine the strength of evidence for the treatment difference varying between subgroups” (p. 2919), but are not used as much as they should (authors are reluctant to use them because of the lack of statistical power). This is because they “recognize the limited extent of data available for subgroup analysis, and are the most effective statistical tool in inhibiting false or premature claims of subgroup findings.” (p. 2920).
Let us concentrate on the last two issues.
Covariate adjustment aims (1) to achieve the most appropriate p-value for the treatment difference, (2) to achieve an unbiased estimate and confidence interval for the magnitude of treatment difference in outcome, or (3) to improve the precision of the estimated treatment difference, thus increasing the statistical power of the trial. Most importantly, what really matters is to adjust for the appropriate covariates (p. 2927).
On the other hand, baseline comparability aims (1) to describe the baseline characteristics of the sample of patients included, (2) to demonstrate that the randomization has worked well by achieving well balanced treatment groups at baseline, (3) to add credibility to the trial results, specifically encouraging confidence in unadjusted outcome analyses as being without any serious bias, or (4) to identify any unlucky imbalances between treatment groups that may have arisen by chance.
The paper also considered some case studies to illustrate the above points, which make it worth to read as compared to standard textbook. The final conclusions of the authors are (p. 2928):
Subgroup analyses are often given too great a prominence and fail to use appropriate methods of statistical inference such as interaction tests. There also appears a lack of consistency regarding the use of covariate-adjusted analyses, perhaps largely because their rationale and statistical properties are poorly understood. Though less serious, there are also improvements to be made in the reporting of baseline comparisons.
On a related note, the use of pre-post data is discussed in Statistical Issues in Drug Development, by Senn, and I gave some further links in response to a question on stats.stackexchange.com: Best practice when analysing pre-post treatment-control designs. I recently came across a nice discussion on the MedStats mailing (Credit to Martin Holt) referencing Streiner and Norman's excellent textbook on Health Measurement Scales (Oxford University Press 2008, 4th ed.). As for the original question, it was about assessing the effect of interventionin hypogonadal male patients as measured by a change observed on an histology index score (0 to 4). The use of residualized gain scores was not new to me, but as always it seems I totally forgot about that approach when replying on stats.stackexchange.com. The idea, which originated from Cronbach and Furby^{(5)} and is an extension of Lord^{(6)} and McNemar^{(7)} proposals, consists in
The take-home message from Cronbach's paper is essentially that using chnage scores obeys a precise research question (this was originally framed into problems like learning or growth, which are in essence multidimensional aspects of individual developement). Gains or differences scores were primarily used (pp. 77-78):
Cronbach sum up the above as follows (p. 78):
There appears to be no need to use measures of change as dependent variables and no virtue in using them. If one is testing the null hypothesis that two treatments have the same effect, the essential question is whether posttest Y scores^{1} vary from group to group. Assuming that errors of measurement of F are random, F is an entirely suitable dependent variable.
Last unrelated note, Kim-Kang and Weiss^{(8)} provides a nice overview of the measurement of individual change in light of IRT models. Another discussion might be found on rasch.org Transactions, Raw Score Nonlinearity Obscures Growth.
Y∞ scores represent the person's “true” score (and its estimate might be obtained from the aforementioned regression model). ↩︎