R version 2.7.1 (2008-06-23) Copyright (C) 2008 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R est un logiciel libre livré sans AUCUNE GARANTIE. Vous pouvez le redistribuer sous certaines conditions. Tapez 'license()' ou 'licence()' pour plus de détails. R est un projet collaboratif avec de nombreux contributeurs. Tapez 'contributors()' pour plus d'information et 'citation()' pour la façon de le citer dans les publications. Tapez 'demo()' pour des démonstrations, 'help()' pour l'aide en ligne ou 'help.start()' pour obtenir l'aide au format HTML. Tapez 'q()' pour quitter R. > # Regression Methods in Biostatistics, Vittinghoff, Glidden, Shiboski > # and McCulloch (Springer, 2005) > # > # Chapter 6. Logistic Regression > # > # Time-stamp: <2008-08-26 20:46:21 chl> > > postscript(file="RMB_c6.ps") > > ############################################################################## > # WCGS Sudy: Relationship between CHD and Age > > # loading the data > WCGS <- read.table("wcgs.txt",header=TRUE,sep="\t") > WCGS$chd69 <- factor(WCGS$chd69,labels=c("Noncases","Cases")) > > # plot the data > stripchart(age~as.numeric(chd69),data=WCGS,method="jitter",pch="+", + group.names=levels(WCGS$chd69),las=1,cex.axis=.8,cex=.8) > title("WCGS Study: Coronary disease and Age") > > > # set up the logistic model with default logit link > WCGS.glm <- glm(chd69~age,data=WCGS,family=binomial()) > > summary(WCGS.glm) Call: glm(formula = chd69 ~ age, family = binomial(), data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -0.6208 -0.4545 -0.3668 -0.3292 2.4835 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.93952 0.54932 -10.813 < 2e-16 *** age 0.07442 0.01130 6.585 4.56e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1781.2 on 3153 degrees of freedom Residual deviance: 1738.4 on 3152 degrees of freedom AIC: 1742.4 Number of Fisher Scoring iterations: 5 > > logLik(WCGS.glm) # the log-likelihood 'log Lik.' -869.178 (df=2) > exp(coef(WCGS.glm)[2]*10) # OR associated with a ten-year increase in age age 2.104811 > library(MASS) > confint(WCGS.glm) # 95% CI using profiled likelihood 2.5 % 97.5 % (Intercept) -7.02449436 -4.86960978 age 0.05227349 0.09661199 > confint.default(WCGS.glm) # 95% CI based assuming asymptotic normality 2.5 % 97.5 % (Intercept) -7.01616019 -4.86287169 age 0.05227038 0.09657473 > > WCGS.glm.fit <- fitted(WCGS.glm) # fitted values > > # estimate the probability of having CHD for age=55 > exp(-5.940+0.074*55)/(1+exp(-5.940+0.074*55)) [1] 0.1335417 > # or, alternatively > pred.df <- data.frame(age=c(55)) > predict(WCGS.glm,pred.df,type="response",se=TRUE) $fit 1 0.1363108 $se.fit 1 0.01183825 $residual.scale [1] 1 > > # we could also do the computation the "hard way", but see > # J Faraway, Extending the Linear Model with R. Generalized Linear, > # Mixed Effects and Nonparametric Regression Models, Chapman & Hall/CRC, > # 2006 > # > # to predict the reponse at age 55 > x0 <- c(1,55) > eta0 <- sum(x0*coef(WCGS.glm)) > library(boot) > inv.logit(eta0) [1] 0.1363108 > > # to get the 95% CI, we first need to extract the variance-covariance matrix > (vcm <- summary(WCGS.glm)$cov.unscaled) (Intercept) age (Intercept) 0.301750690 -0.0061641966 age -0.006164197 0.0001277428 > se <- sqrt(t(x0) %*% vcm %*% x0) # SE on the logit scale > inv.logit(eta0+c(-1,1)*qnorm(0.975)*se) # CI on the probability scale [1] 0.1147253 0.1612181 > > # R does not provide LR test in the summary output, but it could easily be > # obtained from the Deviance Table, e.g. > anova(WCGS.glm,test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: chd69 Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 3153 1781.24 age 1 42.89 3152 1738.36 5.798e-11 > > # See also lrm() in the Design package > library(Design) Design library by Frank E Harrell Jr Type library(help='Design'), ?Overview, or ?Design.Overview') to see overall documentation. > lrm(chd69~age,data=WCGS) Logistic Regression Model lrm(formula = chd69 ~ age, data = WCGS) Frequencies of Responses Noncases Cases 2897 257 Obs Max Deriv Model L.R. d.f. P C Dxy 3154 9e-08 42.89 1 0 0.62 0.24 Gamma Tau-a R2 Brier 0.252 0.036 0.031 0.074 Coef S.E. Wald Z P Intercept -5.93952 0.54932 -10.81 0 age 0.07442 0.01130 6.58 0 > > # for model adequacy checking, see residuals.lrm > # e.g. > resid(lrm(chd69~age,data=WCGS,x=T,y=T), "partial", pl="loess") > > # > # Note: > # The definitive reference for the Design package is > # F. Harrell, Regression Modeling Strategies, Springer, 2001. > # See Chapter 10 to 12 on binary logistic regression. > # > > # assessing linearity in the relationship between CHD Risk and Age (p. 191) > tab.chd69.age <- with(WCGS, table(chd69,age)) > obs.logit <- logit(tab.chd69.age[2,]/apply(tab.chd69.age,2,sum)) > > plot(obs.logit~as.numeric(names(obs.logit)),pch=19,cex=.8,xlab="Age (years)", + ylab="Logit CHD Risk",ylim=c(-4,-1)) > abline(lm(obs.logit~as.numeric(names(obs.logit))),lty=2) > lines(as.numeric(names(obs.logit)), + predict(loess(obs.logit~as.numeric(names(obs.logit)))),lty=3) > legend("bottom",c("Linear fit","LOWESS smooth"),lty=2:3,horiz=TRUE) > > # this suggests to test a model including a quadratic effect for age, > # after centering it to reduce colinearity with the linear term > summary(glm(chd69~age+I(scale(age,scale=F)^2),data=WCGS,family=binomial)) Call: glm(formula = chd69 ~ age + I(scale(age, scale = F)^2), family = binomial, data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -0.6093 -0.4582 -0.3681 -0.3270 2.4960 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.0430098 0.6787098 -8.904 < 2e-16 *** age 0.0769963 0.0150010 5.133 2.86e-07 *** I(scale(age, scale = F)^2) -0.0005543 0.0021065 -0.263 0.792 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1781.2 on 3153 degrees of freedom Residual deviance: 1738.3 on 3151 degrees of freedom AIC: 1744.3 Number of Fisher Scoring iterations: 5 > > > ############################################################################## > # WCGS Study: CHD and Arcus Senilis (categorical predictor) > > WCGS$arcus <- factor(WCGS$arcus,labels=c("Unexposed","Exposed")) > > # we graphically display the Morbidity/Exposition table > mosaicplot(t(table(WCGS$chd69,WCGS$arcus)),col=c(4,2),main="WCGS Study") > > WCGS.glm2 <- glm(chd69~arcus,data=WCGS,family=binomial()) > > summary(WCGS.glm2) Call: glm(formula = chd69 ~ arcus, family = binomial(), data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -0.4790 -0.4790 -0.3787 -0.3787 2.3112 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.5991 0.0838 -31.016 < 2e-16 *** arcusExposed 0.4918 0.1342 3.664 0.000248 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1771.2 on 3151 degrees of freedom Residual deviance: 1758.2 on 3150 degrees of freedom (2 observations deleted due to missingness) AIC: 1762.2 Number of Fisher Scoring iterations: 5 > > # estimate OR and associated 95% CI > library(epitools) > oddsratio(table(WCGS$chd69,WCGS$arcus),method="wald") # Wald CI $data Unexposed Exposed Total Noncases 2058 839 2897 Cases 153 102 255 Total 2211 941 3152 $measure odds ratio with 95% C.I. estimate lower upper Noncases 1.00000 NA NA Cases 1.63528 1.257000 2.127399 $p.value two-sided midp.exact fisher.exact chi.square Noncases NA NA NA Cases 0.0003159313 0.0003407814 0.0002216321 $correction [1] FALSE attr(,"method") [1] "Unconditional MLE & normal approximation (Wald) CI" > > # Note that we could also use the epitab() function > > # estimate a model where predictor has multiple categories > table(WCGS$chd69,WCGS$agec) 0 1 2 3 4 Noncases 512 1036 680 463 206 Cases 31 55 70 65 36 > > WCGS.glm3 <- glm(chd69~as.factor(agec),data=WCGS,family=binomial()) > summary(WCGS.glm3) Call: glm(formula = chd69 ~ as.factor(agec), family = binomial(), data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -0.5676 -0.4427 -0.3429 -0.3216 2.4444 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.8043 0.1850 -15.162 < 2e-16 *** as.factor(agec)1 -0.1315 0.2310 -0.569 0.569298 as.factor(agec)2 0.5307 0.2235 2.374 0.017580 * as.factor(agec)3 0.8410 0.2275 3.697 0.000218 *** as.factor(agec)4 1.0600 0.2585 4.100 4.13e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1781.2 on 3153 degrees of freedom Residual deviance: 1736.3 on 3149 degrees of freedom AIC: 1746.3 Number of Fisher Scoring iterations: 5 > > # estimate OR, together with their SD and associated 95% CI > WCGS.glm3.OR <- as.numeric(exp(coef(WCGS.glm3))[2:5]) # omit the intercept > > res <- matrix(NA,nr=4,nc=3) > for (i in 2:5) { + # we take the first level as the baseline + test <- oddsratio(table(WCGS$chd69,WCGS$agec)[,c(1,i)],method="wald") + res[i-1,] <- test$measure[2,] + } > colnames(res) <- c("OR","LB-CI","UB-CI") > rownames(res) <- paste("agec",1:4,sep="_") > print(res,digits=4) OR LB-CI UB-CI agec_1 0.8768 0.5576 1.379 agec_2 1.7002 1.0970 2.635 agec_3 2.3187 1.4845 3.621 agec_4 2.8863 1.7389 4.791 > > # suppose, we want to change reference level for age > agec2 <- relevel(as.factor(WCGS$agec),ref="3") > table(agec2) agec2 3 0 1 2 4 528 543 1091 750 242 > oddsratio(table(WCGS$chd69,agec2)[,c(1,5)],method="wald") # OR agec_4/agec_3 $data agec2 3 4 Total Noncases 463 206 669 Cases 65 36 101 Total 528 242 770 $measure odds ratio with 95% C.I. estimate lower upper Noncases 1.000000 NA NA Cases 1.244810 0.8024771 1.930960 $p.value two-sided midp.exact fisher.exact chi.square Noncases NA NA NA Cases 0.3304466 0.3578102 0.3276098 $correction [1] FALSE attr(,"method") [1] "Unconditional MLE & normal approximation (Wald) CI" > > ############################################################################## > # Multipredictor Model > > WCGS.glm4 <- glm(chd69~age+chol+sbp+bmi+smoke,data=WCGS,family=binomial()) > summary(WCGS.glm4) Call: glm(formula = chd69 ~ age + chol + sbp + bmi + smoke, family = binomial(), data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -1.1512 -0.4408 -0.3276 -0.2398 2.8824 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -12.336300 0.975011 -12.652 < 2e-16 *** age 0.064356 0.011905 5.406 6.45e-08 *** chol 0.010839 0.001491 7.267 3.67e-13 *** sbp 0.019305 0.004091 4.719 2.37e-06 *** bmi 0.057670 0.026350 2.189 0.0286 * smoke 0.632693 0.140079 4.517 6.28e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1779.2 on 3141 degrees of freedom Residual deviance: 1614.6 on 3136 degrees of freedom (12 observations deleted due to missingness) AIC: 1626.6 Number of Fisher Scoring iterations: 6 > > # we plot the 95% CI for each coefficient > # rather than using the above method with Wald CI, we use confint > # (it works because each variable is either continuous or binary) > WCGS.glm4.coeff <- coef(WCGS.glm4) > WCGS.glm4.CI <- confint(WCGS.glm4) > > opar <- par(mar=c(5,6,4,2),las=1) > plot(exp(WCGS.glm4.coeff)[2:6],1:5,xlim=c(0.5,2.5),axes=FALSE,pch=19, + cex=1.2,ylab="",xlab="OR",main="WCGS Study") > axis(1) > axis(2,at=1:5,labels=names(WCGS.glm4.coeff)[2:6],tick=FALSE) > segments(exp(WCGS.glm4.CI[2:6,1]),1:5,exp(WCGS.glm4.CI[2:6,2]),1:5,lwd=2) > abline(v=1,lty=2,col="lightgray") > par(opar) > > # now we scale the predictors > # Note that age_c is a centered predictor scaled by a factor of 10 > plot(WCGS$age_10,scale(WCGS$age,scale=F)/10) > > WCGS.glm5 <- glm(chd69~age_10+chol_50+sbp_50+bmi_10+smoke,data=WCGS, + family=binomial()) > summary(WCGS.glm5) Call: glm(formula = chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke, family = binomial(), data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -1.1512 -0.4408 -0.3276 -0.2398 2.8824 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.00711 0.11637 -25.841 < 2e-16 *** age_10 0.64356 0.11905 5.406 6.45e-08 *** chol_50 0.54194 0.07457 7.267 3.67e-13 *** sbp_50 0.96524 0.20455 4.719 2.37e-06 *** bmi_10 0.57671 0.26350 2.189 0.0286 * smoke 0.63269 0.14008 4.517 6.28e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1779.2 on 3141 degrees of freedom Residual deviance: 1614.6 on 3136 degrees of freedom (12 observations deleted due to missingness) AIC: 1626.6 Number of Fisher Scoring iterations: 6 > > # next, we add a 4-levels predictor > table(WCGS$behpat) 1 2 3 4 264 1325 1216 349 > > WCGS.glm6 <- update(WCGS.glm5,.~.+as.factor(behpat)) > summary(WCGS.glm6) Call: glm(formula = chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke + as.factor(behpat), family = binomial(), data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -1.2402 -0.4412 -0.3195 -0.2235 2.8717 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.75362 0.22591 -12.189 < 2e-16 *** age_10 0.60566 0.11989 5.052 4.37e-07 *** chol_50 0.53698 0.07529 7.132 9.89e-13 *** sbp_50 0.90199 0.20649 4.368 1.25e-05 *** bmi_10 0.55535 0.26560 2.091 0.03654 * smoke 0.60328 0.14102 4.278 1.89e-05 *** as.factor(behpat)2 0.06707 0.22124 0.303 0.76176 as.factor(behpat)3 -0.66499 0.24230 -2.744 0.00606 ** as.factor(behpat)4 -0.55809 0.31926 -1.748 0.08045 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1779.2 on 3141 degrees of freedom Residual deviance: 1589.8 on 3133 degrees of freedom (12 observations deleted due to missingness) AIC: 1607.8 Number of Fisher Scoring iterations: 6 > exp(confint.default(WCGS.glm6)) # approximate normal-based 95% CI 2.5 % 97.5 % (Intercept) 0.04090949 0.09917637 age_10 1.44872630 2.31783686 chol_50 1.47611502 1.98288223 sbp_50 1.64424708 3.69393198 bmi_10 1.03539362 2.93265565 smoke 1.38664453 2.41012070 as.factor(behpat)2 0.69312262 1.64986312 as.factor(behpat)3 0.31985313 0.82688255 as.factor(behpat)4 0.30610413 1.06998350 > > # the reduced model without behpat is compared to the above one > anova(WCGS.glm6,WCGS.glm5,test="Chisq") Analysis of Deviance Table Model 1: chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke + as.factor(behpat) Model 2: chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 3133 1589.80 2 3136 1614.61 -3 -24.81 1.689e-05 > > # or, by hand > 2*(logLik(WCGS.glm6)-logLik(WCGS.glm5)) [1] 24.81353 attr(,"df") [1] 9 attr(,"class") [1] "logLik" > > # or, using lrtest from the Design package > lrtest(WCGS.glm6,WCGS.glm5) Model 1: chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke + as.factor(behpat) Model 2: chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke L.R. Chisq d.f. P 2.481353e+01 3.000000e+00 1.689053e-05 > > # The lmtest package also contains a function called lrtest > detach(package:Design) > library(lmtest) > lrtest(WCGS.glm6,WCGS.glm5) Likelihood ratio test Model 1: chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke + as.factor(behpat) Model 2: chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke #Df LogLik Df Chisq Pr(>Chisq) 1 9 -794.90 2 6 -807.31 -3 24.814 1.689e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > # or, equivalently > # lrtest(WCGS.glm6,"behpat") > > # after replacing the 4-levels behpat by a binary version (dipbat), we get the > # following results > WCGS.glm7 <- glm(chd69~age_10+chol_50+sbp_50+bmi_10+smoke+dibpat,data=WCGS, + family=binomial()) > summary(WCGS.glm7) Call: glm(formula = chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke + dibpat, family = binomial(), data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -1.2354 -0.4416 -0.3190 -0.2243 2.8626 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.39404 0.14827 -22.891 < 2e-16 *** age_10 0.60370 0.11967 5.045 4.54e-07 *** chol_50 0.53605 0.07525 7.124 1.05e-12 *** sbp_50 0.90383 0.20603 4.387 1.15e-05 *** bmi_10 0.55126 0.26529 2.078 0.0377 * smoke 0.60243 0.14100 4.272 1.93e-05 *** dibpat 0.69718 0.14437 4.829 1.37e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1779.2 on 3141 degrees of freedom Residual deviance: 1590.0 on 3135 degrees of freedom (12 observations deleted due to missingness) AIC: 1604.0 Number of Fisher Scoring iterations: 6 > exp(confint.default(WCGS.glm7)) 2.5 % 97.5 % (Intercept) 0.02510615 0.04489466 age_10 1.44650296 2.31233631 chol_50 1.47485750 1.98086338 sbp_50 1.64875017 3.69742075 bmi_10 1.03180589 2.91890756 smoke 1.38550386 2.40797939 dibpat 1.51319150 2.66482586 > > # again, wa compare the two nested models > anova(WCGS.glm7,WCGS.glm6,test="Chisq") Analysis of Deviance Table Model 1: chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke + dibpat Model 2: chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke + as.factor(behpat) Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 3135 1590.04 2 3133 1589.80 2 0.24 0.89 > > ############################################################################## > # WCGS Study: Confounding effect (Type A Behavior pattern) > > WCGS.glm8 <- glm(dibpat~age_10+chol_50+sbp_50+bmi_10+smoke,data=WCGS, + family=binomial()) > summary(WCGS.glm8) Call: glm(formula = dibpat ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke, family = binomial(), data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -1.562 -1.162 0.878 1.159 1.445 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.09516 0.05011 -1.899 0.05754 . age_10 0.27954 0.06656 4.200 2.67e-05 *** chol_50 0.08562 0.04225 2.026 0.04274 * sbp_50 0.37994 0.12842 2.958 0.00309 ** bmi_10 0.11868 0.14880 0.798 0.42512 smoke 0.23707 0.07330 3.234 0.00122 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 4355.5 on 3141 degrees of freedom Residual deviance: 4301.1 on 3136 degrees of freedom (12 observations deleted due to missingness) AIC: 4313.1 Number of Fisher Scoring iterations: 4 > > # adjusted ORs, omitting the intercept > exp(coef(WCGS.glm8)[-1]) age_10 chol_50 sbp_50 bmi_10 smoke 1.322516 1.089388 1.462190 1.126008 1.267531 > > # unadjusted OR for dipbat/chd69 > oddsratio(table(WCGS$chd69,WCGS$dibpat)) $data 0 1 Total Noncases 1486 1411 2897 Cases 79 178 257 Total 1565 1589 3154 $measure odds ratio with 95% C.I. estimate lower upper Noncases 1.000000 NA NA Cases 2.369726 1.806749 3.134607 $p.value two-sided midp.exact fisher.exact chi.square Noncases NA NA NA Cases 1.780016e-10 1.845304e-10 2.676407e-10 $correction [1] FALSE attr(,"method") [1] "median-unbiased estimate & mid-p exact CI" > > # while in the 7th model, it was > exp(coef(WCGS.glm7)[["dibpat"]]) [1] 2.008082 > > ## # we drop one at a time each predictor in the 7th model and estimate the > ## # change in OR for dipbat/chd69 > ## mod.pred <- "age_10+chol_50+sbp_50+bmi_10+smoke+dibpat" > ## res <- numeric(lenth(pred.var)) > ## for (i in 1:5) { > ## reduced.model <- update(WCGS.glm7,.~.-unlist(strsplit(mod.pred,"+",fixed=T))[i]) > ## res[i] <- exp(coef(reduced.model)[["dibpat"]]) > ## } > > ############################################################################## > # WCGS Study: Interaction effect (age under/over 50 vs. arcus senilis) > > with(WCGS, table(bage_50,arcus)) arcus bage_50 Unexposed Exposed 0 1679 569 1 532 372 > xtabs(~as.numeric(chd69)+as.numeric(bage_50)+as.numeric(arcus),data=WCGS) , , as.numeric(arcus) = 1 as.numeric(bage_50) as.numeric(chd69) 0 1 1 1590 468 2 89 64 , , as.numeric(arcus) = 2 as.numeric(bage_50) as.numeric(chd69) 0 1 1 514 325 2 55 47 > summary(xtabs(~as.numeric(chd69)+as.numeric(bage_50)+as.numeric(arcus),data=WCGS)) Call: xtabs(formula = ~as.numeric(chd69) + as.numeric(bage_50) + as.numeric(arcus), data = WCGS) Number of cases in table: 3152 Number of factors: 3 Test for independence of all factors: Chisq = 121.11, df = 4, p-value = 3.088e-25 > > WCGS.glm9 <- glm(chd69~bage_50*arcus,data=WCGS,family=binomial()) > summary(WCGS.glm9) Call: glm(formula = chd69 ~ bage_50 * arcus, family = binomial(), data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -0.5197 -0.5063 -0.3300 -0.3300 2.4238 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.8829 0.1089 -26.467 < 2e-16 *** bage_50 0.8933 0.1721 5.190 2.11e-07 *** arcusExposed 0.6480 0.1789 3.623 0.000292 *** bage_50:arcusExposed -0.5921 0.2722 -2.175 0.029640 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1771.2 on 3151 degrees of freedom Residual deviance: 1730.9 on 3148 degrees of freedom (2 observations deleted due to missingness) AIC: 1738.9 Number of Fisher Scoring iterations: 5 > > exp(coef(WCGS.glm9)) # OR for coefficient bage_50 is the OR for age 50-59 vs. (Intercept) bage_50 arcusExposed 0.05597484 2.44309997 1.91164254 bage_50:arcusExposed 0.55318920 > # age 39-49, for people without arcus > > # estimate OR for Arcus-Age interaction > # (we use the contrast() function in the Design package) > library(Design) Design library by Frank E Harrell Jr Type library(help='Design'), ?Overview, or ?Design.Overview') to see overall documentation. > WCGS$bage_50 <- as.factor(WCGS$bage_50) > WCGS.glm9bis <- lrm(chd69~bage_50*arcus,data=WCGS) > cc <- contrast(WCGS.glm9bis,list(bage_50='1',arcus='Exposed'), + list(bage_50='0',arcus='Exposed')) > > # finally, we exponentiate the contrast and its 95% CI to work on the > # probability scale > exp(unlist(cc[2:5])) Contrast.1 SE Lower.1 Upper.1 1.351497 1.234795 0.893907 2.043325 > > # we may also compared the full model with a model without interaction > WCGS.glm10 <- update(WCGS.glm9,.~.-bage_50:arcus) > anova(WCGS.glm10,WCGS.glm9,test="Chi") Analysis of Deviance Table Model 1: chd69 ~ bage_50 + arcus Model 2: chd69 ~ bage_50 * arcus Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 3149 1735.58 2 3148 1730.87 1 4.72 0.03 > > # if we now consider age as a continuous variable > WCGS.glm11 <- glm(chd69~age*arcus,data=WCGS,family=binomial()) > summary(WCGS.glm11) Call: glm(formula = chd69 ~ age * arcus, family = binomial(), data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -0.6350 -0.4579 -0.3832 -0.2950 2.5801 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.78809 0.71797 -9.455 < 2e-16 *** age 0.08965 0.01489 6.021 1.74e-09 *** arcusExposed 2.75418 1.14010 2.416 0.0157 * age:arcusExposed -0.04983 0.02334 -2.135 0.0328 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1771.2 on 3151 degrees of freedom Residual deviance: 1717.9 on 3148 degrees of freedom (2 observations deleted due to missingness) AIC: 1725.9 Number of Fisher Scoring iterations: 5 > > # estimate the log odds associated with having arcus for age=55 > pred.df2 <- expand.grid(age=seq(min(WCGS$age),max(WCGS$age)), + arcus=factor(levels(WCGS$arcus))) > WCGS.glm11.pred <- predict(WCGS.glm11,pred.df2,se=TRUE) > # as we expressed the predicted value on the link scale, we don't have to > # take the log of the predicted values before substracting them > OR.11 <- WCGS.glm11.pred$fit[pred.df2$age==55 & pred.df2$arcus=="Exposed"] - + WCGS.glm11.pred$fit[pred.df2$age==55 & pred.df2$arcus=="Unexposed"] > # thus, OR equals > exp(OR.11) 38 1.013637 > > # we can also do this using the above contrast() function > WCGS.glm11bis <- lrm(chd69~age*arcus,data=WCGS) > cc2 <- contrast(WCGS.glm11bis,list(age=55,arcus='Exposed'), + list(age=55,arcus='Unexposed')) > exp(unlist(cc2[2:5])) Contrast.1 SE Lower.1 Upper.1 1.0136365 1.2256351 0.6802954 1.5103131 > > # plot log odds as a function of age, separately for individuals with > # and without arcus > interaction.plot(pred.df2$age,pred.df2$arcus,WCGS.glm11.pred$fit,legend=FALSE, + xlab="Age (years)",ylab="Log CHD Risk") > legend("bottom",c("Unexposed","Exposed"),lty=1:2,horiz=TRUE) > > # compare the p-value for Age x Arcus with that obtained from an additive > # model (i.e. where we model log(P) against a linear combination of predictors) > WCGS.glm12 <- glm((as.numeric(chd69)-1)~bage_50*arcus,data=WCGS,family=poisson) > summary(WCGS.glm12) Call: glm(formula = (as.numeric(chd69) - 1) ~ bage_50 * arcus, family = poisson, data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -0.5027 -0.4905 -0.3256 -0.3256 1.9952 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.9373 0.1060 -27.711 < 2e-16 *** bage_501 0.8196 0.1639 5.001 5.72e-07 *** arcusExposed 0.6008 0.1715 3.503 0.000461 *** bage_501:arcusExposed -0.5518 0.2575 -2.143 0.032151 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1282.4 on 3151 degrees of freedom Residual deviance: 1245.4 on 3148 degrees of freedom (2 observations deleted due to missingness) AIC: 1763.4 Number of Fisher Scoring iterations: 6 > > # however, I cannot get the Wald test p-value of 0.15 as Vittinghoff et al. > # here, it is estimated as 0.032 > > ############################################################################## > # Prediction using an extended logistic model for CHD Events > WCGS.glm13 <- glm(chd69~age_10+chol_50+sbp_50+bmi_10+smoke+dibpat + +bmi_10:chol_50+bmi_10:sbp_50, + data=WCGS,family=binomial()) > summary(WCGS.glm13) Call: glm(formula = chd69 ~ age_10 + chol_50 + sbp_50 + bmi_10 + smoke + dibpat + bmi_10:chol_50 + bmi_10:sbp_50, family = binomial(), data = WCGS) Deviance Residuals: Min 1Q Median 3Q Max -1.1970 -0.4477 -0.3148 -0.2138 2.9126 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.41143 0.15019 -22.714 < 2e-16 *** age_10 0.59150 0.12002 4.929 8.29e-07 *** chol_50 0.59489 0.07665 7.761 8.40e-15 *** sbp_50 1.00917 0.20651 4.887 1.02e-06 *** bmi_10 1.01021 0.30093 3.357 0.000788 *** smoke 0.59479 0.14068 4.228 2.36e-05 *** dibpat 0.72215 0.14487 4.985 6.20e-07 *** chol_50:bmi_10 -0.69208 0.24392 -2.837 0.004550 ** sbp_50:bmi_10 -1.39568 0.62806 -2.222 0.026269 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1779.2 on 3141 degrees of freedom Residual deviance: 1578.6 on 3133 degrees of freedom (12 observations deleted due to missingness) AIC: 1596.6 Number of Fisher Scoring iterations: 6 > > # Note that we get slighlty different estimates > > # predictions for the observed sample (5 individuals taken at random among the > # wole dataset) > set.seed(143) > five.indiv <- sample(1:nrow(WCGS),5,replace=F) > WCGS.glm13.pred <- predict(WCGS.glm13,type="response")[five.indiv] > WCGS.five.indiv <- WCGS[five.indiv,c("chd69","age","chol","sbp","bmi","smoke","dibpat")] > cbind(WCGS.five.indiv,WCGS.glm13.pred) chd69 age chol sbp bmi smoke dibpat WCGS.glm13.pred 2975 Noncases 43 210 114 18.7467 0 0 0.02674995 25 Noncases 49 193 130 21.1151 1 1 0.05781143 2336 Cases 46 280 130 26.6058 1 0 0.02396027 1313 Cases 49 164 130 28.1261 0 1 0.04121997 240 Noncases 54 219 120 23.9865 0 1 0.07271686 > > # compute AUC and plot the AUC > library(Epi) > WCGS.glm13.predAll <- predict(WCGS.glm13,type="response") > tmp <- cbind(predicted=WCGS.glm13.predAll,observed=WCGS$chd69[as.numeric(names(WCGS.glm13.predAll))]) > ROC(tmp[,"predicted"],tmp[,"observed"]-1,plot="ROC") > > # diagnostic plots (as plotted pp. 189-191) > > # plot standardized residuals against obs. number > plot(residuals(WCGS.glm13,type="pearson"),ylab="Standardized Pearson Residuals", + xlab="Observation Number",pch=19,cex=.8) > > # dfbetas > ifl.glm13 <- influence(WCGS.glm13,do.coef=FALSE) > dfbetas.glm13 <- dfbetas(WCGS.glm13) > > # plot of the residuals vs. fitted values > plot(residuals(WCGS.glm13,type="response")~WCGS.glm13.predAll,pch=19,cex=.8, + xlab="Fitted values",ylab="Response Residuals") > > # Leverage effect > hi <- influence(WCGS.glm13)$hat > plot(hi~WCGS.glm13.predAll,xlab="Fitted values", + ylab=expression(paste("Leverage (",h[i],")")),pch=19,cex=.8) > hh <- which(hi>0.06) > text(WCGS.glm13.predAll[hh],hi[hh],hh,cex=.8,pos=4) > > ############################################################################## > # Case-Control Studies: The Ille-et-Villaine study of esopahgeal cancer > # see e.g. http://www.ncbi.nlm.nih.gov/pubmed/861389 > > # > # Note: > # Data are not available on companion website. Fortunately, the are included > # in the esoph R dataset. > data(esoph) > > # recode tobacco consumption as binary variable > esoph$ditob <- ifelse(as.numeric(esoph$tobgp)==1,"O-9g/day","10+ g/day") > esoph$ditob <- factor(esoph$ditob,levels=c("O-9g/day","10+ g/day"),ordered=TRUE) > > with(esoph, tapply(ncases,ditob,sum)) O-9g/day 10+ g/day 78 122 > > # compute odds ratio for smoking and esophageal cancer > > # we first need to recode the case/control classification in the 'long' format > # before using epitab() from epitools package > > # TODO... > > # runnig an GLM on this dataset (here age is categorical) > esoph.glm <- glm(cbind(ncases,ncontrols)~alcgp+ditob+agegp, + data=esoph,family=binomial) > summary(esoph.glm) Call: glm(formula = cbind(ncases, ncontrols) ~ alcgp + ditob + agegp, family = binomial, data = esoph) Deviance Residuals: Min 1Q Median 3Q Max -1.7204 -0.5808 -0.2544 0.3601 2.3548 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.9340 0.1946 -9.937 < 2e-16 *** alcgp.L 1.5169 0.1982 7.655 1.94e-14 *** alcgp.Q -0.1966 0.1782 -1.103 0.26993 alcgp.C 0.2667 0.1576 1.692 0.09060 . ditob.L 0.3238 0.1219 2.657 0.00788 ** agegp.L 2.9559 0.6509 4.541 5.60e-06 *** agegp.Q -1.3043 0.5903 -2.210 0.02713 * agegp.C 0.1357 0.4483 0.303 0.76211 agegp^4 0.1108 0.3073 0.360 0.71853 agegp^5 -0.1988 0.1950 -1.020 0.30785 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 227.24 on 87 degrees of freedom Residual deviance: 57.42 on 78 degrees of freedom AIC: 224.9 Number of Fisher Scoring iterations: 6 > > # > # Note: > # Vittinghoff et al. take age as a continuous variable (in years), but we > # only have a categorical variable > # To approximate their results, we convert agegp to a numerical variable, > # after taking the mid-group value > esoph$age.num <- strsplit(as.character(esoph$agegp),"-") > for (i in 1:77) + esoph$age.num[[i]] <- (as.numeric(esoph$age.num[[i]][2]) + +as.numeric(esoph$age.num[[i]][1]))/2 > esoph$age.num <- unlist(esoph$age.num) > esoph$age.num[78:88] <- "75" > esoph$age.num <- as.numeric(esoph$age.num) > > esoph.glm2 <- glm(cbind(ncases,ncontrols)~alcgp+ditob+age.num, + data=esoph,family=binomial) > summary(esoph.glm2) Call: glm(formula = cbind(ncases, ncontrols) ~ alcgp + ditob + age.num, family = binomial, data = esoph) Deviance Residuals: Min 1Q Median 3Q Max -2.1072 -0.6982 -0.3604 0.3122 2.5691 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.641278 0.461151 -10.065 < 2e-16 *** alcgp.L 1.596520 0.196541 8.123 4.54e-16 *** alcgp.Q -0.234654 0.176765 -1.327 0.18435 alcgp.C 0.242663 0.157822 1.538 0.12415 ditob.L 0.351851 0.121384 2.899 0.00375 ** age.num 0.056736 0.007666 7.401 1.35e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 227.241 on 87 degrees of freedom Residual deviance: 69.606 on 82 degrees of freedom AIC: 229.09 Number of Fisher Scoring iterations: 4 > > # Using this coding, we clearly don't the same results > > > dev.off() # release the postscript device null device 1 >