lwb.R

lwb.R
¶ The low birth weight study (Hosmer and Lemeshow, 1989) will be used to illustrate basic descriptive and inferential techniques in R. This is a complete study, although analysis will somehow depart from the one presented in Hosmer and Lemeshow’s textbook. In an attempt to alleviate the need to learn a lot of external commands, we restricted the use of external packages to a few ones that will, however, save your life in daily R activities. In particular, we rely on `lattice` plotting facilities whenever possible.
¶ Preparing the dataset
¶ Importing data
¶ The low birth weight dataset is already available in the `MASS` package, as `birthwt`. To load a built-in dataset in R, we use `data`. If we want to use the Stata `lbw` dataset instead, we could use the `read.dta()` function in the `foreign` package. To know how data are stored, we then use `str()` which displays the first values of each variables, and their storage mode.	library(MASS) data(birthwt) str(birthwt)
¶ To get a gentle numerical overview of the data, we can use the `summary()` function.	summary(birthwt)
¶ The very first thing to do with unknown datasets is to check for the presence of missing values (MV). This is done with `is.na()`, but see also `complete.cases()`. Some statistical models doesn’t accomodate well with MV, others delete them listwise. In R, MV are stored as `NA`.	apply(birthwt, 2, function(x) sum(is.na(x))) dim(birthwt)
¶ Recoding and checking variables
¶ Some of the factors of interest will not be understood by R as we would like it to do because they are just treated as numerical variables. So, the next step is to convert them to (unordered) `factor`. The `within()` command provides a convenient to update several columns in a `data.frame` within a single call.	birthwt <- within(birthwt, { low <- factor(low, labels=c("No","Yes")) race <- factor(race, labels=c("White","Black","Other")) smoke <- factor(smoke, labels=c("No","Yes")) ui <- factor(ui, labels=c("No","Yes")) ht <- factor(ht, labels=c("No","Yes")) })
¶ Again, with unknown data, it is recommended to take a close look at the distribution of numerical variables, which in this particular case can be isolated using a repeated call to `is.numeric()`.	idx <- sapply(birthwt, is.numeric)
¶ We then use box-and-whiskers charts to summarize each distribution. Note that we have rescaled them to avoid the problem of varying y-axis.	boxplot(apply(birthwt[,idx], 2, scale))
¶ Summarizing data
¶ Most base functions in R can be used to summarize our variables, either numerically or graphically. We have already seen the useful `summary()` command. For numerical variables, it will output a five-number summary; for categorical data, a frequency of counts. In both cases, missing cases will be reported separately. However, there is a full set of dedicated tools included in the Hmisc package, especially the `summary.formula()` command.	library(Hmisc) summary(low ~ ., data=birthwt, method="reverse", overall=TRUE) summary(bwt ~ ., data=birthwt) summary(low ~ smoke + ht + ui, data=birthwt, fun=table)
¶ The above commands might be combined with the `latex()` function to produce pretty-print output, which follow standards for publications in biomedical journal.
¶ Multivariate displays can be used, essentially to study relationships between numerical variables or assess any systematic patterns of covariations. At this stage, it might be interesting to highlight individuals according to a certain characteristic (e.g., low birth weight).	library(lattice) parallel(birthwt[,idx], groups=birthwt$low, horizontal.axis=FALSE) parallel(~ birthwt[,idx] \| smoke, data=birthwt, groups=low, lty=1:2, col=c("gray80","gray20")) splom(~ birthwt[,idx], data=birthwt, jitter.x=TRUE, cex=.6, groups=low, panel=panel.superpose, grid=TRUE, axis.text.cex=0.6, xlab="", pch=c(2,6))
¶ Visual exploration of the dataset can also rely on brushing and linking techniques available in the ggobi software, or directly with the `Acinonyx` package. Below is a screenshot after having selected low weight infants from the histogram. (Click to enlarge)	library(Acinonyx) attach(birthwt) iplot(lwt, bwt) iplot(smoke, bwt) ihist(bwt) ibar(factor(ftv)) detach(birthwt)
¶ Dealing with continuous outcomes
¶ Assessing linear two-way relationships
¶ The covariation between mother’s weight in pounds at last menstrual period and birth weight can be assessed using a correlation test. Of course, we need to display the raw data in a scatterplot, first. It is possible to add a local regression line to the cloud of points to visually assess the linearity of the relationship.	xyplot(bwt ~ lwt, data=birthwt, type=c("p", "smooth")) with(birthwt, cor.test(bwt, lwt))
¶ Compare the results to the ones obtained using Spearman’s method, which relies on ranks.	with(birthwt, cor.test(bwt, lwt, method="spearman"))
¶ Comparing two means
¶ To test the hypothesis that there is no difference in baby weight depending on the smoking status (of the mother), we can use a t-test for independent samples where the standard error is computed from the pooled standard deviation (weighted average of sample SDs).	t.test(bwt ~ smoke, data=birthwt, var.equal=TRUE)
¶ Omitting the `var.equal=TRUE` yields a Welch’s t-test which corrects for possible heteroskedasticity by adjusting the degrees of freedom of the reference t-distribution. It is the default under R.	t.test(bwt ~ smoke, data=birthwt)
¶ It is always a good idea to check whether the assumptions of the test are met, by computing the variances and displaying a quantile-quantile plot. This allows to judge the homogeneity of variances (but, don’t forget this is supposed to hold in the parent distributions) and the normality of the distributions.	with(birthwt, tapply(bwt, smoke, var)) opar <- par(mfrow=c(1,2)) with(birthwt, qqplot(bwt[smoke=="No"], main="")) with(birthwt, qqplot(bwt[smoke=="Yes"], main="")) par(opar)
¶ In addition, we can produce a graphical summary of the two distributions using boxplots or histograms. To superimpose a kernel density estimate on the latter, see `panel.mathdensity()`.	bwplot(bwt ~ smoke, data=birthwt) histogram(~ bwt \| smoke, data=birthwt, type="count")
¶ Should we want to use a non-parametric (distribution free) statistic, we could apply a Wilcoxon test with `wilcox.test()`.
¶ Using re-randomization instead of parametric testing
¶ We could use a permutation technique instead of the t-test, and consider an approximate or exact distribution of the test statistic to decide whether there is a significant difference between central locations of the two groups.	library(coin) independence_test(bwt ~ smoke, data=birthwt, distribution="exact")
¶ Comparing more than two means
¶ In case we are interested in comparing more than two groups, we can use an ANOVA (instead of carrying out multiple unprotected t-tests).	fm <- bwt ~ race aov.fit <- aov(fm, data=birthwt) summary(aov.fit)
¶ Again, it is good to check the distribution of the residuals.	plot(aov.fit)
¶ The `summary()` command gives the classical ANOVA table. Sometimes, we are interested in judging whether the between-group means exhibit large differences or not. We can use a `tapply()` command, or simply the `model.table()` command.	model.tables(aov.fit) plot.design(fm, data=birthwt)
¶ Likewise, a two-way ANOVA can be used to test whether birth weight depends on both smoking status and history of hypertension. If we consider a saturated model, we need to include the interaction `smoke : ht` in addition to the main effects, `smoke + ht`. Following Wilkinson and Rogers' notation, this reduces to `smoke * ht`.	summary(aov.fit2 <- aov(bwt ~ smoke * ht, data=birthwt)) with(birthwt, interaction.plot(smoke, ht, bwt))
¶ As the interaction appears to be non significant, we can remove this term and fit a model including main effects only.	aov.fit3 <- aov(bwt ~ smoke + ht, data=birthwt) anova(aov.fit2, aov.fit3)
¶ Adding a third term, `race`, with its two 2nd order interactions, indicates that ethnicity also impacts baby weight, with no dependence on smoking status or history of hypertension.	summary(aov.fit4 <- update(aov.fit3, . ~ . * race))
¶ We can use multiple comparisons, controlling for FWER, on the final model using Tukey’s HSD contrasts. In fact, only the `race` effect is of interest here, because the factor has more than two levels.	aov.fit5 <- update(aov.fit3, . ~ . + race) model.tables(aov.fit5) se.contrast(aov.fit5, list(birthwt$race=="White", birthwt$race=="Black")) TukeyHSD(aov.fit5, which="race") plot(TukeyHSD(aov.fit5, which="race"))
¶ Results can be compared with what would be obtained using t-tests and a step-down method to adjust p-values.	with(birthwt, pairwise.t.test(bwt, race))
¶ Linear regression
¶ It should be clear that the above models can be tested using classical linear regression. In this case, we use the `lm()` command. Note that baseline categories will be the first lexicographic entry for each factor levels. Model diagnostics can be assessed using the `plot()` method.	lm.fit5 <- lm(bwt ~ smoke + ht + race, data=birthwt) summary(lm.fit5) anova(lm.fit5) plot(lm.fit5, which=4)
¶ The above command just shows influential observations as assessed by Cook’s distances.
¶ It is interesting to take a closer look at the so-called “design matrix” for this particular model since it is composed of a mix of numerical and categorical predictors.	head(model.matrix(lm.fit5)) head(birthwt[,c("smoke","ht","race")])
¶ Conditional means can be obtained with the `aggregate()` command, and plotted using a trellis display. Here, we have omitted SDs, but they should be added to get an idea of the birth weight variability in each partition of the dataset.	all.means <- aggregate(bwt ~ smoke + ht + race, data=birthwt, FUN=mean) xyplot(bwt ~ smoke \| race, data=all.means, groups=ht, layout=c(3,1), type=c("p","l"), auto.key=TRUE)
¶ Using a regression approach also allows to include continuous predictor, like age.	fm <- bwt ~ (smoke + ht) * scale(age, scale=F) + race lm.fit6 <- lm(fm, data=birthwt) summary(lm.fit6) confint(lm.fit6)
¶ Bootstrapping can be used to derive empirical confidence interval for model coefficients, instead of relying on asymptotic 95% CIs.	library(boot) bs <- function(formula, data, k) coef(lm(formula, data[k,])) results <- boot(data=birthwt, statistic=bs, R=1000, formula=fm) boot.ci(results, type="bca", index=2) # smoke
¶ It is possible to get predicted values, together with confidence bands for means or individuals using the `predict` command. Consider the following model, `bwt ~ lwt + smoke`. First, we need to estimate predicted values for given values of the predictors.	fm <- bwt ~ lwt + smoke new.df <- expand.grid(lwt=seq(80, 250, by=10), smoke=levels(birthwt$smoke)) new.df <- data.frame(new.df, predict(lm(fm, data=birthwt), newdata=new.df, interval="confidence"))
¶ Next, we build a rather complex plot with \docpkg{lattice} to display fitted regression line with associated 95\% confidence bands (for predicting the mean).	my.panel.bands <- function(x, y, upper, lower, col, subscripts, ...) { upper <- upper[subscripts] lower <- lower[subscripts] panel.lines(x, lower, ...) panel.lines(x, upper, ...) } my.col <- c("black", "gray60") my.key <- simpleKey(text=c("Not smoking mother","Smoking mother"), points=FALSE, col=my.col, columns=2) xyplot(fit ~ lwt, data=new.df, groups=smoke, upper = new.df$upr, lower = new.df$lwr, xlab="Mother's weight at last menstrual period (pounds)", ylab="Expected birth weight (grams)", col=my.col, lty=1:2, key=my.key, panel = function(x, y, ...){ panel.superpose(x, y, panel.groups = my.panel.bands, type='l', ...) panel.xyplot(x, y, type=c("b","g"), cex=0.6, pch=19, lwd=2, ...)})
¶ Although not recommended because of the lack of control on p-values or biased R² values, we could use `step()` or `stepAIC()` to perform automatic variable selection on the full model. We will, however, consider that a sensible base model will include at least smoking status, hypertension and age, as well as ethnicity.	birthwt$age.s <- scale(birthwt$age, scale=F) lm.fit7 <- lm(bwt ~ . + ht:age.s + ptl:ftv + ui:ftv, data=birthwt[,-1]) lm.fit7.step <- step(lm.fit7, scope=list(lower= ~ smoke + ht*age.s + race, upper= ~ .), trace=FALSE) summary(lm.fit7.step)
¶ A better way to perform variable selection is through penalization, i.e. shrinkage of regression coefficients to zero or use of penalized likelihood.
¶ Dealing with categorical outcomes
¶ Comparing two proportions
¶ Considering the cross-classification of the existence of previous premature labours (yes/no) with number of physician visits during the first trimester (1 or more than one), we can ask whether there is any association between the two variables using a chi-square test.	ptd <- factor(birthwt$ptl > 0, labels=c("No","Yes")) ftv2 <- factor(ifelse(birthwt$ftv < 2, "1", "2+")) tab.ptd.ftv <- table(ptd, ftv2) prop.table(tab.ptd.ftv, 1) chisq.test(tab.ptd.ftv)
¶ More association measures are available through the `assocstats()` function in the `vcd` package. In particular, we could test for an association between smoking status (considered here as an exposure factor) and a low-weight infant (`low`), optionally considering ethnicity as a stratification factor. A Cochran-Mantel-Haenszel test can be used to derive a common odds-ratio estimate for low baby weight by smoking status, after stratification on ethnicity.
¶ Here is a simple estimate of the odds-ratio, ignoring the stratification factor:	library(vcd) oddsratio(xtabs(~ low + smoke, data=birthwt), log=FALSE) fourfold(xtabs(~ low + smoke, data=birthwt), color=rep(c("gray30","gray80"), 3)) tab.low.smoke.by.race <- xtabs(~ low + smoke + race, data=birthwt) plot(oddsratio(tab.low.smoke.by.race))
¶ Compared to stratum-specific \textsc{or}s (see below), the common estimate equals 3.09.	with(birthwt, mantelhaen.test(low, smoke, race)) for (i in 1:3) print(oddsratio(xtabs(~ low + smoke, data=subset(birthwt, as.numeric(race)==i)), log=FALSE))
¶ Testing association in a 2 x k Table
¶ Instead of using a binary indicator, we could also use a trend test which has higher power than the chi-square test when there is indeed a linear or monotonic trend. This test is equivalent to a score test for testing H₀: β=0 in a logistic regression model, but it can be computed from the M² = (n−1)r² statistic (Agresti, 2002).
¶ Consider, for example, three ordered classes for number of physician visits (`ftv`) and a binary indicator for baby weight above or below 2.5 kg. A Cochrane and Armitage trend test can be carried out as follows:	birthwt$ftv.c <- cut2(birthwt$ftv, c(0, 1, 2, 6)) xtabs(~ low + ftv.c, data=birthwt) independence_test(low ~ ftv.c, data=birthwt, teststat="quad", scores=list(ftv.c=c(0,1,4)))
¶ Testing association in a three-way Table
¶ When there are more than two variables, we need dedicated methods to display and test associations. Everything is readily available in the `vcd`, `vcdExtra` and `gnm` packages. The `co_table` command, with or without `mar_table`, can be used to switch from 3+ to 2-way tables.	age.q <- cut2(birthwt$age, g=4) structable(low ~ age.q + ftv2 + ht, data=birthwt)
¶ A more friendly visual summary using a mosaic display.	mosaic(~ low + age.q + ftv2, data=birthwt, shade=TRUE, labeling_args=list(set_varnames=c(low="Weight < 2.5 kg", age.q="Age range", ftv2= "No. physician visits")))
¶ Modeling a binary outcome
¶ Logistic regression
¶ One of the models discussed by Hosmer and Lemeshow was about how low birth weight (<2.5 kg, `low`) relates to mother’s age (`age`), weight at last menstrual period (`lwt`), ethnicity (`race`), and number of first trimester physician visits (`ftv`). As the outcome is binary, we need to use a logistic regression.
¶ Such a model can be fitted using the `glm` and a formula which looks like `low ~ age + lwt + race + ftv`, or with the `rms::lrm` command.	library(rms) ddist <- datadist(birthwt) options(datadist="ddist") fit.glm1 <- lrm(low ~ age + lwt + race + ftv, data=birthwt) print(fit.glm1)
¶ A summary of the effects of each factor (OR with 95\% CI) can be obtained through the `summary.rms` command.	summary(fit.glm1)
¶ A likelihood ratio test indicates that at least one coefficients is non zero. It appears that `age` and `ftv` are not significant, according to their Wald statistic. We can compare the original model with a model where those terms are removed using an LRT.	fit.glm2 <- update(fit.glm1, . ~ . - age - ftv) lrtest(fit.glm2, fit.glm1) anova(fit.glm2)
¶ Finally, we could predict the expected outcome (with confidence intervals for means) for a particular range of weight at last menstrual period, depending on mother’s ethnicity.	pred.glm2 <- Predict(fit.glm2, lwt=seq(80, 250, by=10), race) xYplot(Cbind(yhat,lower,upper) ~ lwt \| race, data=pred.glm2, method="filled bands", type="l", col.fill=gray(.95))
¶ Still on the log odds scale, we can predict the expected weight category for a white mother’s of last menstrual weight `lwt=150`.	Predict(fit.glm2, lwt=150, race="White")
¶ Or we can use the regression equation directly since we have access parameters estimates. The OR is computed as:	exp(sum(coef(fit.glm2)*c(1, 150, 0, 0)))
¶ What’s next?
¶ Model diagnostic and model selection are important steps when building a predictive model, though it was not discussed here. A very thorough review of relevant methods is available in Steyerberg (2009).
¶ Bibliography
¶ [1] Hosmer D, Lemeshow S. Applied Logistic Regression. New York: Wiley; 1989. [2] Agresti, AA. Categorical Data Analysis. New York: Wiley; 2002. [3] Steyerberg, EW. Clinical Predition Models. Springer; 2009.
¶ Last updated: Thu Oct 27 13:52:00 CEST 2011. This document has been prepared using rocco and R version 2.13.2.

lwb.R

Preparing the dataset

Importing data

Recoding and checking variables

Summarizing data

Dealing with continuous outcomes

Assessing linear two-way relationships

Comparing two means

Using re-randomization instead of parametric testing

Comparing more than two means

Linear regression

Dealing with categorical outcomes

Comparing two proportions

Testing association in a 2 x k Table

Testing association in a three-way Table

Modeling a binary outcome

Logistic regression

What’s next?

Bibliography