lwb.R

The low birth weight study (Hosmer and Lemeshow, 1989) will be used to illustrate basic descriptive and inferential techniques in R. This is a complete study, although analysis will somehow depart from the one presented in Hosmer and Lemeshow’s textbook.
In an attempt to alleviate the need to learn a lot of external commands, we restricted the use of external packages to a few ones that will, however, save your life in daily R activities. In particular, we rely on lattice plotting facilities whenever possible.

Preparing the dataset

Importing data

The low birth weight dataset is already available in the MASS package, as birthwt. To load a built-in dataset in R, we use data. If we want to use the Stata lbw dataset instead, we could use the read.dta() function in the foreign package. To know how data are stored, we then use str() which displays the first values of each variables, and their storage mode.

library(MASS)
data(birthwt) 
str(birthwt)

To get a gentle numerical overview of the data, we can use the summary() function.

summary(birthwt)

The very first thing to do with unknown datasets is to check for the presence of missing values (MV). This is done with is.na(), but see also complete.cases(). Some statistical models doesn’t accomodate well with MV, others delete them listwise. In R, MV are stored as NA.

apply(birthwt, 2, function(x) sum(is.na(x)))
dim(birthwt)

Recoding and checking variables

Some of the factors of interest will not be understood by R as we would like it to do because they are just treated as numerical variables. So, the next step is to convert them to (unordered) factor. The within() command provides a convenient to update several columns in a data.frame within a single call.

birthwt <- within(birthwt, {
  low <- factor(low, labels=c("No","Yes"))
  race <- factor(race, labels=c("White","Black","Other"))
  smoke <- factor(smoke, labels=c("No","Yes"))
  ui <- factor(ui, labels=c("No","Yes"))
  ht <- factor(ht, labels=c("No","Yes"))
})

Again, with unknown data, it is recommended to take a close look at the distribution of numerical variables, which in this particular case can be isolated using a repeated call to is.numeric().

idx <- sapply(birthwt, is.numeric)

We then use box-and-whiskers charts to summarize each distribution. Note that we have rescaled them to avoid the problem of varying y-axis.

boxplot(apply(birthwt[,idx], 2, scale))

Summarizing data

Most base functions in R can be used to summarize our variables, either numerically or graphically. We have already seen the useful summary() command. For numerical variables, it will output a five-number summary; for categorical data, a frequency of counts. In both cases, missing cases will be reported separately. However, there is a full set of dedicated tools included in the Hmisc package, especially the summary.formula() command.

library(Hmisc)
summary(low ~ ., data=birthwt, method="reverse", overall=TRUE)
summary(bwt ~ ., data=birthwt)
summary(low ~ smoke + ht + ui, data=birthwt, fun=table)

The above commands might be combined with the latex() function to produce pretty-print output, which follow standards for publications in biomedical journal.

Multivariate displays can be used, essentially to study relationships between numerical variables or assess any systematic patterns of covariations. At this stage, it might be interesting to highlight individuals according to a certain characteristic (e.g., low birth weight).

library(lattice)
parallel(birthwt[,idx], groups=birthwt$low, horizontal.axis=FALSE)
parallel(~ birthwt[,idx] | smoke, data=birthwt, groups=low,
         lty=1:2, col=c("gray80","gray20"))
splom(~ birthwt[,idx], data=birthwt, jitter.x=TRUE, cex=.6,
      groups=low, panel=panel.superpose, grid=TRUE,
      axis.text.cex=0.6, xlab="", pch=c(2,6))

Visual exploration of the dataset can also rely on brushing and linking techniques available in the ggobi software, or directly with the Acinonyx package. Below is a screenshot after having selected low weight infants from the histogram. (Click to enlarge)

brushing

library(Acinonyx)
attach(birthwt)
iplot(lwt, bwt)
iplot(smoke, bwt)
ihist(bwt)
ibar(factor(ftv))
detach(birthwt)

Dealing with continuous outcomes

Assessing linear two-way relationships

The covariation between mother’s weight in pounds at last menstrual period and birth weight can be assessed using a correlation test. Of course, we need to display the raw data in a scatterplot, first. It is possible to add a local regression line to the cloud of points to visually assess the linearity of the relationship.

xyplot(bwt ~ lwt, data=birthwt, type=c("p", "smooth"))
with(birthwt, cor.test(bwt, lwt))

Compare the results to the ones obtained using Spearman’s method, which relies on ranks.

with(birthwt, cor.test(bwt, lwt, method="spearman"))

Comparing two means

To test the hypothesis that there is no difference in baby weight depending on the smoking status (of the mother), we can use a t-test for independent samples where the standard error is computed from the pooled standard deviation (weighted average of sample SDs).

t.test(bwt ~ smoke, data=birthwt, var.equal=TRUE)

Omitting the var.equal=TRUE yields a Welch’s t-test which corrects for possible heteroskedasticity by adjusting the degrees of freedom of the reference t-distribution. It is the default under R.

t.test(bwt ~ smoke, data=birthwt)

It is always a good idea to check whether the assumptions of the test are met, by computing the variances and displaying a quantile-quantile plot. This allows to judge the homogeneity of variances (but, don’t forget this is supposed to hold in the parent distributions) and the normality of the distributions.

with(birthwt, tapply(bwt, smoke, var))
opar <- par(mfrow=c(1,2))
with(birthwt, qqplot(bwt[smoke=="No"], main=""))
with(birthwt, qqplot(bwt[smoke=="Yes"], main=""))
par(opar)

In addition, we can produce a graphical summary of the two distributions using boxplots or histograms. To superimpose a kernel density estimate on the latter, see panel.mathdensity().

bwplot(bwt ~ smoke, data=birthwt)
histogram(~ bwt | smoke, data=birthwt, type="count")

Should we want to use a non-parametric (distribution free) statistic, we could apply a Wilcoxon test with wilcox.test().

Using re-randomization instead of parametric testing

We could use a permutation technique instead of the t-test, and consider an approximate or exact distribution of the test statistic to decide whether there is a significant difference between central locations of the two groups.

library(coin)
independence_test(bwt ~ smoke, data=birthwt, distribution="exact")

Comparing more than two means

In case we are interested in comparing more than two groups, we can use an ANOVA (instead of carrying out multiple unprotected t-tests).

fm <- bwt ~ race
aov.fit <- aov(fm, data=birthwt)
summary(aov.fit)

Again, it is good to check the distribution of the residuals.

plot(aov.fit)

The summary() command gives the classical ANOVA table. Sometimes, we are interested in judging whether the between-group means exhibit large differences or not. We can use a tapply() command, or simply the model.table() command.

model.tables(aov.fit)
plot.design(fm, data=birthwt)

Likewise, a two-way ANOVA can be used to test whether birth weight depends on both smoking status and history of hypertension. If we consider a saturated model, we need to include the interaction smoke : ht in addition to the main effects, smoke + ht. Following Wilkinson and Rogers' notation, this reduces to smoke * ht.

summary(aov.fit2 <- aov(bwt ~ smoke * ht, data=birthwt))
with(birthwt, interaction.plot(smoke, ht, bwt))

As the interaction appears to be non significant, we can remove this term and fit a model including main effects only.

aov.fit3 <- aov(bwt ~ smoke + ht, data=birthwt)
anova(aov.fit2, aov.fit3)

Adding a third term, race, with its two 2nd order interactions, indicates that ethnicity also impacts baby weight, with no dependence on smoking status or history of hypertension.

summary(aov.fit4 <- update(aov.fit3, . ~ . * race))

We can use multiple comparisons, controlling for FWER, on the final model using Tukey’s HSD contrasts. In fact, only the race effect is of interest here, because the factor has more than two levels.

aov.fit5 <- update(aov.fit3, . ~ . + race)
model.tables(aov.fit5)
se.contrast(aov.fit5, list(birthwt$race=="White",
                           birthwt$race=="Black"))
TukeyHSD(aov.fit5, which="race")
plot(TukeyHSD(aov.fit5, which="race"))

Results can be compared with what would be obtained using t-tests and a step-down method to adjust p-values.

with(birthwt, pairwise.t.test(bwt, race))

Linear regression

It should be clear that the above models can be tested using classical linear regression. In this case, we use the lm() command. Note that baseline categories will be the first lexicographic entry for each factor levels. Model diagnostics can be assessed using the plot() method.

lm.fit5 <- lm(bwt ~ smoke + ht + race, data=birthwt)
summary(lm.fit5)
anova(lm.fit5)
plot(lm.fit5, which=4)

The above command just shows influential observations as assessed by Cook’s distances.

It is interesting to take a closer look at the so-called “design matrix” for this particular model since it is composed of a mix of numerical and categorical predictors.

head(model.matrix(lm.fit5))
head(birthwt[,c("smoke","ht","race")])

Conditional means can be obtained with the aggregate() command, and plotted using a trellis display. Here, we have omitted SDs, but they should be added to get an idea of the birth weight variability in each partition of the dataset.

all.means <- aggregate(bwt ~ smoke + ht + race,
                       data=birthwt, FUN=mean)
xyplot(bwt ~ smoke | race, data=all.means, groups=ht,
       layout=c(3,1), type=c("p","l"), auto.key=TRUE)

Using a regression approach also allows to include continuous predictor, like age.

fm <- bwt ~ (smoke + ht) * scale(age, scale=F) + race
lm.fit6 <- lm(fm, data=birthwt)
summary(lm.fit6)
confint(lm.fit6)

Bootstrapping can be used to derive empirical confidence interval for model coefficients, instead of relying on asymptotic 95% CIs.

library(boot)
bs <- function(formula, data, k) coef(lm(formula, data[k,]))
results <- boot(data=birthwt, statistic=bs, 
                R=1000, formula=fm)
boot.ci(results, type="bca", index=2) # smoke

It is possible to get predicted values, together with confidence bands for means or individuals using the predict command. Consider the following model, bwt ~ lwt + smoke. First, we need to estimate predicted values for given values of the predictors.

fm <- bwt ~ lwt + smoke
new.df <- expand.grid(lwt=seq(80, 250, by=10),
                      smoke=levels(birthwt$smoke))
new.df <- data.frame(new.df, predict(lm(fm, data=birthwt), 
                                     newdata=new.df,
                                     interval="confidence"))

Next, we build a rather complex plot with \docpkg{lattice} to display fitted regression line with associated 95\% confidence bands (for predicting the mean).

lattice

my.panel.bands <- function(x, y, upper, lower, col, subscripts, ...) {
  upper <- upper[subscripts]
  lower <- lower[subscripts]
  panel.lines(x, lower, ...)
  panel.lines(x, upper, ...)
}
my.col <- c("black", "gray60")
my.key <- simpleKey(text=c("Not smoking mother","Smoking mother"),
                    points=FALSE, col=my.col, columns=2)
xyplot(fit ~ lwt, data=new.df, groups=smoke, 
       upper = new.df$upr, lower = new.df$lwr,
       xlab="Mother's weight at last menstrual period (pounds)",
       ylab="Expected birth weight (grams)",
       col=my.col, lty=1:2, key=my.key,
       panel = function(x, y, ...){
         panel.superpose(x, y, panel.groups = my.panel.bands,
                         type='l', ...)
         panel.xyplot(x, y, type=c("b","g"), cex=0.6, pch=19,
                      lwd=2, ...)})
                                                              

Although not recommended because of the lack of control on p-values or biased R2 values, we could use step() or stepAIC() to perform automatic variable selection on the full model. We will, however, consider that a sensible base model will include at least smoking status, hypertension and age, as well as ethnicity.

birthwt$age.s <- scale(birthwt$age, scale=F)
lm.fit7 <- lm(bwt ~ . + ht:age.s + ptl:ftv + ui:ftv,
              data=birthwt[,-1])
lm.fit7.step <- step(lm.fit7,
                     scope=list(lower= ~ smoke + ht*age.s + race,
                                upper= ~ .), trace=FALSE)
summary(lm.fit7.step)

A better way to perform variable selection is through penalization, i.e. shrinkage of regression coefficients to zero or use of penalized likelihood.

Dealing with categorical outcomes

Comparing two proportions

Considering the cross-classification of the existence of previous premature labours (yes/no) with number of physician visits during the first trimester (1 or more than one), we can ask whether there is any association between the two variables using a chi-square test.

ptd <- factor(birthwt$ptl > 0, labels=c("No","Yes"))
ftv2 <- factor(ifelse(birthwt$ftv < 2, "1", "2+"))
tab.ptd.ftv <- table(ptd, ftv2)
prop.table(tab.ptd.ftv, 1)
chisq.test(tab.ptd.ftv)

More association measures are available through the assocstats() function in the vcd package. In particular, we could test for an association between smoking status (considered here as an exposure factor) and a low-weight infant (low), optionally considering ethnicity as a stratification factor. A Cochran-Mantel-Haenszel test can be used to derive a common odds-ratio estimate for low baby weight by smoking status, after stratification on ethnicity.

Here is a simple estimate of the odds-ratio, ignoring the stratification factor:

library(vcd)
oddsratio(xtabs(~ low + smoke, data=birthwt), log=FALSE)
fourfold(xtabs(~ low + smoke, data=birthwt),
         color=rep(c("gray30","gray80"), 3))
tab.low.smoke.by.race <- xtabs(~ low + smoke + race, data=birthwt)
plot(oddsratio(tab.low.smoke.by.race))

Compared to stratum-specific \textsc{or}s (see below), the common estimate equals 3.09.

with(birthwt, mantelhaen.test(low, smoke, race))
for (i in 1:3) 
  print(oddsratio(xtabs(~ low + smoke, 
                        data=subset(birthwt, as.numeric(race)==i)), 
                  log=FALSE))

Testing association in a 2 x k Table

Instead of using a binary indicator, we could also use a trend test which has higher power than the chi-square test when there is indeed a linear or monotonic trend. This test is equivalent to a score test for testing H0: β=0 in a logistic regression model, but it can be computed from the M2 = (n−1)r2 statistic (Agresti, 2002).

Consider, for example, three ordered classes for number of physician visits (ftv) and a binary indicator for baby weight above or below 2.5 kg. A Cochrane and Armitage trend test can be carried out as follows:

birthwt$ftv.c <- cut2(birthwt$ftv, c(0, 1, 2, 6))
xtabs(~ low + ftv.c, data=birthwt)
independence_test(low ~ ftv.c, data=birthwt, teststat="quad",
                  scores=list(ftv.c=c(0,1,4)))

Testing association in a three-way Table

When there are more than two variables, we need dedicated methods to display and test associations. Everything is readily available in the vcd, vcdExtra and gnm packages. The co_table command, with or without mar_table, can be used to switch from 3+ to 2-way tables.

age.q <- cut2(birthwt$age, g=4)
structable(low ~ age.q + ftv2 + ht, data=birthwt)

A more friendly visual summary using a mosaic display.

mosaic(~ low + age.q + ftv2, data=birthwt, shade=TRUE, 
       labeling_args=list(set_varnames=c(low="Weight < 2.5 kg", 
                            age.q="Age range", 
                            ftv2= "No. physician visits")))
  

Modeling a binary outcome

Logistic regression

One of the models discussed by Hosmer and Lemeshow was about how low birth weight (<2.5 kg, low) relates to mother’s age (age), weight at last menstrual period (lwt), ethnicity (race), and number of first trimester physician visits (ftv). As the outcome is binary, we need to use a logistic regression.

Such a model can be fitted using the glm and a formula which looks like low ~ age + lwt + race + ftv, or with the rms::lrm command.

library(rms)
ddist <- datadist(birthwt)
options(datadist="ddist")             
fit.glm1 <- lrm(low ~ age + lwt + race + ftv, data=birthwt)
print(fit.glm1)

A summary of the effects of each factor (OR with 95\% CI) can be obtained through the summary.rms command.

summary(fit.glm1)

A likelihood ratio test indicates that at least one coefficients is non zero. It appears that age and ftv are not significant, according to their Wald statistic. We can compare the original model with a model where those terms are removed using an LRT.

fit.glm2 <- update(fit.glm1, . ~ . - age - ftv)
lrtest(fit.glm2, fit.glm1)
anova(fit.glm2)

Finally, we could predict the expected outcome (with confidence intervals for means) for a particular range of weight at last menstrual period, depending on mother’s ethnicity.

pred.glm2 <- Predict(fit.glm2, lwt=seq(80, 250, by=10), race)
xYplot(Cbind(yhat,lower,upper) ~ lwt | race, data=pred.glm2,
       method="filled bands", type="l", col.fill=gray(.95))

Still on the log odds scale, we can predict the expected weight category for a white mother’s of last menstrual weight lwt=150.

Predict(fit.glm2, lwt=150, race="White")

Or we can use the regression equation directly since we have access parameters estimates. The OR is computed as:

exp(sum(coef(fit.glm2)*c(1, 150, 0, 0)))

What’s next?

Model diagnostic and model selection are important steps when building a predictive model, though it was not discussed here. A very thorough review of relevant methods is available in Steyerberg (2009).

Bibliography

[1] Hosmer D, Lemeshow S. Applied Logistic Regression. New York: Wiley; 1989.
[2] Agresti, AA. Categorical Data Analysis. New York: Wiley; 2002. [3] Steyerberg, EW. Clinical Predition Models. Springer; 2009.


Last updated: Thu Oct 27 13:52:00 CEST 2011.
This document has been prepared using rocco and R version 2.13.2.