The Hmisc and rms packages provide a wide range of tools for data transformation, aggregated visual and numerical summaries, and enhanced R’s output for most common biostatistical models (linear regression, logistic or Cox regression).
Text-based data file (comma- or tab-delimited files) can be imported using
read.csv()
or the more generic command read.table()
. The foreign
package can be used to process binary data files from other statistical
packages. See also R Data Import/Export. Hmisc
offers extended
support for foreign data files, including CSV (csv.get()
), SAS
(sas.get()
), SPSS (spss.get()
), or Stata (stata.get()
). Variables
names are automatically converted to lowercase, dates are generally better
handled. Documentation and additional information on the
Hmisc website. Various dataset can be download from a
public repository via the getHdata()
command.
As always, before using any package in R, it must be loaded first:
library(Hmisc)
## help(package = Hmisc)
In what follows, we will be using birthwt
data set from the MASS
package. The low birth weight study is one of the datasets used throughout
Hosmer and Lemeshow’s textbook on Applied Logistic Regression (2000, Wiley,
2nd ed.). The goal of this prospective study was to identify risk factors
associated with giving birth to a low birth weight baby (weighing less than
2,500 grams). Data were collected on 189 women, 59 of which had low birth
weight babies and 130 of which had normal birth weight babies. Four
variables which were thought to be of importance were age, weight of the
subject at her last menstrual period, race, and the number of physician
visits during the first trimester of pregnancy. It can be loaded as shown
below:
data(birthwt, package = "MASS")
## help(birthwt)
In this data set there is no missing observations, but let introduce
some NA
values. Note that variable names are relatively short and poorly
informative. Shorter names are, however, easy to manipulate with R. Hmisc
provides specific command for labeling (label()
) and adding units of
measurement (units()
) as additional attributes to a given variable (or
data frame). We will also convert some of the variables as factor with
proper label (rather than 0/1 values) to facilitate reading of summary
tables or subsequent graphics.
birthwt$age[5] <- NA
birthwt$ftv[sample(1:nrow(birthwt), 5)] <- NA
yesno <- c("No", "Yes")
birthwt <- within(birthwt, {
smoke <- factor(smoke, labels = yesno)
low <- factor(low, labels = yesno)
ht <- factor(ht, labels = yesno)
ui <- factor(ui, labels = yesno)
race <- factor(race, levels = 1:3, labels = c("White", "Black", "Other"))
lwt <- lwt/2.2 ## weight in kg
})
label(birthwt$age) <- "Mother age"
units(birthwt$age) <- "years"
label(birthwt$bwt) <- "Baby weight"
units(birthwt$bwt) <- "grams"
label(birthwt, self = TRUE) <- "Hosmer & Lemeshow's low birth weight study."
list.tree(birthwt) ## equivalent to str(birthwt)
## birthwt = list 10 (24584 bytes)( data.frame )
## . low = integer 189= category (2 levels)( factor )= No No No No No No ...
## . age = integer 189( labelled )= 19 33 20 21 NA 21 ...
## . A label = character 1= Mother age
## . A units = character 1= years
## . lwt = double 189= 82.727 70.455 47.727 ...
## . race = integer 189= category (3 levels)( factor )= Black Other White ...
## . smoke = integer 189= category (2 levels)( factor )= No No Yes Yes Yes ...
## . ptl = integer 189= 0 0 0 0 0 0 0 0 ...
## . ht = integer 189= category (2 levels)( factor )= No No No No No No ...
## . ui = integer 189= category (2 levels)( factor )= Yes No No Yes Yes ...
## . ftv = integer 189= 0 3 1 2 0 0 1 1 ...
## . bwt = integer 189( labelled )= 2523 2551 2557 2594 ...
## . A label = character 1= Baby weight
## . A units = character 1= grams
## A row.names = character 189= 85 86 87 88 89 91 92 ...
## A label = character 1= Hosmer & Lemeshow's lo
The last command, list.tree()
, offers a convenient replacement for R’s
str()
, and in addition to variable type and a list of the first
observation for each variable it will display Hmisc
labels associated to
them.
The contents()
command offers a quick summary of data format and missing
values, and it provides a list of labels associated to variables treated as
factor by R.
contents(birthwt)
##
## Data frame:birthwt 189 observations and 10 variables Maximum # NAs:5
##
## Labels Units Levels Storage NAs
## low 2 integer 0
## age Mother age years integer 1
## lwt double 0
## race 3 integer 0
## smoke 2 integer 0
## ptl integer 0
## ht 2 integer 0
## ui 2 integer 0
## ftv integer 5
## bwt Baby weight grams integer 0
##
## +--------+-----------------+
## |Variable|Levels |
## +--------+-----------------+
## | low |No,Yes |
## | smoke | |
## | ht | |
## | ui | |
## +--------+-----------------+
## | race |White,Black,Other|
## +--------+-----------------+
Another useful command is describe()
, which gives detailed summary
statistics for each variable in a given data frame. It can be printed as
HTML, or as PDF (by using the latex()
backend), and in the latter case
small graphics are added that depict distribution of continuous variables.
describe(birthwt, digits = 3)
## birthwt
##
## 10 Variables 189 Observations
## ---------------------------------------------------------------------------
## low
## n missing unique
## 189 0 2
##
## No (130, 69%), Yes (59, 31%)
## ---------------------------------------------------------------------------
## age : Mother age [years]
## n missing unique Mean .05 .10 .25 .50 .75
## 188 1 24 23.3 16 17 19 23 26
## .90 .95
## 31 32
##
## lowest : 14 15 16 17 18, highest: 33 34 35 36 45
## ---------------------------------------------------------------------------
## lwt
## n missing unique Mean .05 .10 .25 .50 .75
## 189 0 75 59 42.9 45.3 50.0 55.0 63.6
## .90 .95
## 77.3 85.5
##
## lowest : 36.4 38.6 40.5 40.9 41.4
## highest: 97.7 104.1 106.8 109.5 113.6
## ---------------------------------------------------------------------------
## race
## n missing unique
## 189 0 3
##
## White (96, 51%), Black (26, 14%), Other (67, 35%)
## ---------------------------------------------------------------------------
## smoke
## n missing unique
## 189 0 2
##
## No (115, 61%), Yes (74, 39%)
## ---------------------------------------------------------------------------
## ptl
## n missing unique Mean
## 189 0 4 0.196
##
## 0 (159, 84%), 1 (24, 13%), 2 (5, 3%), 3 (1, 1%)
## ---------------------------------------------------------------------------
## ht
## n missing unique
## 189 0 2
##
## No (177, 94%), Yes (12, 6%)
## ---------------------------------------------------------------------------
## ui
## n missing unique
## 189 0 2
##
## No (161, 85%), Yes (28, 15%)
## ---------------------------------------------------------------------------
## ftv
## n missing unique Mean
## 184 5 6 0.799
##
## 0 1 2 3 4 6
## Frequency 97 46 29 7 4 1
## % 53 25 16 4 2 1
## ---------------------------------------------------------------------------
## bwt : Baby weight [grams]
## n missing unique Mean .05 .10 .25 .50 .75
## 189 0 131 2945 1801 2038 2414 2977 3487
## .90 .95
## 3865 3997
##
## lowest : 709 1021 1135 1330 1474, highest: 4167 4174 4238 4593 4990
## ---------------------------------------------------------------------------
Of course, it is also possible to describe only a subset of the data or specific data.
describe(subset(birthwt, select = c(age, race, bwt, low)))
Hmisc
has several helper functions to work with categorical variables,
like dropUnusedLevels()
to remove missing levels or Cs()
to convert
unquoted list of variables names to characters. It also provides a
replacement for R’s cut()
function with better default options (especially
the infamous include.lowest=FALSE
) to discretize a continuous
variable. Here are some examples of use:
table(cut2(birthwt$lwt, g = 4))
##
## [36.4, 50.9) [50.9, 55.5) [55.5, 64.1) [64.1,113.6]
## 53 43 46 47
table(cut2(birthwt$age, g = 3, levels.mean = TRUE))
##
## 18.074 23.091 30.019
## 68 66 54
Using levels.mean=TRUE
will return class center, instead of class
intervals.
There are also a bunch of command dedicated to variables clustering,
analysis of missing patterns, or simple (impute()
) or multiple
(aregImpute()
, transcan()
) imputation methods. Here is how we would
impute missing values with the median in the case of a continuous variable:
lwt <- birthwt$lwt
lwt[sample(length(lwt), 10)] <- NA
lwt.i <- impute(lwt)
summary(lwt.i)
##
## 10 values imputed to 55
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36.4 50.0 55.0 59.0 63.6 114.0
Missing observations will be marked with an asterisk when we print the whole
object in R. To use the mean instead of the median, we just have to add the
fun=mean
option.
There are three useful commands that provide summary statistics for a list of variables. They implement the split-apply-combine strategy in the spirit of R’s built-in functions (unlike plyr).
The first one, summarize()
, can be seen as an equivalent to R’s
aggregate()
command. Given a response variable and one or more
classification factors, it applies a specific function to all data chunk,
where each chunk is defined based on factor levels. The results are stored
in a matrix, which can easily be coerced to a data frame (as.data.frame()
or Hmisc::matrix2dataFrame()
).
Remark: Some of the results are shown via the prn()
command.
f <- function(x, na.opts = TRUE) c(mean = mean(x, na.rm = na.opts), sd = sd(x,
na.rm = na.opts))
out <- with(birthwt, summarize(bwt, race, f))
##
## Average baby weight by ethnicity out
##
## race bwt sd
## 3 White 3103 727.9
## 1 Black 2720 638.7
## 2 Other 2805 722.2
Contrary to aggregate()
, this command provides multiway data structure in
case we ask to compute more than one quantity, as the following command will
confirm:
dim(out) ## should have 3 columns
## [1] 3 3
dim(aggregate(bwt ~ race, data = birthwt, f))
## [1] 3 2
Summarizing multivariate responses or predictors is also possible, with
either summarize()
or mApply()
. Of course, any built-in functions, such
as colMeans()
could be used in place of our custom summary command.
with(birthwt, summarize(bwt, llist(race, smoke), f))
## race smoke bwt sd
## 5 White No 3429 710.1
## 6 White Yes 2827 626.5
## 1 Black No 2854 621.3
## 2 Black Yes 2504 637.1
## 3 Other No 2816 709.3
## 4 Other Yes 2757 810.0
The second command, bystats()
, (or bystats2()
for two-way tabular
output) allows to describe with any custom or built-in function one or
multiple outcome by two explanatory variables, or even more. Sample size and
the number of missing values are also printed.
with(birthwt, bystats(cbind(bwt, lwt), smoke, race))
##
## Mean of cbind(bwt, lwt) by smoke
##
## N bwt lwt
## No White 44 3429 63.11
## Yes White 52 2827 57.41
## No Black 16 2854 67.93
## Yes Black 10 2504 64.82
## No Other 55 2816 54.16
## Yes Other 12 2757 56.36
## ALL 189 2945 59.01
with(birthwt, bystats2(lwt, smoke, race))
##
## Mean of lwt by smoke
##
## +----+
## |N |
## |Mean|
## +----+
## +---+-----+-----+-----+-----+
## |No | 44 | 16 | 55 |115 |
## | |63.11|67.93|54.16|59.50|
## +---+-----+-----+-----+-----+
## |Yes| 52 | 10 | 12 | 74 |
## | |57.41|64.82|56.36|58.24|
## +---+-----+-----+-----+-----+
## |ALL| 96 | 26 | 67 |189 |
## | |60.02|66.73|54.55|59.01|
## +---+-----+-----+-----+-----+
The third and last command is summary.formula()
, which can be abbreviated
as summary()
as long as formula is used to describe variables
relationships. There are three main configurations (method=
):
"response"
, where a numerical variable is summarized for each level of
one or more variables (numerical variables will be discretized in 4
classes), as summarize()
does; "cross"
, to compute conditional and
marginal means of several response variables described by at most 3
explanatory variables (again, continuous predictors are represented as
quartiles); "reverse"
, to summarize univariate distribution of a set of
variables for each level of a classification variable (which appears on the
left-hand side of the formula). Variables are viewed as continuous as long
as they have more than 10 distinct values, but this can be changed by
setting, e.g., continuous=5
. With method="reverse"
, it is possible to
add overall=TRUE, test=TRUE
to add overall statistics and corresponding
statistical tests of null effect between the groups.
Here are some examples of use.
summary(bwt ~ race + ht + lwt, data = birthwt)
## Baby weight N=189
##
## +-------+------------+---+----+
## | | |N |bwt |
## +-------+------------+---+----+
## |race |White | 96|3103|
## | |Black | 26|2720|
## | |Other | 67|2805|
## +-------+------------+---+----+
## |ht |No |177|2972|
## | |Yes | 12|2537|
## +-------+------------+---+----+
## |lwt |[36.4, 50.9)| 53|2656|
## | |[50.9, 55.5)| 43|3059|
## | |[55.5, 64.1)| 46|3075|
## | |[64.1,113.6]| 47|3038|
## +-------+------------+---+----+
## |Overall| |189|2945|
## +-------+------------+---+----+
summary(cbind(lwt, age) ~ race + bwt, data = birthwt, method = "cross")
##
## mean by race, bwt
##
## +-------+
## |N |
## |Missing|
## |lwt |
## |age |
## +-------+
## +-----+-----------+-----------+-----------+-----------+-----+
## | race|[ 709,2424)|[2424,3005)|[3005,3544)|[3544,4990]| ALL |
## +-----+-----------+-----------+-----------+-----------+-----+
## |White| 19 | 23 | 20 | 33 | 95 |
## | | 0 | 1 | 0 | 0 |1 |
## | | 55.55 | 57.67 | 62.23 | 63.25 |60.14|
## | | 22.74 | 24.78 | 24.50 | 24.91 |24.36|
## +-----+-----------+-----------+-----------+-----------+-----+
## |Black| 9 | 9 | 6 | 2 | 26 |
## | | 0 | 0 | 0 | 0 |0 |
## | | 65.10 | 59.70 | 70.83 | 93.41 |66.73|
## | | 23.44 | 20.89 | 20.00 | 20.50 |21.54|
## +-----+-----------+-----------+-----------+-----------+-----+
## |Other| 20 | 16 | 19 | 12 | 67 |
## | | 0 | 0 | 0 | 0 |0 |
## | | 51.23 | 52.95 | 58.90 | 55.34 |54.55|
## | | 22.20 | 22.69 | 22.26 | 22.50 |22.39|
## +-----+-----------+-----------+-----------+-----------+-----+
## |ALL | 48 | 48 | 45 | 47 |188 |
## | | 0 | 1 | 0 | 0 |1 |
## | | 55.54 | 56.48 | 61.97 | 62.51 |59.06|
## | | 22.65 | 23.35 | 22.96 | 24.11 |23.27|
## +-----+-----------+-----------+-----------+-----------+-----+
summary(low ~ race + ht, data = birthwt, fun = table)
## low N=189
##
## +-------+-----+---+---+---+
## | | |N |No |Yes|
## +-------+-----+---+---+---+
## |race |White| 96| 73|23 |
## | |Black| 26| 15|11 |
## | |Other| 67| 42|25 |
## +-------+-----+---+---+---+
## |ht |No |177|125|52 |
## | |Yes | 12| 5| 7 |
## +-------+-----+---+---+---+
## |Overall| |189|130|59 |
## +-------+-----+---+---+---+
out <- summary(low ~ race + age + ui, data = birthwt, method = "reverse", overall = TRUE,
test = TRUE)
print(out, prmsd = TRUE, digits = 2)
##
##
## Descriptive Statistics by low
##
## +------------------+---+----------------------------+----------------------------+----------------------------+----------------------------+
## | |N |No |Yes |Combined | Test |
## | | |(N=130) |(N=59) |(N=189) |Statistic |
## +------------------+---+----------------------------+----------------------------+----------------------------+----------------------------+
## |race : White |189| 56% (73) | 39% (23) | 51% (96) | Chi-square=5 d.f.=2 P=0.082|
## +------------------+---+----------------------------+----------------------------+----------------------------+----------------------------+
## | Black | | 12% (15) | 19% (11) | 14% (26) | |
## +------------------+---+----------------------------+----------------------------+----------------------------+----------------------------+
## | Other | | 32% (42) | 42% (25) | 35% (67) | |
## +------------------+---+----------------------------+----------------------------+----------------------------+----------------------------+
## |Mother age [years]|188| 19.0/23.0/28.0 23.7+/- 5.6| 19.5/22.0/25.0 22.3+/- 4.5| 19.0/23.0/26.0 23.3+/- 5.3| F=1.5 d.f.=1,186 P=0.22 |
## +------------------+---+----------------------------+----------------------------+----------------------------+----------------------------+
## |ui : Yes |189| 11% ( 14) | 24% ( 14) | 15% ( 28) |Chi-square=5.4 d.f.=1 P=0.02|
## +------------------+---+----------------------------+----------------------------+----------------------------+----------------------------+
Note also that tabular output can be converted to graphical displays by
using plot()
like in, e.g.,
plot(out, which = "categorical")
Hmisc
provides replacement for some lattice commands, in particular
xYplot()
and dotchart2()
, or Dotplot()
. In fact, it is also its
strength because we do not need to learn ggplot2 to overcome base
graphics limitations, and using Hmisc
keep in line with lattice charts
(and their multiple options).
Let say we would like to display average birth weight plus or minus one standard error for each class of mother ethnicity. Assuming there is no missing variable we could define a simple function that returns means and associated lower/upper bounds.
se <- function(x) sd(x)/sqrt(length(x))
f <- function(x) c(mean = mean(x), lwr = mean(x) - se(x), upr = mean(x) + se(x))
d <- with(birthwt, summarize(bwt, race, f))
##
## Summary statistics (Mean +/- SE) by group d
##
## race bwt lwr upr
## 3 White 3103 3028 3177
## 1 Black 2720 2594 2845
## 2 Other 2805 2717 2894
xYplot(Cbind(bwt, lwr, upr) ~ numericScale(race, label = "Ethnicity"), data = d,
type = "b", keys = "lines", ylim = range(apply(d[, 3:4], 2, range)) + c(-1,
1) * 100, scales = list(x = list(at = 1:3, labels = levels(d$race))))
An easier (also shorter) solution is to rely on lattice
extra commands, like
library(latticeExtra)
segplot(race ~ lwr + upr, data = d, centers = bwt, horizontal = FALSE, draw.bands = FALSE,
ylab = "Baby weight (g)")
although xYplot()
is very handy when processing model predictions
generated by ols()
or lrm()
, as we will discuss below.
Hmisc
provides automatic labelling of curves or levels of grouping factor,
which are used as in standard lattice graphics (groups=
), without the need
to rely on the directlabels package.
d <- with(birthwt, summarize(bwt, llist(race, smoke), f))
xYplot(Cbind(bwt, lwr, upr) ~ numericScale(race), groups = smoke, data = d,
type = "l", keys = "lines", method = "alt bars", ylim = c(2200, 3600), scales = list(x = list(at = 1:3,
labels = levels(d$race))))
The rms package is used in combination with Hmisc
, which takes care
of data pre-processing and statistical summary. It is devoted to model
fitting, including validation (validate()
) and calibration (calibrate()
)
using bootstrap. It further includes utilities to refine general modeling
strategies and to handle higher-order terms (polymonial or restricted cubic
splines) or ordered catgeorical predictors, see online
help(rms.trans)
. The definitive guide to regression modeling using rms
is
Harrell, F.E., Jr (2001). Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer.
The companion website is BIOS 330: Regression Modeling Strategies.
Instead of lm()
, we will use ols()
to perform linear regression, but the
general formulation of the parametric model remains the same: a formula is
used to describe variable relationships (the response variable is on the
left-hand side, while predictors are on the right-hand side). A basic usage
of this command is shown below. To reuse the model for predictions purpose,
the linear predictor must be stored with model results (x=TRUE
).
library(rms)
m <- ols(bwt ~ age + race + ftv, data = birthwt, x = TRUE)
m
##
## Linear Regression Model
##
## ols(formula = bwt ~ age + race + ftv, data = birthwt, x = TRUE)
##
## Frequencies of Missing Values Due to Each Variable
## bwt age race ftv
## 0 1 0 5
##
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 183 LR chi2 10.37 R2 0.055
## sigma 725.0771 d.f. 4 R2 adj 0.034
## d.f. 178 Pr(> chi2) 0.0347 g 192.508
##
## Residuals
##
## Baby weight [grams]
## Min 1Q Median 3Q Max
## -2132.2374 -499.8035 0.9503 520.4314 1759.2352
##
## Coef S.E. t Pr(>|t|)
## Intercept 2957.5528 261.1681 11.32 <0.0001
## age 5.7498 10.5102 0.55 0.5850
## race=Black -373.3901 163.1956 -2.29 0.0233
## race=Other -295.6800 120.0166 -2.46 0.0147
## ftv 14.4698 51.8045 0.28 0.7803
Note that, contrary to lm()
, the summary()
method (or more precisely,
summary.rms()
) does something else. With ols()
it will print a summary
of the effect of each factor. It requires, however, that the user create a
datadist
object to store values for the predictors entering the
model, and that object must be available in the current namespace. So, the
preceding example becomes:
d <- datadist(birthwt)
options(datadist = "d")
m <- ols(bwt ~ age + race + ftv, data = birthwt, x = TRUE)
summary(m)
## Effects Response : bwt
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## age 19 26 7 40.25 73.57 -103.95 184.45
## ftv 0 1 1 14.47 51.80 -87.07 116.00
## race - Black:White 1 2 NA -373.39 163.20 -693.25 -53.53
## race - Other:White 1 3 NA -295.68 120.02 -530.91 -60.45
Effect size measures can also be displayed graphically using the
corresponding plot
method:
plot(summary(m))
Note also that in the case of multiple regression it is possible to select baseline category and adjust the effect for a particular value of a continuous predictor, as in the example below.
summary(m, race = "Other", age = median(birthwt$age))
## Effects Response : bwt
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## age 19 26 7 40.25 73.57 -103.95 184.4
## ftv 0 1 1 14.47 51.80 -87.07 116.0
## race - White:Other 3 1 NA 295.68 120.02 60.45 530.9
## race - Black:Other 3 2 NA -77.71 169.55 -410.02 254.6
A more conventional ANOVA table for the regression can be obtained using
anova()
.
anova(m)
## Analysis of Variance Response: bwt
##
## Factor d.f. Partial SS MS F P
## age 1 157346 157346 0.30 0.5850
## race 2 4529519 2264760 4.31 0.0149
## ftv 1 41017 41017 0.08 0.7803
## REGRESSION 4 5454889 1363722 2.59 0.0381
## ERROR 178 93581138 525737
Measures of influence are available with the which.influence()
command,
and it returns observations that are above a certain threshold with respect
to their DFBETA (default, 0.2). The vif()
command displays variance
inflation factor, which can be used to gauge multicolinearity issue.
which.influence(m)
## $Intercept
## [1] 117 130 131 133
##
## $age
## [1] 110 127 130 131 133 141
##
## $race
## [1] 106 110 131 133 134 138
##
## $ftv
## [1] 68 110 133
vif(m)
## age race=Black race=Other ftv
## 1.092 1.130 1.132 1.056
Model predictions are carried out the R’s way, using fitted()
, or
rms::Predict
. The latter offers additional control over adjustment factor
(like the effects package does), and does not require to create a data
frame as in predict()
. It also handles 95% confidence intervals smoothly.
p <- Predict(m, age = seq(20, 35, by = 5), race, ftv = 1)
xYplot(Cbind(yhat, lower, upper) ~ age | race, data = p, layout = c(3, 1), method = "filled bands",
type = "l", col.fill = gray(0.95))
Logistic regression is handled by the lrm()
function, and it works almost
in the same way, except that it provides more convenient output than R’s
glm()
, especially in terms of adjusted odds-ratio, partial effects,
confidence intervals, or likelihhod ratio test.