Stata for health researchers


Here is a quick review of the fourth edition of An introduction to Stata for Health Researchers, by Svend Juul and Morten Frydenberg.

I have several books from Stata Press (and I'd love getting more of them). I always found them very well written, and they generally offer a good balance between theory and application. The companion website (other than material provided by Stata Press) for this one is here: There is also some comments to exercices for the present book.

So, here is a brief sketch of this book: As stated in the comments from the Stata technical group, this book does not only expose the reader to Stata syntax or commands to be used to perform a specific task. First of all, a complete chapter (70 pages) is dedicated to data management, including a specific section on data auditing and archiving. Chapter 3 is about all sort of statistical analysis, but it mostly focuses on regression modeling, followed by time-to-event analysis and diagnostic measurement. Various add-on commands are presented. A specific chapter is dedicated to graphical displays, using some custom neat schemes (lean schemes). The last chapter is about Stata programming, but I highly recommend Baum's book for that purpose: An introduction to Stata programming (Stata Press, 2009).

Incidentally, this book uses the Low birth weight data set from Hosmer and Lemeshow. This is the same data set that I have used for my R or Stata lectures during the past 3 or 4 years. I explain why on Cross Validated. For a MOOC on introductory statistics with R, we decided to use another dataset because of possible copyright issues. But see, e.g., for an example of such copyright notice. However, it is not clear to me whether these data are really copyrighted and I didn't find any acknowledgment to Wiley in Juul and Frydenberg's book. I remember a discussion on Twitter one or two years ago saying that tables and/or figures from a textbook were copyrighted but not the raw data. Such legal issues are likely to vary from one country to the other. Personally, I agree with the following quote from the University of Michigan:

Fair use allows limited use of copyrighted material without permission from the copyright holder for purposes such as criticism, parody, news reporting, research and scholarship, and teaching.

In any case, the Low birth weight data is available in R and most of R is under the GPL. The only information that is given from help(birthwt, package = "MASS") is:


     Hosmer, D.W. and Lemeshow, S. (1989) _Applied Logistic
     Regression._ New York: Wiley


     Venables, W. N. and Ripley, B. D. (2002) _Modern Applied
     Statistics with S._ Fourth edition.  Springer.

Back to the book. The authors offer sound advices on data management, which are not limited to Stata. This reminded me of J. Scott Long's workflow of data analysis using Stata, which I discussed briefly in Workflow for statistical data analysis. Regarding R, Roger D. Peng released R Programming for Data Science a few months ago,(a) with several tips to read and process data from within R. I miss a good textbook on data management with R, though. I didn't read Data Management Using Stata: A Practical Handbook, by Michael N. Mitchell, so I don't know how it compares to Juul and Frydenberg's book. Surely it features much more material

Other ressources of interest, other than the [D] reference manual (670 pages long as of Stata 13) are listed below:

Data Analysis Using Stata, by Ulrich Kohler and Frauke Kreuter (3rd ed., 2012) is another great textbook for learning Stata. It does not target a medical audience specifically, but most concepts can be applied in biostatistics or data management anyway. Of course, biostatisticians will miss case-control, survival or time-to event analysis, which are nicely covered in Juul and Frydenberg's book. Finally, here are two other great textbooks for a Stata-oriented approach to biostatistical modeling:

  1. Dupont, W.D. (2009). Statistical Modeling for Biomedical Researchers. Cambridge University Press (2nd ed.).
    The companion website has lot of additional resources, including data sets and lectures notes based on the textbook.
  2. Vittinghoff, E., Glidden, D.V., Shiboski, S.C., and McCulloch, C.E. (2005). Regression Methods in Biostatistics. Springer.
    I once started to wrote R solutions to selected chapters. But I never really finished it because I was missing some data sets.


(a) There is also The Elements of Data Analytic Style, by Jeff Leek. The first chapters are very relevant to users interested in tidying data.


Articles with the same tag(s):

Data Science from Scratch
R Graphs Cookbook
Weaving Stata
Bad Data
Data science at the command-line
Reproducible research with R
Twenty canonical questions in machine learning
Do a large amount of consulting
Stata for structural equation modeling
Bar charts of counts or frequencies in Stata