I’m afraid I haven’t made much progress on my Stata Starter Kit since I started the original draft. Partly because I’m busy with other interests, partly because I don’t use Stata that much these days (and my paid licence is for Stata 13 MP), and because I give a lot of projects along the way, almost surely as they use to say in mathematical statistics. My bad! So I thought I would write down some notes I took when studying the history of Stata, a few years ago. I warmly recommend reading the proper retrospectives published in the Stata Journal, see below (there are other interesting special volumes, IIRC). Interestingly, I’ve done most of my professional statistical analyses using statistical packages that provide scripting facilities (and not menu-driven interaction), mostly R (90%) and Stata (10%). While I used Stata for real work (I mean, from data cleansing/management to statistical modelling) late, I always kept an eye on data analysis using Stata and I learned a lot from the great collection of books published by Stata Press.
Stata is almost 40 years old. The latest stable release (v18) was released on April 25, 2023. Stata is written in C, but users can access the source code of commands/programs using viewsource. Pre-compiled code cannot be accessed, though. The article Statistical software certification describes the process of certification adopted by StataCorp. Thirty Years with Stata: A Retrospective is a good read, as well as volume 5(1) of the Stata Journal which celebrated the 20 years of Stata. Needless to say, Stata comes with extensive documentation. There are 23 manuals bundled with Stata v13, totalling 11365 pages.1
Of course there are the 41 commands that every Stata user should know (as of Stata v13), but there are also those 42 commands, originally found in Stata v1.0:
append dir infile plot spool beep do input query summarize by drop label
regress tabulate capture erase list rename test confirm exit macro
replace type convert expand merge run use correlate format modify save
count generate more set describe help outfile sort
As can be inferred, the above commands are mostly concerned with data management for rectangular datasets and regression.
The first version of Stata was a regression package and really nothing more than that. It did a little bit in the way of calculations, and it did some summary statistics, but it was all built around a regression engine. It was written over a one-year period by me initially and by Sean Becketti, who helped me later. I wrote the C code; Sean Becketti helped me a lot with the design. I would say that half of the design is mine and half the design is Sean’s in terms of what the user actually saw. A number of things became available just at that time when we started this project, and it was those things that actually caused the project to start. The first C compiler was available for the PC. — William Gould (A conversation with William Gould, in Thirty Years with Stata: A Retrospective, Enrique Pinzon (ed.), Stata Press 2015)
The plotting system has been entirely reworked starting with Stata v2.0 but you can still get good ol’ looking inline plot, e.g.:
. sysuse auto
(1978 Automobile Data)
. plot mpg weight
41 +
| *
|
|
M |
i | **
l | *
e |
a | *
g | * *
e | * *
| * * *
( | ** * *
m | * * * * **
p | ** * **
g | * * * * ** * *
) | ** ***
| * * * * ** * ****
| * * * *
| * * ** *** *
12 + * *
+----------------------------------------------------------------+
1760 Weight (lbs.) 4840
Most commands are still available, except modify
which has been superseded by replace
, while spool
, beep
and convert
have simply been deleted.
I started rewritting part of the above Stata system in Scheme,2 and my hope is to provide a Racket #lang stata
one day. One day…
♪ Nirvana • Breed
On Linux you can run, e.g., for i in /usr/local/stata/docs/*.pdf; \ do pdfinfo "$i" | grep "^Pages:"; done | awk '{s+=$2} END {print s}'
↩︎
I always considered Stata as primarily column-based even if it’s written in C and not in Fortran (at least, it’s way easier and way more common to use the if
than the in
statement), which is a perfect fit for list/vector processing. Sean Becketti talks about how useful such a “data rectangle” was back at the time in Thirty Years with Stata: A Retrospective. ↩︎