It’s interesting to see how few R code I write these days, compared to my early professional carrier as a medical statistician. I could have some opportunities to do biostatistical stuff but most of the time I am never asked to and my job resolves around genomics these days. That’s fine by me, I still have some opportunities to learn new things, especially in NGS and bioinformatics. At work I mostly use Python and Bash, because they fulfill their role for string processing and task orchestration. When I’m not involved in programming, I’m using more or less esoteric bioinformatic software, for which my rough estimate of their lifetime is about 2 years after having been described in a peer-reviewed paper. Anyway, I still do some statistical stuff, mostly for fun at home, and most of the time R comes to the rescue, even if I also use Stata and Mathematica.
Here are some R packages I discovered or reinstalled in 2023:
First, I learned that there’s a dedicated webiste for ggplot extensions. There you will find ggstance (horizontal boxplots, for instance), ggalt (useful for spline fitting) or ggradar (radar or spider charts, which I implemented in base R 14 years ago!). There is many cool stuff that been developed around ggplot, which is great.
gtsummary as a quick replacement for gt when one to summarize a data frame. It produces an HTML page by design. However, it supports several backends, including gt, to output PNG (requires the webshot2 package, and one of chrome-based browser), DOCX or TEX (as a longtable) files. Quick one-liner:
library(gt)
library(gtsummary)
tbl_summary(iris) |> as_gt() |> gtsave(filename = "/tmp/out.tex")
janitor to manage your dirty dataset. There’s no magic incantation, though, so I believe it does not prevent the user to closely inspect, and not only eye-balling, the data.
parameters for managing and displaying model estimates; not sure I would really need this, but I’ll keep it in mind.
emmeans and margins, although I’m used to the effects and rms packages.
dbplyr and multidplyr: I’m still not versed into the tidyverse, but I must admit that the dbplyr package offers the best of the two worlds, a robust query language on top of an efficient database system. As such, instead of using the data.table package, we can imagine using Arrow to read and/or write large data files, and duckdb to query the collected data using their efficient streaming algorithms, but see some examples of use in this tutorial. I’d like to benchmark this approach with a full-blown H2O solution. I was quite surprised by the performance and ease of use of the Python datatable package from the H20 team.
There are many other packages that I once installed but removed soon after using them. There are also lot of packages that I am not aware of. In fact, I stopped my technological watch years ago, and never really came back, so that I’m mostly comfortable with what I’ve learned long time ago. By the way, it was already cool. Curiously, while I was using the Bioconductor project and its packages 13 years ago for GWAS analysis, I no longer use them, except for RNASeq or metagenomics-related stuff.
♪ Benoît Delbecq • Anamorphoses