You can also also view the full archives of micro-posts. Longer blog posts are available in the Articles section.
Machine learning in Clojure with XGBoost. Note that there are bindings for the
awesome xgboost in various other languages (Python, Julia, R), not just the JVM.
#clojure
Python didn’t become the leader in the field because it’s inherently better or more performant, but because of scikit-learn, pandas and so on. While as Clojurists we don’t really need pandas (dataframes) or similar stuff (everything is just a map, or if you care more about memory and performance a record) we don’t have something like scikit-learn that makes really easy to train many kind of machine learning models and somewhat easier to deploy them.
merlin - a unified framework for data-analysis, and many other interesting
packages by the same author or other coworker. #stata
Again, I’m slowly updating stata-sk. It took me a while to reset the publishing
system to use Stata 13 MP instead of Stata 15 since I no longer get a free
license for it. This will probably be my last textbook on Stata. #stata
Look. Even Racket has some support for statistical data structure like data
frames. In addition, here is an essential read if you want to get started with
common data structures: An Overview of Common Racket Data Structures. #scheme
An analysis of lossless data compression programs: Large Text Compression Benchmark. (via SO–it looks it is the very first question on the beta site)
The amount of genomic sequence data being generated and made available through public databases continues to increase at an ever-expanding rate. Downloading, copying, sharing and manipulating these large datasets are becoming difficult and time consuming for researchers. We need to consider using advanced compression techniques as part of a standard data format for genomic data. The inherent structure of genome data allows for more efficient lossless compression than can be obtained through the use of generic compression programs. We apply a series of techniques to James Watson’s genome that in combination reduce it to a mere 4MB, small enough to be sent as an email attachment. – Human genomes as email attachments
I haven’t yet embraced the full power of Julia for data munging, but surely this
article is a gem to understand the language at a deeper level. #julia
Useful tips to build and manage R packages: rOpenSci Packages: Development,
Maintenance, and Peer Review. #rstats
Probability and Statistics: a simulation-based introduction, by Bob Carpenter. I
like it when there are instructions for those like me who do not want to install
RStudio to build the book. #rstats
Causal Inference Book, Python code hosted on GitHub (by the author of the Stata kernel). (via @kaz_yos)