Archiving my responses on StackExchange

2011-01-20

I decided to backup my posts on stats.stackechange.com into a single PDF file. This is easy for me because all my responses are already saved on my hard drive in Markdown format. In what follows, I will describe how I managed to create my own archive.

All my posts are available as plain text files, with names 0001.txt to 0245.txt. They include the title of the questions (with link to stats.stackexchange.com) followed by my response, in Markdown format.

worddle

The hard (and ugly) way

The first thing that comes to my mind is just to convert each file to individual HTML file, and then assemble the whole stuff into a single file.

$ ./convert_md.sh --html
$ htmldoc --continuous --charset utf-8 --color -f \
  chl_stats_stackechange_2011-01-20.pdf *.html

where convert_md.sh features a simple loop for calling pandoc on the Markdown files. Using htmldoc rather than a compiled HTML file allows to quickly solve the problem of the Markdown link encoded as numbered references which won't update between each piece of HTML.

The other advantage is that htmldoc will fetch all linked images (which are stored on http://i.imgur.com), although it is rather long. But it appears that this is not a so good idea because then images are not croped or resized. As a consequence, it produces an ugly PDF file (notwithstanding the fact that there appears to be some problem with long lines not breaking, and TeX expressions rendered incorrectly. Generating a single HTML file, and then converting it to PDF with Apercu.app does not solve the problem.

The easy (but pretty) way

This time, I decided to assemble all files in a ConTeX file. For this to be possible, I need each file as an independent TeX chunk, decide of a common figure width, and assemble the whole into a single document.

$ ./convert_md.sh --tex
$ cat ~/header.tex > all.tex
$ cat [0-9]*.tex >> all.tex
$ echo '\stoptext' >> all.tex

The file header.tex includes the preamble for the master file. This is a slightly modified version of the default header produced when calling markdown2pdf. I'm still working on it to enhance the layout and color scheme (Here is the current copy I am using.)

The advantage is that ConTeX handles inline images with URI without any problem (the file is just downloaded into luatex cache, and then reused from that location). Here, it doesn't work, though and I get a libpng error: Not a PNG file when running context on such a file. It is very likely that there is a problem with URL handling in this particular case.(a) So, for the time being, I just grab images on the fly and put them in the same folder, and I make a tar.gz just after ConTeX files have been processed.

$ grep -o --color=never 'http://i.imgur.com[^"]*[png$]' all.tex > imglist.txt
$ cat imglist.txt | awk '{print $1 " -O";}' | xargs curl -O --silent
$ sed 's|http://i.imgur.com/||g' < all.tex > all2.tex
$ texexec all2.tex

(Note that we don't need to explicitly use a slash character as a delimiter with sed!)

I used curl because wget -q -i imglist.txt doesn't work (don't know why it said the URL is malformed but I suspect it is because of the presence of i.). The curl command is a little bit tricky because we need to add a -O statement for each line in the file imglist. Another solution is to use curl -K with a list of URLs in a text file, but then we need to convert this file to replace the raw URLS with url = "http://...".

I can also produce single PDF files, and then concatenate them into a single file, e.g.

$ ./convert_md.sh --pdf -s
$ pdfjoin *.pdf -o all.pdf

Here are two screenshots that show how it looks once compiled. A sample PDF file is also available.

sc1

sc2

Notes

(a) However, it works perfectly well with other PNG files, e.g.

\starttext
\externalfigure[http://upload.wikimedia.org/wikipedia/commons/thumb/ \   
                0/0d/ConTeXt_Unofficial_Logo.s \
                vg/200px-ConTeXt_Unofficial_Logo.svg.png]
\stoptext

will compile using context, but replacing the above link with a bit.ly shortcut (http://j.mp/gOcjTS) doesn't.

---

Articles with the same tag(s):

Collecting email usage statistics from mu
Data science at the command-line
Interacting with Weka from Jython
CoffeeScript or how to avoid typing ugly Javascript code
Workflow for statistical data analysis
Playing with Julia
GSL Shell
Apple weekend miscellanies
Color schemes for Emacs and TeX
Compiling Gnuplot on OS X

---