I decided to backup my posts on stats.stackechange.com into a single PDF file. This is easy for me because all my responses are already saved on my hard drive in Markdown format. In what follows, I will describe how I managed to create my own archive.
All my posts are available as plain text files, with names 0001.txt
to 0245.txt
. They include the title of the questions (with link to stats.stackexchange.com) followed by my response, in Markdown format.
The first thing that came to my mind was just to convert each file to individual HTML files, and then assemble the whole stuff into a single file.
$ ./convert_md.sh --html
$ htmldoc --continuous --charset utf-8 --color -f \
chl_stats_stackechange_2011-01-20.pdf *.html
In the above instructions, convert_md.sh
features a simple loop for calling pandoc on the Markdown files. Using htmldoc rather than a compiled HTML file allows to quickly solve the problem of the Markdown link encoded as numbered references which won’t update between each piece of HTML.
The other advantage is that htmldoc
will fetch all linked images (which are stored on http://i.imgur.com), although it is rather long. But it appears that this is not a so good idea because then images are not croped or resized. As a consequence, it produces an ugly PDF file, notwithstanding the fact that there appears to be some problem with long lines not breaking, and TeX expressions rendered incorrectly. Generating a single HTML file, and then converting it to PDF with Apercu.app
does not solve the problem either.
This time, I decided to assemble all files in a Context file. For this to be possible, I needed each file as an independent $\TeX$ chunk, to decide of a common figure width, and to assemble the whole stuff into a single document:
$ ./convert_md.sh --tex
$ cat ~/header.tex > all.tex
$ cat [0-9]*.tex >> all.tex
$ echo '\stoptext' >> all.tex
The file header.tex
includes the preamble for the master file. This is a slightly modified version of the default header produced when calling markdown2pdf
. I’m still working on it to enhance the layout and color scheme. Here is the current copy I am using.
The advantage is that Context handles inline images with URI without any problem: the file is just downloaded into luatex
cache, and then reused from that location. Here, it doesn’t work, though and I get a libpng error: Not a PNG file
when running Context on such a file. It is very likely that there is a problem with URL handling in this particular case. However, it works perfectly well with other PNG files, e.g.:
\starttext
\externalfigure[http://upload.wikimedia.org/wikipedia/commons/thumb/ \
0/0d/ConTeXt_Unofficial_Logo.s \
vg/200px-ConTeXt_Unofficial_Logo.svg.png]
\stoptext
will compile using context
, but replacing the above link with a bit.ly
shortcut, e.g. http://j.mp/gOcjTS, doesn’t.
So, for the time being, I ended up grabing images on the fly and put them in the same folder, and I make a tar.gz
archive just after Context files have been processed.
$ grep -o --color=never 'http://i.imgur.com[^"]*[png$]' all.tex > imglist.txt
$ cat imglist.txt | awk '{print $1 " -O";}' | xargs curl -O --silent
$ sed 's|http://i.imgur.com/||g' < all.tex > all2.tex
$ texexec all2.tex
(Note that we don’t need to explicitly use a slash character as a delimiter with sed
!)
I used curl
because wget -q -i imglist.txt
doesn’t work — I don’t know why it said the URL is malformed but I suspect it is because of the presence of i.
. The curl
command is a little bit tricky because we need to add an -O
statement for each line in the file imglist
. Another solution is to use curl -K
with a list of URLs in a text file, but then we need to convert this file to replace the raw URLs with url = "http://..."
.
I am now also able to produce single PDF files, and then concatenate them into a single file, e.g.
$ ./convert_md.sh --pdf -s
$ pdfjoin *.pdf -o all.pdf
Here are two screenshots that show how it looks once compiled:
A sample PDF file is also available.