aliquote.org

Murmurhash

March 12, 2020

The MurmurHash algorithm belongs to non-cryptographic hash function and it is used for hash-based lookup, e.g. to convert a word $w$ to an integer $s \in [0, H)$, where $H$ is the table size. I came across this hash function by reading a recent publication on a new read mapper.1 It has been shown to have good properties in terms of speed and number of collisions compared to other hashing algoritms; see also Hash functions: An empirical comparison.2

Although different implementations of Murmurhash3 are cited on the Wikipedia page (see also the mmh3/digest packages for Python/R), I couldn’t find one in Scheme. So here is a quick and dirty one in Chicken scheme, using its solid FFI facilities and some C code grabbed from the Wikipedia page. There’s also the digest R package (here’s the C code in question for the interested reader) and some C++ code available on smhasher on Github, but I haven’t tried it yet.

Please note that things have a little bit changed with Chicken 5, compared to this excellent blog post: A (mostly) comprehensive guide to calling C from Scheme and vice versa. In particular, to import the foreign module, you now have to write (import (chichen foreign)) instead of (import foreign). Say you have this little snippet:

(import (chicken foreign))

(foreign-declare "#include <math.h>")

(define sin-pi
  (foreign-lambda* double ((double x))
    "double val = x * M_PI;"
    "C_return(sin(val));"))

(display (sin-pi 0.5))

Then you can generate a binary for your platform using the Chicken compiler, csc. The interactive Chicken REPL (csi) won’t work with the foreign import. First, I wanted to test the C code so I added a little header and print statement (murmur32.c). Let us check that it works:

$ cc murmu32.c -o murmur32
$ ./murmur32
1498610893

Here is what I got using R, which also implements MurmurHash3 for 32/128 bit (x86/x64):

> digest::digest("murmur", "murmur32", seed = 101, serialize = FALSE)
[1] "5952fccd"

This also happens to be 1,498,610,893 in decimal notation.

Now, let’s write some Scheme wrapper to use this little C procedure, while trying to handle C types correctly: an uint_8 is an unsigned char, or a foreign blob in case it is just a pointer, while an uint32_t corresponds to an unsigned int (4 bytes). And so, we get the following piece of code:

(import (chicken foreign)
        (chicken format))

#>
  extern uint32_t murmur3_32(const uint8_t* key, size_t len, uint32_t seed);
<#

(define m32 (foreign-lambda unsigned-int32 "murmur3_32" blob size_t unsigned-int32))

(printf "~A " (m32 "murmur" 6 101))

But first, we need to compile the core C functions (no need for a shared library, though):

$ cc -c lib-m32.c
$ csc -o m32 lib-m32.o m32.scm

And finally:

$ ./m32
1498610893

  1. Edgar, R. C. (2020). URMAP, an ultra-fast read mapper. bioRxiv ↩︎

  2. See this related post for how we define and quantify hash collisions. ↩︎

See Also

» Josephus Problem » Palindrome number » Insertion sort and Python FFI » Racket FFI and C » Prime numbers in Scheme