De Bruijn graph are common data structures in bioinformatics, especially when it comes to perform genome assembly. John D. Cook discussed application of De Buijn sequences in two of his latest posts. Briefly, a De Bruijn sequence represents the shortest sequence of symbols from an alphabet of size k that contains every possible subsequence of length n. Things get confusing if we think of k-mer, since in this case k refers to the length of the substring, and not the 4-character alphabet for DNA or the total length of the DNA sequence. Anyway, assuming k = 4 and n = 3, John Cook shows that AAACAAGAATACCACGACTAGCAGGAGTATCATGATTCCCGCCTCGGCGTCTGCTTGGGTGTTT
is a De Bruijn sequence for all 43 triples of DNA base pairs, provided you allow wrapping around (e.g., TAA would start in position 64).
De Bruijn Graph assembly (PDF), by Ben Langmead, provides a very detailed account of how De Bruijn graphs work in practice in the case of genome assembly. Assuming perfect sequencing where each length-k substring is sequenced exactly once with no errors, let’s pick a substring length, say k = 30 to 50 (i.e., ≤ shortest read length), take each k-mer and split them into left and right k-1 mers. Then, add those k-1 mers as nodes to De Bruijn graph (if not already there), with edge from left k-1 mer to right k-1 mer.
Here is a short Python snippet to construct a De Bruijn graph, directly inspired from one of Rosalind problems, assuming seq
is a list of reads:
def dbg(seq):
"""De Bruijn graph"""
g = {}
for s in seq:
l, r = s[:-1], s[1:]
if l not in g:
g[l] = set()
if r not in g:
g[r] = set()
g[l].add(r)
return g
As you can see, the idea is simply to store the graph as a dict
, which is a pretty standard way to represent a graph in Python, and to use set
s to extract left (all but the last) and right (all but the first) k mer. In Scheme, we would write (cdr s)
and (cdr (reverse s))
.
Consider the following example data, and recall that we need both the original reads (S) and their reverse complements, since we are interested in the De Bruijn graph of S ⋃ Sr:
def rev_c(s):
"""Reverse complement"""
sub = str.maketrans("ATGC", "TACG")
return s[::-1].translate(sub)
reads = ["TGAT", "CATG", "TCAT", "ATGC", "CATC", "CATC"]
grph = dbg(reads + list(map(rev_c, reads)))
Let’s see it in action:
In [49]: [print("%s => %s" % (k, " ".join(list(v)))) for k, v in grph.items() if len(v) > 0]
TGA => GAT
GAT => ATG
CAT => ATC ATG
ATG => TGA TGC
TCA => CAT
ATC => TCA
GCA => CAT
In the case of genome assembly, if sequencing was performed without any error this procedure yields an Eulerian graph, and an Eulerian walk can be found in $O(|E|)$ time, where $|E|$ is the number of edges (which equals the number of k-mers). The number of nodes is at most $2 \cdot |E|$, but it will typically be much smaller due to repeated k-1-mers. However, there can be more than one Eulerian walk, and uneven coverage, sequencing errors, or even sequence repeats may lead to disconnected graphs, which implies that only connected components will be Eulerian, not the overall graph.1 Ben Langmead has some additional benchmark data, especially regarding how the size of the De Bruijn graph grows sublinearly when average coverage is high (i.e., genome length ≪ total number of reads).
Kingsford, Carl, Michael C. Schatz, and Mihai Pop. “Assembly complexity of prokaryotic genomes using short reads.” BMC bioinformatics 11.1 (2010): 21. ↩︎