I regularly use Biopython for everything related to bioinformatics and, in particular, processing sequence data. It is a nice package, with submodules like
Phylo. The Seq or SeqRecord classes are really useful when it comes to storing and manipulating DNA sequences. Sequence are available as immutable (default) or mutable objects, and it is as easier to work with them as it would be with pure strings. In fact, you can even use their string representation if you like:
from Bio.Seq import Seq, MutableSeq s1 = Seq("ACGT") s2 = MutableSeq("ACGT") s2 = "A" s = s1 + s2 str(s)[:3]
There’s nothing special in Biopython on how “Seq” data are handled, except perhaps the use of bytes in place of raw strings, but this led me to wonder whether the
+ operator is really the best method to concatenate strings, or DNA sequences.
I know at least three ways to concatenate strings in Python: + (or
.join() for specific cases, and the
io.StringIO module.1 Note that we discarded F-strings on purpose. The last option is supposed to be O(1), while string concatenation generally is an O(n2) operation, which makes sense since you need to create a copy of the original string and run to all other letters from the second string to append them. The second string is not necessarily of the same length n, but it doesn’t really matter. There are many other relevant threads on SO, e.g., 1, 2, 3, 4, or 5. The
On the Python side, various benchmarks regarding which method is best (in terms of time or space complexity) are discussed here and there (including the links provided above). Apparently, F-strings are not so bad in terms of performance, especially since Python 3.6 but it looks to me like using “formating” stuff (much like
println! in Rust or
format in Lisp) amounts to divert the original purpose of this type of functions, which mostly exist for their side-effects. I may be wrong of course.
For example, do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn’t present at all in implementations that don’t use refcounting. In performance sensitive parts of the library, the ‘'.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.