Python and string concatenation

April 15, 2021

I regularly use Biopython for everything related to bioinformatics and, in particular, processing sequence data. It is a nice package, with submodules like SeqIO or Phylo. The Seq or SeqRecord classes are really useful when it comes to storing and manipulating DNA sequences. Sequence are available as immutable (default) or mutable objects, and it is as easier to work with them as it would be with pure strings. In fact, you can even use their string representation if you like:

from Bio.Seq import Seq, MutableSeq

s1 = Seq("ACGT")
s2 = MutableSeq("ACGT")
s2[2] = "A"
s = s1 + s2

There’s nothing special in Biopython on how “Seq” data are handled, except perhaps the use of bytes in place of raw strings, but this led me to wonder whether the + operator is really the best method to concatenate strings, or DNA sequences.

I know at least three ways to concatenate strings in Python: + (or __add__), .join() for specific cases, and the io.StringIO module.1 Note that we discarded F-strings on purpose. The last option is supposed to be O(1), while string concatenation generally is an O(n2) operation, which makes sense since you need to create a copy of the original string and run to all other letters from the second string to append them. The second string is not necessarily of the same length n, but it doesn’t really matter. There are many other relevant threads on SO, e.g., 1, 2, 3, 4, or 5. The + operator is handy since it is used in other languages as well, e.g. Javascript, or Rust. Those languages also offer alternative ways to concatenate strings (join, append, etc.).2

On the Python side, various benchmarks regarding which method is best (in terms of time or space complexity) are discussed here and there (including the links provided above). Apparently, F-strings are not so bad in terms of performance, especially since Python 3.6 but it looks to me like using “formating” stuff (much like println! in Rust or format in Lisp) amounts to divert the original purpose of this type of functions, which mostly exist for their side-effects. I may be wrong of course.

In any case, the recommendation appears to be: Use .join(), especially if you like to conform to PEP 8 recommendation:

For example, do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn’t present at all in implementations that don’t use refcounting. In performance sensitive parts of the library, the ‘'.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

  1. For lists, it is even possible to use .extend()↩︎

  2. In Haskell, we even have a join-like feature, thanks to ̀Control.Monad.join↩︎

See Also

» Routed Gothic » Phylogenetic analysis using Python » Heap sort » Python and H2O » On memoization