the key point is that matrix transpose serves as a prime example of the > problem which appears trivial but nevertheless requires tremendous effort to > be solved efficiently for CPU. Root causes of this unpleasant surprise can be > traced to high memory latency and limited parallelization capabilities of > CPUs. Neither of these issues is going to be alleviated in the future. — What it takes to transpose a matrix