the key point is that matrix transpose serves as a prime example of the problem which appears trivial but nevertheless requires tremendous effort to be solved efficiently for CPU. Root causes of this unpleasant surprise can be traced to high memory latency and limited parallelization capabilities of CPUs. Neither of these issues is going to be alleviated in the future. — What it takes to transpose a matrix