To implement this, I need pre-transpose A and C. Then I need transpose CT back to C. Will these overhead operations overshadow the benefit of the improvement?


With better optimization based on algebra. Probably we do not need to transpose until the final results.