Matrix storage for vector instructions - row major vs column major

FoxMcCloud

Newcomer
Are there any articles discussing row major vs column major matrix storage from a performance perspective on modern game architectures? I'm familiar with the memory layout but none of the articles I found discuss performance considerations on SSE, AVX, PPU / SPU etc.
 
Arrange it in such a way that you can do fast dots(). Horizontal dot() is slower, at least on CPU, maybe on GPU as well because of banking conflicts.
The fastest should be like this:

result.xyzw =
vector.xxxx * matrix0.xyzw +
vector.yyyy * matrix1.xyzw +
vector.zzzz * matrix2.xyzw +
vector.wwww * matrix3.xyzw;

3 mad + 1 mul
 
Yeah, I know that's what you ultimately want for linear transformations - I'm curious about which one tends to be used to make dot products fast on various current and near-future architectures.
 
That example is row-major. It's not used in shading languages by default. It's performance benefit can be negated if you use a higher amount of transposes. In general you may want to count the number of M and M^T multiplications you have to do, and use row-major for M-majority and column-major for M^T-majority.
 
Back
Top