Would you be able to explain/show us how it works?
It would be pretty cool to see
You mean the code in the spreadsheet? It normalizes 3D vectors, which is important when doing certain lighting calculations, so there's not much to see. It's
software pipelined, which makes it faster than a straight forward implementation.
What is does it this:
Code:
void Normalize(vec3f *src, vec3f *dst, size_t stride, size_t count) {
do {
float reciplen = 1/sqrt(src->x*src->x + src->y*src->y + src->z*src->z);
*dst = vec3f{src->x * reciplen, src->y * reciplen, src->z * reciplen};
src++; dst++;
} while(--count);
}
The reciplen value can be calculated using the SH4 dot product (FIPR) and reciprocal square root (FSRRA) instructions.
If you normalize one vector at a time on the SH4, it will spend a lot of time waiting and doing nothing while waiting for easy instruction to complete. Doing one vector at a time, like in the C code, using FIPR and FSRRA, each loop iteration is 14 instructions long and would take 22 cycles to complete. The SH4 can execute two instructions each cycle, so in 22 cycles it could theoretically execute 44 instructions. Just doing 14 means the SH4 could be doing more, if you could find more to do.
And there is more to do, if you work on more than one vector at a time, and interleave the work for each vector. While the loads for one vector are completing, it's might also be preforming multiplies for another vector at the same time, instead of doing nothing else like in the one vector at a time version.
In the spreadsheet code, four vectors are normalized in each loop iteration. Each color in the spreadsheet represents a the instructions for a different vector. I used the numbers on the leftmost column to help ensure that I kept all instructions in the right order as I moved them around to try to find the best order. Each loop is 32 cycles, but since it works on 4 vectors each loop, the throughput is one vector every 8 cycles, so it's almost 3 times faster than doing one at a time.
The spreadsheet only contains the code for the loop, but there's more code required to get it to work. It has to do extra work to set stuff up for the loop and finish any incomplete vectors when the loop exits. If the number of vectors is not a multiple of four (or less than four), you have to work around that as well.