Want a good example? I was so pleased with myself the other day when going through trying to optimize the physics engine for a certain port of a PS2 game (which shall remain nameless). I had realized that a quaternion multiplication could be fairly trivially represented as a series of 4 dot products. SWEET. Lets just call into our FIPR intrinsic in KOS! ...not so fast, sunshine.
Look at how god-awful the codegen for that actually turns out to be here:
on Compiler Explorer (yeah, you can totally cross-compile for and target DC on the CE site. lololol). For each call to the FIPR intrinsic--which is a single damn instruction--you get EIGHT(!!!!) potentially unnecessary/redundant FMOV instructions emitted to populate the registers with the two 4D vectors BEFORE the FIPR instruction can be used. Enable the preprocessor condition on line 72 to witness as naive C kicks FIPRs ass here... Turns out the compiler is not smart enough to understand the register allocation and occupation of the inline assembly block, so it must naively prepopulate each source register before each call to the inline ASM block, even if the operands might already be sitting in its source registers! Basically calling tiny little inline assembly intrinsics from C is borderline worthless, for this reason. There goes our nice, easy HW acceleration for T&L!
The vector FPU instructions in the SH4 has some oddities compared to the regular FPU instruction. If you preform a register-to-register move, generally on the SH4, the result is ready instantly, on the same cycle the move is issued. You can copy one register to another and preform, say, an addition to in on the same cycle. It's exactly like how on the Pentium FXCH could dual issue with a FP instruction that relied on the result of the FXCH. The SH4 also has zero cycle FP absolute and FP negation instructions (but you can't do a move and abs/negation on the same cycle).
The vector unit does not support any of that. Those zero-cycle instructions behave like they take three cycles if the vector unit tries to use them. The two-cycle latency load instructions also take an extra cycle. (I think I know why the hardware works like this, but it's not really relevant to this.)
The FIPR instruction (dot product) becomes a lot less useful because of this. Trying to do stuff like that quaternion multiply with the FIPR instruction means that the FPU is completely idle while loading the source vectors. If you load a single 4D vector using single moves, it takes 6 cycles between the start of the loads (4 cycles to load a vector, then 2 cycles for the final load to complete), and when the vector unit can begin its operation. If you load two vectors, it's 10 cycles. If you want to do an operation with the regular FPU, it only takes 3 cycles until you can begin the first operation (2 cycles to load an element, then 1 for the second load to complete).
Also, you are locked into having the entire vector(s) loaded until the FIPR instruction begins, while when working with scalars, once you're done with a value you're free to use its register for something else.
So the FIPR instruction is situational, and often the flexibility of scalar FPU instructions has a speed advantage, because the FPU basically stops doing useful work, and has to spend time
FTRV, the matrix-vector multiply instruction, is much more useful since it stores the matrix in a separate register bank, so you save having to spend instructions to load the matrix into registers like you would for using scalar instructions. It also save registers since you don't have to hold the intermediate values while doing the matrix multiply.
As an example of the weakness of FIPR, suppose you want to do lighting, and dot a bunch of normals with a constant light vector and save the result. If the source and destination are in cache, and you write assembly using 64-bit loads, you can do a FIPR about every 3.5 cycles (this loads in the source, dots it with the constant vector, then saves the result (and the Z value, because switching to 32-bit stores would take another 2 or 3 cycles)). The SH4 can do a dot product every cycle, but it's load/store instructions don't have enough bandwidth to keep it fed, so the lower bound is the time it takes to load a vector (2 cycles) and write the result (1 cycle), plus some loop overhead. So you get one dot per 3.5 cycles.
Or you could load two light vectors into the matrix register, swap the FIPR instruction for FTRV, and get two dots for the exact same amount of time, achieving one dot per 1.75 cycles.
If you load another two light vectors to the matrix, and spend an extra cycle per loop (4.5 cycles) to write out another 64-bit value, you can get one dot per 1.125 cycles, more than three times faster than using FIPRs to do the same work.
The FIPR instruction rarely seems actually useful. Some places where it is is if you want the squared length of a vector, since the enforced sequential multiply-adds from scalar math would outweigh the load delay (and you just need to prepare 4 registers instead of 8), or if you are calculating the input to the FIPR instead of loading them from memory, since you are doing useful work while getting the FIPR inputs ready.
I did find FIPR to be useful was for some collision detection related code. Calculating the plane equation of a triangle requires dot products on previously calculated results. By
manually doing register allocation for GCC, you can get still get something useful out of FIPR.
Good, complex T&L is not easy on the DC. For simple T&L, where everything fits in registers, you can easily get 5+ mpoly/s just by taking the source vertices, transforming them, then sending them off the the hardware.
For complex T&L, you need a temporary work buffer while you do calculate the vertices, since doing stuff like reloading a matrix multiple times per vertex for skinning is stupid. You'd want to load the matrix once, do all the work you need the matrix for, then load the next one. The naive approach would be to have some large buffer, do all the transforms/skinning, then lighting, etc, all in one shot for each.
A better way would be to only work on
cache-sized chunks, so if lighting needs the position, it hasn't been pushed out of the cache yet. But with that, you will still have to drop parts of the work buffer out of cache when you move on to another block (losing time writing to RAM), and when it's time to submit the T&L results to the hardware, you cache-miss loading in the vertices from work buffer.
I've been looking into a meshlet format for DC models. With meshlets, the model is divided up so that the work buffer never leaves cache, meaning both T&L and submission are faster, at the cost of some minor duplicated work for vertices shared between meshlets. For a meshlets of 120 vertices, I was only getting something like an extra ~6% in vertices to T&L, which was more than offset by the speed boost to T&L and submit.