Could Dreamcast et al handle this/that game/effect? *DC tech retrospective *spawn

Ninja 2 - same as above with exception that more of the hardware was exposed like bump mapping , accumulation buffer, frame buffer options. Performance greatly improved and model export setting greatly expanded for more artist graphical control. No major game used it since it came out late 2000 except for a 2d dating sim.

The speed up in CPU (performance) would have gone very well with exposing bump mapping. Sega started experimenting with bump mapping in the likes of Panzer Dragoon Azel, which started life as a Dreamcast game before being rebuilt and completed for Xbox (very impressive on rails shooter too, I might add).
 
Last edited:
A more in depth post on the importance of good transform and lighting code written in SH4 assembly.


It's INCREDIBLY easy on Dreamcast to write some suboptimal T&L code that works perfectly fine in pure C that just cannot achieve 6th gen level polygon counts... In my experience, nearly every indie game and probably most commercial games are just bottlenecking the graphics pipeline in stage 1, with the SH4 CPU unable to feed vertices to the PVR GPU fast enough to fully utilize it.

Indies have been at a big disadvantage so far, assuming the developer was skilled at writing code in assembly to begin with.

Really sucks that Sega's highly optimised Ninja 2 SDK never got the chance to show what it can do. Pity that the leaked version of Half Life was using an old version WinCE (or WinCE in general).
 
Would you be able to explain/show us how it works?

It would be pretty cool to see :)
You mean the code in the spreadsheet? It normalizes 3D vectors, which is important when doing certain lighting calculations, so there's not much to see. It's software pipelined, which makes it faster than a straight forward implementation.

What is does it this:
Code:
void Normalize(vec3f *src, vec3f *dst, size_t stride, size_t count) {
    do {
        float reciplen = 1/sqrt(src->x*src->x + src->y*src->y + src->z*src->z);
        *dst = vec3f{src->x * reciplen, src->y * reciplen, src->z * reciplen};
        src++; dst++;
    } while(--count);
}

The reciplen value can be calculated using the SH4 dot product (FIPR) and reciprocal square root (FSRRA) instructions.

If you normalize one vector at a time on the SH4, it will spend a lot of time waiting and doing nothing while waiting for easy instruction to complete. Doing one vector at a time, like in the C code, using FIPR and FSRRA, each loop iteration is 14 instructions long and would take 22 cycles to complete. The SH4 can execute two instructions each cycle, so in 22 cycles it could theoretically execute 44 instructions. Just doing 14 means the SH4 could be doing more, if you could find more to do.

And there is more to do, if you work on more than one vector at a time, and interleave the work for each vector. While the loads for one vector are completing, it's might also be preforming multiplies for another vector at the same time, instead of doing nothing else like in the one vector at a time version.

In the spreadsheet code, four vectors are normalized in each loop iteration. Each color in the spreadsheet represents a the instructions for a different vector. I used the numbers on the leftmost column to help ensure that I kept all instructions in the right order as I moved them around to try to find the best order. Each loop is 32 cycles, but since it works on 4 vectors each loop, the throughput is one vector every 8 cycles, so it's almost 3 times faster than doing one at a time.

The spreadsheet only contains the code for the loop, but there's more code required to get it to work. It has to do extra work to set stuff up for the loop and finish any incomplete vectors when the loop exits. If the number of vectors is not a multiple of four (or less than four), you have to work around that as well.
 
Interesting. Did the original PC release of GTA3 include a CPU based TnL render pathway or does it require a true GPU? Also, on PS2, did the game make any use of VU0 was the entire non graphics runtime entirely processed on the MIPS and it's internal FPU?
 
Back
Top