Could Dreamcast et al handle this/that game/effect? *DC tech retrospective *spawn

Why? What reasoning leads you to this?

The main guy was bashed by PS fanboys pretty hard on X.

And he gave as good as he got (and rightly so) but in doing so he's put a target on his back, and if that port isn't 1:1 perfect the PS fanboys are going to dive on him.

And I don't think he'll want the hassle or crap from them should it not be a 1:1 port.

Just a hunch/feeling I have.
 
The main guy was bashed by PS fanboys pretty hard on X.

And he gave as good as he got (and rightly so) but in doing so he's put a target on his back, and if that port isn't 1:1 perfect the PS fanboys are going to dive on him.

And I don't think he'll want the hassle or crap from them should it not be a 1:1 port.

Just a hunch/feeling I have.
Nothing to do with fanboys, if anything the guy loves to give it to them. It's about simply getting things done in quiet manner for legal issues. The port itself is progressing nicely , in that regard there's nothing to worry about.
 
Last edited:
The main guy was bashed by PS fanboys pretty hard on X.

And he gave as good as he got (and rightly so) but in doing so he's put a target on his back, and if that port isn't 1:1 perfect the PS fanboys are going to dive on him.

And I don't think he'll want the hassle or crap from them should it not be a 1:1 port.

Just a hunch/feeling I have.
You really needed to add that context to your original post.
 
Given that is is a port of an unfinished PC decompilation version on a less performat platform it sure won't be. 1:1 to the PC version nor the PS2 version witch is different from the PC version by the way ..
 
What happened to this? Its been more than a month

It's probably been mentioned before but it's gonna be months before it's ready. It's still going , just in private. They optimize the game for dc hw, they find issues with the open tools they fix and further optimize in the process. It's quite the hefty work. All this while these guys have regular working lives as professional programmers in different fields( that's not even including they need time for their private lives, I guess if these guys didn't work or have family's it be going at a faster rate. ) Then some privately test it on real hw. And repeat. You could always join the discord to check on it.

One thing id say was learned from all this was with micro optimization on the CPU the DC had a lot of headroom left( this might one of the intensive games on the hw especially since might have things not found on the console version). But it's just as hard as it's contemporaries( PS2) to get tuned performance out of it, making its"EASY" reputation a lie. Basically they learned you have to cycles count down to the t , notice which instructions can execute in parallel at a time , even watch how the compiler sometimes fails to use registers / instructions effectively. Then on top quality code for intensive things like tnl/ physics which is all pretty much is being hand tuned specifically on for the sh4 at this point. It's gonna be while for sure .
 
But it's just as hard as it's contemporaries( PS2) to get tuned performance out of it, making its"EASY" reputation a lie.
I guess 'easier'. ;) But also, as with modern gamedev, getting the basics could be easy but leaving performance on the table. Maxing the hardware is hard, which was true of all the too-the-metal consoles. Perhaps one of the issues with DC's results is devs went with the easy route because it was available and so didn't make the most of its potential? Whereas on PS2 you had to think about difficult hardware, which probably meant thinking about optimisations and different ways of doing things as you went and naturally refining your approach.
 
One thing id say was learned from all this was with micro optimization on the CPU the DC had a lot of headroom left( this might one of the intensive games on the hw especially since might have things not found on the console version). But it's just as hard as it's contemporaries( PS2) to get tuned performance out of it, making its"EASY" reputation a lie. Basically they learned you have to cycles count down to the t , notice which instructions can execute in parallel at a time , even watch how the compiler sometimes fails to use registers / instructions effectively. Then on top quality code for intensive things like tnl/ physics which is all pretty much is being hand tuned specifically on for the sh4 at this point. It's gonna be while for sure .
100% this really nails it.

There is a lot of truth to the Dreamcast being fairly easy to develop for, with a very sane, user-friendly architecture, but the more I've worked with the console, the more I'm convinced people tend to conflate this with "it's easy to push the console," which is absolutely not really true.

One of the biggest examples that comes to mind, that has been at the forefront of stuff I've been involved with lately is just freaking T&L. T&L is EVERYTHING on the Dreamcast. That one damn loop for vertex submission, and how well you write it, is what dictates whether you're barely pushing some N64+ level polygon counts or whether you're able to break into the million+ polygons/second range that the console is clearly capable of on the high-end. PLUS, since it's happening on the main CPU, it also dictates how many cycles you have remaining to dedicate to other general game logic.

I'm no expert on the other 6th gen platforms, but I just cannot see how the PS2 with a vector coprocessor, or the GC with T&L on the GPU, or Xbox with shaders can be as sensitive or as hard to push as the Dreamcast is. It's so so freaking easy to just bottleneck the entire graphics pipeline at the first T&L stage before anything even hits the PVR GPU...

Why? Because the Dreamcast does all the T&L on the SH4 CPU... Which is an approach I totally understand Sega taking on paper: the Saturn was a multicore heterogeneous shitshow that was hard as hell to program for... What could be easier than just handling the T&L on the CPU with just some special instructions? Is that not actually a pretty decent, intuitive design, compared to where Sega came from? ...not really in practice.

Trust me when I say that one of the very first things every big commercial DC developer veteran who was pushing polygons tends to mention about coding for it is "hand-writing SH4 assembly using the special vector/matrix instructions," because first of all, the GCC SuperH toolchain doesn't even know shit about FIPR or FTRV on the SH4. It will never emit those instructions. It will never vectorize naive C code to do anything with them. If you don't drop to assembly and call them directly, tough! You can't access a significant portion of the DC's floating-point potential... and for the record, no, this isn't some ancient antiquated GCC back-end. We're using the bleeding-edge latest versions of GCC (14.2.0), and it's still being actively maintained to this day... We have almost the entirety of C23 and C++23, including crazy stuff like C++ async, coroutines, regular expressions, ranges, etc on the Dreamcast. The whole freaking standard libraries... Things a developer would've never dreamed of having on DC back in the day... sh-elf-gcc can even emit the fast sin/cos and inverse sqrt LUT-based instructions with -ffast-math and a few other flags enabling them.

But still, it can't do shit about the vector instructions, and it's not due to a lack of trying. Turns out that they just do not map well to the way that most architectures' SIMD instructions and semantics work, meaning SH is just not able to "automatically inherit" arch-independent vectorization/optimization passes that can do anything useful with them... SIMD for most arches is a whole lot more little instructions which give rise to what the SH4 can do with its two big-ass instructions: FTRV and FIPR, which can do a whole 4x4 matrix * 4D vector transform and a whole 4D dot product all-at-once, respectively. It's very atypical.

Now for my next point... I know what you're thinking. "K, big deal. Write some inline assembly." Yeah, that'd be the obvious solution, and we absolutely offer our own set of inline ASM-based pseudo "intrinsics" in KallistiOS which expose these SIMD instructions that are so critical for T&L... but too bad the code-gen for calling into them from C is absolute shit.

Want a good example? I was so pleased with myself the other day when going through trying to optimize the physics engine for a certain port of a PS2 game (which shall remain nameless). I had realized that a quaternion multiplication could be fairly trivially represented as a series of 4 dot products. SWEET. Lets just call into our FIPR intrinsic in KOS! ...not so fast, sunshine.

Look at how god-awful the codegen for that actually turns out to be here: on Compiler Explorer (yeah, you can totally cross-compile for and target DC on the CE site. lololol). For each call to the FIPR intrinsic--which is a single damn instruction--you get EIGHT(!!!!) potentially unnecessary/redundant FMOV instructions emitted to populate the registers with the two 4D vectors BEFORE the FIPR instruction can be used. Enable the preprocessor condition on line 72 to witness as naive C kicks FIPRs ass here... Turns out the compiler is not smart enough to understand the register allocation and occupation of the inline assembly block, so it must naively prepopulate each source register before each call to the inline ASM block, even if the operands might already be sitting in its source registers! Basically calling tiny little inline assembly intrinsics from C is borderline worthless, for this reason. There goes our nice, easy HW acceleration for T&L!

This is why, when people talk about pushing polys on DC and working on renderers back during its heydey, they talk about writing a LOT of SH4. You basically have to widen the assembly code coverage to include the entire damn T&L loop, handling register allocation and occupation manually to ensure that you can optimally use instructions like FIPR and FTRV... Hell, at least with something like the PS2's VU, you knew going in that you'd have to write little ASM routines to light and transform the incoming vertex stream, and even then, I'm sure writing such code suboptimally on a dedicated coprocessor is going to be nowhere near as detrimental as writing suboptimal code on the main processor which has to balance resources between T&L and the rest of the logic required for a game.

Imho the DC's approach to T&L here on the SH4 is quite insidious, because it SEEMS like it should be relatively straightforward and simple, yet winds up being far far less simple in practice, and is simply so easy to screw up... and I'm convinced this aspect of targeting the DC alone is one of the reasons why the DC is absolutely not so easy to push or be optimal with as has been commonly cited historically.

I guess 'easier'. ;) But also, as with modern gamedev, getting the basics could be easy but leaving performance on the table. Maxing the hardware is hard, which was true of all the too-the-metal consoles. Perhaps one of the issues with DC's results is devs went with the easy route because it was available and so didn't make the most of its potential? Whereas on PS2 you had to think about difficult hardware, which probably meant thinking about optimisations and different ways of doing things as you went and naturally refining your approach.
I really think you nailed it. I really think that's a lot of what tended to happen on the Dreamcast. It's just too easy to write shitty, naive code in plain C or with inline assembly, because yes, the DC has a very approachable, friendly, easy-to-grasp architecture, and hey, T&L was MEANT to happen on the CPU anyway...
 
You can't access a significant portion of the DC's floating-point potential... and for the record, no, this isn't some ancient antiquated GCC back-end. We're using the bleeding-edge latest versions of GCC (14.2.0), and it's still being actively maintained to this day... We have almost the entirety of C23 and C++23, including crazy stuff like C++ async, coroutines, regular expressions, ranges, etc on the Dreamcast. The whole freaking standard libraries... Things a developer would've never dreamed of having on DC back in the day... sh-elf-gcc can even emit the fast sin/cos and inverse sqrt LUT-based instructions with -ffast-math and a few other flags enabling them.

But still, it can't do shit about the vector instructions, and it's not due to a lack of trying. Turns out that they just do not map well to the way that most architectures' SIMD instructions and semantics work, meaning SH is just not able to "automatically inherit" arch-independent vectorization/optimization passes that can do anything useful with them... SIMD for most arches is a whole lot more little instructions which give rise to what the SH4 can do with its two big-ass instructions: FTRV and FIPR, which can do a whole 4x4 matrix * 4D vector transform and a whole 4D dot product all-at-once, respectively. It's very atypical.

People apparently have not been interested in getting the those instructions wired into GCC. I guess you either have to make an assembler based vector library or start making patches for GCC. The first seems much faster and realistic though.
 
Want a good example? I was so pleased with myself the other day when going through trying to optimize the physics engine for a certain port of a PS2 game (which shall remain nameless). I had realized that a quaternion multiplication could be fairly trivially represented as a series of 4 dot products. SWEET. Lets just call into our FIPR intrinsic in KOS! ...not so fast, sunshine.

Look at how god-awful the codegen for that actually turns out to be here: on Compiler Explorer (yeah, you can totally cross-compile for and target DC on the CE site. lololol). For each call to the FIPR intrinsic--which is a single damn instruction--you get EIGHT(!!!!) potentially unnecessary/redundant FMOV instructions emitted to populate the registers with the two 4D vectors BEFORE the FIPR instruction can be used. Enable the preprocessor condition on line 72 to witness as naive C kicks FIPRs ass here... Turns out the compiler is not smart enough to understand the register allocation and occupation of the inline assembly block, so it must naively prepopulate each source register before each call to the inline ASM block, even if the operands might already be sitting in its source registers! Basically calling tiny little inline assembly intrinsics from C is borderline worthless, for this reason. There goes our nice, easy HW acceleration for T&L!

The vector FPU instructions in the SH4 has some oddities compared to the regular FPU instruction. If you preform a register-to-register move, generally on the SH4, the result is ready instantly, on the same cycle the move is issued. You can copy one register to another and preform, say, an addition to in on the same cycle. It's exactly like how on the Pentium FXCH could dual issue with a FP instruction that relied on the result of the FXCH. The SH4 also has zero cycle FP absolute and FP negation instructions (but you can't do a move and abs/negation on the same cycle).

The vector unit does not support any of that. Those zero-cycle instructions behave like they take three cycles if the vector unit tries to use them. The two-cycle latency load instructions also take an extra cycle. (I think I know why the hardware works like this, but it's not really relevant to this.)

The FIPR instruction (dot product) becomes a lot less useful because of this. Trying to do stuff like that quaternion multiply with the FIPR instruction means that the FPU is completely idle while loading the source vectors. If you load a single 4D vector using single moves, it takes 6 cycles between the start of the loads (4 cycles to load a vector, then 2 cycles for the final load to complete), and when the vector unit can begin its operation. If you load two vectors, it's 10 cycles. If you want to do an operation with the regular FPU, it only takes 3 cycles until you can begin the first operation (2 cycles to load an element, then 1 for the second load to complete).

Also, you are locked into having the entire vector(s) loaded until the FIPR instruction begins, while when working with scalars, once you're done with a value you're free to use its register for something else.

So the FIPR instruction is situational, and often the flexibility of scalar FPU instructions has a speed advantage, because the FPU basically stops doing useful work, and has to spend time

FTRV, the matrix-vector multiply instruction, is much more useful since it stores the matrix in a separate register bank, so you save having to spend instructions to load the matrix into registers like you would for using scalar instructions. It also save registers since you don't have to hold the intermediate values while doing the matrix multiply.

As an example of the weakness of FIPR, suppose you want to do lighting, and dot a bunch of normals with a constant light vector and save the result. If the source and destination are in cache, and you write assembly using 64-bit loads, you can do a FIPR about every 3.5 cycles (this loads in the source, dots it with the constant vector, then saves the result (and the Z value, because switching to 32-bit stores would take another 2 or 3 cycles)). The SH4 can do a dot product every cycle, but it's load/store instructions don't have enough bandwidth to keep it fed, so the lower bound is the time it takes to load a vector (2 cycles) and write the result (1 cycle), plus some loop overhead. So you get one dot per 3.5 cycles.

Or you could load two light vectors into the matrix register, swap the FIPR instruction for FTRV, and get two dots for the exact same amount of time, achieving one dot per 1.75 cycles.

If you load another two light vectors to the matrix, and spend an extra cycle per loop (4.5 cycles) to write out another 64-bit value, you can get one dot per 1.125 cycles, more than three times faster than using FIPRs to do the same work.

The FIPR instruction rarely seems actually useful. Some places where it is is if you want the squared length of a vector, since the enforced sequential multiply-adds from scalar math would outweigh the load delay (and you just need to prepare 4 registers instead of 8), or if you are calculating the input to the FIPR instead of loading them from memory, since you are doing useful work while getting the FIPR inputs ready.

I did find FIPR to be useful was for some collision detection related code. Calculating the plane equation of a triangle requires dot products on previously calculated results. By manually doing register allocation for GCC, you can get still get something useful out of FIPR.

Good, complex T&L is not easy on the DC. For simple T&L, where everything fits in registers, you can easily get 5+ mpoly/s just by taking the source vertices, transforming them, then sending them off the the hardware.

For complex T&L, you need a temporary work buffer while you do calculate the vertices, since doing stuff like reloading a matrix multiple times per vertex for skinning is stupid. You'd want to load the matrix once, do all the work you need the matrix for, then load the next one. The naive approach would be to have some large buffer, do all the transforms/skinning, then lighting, etc, all in one shot for each.

A better way would be to only work on cache-sized chunks, so if lighting needs the position, it hasn't been pushed out of the cache yet. But with that, you will still have to drop parts of the work buffer out of cache when you move on to another block (losing time writing to RAM), and when it's time to submit the T&L results to the hardware, you cache-miss loading in the vertices from work buffer.

I've been looking into a meshlet format for DC models. With meshlets, the model is divided up so that the work buffer never leaves cache, meaning both T&L and submission are faster, at the cost of some minor duplicated work for vertices shared between meshlets. For a meshlets of 120 vertices, I was only getting something like an extra ~6% in vertices to T&L, which was more than offset by the speed boost to T&L and submit.
 
The vector FPU instructions in the SH4 has some oddities compared to the regular FPU instruction. If you preform a register-to-register move, generally on the SH4, the result is ready instantly, on the same cycle the move is issued. You can copy one register to another and preform, say, an addition to in on the same cycle. It's exactly like how on the Pentium FXCH could dual issue with a FP instruction that relied on the result of the FXCH. The SH4 also has zero cycle FP absolute and FP negation instructions (but you can't do a move and abs/negation on the same cycle).

The vector unit does not support any of that. Those zero-cycle instructions behave like they take three cycles if the vector unit tries to use them. The two-cycle latency load instructions also take an extra cycle. (I think I know why the hardware works like this, but it's not really relevant to this.)

The FIPR instruction (dot product) becomes a lot less useful because of this. Trying to do stuff like that quaternion multiply with the FIPR instruction means that the FPU is completely idle while loading the source vectors. If you load a single 4D vector using single moves, it takes 6 cycles between the start of the loads (4 cycles to load a vector, then 2 cycles for the final load to complete), and when the vector unit can begin its operation. If you load two vectors, it's 10 cycles. If you want to do an operation with the regular FPU, it only takes 3 cycles until you can begin the first operation (2 cycles to load an element, then 1 for the second load to complete).

Also, you are locked into having the entire vector(s) loaded until the FIPR instruction begins, while when working with scalars, once you're done with a value you're free to use its register for something else.

So the FIPR instruction is situational, and often the flexibility of scalar FPU instructions has a speed advantage, because the FPU basically stops doing useful work, and has to spend time

FTRV, the matrix-vector multiply instruction, is much more useful since it stores the matrix in a separate register bank, so you save having to spend instructions to load the matrix into registers like you would for using scalar instructions. It also save registers since you don't have to hold the intermediate values while doing the matrix multiply.

As an example of the weakness of FIPR, suppose you want to do lighting, and dot a bunch of normals with a constant light vector and save the result. If the source and destination are in cache, and you write assembly using 64-bit loads, you can do a FIPR about every 3.5 cycles (this loads in the source, dots it with the constant vector, then saves the result (and the Z value, because switching to 32-bit stores would take another 2 or 3 cycles)). The SH4 can do a dot product every cycle, but it's load/store instructions don't have enough bandwidth to keep it fed, so the lower bound is the time it takes to load a vector (2 cycles) and write the result (1 cycle), plus some loop overhead. So you get one dot per 3.5 cycles.

Or you could load two light vectors into the matrix register, swap the FIPR instruction for FTRV, and get two dots for the exact same amount of time, achieving one dot per 1.75 cycles.

If you load another two light vectors to the matrix, and spend an extra cycle per loop (4.5 cycles) to write out another 64-bit value, you can get one dot per 1.125 cycles, more than three times faster than using FIPRs to do the same work.

The FIPR instruction rarely seems actually useful. Some places where it is is if you want the squared length of a vector, since the enforced sequential multiply-adds from scalar math would outweigh the load delay (and you just need to prepare 4 registers instead of 8), or if you are calculating the input to the FIPR instead of loading them from memory, since you are doing useful work while getting the FIPR inputs ready.

I did find FIPR to be useful was for some collision detection related code. Calculating the plane equation of a triangle requires dot products on previously calculated results. By manually doing register allocation for GCC, you can get still get something useful out of FIPR.

Good, complex T&L is not easy on the DC. For simple T&L, where everything fits in registers, you can easily get 5+ mpoly/s just by taking the source vertices, transforming them, then sending them off the the hardware.

For complex T&L, you need a temporary work buffer while you do calculate the vertices, since doing stuff like reloading a matrix multiple times per vertex for skinning is stupid. You'd want to load the matrix once, do all the work you need the matrix for, then load the next one. The naive approach would be to have some large buffer, do all the transforms/skinning, then lighting, etc, all in one shot for each.

A better way would be to only work on cache-sized chunks, so if lighting needs the position, it hasn't been pushed out of the cache yet. But with that, you will still have to drop parts of the work buffer out of cache when you move on to another block (losing time writing to RAM), and when it's time to submit the T&L results to the hardware, you cache-miss loading in the vertices from work buffer.

I've been looking into a meshlet format for DC models. With meshlets, the model is divided up so that the work buffer never leaves cache, meaning both T&L and submission are faster, at the cost of some minor duplicated work for vertices shared between meshlets. For a meshlets of 120 vertices, I was only getting something like an extra ~6% in vertices to T&L, which was more than offset by the speed boost to T&L and submit.

That meshlet idea sounds good. Would it also be better for culling since it's all divided into small chunks? How's performance with animated meshes? Curiously I always see static geometry benchmarks for DC but not really skinned meshes.
 
Back
Top