One thing id say was learned from all this was with micro optimization on the CPU the DC had a lot of headroom left( this might one of the intensive games on the hw especially since might have things not found on the console version). But it's just as hard as it's contemporaries( PS2) to get tuned performance out of it, making its"EASY" reputation a lie. Basically they learned you have to cycles count down to the t , notice which instructions can execute in parallel at a time , even watch how the compiler sometimes fails to use registers / instructions effectively. Then on top quality code for intensive things like tnl/ physics which is all pretty much is being hand tuned specifically on for the sh4 at this point. It's gonna be while for sure .
100% this really nails it.
There is a lot of truth to the Dreamcast being fairly easy to develop for, with a very sane, user-friendly architecture, but the more I've worked with the console, the more I'm convinced people tend to conflate this with "it's easy to push the console," which is absolutely not really true.
One of the biggest examples that comes to mind, that has been at the forefront of stuff I've been involved with lately is just freaking T&L. T&L is EVERYTHING on the Dreamcast. That one damn loop for vertex submission, and how well you write it, is what dictates whether you're barely pushing some N64+ level polygon counts or whether you're able to break into the million+ polygons/second range that the console is clearly capable of on the high-end. PLUS, since it's happening on the main CPU, it also dictates how many cycles you have remaining to dedicate to other general game logic.
I'm no expert on the other 6th gen platforms, but I just cannot see how the PS2 with a vector coprocessor, or the GC with T&L on the GPU, or Xbox with shaders can be as sensitive or as hard to push as the Dreamcast is. It's so so freaking easy to just bottleneck the entire graphics pipeline at the first T&L stage before anything even hits the PVR GPU...
Why? Because the Dreamcast does all the T&L on the SH4 CPU... Which is an approach I totally understand Sega taking on paper: the Saturn was a multicore heterogeneous shitshow that was hard as hell to program for... What could be easier than just handling the T&L on the CPU with just some special instructions? Is that not actually a pretty decent, intuitive design, compared to where Sega came from? ...not really in practice.
Trust me when I say that one of the very first things every big commercial DC developer veteran who was pushing polygons tends to mention about coding for it is "hand-writing SH4 assembly using the special vector/matrix instructions," because first of all, the GCC SuperH toolchain doesn't even know shit about FIPR or FTRV on the SH4. It will never emit those instructions. It will never vectorize naive C code to do anything with them. If you don't drop to assembly and call them directly, tough! You can't access a significant portion of the DC's floating-point potential... and for the record, no, this isn't some ancient antiquated GCC back-end. We're using the bleeding-edge latest versions of GCC (14.2.0), and it's still being actively maintained to this day... We have almost the entirety of C23 and C++23, including crazy stuff like C++ async, coroutines, regular expressions, ranges, etc on the Dreamcast. The whole freaking standard libraries... Things a developer would've never dreamed of having on DC back in the day... sh-elf-gcc can even emit the fast sin/cos and inverse sqrt LUT-based instructions with -ffast-math and a few other flags enabling them.
But still, it can't do shit about the vector instructions, and it's not due to a lack of trying. Turns out that they just do not map well to the way that most architectures' SIMD instructions and semantics work, meaning SH is just not able to "automatically inherit" arch-independent vectorization/optimization passes that can do anything useful with them... SIMD for most arches is a whole lot more little instructions which give rise to what the SH4 can do with its two big-ass instructions: FTRV and FIPR, which can do a whole 4x4 matrix * 4D vector transform and a whole 4D dot product all-at-once, respectively. It's very atypical.
Now for my next point... I know what you're thinking. "K, big deal. Write some inline assembly." Yeah, that'd be the obvious solution, and we absolutely offer our own set of inline ASM-based pseudo "intrinsics" in KallistiOS which expose these SIMD instructions that are so critical for T&L... but too bad the code-gen for calling into them from C is absolute shit.
Want a good example? I was so pleased with myself the other day when going through trying to optimize the physics engine for a certain port of a PS2 game (which shall remain nameless). I had realized that a quaternion multiplication could be fairly trivially represented as a series of 4 dot products. SWEET. Lets just call into our FIPR intrinsic in KOS! ...not so fast, sunshine.
Look at how god-awful the codegen for that actually turns out to be here:
on Compiler Explorer (yeah, you can totally cross-compile for and target DC on the CE site. lololol). For each call to the FIPR intrinsic--which is a single damn instruction--you get EIGHT(!!!!) potentially unnecessary/redundant FMOV instructions emitted to populate the registers with the two 4D vectors BEFORE the FIPR instruction can be used. Enable the preprocessor condition on line 72 to witness as naive C kicks FIPRs ass here... Turns out the compiler is not smart enough to understand the register allocation and occupation of the inline assembly block, so it must naively prepopulate each source register before each call to the inline ASM block, even if the operands might already be sitting in its source registers! Basically calling tiny little inline assembly intrinsics from C is borderline worthless, for this reason. There goes our nice, easy HW acceleration for T&L!
This is why, when people talk about pushing polys on DC and working on renderers back during its heydey, they talk about writing a LOT of SH4. You basically have to widen the assembly code coverage to include the entire damn T&L loop, handling register allocation and occupation manually to ensure that you can optimally use instructions like FIPR and FTRV... Hell, at least with something like the PS2's VU, you knew going in that you'd have to write little ASM routines to light and transform the incoming vertex stream, and even then, I'm sure writing such code suboptimally on a dedicated coprocessor is going to be nowhere near as detrimental as writing suboptimal code on the main processor which has to balance resources between T&L and the rest of the logic required for a game.
Imho the DC's approach to T&L here on the SH4 is quite insidious, because it SEEMS like it should be relatively straightforward and simple, yet winds up being far far less simple in practice, and is simply so easy to screw up... and I'm convinced this aspect of targeting the DC alone is one of the reasons why the DC is absolutely not so easy to push or be optimal with as has been commonly cited historically.
I guess 'easier'.
But also, as with modern gamedev, getting the basics could be easy but leaving performance on the table. Maxing the hardware is hard, which was true of all the too-the-metal consoles. Perhaps one of the issues with DC's results is devs went with the easy route because it was available and so didn't make the most of its potential? Whereas on PS2 you had to think about difficult hardware, which probably meant thinking about optimisations and different ways of doing things as you went and naturally refining your approach.
I really think you nailed it. I really think that's a lot of what tended to happen on the Dreamcast. It's just too easy to write shitty, naive code in plain C or with inline assembly, because yes, the DC has a very approachable, friendly, easy-to-grasp architecture, and hey, T&L was MEANT to happen on the CPU anyway...