I agree that moving to multi-core is going to be a new challenge for Nintendo, although they have some level of experience with DS and 3DS (although DS is hardly an SMP setup and would seem pretty auxiliary the libraries do have a fair bit of concurrency control). But I don't agree that this means no reduced challenge vs other hardware.
--------------------------------------
Compare with XBox 360 and PS3. There you'd be targeting 6-8 threads instead of 3. Hand-coded assembly needs to be much more aggressive scheduled, and a lot of critical code needs to be hand vectorized for a new ISA (they'd have paired singles libraries going back forever, and that stuff's easier to write to begin with). There are gotchas a mile long for things to avoid, like branches and load-hit-stores, and you have to prefetch a lot more conscientiously.
I do agree that three (could have been four though) OoO cpu are more developers friendly (Nintendo or not). Xenon was quiet an unforgiving bitch that needed to be nursed to get any form of sane performances.
I don't know about the cache hierarchy being complex, but I'm not aware of anything that this would do to contribute to software difficulty. If there's a generous amount of cache for this level of compute that just makes things easier.
I was speaking of the whole memory hierarchy, not the CPU caches, it is still unclear if the CPU can access the EDRAM located on the GPU die. I think it is a possibility that it can for the sake of emulating the embedded RAM in their previous design. But that pretty iffy, let wait to learn more (with the most likely scenario being that Edram for all intend and purpose act like limited (by the size) of VRAM).
PowerPC476FP is relatively new, possibly too new to have been a viable candidate for Nintendo who probably likes sourcing out parts far, far in advance. Backwards compatibility would require software emulation or support built into the core (not just for the functionality but also decoding, the ISAs are not totally compatible). AFAIK this core doesn't have any FP SIMD at all so it'd be even worse in that regard. Or were you thinking about a different processor?
------------------------------
Not with PPC47x, unless I'm not doing a good job understanding its specs.
-----------------------------
I don't know about clock speed, what I'm reading says it's up to 2GHz. A 9-stage pipeline is going to be a big limiter there.
I was indeed thinking of
that CPU. I can't tell when IBM finalized the product / when it was available either to costumers, but the product brief is dated from August 2010.
The ISA is indeed different. I went back to those paper and thanks to your comment it make me realize that I was making a wrong assumption: as the CPU handles FP in double precision it might like previous designs uses a paired FPU. Reading through the white paper, it was a faulty assumption of mine.
Whereas it seems that the CPU can mostly achieve ~2 DP FLOPS per cycle, it doesn't automatically imply that he would achieve ~4 in SP. I guess I'm not the only that had that misconception as many others posters here may have gone by the same assumption.
So it is always nice to have people like you clearing misconception.
If that CPU doesn't have a paired FPUs (like broadway and PPC750) but instead a single FPU that "natively" handles DP calculation (a bit like the DP version of the Cell doesn't provide twice the throughput of the original Cell in SP mode), it indeed makes it a bad target for Nintendo.
I think this alone could explain why Nintendo discarded it, more than timelines.
But that get me back to your question about what IBM made for the billion they were given(either way the figure is wrong, I never read till recently here).
It is a lot of money, one could wonder if IBM could have tweaked such a CPU (the 476FP) and replace the DP FPU with a paired FPU ala broadway. The basis of the CPU is pretty good, the pipeline is a bit longer than on Broadway/ppc 750, so it clock a bit higher, it is a 4 issue design and I think you told us that Broadway is a 2 issue design. It has more advance OoO execution (up to 32 instructions on the fly). It can dispatch up to 6 instruction at a time to the functional units, mostly like more modern branch unit, and so on. As the CPU is to handle DP calculation pro-efficiently I would assume that data path are already 64 bit wide and that feeding a paired FPU, SIMD style, would not have necessitate a rework of those aforementioned data path (vs pushing to 4 wide SIMD).
I don't know why I go through this you read the paper and better than I did
Indeed what IBM did for its money, is indeed the billion dollars question.
THe clock speed of Expresso indeed point to a pretty short pipeline. One could assume on the clock speed alone that it is mostly the same 5 stage pipeline as in broadway pushed to its limits on a newer process.
It's possible it wouldn't be a huge effort for emulation with a little bit of hardware glue but I think Nintendo is very paranoid about emulation.
But it wouldn't have come from rearchitecting something like Dolphin, that's too far off what they'd want.
Well actually I though (and I am not alone) that Nintendo had an option for a off the shelves CPU that was a modernized of the PPC 750 which Broadway is built upon. The premise turned out wrong and based on a misunderstanding of me and others.
May be Nintendo would have been willing to work some "software magic" to make BC happens with a CPU as I described above (but was indeed not the ppc 476fp) but they never had the chance. They would not have gone as far as switching to another ISA that for sure.
The CPU they may have been searching might simply does not exist in IBM port folio. A bit like if MSFT or Sony had shipped this year the pretty sexy, well rounded, CPU that Jaguar seems to be would have not be an option. Either way they would have to develop their own CPU, and that is costly. Either way there was the Power a2 CPU but back to your first point, it might not be the most developers friendly option around even if Nintendo were to set a 2 threads per core limit. Dealing with 8 threads, really sucky single thread performance, etc. would have affected Nintendo team.
For now on, I will discard that hypothesis that the PPC 476fp without modification was an option for Nintendo.
I would also discard that the hypothesis that Expresso could be a customized PPC 476, as it clock speed (really low) hint at really short pipeline.
We still are left with the 1 billions Dollars question ???