Even with state of the art efficiency per transistor per cycle for their superscalar cores this won't stop being true. This is more an argument for spending more area on SPUs/Larrabee than anything else.
True, but it is the initial thing, they can only build around the HW, anyway, like pointed by the tesselator, there is the question if there is a real point in having to much specialized HW.
PC999 said:
That is the beauty of Fusion like designs, althought there may be a problem of such units going undersude (likethe tesselator), depending of how much it brings and how easy it is to use.
Anyway it seems to me that a FUsion design is perfect for a console...
I will somehow answer to both of your posts (even if one of your posts PC999 is not aimed at me).
Well SPu or larrabee are tinier than a xenon but remains significantly bigger than dedicated parts. Decompression for example is the kind of task that will always come into play. The choice but between specialized/fixec function hardware and more generic ressources is likely to be made on needs and perfs and how they balance each other (I'm a master of obvious...).
Gpu have moved from dedicated units for vertex and pixel shading because it did make sense for flexibilty, harware utilisation/etc but they kept texture filtering tied to dedicated hardware. Basically the trade off for flexibilty and hardware utilization (/whatever) are too expansive => it's slow.
I feel like it's the same for compression/decompression, GPU are more than 5 times faster at decoding/decompressing than CPU (badboom and the likes), OK. But do you need that many trasistors to beat athe speed of a cpu? clearly no SPURS engine is a ~tenth of an actual GPU and it is a match for top of the line GPU (I don't remember the figure, I guess it beat them, not too mentioned power efficience). And actually the Spurs engine can do much more, matching is perf on decoding/decompression alone would actually need way less transistors.
As I said decompression will always be part of the equation as you will have to stream compress data to the RAM or the gpu even using a SSD. I think that it would make sense to dedicate a tiny portion of silicon on devices that make the job done for cheap.
Why spend more time and money with a "flawed" CPU , in 2010 any dual core soluction (or a cut down one) would beat it in every single aspect, many of them offering OOE, branch prediction, good latency, power effeciency... Only for very easy BC.
Anyway dont take to hard on Xenom in some things it is probably way better than PC CPUs (raw fp or dot products)
I don't think it's "flawed" I would better say it needs to be perfected
I think that BC will be have it weight on the design. As you pointed xenon is not bad for every thing I guess that for some workload its clock speed may have some merits (along with the altivec units as you pointed out). I think that MS will need a CPU running as the same speed to ensure no problemo BC. The point is that a CPU granted with a potent OoO execution engine great branch prediction, etc. running at least @3.2 GHz will too big too hot. My point was MS can't won't afford top of the line X86 cpu (nor AMD/intel have reasons to sell them cheap).
That's why I think that MS should work on "reasonable improvement" instead of starting from scratch again (keeping the POWER ISA). It's not to say that at some point the number of pipeline stages may be modified for example but they should stuck to xenon philosophy.
POWER6 are in order processors but I remember reading that indeed they do some kind of simplistic form of OoO execution sadly I can remember where I read it. If my memory is right MS could look into that direction to improve perfs with a tinier energetic cost than a potent OoO engine would have.
Obviously Ms should look at better prefetching, lower latencies, branch predictions etc. but expecting X86 level of performances is desillusional.
I quote a part of your post here Alstrong
Just a few questions to throw out there...
a) Would there be much sense in taking VMX128 any further And if so, in what way ?
While I was searching infos on POWER6 (see above) I found out interesting stuffs on wiki:
http://en.wikipedia.org/wiki/POWER6
http://en.wikipedia.org/wiki/Power_Architecture#Power_ISA_v.2.03
There is an AltiVec unit to POWER6, and the processor is fully compliant with the new Power ISA v.2.03 specification. POWER6 also takes advantage of ViVA-2, Virtual Vector Architecture, which enables the combination of several POWER6 nodes to act as a single Vector processor.
Actually I don't know if that would be useful as VMX128 are already pretty large, or may be doubling the width would be a better move, basically I've no clue insights welcome here too
Power ISA v.2.03
The specification for Power ISA v.2.03[7] is based on the former PowerPC ISA v.2.02[3] in POWER5+ and the Book E[1] extension of the PowerPC specification. The Book I included five new chapters regarding auxiliary processing units like DSPs and the AltiVec extension.
Book I – User Instruction Set Architecture covers the base instruction set available to the application programmer. Memory reference, flow control, Integer, floating point, numeric acceleration, application-level programming. It includes chapters regarding auxiliary processing units like DSPs and the AltiVec extension.
This is interesting if Ms find out that decompression units and why not network accelerator would fit their goals.
I wonder if implementing scatter/gather would help for the kind of workload likely to run on the cpu?
How about just doubling the cores from 3 to 6 and have 1 hardware thread for each core? If the process node were to be 32nm they could double the number of cores/cache and maybe even increase the clock. They might even be able to use beefier cores that are 100% backwards compatible similar to what IBM did from GC's Gekko to Wii's Broadway.
Well multithreading is a way to improved IPC, such cores would have to use other tricks to keep up with perfs that are likely more power consuming but it's not impossible.