Are the appalling SSE utilisation rates of most x86 code in comparison with theoreticals indicative of TDP headroom in x86 for HPC?
IEEE compliant DP erases a number of sins. As does a much more robust ecosystem of tools when it comes to debugging and in-silicon instrumentation, although the aged basis Larrabee works from may limit this in comparison to more modern x86.
Larrabee, as disclosed, will derive much of its FP performance from an enhanced vector set and will not have the massive OOE scheduler overhead or a pipeline designed to cater to high clock speeds.
There's the one.
We know a more diverse software base will reveal other examples of code that hits utilization high enough to exceed TDP in the case of CPUs.
In AMD's case, a significant portion of the time the GPU is CPU-limited, or so I interpret from the number of times I've seen CAL having the finger pointed at it for non-ideal work units.
We have one anecdote saying it won't happen for GPGPUs.
I suppose both AMD and Nvidia can just point out the fundamental weakness of their slave cards always being at the mercy of the host processor, the expansion bus, and their software layer.
The initial instantiation of Larrabee should have the same problem, unless Intel shifts in its position with regards to Larrabee in HPC. Lucky for the GPGPU crew.
A GPU is more than just programmable ALUs and in case you haven't noticed Furmark is driving more than just the ALUs very hard.
Nvidia's Tesla TDP is 160 Watts, versus 236 for the related card not running CUDA.
So we can attribute close to 1/3 of the total heat output to the ROP, raster, and texturing special hardware.
40nm would allow for a blind doubling of everything. The improvement in power terms was modest and most definitely not halving of power consumption. If rumors turn out to be true, the power improvement may be smaller.
A CUDA-only load would be awfully close to 300 W, and would be over if power savings are close to 0 in some worst-case scenario. That is assuming no changes in the ratio of ALU to special-purpose hardware, though all the speculation seems to be upping the ALU load, not reducing it.
AMD's slack with an architecture with an even higher ALU:TEX ratio and smaller number of ROPs is much less.
The change from SM3 GPUs to SM4 was way more of an overhaul than any recent change in x86. There's a reason it's taking Intel so long to come up with a competitive GPU - and Larrabee 1, if it's competitive (i.e. on anything that isn't DP), will only be so for 6 months.
G80's purported design cycle took 4 years, if we believe Anandtech.
What number do we give Larrabee, and how much more would that be?
CPU design cycles are in the range of 3-5 years. GPUs seem to be roughly 3 years, then going by recent history we add two or three quarters of delay on top of that.
The IHVs, courtesy of TSMC, are building consumer chips that are far bigger and more complex than Intel's. The memory controllers, alone, are way ahead.
Nvidia's GPUs are not as big as the biggest CPUs Intel has produced.
Intel's advances in power regulation appear to exceed anything GPUs manage to do when they aren't forced by the driver to throttle.
Intel's version of hardware scheduling is much more complex than what either Nvidia or AMD does, and that is part of the reason they can afford the ALU density they have.
Also, I forgot to note that exceptions on the CPUs can be precise.
Really? Run the process's implied density numbers (~400M transistors per 100mm2) for RV770's 956M transistors in 256mm2.
Intel's Penryn is 410M per 110 mm2.
It does lag 55nm, as it is 3.7 versus 4, yes.
The density is somewhat disingenuous, since so much is cache and the logic itself is significantly harder to shrink for a high-speed design with a lot of complexity.
We don't rightfully know what Intel could manage if it tuned the design to target density with relaxed timing and power requirements.
It is also dependent on whether TSMC's 40nm quality is closer in success to 55nm or what R600 encountered on 80nm...
Larrabee's cores are not as complex and there are serious near-term concerns about transistor variation over large chips and leakage without a metal gate stack.
Do you think Larrabee will be getting the best of Intel's process ahead of x86?
Depends if Intel does to 45nm Larrabee what it did to the 45nm Nehalem dual cores.
TSMC's customers will be at 28nm by the end of 2010 in theory (32nm appears to be a "limited capability shrink of 40nm" which gives the impression that it'll be of limited use to AMD/NVidia)...
TSMC's customers should have been at 40nm last year in theory, but maybe we'll see such products Q2 this year instead.
Intel's demoed running Windows on Westmere.
Where's the RV870 running Crysis?
There's 8 stages in ATI GPUs. NVidia's ALUs seem to be in the region of 12 stages. What do you think Larrabee's vector pipeline length will be? Is <=2GHz smoke'n'mirrors on Intel's part?
FP has more stages than the INT pipeline, but the latter has tracked clock speeds more closely for CPUs.