Jawed
Legend
As far as OpenCL and CUDA are concerned, toolsets are definitely immature, though CUDA's environment is much better than it was with debugging and metering/profiling.IEEE compliant DP erases a number of sins. As does a much more robust ecosystem of tools when it comes to debugging and in-silicon instrumentation, although the aged basis Larrabee works from may limit this in comparison to more modern x86.
Larrabee, as disclosed, will derive much of its FP performance from an enhanced vector set and will not have the massive OOE scheduler overhead or a pipeline designed to cater to high clock speeds.
I think HPC programmers/users want to write their code in FORTRAN/C-ish rather than assembly, or they want to use libraries. Obviously libraries should already be robust on x86 and be making excellent use of SSE. OpenCL/CUDA are going to take a long time to catch-up in the breadth and depth of library support.
And yes, there's a minefield between you and maximum GPU performance, but I don't think it's GPU functionality, per se, that's getting in the way now.
In HPC systems TDP will be observed more strictly, I expect. Lower clocks are an easy fix for specific configurations - e.g. 1U versus 4U configurations. It's not rocket science.There's the one.
We know a more diverse software base will reveal other examples of code that hits utilization high enough to exceed TDP in the case of CPUs.
Let's find masses of GPUs being killed by GPGPU workloads before accusing them of being unable to sustain HPC workloads...In AMD's case, a significant portion of the time the GPU is CPU-limited, or so I interpret from the number of times I've seen CAL having the finger pointed at it for non-ideal work units.
I think it's just a matter of the application, i.e. arithmetic intensity - there's no reason for the end-game to consist of being bogged down by external factors. You're no longer forced into lock-waiting the CPU while the GPU works, for example - that was just an aberration of early implementation.We have one anecdote saying it won't happen for GPGPUs.
I suppose both AMD and Nvidia can just point out the fundamental weakness of their slave cards always being at the mercy of the host processor, the expansion bus, and their software layer.
The initial instantiation of Larrabee should have the same problem, unless Intel shifts in its position with regards to Larrabee in HPC. Lucky for the GPGPU crew.
F@H on ATI is clearly hobbled by the Brook environment's history and to the pure streaming approach it orginally enforced - OK, so Brook enabled it in the first place, but it's now a drag. Brook+ just seems doomed to me - it's perhaps a nice teaching language for throughput/stream computing but I just can't see people using it in anger for any longer than they have to.
GPGPU is past its version 1.0-itis, at least NVidia's.
I think you're being too literal - NVidia doesn't have to specify Tesla to be within an inch of the relevant TDP. Against CPUs it's home and dry. Against Larrabee DP, well, 10% clocks aren't going to be the issue thereNvidia's Tesla TDP is 160 Watts, versus 236 for the related card not running CUDA.
So we can attribute close to 1/3 of the total heat output to the ROP, raster, and texturing special hardware.
40nm would allow for a blind doubling of everything. The improvement in power terms was modest and most definitely not halving of power consumption. If rumors turn out to be true, the power improvement may be smaller.
A CUDA-only load would be awfully close to 300 W, and would be over if power savings are close to 0 in some worst-case scenario. That is assuming no changes in the ratio of ALU to special-purpose hardware, though all the speculation seems to be upping the ALU load, not reducing it.
Yes - but their FLOPs/mm are so high (2x NVidia's SP, way more DP) they can afford to downclock for margin.AMD's slack with an architecture with an even higher ALU:TEX ratio and smaller number of ROPs is much less.
Will Larrabee have taken 4 years by the time we can buy one? I don't know, close I think - and it's functionally a lot simpler than a GPU, massively so. My main point is that Intel's going to be in an environment where real new products up the ante every 6 months.G80's purported design cycle took 4 years, if we believe Anandtech.
What number do we give Larrabee, and how much more would that be?
I also think that Intel only has a couple of years, maximum, of pain with Larrabee before they're in the lead for good.
Fingers-crossed the D3D11 generation is here on time - I want to build a PC based on one...CPU design cycles are in the range of 3-5 years. GPUs seem to be roughly 3 years, then going by recent history we add two or three quarters of delay on top of that.
That's why I was quite specific about consumer products.Nvidia's GPUs are not as big as the biggest CPUs Intel has produced.
Both RV770 and GT200 have on-board automation/optimisation of power usage that's BIOS/driver controllable. Have you seen the idle consumption of GTX285 and HD4670? 9W for the latter, FFS.Intel's advances in power regulation appear to exceed anything GPUs manage to do when they aren't forced by the driver to throttle.
I think that's moot once you take into account the monster register files, the hierarchy of automatic scheduling (cluster->batch->instruction/clause->ALU/TMU) and the tight integration of memory controller and cache threading. Intel had to find a way to spend transistors - once they built a fast "single-core" ALU their only options were cache and varieties of OoOE and superscalar issue.Intel's version of hardware scheduling is much more complex than what either Nvidia or AMD does, and that is part of the reason they can afford the ALU density they have.
God I wish Transputer had taken hold back in the day.
I'm outta my depth on what NVidia's doing there.Also, I forgot to note that exceptions on the CPUs can be precise.
Yes, there's a hell of a lot of cache on there:Intel's Penryn is 410M per 110 mm2.
It does lag 55nm, as it is 3.7 versus 4, yes.
The density is somewhat disingenuous, since so much is cache and the logic itself is significantly harder to shrink for a high-speed design with a lot of complexity.
http://www.intel.com/pressroom/archive/releases/20070328fact.htm
I don't know what degree of "custom" AMD has achieved in RV770. I'm also a bit suspicious of the 956M transistor count for RV770, because supposedly they don't actually count the transistors they implement. For all we know that count is nothing more than the area of the chip * the standard cell density
There's also a big fudge factor at the edges of the GPU with all that physical I/O stuff.
Taken at face value, TSMC's "45nm" is >700M per 100mm2 - that's ~ double Penryn. TSMC's 55nm referenced to 65nm is actually 0.9 scaling (as opposed to "theoretical" 0.72), so 55nm at TSMC isn't much of a clue to the gap between Intel and TSMC. Obviously staggered timescales play their part too...
In terms of density/clocks/yields I dare say ATI GPUs are "fully adapted" to TSMC, as Intel's CPUs are fully adapted to their own fabs. NVidia's current GPUs don't seem to be that well adapted - it looks like their desire to run the ALUs at 2x clocks is not a good match for what TSMC offers - for the kind of architectural balance they've chosen with batch size and instruction-issue they appear to have no choice about the clocking ratio.
(NVidia's considerably smaller batch size is still a radically untested comparison point in these architectures - nested branching may be a total disaster zone on ATI per mm2 but relatively painless per mm2 on NVidia.)
Yeah it's hard to disentangle what have been alluded to as strategic delays for 40/45nm by TSMC versus technical issues (which also seem to have led to the curtailment of 32nm options).We don't rightfully know what Intel could manage if it tuned the design to target density with relaxed timing and power requirements.
It is also dependent on whether TSMC's 40nm quality is closer in success to 55nm or what R600 encountered on 80nm...
Yeah, Intel is constrained by not being able to sell Larrabee in large quantities for much more than $100 a go. Sure sales to the HPC crowd will rake in profit per unit, but...Larrabee's cores are not as complex and there are serious near-term concerns about transistor variation over large chips and leakage without a metal gate stack.
Depends if Intel does to 45nm Larrabee what it did to the 45nm Nehalem dual cores.
AMD has nothing to gain by showing something it isn't ready to launch - unless it's R600 all over againTSMC's customers should have been at 40nm last year in theory, but maybe we'll see such products Q2 this year instead.
Intel's demoed running Windows on Westmere.
Where's the RV870 running Crysis?
I'm not sure if any RV8xx GPUs are actually up and running - apparently sampling in Q2 if CJ's to be believed. Seems likely there'll be something running if that's true.
Does Intel vary the pipeline organisation for the slower clocked mobile parts?FP has more stages than the INT pipeline, but the latter has tracked clock speeds more closely for CPUs.
Anyway, if Larrabee is ~2B transistors, can they fit 32 cores 8MB of L2 and 32 TMUs
Jawed