Nvidia GT300 core: Speculation

Status
Not open for further replies.
That's the performance problem, size has nothing to do with it.

If you're counting with a Perf/mm2 ratio, size matters (no pun intended LOL).

CarstenS,

480SPs in 10 clusters? Hopefully that scenario doesn't suggest 6*8/cluster. How about thinking of a new MC as a start?
 
I think the key point is that NVidia's implementation of DP is far richer than the basic approach AMD's taken. All that richness costs, but presumably makes use of the existing register-file etc. infrastructure, so is relatively efficient and being slow (because it's not used much) makes it even more efficient.

Jawed

Remember, nv's goal is to be best of class by 2009 in dpfp. So it is virtually guaranteed that they will be coming out with stuff that improves upon dp. Here their approach becomes visibly better as since dp is practically a tesla exclusive (use wise), so keeping all the transistors which are not used in games makes it easier to make mid range chips. AMD will have to improve all their alu's across market range or maintain chips using 2 different alu's for different markets.
 
Or is there anything absolutely requiring 8 TMUs per TPC which i have missed?
Are there any G8x GPUs with only 4 TMUs per cluster?

Then we get into a question of overall efficiency as the number of clusters varies, i.e. which is more efficient:
  • 10 clusters each of 4 SIMDs (32 lanes total) and 8 TMUs
  • 20 clusters each of 2 SIMDs (16 lanes total) and 4 TMU
Overall, I'd say cluster count should be minimal.

Jawed
 
Remember, nv's goal is to be best of class by 2009 in dpfp.
"Best of class" basically means what with regard to performance? There's a functionality/performance mix here, so they can bullshit analysts at conference calls with any definition of "best" they like.

So it is virtually guaranteed that they will be coming out with stuff that improves upon dp.
You mean performance? Of course. But they're way way behind AMD on performance. But that doesn't matter because AMD's GPGPU profile is essentially non-existant - AMD's tools are basically broken. NVidia's competing with Larrabee, not ATI, on GPGPU. Against Larrabee's DP they're toast (unless Intel decides to delete DP capability entirely!).

Which is why NVidia is proselytising its first mover status and courting everyone with CUDA. NVidia undoubtedly has a massive headstart, but anyone who wants to build a DP-specific supercomputer from "GPGPU" will have noticed that CUDA currently offers no useful performance gains there.

DP is a niche of a niche, so really the question is: will NVidia afford it the "correct" priority. What is the correct priority in GT300?

GT200 is, according to NVidia, too much CUDA, not enough graphics. Irrespective of the rather harsh lesson AMD gave them last summer, I expect they've re-balanced GT300. So I wouldn't put much emphasis on DP.

Here their approach becomes visibly better as since dp is practically a tesla exclusive (use wise), so keeping all the transistors which are not used in games makes it easier to make mid range chips. AMD will have to improve all their alu's across market range or maintain chips using 2 different alu's for different markets.
Only RV770 has double-precision capability, RV730 and RV710 don't have it.

Jawed
 
Last edited by a moderator:
"Best of class" basically means what with regard to performance? There's a functionality/performance mix here, so they can bullshit analysts at conference calls with any definition of "best" they like.

Fair point. :???:

You mean performance? Of course. But they're way way behind AMD on performance. But that doesn't matter because AMD's GPGPU profile is essentially non-existant - AMD's tools are basically broken.

Absolutely. :devilish: AMD is nowhere to be found in the gpgpu space. I hope their opencl implementation can change that.
NVidia's competing with Larrabee, not ATI, on GPGPU. Against Larrabee's DP they're toast (unless Intel decides to delete DP capability entirely!).
Hang on. Are you referring to their useless perf in dp or the functionality? Former, yes. Latter, no.
Which is why NVidia is proselytising its first mover status and courting everyone with CUDA. NVidia undoubtedly has a massive headstart, but anyone who wants to build a DP-specific supercomputer from "GPGPU" will have noticed that CUDA currently offers no useful performance gains there.

Yup. Look here. (somewhere in the 2nd half, where there's some substance) Laughable numbers, I'd say.

DP is a niche of a niche, so really the question is: will NVidia afford it the "correct" priority. What is the correct priority in GT300? GT200 is, according to NVidia, too much CUDA, not enough graphics. Irrespective of the rather harsh lesson AMD gave them last summer, I expect they've re-balanced GT300. So I wouldn't put much emphasis on DP.

yes. I too thought so when the overall numbers came out for rv770 and gt200.:smile: I expect them to go full blast on increasing 3d perf which means lots of flexibility and perf(sp only). DP is likely to be sidelined this gen, though I expect functionality(precision etc.) to improve.
 
CarstenS,

480SPs in 10 clusters? Hopefully that scenario doesn't suggest 6*8/cluster. How about thinking of a new MC as a start?
I was thinking about twice the number of clusters for example. So each TPC stays at 24 ALUs + 1DP but gets cut 4 TMUs.

You mean performance? Of course. But they're way way behind AMD on performance.
Apart from the tools - how much are the theoretical numbers worth wrt dp?
 
Last edited by a moderator:
dpfp numbers:- ~75g for nv and ~250g for amd

I'm aware of those (78G, 240G actually). :) But my question was, if anyone can assess how much you can do with both, since obviously FMA is not all one's ever going to use.
 
Last edited by a moderator:
dpfp numbers:- ~75g for nv and ~250g for amd

For some or most customers, the lack of full IEEE compliance, no interrupts, lacking toolsets, and Larrabee on the horizon, there should be a multiplier of 0.0 for Nvidia and something lower for AMD (-0.0?) for how much their DP matters.

On the plus side, the expansive transistor budget 40nm allows at a given size gives them opportunity to change things with extra hardware.
The power output of the current chips at high utilization is already hitting limits.
40nm, even on a good day that TSMC is rumored to not be having, would not allow for a doubling of highly utilized vector units.

So perhaps there is now more room for either specialized hardware or the extra bells and whistles that the lack thereof has relegated GPGPU to the low-rent side of HPC.

We can grouse about bad utilization numbers, but that doesn't look to be the limiting factor in the coming generation.

Given TSMC's inferiority to Intel in process tech, there is significant pressure against becoming too Larrabee-like, since Intel has more leeway with its better process to take some generalized inefficiency.
 
Yes, but that doesn't mean that they can't wow us double time not only in density but in density+size =)
They have to produce something that's approaching 2x the speed of RV870 (i.e. X2 configuration). Otherwise the only way NVidia can claim the single-card performance crown is with a GPU that'll work in a GX2 configuration - which appears to preclude a chip that's monstrous like GT200.

40nm appears to be a disappointment in terms of power - were the IHVs expecting better? Rumours centred on RV740 seem to indicate so, but the ATI GPUs seem to be more profligate with power (though HD4670 is a curious contrary indicator on that point).

If 40nm is disappointing on power, that would increase the chances that GT300 is not amenable to GX2.

As to being wowed by size, well no, GT200 is pretty much as big as TSMC can do...

It's interesting that the TSMC brochure says that 45nm is good for >500M transistors in about 70mm2. 2x the density for 45nm in comparison with 65nm for GT200 should mean its 1.4B transistors occupy about ~390mm2, yet it actually occupies 583mm2. So that 50% margin hides a variety of factors (including transistors not counted and analogue circuitry and other stuff better-qualified people can describe).

What "over-sizing factor" will apply for GT300? Is the 40nm process likely to make this better or worse? How dependent on chip size is this factor?

If the memory bus is 512-bit GDDR5 then I'll be wowed, too, so long as the performance actually warrants that kind of bandwidth.

Jawed
 
Apart from the tools - how much are the theoretical numbers worth wrt dp?
Depends on how bandwidth-limited the computation is and how that limitation can be mitigated by the memory hierarchy - RV770 has less memory bandwidth than GT200.

There's a general expectation that Larrabee will have significantly less memory bandwidth than competing GPUs, simply because for graphics it won't need it. That could also spoil DP numbers on certain benchmarks (e.g. matrix multiply).

Overall it seems to be a problem to find anyone who is serious about using a GPU for DP. I've only seen odd forum postings from people doing stuff.

Jawed
 
So perhaps there is now more room for either specialized hardware or the extra bells and whistles that the lack thereof has relegated GPGPU to the low-rent side of HPC.
Graphics cards under heavy GPGPU loads are typically far far short of the worst case power demonstrated in Furmark ;)

Given TSMC's inferiority to Intel in process tech, there is significant pressure against becoming too Larrabee-like, since Intel has more leeway with its better process to take some generalized inefficiency.
Intel is used to 2-year cycles and at best the 1 year tick-tock phasing. The GPU guys and TSMC move faster than that.

Also, TSMC's 40/45nm is considerably denser than Intel's.

It's notable that Intel appears to be targetting 1-2GHz for Larrabee, not 3GHz+.

Jawed
 
Graphics cards under heavy GPGPU loads are typically far far short of the worst case power demonstrated in Furmark ;)
Really, which ones? Are they representative of the whole gamut of HPC apps that simply do not run on GPUs?
And who says they are typically far short of worst-case?
The same GPU makers whose TDP and wattage numbers exceed PCIe spec on Furmark?
There was of course AMD's pre-release bragging about how R600's MADD loop threatened to cook off the cards in their teraflop box, of course, that was a high-utilization load of nonsense MADDs at a different process node.

However, there is no indication this is going to get better at 40nm, and the delays and rumor mill (even if predominantly the Inq at this point) says it is the opposite.

Intel is used to 2-year cycles and at best the 1 year tick-tock phasing. The GPU guys and TSMC move faster than that.

Just how significantly?
Architectural shifts from 2006 to today:
Intel: Conroe, Penryn, Nehalem
Nvidia G80, G92, G200
AMD: R600, RV670,RV770

Also, TSMC's 40/45nm is considerably denser than Intel's.

It's notable that Intel appears to be targetting 1-2GHz for Larrabee, not 3GHz+.

Jawed

TSMC's process numbers in the PR for 40nm are optimistic at best, and their best numbers in many regards are massively inferior.
Intel's stated SRAM cell sizes are less dense than many other processes, in part because of the reduced definition of double-patterning, but also because you actually see real Intel products using them.

TSMC's minimum cell size is not something many clients that care one whit about leakage or reliability are ever going to use.

TSMC's lack of metal gates is going to hurt when it comes to leakage and variability, and given Larrabee's delayed status and the accelleration of Intel's 32nm transition, any density advantage real or imagined between 45nm and 40nm is going to be a temporary respite before it's 32nm vs 40nm.

As for clock speeds, any highly utilized vector units in a manycore at 3 GHz would be very hot, and the pipline for Larrabee's cores is pretty short.

For reference, we should look at the clock speeds of CPUs with longer pipelines on foundry processes, and we should note their clock speeds when trying to determine relative quality.
 
Really, which ones? Are they representative of the whole gamut of HPC apps that simply do not run on GPUs?
Are the appalling SSE utilisation rates of most x86 code in comparison with theoreticals indicative of TDP headroom in x86 for HPC?

And who says they are typically far short of worst-case?
Folding@home

However, there is no indication this is going to get better at 40nm, and the delays and rumor mill (even if predominantly the Inq at this point) says it is the opposite.
A GPU is more than just programmable ALUs and in case you haven't noticed Furmark is driving more than just the ALUs very hard.

Just how significantly?
Architectural shifts from 2006 to today:
Intel: Conroe, Penryn, Nehalem
Nvidia G80, G92, G200
AMD: R600, RV670,RV770
The change from SM3 GPUs to SM4 was way more of an overhaul than any recent change in x86. There's a reason it's taking Intel so long to come up with a competitive GPU - and Larrabee 1, if it's competitive (i.e. on anything that isn't DP), will only be so for 6 months.

Though after SM5 the IHVs are on a losing wicket - everything they do architecturally is on Intel's turf. Luckily for AMD it owns some of that turf too :LOL:

TSMC's process numbers in the PR for 40nm are optimistic at best, and their best numbers in many regards are massively inferior.
Intel's stated SRAM cell sizes are less dense than many other processes, in part because of the reduced definition of double-patterning, but also because you actually see real Intel products using them.
The IHVs, courtesy of TSMC, are building consumer chips that are far bigger and more complex than Intel's. The memory controllers, alone, are way ahead.

TSMC's minimum cell size is not something many clients that care one whit about leakage or reliability are ever going to use.
Really? Run the process's implied density numbers (~400M transistors per 100mm2) for RV770's 956M transistors in 256mm2.

TSMC's lack of metal gates is going to hurt when it comes to leakage and variability, and given Larrabee's delayed status and the accelleration of Intel's 32nm transition, any density advantage real or imagined between 45nm and 40nm is going to be a temporary respite before it's 32nm vs 40nm.
Do you think Larrabee will be getting the best of Intel's process ahead of x86?

TSMC's customers will be at 28nm by the end of 2010 in theory (32nm appears to be a "limited capability shrink of 40nm" which gives the impression that it'll be of limited use to AMD/NVidia)...

For reference, we should look at the clock speeds of CPUs with longer pipelines on foundry processes, and we should note their clock speeds when trying to determine relative quality.
There's 8 stages in ATI GPUs. NVidia's ALUs seem to be in the region of 12 stages. What do you think Larrabee's vector pipeline length will be? Is <=2GHz smoke'n'mirrors on Intel's part?

Jawed
 
They have to produce something that's approaching 2x the speed of RV870 (i.e. X2 configuration).
Actually, they need to be faster. Because otherwise they'd better use the same AFR-top-end approach as AMD's using. Being slower then 2-chip AFR makes the idea of big 1-chip top-end pretty pointless for mass market (for CUDA/OCL/DXCS it may be more interesting).

If 40nm is disappointing on power, that would increase the chances that GT300 is not amenable to GX2.
Why?

As to being wowed by size, well no, GT200 is pretty much as big as TSMC can do...
Even GT200-sized GT300 will wow some people because they kinda think that GT200 size is a mistake that should be avoided in the future =)
But i myself expect a smaller then GT200 die from GT300. I think they'll push for more preformance/mm^2 this time and that'll probably mean that they won't need such die size in GT300.
 
Are the appalling SSE utilisation rates of most x86 code in comparison with theoreticals indicative of TDP headroom in x86 for HPC?
IEEE compliant DP erases a number of sins. As does a much more robust ecosystem of tools when it comes to debugging and in-silicon instrumentation, although the aged basis Larrabee works from may limit this in comparison to more modern x86.
Larrabee, as disclosed, will derive much of its FP performance from an enhanced vector set and will not have the massive OOE scheduler overhead or a pipeline designed to cater to high clock speeds.

Folding@home
There's the one.
We know a more diverse software base will reveal other examples of code that hits utilization high enough to exceed TDP in the case of CPUs.

In AMD's case, a significant portion of the time the GPU is CPU-limited, or so I interpret from the number of times I've seen CAL having the finger pointed at it for non-ideal work units.
We have one anecdote saying it won't happen for GPGPUs.

I suppose both AMD and Nvidia can just point out the fundamental weakness of their slave cards always being at the mercy of the host processor, the expansion bus, and their software layer.

The initial instantiation of Larrabee should have the same problem, unless Intel shifts in its position with regards to Larrabee in HPC. Lucky for the GPGPU crew.

A GPU is more than just programmable ALUs and in case you haven't noticed Furmark is driving more than just the ALUs very hard.

Nvidia's Tesla TDP is 160 Watts, versus 236 for the related card not running CUDA.
So we can attribute close to 1/3 of the total heat output to the ROP, raster, and texturing special hardware.

40nm would allow for a blind doubling of everything. The improvement in power terms was modest and most definitely not halving of power consumption. If rumors turn out to be true, the power improvement may be smaller.

A CUDA-only load would be awfully close to 300 W, and would be over if power savings are close to 0 in some worst-case scenario. That is assuming no changes in the ratio of ALU to special-purpose hardware, though all the speculation seems to be upping the ALU load, not reducing it.

AMD's slack with an architecture with an even higher ALU:TEX ratio and smaller number of ROPs is much less.


The change from SM3 GPUs to SM4 was way more of an overhaul than any recent change in x86. There's a reason it's taking Intel so long to come up with a competitive GPU - and Larrabee 1, if it's competitive (i.e. on anything that isn't DP), will only be so for 6 months.
G80's purported design cycle took 4 years, if we believe Anandtech.
What number do we give Larrabee, and how much more would that be?

CPU design cycles are in the range of 3-5 years. GPUs seem to be roughly 3 years, then going by recent history we add two or three quarters of delay on top of that.

The IHVs, courtesy of TSMC, are building consumer chips that are far bigger and more complex than Intel's. The memory controllers, alone, are way ahead.
Nvidia's GPUs are not as big as the biggest CPUs Intel has produced.

Intel's advances in power regulation appear to exceed anything GPUs manage to do when they aren't forced by the driver to throttle.

Intel's version of hardware scheduling is much more complex than what either Nvidia or AMD does, and that is part of the reason they can afford the ALU density they have.
Also, I forgot to note that exceptions on the CPUs can be precise.

Really? Run the process's implied density numbers (~400M transistors per 100mm2) for RV770's 956M transistors in 256mm2.
Intel's Penryn is 410M per 110 mm2.
It does lag 55nm, as it is 3.7 versus 4, yes.
The density is somewhat disingenuous, since so much is cache and the logic itself is significantly harder to shrink for a high-speed design with a lot of complexity.

We don't rightfully know what Intel could manage if it tuned the design to target density with relaxed timing and power requirements.
It is also dependent on whether TSMC's 40nm quality is closer in success to 55nm or what R600 encountered on 80nm...

Larrabee's cores are not as complex and there are serious near-term concerns about transistor variation over large chips and leakage without a metal gate stack.

Do you think Larrabee will be getting the best of Intel's process ahead of x86?
Depends if Intel does to 45nm Larrabee what it did to the 45nm Nehalem dual cores.

TSMC's customers will be at 28nm by the end of 2010 in theory (32nm appears to be a "limited capability shrink of 40nm" which gives the impression that it'll be of limited use to AMD/NVidia)...
TSMC's customers should have been at 40nm last year in theory, but maybe we'll see such products Q2 this year instead.

Intel's demoed running Windows on Westmere.
Where's the RV870 running Crysis?

There's 8 stages in ATI GPUs. NVidia's ALUs seem to be in the region of 12 stages. What do you think Larrabee's vector pipeline length will be? Is <=2GHz smoke'n'mirrors on Intel's part?
FP has more stages than the INT pipeline, but the latter has tracked clock speeds more closely for CPUs.
 
Last edited by a moderator:
Actually, they need to be faster. Because otherwise they'd better use the same AFR-top-end approach as AMD's using. Being slower then 2-chip AFR makes the idea of big 1-chip top-end pretty pointless for mass market (for CUDA/OCL/DXCS it may be more interesting).
Ha, well I imagine NVidia's strategy will be a long time adjusting to AMD's - some time after GT300. If ever?

Why did NVidia not launch GTX295 based upon GT200?

Even GT200-sized GT300 will wow some people because they kinda think that GT200 size is a mistake that should be avoided in the future =)
It looks like a mistake if it won't sell for $650.

But i myself expect a smaller then GT200 die from GT300. I think they'll push for more preformance/mm^2 this time and that'll probably mean that they won't need such die size in GT300.
I think it has to be >256-bit memory bus - 384 or 512. With 256-bit they can only get to around 192GB/s (6Gbps chips) - which isn't much of a gain on GTX285's 166.4GB/s.

If performance is going to double then it seems more than 384-bits will be required. I suppose around the size of GT200b - what's the minimum size for a roughly square die with 512-bit bus + PCI Express + NVIO bus?...

It would be fun if NVidia launched a GX2 at the same time, that's for sure.

Anyway, I think with a 512-bit monster chip they'll be fine against anything AMD has. The best AMD can do is double the performance of RV770, whereas for NVidia double GT200b should be the minimum.

Jawed
 
Status
Not open for further replies.
Back
Top