Nvidia GT300 core: Speculation

dkanter · Apr 20, 2009

DegustatoR said:
Don't forget that while you may save some bucks on the smaller GPUs in mGPU card you're still burning much more on double memory size and more complex PCB/power/cooling solution.

So what? X2 cards of any sort sell for huge amounts of money, with ludicrous margins to spread all around. Unless your single chip performance just sucks and then you have bigger problems. Nobody needs to optimize high-end cards for cost. Why do you think NV fully specifies the high-end, including the cooling? Because if folks in Taiwan try to cut costs, they will create problems further down the road.

If you look at the strategy for high-end cards, it doesn't involve optimizing for cost. Cooling >130W is quite expensive, routing that much power involves many many layered PCB, shit tons of caps, VRMs, etc.

Generally "one big GPU" cards are simply more effective than any mGPU solution that we saw up until today. And that's true not only in cost of producing cards, but in performance and features too.
RV770x2 vs GT200 is like an opposite of RV670x2 vs G80 -- the second scenario was a bit unfair to mGPU cards because RV670 was kinda bad. And now we have a scenario which is in my opinion unfair to "big single GPU" cards because now GT200 is kinda bad.

You totally missed the big picture. The super high-end of the market that buys GTX 280 or RV770x2, is minuscule by volume and has almost 0 impact on overall profits. It's the halo effect that's useful. GPU vendors make most of their money on pro GPUs and GPUs in the $100-250 range.

So that high-end card, it doesn't actually have high volumes until it's been shrunk to the next process and becomes mid-range. And in semiconductors, volume is king because it dictates your NRE.

Yes a single GPU may be more efficient, but only in a narrow and uninteresting sense...and I'm not even convinced that single GPUs are necessarily more efficient (more on that later).

I'm quite sure that nothing will bring mGPU solutions to the level of single GPU solutions in terms of flexibility and efficiency ever.

Flexibility is tricky, since SLI/XF are software visible hacks that require changing your app.

How do you define efficiency?

Frankly, if you look at good CPU architectures, it's quite easy to see that DP servers that is pretty much exactly as efficient as a single socket server for many workloads and hence are the sweet spot for efficiency (e.g. 95% scaling).

GPU workloads are by definition trivially parallel, so it's quite easy to see how a dual chip approach would be just as efficient. Both from a performance and power/cost standpoint.

An interesting area for mGPU cards lies a bit higher than where AMD is putting it's RV670/770 GPUs -- let's say that you have a GPU with performance between middle and high-end class. There may be a window where you can make an mGPU card with two such GPUs which cannot be challenged with one single big GPU simply because you won't be able to make such GPU (due to technical limitations). Anything higher will be too complex to use in mGPU cards, anything lower will be beaten by a single big GPU boards.
This window is where AMD is with RV770x2 essentially, but only WRT performance levels, not die complexity.

Yes, that's interesting. But what else is interesting is having a much more highly optimized card to server the $100-250 market, where you can kickass AND make mad money because your die size is way smaller.

The biggest single advantage of a monolithic GPU is that using multiple GPUs for general purpose workloads is retarded, because the programming model (i.e. no coherency) sucks ass. NV has to produce large monolithic GPUs to make GPGPU interesting and get sufficient performance gains over a standard dual socket server.

Well, people tend to expect too much. But why isn't GT30x middle with GT200+ performance and DX10.1/11 support at $250 price point won't be a performance bomb anyway? After all that's exactly what's expected from AMD's RV870.

What's the problem with doing everything that AMD does PLUS doing a single big GPU AFTER you've done what AMD did? I don't see how's any kind of power usage may be a problem here.

It's a huge waste of money and engineers time. Next question?

DK

CarstenS · Apr 20, 2009

Dave Baumann said:
Voltage does make a large difference. In fact its where the majority of the idle power savings are coming from with HD 4890.

I know that AMD wrote this into the HD 4890 lauch docs - but in practice, we're still at the 50-60ish watts in Idle we've measured with HD 4870. (Yes, we've measured isolated graphics card power consumption on at least 3 different HD 4890s by now).

So until I can really see an improvement, I'll stand by my statement that I have yet to see the better power characteristics GDDR5 has on paper also in real-life.

Dave Baumann said:
However, there is nearly 2x the bandwidth between a GDDR5 and GDDR3 HD 4800, which means there is 2x the speed on the PHY and memory controller - of course there is going to be higher power (nothing is for free in 3D). You run a PCI Express Gen 2 device in Gen 1 mode and there will be a high power difference.

True, but I have yet to see the difference between comparable solutions on a double-wide memory bus vs. double-speed GDDR5. RV770/90 vs. GT200(b) are NOT comparable.

Dave Baumann said:
Load efficiency and idle power savings/features are different things.

True. And that is why i hope HD 4770 will have good power efficiency in load AND idle.

Dave Baumann · Apr 20, 2009

CarstenS said:
I know that AMD wrote this into the HD 4890 lauch docs - but in practice, we're still at the 50-60ish watts in Idle we've measured with HD 4870.

That depends on the BIOS revision and whether the 4870 you're using is using 4780 X2 type swtiching. We've run RV770 from 600MHz down to 200MHz with no voltage switching and the current draw is the same across that entire range.

RV770/90 vs. GT200(b) are NOT comparable.

Why?

Blazkowicz · Apr 20, 2009

what about some true power management? i.e. only use one quad worth of ROP, one set of ALU or multiprocessor, and shut off like 80% of the chip altogether in addition to downing the voltage. hell if that was too complex you can do something crazy like an IGP-level GPU on the die on its own power island, sharing maybe the video/2D engine and using system RAM, shutting off the whole memory controller and chips. Not too costly on a high end GPU.

Imagine you could do the same with the CPU, disabling cores. I only need one core for most tasks, might need more for VM gimmicks or some tasks, and gaming needs one to four cores.

basically I want a PC that scales from like 50W total usage (even if fully using it in that low profile) to 200W depending on what I do.

I believe there's nothing worth to know about GT300 yet. But, any hint on power management?

trinibwoy · Apr 20, 2009

dkanter said:
Yes, that's interesting. But what else is interesting is having a much more highly optimized card to server the $100-250 market, where you can kickass AND make mad money because your die size is way smaller.

Why does "highly optimized" have to do with die size? You can certainly have a highly optimized large monolithic die. And once you have that, making a smaller, also highly optimized version comes naturally.

Scali · Apr 20, 2009

You can optimize for many things... maximum performance, minimum size, maximum power efficiency etc.
So I guess you have to specify what you want to optimize for.

CarstenS · Apr 20, 2009

Dave Baumann said:
That depends on the BIOS revision and whether the 4870 you're using is using 4780 X2 type swtiching. We've run RV770 from 600MHz down to 200MHz with no voltage switching and the current draw is the same across that entire range.

We've tested several retail models so far - reference and non-reference designs and all were in the said range (I do not remember the exact values, so 'ill only give that range).

But I've yet to se a HD 4870 that's taking 90ish watts in idle. Even an X2 barely touches 80 watts when idling poperly (i.e. without the PP-bug in the last two Cat revisions).

Dave Baumann said:
Why?

Good question. Because there's a hell of a lot more variables between these chips than between HD 4850 and HD 4870 which you already consider invalid for comparison, if I got you right.

Take voltage for example. Or clock rates. Or transistors. Or basically any functional unit. They all have quite different characteristics meaning you cannot compare the whole package and only boiling it down to make assumptions on GDDR3 versus GDDR5.

trinibwoy · Apr 20, 2009

Scali said:
You can optimize for many things... maximum performance, minimum size, maximum power efficiency etc.
So I guess you have to specify what you want to optimize for.

Yeah but he seemed to have associated a small die size with being highly optimized. You can have a small unoptimized die or a large optimized one. The two concepts are orthogonal. Now if you're assuming some other metric is fixed - say performance - then obviously a smaller die is more optimized but I'm not sure if that's what he meant.

Dave Baumann · Apr 20, 2009

CarstenS said:
Even an X2 barely touches 80 watts when idling poperly (i.e. without the PP-bug in the last two Cat revisions).

This is correct. The X2 is actually slightly lower idle power than the initial reference 4870.

Good question. Because there's a hell of a lot more variables between these chips than between HD 4850 and HD 4870 which you already consider invalid for comparison, if I got you right.

Take voltage for example. Or clock rates. Or transistors. Or basically any functional unit. They all have quite different characteristics meaning you cannot compare the whole package and only boiling it down to make assumptions on GDDR3 versus GDDR5.

You can never get a like for like comparison - you asked for a comparison of equivelent bandwidth between GDDR5 and twice bus width GDDR3, but that simply cannot happen (in any sensible scenario) because they will dictate a lot of other parameters such as chip size (i.e. number of transistors). Even power will vary from one ASIC to another (quoted numbers are generally from worst case ASIC). All you can compare against if the product performance/price/positioning.

ShaidarHaran · Apr 21, 2009

rpg.314 said:
Well, if in future the color of each pixel starts depending on color of every other pixel, then you'd need lrb like cache coherency. I doubt things would become as bad (or as radical) as that. They appear to be there to handle non-graphics stuff and to provide the familiarity of existing programming models. Whatever changes the future brings, they'd likely respect the inherent parallelism of graphics.

I'm no programmer but would not global illumination and radiosity specifically tend in that direction? Compute power is there aplenty, it's starting to become more about programmability. Seems like LRB is going in the right direction.

rpg.314 · Apr 21, 2009

I said that from a shader approach's perspective. Radiosity you can do on gpu's(using cuda/opencl) w/o the need for lrb like cache coherency.

MfA · Apr 21, 2009

rpg.314 said:
Well, if in future the color of each pixel starts depending on color of every other pixel, then you'd need lrb like cache coherency.

You'd still not need SMP. Expensive crutches to help along bad programmers make sense when 95% of the transistors is dedicated to crutches anyway, but on a GPU it's out of place. There are other forms of coherency, the easiest being to just never put the same page in 2 caches.

aaronspink · Apr 21, 2009

MfA said:
You'd still not need SMP. Expensive crutches to help along bad programmers make sense when 95% of the transistors is dedicated to crutches anyway, but on a GPU it's out of place. There are other forms of coherency, the easiest being to just never put the same page in 2 caches.

Coherency is not a "crutch". Neither are caches a "crutch".

3dilettante · Apr 22, 2009

Real men work with raw memory accesses and grit their teeth through the latency, not hide behind wussy software-transparent on-chip storage.

nAo · Apr 22, 2009

3dilettante said:
Real men work with raw memory accesses and grit their teeth through the latency, not hide behind wussy software-transparent on-chip storage.

My point of view wasn't very dissimilar a few years ago, before working on CELL in a couple of very large projects

rpg.314 · Apr 22, 2009

@nAo: then what do you think is the biggest problem with programming model of CELL? Is it the management of async DMAs?

IOW: what changed your opinion of "real men" ?

3dilettante · Apr 22, 2009

nAo said:
My point of view wasn't very dissimilar a few years ago, before working on CELL in a couple of very large projects

Nah. Local store isn't much better in the manliness department.
Random long latencies and indeterminate behavior? Not with the deterministic and fast local store, and especially not with the DMA engine.

Actually, all on-chip storage is crutch. The lone accumulator in the original Von Neumann machine concept was itself a small deduction in the mem/mem ops forever testosterone scale.

MfA · Apr 22, 2009

aaronspink said:
Coherency is not a "crutch". Neither are caches a "crutch".

SMP.

TimothyFarrar · Apr 22, 2009

nAo said:
My point of view wasn't very dissimilar a few years ago, before working on CELL in a couple of very large projects

I'm interested in what really changed your mind?

Was it (a) the lack of tools to make it easy to build really fast SPU kernels (ie all the manual labor in optimization such as loop unrolling, assembly, etc)? Or (b) the requirement of manual DMA? Or (c) was it mostly a mess of integration between traditional CPU OO code/systems and the completely different type/style of systems/code required to get good performance on the SPUs?

aaronspink · Apr 22, 2009

MfA said:
SMP.

SMP isn't a crutch either. Any form of cache needs some coherence system either via software or hardware. Hardware tends to me more robust and has fairly minimal overheads overall. Once you have coherency, the support for SMP is fairly simple. There are far more complex things required for a modern computation device than coherency.

Fundamentally, there are vast areas of workloads that would improve significantly with modern GPUs if they had some GPU-GPU interconnect and some form of coherency. One of the most fundamental of these would be an actually working multi-gpu implementation instead of the major crutch (and more often than not non-working crutch) that is AFR.

Nvidia GT300 core: Speculation

dkanter

CarstenS

Moderator

Dave Baumann

Gamerscore Wh...

Blazkowicz

trinibwoy

Meh

Scali

CarstenS

Moderator

trinibwoy

Meh

Dave Baumann

Gamerscore Wh...

ShaidarHaran

hardware monkey

rpg.314

MfA

aaronspink

3dilettante

nAo

Nutella Nutellae

rpg.314

3dilettante

MfA

TimothyFarrar

aaronspink

Similar threads