NVIDIA Kepler speculation thread

Alexko · Feb 6, 2013

DavidGraham said:
Hmmm , really? I didn't know that. Sorry.. my mistake ,I guess that's why people are calling it an APU then .

However I still think it is really just semantics . What is the difference between a powerful discrete GPU and a powerful integrated APU ? their place relative to the CPU ? well that's not really an earth shattering difference.

The fact of the matter is , so long as consoles use decent or powerful GPUs (intergratd or not), PCs will need decent discrete GPUs to match them and to exceed .. we all know PCs lag behind in the optimization and utilization front , they will need to compensate for that with more powerful hardware .

The difference is bandwidth, mostly. Wide GDDR5 buses give discrete chips a very substantial advantage. Presumably, very wide I/O using an interposer could negate this, at least partially, but that's speculative at this point.

Ailuros · Feb 6, 2013

DavidGraham said:
Hmmm , really? I didn't know that. Sorry.. my mistake ,I guess that's why people are calling it an APU then .

However I still think it is really just semantics . What is the difference between a powerful discrete GPU and a powerful integrated APU ? their place relative to the CPU ? well that's not really an earth shattering difference.

The fact of the matter is , so long as consoles use decent or powerful GPUs (intergratd or not), PCs will need decent discrete GPUs to match them and to exceed .. we all know PCs lag behind in the optimization and utilization front , they will need to compensate for that with more powerful hardware .

Sony and Microsoft will be using SoCs, people are using AMD's own description for it. Of course is there a difference between a discrete CPU/GPU and a SoC especially in a closed system like a console.

As for the low level API advantage you might be hinting at in your second paragraph it's my understanding that it stands mostly for console CPUs.

Ailuros · Feb 6, 2013

UniversalTruth said:
No one will ever use this term "powerful integrated APU". First of all because the nature of the APU itself already determines that it would suck very badly in comparison to normal discrete CPU+ GPU. There are too many bottlenecks appearing when you put them on a single die.
You would probably mean "powerful integrated GPU in an APU".

Upcoming console SoCs aren't obviously as powerful as a solution of discrete units could be, but you still can't find an as powerful SoC for the PC/notebook markets yet either. They'll be the strongest SoCs commercially available at launch. While there might be bottlenecks for a SoC, are you completely sure that the disadvantages overweigh the advantages?

That is true. The industry is evil enough to unleash games which run artificially very badly on existing hardware in order to push sales...

The majority of the industry is moving into the mobile/SoC direction anyway. By the way I thought this is a Kepler thread *shrugs*

KimB · Feb 6, 2013

Ailuros said:
Upcoming console SoCs aren't obviously as powerful as a solution of discrete units could be, but you still can't find an as powerful SoC for the PC/notebook markets yet either. They'll be the strongest SoCs commercially available at launch. While there might be bottlenecks for a SoC, are you completely sure that the disadvantages overweigh the advantages?

The majority of the industry is moving into the mobile/SoC direction anyway. By the way I thought this is a Kepler thread *shrugs*

Somewhat of an aside, but I've thought for a long time that the limitation in interconnect bandwidth compared to processing power may, in time, make SoC's higher-performing than discrete units. I will be very interested to see if that happens down the road...

keldor314 · Feb 6, 2013

I suspect that as the GPU programing model evolves, the importance of the interconnect between GPU and CPU will diminish, as the GPU becomes able to run more or less the whole program rather than ping-ponging small portions back and forth. Another factor is that with an existing code base, you're likely only going to be moving things over one piece at a time, rather than rewriting the whole thing at once, meaning that current programs will be doing more transfers that they have to.

Ailuros · Feb 6, 2013

keldor314 said:
I suspect that as the GPU programing model evolves, the importance of the interconnect between GPU and CPU will diminish, as the GPU becomes able to run more or less the whole program rather than ping-ponging small portions back and forth. Another factor is that with an existing code base, you're likely only going to be moving things over one piece at a time, rather than rewriting the whole thing at once, meaning that current programs will be doing more transfers that they have to.

I suspect Chalnoth might have had Dally's Exascale paradigm in mind: http://www.nvidia.com/content/GTC/documents/SC09_Dally.pdf

KimB · Feb 6, 2013

Ailuros said:
I suspect Chalnoth might have had Dally's Exascale paradigm in mind: http://www.nvidia.com/content/GTC/documents/SC09_Dally.pdf

I actually wasn't. I was more going by the idea that it's becoming increasingly easier to create more and more processing power within a chip, but bandwidth between chips seems to be increasing more slowly.

The statements on power limitation in that presentation are very intriguing, however, as that would put an even harder limit.

Anyway, it is very true that programming paradigms can help to relieve bandwidth usage, but this isn't always going to work for all applications.

Alexko · Feb 6, 2013

Chalnoth said:
Somewhat of an aside, but I've thought for a long time that the limitation in interconnect bandwidth compared to processing power may, in time, make SoC's higher-performing than discrete units. I will be very interested to see if that happens down the road...

For GPGPU, that's quite possible. For graphics, I expect external memory bandwidth to remain the determining factor. I think Llano and Trinity prove this: they tend to be slower than similar discrete GPUs with higher bandwidth.

KimB · Feb 6, 2013

Alexko said:
For GPGPU, that's quite possible. For graphics, I expect external memory bandwidth to remain the determining factor. I think Llano and Trinity prove this: they tend to be slower than similar discrete GPUs with higher bandwidth.

I think that's generally part of the exact same thing I'm talking about: in time, SoC's may have their primary video memory integrated on-chip.

The main problem with this is that it concentrates all of the power generation in one area, whereas a design with multiple chips spreads the power out a bit. So it's entirely possible that high-end designs will always have multiple chips, simply because they can consume more power. But it still would be very interesting to see SoC's become much more competitive relative to more traditional designs, which I do expect to happen at any rate.

Alexko · Feb 6, 2013

Chalnoth said:
I think that's generally part of the exact same thing I'm talking about: in time, SoC's may have their primary video memory integrated on-chip.

The main problem with this is that it concentrates all of the power generation in one area, whereas a design with multiple chips spreads the power out a bit. So it's entirely possible that high-end designs will always have multiple chips, simply because they can consume more power. But it still would be very interesting to see SoC's become much more competitive relative to more traditional designs, which I do expect to happen at any rate.

I don't know if multiple sources of heat is really that much of an advantage. Sure, it distributes the heat around, but it also means you need multiple cooling systems. With only one, perhaps you can actually dissipate more heat with the same budget. But I'm purely speculating here, I don't have anything solid to back this idea up.

However there is one nice power advantage to having everything on a single die: dynamic, low-latency, multi-directional power management. All other things being approximately equal, this can provide substantial performance benefits, perhaps more so than low-latency communication, at least for some workloads.

UniversalTruth · Feb 6, 2013

DavidGraham said:
What is the difference between a powerful discrete GPU and a powerful integrated APU ?

Alexko said:
The difference is bandwidth, mostly.

I have always taken this topic with regards to absolute maximum performance.
Here, the real difference is number of transistors. With APUs you are limited, you can't put inside 400-500 mm^2 iGPU part and 300-400 mm^2 iCPU part.

So, that's why:

Chalnoth said:
So it's entirely possible that high-end designs will always have multiple chips, simply because they can consume more power.

Alexko said:
multiple sources of heat is really that much of an advantage

Multiple sources of performance...

A1xLLcqAgt0qc2RyMz0y · Feb 6, 2013

This is the Kepler speculation thread

Please post APU and AMD questions/answers in their appropriate forum thread.

It is hard to find posts about Kepler here and I just wasted my time thinking new posts were about Kepler only to find none were.

DavidGraham · Feb 6, 2013

Alexko said:
The difference is bandwidth, mostly. Wide GDDR5 buses give discrete chips a very substantial advantage. Presumably, very wide I/O using an interposer could negate this, at least partially, but that's speculative at this point.

Correct me if I am wrong, but I don't think consoles ever enjoyed large bandwidth to begin with.

Ailuros said:
As for the low level API advantage you might be hinting at in your second paragraph it's my understanding that it stands mostly for console CPUs.

Both , CPUs and GPUs . maybe one more than the other , but both are affected.

ninelven · Feb 6, 2013

Given that there is a direct relationship between distance traveled and energy consumed (and heat generated), your "additional performance" may end up being significantly less than the sum of its parts. IOW, less energy spent moving data around means more energy available to do actual work.

UniversalTruth said:
With APUs you are limited, you can't put inside 400-500 mm^2 iGPU part and 300-400 mm^2 iCPU part

Sure you can. You just need a process shrink to do it. If your [strike]CPU[/strike] LOC transistor budget is relatively fixed or at least growing at a significantly lower rate than your [strike]GPU[/strike] TOC transistor budget, then the LOCs effectively become "cheaper" over time.

jimbo75 · Feb 7, 2013

It's not about size, it's about TDP and how much is likely to be the maximum spend on graphics. There is almost no good reason why AMD wouldn't end up with a 50W CPU / 50W IGP part, so long as the bandwidth problem is fixed. They could even go as far as 25W CPU / 75W IGP.

The only problem is, the market for such a part appears to be shrinking quite rapidly...and that's as good a reason as any why we might not see this part. Still, when you look at the Temash demo from CES it appears that "good enough" will be hitting the sub-5W parts before long. This is relevant to both Nvidia and AMD, and it should be interesting to see them go head to head here.

KimB · Feb 7, 2013

Alexko said:
I don't know if multiple sources of heat is really that much of an advantage. Sure, it distributes the heat around, but it also means you need multiple cooling systems. With only one, perhaps you can actually dissipate more heat with the same budget. But I'm purely speculating here, I don't have anything solid to back this idea up.

However there is one nice power advantage to having everything on a single die: dynamic, low-latency, multi-directional power management. All other things being approximately equal, this can provide substantial performance benefits, perhaps more so than low-latency communication, at least for some workloads.

It is physically more difficult to dissipate heat from a smaller heat source than a larger one, for the simple reason that there is less available contact area. This effect will operate at the connection between the chip and the heat sink: the smaller the chip is, the less efficiently it will transfer heat to the heat sink. So if you can distribute the power generation across a number of different chips, with each one having a lower power density than the one big one, but greater total power generation, you can dissipate that heat with a less efficient connection between the heat sink and the chip (e.g. you might be able to get away with not using copper for the heat sink's base).

ninelven · Feb 7, 2013

Indeed, but if you use less energy (generate less heat) to begin with...

Alexko · Feb 7, 2013

Chalnoth said:
It is physically more difficult to dissipate heat from a smaller heat source than a larger one, for the simple reason that there is less available contact area. This effect will operate at the connection between the chip and the heat sink: the smaller the chip is, the less efficiently it will transfer heat to the heat sink. So if you can distribute the power generation across a number of different chips, with each one having a lower power density than the one big one, but greater total power generation, you can dissipate that heat with a less efficient connection between the heat sink and the chip (e.g. you might be able to get away with not using copper for the heat sink's base).

Well, yes, but if you have 4 cores and a GPU of, say 20 Compute Units or SMXs or whatever, why should the total die area be smaller just because you put everything on the same die?

Granted, you should be able to remove some redundant hardware, such as in the memory controller(s), but that would reduce power too.

So is it easier to dissipate heat from a 250mm² CPU and a 250mm² GPU drawing 50W each, or from a 500mm² APU drawing 100W? Is it better to have two heatsinks or a single, large one? Is it cheaper to have two fans or a single, large one? Etc.

KimB · Feb 7, 2013

Alexko said:
Well, yes, but if you have 4 cores and a GPU of, say 20 Compute Units or SMXs or whatever, why should the total die area be smaller just because you put everything on the same die?

The main thing I was thinking is that the total size of a single die is largely limited by engineering concerns: beyond a certain size, yields start to fall significantly. So what you tend to have is die size limit out and go from having one die of a certain size to having two or more at roughly the same size. Thus the distributed design tends to have a much greater total die area available to it.

So if each individual chip is limited in its ability to dissipate heat efficiently, then you should be able to build a more powerful system by sharing the processing among a larger number of chips: the larger number of chips will be less efficient and consume more total power, but the dramatic increase in die area should allow for higher total performance.

Unless, that is, the inefficiencies from having the processing distributed into different chips overcomes the additional silicon available. This might happen if total power consumption is the most pressing issue, or it might happen if the chips become so much faster than the communications buses that the chips are continually starved for data to work on.

As chips become denser and denser, and power constraints become more and more of an issue, I do wonder if the whole industry will move to SoC designs.

Alexko · Feb 7, 2013

Chalnoth said:
The main thing I was thinking is that the total size of a single die is largely limited by engineering concerns: beyond a certain size, yields start to fall significantly. So what you tend to have is die size limit out and go from having one die of a certain size to having two or more at roughly the same size. Thus the distributed design tends to have a much greater total die area available to it.

So if each individual chip is limited in its ability to dissipate heat efficiently, then you should be able to build a more powerful system by sharing the processing among a larger number of chips: the larger number of chips will be less efficient and consume more total power, but the dramatic increase in die area should allow for higher total performance.

Unless, that is, the inefficiencies from having the processing distributed into different chips overcomes the additional silicon available. This might happen if total power consumption is the most pressing issue, or it might happen if the chips become so much faster than the communications buses that the chips are continually starved for data to work on.

As chips become denser and denser, and power constraints become more and more of an issue, I do wonder if the whole industry will move to SoC designs.

I would say that yes, it will. In hindsight, this has been the trend for decades.

FPUs and successive levels of cache were integrated into CPUs, networking chips were integrated into motherboard chipsets, along with sound chips, USB controllers, etc. Then northbridges (i.e. memory controllers) were integrated into CPUs, followed by PCI-Express controllers and then GPUs.

Now AMD's Kabini is about to be launched, and it's a full SoC. Haswell will not include anything on die that Ivy doesn't (I think) but it will have on-package VRMs. And apparently, there may be some SKUs with fast DRAM on an interposer.

So mainstream CPUs are looking more and more like SoCs. Meanwhile, ARM SoCs are getting more and more powerful, to the point that they can adequately power not just tablets but also light notebooks. Even large notebooks typically have no use for chips bigger than ~200mm², whether they're pure CPUs or pure GPUs, while only APUs exceed this size, though they remain below 300mm². The only reason they still feature discrete graphics is memory bandwidth. When that hurdle is overcome, discrete graphics in notebooks will be a thing of the past.

As long as die sizes remain a constraint on desktops, discrete graphics will make sense. Yet how long will that be? Processes continue to scale well in density, but not in power. What use are more billions of transistors if you can't power them? Bill Dally recently stated that current GPUs are power-constrained more than transistor-constrained. When he said that, GF100 was NVIDIA's 550mm² flagship, which made the statement rather strange, but I imagine he was saying this from his own perspective, which is that of someone looking at designs 2~6 years down the pipeline. So perhaps the days of the discrete GPU are numbered, even on desktops.

NVIDIA Kepler speculation thread

Alexko

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

KimB

keldor314

Ailuros

Epsilon plus three

KimB

Alexko

KimB

Alexko

UniversalTruth

A1xLLcqAgt0qc2RyMz0y

DavidGraham

ninelven

PM

jimbo75

KimB

ninelven

PM

Alexko

KimB

Alexko

Similar threads