Practicality of a CPU/GPU hybrid in consoles

Capeta · Feb 13, 2007

Shifty Geezer said:
Even if that's possible, why choose that solution over separate dies? I can only see it saving space, and not money or complexity which is the main concern in these consoles.

Bandwidth...

http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2925

At the Spring 2005 Intel Developer Forum, Justin Rattner outlined a very serious problem for multi-core chips of the future: memory bandwidth. We're already seeing these problems today, as x86 single, dual and quad core CPUs currently all have the same amount of memory bandwidth. The problem becomes even more severe when you have 8, 16, 32 and more cores on a single chip.

The obvious solution to this problem is to use wider front side and memory buses that run at higher frequencies, but that solution is only temporary. Intel's slide above shows that a 6-channel memory controller would require approximately 1800 pins, and at that point you get into serious routing and packaging constraints. Simply widening the memory bus and relying on faster memory to keep up with the scaling of cores on CPUs isn't sufficient for the future of microprocessors.

So what do you do when a CPU's access to data gets slower and more constrained? You introduce another level in the memory hierarchy of course. Each level of the memory hierarchy (register file, L1/L2/L3 cache, main memory, hard disk) is designed to help mask the latency of accessing data at the level immediately below it. The clear solution to keeping massively multi-core systems fed with data then is to simply put more memory on die, maybe an L4 cache perhaps?

The issue you run into here is that CPU die space is expensive, and the amount of memory we'd need to keep tens of cores fed is more than a few megabytes of cache can provide. Instead of making the CPU die wider, Intel proposed to stack multiple die on top of each other. A CPU die, composed of many cores, would simply be one layer in a chip that has integrated DRAM or Flash or both. Since the per-die area doesn't increase, the number of defects don't go up per die.

Memory bandwidth improves tremendously, as your DRAM die can have an extremely wide bus to connect directly to your CPU cores. Latency is also much improved as the CPU doesn't have to leave the package to get data stored in any of the memory layers.

Obviously there will still be a need for main memory, as Intel is currently estimating that a single layer could house 256MB of memory. With a handful of layers, and a reasonably wide external memory bus, keeping a CPU with tens of cores fed with data now enters the realm of possibility.

It's simply the next logical step - die stacking and SiP

AFAICS a PCB with huge buses is more expensive and complex than chip stacking using large chip level bus interconnects. We are already at a point where we need big heatsinks and heatpipes which are already expensive. Moving to liquid cooling isn't going to be that much more expensive on complex.

Shifty Geezer · Feb 13, 2007

Different problem. Intel are talking about keeping multiple cores fed with data by putting memory on top of the die. Stacking a GPU on a CPU will give you lots of BW between the two, but that's no use. And at considerable more cost (not a concern for high-end CPUs). There remains no reason for combining CPU and GPU into one chip on a launch platform, unless you're launching with old tech.

Gubbi · Feb 13, 2007

Capeta said:
We've had cooling devices that could cool a 200W chip for years, another 100Ws isn't going to magically transform into some quantum affect wall, but to answer your question a nice heat spreader + water or liquid alloy cooling technology coupled to a nice radiator or large heatsink.

Now qualify with cost.

Cheers

3dilettante · Feb 13, 2007

Capeta said:
We've had cooling devices that could cool a 200W chip for years, another 100Ws isn't going to magically transform into some quantum affect wall, but to answer your question a nice heat spreader + water or liquid alloy cooling technology coupled to a nice radiator or large heatsink.

Those 200 W chips have their dies directly pressed against a heatspreader or heatsink.

If heat can't travel through the intervening and actively heating cores, then it doesn't matter what the outer core is attached to.

It's the same reason why the floor doesn't catch on fire when you use the stove.

Capeta · Feb 13, 2007

Shifty Geezer said:
Different problem.

No it's not a different problem.

Stacking a GPU on a CPU will give you lots of BW between the two, but that's no use.

Bandwidth is bandwidth whether it's memory or inter/intra-chip communication. Take a dual core LSI (CPU+GPU) for example. You have 3 options for packaging:

1. 2 cores on a single die
2. 2 separate dies on single package (Sip/MCM)
3. 2 dies stacked

Option 1 is only viable if the two cores are small otherwise it would be a huge die (expensive).
Option 2 uses up lots of space especially if the chips are big.
Option 3 only uses up as much space as the larger of the two cores.

In the future you will see muliple chips each having multiple cores operating as a single chip. You need high bandwidth between these chips otherwise efficiency goes down the toilet making the whole idea worthless. Why do you think supercomuters use special high bandwidth interconnects? Why not use 10Mbps ethernet between nodes?

And at considerable more cost (not a concern for high-end CPUs).

The costs with this packaging technology will come down very quickly once Intel and others migrate to the new style packaging which WILL happen in the near future.

There remains no reason for combining CPU and GPU into one chip on a launch platform, unless you're launching with old tech.

The reason is bandwidth. You think the bus size between chips will stay the same forever? With stacking you don't need to sacrifice die size like you do with a single die CPU/GPU so performance won't need to be compromised.

Gubbi said:
Now qualify with cost.

Cheers

Cost of what the cooling system?

3dilettante said:
Those 200 W chips have their dies directly pressed against a heatspreader or heatsink. If heat can't travel through the intervening and actively heating cores, then it doesn't matter what the outer core is attached to.

And what makes you think the intervening layers will be made of thermal insulation? That wouldn't be too smart would it? You don't think memory also gets hot? Why do you think Intel and NEC are moving toward stacking with memory cache? High speed cache doesn't get hot?

3dilettante · Feb 13, 2007

Capeta said:
And what makes you think the intervening layers will be made of thermal insulation? That wouldn't be too smart would it? You don't think memory also gets hot? Why do you think Intel and NEC are moving toward stacking with memory cache? High speed cache doesn't get hot?

If they're made of silicon, and you stack several active processor cores on top of each other, then yes you have layers of insulation. It's actually worse because they are actively heating each other.

Logic and IO is the primary source of power draw. Cache is nowhere in the same league.

Capeta · Feb 13, 2007

The silicon die pad itself has the least thermal resistance. It's not an insulator otherwise you wouldn't be attaching a heat sink directly to the die to get the best cooling performance. IOW you don't connect a heatsink to an insulator.

Shifty Geezer · Feb 13, 2007

Capeta said:
No it's not a different problem.
Bandwidth is bandwidth whether it's memory or inter/intra-chip communication.

Uh? There's no point adding BW between components if it's not needed. Memory BW is always a bottleneck, and that's why Intel are targetting that specifically. Other component<>component BW has never been an issue AFACT.

You need high bandwidth between these chips otherwise efficiency goes down the toilet making the whole idea worthless.

No, you don't. You need the data to be accessible at speed for the GPU, but that doesn't necessitate a fast CPU write direct to GPU. GPUs do a lot of random data accessing which necessitates data being completely available. If the CPU is to generate a 1024x1024 texture, the whole texture needs to be available. It needs to reside in memory (VRAM or texture cache) and be accessed at speed by the GPU. Direct communication between CPU and GPU is limited. What's far more important is fast memory access to components, with CPU and GPU both having high BW to their external large-volume stores. A GPU on a CPU isn't going to help that at all. The GPU and CPU having their memories sat on top of the logic would help that...and that's exactly what Intel are talking about. Why does the report say 'Intel are putting memory on top of their CPUs' and not 'Intel are putting GPUs on top of their CPUs'? Because they're not, because high BW between CPU and GPU isn't that important. 100GB/s BW between CPU and GPU with 10 GB/s VRAM to GPU is useless compared to 10 GB/s CPU<>GPU and 100 GB/s VRAM.

Why do you think supercomuters use special high bandwidth interconnects? Why not use 10Mbps ethernet between nodes?

That's a silly analogy. We're not talking about separate dies being stuck on 10 Mbps busses. There's 35 GB/s between Cell and RSX on PS3 over a standard bus between separate dies. What reasons do you see for more BW than that between CPU and GPU directly, instead of both working on data from RAM?

3dilettante · Feb 13, 2007

Capeta said:
The silicon die pad itself has the least thermal resistance.

Pure silicon has less than half the thermal conductivity of copper.
The polysilicon and circuit components of a chip have lower conductivity.

If the silicon had ten times the conductivity of copper, it still wouldn't matter if it was spitting out 100W of its own.

Capeta · Feb 13, 2007

Uh? There's no point adding BW between components if it's not needed. Memory BW is always a bottleneck, and that's why Intel are targetting that specifically. Other component<>component BW has never been an issue AFACT.

Key phrase is "has never been", we're talking about the future not the past. The future is mulitcore multichip that act as ONE chip.

Why does the report say 'Intel are putting memory on top of their CPUs' and not 'Intel are putting GPUs on top of their CPUs'?

Because Intel is not in high performance GPU market? Why do you think AMD/ATI are moving to CPU/GPU hybrids AKA Fusion?

100GB/s BW between CPU and GPU with 10 GB/s VRAM to GPU is useless compared to 10 GB/s CPU<>GPU and 100 GB/s VRAM.

Sounds like you are trying to distort reality to make a n iffy point. Going by your logic inner bandwidth between cores in a multicore chip is not important. If that is the case then CELL must be a crap design.

That's a silly analogy. We're not talking about separate dies being stuck on 10 Mbps busses. There's 35 GB/s between Cell and RSX on PS3 over a standard bus between separate dies. What reasons do you see for more BW than that between CPU and GPU directly, instead of both working on data from RAM?

We already know memory bandwidth is a problem. That's not the point since that could be relieved with more levels of cache. Again you're not looking at the FUTURE. Do you know the limits of copper signaling technology? Do you know the difference between ONchip and OFFchip communication? You think the computation abilities of these super multicore chips will dramatically increase yet the buses will stay the same? Why isn't the bandwidth between CELL and RSX 10GB/s or 20GB/s?

3dilettante said:
Pure silicon has less than half the thermal conductivity of copper.
The polysilicon and circuit components of a chip have lower conductivity.

If the silicon had ten times the conductivity of copper, it still wouldn't matter if it was spitting out 100W of its own.

Um..and what relevence does that have? Current chips are made of silicon right? Or are they made of copper? Is there an invisible thermal wall that I'm not aware of?

Shifty Geezer · Feb 13, 2007

Because Intel is not in high performance GPU market? Why do you think AMD/ATI are moving to CPU/GPU hybrids AKA Fusion?

As I understnad it, for cheap low-end devices, and not cutting edge performance.

Sounds like you are trying to distort reality to make a n iffy point. Going by your logic inner bandwidth between cores in a multicore chip is not important. If that is the case then CELL must be a crap design.

We already know memory bandwidth is a problem. That's not the point since that could be relieved with more levels of cache. Again you're not looking at the FUTURE.

The future is not going to be any different from the present unless there's a radical shift in the way GPUs work. You're likening CPU<>GPU BW as multicore BW, but it's not. CPU clusters that share masses of data, like Cell or supercomputers, need masses of data BW for those cores. The CPU<>GPU setup is a producer/consumer relationship where a small amount of the source data is used to create a lot of the product data. A GPU uses lots of BW on the same data over and over again. For procedurally created textures, a CPU would need to create the texture once foer the GPU to read that data a dozen times. Thus the GPU BW demands are very high, but the data access from the CPU isn't. This is way developments have seen the GPU paired up with super fast VRAM with limited CPU connections. That's not going to change for a good while because of the nature of the GPU workload. It needs the scene data to be ready for it and stored in RAM for random access. Alternatively you have it requesting data to be created on the fly from the CPU...at which point you're likely better off creating the data on the fly in the GPU's own shaders.

Capeta · Feb 13, 2007

As I understnad it, for cheap low-end devices, and not cutting edge performance.

They are not cutting edge because they are looking at usig a single die and need to keep die size down. They are not using die stacking/MCM.

The future is not going to be any different from the present unless there's a radical shift in the way GPUs work. You're likening CPU<>GPU BW as multicore BW, but it's not. CPU clusters that share masses of data, like Cell or supercomputers, need masses of data BW for those cores. The CPU<>GPU setup is a producer/consumer relationship where a small amount of the source data is used to create a lot of the product data. A GPU uses lots of BW on the same data over and over again. For procedurally created textures, a CPU would need to create the texture once foer the GPU to read that data a dozen times. Thus the GPU BW demands are very high, but the data access from the CPU isn't. This is way developments have seen the GPU paired up with super fast VRAM with limited CPU connections. That's not going to change for a good while because of the nature of the GPU workload. It needs the scene data to be ready for it and stored in RAM for random access. Alternatively you have it requesting data to be created on the fly from the CPU...at which point you're likely better off creating the data on the fly in the GPU's own shaders.

Actually from what I've seen so far it seems the trend will be more of the CPU functions being added to GPUs->GPGPU (Hybrid CPU/GPU).

3dilettante · Feb 13, 2007

Capeta said:
Um..and what relevence does that have? Current chips are made of silicon right? Or are they made of copper? Is there an invisible thermal wall that I'm not aware of?

Two walls: one physical, the other practical.
Physics is the hard physical wall.
Silicon circuitry's thermal tolerance is the practical wall.

The thermal conductivity of a substance (already lower with silicon than what is used to carry heat in a heatsink) is irrelevant if that substance is also being heated. You only get conduction if there is a difference in temperature between the spot you want cooled and the spot you the heat to move to.

Unless you think heat flows uphill, a stack of CPU cores will not allow heat to pass from a core hidden behind layers of additional heat sources to the cooler, not until that sandwiched core reaches a higher temp than the cores surrounding it.

This leads to wall number two, where the only way you get decent heat transfer from a core that must go through another ~60 C core is if the core farther from the heatsink is way hotter. Silicon chips can be expected to die almost immediately once you cross 120 C.

Shifty Geezer · Feb 13, 2007

Capeta said:
They are not cutting edge because they are looking at usig a single die and need to keep die size down. They are not using die stacking/MCM.

Yes. But that's neither here nor there. You can't take Fusion as an example of combining CPU and GPU to get higher bandwidth and performance. No one at the moment is doing that or has a roadmap for that. No-one, anywhere, is saying 'we want GPUs and CPUs to have massive interconnect bandwidth way beyond what you can get from separate dies.'

Again, if you think it's an issue, explain how.

Actually from what I've seen so far it seems the trend will be more of the CPU functions being added to GPUs->GPGPU (Hybrid CPU/GPU).

Which still doesn't warrant a cPU+GPU hybrid chip. These GPGPU designsare about extending the logic of the GPU, and not strapping a CPU part to a GPU part.

What you suggest would work for creating a larger chip. That is, for a mammoth GPU, you could get two GPU dies that share data and function and stack them so they have have fast die<>die communication. You could do the same for CPUs and have a large multicore CPU that's stacked rather than fabbed from a single large die. You could put RAM local to the die to get fast memory access. I'd go with any of those ideas. CPU and GPU stacked makes no sense to me though. You're making a hotter, pricier chip than using two separate dies (at least until prices drop below the ever increasing costs of large and fast mobo busses) for no advantage whatsoever. Unless you can convince me that CPU<>GPU speed is going to prove paramount.

Capeta · Feb 13, 2007

3dilettante said:
Two walls: one physical, the other practical.
Physics is the hard physical wall.
Silicon circuitry's thermal tolerance is the practical wall.

The thermal conductivity of a substance (already lower with silicon than what is used to carry heat in a heatsink) is irrelevant if that substance is also being heated. You only get conduction if there is a difference in temperature between the spot you want cooled and the spot you the heat to move to.

Huh? Two pieces of silicon slapped together offers nearly the same thermal resistance as a single piece of silicon. This all depends on the bonding material and thickness of the total dies. If the bonding material is highly conductive eg more or equal to silicon then the stacked dies will have the same resistance as the nonstacked assuming both have the same total thickness.

For example two 4-layer dies stacked together would be equal in thermal resistance as an 8 layer die of the same total thickness assuming the bonding material between the stacked dies are less/equal in thermal resistance compared to the silicon.

Unless you think heat flows uphill, a stack of CPU cores will not allow heat to pass from a core hidden behind layers of additional heat sources to the cooler, not until that sandwiched core reaches a higher temp than the cores surrounding it.

In a solid heat flows in all directions from areas of high heat concentration to areas of low heat concentration. If you have two stacked dies the heat will flow from the hotter chip to the cooler one, however with the heatspreader/heatsink attached to the hotter die, the two dies will be at near thermal equilibrium.

This leads to wall number two, where the only way you get decent heat transfer from a core that must go through another ~60 C core is if the core farther from the heatsink is way hotter. Silicon chips can be expected to die almost immediately once you cross 120 C.

Why would the chip be running at 120C if you're cooling it with a capable cooling system?

3dilettante · Feb 14, 2007

Capeta said:
Huh? Two pieces of silicon slapped together offers nearly the same thermal resistance as a single piece of silicon. This all depends on the bonding material and thickness of the total dies. If the bonding material is highly conductive eg more or equal to silicon then the stacked dies will have the same resistance as the nonstacked assuming both have the same total thickness.

For example two 4-layer dies stacked together would be equal in thermal resistance as an 8 layer die of the same total thickness assuming the bonding material between the stacked dies are less/equal in thermal resistance compared to the silicon.

Have you ever wondered why a bowl of soup if left in a bowl stays hot for a long time, but it gets cold faster if on a wide plate, or even faster if you throw it on the floor?

Heat transfer is not instantaneous, cooler efficiency is not perfect, and temperature distribution is not uniform. Current top of the line chips need to have the active transistor layer right next to the cooler.
An entire hot core in the way does not help.

In a solid heat flows in all directions from areas of high heat concentration to areas of low heat concentration. If you have two stacked dies the heat will flow from the hotter chip to the cooler one, however with the heatspreader/heatsink attached to the hotter die, the two dies will be at near thermal equilibrium.

The rate of flow is determined in part by the difference in temperature. The difference between a single core and a heatsink is a significant factor.
For the stacked cores, it's the difference between the second chip and the actively dissipating first chip, and then the heatsink.

Due to non-zero resistance and the active power output of the other core, the thermal equilibrium point for areas most distant from the cooler will be higher. You can't add the equivalent of an electric blanket on top of one core and expect it to be cooled as efficiently.

In addtion, the two cores will have the effective heat transfer area to the heatsink of a single core.
High power chips have been known to overheat when the layer of thermal compound is a mere fraction of a chip thickness too thick, and thermal compound does not actively draw power.

Which do you think is easier to cool, two 100W 100 mm2 chips with a transfer area of 200 mm2, or a single stacked unit at 100mm2 and 200W?

Why would the chip be running at 120C if you're cooling it with a capable cooling system?

I suppose if you paste a sub-ambient cooler on the other chip and run it very cold, the more distant chip might operate within its specified temperature range. The problems associated with wild temperature variations on electrical and physical characteristics would be much worse.

Practicality of a CPU/GPU hybrid in consoles

Capeta

Shifty Geezer

uber-Troll!

Gubbi

3dilettante

Capeta

3dilettante

Capeta

Shifty Geezer

uber-Troll!

3dilettante

Capeta

Shifty Geezer

uber-Troll!

Capeta

3dilettante

Shifty Geezer

uber-Troll!

Capeta

3dilettante

Similar threads