Llano IGP vs SNB IGP vs IVB IGP

Karoshi · Nov 12, 2010

Why are there no APU's GPUs running at 2+ GHz?

DavidC · Nov 15, 2010

Karoshi said:
Why are there no APU's GPUs running at 2+ GHz?

It's all about balance. Remember we aren't talking about the 1980's which the components had passive cooling using 3W. We are already limited by cooling and power consumption.

It's probably better to get 400SPs at 650MHz than 200SPs at 1300MHz. GPU code has extremely high parallelism so adding more SPs are easier than clocking it high.

Nvidia does have high clock speeds for its SPs, but again, its just for SPs. All other blocks clock much lower. ATI design calls for having everything clock like the base clock. I guess they can change it, but not something that'll happen overnight.

Even if the process technology, thermal and power limits, and costs of development allow clocking the GPU at 2GHz, does the design allow it?

hkultala · Nov 15, 2010

It seems intel is finally at least developing openCL implementation for their integrated GPU's:

They just sent an email to llvm-developers list, recruiting people to develop their llvm-based opencl implementation:

Intel recruitment email said:
LLVM Software engineer at Intel,CA(Santa Clara or Folsom)

In this position, you will be responsible for designing and developing highly competitive OpenCL (Open Compute Language, a new industry standard for heterogeneous data and task parallel computing across GPU's and CPU's). You will be supporting on integrated graphics processors. This includes a JIT compiler, a library of built-in functions and OpenCL runtime driver support. Responsibilities (depending on your skill set) will include applying state of the art compilation/JIT technology, knowledge of high performance math algorithms and system architecture skills to allow applications to tap into the computation power of GPUs previously only available to graphics applications ....

rpg.314 · Dec 30, 2010

http://www.semiaccurate.com/2010/12/29/intel-puts-gpu-memory-ivy-bridge/

Interesting.

GZ007 · Dec 30, 2010

rpg.314 said:
http://www.semiaccurate.com/2010/12/29/intel-puts-gpu-memory-ivy-bridge/

Interesting.

The size and the bandwith doesnt sound to realistic to me.
Wouldnt they use it already in server cpu-s if they could get 1 GB of memory at 5770 speeds in the ivy bridge design.

rpg.314 · Dec 30, 2010

Not quite. Graphics can live with memory under 1 gig. Server workloads can't. Besides, it is not immediately clear that it will have lower latency than normal dram though it is likely. Besides, you would want to start with a lower risk product.

Arun · Dec 30, 2010

512-bit LPDDR2 stacks? That's unlikely although not strictly impossible. Also describing LPDDR2 as "old" when there's barely any smartphone using it today proves only that Charlie doesn't know enough about that part of the market to speculate intelligently about it. It's interesting that nobody is thinking of doing that kind of bus width before the JEDEC Wide I/O standard with TSV (Through Silicon Vias i.e. 3D packaging) but Intel is hardly using a traditional approach here so standards are not very relevant.

LPDDR2 chips are always 32-bit and the only official JEDEC packages are for Package-on-Package configurations. The maximum is a 64-bit PoP package with 2 or more chips. But Intel could certainly buy the raw chips and stack it themselves (something they couldn't do with GDDR5 perhaps) - they'd need one chip for every 32-bit. That would mean 16 chips for 512-bit (each 512Mbit for 1GB). That's a HUGE stack - this isn't going to be a thin package if true! Obviously that would be the top SKU and not aimed at ultraportables or netbooks - but the problem is that if you've got only 256MB then you can't have more than a 128-bit memory bus (there are no 256Mbit chips and 512Mbit isn't the most cost efficient standard already). They'd also be wasting a fairly huge 384-bit worth of their memory controller! It doesn't matter that the memory chips are closer and that the CPU's pitch is smaller, the memory controller (and probably PHYs) are still going to cost the same - that is to say... quite a lot!

Could Ivy Bridge be doing something fancy with in-package memory resulting in a substantial performance boost? Yes. But I'm not convinced it's what Charlie is describing if so. Either way I'd like one of these - if he's wrong, deeply stacked enough to be smoked.

3dilettante · Dec 30, 2010

Silicon interposers have been part of an FPGA vendor's product plans already, so that is doable.
I think Altera was the one. (edit: Xilinx)
I hope the drawn diagram isn't too accurate, since that would require one massive glob of thermal grease to reach from the CPU to the bottom of the heatspreader.

It's a slight step backwards from the progressively more unified GPU/CPU memory hiearchy introduced by Sandy Bridge.
The on-die memory hierarchy on the CPU could still be unified, but there would be a secondary memory controller that would be primarily useful to the GPU.
Perhaps at some point it would just be a DRAM L4 cache? It seems like a waste to have it idling if someone opts out of the on-board graphics.

rpg.314 · Dec 30, 2010

3dilettante said:
Silicon interposers have been part of an FPGA vendor's product plans already, so that is doable.
I think Altera was the one.
I hope the drawn diagram isn't too accurate, since that would require one massive glob of thermal grease to reach from the CPU to the bottom of the heatspreader.

Sounds like AMD *could* do it reasonably cheaply.

It's a slight step backwards from the progressively more unified GPU/CPU memory hiearchy introduced by Sandy Bridge.
The on-die memory hierarchy on the CPU could still be unified, but there would be a secondary memory controller that would be primarily useful to the GPU.
Perhaps at some point it would just be a DRAM L4 cache? It seems like a waste to have it idling if someone opts out of the on-board graphics.

I assume that the driver will preferentially allocate memory for graphics objects from this pool.

Teaching libc malloc to not touch this would be a different matter though. Or can drivers lock down a segment of physical memory to themselves?

rpg.314 · Dec 30, 2010

Arun said:
Also describing LPDDR2 as "old" when there's barely any smartphone using it today proves only that Charlie doesn't know enough about that part of the market to speculate intelligently about it.

I remember reading on these forums (not sure who said it) that LPDDR2 was expensive and hence wasn't being used. It COULD be that charlie meant that it the standard had been finalized a while ago.

fehu · Dec 30, 2010

He even added the watermark in the easiest spot to delete it

Arun · Dec 30, 2010

rpg.314 said:
I remember reading on these forums (not sure who said it) that LPDDR2 was expensive and hence wasn't being used. It COULD be that charlie meant that it the standard had been finalized a while ago.

Pretty sure I'm the one who said that

Specifically that the Apple A4 used 64-bit LPDDR1 instead of 32-bit LPDDR2 because 512MB of the latter would be a lot more expensive in that timeframe (and might not even have been available in the volumes Apple needs). Hopefully in early 2012 there wouldn't be a huge price difference versus LPDDR1 anymore, but there would still be a big price difference versus DDR3. No idea how it would compare per megabyte versus GDDR5. Expect plenty of LPDDR2 devices in 1H11 (starting with the LP Optimus 2X using Tegra2).

3dilettante · Dec 30, 2010

rpg.314 said:
Sounds like AMD *could* do it reasonably cheaply.

Could, if that is the direction they choose. GlobalFoundries may have some input on this.
The additional question is "when".

It's not only process technology Intel has historically beat AMD on by a wide margin.
Packaging technology has also been a strong suit for Intel, with AMD usually lagging by a fair amount.

Also, I checked and it was Xilinx that had the silicon interposer tech.

rpg.314 · Dec 30, 2010

3dilettante said:
Packaging technology has also been a strong suit for Intel, with AMD usually lagging by a fair amount.

What is there in packaging tech to beat your competitor with? PPro's L2 cache comes to mind but doesn't seem like that big a deal.

3dilettante · Dec 30, 2010

Intel transitioned faster to organic substrates when that first came into use, and faster to use LGA packages.
It was faster to eliminate lead from its packaging, and one of the first to get a handle on the reliability issues that arose because of it.

Intel was also able to mass-produce dual-die packages much earlier than AMD. This was perhaps due to necessity, but this predates AMD's MCM by years.
As a result, it beat AMD's single-chip multicores to market, both for the dual and quad-core transitions.

TKK · Dec 30, 2010

3dilettante said:
Intel transitioned faster to organic substrates when that first came into use, and faster to use LGA packages.
It was faster to eliminate lead from its packaging, and one of the first to get a handle on the reliability issues that arose because of it.

Intel was also able to mass-produce dual-die packages much earlier than AMD. This was perhaps due to necessity, but this predates AMD's MCM by years.
As a result, it beat AMD's single-chip multicores to market, both for the dual and quad-core transitions.

These are simply situations where having tremendous resources to throw at certain issues pays off.
Their resources sure allow them to react very quickly.

Anyway, while you can never know with Intel, I wouldn't be surprised if the IB incarnation comes with something moderate like 128-bit/256MB, basically Intel's own answer to 'sideport memory'. Too much of this would drive up cost and make cooling rather challenging, I think.

I also wouldn't be surprised if developement of this started around the time AMD showcased sideport memory.

DavidC · Jan 2, 2011

TKK said:
Anyway, while you can never know with Intel, I wouldn't be surprised if the IB incarnation comes with something moderate like 128-bit/256MB, basically Intel's own answer to 'sideport memory'. Too much of this would drive up cost and make cooling rather challenging, I think.

I also wouldn't be surprised if developement of this started around the time AMD showcased sideport memory.

Doesn't ANYONE remember the leaked slide with Gesher and Larrabee stating similar things?

0-512MB 64GB/s bandwidth

DarthShader · Jan 2, 2011

http://www.xtremesystems.org/forums/showpost.php?p=4686529&postcount=45

Some info from Dr.Who? - ca. 60% hit rate of the L3 cache for the IGP. Concludes SB graphics will be faster than Llano - becasue the latter will be bandwidth starved and have shaders idling.

itsmydamnation · Jan 2, 2011

Some info from Dr.Who? - ca. 60% hit rate of the L3 cache for the IGP. Concludes SB graphics will be faster than Llano - becasue the latter will be bandwidth starved and have shaders idling.

current high end amd GPU's having like 7-8mb of ram on them, do they actually need to hit cahce at all. to me and im a layman, thinking about this logically doesn't make much sence.

GPUs are:
high memory latency
high memory thoughput

I would have thought that GPU's dont get that much duplicate data being pulled over the memory bus which is where a cache would help reduce memory bandwidth. Also if its getting 60% hit what about cache thrashing on CPU intensive games.

if that tiny amout of cache actually helped a GPU you think we would have seen it by now.

The other thing that a cache does is reduce latency, who cares about that for a GPU. so to me that conclusion makes little sence and until we know the way LLano's memory control works how can anything be gugaed.

edit: also is SB cache structure still inclusive, can SB prefetch from memory straight into L3 or does it have to do straight to L1 like AMD?

ShaidarHaran · Jan 2, 2011

DarthShader said:
http://www.xtremesystems.org/forums/showpost.php?p=4686529&postcount=45

Some info from Dr.Who? - ca. 60% hit rate of the L3 cache for the IGP. Concludes SB graphics will be faster than Llano - becasue the latter will be bandwidth starved and have shaders idling.

It confirms no such thing. It means exactly what it says. SNB graphics will potentially have access to a greater subset of data at a lower latency than Llano. Latency is practically irrelevant with such a parallel workload as graphics processing. Throughput is key.

Llano IGP vs SNB IGP vs IVB IGP

Karoshi

DavidC

hkultala

rpg.314

GZ007

rpg.314

Arun

Unknown.

3dilettante

rpg.314

rpg.314

fehu

Arun

Unknown.

3dilettante

rpg.314

3dilettante

TKK

DavidC

DarthShader

itsmydamnation

ShaidarHaran

hardware monkey