Discussion in 'Architecture and Products' started by AnarchX, Oct 29, 2010.
Why are there no APU's GPUs running at 2+ GHz?
It's all about balance. Remember we aren't talking about the 1980's which the components had passive cooling using 3W. We are already limited by cooling and power consumption.
It's probably better to get 400SPs at 650MHz than 200SPs at 1300MHz. GPU code has extremely high parallelism so adding more SPs are easier than clocking it high.
Nvidia does have high clock speeds for its SPs, but again, its just for SPs. All other blocks clock much lower. ATI design calls for having everything clock like the base clock. I guess they can change it, but not something that'll happen overnight.
Even if the process technology, thermal and power limits, and costs of development allow clocking the GPU at 2GHz, does the design allow it?
It seems intel is finally at least developing openCL implementation for their integrated GPU's:
They just sent an email to llvm-developers list, recruiting people to develop their llvm-based opencl implementation:
The size and the bandwith doesnt sound to realistic to me.
Wouldnt they use it already in server cpu-s if they could get 1 GB of memory at 5770 speeds in the ivy bridge design.
Not quite. Graphics can live with memory under 1 gig. Server workloads can't. Besides, it is not immediately clear that it will have lower latency than normal dram though it is likely. Besides, you would want to start with a lower risk product.
512-bit LPDDR2 stacks? That's unlikely although not strictly impossible. Also describing LPDDR2 as "old" when there's barely any smartphone using it today proves only that Charlie doesn't know enough about that part of the market to speculate intelligently about it. It's interesting that nobody is thinking of doing that kind of bus width before the JEDEC Wide I/O standard with TSV (Through Silicon Vias i.e. 3D packaging) but Intel is hardly using a traditional approach here so standards are not very relevant.
LPDDR2 chips are always 32-bit and the only official JEDEC packages are for Package-on-Package configurations. The maximum is a 64-bit PoP package with 2 or more chips. But Intel could certainly buy the raw chips and stack it themselves (something they couldn't do with GDDR5 perhaps) - they'd need one chip for every 32-bit. That would mean 16 chips for 512-bit (each 512Mbit for 1GB). That's a HUGE stack - this isn't going to be a thin package if true! Obviously that would be the top SKU and not aimed at ultraportables or netbooks - but the problem is that if you've got only 256MB then you can't have more than a 128-bit memory bus (there are no 256Mbit chips and 512Mbit isn't the most cost efficient standard already). They'd also be wasting a fairly huge 384-bit worth of their memory controller! It doesn't matter that the memory chips are closer and that the CPU's pitch is smaller, the memory controller (and probably PHYs) are still going to cost the same - that is to say... quite a lot!
Could Ivy Bridge be doing something fancy with in-package memory resulting in a substantial performance boost? Yes. But I'm not convinced it's what Charlie is describing if so. Either way I'd like one of these - if he's wrong, deeply stacked enough to be smoked.
Silicon interposers have been part of an FPGA vendor's product plans already, so that is doable.
I think Altera was the one. (edit: Xilinx)
I hope the drawn diagram isn't too accurate, since that would require one massive glob of thermal grease to reach from the CPU to the bottom of the heatspreader.
It's a slight step backwards from the progressively more unified GPU/CPU memory hiearchy introduced by Sandy Bridge.
The on-die memory hierarchy on the CPU could still be unified, but there would be a secondary memory controller that would be primarily useful to the GPU.
Perhaps at some point it would just be a DRAM L4 cache? It seems like a waste to have it idling if someone opts out of the on-board graphics.
Sounds like AMD *could* do it reasonably cheaply.
I assume that the driver will preferentially allocate memory for graphics objects from this pool.
Teaching libc malloc to not touch this would be a different matter though. Or can drivers lock down a segment of physical memory to themselves?
I remember reading on these forums (not sure who said it) that LPDDR2 was expensive and hence wasn't being used. It COULD be that charlie meant that it the standard had been finalized a while ago.
He even added the watermark in the easiest spot to delete it
Pretty sure I'm the one who said that Specifically that the Apple A4 used 64-bit LPDDR1 instead of 32-bit LPDDR2 because 512MB of the latter would be a lot more expensive in that timeframe (and might not even have been available in the volumes Apple needs). Hopefully in early 2012 there wouldn't be a huge price difference versus LPDDR1 anymore, but there would still be a big price difference versus DDR3. No idea how it would compare per megabyte versus GDDR5. Expect plenty of LPDDR2 devices in 1H11 (starting with the LP Optimus 2X using Tegra2).
Could, if that is the direction they choose. GlobalFoundries may have some input on this.
The additional question is "when".
It's not only process technology Intel has historically beat AMD on by a wide margin.
Packaging technology has also been a strong suit for Intel, with AMD usually lagging by a fair amount.
Also, I checked and it was Xilinx that had the silicon interposer tech.
What is there in packaging tech to beat your competitor with? PPro's L2 cache comes to mind but doesn't seem like that big a deal.
Intel transitioned faster to organic substrates when that first came into use, and faster to use LGA packages.
It was faster to eliminate lead from its packaging, and one of the first to get a handle on the reliability issues that arose because of it.
Intel was also able to mass-produce dual-die packages much earlier than AMD. This was perhaps due to necessity, but this predates AMD's MCM by years.
As a result, it beat AMD's single-chip multicores to market, both for the dual and quad-core transitions.
These are simply situations where having tremendous resources to throw at certain issues pays off.
Their resources sure allow them to react very quickly.
Anyway, while you can never know with Intel, I wouldn't be surprised if the IB incarnation comes with something moderate like 128-bit/256MB, basically Intel's own answer to 'sideport memory'. Too much of this would drive up cost and make cooling rather challenging, I think.
I also wouldn't be surprised if developement of this started around the time AMD showcased sideport memory.
Doesn't ANYONE remember the leaked slide with Gesher and Larrabee stating similar things?
0-512MB 64GB/s bandwidth
Some info from Dr.Who? - ca. 60% hit rate of the L3 cache for the IGP. Concludes SB graphics will be faster than Llano - becasue the latter will be bandwidth starved and have shaders idling.
current high end amd GPU's having like 7-8mb of ram on them, do they actually need to hit cahce at all. to me and im a layman, thinking about this logically doesn't make much sence.
high memory latency
high memory thoughput
I would have thought that GPU's dont get that much duplicate data being pulled over the memory bus which is where a cache would help reduce memory bandwidth. Also if its getting 60% hit what about cache thrashing on CPU intensive games.
if that tiny amout of cache actually helped a GPU you think we would have seen it by now.
The other thing that a cache does is reduce latency, who cares about that for a GPU. so to me that conclusion makes little sence and until we know the way LLano's memory control works how can anything be gugaed.
edit: also is SB cache structure still inclusive, can SB prefetch from memory straight into L3 or does it have to do straight to L1 like AMD?
It confirms no such thing. It means exactly what it says. SNB graphics will potentially have access to a greater subset of data at a lower latency than Llano. Latency is practically irrelevant with such a parallel workload as graphics processing. Throughput is key.