Haswell vs Kaveri

sebbbi · Jun 18, 2013

(Slightly OT, continued from my above post)

I did some extra testing with the GCN ROP caches. Assuming the mobile versions have equally sized ROP caches than my Radeon 7970 (128 KB), it looks like rendering in 128x128 tiles might help these new mobile APUs very much, as APUs are very much BW limited.

Particle rendering has a huge backbuffer BW cost, especially when rendering HDR particles to 4x16f backbuffer. Our system renders all particles using a single draw call (particle index buffer is depth sorted, and we use premultiplied alpha to achieve both additive and alpha channel blended particles simultaneously). It is actually over 2x faster to brute force render a draw call containing 10k particles 60 times to 128x128 tiles (move scissor rectangle across a 1280x720 backbuffer) compared to rendering it once (single draw call, full screen). And you can achieve this kinds of gains by spending 15 minutes (just a brute force hack). With a little bit of extra code, you can skip particle quads (using geometry shader) that do not land on the active 128x128 scissor area (and save most of the extra geometry cost). This is a good way to reduce particle overdraw BW cost to zero. A 128x128 tile is rendered (alpha blended) completely inside the GPU ROP cache. This is especially a good technique for low BW APUs, but it helps even the Radeon 7970 GE (with massive 288 GB/s BW).

With this technique, soft particles gain even more, since the full screen depth texture reads (128x128 area) fits the GCN 512/768 KB L2 cache (and become BW free as well). Of course Kepler based chips should have similar gains (but I don't have one for testing).

If techniques like this become popular in future, and developers start to spend lots of time in optimizing for the modern GPU L2/ROP caches, it might make larger GPU memory pools (such as the Haswell 128 MB L4 cache) less important. It's going to be interesting too see how things pan out.

DSC · Jun 18, 2013

http://techreport.com/review/24954/amd-a10-6800k-and-a10-6700-richland-apus-reviewed/10

AMD has talked a lot about that fact, but apparently two can play at this game. Intel's Core i3 outperforms the AMD APUs in LuxMark.

Alexko · Jun 18, 2013

kalelovil said:
An AMD press slide in a recent Anandtech article has some details on the 'Berlin' APU arriving 1H14, presumably the Opteron equivalent of Kaveri using the same die.

The memory architecture sadly appears to have the same specifications as Trinity/Richland, 128bit DDR3 at up to 1866Mhz.

http://www.anandtech.com/show/7079/amd-evolving-fast-to-survive-in-the-server-market-jungle-/4

I was hoping for some change in the cache hierarchy, but that doesn't appear to be the case either. I guess the L2 could still be faster.

mczak · Jun 18, 2013

Alexko said:
I was hoping for some change in the cache hierarchy, but that doesn't appear to be the case either. I guess the L2 could still be faster.

You could count the hUMA changes for cpu/gpu cache snooping as a change there I guess.
But if you're waiting for a shared (all cores + gpu) L3 then I don't know if Excavator is going to have that.

Zaphod · Jun 19, 2013

The Hotchips presentation last year said that Steamroller would have "Shared L2 Cache" with "Dynamic resizing of L2 cache, Adaptive mode based on workload", which might sound like a unified L2 for all CPU cores at least (similar to Kabini?). But maybe that didn't make it for Kaveri, or they meant something less ambitious.

3dilettante · Jun 19, 2013

The L2 is already shared between two cores. The description didn't mention sharing across modules.

Zaphod · Jun 19, 2013

You're right. I see they only showed a single module in the diagram too. So they were basically talking about better load balancing between the execution units for efficiency gains and/or powering down parts of it for a perf/watt improvement. Maybe for Excavator then.

liolio · Jun 19, 2013

sebbbi said:
(Slightly OT, continued from my above post)

I did some extra testing with the GCN ROP caches. Assuming the mobile versions have equally sized ROP caches than my Radeon 7970 (128 KB), it looks like rendering in 128x128 tiles might help these new mobile APUs very much, as APUs are very much BW limited.

Particle rendering has a huge backbuffer BW cost, especially when rendering HDR particles to 4x16f backbuffer. Our system renders all particles using a single draw call (particle index buffer is depth sorted, and we use premultiplied alpha to achieve both additive and alpha channel blended particles simultaneously). It is actually over 2x faster to brute force render a draw call containing 10k particles 60 times to 128x128 tiles (move scissor rectangle across a 1280x720 backbuffer) compared to rendering it once (single draw call, full screen). And you can achieve this kinds of gains by spending 15 minutes (just a brute force hack). With a little bit of extra code, you can skip particle quads (using geometry shader) that do not land on the active 128x128 scissor area (and save most of the extra geometry cost). This is a good way to reduce particle overdraw BW cost to zero. A 128x128 tile is rendered (alpha blended) completely inside the GPU ROP cache. This is especially a good technique for low BW APUs, but it helps even the Radeon 7970 GE (with massive 288 GB/s BW).

With this technique, soft particles gain even more, since the full screen depth texture reads (128x128 area) fits the GCN 512/768 KB L2 cache (and become BW free as well). Of course Kepler based chips should have similar gains (but I don't have one for testing).

If techniques like this become popular in future, and developers start to spend lots of time in optimizing for the modern GPU L2/ROP caches, it might make larger GPU memory pools (such as the Haswell 128 MB L4 cache) less important. It's going to be interesting too see how things pan out.

Great post to read

I never though about it but that amount of cache is spread across 8 render back ends. It got me to wonder if all the units have equal access (in bandwidth, latency) to their "non local" slices of the Z and color caches.
Your results are pretty impressive, though I wonder (for the ref I'm neither a coder or a hardware guy sorry if my comment is dumb) if you could try to consider that amount of cache as 8 pieces of caches.
If those cache behave like the L3 in Intel or IBM architectures (bound to a specific resource in that case a CPU core) you could have even better result by fitting your tiles to the size of a "local subset" of those caches.
So have you tried to submit 60x8 (240 that sounds high 8O ) times 16x16 tiles that would fit in the local share of the cache of a RBE? EDIT <= horrendous math mistake...
May be by trying 16x16, 32x32 and 64x64 (128x128) tiles you could figure out how much bandwidth the render units have to their 'non local' shares of the aforementioned caches.

Though may be they have equal access (which for somne reasons I don't expect) and may be the impact of the (raising) number of draw calls could "taint" the results ?

/ I stop here with things over my head, hope it was not dumb if incorrect or worse nonsensical.

liolio · Jun 19, 2013

I can't help it... my mind have to wonder...

I don't know how ROPs work but I could imagine situations where the units are busy enough and the penalty in latencies and bandwidth from accessing "non local" share of those caches would not trigger penalties /?

-------------
ABout Intel, D.Kanter has an in depth analysis of Intel Gen 7 GPU, to me they are the "same", though it seems there is only one RBE and so the amount of cache is only 8Kb for color and 32KB for Z.

Anyway it would still be weird to me if manufacturers come with solutions that doesn't allow software to "scale". In you case if I get it properly you optimized for a 8 RBEs / 32ROPs but it would not benefit a lesser part with 4 RBEs (/16 ROps). I would get that optimization based on the size of the L2 would be platform specific but I find it odd for the ROps/RBE. They are scalable structures I would expect them to be though so software performances scales with the number of ROPs up and down (though bandwidth should be scale linearly too).

So the "building block" is the RBE to me, I wonder if software should be optimized around that building block. A bit like multi-core CPU, say Xenon you have 32x3KB of L1 data cache though I would think that you would optimize your data structure for 32KB not 96KB, no?

My idea is that RBE are designed so if your data fit within their cache (4Kb color cache, 16KB z cache ?) they should achieve their maximum throughput without needing external bandwidth. Actually the color cache is tiny I'm not sure what tile format would fit in there.

/Ignore if it doesn't make sense.

Edit Oops maths are not my friend... from 128x128 to 16x16 and I find 8 times the number of drawcalls... it is more like... well 64, lots of draw calls (3840...).

nAo · Jun 20, 2013

liolio said:
ABout Intel, D.Kanter has an in depth analysis of Intel Gen 7 GPU, to me they are the "same", though it seems there is only one RBE and so the amount of cache is only 8Kb for color and 32KB for Z..

Each slice gets its own pixel backend + color and Z caches (and setup, rasterizer, etc..).
For instance HSW GT3 has two of everything and you can scale it up (see slide 19: http://www.highperformancegraphics.org/previous/www_2012/media/Hot3D/HPG2012_Hot3D_Intel.pdf)

Also the color cache is backed by the whole CPU+GPU cache hierarchy.

liolio · Jun 20, 2013

nAo said:
Each slice gets its own pixel backend + color and Z caches (and setup, rasterizer, etc..).
For instance HSW GT3 has two of everything and you can scale it up (see slide 19: http://www.highperformancegraphics.org/previous/www_2012/media/Hot3D/HPG2012_Hot3D_Intel.pdf)

Also the color cache is backed by the whole CPU+GPU cache hierarchy.

Nice paper, thanks.

I find the Gen X pretty interesting as it looks a lot to me like "proper" multi-cores GPU.
AMD and Nvidia scale the number of either CUs or SMX but ROPs and the (gpu) last level of cache is not part of the "party".
Now Intel seems to go with "self contained" blocks, slices to use their wording.
I'm not a technical guy but my guts tells me it is the right move as I can see how it allows to "fine tune" everything. I mean matching the amount of execution units with the amount of cache you have and the bandwidth to those cache or the amount of threads in flight to cover the latency (trying to put it in a more sensible way may having a proper cache hierarchy so pretty high number of thread is your main tool to hide those latency).

I could also see some benefits fro the "wiring", If I look at AMD GPU I imagine a pretty massive bus running around "a bunch" of CUs linking the CUs, the fixed function units, the L2, ROPs ,etc.
Now with those self contain blocks, slices, I could see that amount being lesser, and whatever need to be share could be get just by checking the local share of the last level of cache of each slices. I could also guess that the latency to "reach" anything withitn a "self contained blocks", slice, are lower than checking a "whole chip".

It may be an incorrect way to state it but I see something "beautiful" in the way that cache hierarchy is put together, no too surprising as it is Intel forte. I'm not technical enough (big quite a stretch) to really get the benefit clearly but it looks more like a "proper" cache hierarchy to me and by that I mean something as lean and evolved as what we find in CPUs that have gone through decade of evolution to ends up with that nice set-up one find in Intel or IBM CPUs.

It seems that the architecture is already ahead of competition, it seems like a more balanced architecture that can make more of data locality than competing architectures, that could possibly require lot less threads to function, it can already successfully work on significantly narrower vector / no offence to Nidia or AMD engineers, it looks like something really intended at leveraging parallelism, not the most extreme cases, wrt to vector width but also data structures.

Anyway I stop with pseudo technical blablating akin to "pub level philosophy"

I can't analyze properly the competing architectures, some time the high level representations differ in more than one way from the actual silicon, etc. Though looking at it I see some "beauty" to it, you guys have to be proud of it and competition concerned.

zorg · Jun 20, 2013

liolio said:
AMD and Nvidia scale the number of either CUs or SMX but ROPs and the (gpu) last level of cache is not part of the "party".

They can scale the ROPs and the L2 cache independently from the core section. This is a different approach, but they have more control on performance scaling.
Intel just use less independent blocks, and this is definitely not the right way, because they spend much more transistor to achieve the same performance level.

liolio said:
I could also see some benefits fro the "wiring", If I look at AMD GPU I imagine a pretty massive bus running around "a bunch" of CUs linking the CUs, the fixed function units, the L2, ROPs ,etc.
Now with those self contain blocks, slices, I could see that amount being lesser, and whatever need to be share could be get just by checking the local share of the last level of cache of each slices. I could also guess that the latency to "reach" anything withitn a "self contained blocks", slice, are lower than checking a "whole chip".

Sure, it has benefits, but AMD and NVIDIA design the architecture for high scalability. Ultramobiles to supercomputers as they say. Intel don't scale the GenX this kind of performamce range. They can, but it won't be efficient.

liolio · Jun 20, 2013

zorg said:
They can scale the ROPs and the L2 cache independently from the core section. This is a different approach, but they have more control on performance scaling.
Intel just use less independent blocks, and this is definitely not the right way, because they spend much more transistor to achieve the same performance level.
-----------------------------------------------------------------------
Sure, it has benefits, but AMD and NVIDIA design the architecture for high scalability. Ultramobiles to supercomputers as they say. Intel don't scale the GenX this kind of performance range. They can, but it won't be efficient.

I don't see how less scalable it is, they are just moving toward more "self contained building block".

As for spending more transistors well this is clearly disputable. If you read the Anandtech reviews you the core i$950 HQ competing with a 4 quad core CPU + Gt 650m.
The GT 650 is alone 1.2 billion transistors.
Now the perfs per watts does not compare favorably for Nvidia.

As for ultramobile parts there are no GPU parts from Nvidia or AMD that compare to desktop GPU Intel is to deploy its GPU in a couples of months (Silvermont) in a really low power profile platforms.
Let see how Kepler fares in the mobile realm that would be a point of comparison but it seems that as far as timeline are concerned Intel is to beat Nvidia and AMD significantly.

For Super computer, Intel has other products, their GPU don't support DP calculation, but I don't see that as a sign of lesser scalability. They can scale their architecture up to close to 2 TFLOPS (according to the paper nAo linked). As you say they cn but it doesn't fit their goal /power consumption. For the kind of power budget they aim they fry the competition (though they sell those Haswell + CW at crazy high price but it is not set in stone and the market is to tell them if that price is worth, the crazy part it could be

).

In the super computer realm I think IBM rules but it seems that Intel may attempt to come after them as it seems that they move to a pretty aggressive roadmap for their Xeon Phi line. A part that include Crystal Web should launch on their 14nm process at some point in 2014.

Overall I think that in a world where the marketing PR department of GPU manufacturer claim to sell "many cores" GPU (with core count in the thousand), Intel move toward a more comprehensive cache hierarchy for what could really well qualify as "real" GPU cores /general purpose vector machine 2.0 looks like the thing to do for me.

--------------
There is also the issue that Intel design iGPU, it will be interesting to see how well competing AMD and Nvidia platform handle communication between the CPU and GPU, how coherency traffic is handle, at which power cost, etc. and thing like bandwidth requirement with setting constrain on the amount of RAM the system can use (as GDDR5 does).

zorg · Jun 20, 2013

liolio said:
I don't see how less scalable it is, they are just moving toward more "self contained building block".

... and they can't scale the design beyond 4 block.

liolio said:
As for spending more transistors well this is clearly disputable. If you read the Anandtech reviews you the core i$950 HQ competing with a 4 quad core CPU + Gt 650m.
The GT 650 is alone 1.2 billion transistors.
Now the perfs per watts does not compare favorably for Nvidia.

The EDRAM, and the GDDR5 is "messing up the results". Without the EDRAM, the GT3 design is not faster than the fastest Richland IGP, which is use much less transistor.

liolio said:
As for ultramobile parts there are no GPU parts from Nvidia or AMD that compare to desktop GPU Intel is to deploy its GPU in a couples of months (Silvermont) in a really low power profile platforms.
Let see how Kepler fares in the mobile realm that would be a point of comparison but it seems that as far as timeline are concerned Intel is to beat Nvidia and AMD significantly.

AMD Temash is an ultramobile SoC with 3.x watt TDP, and it use a GCN-based iGPU. It's the fastest solution in the market at this power level, more than ten times faster than Atom Z2760.
NVIDIA Logan SoC will use Kepler in the next year.

Paran · Jun 20, 2013

nAo said:
Each slice gets its own pixel backend + color and Z caches (and setup, rasterizer, etc..).
For instance HSW GT3 has two of everything and you can scale it up (see slide 19: http://www.highperformancegraphics.org/previous/www_2012/media/Hot3D/HPG2012_Hot3D_Intel.pdf)

Also the color cache is backed by the whole CPU+GPU cache hierarchy.

This looks quite different to a Haswell slice. I see four samplers on each slice. Haswell only has one sampler per slice. Is that a real upcoming GPU?

liolio · Jun 20, 2013

zorg said:
... and they can't scale the design beyond 4 block.

The EDRAM, and the GDDR5 is "messing up the results". Without the EDRAM, the GT3 design is not faster than the fastest Richland IGP, which is use much less transistor.

Well it is messing nothing both come at a cost, though from a power perspective CW wins.

As for Trinity Richland, if I look here or here perf are really close, so is the transistor count.
Though the difference in power consumption is pretty significant.
Other than that well CPU perf are not in the same ball park, taking in account the fact that a 4770k embark a lot of L3 I'm not sure about who out of intel or AMD is spending its silicon in the best manner.
Anyway I smell that attempt at trigger a bullshit war, let be clear I don't care. Neither thoes CPU offers enough (GPU) performance for me. GT3e is OKish but way too expensive vs a discrete solution. Ultimately I'm not "on the market" researching a new set-up to buy.
I've no intention to further discuss benches, driver optimization, etc. If you can elaborate on the cache hierarchy and provide me more information you are welcome to do so.

AMD Temash is an ultramobile SoC with 3.x watt TDP, and it use a GCN-based iGPU. It's the fastest solution in the market at this power level, more than ten times faster than Atom Z2760.
NVIDIA Logan SoC will use Kepler in the next year.

Wow more than ten time faster. Sorry I'm wary about AMD power figures, those 100Watts part burns close to 130 Watts, I will wait for review and yes the Next Atom aims at higher perfs per watts enough to fit in a phone (review will tell if INtel succeeded).

fellix · Jun 20, 2013

Paran said:
This looks quite different to a Haswell slice. I see four samplers on each slice. Haswell only has one sampler per slice. Is that a real upcoming GPU?

Probably a Broadwell SKU. Intel promised a major boost in IGP performance in the generation after Haswell.

nAo · Jun 20, 2013

Paran said:
This looks quite different to a Haswell slice. I see four samplers on each slice. Haswell only has one sampler per slice. Is that a real upcoming GPU?

HSW has 2 samplers per slice (http://images.anandtech.com/reviews/cpu/intel/Haswell/Architecture/gpu2.jpg). Again, it's an architecture designed to scale up or down.

nAo · Jun 20, 2013

zorg said:
The EDRAM, and the GDDR5 is "messing up the results". Without the EDRAM, the GT3 design is not faster than the fastest Richland IGP, which is use much less transistor.

This doesn't make sense. A well-thought machine is made of parts that have been specifically designed to work together with certain trade-offs. If you take a piece off in order to compare it to another one you often reach the wrong conclusion. You can't simply take the eDRAM and pretend the rest was not architected to use it.

Paran · Jun 20, 2013

nAo said:
HSW has 2 samplers per slice (http://images.anandtech.com/reviews/cpu/intel/Haswell/Architecture/gpu2.jpg). Again, it's an architecture designed to scale up or down.

I missed the second indeed. It's quite different to Haswell nevertheless. Also +4 EUs per slice obviously. If this is real and not just a showcase slide it possibly belongs to Gen8 Broadwell.

Haswell vs Kaveri

sebbbi

DSC

Alexko

mczak

Zaphod

Remember

3dilettante

Zaphod

Remember

liolio

Aquoiboniste

liolio

Aquoiboniste

nAo

Nutella Nutellae

liolio

Aquoiboniste

zorg

liolio

Aquoiboniste

zorg

Paran

liolio

Aquoiboniste

fellix

nAo

Nutella Nutellae

nAo

Nutella Nutellae

Paran