Haswell vs Kaveri

Andrew Lauritzen · Jun 7, 2013

If you're adding CPU+GPU then the 4470R is ~512+832Gflops = ~1.35 Tflops, right? But of course that assumes max CPU and GPU turbo, which is probably not sustainable within TDP limits. Not clear what numbers AMD is counting on that front though... and of course counting FLOPs isn't necessarily that useful.

On mobile they are almost certainly going to be TDP constrained, same as Haswell GT3e, no? Thus it will come down to power efficiency of the whole stack (CPU, GPU, drivers), not raw peak numbers.

I do worry about a 1Tflop part with standard DDR3 though... seems like it would at least have to be quad channel, if not higher clocked memory as well.

Should be an interesting comparison in any case, especially in mobile

Paran · Jun 7, 2013

AMD always use Gflops numbers from the whole APU. Here is from Kaveri: http://pics.computerbase.de/3/9/3/8/6/3.png
http://img850.imageshack.us/img850/5059/kaverislidetualp.png

You can see in the second slide that 1050 Gflops is valid for a 100W APU. We shouldn't expect this in a notebook.

http://www.extremetech.com/wp-content/uploads/2012/06/AFDS_Kaveri_APU.jpg

Here a newer slide, not sure which one is correct.

Andrew Lauritzen · Jun 7, 2013

Paran said:
AMD always use Gflops numbers from the whole APU. Here is from Kaveri:

Ah I see, although they play a bit fast and loose with their definition of "frequency" in their computation of peak computational power. It may not actually be unreasonable to compare their marketing number to the 1.3Tflops number for 4770R I guess, although neither is practically that useful.

borden · Jun 7, 2013

Paran said:
Probably slightly higher. 250-260 mm² based on my calculation.

14.5 x 15.5mm, according to 4gamer

http://i44.tinypic.com/b3va1l.jpg

http://www.4gamer.net/games/147/G014731/20130606053/

Paran · Jun 7, 2013

borden said:
14.5 x 15.5mm, according to 4gamer

http://i44.tinypic.com/b3va1l.jpg

http://www.4gamer.net/games/147/G014731/20130606053/

This is better. Yes with this photo I get 225 mm² too. This one looks accurate. Slighty smaller than Trinity.

Homeles · Jun 8, 2013

Andrew Lauritzen said:
If you're adding CPU+GPU then the 4470R is ~512+832Gflops = ~1.35 Tflops, right? But of course that assumes max CPU and GPU turbo, which is probably not sustainable within TDP limits. Not clear what numbers AMD is counting on that front though... and of course counting FLOPs isn't necessarily that useful.

On mobile they are almost certainly going to be TDP constrained, same as Haswell GT3e, no? Thus it will come down to power efficiency of the whole stack (CPU, GPU, drivers), not raw peak numbers.

I do worry about a 1Tflop part with standard DDR3 though... seems like it would at least have to be quad channel, if not higher clocked memory as well.

Should be an interesting comparison in any case, especially in mobile

Quad channel doesn't make sense for mobile, and simply doesn't make sense from a cost perspective to me.

Kaveri's rumored to use GDDR5m, the merits of which were discussed earlier this this thread.

I think AMD's going to be power constrained at every TDP target... that's how practically everything goes these days.

Andrew Lauritzen · Jun 8, 2013

Homeles said:
Kaveri's rumored to use GDDR5m, the merits of which were discussed earlier this this thread.

Right but I thought the new rumor was no GDDR, hence the other speculation in this thread.

Homeles · Jun 8, 2013

Andrew Lauritzen said:
Right but I thought the new rumor was no GDDR, hence the other speculation in this thread.

Ah, I had not seen BSN's sudden change in heart.

I totally agree with your sentiment about DDR3 -- it's just not enough. Trinity is already held back considerably by a lack of bandwidth. Suddenly we're going to move to a wider, higher clocked GPU with the same constraints? From what I can gather, GCN won't make notably more efficient use of that bandwidth either.

AMD's already indicated that GDDR5m is a viable interim solution during the wait for stacked memory. To me, it sounds like it fits the ticket quite well. The lower voltage helps to alleviate power consumption and heat issues that standard GDDR5 has, it shouldn't cost as much as DDR4, and it provides higher bandwidth than DDR3. It allegedly gets around the 4GB capacity limit of current GDDR5 ICs as well.

itsmydamnation · Jun 9, 2013

Homeles said:
From what I can gather, GCN won't make notably more efficient use of that bandwidth either.
.

i thought one of GCN biggest improvements was is cache system.

lowest comparable bandwidth between GCN and VLIW4 i could fine. Way less resources for 8670 yet generally way out performs VLIW4.

http://clbenchmark.com/compare.jsp?config_0=13292273&config_1=12011396

hard to find game benchmarks of these cards.

lanek · Jun 9, 2013

itsmydamnation said:
i thought one of GCN biggest improvements was is cache system.

lowest comparable bandwidth between GCN and VLIW4 i could fine. Way less resources for 8670 yet generally way out performs VLIW4.

http://clbenchmark.com/compare.jsp?config_0=13292273&config_1=12011396

hard to find game benchmarks of these cards.

From VliW4 mArch, yes, but for the next one, i believe they need increase the L2 cache size to 2Mo at least and not just optimize it .

sebbbi · Jun 9, 2013

Homeles said:
From what I can gather, GCN won't make notably more efficient use of that bandwidth either.

itsmydamnation said:
i thought one of GCN biggest improvements was is cache system.

Yes, GCN (Tahiti) has 768 KB of general purpose read/write L2 cache. All memory traffic goes though it. It will certainly reduce the external memory BW usage.

Some time ago, I did some testing with my Tahiti to determine some of it's L2 properties. One of the things I noticed immediately, was that Tahiti is already BW bound when you just alpha blend to a 4x16 bit float HDR render target, even if you are not sampling any textures at all (just write solid color triangles). Fast GDDR5, wide 384 bit bus and a total of 264 GB/s bandwidth, and a simple untextured blend is able to spend all of that BW. I roughly calculated that it would require more than 500 GB/s BW to reach full Tahiti fill rate in this scenario. So I split the viewport to 128x128 tiles, and rendered each tile separately. This doubled the performance (reached maximum theoretical fill rate). 128x128 tile of 4x16 bit float color + 32 bit depth buffer size is 196 KB in memory. It easily fits the 768 KB L2 of Tahiti, and thus all the blending passes and depth read/writes occur completely inside the L2 (no memory BW used at all). It seems that cache optimizations are very important for these new breed of GPUs. Even simple things such as sorting objects by screen space XY location (in addition to depth) could bring nice reductions in backbuffer BW usage (= big performance gains for the BW starved APUs).

It's hard to say how the large L2 caches of GCN (and Fermi/Kepler) affect performance of current generation games. Current generation consoles do not have large general purpose GPU caches, so most developers have surely not analyzed their L2 cache behavior or changed their rendering methods to exploit the L2 caches of the recent hardware. Cache optimization was (and still is) one of the major performance improvements for CPU code. If you don't design around caches, you still get some gains from the caches, but the biggest gains require careful design. So I would expect to see biggest gains in new games that are designed from the ground up for the new GPUs.

Gipsel · Jun 9, 2013

sebbbi said:
Yes, GCN (Tahiti) has 768 KB of general purpose read/write L2 cache. All memory traffic goes though it.

Actually, ROP exports don't. ROPs have their separate Z and color tile caches (I think 4 and 16kB in size per ROP partition/RBE, edit: at least that was the size up to Cayman [supposedly, it's hard to get firm numbers on these details], that would be 128kB ROP color cache in total, not quite enough to fit the 192kB you just mentioned for your test, maybe they doubled it in the SI generation; edit2: actually you had 128kB color data and 64kB Z data, so only the Z data wouldn't fit; RBEs [with the integrated caches] are afaik assigned based on screen space/render target tiles [8x8 or 16x16 pixels?] in an interleaved pattern for load balancing, btw.).

lanek · Jun 9, 2013

Its interessant, what you say ( if i understand well ), it is more the cache are underused by softwares / games... ( actually it look ( and again i could be completely wrong ) this is more the speed of the cache who is used instead of the large size of it. ( on computing side, games today are so ported from actual consoles arch. this is hard to determine. ) ( On GPU, not on CPU as it is completely the invert )

( outside what is pointing Gipsel about ROPs partition ... )

Homeles · Jun 9, 2013

My efficiency claim was largely based on the ROPs. Specifically, I took a look at the the 7870 vs the 6970. The 6970 outperforms the 7870 in some fillrate metrics, and blending is much faster as well. This is in spite of the 7870 having higher clocked ROPs. In 3DMark Vantage Pixel Fill, the 6970 has a 10% higher fillrate than the 7870, despite the 7870 having a ~14% higher clock speed. This is undoubtedly due to the 6970's ~15% higher GDDR5 bandwidth.

Blending simply kills the 7870. From my understanding, when a GPU is said to be bandwidth constrained, it is because the blend rate is being held back. Texturing performance is not nearly as reliant on bandwidth. Now I know next to nothing about programming, especially when it comes to GPUs, but it seems to me that the considerable FPS gains we see from jumping from 1333 to 2133 MHz DDR3 with Trinity are largely due to more bandwidth being available for blending.

lanek · Jun 9, 2013

But thoses metrics ( synthetics ) are really significant with todays ask ? For me, it seems all of this have moving to another level who dont lookat thoses values the same way it was before ..

sebbbi · Jun 9, 2013

Gipsel said:
Actually, ROP exports don't. ROPs have their separate Z and color tile caches (I think 4 and 16kB in size per ROP partition/RBE, edit: at least that was the size up to Cayman [supposedly, it's hard to get firm numbers on these details], that would be 128kB ROP color cache in total, not quite enough to fit the 192kB you just mentioned for your test, maybe they doubled it in the SI generation; edit2: actually you had 128kB color data and 64kB Z data, so only the Z data wouldn't fit; RBEs [with the integrated caches] are afaik assigned based on screen space/render target tiles [8x8 or 16x16 pixels?] in an interleaved pattern for load balancing, btw.).

Thanks for the clarification. I was rendering big planar triangles, so the depth compression should work perfectly (reducing depth BW requirement a lot). When I increased the tile size to 196x196 the performance dropped slightly bit, and for 256x256 or bigger tiles, the cache didn't help much at all. 128 KB ROP cache would be a good bet for these results (assuming it contains the depth data in compressed format). I can do more experiments. It shouldn't be that hard to create test data sets to detect all the caches and their sizes. Btw. are there any public documents about the exact details of Kepler cache hierarchy?

Gipsel · Jun 9, 2013

sebbbi said:
128 KB ROP cache would be a good bet for these results (assuming it contains the depth data in compressed format).

The caches are split in color and Z data (there are two separate arrays, it's even depicted that way in all block diagrams, even when I don't trust them with such details). It should be 128kB color cache and 32kB Z cache in total for a chip with 8 RBEs (32 ROPs). In the VLIW architectures at least the color caches (don't know about Z) stored uncompressed render target tiles. They get uncompressed when loading a tile from memory and compressed when storing them back (only complete tiles [likely 8x8 pixels in size] are transferred). As the ROPs havn't seen much changes in the GCN GPUs (opposed to the L1 caches which store now compressed data with GCN, the VLIW architectures uncompressed it when fetching from L2), I wouldn't expect that this was changed.

Homeles · Jun 9, 2013

lanek said:
But thoses metrics ( synthetics ) are really significant with todays ask ? For me, it seems all of this have moving to another level who dont lookat thoses values the same way it was before ..

I'm not quite following you. Could you clarify what you're saying?

sebbbi · Jun 9, 2013

Gipsel said:
It should be 128kB color cache and 32kB Z cache in total for a chip with 8 RBEs (32 ROPs).

Separate depth and color ROP caches explain my test results. The color data cannot be compressed (no MSAA used), and the depth data can be compressed with the best ratio (one triangle in a depth block). The color data fits to the color ROP cache, and thus no bandwidth will be used for the repeated 128x128 tile blend operations. Depth cache will of course trash (64 KB uncompressed data set size doesn't fit in the 32 KB cache), but since the depth compression reduces the external memory writes to minimum, the bandwidth usage is very low (and thus the GPU can reach it's maximum fill rate before hitting the BW limit). I need to repeat this test with some other GPU with lower BW (and other cache properties). Tahiti isn't the best test platform to check BW bound scenarios

kalelovil · Jun 18, 2013

An AMD press slide in a recent Anandtech article has some details on the 'Berlin' APU arriving 1H14, presumably the Opteron equivalent of Kaveri using the same die.

The memory architecture sadly appears to have the same specifications as Trinity/Richland, 128bit DDR3 at up to 1866Mhz.

http://www.anandtech.com/show/7079/amd-evolving-fast-to-survive-in-the-server-market-jungle-/4

Haswell vs Kaveri

Andrew Lauritzen

Moderator

Paran

Andrew Lauritzen

Moderator

borden

Paran

Homeles

Andrew Lauritzen

Moderator

Homeles

itsmydamnation

lanek

sebbbi

Gipsel

lanek

Homeles

lanek

sebbbi

Gipsel

Homeles

sebbbi

kalelovil