AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

The bandwidth test with a texture that fails to compress well with the frame buffer compression methods is handling a lot of back and forth traffic, which can hurt utilization significantly.
Fury X has somewhat worse utilization than the other cards besides the 780 Ti, which can be the result of diminishing returns in terms of raw bandwidth versus what is trying to use it.
What I find notable is where the compressible texture's bandwidth goes above theoretical for the later Nvidia GPUs, which seems to point to a lot of bus accesses being saved due to some combination of compression and keeping the data on-die.

There are new shaders for that test suite, but no numbers as of yet.
 
You can't expect 99% sustained performance to the theoretical figures of any kind of DRAM at the time.

HBM is unlike GDDR5 and it's predecessors in that thanks to it's separate row and column command buses, it can actually reach and sustain it's rated peak bandwidth, if only in a linear synthetic workload. Looking at the bus spec alone, HBM should allow highest utilization of all DRAM buses out there.
 
What I find notable is where the compressible texture's bandwidth goes above theoretical for the later Nvidia GPUs, which seems to point to a lot of bus accesses being saved due to some combination of compression and keeping the data on-die.


Nvidia reworked the memory subsystem quite a bit, enabling much higher memory clock frequency speeds compared to previous generation GeForce GPUs. The result is this; memory speeds up-to 7 Gbps combined with a faster 384-bit wide bus. Combined with some clever advancements in color compression Nvidia can claim even more bandwidth as Maxwell cards now use 3rd generation delta color compression (eg. 7 Gbps *1/75%) = 9.3 Gbps effective bandwidth thanks to enhanced color compression and enhanced caching techniques.

To reduce DRAM bandwidth demands, NVIDIA GPUs make use of lossless compression techniques as data is written out to memory. The bandwidth savings from this compression are realized a second time when clients such as the Texture Unit later read the data. As illustrated in the preceding figure, the compression engine has multiple layers of compression algorithms.
...
Therefore, starting in Fermi Nvidia also implemented support for a “delta color compression” mode. In this mode, they calculate the difference between each pixel in the block and its neighbor, and then try to pack these different values together using the minimum number of bits.
...
The effectiveness of delta color compression depends on the specifics of which pixel ordering is chosen for the delta color calculation. Maxwell contains the third generation of delta color compression, which improves effectiveness by offering more choices of delta calculation to the compressor. Thanks to the improvements in caching and compression in Maxwell, the GPU is able to significantly reduce the number of bytes that have to be fetched from memory per frame. Maxwell uses roughly 25% fewer bytes per frame compared to Kepler.
http://www.guru3d.com/articles_pages/asus_geforce_gtx_980_ti_strix_review,6.html
 

With regards to the 25% fewer bytes per frame claim, relative to the Techreport numbers.

Some base assumptions that could readily be wrong, and if they are the rest can be disregarded.

The bandwidth test has a texture read input that is not being compressed, just the output via the ROPs is.
Whatever the shader reads in is put out, so the benefits of compression would be that there would seem to be more bandwidth for the read and write paths to reach equilibrium again.
The incompressible test, coupled with the test's broad access pattern, provides an indication of what GPU can get in terms of actual external bus transfers. Whatever the software perceives, this is what the bus is actually doing.
Pretending there are no intermediate effects like export, TEX, DRAM, cache, or ALU quirks** (at least some are really unlikely for all tests, as indicated later)

I've gone over the math, but I've put so many assumptions up that I really could have gotten a lot wrong, and it wouldn't take much to invalidate the rest of this. If I've somehow messed this up, my apologies.

780 Ti, with the older Nvidia compression:
197 incompressible, 223 compressible

Incompressible test:
98.5 in the read path + 98.5 in the output path = 197 GB/s.
Compressible, with a perceived bandwidth increase of Y (splits evenly since its assumed to be 1 in per 1 out)
98.5 + Y/2 + 98.5+Y/2 = 223 This is what I am potentially misunderstanding as what the test shader's numbers are for what should be a software-transparent compression method.
Y = 26
98.5 +13 + (98.5+13)*X = 197 This is an attempt to get to the apparent load as it appears in outside traffic, where the inbound reads are uncompressed and the outbound reads correspond to what the GPU manages to save.
111.5 +111.5*X = 197
X= .77
Since this is a comparison of perceived contribution to bandwidth from the two tests, the incompressible bandwidth number could be divided out and the ratio X would be the same.
To note, this isn't the compression ratio as much as it is some kind of ratio of the outbound bandwidth's contribution to the total bandwidth numbers, relative to what the test measured as the apparent pre-compressor throughput.
The 780 Ti's pixel throughput is the lowest especially relative to its memory bus, so if there is some other limit, it could be creating an unanticipated throughput ceiling.

Fury:
333,387
166.5+Y/2 + 166.5+Y/2 = 387
Y=54
166.5+Y/2 + (166.5+Y/2)*X = 333
193.5 + 193.5*X = 333
193.5*X = 139.5
X = .72

980 Ti.
234,364
117 + Y/2 + 117 + Y/2 = 364
Y=130
117 + Y/2 + (117+Y/2)*X = 234
182 + (117+65)*X = 234
182*X = 52
X = .29

980:
172,286
172+Y = 286
Y= 114
143+(86+57)X=172
143X =29
X=.20

So, with regards to 25% less data needed from Kepler to Maxwell, I think there is something else that is affecting the test.
Maxwell is getting a lot of mileage out of the compressed test, with the Kepler example and Fury showing a very wide gulf in the synthetic.
At least some factors really can't be handwaved away if the ratio between input and output swings that far, so I think something assumed initially is broken or I missed something.
 
Last edited:
Isnt Sandrasissoftware who have a test for gpu bandwith ? will try dl it later when back to home.

( suddenly, i see this suite of benchmark have completely disappear from reviews )
 
Bandwidth can measured in several ways. In AIDA's GPGPU test applet (OpenCL), the mem copy gives me ~280GB/s with my 780 Ti out of 364GB/s theoretical. The old GPC Benchmark suite is still good for more detailed breakdown with block-size comparison.
 
Last edited:
In AIDA, Fury X is at around 364 GB/s, so the GPC-Test (which is what I've used for the test above) is slightly "better" in utilizing Fiji's raw transfer potential.
 
upload_2015-7-8_11-16-8.png

upload_2015-7-8_11-17-48.png
2x HD7970 3GB, Normally it should not use both gpu's as i have not put a monitor on my second one, but i will need retest it. ( one 7970 should got 288GB/S )
 
2x HD7970 3GB, Normally it should not use both gpu's as i have not put a monitor on my second one, but i will need retest it. ( one 7970 should got 288GB/S )


Actually, if you have two monitors then you should connect one monitor to each card.
AMD GPUs have had this ridiculous problem for ~3 years where the memory clock goes up into "3D gaming mode" if you have two or more monitors connected to the same card.
If you have only one monitor per card, you'll probably save a lot of power.
 
Actually, if you have two monitors then you should connect one monitor to each card.
AMD GPUs have had this ridiculous problem for ~3 years where the memory clock goes up into "3D gaming mode" if you have two or more monitors connected to the same card.
If you have only one monitor per card, you'll probably save a lot of power.

Its 2 different things.. the monitor semi memory clock speed ( gpu is not in idle state but on the video sates ) is a different things.

In Sandra, the limitations exist too for nVIDIA; It is sadly how work their benchmark or driver with their benchmark, you need disable crossfire / SLI and effectively, if you want test both GPU's at the same time ( so get the overall compute, bandwith etc ) you need wire a second monitor to the second gpu..

In reality, for only some benchmarks, as OpenCL, directcompute gpu ones, bandwith this is not the case, i have set only 1 gpu active for the benchmark, and how surprise, i m at ~190 GB/s, a bit more than half of the first ressult .... I can imagine some of their benchmark need to get the video output active , but not the bandwith one anyway. still i have not look if it was going to full memory speed directly. ( for a benchmark like that, for avoid usage problem / clock speed variation, i will anyway disable the power saving features and get the 100% clock speed before the benchmark begin, as like with superPI OCL benchmark you dont have the full vram usage, it is really low.. Sandra is using 1GB of Vram as example )
 
Last edited:
Bandwidth can measured in several ways. In AIDA's GPGPU test applet (OpenCL), the mem copy gives me ~280GB/s with my 780 Ti out of 364GB/s theoretical. The old GPC Benchmark suite is still good for more detailed breakdown with block-size comparison.
With 364GB/s, I assume you're overclocking the memory to 7.5Gbps?
 
Very interesting. ;) I'll believe it when I see it, but AMD has done dumb stuff in the past.
what is dumb about it ?

It should make the low end boards very very small with great performance that should make dell/hp/apple and the likes very happy to offer the cards. Just look how tiny the fury x is Now imagine something like the 290x on 14/16nm with hdm . That be a killer product.
 
A curiosity of HBM equipped GPUs is that the GPU maker is effectively selling the memory to the add in board partner. AIB's can't independently source the memory, as they were doing. Notionally the GPU maker should be deriving profit from the memory on top of the profit derived from the GPU.
 
what is dumb about it ?

It should make the low end boards very very small with great performance that should make dell/hp/apple and the likes very happy to offer the cards. Just look how tiny the fury x is Now imagine something like the 290x on 14/16nm with hdm . That be a killer product.
The top to bottom phrasing is not particularly well-written, although I can see several ways that a new product range could have HBM throughout its covered range even if that range does not include every part of AMD's overall market.
 
Back
Top