AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

I was considering the possibility that the block of what appears to be SRAM in the upper third of the green block was an L2 section.
The L2 cache is distributed in fine grained sections among the memory controllers (six of them). Those are probably the 3x3 blocks, flanking the rows of CUs along the horizontal center axis. As for the SRAM arrays in the setup pipes -- these are most likely the parameter caches or something related to the fragment scan-out buffering.
 
Something seems to be a huge bottleneck or something. Card just doesn't scale at all with clock increases. I hope drivers fix it.
 
Where exactly are you thinking that it would land up if there were no ceiling?
With the introduction of compression, the 980 cards have results that can get to or over the interface bandwidth.
Perhaps that just means that the ROP count Fiji comes with shouldn't be expected to burn as much bandwidth as an HBM interface, but there are formats that should in theory be happy with all 512 GB/s. I'm not sure how the test derives the result.

The L2 cache is distributed in fine grained sections among the memory controllers (six of them). Those are probably the 3x3 blocks, flanking the rows of CUs along the horizontal center axis. As for the SRAM arrays in the setup pipes -- these are most likely the parameter caches or something related to the fragment scan-out buffering.
That makes sense. The storage in the central blocks appears to be more prominent than was visible in other die shots for older GPUs. Perhaps it's related to Tonga's changed performance profile in tessellation?

Something seems to be a huge bottleneck or something. Card just doesn't scale at all with clock increases. I hope drivers fix it.
There is that unknown when it comes to drivers as a source of overhead that can dilute clock speed increases.
Fury X is in the neighborhood of the 980 Ti while still paying the AMD driver performance tax.
 
I wonder, if it would have been a better decision for AMD to re-balance Fiji's architecture and fit two more setup pipes and bump the ROP count to 96, at the expense of a bit less multiprocessors?
Looking at Tonga's die, a single setup pipe takes roughly the same area as a CU. So, four less CUs (64 -> 60) would balance quite well for two more setup pipes and eight additional ROP clusters, that would definitely utilize the HBM throughput more rationally and would be more "visible" for the purpose of high-resolution gaming/benchmarking.

From what I gather, the mismatch between the number of ROPs in Tahiti (32) and the number of memory controllers (6) required a crossbar. Hawaii, with 64 ROPs and 8 memory controllers, managed to do away with it.

So I guess AMD didn't want to reintroduce the crossbar, which would have been necessary with 96 ROPs (along with its mm² and watts) and didn't want to go all the way to 128 ROPs either.

Edit: come to think of it, if that's correct, then Tonga probably has 48 ROPs. I hope AMD will actually release a full Tonga before I have grandchildren so we can figure out whether it is better balanced than Fiji.
 
Last edited:
The one anomaly in that fillrate test looks like 780Ti doing far worse than what its 48ROPs would suggest. Perhaps lower clock speed or nvidia gimping it.
 
Sorry to be late at the party, please treat it like a conclusion :)

the approach is always the same for this stuff (depth buffer, MSAA etc), when you overflow you just drop to baseline uncompressed.
If the delta values fit into 2bits you go say 1:8, otherwise when 3bit then 1:4, when 4bit 1:2 otherwise dump 1:1.
Sounds reasonable, but this would only work with the back buffer using an integer 8-bit per component RGB/BGR format, and not with HDR formats R10G10B10A2 and 16-bit floating point R16G16B16A16.
Same for multiple render target formats, including one- and two-component 8-bit integer formats.

How do you know how many bytes a compressed tile occupies when reading it back
There is an array describing the size of each block with a bit mask (probably 2 bits as explained below).

8x8 block of 4 byte (32 bit) pixels takes a minimum of 32 bytes assuming 8:1 compression, then 64 bytes for 4:1 and 128 bytes for 2:1. Finally, it takes 256 bytes if uncompressed, so each block is aligned along the 256 byte boundary.

Assuming 128 bit (16 byte) or 256 bit (32 byte) memory bus, this takes 4 to 8 reads, and a peek into the above array (either 4 bytes, or most likely a full line readout to improve cache hits for next reads) is negligible.

Not exactly on-chip.... You need at least 2 bits per tile (cleared, 1:1, 1:2, 1:4, 1 bit more if you want to support higher ratios), so with 8x8 tiles it can add up to quite something (for 16kx16k buffer and 2 bits per tile that would give 1MB per RT).
Who needs 16K framebuffer? More realistically, with 2 bits per tile and 8x8 pixel block, 1080p requires 8100 bytes, 2560x1600 requires 16000 bytes, and 4K (2160p) requires 32400 bytes.

I agree that any of the above is too large to have a dedicated cache though.

one cache line, no matter where, can store information about quite a few tiles of course ... unless you have a really terrible case not achieving any compression it's not a big deal).
The memory controller probably doesn't supports reading individual bytes, so it's 32 bit (4 bytes) at minimum - though most likely this is not very efficient and would require as much cycles as just reading a full line (128 or 256 bit).

It could be done in Linux with a modified open source driver with an added function call to DMA out raw video RAM data?
This would involve a thorough analysis of the shader compiler in the OpenGL driver.

AFAIK AMD's open source driver is currently limited to kernel-mode-setting, i.e. basic configuration, framebuffer, display mode and refresh rate stuff. The heavyweight stuff like OpenGL rendering is maintained by Mesa/Gallium3D, and I'm not really sure it knows anything about the compression hardware in GCN3... it most likely doesn't, when you consider how its performance lags behind the proprietary AMD driver.

Wow, there is a blast from the past...
Textured models were first researched at LucasFilm/Pixar some 14 years before they appeared in mainstream PC graphics cards.
 
Last edited:
If you bind the compressed resources with a graphics operation, you might be able to hack up a compute shader to grab any piece of memory mapped to the GPU, including metadata.
Shaders are not executed directly like a binary code - when the intermediate shader code is compiled into machine code, it should make full consideration of the declared resource formats and resource descriptors. So the IML compiler will produce machine code that will use the compression/decompression block when appropriate resource format is bound and so decompression will happen transparently to the shader code.

I also don't think the driver would (or should) allow addressing any resource in memory that wasn't explicitly bound to the shader. It would be a bad implementation that is prone to errors.

Most video memory is not mapped to the CPU at all due to the existence of 32-bit OSes. Resizable BAR has been supported by many video cards for a while but OS support has been spotty at best.
Hmmm. I thought it's only partially mapped in 32-bit operating systems because they do not support 64-bit virtual address space, but indeed MSINFO32 lists only 264 MBytes of mapped memory for my Radeon R9 290X in 64-bit Windows 10...

I don't know how exactly it works so both the TMU and ROPs have access to the uncompressed data in the end.
I don't think non-integer render target formats can be compressed with this simple color delta scheme. In fact I don't even think any render target format gets compressed at all, for various reasons...

The compressions happen pixel block by pixel block. They either compress 2:1 or 4:1 or 8:1, or not at all... That'd make the average of 25% or 40% so low compared to the per block compression factors.
BTW, for the statistics, just a few quick examples of what series give you an average of exactly 0.6, i.e. 40% savings.

Example 1: 25 blocks, where 2 blocks have 1/8 compression, 5 blocks have 1/4, 9 have 1/2, and the last 9 have 1 (no compression);
Example 2: 5 blocks, where 2 blocks have 1/4 compression, 1 blocks have 1/2 compression, 2 have 1 (no compression).
Example 3: 10 blocks, where 2 blocks have 1/8 compression, 1 block has 1/4 compression, 3 have 1/2, and 4 have 1 (no compression).

So it does seem to mostly have 1:1 (no compression) and 2:1 compression.

Between AMD and NVidia we have claims of 25-40%. They're far from 8:1 or 4:1, let alone 2:1
either it's the final average compression rate, and the caches keep uncompressed data, and it's pure fill-rate savings. Or the average compression rate in the end is much worst, but somewhere compressed tiles are cached and they allow cache-bandwidth savings and this is what we see.
But what exactly do they claim, compression rate or bandwidth savings?

I'd vote for uncompressed data in the cache. A decompression block and another level of cache in between L2 and L1 sounds much too complex.
 
Last edited:
Who needs 16K framebuffer? More realistically, with 2 bits per tile and 8x8 pixel block, 1080p requires 8100 bytes, 2560x1600 requires 16000 bytes, and 4K (2160p) requires 32400 bytes.
Not saying you really need it, but this is what the d3d11 specification requires to be supported. Maybe some app is doing something crazy (like 4x supersampling with 4k...) and you don't want to fall off a performance cliff once you've reached some fixed size where you'd need the savings the most (and yes chips like rv350 had such cliffs). And in case of MRTs you need that structure for each RT which further increases the size.
AFAIK AMD's open source driver is currently limited to kernel-mode-setting, i.e. basic configuration, framebuffer, display mode and refresh rate stuff. The heavyweight stuff like OpenGL rendering is maintained by Mesa/Gallium3D, and I'm not really sure it knows anything about the compression hardware in GCN3... it most likely doesn't, when you consider how its performance lags behind the proprietary AMD driver.
Well it doesn't support that feature for GCN 1.2 yet, but depth buffer compression is not really different. And the driver knows pretty much everything about this, including doing the decompress "blits" to uncompressed when necessary (fwiw it actually does this in-place, you essentially draw a quad and set up some RBE bits correctly, so depth values get read/written even though depth test is always pass and the values don't change, with compression enabled for reads, but disabled for writes or something - it is quite possible the hw may even be able to skip tiles which are already uncompressed though I'm not sure). I'm waiting for the driver to support this stuff on GCN 1.2 so I get a better understanding how it's working (especially the access in the tmus) ;-).
 
Hmmm. I thought it's only partially mapped in 32-bit operating systems because they do not support 64-bit virtual address space, but indeed MSINFO32 lists only 264 MBytes of mapped memory for my Radeon R9 290X in 64-bit Windows 10...

...really?? but it makes no sense to have not everything mapped through MMIO. BIOS/UEFI issue maybe?
 
I don't think non-integer render target formats can be compressed with this simple color delta scheme. In fact I don't even think any render target format gets compressed at all, for various reasons...
I'm not sure if you're saying no target but the swap chain would be compressed, or simply that no formats other than 8bpc formats would be compressed.
I don't see a reason to believe either.
 
Back
Top