AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

You're right I forgot about morton ordering... so that would be twice the number of tiles per byte size cache line. But I was told in another thread https://forum.beyond3d.com/threads/gpu-cache-sizes-and-architectures.56731/ that at least on GCN that the CB and DB caches don't go through the L2... did it change for tonga/GCN+framebuffercompression?
Probably not.

Why would TMU's need access to this? If you mean render target to texture doesn't the driver resolve the compressed RT first manually? At least thats what I assumed.
Because it's a bandwidth eater to decompress color RTs and depthstencil buffers (which had similar compression for a long time, not to mention color had compression for msaa too already), after all sampling them isn't uncommon (of course eats bandwidth there too if you sample from uncompressed). And yes up to GCN 1.1 this had to be done with a decompress blit, but Tonga should not have to, and nvidia doesn't have to since Fermi unless I'm badly mistaken. As said earlier, I don't know how exactly it works so both the TMU and ROPs have access to the uncompressed data in the end.
 
I meant memory bandwidth-wise, as writing a few bytes spread out here and there is quite wasteful.
Any write will affect the entire 8x8 block because it involves re-encoding the differences (unless the differences are additionally encoded with a global codebook, which is unlikely.) The first paper assumes that by keeping the entire compressed block in a dedicated cache this still gives significant memory bandwidth savings assuming typical compression ratios of 2.5-4 to 1. Of course it all depends on the details of the compression format, which is AFAIK not disclosed by AMD.
 
Of course it all depends on the details of the compression format, which is AFAIK not disclosed by AMD.
Could we, once these cards are out in the wild, perhaps DMA out the compressed framebuffer as-is, without going through decompression, and then pick apart the encoding scheme...? :D

Of course, if Tonga already has this tech (in perhaps the exact same implementation), then we could just use one of those cards instead.
 
Could we, once these cards are out in the wild, perhaps DMA out the compressed framebuffer as-is, without going through decompression, and then pick apart the encoding scheme...? :D

Of course, if Tonga already has this tech (in perhaps the exact same implementation), then we could just use one of those cards instead.
Fiji is supposedly the same GCN level, so should work just the same.
Should be pretty trivial to get that data out if you're using some hacks with the open source linux driver (though it does not support Tonga yet, but you could start by analyzing depth buffer compression or color msaa compression, though the latter is probably using something different...). Not sure though if it's really all that interesting :).
 
In the Heaven bench it's barely faster than my 780Ti here.

At least it's faster than Hawaii @1050/1325!
My score was 852 at the same settings with 15.15beta driver. That's 29% faster, not very impressive but I suspect it's down to tessellation.
 
Could we, once these cards are out in the wild, perhaps DMA out the compressed framebuffer as-is, without going through decompression, and then pick apart the encoding scheme...?
Not with DirectX, it does not allow direct framebuffer access since DirectDraw 8, you can only set a standard front/back buffer "texture" format, but the physical representation is completely abstracted by the driver.

Video adapter memory is indeed directly mapped to CPU virtual address space and can be accessed using on-chip DMA engines, however it's in the kernel space, the upper half of 64-bit virtual address space, and is only accessible by the kernel-mode video driver which owns that memory range.

Should be pretty trivial to get that data out if you're using some hacks with the open source linux driver... Not sure though if it's really all that interesting :).
Hacking Linux drivers is far from trivial, but in the end I'd guess it would be a variation of color difference scheme.
 
Not with DirectX, it does not allow direct framebuffer access since DirectDraw 8, you can only set a standard front/back buffer "texture" format, but the physical representation is completely abstracted by the driver
If you bind the compressed resources with a graphics operation, you might be able to hack up a compute shader to grab any piece of memory mapped to the GPU, including metadata.
DmitryKo said:
Video adapter memory is indeed directly mapped to CPU virtual address space and can be accessed using on-chip DMA engines, however it's in the kernel space, the upper half of 64-bit virtual address space, and is only accessible by the kernel-mode video driver which owns that memory range.
Most video memory is not mapped to the CPU at all due to the existence of 32-bit OSes. Resizable BAR has been supported by many video cards for a while but OS support has been spotty at best.

-FUDie
 
however it's in the kernel space, the upper half of 64-bit virtual address space, and is only accessible by the kernel-mode video driver which owns that memory range.
It could be done in Linux with a modified open source driver with an added function call to DMA out raw video RAM data? Or is AMD's driver not open source perhaps? *shrug* This is getting rather academic, heh...
 
Well, you bought it for a hell lot of routing. :)
If, say you implement it as a pipeline what you were drawing, and you can manage an outstanding queue, you might actually get away with 2 cycles sustained throughput per tile, which would be very nice. But I have my doubts that this implemention matches the latency requirements of the render feedback loop (I mean tiles circulating between raster-output and memory). I suspect you can't give away any cycle of added latency, or otherwise: "do it again".

Sure, the approach is always the same for this stuff (depth buffer, MSAA etc), when you overflow you just drop to baseline uncompressed.
If the delta values fit into 2bits you go say 1:8, otherwise when 3bit then 1:4, when 4bit 1:2 otherwise dump 1:1. It's never really cheap to figure the right permutation for the encoding, but if it's within 4-8 cycles per tile, it might be in a better place than the above pipelined implementation.

The acceptability very much depends on the exact statistical profile of the ROP-cache/MC transfers. An implementation might just give a damn about latency because tiles are ever only re-touched every 1k cycles. Or if going to memory already takes 400 cycles, what does it matter if we add 50-100 more. There might be a window where you can pick a dog-slow implementation, like with XB1 zlib compression. If that thing would be serial, or you have to use it serial, it's a killer; if asynchronous you might not care that much.
My preference though is, try to make it really tight (better safe than sorry) without forgetting about compression efficiency.

Anyway. Little bit off-topic, sorry for the distraction. :)
I don't consider it off-topic.

Between AMD and NVidia we have claims of 25-40%. They're far from 8:1 or 4:1, let alone 2:1. It appears that the returns diminish rapidly. The transistor budgets are hardly lacking.

NVidia apparently chose to use delta colour compression in order to make more fillrate viable.

I can't work out what AMD's view of the balance is, with the meagre fillrates of Tonga/Antigua and Fiji in direct contrast to NVidia. There appears to be an argument that as resolutions rise, delta colour compression becomes less useful. If true, that would hint that latency is getting the better of NVidia.
 
I can't work out what AMD's view of the balance is, with the meagre fillrates of Tonga/Antigua and Fiji in direct contrast to NVidia. There appears to be an argument that as resolutions rise, delta colour compression becomes less useful. If true, that would hint that latency is getting the better of NVidia.

Wouldn't you expect just the opposite? At higher definitions, there should be larger regions made up of relatively similar pixels. By "larger" I mean "containing more pixels".
 
The compressions happen pixel block by pixel block. They either compress 2:1 or 4:1 or 8:1, or not at all. I don't think there is anything intermediate between 2:1 and 1:1. If you can't do the 2:1 compression, you immediately fall back to 1:1. That'd make the average of 25% or 40% so low compared to the per block compression factors.
 
I don't consider it off-topic.

Between AMD and NVidia we have claims of 25-40%. They're far from 8:1 or 4:1, let alone 2:1. It appears that the returns diminish rapidly. The transistor budgets are hardly lacking.

Probably doesn't mean so much, because if the average ratio would be a little above 1:2 then it'd be as good as top notch lossless compressors. Albeit, yes we talk about artificial images, not photos, so there could/should be more derundancy.
I guess it can mean two things, either it's the final average compression rate, and the caches keep uncompressed data, and it's pure fill-rate savings. Or the average compression rate in the end is much worst, but somewhere compressed tiles are cached and they allow cache-bandwidth savings and this is what we see. I think we should measure if actually consuming the rendertarget with a SRV is faster than it should, because that's then the real compression-rate, minus a bit of arithmetic overhead.

I start to feel a bit unconfortable now, because we can not estimate it's behaviour, black boxes are not nice to deal with. I hope it doesn't have weird erratic behaviour in some situations. Not globally, but say under some overdraw characteristics.

NVidia apparently chose to use delta colour compression in order to make more fillrate viable.

I can't work out what AMD's view of the balance is, with the meagre fillrates of Tonga/Antigua and Fiji in direct contrast to NVidia. There appears to be an argument that as resolutions rise, delta colour compression becomes less useful. If true, that would hint that latency is getting the better of NVidia.

I would suspect that under the current game profile the performance should rise. In the end you haven't enough texture resolution for all-distinct pixels, and texture mapping converges to bilinear up-sampling + some lighting, which is pretty smooth; at least in comparison to the tendencially undersampling of low-res rasterization, or somewhat noisy anisotropic down-sampling. And there are less pixels at edge-boundaries (and thus texture separations) of triangles when you raise resolution. Depends on the assets ofc.
 
Last edited:
That guy's SoM benches make no sense. A 390X is like 10% slower than a 980Ti in that game, so Fury should be decently faster, but it's slightly behind a 980Ti.

And the vantage test had 285 performing 2x times a 280 for bandwidth, unless something else changed too.
 
Back
Top