AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

I figured it was backed in VRAM but my guess was there was an on-chip cache dedicated to it. It just makes sense from a performance perspective... adding a L2 cache access for every few framebuffer access's doesn't sound very nice.
 
I figured it was backed in VRAM but my guess was there was an on-chip cache dedicated to it. It just makes sense from a performance perspective... adding a L2 cache access for every few framebuffer access's doesn't sound very nice.
It is certainly possible (and indeed pretty much a must) there's some cache dedicated to it in the ROP, but I was just pointing out it wouldn't make sense to store all of it on chip at once (as did older chips), as the max cache size you'd need is simply too big. But we don't know much about how ROP cache works in general (tile data + metadata), and I have no idea how this works for newer chips (which need to access this from TMUs, so probably indeed through L2). But certainly it could be stuffed into ordinary cache lines (one cache line, no matter where, can store information about quite a few tiles of course (with those mentioned 2 bits per tile, for a rgba8 format that's only 1/1024 of the data you'd need for the uncompressed pixels for this metadata, so unless you have a really terrible case not achieving any compression it's not a big deal).
 
But certainly it could be stuffed into ordinary cache lines (one cache line, no matter where, can store information about quite a few tiles of course (with those mentioned 2 bits per tile, for a rgba8 format that's only 1/1024 of the data you'd need for the uncompressed pixels for this metadata, so unless you have a really terrible case not achieving any compression it's not a big deal).
You're right I forgot about morton ordering... so that would be twice the number of tiles per byte size cache line. But I was told in another thread https://forum.beyond3d.com/threads/gpu-cache-sizes-and-architectures.56731/ that at least on GCN that the CB and DB caches don't go through the L2... did it change for tonga/GCN+framebuffercompression?
and I have no idea how this works for newer chips (which need to access this from TMUs, so probably indeed through L2)
Why would TMU's need access to this? If you mean render target to texture doesn't the driver resolve the compressed RT first manually? At least thats what I assumed.
 
Further thinking about buffer compression... How do you store compressed pixels efficiently in memory, are they packed together into bundles to fit DRAM burst length, or how does it work? I can only assume it would be very inefficient to store a few bytes at most where a full 8 or 16 byte deep color pixel used to reside, and then repeat for all other pixels.

I'm just thinking that if you re-pack the pixels, you would need some kind of mechanism to quickly isolate an individual pixel inside the bundle when you need to read it back again... *shrug* Instances like this, I really wish I'd had the brains and willpower to have studied graphics engineering stuff in university... :p
 
How do you store compressed pixels efficiently in memory, are they packed together into bundles to fit DRAM burst length
Yes, fixed-length encoding - probably pixel differences in a 8x8 block.

http://graphics.stanford.edu/~mhous...all/HoKo_compression_in_graphics_pipeline.pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.60.8187&rep=rep1&type=pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.412&rep=rep1&type=pdf

I can only assume it would be very inefficient to store a few bytes at most where a full 8 or 16 byte deep color pixel used to reside, and then repeat for all other pixels
Inefficient for what exactly - perceived quality, compression ratio, or decoding complexity?
 
Last edited:
neHFAVn.jpg


450GB/s, non-overclockable HBM...lolx
 
Are you laughing because there is an error on the box or because you think Fury X will have 450GB/s memory bandwidth?

Other vendor for instance has put Fury X for pre-orders, but don't buy it as it has GDDR5! (sarcasm)
 
Sapphire is AMD premium partner and ...i think they also assembles some cards for them...it would be a silly printing mistake..no?
 
Sapphire is AMD premium partner and ...i think they also assembles some cards for them...it would be a silly printing mistake..no?

On the other hand AMD designs chip and puts out specification. All Fury X cards will be AMD made for now (manufactured by a partner, most likely Sapphire). AMD states in official data-sheet 500MHz HBM clock so please do the math and make your own conclusion. Besides HBM is overclokable as is any other memory, but AMD will not allow overclocking from Overdrive, at least initially. There is no saying 3rd party tools will not break the lock and TBH there were leaks from couple of months ago claiming HBM1 on Fiji can go as high as 700MHz.

Anyway, this is non issue for now as there is no point having massive bandwidth if your engine can't utilize it. We will have to wait and see if Fiji can be memory bandwidth limited at all.
 
Why would TMU's need access to this? If you mean render target to texture doesn't the driver resolve the compressed RT first manually? At least thats what I assumed.
I don't know what AMD implemented, but a resolve pass required on colour render targets might often actually increase total memory bandwidth requirements.

Note that bandwidth savings aren't everything and it is sometimes possible to achieve higher performance by increasing the total number of memory accesses, if that leads to a more even spread of memory accesses over time. I.e. if you could do the resolve (which is a streaming operation thus doesn't need to pollute any cache) while the memory subsystem is mostly idle. But that seems unlikely.
 
Maybe the HBM is clocked at 440MHz? I think AMD promising up to 512GB/s and to achieve that they must use 500MHz. Since it's up to, I wouldn't be surprised if 450GB/s on early card is correct. It will still be the highest bandwidth GPU available.
 
Yes, fixed-length encoding - probably pixel differences in a 8x8 block.
Thanks for your reply! :)

Inefficient for what exactly - perceived quality, compression ratio, or decoding complexity?
Pardon my imprecise language - I meant memory bandwidth-wise, as writing a few bytes spread out here and there is quite wasteful.
 
Back
Top