AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Seems to be a war of incompetence on the driver front recently(techreport's review of Fury X has it hiccuping all over the place), searching nvidia's gimping just brought up this recent thread,

https://forums.geforce.com/default/...ad-released-2015-06-22-/post/4580806/#4580806

I saw another one of their forums where it was rather clear-cut, too bad only AMD fanboys seem to be rather concerned about the whole issue to keep those links handy.

Enough off-topic though.
 
Seems to be a war of incompetence on the driver front recently(techreport's review of Fury X has it hiccuping all over the place), searching nvidia's gimping just brought up this recent thread,

Well, I'd just go by what the NV Rep posted this morning:
Correct. Thanks to everyone for their help. We will be releasing a hotfix driver soon which we hope addresses the desktop TDR issues that users reported with drivers newer than 350.12.
You are correct, enough off-topic!
 
[... 4/3/2/etc. bits per delta...]
Sounds reasonable, but this would only work with the back buffer using an integer 8-bit per component RGB/BGR format, and not with HDR formats R10G10B10A2 and 16-bit floating point R16G16B16A16.
Same for multiple render target formats, including one- and two-component 8-bit integer formats.

It's not that it doesn't work, it would just change the ratios. If the delta in a RGBA16 block are still only 2bits large at most, it's still possible to use the same "2bit delta block", it'd be 1:8 instead of 1:4. It is less likely though, that there's only 2 bit difference. When decompressing the 2bit values are up-casted to chars, shorts, ints; even half, floats are quite valid. Lossy BC6 treats halfs as if they are unsigned shorts, and it works fine, which seems counter-intuitive but comes out quite clear when calculating it through. Casting small datatypes to larger datatypes is highly effective, some little more logic, and without speed loss. As long as the deltas of whatever (any number of channels, and bits per channel, and datatype) are small enough to be stuffed into the block-type, you can compress, and gain speed hopefully.

Because of this impressive versatility, I'm very sad that we don't get this as a programmable component. Imagine you want to store a compressed delta from frame N-1 and N. That could speed up any rendering which keeps previous frames.
 
AMD has however, given a clarifying statement (via VRWorld) claiming the R9 390 Series (X and non-X) are not rebadges due to reasons given below.

AMD is pleased to bring you the new R9 390 series which has been in development for a little over a year now. To clarify, the new R9 390 comes standard with 8GB of GDDR5 memory and outpaces the 290X. Some of the areas AMD focused on are as follows:

1) Manufacturing process optimizations allowing AMD to increase the engine clock by 50MHz on both 390 and 390X while maintaining the same power envelope.

2) New high density memory devices allow the memory interface to be re-tuned for faster performance and more bandwidth
· Memory clock increased from 1250MHz to 1500MHz on both 390 and 390X
· Memory bandwidth increased from 320GB/s to 384GB/s
· 8GB frame buffer is standard on ALL cards, not just the OC versions

3) Complete re-write of the GPUs power management micro-architecture
· Under “worse case” power virus applications, the 390 and 390X have a similar power envelope to 290X
· Under “typical” gaming loads, power is expected to be lower than 290X while performance is increased”.


http://wccftech.com/amd-radeon-r9-390-390x-not-rebadges-power-optimization/
 
It's not a new stepping or revision. It's more experience (my guess: more statistical data usable for binning) and a more „well oiled machinery“ at TSMC.
 
If there's not even a stepping change, then at least in terms of silicon this is even less than a rebadge. A rebadge requires making a new box.
Of the two remaining reasons, the potential rewrite of firmware seems to be more more of a change relative to the memory upgrade. Capacity-wise it does not stand out from the normal variation configurations between partner boards.
The firmware change could be something that couldn't be updated via client software, so that could require a visibly different offering, although the last time they did it AMD called it the 7970 GHz Edition, and that was accompanied by a larger clock increase.

A new SKU would in less lean years just about cover it.
 
Essentially, a bunch of BIOS tweaks and binned chips with better perf/watt metrics.
Not quite that simple. The microcontrollers on these things are pretty sophisticated in what you can do with them and the microcode that they execute are intrinsically tied to the binning mechanism that is in place for a particular product. There are specific changes between R9 290's and the R9 390's binning test programs and operational microcode such that, despite R9 390X running a superset of 290X's clocks it is not a guaranteed that taking an R9 390X an using a 290X BIOS and microcode would actually operate at those speeds; the intersection is likely large, but not necessarily 100%.
 
The firmware change could be something that couldn't be updated via client software, so that could require a visibly different offering, although the last time they did it AMD called it the 7970 GHz Edition, and that was accompanied by a larger clock increase
Of note, Tahiti isn't able to achieve the type of changes that can be achieved on Bonaire and beyond.
 
Sounds reasonable, but this would only work with the back buffer using an integer 8-bit per component RGB/BGR format, and not with HDR formats R10G10B10A2 and 16-bit floating point R16G16B16A16. Same for multiple render target formats, including one- and two-component 8-bit integer formats.
I don't see any problems with float formats. Delta compression with floats works exactly the same way it works with integers. You have a certain (hard-coded) rule for the estimate (guess) and store the distance from it (rule can be different for different formats, as to ROP knows the active format for each RT). Floats are binary numbers, so you can threat them exactly like integers. If you flip the sign bit (and ignore NaN and INF), IEEE float bit presentation produces a monotonically increasing numeric presentation (numbers close to each other are also close to each other in the bit presentation). You store the distance by binary value, not by float value. Same in decompression, the decompressor doesn't need to know that the value is float. It just decompresses it with the same rules.
Who needs 16K framebuffer? More realistically, with 2 bits per tile and 8x8 pixel block, 1080p requires 8100 bytes, 2560x1600 requires 16000 bytes, and 4K (2160p) requires 32400 bytes.

I agree that any of the above is too large to have a dedicated cache though.
Older GPUs had dedicated caches for Hi-Z, fast clear and other acceleration structures. If you used too large render target, the GPU disabled the optimizations for the lower part of the render target. GCN on the other hand is fully memory based. The delta color block data is most likely cached by the L2 in a similar way to all the other data (including HTILE, etc other acceleration structures).

2 bits (per 8x8 tile) at 16k * 16k is only 1 MB. If you compare this to the actual data size of a 16k * 16k 32 bpp render target (1 GB), you notice that the size of the acceleration structure is meaningless (it is 1024x smaller).
This would involve a thorough analysis of the shader compiler in the OpenGL driver.
Shader code just issues RT reads and writes. Shader execution (and the shader compiler) do not need to know about the RT/texture formats (or compression, or tiling mode, etc). Resource descriptors hide these details and go directly from scalar registers to samplers. Shaders should not require any microcode changes to support delta color compression (depth compression doesn't need shader code support either).
Not saying you really need it, but this is what the d3d11 specification requires to be supported. Maybe some app is doing something crazy (like 4x supersampling with 4k...) and you don't want to fall off a performance cliff once you've reached some fixed size where you'd need the savings the most (and yes chips like rv350 had such cliffs). And in case of MRTs you need that structure for each RT which further increases the size.
Not a problem on GCN. 2 bit per 8x8 block would be 1024x less data than the payload. GCN is fully memory based. L2 would cache the delta compression acceleration structure (like it does for all the other existing acceleration structures).
Well it doesn't support that feature for GCN 1.2 yet, but depth buffer compression is not really different. And the driver knows pretty much everything about this, including doing the decompress "blits" to uncompressed when necessary (fwiw it actually does this in-place, you essentially draw a quad and set up some RBE bits correctly, so depth values get read/written even though depth test is always pass and the values don't change, with compression enabled for reads, but disabled for writes or something - it is quite possible the hw may even be able to skip tiles which are already uncompressed though I'm not sure). I'm waiting for the driver to support this stuff on GCN 1.2 so I get a better understanding how it's working (especially the access in the tmus) ;-).
Yes, the driver needs to know when it needs to queue a decompress command for a compressed resource (this is usually done before a RT need to be sampled as a texture - driver knows this by checking the texture bindings). In DirectX 12 you do this manually (barrier resource transition from RT -> SRV). The shader reading a texture or writing to a render target doesn't need to know anything (resource descriptors hide all the format details).
 
Older GPUs had dedicated caches for Hi-Z, fast clear and other acceleration structures. If you used too large render target, the GPU disabled the optimizations for the lower part of the render target. GCN on the other hand is fully memory based. The delta color block data is most likely cached by the L2 in a similar way to all the other data (including HTILE, etc other acceleration structures).
The encoded information for the compressed color data would change based on what is happening in the ROP caches or when data is being exported from them to the memory controller. At least historically, GCN has not involved the L2 in this path.
Wouldn't that pose a problem where only a round-trip to memory could resolve whether or not the ROP accesses can skip a round trip to memory?

The control data could be embedded within the compressed color stream, which would mean the control data maintains residency with the data it references.
 
The control data could be embedded within the compressed color stream, which would mean the control data maintains residency with the data it references.
The control data should be fully separated for best locality. With separate control data, a single 64 byte cache line contains info for 128x128 pixel area (assuming 2 bits control data per 8x8 block). This would mean that the control data uses minimal amount of cache lines and bandwidth, and the cache misses would be minimized. There would only be a single cache line control data load for 128x128 = 16384 rendered pixels.
Wouldn't that pose a problem where only a round-trip to memory could resolve whether or not the ROP accesses can skip a round trip to memory?
What kind of problem? The control data is accessed first. It almost never misses the cache (see above). If the control data load misses the cache the GPU will just load that cache line, and you get doubled latency for this particular ROP block load, but lots of further loads will (with high likelihood) access the same control data cache line.
 
Last edited:
What kind of problem? The control data is accessed first. It almost never misses the cache (see above).
It comes down to my question of whether the the ROP data path has any way of maintaining synchronization with the L2. Otherwise there will be compression control data in the L2 or a mismatch somewhere in the vector and ROP memory pipelines that could be stale relative to the data it is referencing.
L2 slices can service one requester per cycle, which might be why the ROPs have historically not touched it.

If the control data load misses the cache the GPU will just load that cache line, and you get doubled latency for this particular ROP block load, but lots of further loads will (with high likelihood) access the same control data cache line.
What is the policy for a miss when attempting to write changed compression status to the control line? For that matter, is the assumption that nothing else but the ROP may have cached a copy of the now-stale control line?

If more than one color cache can contend for the same control line, there can be problems with each one trying to read/write portions of he same L2, so it might be better to statically allocate 64-byte regions to specific color caches. That would save on having separate hardware units stalling each other or worrying about inconsistent versions of the same control line floating around.
 
NVidia uses L2 for render target output during pixel shading but AMD doesn't, as far as I know. The colour buffer cache talks to MCs directly.
 
If more than one color cache can contend for the same control line, there can be problems with each one trying to read/write portions of he same L2, so it might be better to statically allocate 64-byte regions to specific color caches.
I'm pretty sure the allocations are static.
FWIW I was slightly wrong stencil/depth buffer works exactly the same as color. The reason is that the hw presumably embeds the control information directly in the htile data for depth/stencil (htile buffer is 4 bytes per 8x8 tile). For color buffers, there's cmask and fmask data - fmask is only used for msaa, though cmask is used for non-multisampled surfaces as well. It is actually 4 bits per 8x8 tile, though I don't think they are actually all used in the non-msaa case (this is all from pre-gcn 1.2 hw), but it's used for fast clearing at least (which only would require one bit). Oh and btw the alignment requirements of this control data are quite heavy and depend on the number of "channels" (for Hawaii, 128x64 tiles for htile data, 64x64 tiles for cmask data). Not quite sure how fmask works...
 
Seeing less than ideal scaling in Fiji memory wise... did we explore the possibility that 8×512 Bit wide memory controllers might not be sufficiently fine-grained for current workloads? Even if you apply 128 KiB of L2 for each of them?
 
What scaling?

In the review thread I linked the review on Hardware.fr showing Crysis 3 and Witcher 3 scaling almost perfectly over the HD7970 in the same test.

The fillrate test at TechReport shows 64 GP/s instead instead of 67, whereas the 290X test shows 66 instead of 67. Is that what you're referring to?

Also, the HBM configuration results in 32 channels of 128 bits each, since each HBM chip has 8 channels.
 
Back
Top