AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

After looking the 390x reviews, it seems like the Fury X will have to be truly outstanding to compensate for 390x's performance, perf/W and perf/$.
 
I am curious where the compression hardware is in the process. The least disruptive would be to have it on the path between the ROPs and the memory controllers (possibly in the controllers?), although what that does to a possible memory crossbar is unclear.
The compression and decompression process mostly intends on saving bus accesses, although that may not do much good for individual tile export or import from the ROP caches since those have to be processed and an extra DRAM burst or two is a handful of cycles at most.
The downside to that is that the ROP caches would have uncompressed data, so their hit rates would not improve. HBM's latency could be better than before, but the dominant factor is the DRAM arrays, which have not changed much.
My understanding is that the compression/decompression hardware is indeed between MC and ROPs. The blocks are too big to be decompressed on the fly for individual access (unlike s3tc textures), not to mention it can't really work that way for writing.
That said, up to and including GCN 1.1, compressed framebuffers (for msaa color or depth), could not be handled outside the ROPs at all, thus a decompress blit was necessary when those got accessed as a texture. Obviously, this is quite suboptimal and as far as I can tell a big disadvantage against nvidia gpus (since fermi).
GCN 1.2 reportedly can access those in TMUs without a decompress blit, though I'm not sure how exactly (for nvidia, it seems they'd just use the unified L2 cache to make it happen).

Nvidia uses ROP (raster operations), AMD uses RBE (render backend), intel uses CC (color calculator, though their docs also use the term "output merger" which is the language used by d3d10).
 
As for 390X, at least one reviewer who are not AMD themselves(so much for transparency), shows it performing near 980Ti for a few games.

GTAV - 31.3 to 29

Evolve - 39.6 to 37.7

FC4 - 37.9 to 36.4

And over 980 for most.

http://nl.hardware.info/reviews/613...et-bestaande-chips-benchmarks-alien-isolation.

That's quite an optimistic interpretation of those results. In actuality the 980 beats the 390x in every one of the 1080p/max settings tests aside from 2. Evolve (which is extremely AMD friendly) and GTA4 (which looks to be down to memory limitations, although still a good win for AMD). The two GPU's trade blows at 4K with the 390x taking 7 and the 980 taking 4 with 2 draws - although often these framerates are too low to be playable anyway.

In comparison to the 980Ti, the 390x is always well behind at any resolution with arguably the 3 instances where it comes close above all being unplayable anyway (although Evolve is probably okay).
 
http://worldwide.espacenet.com/publ...20813&DB=worldwide.espacenet.com&locale=en_EP

Dual fragment-cache pixel processing circuit and method therefore

Multiple graphics primitives may be processed in quick succession where each of the multiple graphics primitives produces fragments that correspond to the same pixel location. As such, rather than forcing the render backend block to handle multiple fragments corresponding to the same pixel location, a cache structure can be used to buffer the received fragments prior to providing them to the render backend block. Including a cache structure in the data path for the pixel fragments enables multiple fragments that apply to the same pixel location to be combined prior to presentation to the render backend block. Offloading some of the blending operations from the render backend block can improve overall system performance.

This is an ancient patent (1999). The fragment cache is simply operating as a fragment selector (according to Z) or MSAA fragment selector (according to mask, with some blending).

(The patent is about using two caches in a ping-pong arrangement, to enable full speed processing across the boundary in time caused by a rendering state change. That isn't the reason I'm linking it.)

The render backend pulls from the fragment cache to perform read-modify-write (blend) against the render target (or just write). To do so the RBE will want to work on small blocks of the render target, which entails a colour buffer cache.

So RBE wants to pull from the fragment cache but only when CBC is ready. That's determined by whether the CBC has been populated from memory (if modifying) or whether the CBC lines are available (write only). The whole process needs to be pipelined, though I'm unclear on how RBE prioritises fragment cache lines to pull from. (Perhaps it's driven by the rasteriser's interaction with hierarchical-Z cache. As the rasteriser touches the hierarchical-Z cache it can feed forward to the RBEs as a predictor of which render target coordinates are "live" and which are "dead".)

What we don't know is whether CBC is held in delta-colour-compressed format or native.

Looking at the PNG lossless compression techniques:

http://optipng.sourceforge.net/pngtech/optipng.html
http://www.w3.org/TR/PNG-Filters.html

specifically delta compression, indicates that decompression would be pretty simple, enabling CBC to be held in compressed format with low-latency, low-area cost to read.

I'm assuming that delta colour compression operates on large blocks of pixels so that they align with the memory channel's native burst length. Some care is needed to ensure that compression is still possible with 16-bit per channel and 32-bit per channel pixels.

Translating this to PNG style delta-colour filters, a scanline might be 8 or 16 pixels long with a fixed count of 4 or 8 scanlines.

The problem then becomes about the time spent compressing final pixels produced by the ROPs. If suitably pipelined (using a fork-and-join to evaluate scanline filter choices to pick the best-compressed scanline), then this will run at the native ROP rate.

The next problem to solve is dealing with blocks that flip back and forth between compressed and uncompressed over the time spent rendering the frame.

Ultimately, you want to deliver coherent blocks (compressed or uncompressed) to the MCs, so that they match precisely with the memory system's granularity.
 
That's quite an optimistic interpretation of those results. In actuality the 980 beats the 390x in every one of the 1080p/max settings tests aside from 2. Evolve (which is extremely AMD friendly) and GTA4 (which looks to be down to memory limitations, although still a good win for AMD). The two GPU's trade blows at 4K with the 390x taking 7 and the 980 taking 4 with 2 draws - although often these framerates are too low to be playable anyway.

In comparison to the 980Ti, the 390x is always well behind at any resolution with arguably the 3 instances where it comes close above all being unplayable anyway (although Evolve is probably okay).

Lowered expectations and the rebrand-rebrand chanting does that. Too bad they don't have more 1440p results.

Being AMD friendly is not a problem, the problem is that there are fewer of them,

http://www.techspot.com/articles-info/977/bench/Hitman.png

or they don't release as AMD friendly. TR's 980Ti review had 295x2 smoking the GM200 cards by 70% while being a mess in Titan X's review and now 390X is nibbling at the heels of 980Ti, at above 30fps, in 4k ultra.

And GTAV doesn't seem to memory limited there or you'd see 780Ti trailing far below.
 
Wccftech has used "the ancient art of math" to calculate the possible FP32 TFLOPS compute performance of R9 Nano [not gaming performance].
http://wccftech.com/fast-amd-radeon-r9-nano-find/

They claim 7.84 TFLOPs for the Nano, if they use AMD's own numbers for the R9 290X and the performance+power ratios.
60 CUs at 1020MHz would achieve 7.834 TFLOPs.

Having the Nano at 3840 ALUs / 60 CUs and 1020MHz seems feasible. If the TMUs are cut accordingly (1/16th disabled), then there'd be 240 of them.
That said, Nano could be a part with 3840 ALU : 240 TMU : 64 ROP at 1020MHz.

We don't know if Fiji boosts and this could be a boost number, so maybe the chip stays at e.g. 900MHz but can boost up to 1020MHz in short periods of time.
To be honest, this was the kind of performance I was expecting for the aircooled Fury non-X. I thought the Nano would be much closer to the 290X in absolute performance, but from those performance/power ratios and the 175W TDP it's obvious it would become much more powerful.
 
Last edited by a moderator:
Translating this to PNG style delta-colour filters, a scanline might be 8 or 16 pixels long with a fixed count of 4 or 8 scanlines.

8x8 it is.

Because for decompression it is a serial chain of delta values (each related to the previous one, which isn't available before that one has been decompressed) it's desireable to manage uncompressed data for as long as possible, it'd be really bad to have to decode a lot. Compression is parallelizable somewhat, decompression isn't, but you could go for O(log2) if you store deltas in a binary hierarchy (same mechanism as bitonic sort). You can also go recursive/hirarchical quincunx, although then compression speed suffers instead of decompression speed.
 
8x8 block size is also commonly used by depth compression. With 32 bit color data this results in 256 byte blocks (4x larger than 64 byte cache lines). With 64 bit color data a 8x8 block is 512 bytes. 256 bytes is not much, but with hardcoded assumptions (noisy images do not look good), we can definitely find bits to remove.

As Ethatron said, delta calculation during compression is a trivially parallelizable task, but the decompression is fully serial. We have used delta encoding along lossless compression (LZMA mostly, but arithmetic encoding also works) when storing assets to HDD, and it gives big gains for smooth data (such as terrain heightmaps). 2d case is actually better than 1d, since you can construct a triangle from two of the last row values and the previous value (in the same row). Then you assume that the current pixel (completing the quad) lies on the same plane, and store the delta to that estimate. This delta is close to zero if the planar assumption holds (smooth color gradients, smooth heightmaps, smooth normals).
 
Yesterday Gibbo from Overclockers.co.uk posted this:

Hi there

What I will say is the Fury is one of the best looking cards ever, in my PC at home it looks pretty awesome with the red Radeon logo and the rev counter LEDs which can be set to be red or blue. Pretty cool!

Also happy to report absolute zero coil whine and pretty much zero noise, never seen it go above 40c under load.
smile.gif


Performance I can't hint to simply as that would land me into trouble but it's faster than my 290X which was clocked at 1100/6000.
smile.gif

It it really has no coil whine at all then I'm mega happy! I don't remember last time I had a GPU without coil whine and it still is annoying after all these years listening to it!

Another 3x4K showcase from AMD Fury X:

http://wccftech.com/amd-r9-fury-x-p...i-at-12k-resolution-and-60-fps/#ixzz3debRMFB2

somehow they again come up at 12K resolution where it's not even 8K but the result is impressive none the less!
 
Last edited:
I can't say my R9 290 has any noticeable coil whine.

This thing on the other hand, was absolutely atrocious.
14-122-175-07.JPG
 
Taking PNG literally, and referring to section 9.4 Paeth, which has the most complex algorithm (the same algorithm is used to compress and decompress):

http://www.w3.org/TR/PNG/#9Filter-type-4-Paeth

Code:
    p = a + b - c
    pa = abs(p - a)
    pb = abs(p - b)
    pc = abs(p - c)
    if pa <= pb and pa <= pc then Pr = a
    else if pb <= pc then Pr = b
    else Pr = c
    return Pr

That looks like it will pipeline very well down the diagonal of a block.

This is a 2 cycle implementation of a single predictor:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.4.391&rep=rep1&type=pdf

For a block of 8x8 bytes, the latencies for an 8-way implementation of this hardware is:

Code:
0    2    4    6    8    10    12    14
2    4    6    8    10   12    14    16
4    6    8    10   12   14    16    18
6    8    10   12   14   16    18    20
8    10   12   14   16   18    20    22
10   12   14   16   18   20    22    24
12   14   16   18   20   22    24    26
14   16   18   20   22   24    26    28
So, that's 28 cycles to decode 64 bytes.

Or 60 cycles to process 192 bytes = 8 scanlines of 24 bytes (8 pixels x 3 bytes). With 8-way Paeth per colour channel that would be three times faster.

Obviously the utilisation of this 8-way unit is poor if scanlines are only 8 bytes long... But the damn thing is microscopically cheap.

Also if you look carefully, you can see how the latencies are compatible with a Z-order curve

https://en.wikipedia.org/wiki/Z-order_curve

The other cost is Huffman or similar decoding to derive the byte-block of predictors from the CBC block.
 
Last edited:
They claim 7.84 TFLOPs for the Nano, if they use AMD's own numbers for the R9 290X and the performance+power ratios.
60 CUs at 1020MHz would achieve 7.834 TFLOPs.

Having the Nano at 3840 ALUs / 60 CUs and 1020MHz seems feasible. If the TMUs are cut accordingly (1/16th disabled), then there'd be 240 of them.
That said, Nano could be a part with 3840 ALU : 240 TMU : 64 ROP at 1020MHz.

We don't know if Fiji boosts and this could be a boost number, so maybe the chip stays at e.g. 900MHz but can boost up to 1020MHz in short periods of time.
To be honest, this was the kind of performance I was expecting for the aircooled Fury non-X. I thought the Nano would be much closer to the 290X in absolute performance, but from those performance/power ratios and the 175W TDP it's obvious it would become much more powerful.

I really don't think Nano would work being a cut down part. Just look at R9 290 power draw compared to 290X. Nano will be an undervolted, downclocked fully enabled Fiji, much like laptop chips being binned based on their lower power draw. As such Nano will likely cost the same as Fury X while performing worse. It will be akin to low TDP CPU's currently on the market. You will be paying for the better performance/watt and the smaller form factor, it will not be the best perf/$ part.
 
... But the damn thing is microscopically cheap.

BC is 28 times faster.
If you get a min/max range from the 8x8 tile and delta against the gradient you'd be as fast as BC. You can also split the tile in 2 sets (ala ETC). I would use variable golomb-codes for the delta values, just store the distribution parameter. But I don't think they used any variable coding scheme. 1:1, 1:2, 1:4 and 1:8 is more likely, much easier.
 
Back
Top