AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

For make it a bit clear and this have been discussed some post further, tesssmark is really not the best way of test tesselation , it use OpenGL 4 ( who at this time was clearly made by the green team of OpenGL ), it dont forcibly reflect the real tesselation performance on DX tesselation.. But this said we all know the difference on tesselation performance between Kepler, Maxwell and AMD counterparts.
Why do you think OpenGL 4 tessellation is different from DX? It was specified to match the capabilities of the hardware which existed because of DX. If there's a difference in performance between OpenGL tessellation and DX it's because of the drivers.

I agree that Tessmark shouldn't be the only data point when judging tessellation performance. Of course having a single data point is no good for judging the performance of any feature.
 
Why do you think OpenGL 4 tessellation is different from DX? It was specified to match the capabilities of the hardware which existed because of DX. If there's a difference in performance between OpenGL tessellation and DX it's because of the drivers.

I agree that Tessmark shouldn't be the only data point when judging tessellation performance. Of course having a single data point is no good for judging the performance of any feature.

what ? ..,.. im not sure you are knowing OpenGL so-- well specially OpenGL 4.0 .. But thats another story, so maybe we will not go so far.
 
I also never understood why are people expecting all the bandwidth saving technologies in Fiji. For example if the compression had anything to do with booming size of Tonga, than you better drop that.
Because saving BW on external pins saves tons of power.
 
I still expect that full Tonga has 384bit membus
If there exists a dieshot out there you should be able to visually identify all of the memory controllers on it.

Anyway, way I see it, if tonga really had 384 bit memory, you would have seen that configuration in the retina iMacs. 5Mpixel* screen rez needs all the memory bandwidth it can get, and there exists no genuine reason Apple would hold back either. Shit, why would AMD themselves hold back the chip? It doesn't make sense. It's not as if a memory controller can be used as redundancy control anyway, considering it affects board routing considerably depending on which controller is disabled.

*5? Isn't it like...13M? "5K" is a silly designator, confusing old geezers like me. :p
 
Last edited:
Modern GPUs with scalar architectures (not to be confused with the scalar unit) need to run lots of threads (as each scalar thread does less work compared to a VLIW thread). More simultaneous threads means of course more register pressure.

Isn't the number of threads determined by pipeline/memory latency hiding requirements?
 
Isn't the number of threads determined by pipeline/memory latency hiding requirements?
The maximum number is determined by the size of your register file. The minimum required to cover internal latency is determined by pipeline depth. For CUDA at least, you try to get as many threads as possible except in some special cases.
 
The maximum number is determined by the size of your register file. The minimum required to cover internal latency is determined by pipeline depth. For CUDA at least, you try to get as many threads as possible except in some special cases.

Right. I'm trying to understand sebbi's assertion that VLIW architectures inherently have less register pressure.
 
what ? ..,.. im not sure you are knowing OpenGL so-- well specially OpenGL 4.0 .. But thats another story, so maybe we will not go so far.
I don't quite follow what you're saying but I know what I'm talking about in this case. OpenGL tessellation wasn't created by Nvidia and works the same as DX tessellation.

Right. I'm trying to understand sebbi's assertion that VLIW architectures inherently have less register pressure.
I don't know if this is what sebbbi means but if you have more ILP there's less latency so you need fewer threads to this latency. Whether that translates to less register pressure might depend on the shader.
 
Right. I'm trying to understand sebbi's assertion that VLIW architectures inherently have less register pressure.
I think he means that GCN has a scalar, non-VLIW, unit bolted on next to the VLIW unit. This can be used for loop counters etc (I think.) So when you have a loop counter, you only need to allocate 1 register in the register file instead of the width of the VLIW unit.

Edit: after rereading his comment, he was obviously not talking about that scaler unit. So now I don't know either. ;)
 
Anyone knows how this color compression scheme works anyhow? As it has to be lossless, and you must surely not want it to be too cumbersome to extract the value of an individual pixel either so you don't end up having to constantly fetch more data than you would without compression...how would you go about implementing something like this?

I hear it uses delta compression, which from what I understand records changes from a baseline level, but is there also something like RLL encoding on top of that? Because, most of the time, how much can you really save just with delta compression? In a windows environment, it's probably quite a lot, but in 3D games, there's usually contrasts and gradients all over the place and since you can't throw any data away it's really no fecking good if you can only save a few bits' worth of information per pixel, you'll end up having to write the whole shebang to memory uncompressed anyhow...

So overall savings are higher than I suspect? Does the delta compression work separately on an individual color channel basis perhaps...? How would it work with floating point framebuffers? G-buffers? Does it work with a fixed block size, like DXTC for example, and if so, what size blocks? There's so much of this stuff I have no clue about... *shrug* :LOL:

An in-depth analysis of this stuff would be a fascinating read! :)
 
I like GCN compute units. The design is elegant. I definitely do not want OoO. I like that GCN is a memory based architecture. All the resources stored in memory and cached by general purpose cache hardware.
The cache hierarchy could stand for an improvement.
The 2011 implementation GCN still has is a step above the incoherent read-only pipeline it had before, but its behavior is still too primitive to mesh with the CPU (Onion+ skips it) and its method of operation moves a lot of bits over significant distances.
The increase in channel count, and the number of requestors for the L2 means more data moving over longer distances, which costs power. The way GCN enforces coherence by forcing misses to the L2 or memory costs power as well.
Changes like more tightly linking parts of the GPU to the more compact HBM channels and going writeback between the last-level cache and the CUs could reduce this, but not without redesigning the cache.

While increasing storage locality, why not copy Nvidia and provide a small set of registers/register cache for hot register accesses, rather than going over greater distances to the register file?
If not ripping off Nvidia, AMD could revive a form of clause temporary register and explicit slot forwarding it had in its VLIW GPUs to get the effect.

Surprisingly many instructions could be offloaded to the scalar unit, if the compiler was better and the scalar unit supported full instruction set. This would be a good way for AMD to improve performance, reduce the register pressure and to save power. But this strategy also needs a very good compiler to work.
There was a paper on promoting the scalar unit to support scalar variants of VALU instructions, which had some benefits.
Tonga's GCN variant did promote the scalar memory pipeline to support writes, at least.

The CUs are rather loose domains, as exposed in GCN. The number of manual wait states for operations that cross between the different types means software is made aware that there are semi-independent pipelines whose behavior requires hand-holding to provide correct behavior.
This has gotten worse with the introduction of flat addressing, which readily admits to a race condition between LDS and the vector memory pipe, but in each of these cases the program is forced to set waitcnts of 0 at fracture points in the architecture. The scalar memory pipeline cannot even guarantee that it will return values in the order accesses were issued, which seems like a nice thing to nail down when the transistor budget doubles at 16nm.

Other questions are whether GCN can evolve to express dependency information to the scheduling hardware. It currently has 10 hardware threads per SIMD waking up and being evaluated per cycle. Perhaps some of that work could be skipped if the hardware knew that certain wavefronts could steam ahead without waking up all the arbitration hardware.
This is another thing it could borrow from Nvidia, or again the VLIW architectures GCN replaced.

The most important thing with 16 bit registers is the reduced register file (GPR) usage. If we also get double speed execution for 16 bit types, then even better (but that is less relevant than saving GPRs).
The most recent GCN variation also includes pulling 8-bit fields out of registers, and some use cases in signal analysis can use them.
16-bit fields are sufficient for machine learning.
That's three datum lengths for differing workloads, not including 64-bit which a number of HPC targets like. The way the domains where these data paths are used on a GPU is not conducive to their flexible use, but these chips do have regions of hardware that handle and manipulate data with differing fractional precisions.

A few possible future questions for APUs is whether a less-coarse granularity than 64 work items, and page granularity below 64k could make interoperability work better.

It's a scalar instruction set that operates on a VLIW unit.
The instruction word is one operation long in GCN. That seems like a SIMD implementation.
 
Anyone knows how this color compression scheme works anyhow? As it has to be lossless, and you must surely not want it to be too cumbersome to extract the value of an individual pixel either so you don't end up having to constantly fetch more data than you would without compression...how would you go about implementing something like this?

I hear it uses delta compression, which from what I understand records changes from a baseline level, but is there also something like RLL encoding on top of that? Because, most of the time, how much can you really save just with delta compression? In a windows environment, it's probably quite a lot, but in 3D games, there's usually contrasts and gradients all over the place and since you can't throw any data away it's really no fecking good if you can only save a few bits' worth of information per pixel, you'll end up having to write the whole shebang to memory uncompressed anyhow...
I think they take a pixel in a fixed rectangle as anchor and then, as they fan out, subtract the next pixel from the previous one. If you have a constant color or a gradient, that results a lot of very similar and small differences. In the next step, you can then compress with something like RLE or Huffman encoding, similar to how JPEG compresses rectangles of reordered DCT factors.

I've implemented a similar compression once to compress height fields of a elevation data. The delta step makes the following compression step go up like crazy.

So overall savings are higher than I suspect? Does the delta compression work separately on an individual color channel basis perhaps...? How would it work with floating point framebuffers? G-buffers? Does it work with a fixed block size, like DXTC for example, and if so, what size blocks?
There's a good chance that they separate the color channels to increase the correlation of neighboring differences. With floating point numbers, it should still work: they could separate the exponents and the mantissas and find a lot of correlation there as well. G buffers: no clue. What format does that typically use?
In the Maxwell color compression slides, they show blocks of 8x8 pixels, so it's likely that they use a fixed block size.

(BTW that whole Anandtech page is a very good read.)
 
Last edited:
Even with HBM? So far it does not seem like it would be significant saving for Fiji cards.
When you're on a mission to save power, once you'd done the basics (various forms of clock gating and power gating, which I'm sure AMD already done), there's no low hanging fruit, and you have to fight for each %. Moving data around is costly, and doing so off-chip is one of the most expensive ones. Even after cutting power by 40% for HBM, it's still a major power sink. It seems like an obvious place to optimize for. I don't think the surprisingly large area of Tonga has anything to do with color compression: Nvidia has claimed color compression as well and it doesn't seem to have hurt them for Maxwell.

Edit: one interesting thing that I just noticed in Maxwell slides that I mentioned earlier, is that they say "Enhanced cache effectiveness". This suggests that the compression/decompression happens not between the L2 and the memory, but between the L2 and the requesting client. So your cache hit rate will go up as well. You're not only saving power in external transactions, but in internal transactions between L2 and the MC as well. And you're increasing the cache hit rate. Lots of benefits.
 
Last edited:
Tonga didn't saw much of a drop in power ocnsumption compared to Tahiti, didn't see any numbers for its cache size either. It'd be interesting to see how much the power consumption changes in that b3d test that techreport used that showed how different textures caused different effective bandwidth.
 
Any thoughts on how the (supposed) reduction of the access latency to the HBM will affect occupancy and register pressure metrics?
 
I don't run these low level tests every time (they never change, until now), so I hadn't picked up on it until now.
I don't regularly run Tessmark either, i prefer the D3D sample code. First, it's openly accessible, second it has been donated by AMD themselves and third it gives you more control over what you run.
 
Back
Top