AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Yes, the CPU<->GPU coherence is not yet perfect. However we have not seen a problem with that, since we sidestepped that problem completely, by moving the whole graphics engine to the GPU. We don't need tight communication between the rendering and the game logic (physics, AI, etc). Obviously for general purpose computing, it would be a great improvement to get a shared CPU+GPU L3 cache and good cache coherence between the units (without needing to flush caches frequently or use slower BW bus).
For APUs it is a multi-pronged problem. From an engine level it may be fine, but the implementation appears to be responsible for a sizeable portion of the significantly longer memory latency AMD's recent chips have versus pre-APU and Intel chips, which would affect the baseline performance of the system.
There are some other slides discussing the coherent buses, where the utilization figure seem measurably lower.

A more consistent architecture would go towards a more consistent implementation, which has knock-on effects over everything because we've moved past the era where there were many singular improvements or problems that could be fixed with one tweak.
The poor integration is also a problem for AMD's desire to make more modular SOCs or to split them up over an interposer.
These concerns, other than the low-level power and performance changes, are not directly relevant to games so much as one of AMD's options in a dwindling set for maintaining relevance and maintaining some kind of volume to its products, which goes to whether said future GCN will exist, and what resources it will be given to have a competent implementation.


On GPU side, the atomics are going directly to L2.
The other portion of my discussion indirectly addresses this. That atomics and other traffic have to go back to the L2 is a power cost.
For a coherent cache line, every write winds up sending 64 bytes to the L2 or memory based on how the operation is flagged.
That is 512+ bits moving possibly tens of mm, then possibly off-die, each cycle. For a full coherent wavefront with optimal striding, that is done four times. For chips like Tahiti, where the PHY for the bus are spread across three sides of its perimeter, that is additional length to travel.
HBM's compactness, and the rising channel count and CU count's adding to the complexity of that long-distance crossbar, might lead to design pressure for something like an L1 atomic, or there being a new cache layer local to a portion of the die that reduces the number of full-length trips.

Heh, I was going to say that I want full float instruction set and full LDS read/write instruction set... but that I don't need memory write support :)
But I see some important uses cases for scalar unit memory writes. Especially if the scalar unit cache is not going to be coherent with the SIMD caches.
The paper at least provided for full FP support, and did rely on LDS for parts of its functionality.
I think the scalar cache may be coherent, if the write is flagged as such. The actual implementation in Tonga below the ISA's abstraction is unclear.

I just try to get maximum performance out of the hardware. This is why I suggest things that allow the programmer to write new algorithms that perform faster. All kinds of hardware optimizations that allow shutting down unneeded transistors are obviously we much welcome.
I've seen the desire for VALU ops to be able to source more than one scalar register. Expanding the register access methods and more fully integrating the domains might make that more of a possibility.

16 bit fields are also sufficient for unit vector processing (big parts of the lighting formulas) and for color calculations (almost all post processing). No need to waste double amount of bits if you don't need them.
I did not mean to discount those use cases, which are major. I was trying to make the case from the GPGPU or HPC standpoint, since GCN was willing to sacrifice graphics-domain capabilities for the sake of them.

3dilettante and a few other hardware oriented guys (sorry forgot who) were discussing the GCN tessellator and geometry shader design in another thread year or two ago. The conclusion of that discussion was that the load balancing needs improvements (hull, domain and vertex shader have load balancing issues because of the data passing between these shader stages).
.
I think 3dcgi and a few others had more in-depth knowledge on it.
The distribution of primitives seems to be an issue.
One possibility with 32 channels and scads of bandwidth might be to default Fiji to stream its tessellation output off-chip or via the broader cache.
The 980 Ti benchmarks still show notable gaps in even non-tessellated geometry handling.
Also, GCN does not like triangle strips.

Call it "ALU" or "simple core" - as long as each of them still has its own register memory, it's a matter of classification.
At least for GPUs, the ALUs do not have their own register memory, as they source from a wide register file. Other notable thread contexts like their instruction pointers and wavefront states are either global state for the wavefront or housed in physically common hardware.

Sorry, I don't think it's a good idea from a hardware designer point of view.

While registers and cache memory serve the same purpose, i.e. provide fast transistor-based memory to offset slow capacitor-based RAM, they belong to very different parts of the microarchitecture and use very different design techniques to implement.
The SRAM register files in a CU have capacities on the order of some CPU L1 caches. In aggregate, they are larger than some caches. The hit logic and address translation (need to map register IDs to their placement in the shared file) are not there, or are not there to the same extent. There are properties to their physical accesses that are closer to accessing a low-port cache array than they would be to a CPU's more heavily ported and lower-density register file.
If the register file grows in future GPUs, the files will be bigger than some caches that had smaller caches between them and their CPUs.

Registers are part of the instruction set architecture, they are much closer to the ALU and are directly accessible from the microcode, so they are "hard-coded" as transistor logic. Caches are a part of the external memory controller, accessible through the memory address bus only, designed in blocks of "cells", and they are transparent to the programmer and not really a part of the ISA.
That's the best-known use of a cache, but the register cache idea was given the name so that it would be considered different in terms of what it targeted. There are other caches out there, like the page walker caches used by the TLB fill mechanism that are not even accessible to software.
A cache is a storage pool that tries to take advantage of locality and maps itself over locations in another storage pool.
The exact nature of the target pool or its addressing can change.

Whether that proves workable, or if the hierarchy needs to be software-managed with a small set of local storage is not settled.
The explicit method has already been done, as found by the reverse engineering of Nvidia's latest GPU ISA.

https://forum.beyond3d.com/posts/1619310/

The switching behavior for Maxwell is a little unclear to me, so I don't know if Nvidia has simplified things at a low level by not having a reuse cache per warp and keeping warps active for consecutive cycles to take advantage of it. The switch-happy GCN model, and its wider wavefronts, might need 64-wide locations 10 times over per SIMD, which may not have the same hardware inflection point.

Did AMD actually go on record saying this?
I haven't seen an actual transcript of the call, but multiple articles had the authors specifically stating they asked about this. It would seem odd that so many managed to misinterpret the answers the same way, or that AMD managed to accomplish this accidentally.
Frankly, AMD's history on the matter of disingenuously letting the press run with straightforward interpretations of plain English would point to a willful act of manipulation if this happened again.
A higher capacity should be possible, at some point. What time frame or product that may show up could be where the bigger question lies.

late edit: Spelled 3dcgi incorrectly
 
Last edited:
AMD's options in a dwindling set for maintaining relevance and maintaining some kind of volume to its products, which goes to whether said future GCN will exist, and what resources it will be given to have a competent implementation.
I think they are waiting 14nm node next year for drastic changes.
 
Hopefully they are being conservative, since their quoted energy efficiency improvement is consistent with the process node transition only.

They mentioned a factor of 2, which is quite a bit more than what you actually get from process alone in practice. Of course, if the baseline is Tahiti, it's terrible. If it's Fiji, depending on how good Fiji is, it could be quite good.
 
The transition from 28nm to 14/16nm is a hybrid case in and of itself.
20nm had some mild power efficiency benefits, but the structural changes with addition of FinFET is the big power-efficiency gain for the hybrid nodes.
Their strong point tends to be below the highest speeds, where they are massively better at leakage reduction and lower voltages when they aren't being pushed. That's a clock range that the GPUs can operate just fine in, and their massive transistor counts like the lower leakage.
The foundries are pushing to eke out a little more improvement over 20nm, despite the shared interconnect, and they freely market half the power or less at equivalent speed (vs 28nm).
 
The transition from 28nm to 14/16nm is a hybrid case in and of itself.
20nm had some mild power efficiency benefits, but the structural changes with addition of FinFET is the big power-efficiency gain for the hybrid nodes.
Their strong point tends to be below the highest speeds, where they are massively better at leakage reduction and lower voltages when they aren't being pushed. That's a clock range that the GPUs can operate just fine in, and their massive transistor counts like the lower leakage.
The foundries are pushing to eke out a little more improvement over 20nm, despite the shared interconnect, and they freely market half the power or less at equivalent speed (vs 28nm).

I didn't realize the foundries were so bullish about their FinFET processes. I had in mind figures closer to -30% power at isospeed, but in hindsight, that might have been relative to 20nm.
 
It is foundry marketing, but it does look like a decent-ish gain (FinFET) on top of a kind of disappointing one (20nm), which basically means it took two "nodes" to improve power-wise versus the ideal.
The foundries are still playing a few games here and there, since TSMC needs the + revision of its process to get the best transistor performance.

edit: One example is TSMC. GF is similar (in comparing 28 to 14), IIRC.
http://www.tsmc.com/english/dedicatedFoundry/technology/16nm.htm
 
Register caching sounds like a good idea. Majority of the registers are just holding data that is not needed right now, as the GPUs don't have register renaming and register spilling to stack. Register cache should allow a considerably larger larger register file (little bit more far away), while keeping the performance and power consumption similar. This would definitely help GCN.
A single CU has 256KB of register file. Fiji, with 64 CUs of the same design as prior GCN GPUs would have 16MB of registers.

Latency hiding works because 1/4 or more of the per-CU register allocation is available on the next cycle.

GCN (like the VLIW chips) has scratch registers: registers that live in global memory. It's unclear whether they're cached. But the performance with them sucks horribly.

The sheer quantity of bandwidth that could be consumed by scratch registers should be heeded. A single CU at 1GHz has 64 lanes, each of which reads 3 DWords and writes 1 DWord per cycle. That's 64 lanes x 4 bytes x 4 DWords x 1GHz = 1TB/s of register file bandwidth.

64 CUs in Fiji would have 64TB/s of aggregate register file bandwidth.
 
I didn't realize the foundries were so bullish about their FinFET processes. I had in mind figures closer to -30% power at isospeed, but in hindsight, that might have been relative to 20nm.

Depends on what kind of chip. The finfets have a full order of magnitude less leakage for similarly sized transistors, and you can shrink the size and energy use of SRAM a lot with them. However, neither TSMC or Samsung/GF are improving interconnect energy use that much with this process. This means that the energy savings you get can be somewhere between pathetic and awesome, depending on just what you are building.

One thing that's a sure bet for the next foundry cycle is that cache and buffer sizes are trending up. The size of SRAM, and the time it takes to access a large pool are both improving much more than general logic.
 
3dilettante and a few other hardware oriented guys (sorry forgot who) were discussing the GCN tessellator and geometry shader design in another thread year or two ago. The conclusion of that discussion was that the load balancing needs improvements (hull, domain and vertex shader have load balancing issues because of the data passing between these shader stages). These guys could pop-in and give their in-depth knowledge. I have not programmed that much for DX10 Radeons (as consoles skipped DX10). GCN seems to share a lot of geometry processing design with the 5000/6000 series Radeons.
You could start here and work backwards:

[URL="https://forum.beyond3d.com/posts/1818498/"]Xbox One November SDK Leaked[/URL]
 
Back
Top