AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

The cache hierarchy could stand for an improvement.
The 2011 implementation GCN still has is a step above the incoherent read-only pipeline it had before, but its behavior is still too primitive to mesh with the CPU (Onion+ skips it) and its method of operation moves a lot of bits over significant distances.
The increase in channel count, and the number of requestors for the L2 means more data moving over longer distances, which costs power. The way GCN enforces coherence by forcing misses to the L2 or memory costs power as well.
Changes like more tightly linking parts of the GPU to the more compact HBM channels and going writeback between the last-level cache and the CUs could reduce this, but not without redesigning the cache.
Yes, the CPU<->GPU coherence is not yet perfect. However we have not seen a problem with that, since we sidestepped that problem completely, by moving the whole graphics engine to the GPU. We don't need tight communication between the rendering and the game logic (physics, AI, etc). Obviously for general purpose computing, it would be a great improvement to get a shared CPU+GPU L3 cache and good cache coherence between the units (without needing to flush caches frequently or use slower BW bus). With "units" I mean both the CPU and the various parts of the GPU (such as the front end, vector units, scalar unit, etc). More coherence = less flushes needed = less stalls.

On GPU side, the atomics are going directly to L2. That could be improved to make some cases faster. However in practice we use LDS atomics locally, and then perform one global atomic to synchronize. This greatly reduces the number of global atomics you need. GCN LDS is highly optimized for atomics. GCN also has super fast cross lane operations, allowing the developer to sidestep the need of atomics (and LDS accesses) in many common cases. Unfortunately only OpenCL 2.0 exposes this feature on PC (subgroup operations, see here: http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/). I am still sad that this feature didn't get included in DirectX 12 :(

I would also love to see ordered atomics in DirectX some day :)
There was a paper on promoting the scalar unit to support scalar variants of VALU instructions, which had some benefits.
Tonga's GCN variant did promote the scalar memory pipeline to support writes, at least.
Heh, I was going to say that I want full float instruction set and full LDS read/write instruction set... but that I don't need memory write support :)

But I see some important uses cases for scalar unit memory writes. Especially if the scalar unit cache is not going to be coherent with the SIMD caches.

Other questions are whether GCN can evolve to express dependency information to the scheduling hardware. It currently has 10 hardware threads per SIMD waking up and being evaluated per cycle. Perhaps some of that work could be skipped if the hardware knew that certain wavefronts could steam ahead without waking up all the arbitration hardware.
This is another thing it could borrow from Nvidia, or again the VLIW architectures GCN replaced.
I am not a hardware engineer, so I don't know that much about the hardware level power saving mechanisms. Those are often fully transparent to the programmer. I just try to get maximum performance out of the hardware. This is why I suggest things that allow the programmer to write new algorithms that perform faster. All kinds of hardware optimizations that allow shutting down unneeded transistors are obviously we much welcome.

Register caching sounds like a good idea. Majority of the registers are just holding data that is not needed right now, as the GPUs don't have register renaming and register spilling to stack. Register cache should allow a considerably larger larger register file (little bit more far away), while keeping the performance and power consumption similar. This would definitely help GCN.
16-bit fields are sufficient for machine learning.
16 bit fields are also sufficient for unit vector processing (big parts of the lighting formulas) and for color calculations (almost all post processing). No need to waste double amount of bits if you don't need them.
 
What do you mean "shader design"? The AMD compiler/optimizer is inefficient for domain/hull/geometry HLSL profiles of the IML bytecode, or the GCN instruction set is lacking instructions or registers to perform these shaders efficiently?
3dilettante and a few other hardware oriented guys (sorry forgot who) were discussing the GCN tessellator and geometry shader design in another thread year or two ago. The conclusion of that discussion was that the load balancing needs improvements (hull, domain and vertex shader have load balancing issues because of the data passing between these shader stages). These guys could pop-in and give their in-depth knowledge. I have not programmed that much for DX10 Radeons (as consoles skipped DX10). GCN seems to share a lot of geometry processing design with the 5000/6000 series Radeons.

I have programmed a 5870 Radeon at home, and that GPU is super slow in many things that GCN handles without any problems. For example dynamic indexing of a buffer (SRV) in a shader makes my test code 6x slower compared to constant buffer indexing (even for very small cache friendly data sets). Needless to say, I really like some of the improvements in GCN.
 
No. Repeat after me: an ALU (the lane in a SIMD) is not a core... If Fiji actually is 4096 ALUs, it's going to be a 64-core machine, each core being 64-wide (logically) SIMD. The Phi (for example) has comparable core count, but narrower SIMD. Intel is honest about counting their cores.
Call it "ALU" or "simple core" - as long as each of them still has its own register memory, it's a matter of classification.

Intel approach may be more "pure", but it is necessarily so to give developers the ability to program against the familiar x64/AVX512 instruction set. On the other hand, Xeon Phi 7100 costs $4000 and gives you ~1.2 TFLOPS consuming 300 W, while Hawaii (R9 290X) is $500 and ~6 TFLOPS for the same 300 W, at least theoretically.

I find this presentation by Andy Glew to be an excellent exploration, albeit probably a tad optimistic (IMHO): http://parlab.eecs.berkeley.edu/sites/all/parlab/files/20090827-glew-vector.pdf.
It's only been 25 years since I first saw similar theoretical papers explaining how RISC/VLIW is so much better than CISC for improving ILP :)
 
The biggest change I want to see in the GCN core is more RF per work item. Ideally 4x more. Yes, we're then talking about RF area coming close to the ALU area.
A better compiler would reduce the VGPR usage. No hardware modifications required. If the scalar unit was more robust and the compiler was even better, scalar offloading would additionally reduce the VGPR usage a bit. I would also want to see full support for 16 bit integer and 16 bit float types (to halve the RF usage of these types of data). When these cheap improvements (HW wise) are exhausted, I would love to see a bigger RF size per work item. Register cache would be a good solution to allow a bigger register file with no performance issues and no power issues. But I don't think we need 4x bigger RF, if we get all these other improvements first. In this case, I would be perfectly happy with 2x larger RF per work item.
Umm aren't GPU's still under the SIMT paradigm? And according to you and the doc Razor provided, SIMT on GCN is 16 wide on the hardware side of things, and 4 of those 16 wide vector units comprise a part of a CU?
Logically GCN SIMT is 64 wide. Waves are 64 wide. Branching and memory waiting happens at wave granularity. The SIMD execution units however are 16 wide. One instruction is executed in 4 cycles (in pipelined round robin manner). AMD has good papers and presentations about the GCN execution model available.
GCN has removed the automatic handling of divergence and it expects the program to handle its jumps and instruction skipping.
Older AMD GPUs did this as well. Jump instruction could check the execution mask, if mask filled with zero bits jump over the code completely. Xbox 360 HLSL even had some high level constructs to control whether you wanted only a jump or jump + predication. See here: https://msdn.microsoft.com/en-us/library/bb313975(v=xnagamestudio.31).aspx
 
I think a prominent graphics architect from AMD (I can't remember who it was exactly) once commented that if not for compute, AMD wouldn't even have moved from VLIW5 to VLIW4.
Multiple, in fact. That was their Mantra from mid-2011 on.
 
A better compiler would reduce the VGPR usage. No hardware modifications required.
the load balancing needs improvements (hull, domain and vertex shader have load balancing issues because of the data passing between these shader stages).
So this again boils down to AMD's conscious underengineering approach to driver performance and native code generator optimisations...


Register caching sounds like a good idea. Majority of the registers are just holding data that is not needed right now, as the GPUs don't have register renaming and register spilling to stack.
Sorry, I don't think it's a good idea from a hardware designer point of view.

While registers and cache memory serve the same purpose, i.e. provide fast transistor-based memory to offset slow capacitor-based RAM, they belong to very different parts of the microarchitecture and use very different design techniques to implement.

Registers are part of the instruction set architecture, they are much closer to the ALU and are directly accessible from the microcode, so they are "hard-coded" as transistor logic. Caches are a part of the external memory controller, accessible through the memory address bus only, designed in blocks of
"cells", and they are transparent to the programmer and not really a part of the ISA.

While it may initially seem attractive for some reasons to implement a "register file" as a continous block of fast SRAM cache and allocate registers on-demand, the utter complexity of such implementation will kill any benefits, more so than just adding more registers to the ISA or implementing a common pool/file with registers renaming. Not to mention that registers give you very fast 1-clock access, while even the fastest L1 caches require a few clocks.
 
Call it "ALU" or "simple core" - as long as each of them still has its own register memory, it's a matter of classification.

Intel approach may be more "pure", but it is necessarily so to give developers the ability to program against the familiar x64/AVX512 instruction set. On the other hand, Xeon Phi 7100 costs $4000 and gives you ~1.2 TFLOPS consuming 300 W, while Hawaii (R9 290X) is $500 and ~6 TFLOPS for the same 300 W, at least theoretically.

It's only been 25 years since I first saw similar theoretical papers explaining how RISC/VLIW is so much better than CISC for improving ILP :)

The "call it whatever you want to, it's OK, naming is academic anyway" attitude is an issue (and part of why more mature fields have issues taking the whole GPGPU story seriously). The terms are not substitutable, and the difference goes back more than 25 years - really, it's a solved issue. I fail to see what is relevant in the pricing of respective cards, and counting FLOPs is also misleading. They are in different markets - FirePro is comparatively priced. The discriminant is not whether an ALU has or does not have some SRAM attached (sidenote, even the way in which the RF is partitioned, which is not per-lane independent, should be an indication). Amusingly, if the most solid illusion of "an ALU is a core" is provided by Intel's Gen, but that's a pretty...eccentric piece of hardware to begin with. Finally, if you have time parsing the linked presentation might be useful. Especially since in some regards Glew is more supportive of your views than mine. Since we're moving the discussion in a different direction, I wonder whether or not I should fork this (B3D denizens, opinions?).
 
The "call it whatever you want to, it's OK, naming is academic anyway" attitude is an issue (and part of why more mature fields have issues taking the whole GPGPU story seriously). The terms are not substitutable, and the difference goes back more than 25 years - really, it's a solved issue. I fail to see what is relevant in the pricing of respective cards, and counting FLOPs is also misleading.[...]
Cores is the new MEGAhurtz (since ca. 2006).
 
OC result? At least you cannot buy those from SK-Hynix right now.

Hynix's public part catalog has been outdated or just plain wrong before. In the short term, there is one company that is going to buy HBM from them, so they have no need to have very accurate catalogs.

However, isn't the reported clock of the first-gen HBM going to be the non-DDR clock, or 500ish MHz?
 
As for AMD feeding BS to news sites to mislead competition, not completely unlikely.
Outright lying to media would be very much frowned upon pretty much universally I would think.

Better to just say "no comment" if you don't want to confirm the amount of RAM your new unannounced graphics card has.
 
Hynix's public part catalog has been outdated or just plain wrong before. In the short term, there is one company that is going to buy HBM from them, so they have no need to have very accurate catalogs.

Not very convincing, IMO. Maybe their catalogues have been wrong, but there's HBM listed in there right now, albeit with 1 Gbps (500 MHz). The slower 0,8 Gbps-HBM got removed lately, but they forgot to add the more interesting higher speed grade? I am not convinced.
 
Outright lying to media would be very much frowned upon pretty much universally I would think.

Better to just say "no comment" if you don't want to confirm the amount of RAM your new unannounced graphics card has.

I'm not sure if they ever explicitly said its gonna be 4 GB, period? AFAIK they kinda bullshited around the question implying it's gonna be 4 GB.
 
Back
Top