AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Since it's highly unlikely that a forum troll decided to write a fake 354-page document full of technical specifications,

Yes, I said "supposed" because some other people that were discussing it were trying to pass it off as a PI document.
I just skimmed through it and saw that it was from this month, but also saw the 2013/7 reference in the url.

I also wouldn't put it past those aforementioned trolls to take an old document they don't understand, update a few things (dates, images, and a few code names) and try to pass it off as documentation on the next-gen.
 
It'd be cool if it were fake, though. Just the thought. :)


Page XV lists the differences. Seems very much evolutionary. Mostly compute related?

“DPP” – Data Parallel Processing allows VALU instructions to access data from neighboring lanes

This is related to "ballot", and similar to ddx/ddy for compute shaders, but without d(erivative). No more going over the LDS when you can peek directly into your siblings VGPRs. I'm sure it isn't going to be exposed in HLSL, but the compiler could detect the round-trip and place this instructions. I can imagine dpp and ddx/y share silicon.
 
Yes, I said "supposed" because some other people that were discussing it were trying to pass it off as a PI document.
I just skimmed through it and saw that it was from this month, but also saw the 2013/7 reference in the url.

I also wouldn't put it past those aforementioned trolls to take an old document they don't understand, update a few things (dates, images, and a few code names) and try to pass it off as documentation on the next-gen.
The quoted link is available right on http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/
 
The GCN diagram is certainly wrong. Only 32 KB LDS among lots of other issues (it looks like their old VLIW GPU design). This seems to be already discussed in the previous page.
“DPP” – Data Parallel Processing allows VALU instructions to access data from neighboring lanes

This is related to "ballot", and similar to ddx/ddy for compute shaders, but without d(erivative). No more going over the LDS when you can peek directly into your siblings VGPRs. I'm sure it isn't going to be exposed in HLSL, but the compiler could detect the round-trip and place this instructions. I can imagine dpp and ddx/y share silicon.
GCN implements ddx/ddy using cross lane operations (4 lane crossbar). AMD has a public presentation about the lane swizzles (4 lane crossbar, reverse, broadcast, swap). By combining these you can do everything. See slides 42-43 here: http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah

OpenCL 2.0 already exposes the cross lane operations (called work group functions). Intel and AMD both support them (NVIDIA doesn't yet have OpenCL 2.0 drivers). OpenCL supports (work_group_) all, any, broadcast, reduce, inclusive_scan and exclusive_scan. These compile to GPU specific cross lane operations.

More info:
http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/

HLSL (DirectX 11.2) doesn't yet support these operations. This is unfortunately since these operations both simplify algorithms and make them faster.
 
Last edited:
GCN implements ddx/ddy using cross lane operations (4 lane crossbar). AMD has a public presentation about the lane swizzles (4 lane crossbar, reverse, broadcast, swap). By combining these you can do everything. See slides 42-43 here: http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah

OpenCL 2.0 already exposes the cross lane operations (called work group functions). Intel and AMD both support them (NVIDIA doesn't yet have OpenCL 2.0 drivers). OpenCL supports (work_group_) all, any, broadcast, reduce, inclusive_scan and exclusive_scan. These compile to GPU specific cross lane operations.
Nvidia supports a single shuffle instruction that does arbitrary communication between lanes in a warp (without using any shared memory) since Kepler. Certain functions are supported across all elements in a work group since Fermi (any, all, reduce). Fermi and up also support the ballot instruction, which is useful for these types of algorithms. I think doing scans and broadcasts across all work items in a work group on Nvidia hardware requires allocating a very small amount of shared memory - one element for broadcast, and w elements for scans, where w is the number of warps in the workgroup. The OpenCL compiler could do this automatically.
 
Is there an indication that the DDP instructions do something different than the no-allocation LDS instructions introduced with GCN, or is this more of an encoding change that still leverages the same hardware?
This may leverage a similar method of execution as VALU instructions that source from the LDS, where there is no need for a WAITCNT for the LDS because a wavefront cannot issue past an in-progress ALU operation.
Using explicit LDS to swizzle would require software tracking, whereas an encoding that rolls it into an ALU instruction would not.

If it's reusing the LDS data paths, it could implications for the time it takes to execute the ops if LDS is being more heavily used. Contention would go down if a separate network were present per SIMD, which would have a hardware cost.
 
I imagine you have some sort of barrel-register-file. Because all threads execute a symmetric cross-lane operation you just need to rewire the alu<->register connection, which I assume exists anyway because you can just switch active threads in a CU without a big penalty on GCN.
A la: register-file base address = "(rotate((base + threadid) % groupsize, simd rotation width, amount))". Something like that.
 
I imagine you have some sort of barrel-register-file. Because all threads execute a symmetric cross-lane operation you just need to rewire the alu<->register connection, which I assume exists anyway because you can just switch active threads in a CU without a big penalty on GCN.
A la: register-file base address = "(rotate((base + threadid) % groupsize, simd rotation width, amount))". Something like that.

Why would switching active threads require moving data between lanes? Absent cross-lane activity, register access can start with the base register ID for the wavefront+whatever ID the code thinks it is using.
The originally introduced swizzling methods were categorized as LDS instructions that didn't write to the LDS storage banks.

There are some operations that also require more than a simple rotation, including mirroring and broadcasting of specific lanes to later rows.
Having an LDS-like network at the SIMD level could make the LDS network redundant.
 
Why would switching active threads require moving data between lanes? Absent cross-lane activity, register access can start with the base register ID for the wavefront+whatever ID the code thinks it is using.

Hm. To me it looks like the GCN register file is like a say 23 bit adress space (8MB), of which you can only address a window of 7 bits (128 VGPRs). So every SIMD in the CUs hase a base-address, and all register access is relative to the base address, much like the "ebp" just for registers instead of memory. Let's call it "tctx" (thread context). Now, when you want to deactivate stalling SIMD threads in a CU, you remember the program counter, store is somewhere and reset the program with a different base address. All the state is in the registerfile, there are no flags or other processor state which could get lost, like with OoO and so on.
No data is moved, but the lanes between SIMDs/CU are rewired to give access to different windows of the register file. The real register file wiring is much larger than just 7 bits, which means you can create an instruction which temporarily alters the access-network's base address such that a "mov v0, tctx[23].v56" would make the address of v0 fetch that part of the register file which is the 56th VGPR of the 23rd thread. That would be the generic idea.
If you don't need to address every other register explicitly, but you want to have registers that are actual swizzles to the sibling threads then you see that you only need to do things like "(threadid + 1) % threadgroup", that gives you the results of your next circular neighbour, subtract your own value and you got derivatives. Multiply by 4 and you get the next SIMDs, multiply by 64 and you get the next CU, and so on. These operations are all arithmetic modifications of a base address in a global VGPR address space. The cross-SIMD "address" modifications that are possible are written down in the GCN documentation. The command is executed in the same cycle in parallel on all threads with the same modifier which is designed such that it's impossible to have a read/write conflict, it's also impossible to have read/read conflicts.
But I'm not a low-level silicon guy, it might sound more like how a GCN emulator could implement the instruction efficiently.

The originally introduced swizzling methods were categorized as LDS instructions that didn't write to the LDS storage banks.

There are some operations that also require more than a simple rotation, including mirroring and broadcasting of specific lanes to later rows.
Having an LDS-like network at the SIMD level could make the LDS network redundant.

Not really I think, you have just a handful of registers (per thread), the LDS is huge in comparison.
 
GCN register file is 256 addresses x 2048-bit (64 x 32-bit). It's why it can store 1 and fetch 3 distinct addresses every four cycles. For ever.

It's trivially simple to divide the addresses equally among the set of wavefronts running on the SIMD, nothing fancy.

I suspect you're using imprecise language, when you say:

"mov v0, tctx[23].v56" would make the address of v0 fetch that part of the register file which is the 56th VGPR of the 23rd thread.

I suspect you mean SIMD lane 23 (wavefront lane 23 of 64), VGPR 56. GCN doesn't have that concept. All register addressing is uniform across all lanes. It's fat and dumb. Or, it's brutishly elegant, to be less pejorative.

There is state beyond PC, e.g. to handle exceptions, countdowns for memory accesses etc.
 
GCN register file is 256 addresses x 2048-bit (64 x 32-bit). It's why it can store 1 and fetch 3 distinct addresses every four cycles. For ever.

It's trivially simple to divide the addresses equally among the set of wavefronts running on the SIMD, nothing fancy.

I suspect you're using imprecise language, when you say:
I suspect you mean SIMD lane 23 (wavefront lane 23 of 64), VGPR 56.

Yes. lane == thread. wavefront == threadgroup.

GCN doesn't have that concept. All register addressing is uniform across all lanes. It's fat and dumb. Or, it's brutishly elegant, to be less pejorative.

It's certainly interesting. I had discussions lately where we tried to conjecture what exactly makes GCN so efficient at thread-switching, and I think it's the register-file connection/adressability.
It's too bad these instructions don't make it in any shape or form into HLSL pixel and compute shaders. There are funny "x and x + 3" offsets possible in the dpp instructions, I wonder if they are related to getting the next triangles vertex-triple and if they're used in domain and geometry shaders. Imagine you'd make a pixel shader which decides it returns the same value for all pixels, distributes the calculation over the quad's threads and makes a SIMD-coherent branch out. I think I could play with this all day. :)
 
It's too bad these instructions don't make it in any shape or form into HLSL pixel and compute shaders.
These instructions got into OpenCL 2.0 very recently. AMD, Intel and NVIDIA all have cross lane operations in their GPUs. I would expect to see cross lane operations in Vulkan, since it shares the same SPIR-V back end with OpenCL 2.1.

IMHO, it is only a matter of time when we get these to DirectX, since the cross lane operations provide nice performance gains for many algorithms (reduced GPR usage, reduced LDS usage, less instructions, etc) and at the same time make writing these algorithms much simpler. Just look at the OpenCL 2.0 examples to see how nice the code looks written with these operations.

Intel example about sorting with cross lane operations: https://software.intel.com/en-us/ar...ted-parallelism-and-work-group-scan-functions
 
Last edited:
One of the main reason is that the card is expected to perform so well in 4K gaming, that the 4GB frame buffer could impose a serious limitation.
Or the consoles are expected to use >6GB so having "just" 4GB is likely to hinder performance sooner rather than later.
Or because the GM200 Geforce will probably bring 6GB so they'd get the shorter end of the stick if performance isn't that different between the two.

Regardless, I'm glad that AMD continues to push the envelope for more memory in their graphics cards "for the common mortal".
(Except for the R9 285.. 2GB? Bad AMD!)
 
Last edited by a moderator:
It says 4GB initially, 8GB later. So Fud probably discovered that HBM2 allows for larger capacity? The only question is when HBM2 will be available for mass production.

He actually said:
"The 4GB card needs four 8Gb chips […] so you can end up with a total bandwidth of 512GB/s. In order to put 8GB on the card you need eight 8Gbit chips […] with 1024GB/s bandwidth.

The decision to go for an 8GB Fiji rather than the planned 4GB version was in part attributed by Nvidia’s Titan X 12GB card announcement."

This makes no sense. There's no way that AMD designed Fiji, then saw the Titan X and decided to redesign it with a memory bus twice as wide just a few weeks away from mass production. Yes, higher-density chips would be plausible, but I'm more inclined to think he has no idea what he's talking about.
 
Back
Top