AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

I believe GCN won't accept more than a 256 work group size. I suspect larger is illusory.
Demonstrably false. DirectCompute requires support for work group sizes of at least 1024 threads, for example.
Jawed said:
A few months back I wrote about apocryphal VGPR allocations in code I'm working on. Since then I've cleaved my kernel in two, with the rationale that I can't avoid storing data off-die for possible run times of 1+ seconds, since there's no way I can construct a pure on-die pipeline (which would also use multiple kernels).

Running two kernels compartmentalises VGPR and shared memory allocation. This enables me to re-code the two halves without fighting their joint VGPR+LDS allocation, which ultimately leads to more performance.

The two halves are strongly asymmetric in their use of VGPRs and LDS. The first half has a giant cache in LDS and moderate VGPR allocation, the second uses a small amount of LDS for 8-way work item sharing with a huge VGPR allocation including a large array in VGPRs. Luckily it has very little latency to hide (and it has huge arithmetic intensity), so 3 hardware threads isn't really a problem. The first kernel is LDS limited to 5 hardware threads. An improvement from 3 hardware threads for the uber kernel, which is where the performance gain came from as far as I can tell.

And now I have the freedom to re-work the first kernel now that it isn't bound 1:1 by the logic and in-register capacity constraints of the second kernel, e.g. I can run half the instances of kernel 1 for each instance of kernel 2, reaping more performance from the huge cache.

It helps that global memory access latency incurred by writing and reading data across the split is hidden, with only about 14GB/s usage. I now have the opportunity to re-code each half, which means I'll get substantially more performance than the original "uber kernel" (rather than the 3% gain I got doing the split).

Overall, it seems to me there's a lot of mileage in atomising kernels.
This has been recommended on GPU for ages. See the huge progress made with LuxRenderer for another example.
 
Demonstrably false. DirectCompute requires support for work group sizes of at least 1024 threads, for example.
Yes, a GCN CU can run a workgroup of size 40 * 64 threads = 2560 threads. And one workgroup can use 64 kB of LDS.

If you use the maximum workgroup size allowed by DirectX (1024 threads with 32 kB shared memory), a single GCN CU can run two of these simultaneously. But that is true on practice only if your shader uses less than 32 VGPRs. This is a overly tight limit for complex shaders such as MSAA tiled lighting. If your shader needs even a single register more, the GPU can only fit one workgroup per CU. This means that your occupancy is only 16/40 = 40% of the maximum. This is why 32x32 lighting tile size is a bad choice for GCN.

Edit: made a typo in my occupancy calculation. I don't have a printed occupancy poster at home :)
256 kB registers per CU, 32*32 = 1024 threads. One register is 4 bytes. We need to fit two groups. So: (256kB registers / 4 bytes) / (1024 threads/block * 2 blocks) = 32 registers/thread.
 
Last edited:
Look at all the reviews of FuryX. Everybody is quite enthusiastic about AMD finally be somewhat competitive against Nvidia. And about power being much better. And about the size of the card. And then... it all deflates when they say how it should have been a bit cheaper compared to the 980Ti.
If one uses HBM and the other doesn't and perf is the same, the other will have a significant pricing leverage.

Without HBM, Fury wouldn't be competitive at all. If one uses HBM, the other doesn't and performance is the same, sure, the other has pricing leverage, but less so than if one didn't use HBM, because in that case one wouldn't be competitive.

It's the same for 2016. Whatever the case, HBM gives you a perf and/or power advantage that can improve your competitive position and therefore your ASPs, even if manufacturing costs go up as well. It's probably a win for margins.


But I will concede that this is quite likely a prisoner's dilemma: if neither AMD nor NVIDIA use HBM, then they both maintain low costs and the same ASPs as always, and that's a good situation for both vendors.

But whatever the decision of one vendor, it's in the other's interest to use HBM. If one vendor doesn't use it, the other gets a competitive edge from using it. If one vendor does use it, the other loses if it doesn't. This should encourage both vendors to use it, which isn't ideal for them because they'll have higher costs than usual, but the same ASPs as usual due to competitive pressure.

Experience suggests that with prisoner's dilemmas, the participants will choose the option that is to their individual advantage, even if it leads to a sub-optimal situation from a collective point of view, because they have no control over the actions of other agents. And in the case, coming to an agreement with the other agent might possibly be considered collusion (I'm not sure about that).

Naturally, this is all predicated on the condition that Samsung/Hynix can manufacture enough HBM for the entire discrete graphics market. If they can't, it's all moot.
 
Do we have any indications at all that AMD ever had plans for a 20nm lineup?

They went on record saying that they would "move to 20nm", or something to that effect. But I don't think they ever specified whether they were talking about APUs or GPUs. They might have been talking about project Skybridge or some other low-power APU. Whatever they were referring to, it's been canceled.
 
Yes. We have also noticed that simpler kernels are almost always a win, if the split doesn't cause a big extra BW cost. With DX12 you can precisely control which compute shaders are running simultaneously by placing (or not placing) resource barriers between them in the command queue. You don't even need async compute for this. So you have a pretty straightforward way to instruct the GPU run two shaders simultaneously, even if those are accessing the same resources (this obviously only works as long as you know what you are doing). In DX11, the API was highly conservative, and didn't allow this to happen (wasting GPU performance).

I also hope that we get kernel side enqueue (spawn lambdas as kernels) to DirectX soon. This allows much finer grained shader execution. With the current GPU execution model with static (worst case) register allocation, small sub kernels lauched by the GPU are a good way to reduce the register pressure. This way you only pay for the uncommon branch when you actually hit it, and you can also split the shader efficiently to smaller parts.

We obviously need a better (on-chip) way to communicate (pass data) between these shaders. Now we need to trust the caches. As both NVIDIA and AMD have raised their caches to 2 MB this might actually not be that big problem (Intel has had big GPU caches as long as I remember).

Yes, those are the GCN 1.0 lane swizzles. NVIDIA has similar (see CUDA documentation) and Intel has released some OpenCL 2.0 examples and benchmark results of their cross lane swizzle gains. Andrew most likely could give you some extra full details (or a link to a ISA document, if one is publicly available for Broadwell GPU).

Will be a dream for OpenCL path gpu ( i think to something like 3D render engines, Luxcore, Vray, Cycles, Pixar engine,etc ), as for the Kernel size, at least on the OpenCL front, when made right, we see that it work exceptionally well, it is blazing fast, there again i think to Lux who have now made all the turn around since many times, i hope cycles will take some idea from them ( AMD have patch Blender cycles ( 2.75) now for use OpenCL but it is not really optimal yet ( divide the large kernel in multiple small kernel when compiled it push an substantial time for do it the first time you use real time render or start a render ( even worst with an animation ) .
 
BTW: I don't see this as an AMD vs Nvidia thing. IMO for the next generation, both will continue to use GDDR5 for everything but the highest SKUs.

What i can see is a situation ala 760TI 770-780, or basically, FuryX, Fury, Nano use the highest sku with HBM on 3 different mid high level gpu's ( decline more gpu's from the higher sku ); then as the sku differ, keeping standard 256-384bit bus for mid class gpu and low end gpu's is not a problem.. Whatever, the cost of an 4096bit bus on a small gpu's on size + low compute power who cant use this bandwith anyway will be a non sense..

The only thing is for APU's and maybe where the situation of AMD sku's willl differ with HBM; as it is a pefect fit for thoses chips, and so they could have a low, middle class gpu's for the APU's designed with HBM in mind. In addition, this could well be a really good investment for mobiles chips ( notebook ), mini PC, where finally the concept of APU ala AMD is sadly not working as good as they should... ( Maybe because on the end, dual core from Intel with HD5-6xxx graphics do too their job well for it, for, a notebook, a small good cpu+IGP low power is now largely enough in most case ( look the last Macbook.. )

Edit.. huum sorry for the double post..
 
Last edited:
TDP scales exponentially with frequency, not linearly. Depending on frequency, assuming your normal target frequency is somewhere just beyond the "takeoff" portion of the exponential curve, you can save a lot of power by clocking down say, 30% or so and save the 40% or so power you'd want. The GPU makers of course don't do this most of the time as it's usually better to hit that 5%-30% better performance even if you're taking 6-40% more power to do it.


True but we still have to look at the temperature of the chip, the air cooled vs water cooled, and the temperature will cause a linear increase in leakage.

If it was just frequency I would be more apt to believe it, but since Fury X keeps its chip cooler with the water cooling, it doesn't seem likely anymore.
 
It's the same for 2016. Whatever the case, HBM gives you a perf and/or power advantage that can improve your competitive position and therefore your ASPs, even if manufacturing costs go up as well. It's probably a win for margins.
This is where the price difference between HBM and GDDR5 determines the final outcome. If HBM is 20% more expensive than GDDR5, you're probably be right. If it's 100% or 200%, you're probably wrong. In addition, the smaller the GPU, the bigger cost share of the external memory in the overall cost of the GPU, especially if the trend towards 6GB or 8GB continues. This strengthens the case of staying with GDDR5 for anything but the largest (or the 2 largest) SKUs.
 
This is where the price difference between HBM and GDDR5 determines the final outcome. If HBM is 20% more expensive than GDDR5, you're probably be right. If it's 100% or 200%, you're probably wrong. In addition, the smaller the GPU, the bigger cost share of the external memory in the overall cost of the GPU, especially if the trend towards 6GB or 8GB continues. This strengthens the case of staying with GDDR5 for anything but the largest (or the 2 largest) SKUs.

I'm not sure the cost share of the memory would increase that much with smaller GPUs, since you'd expect fewer stacks and a much smaller interposer. Actually, I'd expect things to scale down fairly well. If you were designing a GPU with HBM in mind from the start, and your internal organization were flexible enough, you might even lay it out as a rectangle of the same width as an HBM stack, to get something like this, and minimize the wasted space on the interposer:

HBM.png
 
Yes, a GCN CU can run a workgroup of size 40 * 64 threads = 2560 threads. And one workgroup can use 64 kB of LDS.

If you use the maximum workgroup size allowed by DirectX (1024 threads with 32 kB shared memory), a single GCN CU can run two of these simultaneously. But that is true on practice only if your shader uses less than 32 VGPRs. This is a overly tight limit for complex shaders such as MSAA tiled lighting. If your shader needs even a single register more, the GPU can only fit one workgroup per CU. This means that your occupancy is only 16/40 = 40% of the maximum. This is why 32x32 lighting tile size is a bad choice for GCN.

Edit: made a typo in my occupancy calculation. I don't have a printed occupancy poster at home :)
256 kB registers per CU, 32*32 = 1024 threads. One register is 4 bytes. We need to fit two groups. So: (256kB registers / 4 bytes) / (1024 threads/block * 2 blocks) = 32 registers/thread.
A kernel is free to use as many registers as it needs, it's the compiler that has to work within the limits of the hardware. If the kernel uses more registers than are available, then spilling will occur. With a work group of 1024 threads, you will get up to 64 registers per work-item on GCN. Going over 32 would limit your occupancy, but that might be a better trade-off than spilling.

The developer has to weigh the trade-offs of register file usage vs. data sharing. If you want to share a lot of data between a lot of threads, then the larger work-group size is helpful, but it does limit how many registers you can use without spilling.
 
Whatever, the cost of an 4096bit bus on a small gpu's on size + low compute power who cant use this bandwith anyway will be a non sense..

It wouldn't be a 4096bit but on small GPU's though, it would only be a single stack/1024bit bus. With HBM2 that still provides 256GB or raw bandwidth which is more than the 980 (non Ti). More than enough for a mid range GPU next gen.
 
A kernel is free to use as many registers as it needs, it's the compiler that has to work within the limits of the hardware.
If the kernel uses more registers than are available, then spilling will occur.
I was not talking about high level language features. I am talking about the GCN GPU physical register files. From that perspective, the (compiled) microcode program is the kernel. A microcode kernel cannot use more than a certain amount of physical registers (or it cannot fit the CU). This is also true if you hand write the shader with GCN microcode.

Some compilers emulate register spilling by various means (to LDS or to memory). The compiler generates additional memory loads and stores to reduce the real physical register count. This (pure software) feature is there just for compatibility (on PC), a programmer should never intend to use it. Memory spilling always makes shaders way too slow to be useful for anything. LDS based spilling has some niche uses when your shader otherwise doesn't use much LDS. You can store up to six extra registers that way (if you are willing to eat the whole 64kB LDS just for spilling). In some corner cases this is enough to improve the occupancy one step higher to gain some extra performance. But in most cases LDS spilling is a performance hit.

A good shader compiler should avoid spilling at any cost, and should always give a warning to the user if spilling occurs. I would personally prefer to get a compiler error instead of spilling if my shader doesn't fit a CU. The shader performance (occupancy) would be absolutely awful long time before spilling is even considered.
 
Last edited:
I was not talking about high level language features. I am talking about the GCN GPU physical register files. From that perspective, the (compiled) microcode program is the kernel. A microcode kernel cannot use more than a certain amount of physical registers (or it cannot fit the CU). This is also true if you hand write the shader with GCN microcode.
Right, but, as I stated, you are free to spill.
sebbi said:
Some compilers emulate register spilling by various means (to LDS or to memory). The compiler generates additional memory loads and stores to reduce the real physical register count. This (pure software) feature is there just for compatibility (on PC), a programmer should never intend to use it. Memory spilling always makes shaders way too slow to be useful for anything. LDS based spilling has some niche uses when your shader otherwise doesn't use much LDS. You can store up to six extra registers that way (if you are willing to eat the whole 64kB LDS just for spilling). In some corner cases this is enough to improve the occupancy one step higher to gain some extra performance. But in most cases LDS spilling is a performance hit.

A good shader compiler should avoid spilling at any cost, and should always give a warning to the user if spilling occurs. I would personally prefer to get a compiler error instead of spilling if my shader doesn't fit a CU. The shader performance (occupancy) would be absolutely awful long time before spilling is even considered.
This is not always the case and I will leave it at that. Also, how would the compiler report warnings about spilling? Sure, this is possible in OpenCL where there is a log for the compilation, but what about other APIs?

Spilling is not always bad, it's completely dependent on whether you have enough work in flight to hide the extra latency and how often you access the spilled data. If the alternative is spilling or can't run at all, which would you choose?
 
Also, how would the compiler report warnings about spilling? Sure, this is possible in OpenCL where there is a log for the compilation, but what about other APIs?
Maybe I am too console centric. I prefer to optimize my shader register counts (occupancy) and ALU count using console tools. PC compilers and analysis tools don't give enough information to optimize shaders efficiently. If the compiler doesn't tell you the occupancy, it is almost impossible to optimize for modern GPUs that are more often than not register pressure bound.
Spilling is not always bad, it's completely dependent on whether you have enough work in flight to hide the extra latency and how often you access the spilled data. If the alternative is spilling or can't run at all, which would you choose?
Spilling a single register to memory in a full screen compute shader pass costs as much as writing plus reading a 32 bit pixel render target (a full screen pass). Spilling four registers costs as much BW as writing and reading your entire G-buffer. Assuming of course that these reads and writes hit memory. If the spilled registers are kept in L1 cache the performance is good.

GCN has 16kB L1 data cache per CU. If the CU occupancy is full, 40 * 64 = 2560 threads are processed simultaneously (not every single one every clock). One register is 4 bytes. Single register spill needs 10kB L1. If you are lucky you might be able to spill one register and load it from L1 before it gets evicted. Loading from L2 costs much more and if you spill to L2 you have to count all CUs (as we assume that the GPU is running a shader that fills all the CUs for significant period of time). The L2 ends very quickly, even in Fiji (2 MB L2).
 
Maybe I am too console centric. I prefer to optimize my shader register counts (occupancy) and ALU count using console tools. PC compilers and analysis tools don't give enough information to optimize shaders efficiently. If the compiler doesn't tell you the occupancy, it is almost impossible to optimize for modern GPUs that are more often than not register pressure bound.
I agree that PC tools are lacking in this regard.
sebbbi said:
Spilling a single register to memory in a full screen compute shader pass costs as much as writing plus reading a 32 bit pixel render target (a full screen pass). Spilling four registers costs as much BW as writing and reading your entire G-buffer. Assuming of course that these reads and writes hit memory. If the spilled registers are kept in L1 cache the performance is good.

GCN has 16kB L1 data cache per CU. If the CU occupancy is full, 40 * 64 = 2560 threads are processed simultaneously (not every single one every clock). One register is 4 bytes. Single register spill needs 10kB L1. If you are lucky you might be able to spill one register and load it from L1 before it gets evicted. Loading from L2 costs much more and if you spill to L2 you have to count all CUs (as we assume that the GPU is running a shader that fills all the CUs for significant period of time). The L2 ends very quickly, even in Fiji (2 MB L2).
You only need to worry about the bandwidth of spilling if your kernel is largely memory-bound. If you are compute-bound, then you might have enough work to hide the latency and you likely have bandwidth to spare. This is why it's crucial to leverage LDS as much as possible: It reduces memory and cache pressure and reduces latency. This is one of the most under-exploited resources in compute kernels, at least from my experience with OpenCL.
 
This is why it's crucial to leverage LDS as much as possible: It reduces memory and cache pressure and reduces latency. This is one of the most under-exploited resources in compute kernels, at least from my experience with OpenCL.
Agreed. LDS should be used more. One of the underused scenarios is "once per thread block" variables. It might feel a little bit odd to store a value further away from the registers to grab it back later, but on these register starved machines that is actually a great idea in this case. This is a good optimization technique. It is a waste of register space to keep data in registers that do not differ between threads. If the kernel has any subtiles inside the thread groups that need per tile data, LDS is the perfect place to store it (instead of keeping it in registers).

GCN has a very good LDS implementation. It is large (64 kB), fast and doesn't suffer from similar minefield of bank conflicts than previous GPU LDS designs. LDS instructions also have a separate execution port, so they don't waste any VALU cycles. LDS has very fast local atomics that are super handy for many things (and the atomic add/or/and instructions also do not cost VALU cycles, since LDS performs atomics completely on it's own). Global (memory) atomics go to L2 (are not L1 cached) on GCN, meaning that LDS atomic reduction before global aromics is a big help.
 
I'm not sure the cost share of the memory would increase that much with smaller GPUs, since you'd expect fewer stacks and a much smaller interposer.
People here have been pretty persistent in claiming that the interposer cost is really quite low, but, yes, it's another variable in the set of unknowns.
 
Back
Top