AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

:(

At least this gives me clear sign that I can safely upgrade my old GPU with GTX970 right now. I was willing to wait for the 2015Q1 for AMD to unveil their cards, but now that is most likely out of the picture.

You got "most likely" based on some assertion by a couple of random analysts?
 
I hope they have that 500mm2 chip based card coming out soon. Perhaps my last post was too harsh on it. It should be faster than the 980 and if priced reasonably it could find a decent enough place under big Maxwell.
 
:(

At least this gives me clear sign that I can safely upgrade my old GPU with GTX970 right now. I was willing to wait for the 2015Q1 for AMD to unveil their cards, but now that is most likely out of the picture.


Because, you believe the informations lined from Analyst who are doing an Nvidia stock market report ? .. We have no idea from where they have got their sources ( if they had one ), or where is coming this information ( who could be right or not .. i dont know... ).

This said, the 970 is a good gpu, cheap, low TDP, overclock really well. If you want buy one, i dont see exactly why you should wait anything ? ( its like say you want wait big Maxwell )
 
Last edited by a moderator:
2h 2015 is a pretty unexciting date.
Am really looking forward to see what can be done with this exciting tech like stacked dies so painful for the tech geek.

Good for bank balance in the meantime though.
 
Maybe they expected some power-savings from the 28HPM process, but it didn't work that well (Tonga), so the >500mm² 28nm GPU is being redesigned for 20nm(?)

Just curious but apart from that Synapse slide, do we have any official confirmation that Tonga is indeed on 28HPM? I haven't seen it mentioned in any of the reviews. AFAIK Maxwell stayed with 28HP.
Fair enough. Thanks for all that. :)

You're welcome :smile: In fact there was even a "leak" that the 20nm shrink was in development, ref this post by mosen from the console forum - http://forum.beyond3d.com/showpost.php?p=1883795&postcount=10560
Saving it for Spring '15? Was wondering if they were spacing products out if they weren't confident about the timeline for the next node.
I highly doubt this..as silent guy said..if you have working silicon you would ship it. If it is ~550 mm2 as rumoured, it could still have the performance to challenge NV (but at the cost of higher power). Besides, that large a chip surely would have some features targeted at GPGPU/compute so it could do well in the professional segment. Maybe they've had a delay or problem of some sort and a respin was required.
 
Summer/Autumn 2015?

So, erm, are we looking at a re-run of R600?: Ambitious memory architecture (with theoreticals to drown in), new process, delays, delays, and awful performance due to terrible balance.

OK, R600 was a totally new architecture, and it seems unlikely this will be a huge departure from GCN.

But I do wonder if the ALU architecture is working too hard for the results it generates. I'm wondering if it was built deliberately to spend lots of power to achieve its very low internal latencies (insanely short ALU pipeline, read-after-write, LDS reads/writes, branching). Also I wonder if the architectural balance in terms of ALUs per byte of register file, is wasting power. I'm wondering if the CUs are spinning like demons even though there's nothing for the ALUs to do, because register over-allocation has cut the number of threads per SIMD.

(I haven't run the numbers: does NVidia now have more bytes of register file per theoretical FLOP? Seems likely...)

Anecdotally I've heard of very recent drivers causing vast increases in power consumption at slight gains in performance on intensive compute kernels.

So even if I'd like to think GCN is in many ways a sane compute architecture, it might have a rotten core. Which might mean AMD tries to tackle this problem with the new chip. Which just increases the chances of fumbling, especially if it's "big chip first".
 
But I do wonder if the ALU architecture is working too hard for the results it generates. I'm wondering if it was built deliberately to spend lots of power to achieve its very low internal latencies (insanely short ALU pipeline, read-after-write, LDS reads/writes, branching). Also I wonder if the architectural balance in terms of ALUs per byte of register file, is wasting power. I'm wondering if the CUs are spinning like demons even though there's nothing for the ALUs to do, because register over-allocation has cut the number of threads per SIMD.

(I haven't run the numbers: does NVidia now have more bytes of register file per theoretical FLOP? Seems likely...)
Both Maxwell and GCN are 1 FMA/thread/clock, so it's possible to calculate both registers/thread and registers/fma quite easily nowadays.

Per FMA
Maxwell: each scheduler has 16384 registers for 32 FMAs -> 512 registers per scalar FMA.
GCN: each SIMD has 16384 registers for 16 FMAs -> 1024 registers per scalar FMA.

GCN actually has twice as many registers per FMA! But...

Per Thread
Maxwell: each scheduler has 16384 registers for 16 warps (32-wide) -> 32 registers per thread for full occupancy.
GCN: each SIMD has 16384 registers for 10 wavefronts (64-wide) -> 25.6 registers per thread for full occupancy.

GCN has slightly fewer registers available per thread for maximum occupancy.

So the question is: does AMD simply support (too?) many threads, and their register file itself is actually more than big enough? Yes and no...

The critical difference between NVIDIA and AMD at this point is Instruction Level Parallelism. NVIDIA's datapath is 6 cycles but a single warp can achieve close to peak if it has an ILP >= 6. On the other hand, GCN's datapath is only 4 cycles, which means the worst-case is better than NVIDIA's, but AFAIK it cannot extract any ILP - so if you only have 1 wavefront ready, you will never achieve more than 25% performance.

This is not a design omission but a fundamental architectural trade-off. AMD's register file seems to be single-banked with one thread's 3 read operands being read 1-by-1 over 3 cycles (in parallel to the previous computation happening for the same thread). This avoids all possible bank conflicts and removes the need for NVIDIA's operand collectors among other things.

Fundamentally it seems very likely that AMD's register file is significantly cheaper than NVIDIA's *per byte* and this is how they can afford 2x as many registers per FMA. But it's also significantly less effective at hiding memory latency *per byte*. NVIDIA is able to get away with much lower occupancy as more complex workloads are also more likely to have more ILP to extract (see: SGEMM on Maxwell), while AMD's compiler has to work harder to reduce register pressure. So it's a fundamental architectural trade-off that will play out differently depending on the workload, I think :)

I don't see what AMD can do to change that trade-off without fundamentally changing the GCN CU/SIMD architecture which I don't expect will happen for a while. I have no personal knowledge of their compiler, but I assume they are already executing memory instructions as early as possible and getting the results back as late as possible. I don't know how good AMD's memory subsystem is, if there's room for improvement then anything that improves internal or external memory latency would be helpful.
 
Last edited:
Thanks for assembling the numbers.

Both Maxwell and GCN are 1 FMA/thread/clock, so it's possible to calculate both registers/thread and registers/fma quite easily nowadays.

Per FMA
Maxwell: each scheduler has 16384 registers for 32 FMAs -> 512 registers per scalar FMA.
GCN: each SIMD has 16384 registers for 16 FMAs -> 1024 registers per scalar FMA.

GCN actually has twice as many registers per FMA! But...

Per Thread
Maxwell: each scheduler has 16384 registers for 16 warps (32-wide) -> 32 registers per thread for full occupancy.
GCN: each SIMD has 16384 registers for 10 wavefronts (64-wide) -> 25.6 registers per thread for full occupancy.

GCN has slightly fewer registers available per thread for maximum occupancy.

So the question is: does AMD simply support (too?) many threads, and their register file itself is actually more than big enough? Yes and no...

The critical difference between NVIDIA and AMD at this point is Instruction Level Parallelism. NVIDIA's datapath is 6 cycles but a single warp can achieve close to peak if it has an ILP >= 6. On the other hand, GCN's datapath is only 4 cycles, which means the worst-case is better than NVIDIA's, but AFAIK it cannot extract any ILP - so if you only have 1 wavefront ready, you will never achieve more than 25% performance.
I don't understand what you're saying here. ILP is quite common. When you say "extract ILP" are you saying NVidia is doing something different to AMD, given the same code?

Are you saying that given register constraints such that a SIMD can only support a single hardware thread and with the same amount of ILP, that GCN will run at 25% of the throughput?

I have a GCN kernel with an allocation of 179 VGPRs, which results in one hardware thread per SIMD. According to CodeXL, this kernel has about 54% VALU usage (as profiled during execution).

NVIDIA is able to get away with much lower occupancy as more complex workloads are also more likely to have more ILP to extract (see: SGEMM on Maxwell), while AMD's compiler has to work harder to reduce register pressure. So it's a fundamental architectural trade-off that will play out differently depending on the workload, I think :)
I was under the impression that the SGEMM on Maxwell is hand-assembled, there is no compiler.

Also, AMD's compiler doesn't, as far as I know, make any effort to reduce register pressure. Sure, it's good at re-using registers to avoid them being idle over portions of a work-item's lifetime. But it also tends to use as many as required until all possible ILP has been extracted, without any sense that sufficient ILP has been extracted and there is no way for the programmer to parameterise what's "sufficient" (without engaging in actual warfare with the compiler).

Which is why the kernel above uses 179 VGPRs (one hardware thread), when it could use less than 80 (3 hardware threads).

I don't see what AMD can do to change that trade-off without fundamentally changing the GCN CU/SIMD architecture which I don't expect will happen for a while. I have no personal knowledge of their compiler, but I assume they are already executing memory instructions as early as possible and getting the results back as late as possible.
The hardware, with direction from the compiler, will take results as soon as they are ready, e.g. a bunch of 4 reads all issued early, together, might result in 4 distinct barriers sprinkled throughout the dependent code.

I don't know how good AMD's memory subsystem is, if there's room for improvement then anything that improves internal or external memory latency would be helpful.
My theory is that GCN is engineered very precisely for very low intra-CU latencies, but does so without regard to power consumption. Or, at the very least, some slackening would lead to a substantial power saving. But slackening something this tightly bound ends-up being a fundamental design change.
 
Gah, you are right, sorry - 1 wavefront *per SIMD* is enough for full VALU occupancy because the wavefront is 64-wide and executed over 4 cycles on a 16-wide VALU with 4 cycles latency.

I was confused because I had a long time ago tried to figure out how expensive it would be to do a GCN-like design but with 16-wide branch granularity where you basically had 4 separate "thread pools" and whether that would result in memory latency benefits. My tentative conclusion was it was only worth it if you really needed the 16-wide granularity. Anyway...

But it also tends to use as many as required until all possible ILP has been extracted, without any sense that sufficient ILP has been extracted and there is no way for the programmer to parameterise what's "sufficient" (without engaging in actual warfare with the compiler).

Which is why the kernel above uses 179 VGPRs (one hardware thread), when it could use less than 80 (3 hardware threads).
That makes little sense. Either I'm missing something or AMD's compiler is flawed :( Again the point is that GCN *does not benefit from VALU ILP*. Unlike what I said previously it will achieve full VALU occupancy with 1 wavefront per SIMD (4 wavefronts per CU) but it will not achieve any higher efficiency if the successive VALU instructions are independent. This is unlike NVIDIA which needs "[Available Warps]*[Available ALU ILP] >= 6" for peak efficiency.

By ILP, do you mean the amount of work that can be done in-between memory accesses maybe? I can see how the behaviour you described ("bunch of 4 reads all issued early, together, might result in 4 distinct barriers sprinkled throughout the dependent code.") is highly desirable and will lead to higher register pressure but better memory latency tolerance. But that is separate from VALU ILP. And I'd be surprised if you really needed to reduce the number of available wavefronts by 3x to reach that!
 
That makes little sense. Either I'm missing something or AMD's compiler is flawed :( Again the point is that GCN *does not benefit from VALU ILP*.
The ALUs, per se, don't benefit. But the compiler likes to re-order memory accesses so that they are in the tightest sequences, making as many accesses as early as possible, preferably contiguously. Additionally, it likes to unroll loops without being asked to do so.

Both of those things generate ILP. You could say it's a hangover from R600, VLIW, ISA.

Unlike what I said previously it will achieve full VALU occupancy with 1 wavefront per SIMD (4 wavefronts per CU) but it will not achieve any higher efficiency if the successive VALU instructions are independent. This is unlike NVIDIA which needs "[Available Warps]*[Available ALU ILP] >= 6" for peak efficiency.
Within GCN CU, VALU, SALU, LDS and memory instructions are candidates for issue. Anything that's not VALU can cause VALU to stall, if there's too much of it. So, for example, a kernel with lots of scalar instructions on the SALU will hit VALU throughput.

Which is why hardware threads are needed to hide non-VALU instruction latency, whether its LDS, memory or SALU.

By ILP, do you mean the amount of work that can be done in-between memory accesses maybe?
No, I was referring solely to intra-VALU instruction scheduling, which as far as I can tell is where you started from, when talking about requiring 6-way ILP in Maxwell.

I can see how the behaviour you described ("bunch of 4 reads all issued early, together, might result in 4 distinct barriers sprinkled throughout the dependent code.") is highly desirable and will lead to higher register pressure but better memory latency tolerance. But that is separate from VALU ILP. And I'd be surprised if you really needed to reduce the number of available wavefronts by 3x to reach that!
Indeed not, the version of the kernel with 3 hardware threads per SIMD is 50% faster (both do precisely the same count of computations per work item) without the largesse of the maximally re-ordered memory instructions :D
 
Summer/Autumn 2015?
So, erm, are we looking at a re-run of R600?: Ambitious memory architecture (with theoreticals to drown in), new process, delays, delays, and awful performance due to terrible balance.
The memory architecture is a sort-of parallel. The possibility is there that there will be an HBM device or two on-package, but the memory controllers should help insulate the internals of the chip from that change. I haven't seen news about changes to that portion, whereas R600 re-engineered things on-die as well.
The new process is a maybe.
Terrible balance doesn't seem to be in the cards, at least not to the same degree R600 emphasized its ALU capabilities over TEX and ROP capability.

But I do wonder if the ALU architecture is working too hard for the results it generates. I'm wondering if it was built deliberately to spend lots of power to achieve its very low internal latencies (insanely short ALU pipeline, read-after-write, LDS reads/writes, branching).
I doubt it deliberately spent lots of power, but I suspect it prioritized certain other features.
GCN attempts to provide a straightforward machine for compute, one more consistent with the CPU platform it is meant to integrate with.
This means not exposing a number of low-level details AMD once left out there, like exposing forwarding considerations within a single domain(straightforward, some kind of HSA finalizer aid) and continuing to strip out the amount of hidden state that is maintained for a CU context(all the better to help virtualize resources and get CPU-type QoS measures implemented).
However, this comes with the constraint of leveraging a large amount of the existing execution paradigm, and a general emphasis on using simplistic hardware.

I'm wondering if the CUs are spinning like demons even though there's nothing for the ALUs to do, because register over-allocation has cut the number of threads per SIMD.
This seems unlikely, or should be unlikely. AMD's clock gating should be better than that, and if there's no work for the units to do they shouldn't be doing much.

Anecdotally I've heard of very recent drivers causing vast increases in power consumption at slight gains in performance on intensive compute kernels.
Does this include changes to the compiler, and for what hardware? I'd be curious what would have changed. Change in cache policy might be forcing more broadcasts than there once were, and the unevolved cache hierarchy could be a pain point.

So even if I'd like to think GCN is in many ways a sane compute architecture, it might have a rotten core.
I think it was a decent start at introduction, and then apparently it was decided that GCN and its warts were sufficiently comfortable laurels to rest on.

Which might mean AMD tries to tackle this problem with the new chip.
That's hardly guaranteed. Let's note that the original comparison is to R600, after all: dodgy drivers, fighting with the compiler, failing to reach peak...

It is possible AMD will do something, but some of the most notable promised changes have little emphasis on retuning the VALU execution balance, and AMD has not promised better software.
They tend to focus more on QoS and integration, which do not directly help the CUs. Preemption and prioritization schemes tend to make things at least somewhat worse at the CU level.
Some amount of compute context switching is present, and it has been indicated that graphics preemption is coming up next.

One set of outside possibilities includes finding some way to handle divergence, which includes measures like promoting the SALU to be capable of running a scalar variant of VALU code, and another being sporadic discussion of a wavefront repack scheme. There was a paper on promoting the scalar unit, and some research/patents (there was some throwaway commentary from watchimpress about this when Tonga launched) that might cover the repacking. A common refrain is the leveraging of the LDS and its cross-lane capability, and for what it's worth, AMD barely mentioned the introduction of some cross-lane capability with the latest hardware.

My theory is that GCN is engineered very precisely for very low intra-CU latencies, but does so without regard to power consumption. Or, at the very least, some slackening would lead to a substantial power saving. But slackening something this tightly bound ends-up being a fundamental design change.
There has been for a very long time a 4-cycle cadence to part of the execution loop, and AMD kept it.
I think there is a desire for very low apparent latency, but it is the simplistic hardware and throwback execution loop that makes it a problem for the implementation.

The ALUs, per se, don't benefit. But the compiler likes to re-order memory accesses so that they are in the tightest sequences, making as many accesses as early as possible, preferably contiguously. Additionally, it likes to unroll loops without being asked to do so.

Both of those things generate ILP. You could say it's a hangover from R600, VLIW, ISA.
This may play into a simplified heuristic that greedily generates memory-level parallelism and avoids control overhead.
It does leverage an aggressively coalescing memory pipeline and reduces the burden on GCN's less than ideal branching capability.
It does make sense from the perspective of an optimizer for a single wavefront, but then it seems attention went elsewhere once that first step was taken.

I've been musing over whether this has ramifications for AMD's preemption and context switching, as well as other possible design enhancements. Vector memory operations are perhaps the most painful ones to cut off or discard, so the architecture might try to break just before or after long runs of vector memory traffic.

Within GCN CU, VALU, SALU, LDS and memory instructions are candidates for issue. Anything that's not VALU can cause VALU to stall, if there's too much of it. So, for example, a kernel with lots of scalar instructions on the SALU will hit VALU throughput.
The mention of the other types also brings to mind that a CU is not so much a unified processor as it is a bundle of independent pipelines, which are either architecturally constrained from going too far afield (4-cycle loop, single instruction issue per wavefront), or dependent on explicit waitcnt instructions. The compiler heuristic might be tuned to generate work for all these pipelines at the expense of occupancy.
There's still a vector wait counter without a matching wait instruction, at least as of Sea Islands.
 
Does this include changes to the compiler, and for what hardware? I'd be curious what would have changed. Change in cache policy might be forcing more broadcasts than there once were, and the unevolved cache hierarchy could be a pain point.
I've not actually heard of this myself, but given that a Hawaii or later product has its performance intrinsically tied to the input power then it would suggest that if such a change were to have happened the application was running a low power/performance in the first place (i.e. it was artificially throttled by some bottleneck somewhere). A change that uses "more power" per throughput would result in a performance loss for apps that are subscribing their power already.
 
The memory architecture is a sort-of parallel. The possibility is there that there will be an HBM device or two on-package, but the memory controllers should help insulate the internals of the chip from that change. I haven't seen news about changes to that portion, whereas R600 re-engineered things on-die as well.
Won't HBM be a much wider bus at the physical level? e.g. 1024 or 2048 bits?

If so, I'd say "there's your risk right there, it reaches back into the chip's overall bus architecture".

The new process is a maybe.
Terrible balance doesn't seem to be in the cards, at least not to the same degree R600 emphasized its ALU capabilities over TEX and ROP capability.
Tonga seems like a great example of bad balance. Not enough fillrate for the bandwidth available. ROPs were R600's crushing problem, especially Z-rate.

I doubt it deliberately spent lots of power, but I suspect it prioritized certain other features.
GCN attempts to provide a straightforward machine for compute, one more consistent with the CPU platform it is meant to integrate with.
This means not exposing a number of low-level details AMD once left out there, like exposing forwarding considerations within a single domain(straightforward, some kind of HSA finalizer aid) and continuing to strip out the amount of hidden state that is maintained for a CU context(all the better to help virtualize resources and get CPU-type QoS measures implemented).
However, this comes with the constraint of leveraging a large amount of the existing execution paradigm, and a general emphasis on using simplistic hardware.
Seems to me you've fleshed out my suspicions...

This seems unlikely, or should be unlikely. AMD's clock gating should be better than that, and if there's no work for the units to do they shouldn't be doing much.
As an example, if the VALUs in 1 or more SIMDs are idle due to un-hidden latencies, but other SIMDs within the same CU are working at full speed, are those idle SIMDs burning loads of power?

Does this include changes to the compiler, and for what hardware? I'd be curious what would have changed. Change in cache policy might be forcing more broadcasts than there once were, and the unevolved cache hierarchy could be a pain point.
Rather than leave this to anecdote, I'm going to see if I can evaluate this myself.

One thing I discovered recently, with my HD7970 (which is a 1GHz stock card that was launched a while before the 1GHz Editions were official) is that I have to turn up the power limit to stop it throttling. That's despite the fact that some kernels are only getting ~55-75% VALU utilisation and the fact that the card is radically under-volted (1.006V). I don't know if the BIOS is super-conservative, or if it's detecting current or what. I've just run a test and got 8% more performance on a kernel with 84% VALU utilisation with the board power set to "+20" (is that percent?). I suspect it's still throttling (the test is about 2 minutes long).

I think it was a decent start at introduction, and then apparently it was decided that GCN and its warts were sufficiently comfortable laurels to rest on.
I can't help wondering if HSA has explicitly distracted them from efficiency.

That's hardly guaranteed. Let's note that the original comparison is to R600, after all: dodgy drivers, fighting with the compiler, failing to reach peak...
Simple example: it's possible to write a loop with either a for or a while. Both using an int for the loop index/while condition. One will compile to VALU instructions that increment and compare the index. The other uses SALU:

Code:
  s_add_u32     s2, s2, 4
  s_cmp_le_i32  s10, s2
  s_cbranch_scc0  label_0042

Is that a feature or a bug? What if the latter is faster (not to mention also uses less VGPRs, you know that precious commodity, whereas most scalar registers normally sit around doing nothing)?...

There's still plenty of fail in the compiler.

It is possible AMD will do something, but some of the most notable promised changes have little emphasis on retuning the VALU execution balance, and AMD has not promised better software.
They tend to focus more on QoS and integration, which do not directly help the CUs. Preemption and prioritization schemes tend to make things at least somewhat worse at the CU level.
Some amount of compute context switching is present, and it has been indicated that graphics preemption is coming up next.
I dare say all this stuff is desperately needed. The architectural/power cost does need to be paid.

One set of outside possibilities includes finding some way to handle divergence, which includes measures like promoting the SALU to be capable of running a scalar variant of VALU code, and another being sporadic discussion of a wavefront repack scheme. There was a paper on promoting the scalar unit, and some research/patents (there was some throwaway commentary from watchimpress about this when Tonga launched) that might cover the repacking. A common refrain is the leveraging of the LDS and its cross-lane capability, and for what it's worth, AMD barely mentioned the introduction of some cross-lane capability with the latest hardware.
I hadn't noticed that paper. I dare say I've given up on thinking divergence will be tackled.

There has been for a very long time a 4-cycle cadence to part of the execution loop, and AMD kept it.
I think there is a desire for very low apparent latency, but it is the simplistic hardware and throwback execution loop that makes it a problem for the implementation.
Why do you say "apparent"? And what's the problem you're alluding to?

This may play into a simplified heuristic that greedily generates memory-level parallelism and avoids control overhead.
It does leverage an aggressively coalescing memory pipeline and reduces the burden on GCN's less than ideal branching capability.
It does make sense from the perspective of an optimizer for a single wavefront, but then it seems attention went elsewhere once that first step was taken.

I've been musing over whether this has ramifications for AMD's preemption and context switching, as well as other possible design enhancements. Vector memory operations are perhaps the most painful ones to cut off or discard, so the architecture might try to break just before or after long runs of vector memory traffic.
Well, more fundamentally, when a work group is fully occupying a CU (not difficult with >128 VGPR allocation and a work group size of 256), there's a very serious cost if a decision is taken to switch that CU to a different kernel - LDS and all registers are either discarded and the work group is restarted later or all that state is stashed somewhere... With that much pain, maybe the nature of context switching is rather coarse.

The mention of the other types also brings to mind that a CU is not so much a unified processor as it is a bundle of independent pipelines, which are either architecturally constrained from going too far afield (4-cycle loop, single instruction issue per wavefront), or dependent on explicit waitcnt instructions. The compiler heuristic might be tuned to generate work for all these pipelines at the expense of occupancy.
I think programmers should be able to steer this. The only official steering I'm aware of is specification of unroll (including no unroll). Having to engage in warfare, either with dummy control flow, or dummy barriers is beyond the pale.

There's still a vector wait counter without a matching wait instruction, at least as of Sea Islands.
Would that be a scalar wait instruction (i.e. scalar pipe waiting for VALU to finish)?
 
Back
Top