AMD: Speculation, Rumors, and Discussion (Archive)

Jawed · Apr 12, 2016

Ext3h said:
Ups, LDS was local data storage, not the load store unit. So my previous post didn't make much sense, respectively described something entirely different.

Yep, I gave up.

Jawed · Apr 12, 2016

Anarchist4000 said:
The repack element is probably one or more of those scalar ALUs that are really good at INT and conditionals with a high clock. It's also possible they are NOT repacking the register file, just changing references to threads for alignment.

The stack contains pointers.

It's likely possible, but I'd think a lot of threads would need to die for that to be worthwhile.

The document makes that point. It also claims non-earth-shaking performance deltas.

It should also be possible to start with 4 threads/wave, and a consolidated register file, that will get scheduled on a high performance SIMD.

Which in theory saves power and runs faster at the same time.

It all sounds too good to be true...

The barrier could force realignment/consolidation of all threads in a warp or possibly workgroup. With the async work it's likely not all waves are the same and they'd target concurrent workloads.

The barriers don't need to respect compute workgroups. e.g. it's possible to run several 256-work-item workgroups on a CU, and when repacking, all wavefronts in that set that's currently active can have the same set of barriers erected.

Ext3h · Apr 12, 2016

Jawed said:
The barriers don't need to respect compute workgroups. e.g. it's possible to run several 256-work-item workgroups on a CU, and when repacking, all wavefronts in that set that's currently active can have the same set of barriers erected.

No, they don't need to respect the workgroups.

But repacking in general sounds bad as it always requires synchronization on an unknown (because dynamic) set of threads, and that in return requires a decent over commitment in order to work around the resulting stalls.

Respectively it looks even worse than that, if you don't know ahead if there WILL be sufficient threads with the same IP to form a full wavefront again, you would have to utilize timeouts.
For limited nesting, forward signaling on barriers ("I WILL arrive in a finite time", so if you arrive on a barrier, you can tell whether to take what you can right away, or to wait instead because e.g. there are 50 threads already there, but another 20 are scheduled to arrive soon) could provide a timer free wait mechanism. But with heavy use of barriers (e.g. you have branches (not even nested ones) every few dozen instructions) you would end up signaling not far enough ahead, ending up with smaller than necessary wavefronts if the thread pool isn't sufficiently large in return.

Forking actually does seem promising in terms of what other optimizations it allows, but the repack operations result in a nasty scheduling problem. Even with proper hardware support, and you need more aggressive over commitment to actually profit from it.

What's also not covered yet at all, is repacking threads into the original(!) wavefront layout, which I can only imagine to bring greater penalty if the threads were grouped intentionally for continuous memory and LDS access per wavefront.

So in summary, it does look promising in terms of keeping the utilization up in code with large branches, but only at the cost of undoing possible optimizations on memory access patterns. And for smaller branches, it doesn't look like it could work properly at all. The energy savings are also doubtful, considering the additional bookkeeping, it only helps with sustainable ALU utilization at most. If it's just energy efficiency, not utilization, it's probably better not to attempt forking and joining and aim for aggressive power gating instead.

Jawed · Apr 12, 2016

I'm curious, have you ever read any of the stuff on "dynamic warp formation"? Without wanting to sound more pompous than normal, the problems with re-packing are well known.

Ext3h said:
But repacking in general sounds bad as it always requires synchronization on an unknown (because dynamic) set of threads, and that in return requires a decent over commitment in order to work around the resulting stalls.

I've said this already.

If a compute unit has free access to all wavefronts, then instead of a population of 10 maximum on any SIMD as candidates for repacking that we see in GCN, the limit would be higher. How much higher, who knows. And, of course, register allocation will spoil the party at every opportunity.

Honestly, I'm sceptical about this, but as it's been years since we've talked about repacking (I and others brought up this concept in 2006, maybe earlier?) it's a nice diversion...

Respectively it looks even worse than that, if you don't know ahead if there WILL be sufficient threads with the same IP to form a full wavefront again, you would have to utilize timeouts.
For limited nesting, forward signaling on barriers ("I WILL arrive in a finite time", so if you arrive on a barrier, you can tell whether to take what you can right away, or to wait instead because e.g. there are 50 threads already there, but another 20 are scheduled to arrive soon) could provide a timer free wait mechanism. But with heavy use of barriers (e.g. you have branches (not even nested ones) every few dozen instructions) you would end up signaling not far enough ahead, ending up with smaller than necessary wavefronts if the thread pool isn't sufficiently large in return.

Yes, fragmentation is the killer.

Forking actually does seem promising in terms of what other optimizations it allows, but the repack operations result in a nasty scheduling problem. Even with proper hardware support, and you need more aggressive over commitment to actually profit from it.

What's also not covered yet at all, is repacking threads into the original(!) wavefront layout, which I can only imagine to bring greater penalty if the threads were grouped intentionally for continuous memory and LDS access per wavefront.

You only unpack results. That's what the stack is there for: it tells you where the results go back to.

Also, remember, that it's not difficult while doing actual work to beat doing no work. Lanes that are predicated off, or ALUs that are idled to save power while narrower ALUs are fully occupied are both scenarios in which the CU's actual throughput is substantially reduced, especially over a long loop.

Also as soon as you do LDS/memory access inside incoherent control flow you've already lost contiguousness and friendly cache access patterns. Without repacking all that will happen is that Wavefront 2 iteration 5 follows directly after Wavefront 7 iteration 11, say. And your access patterns are about as random as they could ever be.

So in summary, it does look promising in terms of keeping the utilization up in code with large branches, but only at the cost of undoing possible optimizations on memory access patterns. And for smaller branches, it doesn't look like it could work properly at all. The energy savings are also doubtful, considering the additional bookkeeping, it only helps with sustainable ALU utilization at most. If it's just energy efficiency, not utilization, it's probably better not to attempt forking and joining and aim for aggressive power gating instead.

Don't use the terms fork/join, as that's what normally occurs in today's GPUs.

Ext3h · Apr 13, 2016

Jawed said:
You only unpack results. That's what the stack is there for: it tells you where the results go back to.
[...]
Also as soon as you do LDS/memory access inside incoherent control flow you've already lost contiguousness and friendly cache access patterns. Without repacking all that will happen is that Wavefront 2 iteration 5 follows directly after Wavefront 7 iteration 11, say. And your access patterns are about as random as they could ever be.

I didn't realized you would keep the original wavefronts on the stack, so I assumed the incoherent regions would also poison the following coherent regions as the newly assembled wavefronts carried over.

Thanks, makes sense to me now.

Anarchist4000 · Apr 13, 2016

Jawed said:
The barriers don't need to respect compute workgroups. e.g. it's possible to run several 256-work-item workgroups on a CU, and when repacking, all wavefronts in that set that's currently active can have the same set of barriers erected.

Per workgroup after an early exit (primitive discard accelerator?) or costly diverging paths makes sense. Possible the scheduler could identify when a wave diverged significantly and regroup/reduce it, in addition to compiler instructions for heavy branches.

Jawed said:
If a compute unit has free access to all wavefronts, then instead of a population of 10 maximum on any SIMD as candidates for repacking that we see in GCN, the limit would be higher. How much higher, who knows. And, of course, register allocation will spoil the party at every opportunity.

Definitely makes sense to increase the limit if you're adding small, high priority workloads.

Jawed said:
Yes, fragmentation is the killer.

Keep in mind they don't need a full wavefront, they need close to a pow2 quantity of threads likely originating from the same wave. So fragmentation shouldn't spread across the register file in that case. Less efficient, but should be quick and still increase utilization. It might also be conceivable to split a wave, which would have been pointless with fixed, large SIMD sizes. That should help caching as they all decided to diverge the same direction and could hit different SIMDs concurrently. SIMDs might not all be 16 wide either.

Possible the register file needs rearranged when consolidating more than one wave spanning multiple banks, but a lot of these patents seem to suggest optimizing a single wave. If you were repacking workgroups to get 64 thread waves, you wouldn't need smaller SIMDs. Save for the remainder threads that could otherwise run on the scalar processor.

Jawed said:
Also, remember, that it's not difficult while doing actual work to beat doing no work. Lanes that are predicated off, or ALUs that are idled to save power while narrower ALUs are fully occupied are both scenarios in which the CU's actual throughput is substantially reduced, especially over a long loop.

Reduced throughput for faster evaluation likely still makes sense for small, time sensitive workloads. Situations that might arise with physics simulations and VR. Only reasonable if a SIMD ramps up clockspeed as it gets smaller. As per the patent, some SIMDs might start smaller(mentioned 8 wide) so utilization wouldn't be affected. There could conceivably be 4(#?) wide specialty SIMDs. So it's possible not all SIMDs are power gated either.

Razor1 · Apr 13, 2016

One big question about that patent, how would the chip synchronize all of these different thread lengths and warp sizes? Seems like a head ache. The patent doesn't really go into that at all, the proposed theory sounds great on paper.....

Ailuros · Apr 13, 2016

Razor1 said:
One big question about that patent, how would the chip synchronize all of these different thread lengths and warp sizes? Seems like a head ache. The patent doesn't really go into that at all, the proposed theory sounds great on paper.....

Well patents never disclose final implementation for obvious reasons. I'm just a layman, but it doesn't sound like its feasable for the upcoming generation.

Anarchist4000 · Apr 13, 2016

Razor1 said:
One big question about that patent, how would the chip synchronize all of these different thread lengths and warp sizes? Seems like a head ache. The patent doesn't really go into that at all, the proposed theory sounds great on paper.....

Same way they do now. Stop dispatching/scheduling waves at a sync point until a condition is met. Break down and page out data if you have a huge problem. Branching can already change the thread length, caching(etc) can affect the execution time, and partial waves already happen. Reorganizing waves wouldn't affect the process so long as they all check in upon completion.

pTmdfx · Apr 13, 2016

It seems to me, for any implementation not migrating contexts by spilling, that the problem would actually be the port/bank conflicts in the datapath in the case of an RF spanning across multiple varied-width SIMDs.

A wider-than-ever multiport register file could be the fanciest rescue though. Say having a 16-lane RF with at least 12R3W ports and 256+ entries per lane, so as to serve a 16-wide, 4-wide and scalar unit concurrently (or 3R1W less for scalar unit being as is). Then with a full 16-to-4 (and 16-to-1 for scalar) crossbar, perhaps one can freely split or repack wavefronts by knowing only the relative RF indexes of the realigned lanes.

Razor1 · Apr 13, 2016

Well the whole premise of an SIMD is throughput right?

So by with different thread lengths (which happens right now) and adding to that different warp sizes, it introduces even more divergence, there really is no way around this, and to unroll each thread before hand to predict everything, will take up quite a bit of on chip memory, will there be enough space on chip to do this?

The variance amounts of having different warp lengths creates a need increased need of sync dependence (we already do some of this for different thread length) even more than what we have now, its not as simple as down clocking or using more cycles, it will actually stall the pipeline in most cases.

Deleted member 2197 · Apr 15, 2016

AMD Radeon R7 470 and R9 480

The new Polaris based GPUs are based on the new and energy efficient 14nm FinFET process.
Let's break it down:

Radeon R7 470 - This SKU series will be based on 14 nm "Baffin" (aka Polaris 11), rumored to be a 50 Watt TDP card.

Radeon R9 480 - This SKU is based on 14 nm "Ellesmere" (aka Polaris 10), rumored to be a ~130 Watt TDP card.

Obviously the avid PC Gamer the R9 480 will be an interesting SKU as rumors right now point to a GPU that holds 2560 shader processors (GCN iteration 4). The R9 480 would get an active 2304 shader processors (leaving room in the GPU for an 480X model.

That means 40 shader processor clusters (each holding 64 SPs):

36 x 64 = 2304 (Radeon R9 480 / Polaris 10)

40 x 64 = 2560 (Radeon R9 480X / Polaris 10)

We are looking at a 256-bit wide memory bus, yet it is unclear if that'll hold GDDR5 or doubled bandwith with GDDR5X. Polaris 10 is expected to be clocked in the 1 GHz marker on the GPU core frequency.

http://www.guru3d.com/news-story/amd-radeon-r7-470-and-r9-480-at-computex-or-e3.html

3dilettante · Apr 15, 2016

~1 GHz doesn't sound like much has changed with the process transition. This might be interesting to watch.

SimBy · Apr 15, 2016

R9 490(X) 300mm2 Vega 11?

There's gonna be quite a price gap if they don't lower Fury(X) price to 390(X) level.

silent_guy · Apr 15, 2016

3dilettante said:
~1 GHz doesn't sound like much has changed with the process transition. This might be interesting to watch.

As interesting as watching slow motion crash test videos!

I saw those numbers elsewhere, but I think they're for mobile versions. They can't possibly have decided to value power efficiency that much over absolute performance.

SimBy · Apr 15, 2016

silent_guy said:
As interesting as watching slow motion crash test videos!

I saw those numbers elsewhere, but I think they're for mobile versions. They can't possibly have decided to value power efficiency that much over absolute performance.

There's this forum where they believe 1GHz is all AMD really needs.

But yes I agree, no way no how 1GHz are final clocks. For desktop SKUs that is.

Deleted member 13524 · Apr 15, 2016

3dilettante said:
~1 GHz doesn't sound like much has changed with the process transition. This might be interesting to watch.

On the other hand, AMD has claimed over and again that Polaris' scope is power efficiency. Vega may be optimized for higher clocks, whereas Polaris was made for keeping that power consumption as low as possible (and easier transition to laptops).

Kaotik · Apr 15, 2016

SimBy said:
R9 490(X) 300mm2 Vega 11?

There's gonna be quite a price gap if they don't lower Fury(X) price to 390(X) level.

By the time Vegas get out late this year there's plenty of time to phase out Fury (X)

Transponster · Apr 15, 2016

480 to be 2.2+ more powerful than 470?!
And, looking at everything, 480(X) should be better than Furys overall, so AMD is really counting on these two gpus and Pro Duo(lel)? In the meanwhile gp104 will be beasting and gp106 is already here as well, hmm...

SimBy · Apr 15, 2016

Kaotik said:
By the time Vegas get out late this year there's plenty of time to phase out Fury (X)

No doubt. I was talking about the gap between 480(X) and Fury series. Unless of course AMD slots 480 at the same price level as 390. But that's very unlikely.

AMD: Speculation, Rumors, and Discussion (Archive)

Jawed

Jawed

Ext3h

Jawed

Ext3h

Anarchist4000

Razor1

Ailuros

Epsilon plus three

Anarchist4000

pTmdfx

Razor1

Deleted member 2197

Guest

3dilettante

SimBy

silent_guy

SimBy

Deleted member 13524

Guest

Kaotik

Drunk Member

Transponster

SimBy

Similar threads