Jawed
Legend
Yep, I gave up.Ups, LDS was local data storage, not the load store unit. So my previous post didn't make much sense, respectively described something entirely different.
Yep, I gave up.Ups, LDS was local data storage, not the load store unit. So my previous post didn't make much sense, respectively described something entirely different.
The stack contains pointers.The repack element is probably one or more of those scalar ALUs that are really good at INT and conditionals with a high clock. It's also possible they are NOT repacking the register file, just changing references to threads for alignment.
The document makes that point. It also claims non-earth-shaking performance deltas.It's likely possible, but I'd think a lot of threads would need to die for that to be worthwhile.
Which in theory saves power and runs faster at the same time.It should also be possible to start with 4 threads/wave, and a consolidated register file, that will get scheduled on a high performance SIMD.
The barriers don't need to respect compute workgroups. e.g. it's possible to run several 256-work-item workgroups on a CU, and when repacking, all wavefronts in that set that's currently active can have the same set of barriers erected.The barrier could force realignment/consolidation of all threads in a warp or possibly workgroup. With the async work it's likely not all waves are the same and they'd target concurrent workloads.
No, they don't need to respect the workgroups.The barriers don't need to respect compute workgroups. e.g. it's possible to run several 256-work-item workgroups on a CU, and when repacking, all wavefronts in that set that's currently active can have the same set of barriers erected.
I've said this already.But repacking in general sounds bad as it always requires synchronization on an unknown (because dynamic) set of threads, and that in return requires a decent over commitment in order to work around the resulting stalls.
Yes, fragmentation is the killer.Respectively it looks even worse than that, if you don't know ahead if there WILL be sufficient threads with the same IP to form a full wavefront again, you would have to utilize timeouts.
For limited nesting, forward signaling on barriers ("I WILL arrive in a finite time", so if you arrive on a barrier, you can tell whether to take what you can right away, or to wait instead because e.g. there are 50 threads already there, but another 20 are scheduled to arrive soon) could provide a timer free wait mechanism. But with heavy use of barriers (e.g. you have branches (not even nested ones) every few dozen instructions) you would end up signaling not far enough ahead, ending up with smaller than necessary wavefronts if the thread pool isn't sufficiently large in return.
You only unpack results. That's what the stack is there for: it tells you where the results go back to.Forking actually does seem promising in terms of what other optimizations it allows, but the repack operations result in a nasty scheduling problem. Even with proper hardware support, and you need more aggressive over commitment to actually profit from it.
What's also not covered yet at all, is repacking threads into the original(!) wavefront layout, which I can only imagine to bring greater penalty if the threads were grouped intentionally for continuous memory and LDS access per wavefront.
Don't use the terms fork/join, as that's what normally occurs in today's GPUs.So in summary, it does look promising in terms of keeping the utilization up in code with large branches, but only at the cost of undoing possible optimizations on memory access patterns. And for smaller branches, it doesn't look like it could work properly at all. The energy savings are also doubtful, considering the additional bookkeeping, it only helps with sustainable ALU utilization at most. If it's just energy efficiency, not utilization, it's probably better not to attempt forking and joining and aim for aggressive power gating instead.
I didn't realized you would keep the original wavefronts on the stack, so I assumed the incoherent regions would also poison the following coherent regions as the newly assembled wavefronts carried over.You only unpack results. That's what the stack is there for: it tells you where the results go back to.
[...]
Also as soon as you do LDS/memory access inside incoherent control flow you've already lost contiguousness and friendly cache access patterns. Without repacking all that will happen is that Wavefront 2 iteration 5 follows directly after Wavefront 7 iteration 11, say. And your access patterns are about as random as they could ever be.
Per workgroup after an early exit (primitive discard accelerator?) or costly diverging paths makes sense. Possible the scheduler could identify when a wave diverged significantly and regroup/reduce it, in addition to compiler instructions for heavy branches.The barriers don't need to respect compute workgroups. e.g. it's possible to run several 256-work-item workgroups on a CU, and when repacking, all wavefronts in that set that's currently active can have the same set of barriers erected.
Definitely makes sense to increase the limit if you're adding small, high priority workloads.If a compute unit has free access to all wavefronts, then instead of a population of 10 maximum on any SIMD as candidates for repacking that we see in GCN, the limit would be higher. How much higher, who knows. And, of course, register allocation will spoil the party at every opportunity.
Keep in mind they don't need a full wavefront, they need close to a pow2 quantity of threads likely originating from the same wave. So fragmentation shouldn't spread across the register file in that case. Less efficient, but should be quick and still increase utilization. It might also be conceivable to split a wave, which would have been pointless with fixed, large SIMD sizes. That should help caching as they all decided to diverge the same direction and could hit different SIMDs concurrently. SIMDs might not all be 16 wide either.Yes, fragmentation is the killer.
Reduced throughput for faster evaluation likely still makes sense for small, time sensitive workloads. Situations that might arise with physics simulations and VR. Only reasonable if a SIMD ramps up clockspeed as it gets smaller. As per the patent, some SIMDs might start smaller(mentioned 8 wide) so utilization wouldn't be affected. There could conceivably be 4(#?) wide specialty SIMDs. So it's possible not all SIMDs are power gated either.Also, remember, that it's not difficult while doing actual work to beat doing no work. Lanes that are predicated off, or ALUs that are idled to save power while narrower ALUs are fully occupied are both scenarios in which the CU's actual throughput is substantially reduced, especially over a long loop.
One big question about that patent, how would the chip synchronize all of these different thread lengths and warp sizes? Seems like a head ache. The patent doesn't really go into that at all, the proposed theory sounds great on paper.....
Same way they do now. Stop dispatching/scheduling waves at a sync point until a condition is met. Break down and page out data if you have a huge problem. Branching can already change the thread length, caching(etc) can affect the execution time, and partial waves already happen. Reorganizing waves wouldn't affect the process so long as they all check in upon completion.One big question about that patent, how would the chip synchronize all of these different thread lengths and warp sizes? Seems like a head ache. The patent doesn't really go into that at all, the proposed theory sounds great on paper.....
http://www.guru3d.com/news-story/amd-radeon-r7-470-and-r9-480-at-computex-or-e3.htmlThe new Polaris based GPUs are based on the new and energy efficient 14nm FinFET process.
Let's break it down:
Obviously the avid PC Gamer the R9 480 will be an interesting SKU as rumors right now point to a GPU that holds 2560 shader processors (GCN iteration 4). The R9 480 would get an active 2304 shader processors (leaving room in the GPU for an 480X model.
- Radeon R7 470 - This SKU series will be based on 14 nm "Baffin" (aka Polaris 11), rumored to be a 50 Watt TDP card.
- Radeon R9 480 - This SKU is based on 14 nm "Ellesmere" (aka Polaris 10), rumored to be a ~130 Watt TDP card.
That means 40 shader processor clusters (each holding 64 SPs):
We are looking at a 256-bit wide memory bus, yet it is unclear if that'll hold GDDR5 or doubled bandwith with GDDR5X. Polaris 10 is expected to be clocked in the 1 GHz marker on the GPU core frequency.
- 36 x 64 = 2304 (Radeon R9 480 / Polaris 10)
- 40 x 64 = 2560 (Radeon R9 480X / Polaris 10)
As interesting as watching slow motion crash test videos!~1 GHz doesn't sound like much has changed with the process transition. This might be interesting to watch.
As interesting as watching slow motion crash test videos!
I saw those numbers elsewhere, but I think they're for mobile versions. They can't possibly have decided to value power efficiency that much over absolute performance.
~1 GHz doesn't sound like much has changed with the process transition. This might be interesting to watch.
By the time Vegas get out late this year there's plenty of time to phase out Fury (X)R9 490(X) 300mm2 Vega 11?
There's gonna be quite a price gap if they don't lower Fury(X) price to 390(X) level.
By the time Vegas get out late this year there's plenty of time to phase out Fury (X)