AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

DavidGraham · Oct 11, 2016

There is one important question I feel many should answer first: Do people believe AMD will make radical changes to Vega in a manner that represents a large departure from GCN?

GCN have stayed so long with AMD because it serves them well on desktop and consoles, they don't have to spend money and human resources on too many different archs, so they don't diverge much from the base formula, we have seen that in Fiji and Polaris.

Xbox Scorpio is a GCN based console, it's also backwards compatible with the first Xbox One. That implies architecture similarities between the two, If Scorpio is based on Vega, then that means Vega will not be that much different from existing GCN, it will definitely be different, in the sense that some design decisions will change, but that will remain limited to an extent. The question is, Will that extent prove sufficient to make AMD competitive again?

MDolenc · Oct 11, 2016

From the IHV point of view a change to architecture costs you engineering hours and die area.
At some point someone at AMD will come up with: "You know what, we should do X!" What follows this statement is a question: "Ok, how much engineering hours will this take us and how much die space will it take?" Then there will be prioritisation, should we do X, how many people should we throw at it (do we want it finished for Vega? For Navi? For Navi + 10?). If an engineer comes up with the idea for X today and X calls for 10 years of engineering work it's a whole different ball park then if X calls for 3 months of work.
This doesn't mean that with Vega it is totally guaranteed that those hard to implement ideas since original GCN architecture will be implemented. It just means it's more likely then it was with Polaris. And even more likely then it was with Tonga... And why do you think radical departure from GCN would mean something that's difficult to support? Why would be, I don't know rasteriser able to work with 64 triangles at a time, be difficult to support? Sure you might need to rebalance the pipeline, but you need to take this into account basically always. On the other hand the all so baseline GCN is not ISA compatible between all 4 current iterations so this clearly requires driver support.

DavidGraham · Oct 11, 2016

MDolenc said:
Why would be, I don't know rasteriser able to work with 64 triangles at a time, be difficult to support?

That's why I said some design choices will be different, naturally relieving old bottlenecks would be priority. However some other things could prove more difficult, like changing SIMD structure for example.

MDolenc said:
And why do you think radical departure from GCN would mean something that's difficult to support?

1- AMD might not want to risk developing two widely different archs in their current financial situation.
2-The next Xbox needs to be backward compatible, developers will want a familiar version of GCN for that.

seahawk · Oct 12, 2016

Am I wrong or did Vega not represent another iteration of CGN 4.0.

Anarchist4000 · Oct 12, 2016

3dilettante said:
If the 4096 stream processor count is valid, it may not fit the diagram well without further architectural changes. It seems like it would be misleading to count the scalar ALUs unless they could feasibly work in concert with the SIMD units without weirdness related to not having any storage or operand paths that come with associated register files. It's not impossible, if they can somehow have storage allocated in another pool or in the other register files, but that would point to some change in banking or operand routing to do this.
One possibility is that the SIMD register files are not all of the vector register storage in the CU, and the scalar units can hit another pool. Another would be that they can poach storage and values from the vector files, with some creative banking and allocation to avoid stepping on the other files--although that might need some other work to fit into the SIMD unit's patterns.
If they cannot, then I'm not sure how the remaining SIMD resources can keep GCN's multiple of 4 cadence and batch size of 64, which the rest of the claims do not entirely dispense with.

In this paper, we propose to extend the capability of the
scalar unit so that it can either share an instruction stream
with a SIMT unit or execute its own instruction stream,
referred to as a scalar program. The purpose of scalar
programs is to assist SIMT execution in a flexible way. We
present three collaborative execution paradigms between the
scalar unit and the SIMT unit and propose the compiler
algorithms to generate the scalar code from SIMT programs.
First, we adapt a recently proposed algorithm [40] to
generate scalar programs for a scalar unit to prefetch data for
SIMT units. Second, we propose a software approach to
eliminate/reduce control divergence in SIMT units. In this
approach, the SIMT code is slightly altered to communicate
control flow information of each thread to a scalar unit. The
scalar program reorganizes/remaps the threads so as to
minimize control divergence. Third, we extract scalar
workload from SIMT programs to generate scalar programs.

The paper makes reference to the scalar being able to execute it's own instruction stream. What if however that 2nd stream wasn't a scalar, but in fact a hybrid scalar/vector? Akin to ALU and FPU instructions in the same stream? In the case of a CPU a thread is scheduled to the appropriate hardware. Difference here being the ALU and FPU equivalents could actually execute the same instructions, however at different rates. I'd suggest MIMD would be a better analogy of this processor. Pushing the equivalent of two waves onto a SIMD with 2 or 3 execution units inside. This is where I derived the 16+1 or 16+1+1 configurations. The latter only increases throughput by 12.5%, but would be more efficient for divergent or scalar workflow. Both instruction streams (waves) could execute scalar or vector code simultaneously. The paper posits an "I-MUX" (Instruction Multiplexer) that would allow this. I'd also posit a 2 wide SIMD from bonding scalars may be possible. The 4 cycle cadence would still apply as the scalars should not consume vector data any faster than the SIMD. The scalars already break the cadence with scalar code.

While this would require some hardware modifications, it wouldn't be a huge departure from GCN. The same code could execute, but would be far more efficient where the scalars would come into play. Doubling the registers and some extra decode/schedule logic would be required.

Another interesting possibility is that both streams may be bonded to feed data to the SIMD unit for FP64.

sebbbi · Oct 12, 2016

Anarchist4000 said:
The paper makes reference to the scalar being able to execute it's own instruction stream. What if however that 2nd stream wasn't a scalar, but in fact a hybrid scalar/vector? Akin to ALU and FPU instructions in the same stream? In the case of a CPU a thread is scheduled to the appropriate hardware. Difference here being the ALU and FPU equivalents could actually execute the same instructions, however at different rates. I'd suggest MIMD would be a better analogy of this processor. Pushing the equivalent of two waves onto a SIMD with 2 or 3 execution units inside. This is where I derived the 16+1 or 16+1+1 configurations. The latter only increases throughput by 12.5%, but would be more efficient for divergent or scalar workflow. Both instruction streams (waves) could execute scalar or vector code simultaneously. The paper posits an "I-MUX" (Instruction Multiplexer) that would allow this. I'd also posit a 2 wide SIMD from bonding scalars may be possible. The 4 cycle cadence would still apply as the scalars should not consume vector data any faster than the SIMD. The scalars already break the cadence with scalar code.

While this would require some hardware modifications, it wouldn't be a huge departure from GCN. The same code could execute, but would be far more efficient where the scalars would come into play. Doubling the registers and some extra decode/schedule logic would be required.

Another interesting possibility is that both streams may be bonded to feed data to the SIMD unit for FP64.

I read some paper by Nvidia research a few years ago about automatic coherent (scalar) work extraction from shaders. It assumed (single lane) scalar unit with int+float instructions. It didn't do any fancy stuff in addition to simply offloading coherent math and loads/stores to scalar unit. Basically it just tracked which instructions have only wave coherent data as input and propagated this information to the outputs of that instruction (recursively). This is a relative simple thing a compiler can do fully automatically. It both saves vector ALU slots and vector registers (scalar register = 64x reduction in register storage cost). IIRC the perf boost was somewhere around 10%-20% in most cases (it was comparing a mix of existing CUDA code).

AMDs current scalar unit unfortunately only has integer instruction set, so ALU offloading is basically limited to address calculation (of coherent memory reads) and branch/jump related code (as jumps are always wave coherent as there's only one program counter per wave). AMD added scalar memory stores in GCN3. I was hoping for full float instruction set. Scalar unit is tiny compared to SIMDs, so adding float support wouldn't cost much, but it would allow much better scalar offloading. This would mean increased performance and reduced power usage. I don't know why they haven't done this already. If they do some scalar unit related improvements, I would expect full float support to be high on the list.

Offloading branches (where only one lane passes) to scalar unit is an interesting idea. If the scalar unit had full float instruction set this would be doable already. You obviously first need to move the active lane's data to scalar registers and at the end of the branch move data back to the same vector lane. GCN vector<->scalar register moves always need waitcnt afterwards, as the scalar and vector units are not running in lockstep. Waitcnt blocks execution based on a counter value. You usually use it after you do a memory load (including LDS) to ensure that the data is written to the register before you use it. Latency of loads can vary dramatically (L1 hit vs memory access with page fault). It is a fast instruction if there's no waiting to be done, but adding some extra vgpr<->sgpr moves + waitcnt to the beginning and end of each branch doesn't sound like a good idea. It should pay off in long branches however.

The idea of having a wider scalar unit (with the same instruction set) could a benefit for long highly incoherent branches. But as long as the register files are separate, you'd only use this kind of offloading for long branches. Of course AMD could do something to automate the data movement (no extra instruction cycles wasted) and reduce the latency. You could for example have one scalar unit per SIMD instead of one per CU. I would be positively surprised if Vega had changes like these. Scalar unit is AMDs architecture's unique strength and they haven't exploited if fully yet.

Kaotik · Oct 12, 2016

seahawk said:
Am I wrong or did Vega not represent another iteration of CGN 4.0.

Vega-generations hardware implementation is still nothing but questions, only thing we do know is that the ISA will change (gfx ip 9.x) , which didn't happen with 4th gen GCN / Polaris (gfx ip 8.x, just like for example fiji and tonga were)

Anarchist4000 · Oct 12, 2016

sebbbi said:
AMD added scalar memory stores in GCN3.

The way I proposed it scalar registers wouldn't be used in most cases, but packed 16x vectors with a mux or crossbar inside the scalars to extract the relevant data.

sebbbi said:
You could for example have one scalar unit per SIMD instead of one per CU. I would be positively surprised if Vega had changes like these. Scalar unit is AMDs architecture's unique strength and they haven't exploited if fully yet.

My proposal would be 1 or 2 full scalars per SIMD/MIMD. Maybe a CU wide scalar for repacking or special functions. The scalars could be clocked higher as well. Both scalars processing the same wave at 8x clocks would therefore have the same throughput as the SIMD and data fetch to support it for vector operations. Realistically it won't clock that well. A CU would be comprised of 4 SIMDs, 4/8 Scalars, and 1 generic Scalar with 8 waves processed simultaneously. The scalar(actually vector) and vector registers would also be closely associated to each other so the SIMD and scalars could bounce between waves as required.

Deleted member 13524 · Oct 12, 2016

3dilettante said:
If you extrapolate from the introduction of R600 in 2006 to GCN in 2011, Vega would seem to be roughly where AMD is "due" for a change.

What is weird is how AMD kept talking about Polaris as the largest step in many years through its marketing campaign (improved IPC, improved geometry, improved color compression, etc.), whereas Vega kept having only "HBM2" in its description. To me, this suggested that Polaris would be the largest architectural leap since GCN1 and Vega would be little more than larger Polaris + HBM2.

Fast-forward to today and turns out the ISA didn't even change in any meaningful way. We have the OpenCL drivers calling Polaris "GFX8.1", a reportedly small step up from Tonga/Fiji's "GFX8", while Vega is "GFX9". And we have anandtech not bothering to write a full Polaris review or even some architecture analysis with the claim that there's not much to be said about architectural differences from prior GCN3 chips.

I get that AMD's marketing team wanted to get people hyped up to Polaris (and arguably did so in excess), but what I don't get is how Vega is being kept in an awkward silence, especially considering how Vega cards won't cannibalize Polaris cards because they reportedly don't belong to the same segment.

Razor1 said:
We know the developer didn't work with nV and pascal for this so why are you showing Doom Vulkan to me?

AFAIK, nvidia was the first company to ever show Doom Vulkan to the general public, and they did it running on Pascal nonetheless.

3dilettante · Oct 14, 2016

Anarchist4000 said:
The paper makes reference to the scalar being able to execute it's own instruction stream. What if however that 2nd stream wasn't a scalar, but in fact a hybrid scalar/vector? Akin to ALU and FPU instructions in the same stream?

That would revert the second thread into what GCN already does with scalar and vector instructions in the same stream.
Perhaps more distinct from GCN is the helper-thread model, where there is an semi-independent thread that is still kept in sync with the main one.
That's not something spelled out at the GCN ISA-level for getting a specific thread co-resident with another, though it might be something that could be hacked in software or can be handled at a higher level with various graphics pipeline stages.
Some of the descriptions for Polaris' instruction prefetch indicate that it happens when wavefronts are co-resident in a given CU or CU group, so there might be hints of that kind of context information filtering down.

The scalars already break the cadence with scalar code.

In the proposed scheme, or in GCN? With GCN currently, they do not.

sebbbi said:
AMDs current scalar unit unfortunately only has integer instruction set, so ALU offloading is basically limited to address calculation (of coherent memory reads) and branch/jump related code (as jumps are always wave coherent as there's only one program counter per wave). AMD added scalar memory stores in GCN3. I was hoping for full float instruction set. Scalar unit is tiny compared to SIMDs, so adding float support wouldn't cost much, but it would allow much better scalar offloading. This would mean increased performance and reduced power usage. I don't know why they haven't done this already. If they do some scalar unit related improvements, I would expect full float support to be high on the list.

One possible reason that I have been pondering lately is whether the complexity in handling FP in the scalar domain would have added too much complexity at an acceptable level of support, and not enough upside if kept simple enough to implement.
This goes to my earlier musing about different the command stream appears at the instruction buffer or in the sequencing after. The CU seems to be structured into multiple independently sequenced pipelines with varying behaviors, with there being internal command words that carry over multiple cycles or possibly there are separate internal ops per cycle.
The instruction buffers have some amount of sequencing logic, while the decode path after them would have domain-specific logic and sequencing.
The way the CU distributes its resources, the FP path has one path with some unknown amount of control hardware and control logic per SIMD, which means designers saw fit to give it the resources to have 4 paths/buffers/sequencers while allowing individual instructions to spread that overhead over 4 cycles.
The scalar unit is shared between all SIMDs, so its sequencing and control pipeline has 4x the demand and needs to support different threads/commands every cycle. This could mean the cost of expanding what the scalar unit can do can be higher than it seems, with a larger representation for each cycle's command, the fact it is shared and active every cycle, and due to the scalar domain interacting with centrally-important contexts and providing operands to distant parts of the CU.

Perhaps the way the complexity is distributed between the various sections is due for a rebalancing, although it's not clear if the patent represents how AMD will actually do it.

GCN vector<->scalar register moves always need waitcnt afterwards, as the scalar and vector units are not running in lockstep. Waitcnt blocks execution based on a counter value.

Am I correct in assuming this is using a wait count for a memory write at one pipe and then another wait count on the memory pipe reading it back in? I'm not sure what other wait counts would help with that scenario.

You could for example have one scalar unit per SIMD instead of one per CU. I would be positively surprised if Vega had changes like these. Scalar unit is AMDs architecture's unique strength and they haven't exploited if fully yet.

The patent puts forward the possibility of high-performance scalar units that can take on wavefronts that are mostly predicated off, although it doesn't quite state that those are the same as the scalar unit. One possible implementation cited had 4 such units, which might keep the full complexity of supporting vector ops safely on the SIMD portions rather than in the scalar pipe.

There's a number of assumptions built into the model of GCN as we know it that might not hold with the patent.
There are various interpretations that could mess with expectations on the instructions issued per cycle, operations per instruction, SIMD length, wavefront size, the relationship between SIMD count and issue units, forwarding latency, issue latency, the number of cycles in the cadence, if there is necessarily a fixed cadence, SIMDs per CU.

Some of the complexities brought up by this might be a reason to be uncertain how much of it is going to happen with Vega, or if it will happen.

It'd be much more interesting if it did, particularly with some of the other patents and rendering schemes were synthesized with it.
Something like a visibility buffer running on an architecture with a hybrid tiled/deferred front end with enhanced culling, wavefront compaction, and the variably-sized wavefronts.
However, why couldn't it be Polaris+tweaks+bugfixes on TSMC 16nm FF+ with HBM?

Anarchist4000 · Oct 14, 2016

3dilettante said:
That would revert the second thread into what GCN already does with scalar and vector instructions in the same stream.
Perhaps more distinct from GCN is the helper-thread model, where there is an semi-independent thread that is still kept in sync with the main one.
That's not something spelled out at the GCN ISA-level for getting a specific thread co-resident with another, though it might be something that could be hacked in software or can be handled at a higher level with various graphics pipeline stages.

True, but there would be more execution units and double the streams. The patent also made mention that timing issues would be avoided as execution was mutually exclusive. A wave executed vector or scalar ops but not both simultaneously. That in my mind leaves a lot of performance on the table as the SIMD would be idle while executing the scalar path. I'd expect a designer to try to work around that. I'd almost go so far to expect Polaris to already have the functionality for prefetching.

A helper thread model I believe may already exist with Polaris. It would help with some functions, but still be limited. Scalar code extraction would be problematic as the SIMDs would be idle. It would save power, but not necessarily aid performance.

3dilettante said:
In the proposed scheme, or in GCN? With GCN currently, they do not.

Definitely with the proposed, although it's questionable with current GCN. One instruction per wave per SIMD fills the 4 cycle cadence of the scalar. The prefetching technique mentioned in the paper suggested the scalar kernel running ahead to a future wave. That wouldn't seem to fit into the current cadence design without something interesting occurring. I'm not sure you could reliably know which waves would be scheduled together outside of compute.

3dilettante said:
Am I correct in assuming this is using a wait count for a memory write at one pipe and then another wait count on the memory pipe reading it back in? I'm not sure what other wait counts would help with that scenario.

That sounds correct. I dug through the ISA a week or so ago trying to figure this out. The waits are likely to cover delays in shifting scalar values within a wave. Lane2 to Lane13 taking up to 3 cycles for example. Limitations stemming from not supporting all swizzle patterns or limited crossbar on all architectures.

3dilettante said:
However, why couldn't it be Polaris+tweaks+bugfixes on TSMC 16nm FF+ with HBM?

A new tiled rasterizer and front end along with HBM could account for their stated perf/watt improvements. It just seems like they'd be looking for more improvements. The same paper that laid out the prefetching and other Polaris improvements relied on a much more powerful scalar. Scalar extraction would surely help with power consumption, but it seems wasteful to leave the SIMD idle. There would also seem to be concerns with the 64 CU limit without buffing the SIMDs to some degree. Most recent patents also indicate they've been looking into more scalar or variable SIMD capabilities.

3dilettante · Oct 14, 2016

ToTTenTranz said:
What is weird is how AMD kept talking about Polaris as the largest step in many years through its marketing campaign (improved IPC, improved geometry, improved color compression, etc.), whereas Vega kept having only "HBM2" in its description. To me, this suggested that Polaris would be the largest architectural leap since GCN1 and Vega would be little more than larger Polaris + HBM2.

One item of concern I had early on was that AMD promised 2x perf/watt, which left modest room for improvement outside of FinFET. Then there was the better 2.5x in its marketing roadmap--with the significantly older starting point than the previous generation.
I think it was mostly marketing. Polaris was made to seem like it was a big change with all the *New* tabs on its powerpoint slides. It's not like there weren't several items sold as new that were slides from Fiji/Carrizo like adaptive voltage.
If Vega were closer, then it might have had some of those items listed as being new as well, and still might.

Anarchist4000 said:
A helper thread model I believe may already exist with Polaris. It would help with some functions, but still be limited. Scalar code extraction would be problematic as the SIMDs would be idle. It would save power, but not necessarily aid performance.

I'm not sure about helper threads, although the instruction prefetch feature is poorly defined.

Per http://techreport.com/review/30328/amd-radeon-rx-480-graphics-card-reviewed/2, there is something that can take advantage of wavefronts for the same workload fetching instructions for their neighbors.
"If many wavefronts (AMD's name for groups of threads) of the same workload are set to be processed, a new feature called instruction prefetch lets executing wavefronts fetch instructions for subsequent ones. The company says this approach makes its instruction caching more efficient."

How that is facilitated isn't clear. At the command processer/ACE level, it may be that the GPU is given CU allocation strategy that tries to fill each CU from one draw or dispatch before moving to the next.
If there is something besides tweaking scheduling/dispatch heuristics, possibly there's a hardware change like tracking the base address for the kernel or a short list of the most recently accessed code pages for each wavefront. That could give the instruction fetch front-end a way to buffer or preferentially maintain lines likely to be shared. One possibly perverse optimization would be to preferentially stall wavefronts or de-prioritize them if they might get too far ahead of others from the same shader, although lining up wavefronts like that to prevent instruction cache misses could cause multiple wavefronts to hit the same demand peaks.

Anarchist4000 · Oct 14, 2016

3dilettante said:
One possibly perverse optimization would be to preferentially stall wavefronts or de-prioritize them if they might get too far ahead of others from the same shader, although lining up wavefronts like that to prevent instruction cache misses could cause multiple wavefronts to hit the same demand peaks.

What if the ACEs were interleaving four kernels into the same group? The vector code would be identical, but the scalar code progressing at 4x the rate. Scalar code could be extracted from the entire group. The code could still be arranged so conventional scalar operations perform as expected and gaps filled with the prefetch code. Creative use of barriers, instruction counters, and prioritization might allow for this.

3dilettante · Oct 14, 2016

Anarchist4000 said:
What if the ACEs were interleaving four kernels into the same group? The vector code would be identical, but the scalar code progressing at 4x the rate. Scalar code could be extracted from the entire group. The code could still be arranged so conventional scalar operations perform as expected and gaps filled with the prefetch code. Creative use of barriers, instruction counters, and prioritization might allow for this.

The ACEs would be arbitrating the creation of wavefronts, and by extension the work groups. They wouldn't be mixing in anything from another call, nor would they really have visibility on the particulars of the programs they are launching wavefronts for--other than the dimensions and the starting address of the program being launched. There's also no guarantee that different wavefronts are going to behave identically in their scalar component, and there are unique per-wave and per-group values and various synchronization and trap contexts that are not going to compact down sensibly with the limited capability the queue command processing hardware has for analysis.

The base program address is something that might be visible at the ACE level, and it is possible in some contexts to tell the GPU to either spread workgroup instances over all available CUs, or to sequentially fill them. This could also be coupled with the design elements that permit spatial allocation to encourage workgroups running the same program to prefer instantiating on CUs with the same starting address, if possible.

That alone should allow for better locality in the shared L1 Icaches, although after further review I am not sure if it's the L1 caches themselves being more efficient or a reduction the number of times the CUs have to hit the L1s, since up to 4 CUs have to arbitrate accesses.
How much locality improves at the L1 level would depend on how many disparate kernels are co-resident in the CU group, and how much the code thrashes the L1 on its own.
If it really were about warming up the L1 Icache, another possible change at the shader program level could be profiling something like a loop that can sit in the 32KB cache, and running a wavefront through it to populate the L1. I'm not sure how effective that would be relative to just counting on similar wavefronts generally hitting the same code at roughly the same time--as long as the wavefronts were sent to the right CUs.

Within a work group:
One somewhat inefficient property of GCN that could be used to warm things up is VSKIP's inability to avoid fetching the instructions it is skipping. This, coupled with either getting a synchronized count or using hardware ID registers (if they are still available for this purpose) to predicate off all but one thread and put all other wavefronts on a workgroup barrier.
Tell that one thread to basically not do much but run a version of its loop while skipping, then let everything run normally after.

All that said, however, another potential benefit might not be the L1 itself but a reduction in redundant instruction buffer loads taking up L1 bandwidth. A CU could keep a buffer of recent instruction buffer loads that it could check before trying to arbitrate for the L1. Possibly indexing it by workgroup ID or possibly kernel program base address might allow for more intelligent management if it might be thrashed. Some kind of logic for de-prioritizing a wavefront that is getting too far ahead might help if it starts loading instructions and discarding them before other wavefronts get there, but GCN might just go for a buffer that takes advantage of temporal locality without too much analysis.

psurge · Oct 14, 2016

sebbbi said:
I read some paper by Nvidia research a few years ago about automatic coherent (scalar) work extraction from shaders. It assumed (single lane) scalar unit with int+float instructions. It didn't do any fancy stuff in addition to simply offloading coherent math and loads/stores to scalar unit. Basically it just tracked which instructions have only wave coherent data as input and propagated this information to the outputs of that instruction (recursively). This is a relative simple thing a compiler can do fully automatically. It both saves vector ALU slots and vector registers (scalar register = 64x reduction in register storage cost). IIRC the perf boost was somewhere around 10%-20% in most cases (it was comparing a mix of existing CUDA code).

Was it http://hwacha.org/papers/scalarization-cgo2013.pdf? There's also some data https://www.cs.utexas.edu/users/skeckler/pubs/SC_2014_Exascale.pdf (Fig 5), which makes it seem like a significant amount of work could be saved by dynamic scalarization. These papers seem to be assuming narrow warps (4-wide instead of 32), so the savings are likely understated relative to wide ones. The latter paper also seems to assume temporal SIMT (entire warps stay resident on a single lane, no separate ALUs or register file partitioning necessary), but it's not immediately clear to me how that affects things relative to current AMD GPUs.

Going back to the first paper (Fig 4), it looks like a simple dynamic mechanism (recognize that a warp is still convergent when threads dynamically branch in the same direction) gets you to known convergent execution 97% of the time (for a 4-wide warp). I wonder if it would be possible/beneficial/practical to have the compiler emit 2 copies of each side of a branch that might introduce divergence, where the copy actually executed depends on whether divergence has occurred or not. This would seem to have the potential for further savings in executed instruction count (due to scalarization in dynamic contexts) without massive HW costs.

Anarchist4000 · Oct 14, 2016

3dilettante said:
They wouldn't be mixing in anything from another call, nor would they really have visibility on the particulars of the programs they are launching wavefronts for--other than the dimensions and the starting address of the program being launched. There's also no guarantee that different wavefronts are going to behave identically in their scalar component, and there are unique per-wave and per-group values and various synchronization and trap contexts that are not going to compact down sensibly with the limited capability the queue command processing hardware has for analysis.

Going off that paper, the prefetching algorithm was a predetermined function of the compiler. So while normally waves from different kernels wouldn't be mixed, it could be done to emulate an independent command stream for the scalar unit by the compiler. This hack likely won't even work without overriding a bunch of the scheduling parameters or some sort of redirection. There might exist a more direct approach to this, but someone with far more intimate knowledge of the architecture would have to answer that. The capability may not be possible or exposed in the ISA. The waves would definitely have to be crafted so they behaved in a predictable manner.

3dilettante said:
If it really were about warming up the L1 Icache, another possible change at the shader program level could be profiling something like a loop that can sit in the 32KB cache, and running a wavefront through it to populate the L1. I'm not sure how effective that would be relative to just counting on similar wavefronts generally hitting the same code at roughly the same time--as long as the wavefronts were sent to the right CUs.

Was it about the L1? I assumed it was more along the lines of page faulting to populate caches or even VRAM in the case of dynamic memory or tiled resources.

3dilettante · Oct 14, 2016

Anarchist4000 said:
Was it about the L1? I assumed it was more along the lines of page faulting to populate caches or even VRAM in the case of dynamic memory or tiled resources.

I was addressing the instruction prefetch feature added by Polaris, and the claim that it improves instruction caching efficiency.
Code should in isolation be compact and have decent locality for the L1 instruction cache. Contention by multiple CUs for buffer fills and capacity could be areas where Polaris could take advantage of having the same workload's wavefronts running on the same CU or in the same CU group.

The helper-thread model for data prefetching was posited for a scenario where the GPU underutilizes memory bandwidth and so loses cycles due to DRAM accesses.

Pulling in tiled resources and handling page faults potentially involves orders of magnitude more latency. Various faults can require system functions beyond the CU's capability to address. The cost of that scenario is so prohibitive that the costs of inefficiency at the individual wavefront level are minor.

(edit: misspelled locality)

Anarchist4000 · Oct 15, 2016

3dilettante said:
Code should in isolation be compact and have decent locality for the L1 instruction cache. Contention by multiple CUs for buffer fills and capacity could be areas where Polaris could take advantage of having the same workload's wavefronts running on the same CU or in the same CU group.

Figure 8 of the paper has a breakdown of cache miss rates in their testing. While the prefetching does improve L1 cache hits, the real benefit, at least for the scenarios they tested, appears to be L2. Halving or nearly eliminating misses for some of their tests. Lower caches should have better rates assuming adequate performance for that pool.

3dilettante said:
Pulling in tiled resources and handling page faults potentially involves orders of magnitude more latency. Various faults can require system functions beyond the CU's capability to address. The cost of that scenario is so prohibitive that the costs of inefficiency at the individual wavefront level are minor.

Where resources exist isn't the issue so much as resources need to become more resident. Fetching from system memory wouldn't be a great option, but a 2nd adapter or SSG would likely benefit. The further ahead the prefetching runs the better the hit rate, albeit on a far lower level cache.

Jawed · Oct 15, 2016

Graphics shaders rarely, if ever, have incoherent control flow. Only in compute can you really make an argument that incoherent control flow is essential.

Except I can't help wondering that if you ditched all of the fixed function geometry hardware (especially TS) and the rasteriser and relied upon shader hardware that can handle incoherence and early-out "properly", you'd get something that was worth having. And it wouldn't choke on bottlenecks for stupid global architectural layout reasons (like shader engine count).

But I've been wishing this would happen for years now and it hasn't.

Razor1 · Oct 15, 2016

I will happen eventual Jawed, maybe another 5 years or another 10 just need legacy fixed function units to die out.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

DavidGraham

MDolenc

DavidGraham

seahawk

Anarchist4000

sebbbi

Kaotik

Drunk Member

Anarchist4000

Deleted member 13524

Guest

3dilettante

Anarchist4000

3dilettante

Anarchist4000

3dilettante

psurge

Anarchist4000

3dilettante

Anarchist4000

Jawed

Razor1