AMD: Navi Speculation, Rumours and Discussion [2019-2020]

3dilettante · Jun 30, 2019

OlegSH said:
GCN was even more constrained in this regard with its single scheduller unit with 1 wave per clk throughput for 4 SIMDs and 1 scalar unit

For clarification, GCN's instruction issue hardware could issue up to 5 instructions of different types by selecting one from up to five wavefronts per cycle on the SIMD whose scheduling cycle has come up. This is not including special instructions that could be automatically consumed by the per-wavefront instruction buffers.

I'm assuming there's parallel issue from multiple wavefronts with RDNA, although it's possible the specifics about the width or types might have changed.

OlegSH · Jun 30, 2019

3dilettante said:
GCN's instruction issue hardware could issue up to 5 instructions of different types by selecting one from up to five wavefronts per cycle on the SIMD whose scheduling cycle has come up

Yes, I remember this claim from GCN docs when it was revealed for the first time. I find it highly misleading considering wave scheduller throughput is 1 wave per cycle and most of these instructions of different types obviously share data paths with FP32 units, hence INT execution will block FP32 from execution on the same data path, the same with SFU and so on. Scalar units have separate data paths, but with 1 wave per cycle, issuing a scalar op will block a SIMD op (fp, int, sfu) from execution. Depending on scalar cores throughput (i guess it's an instruction per cycle), there will be 0 interleaving with SIMD ops. If throughput is lower, lets say 1/2 ops per cycle, then there will be overlap for 1 cycle with SIMD units.

Don't get me wrong, GCN obviously can execute up to 4 waves concurrently with just a single scheduler per CU because of the 4 cycle cadence Wave execution on SIMD units, but unlike Turing, it can't execute different types of instructions (different mixes of fp + int + sfu) at full rate concurrently because they all share the same data paths and scheduling rate is not up to speed to meet separate data paths requirements

Betonmischer · Jun 30, 2019

I've found this in an old GCN slide deck. It shows an independent instruction issue and decode datapath for the scalar unit. Therefore, GCN is perfectly able to execute a scalar instruction every cycle without sacrificing any of the SIMD throughput. From the information I was able to gather, it works like this: each cycle that the vector scheduler is issuing an instruction to one of the SIMDs, the scalar scheduler has an opportunity to issue an instruction to the scalar ALU. The caveat is both kinds of instruction must belong to different wavefronts chosen from the pool of up to 10 wavefronts that are associated with each SIMD, as GCN lacks the ILP capability.

If I am correct in my assumptions on GCN, it wouldn't make sense for RDNA to be inferior to GCN in this regard. The RDNA slides clearly show a certain degree of interleaving between SIMDs and SFUs (each cycle an SFU instruction is issued by the scheduler, the corresponding SIMD stays idle), whereas both the scalar unit and the SIMD can be issued instruction every cycle.

Ethatron · Jun 30, 2019

3dilettante said:
This serial process can lead to sub-optimal results if blocks evaluated earlier are compiled to use a certain number of registers, and then a later block needs a large allocation.
It may have been possible that if the earlier evaluations had known of this, they could have been compiled with more generous register constraints for better performance.
Alternately, it may be the case that a block that needs a lot of registers could lead to occupancy problems. If it's just one small part of a shader that has otherwise modest occupancy, then it might be better if that big block were compiled less-optimally in terms of performance if it lets the overall shader experience better occupancy.

I want to share my experience regarding register allocation. The compiler doesn't minimize the number of allocated VGPRs currently (for the pre-Navi hardware I could analyze) and leaves it at that, but instead it maximizes VGPRs to the next high power-of-two of the mimimum allocation. As an example, I tried to optimize a compute shader for occupancy, it was basically single wavefront without being able to swap it out. I statically calculated the number of needed registers myself and it ended up needing 3 registers more than 32, instead of leaving it at 35 the compiler skyrocketed to 65 (which for some oddness is considered a 64 VGPR wavefront, probably one was some sort of pseudo VGPR eliminated below the ISA at real machine code). The analysis can be conducted with the tools Rys' team is publishing, e.g. some options allow you to see register-allocation and reuse over the dump of the ISA of the shader on every instruction.
I think the register allocation allowance has changed a lot for Navi. AFAIU from remarks here, power-of-two isn't necessary anymore, so this kind of thing you describe became only necessary from Navi on. If the older architectures could potentially be called in a similar way is an interesting question. Technically it's a scheduling optimization problem, which is easily approachable in software (unlimited complexity), but not so much in hardware. It could mean the schedulers became more powerful in regards to solving these type of "searches", or the problem is directed towards the driver and more complex software. Ah, the eternal issue of static vs. dynamic optimization (compile-time vs. run-time).

OlegSH · Jun 30, 2019

Betonmischer said:
It shows an independent instruction issue and decode datapath for the scalar unit

There is an independent decode logic because scalar unit has its own set of instructions (as well as caches and registers). However, scalar instructions are just instructions for a wave with uniform values. The way it's done right now is that waves with just a single work item or with uniform values are determined at compiler time and converted to scalar unit instructions (memory address calculations for example), but these still require launching a wave and next instruction in the wave can be dependent on previous one, so you can't issue 2 instructions to 2 independent data paths for the same wave unless you are sure both instructions are independent (this in order will require additional compiler instruction scheduling and dual-issue related hw logic). There is no evidence of superscalar instruction execution in GCN, in fact, GCN has always been known as a scalar architecture.

keldor · Jun 30, 2019

Ethatron said:
I want to share my experience regarding register allocation. The compiler doesn't minimize the number of allocated VGPRs currently (for the pre-Navi hardware I could analyze) and leaves it at that, but instead it maximizes VGPRs to the next high power-of-two of the mimimum allocation. As an example, I tried to optimize a compute shader for occupancy, it was basically single wavefront without being able to swap it out. I statically calculated the number of needed registers myself and it ended up needing 3 registers more than 32, instead of leaving it at 35 the compiler skyrocketed to 65 (which for some oddness is considered a 64 VGPR wavefront, probably one was some sort of pseudo VGPR eliminated below the ISA at real machine code). The analysis can be conducted with the tools Rys' team is publishing, e.g. some options allow you to see register-allocation and reuse over the dump of the ISA of the shader on every instruction.
I think the register allocation allowance has changed a lot for Navi. AFAIU from remarks here, power-of-two isn't necessary anymore, so this kind of thing you describe became only necessary from Navi on. If the older architectures could potentially be called in a similar way is an interesting question. Technically it's a scheduling optimization problem, which is easily approachable in software (unlimited complexity), but not so much in hardware. It could mean the schedulers became more powerful in regards to solving these type of "searches", or the problem is directed towards the driver and more complex software. Ah, the eternal issue of static vs. dynamic optimization (compile-time vs. run-time).

Remember, it's splitting a big register bank between some number of wavefronts/threads. This means the number of registers to a wavefront needs to divide the number of registers in the register bank or else there will be left over ones that are just wasted.

As for asynchronous compute, this is just another complication - the optimal number of registers for a kernel can be different depending on whether other kernels with differing register counts are likely to be running on the same compute unit (I think is what AMD calls them). Actually, something interesting to do from the driver side would be live profiling. Identify register and cache pressure of kernels at runtime and for later invocations, assign them to compute units accordingly, as well as perhaps switching to a different kernel binary tailored to specific register counts. Asynchronous compute is a performance loss if it results in thrashing caches.

Per Lindstrom · Jun 30, 2019

Some nice new features in Navi.
The new blower cooler may be real good.
Radeon Image Sharpening.
And then FidelityFX

https://www.amd.com/en/products/graphics/amd-radeon-rx-5700-xt

Betonmischer · Jun 30, 2019

OlegSH said:
There is an independent decode logic because scalar unit has its own set of instructions (as well as caches and registers). However, scalar instructions are just instructions for a wave with uniform values. The way it's done right now is that waves with just a single work item or with uniform values are determined at compiler time and converted to scalar unit instructions (memory address calculations for example), but these still require launching a wave and next instruction in the wave can be dependent on previous one, so you can't issue 2 instructions to 2 independent data paths for the same wave unless you are sure both instructions are independent (this in order will require additional compiler instruction scheduling and dual-issue related hw logic). There is no evidence of superscalar instruction execution in GCN, in fact, GCN has always been known as a scalar architecture.

The idea is that both the SIMD and the scalar unit can each take an instruction from two separate wavefronts, as opposed to a single wavefront, which indeed would make GCN a superscalar architecture. Whether that's the case or not, boils down to the question if GCN has a fully independent scheduling logic for the scalar unit. I'm trying to reach out to AMD for the clarification.

Betonmischer · Jun 30, 2019

Per Lindstrom said:
Some nice new features in Navi.
The new blower cooler may be real good.
Radeon Image Sharpening.
And then FidelityFX

https://www.amd.com/en/products/graphics/amd-radeon-rx-5700-xt

Personally, I'm in favor of blower coolers. They're more robust mechanically compared to a lot of axial designs, if not as good in terms of acoustics.

Alexko · Jun 30, 2019

Per Lindstrom said:
Some nice new features in Navi.
The new blower cooler may be real good.
Radeon Image Sharpening.
And then FidelityFX

https://www.amd.com/en/products/graphics/amd-radeon-rx-5700-xt

Any idea what those bolded features actually are? I don't remember reading anything about them before, curiously enough.

Kaotik · Jun 30, 2019

Alexko said:
Any idea what those bolded features actually are? I don't remember reading anything about them before, curiously enough.

Radeon Image Sharpening

Performance loss 0.x - 2 % or so

FidelityFX is / will be collection of "image improving" and possibly other features / effects

3dilettante · Jun 30, 2019

OlegSH said:
Yes, I remember this claim from GCN docs when it was revealed for the first time. I find it highly misleading considering wave scheduller throughput is 1 wave per cycle and most of these instructions of different types obviously share data paths with FP32 units, hence INT execution will block FP32 from execution on the same data path, the same with SFU and so on.

AMD's description has been pretty open about the 1-instruction per wavefront per cycle behavior. It is true that vector FP and vector INT do not issue concurrently, since the architecture does not consider them a separate type.
The scalar unit's domain is separate, and the other types tend to have different domains or units handling them. Branch, GDS/export, LDS, vector memory, and special instructions round out the other general types as far as the original GCN architecture description states.
While the data paths are not sized for an arbitrary combination of 5 instructions, there are allowances made for some concurrent issue. AMD's general architectural thrust has been to give the SIMD the best chance at good utilization, but there are signs that this includes making sure at least some of the other operation types can issue at a reasonable rate.
For Navi, the LLVM changes note that the scalar register file is more heavily banked than the vector file. This makes sense since there are multiple types of instructions including vector ones that can source from that file.
Prior GCN designs with the 4-cycle cadence also left the vector register file available for something other than a vector instruction at least 1/4 of the time. An FMA with three read operands in a 4-cycle cadence allows for another unit to tap into the register file. While AMD's super-SIMD patent may not apply to Navi's implementation, it notes that on average vector instructions would use the equivalent of 2 operands, meaning in many cases the register file was available for two operands to any other instruction type that might need a vector operand (export, VMEM, LDS, etc.).
If a vector instruction sourced a scalar register (unclear if so, though the bus that allows for this is mentioned as being its own entity), then that might have been another opportunity for the register file to find concurrent use.

A major limiting factor is operand bandwidth from a register file that is physically challenging to implement. For Turing, this is also a major consideration.
Both architectures have 4-banked vector register files. With Navi, this banking was exposed due to the breaking of the 4-cycle cadence, and the same rule of one operand per bank per cycle is now a hazard both make software deal with.
A significant feature Nvidia's GPUs have relied upon since Maxwell is the operand reuse cache, since it helps get around bank conflicts and provides bandwidth the register file lacks. There's a brief mention of something that might be similar for Navi in the LLVM flags, but it's unclear if it plays a similar role.

Scalar units have separate data paths, but with 1 wave per cycle, issuing a scalar op will block a SIMD op (fp, int, sfu) from execution.

If this is a prior GCN GPU, the restriction is that it cannot be from the same wavefront. If any of the up to 9 other wavefronts on the SIMD have a scalar operation pending it will probably be issued.
How many candidate wavefronts Navi has to choose from on a given SIMD, and how many per cycle it can select isn't something I've seen in the presentations.
Fully dedicating a scalar unit and scheduler to a SIMD can provide more opportunities for finding instructions without resource conflicts between the scalar and vector paths, and combining the hardware budget of two SIMD16 units should give it the ability to have more wavefront slots than before. However, the supposed streamlining of the architecture for clock speed and efficiency may put some downward pressure on the totals for a SIMD32 versus 2 SIMD16 blocks.
Also unclear from the Navi presentation is why the branch/message block in the GCN diagram went away for Navi. Most of those instructions still exist, although the loss of some instructions may allow for a shift in what categories or relative issue rates Navi's scheduler has.

Don't get me wrong, GCN obviously can execute up to 4 waves concurrently with just a single scheduler per CU because of the 4 cycle cadence Wave execution on SIMD units, but unlike Turing, it can't execute different types of instructions (different mixes of fp + int + sfu) at full rate concurrently because they all share the same data paths and scheduling rate is not up to speed to meet separate data paths requirements

It can make forward progress on more than just four wavefronts in a CU, although it seems the types you are focusing on are architecturally in the same bucket for GCN.
The Navi presentation did mention that the SFUs use one issue cycle and then continue in parallel, which has some similarity with how Turing handles FP and INT instruction issue.

Ethatron said:
I want to share my experience regarding register allocation. The compiler doesn't minimize the number of allocated VGPRs currently (for the pre-Navi hardware I could analyze) and leaves it at that, but instead it maximizes VGPRs to the next high power-of-two of the mimimum allocation. As an example, I tried to optimize a compute shader for occupancy, it was basically single wavefront without being able to swap it out. I statically calculated the number of needed registers myself and it ended up needing 3 registers more than 32, instead of leaving it at 35 the compiler skyrocketed to 65 (which for some oddness is considered a 64 VGPR wavefront, probably one was some sort of pseudo VGPR eliminated below the ISA at real machine code). The analysis can be conducted with the tools Rys' team is publishing, e.g. some options allow you to see register-allocation and reuse over the dump of the ISA of the shader on every instruction.
I think the register allocation allowance has changed a lot for Navi. AFAIU from remarks here, power-of-two isn't necessary anymore, so this kind of thing you describe became only necessary from Navi on. If the older architectures could potentially be called in a similar way is an interesting question. Technically it's a scheduling optimization problem, which is easily approachable in software (unlimited complexity), but not so much in hardware. It could mean the schedulers became more powerful in regards to solving these type of "searches", or the problem is directed towards the driver and more complex software. Ah, the eternal issue of static vs. dynamic optimization (compile-time vs. run-time).

It's difficult to say why the compiler bumped the allocation that high, though there are multiple programmer posts on this board and even references in AMD's GPUOpen site from developers that reference poor compiler allocation behavior for GCN.

The hardware itself doesn't have that extreme of a granularity.
https://gpuopen.com/amdgcn-assembly/
https://llvm.org/docs/AMDGPUUsage.html
The above give a VGPR granularity of 4, and 16 for SGPR for GCN3 (albeit a given shader will see its allocation doubled). It's 4 and 8 for older GCN architectures.
From the second link, Navi's VGPR granularity goes to 8. However, there's not so much a scalar granularity so much as there's just 128 registers at all times.
It has been the case that GCN's scalar register file has accumulated more and more context as more complex trap handling, complex addressing, and debugging have been added.
The large base allocation may also have some influence as to why the scalar path was duplicated and made per-SIMD.

3dilettante · Jun 30, 2019

Betonmischer said:
The idea is that both the SIMD and the scalar unit can each take an instruction from two separate wavefronts, as opposed to a single wavefront, which indeed would make GCN a superscalar architecture. Whether that's the case or not, boils down to the question if GCN has a fully independent scheduling logic for the scalar unit. I'm trying to reach out to AMD for the clarification.

I would disagree about it being called superscalar if the instruction is selected from a different wavefront. Superscalar is a descriptor for architectures that can improve single-threaded execution, mostly if the hardware can analyse the instruction stream for dependences on its own.
Picking from two separate threads removes the need for the analysis that superscalar processors perform.

JoeJ · Jun 30, 2019

Ethatron said:
instead of leaving it at 35 the compiler skyrocketed to 65

(lID is the local thread index)
It took me some time to find a 'fix', writing a looop did not help for example.
However, this was OpenCL compiler. I did the same in Vulkan and no issues there. (VK was in general about 15% faster but at the time there were no profiling tools to check register usage.)
It seemed the OpenCL compiler did not bother about the number of registers in regard of occupancy. More often than not i missed a better tier just by one damn register. Still, overall the performance on GCN was always better than anywhere else for me.

Betonmischer · Jul 1, 2019

3dilettante said:
I would disagree about it being called superscalar if the instruction is selected from a different wavefront. Superscalar is a descriptor for architectures that can improve single-threaded execution, mostly if the hardware can analyse the instruction stream for dependences on its own.
Picking from two separate threads removes the need for the analysis that superscalar processors perform.

Perhaps my wording should have been clearer on this, but I was referring to GCN being able to execute two successive instructions from a single wavefront as superscalarity. Which is obviously not the case, as two wavefronts (if the whole assumption of the vector and scalar instruction co-issue is correct) are required for concurrent execution on the SIMD and the scalar unit.

3dilettante · Jul 1, 2019

Betonmischer said:
Perhaps my wording should have been clearer on this, but I was referring to GCN being able to execute two successive instructions from a single wavefront as superscalarity. Which is obviously not the case, as two wavefronts (if the whole assumption of the vector and scalar instruction co-issue is correct) are required for concurrent execution on the SIMD and the scalar unit.

In that case I misread which item the superscalar description was being applied to.

Ext3h · Jul 1, 2019

3dilettante said:
It can make forward progress on more than just four wavefronts in a CU, although it seems the types you are focusing on are architecturally in the same bucket for GCN.
The Navi presentation did mention that the SFUs use one issue cycle and then continue in parallel, which has some similarity with how Turing handles FP and INT instruction issue.

Slide 16 shows that not only SFU, but also regular vector (and possibly scalar) instructions are pipelined from the same wave if there is no data dependency.

What's unclear though, is how many instructions per wave can be in flight concurrently, or in other terms how many dependencies can be tracked before a pipeline stall is enforced due to exhausted scheduling resources.

3dilettante said:
A major limiting factor is operand bandwidth from a register file that is physically challenging to implement. For Turing, this is also a major consideration.
Both architectures have 4-banked vector register files. With Navi, this banking was exposed due to the breaking of the 4-cycle cadence, and the same rule of one operand per bank per cycle is now a hazard both make software deal with.

Are you suggesting the vector register file banks in Navi have really just single read port?
That's quite difficult to believe, as it would result in 2 or 3 cycles per instruction just for operand gathering (at least when not utilizing a different workaround like the operand cache you mentioned), even in the best case. And doesn't fit the advertised instruction throughput of 1 vector instruction per cycle either, which results in an effective 2 or 3 reads and another write required per cycle.

If there actually is some sort of bank conflict, it's more likely that it only costs extra if you happen to accidentally gather from more than a single bank in a single instruction. And multiple reads from same VGPR bank are handled in same cycle by use of multiple ports.
Mind linking to the LLVM commit which you think referred to bank conflicts?

3dilettante · Jul 1, 2019

Ext3h said:
What's unclear though, is how many instructions per wave can be in flight concurrently, or in other terms how many dependencies can be tracked before a pipeline stall is enforced due to exhausted scheduling resources.

The LLVM changes have a speed model section that detail some of the latencies. For vector operations, the latency is 5 cycles (one worse than prior generations). If tracking vector writes, there's one destination register per instruction, so up to five registers may be in flight as far as the software-visible pipeline latency is concerned.
This is supported by AMD's slide showing instruction issue for a code sample, with Navi showing an extra cycle needed to handle dependent instructions. For the scalar to vector dependence in that slide, there was a two cycle latency.

Are you suggesting the vector register file banks in Navi have really just single read port?

Yes, and I think this is stated in the LLVM changes in various places. GFX10 has an architectural feature flag stating it has a banked register file.
There's a new set of routines used to handle register bank selection with an eye for avoiding bank conflicts.
https://github.com/llvm-mirror/llvm...b3561b2#diff-1fe939c9865241da3fd17c066e6e0d94
The code explicitly points out the restriction in GCNRegBankReassign.cpp:

Code:

/// On GFX10 registers are organized in banks. VGPRs have 4 banks assigned in
/// a round-robin fashion: v0, v4, v8... belong to bank 0. v1, v5, v9... to
/// bank 1, etc. SGPRs have 8 banks and allocated in pairs, so that s0:s1,
/// s16:s17, s32:s33 are at bank 0. s2:s3, s18:s19, s34:s35 are at bank 1 etc.
///
/// The shader can read one dword from each of these banks once per cycle.
/// If an instruction has to read more register operands from the same bank
/// an additional cycle is needed. HW attempts to pre-load registers through
/// input operand gathering, but a stall cycle may occur if that fails. For
/// example V_FMA_F32 V111 = V0 + V4 * V8 will need 3 cycles to read operands,
/// potentially incuring 2 stall cycles.

I've referenced patents for super-SIMD and a register destination cache where AMD cites a banked register file. Also notable is that this is how AMD described future and existing GPUs. This actually makes sense with the 4-cycle cadence, since a bank can be read over three cycles without worrying about conflicts. Navi breaks the issue cycle restriction of the old cadence, but does not improve the delivery rate (edit: latency) of the SRAM banks (actually a bit worse now).
Nvidia's generally the same. I think it stopped hiding this after G80, and actually had more banks until more recent GPUs settled on the same modulo-4 assignment Navi adopted.
For AMD, I remember descriptions of its use of banked register files going back to R600. Going that far back, AMD actually released a set of test loops for throughput that would ironically now seriously hurt throughput because they used the same register IDs over and over.
(edit: https://forum.beyond3d.com/posts/994348/ --although it's been a long time and I may have misremembered how simple a lot of the loops were. They use few IDs, but some operations are so simple that they use the same register for all operands, and those I'd imagine Navi's hardware has no problem with.)
One caveat about GFX10 is that while there is a mention of a register cache, its effect isn't noted for register bank selection unless it's in the nebulous "operand gathering" claim.

That's quite difficult to believe, as it would result in 2 or 3 cycles per instruction just for operand gathering (at least when not utilizing a different workaround like the operand cache you mentioned), even in the best case. And doesn't fit the advertised instruction throughput of 1 vector instruction per cycle either, which results in an effective 2 or 3 reads and another write required per cycle.

That's what's state above. For prior generations, I've speculated before that this was likely the case, and that it's actually handled in a longer pipeline ahead of actual ALU execution. There are instances cited in the ISA doc about how long it takes to effect certain instruction fetch behaviors where there's a delay before it propagates down to the wavefront's execution, basically where a mode change cannot stop instructions already in those preliminary pipeline stages.
With the 4-cycle cadence, a long enough pipeline, and forwarding there's generally little software-visible difference. This sort of allocating a bank to one quarter of a wavefront is also hinted at in the super-SIMD patent as a pre-existing embodiment.

If there actually is some sort of bank conflict, it's more likely that it only costs extra if you happen to accidentally gather from more than a single bank in a single instruction. And multiple reads from same VGPR bank are handled in same cycle by use of multiple ports.
Mind linking to the LLVM commit which you think referred to bank conflicts?

In this case, it's the other way around. A SIMD can now freely select between 4 banks in the same cycle.
Some possible reasons why include that AMD's super-SIMD patent blames 1/3 of a CU's power draw on the register file, and each additional port adds a transistor per cell to what is likely an array using 6T cells. These are already very large storage arrays, and also having 3-4 ports per bank is unbalancing given that it's multiplying the peak bandwidth of the register file versus the one SIMD that can use it.

AMD neglected to mention it, and Nvidia doesn't really talk about it in the marketing either. AMD's instruction issue slide does quietly spread its operands across different banks, however.

Per Lindstrom · Jul 1, 2019

Alexko said:
Any idea what those bolded features actually are? I don't remember reading anything about them before, curiously enough.

No, sorry, they are new to me as well.

Kaotik · Jul 1, 2019

Rage 2 becomes the first to support FidelityFX before Navis are even out

https://www.dsogaming.com/news/rage...ally-supports-amds-new-fidelityfx-technology/

The feature doesn't work at least in it's current state on NVIDIA cards

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

3dilettante

OlegSH

Betonmischer

Ethatron

OlegSH

keldor

Per Lindstrom

Betonmischer

Betonmischer

Alexko

Kaotik

Drunk Member

3dilettante

3dilettante

JoeJ

Betonmischer

3dilettante

Ext3h

3dilettante

Per Lindstrom

Kaotik

Drunk Member