AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Seems very similar to RTX, as far as i can imagine how the latter works.
The high-level description of RTX implies a more autonomous RT core versus the AMD patent. A BVH instruction passes pointers and ray data across a bus to the texture unit, and the memory and filtering path loads and filters data that can be passed back to the SIMD hardware and/or the intersection engine. That engine can perform intersection tests on bounding boxes or triangles, depending on what type of node is being evaluated.
What seems different here is that the hybrid texture/RT block only accelerates the evaluation of one node at a time, whereas Nvidia's described RT core functionality keeps on traversing and testing until it can return a hit/miss result.

The AMD method uses a state machine that takes the set of intersection tests, additional child nodes, and data indicating how the traversal stack should be updated, and passes it back to the SIMD hardware. The SIMD hardware then evaluates or implements what was passed back. Successive nodes would involve executing another BVH instruction with arguments based on the most recently updated context and stack data.
The SIMD hardware and the register and memory resources it has available host much of the context, although I am unclear on how exposed the back and forth between the texture block+state machine and the SIMD would be to the programmer. The programmable SIMD could make more programmable decisions about what it does for the next traversal step, although what the patent describes could be implemented such that it's using programmable hardware, but using internal programs or microcode loops that won't release the wavefront back to programmer control until they are done.

The impact of the AMD method on the overall CU appears to be more significant, versus Nvidia's claim that its RT core can leave the SM mostly free to do other things.
Other Nvidia claims like the RT core's built-in execution loop saving a lot of instruction fetch bandwidth for the SM would put AMD's method in a potentia half-way point. There's a subset of operations that have an internal loop, but whatever additional steps go back to the SIMD hardware may incur significant instruction traffic--albeit not as much as a fully software solution.

I haven't found a description to this level of detail for the RTX elements of Nvidia's architecture for comparison. One possible point for future review is how the two methods compare if RTX hardware is using a custom intersection shader, which Nvidia generally recommends against for performance reasons. That might inject some of the back and forth communication between RT core and SIMD hardware AMD's method defaults to.

I really wonder how they manage a stack per ray. Can one assume an upper bound of how large this stack has to be? And even if, that's a lot of memory and bandwidth.
Personally i've always used stackless approach on GPU. I must be missing something here...
The vector register file seems to be the first choice for hosting the stack, but since this is back in the programmable domain there could be fallback to LDS and memory.
It seems hardware vendors like how stack-based methods tend to yield more compact BVH structures in memory, don't have as many repeat traversals of nodes as many stackless methods, and the accesses that exist may play better with cache hierarchies than some stackless methods. Being able to play in the same conceptual space as many CPU methods may also be a bonus.

"Fixed function ray intersection engine in a texture processor" - question, this texture processor in amd nomenclature is tmu or some sort of CU ?
It's generally in the area where the vector cache, L/S units, and texture filtering units are, but the engine might be a hardware block sitting next to them. The texture path has a lot of buffers, ALUs, and sequencing capability already, so what gets reused versus re-implemented isn't clear.

http://www.freepatentsonline.com/y2019/0164328.html
PRIMITIVE LEVEL PREEMPTION USING DISCRETE NON-REAL-TIME AND REAL TIME PIPELINES
This may have some relation to the existence of more than one graphics ring in the recent Navi driver commits. This allows for preemption at the granularity of a primitive by creating a duplicate pipeline and register and data storage for the main graphics pipeline and a real-time pipeline. This duplication goes from the command processor through the geometry processor. Whether there's an explicitly separate command processor or processor block per pipeline or some form of multi-threading isn't clear.
The big change is that a context switch and drain of the fixed-function pipeline doesn't happen in this form of preemption because the command processor and front end duplicates storage, and various blocks like input assembly and tessellation are not shared with the real-time pipeline. The non-realtime path would presumably be the high-performance standard graphics path, while the real-time path avoids stepping on its toes by emulating various stages in software rather than risk flushing them.
This would apply to workloads that are very latency sensitive, but aren't counting on the using some of those emulated resources much.
The shader back-end is generally agnostic of the front end, so its changes appear minimal.
(edit: Mentions a scheduling processor that may align with the MES controller added with Navi, which match priority tunneling in AMD's slides. Might relate to having a central geometry processor.)

This may not be strictly related to Navi or hardware. A skim of it makes me think it is a change in how the compiler can handle static instruction scheduling in terms of deciding on how it can compile or optimize sections of shader programs.
The supposed original way of doing this was to have the compiler walk through every block of a shader, record the number of registers it needs, and then indicate that the shader as a whole will need an allocation matching the consumption of the block that needs the most registers.
This serial process can lead to sub-optimal results if blocks evaluated earlier are compiled to use a certain number of registers, and then a later block needs a large allocation.
It may have been possible that if the earlier evaluations had known of this, they could have been compiled with more generous register constraints for better performance.
Alternately, it may be the case that a block that needs a lot of registers could lead to occupancy problems. If it's just one small part of a shader that has otherwise modest occupancy, then it might be better if that big block were compiled less-optimally in terms of performance if it lets the overall shader experience better occupancy.
The patent describes evaluating blocks with multiple scheduling algorithms in parallel, taking the accumulated results, and selecting the versions of each block that it thinks lead to a better overall result.

This sets up a vertically-stacked system where SIMD hardware can access a section of the DRAM above as if it were "local", which can be supplied at lower latency and apparently at data bus widths closer to the internal data paths of the DRAM, which interfaces whittle down. This local connection also drills directly down to the SIMD hardware and its register file. The cache hierarchy that exists in the case, exists for accessing other parts of the HBM that are non-local (above another distant SIMD).
The rules for this type of access appear to be different than what would be for more traditional memory accesses that may also need to be consistent with CPU or other clients. Accesses are even aware of lane predication, so this seems like it's treated like an extension of a data share or local buffer. There's also consideration for load-balancing between local and remote access, and power consumption from much higher DRAM array activity.

edit: Of note, this talks about SIMD16 hardware, and it's another one of those DOE patents, which like variable width SIMD and a raft of near-threshold, per-ALU voltage regulation, or asynchronous processing patents in similar programs seem to have less correlation with any AMD products.

Seems like a method to stack an APU or GPU with HBM using TSVs.
I keep reminding myself of that patent that mentioned a method to dissipate the heat of a chip across the PCB with copper tubes.
Stacking HBM with an APU would be great to reduce costs, but the problem of dissipating the heat between the stacks should prevent it from happening.

Maybe this way they could do it.

GsqFUW7.png
It's possible that an APU like that wouldn't need the big heatsink because there's no way it can draw enough current, since the base is also a heatink and the HBM dies are in the way.
 
Last edited:
What seems different here is that the hybrid texture/RT block only accelerates the evaluation of one node at a time, whereas Nvidia's described RT core functionality keeps on traversing and testing until it can return a hit/miss result.
Just came back here because i started to doubt my early conclusion, and wanted to ask exactly for what you answer here. :)

The impact of the AMD method on the overall CU appears to be more significant, versus Nvidia's claim that its RT core can leave the SM mostly free to do other things.
On the other hand, if this results in less area spent on RT it could be a net win. It also depends on how much of general purpose compute remains available. I assume they use LDS for the stack, so likely it all becomes quite limited, but who knows - maybe some sort of reordering or quantizing multiple rays to use the same direction for better coherence becomes possible.
Likely there could be better ideas than that, but better stop programmers wishful thinking until we know more. :D
 
It's possible that an APU like that wouldn't need the big heatsink because there's no way it can draw enough current, since the base is also a heatink and the HBM dies are in the way.
...

I didn't make a 2min drawing in mspaint to be taken literally regarding dimensions or proportions..
 
On the other hand, if this results in less area spent on RT it could be a net win. It also depends on how much of general purpose compute remains available. I assume they use LDS for the stack, so likely it all becomes quite limited, but who knows - maybe some sort of reordering or quantizing multiple rays to use the same direction for better coherence becomes possible.
It might have reduced area requirements, although we wouldn't know how much in absolute terms this matters without a clear view of the hardware. Nvidia's RT core area consumption is not well-understood, although attempts at comparing Turing GPUs with and without RTX pointed at modest penalties for the RT cores.

...

I didn't make a 2min drawing in mspaint to be taken literally regarding dimensions or proportions..
It seemed reasonable to me. The high-performance dies that can generate enough activity to use stacks of HBM have most of their pinout covered by power/ground.
A substantial amount of metal would need to be under the stack to make it a worthwhile medium for heat transport, and it would need to make good physical contact across the warm base area.
Power delivery also suffers along with dissipation with 3D stacking, and there may be compromises related to where the memory is versus where a large number of vias dedicated to power/ground need to be placed to get to the APU.
 
That's how articles about the Navi reveal at AMDs techday say it, yes.

I didn't find the slides mentioning it. So wccftech used the wrong code likely due to AdoredTV's 'leaks'.

Forgot to mention that the whole last post was to say that sapphire has already registered whole lot of 59xx and 58xx names, which was quite expected due to AMD's naming of current cards 57xx

The newly registered trademarks include The Radeon RX 5950XT, RX 5950, RX 5900XT, RX 5900, RX 5850XT, RX 5850, RX 5800XT, RX 5800, RX 5750XT, RX 5750, RX 5700XT, RX 5700, RX 5650XT, RX 5650, RX 5600XT, RX 5600, RX 5550XT, RX 5550, RX 5500XT, RX 5500, RX590XT and RX 590.

https://wccftech.com/amd-radeon-rx-5950-5900-5850-5800-graphics-cards-leaked/
 
The high-level description of RTX implies a more autonomous RT core versus the AMD patent. A BVH instruction passes pointers and ray data across a bus to the texture unit, and the memory and filtering path loads and filters data that can be passed back to the SIMD hardware and/or the intersection engine. That engine can perform intersection tests on bounding boxes or triangles, depending on what type of node is being evaluated.
What seems different here is that the hybrid texture/RT block only accelerates the evaluation of one node at a time, whereas Nvidia's described RT core functionality keeps on traversing and testing until it can return a hit/miss result.

The AMD method uses a state machine that takes the set of intersection tests, additional child nodes, and data indicating how the traversal stack should be updated, and passes it back to the SIMD hardware. The SIMD hardware then evaluates or implements what was passed back. Successive nodes would involve executing another BVH instruction with arguments based on the most recently updated context and stack data.
The SIMD hardware and the register and memory resources it has available host much of the context, although I am unclear on how exposed the back and forth between the texture block+state machine and the SIMD would be to the programmer. The programmable SIMD could make more programmable decisions about what it does for the next traversal step, although what the patent describes could be implemented such that it's using programmable hardware, but using internal programs or microcode loops that won't release the wavefront back to programmer control until they are done.

The impact of the AMD method on the overall CU appears to be more significant, versus Nvidia's claim that its RT core can leave the SM mostly free to do other things.
Other Nvidia claims like the RT core's built-in execution loop saving a lot of instruction fetch bandwidth for the SM would put AMD's method in a potentia half-way point. There's a subset of operations that have an internal loop, but whatever additional steps go back to the SIMD hardware may incur significant instruction traffic--albeit not as much as a fully software solution.

I haven't found a description to this level of detail for the RTX elements of Nvidia's architecture for comparison. One possible point for future review is how the two methods compare if RTX hardware is using a custom intersection shader, which Nvidia generally recommends against for performance reasons. That might inject some of the back and forth communication between RT core and SIMD hardware AMD's method defaults to.


The vector register file seems to be the first choice for hosting the stack, but since this is back in the programmable domain there could be fallback to LDS and memory.
It seems hardware vendors like how stack-based methods tend to yield more compact BVH structures in memory, don't have as many repeat traversals of nodes as many stackless methods, and the accesses that exist may play better with cache hierarchies than some stackless methods. Being able to play in the same conceptual space as many CPU methods may also be a bonus.


It's generally in the area where the vector cache, L/S units, and texture filtering units are, but the engine might be a hardware block sitting next to them. The texture path has a lot of buffers, ALUs, and sequencing capability already, so what gets reused versus re-implemented isn't clear.


This may have some relation to the existence of more than one graphics ring in the recent Navi driver commits. This allows for preemption at the granularity of a primitive by creating a duplicate pipeline and register and data storage for the main graphics pipeline and a real-time pipeline. This duplication goes from the command processor through the geometry processor. Whether there's an explicitly separate command processor or processor block per pipeline or some form of multi-threading isn't clear.
The big change is that a context switch and drain of the fixed-function pipeline doesn't happen in this form of preemption because the command processor and front end duplicates storage, and various blocks like input assembly and tessellation are not shared with the real-time pipeline. The non-realtime path would presumably be the high-performance standard graphics path, while the real-time path avoids stepping on its toes by emulating various stages in software rather than risk flushing them.
This would apply to workloads that are very latency sensitive, but aren't counting on the using some of those emulated resources much.
The shader back-end is generally agnostic of the front end, so its changes appear minimal.
(edit: Mentions a scheduling processor that may align with the MES controller added with Navi, which match priority tunneling in AMD's slides. Might relate to having a central geometry processor.)


This may not be strictly related to Navi or hardware. A skim of it makes me think it is a change in how the compiler can handle static instruction scheduling in terms of deciding on how it can compile or optimize sections of shader programs.
The supposed original way of doing this was to have the compiler walk through every block of a shader, record the number of registers it needs, and then indicate that the shader as a whole will need an allocation matching the consumption of the block that needs the most registers.
This serial process can lead to sub-optimal results if blocks evaluated earlier are compiled to use a certain number of registers, and then a later block needs a large allocation.
It may have been possible that if the earlier evaluations had known of this, they could have been compiled with more generous register constraints for better performance.
Alternately, it may be the case that a block that needs a lot of registers could lead to occupancy problems. If it's just one small part of a shader that has otherwise modest occupancy, then it might be better if that big block were compiled less-optimally in terms of performance if it lets the overall shader experience better occupancy.
The patent describes evaluating blocks with multiple scheduling algorithms in parallel, taking the accumulated results, and selecting the versions of each block that it thinks lead to a better overall result.


This sets up a vertically-stacked system where SIMD hardware can access a section of the DRAM above as if it were "local", which can be supplied at lower latency and apparently at data bus widths closer to the internal data paths of the DRAM, which interfaces whittle down. This local connection also drills directly down to the SIMD hardware and its register file. The cache hierarchy that exists in the case, exists for accessing other parts of the HBM that are non-local (above another distant SIMD).
The rules for this type of access appear to be different than what would be for more traditional memory accesses that may also need to be consistent with CPU or other clients. Accesses are even aware of lane predication, so this seems like it's treated like an extension of a data share or local buffer. There's also consideration for load-balancing between local and remote access, and power consumption from much higher DRAM array activity.

edit: Of note, this talks about SIMD16 hardware, and it's another one of those DOE patents, which like variable width SIMD and a raft of near-threshold, per-ALU voltage regulation, or asynchronous processing patents in similar programs seem to have less correlation with any AMD products.


It's possible that an APU like that wouldn't need the big heatsink because there's no way it can draw enough current, since the base is also a heatink and the HBM dies are in the way.
I am wonder, how they will make denoising , after ray trace calulation ?
 
And also four Navi variants from MacOS,


https://wccftech.com/four-amd-navi-gpu-variants-leaked-rumored-for-july-2019-launch/

Then we had a rumor four months back that Navi release had been pushed back to October. Maybe confusing it with the next Navi's release?
Those numbers are actually related to C++ names. From a comment on Videocardz:
C++ decorates the names of the methods upon compilation to encode the class name and parameters types. The decoration adds a lot of extra digits and characters. There is no Navi16. 16 is the length of the method name getMatchProperty.
You can try and run this long name through the c++filt command and see the real name of the method.
I have no idea if the above comment is accurate, but the character length part checks out (for example, in "__ZN38AMDRadeonX6000_AMDRadeonHWServicesNavi9MetaClassC1Ev" the "MetaClass" has 9 characters).
 
I didn't find the slides mentioning it. So wccftech used the wrong code likely due to AdoredTV's 'leaks'.

Forgot to mention that the whole last post was to say that sapphire has already registered whole lot of 59xx and 58xx names, which was quite expected due to AMD's naming of current cards 57xx



https://wccftech.com/amd-radeon-rx-5950-5900-5850-5800-graphics-cards-leaked/
EEC has had previously wrong/fake "registered products" too - even to the point where some 3rd party managed to get some "future cards" from real company in there
 
No inside info. But by adding 2021, if AMD takes until 2021 to deliver the hardware, they will say that they delivered it on time as promised. With vague roadmaps you should always take the most conservative interpretation, because that's what the vendor will take (otherwise it wouldn't be vague). Then you can be pleasantly surprised if they beat it.

Also, GPU roadmaps are non-linear.
"Similarly, with the GPU group, we’re talking about the first generation of RDNA today, and we have several generations in parallel.

So I think the one thing that I would like to say is that our roadmap has not changed. Our roadmap, when Mark and I and the rest of the team started the roadmap, was the idea that we needed multiple generations of continuous improvement. So with Zen, Zen+, Zen 2, Zen 3, that was all part of the plan. As we go forward, we’re going to continue to be very very aggressive about the CPU."

-Dr Lisa Su



I think AMD is just throwing everyone off with their slides.

There are hints all over the place that suggest that RDNA2 is done and working and in the consoles, and RDNA3 is being worked on. Nothing.. and I mean nothing technically is preventing Dr SU from announcing on July 7th... that Big Navi is an early 2020 release.... and RDNA3 is late next year.



 
They talked about it, but in current games, they're using shaders.
Tensor cores are only used for dlss afaik.
Exactly. Tensor Cores have yet to be used for Denoising in shipped games and frankly doesn't look like it's going to change given that current in-house implementations are good & fast enough.
 
Dunno who you mean by everyone, but Nvidia using Tensor Cores ...
Seems still a common misconception, but the assumption tensors are not used at all may be wrong as well.
Eventually there is a game that utilizes NNs by tensor cores for denoising by using GameWorks? But i doubt it, and i don't know if GameWorks denoisers are available yet and use tensors at all.

Wondering about this for a long time my actual assumptions are:
You need at least fp16 for doising, and if you write fp16 compute shader code on NV it will utilize tensor cores, because fp16 is executed on them? (This assumption is based on GTX 1660 which added fp16 units as replacements for full tensors)
NNs might not be the best way to handle denoising - if you can, manually written and highly optimized code should beat it. Denoising is worth the manual work - you use AI more likely in cases where a human can no longer weight thousands of fuzzy decisions to get some fuzzy thing working.
 
"Similarly, with the GPU group, we’re talking about the first generation of RDNA today, and we have several generations in parallel.

So I think the one thing that I would like to say is that our roadmap has not changed. Our roadmap, when Mark and I and the rest of the team started the roadmap, was the idea that we needed multiple generations of continuous improvement. So with Zen, Zen+, Zen 2, Zen 3, that was all part of the plan. As we go forward, we’re going to continue to be very very aggressive about the CPU."

-Dr Lisa Su



I think AMD is just throwing everyone off with their slides.

There are hints all over the place that suggest that RDNA2 is done and working and in the consoles, and RDNA3 is being worked on. Nothing.. and I mean nothing technically is preventing Dr SU from announcing on July 7th... that Big Navi is an early 2020 release.... and RDNA3 is late next year.



Don’t necessarily agree with everything here, but AMD on The Full Nerd podcast did suggest they’d have more to share soon.
 
Is there any info on RDNA's concurrent instruction execution capabilities?
Can it execute scalar instructions in parallel with vector ones? I guess RDNA's CU can interleave scalar instructions in Wave64 mode with vector instructions because each wave will execute for 2 clocks on simd32 units. Hence, on the second clock, wave scheduller can dispatch a wave on a scalar unit to execute a scalar instruction, so every 2 cycles there can be overlap between simd and scalar execution. Putting it all together, scalar units can be utilized at 50% throughput concurently with vector units at 100% throughput in the Wave64 mode. Is there any other case when the scalar units in RDNA can be concurrently utilized with vector SIMDs?
 
Last edited:
I am wonder, how they will make denoising , after ray trace calulation ?
Navi does have packed math instructions, and I think it expands the set. If going in a machine-learning direction, later versions of GFX10 are flagged as having the Vega 20 instructions plus a new set of new instructions along those lines.

"Similarly, with the GPU group, we’re talking about the first generation of RDNA today, and we have several generations in parallel.

This is normal for product development. The rule of thumb for a new CPU in AMD's class from design to market is ~5 years, and the old rule for new GPUs was on the order of ~4-5. Evolutions of a core might reduce the timeline, but when products can have a cadence two or three times shorter, there needs to be overlap.
Projects go through distinct stages from concept, features, design, synthesis, various forms of verification, sampling, bug fixing, etc. As such, there are milestones where groups that specialize in one stage hand things off to the next, and they aren't going to sit idle waiting for the N, N+1, or N+2 design to make its way down the later phases in the pipeline.

There are hints all over the place that suggest that RDNA2 is done and working and in the consoles, and RDNA3 is being worked on. Nothing.. and I mean nothing technically is preventing Dr SU from announcing on July 7th... that Big Navi is an early 2020 release.... and RDNA3 is late next year.
There may be technical constraints in that we don't know whether the next design has been fully (or at least satisfactorily, given some of the bugs we see with GFX10) bug-fixed.
Beyond that, if the consoles are RDNA2 or based on something similar, AMD may be bound from announcing its own product fully if the semicustom clients are using some of those new features in their products. It happened with Bonaire and the current gen consoles.

Is there any info on RDNA's concurrent instruction execution capabilities?
Can it execute scalar instructions in parallel with vector ones?
I didn't see a specific mention of this, although it looks like there's still the possibility of instructions of different types to be issued in parallel from different wavefronts like in prior GCN.
AMD's slide with a sample set of instructions is not iron-clad evidence that there isn't dual-issue within a wavefront, but it seems consistent with the one-instruction per clock per wavefront scheme.
 
I didn't see a specific mention of this, although it looks like there's still the possibility of instructions of different types to be issued in parallel from different wavefronts like in prior GCN.
GCN was even more constrained in this regard with its single scheduller unit with 1 wave per clk throughput for 4 SIMDs and 1 scalar unit

AMD's slide with a sample set of instructions is not iron-clad evidence that there isn't dual-issue within a wavefront, but it seems consistent with the one-instruction per clock per wavefront scheme.
I am pretty sure they would mention super-scalar execution if it was there
 
I am asking because Turing with 4 schedulers per SM can theoretically execute up to 3 types of instructions for 3 warps in parallel on different units, lets say FP32 + 2xFP16 or INT32 in an interleaved fashion with constant overlap (hence concurrent execution) + SFU, which processes warp in 8 cycles (if it's a decoupled unit)
 
Status
Not open for further replies.
Back
Top