AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Why can't we ? (True question, I don't get why on the diagram)
fqvK7bgMNGxQdNKNnHKZHQ-1366-80.png

Not a diagram ;)

I also missed the 2nd last bullet point.
 
abHvXEjBo5CvTr8SpwsAdR-2048-80.jpg


Looking at this, it looks pretty much the same as NAVI10 except there is more cache and more CUs. The shader engines are still the same structure. We can maybe assume RDNA2 cards to be very similar.
What else did you expect from high level block diagram?
 
What else did you expect from high level block diagram?
Maybe more shader engines or more workgroups per SE. Or they change something big like move some of the functions into or out of the WG. There were rumors that RDNA2 was a big departure from previous architectures but this pretty much confirms that theres nothing big that changed at the high level.

It makes it interesting to guess what AMD even changed to get better IPC from RDNA1 to RDNA2.
 
fqvK7bgMNGxQdNKNnHKZHQ-1366-80.png

I've been comparing this slide against RDNA 1 whitepaper and I guess if you're loose with terminology I think everything is exactly the same.

But if you're wanting to be nit picky maybe the following could be different.

Launch 7 instructions/clk per CU vs 4 instructions/clk per CU

The RDNA front-end can issue four instructions every cycle to every SIMD, which include a combination of vector, scalar, and memory pipeline. The scalar pipelines are typically used for control flow and some address calculation, while the vector pipelines provide the computational throughput for the shaders and are fed by the memory pipelines.

And the biggest difference is that in the slide above they seemed to have separated 2 Scalar and 2 control as well as the Vector Data instruction. This is not really clear is the 2 Control and the 1 Vector data are also part of RDNA 1 just not indicated. So I'm not sure if RDNA 1 issued 4 instructions per CU, and broke it down to 7 instructions per CU.

Second difference I might be seeing here is the:
32 Scalar FP32 FMAD per SIMD, 128 per Dual CU with Data sharing.
It's not really clear if the scalar units can perform a FMAD in the scalar units in the RDNA 1 whitepaper.
The scalar ALU accesses the scalar register file and performs basic 64-bit arithmetic operations.

Lastly the machine learning inference bit. Possibly also available for RDNA 1.
Some variants of the dual compute unit expose additional mixed-precision dot-product modes in the ALUs, primarily for accelerating machine learning inference. A mixed-precision FMA dot2 will compute two half-precision multiplications and then add the results to a single-precision accumulator. For even greater throughput, some ALUs will support 8-bit integer dot4 operations and 4-bit dot8 operations, all of which use 32-bit accumulators to avoid any overflows.
And I think MS took the option to do this. This is slightly different than just packing in Int8 and Int4 into vector registers listed here:
More importantly, the compute unit vector registers natively support packed data including two half-precision (16-bit) FP values, four 8-bit integers, or eight 4-bit integers
So there is support to hold them in the registers (RDNA 1), but it would appear you need to have a variant of CUs that can perform some of these other 8bit and 4bit operations listed above. I'm going to assume it can just do rapid packed math normally.

Some of the bullet points here:
  • Unified Geometry Engine
  • Distributed Primitives and Rasterization is in RDNA 1
The primitive units assemble triangles from vertices and are also responsible for fixed-function tessellation. Each primitive unit has been enhanced and supports culling up to two primitives per clock, twice as fast as the prior generation. One primitive per clock is output to the rasterizer. The work distribution algorithm in the command processor has also been tuned to distribute vertices and tessellated polygons more evenly between the different shader arrays, boosting throughput for geometry.
  • Mesh Shading Geometry Engine is RDNA 2
  • Multi-Core Command Processor > No indication on RDNA 1 about multi-core
 
Last edited:
Maybe Microsoft will add ML resolution scaling to DirectX. Would be good to have something for it that is hardware agnostic at the API level. Since they are likely just using the shaders to do the inference without the use of tensor cores, it could probably work on any GPU, just depends on the performance they can get out of the hardware for inference while still running the game on the GPU.

Is Microsoft even in the algorithm business? Modern DirectX defines a high level workflow, interfaces and data structures but most logic is left up to the game programmer. I would be surprised if DirectX ships with any sort of upscaling implementation. That would be like MS providing their own depth of field shader.
 
Multi-Core Command Processor > No indication on RDNA 1 about multi-core
AMD's explanation of ACE/HWS in the open for OSS Linux drivers partly answered this question.

Ever since the first eight "ACE" GPU, all GPUs come with two "MCE" microcontroller core each of which has four "pipes" (i.e., quad threaded, probably temporal). The initial iterations have all 8 pipes configured as "ACE"s, while later GPUs reappropriated some pipes for GPU multi-process scheduling with support of user mode queue oversubscription (aka "HWS").

Not much information on the graphics CP that I know of though, which has its own core(s?).
 
AMD's explanation of ACE/HWS in the open for OSS Linux drivers partly answered this question.

Ever since the first eight "ACE" GPU, all GPUs come with two "MCE" microcontroller core each of which has four "pipes" (i.e., quad threaded, probably temporal). The initial iterations have all 8 pipes configured as "ACE"s, while later GPUs reappropriated some pipes for GPU multi-process scheduling with support of user mode queue oversubscription (aka "HWS").

Not much information on the graphics CP that I know of though, which has its own core(s?).
Yea that wasn't clear to me either.
RDNA continues to separate ACE for Compute Pipeline and GCP for 3D Graphics pipeline, that's about all I can gather from the RDNA whitepaper unfortunately.
I'm not sure where the Mesh Shading Engine would fit in there, I suppose on the GCP side of things.

I also don't know if the multi-core GCP thing is a MS thing. I suspect if it was perhaps they would have mentioned it. They have been really focused on customizing their GCP for the last 2 iteration of XBox, this might be an evolution of what they learned.
 
Last edited:
I'm not sure where the Mesh Shading Engine would fit in there, I suppose on the GCP side of things.
It is an optional cog in the graphics pipeline machine after all like the tessellation DLC, so it can hardly escape Graphics CP.

I also don't know if the multi-core GCP thing is a MS thing.
Doesn't seem like they are explicitly claiming "Graphics Command Processor" being multi-core, unless I have missed something.
 
It is an optional cog in the graphics pipeline machine after all like the tessellation DLC, so it can hardly escape Graphics CP.


Doesn't seem like they are explicitly claiming "Graphics Command Processor" being multi-core, unless I have missed something.
Is there a generic command processor that is separate from the GCP?
I was just assuming they were the same thing.
 
Is there a generic command processor that is separate from the GCP?
I was just assuming they were the same thing.
Given the subtle divergence in terminology, and the absence of ACE/HWS as freestanding colorful blocks, I wouldn't be surprised that "Multi-core Command Processor" is meant to refer to all blocks that eat PM4/AQL packets.
 
Given the subtle divergence in terminology, and the absence of ACE/HWS as freestanding colorful blocks, I wouldn't be surprised that "Multi-core Command Processor" is meant to refer to all blocks that eat PM4/AQL packets.
@Rys can you provide any light on this aspect here? I read some of your article on context rolls here, but I don't get it all, and not sure if CP and GCP are being used interchangeably.
https://gpuopen.com/learn/understanding-gpu-context-rolls/
 
I also missed the 2nd last bullet point.
There is now two distinctions with RDNA 2 RT acceleration:

1- It can't accelerate BVH Traversal, only ray intersections, traversal is performed by the shader cores.
2- Ray Intersection is shared with texture units.

In comparison, Turing RT cores:
1- Accelerate BVH traversal on their own
2- Ray Intersection is independent and is not shared with anything else

So in a sense RDNA2's solution is hybrid, as it is shared between both textures and shaders compared to Turing's solution.
 
Last edited:
There is now two distinctions with RDNA 2 RT acceleration:

1- It can't accelerate BVH Traversal, only ray intersections, traversal is performed by the shader cores.
2- Ray Intersections is shared with texture units.

In comparison, Turin RT cores:
1- Accelerate BVH traversal on their own
2- Ray Intersections is independent and is not shared with anything else

So in a sense RDNA 2 solution is hybrid, as it is shared between both textures and shaders compared to Turing's solution.

And as result can I infer that RT performance might scale in a very differently on RDNA2 vs Turing.
RT perf on RDNA2 should be closely matched to the number of texture units and shader cores, Whereas on Turing RT perf is more closley determined by the number/amount of dedicated RT resources?

So if your not gonna use RT all that RT hardware on Turing is a waste, if you choose not to do RT on RDNA2, you get more texture units to use..
(assuming a RDNA2 core has more texture units than a Turing core, to compensate for loss due to RT usage. )
 
abHvXEjBo5CvTr8SpwsAdR-2048-80.jpg


Looking at this, it looks pretty much the same as NAVI10 except there is more cache and more CUs. The shader engines are still the same structure. We can maybe assume RDNA2 cards to be very similar.

One extra box that is shown is the Shader Input block that is per-SE. This could be the SPI block, and while it seems likely that Navi has a similar arrangement, it is something not diagrammed for RDNA. This could point to one reason why the Shader Engine exists as an entity when it seemed like almost all hardware in it had been made per shader array, where wavefront launch is a resource shared between shader arrays in an SE.
Another element is the capacity and number of arrows into the L2. That could point to 20 L2 slices, although another diagram only had 10 fabric links going to the L2.
I'd ask whether there are more than 16 slices, and if the L1s can request more than 4 accesses per cycle. More than 16 could give more bandwidth internally, but not if the L1s cannot make more requests than they already do.
What stands out to me is that if there are 20 slices, the so-called "Big Navi" leak would indicate an L2 with fewer slices, despite having a wider GDDR6 bus.


fqvK7bgMNGxQdNKNnHKZHQ-1366-80.png

I've been comparing this slide against RDNA 1 whitepaper and I guess if you're loose with terminology I think everything is exactly the same.

But if you're wanting to be nit picky maybe the following could be different.

Launch 7 instructions/clk per CU vs 4 instructions/clk per CU
A number of the RDNA instruction throughput claims are per-SIMD and there is a diagram with 4 instruction types being considered for issue. That would be 8, although one of the types is vector memory that contends for the same MEM/TEX block, so that's not necessarily out of line since there are two SIMDs per CU.


Second difference I might be seeing here is the:
32 Scalar FP32 FMAD per SIMD, 128 per Dual CU with Data sharing.
It's not really clear if the scalar units can perform a FMAD in the scalar units in the RDNA 1 whitepaper.
I think scalar in this case is the regular math op in the SIMD, rather than packed instructions or some kind of matrix/tensor operation.

And as result can I infer that RT performance might scale in a very differently on RDNA2 vs Turing.
RT perf on RDNA2 should be closely matched to the number of texture units and shader cores, Whereas on Turing RT perf is more closley determined by the number/amount of dedicated RT resources?

So if your not gonna use RT all that RT hardware on Turing is a waste, if you choose not to do RT on RDNA2, you get more texture units to use..
(assuming a RDNA2 core has more texture units than a Turing core, to compensate for loss due to RT usage. )
A Turing SM has 4 texture units in an SM. A CU has a texture block with 4 texture filtering units.
I don't interpret the slides as indicating the ray-tracing hardware can do additional texturing, and so it seems both architectures have an RT block and 4 texture units per SM/CU. What each block can do or how their functions interact with other work would need to be evaluated.
 
And as result can I infer that RT performance might scale in a very differently on RDNA2 vs Turing.
RT perf on RDNA2 should be closely matched to the number of texture units and shader cores, Whereas on Turing RT perf is more closley determined by the number/amount of dedicated RT resources?

So if your not gonna use RT all that RT hardware on Turing is a waste, if you choose not to do RT on RDNA2, you get more texture units to use..
(assuming a RDNA2 core has more texture units than a Turing core, to compensate for loss due to RT usage. )

Both implementations scale the same. More compute units more performance. Difference is that nVidia's RT Cores are doing more work and are fully indepentend from the other units.
 
Status
Not open for further replies.
Back
Top