AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Hmmm I’m surprised that the 256-bit bus is real. Maybe there is something to this magical cache after all.

Does sure seem like it. I'm a surprised by the 16GB on the Navi 21 XL though. My math says it should have approximately 72% of the Tflops of the XT. Surely a 192 bit 12 GB configuration (75% of the XT) would have been better suited (and cheaper). Another rumour suggested 80,72,72 CUs as the stack instead of 80,72,64.
 
Does sure seem like it. I'm a surprised by the 16GB on the Navi 21 XL though. My math says it should have approximately 72% of the Tflops of the XT. Surely a 192 bit 12 GB configuration (75% of the XT) would have been better suited (and cheaper). Another rumour suggested 80,72,72 CUs as the stack instead of 80,72,64.
The prospect of AMD using a cut-down ~536mm² Navi 21 to fight 3070 (~394mm²) is sad isn't it?
 
The prospect of AMD using a cut-down ~536mm² Navi 21 to fight 3070 (~394mm²) is sad isn't it?

Well if they are doing so with salvaged dies because of yield reasons, it's just par for the course. I'd expect the XL to still beat the 3070 though, and with slightly lower power consumption so it should still come out ahead in the fight.
 
Last edited:
I've changed my mind: the memory is crucial to be able to compete. Navi 21: 16GB, Navi 22: 12GB (10 too?) and Navi 23: 8/4GB. Navi 22 and 23 should be expected to be more tightly targetted in their die size and memory configuration, since they'll be $400 down to $200 in $50 steps.

5700XT's replacement should be Navi 22: 50% more memory for the XT and 30%+ more performance.
 
I think the cutdown bus is something especially targetted to the mobile market: having smaller bus width will be definitely a plus factor for OEMs.
 
I guess the 256b is the crucial factor. It's a make it or break it situation.

The new HUGEass cache & narrow bus approach can:
A) bring positive innovation like - notable savings in power/area or frequency gains
B) to stratify workloads or even games/engines to "runs well from cache" and "behaves like a mainstream 256b card using 300+W"

Dunno where to place my bet. Given the Vega experience it would surely be B). But this is 2nd gen Navi, so A), perhaps?
 
Regarding the on-going thread about the power usage though, I don't see anything contentious... As the patent describes, the intersection engine is basically an alternative path to texture filtering, operating on packed BVH node data. Ray-box and ray-tri testing seem quite straightforward logic, so likely no "power drainage" to be expected... At worst the CU can issue bunch of intersections, issue a vmcnt wait and eventually clockgate the ALU datapaths, if no other kind of kernels is running in parallel.
A vmcnt would apply to that specific wavefront, but it would have no influence on any other wavefronts on the CU. Those other wavefronts could be from other kernels, or part of a multi-wavefront workgroup. The wavefront itself seems like it could be generating additional memory traffic as a result of processing the data returned by BVH instructions, if the process is meant to be more programmable between the node/box evaluation stage of AMD's algorithm. Restricting all those possible non-RT use cases would be a very large restriction on the concurrency and latency-hiding of the CU. There's also another SIMD in the CU, whose interference cannot be isolated unless it were left vacant.
Implementing such a monopoly would require the driver/GPU to be aware that RT was occurring and to vacate a CU in order to give it exactly one wavefront, or setting up a clause where a given wavefront does block other wavefronts from using the vector memory path. The variable number of misses to memory for a complex operation like RT could leave the CU with swathes of idle time if that happened, and it may not be a good fit for that mode since the RT method assumes the SIMD is performing various non-memory operations to calculate the payload for the next BVH instruction--which would break a clause immediately.

Nvidia persumably does the whole traversal process in the fixed function hardware, so one might argue that they could have an edge in power usage in potentially keeping the CU/SM off. But it is uncertain whether it matters with the prevalent use of async compute to fill gaps, and whether the actual saving does make a dent in the grand power consumption.
The emphasis would likely be on maintaining parallel execution, and the time horizon for an RT instruction is short in terms of power gating. Perhaps clock gating could occur, but the likelihood is that something else is happening somewhere in the SM to keep things active.


It matters in case each traversal steps needs to be taken on the CUs, i.e. the pointer chasing happens in shader code while the intersection HW/texture unit simply tells whether a ray intersected one or more nodes of the BVH. If this is the way it works then it requires a constant back and forth between CUs and texture/intersection units.
The patent AMD seems to be following indicates the RT block should pass back intersection results or the pointer to the next node for each ray submitted to the unit by the SIMD. There's still significant back and forth, but at least the SIMD isn't tasked with a redundant lookup of pointer values after the RT hardware looked at the same node data and skipped over the pointer information in it.


In general CU scheduling and cache-friendly operations (such as L0s being able to snoop each other passively) appear to be part of RDNA. How much of that is new for RDNA 2, I can't tell.
At least with RDNA, snooping at the L0 is known not to happen. They are write-through, so the L2 still serves as an eventual place for visibility. LLVM changes specifically point out that the L0s in a WGP are not coherent when running in WGP mode, so kernels running across dual-CUs would need to explicitly flush the L0 at various times to keep some level of consistency if there's a chance of discrepancies between them.

The prospect of AMD using a cut-down ~536mm² Navi 21 to fight 3070 (~394mm²) is sad isn't it?
It seems less ideal to put a chip of that size against another, since even with perfect yields there's the reduction of candidate dies available from each wafer. The maturity and cost of each node isn't clear for the comparison, however.
If we assume a large cache of some kind, the yields could be better than raw area would suggest.
Another possibility that occurred to me that could take up some area besides a cache would be some kind of on-die voltage regulation, which might be desirable for a mobile or possibly HPC solution. However, that's not something AMD's talked about that much, as a product direction.
 
Wouldn't a 40 CU Navi 22 actually be biting at the heels of the RTX3070 / 2080 Ti if it averages at 2.3GHz or more?

The rumored clocks for those cards are pretty crazy.
 
Wouldn't a 40 CU Navi 22 actually be biting at the heels of the RTX3070 / 2080 Ti if it averages at 2.3GHz or more?

The rumored clocks for those cards are pretty crazy.

Screenshot2020102121.png


A 40 CU Navi 22 at 2.3 GHz would be 11.8 tflops which is +31% to 9 tflops of 5700XT.

Even assuming linear scaling it won't reach 2080Ti, even at stock.
 
The patent AMD seems to be following indicates the RT block should pass back intersection results or the pointer to the next node for each ray submitted to the unit by the SIMD. There's still significant back and forth, but at least the SIMD isn't tasked with a redundant lookup of pointer values after the RT hardware looked at the same node data and skipped over the pointer information in it.

Are you referring to BVH or triangle intersection? Not sure I understand how you would avoid fetching the same BVH node in both the SIMD and RT unit.
 
Are you referring to BVH or triangle intersection? Not sure I understand how you would avoid fetching the same BVH node in both the SIMD and RT unit.
(disclaimer: I have no idea how it works, this is just an hypothesis)
CU sends ray data + pointer to BVH node(s) to RT unit, which fetches the BVH node(s), performs intersections and returns to CUs intersection results + pointers to leaf nodes, so that the CU never needs to load the BVH data in the first place.
 
(disclaimer: I have no idea how it works, this is just an hypothesis)
CU sends ray data + pointer to BVH node(s) to RT unit, which fetches the BVH node(s), performs intersections and returns to CUs intersection results + pointers to leaf nodes, so that the CU never needs to load the BVH data in the first place.

The RT unit returns intermediate nodes too, not just leaves.
 
Screenshot2020102121.png


A 40 CU Navi 22 at 2.3 GHz would be 11.8 tflops which is +31% to 9 tflops of 5700XT.

Even assuming linear scaling it won't reach 2080Ti, even at stock.

Why only linear scaling ? AMD said RDNA 2 has performance per clock increase . this chip has only 40CU's/20DCU's so its not like its pushing the limits.

According to TPU 2080ti is 34% ahead for 5700xt
Assume a game clock of 2300

2300/1755 = 1.31

0 IPC linear scaling
100 * 1.31 = 3% behind

5% IPC
100 * 1.05 * 1.31 = 3% ahead

10% IPC
100 * 1.1 * 1.31 = 10% ahead

So the most likely answer is @ToTTenTranz is correct and a 40CU N22 will be somewhere around 2080ti levels.
 
Why only linear scaling ? AMD said RDNA 2 has performance per clock increase . this chip has only 40CU's/20DCU's so its not like its pushing the limits.

According to TPU 2080ti is 34% ahead for 5700xt
Assume a game clock of 2300

2300/1755 = 1.31

0 IPC linear scaling
100 * 1.31 = 3% behind

5% IPC
100 * 1.05 * 1.31 = 3% ahead

10% IPC
100 * 1.1 * 1.31 = 10% ahead

So the most likely answer is @ToTTenTranz is correct and a 40CU N22 will be somewhere around 2080ti levels.

We do have to keep in mind that the memory bandwidth is going to play a part at 4K (even with the cache) and the 5700XT also drops off in performance at 4K. N22 should be competitive with 3070/2080Ti at 1440p and is likely going to be positioned more as a 1440p card rather than 4K. They have N21 for that.
 
Are you referring to BVH or triangle intersection? Not sure I understand how you would avoid fetching the same BVH node in both the SIMD and RT unit.
The patent indicated that the RT hardware would pass back what the algorithm considered relevant for further BVH instructions. That would at least mean intersection results if the rays in a given wavefront's group reached leaf nodes, or metadata indicating how traversal needed to continue and the pointers to be fed into the next BVH instruction.

Whether the actual instructions fully match that will hopefully be detailed once the architecture is fully published, but at least in theory it seemed like the ray tracing hardware would make the evaluation of the next step in the traversal process and give that recommendation to the SIMD.
In theory, involving the SIMD could mean there's the possibility for the programmable portion to not follow those recommendations, or use additional data to control the evaluation. Such a change wouldn't be necessary in the default case. At least some of Nvidia's RT method can be substituted with custom shaders for intersections, though at least with Turing the recommendation for performance was to keep to the built-in methods.
 
Status
Not open for further replies.
Back
Top