AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Erinyes · Oct 21, 2020

trinibwoy said:
Hmmm I’m surprised that the 256-bit bus is real. Maybe there is something to this magical cache after all.

Does sure seem like it. I'm a surprised by the 16GB on the Navi 21 XL though. My math says it should have approximately 72% of the Tflops of the XT. Surely a 192 bit 12 GB configuration (75% of the XT) would have been better suited (and cheaper). Another rumour suggested 80,72,72 CUs as the stack instead of 80,72,64.

Jawed · Oct 21, 2020

Erinyes said:
Does sure seem like it. I'm a surprised by the 16GB on the Navi 21 XL though. My math says it should have approximately 72% of the Tflops of the XT. Surely a 192 bit 12 GB configuration (75% of the XT) would have been better suited (and cheaper). Another rumour suggested 80,72,72 CUs as the stack instead of 80,72,64.

The prospect of AMD using a cut-down ~536mm² Navi 21 to fight 3070 (~394mm²) is sad isn't it?

Erinyes · Oct 21, 2020

Jawed said:
The prospect of AMD using a cut-down ~536mm² Navi 21 to fight 3070 (~394mm²) is sad isn't it?

Well if they are doing so with salvaged dies because of yield reasons, it's just par for the course. I'd expect the XL to still beat the 3070 though, and with slightly lower power consumption so it should still come out ahead in the fight.

SimBy · Oct 21, 2020

Jawed said:
The prospect of AMD using a cut-down ~536mm² Navi 21 to fight 3070 (~394mm²) is sad isn't it?

Maybe they finally realized you don't get an award for style.

Jawed · Oct 21, 2020

I've changed my mind: the memory is crucial to be able to compete. Navi 21: 16GB, Navi 22: 12GB (10 too?) and Navi 23: 8/4GB. Navi 22 and 23 should be expected to be more tightly targetted in their die size and memory configuration, since they'll be $400 down to $200 in $50 steps.

5700XT's replacement should be Navi 22: 50% more memory for the XT and 30%+ more performance.

Leoneazzurro5 · Oct 21, 2020

I think the cutdown bus is something especially targetted to the mobile market: having smaller bus width will be definitely a plus factor for OEMs.

Scott_Arm · Oct 21, 2020

Igorslab bios pics show a 255W power limit for the soc. So around 300W for the card.

yuri · Oct 21, 2020

I guess the 256b is the crucial factor. It's a make it or break it situation.

The new HUGEass cache & narrow bus approach can:
A) bring positive innovation like - notable savings in power/area or frequency gains
B) to stratify workloads or even games/engines to "runs well from cache" and "behaves like a mainstream 256b card using 300+W"

Dunno where to place my bet. Given the Vega experience it would surely be B). But this is 2nd gen Navi, so A), perhaps?

3dilettante · Oct 21, 2020

pTmdfx said:
Regarding the on-going thread about the power usage though, I don't see anything contentious... As the patent describes, the intersection engine is basically an alternative path to texture filtering, operating on packed BVH node data. Ray-box and ray-tri testing seem quite straightforward logic, so likely no "power drainage" to be expected... At worst the CU can issue bunch of intersections, issue a vmcnt wait and eventually clockgate the ALU datapaths, if no other kind of kernels is running in parallel.

A vmcnt would apply to that specific wavefront, but it would have no influence on any other wavefronts on the CU. Those other wavefronts could be from other kernels, or part of a multi-wavefront workgroup. The wavefront itself seems like it could be generating additional memory traffic as a result of processing the data returned by BVH instructions, if the process is meant to be more programmable between the node/box evaluation stage of AMD's algorithm. Restricting all those possible non-RT use cases would be a very large restriction on the concurrency and latency-hiding of the CU. There's also another SIMD in the CU, whose interference cannot be isolated unless it were left vacant.
Implementing such a monopoly would require the driver/GPU to be aware that RT was occurring and to vacate a CU in order to give it exactly one wavefront, or setting up a clause where a given wavefront does block other wavefronts from using the vector memory path. The variable number of misses to memory for a complex operation like RT could leave the CU with swathes of idle time if that happened, and it may not be a good fit for that mode since the RT method assumes the SIMD is performing various non-memory operations to calculate the payload for the next BVH instruction--which would break a clause immediately.

Nvidia persumably does the whole traversal process in the fixed function hardware, so one might argue that they could have an edge in power usage in potentially keeping the CU/SM off. But it is uncertain whether it matters with the prevalent use of async compute to fill gaps, and whether the actual saving does make a dent in the grand power consumption.

The emphasis would likely be on maintaining parallel execution, and the time horizon for an RT instruction is short in terms of power gating. Perhaps clock gating could occur, but the likelihood is that something else is happening somewhere in the SM to keep things active.

nAo said:
It matters in case each traversal steps needs to be taken on the CUs, i.e. the pointer chasing happens in shader code while the intersection HW/texture unit simply tells whether a ray intersected one or more nodes of the BVH. If this is the way it works then it requires a constant back and forth between CUs and texture/intersection units.

The patent AMD seems to be following indicates the RT block should pass back intersection results or the pointer to the next node for each ray submitted to the unit by the SIMD. There's still significant back and forth, but at least the SIMD isn't tasked with a redundant lookup of pointer values after the RT hardware looked at the same node data and skipped over the pointer information in it.

Jawed said:
In general CU scheduling and cache-friendly operations (such as L0s being able to snoop each other passively) appear to be part of RDNA. How much of that is new for RDNA 2, I can't tell.

At least with RDNA, snooping at the L0 is known not to happen. They are write-through, so the L2 still serves as an eventual place for visibility. LLVM changes specifically point out that the L0s in a WGP are not coherent when running in WGP mode, so kernels running across dual-CUs would need to explicitly flush the L0 at various times to keep some level of consistency if there's a chance of discrepancies between them.

Jawed said:
The prospect of AMD using a cut-down ~536mm² Navi 21 to fight 3070 (~394mm²) is sad isn't it?

It seems less ideal to put a chip of that size against another, since even with perfect yields there's the reduction of candidate dies available from each wafer. The maturity and cost of each node isn't clear for the comparison, however.
If we assume a large cache of some kind, the yields could be better than raw area would suggest.
Another possibility that occurred to me that could take up some area besides a cache would be some kind of on-die voltage regulation, which might be desirable for a mobile or possibly HPC solution. However, that's not something AMD's talked about that much, as a product direction.

Deleted member 13524 · Oct 21, 2020

Wouldn't a 40 CU Navi 22 actually be biting at the heels of the RTX3070 / 2080 Ti if it averages at 2.3GHz or more?

The rumored clocks for those cards are pretty crazy.

DegustatoR · Oct 21, 2020

ToTTenTranz said:
Wouldn't a 40 CU Navi 22 actually be biting at the heels of the RTX3070 / 2080 Ti if it averages at 2.3GHz or more?

The rumored clocks for those cards are pretty crazy.

A 40 CU Navi 22 at 2.3 GHz would be 11.8 tflops which is +31% to 9 tflops of 5700XT.

Even assuming linear scaling it won't reach 2080Ti, even at stock.

trinibwoy · Oct 21, 2020

3dilettante said:
The patent AMD seems to be following indicates the RT block should pass back intersection results or the pointer to the next node for each ray submitted to the unit by the SIMD. There's still significant back and forth, but at least the SIMD isn't tasked with a redundant lookup of pointer values after the RT hardware looked at the same node data and skipped over the pointer information in it.

Are you referring to BVH or triangle intersection? Not sure I understand how you would avoid fetching the same BVH node in both the SIMD and RT unit.

trinibwoy · Oct 21, 2020

ToTTenTranz said:
Wouldn't a 40 CU Navi 22 actually be biting at the heels of the RTX3070 / 2080 Ti if it averages at 2.3GHz or more?

The rumored clocks for those cards are pretty crazy.

On a 192-bit bus? This cache must really be something special.

nAo · Oct 21, 2020

trinibwoy said:
Are you referring to BVH or triangle intersection? Not sure I understand how you would avoid fetching the same BVH node in both the SIMD and RT unit.

(disclaimer: I have no idea how it works, this is just an hypothesis)
CU sends ray data + pointer to BVH node(s) to RT unit, which fetches the BVH node(s), performs intersections and returns to CUs intersection results + pointers to leaf nodes, so that the CU never needs to load the BVH data in the first place.

trinibwoy · Oct 21, 2020

nAo said:
(disclaimer: I have no idea how it works, this is just an hypothesis)
CU sends ray data + pointer to BVH node(s) to RT unit, which fetches the BVH node(s), performs intersections and returns to CUs intersection results + pointers to leaf nodes, so that the CU never needs to load the BVH data in the first place.

The RT unit returns intermediate nodes too, not just leaves.

itsmydamnation · Oct 21, 2020

DegustatoR said:
A 40 CU Navi 22 at 2.3 GHz would be 11.8 tflops which is +31% to 9 tflops of 5700XT.

Even assuming linear scaling it won't reach 2080Ti, even at stock.

Why only linear scaling ? AMD said RDNA 2 has performance per clock increase . this chip has only 40CU's/20DCU's so its not like its pushing the limits.

According to TPU 2080ti is 34% ahead for 5700xt
Assume a game clock of 2300

2300/1755 = 1.31

0 IPC linear scaling
100 * 1.31 = 3% behind

5% IPC
100 * 1.05 * 1.31 = 3% ahead

10% IPC
100 * 1.1 * 1.31 = 10% ahead

So the most likely answer is @ToTTenTranz is correct and a 40CU N22 will be somewhere around 2080ti levels.

Erinyes · Oct 21, 2020

itsmydamnation said:
Why only linear scaling ? AMD said RDNA 2 has performance per clock increase . this chip has only 40CU's/20DCU's so its not like its pushing the limits.

According to TPU 2080ti is 34% ahead for 5700xt
Assume a game clock of 2300

2300/1755 = 1.31

0 IPC linear scaling
100 * 1.31 = 3% behind

5% IPC
100 * 1.05 * 1.31 = 3% ahead

10% IPC
100 * 1.1 * 1.31 = 10% ahead

So the most likely answer is @ToTTenTranz is correct and a 40CU N22 will be somewhere around 2080ti levels.

We do have to keep in mind that the memory bandwidth is going to play a part at 4K (even with the cache) and the 5700XT also drops off in performance at 4K. N22 should be competitive with 3070/2080Ti at 1440p and is likely going to be positioned more as a 1440p card rather than 4K. They have N21 for that.

SimBy · Oct 21, 2020

itsmydamnation said:
According to TPU 2080ti is 34% ahead for 5700xt
Assume a game clock of 2300

How did you come up with 34%?!

3dilettante · Oct 21, 2020

trinibwoy said:
Are you referring to BVH or triangle intersection? Not sure I understand how you would avoid fetching the same BVH node in both the SIMD and RT unit.

The patent indicated that the RT hardware would pass back what the algorithm considered relevant for further BVH instructions. That would at least mean intersection results if the rays in a given wavefront's group reached leaf nodes, or metadata indicating how traversal needed to continue and the pointers to be fed into the next BVH instruction.

Whether the actual instructions fully match that will hopefully be detailed once the architecture is fully published, but at least in theory it seemed like the ray tracing hardware would make the evaluation of the next step in the traversal process and give that recommendation to the SIMD.
In theory, involving the SIMD could mean there's the possibility for the programmable portion to not follow those recommendations, or use additional data to control the evaluation. Such a change wouldn't be necessary in the default case. At least some of Nvidia's RT method can be substituted with custom shaders for intersections, though at least with Turing the recommendation for performance was to keep to the built-in methods.

Leoneazzurro5 · Oct 21, 2020

SimBy said:
How did you come up with 34%?!

According to Techpowerup, that is the average difference, but at 1080p. In 4K, it is around 50%

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Erinyes

Jawed

Erinyes

SimBy

Jawed

Leoneazzurro5

Scott_Arm

yuri

3dilettante

Deleted member 13524

Guest

DegustatoR

trinibwoy

Meh

trinibwoy

Meh

nAo

Nutella Nutellae

trinibwoy

Meh

itsmydamnation

Erinyes

SimBy

3dilettante

Leoneazzurro5