AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Maybe more shader engines or more workgroups per SE. Or they change something big like move some of the functions into or out of the WG. There were rumors that RDNA2 was a big departure from previous architectures but this pretty much confirms that theres nothing big that changed at the high level.

It makes it interesting to guess what AMD even changed to get better IPC from RDNA1 to RDNA2.

Maybe a big departure in terms of how each ALU is built, re-arrange the gate configurations for better instruction throughput. The instructions per cycle to each compute unit have gone from up to 4 to up to 7. So one supposes either there was some limitation with instruction issue or instructions now retire faster so the limitation can be lifted usefully. Either way that suggest an efficiency gain already.

As for the difference in CUs per engine, well I don't know what to make of it because it could be a Series X only decision. The memory bus is unified obviously so one assumes it's more about scheduling and other whatnot, it may not have an effect on other RDNA2 dies. Though heck, maybe AMD will use GDDR6x as well and just throw 7 dcus per, the bandwidth should be there if the clockspeed is kept within reason. Sidenote: thank you MS employee that agreed with me on that terminology, fucking shoutout for DCU/Double Compute Unit.

Anyway, the only other interesting bit is the cost of the die per mm^2. The die sizes for both consoles are within ballpark of the previous gen, and the PS4 die was estimated at $100. With inflation that's $111, but I wonder how much more it costs? This estimates that it costs a bit more than 150% relative. Do the dies cost $170 or so then? Whatever, not directly relevant to RDNA2 arch, but it shows that with dies not getting super small or anything RDNA2 isn't going to be any cheaper than 1, but that wasn't hard to guess.

Most of the rest of the stuff seems to be already known or too hard to parse. "Rays per second max" versus Nvidia's arbitrary "About this many rays" is like, whatever. Even looking at this and trying to work it out made me shake my head, too many unknowns. As for some of the other bulletpoints, rather boring. I guess the new HDR format is neat for obsessive 120hz reaching, but otherwise is rather useless. Please don't force correlate my color channels with a shared exponent, that's just weird. Besides, there's already usage of extended precision tricks for 10 bit mantissa of 16bit floats, losing another bit isn't helping.
 
There was a slide stating the power consumption of the SoC was equal to 2x XBox One X resulting in 200W TDP. Given the presence of 8c Zen 2 eating ~50W and other XBox-specific stuff maybe 5W.

This gives us a RDNA2 52CUs at 1.825GHz using only less than 150W TDP. This is pretty good, given the used process is marginally better than 40CU Navi 10 featuring over 200W TDP.

What is even more weird is the high level architecture is like RDNA1 + RT. No major changes, apparently. So is this all the low level circuity magic?
 
resulting in 200W TDP
It's a bit lower than that.
Given the presence of 8c Zen 2 eating ~50W
They're statically allocated to top out at 57-60W worst case.
What is even more weird is the high level architecture is like RDNA1 + RT
Not precisely.
No major changes, apparently
Not talking about them yet.
So is this all the low level circuity magic?
A lot of it, yes.
 
...

What is even more weird is the high level architecture is like RDNA1 + RT. No major changes, apparently. So is this all the low level circuity magic?

I believe we will have to wait for the pc part, it's not out of the question that consoles SoC and PC part are differents, even if it's not by much. But to be fair, I always thought magic was a big part of GPUs. I was laughed at for thinking that.
 
I believe we will have to wait for the pc part, it's not out of the question that consoles SoC and PC part are differents, even if it's not by much. But to be fair, I always thought magic was a big part of GPUs. I was laughed at for thinking that.
Of course they are different. RDNA2 in consoles is a crippled version. On desktop, the cache is magnitude bigger. In fact, same as the CPU. That's why we can't extrapolate desktop RDNA2 from Console RDNA2...
 
You're wrong. Cache plays a big role in TDP.
Bigger caches reduce TDP if anything.
No you can't. Stop your FUD.
You provide no technical reason as to why you can't do that. It's the same silicon process, same process flows, designs, with only a minor microarchitectural unit balance change. Extensive re-use is why AMD can even put out 4 different designs within the same 6 month timespan in the first place.
 
There is now two distinctions with RDNA 2 RT acceleration:

1- It can't accelerate BVH Traversal, only ray intersections, traversal is performed by the shader cores.
2- Ray Intersections is shared with texture units.
....
  1. According to those Microsoft's slides, BVH traversal is sunning in parallel to other shaders operations - so, no utilization conflicts or "slowdown" there
  2. Yes, but where in the current DXR graphics pipeline you need to do RT operations at the same time as texturing operations?
 
My bad, the original XBox One had 95W TDP.

175W x2 seems a bit excessive but 250-300W looks more reasonable for XSX. That's total though so the SoC should be some 50-75W lower, I guess, which gives us what, 175-250W? So similar ballpark.

According to those Microsoft's slides, BVH traversal is sunning in parallel to other shaders operations
Ray intersection testing is performed in parallel, BVH traversal is handled by a shader. This is different to Turing where BVH traversal is handled by RT h/w too.
 
Ray intersection testing is performed in parallel, BVH traversal is handled by a shader. This is different to Turing where BVH traversal is handled by RT h/w too.
Yep, my bad

From the AMD RT patent
A fixed function BVH intersection testing and traversal (a common and expensive operation in ray tracers) logic is implemented on texture processors. This enables the performance and power efficiency of the ray tracing to be substantially improved without expanding high area and effort costs. High bandwidth paths within the texture processor and shader units that are used for texture processing are reused for BVH intersection testing and traversal. In general, a texture processor receives an instruction from the shader unit that includes ray data and BVH node pointer information. The texture processor fetches the BVH node data from memory using, for example, 16 double word (DW) block loads. The texture processor performs four ray-box intersections and children sorting for box nodes and 1 ray-triangle intersection for triangle nodes. The intersection results are returned to the shader unit.
 
Another element is the capacity and number of arrows into the L2. That could point to 20 L2 slices, although another diagram only had 10 fabric links going to the L2.
I'd ask whether there are more than 16 slices, and if the L1s can request more than 4 accesses per cycle. More than 16 could give more bandwidth internally, but not if the L1s cannot make more requests than they already do.
What stands out to me is that if there are 20 slices, the so-called "Big Navi" leak would indicate an L2 with fewer slices, despite having a wider GDDR6 bus.

Apologies if I've misunderstood what you're saying here, but I think it works as following:

XSX has a 320-bit bus, with 5 x 64-bit controllers. Each controller has it's own L2 with 4 slices, so 5 x 4 makes the 20 slices. Which also fits the 20 memory channels MS described.

5MB total L2 fits with 1MB L2 for each of the five controllers.

Even if the L1's can't request more than 4 accesses per cycle, you'll still need the 5 controllers, each with their four L2 slices to manages the 320-bit bus. Couldn't compute can bypass the L1 and make full use of the L2 bandwidth though ... (genuine question)?

I haven't seen anything about Big Navi L2 cache, but if it's a 384-bit bus shouldn't it be 6 controllers and therefore 24 L2 slices?
 
Couple other interesting points from Hot Chips presentation:

* "CUs have 25% better perf/clock compared to last gen" - that's compared to GCN so doesn't look like there will be anything more than a single digit perf/clock gain between RDNA1 and 2.
Is it though? XSX for something along the lines of Polaris, Vega IIRC had better IPC than Polaris and RDNA1 offered ~25% IPC over Vega, didn't it? So it could very well be compared to RDNA1 instead of GCN~4
 
Does that mean the RT units in RDNA2 also share common data path with the TMUs, i.e. blocking each other?

Turing's RT core apparently sits on it's own bypass network:

This was stated in the presentation - you can either do a texture fetch or ask the RT h/w to trace a ray through a box/scene. The TMU/RT h/w itself is likely independent but its clearly using the same data path and is probably using same caches. It's likely a good trade off between h/w complexity and s/w flexibility for consoles in particular.

Is it though? XSX for something along the lines of Polaris, Vega IIRC had better IPC than Polaris and RDNA1 offered ~25% IPC over Vega, didn't it? So it could very well be compared to RDNA1 instead of GCN~4
Vega had worse IPC than Polaris IIRC.
 
Status
Not open for further replies.
Back
Top