AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
My confusion: VGPRs are not twice as much now, in a way register pressure problems are magically gone now, no?
An RDNA SIMD32 register file has twice the register file capacity (in terms of KB) of a GCN SIMD16 register file.
Since RDNA's register IDs correspond to 32 work-items, a register is individually half the length of the 64-wide register of GCN.

The register file has 4x as many individually addressable registers, although they are half-size. If a single CU in RDNA were asked to support the same number of work-items as GCN (2x Wave32 wavefronts or 1 Wave64 per GCN wavefront without certain optimizations), it would have the same register capacity per work-item.
I'm trying to find the slide or reference that characterized RDNA has slightly improving register pressure, since I didn't see it being considered "gone".

The finer granularity of Wave32 might allow for shaders that are particularly poor at utilizing 64 threads from having to allocate a full 64-wide wavefront context.
Wave64 has a sub-vector execution loop that takes advantage of how Wave64 works with 2x Wave32 instructions and splits the execution into two Wave32 halves and treats each half as a single iteration of an internal loop.
Registers used for results internal to that loop can be assigned to the same Wave32 register ID, as the software knows the intermediate results of the halves are separated in time--saving some capacity if that mode is used.

Sort of killer feature would be the option to double accessible LDS for certain workgroups, while running others that do not use any LDS on on the other half of one WGP. If technically possible, would be worth some work to make it happen!
Perhaps someone has parsed the ISA doc better than I have, but I didn't see reference to the LDS allocation values in the wavefront context being extended to give them the ability to allocate more LDS.

It is not magic, it is an effect of the shorter execution latency. Each instruction result has a reservation in the register file. Since latency in RDNA is one quarter of GCN, register file entries used for temporary results effectively have one quarter the footprint (measured in bytes times cycles).
The issue latency is 1/4 of GCN. For scalar forwarding, there appears to be a 2-cycle latency, which is better though not relevant as far as scalar register file footprint goes. It's scalar, so no savings in register width. It's also RDNA, which has shifted to a static 128 registers per wavefront, which in capacity terms is worse than GCN though it's rendered moot by the architecture having enough register file space to hard-wire the allocation.

The vector result forwarding latency is worse than GCN, with RDNA needing 5 clock cycles before a dependent instruction can issue versus GCN's 4.
Depending on what the limiting factor is for getting from the point of a temp register being written and its being consumed, there would be cases where a temp generated ahead of a serial chain would live longer with RDNA than GCN.
Wave64 can provide register savings, within the limits of sub-vector mode. If running in Wave32 on a workload that can readily use up 64 work-items, needing 2x the wavefronts to get the same number of work-items leaves overall occupancy similar.
 
Perhaps someone has parsed the ISA doc better than I have, but I didn't see reference to the LDS allocation values in the wavefront context being extended to give them the ability to allocate more LDS.

Section 10.3 is the closest I remember seeing:

In WGP mode, the waves are distributed over all 4 SIMD32’s and LDS space maybe allocated
anywhere within the LDS memory. Waves may access data on the "near" or "far" side of LDS
equally, but performance may be lower in some cases.
 
Section 10.3 is the closest I remember seeing:

In WGP mode, the waves are distributed over all 4 SIMD32’s and LDS space maybe allocated
anywhere within the LDS memory. Waves may access data on the "near" or "far" side of LDS
equally, but performance may be lower in some cases.

I saw where it references the ability for wavefronts on either side of the dual-CU to access data in the other half--subject to unspecified performance limits.
I interpreted the statement about allowing the option to double the accessible LDS for certain workgroups to mean allowing an allocation twice the size of what a GCN wavefront could allocate. However, the sections I saw that referenced allocation like M0 or the LDS_ALLOC didn't appear to be different. Some of the other items like offset values for addressing also didn't change in stride or length, but I may have missed some change that would allow a larger allocation or allow a wavefront to access the portions of a larger allocation.
 
If a single CU in RDNA were asked to support the same number of work-items as GCN (2x Wave32 wavefronts or 1 Wave64 per GCN wavefront without certain optimizations), it would have the same register capacity per work-item.
Yeah, that's what i thought, but hard to be sure. Being no hardware guy i tend to ignore details that won't affect programming. For example i never cared GCN has 16-simds and just assume 64 threads in lockstep. Also the scalar / vector instruction cycle ratio is not important to know because it never affects any decisions - there are just no options.
I remember a lengthy discussion with another dev who thought RDNA would be twice as fast in general because 32 vs 16 simd. I ended up saying the 32 simd takes twice the time and we would see twice the TF numbers if this would be true.
Trying to understand hardware can be quite difficult nowadays :)

Perhaps someone has parsed the ISA doc better than I have, but I didn't see reference to the LDS allocation values in the wavefront context being extended to give them the ability to allocate more LDS.
I also don't think it's possible yet, but it might become an option requiring little changes. (Being just a matter of driver work sounds too good to be true i guess.)
Overall amount of LDS and registers seems just right most of the time, but still many shaders do not use LDS at all while others would benefit a lot form having more and pairing CUs seems an opportunity.

In WGP mode, the waves are distributed over all 4 SIMD32’s and LDS space maybe allocated
anywhere within the LDS memory. Waves may access data on the "near" or "far" side of LDS
equally, but performance may be lower in some cases.
I think it would make sense to manually choose WGP mode or not (also on desktop via extensions). Likely the driver can't be right all the time, and it might be some wins for free.
 
Navi 12 and Navi 14 could be released very shortly:

https://www.pcgamesn.com/amd/navi-12-navi-14-rx-5600-rx-5500-october-15-launch

It seems AMD made an "urgent" commit to the Mesa 3D library for Navi 12's PCI ID. The next release is October 15 so there's speculation saying if they didn't want to wait until the next Mesa driver release it's because the cards are coming out before that.

I'm not familiar with Linux driver stuff, and I also don't know of any event between now and October 15 where AMD would announce the new graphics cards.
OTOH, mid/low-rangers not always get released with great formal announcements.
 
I don't suppose they'd also be mobile chips i.e. Surface Event on Oct 2nd?
 
Last edited:
Hmmm, if a Navi card releases in the 100-150 USD range, I'd be interested. It's about time I finally replaced and retired my REALLY old Radeon 5450. :p That said, I'm not sure any of these upcoming cards are planned to go that low at launch.

Regards,
SB
 
On a related note, the rumored 16"–16.5" MacBook Pro is supposedly going to be launched later this year. Could one of the two upcoming Navi chips also be used for this product?

The 15" MacBook Pro already has Vega Mobile as an option, so that would be a lower bound for the performance of the GPU in the 16"–16.5" MBP.

Certainly would be nice if Apple were to switch out Polaris for Navi. Vega Mobile is a "high end" upgrade option that doesn't quite feel like it's worth it unless you require the compute improvements. Not to mention that I feel Apple should have just replaced the Polaris chips with Vega whole sale instead of this upgrade to begin with. But I assume HBM costs and profit margins had something to do with that...
 
A wild Radeon RX 5300 4GB GDDR5 appears in a HP desktop shipping October 9:
https://www.alternate.de/HP/Desktop-M01-F0017ng-Komplett-PC/html/product/1576782?

It could be Navi 12/14, or it could be the N-th time AMD rebrands the polaris family (especially for OEMs).
Considering they already launched 600-series for OEMs with Polaris this year I doubt they'd do that. Also RX 5500 https://gfxbench.com/device.jsp?ben...ws&api=gl&D=AMD+Radeon+RX+5500&testgroup=info
 
If we assume Navi 14 is a midranger and Navi 12 bigger than Navi 10, what's 5300 made of?
1 Extreme dual-compute unit ?

:p

-----
Maybe just cut the Navi 14 in half and call it a day? Assuming there are 2 NSE's & 4x3 DCUs on there. i.e. 5300 = 1NSE & 2x3 DCUs

^if 5700 is defined as 2NSEs & 4x5 DCUs (40CUs)

a miserable little pile of acronyms
 
Last edited:
https://www.3dcenter.org/news/amds-...erface-navi-12-hingegen-mit-256-bit-interface

3DCenter speculates that Navi 14 is a midranger with 128bit GDDR6, whereas Navi 12 is a higher-end and larger chip than Navi 10.

A higher end chip already, without nearly as much fanfare as they put in for the midrange one? I can see a half a Navi 10 getting released, an RX580 equivalent for cheap before christmas, and it wouldn't need a "big press buildup!" But a higher end one right now seems odd.

As for the low end, isn't that the leaked 12/24 CU chip? Assuming the arch isn't cache/bandwidth bound compared to CU's that could be good value. Even if it is bandwidth bound there's already higher clocked GDDR6 to buy.
 
Status
Not open for further replies.
Back
Top