AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
nVidia does 32-wide, and approximately all privately written GPGPU code was optimized for that. AMD needs to match it to minimize expense of translating code between the archs for it to have any hope of ever properly competing in the space.
I think there's merit in this idea but it also appears to be somewhat limited, at least in terms of how much performance might be left on the table.

The 64-wide threads are described as hiding inter-instruction latency (which I interpret as referring to complex register file banking (e.g. indexed reads), lane interchange and LDS operations) better than 32-wide. These operations are more typical of compute than of graphics. Additionally, scalar instructions will often be shared by all 64 work items in this case, halving the count of required scalar instruction issues required - though the new architecture appears to make scalar instructions less of a bottleneck than previously. Scalars are heavily used in compute to generate addresses and to control branching and related queries (any, all).

There's a slide in the deck which shows how 64-wide threads issue at a rate of 2 instructions from the same thread to the same ALU, consecutively:

b3da035.png


interestingly you can see from this example that Navi cannot issue two instructions consecutively when there's a store to load. v0 is written in cycle 2 but cannot be consumed until cycle 7, four cycles of latency. GCN doesn't have to switch to another hardware thread in this case, while Navi does.

In this case 64-wide is preferable - except that two 32-wide threads could have ran in the same time. So then it's a question of coherency of work item execution versus cache thrashing and other side effects of switching amongst lots of threads.

I don't understand why there are "dual compute units". They appear to share instruction and scalar/constant caches, which seems like a weak gain. I don't fully understand the slide that refers to a "Workgroup Processor", it seems to be saying that because LDS and cache are "shared" huge gains in performance from issuing large workgroups are possible. So perhaps If you have a workgroup of 128 or 256 running in Workgroup Processor mode, then you get twice as much LDS capacity and 4x the cache bandwidth as on just a single compute unit.
 
I don't understand why there are "dual compute units". They appear to share instruction and scalar/constant caches, which seems like a weak gain. I don't fully understand the slide that refers to a "Workgroup Processor", it seems to be saying that because LDS and cache are "shared" huge gains in performance from issuing large workgroups are possible. So perhaps If you have a workgroup of 128 or 256 running in Workgroup Processor mode, then you get twice as much LDS capacity and 4x the cache bandwidth as on just a single compute unit.
Isnt that precissely what was described in the super simd patent?.
 
So perhaps If you have a workgroup of 128 or 256 running in Workgroup Processor mode, then you get twice as much LDS capacity
Twice the LDS capacity without the usual occupancy penalty?

I guess it will take some time until we understand such changes. Where do those slides come from?
 
How is it decided whether to use wave32 or wave64 mode?
I mean, the compiler could know the additional stalls with wave32 mode, but it seems like there would be a lot more factors which would decide what is optimal (and for things like potentially divergent branches that might be factors the compiler cannot know).
 
I paid a little under 700$. Planning to keep it for ~5 years. As of today I would guess 1080ti is only bested by 2080ti and trades blows with 2080 and radeon vii. Pretty good for old piece of junk.

I'm much more in the camp buy high end and use it long time to get value rather than getting new mid tier crap every 1.5 years. There was time when updating often made sense but that is not anymore.

Unfortunately that's probably not going to work, in large part because of ram limitations. Another reason to be disappointed in this launch is the 8gb of ram. AMD knows they'll be doing 16gb cards next year to keep up with the PS5/Xbox launch, but they limit the ram anyway to the competition, instead of to what would be more futureproof.

Of course Nvidia knows this nigh as well as AMD, and still do it to. So no points there, but it does mean anyone buying an RTX 2080, or even a 5700... blah blah blah (what's the highest end one with maximum wordage again?) anyway, anyone buying those this year and expecting them to last at high settings is going to have a bad time.

Though what bothers me most is that "Next Gen" on the roadmap, for all of next year.
With the lack of graphically related instructions/features on this launch it's not too hard to guess Next Gen is what the PS5/Scarlett are actually being built on.
Variable rate shading, programmable primitive shaders (ala Nvidia's mesh shaders), raytracing support in some fashion, does this launch even have conservative raster and etc.?

All these things are either announced, or being expected/heavily pushed for by devs, including devs that work directly for both MS and Sony studios. So I'd easily expect most if not all of them and more to show up in these consoles. Which doesn't bode well for anyone buying these cards this year at the very least in terms of features, if not efficiency as well.

I tried doing a per mm comparison with Turing, but after tearing my hair out over the inane charts of node comparisons and variants of nodes I gave up. So here's a rougher estimate: A 2070 die has roughly the same transistor count as a 5700 series. And while a 5700 might be a bit faster, it's also clocked faster. A 2070 meanwhile has raytracing support (even if it is narrow) tensor cores (even if it isn't relevant to gaming) and poor performance per mm as it is. So right now Navi's efficiency per mm, while technically better than Vega's, still isn't that impressive.
 
Last edited:
It also seems like the block diagram is for the most basic type of Navi chip, which might be a 20 "Compute Unit" part? So a 5700 could be two of these put together.
No, that block diagram is exactly for Navi10, 20 double-CUs packed into 4 macro-blocks which themselves are packed into 2 SEs
 
No, that block diagram is exactly for Navi10, 20 double-CUs packed into 4 macro-blocks which themselves are packed into 2 SEs

I've been trying to figure that out, but haven't gotten a definitive answer as to what a "double compute unit" is from someone that knows for sure. I'd first assumed it was 20 CUs per Se, but saying "double compute unit" is weird, especially with their Super SIMD like 32/64 wavefront, which could be a "double compute unit"
 
Last edited:
Is the 10 CUs for one shader engine design mandatory for RDNA ? Is it possible to use for instance 8CUs (well 4 Dual compute units) by SE ?
Yes, every AMD architecture has been able to vary these numbers. That's why the 5700 works, not just the 5700 XT.
 
I've been trying to figure that out, but haven't gotten a definitive answer as to what a "double compute unit" is from someone that knows for sure. I'd first assumed it was 20 CUs per Se, but saying "double compute unit" is weird, especially with their Super SIMD like 32/64 wavefront, which could be a "double compute unit"
Each Shader Engine contains 10 Workgroup Processors, which in turn each contain 2 CUs. The CUs inside of a WGP can be grouped up to cooperate on workloads, if the compiler deems it beneficial.

Mike_Mantor-Next_Horizon_Gaming-Graphics_Architecture_06092019_20.jpg
 
Status
Not open for further replies.
Back
Top