AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Bondrewd · May 25, 2019

naenrda said:
Doesn't sound like a GCN problem to me.

None of what they said is a GCN problem.
All either their legacy from days before it or just plain lower quality circuit design.

Deleted member 13524 · May 25, 2019

Alexko said:
Guys, before you argue endlessly about whether GCN is the problem, you should start by defining what you mean by "GCN". The debate makes no sense whatsoever otherwise.

I'd call GCN to what AMD calls GCN. I
n practice, it's all GPUs so far that use Compute Units with 64 ALUs each using RISC SIMD.

Heinrich4 · May 25, 2019

Maybe not, but the question still remains ... but why since R9 AMD failed to equate Nvidia's efficiency and still needs a lot more bandwith? (I know Vega 56/64 uses bining, looked deferred like Nvidia ...)

naenrda said:
Doesn't sound like a GCN problem to me.

https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/8

GPUCurious · May 26, 2019

So I guess everyone has decided that the rumor/leak from KOMACHI that Navi has 8 shader engines is false then? And everyone must have also decided therefore that Navi is using more than 40 compute units to compete with the 2070? Because I'm not sure how a GCN based GPU is supposed to beat a 2070 with 40 CUs at a vaguely sensible TDP without some sort of architectural advancement to GCN?

Kaotik · May 26, 2019

GPUCurious said:
So I guess everyone has decided that the rumor/leak from KOMACHI that Navi has 8 shader engines is false then? And everyone must have also decided therefore that Navi is using more than 40 compute units to compete with the 2070? Because I'm not sure how a GCN based GPU is supposed to beat a 2070 with 40 CUs at a vaguely sensible TDP without some sort of architectural advancement to GCN?

Komachi in general has been trustworthy leaker for what I can remember, but didn't he quickly delete the tweet where he said 8x5CU?`Which could indicate it wasn't solid

Jawed · May 26, 2019

I'm sorry to see my old thread vandalised by being closed for further replies.

Heinrich4 said:
Maybe not, but the question still remains ... but why since R9 AMD failed to equate Nvidia's efficiency and still needs a lot more bandwith? (I know Vega 56/64 uses bining, looked deferred like Nvidia ...)

Did Beyond3D ever provide the source code for these tests? I'm reasonably sure they were at least partly debunked.

Also, NVidia's "efficiency" had a big problem with games that use HDR, didn't it? While AMD cards suffered no problem. NVidia did eventually solve this problem though, as I understand it.

GPUCurious said:
So I guess everyone has decided that the rumor/leak from KOMACHI that Navi has 8 shader engines is false then? And everyone must have also decided therefore that Navi is using more than 40 compute units to compete with the 2070? Because I'm not sure how a GCN based GPU is supposed to beat a 2070 with 40 CUs at a vaguely sensible TDP without some sort of architectural advancement to GCN?

I think this rumour could have merit. Though it might be for the wrong GPU?

Apart from the increased fixed-function throughput this would offer, it could also change the ratio of scalar:vector instruction throughput. If a CU consists of 2 VALUs that are 32-wide (while retaining a 64-wide hardware thread group size) there would be twice as many SALU instruction issues available per VALU instruction issue.

My overall feeling with GCN has been that the fixed function hardware and work distribution (that has to deal with a mixture of fixed-function and compute work) has failed to scale because it is globally constrained in some way. The mysteries of the use of GDS have made me wonder if GDS itself has been a relevant bottleneck, but regardless I feel there has long been some kind of global bottleneck.

More CUs on their own won't help with this bottleneck. The only real solution is to pull apart the way that work distribution functions, minimising the effort required of the global controller. Part of this requires better queue handling, both globally and for each distributed component. This requires more internal bandwidth (since the definition of work can be quite complex) and interacts with how the on-chip cache hierarchy is designed.

I've held this theory about AMD's failure to scale for pretty much the entire time we've had GCN, because the 4 CU limit has been around forever (though it took a while to discover that it was there).

It might be worthwhile to consider why AMD ever thought it necessary to share resources between CUs. This has always smelt like a false economy to me. Some would argue that this is a side-effect of AMD considering GCN to be a compute architecture first, since AMD has spent about 10 years arguing that graphics is compute and fixed function is just a side-show for compute. (Which, I believe, is why consoles are so amazing these days as console devs have embraced this perspective.)

3dilettante · May 26, 2019

yuri said:
Oh boy. When you keep the scope of changes done between BD and Zen and project this on to GCN you wouldn't call that GCN anymore. That's the whole point of this "ditch GCN" move - to implement major improvements.

The Zen vs BD comparison becomes a question of architecture versus microarchitecture. The x86 architecture defines instructions and a range of behaviors to software, and the Zen, BD, or Skylake architectures are implementations of said behaviors. The particulars of what they use to carry out their architectural requirements and how well or not well they handle them are things the architecture tries to be somewhat agnostic about. That being said, implementation quirks throughout history can be discerned if one know the context of things like x87, FMA3, the often winding encoding and prefix handling, etc.
That being said, x86 generally doesn't commit to something like cache sizes, or instruction latencies, or how many ALUs/load queues/or other lower-level details. In part, there are too many x86 implementations that provide contrary examples for saying a given resource allocation or pipeline choice is architectural.

Alexko said:
Guys, before you argue endlessly about whether GCN is the problem, you should start by defining what you mean by "GCN". The debate makes no sense whatsoever otherwise.

I think AMD might share some of the blame. If we go by the GCN architecture whitepaper, GCN is effectively the 7970 with some handwaving about the CU count. If we go by the ISA docs, we lose some of the cruft, but there's still a lot of specific microarchitectural details that get rolled into it.
Which elements are what AMD considers essential to what is GCN are embedded in whatever else is in the CU and GPU hardware stack.

ToTTenTranz said:
I'd call GCN to what AMD calls GCN. I
n practice, it's all GPUs so far that use Compute Units with 64 ALUs each using RISC SIMD.

There are other elements that at least so far would hold:
4-cycle cadence--there examples of instructions whose semantics recognize this, such as lane-crossing ops
16-wide SIMD--various operations like the lane-crossing ones have row sizes linked to the physical width of the SIMD
incomplete or limited interlocking within a pipeline--courtesy of the cadence and an explicitly stated single-issue per wavefront requirement
multiple loosely joined pipelines--explicitly recognized in the ISA with the waitcnt instructions
very weakly ordered memory model with incoherent L1 with eventual consistency at the L2
multiple memory spaces and multiple modes of addressing
integer scalar path and SIMD path with predication
separate scalar memory path and vector memory path

The ISA docs also tend to commit to rather specific cache sizes and organization, data share capacity and banking, and the sizes of various buses

These would be elements more present in the ISA docs rather than the overarching GCN doc. One thing the ISA docs do not help with is establish a consistent set of encodings versus instructions, and in a few GFX9 FMA instruction cases there are instructions AMD gave the architectural name that belonged to pre-existing instructions and so changed the behavior in a way that threw things like code disassemblers out of whack.

As far as the ISA being RISC. Other than there being either 32-bit or 64-bit instructions, I think GCN is very complex. Multiple memory spaces, multiple ways to address them, multiple ways to address the same memory inconsistently. Many special registers that are implicit arguments to many operations, addressing rules, complex access calculations (such as the bugged flat addressing mode in GFX1010). Vector instructions source operands from LDS, scalar, special registers, or a number of special-purpose values rather than a straightforward register ID.
To a limited extent, there are some output modifiers for results.

Jawed said:
Apart from the increased fixed-function throughput this would offer, it could also change the ratio of scalar:vector instruction throughput. If a CU consists of 2 VALUs that are 32-wide (while retaining a 64-wide hardware thread group size) there would be twice as many SALU instruction issues available per VALU instruction issue.

I think another interpretation was 2x16, so perhaps allowing for multiple issue of instructions whose behavior would be consistent with prior generations. The lane-crossing operations would have the same behavior then, as it might be difficult to broadcast from an ISA-defined row of 16 to the next row if they're executing simultaneously.
It might also help explain why register banking is now a concern, while a physically 32-wide SIMD and register file would still be statically free of bank conflicts. The latency figures seem to be in-line with a cadence similar to past GPUs, which might not make sense with a 32-wide SIMD.

My overall feeling with GCN has been that the fixed function hardware and work distribution (that has to deal with a mixture of fixed-function and compute work) has failed to scale because it is globally constrained in some way. The mysteries of the use of GDS have made me wonder if GDS itself has been a relevant bottleneck, but regardless I feel there has long been some kind of global bottleneck.

Some of the patents for the new geometry pipeline and those extending it cited the need for an increasingly complex crossbar to distribute primitives between shader engines bottlenecking scaling. One alternative was to use what was in effect a primitive shader per geometry front end and streaming out data to the memory hierarchy to distribute them. Although if that were the case for the ASCII diagram, needing a fair amount of redundant culling work at each front end might leave the relatively paltry number of CUs per SE with less CUs for other work like asynchronous compute, and relying on the memory crossbar that's being used by everything else to save on a geometry crossbar may be shifting from one specialized bottleneck to another global one.

Jawed · May 27, 2019

3dilettante said:
Some of the patents for the new geometry pipeline and those extending it cited the need for an increasingly complex crossbar to distribute primitives between shader engines bottlenecking scaling. One alternative was to use what was in effect a primitive shader per geometry front end and streaming out data to the memory hierarchy to distribute them. Although if that were the case for the ASCII diagram, needing a fair amount of redundant culling work at each front end might leave the relatively paltry number of CUs per SE with less CUs for other work like asynchronous compute, and relying on the memory crossbar that's being used by everything else to save on a geometry crossbar may be shifting from one specialized bottleneck to another global one.

What do you mean by redundant? You're referring to a single primitive being shaded (culled) by each instance (tile, effectively) it appears? Well that's the trouble with hardware implementing an API: brute force is always going to result in wasted effort.

3dilettante · May 27, 2019

Jawed said:
What do you mean by redundant? You're referring to a single primitive being shaded (culled) by each instance (tile, effectively) it appears? Well that's the trouble with hardware implementing an API: brute force is always going to result in wasted effort.

Yes, the primitive stream is broadcast to all front ends, and the same occupancy and throughput loss would be incurred across all shader engines. It's proportionally less of an impact in a GPU with 16 CUs per shader engine versus that ASCII diagram that has less than a third of the compute resources available.
Also unclear would be how salvage SKUs would be handled. A balanced salvage scheme would be cutting off resources in 20% increments.

As far as attributing blame to the API, what specifically is inherent to the API that requires this? If there are many physical locations that may be responsible for handling all or part of a primitive, the question as to which ones are relevant needs to be answered by something, and then somehow the whole system needs to be updated with the answer.

BRiT · May 27, 2019

For sale in July.

More details at E3 on June 10th.

DavidGraham · May 27, 2019

The Sapphire rep was right on the money, top Navi competes with RTX 2070 (barely winning by 10% on AMD's favorite Strange Brigade). No mention of hardware Ray Tracing. High End Vega will continue to present on the high end for the foreseeable future .. all that remains is the confirmation of the 500$ price.

Ike Turner · May 27, 2019

RDNA is supposedly not GCN..but..well...the reality is that it's probably still an evolution of GCN (which isn't a bad thing contrary to what some folks are crying about..). Anyway's it's clear that this seems to be streamlined evolution of GCN aimed at gaming while Vega (and its successor) will be the "compute" version of the GCN arch..BTW the die doesn't have HMB. Navi is the new Polaris as expected.

Kaotik · May 27, 2019

Ike Turner said:
RDNA is supposedly not GCN..but..well...the reality is that it's probably still an evolution of GCN (which isn't a bad thing contrary to what some folks are crying about..). Anyway's it's clear that this seems to be streamed down evolution of GCN aimed at gaming while Vega (and its successor) will be the "compute" version of the GCN arch..BTW the die doesn't have HMB. Navi is the new Polaris as expected.

By the looks of it GCN will be indeed split into Compute-GCN and Gaming-RDNA.
What "new polaris as expected because it's not using HBM"? Memory solution has little to nothing to do with the architecture, GCN(/RDNA) isn't tied to specific memory type, they can fit any memory controller they choose. Heck, even the Polaris architecture you specifically mentioned has products using both GDDR (desktop GPUs) and HBM (Intel "Vega" which is really Polaris)

Bondrewd · May 27, 2019

"same old GCN" - t. every redditor ever

Ike Turner · May 27, 2019

Kaotik said:
By the looks of it GCN will be indeed split into Compute-GCN and Gaming-RDNA.
What "new polaris as expected because it's not using HBM"? Memory solution has little to nothing to do with the architecture, GCN(/RDNA) isn't tied to specific memory type, they can fit any memory controller they choose. Heck, even the Polaris architecture you specifically mentioned has products using both GDDR (desktop GPUs) and HBM (Intel "Vega" which is really Polaris)

"New Polaris" as in "new mid-range GPU arch" (wasn't related to my HMB remark sorry)

https://www.anandtech.com/show/1441...ducts-rx-5700-series-in-july-25-improved-perf

3dilettante · May 27, 2019

Ike Turner said:
RDNA is supposedly not GCN..but..well...the reality is that it's probably still an evolution of GCN (which isn't a bad thing contrary to what some folks are crying about..). Anyway's it's clear that this seems to be streamed down evolution of GCN aimed at gaming while Vega (and its successor) will be the "compute" version of the GCN arch..BTW the die doesn't have HMB. Navi is the new Polaris as expected.

At least looking at the code commits thus far, there are certain hints at what could be considered significant departures, possibly.
There was an announced new cache hierarchy. How different or new it is isn't clear, but there are some code comments with new naming conventions like indicating there is a per-CU L0 cache, rather than an L1.

There are some indications of better, though not complete, interlocks in the pipeline--although I recall discussing in past architecture threads how I thought a good improvement to GCN proper would be to have those interlocks.
Some things, like how differently the SIMD path is handled, and why certain instructions related to branching, memory counts, or skipping instructions changed/dropped could be other areas of notable change.

Whether that's enough to be called "new", I suppose is up to AMD. The introduction of scalar memory writes and a new set of instructions for that in Volcanic Islands would be on the same level of some of these changes, and that didn't prompt AMD to declare Fiji or Tonga as not being GCN.
Maybe GFX10 is different enough for AMD, but that's counterbalanced by how AMD has muddied the waters as to what is in GCN as an architectural model versus a collection of product minutia.

I also don't see why a number of the Navi changes wouldn't be desired for the compute line. There are new caches, HSA-focused forward progress guarantees, memory ordering features, and pipeline improvements that would help a Vega successor as well, so how different a Vega successor would be--or why it would be similar enough to old products to still be called GCN isn't clear.

rSkip · May 27, 2019

https://www.amd.com/en/press-releas...ion-leadership-products-computex-2019-keynote

With a new compute unit(10) design, RDNA is expected to deliver incredible performance, power and memory efficiency in a smaller package compared to the previous generation Graphics Core Next (GCN) architecture. It is projected to provide up to 1.25X higher performance-per-clock(11) and up to 1.5X higher performance-per-watt over GCN(12), enabling better gaming performance at lower power and reduced latency.

Footnotes:

10. AMD APUs and GPUs based on the Graphics Core Next and RDNA architectures contain GPU Cores comprised of compute units, which are defined as 64 shaders (or stream processors) working together. GD-142
11. Testing done by AMD performance labs 5/23/19, showing a geomean of 1.25x per/clock across 30 different games @ 4K Ultra, 4xAA settings. Performance may vary based on use of latest drivers. RX-327
12. Testing done by AMD performance labs 5/23/19, using the Division 2 @ 25x14 Ultra settings. Performance may vary based on use of latest drivers. RX-325

3dilettante · May 27, 2019

I didn't see which GCN product this was compared to.

yuri · May 27, 2019

Getting rid of the GCN branding just to calm the haters down. Well played

Let's hope those "up to" improvements will be achieveable using regular products.

del42sa · May 27, 2019

In what the company is calling their Radeon DNA (RNDA) archtiecture – I should note that it’s not clear if this is a branding exercise to downplay the GCN family name or if it's a more heavily overhauled architecture – AMD has revealed that Navi’s compute units have been redesigned to improve their efficiency. AMD's press materials also note that, regardless of the above changes, the size hasn't changed: a single CU is still 64 stream processors.

Feeding the beast is a new multi-level cache hierarchy. AMD is touting that Navi’s cache subsystem offers both higher performance and lower latency than Vega’s, all for less power consumption. AMD has always been hamstrung a bit by memory/cache bottlenecks, so this would be a promising development for AMD’s GPU architecture. Meanwhile for a bit of reference, Vega already implemented a more modern cache hierarchy, so it would seem unlikely that AMD is changing their cache levels or what blocks are clients of which caches.

https://www.anandtech.com/show/1441...ducts-rx-5700-series-in-july-25-improved-perf

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Bondrewd

Deleted member 13524

Guest

Heinrich4

GPUCurious

Kaotik

Drunk Member

Jawed

3dilettante

Jawed

3dilettante

BRiT

(>• •)>⌐■-■ (⌐■-■)

DavidGraham

Ike Turner

Kaotik

Drunk Member

Bondrewd

Ike Turner

3dilettante

rSkip

3dilettante

yuri

del42sa