AMD: Navi Speculation, Rumours and Discussion [2019-2020]

function · May 31, 2019

snc said:
There is no slide that telling 1.5 X perf/watt at same frequency. There is navi slide that claims 1.25 X perf per clock and also 1.5 X perf/watt. Slides for amd vii claims 1.25x per at the same power which is the same as 1.25x per/w ( The watt (symbol: W) is a unit of power - wikipedia)

anexanhume said:
It’s self consistent. Radeon 7 has 1.25 perf/Watt, Navi has 1.25x perf clock. Multiplying the 2 gives you ~1.5X. Actually a small regression which may be rounding error or power penalty of GDDR6 vs. HBM2.

Yeah I made a mess of that.

Thanks folks.

del42sa · May 31, 2019

I see one potential issue with that announced 1,5 perf/watt. value.

We know that 1,25perf/watt. for Navi is given by the process, it´s the same case as with Vega 20, it´s logical. AMD made some architecture change within NAVI so that RDNA is responsible for 1,25x/clock performance increase. Problem is, that AMD declares "up-to" value. It could mean, that Navi will not always achieve it´s 1,25x performance but it could be less, depending on workload. Who remember announced 2,5 perf/watt. for Polaris would understand, what I mean. The performance growth was not related to the entire line, but only to a particular comparison between RX470 vs R290.

In the end it may be that in some games the gain will be greater and in others just a few percent above GCN Vega, clock to clock performance.

Deleted member 13524 · May 31, 2019

anexanhume said:
It’s self consistent. Radeon 7 has 1.25 perf/Watt, Navi has 1.25x perf clock. Multiplying the 2 gives you ~1.5X. Actually a small regression which may be rounding error or power penalty of GDDR6 vs. HBM2.

And the transistors needed to provide the 25% IPC increase probably don't come at zero power cost.
Perhaps a hypothetical 14/12nm Navi with GDDR6 would "only" be 1.2x more efficient than Vega 10.

CarstenS · May 31, 2019

ToTTenTranz said:
And the transistors needed to provide the 25% IPC increase probably don't come at zero power cost.

Was it established already, which hardware level the increased IPC is referring to? Per SIMD, per CU, per SE, per GPU?

Urian · May 31, 2019

anexanhume said:
Haven’t had a chance to fully dig into the links being made here, but they suggest RDNA is indeed Super-SIMD as conceived of in the patent.

https://www.reddit.com/r/Amd/comments/bu5mum

3dilettante said:
There's been a modest stream of LLVM changes coming out, and a few curious benchmark database entries.
I think the acronym is RDNA. At least from the code changes, I haven't run into any mention of RDNA despite many GCN references and shared flags with GCN GPUs, including many that have GFX10 lining up with older GCN architectures.

Perhaps the lack of mention is due to secrecy purposes, or the RDNA label is not used by a number of staff responsible for supporting the architectures for other reasons like not being communicated to them or not used by them.

I'm not 100% certain I'm reading the autotranslated text totally right, and I'm not ruling out a possible change like dual-issue or a difference in issue latency.
However, based on my (non-authoritative) interpretation and what I think is being said, I think the changes are being misconstrued.

There is a potential difference in how workgroups allocate their wavefronts, with it being possible in one mode for a workgroup to have wavefronts on more than one CU. There are implications as far as what that means for barriers that only had to be supported within a CU when workgroups were limited to one CU each. The memory comments seem concerned about visibility/ordering of workgroup memory accesses in the event that wavefronts are no longer reading/writing to one CU's local cache. This seems pf higher importance in the context of the code those changes were made to dealing with synchronization and writes to possible shared global memory.
There may be something new about this L0, as there is a new bit for coherence, more active discussion of invalidations versus the write-through GCN L1, and a new memory counter. What specifically the L0 is versus the L1 in prior generations isn't clear, as the vector memory path in current GCN does order accesses within a wavefront at least.
I don't think it's the same as the patents' register cache, which is local to a SIMD/cluster and on the wrong end of the memory pipeline to be of any concern for other CUs or wavefronts.
There is a single reference about a register destination cache in https://github.com/llvm-mirror/llvm...3380939#diff-ad4812397731e1d4ff6992207b4d38fa, which is a different file with a different purpose.
There's some discussion of code comments for the buggy prefetch instruction, and I think some discussion of the size of either the vector cache or L0 that I think may be red herrings. For one thing, the prefetch and I$ comments are dealing with instruction fetch, which is not subject to synchronization operations or atomic writes. Ordering concerns between CUs for static code seems unnecessary. Claims as to the size of the destination cache somehow matching a workgroup don't seem supported in what I have read, and I think they wouldn't be consistent with the patents--I may be misreading the translation, though.

3dilettante said:
sing type for image instructions that could add some complexity. If there's more interconnect or cache levels there's probably extra area beyond the L1 blocks and L2.

I am the one who made the original article that reddit linked.

I made clear in my own article that I could be wrong at 100% and this is a speculation with the few info that we have.

Since is only partially informed speculation I am not going to add anything or say anything about it.

Deleted member 13524 · May 31, 2019

CarstenS said:
Was it established already, which hardware level the increased IPC is referring to? Per SIMD, per CU, per SE, per GPU?

Nope.
Current rumors have been pointing to Navi 10 being a chip with 8 shader engines x 5 CU each, in which case there would be a significant increase in front-end width compared to Vega 10.
AMD mentioned a revamped cache subsystem so I'd guess that would be at the SE level. There are also rumors of RDNA using Super SIMD, but I think it's mostly because of the patents that have come up in the last couple of years.

Urian · May 31, 2019

And I forgot something to add...

I never said in my blog post that the Super-SIMD only would give the hability for Real Time Raytracing. I made clear in it that Traversal Unit/RT Cores would be needed.
In my blog post I talk about the possibility of AMD adding some of the AI accelerators from Cadence. But i never specified as "for Xbox Next only".

Thanks for your time.

3dilettante · May 31, 2019

Urian said:
I am the one who made the original article that reddit linked.

I made clear in my own article that I could be wrong at 100% and this is a speculation with the few info that we have.

Since is only partially informed speculation I am not going to add anything or say anything about it.

For clarity, I am not disagreeing about it being possible for there to be changes in how instructions are issued.
I was stating that my (also speculative) interpretation of the github changes concerning the caches has them discussing a different part of the architecture.

w0lfram · Jun 1, 2019

COMPUTEX_KEYNOTE_DRAFT_FOR_PREBRIEF.26.05.19-page-010.jpg

...and High Clock Speed...!

Anyone want to guess what Navi could reasonably be clocked at?

Frenetic Pony · Jun 1, 2019

w0lfram said:
...and High Clock Speed...!

Anyone want to guess what Navi could reasonably be clocked at?

Already calculated that to be a bit above a Radeon VII (1750mghz), assuming there's 40cu.

Frankly I just hope there's raytracing/extra feature support, that's not a small die for the 7nm node.

bridgman · Jun 1, 2019

del42sa said:
Didn´t they said the same about Polaris and then Vega ? I recall some Bridgman talk about how they rebuild every block in Polaris to get different lego....

I don't think I said "rebuilt every block"... I was just trying to make the point that changes in the underlying implementation did not necessarily have to be driver-visible.

3dilettante said:
I'd be delighted if AMD took the opportunity to outline what they consider the set of design-defining elements of GCN, so that we could get closer to the heart of the matter.

When the first GCN parts were launched we talked about ISA and implementation without drawing a line between them... so while we have generally thought about GCN as a programming model / ISA a lot of people seemed to be thinking about "GCN" as a micro-architecture instead.

Is your question about definition of GCN ISA or definition of GCN micro-architecture ? Probably can't answer yet either way but at least we can get closer on the question

3dilettante · Jun 2, 2019

bridgman said:
When the first GCN parts were launched we talked about ISA and implementation without drawing a line between them... so while we have generally thought about GCN as a programming model / ISA a lot of people seemed to be thinking about "GCN" as a micro-architecture instead.

Is your question about definition of GCN ISA or definition of GCN micro-architecture ? Probably can't answer yet either way but at least we can get closer on the question

I do wish for more clarity on what AMD thinks GCN is, then determining what features distinguish RDNA--and at what level it is different from what came before.
I would like to go with a clear discussion that recognizes a distinction between architecture versus microarchitecture, though it seems from the outset it was glossed over with Tahiti.
I've commented in the past on how I felt the descriptions of GCN disseminated to the public had a lot of what I'd consider implementation details versus the architecture presented to software.

Is this current distinction focused primarily on the GCN ISA and CUs, or can there be a distinction based on the hardware not in the CU arrays?

Even going by ISA, it seems like GCN is notably less static than CPU ISAs over time, as evidenced by new encodings for scalar memory ops, swaths of the vector ISA encodings moving to new locations in the opcode space, and Vega fiddling with the naming and encodings for some FMA operations.
I'm not sure what can be commented on about the LLVM changes documenting GFX10's moving swaths of the vector ISA back to GFX6/GFX7 placement.

So is GCN strongly linked to the ISA, or is there more of a "theme" given how much of the ISA has shifted despite the GCN name being maintained?

del42sa · Jun 2, 2019

bridgman said:
I don't think I said "rebuilt every block"... I was just trying to make the point that changes in the underlying implementation did not necessarily have to be driver-visible.

yes, and that was the whole point of that discussion, wasn't it ? That Polaris is different enough than previous GCN chips and here we go again with NAVI being RDNA

AlNom · Jun 2, 2019

Wonder if we might eventually see consumer cards with a modest amount of SSD (ala Radeon Pro SSG).

LordEC911 · Jun 3, 2019

del42sa said:
I see one potential issue with that announced 1,5 perf/watt. value.

We know that 1,25perf/watt. for Navi is given by the process, it´s the same case as with Vega 20, it´s logical. AMD made some architecture change within NAVI so that RDNA is responsible for 1,25x/clock performance increase. Problem is, that AMD declares "up-to" value. It could mean, that Navi will not always achieve it´s 1,25x performance but it could be less, depending on workload. Who remember announced 2,5 perf/watt. for Polaris would understand, what I mean. The performance growth was not related to the entire line, but only to a particular comparison between RX470 vs R290.

In the end it may be that in some games the gain will be greater and in others just a few percent above GCN Vega, clock to clock performance.

Wrong.

Lisa Su said:
And then, when you put that together, both the architecture – the design capability – as well as the process technology, we're seeing 1.5x or higher performance per watt capability on the new Navi products

https://www.anandtech.com/show/1441...ducts-rx-5700-series-in-july-25-improved-perf

w0lfram said:
...and High Clock Speed...!

Anyone want to guess what Navi could reasonably be clocked at?

We have already heard about clockspeed regression on this process and future ones, where they need to do significant work to overcome the inherent shortcomings.
So the question is, doing that required work- did they move the needle to the "new" edge of the chip's capability or did they place it in a more "comfortable" position in regard to voltage/power efficiency or somewhere in the middle?

Based on the 1.25% IPC and assuming 40CUs, one could assume that chip would need clocks ~10-15% higher than Vega 20 to end up around a RTX2070 (plus I think 2ghz would be a milestone AMD would strive for).
So something like 1600-1650mhz base and ~2ghz boost. IMO- those clocks would be somewhere in the middle, leaning slightly more towards the edge.
If it is 48CUs then with the 1.25% IPC, AMD wouldn't have to push the clocks as much and likely could leave it around Vega 20 clocks and get their ~50% increased power efficiency that way.
Say 1500-1550mhz base and 1800-1850mhz boost.

del42sa · Jun 3, 2019

LordEC911 said:
Wrong.

https://www.anandtech.com/show/1441...ducts-rx-5700-series-in-july-25-improved-perf

please read AMD keynote: https://www.amd.com/en/press-releas...ion-leadership-products-computex-2019-keynote

AlNom · Jun 3, 2019

Jawed said:
I suppose this means there should be a mobile RDNA thread.

If it's a tiler and leans heavily on primitive shading that makes it good for mobile?...

Spin-off discussion:
https://forum.beyond3d.com/threads/...nce-strategic-partnership-in-mobile-ip.61233/

Let me know if y'all want a different thread title (to focus discussion accordingly) or if it needs to be in a different forum etc.

BoMbY · Jun 4, 2019

Urian said:
I made clear in it that Traversal Unit/RT Cores would be needed.

Not necessarily, simple RT could be implemented without many changes: http://diglib.eg.org/handle/10.2312/hpg.20141091.029-040

Of course dedicated circuits would be better. They could also do it like in the paper, and then reserve a fixed amount of CUs do to RT in parallel to the normal pipeline, or whatever.

w0lfram · Jun 4, 2019

LordEC911 said:
Wrong. : https://www.anandtech.com/show/1441...ducts-rx-5700-series-in-july-25-improved-perf

We have already heard about clockspeed regression on this process and future ones, where they need to do significant work to overcome the inherent shortcomings. So the question is, doing that required work- did they move the needle to the "new" edge of the chip's capability or did they place it in a more "comfortable" position in regard to voltage/power efficiency or somewhere in the middle? Based on the 1.25% IPC and assuming 40CUs, one could assume that chip would need clocks ~10-15% higher than Vega 20 to end up around a RTX2070 (plus I think 2ghz would be a milestone AMD would strive for).

So something like 1600-1650mhz base and ~2ghz boost. IMO- those clocks would be somewhere in the middle, leaning slightly more towards the edge.
If it is 48CUs then with the 1.25% IPC, AMD wouldn't have to push the clocks as much and likely could leave it around Vega 20 clocks and get their ~50% increased power efficiency that way.

Say 1500-1550mhz base and 1800-1850mhz boost.

It depends on what the base freq of Navi will be. I think She said several times they worked hard on clocks, and she says "higher freq" several times in her keynote and with it written on banners. Hence, "and high clocks speeds..!" quote.

I don't see anything wrong with suggesting that Navi can't hit 2,100Mhz at 7nm. I think Dr Su is a study of AMD past and like to topple stigmata. But there might be more to it than just freq.

Frenetic Pony · Jun 5, 2019

w0lfram said:
It depends on what the base freq of Navi will be. I think She said several times they worked hard on clocks, and she says "higher freq" several times in her keynote and with it written on banners. Hence, "and high clocks speeds..!" quote.

I don't see anything wrong with suggesting that Navi can't hit 2,100Mhz at 7nm. I think Dr Su is a study of AMD past and like to topple stigmata. But there might be more to it than just freq.

High clocks speeds should be expected. Vega was supposed to hit high clockspeeds but failed, causing some exponential power spikage that seems partially independent of silicon node. Hence why Vega 64 missed its clockspeed targets and the Vega arch gets way more efficient at the low clockspeeds found on mobile parts.

Thus any improvements to clockspeed from 7nm can't be inferred from the Vega VII, that things huge powerdraw is due to arch, not the fmax curve of TSMC's 7nm. Going by TSMC's own numbers 7nm vs 16nm should be a huge improvement in TDP efficiency for AMD. Thus the low quoted efficiency improvements could easily be due to maxxing out clockspeed with no regards to powerdraw for a desktop part, a strategy which AMD just recently said sells better. Besides, the quote for 1.5x improvement was "at least" meaning it'll probably get much better for whatever laptop part they have replacing the Vega 10.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

function

None functional

del42sa

Deleted member 13524

Guest

CarstenS

Moderator

Urian

Deleted member 13524

Guest

Urian

3dilettante

w0lfram

Frenetic Pony

bridgman

3dilettante

del42sa

AlNom

Moderator

LordEC911

del42sa

AlNom

Moderator

BoMbY

w0lfram

Frenetic Pony