AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Entropy · May 27, 2019

3dilettante said:
I didn't see which GCN product this was compared to.

Couldn’t find it either. One would hope Radeon VII, or the efficiency numbers will be dominated by lithography and not architecture. (And efficiency is subject to large change merely by changing clocks and voltages around, so God know what that will mean for real products)
IPC is difficult when actual loads are strongly influenced by the memory solution. But they did mentioned their methods for achieving that, so there is probably something to it.

Frenetic Pony · May 27, 2019

A few quick mental estimates put a 40cu card at similar to a 2070 on titles like FFXV and Total War/hammer 2, at least at 4k, and those are Nvidia favored titles. If it's the same clockspeed as a Radeon VII then a bit below, if it's say 2ghz, seemingly doable, it could be a bit above.

Of course AMD cards have done much better at getting not bottlenecked by sheer pixel throuput, vs other overhead like geometry where Nvidia has clearly won at lower resolutions. Whether this has been improved any will have to wait, same for any word on raytracing capability.

Any estimates of the die size from that shot? The tdp efficiency dropping to half is a giant improvement even if it is just from 14nm Vega. Assuming it's not as inefficient as a Vega VII then AMD could easily fit a "big Navi" 80cu card next year.

itsmydamnation · May 27, 2019

It is projected to provide up to 1.25X higher performance-per-clock11 and up to 1.5X higher performance-per-watt over GCN12

11 .Testing done by AMD performance labs 5/23/19, showing a geomean of 1.25x per/clock across 30 different games @ 4K Ultra, 4xAA settings. Performance may vary based on use of latest drivers. RX-327
12.Testing done by AMD performance labs 5/23/19, using the Division 2 @ 25x14 Ultra settings. Performance may vary based on use of latest drivers. RX-325

https://www.amd.com/en/press-releas...ion-leadership-products-computex-2019-keynote

CarstenS · May 27, 2019

Jawed said:
Did Beyond3D ever provide the source code for these tests? I'm reasonably sure they were at least partly debunked.

I think with Rys working where he does, he'd have notified B3D-Suite users if those tests were blatantly misleading wrt any architecture in general and GCN specifically.

That out of the way, the increase from random to black is not the all-telling value. For Radeon VII with it's massive raw bandwith, there's virtually no increase throughput compared to using the texture caches over multiple runs. Whereas the V64 LCE, which can clock almost as high over short term loads but has less than half the bandwidth available, can achieve a factor of 1.4 for 1 texture layer and 1.3 for 2.

For Geforce-cards, the lowly 1030 (G5 version) can achieve the highest jumps from random to all black with about 3.8, where a 2060 FE hovers at 2.1 (max). With higher number of texture layers (up to 32), the increase goes massively down for the 1030, notably down for Turing in general and only very little for bigger Pascal chips (GP104, GP102). And no, I don't have a link handy, since I did not publish those results anywhere yet.

Jawed · May 27, 2019

3dilettante said:
Yes, the primitive stream is broadcast to all front ends, and the same occupancy and throughput loss would be incurred across all shader engines. It's proportionally less of an impact in a GPU with 16 CUs per shader engine versus that ASCII diagram that has less than a third of the compute resources available.
Also unclear would be how salvage SKUs would be handled. A balanced salvage scheme would be cutting off resources in 20% increments.

Typically the reduction in ALU throughput is more severe than in other aspects (fillrate, bandwidth) for the salvage GPU...

Does it need to be "balanced"...

As far as attributing blame to the API, what specifically is inherent to the API that requires this? If there are many physical locations that may be responsible for handling all or part of a primitive, the question as to which ones are relevant needs to be answered by something, and then somehow the whole system needs to be updated with the answer.

Agreed.

Now imagine how a software renderer would do this. e.g. threads spawned across arbitrary processors to match workload. With data locality being a parameter of the spawn algorithm. And cost functions tuned to the algorithms.

The problem with hardware is partly that the data pathways have to be defined in advance for all use cases and have to be sized for a specified peak workload. So the data structures are fixed, the buffer sizes are fixed and the ratios of maximum throughput are fixed.

This would be similar to how unified shader architectures took over. To do that, substantial transistor budget was spent, but the rewards in performance were unignorable. Despite the fact that the hardware was no longer optimised specifically for the API of that time. Remember, for example, how vertex ALUs and pixel ALUs were different in their capability (ISA) before unified GPUs took over? (Though the API itself was changing to harmonise VS and PS instructions.)

In the end, primitive culling has been identified as a serious bottlneck in historical AMD GPUs. Loosening that bottleneck costs in complexity and potentially requires that the "smallest" GPUs (say $75) are comparatively "over-sized" compared with their predecessors. But that's been the story of the GPU for a long time. It's a bit like the smallest GPUs being able to do 8x MSAA, when they would crumble if actually ran that way.

Jawed · May 27, 2019

3dilettante said:
There are some indications of better, though not complete, interlocks in the pipeline--although I recall discussing in past architecture threads how I thought a good improvement to GCN proper would be to have those interlocks.
Some things, like how differently the SIMD path is handled, and why certain instructions related to branching, memory counts, or skipping instructions changed/dropped could be other areas of notable change.

Wild arsed guess: AMD will re-introduce odd and even hardware threads...

I also don't see why a number of the Navi changes wouldn't be desired for the compute line.

Agreed.

Deleted member 13524 · May 27, 2019

Is anyone trying to measure that Navi's size?
Seems to be a direct replacement for Polaris 10, something between 200mm^2 and 250mm^2.

3dilettante said:
I didn't see which GCN product this was compared to.

Entropy said:
Couldn’t find it either. One would hope Radeon VII, or the efficiency numbers will be dominated by lithography and not architecture.

I'm actually betting on Vega 14nm parts as comparison.

If they were comparing to Radeon VII, then the +50% efficiency comparison would be much closer to the +25% IPC.

My guess is the architectural improvements are coming at very low power cost, so they're getting:
- x1.25 more performance out of new cache hierarchy and CU adjustments
- x1.2 higher clocks out of 14nm -> 7nm transition (which is the clock difference between a 1.45GHz Vega 64 and a 1.75GHz Radeon VII)

1.25 x 1.2 = 1.5, hence the 50% higher efficiency, or higher performance at ISO power.

There's also an update at Anandtech quoting Lisa Su who said the new efficiency comes partly from new process technologies.

I think it's a 40 CU / 2560sp part running at 1.75GHz, with a power budget close to 200W.
It should get Vega 64 performance at 300W / 1.5 = 200W.

Ryan Smith · May 27, 2019

ToTTenTranz said:
Is anyone trying to measure that Navi's size?
Seems to be a direct replacement for Polaris 10, something between 200mm^2 and 250mm^2.

Best guess from Andrei Frumusanu based on some other photos we have: 275mm2, +/- 5mm2.

3dilettante said:
I didn't see which GCN product this was compared to.

For power efficiency, it would seem to be against a 14nm Vega 10 product. Lisa's specific words on the subject:

"And then, when you put that together, both the architecture – the design capability – as well as the process technology, we're seeing 1.5x or higher performance per watt capability on the new Navi products" (emphasis mine)

anexanhume · May 27, 2019

Ryan Smith said:
Best guess from Andrei Frumusanu based on some other photos we have: 275mm2, +/- 5mm2.
For power efficiency, it would seem to be against a 14nm Vega 10 product. Lisa's specific words on the subject:

"And then, when you put that together, both the architecture – the design capability – as well as the process technology, we're seeing 1.5x or higher performance per watt capability on the new Navi products" (emphasis mine)

I see a lot of problems with the latter statement. We know Radeon 7 improves on Vega 14nm performance 1.25X iso power due to AMD slides and benchmarks. If Navi is 1.25X better than Vega architecture wise, it should be 1.25X times 1.25X (1.56X) faster than Vega 14nm on a per watt basis. This implies a performance per watt regression on node compared to Radeon 7’s gains. What’s the reason for this? More power dedicated to memory, proportionally?

Globalisateur · May 27, 2019

ToTTenTranz said:
Is anyone trying to measure that Navi's size?
Seems to be a direct replacement for Polaris 10, something between 200mm^2 and 250mm^2.

I'm actually betting on Vega 14nm parts as comparison.

If they were comparing to Radeon VII, then the +50% efficiency comparison would be much closer to the +25% IPC.

My guess is the architectural improvements are coming at very low power cost, so they're getting:
- x1.25 more performance out of new cache hierarchy and CU adjustments
- x1.2 higher clocks out of 14nm -> 7nm transition (which is the clock difference between a 1.45GHz Vega 64 and a 1.75GHz Radeon VII)

1.25 x 1.2 = 1.5, hence the 50% higher efficiency, or higher performance at ISO power.

There's also an update at Anandtech quoting Lisa Su who said the new efficiency comes partly from new process technologies.

I think it's a 40 CU / 2560sp part running at 1.75GHz, with a power budget close to 200W.
It should get Vega 64 performance at 300W / 1.5 = 200W.

It must be bigger than 40 CU if it has a size of 275mm2. No ?

anexanhume · May 27, 2019

Globalisateur said:
It must be bigger than 40 CU if it has a size of 275mm2. No ?

Probably. Polaris fits 36 CU in 232 mm^2 on 14nm.

That would be some big CU growth if it wasn’t 48 CU or more, IMO. Vega 7nm fits 64CU in 330 mm^2. I’m assuming the 4096 HBM interface is at least as big as 256-bit GDDR6 interface here.

del42sa · May 27, 2019

anexanhume said:
Probably. Polaris fits 36 CU in 232 mm^2 on 14nm.

That would be some big CU growth if it wasn’t 48 CU or more, IMO. Vega 7nm fits 64CU in 330 mm^2. I’m assuming the 4096 HBM interface is at least as big as 256-bit GDDR6 interface here.

it could be like 56CU or something like that. And I do agree, from that point of view it doesn´t look that great ....

yuri · May 27, 2019

Globalisateur said:
It must be bigger than 40 CU if it has a size of 275mm2. No ?

There are probably major changes in cache hierarchy and also the CU organization. On top of that there is an AMD's first gen GDDR6 interface. This might imply changes in perf/mm2, right?

McHuj · May 27, 2019

Globalisateur said:
It must be bigger than 40 CU if it has a size of 275mm2. No ?

Why? We don’t know what area impact the new enhancements had. I bet the increased on chip cache significantly

anexanhume · May 27, 2019

del42sa said:
it could be like 56CU or something like that. And I do agree, from that point of view it doesn´t look that great ....

I’m actually hoping for some CU footprint growth. Nvidia has made it clear you can be bigger and more efficient. Apple has shown this too with mobile CPUs. Die cost hurts, yes, but when a better arch comes out the other end, I’ll take it.

Malo · May 27, 2019

3dilettante said:
I didn't see which GCN product this was compared to.

From the Anandtech article it seems to be compared to Vega 14nm.

del42sa · May 27, 2019

yuri said:
There are probably major changes in cache hierarchy and also the CU organization. On top of that there is an AMD's first gen GDDR6 interface. This might imply changes in perf/mm2, right?

Vega already presented a big change in cache organization with enormous 45MB of SRAM on die. Adding even more with Navi seems highly unlikely....

Deleted member 13524 · May 27, 2019

Globalisateur said:
It must be bigger than 40 CU if it has a size of 275mm2. No ?

Not if they're significantly increasing cache sizes, front-end width and using a larger proportion of higher-frequency / lower-density transistors.

Bondrewd · May 27, 2019

anexanhume said:
I see a lot of problems with the latter statement. We know Radeon 7 improves on Vega 14nm performance 1.25X iso power due to AMD slides and benchmarks. If Navi is 1.25X better than Vega architecture wise, it should be 1.25X times 1.25X (1.56X) faster than Vega 14nm on a per watt basis. This implies a performance per watt regression on node compared to Radeon 7’s gains. What’s the reason for this? More power dedicated to memory, proportionally?

They're being purposefully vague, for whatever reason.

AlNom · May 27, 2019

Ryan Smith said:
Best guess from Andrei Frumusanu based on some other photos we have: 275mm2, +/- 5mm2.

For all the times she's gone on stage, you'd think we'd just have the physical measurements for Lisa's hand for easier size analyses.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Entropy

Frenetic Pony

itsmydamnation

CarstenS

Moderator

Jawed

Jawed

Deleted member 13524

Guest

Ryan Smith

anexanhume

Globalisateur

Globby

anexanhume

del42sa

yuri

McHuj

anexanhume

Malo

Yak Mechanicum

del42sa

Deleted member 13524

Guest

Bondrewd

AlNom

Moderator