AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
I didn't see which GCN product this was compared to.
Couldn’t find it either. One would hope Radeon VII, or the efficiency numbers will be dominated by lithography and not architecture. (And efficiency is subject to large change merely by changing clocks and voltages around, so God know what that will mean for real products)
IPC is difficult when actual loads are strongly influenced by the memory solution. But they did mentioned their methods for achieving that, so there is probably something to it.
 
Last edited:
A few quick mental estimates put a 40cu card at similar to a 2070 on titles like FFXV and Total War/hammer 2, at least at 4k, and those are Nvidia favored titles. If it's the same clockspeed as a Radeon VII then a bit below, if it's say 2ghz, seemingly doable, it could be a bit above.

Of course AMD cards have done much better at getting not bottlenecked by sheer pixel throuput, vs other overhead like geometry where Nvidia has clearly won at lower resolutions. Whether this has been improved any will have to wait, same for any word on raytracing capability.

Any estimates of the die size from that shot? The tdp efficiency dropping to half is a giant improvement even if it is just from 14nm Vega. Assuming it's not as inefficient as a Vega VII then AMD could easily fit a "big Navi" 80cu card next year.
 
Last edited:
It is projected to provide up to 1.25X higher performance-per-clock11 and up to 1.5X higher performance-per-watt over GCN12

11 .Testing done by AMD performance labs 5/23/19, showing a geomean of 1.25x per/clock across 30 different games @ 4K Ultra, 4xAA settings. Performance may vary based on use of latest drivers. RX-327
12.Testing done by AMD performance labs 5/23/19, using the Division 2 @ 25x14 Ultra settings. Performance may vary based on use of latest drivers. RX-325

https://www.amd.com/en/press-releas...ion-leadership-products-computex-2019-keynote
 
Did Beyond3D ever provide the source code for these tests? I'm reasonably sure they were at least partly debunked.

I think with Rys working where he does, he'd have notified B3D-Suite users if those tests were blatantly misleading wrt any architecture in general and GCN specifically.

That out of the way, the increase from random to black is not the all-telling value. For Radeon VII with it's massive raw bandwith, there's virtually no increase throughput compared to using the texture caches over multiple runs. Whereas the V64 LCE, which can clock almost as high over short term loads but has less than half the bandwidth available, can achieve a factor of 1.4 for 1 texture layer and 1.3 for 2.

For Geforce-cards, the lowly 1030 (G5 version) can achieve the highest jumps from random to all black with about 3.8, where a 2060 FE hovers at 2.1 (max). With higher number of texture layers (up to 32), the increase goes massively down for the 1030, notably down for Turing in general and only very little for bigger Pascal chips (GP104, GP102). And no, I don't have a link handy, since I did not publish those results anywhere yet.
 
Yes, the primitive stream is broadcast to all front ends, and the same occupancy and throughput loss would be incurred across all shader engines. It's proportionally less of an impact in a GPU with 16 CUs per shader engine versus that ASCII diagram that has less than a third of the compute resources available.
Also unclear would be how salvage SKUs would be handled. A balanced salvage scheme would be cutting off resources in 20% increments.
Typically the reduction in ALU throughput is more severe than in other aspects (fillrate, bandwidth) for the salvage GPU...

Does it need to be "balanced"...

As far as attributing blame to the API, what specifically is inherent to the API that requires this? If there are many physical locations that may be responsible for handling all or part of a primitive, the question as to which ones are relevant needs to be answered by something, and then somehow the whole system needs to be updated with the answer.
Agreed.

Now imagine how a software renderer would do this. e.g. threads spawned across arbitrary processors to match workload. With data locality being a parameter of the spawn algorithm. And cost functions tuned to the algorithms.

The problem with hardware is partly that the data pathways have to be defined in advance for all use cases and have to be sized for a specified peak workload. So the data structures are fixed, the buffer sizes are fixed and the ratios of maximum throughput are fixed.

This would be similar to how unified shader architectures took over. To do that, substantial transistor budget was spent, but the rewards in performance were unignorable. Despite the fact that the hardware was no longer optimised specifically for the API of that time. Remember, for example, how vertex ALUs and pixel ALUs were different in their capability (ISA) before unified GPUs took over? (Though the API itself was changing to harmonise VS and PS instructions.)

In the end, primitive culling has been identified as a serious bottlneck in historical AMD GPUs. Loosening that bottleneck costs in complexity and potentially requires that the "smallest" GPUs (say $75) are comparatively "over-sized" compared with their predecessors. But that's been the story of the GPU for a long time. It's a bit like the smallest GPUs being able to do 8x MSAA, when they would crumble if actually ran that way.
 
There are some indications of better, though not complete, interlocks in the pipeline--although I recall discussing in past architecture threads how I thought a good improvement to GCN proper would be to have those interlocks.
Some things, like how differently the SIMD path is handled, and why certain instructions related to branching, memory counts, or skipping instructions changed/dropped could be other areas of notable change.
Wild arsed guess: AMD will re-introduce odd and even hardware threads...

I also don't see why a number of the Navi changes wouldn't be desired for the compute line.
Agreed.
 
Is anyone trying to measure that Navi's size?
Seems to be a direct replacement for Polaris 10, something between 200mm^2 and 250mm^2.


I didn't see which GCN product this was compared to.
Couldn’t find it either. One would hope Radeon VII, or the efficiency numbers will be dominated by lithography and not architecture.

I'm actually betting on Vega 14nm parts as comparison.

If they were comparing to Radeon VII, then the +50% efficiency comparison would be much closer to the +25% IPC.

My guess is the architectural improvements are coming at very low power cost, so they're getting:
- x1.25 more performance out of new cache hierarchy and CU adjustments
- x1.2 higher clocks out of 14nm -> 7nm transition (which is the clock difference between a 1.45GHz Vega 64 and a 1.75GHz Radeon VII)

1.25 x 1.2 = 1.5, hence the 50% higher efficiency, or higher performance at ISO power.

There's also an update at Anandtech quoting Lisa Su who said the new efficiency comes partly from new process technologies.

I think it's a 40 CU / 2560sp part running at 1.75GHz, with a power budget close to 200W.
It should get Vega 64 performance at 300W / 1.5 = 200W.
 
Is anyone trying to measure that Navi's size?
Seems to be a direct replacement for Polaris 10, something between 200mm^2 and 250mm^2.
Best guess from Andrei Frumusanu based on some other photos we have: 275mm2, +/- 5mm2.
I didn't see which GCN product this was compared to.
For power efficiency, it would seem to be against a 14nm Vega 10 product. Lisa's specific words on the subject:

"And then, when you put that together, both the architecture – the design capability – as well as the process technology, we're seeing 1.5x or higher performance per watt capability on the new Navi products" (emphasis mine)
 
Best guess from Andrei Frumusanu based on some other photos we have: 275mm2, +/- 5mm2.
For power efficiency, it would seem to be against a 14nm Vega 10 product. Lisa's specific words on the subject:

"And then, when you put that together, both the architecture – the design capability – as well as the process technology, we're seeing 1.5x or higher performance per watt capability on the new Navi products" (emphasis mine)

I see a lot of problems with the latter statement. We know Radeon 7 improves on Vega 14nm performance 1.25X iso power due to AMD slides and benchmarks. If Navi is 1.25X better than Vega architecture wise, it should be 1.25X times 1.25X (1.56X) faster than Vega 14nm on a per watt basis. This implies a performance per watt regression on node compared to Radeon 7’s gains. What’s the reason for this? More power dedicated to memory, proportionally?
 
Is anyone trying to measure that Navi's size?
Seems to be a direct replacement for Polaris 10, something between 200mm^2 and 250mm^2.





I'm actually betting on Vega 14nm parts as comparison.

If they were comparing to Radeon VII, then the +50% efficiency comparison would be much closer to the +25% IPC.

My guess is the architectural improvements are coming at very low power cost, so they're getting:
- x1.25 more performance out of new cache hierarchy and CU adjustments
- x1.2 higher clocks out of 14nm -> 7nm transition (which is the clock difference between a 1.45GHz Vega 64 and a 1.75GHz Radeon VII)

1.25 x 1.2 = 1.5, hence the 50% higher efficiency, or higher performance at ISO power.

There's also an update at Anandtech quoting Lisa Su who said the new efficiency comes partly from new process technologies.

I think it's a 40 CU / 2560sp part running at 1.75GHz, with a power budget close to 200W.
It should get Vega 64 performance at 300W / 1.5 = 200W.
It must be bigger than 40 CU if it has a size of 275mm2. No ?
 
It must be bigger than 40 CU if it has a size of 275mm2. No ?
Probably. Polaris fits 36 CU in 232 mm^2 on 14nm.

That would be some big CU growth if it wasn’t 48 CU or more, IMO. Vega 7nm fits 64CU in 330 mm^2. I’m assuming the 4096 HBM interface is at least as big as 256-bit GDDR6 interface here.
 
Probably. Polaris fits 36 CU in 232 mm^2 on 14nm.

That would be some big CU growth if it wasn’t 48 CU or more, IMO. Vega 7nm fits 64CU in 330 mm^2. I’m assuming the 4096 HBM interface is at least as big as 256-bit GDDR6 interface here.

it could be like 56CU or something like that. And I do agree, from that point of view it doesn´t look that great ....
 
it could be like 56CU or something like that. And I do agree, from that point of view it doesn´t look that great ....
I’m actually hoping for some CU footprint growth. Nvidia has made it clear you can be bigger and more efficient. Apple has shown this too with mobile CPUs. Die cost hurts, yes, but when a better arch comes out the other end, I’ll take it.
 
There are probably major changes in cache hierarchy and also the CU organization. On top of that there is an AMD's first gen GDDR6 interface. This might imply changes in perf/mm2, right?

Vega already presented a big change in cache organization with enormous 45MB of SRAM on die. Adding even more with Navi seems highly unlikely....
 
It must be bigger than 40 CU if it has a size of 275mm2. No ?
Not if they're significantly increasing cache sizes, front-end width and using a larger proportion of higher-frequency / lower-density transistors.
 
I see a lot of problems with the latter statement. We know Radeon 7 improves on Vega 14nm performance 1.25X iso power due to AMD slides and benchmarks. If Navi is 1.25X better than Vega architecture wise, it should be 1.25X times 1.25X (1.56X) faster than Vega 14nm on a per watt basis. This implies a performance per watt regression on node compared to Radeon 7’s gains. What’s the reason for this? More power dedicated to memory, proportionally?
They're being purposefully vague, for whatever reason.
 
Status
Not open for further replies.
Back
Top