AMD: Navi Speculation, Rumours and Discussion [2019-2020]

AlphaWolf · Sep 12, 2020

trinibwoy said:
I tried watching the vid but it was painful.

There is at most 20% performance between a 3080 and a 3090 so it would be silly for AMD to have 3 cards in the same range of performance. That tidbit certainly seems to be nonsense.

Nonsense for AMD, but makes sense for Nvidia? It suggests 4 SKUs competing with 4 NV SKUs.

trinibwoy · Sep 12, 2020

AlphaWolf said:
Nonsense for AMD, but makes sense for Nvidia? It suggests 4 SKUs competing with 4 NV SKUs.

You're expecting multiple 6800's with different memory configs within 5% performance of each other in November?

AlphaWolf · Sep 12, 2020

trinibwoy said:
You're expecting multiple 6800's with different memory configs within 5% performance of each other in November?

How do you get to there from what I said?

3070-> 6700
3080-> 6800
3080TI-> 6800XT
3090-> 6900

If they are all within 5% of each other... ok.

trinibwoy · Sep 12, 2020

AlphaWolf said:
How do you get to there from what I said?

3070-> 6700
3080-> 6800
3080TI-> 6800XT
3090-> 6900

If they are all within 5% of each other... ok.

I don't think AMD will be planning to have cards that close together given we're expecting 16GB out the gate. The only reason for the 3080 Ti to exist is to entice people for whom 10GB isn't enough for whatever reason.

AlphaWolf · Sep 12, 2020

trinibwoy said:
I don't think AMD will be planning to have cards that close together given we're expecting 16GB out the gate. The only reason for the 3080 Ti to exist is to entice people for whom 10GB isn't enough for whatever reason.

They might not launch at the same time, and there's always some demand for the premium parts no matter the reason.

3dilettante · Sep 12, 2020

Digidi said:
My Theory why AMD packed RT Cores in TMUs.

My theory prior to this being confirmed was that there were two potential areas in the GCN architecture that could be adapted for RT rather than creating a new domain: the shared scalar pipeline with its own memory subsystem, or the texturing path with its own ability to calculate addresses and perform operations on data before passing the filtered result back.
RDNA2 reduced the independence of the scalar path, but that still left the portion of the pipeline related to the texturing units. RT operations have a lot of memory accesses and scheduling, so the texturing path's already existing handling of the vector memory subsystem and its independence from the SIMD hardware would allow for a similar function to be added.

Esrever said:
128MB isn't very much all things considered. Intel's old Crystal Well IGP has 128mb of L4 for something like 1/20th the performance. Would be an interesting to see how it works out.

At 32x the size of Navi 10's L2, the old rule of thumb that miss rate scales with the square root of capacity would give a miss rate of ~18% that of the 4MB cache. A more streaming workload may not see as much benefit, but it's also possible that if the size footprint of the cache grew enough it would reach a threshold where there data in-cache could be resident long enough to derive significant reuse.
The area of a GPU L2 and its associated hardware isn't that compact, and even a more dense implementation like Zen2's L3 would give over 100mm2 just for cache.
Bandwidth-wise, it would seem like this cache should provide enough transactions to be equivalent to whatever memory bus it is compensating for, though that would cost density.

DDH said:
Redgamingtech also claims that rdna2 beyond 2.23ghz produces logic problems, supported by someone from Sony on the ps5 apparently.

What logic problems could arise due to high frequencies?

Every chip driven by a clock signal allocates a certain amount of time for each pipeline stage. Signals need to propagate through layers of logic between stages, and there is some minimum time before all signals can be safely assumed to have reached the end of the stage before the next clock cycle. Some parts of the chip tend to take longer compared to others, and if the clock period shrinks enough, some portions of logic can no longer be relied upon to function as expected across the operating range of the chip. Whatever appropriate behavior the chip should have would become prone to failure, which could be seen as data corruption or instability.

SpeedyGonzales said:
Wouldn't cache be helpful in reducing the amount of memory traffic for BVH traversal and thus speed up ray tracing?

It could reduce miss rates off-chip, although depending on how the cache hierarchy is implemented it may not be significantly faster in terms of latency, since GPU cache hierarchies tend to take similar or more time than the actual DRAM access.

nAo · Sep 12, 2020

3dilettante said:
My theory prior to this being confirmed was that there were two potential areas in the GCN architecture that could be adapted for RT rather than creating a new domain: the shared scalar pipeline with its own memory subsystem, or the texturing path with its own ability to calculate addresses and perform operations on data before passing the filtered result back.
RDNA2 reduced the independence of the scalar path, but that still left the portion of the pipeline related to the texturing units. RT operations have a lot of memory accesses and scheduling, so the texturing path's already existing handling of the vector memory subsystem and its independence from the SIMD hardware would allow for a similar function to be added.

If the BVH traversal logic is indeed handled by the CUs.. it's not a given the texture units fetch BVH and triangle data. Perhaps the CUs do it and send everything to the texture units to accelerate intersections.

SimBy · Sep 12, 2020

I mean huge cache would definitely explain the enormous 500+mm2 die size.

OlegSH · Sep 12, 2020

trinibwoy said:
The cache stuff is really intriguing. If AMD can successfully boost performance with a massive cache it could change the trajectory of graphics architectures going forward. Big cache + high clocks....maybe RDNA2 is really Zen 3!

Big L2 cache is a silly stuff right there.
First, it goes against principles of how cache is organized on RDNA2 in consoles and other architectures in PC space == no software optimizations from developers.
Unless they have changed ROPs drastically, doing the whole thing in TBDR style, I don't see how they will compensate for lower bandwidth.
The issue is that TBDR doesn't have an appropriate support in API, AMD's drivers, and in games, so expect to see lower perf/mm in comparison with traditional architectures and in the best case years of extracting perf out of the thing. In the worst case, they will drop it in generation or two.
Big cache isn't a magic thing, which will improve perf once you add it, otherwise everybody would have added more and more of it.
Neither can it replace higher memory bandwidth, because, well, even 128 MB cache is a drop of memory.
Cache here is for data reuse and for these things, which are reusable, it has already been sized well. Try to add more and you will lose in perf/area.

pjbliverpool · Sep 12, 2020

Frenetic Pony said:
Regardless as others pointed out, even if AMD beats out a 3090 a bit, and assumedly other cards on down the line, DLSS 2.0 is a nice temporary advantage for Nvidia that a decent amount of consumers will probably take into account when buying this year and could easily extend into next year as well. If I were at AMD I'd try to counter it with something vaguely similar, but in the form of an open source SDK. Sure that'd imply that devs would have to add it themselves if they didn't sign up for AMD's partner program thing. But likely it'd be worth it, it could be cross platform including consoles which need to be released anyway. Something like this (ironically done partially by Intel) could work, RDNA2 has eight times rate Int 4 support right? https://creativecoding.soe.ucsc.edu/QW-Net/

The results are generally excellent

Thats not a real time model. It's 10x slower than DLSS even on Tensor cores. And RDNA 2's INT performance is far below the tensor cores.

Consider that a 4k upscale on a 24TFLOP NAVI 21 would take about 2.5ms vs well under 1ms on a 3080 and we see that even with a model as efficient as DLSS, AMD will still be at a disadvantage.

OlegSH · Sep 12, 2020

SpeedyGonzales said:
Wouldn't cache be helpful in reducing the amount of memory traffic for BVH traversal and thus speed up ray tracing?

It depends, there are several types of memory accesses in RT.
For GI, AO and other noisy stuff, rays can flight in any direction across hemispehere centered around normal.
BVH'es are big, much more than 128 MB, so shooting rays in random directions won't be a free lunch.
For more coherent stuff, much smaller caches might be more performant from perf per area perspective.

LordEC911 · Sep 12, 2020

SimBy said:
I mean huge cache would definitely explain the enormous 500+mm2 die size.

Some were recently suggesting Big Navi wasn't so big in regard to die size.
Unless they were mixing it up with a lower ASIC, though I don't know how you mix up a 500mm2 and a ~300mm2 die.
Supposedly there was only one source/leak that mentioned a 505mm2 die, all the other news/info/speculation was just ran with it until it became "common knowledge".

Jawed · Sep 12, 2020

128MB huh? Perhaps this is why on this diagram it says "Color/Depth":

not "RBE".

That's 16 bytes per pixel at 3840x2160. That's so much you will be able to see it from space. Godlike. If I was NVidia I would be scared.

Chip yield should be helped massively, too.

If this is true, then yes, this is a whole new era of graphics performance. This is XB360 on alien technology. This finally answers for me doubts about the consoles targetting 120fps for 4K TVs.

No need for a 384-/512-bit or HBM memory system.

Frenetic Pony · Sep 12, 2020

You know the rumor does seems weirder and weirder the more you think about it. The obvious thing staring one in the face is, if 128mb of cache is so magical, why doesn't the PS5 or XSX use it, why did the latter go for a giant 320bit bus if it was so useful? And of course the XSX shouldn't need a 320bit bus if somehow a 20+ teraflop RDNA2 chip only needed 256.

I mean, what would you even do with 128mb, fit the world's thinnest 4k g-buffer?

pjbliverpool said:
Thats not a real time model. It's 10x slower than DLSS even on Tensor cores. And RDNA 2's INT performance is far below the tensor cores.

Consider that a 4k upscale on a 24TFLOP NAVI 21 would take about 2.5ms vs well under 1ms on a 3080 and we see that even with a model as efficient as DLSS, AMD will still be at a disadvantage.

What are you talking about, there isn't a single overall performance number for this in the whole paper.

Not that was even the point, just showing that other researchers can produce potentially realtime ai upscaling and it'd be a smart thing for AMD to do. Just going through their numbers it looks like it should be possible in under 3ms on their "unnamed high end GPU" as compared to UNET, other than a few hiccups they had ideas for solving but never got to.

OlegSH · Sep 12, 2020

Frenetic Pony said:
You know the rumor does seems weirder and weirder the more you think about it.

Nah, just throw away critical thinking out of the window, think about alien stuff - Godlike L2, 120fps for 4K on consoles, etc, etc

Jawed · Sep 12, 2020

The die shot for XSX certainly does not show a huge amount of memory:

so I think we can safely say that this "Infinity Cache" concept is not present there.

3dcgi · Sep 12, 2020

It could reduce miss rates off-chip, although depending on how the cache hierarchy is implemented it may not be significantly faster in terms of latency, since GPU cache hierarchies tend to take similar or more time than the actual DRAM access.

There's no way DRAM latency is lower than a cache hit since you need to check the cache before going to DRAM.

Deleted member 2197 · Sep 12, 2020

To be taken with "Grain of Salt" as with all leaked benchmarks.

AMD’s Next-Gen Radeon RX 6000 Series ‘RDNA 2’ Graphics Cards Alleged Benchmarks Leak Out

AMD Radeon RX 6000 Series Alleged Performance Benchmarks (Via Rogame):

MSI Gaming X Trio 2080 Ti 74.4 (156.63%)

NVIDIA GeForce RTX 2080 Ti water stock 71.4 (150.31%)

EVGA XC Black 2080 Ti stock 64.9 (136.63%)

AMD Radeon(TM) Graphics 63.9 (134.52%)

MSI Gaming X Trio 2080 stock 54.4 (114.52%)

NVIDIA GeForce RTX 2070 stock 47.5 (100%)

https://wccftech.com/amd-next-gen-radeon-rx-6000-series-rdna-2-graphics-cards-benchmarks-leak/

Deleted member 90741 · Sep 12, 2020

@3dgci, aren't you agd5f?

CarstenS · Sep 12, 2020

Jawed said:
The die shot for XSX certainly does not show a huge amount of memory:
so I think we can safely say that this "Infinity Cache" concept is not present there.

Latest rumor I've read in Tweeter is that it's really distributed due to incredible engineering. Some of it may even be on the backside.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

AlphaWolf

Specious Misanthrope

trinibwoy

Meh

AlphaWolf

Specious Misanthrope

trinibwoy

Meh

AlphaWolf

Specious Misanthrope

3dilettante

nAo

Nutella Nutellae

SimBy

OlegSH

pjbliverpool

B3D Scallywag

OlegSH

LordEC911

Jawed

Frenetic Pony

OlegSH

Jawed

3dcgi

Deleted member 2197

Guest

Deleted member 90741

Guest

CarstenS

Moderator