AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

Kaotik · Apr 10, 2021

Bondrewd said:
Not all mem standards are JEDEC lol.

No, but a product by one company isn't a standard by any stretch of the imagination before some accepted organization makes it such.
JEDEC is the most prominent one when it comes to memory standards and even if there's some others, none of them has made GDDR6X a standard.

Silent_Buddha · Apr 11, 2021

OlegSH said:
Because it has more performance pitfalls?
Any pointer chasing workloads, such as ETH, BVH, any random sampling, etc do not scale well with cache sizes.
Cache's hit rates tend to fall of the cliff with higher resolutions, larger data sets, increasing randomness of memory accesses.
Cache as bandwidth amplifier is ok for certain things, but that's not a magic wand, which was all of a sudden just discovered by AMD.

Like the L1/L2/L3 caches on CPUs? Like register space on all computing devices? I mean once upon a time CPUs didn't have any cache other than register space, but memory speeds couldn't keep up.

Obviously if main memory were fast enough then CPUs wouldn't need L1/L2/L3 cache or even register space. Everything is a workaround in computing because rarely is anything as fast as one wants it be ... or more importantly rarely is anything as cheap (either in transistor cost or energy cost) as one wants it to be.

There may (probably eventually will) come a time when NV can't pay for custom memory with the bandwidth they require then what? I'd imagine something similar to infinity cache.

I mean, thinking about it another way. Isn't DLSS just a workaround with IQ pitfalls because NV GPUs can't render fast enough at higher resolutions with RT enabled? Something that AMD desperately needs, IMO.

Just because it's a workaround for a technology limitation doesn't mean it's bad. After all, everything is limited by technology. Just look at games, EVERYTHING in games is a workaround to try to get something to render fast enough even though it comes with plenty of "pitfalls".

Regards,
SB

OlegSH · Apr 11, 2021

Silent_Buddha said:
Just because it's a workaround for a technology limitation doesn't mean it's bad

I wasn't talking whether it's good or bad. It all comes down to engineering trade offs.

As of now at 7 nm, that cache eats up more die than 2 additional memory controllers would have taken otherwise.

That looks like a net perf/mm loss to me and smaller die 5700 XT having the same IPC as 6700 XT at the same frequency only confirms that.

AMD claims better perf/watt, so probably it was worth it in the end, though I have a hard time to believe that the same config as 5700 XT with faster GDDR6 would have scaled any worse.

It would be interesting to compare PS5 with similar frequencies against something like 6700 in different games to figure out whether the infinity cache was worth the additional area.

And that additional area translates into higher cost products - 12 GB of memory looks like overkill for the middle end segment and 33% higher die area in comaprison with Navi10 is not free either.

troyan · Apr 11, 2021

Bondrewd said:
Yes it is (just a bit tricky and pricey in its own way).
What was the last time you've seen a GPU just scale with moar MHz pumped into it?

You know that GPUs scale with units, too? Like for example double FP32 throughput?

Bondrewd · Apr 11, 2021

troyan said:
Like for example double FP32 throughput?

Ah yes, "double" FP32.
Let's do the count regfile operand sources dance.

trinibwoy · Apr 11, 2021

OlegSH said:
Cache as bandwidth amplifier is ok for certain things, but that's not a magic wand, which was all of a sudden just discovered by AMD.

It’s a little annoying that AMD doesn’t know how to show off its own products. I’m sure IC is really good for certain things, maybe spatially and temporally coherent data access patterns like all the post processing that’s done today.

OlegSH · Apr 11, 2021

trinibwoy said:
I’m sure IC is really good for certain things, maybe spatially and temporally coherent data access patterns like all the post processing that’s done today.

It shines whenever there are lots of read-modify-write operations, such as particle blending, it simply destroys everything in 3D Mark Fire Strike for example.
The funny part that most of the games do not use blending that much for obvious reasons - they have to look great on consoles with way lower bandwidth shared between both CPU and GPU.
Post-processing today is mostly done in compute shaders and highly optimized manually via the on-chip shared memory usage available in CS (and GA102 has more shared memory per chip vs Navi21), so I doubt the large L3 would benefit a lot (because even the way smaller L2 is almost never a bottleneck here).
Neither do I see benefits in games where PP is done in pixel shaders (all vanilla UE4 games), in fact, I see the opposite - 3080 is faster in such games since it has way more L1 cache and pixel shaders are sensitive to it.

xEx · Apr 11, 2021

Frenetic Pony said:
"Because we fucking can you dumb motherfucker."

-AMD

The price is set my the buyer, companies will always try to sell their product to the higher price must of the buyers will pay for it.

Tbh I'm surprise AMD and Nvidia haven't raise their MSRP higher these days, they have to sell a card for 480 while watching everyone in the middle resell that card for 1200.

neckthrough · Apr 11, 2021

Silent_Buddha said:
Like the L1/L2/L3 caches on CPUs? Like register space on all computing devices? I mean once upon a time CPUs didn't have any cache other than register space, but memory speeds couldn't keep up.

Obviously if main memory were fast enough then CPUs wouldn't need L1/L2/L3 cache or even register space.

"Fast" has two attributes - latency and bandwidth. CPUs are extremely latency-sensitive and in fact use caches more for latency reduction than for bandwidth amplification (although maybe AVX has been rebalancing that equation somewhat). GPUs on the other hand are bandwidth machines and can shrug off latency to a large degree (except for some specific latency-sensitive operations as @OlegSH pointed out). And so they use caches primarily for bandwidth enhancement. You cannot buy latency, but you can always buy more bandwidth, which makes the tradeoff for cache on GPUs a different ballgame than for CPUs. Navi 2x certainly made some bold choices in that tradeoff space, but the benefits aren't as slam-dunk as, for example, the hilariously awesome Zen L3$.

OlegSH · Apr 11, 2021

neckthrough said:
Navi 2x certainly made some bold choices in that tradeoff space, but the benefits aren't as slam-dunk as, for example, the hilariously awesome Zen L3$.

Yep, there are all kinds of prefetching and speculation hardware in Zen that helps getting all the necessary data in advance in the large L3 to decrease latency rather than make it worse.
Those smarties is what makes this cache relevant and usefull, not just the sheer size.
In Imagination graphics, there is a whole pipeline built around tiled optimizations and on-chip SRAM, this architecture can live up with way smaller caches and capture the same benefits.
Besides caches and tiled pipelines, consoles exploited the EDRAM and SRAM to pin certain resources in this fast memory.
There is also both large L2 and cache control knobs in the Ampere A100 GPU so that one can control data residency and pin resources in these caches manually.

neckthrough · Apr 11, 2021

OlegSH said:
Yep, there are all kinds of prefetching and speculation hardware in Zen that helps getting all the necessary data in advance in the large L3 to decrease latency rather than make it worse.
Those smarties is what makes this cache relevant and usefull, not just the sheer size.
In Imagination graphics, there is a whole pipeline built around tiled optimizations and on-chip SRAM, this architecture can live up with way smaller caches and capture the same benefits.
Besides caches and tiled pipelines, consoles exploited the EDRAM and SRAM to pin certain resources in this fast memory.
There is also both large L2 and cache control knobs in the Ampere A100 GPU so that one can control data residency and pin resources in these caches manually.

Which leads into the following question -- does the RDNA2 L3 include any hooks for programmer-controlled data pinning, or is it purely transparent? FWIW I find cache data pinning a little awkward because it breaks abstraction. Scratchpads are a different beast because they are meant to be explicitly managed, but caches are supposed to be programmer-transparent.

OlegSH · Apr 11, 2021

neckthrough said:
Which leads into the following question -- does the RDNA2 L3 include any hooks for programmer-controlled data pinning, or is it purely transparent?

Well, data pinning is likely a requirement for good perf with producer-consumer pipelines in compute or machine larning. The graphics pipeline is itself a producer-consumer pipeline and all the data goes thru L2 at least, but I don't know how data residency is controlled there, certainly this is not exposed in APIs.

Bondrewd · Apr 11, 2021

neckthrough said:
the hilariously awesome Zen L3$.

Ugh should we tell him?

neckthrough · Apr 11, 2021

Bondrewd said:
Ugh should we tell him?

If you have something useful to contribute maybe it’s better to do it directly instead of trying to create a grandiose mystique about yourself first.

Bondrewd · Apr 11, 2021

neckthrough said:
trying to create a grandiose mystique about yourself first.

That's the best part.
Either way Zen LLCs are all 4/8c sized exclusive chunks aka you get GORE in all and every LLC-bound actual workload.
Everything has its tradeoffs and all.

pTmdfx · Apr 12, 2021

neckthrough said:
Which leads into the following question -- does the RDNA2 L3 include any hooks for programmer-controlled data pinning, or is it purely transparent? FWIW I find cache data pinning a little awkward because it breaks abstraction. Scratchpads are a different beast because they are meant to be explicitly managed, but caches are supposed to be programmer-transparent.

Linux driver had indicated that each entry in the GPU page table can be individually marked as LLC No Alloc. RDNA 2 ISA documentation indicated that image resource (texture) descriptors use 2 bits to specify LLC allocation policy, which apparently supports either a resource-wide override value, or using the LLC No Alloc setting in the page table entry.

Deleted member 13524 · Apr 14, 2021

pTmdfx said:
Linux driver had indicated that each entry in the GPU page table can be individually marked as LLC No Alloc. RDNA 2 ISA documentation indicated that image resource (texture) descriptors use 2 bits to specify LLC allocation policy, which apparently supports either a resource-wide override value, or using the LLC No Alloc setting in the page table entry.

In the PCGamer interview, Scott Herkelman said that devs are looking into further optimizations to be made around Infinity Cache, and that they're getting (I think he said) "very interesting results".

From there I assumed it could be directly accessed by game developers (I guess through proprietary extensions). He specifically said game developers and not driver developers.

Over time, it'll be interesting to see if IC implementations get more performance like AMD's Radeon boss suggests, or it'll follow the AMD Doom&Glooming (growing datasets) from the usual and predictable suspects.

DegustatoR · Apr 14, 2021

ToTTenTranz said:
In the PCGamer interview, Scott Herkelman said that devs are looking into further optimizations to be made around Infinity Cache, and that they're getting (I think he said) "very interesting results".

From there I assumed it could be directly accessed by game developers (I guess through proprietary extensions). He specifically said game developers and not driver developers.

Over time, it'll be interesting to see if IC implementations get more performance like AMD's Radeon boss suggests, or it'll follow the AMD Doom&Glooming (growing datasets) from the usual and predictable suspects.

You can optimize for cache size by managing your dataset - making it fit into the cache size. If that wasn't clear to the usual and predictable suspects.

fellix · Apr 19, 2021

Measuring GPU Memory Latency

3dilettante · Apr 19, 2021

I'm curious what is being measured for some of the graphs.
The low end at ~16KB would be where you'd expect the CU-level caches would hit.
The L1 texturing latency for GCN was stated in an optimization presentation to be ~114 cycles, which for an 7950 wouldn't be ~40ns.
The ~27ns for RDNA2 seems like it would be too low for standard clocks, although there was discussion of potential latency savings in RDNA when fetching non-filtered data.
The L1 scalar cache could potentially have a lower cycle count, although in this instance it would have fewer ns.

Also, the relative flatness of the latency curve in the 16kB-32kB range seems like there may be some other confounding factor like some kind of prefetching or some kind of access pattern optimization, since I don't think the CU-level caches go beyond 16KB before a more substantial hit occurs.

AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

Kaotik

Drunk Member

Silent_Buddha

OlegSH

troyan

Bondrewd

trinibwoy

Meh

OlegSH

xEx

neckthrough

OlegSH

neckthrough

OlegSH

Bondrewd

neckthrough

Bondrewd

pTmdfx

Deleted member 13524

Guest

DegustatoR

fellix

3dilettante

Similar threads