AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

Anandtech swears the last years in these slides are always inclusive (what AMD told them at least), though the same slides that had Zen3 and RDNA2 at the end also showed 2021 at the right of the x axis.
Those roadmaps are a surprisingly clever marketing product - they allow for a two years long launch window.

Zen 3 was launched in 2020, although it was just a supersmall launch compared to the 2021 schedule. In 2021 AMD is launching following Zen 3 products: Milan server processors, non-X and the rest of X desktop processors, Cezanne APUs, Threadrippers HEDTs, and *maybe* the Warhol DDR5 refresh. A similar pattern is for RDNA2. This year AMD should launch only the 6800/6900 series. The 6700 and lower dGPUs, the PRO line, and APUs are clearly 2021.
 
Probably a second team is working on RDNA3. They could have started development asap with Samsung money while RDNA2 was being developed with Sony's money.
From Korean rumors, Samsung should be producing a SoC with RDNA3 by next year end. So there is probably a second team.
Also the collaboration with CPU teams should help with physical design and process related stuffs.
Samsung is using RDNA2, not RDNA3 (of course this could change by the release but that's the official story so far)
RDNA2 wasn't developed with "Sony's money" any more than it was developed with "Microsoft's money", it would have been developed regardless of either console manufacturer and most of what Sony and MS pay are just now starting to roll in as they're actually buying the chips from AMD.
Considering that each architecture is a multi-year project really, of course they have overlapping development by different teams.
 
Both could be true, though. Hawaii's wide and slow memory controllere may indeed be more power efficient than Tahiti's. Truth be told, AMD isn't using super fast GDDR6 this time around either.
Not saying that goalposts might have shifted in the meantime. It's G6, not G5 anymore. But regarding 6800: AMD is using the fastest GDDR6 available.
 
Are there results for Navi 21 with memory overclocking while core overclocking is not used?

I can run few tests, just let me know settings and benchmark you want to see scaling in.

My memory starts artifacting at 2150MHz with fast timings, haven't spent much time finding real limits as it runs fine at 2100MHz and I was spending my free time on tweaking core first. Undervolting boosted my Port Royale score from 9760 to 9970 give or take few points. Core in both cases set to 2727MHz and all the other parameters kept equal.
 
Looks like SAM is up and running on Intel + RDNA 2.

https://wccftech.com/amd-smart-acce...ble-performance-gains-with-radeon-rx-6800-xt/

Also, there is something about Zen 2 and earlier possibly not supporting SAM due to a hardware limitation.

https://www.techpowerup.com/275565/...mitation-intel-chips-since-haswell-support-it

Apparently the PCI-Express root complex of Ryzen 5000 "Vermeer" processors introduce a PCIe physical-layer feature called full-rate _pdep_u32/64, which is required for resizable-BAR to work.

It gets more interesting—Intel processors have been supporting this feature since the company's 4th Gen Core "Haswell," which introduced it with its 20-lane PCI-Express gen 3.0 root-complex.

Quite embarrassing if true.
 
Isn't it so gpu connects straight into cpu via pcie. IO die is mainly for cpu-chipset connection? Or io die acts as pass through without much logic?
 
I can run few tests, just let me know settings and benchmark you want to see scaling in.
It seems like it would be a waste of time doing these tests since the potential gain is so low and may not even exist.

I was curious about what might be seen with a refresh in June, say, if it had GDDR6X.
 
It seems like it would be a waste of time doing these tests since the potential gain is so low and may not even exist.

I was curious about what might be seen with a refresh in June, say, if it had GDDR6X.

My memory overclock is only by 5% and with IC in place, makes it really difficult to notice its impact.
Just of out curiosity, I've ran Unigine Heaven in QHD, stopped camera and with the same scene I manipulated memory clock. There was absolutely no difference to FPS in 3 scenes I've tested, which ranged from 160 to 223FPS. I think this engine and level of complexity is very well suited to IC or lack of movement doesn't put any extra load on MC, hence no difference.

For my 2nd test I've switched to Unigine Superposition and went straight to 4K and 8K tests. Both sets ran with my standard overclock of 2727MHz GPU, custom fan profile to avoid any thermal throttling and undervolt to 1050mV (which in reality just changes voltage curve as the chip still hits 1150mV at higher clocks, just the threshold before it has to go there is later in the scale).

Here are results:
4-K-Superposition-Benchmark-v1-0-14857-1607180808-GPU2727-2000-FT-Air.png

4-K-Superposition-Benchmark-v1-0-14986-1607181157-GPU2727-2100-FT-Air.png

4K - only difference RAM 2000MHz top picture and 2100MHz bottom

8-K-Superposition-Benchmark-v1-0-5745-1607181807-GPU2727-2000-FT-Air.png

8-K-Superposition-Benchmark-v1-0-5839-1607181439-GPU2727-2100-FT-Air.png

8K - top pic 2000MHz and bottom 2100MHz

In this test, at 4K Optimized my card ranks around 220 position in the world, among RTX3090's, also overcloked. On another hand, at 8K Optimized settings, I still rank around 220 postion, but there performance is only reaching RTX 2080Ti.

Clearly, nVidia's brute-force approach to memory bandwidth has big advantages at very high resolutions. At the same time, in this test at least, 4K performance with 5% memory overclock gains less than 1% performance, where at 8K it jumps to over 1.5%.
What is clouding accurately measuring influence of memory clock on the overlay performance in my case is it's influence on GPU dynamic clock. Simply, memory overclock eats from the same 300W power limit GPU has, lowering average GPU frequency by about 20MHz to 30MHz. I can imagine situations where having less power and frequency eaten by memory subsystem will result in higher FPS as GPU will clock higher. This should be most likely at lower resolutions, as IC will have higher hit rate, saving even more energy from memory subsystem and using it to boost GPU clocks.
 
can imagine situations where having less power and frequency eaten by memory subsystem will result in higher FPS as GPU will clock higher.
Or, it's actually higher GPU utilization is turned into higher power consumption / lower clocks. It's the same with Vega - in poorly optimized games like Assasin's creed Odyssey or Final Fantasy XV (the one with the inside-out TWIMTBP fluffy cows) it can boost upwards of 1700 mhz, while in Witcher 3 or Doome Eternal it goes down to 1640-1650 mhz with the same OC settings. If you increase PL (or if it can be increased higher than the standard 15% so that both power package / TBC limitations won't really matter anymore), you'd probably see higher GPU clocks and higher scores with heavily OC'd VRAM as compared to stock / lesser OC settings
 
Things really must have changed. I remember vividly, how AMD told everyone, that Hawaii's 512 bit bus was more power efficient and more space saving than Tahiti's 384 bit, because they could trade the relatively high clocks against a wider interface (power) and have much smaller drivers because they didn't need the clocks to go that high (area). I still like the idea of a large memory bus. It gives you more fine-grained accesses and it scales more easily to larger memory capacity.

I remember it being more about the size of the memory controllers.
They were able to fit a 512bit bus in 440mm2. Hawaii's interface took 20% less space than Tahiti's 384bit, earning them ~50% more bandwidth per mm2.

Scales more easily while needing extra memory IC's. Increasing from 256bit to 384bit is a minimum 50% increase to your DRAM costs. GDDR6 isn't exactly cheap, and I don't even want to think of what Nvidia will be paying for 2GB GDDR6X.

AMD-Hawaii-GPU-Diagram-Leaked-Shows-Four-Shader-Engines-390754-6.jpg


Edit- They also wanted to scale out their ROPs which were tied to L2/MC at the time.
 
Last edited:
Things really must have changed. I remember vividly, how AMD told everyone, that Hawaii's 512 bit bus was more power efficient and more space saving than Tahiti's 384 bit, because they could trade the relatively high clocks against a wider interface (power) and have much smaller drivers because they didn't need the clocks to go that high (area). I still like the idea of a large memory bus. It gives you more fine-grained accesses and it scales more easily to larger memory capacity.
Everything doesn't shrink the same with each new process node so that could have something to do with it. For example, wires don't shrink the same as combinational, also called standard cell, logic.
 
Things really must have changed. I remember vividly, how AMD told everyone, that Hawaii's 512 bit bus was more power efficient and more space saving than Tahiti's 384 bit, because they could trade the relatively high clocks against a wider interface (power) and have much smaller drivers because they didn't need the clocks to go that high (area). I still like the idea of a large memory bus. It gives you more fine-grained accesses and it scales more easily to larger memory capacity.

Going that wide again may require pushing memory clocks near where Hawaii's bus topped out. Initial speeds were 5.0 Gbps, with the 390X hitting 6.0.
I've seen commentary about the routing and spacing rules for the higher-speed GDDR6/GDDR6X chips being a possible limiter for bus width on the high-end. Perhaps the gap is too wide to justify reducing clocks enough to get to 512 bits?
 
Going that wide again may require pushing memory clocks near where Hawaii's bus topped out. Initial speeds were 5.0 Gbps, with the 390X hitting 6.0.
I've seen commentary about the routing and spacing rules for the higher-speed GDDR6/GDDR6X chips being a possible limiter for bus width on the high-end. Perhaps the gap is too wide to justify reducing clocks enough to get to 512 bits?

Could be why Nvidia stuck with 384bit and pushed speeds?

Then if a giant cache loses efficiency at a 4k resolution, either from cache overflow despite being 128mb or from some pass needing a lot of access to main memory (has it been figured out which one it is?) then what's the solution to bandwidth scaling? Hoping HBM becomes cheaper doesn't seem the most likely. How cheap is Intel's EMIB style connect supposed to make it, I've seen claims that it's supposed to be cheaper, but not numbers.
 
It is hard to compare Intel foundry/packaging costs to others. I suppose we'll have to watch how TSMC's version of EMIB compares to their current interposer. I don't think it has shipped in anything yet?
 
Not sure about the modern GPUs (as I can't get one for weeks, sad sigh), but overclocking the memory on an R9 290 was basically a completely pointless endeavour - going from the standard 1250 mhz to something like 1500 mhz improved badwidth a lot (judging from oclmembench), but the scores in the benchmarks and games went higher by measly 3-5% at best. Of course, it might have been higher in 4K and modern games, but at that time it was just a matter of academic interest / hardware benching kind of stuff.
 
Not sure about the modern GPUs (as I can't get one for weeks, sad sigh), but overclocking the memory on an R9 290 was basically a completely pointless endeavour - going from the standard 1250 mhz to something like 1500 mhz improved badwidth a lot (judging from oclmembench), but the scores in the benchmarks and games went higher by measly 3-5% at best. Of course, it might have been higher in 4K and modern games, but at that time it was just a matter of academic interest / hardware benching kind of stuff.
The Hawaii GPU doesn't support frame-buffer compression, that could be one of the reasons for the poor performance scaling in this case.
 
Back
Top