AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

So how do you improve rasterizsation with 2 kindes of rasterizer when both do the same?

I see only advantage for big bpolygons over 4x4 pixel size.

Edit: Also it is strange that each array have it's own rasterizer (Scan converter) if you look into linux driver:
num_se: 4 You have 4 sahder engins
num_sh_per_se: 2 you have 2 Shaderarrys for each shaderengine
num_sc_per_sh: 1 and each shaderarry have 1 scan converter for its own?:
So for scan converter you get: num_se x num_sh_per_se x num_sc_per_sh = 4x2x1 = 8

But if i follow that we have 2 rasterizer for a shaderengine it should look like:
num_se: 4
num_sc_per_se: 2
I asked a few times earlier in the thread, but I didn't get clarification. Recall that there are 4 Packers per Scan Converter, so Navi21 has 32 Packers. And 8 Packers per Raster Unit (2 Scan Converters). Each Packer, up to 4 Packers, is being dispatched to each Shader Array with optimised fragments, arranged as 1x2, 2x1 or 2x2 fragment groups as discussed below for VRS (my speculation). The efficiency gains are from these packed fragments.

RuHJrff.png
 
Perhaps you were joking but this is far from a stupid idea. With good post-processing AA, and perhaps clever texturing tricks, it might produce very interesting results. For instance, you might render and especially sample textures at 4K, but path-trace at 320p, and just interpolate the path-tracing results for pixels where you don't path-trace.
I like the idea to do RT at lower res than raster, so only RT needs upscaling.
On the other hand, if i had the choice: 1 ray per pixel at 1080p or 4rpp at 270p, i would choose the former because i get more spatial information for the same number of rays.

I think the RT will be very selective for the next 5 years.
So now we know what AMD has meant with 'Select Lighting Effects' on that early RDNA2 slide :D
 
I have not doubt post PS5 and Xbox Series, the situation will change with RT being more important but I think the RT will be very selective for the next 5 years.

https://gfxspeak.com/2020/10/09/unreal-engine-almost/

About the future of graphics I find this approach more interesting. It's about how instead of calculating geometry and colors we could simply "imagine" it when neural networks. Maybe it's a topic worth having it's own thread.

Is in Spanish but have subs.

 
Did not find a post on AMD's Tech Demo released (Youtube) on the Nov. 19th, so here it is.
The demo focuses mainly on ray tracing, especially soft shadows and reflections. It does also showcase the technology used by Microsoft DirectX 12 Ultimate API, such as mesh shading variable shading rate.

The DirectX Ray Tracing (DXR) is now hardware-accelerated by AMD RDNA2 GPUs. In the demo, we can see ray tracing supported by FidelityFX Compute Effects accompanied by Stochastic Screen Space Rfflrections (SSSR). The FidelityFX Denoiser is also used to improve visual quality while reducing the computing power required to generate the scene.

For more realistic shading geometry AMD uses FidelityFX Ambient Occlusion and to improve performance there is Fidelity Variable Shading enabled. Unfortunately, the demo is not available for download, so we cannot test it ourselves.
...
The demo was prerecorded on AMD Radeon RX 6900XT GPU (the unreleased one) paired with Ryzen 9 3900 12-core CPU.

AMD releases RDNA2 technology demo as a 1080p video - VideoCardz.com
 
Did not find a post on AMD's Tech Demo released (Youtube) on the Nov. 19th, so here it is.


AMD releases RDNA2 technology demo as a 1080p video - VideoCardz.com

Yea, I saw it. It didn't really blow me away. The tech is fine, and produces respectable results, although the RT is particularly noisy in many shots.. (which they also try to cover up with DOF) but IMO is ultimately let down by lackluster art and presentation.

These new cards would have been the perfect time to reintroduce Ruby, with crazy good shadows and reflections. I always liked the Ruby demos.

But the GOAT Radeon tech demo for me personally is.... Pipe Dream


God I love it!
 
Agreed, but the theme was pretty good! :LOL:
Yea I have no problem with the theme or anything.. just kinda the scenario they decided on. A robot Ninja runs around a hangar while a robot drone searches for him... except it's not even that exciting and nothing happens.. lol.

They should have scaled it in. Make that single character far more detailed, have a denser, slightly smaller environment, really zoom in on the detail on him at times.. have some parts of him reflecting the environment.. have him do some cool animations and then fight an enemy at which point the drone comes out, casts beautiful shadows of the two ninja robots fighting and reflecting the environment... and then close with him killing the other robot and then the drone chasing him off.

Also.. 1080p... no sir. 1440p AT LEAST with youtube, regardless if the native resolution of the demo is 1080p.

I dunno.. lol :D
 
Yea, I saw it. It didn't really blow me away. The tech is fine, and produces respectable results, although the RT is particularly noisy in many shots.. (which they also try to cover up with DOF) but IMO is ultimately let down by lackluster art and presentation.

These new cards would have been the perfect time to reintroduce Ruby, with crazy good shadows and reflections. I always liked the Ruby demos.

But the GOAT Radeon tech demo for me personally is.... Pipe Dream


God I love it!

I still have Pipe Dream stored at multiple locations with redundancy so that I'll never risk losing it. It's still the most memorable demo I've ever experienced.

Regards,
SB
 
If you happened to watch Scott's Interview with HotHardware, he said the goal of IC is not just performance. It was a tradeoff vs die area, performance and power.
He specifically said if they would have needed a wider bus to get the same BW for more performance. And the power needed by wider bus and more memory chips means higher TBP. He also added that the memory controllers + PHY would occupy a significant footprint on the chip.

The other point is Infinity Cache also appears to be a forward looking feature, which could also be used on CPUs & APUs. And since SRAM traditionally scales better than analog, it's cheaper in the future than adding more PHYs. SRAM scaling seems to be slowing down however, with TSMC only promising a 1.35x scaling for 5nm and 1.2x for 3nm. For the Apple A14, SRAM scaling was actually found to be only 1.19x so it's significantly lower than TSMC's claimed 1.35x. Whether this is due to the process not delivering advertised gains, or design decisions for power/performance, we won't fully know yet until we can analyse more 5nm chips. But it would still be better than analog.
A downclocked N22/N23 in mobile form would be very efficient looking at the chart below
View attachment 4965
And according to banned member Navi 2x is getting a a lot of interest from Laptop OEMs for its efficiency which is what Scott also mentioned.
They asked some questions which he dodged regarding a cut down IC for low power form factors but it seems obvious.


Yes I had posted this chart a few pages back and commented on the likely position of AMD's mobile gaming platforms next year. Cezanne will be able to make use of the updated 7nm and increased power efficiency of Zen 3 to further increase AMD's CPU lead over Comet Lake, though Tiger Lake H could bring parity. I still predict Cezanne + RDNA2 to be the best selling mobile gaming platform in 2021. This is a significant market btw, and in Nvidia's recent earnings call they specifically mentioned that they've had 11 successive quarters of double digit growth in mobile.
One additional point from that call was that, Pro variants will come and from Linux commits, they will carry 2048bit HBM.

But the die shot of N21 at least certainly does not seem to have any HBM PHYs, or have I missed something?
That chart is almost as scummy as the NVidia equivalent, both making 2x performance per watt claims by cherry picking places on the curves that don't relate to the best performing cards being compared.

How is it scummy? NV was comparing different power and performance levels, on different processes. But AMD's compares the SAME clockspeeds and power, iso process. So an RDNA2 CU at the same clock will consume ~50% of the power of an RDNA1 CU. Unlike desktops where you can push power and thermals, for mobile GPUs this is very relevant as you are power limited.
 
The other point is Infinity Cache also appears to be a forward looking feature, which could also be used on CPUs & APUs. And since SRAM traditionally scales better than analog, it's cheaper in the future than adding more PHYs. SRAM scaling seems to be slowing down however, with TSMC only promising a 1.35x scaling for 5nm and 1.2x for 3nm.
TSMC should probably consider researching the various EDRAM technologies to resolve the memory scaling issues, particularly for large cache arrays.
IBM and Intel already employ different integration methods, though these are very tightly related to their particular manufacturing process.
 
I have seen something relatively recent to Pipe Dream, maybe it was RT... I don't think the newer version was from AMD though.
Maybe it was an opensource/fanmade remake?

Edit- 4k PipeDream on youtube

Just thinking about it, instead of that kind of lackluster robot thing they recently released, it would have been cool if they'd done a 4k (or even 1440p) RT remake of Pipe Dream running in real time. There's lots of opportunities there to showcase some RT effects. Lighting, shadows, reflections (of moving objects), etc. Maybe a dynamic light or two moving around while the scene is playing out. I just feel like it would have been more impressive than that robot demo.

Regards,
SB
 
Techpowerup did have the slides posted here, they cover some of the memory latency aspects - https://www.techpowerup.com/review/amd-radeon-rx-6800-xt/2.html
I'm interested in seeing the endnotes for some of the slides like the memory latency one. It might give some of the base values that go into their percentages. I'm not sure whether the infinity cache's latency improvement is a percentage of the total memory latency (L0,L1,L2,memory total) or it's relative to the latency of the DRAM access.

About this point, I'm thinking that the BVH structure retention / discarding by the cache may be a driver and application matter, more than being hardwired. This also explains a part of the need of having specific optimization for the AMD ray tracing implementation.
Driver commits indicate it can happen at page granularity, and there are also flags for specific functionality types. It's not clear BVH fits in that, unless it might hide under the umbrella of some of the metadata related to DCC or HiZ.
Some of those would seem to be better kept in-cache, since DCC in particular can suffer from thrashing of its metadata cache, injecting a level latency sensitivity normal accesses wouldn't.

The vanilla 6800 has one less SE so one less rasterizer. But higher clocks, and it not known if pre-cull numbers are the same.
Is there a source for this, or tests that can tell the difference between an SE being inactivated versus an equivalent number of shader arrays disabled across the chip?

I thought the general consensus was, that AMD disabled one entire SE.
AMD's Sienna Cichlid code introduced a function to track for disabling formerly per-SE resources like ROPs at a shader array level. This might lead to similar outcomes.

When you bring evidence to back up your assumptions, I guess you'll have an argument.
We do have some comparison in terms of AMD's patent for BVH acceleration versus Nvidia's. There are some potential points of interest, such as the round-trip node traversal must make to the SIMD from the RT block, and the implicit granularity of execution being SIMD-width.
There are some code commits that give instruction formats for BVH operations that look to be in-line with the patent.

Or there is a problem in a RBE, or in the scheduling HW of that SE.
RBEs are something that can be disabled at a different granularity than SEs, though.

I asked a few times earlier in the thread, but I didn't get clarification. Recall that there are 4 Packers per Scan Converter, so Navi21 has 32 Packers. And 8 Packers per Raster Unit (2 Scan Converters). Each Packer, up to 4 Packers, is being dispatched to each Shader Array with optimised fragments, arranged as 1x2, 2x1 or 2x2 fragment groups as discussed below for VRS (my speculation). The efficiency gains are from these packed fragments.
The packers I am thinking of are related to primitive order processing, which is related to rasterizer ordered views rather than how primitives are translated to wavefronts.

TSMC should probably consider researching the various EDRAM technologies to resolve the memory scaling issues, particularly for large cache arrays.
IBM and Intel already employ different integration methods, though these are very tightly related to their particular manufacturing process.
Perhaps as scaling falters, the pressure will resume to go back to EDRAM despite the cost and complexity penalties.
Neither IBM or Intel have that technique available at smaller nodes. IBM's next Power chip dropped the capability since IBM sold off that fab to Globalfoundries--which then gave up scaling to lower nodes, and Power was the standout for having EDRAM.
 
Last edited:
Just thinking about it, instead of that kind of lackluster robot thing they recently released, it would have been cool if they'd done a 4k (or even 1440p) RT remake of Pipe Dream running in real time. There's lots of opportunities there to showcase some RT effects. Lighting, shadows, reflections (of moving objects), etc. Maybe a dynamic light or two moving around while the scene is playing out. I just feel like it would have been more impressive than that robot demo.

Regards,
SB

It may have invited unwelcome comparisons to Nvidias Marbles demo.
 
Lumen system is very different than RTXGI and this is not what they have in mind at least probably for this console generation, Lumen is probably easier to use for lighting artist because there is no probes at all.

https://www.eurogamer.net/articles/...eal-engine-5-playstation-5-tech-demo-analysis

Unreal Engine 4 uses RT but not UE 5 after maybe in the future they will RT for specular reflection. The engine was designed around PS5 and Xbox Series X|S.

Demon's souls uses a froxel based GI system based on probes.


From the article, part of the implementation is tracing rays to voxels. Could the ray tracing box testers on these new GPUs possibly be used for that? Voxels are boxes too...

"Lumen uses ray tracing to solve indirect lighting, but not triangle ray tracing," explains Daniel Wright, technical director of graphics at Epic. "Lumen traces rays against a scene representation consisting of signed distance fields, voxels and height fields. As a result, it requires no special ray tracing hardware."
To achieve fully dynamic real-time GI, Lumen has a specific hierarchy. "Lumen uses a combination of different techniques to efficiently trace rays," continues Wright. "Screen-space traces handle tiny details, mesh signed distance field traces handle medium-scale light transfer and voxel traces handle large scale light transfer."

 
The packers I am thinking of are related to primitive order processing, which is related to rasterizer ordered views rather than how primitives are translated to wavefronts.
The driver leak has changes to SIMD waves, which combined with the slide about RB+ and Packers connected to Scan Converters in the driver leak as well, suggests some optimisations post scan conversation and dispatching to Shader Arrays. Number of Packers per Scan Converters doubled from RDNA1, but triangle per clock rasterisation remains the same at 4 per clock.
 
I'm interested in seeing the endnotes for some of the slides like the memory latency one. It might give some of the base values that go into their percentages. I'm not sure whether the infinity cache's latency improvement is a percentage of the total memory latency (L0,L1,L2,memory total) or it's relative to the latency of the DRAM access.
There you go
upload_2020-11-23_22-25-37.png
 
Back
Top