AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

Doesn't HWS supplant ACE? Or it's another module that provides more finely-controllable use of async compute?
The internal particulars may have changed over time. There are still ACEs, as HWS is more concerned with having a processor virtualize the fixed number of queues the ACEs are processing so that an arbitrary number of queues can be swapped in and out.
Whether HWS is distinct hardware isn't clear to me.
ACEs are at least since Sea Islands come in groups of 4 custom processors that share some resource, like the microcode store they use.
When HWS was introduced, it seemed to come at the expense of ACE resources being re-assigned to monitoring and controlling the queues by the other ACEs. This may have been why Fury's marketing went from 8 ACEs to 4ACEs + HWS.
There's other nuances like which pipes have dispatch capability that might distinguish ACEs.
Whether more modern GPUs would give HWS hardware ACE capabilities once it became standard isn't clear. I think the HWS has since been described as a dual-threaded processor/block, and so this may have become more distinct from whatever it is the ACEs are.

I believe there were code changes related to the graphics command processor (cluster of at least 3 cores) that hint at a possible extra processor that has some similar functions as HWS for graphics, in that it can swap and direct what the hardware graphics queues are linked to.
 
Whether HWS is distinct hardware isn't clear to me. ACEs are at least since Sea Islands come in groups of 4 custom processors that share some resource, like the microcode store they use. When HWS was introduced, it seemed to come at the expense of ACE resources being re-assigned to monitoring and controlling the queues by the other ACEs. This may have been why Fury's marketing went from 8 ACEs to 4ACEs + HWS.

Yep - each MEC block has 4 processor threads. If I remember correctly each thread can either run "pipe" microcode (multiplexing between 8 queues on the pipe and managing/submitting work from those queues) or "HWS" microcode (a layer above the pipes/queues which dynamically maps queues from a larger set onto the available HW queues. It also multiplexes an unlimited number of processes onto a finite number of VMIDs.

Each HW queue has a Hardware Queue Descriptor associated with it, while each application queue has a Memory Queue Descriptor. The HWS microcode is passed a runlist with a set of MQDs for each process plus a set of resources (HW queues + VMIDs) plus a few other parameters. At that point HWS takes over and maps (copies) sets of MQDs into HQDs and lets the queues run for a programmable time quantum. At the end of the time quantum it rolls the waves off the hardware, HQD contents are written back into MQDs, and the next set of MQDs is selected.

If you have trouble sleeping you can pick through amdkfd->kfd_packet_manager.c and explore out from there. "Oversubscription" refers to having more MQDs than available HQDs or more processes than VMIDs, requiring HWS to round-robin multiplex sets of MQDs onto the HW queues.

If HWS is not being used then driver code maps MQDs to HQDs directly, but we normally run with HWS enabled all the time.
 
Lol ... Image Sharpening filters were in use by developers long before AMD created RIS. Can't say the same about DLSS.
1. May I ask, which game developers (or when) used image sharpening filters for real-time resizing of rendered image?
2. In the case you are not talking particularly about real-time image resizing, GPU-accelerated AI-based image resizing was used long time before DLSS, e. g. for photo-resizing for large-print purposes, but also by some game developers for creating high-resolution textures.
 
1. May I ask, which game developers (or when) used image sharpening filters for real-time resizing of rendered image?
2. In the case you are not talking particularly about real-time image resizing, GPU-accelerated AI-based image resizing was used long time before DLSS, e. g. for photo-resizing for large-print purposes, but also by some game developers for creating high-resolution textures.
Any game that offered bicubic or lanczos up or downscaling to me is considered a form of sharpening to a degree. And that goes way back.
Other than that.. any game on PC that offered a subnative resolution option as well as had controls for sharpening. That goes back to pre-UE4 games AFAIK.
GeDoSaTo tool on PC offered down and upscaling with sharpening ever since its inception as well.
 
ACEs are front end processors without a fixed relationship with shader engines. The number of ACEs per shader engine has varied over GCN's lifetime. The PS4 had 8 for 2, Hawaii had 8 for 4, Fury had 4 for 4, APUs have had 1/2 for 1, etc.
I haven't seen restrictions in the number of CUs that could be addressed by an ACE in the past. Given how long it took for decent adoption of asynchronous compute, getting one queue to be used would be a likely minimum and it's not like those scenarios couldn't access all CUs. Other marketing for things like rapid response queues seemed to show allocations that spanned shader engines as well.

You know what, ignore me. I was getting ACEs mixed up with Shader Arrays. I blame Tech Report Techspot which does the same in their Navi vs Turing architecture article!

In terms of shader arrays what could be going on here though? If a shader array still contains one primitive unit then RDNA2 can't have 2 of them per Shader Engine can it? Even though the series X definitely does? Or have I misunderstood something else?
 
Last edited:
You know what, ignore me. I was getting ACEs mixed up with Shader Arrays. I blame Tech Report which does the same in their Navi vs Turing architecture article!

In terms of shader arrays what could be going on here though? If a shader array still contains one primitive unit then RDNA2 can't have 2 of them per Shader Engine can it? Even though the series X definitely does? Or have I misunderstood something else?

https://videocardz.com/newz/amd-radeon-rx-6800-launch-press-deck-transcript

it says

Geometry Processor
  • 8 Pre-Cull Prims/Cycle
  • 4 Post-Cull Prims/Cycle
So 2 Pre-cull primitives per SE and 1 Post-cull primitive per SE
 
https://videocardz.com/newz/amd-radeon-rx-6800-launch-press-deck-transcript

it says

Geometry Processor
  • 8 Pre-Cull Prims/Cycle
  • 4 Post-Cull Prims/Cycle
So 2 Pre-cull primitives per SE and 1 Post-cull primitive per SE

Indeed but doesn't each primitive unit accept 2 un-culled primitives and output 1 culled primitive per clock? So in RDNA we have:

  • 2 Shader Engines
  • 2 Shader Arrays per Shader Engine
  • 1 Primitive Unit per Shader Array
  • Hence 4 primitives output per clock

According to Hotchips the XSX is setup the same albeit with 14 CU's per Shader Array rather than 10 in RDNA.

I thought it was pretty much confirmed that there were 4 Shader Engines in Navi21 which means based on the above if should have 8 primitive units.

Also bare in mind we know Navi21 has 128 ROPS which would also suggest 4 Shader Engines and 8 Shader Arrays (16 ROPS per SA) given that's how the Series X is configured this way with 2 Shader Engines and 64 ROPs.

So the only explanation I can think of for Navi21 only outputting 4 primitives per clock is if the overall architecture is drastically changed, i.e. still 4 Shader Arrays with doubled up resources in each, or the Primitive Units in Navi21 only output 1 Primitive every other clock vs one every clock in Navi10. Which sounds strange - especially as the Series X still outputs 1 per clock.

Hopefully we'll find out in about 4 hours!
 
Indeed but doesn't each primitive unit accept 2 un-culled primitives and output 1 culled primitive per clock? So in RDNA we have:

  • 2 Shader Engines
  • 2 Shader Arrays per Shader Engine
  • 1 Primitive Unit per Shader Array
  • Hence 4 primitives output per clock

According to Hotchips the XSX is setup the same albeit with 14 CU's per Shader Array rather than 10 in RDNA.

I thought it was pretty much confirmed that there were 4 Shader Engines in Navi21 which means based on the above if should have 8 primitive units.

Also bare in mind we know Navi21 has 128 ROPS which would also suggest 4 Shader Engines and 8 Shader Arrays (16 ROPS per SA) given that's how the Series X is configured this way with 2 Shader Engines and 64 ROPs.

So the only explanation I can think of for Navi21 only outputting 4 primitives per clock is if the overall architecture is drastically changed, i.e. still 4 Shader Arrays with doubled up resources in each, or the Primitive Units in Navi21 only output 1 Primitive every other clock vs one every clock in Navi10. Which sounds strange - especially as the Series X still outputs 1 per clock.

Hopefully we'll find out in about 4 hours!

The figures should be for the whole chip as all other figures in that section were calculated for the Navi21 as a whole. Details are unclear, so it is unknown if it's one unit with double unculled primitive gen per SE or ther eare two units with halved culled primitive generation per clock. A thing is that, by pushing clocks so high and by relying on improved culling, they could have less need to improve their geometric power.
 
Anyone else here completely shocked by the fact that it's 11h am GMT of launch day and not one review has leaked so far?

I saw some charts on reddit supposedly from a youtuber but those seemed fake.
 
The figures should be for the whole chip as all other figures in that section were calculated for the Navi21 as a whole. Details are unclear, so it is unknown if it's one unit with double unculled primitive gen per SE or ther eare two units with halved culled primitive generation per clock. A thing is that, by pushing clocks so high and by relying on improved culling, they could have less need to improve their geometric power.

Yes could be. I wonder then if the same would apply to the PS5 and XSX if this is true of Navi21? I don't think it's actually been explicitly stated what the primitive throughput is on either of those consoles has it? We know how many primitive units the XSX has but if they are half as effective as those in RDNA then the throughput would be half what we currently think it is.
 
These popped up. Take with salt

9AhesZh.png
 
More prices ... UK etailer is listing prices for Asus Radeon RX 6800 (XT) cards
ASUS Radeon RX 6800 XT ROG STRIX LC: £ 764; £ 917 with tax; 851 / 1,022 EUR
ASUS Radeon RX 6800 XT TUF GAMING OC: £ 676; £ 811 with tax; 753/903 EUR
ASUS Radeon RX 6800 TUF Gaming OC: £ 588; £ 705 with tax; 655/785 EUR
ASUS Radeon RX 6800 ROG STRIX OC: £ 605; £ 726 with tax; 674/809 EUR

The standard UK Value Add Tax rate is 20 percent. The base prices of the cards are 579 euros for the Radeon RX 6800 and 649 euros for the Radeon 6800 XT.
https://www.guru3d.com/news-story/uk-etailer-is-listing-prices-for-asus-radeon-rx-6800-(xt)-cards.html
 
Last edited:
These popped up. Take with salt
2h30m before embargo lifts? Is that like a record-setting embargo respectfullness on a graphics card (or anything non-apple) from the last 10 years?

But wow, those tables really turn from Gears 5 onwards.


Makes you wonder, doesn't it? :)
Wonder what? What should I wonder about?? I need to know, this wait is taking forever!!!!
 
Back
Top