AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Kaotik · Jan 30, 2020

trinibwoy said:
I doubt the lead times for pc AIB cards are anywhere close to the lead times for custom console hardware. Especially when those cards aren’t bringing any new features to the table. Point is that next gen console hardware could have been finalized before rdna 2 was ready given the long lead times.

Yes, they could have, but that doesn't mean they were either.
PS4 Pro, which Sony has confirmed to have features from Polaris and beyond was released within 6 months of Polaris GPUs.
PS4 and XB1 were both GCN1.1 and released within couple months of Bonaire.
So using RDNA2 should be well within reason for the known timeframes.

Michellstar · Jan 30, 2020

Globalisateur said:
You are all setting yourself up for disappointment. No one ever said this officially in any way. Almost all evidence (except the dubious "hardware accelerated" definition historically used by Microsoft for any rendering being DirectX accelerated) point to XSX being regular RDNA1.

Almost all evidence? like VRS that´s in RDNA1 right?

We take for granted that RDNA2 will bring VRS an RT, and it´ll be aimed first to big GPUs, just like last time with vega and polaris.

I see a parallel with base consoles, which were based in the Bonaire core (GNC2 sea islands??) that came to market before the Durango and Orbis, but they were already outed at that time.

ups beaten

DegustatoR · Jan 30, 2020

Globalisateur said:
You are all setting yourself up for disappointment. No one ever said this officially in any way. Almost all evidence (except the dubious "hardware accelerated" definition historically used by Microsoft for any rendering being DirectX accelerated) point to XSX being regular RDNA1.

On quite the contrary, almost everything we know now about next gen console APUs and RDNA2 points to the former being based on the latter, not on RDNA1. But it is of course possible that consoles will use some sort of mix of features from RDNA1 and RDNA2. No way of knowing this for sure for now unless you're under NDA.

fehu · Jan 30, 2020

For a refresh speculation keep in mind the n7p node, that in theory is an easier port compared to n7+, and offer 7% more performance at the same power.

Kaotik · Jan 30, 2020

fehu said:
For a refresh speculation keep in mind the n7p node, that in theory is an easier port compared to n7+, and offer 7% more performance at the same power.

AFAIK N7P is a drop-in replacement for N7 and everything N7 will migrate to N7P as soon as TSMC gets their lines updated. It offers 7% more performance at the same power compared to N7, not N7+. N7+ supposedly offers slightly higher 10% performance uplift at same power compared to N7 as well as area benefits with 20% higher density. (Also AMD has said 7nm+ which is generally understood same as N7+)

sniffy · Jan 30, 2020

Ryan Smith said:
I fully expect that RDNA2 will have a new graphics architecture, which is something AMD couldn't do for RDNA(1). Navi's graphics core is essentially unchanged from Vega, which was a bit behind the curve even when it launched. So RDNA2 chips are going to be quite a big change from RDNA(1), even if the compute architecture itself isn't changing.

Just to clarify - obviously the topology has changed (reorganisation into WGPs and resource allocation) but the individual units (all 2560 of them in Navi 10) are unchanged from GCN 5/Vega?

Will we see an AT deep dive come release of RDNA2?

Ryan Smith · Jan 31, 2020

sniffy said:
Just to clarify - obviously the topology has changed (reorganisation into WGPs and resource allocation) but the individual units (all 2560 of them in Navi 10) are unchanged from GCN 5/Vega?

Will we see an AT deep dive come release of RDNA2?

It's probably better to say that the throughput and features of various graphics-related units were not changed. Graphics these days is about half a layer above compute, so when you change the underlying architecture, the stuff above is not quite identical anymore.

Nebuchadnezzar · Jan 31, 2020

A GPU in the end is nothing more than its own SoC within an SoC, there's a dozen functional blocks that interact with each other. RDNA1 revamped the compute blocks completely as well as some of the memory subsystem. The understanding is that RNDA2 will revamp the rasterisation / classical fixed function blocks - which seemingly had very little upgrades or changes in Navi over Vega (and why AMD essentially didn't talk about them at all this past generation).

Kaotik · Jan 31, 2020

Nebuchadnezzar said:
A GPU in the end is nothing more than its own SoC within an SoC, there's a dozen functional blocks that interact with each other. RDNA1 revamped the compute blocks completely as well as some of the memory subsystem. The understanding is that RNDA2 will revamp the rasterisation / classical fixed function blocks - which seemingly had very little upgrades or changes in Navi over Vega (and why AMD essentially didn't talk about them at all this past generation).

Don't forget new TMUs responsible for part of the RT acceleration, the compute block probably needs other changes too for it.

DegustatoR · Jan 31, 2020

Kaotik said:
Don't forget new TMUs responsible for part of the RT acceleration, the compute block probably needs other changes too for it.

The fact that they can use texturing datapaths for RT h/w access doesn't mean that TMUs themselves will be used for RT in any capacity.

Kaotik · Jan 31, 2020

DegustatoR said:
The fact that they can use texturing datapaths for RT h/w access doesn't mean that TMUs themselves will be used for RT in any capacity.

It's been a while since I last glanced the patent, but didn't it imply the hardware would be integrated as part of TMU, not just use its datapahts?

DegustatoR · Jan 31, 2020

Kaotik said:
It's been a while since I last glanced the patent, but didn't it imply the hardware would be integrated as part of TMU, not just use its datapahts?

Depends on what you mean by "integrated" and "TMU".

Texture address unit is used for data fetching but this can be considered as "using the same datapaths", the rest is independent from texturing h/w.

Kaotik · Jan 31, 2020

DegustatoR said:
Depends on what you mean by "integrated" and "TMU".

Texture address unit is used for data fetching but this can be considered as "using the same datapaths", the rest is independent from texturing h/w.

That pretty clearly states that AMD considers ray intersection engine being part of Texture Processor. Based on what else is listed for it, the patents "texture processor" seems to be what I've understood to be meant with "TMU"

edit: the yellow part here, even if AMD separates them as Texture Filtering Units and Texture Mapping Units.

DegustatoR · Jan 31, 2020

Kaotik said:
That pretty clearly states that AMD considers ray intersection engine being part of Texture Processor.

Which does no apparent texture work so why exactly is it a part of a texture processor? Patents can be weird when they are trying to get around stuff which was previously patented by some other party. Again, it all depends on what you consider to be a "TMU" - if a "TMU" has a shading unit inside it for example does this make it into something more than a TMU or not? Does a chip with such "TMUs" and no other shading capabilities don't have "shader cores" as they are a part of "TMUs" now? This can go far.

CarstenS · Jan 31, 2020

Since „shader“ is mentioned separately in the patent excerpt you (DegustatoR) posted, it seems pretty clear in this case, that the patent is basically adhering to AMDs usual nomenclature of things. IMHO, as a non-native speaker.

edit: Which would make the two approaches interesting. Nvidias seemingly can work concurrently with shader cores and texture units, possibly blocking L1 Cache and/or shared memory to some extent. AMDs implementation seems to use some resources of the TMU and thus prefers to run concurrently with more compute heavy workloads.

iroboto · Jan 31, 2020

DegustatoR said:
Which does no apparent texture work so why exactly is it a part of a texture processor?

I'm going to give answer this a shot in the dark - break me down. First I'll separate the patent of how it works as listed earlier and then I'll follow up with a snippet that was missed in the snippet but critical:

(1) A shader sends a texture instruction containing ray data and a pointer to the BVH volume node to the texture address unit.
(2) The texture cache processor uses an address provided by (1) to fetch BVH node data from the cache
(3) The ray intersection engine performs ray-BVH node type intersection testing using the ray data [from (1)] and the BVH node data [from (2)]
(4) The intersection testing results and indications for BVH traversal are returned to the shader [original caller from (1)] via a texture data return path.
(5) The shader reviews the intersection results and the indications to decide how to traverse to the next BVH node

Breakdown
(1) the hybrid approach and using a shader unit to schedule the processing addresses the issues with solely hardware based and/or solely software based solutions.
- So what are the known problems of purely hardware based solutions? We know the issue with purely software based solutions (too slow).
- IIRC As noted by @JoeJ and other developers @sebbbi their biggest issue is over control of casting the rays for performance.
- With nvidia's current solution the rays are cast for them, it is an entire black box
- We have an entire thread about this issue here
(2) Flexibility is preserved since the shader unit can still control the overall calculation and can bypass the fixed function hardware where needed and still get the performance advantage of fixed function hardware.
- Engines and renderers today have been using ray casting for some time; the ability to have a custom intersection shader is something that would allow them to port directly without needing to rework things excessively https://forum.beyond3d.com/posts/2042744/
- @JoeJ makes a heavy case for the need to remove restrictions around triangle intersection and many of our senior members and mods (@Shifty Geezer) have been on the debating side to see a (flexible) based ray tracing solution. https://forum.beyond3d.com/posts/2088217/

The problem: First RTX GPUs have no support for traversal shader because traversal is fixed function. How to deal with it? Leave first gen GPUs behind? Develop multiple codepaths? (The latter won't work. If we do this, we can not have a game that fully utilizes the new GPUs! The compromise has to be towards the old gen. Period.)

(3) In addition, by utilizing the texture processor infrastructure, large buffer for ray storage and BVH caching are eliminated that are typically required in a hardware ray tracing solution as the existing VGPRS and texture cache can be used in its place, which substantially saves area and complexity of the hardware solution
- And if understood correctly we may see some silicon savings as a result of going with this method.

Lets take these concepts and look at 2 important aspects [f____ck me I need to do actual work instead of this]
a) VRS Tier 2 (or MS version of it)
b) DXR Tier 1.1

VRS Tier 2 - Works well with having better control over ray casting: (https://forum.beyond3d.com/posts/2093848/)

VRS Tier 2

Shading rate can be specified on a per-draw-basis, as in Tier 1. It can also be specified by a combination of per-draw-basis, and of:

Semantic from the per-provoking-vertex, and

a screenspace image

Shading rates from the three sources are combined using a set of combiners

Screen space image tile size is 16x16 or smaller

Shading rate requested by the app is guaranteed to be delivered exactly (for precision of temporal and other reconstruction filters)

SV_ShadingRate PS input is supported

The per-provoking vertex rate, also referred to here as a per-primitive rate, is valid when one viewport is used and SV_ViewportIndex is not written to.

The per-provoking vertex rate, also referred to as a per-primitive rate, can be used with more than one viewport if the SupportsPerVertexShadingRateWithMultipleViewports cap is marked true. Additionally, in that case, it can be used when SV_ViewportIndex is written to.

Screen Space Image (image-based):
On Tier 2 and higher, pixel shading rate can be specified by a screen-space image.

The screen-space image allows the app to create an “LOD mask” image indicating regions of varying quality, such as areas which will be covered by motion blur, depth-of-field blur, transparent objects, or HUD UI elements. The resolution of the image is in macroblocks, not the resolution of the render target. In other words, the subsampling data is specified at a granularity of 8x8 or 16x16 pixel tiles as indicated by the VRS tile size.

This is important because developers may be very specifically fine tuning their ray casts for a variety of things to be able to maximize performance in according to using VRS. ie, it makes it a hell of a lot easier to control performance as per above. They can fine tune where on the image they want more or less rays or no rays at all. ie: why bother with ray casting/rendering where the UI is going to block it. When you consider VRS, look at (1). The shader is the one holding the ray data you want to submit for intersection testing.

This combined with new flexibility found in DXR 1.1

Tier 1.1 implementations also support a variant of raytracing that can be invoked from any shader stage (including compute and graphics shaders), but does not involve any other shaders - instead processing happens logically inline with the calling shader. See Inline raytracing.

Tier 1.1 implementations also support GPU initiated DispatchRays() via ExecuteIndirect().

So we see a scenario where in your standard rendering pathways for compute shader, you can freely inline or call ray tracing through executeIndirect, get the results you need and continue forward without having to go back to the CPU.

A bit of overview of Ray Tracing Algorithm in question for those that need a refresher

A ray gets sent out for each pixel in question. The algorithm works out which object gets hit first by the ray and the exact point at which the ray hits the object. This point is called the first point of intersection and the algorithm does two things here: 1) it estimates the incoming light at the point of intersection and 2) combines this information about the incoming light with information about the object that was hit.

1) To estimate what the incoming light looked like at the first point of intersection, the algorithm needs to consider where this light was reflected or refracted from.

2) Specific information about each object is important because objects don’t all have the same properties: they absorb, reflect and refract light in different ways:

...
Savvy readers with some programming knowledge might notice some edge cases here.

[*]Sometimes light rays that get sent out never hit anything. Don’t worry, this is an edge case we can cover easily by measuring for how far a ray has travelled so that we can do additional work on rays that have travelled for too far.

[*]The second edge case covers the opposite situation: light might bounce around so much that it’ll slow down the algorithm, or an infinite number of times, causing an infinite loop. The algorithm keeps track of how many times a ray gets traced after every step and gets terminated after a certain number of reflections. We can justify doing this because every object in the real world absorbs some light, even mirrors. This means that a light ray loses energy (becomes fainter) every time it’s reflected, until it becomes too faint to notice. So even if we could, tracing a ray an arbitrary number of times doesn’t make sense.

I asterisked these points because when you consider (5) on the patent, it's up to the shader after each cast to determine how to traverse the node next. That should mean they can control when they want to stop bouncing, how many rays they want bounced, possibly controlling what distance the ray should travel before stopping. It appears to me that may be easier to handle with determining your zones with VRS.

iroboto · Jan 31, 2020

the TLDR of above: Developers are looking for flexibility to do what they need
As per @sebbbi's commentary

https://twitter.com/x/status/1032954049044328449

https://twitter.com/x/status/1032954752852799488

He can use his current setup and do his primary rays using this cone based method. And then inline/execute indirect dispatch rays directly to access the results from the BVH structure, and then continue there on with other features.
He is no longer bound doing one thing or another. Developers will be able to freely mix and match where they like. There was some discussion about DXR needing the ability to directly access the BVH structure. But perhaps that no longer needed. Let developers do what they want, how they want, and if they want results from BVH tree, inline/executeIndirect a call from your results (of your primary rays/cones) and retrieve the new intersection values from BVH.

upnorthsox · Jan 31, 2020

iroboto said:
the TLDR of above: Developers are looking for flexibility to do what they need
As per @sebbbi's commentary

https://twitter.com/x/status/1032954049044328449

https://twitter.com/x/status/1032954752852799488

He can use his current setup and do his primary rays using this cone based method. And then inline/execute indirect dispatch rays directly to access the results from the BVH structure, and then continue there on with other features.
He is no longer bound doing one thing or another. Developers will be able to freely mix and match where they like. There was some discussion about DXR needing the ability to directly access the BVH structure. But perhaps that no longer needed. Let developers do what they want, how they want, and if they want results from BVH tree, inline/executeIndirect a call from your results (of your primary rays/cones) and retrieve the new intersection values from BVH.

That's not what he's saying. What he's saying is he, like everyone else, really needs a HW solution for BVH but sans that, then don't put partial solutions in the way that block him from implementing his own workaround.

AlNom · Jan 31, 2020

CarstenS said:
edit: Which would make the two approaches interesting. Nvidias seemingly can work concurrently with shader cores and texture units, possibly blocking L1 Cache and/or shared memory to some extent. AMDs implementation seems to use some resources of the TMU and thus prefers to run concurrently with more compute heavy workloads.

hum... something something Compute Tunneling (to help mitigate such penalties) :?:

Whitepaper

However, at times one task can become far more latency sensitive than other work. In prior generations, the command processor could prioritize compute shaders and reduce the resources available for graphics shaders. As Figure 5 illustrates, the RDNA architecture can completely suspend execution of shaders, freeing up all compute units for a high-priority task. This scheduling capability is crucial to ensure seamless experiences with the most latency sensitive applications such as realistic audio and virtual reality.

DegustatoR · Feb 1, 2020

iroboto said:
the TLDR of above: Developers are looking for flexibility to do what they need

I'm not entirely sure what you're trying to say as a reply to what I've posted.

The fact that BVH traversal acceleration unit may be located inside a "texture processor" in RDNA2 doesn't make it any more or less flexible than what Turing has - and please do note that we know nothing about the placement of RT h/w in Turing as NV hasn't actually disclosed this information.

The fact that hit evaluations are happening and data paths are used differently than they *seem* to be happening and used in Turing doesn't mean that Turing can't do this in a similar fashion - in fact, NV has already officially confirmed that Turing h/w will support DXR 1.1.

Thus far from what can be discerned from the patent (and we don't really know how this will actually transfer into the h/w; AMD files a lot of GPU patents and not all of them are getting an obvious physical implementation) AMD's approach trades general RT performance for die area savings.

I also don't see what VRS has to do with this as VRS is a modification of MSAA/depth testing h/w which is typically residing in ROPs/RBEs (thus is also not a part of texturing h/w). The question of what happens currently on Turing when you trace a ray per a "VRSed" pixel is an interesting one and I hope someone more knowledgeable can answer it here - however Turing's h/w approach doesn't limit ray casting to full screen or any specific number of rays per each screen pixel which was already shown in BFV's RT optimizations which varies the RPP number depending on content inside a scene.

iroboto said:
With nvidia's current solution the rays are cast for them, it is an entire black box

AFAIK this is inaccurate. The only part which is a "black box" is a BVH structure traversal by a ray which is cast by the application in a way it see fit. This is most likely will be similar in AMD's approach from the patent.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Kaotik

Drunk Member

Michellstar

DegustatoR

fehu

Kaotik

Drunk Member

sniffy

Ryan Smith

Nebuchadnezzar

Kaotik

Drunk Member

DegustatoR

Kaotik

Drunk Member

DegustatoR

Kaotik

Drunk Member

DegustatoR

CarstenS

Moderator

iroboto

Daft Funk

iroboto

Daft Funk

upnorthsox

AlNom

Moderator

DegustatoR