AMD: Navi Speculation, Rumours and Discussion [2019-2020]

T2098 · Oct 20, 2020

Erinyes said:
Of course they probably would adjust the clocks based on what they've seen from NV, but by how much? You can't magically change silicon, PCB and cooler design overnight to dissipate 50-100W more. Testing, validation, production and distribution take a lot longer than the approx 4-6 weeks since Ampere has been out.

AMD's done exactly that a few times before and caused some significant heartache for their board partners - most recently:

https://www.anandtech.com/show/15422/the-amd-radeon-rx-5600-xt-review/2
https://www.pcgamer.com/amds-last-minute-5600-xt-bios-update-feels-like-a-bait-and-switch/

You're right in that they can't change the silicon or board layout but there's an awful lot of latitude to make changes to V/F curves and clocks in the vBIOS.

And not all cards were stable with the 'revised' vBIOS and higher clocks, which also lead to some unpleasantness for consumers who bought a 5600XT and expected it to perform exactly the way all the review samples did:

https://www.igorslab.de/en/radeon-r...mits-and-benchmark-morepowertool-tutorial/12/

andermans · Oct 21, 2020

FWIW for the raytracing hardware I think you're still limited by the L0 cache only being able to deliver 1 cacheline per CU per cycle (unless they double that of course, but that sounds fairly costly) so raytracing and texture sampling compete there. Furthermore due to bad cache access patterns the raytracing might be trashing the cache.

The other thing I'd not be so sure about is texture sampling and raytracing in the texture units at the same time. How it has been worded to use only minimal die area sounds to me like they might be reusing some of the texture sampler parts to be dual purpose, in which case you can't do them in parallel. Though I think the signal here is very weak.

Deleted member 2197 · Oct 21, 2020

Any date for the RX 6000 reviews?

Erinyes · Oct 21, 2020

T2098 said:
AMD's done exactly that a few times before and caused some significant heartache for their board partners - most recently:

https://www.anandtech.com/show/15422/the-amd-radeon-rx-5600-xt-review/2
https://www.pcgamer.com/amds-last-minute-5600-xt-bios-update-feels-like-a-bait-and-switch/

You're right in that they can't change the silicon or board layout but there's an awful lot of latitude to make changes to V/F curves and clocks in the vBIOS.

And not all cards were stable with the 'revised' vBIOS and higher clocks, which also lead to some unpleasantness for consumers who bought a 5600XT and expected it to perform exactly the way all the review samples did:

https://www.igorslab.de/en/radeon-r...mits-and-benchmark-morepowertool-tutorial/12/

While true, that was a different case where it was a derivative product from cut down silicon, using the same 5700 PCB if I am not mistaken. Changing a vBIOS and power target is comparatively a lot easier in this scenario as compared to a completely new part. It was a reaction to the 2060 price cut after the 5600XT was announced and AMD really did do it a bit too last moment. Here, NVs cards are all laid out in advance.

Either ways, the point being, AMD was very clear that it was aiming for the high end of the market and would have designed the product accordingly. Claiming that they changed it only as a reaction to Ampere is grasping at straws, for no reason.

pharma said:
Any date for the RX 6000 reviews?

If Zen 3 is anything to go by, reviews would only be up by the date it is available for sale. With a rumored mid-november release for RX6000, I'd expect reviews around or just before launch day.

pTmdfx · Oct 21, 2020

andermans said:
FWIW for the raytracing hardware I think you're still limited by the L0 cache only being able to deliver 1 cacheline per CU per cycle (unless they double that of course, but that sounds fairly costly) so raytracing and texture sampling compete there. Furthermore due to bad cache access patterns the raytracing might be trashing the cache.

Sampling does compete for write back data path with intersection, but if BVH traversal alone can keep the memory hierarchy busy (pointer chasing, being bandwidth bound), does this specific limitation matter?

Also I would say traversal being cache thrashing is no stranger to GPUs; say use of higher definition textures can thrash L0 caches hard too.

nAo · Oct 21, 2020

pTmdfx said:
Sampling does compete for write back data path with intersection, but if BVH traversal alone can keep the memory hierarchy busy (pointer chasing, being bandwidth bound), does this specific limitation matter?

It matters in case each traversal steps needs to be taken on the CUs, i.e. the pointer chasing happens in shader code while the intersection HW/texture unit simply tells whether a ray intersected one or more nodes of the BVH. If this is the way it works then it requires a constant back and forth between CUs and texture/intersection units.

Also I would say traversal being cache thrashing is no stranger to GPUs; say use of higher definition textures can thrash L0 caches hard too.

(minified) high res textures don't trash the texture caches unless mip maps are absent or mip mapping is disabled.

andermans · Oct 21, 2020

pTmdfx said:
Sampling does compete for write back data path with intersection, but if BVH traversal alone can keep the memory hierarchy busy (pointer chasing, being bandwidth bound), does this specific limitation matter?

Also I would say traversal being cache thrashing is no stranger to GPUs; say use of higher definition textures can thrash L0 caches hard too.

I'm not talking about the write back path I'm talking about the L0 cache <-> texture unit path (i.e. already part of the memory hierarchy), or even further up if there is a cache miss. The write back to registers should indeed not be much of a problem.

Also I don't think higher definition textures generally trash the cache badly, or at least not worse than low resolution textures as long as the textures have the appropriate mipmaps. (which pretty much everyone does these days)

There is plenty other stuff that can thrash L0 caches though, especially in post processing steps that try to access close-by but not directly neighboring pixels.

DegustatoR · Oct 21, 2020

https://videocardz.com/newz/amd-radeon-rx-6900xt-to-feature-navi-21-xtx-gpu-with-80-cus

https://twitter.com/x/status/1318830142337277952

SimBy · Oct 21, 2020

64CUs? That's a big cut. But kind of makes sense I guess. The gap between 40 and 72 was huge.

DegustatoR · Oct 21, 2020

SimBy said:
64CUs? That's a big cut. But kind of makes sense I guess. The gap between 40 and 72 was huge.

It can't be smaller I think if there are 4 SEs - they can disable only the same amount of WGPs per SE.

Jawed · Oct 21, 2020

pTmdfx said:
Even normal load-stores are quite liberal — operations can be freely reordered, and only RF writeback is in program order. The texture load-store path has been supporting varying latency and a huge swarm of capabilities since GCN anyway, say for example, address coaleasing or the lack thereof can cause a load instruction to take a varying number of cycles to complete, even though multiple load instructions can be issued back-to-back. RDNA enhanced it further by adding a low-latency path bypassing the samplers, and RDNA 2 BVH intersection seems to be merely a (new) cherry on the "filtering/pre-processing" pie.

TMU is pipelined, too. So the texture unit will process texels or ray trace queries interleaved if necessary.

There's a patent document that talks about intra-CU producer-consumer scheduling of wavefronts. I would hope that this is applied to ray-traversal and ray-query-result shaders...

Though I still believe that you will not see ray tracing and pixel shading concurrently running on the GPU. These will be separate passes, so only an overlap phase will be seen as one spins down and the other spins up.

So the fact that texturing hardware is "dual-function" is immaterial from the point of view of pixel shading. Pixel shading will consume one or more buffers in VRAM (UAVs) that were produced by ray tracing passes (shadowing, global illumination, reflections, caustics, etc.).

andermans said:
FWIW for the raytracing hardware I think you're still limited by the L0 cache only being able to deliver 1 cacheline per CU per cycle (unless they double that of course, but that sounds fairly costly) so raytracing and texture sampling compete there. Furthermore due to bad cache access patterns the raytracing might be trashing the cache.

There's a swarm of patent documents on the subject of cache friendly CU scheduling...

pTmdfx said:
Sampling does compete for write back data path with intersection, but if BVH traversal alone can keep the memory hierarchy busy (pointer chasing, being bandwidth bound), does this specific limitation matter?

Also I would say traversal being cache thrashing is no stranger to GPUs; say use of higher definition textures can thrash L0 caches hard too.

Yes, another subject of AMD's patent documents.

In general CU scheduling and cache-friendly operations (such as L0s being able to snoop each other passively) appear to be part of RDNA. How much of that is new for RDNA 2, I can't tell.

trinibwoy · Oct 21, 2020

Jawed said:
Though I still believe that you will not see ray tracing and pixel shading concurrently running on the GPU. These will be separate passes, so only an overlap phase will be seen as one spins down and the other spins up.

So the fact that texturing hardware is "dual-function" is immaterial from the point of view of pixel shading. Pixel shading will consume one or more buffers in VRAM (UAVs) that were produced by ray tracing passes (shadowing, global illumination, reflections, caustics, etc.).

Compute shaders read textures too and they can certainly overlap with RT.

Jawed · Oct 21, 2020

trinibwoy said:
Compute shaders read textures too and they can certainly overlap with RT.

Compute shaders don't have the data per work item that pixel shaders have. So there's no way to map a texel to a triangle (there's no triangle) and there's no way to mip-map filter (because there's no triangle).

Sure, a compute shader reads from memory, but that doesn't use the texture-processing/ray-intersection pipelines. I would expect ray-hit/miss (etc.) compute shaders to run on the same CU or same WGP as ray-intersection shaders. That's the producer-consumer model I was talking about earlier.

andermans · Oct 21, 2020

DegustatoR said:
It can't be smaller I think if there are 4 SEs - they can disable only the same amount of WGPs per SE.

Actually in the Linux driver they recently added some code to deal with disabled SEs. Not sure how complete it is and won't guarantee there is a SKU that really does this but ...

Jawed said:
Though I still believe that you will not see ray tracing and pixel shading concurrently running on the GPU. These will be separate passes, so only an overlap phase will be seen as one spins down and the other spins up.

So the fact that texturing hardware is "dual-function" is immaterial from the point of view of pixel shading. Pixel shading will consume one or more buffers in VRAM (UAVs) that were produced by ray tracing passes (shadowing, global illumination, reflections, caustics, etc.).

Doesn't DXR 1.1 allow raytracing in all shader stages though?

Jawed said:
Compute shaders don't have the data per work item that pixel shaders have. So there's no way to map a texel to a triangle (there's no triangle) and there's no way to mip-map filter (because there's no triangle).

Sure, a compute shader reads from memory, but that doesn't use the texture-processing/ray-intersection pipelines. I would expect ray-hit/miss (etc.) compute shaders to run on the same CU or same WGP as ray-intersection shaders. That's the producer-consumer model I was talking about earlier.

A compute shader can definitely use the texture-processing pipelines. Either by just always selecting lod 0, explicitly passing a LOD, using explicit derivatives, or even just using implicit derivatives (all it needs is shader invocations to be arranged in a quad pattern. No geometry needed).

Furthermore even without filtering, images always need format conversion which will use the texture-processing part of the texture unit. (tell-tale is that loads with a format always only have a throughput of 4 texels/cycle instead of up to 32 items/cycle for plain buffer loads on RDNA1)

btw I'm interested in the talk about the intra-CU scheduling. Was that for tessellation or raytracing? Do you have some links to read up on it?

DegustatoR · Oct 21, 2020

andermans said:
Actually in the Linux driver they recently added some code to deal with disabled SEs. Not sure how complete it is and won't guarantee there is a SKU that really does this but ...

Disabling a whole SE is not the same as disabling a different number of WGPs per each enabled SE though.

trinibwoy · Oct 21, 2020

DegustatoR said:
https://videocardz.com/newz/amd-radeon-rx-6900xt-to-feature-navi-21-xtx-gpu-with-80-cus

Hmmm I’m surprised that the 256-bit bus is real. Maybe there is something to this magical cache after all.

del42sa · Oct 21, 2020

https://videocardz.com/newz/amd-rad...ard-allegedly-features-a-2577-mhz-boost-clock

hmm 2577MHz ?

DegustatoR · Oct 21, 2020

del42sa said:
https://videocardz.com/newz/amd-rad...ard-allegedly-features-a-2577-mhz-boost-clock

hmm 2577MHz ?

Why stop there?

It's hard to say how relevant these max clocks are to what the card will be running on under load.
If the reference specs above are of any indication the actual sustained clock should be some 300 MHz lower. So for a 2577 it would be around 2250.

SimBy · Oct 21, 2020

trinibwoy said:
Hmmm I’m surprised that the 256-bit bus is real. Maybe there is something to this magical cache after all.

Memory bandwidth tests are gonna be interesting.

del42sa said:
https://videocardz.com/newz/amd-rad...ard-allegedly-features-a-2577-mhz-boost-clock

hmm 2577MHz ?

Not sure what is up with videocardz lately, that's not what Igor is saying. Actual clocks are lower.

Jawed · Oct 21, 2020

andermans said:
Doesn't DXR 1.1 allow raytracing in all shader stages though?

Interesting idea!

A compute shader can definitely use the texture-processing pipelines. Either by just always selecting lod 0, explicitly passing a LOD, using explicit derivatives, or even just using implicit derivatives (all it needs is shader invocations to be arranged in a quad pattern. No geometry needed).

Furthermore even without filtering, images always need format conversion which will use the texture-processing part of the texture unit. (tell-tale is that loads with a format always only have a throughput of 4 texels/cycle instead of up to 32 items/cycle for plain buffer loads on RDNA1)

Excellent stuff. I'm clearly out of the loop on recent shader models!

btw I'm interested in the talk about the intra-CU scheduling. Was that for tessellation or raytracing? Do you have some links to read up on it?

COOPERATIVE WORKGROUP SCHEDULING AND CONTEXT PREFETCHING

A first workgroup is preempted in response to threads in the first workgroup executing a first wait instruction including a first value of a signal and a first hint indicating a type of modification for the signal. The first workgroup is scheduled for execution on a processor core based on a first context after preemption in response to the signal having the first value. A second workgroup is scheduled for execution on the processor core based on a second context in response to preempting the first workgroup and in response to the signal having a second value. A third context it is prefetched into registers of the processor core based on the first hint and the second value. The first context is stored in a first portion of the registers and the second context is prefetched into a second portion of the registers prior to preempting the first workgroup.

I went through a year's worth of patent documents yesterday and decided there are too many interesting ones to bother linking/summarising. It seems, while I've been "lazy" these last few years, very little heed has been paid to patent stuff.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

T2098

andermans

Deleted member 2197

Guest

Erinyes

pTmdfx

nAo

Nutella Nutellae

andermans

DegustatoR

SimBy

DegustatoR

Jawed

trinibwoy

Meh

Jawed

andermans

DegustatoR

trinibwoy

Meh

del42sa

DegustatoR

SimBy

Jawed