AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Of course they probably would adjust the clocks based on what they've seen from NV, but by how much? You can't magically change silicon, PCB and cooler design overnight to dissipate 50-100W more. Testing, validation, production and distribution take a lot longer than the approx 4-6 weeks since Ampere has been out.

AMD's done exactly that a few times before and caused some significant heartache for their board partners - most recently:

https://www.anandtech.com/show/15422/the-amd-radeon-rx-5600-xt-review/2
https://www.pcgamer.com/amds-last-minute-5600-xt-bios-update-feels-like-a-bait-and-switch/

You're right in that they can't change the silicon or board layout but there's an awful lot of latitude to make changes to V/F curves and clocks in the vBIOS.

And not all cards were stable with the 'revised' vBIOS and higher clocks, which also lead to some unpleasantness for consumers who bought a 5600XT and expected it to perform exactly the way all the review samples did:

https://www.igorslab.de/en/radeon-r...mits-and-benchmark-morepowertool-tutorial/12/
 
FWIW for the raytracing hardware I think you're still limited by the L0 cache only being able to deliver 1 cacheline per CU per cycle (unless they double that of course, but that sounds fairly costly) so raytracing and texture sampling compete there. Furthermore due to bad cache access patterns the raytracing might be trashing the cache.

The other thing I'd not be so sure about is texture sampling and raytracing in the texture units at the same time. How it has been worded to use only minimal die area sounds to me like they might be reusing some of the texture sampler parts to be dual purpose, in which case you can't do them in parallel. Though I think the signal here is very weak.
 
AMD's done exactly that a few times before and caused some significant heartache for their board partners - most recently:

https://www.anandtech.com/show/15422/the-amd-radeon-rx-5600-xt-review/2
https://www.pcgamer.com/amds-last-minute-5600-xt-bios-update-feels-like-a-bait-and-switch/

You're right in that they can't change the silicon or board layout but there's an awful lot of latitude to make changes to V/F curves and clocks in the vBIOS.

And not all cards were stable with the 'revised' vBIOS and higher clocks, which also lead to some unpleasantness for consumers who bought a 5600XT and expected it to perform exactly the way all the review samples did:

https://www.igorslab.de/en/radeon-r...mits-and-benchmark-morepowertool-tutorial/12/

While true, that was a different case where it was a derivative product from cut down silicon, using the same 5700 PCB if I am not mistaken. Changing a vBIOS and power target is comparatively a lot easier in this scenario as compared to a completely new part. It was a reaction to the 2060 price cut after the 5600XT was announced and AMD really did do it a bit too last moment. Here, NVs cards are all laid out in advance.

Either ways, the point being, AMD was very clear that it was aiming for the high end of the market and would have designed the product accordingly. Claiming that they changed it only as a reaction to Ampere is grasping at straws, for no reason.
Any date for the RX 6000 reviews?

If Zen 3 is anything to go by, reviews would only be up by the date it is available for sale. With a rumored mid-november release for RX6000, I'd expect reviews around or just before launch day.
 
FWIW for the raytracing hardware I think you're still limited by the L0 cache only being able to deliver 1 cacheline per CU per cycle (unless they double that of course, but that sounds fairly costly) so raytracing and texture sampling compete there. Furthermore due to bad cache access patterns the raytracing might be trashing the cache.
Sampling does compete for write back data path with intersection, but if BVH traversal alone can keep the memory hierarchy busy (pointer chasing, being bandwidth bound), does this specific limitation matter?

Also I would say traversal being cache thrashing is no stranger to GPUs; say use of higher definition textures can thrash L0 caches hard too.
 
Sampling does compete for write back data path with intersection, but if BVH traversal alone can keep the memory hierarchy busy (pointer chasing, being bandwidth bound), does this specific limitation matter?
It matters in case each traversal steps needs to be taken on the CUs, i.e. the pointer chasing happens in shader code while the intersection HW/texture unit simply tells whether a ray intersected one or more nodes of the BVH. If this is the way it works then it requires a constant back and forth between CUs and texture/intersection units.

Also I would say traversal being cache thrashing is no stranger to GPUs; say use of higher definition textures can thrash L0 caches hard too.
(minified) high res textures don't trash the texture caches unless mip maps are absent or mip mapping is disabled.
 
Sampling does compete for write back data path with intersection, but if BVH traversal alone can keep the memory hierarchy busy (pointer chasing, being bandwidth bound), does this specific limitation matter?

Also I would say traversal being cache thrashing is no stranger to GPUs; say use of higher definition textures can thrash L0 caches hard too.

I'm not talking about the write back path I'm talking about the L0 cache <-> texture unit path (i.e. already part of the memory hierarchy), or even further up if there is a cache miss. The write back to registers should indeed not be much of a problem.

Also I don't think higher definition textures generally trash the cache badly, or at least not worse than low resolution textures as long as the textures have the appropriate mipmaps. (which pretty much everyone does these days)

There is plenty other stuff that can thrash L0 caches though, especially in post processing steps that try to access close-by but not directly neighboring pixels.
 
Screenshot2020102111.png

https://videocardz.com/newz/amd-radeon-rx-6900xt-to-feature-navi-21-xtx-gpu-with-80-cus

 
Even normal load-stores are quite liberal — operations can be freely reordered, and only RF writeback is in program order. The texture load-store path has been supporting varying latency and a huge swarm of capabilities since GCN anyway, say for example, address coaleasing or the lack thereof can cause a load instruction to take a varying number of cycles to complete, even though multiple load instructions can be issued back-to-back. RDNA enhanced it further by adding a low-latency path bypassing the samplers, and RDNA 2 BVH intersection seems to be merely a (new) cherry on the "filtering/pre-processing" pie.
TMU is pipelined, too. So the texture unit will process texels or ray trace queries interleaved if necessary.

There's a patent document that talks about intra-CU producer-consumer scheduling of wavefronts. I would hope that this is applied to ray-traversal and ray-query-result shaders...

Though I still believe that you will not see ray tracing and pixel shading concurrently running on the GPU. These will be separate passes, so only an overlap phase will be seen as one spins down and the other spins up.

So the fact that texturing hardware is "dual-function" is immaterial from the point of view of pixel shading. Pixel shading will consume one or more buffers in VRAM (UAVs) that were produced by ray tracing passes (shadowing, global illumination, reflections, caustics, etc.).

FWIW for the raytracing hardware I think you're still limited by the L0 cache only being able to deliver 1 cacheline per CU per cycle (unless they double that of course, but that sounds fairly costly) so raytracing and texture sampling compete there. Furthermore due to bad cache access patterns the raytracing might be trashing the cache.
There's a swarm of patent documents on the subject of cache friendly CU scheduling...

Sampling does compete for write back data path with intersection, but if BVH traversal alone can keep the memory hierarchy busy (pointer chasing, being bandwidth bound), does this specific limitation matter?

Also I would say traversal being cache thrashing is no stranger to GPUs; say use of higher definition textures can thrash L0 caches hard too.
Yes, another subject of AMD's patent documents.

In general CU scheduling and cache-friendly operations (such as L0s being able to snoop each other passively) appear to be part of RDNA. How much of that is new for RDNA 2, I can't tell.
 
Though I still believe that you will not see ray tracing and pixel shading concurrently running on the GPU. These will be separate passes, so only an overlap phase will be seen as one spins down and the other spins up.

So the fact that texturing hardware is "dual-function" is immaterial from the point of view of pixel shading. Pixel shading will consume one or more buffers in VRAM (UAVs) that were produced by ray tracing passes (shadowing, global illumination, reflections, caustics, etc.).

Compute shaders read textures too and they can certainly overlap with RT.
 
Compute shaders read textures too and they can certainly overlap with RT.
Compute shaders don't have the data per work item that pixel shaders have. So there's no way to map a texel to a triangle (there's no triangle) and there's no way to mip-map filter (because there's no triangle).

Sure, a compute shader reads from memory, but that doesn't use the texture-processing/ray-intersection pipelines. I would expect ray-hit/miss (etc.) compute shaders to run on the same CU or same WGP as ray-intersection shaders. That's the producer-consumer model I was talking about earlier.
 
It can't be smaller I think if there are 4 SEs - they can disable only the same amount of WGPs per SE.

Actually in the Linux driver they recently added some code to deal with disabled SEs. Not sure how complete it is and won't guarantee there is a SKU that really does this but ...

Though I still believe that you will not see ray tracing and pixel shading concurrently running on the GPU. These will be separate passes, so only an overlap phase will be seen as one spins down and the other spins up.

So the fact that texturing hardware is "dual-function" is immaterial from the point of view of pixel shading. Pixel shading will consume one or more buffers in VRAM (UAVs) that were produced by ray tracing passes (shadowing, global illumination, reflections, caustics, etc.).

Doesn't DXR 1.1 allow raytracing in all shader stages though?

Compute shaders don't have the data per work item that pixel shaders have. So there's no way to map a texel to a triangle (there's no triangle) and there's no way to mip-map filter (because there's no triangle).

Sure, a compute shader reads from memory, but that doesn't use the texture-processing/ray-intersection pipelines. I would expect ray-hit/miss (etc.) compute shaders to run on the same CU or same WGP as ray-intersection shaders. That's the producer-consumer model I was talking about earlier.

A compute shader can definitely use the texture-processing pipelines. Either by just always selecting lod 0, explicitly passing a LOD, using explicit derivatives, or even just using implicit derivatives (all it needs is shader invocations to be arranged in a quad pattern. No geometry needed).

Furthermore even without filtering, images always need format conversion which will use the texture-processing part of the texture unit. (tell-tale is that loads with a format always only have a throughput of 4 texels/cycle instead of up to 32 items/cycle for plain buffer loads on RDNA1)

btw I'm interested in the talk about the intra-CU scheduling. Was that for tessellation or raytracing? Do you have some links to read up on it?
 
Actually in the Linux driver they recently added some code to deal with disabled SEs. Not sure how complete it is and won't guarantee there is a SKU that really does this but ...
Disabling a whole SE is not the same as disabling a different number of WGPs per each enabled SE though.
 
Doesn't DXR 1.1 allow raytracing in all shader stages though?
Interesting idea!

A compute shader can definitely use the texture-processing pipelines. Either by just always selecting lod 0, explicitly passing a LOD, using explicit derivatives, or even just using implicit derivatives (all it needs is shader invocations to be arranged in a quad pattern. No geometry needed).

Furthermore even without filtering, images always need format conversion which will use the texture-processing part of the texture unit. (tell-tale is that loads with a format always only have a throughput of 4 texels/cycle instead of up to 32 items/cycle for plain buffer loads on RDNA1)
Excellent stuff. I'm clearly out of the loop on recent shader models!

btw I'm interested in the talk about the intra-CU scheduling. Was that for tessellation or raytracing? Do you have some links to read up on it?
COOPERATIVE WORKGROUP SCHEDULING AND CONTEXT PREFETCHING

A first workgroup is preempted in response to threads in the first workgroup executing a first wait instruction including a first value of a signal and a first hint indicating a type of modification for the signal. The first workgroup is scheduled for execution on a processor core based on a first context after preemption in response to the signal having the first value. A second workgroup is scheduled for execution on the processor core based on a second context in response to preempting the first workgroup and in response to the signal having a second value. A third context it is prefetched into registers of the processor core based on the first hint and the second value. The first context is stored in a first portion of the registers and the second context is prefetched into a second portion of the registers prior to preempting the first workgroup.

I went through a year's worth of patent documents yesterday and decided there are too many interesting ones to bother linking/summarising. It seems, while I've been "lazy" these last few years, very little heed has been paid to patent stuff.
 
Status
Not open for further replies.
Back
Top