AMD: Navi Speculation, Rumours and Discussion [2019-2020]

milk · Aug 2, 2020

trinibwoy said:
Maybe for glasses of water but my brain certainly doesn't discard all caustics. Take a swim in the caribbean. Guarantee you will notice and appreciate them!

But can any brain dicern from accurate caustics for that particular ocean wave pattern, sun position, depth, etc, vs. a cheap aproximation like in GTAV for example?

trinibwoy · Aug 2, 2020

milk said:
But can any brain dicern from accurate caustics for that particular ocean wave pattern, sun position, depth, etc, vs. a cheap aproximation like in GTAV for example?

I haven’t played GTA V but any reasonable approximation should be passable.

eastmen · Aug 2, 2020

Radolov said:
An AMD engineer said last year that 8k gaming "is not as far away as we might think" and that he "thinks that with the next-generation of cards, multiple cards, you'll be able to do 8k".

i mean you could do 8k with any card really. Just depends on the frame rates your okay with and the quality per pixel.

Pressure · Aug 3, 2020

eastmen said:
i mean you could do 8k with any card really. Just depends on the frame rates your okay with and the quality per pixel.

*If it has the required DP 1.4 or HDMI 2.1 connector.

entity279 · Aug 3, 2020

Rootax said:
So, my brain needs more terraflops ? I believe that too.

No, I think the brain discards as much information as possible so that one could take important decisions (e.g. dodging stuff thrown at you) as fast as possible

Kaotik · Aug 3, 2020

Big load of AMD patents, some newer, some slightly older

https://twitter.com/x/status/1290380141693198345

Includes:

Bandwidth saving architecture for scalable video coding - AMD
Real time on-chip texture decompression using shader processors
Matrix Multiplier With Submatrix Sequencing
Shared loads at compute units of a processor
Automatic configuration of knobs to optimize performance of a graphics pipeline
Pixel Wait Synchronization
Hint-based fine-grained dynamic voltage and frequency scaling in GPUs
Pipelined matrix multiplication at a graphics processing unit
Optimizing Primitive Shaders
Water tight ray triangle intersection without resorting to double precision
Graphics texture footprint discovery
Use of Workgroups in Pixel Shader
Efficient data path for ray triangle intersection
Robust Ray-triangle Intersection
Variable rate rendering based on motion estimation
Apparatus and method for providing workload distribution of threads among multiple compute units
Mechanism for supporting discard functionality in a ray tracing context
Merged data path for triangle and box intersection test in Ray Tracing
Variable Rate Shading
Raster Order View
Integration of variable rate shading and super-sample shading
Centroid selection for variable rate shading

pTmdfx · Aug 3, 2020

Hmm, a ROV patent finally. It looks like a software implementation though, and it does not resemble to me as the hardware(?) solution in Vega and Navi 10.

Lurkmass · Aug 5, 2020

pTmdfx said:
Hmm, a ROV patent finally. It looks like a software implementation though, and it does not resemble to me as the hardware(?) solution in Vega and Navi 10.

I think we should give up on the idea of ROVs altogether and that's AMD's opinion on the matter as well ...

To get even remotely acceptable level of performance on an immediate mode GPU architecture, it would involve storing/tracking the entire framebuffer/render target state in hardware which would mean implementing a lot of dedicated on-chip memory to store the entire framebuffer/render target. The other option is designing a tile-based GPU which will automatically come with a small amount of tile memory but I don't think the architects will find that to be an acceptable solution either since it would mean executing duplicated vertex shader invocations or potentially starving the amount of work on the GPUs shader execution units. Tile-based GPUs died out a decade ago on the desktop space for very good reasons ...

Just to give you an idea, two 1080p render targets consisting of the colour+alpha (32 bits/4 bytes) and the depth (32 bits/4 bytes) would total out to 16.588MB worth of memory which is over 4x bigger than Navi 10's L2 cache. That's not even counting the stencil bits, MSAA case, higher resolutions, or needing multiple render targets/more bits per-pixel either. You'd have to spend enormous amounts of die space to make a robust solution for ROVs which could be used for better used elsewhere ...

nAo · Aug 5, 2020

Lurkmass said:
I think we should give up on the idea of ROVs altogether and that's AMD's opinion on the matter as well ...

To get even remotely acceptable level of performance on an immediate mode GPU architecture, it would involve storing/tracking the entire framebuffer/render target state in hardware which would mean implementing a lot of dedicated on-chip memory to store the entire framebuffer/render target.

That’s absolutely not the case. I can’t get into details but on some IMRs it can be done efficiently (barring pathological cases) with little to no extra HW.

Lurkmass · Aug 5, 2020

nAo said:
That’s absolutely not the case. I can’t get into details but on some IMRs it can be done efficiently (barring pathological cases) with little to no extra HW.

Would that be Intel HW since it's never used outside of their demos ?

pTmdfx · Aug 5, 2020

Lurkmass said:
I think we should give up on the idea of ROVs altogether and that's AMD's opinion on the matter as well ...

To get even remotely acceptable level of performance on an immediate mode GPU architecture, it would involve storing/tracking the entire framebuffer/render target state in hardware which would mean implementing a lot of dedicated on-chip memory to store the entire framebuffer/render target. The other option is designing a tile-based GPU which will automatically come with a small amount of tile memory but I don't think the architects will find that to be an acceptable solution either since it would mean executing duplicated vertex shader invocations or potentially starving the amount of work on the GPUs shader execution units. Tile-based GPUs died out a decade ago on the desktop space for very good reasons ...

Just to give you an idea, two 1080p render targets consisting of the colour+alpha (32 bits/4 bytes) and the depth (32 bits/4 bytes) would total out to 16.588MB worth of memory which is over 4x bigger than Navi 10's L2 cache. That's not even counting the stencil bits, MSAA case, higher resolutions, or needing multiple render targets/more bits per-pixel either. You'd have to spend enormous amounts of die space to make a robust solution for ROVs which could be used for better used elsewhere ...

IIRC latest GPU architectures now all have tiled rasterizers, if not TBDR. Since ROV guarantees only the serialization at the same screen space pixel in API submission order, shouldn't the cost be capped by the rasterizer screen space tile size and the max prim concurrency of the executor (e.g. max 10 CUs in a shader array)? :???:

Lurkmass · Aug 6, 2020

pTmdfx said:
IIRC latest GPU architectures now all have tiled rasterizers, if not TBDR. Since ROV guarantees only the serialization at the same screen space pixel in API submission order, shouldn't the cost be capped by the rasterizer screen space tile size and the max prim concurrency of the executor (e.g. max 10 CUs in a shader array)?

Are you sure you aren't describing a tile-based GPU ? I'm pretty sure mobile GPUs shade screen space tiles and desktop GPUs don't do that at all. All of the best solutions for ROVs/programmable blending involve storing framebuffer state in the hardware. Mobile GPUs have this extremely low latency tile memory where they access/store a small portion of the framebuffer which makes it trivial to implement ROVs on their HW. No modern desktop GPU does tile shading or have tile memory. A comparable solution for non-tiling architectures would be is to have built-in memory storing all of the framebuffer rather than just a small portion of it but this has a huge implementation cost in HW ...

I'm not even sure if Nvidia is all that happy about ROVs from a performance perspective either. Hence, why we should follow AMD's recommendation on giving up ROVs altogether because there seems to be little chance on making an acceptable implementation on discrete GPUs. Ultimately, the problem behind ROV performance is how well the HW is going to be able to track the framebuffer state. AMD HW tracks little to no state in their hardware so there's a huge performance cost for enabling ROVs regardless. Mobile GPUs can track some of this state for a given tile with reduced memory latency access but as a consequence this model is not compatible with immediate mode rendering. Then there's my proposal at the other extreme end of the spectrum where we give generous amounts of on-chip memory to be able to store multiple entire framebuffers worth of state so this can potentially work with IMRs but this model comes with it's own set of restrictions like adhering to the fixed budget of the finite amount of on-chip memory which will prove to be tricky when dealing with corner cases like MSAA and switching framebuffers will also have a significant performance impact as well. Even in this hypothetical restrictive IMR model it still shares a couple of limitations like we see on tilers ...

I'm not sure where Intel HW or Nvidia HW falls in all of this but I heard from an Nvidia engineer that if you need more memory than a vec4 packing (128 bits/pixel), performance is expected to cliff while using ROVs ...

trinibwoy · Aug 6, 2020

pTmdfx said:
IIRC latest GPU architectures now all have tiled rasterizers, if not TBDR. Since ROV guarantees only the serialization at the same screen space pixel in API submission order, shouldn't the cost be capped by the rasterizer screen space tile size and the max prim concurrency of the executor (e.g. max 10 CUs in a shader array)?

The tiling in modern desktop GPUs is just to facilitate work distribution for a single draw call. You would still need to allocate memory to hold state for all screen space tiles concurrently across draw calls otherwise you’re fetching lots of off-chip data.

Love_In_Rio · Aug 6, 2020

pTmdfx said:
Hmm, a ROV patent finally. It looks like a software implementation though, and it does not resemble to me as the hardware(?) solution in Vega and Navi 10.

mmm, the rumored Geometry Engine from PS5 to debut in RDNA3?.

AlNom · Aug 6, 2020

Bondrewd said:
No.

Well, I wonder how bandwidth limited it'll be if it's only +50% over Navi 10 (384 vs 256-bit). It seems a bit odd. If they've strapped 16Gbps to the bus that would be about +71% bandwidth, and that's even more power consumption with a mostly double sized Navi 10.

Then again, the 5700 series ranges anywhere from 6.7TF (180W TDP) to 10.1TF (235W TDP)) with 448GB/s. A modest base clock in the 1600s would still give something in the 16TF area as a starting point.

I'm more skeptical that there would be any sustained power at higher frequencies ( for 20TF, double Navi 10) just to keep things under 400W (perhaps something more reasonable 18TF boost?).

Bondrewd · Aug 6, 2020

TheAlSpark said:
If they've strapped 16Gbps to the bus that would be about +71% bandwidth

They did.

TheAlSpark said:
and that's even more power consumption with a mostly double sized Navi 10.

Uh, it has nothing to do with N10.

TheAlSpark said:
A modest base clock in the 1600s

Lul.

TheAlSpark said:
I'm more skeptical that there would be any sustained power at higher frequencies ( for 20TF, double Navi 10) just to keep things under 400W (perhaps something more reasonable 18TF boost?).

What do you mean 400W?
It's 275W so far.

AlNom · Aug 6, 2020

Bondrewd said:
Uh, it has nothing to do with N10.

I'm not understanding this. 384-bit bus is 50% larger, and thus adds to the power consumption beyond just doubling the size of Navi 10.

Lul.

??? Discuss. This is getting needlessly silly otherwise.

What do you mean 400W?
It's 275W so far.

I'm talking about TDP. Navi 10 has a range of TDP, which I mentioned (180-235W TDP). How do you propose doubling the size and power of Navi 10 with just another 20%?

Kaotik · Aug 6, 2020

Bondrewd said:
They did.

Uh, it has nothing to do with N10.

Lul.

What do you mean 400W?
It's 275W so far.

I'm still waiting for you to post even some speculation tweet or something to back your words, so far I've seen you post nothing to back your words despite always wording your posts like they're facts

SimBy · Aug 6, 2020

I mean if AMD put effort and silicon into 'multiple GHz' clock speed architecture (they specifically mention this), why on earth would they run it at modest 1600MHz. That's just silly. Expecting anything less than what PS5 clocks at doesn't make much sense.

Bondrewd · Aug 6, 2020

TheAlSpark said:
384-bit bus is 50% larger, and thus adds to the power consumption beyond just doubling the size of Navi 10.

Yeah, kinda.

TheAlSpark said:
This is getting needlessly silly otherwise.

There's not much left to discuss, the product's soon(tm), along with N22 after.

TheAlSpark said:
How do you propose doubling the size and power of Navi 10 with just another 20%?

That thing called engineering.

SimBy said:
I mean if AMD put effort and silicon into 'multiple GHz' clock speed architecture (they specifically mention this), why on earth would they run it at modest 1600MHz. That's just silly. Expecting anything less than what PS5 clocks at doesn't make much sense.

This one is smart.

I know you guys have your reasons to be wary, but soon(tm).

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

milk

Like Verified

trinibwoy

Meh

eastmen

Pressure

entity279

Kaotik

Drunk Member

pTmdfx

Lurkmass

nAo

Nutella Nutellae

Lurkmass

pTmdfx

Lurkmass

trinibwoy

Meh

Love_In_Rio

AlNom

Moderator

Bondrewd

AlNom

Moderator

Kaotik

Drunk Member

SimBy

Bondrewd