AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Maybe for glasses of water but my brain certainly doesn't discard all caustics. Take a swim in the caribbean. Guarantee you will notice and appreciate them!

But can any brain dicern from accurate caustics for that particular ocean wave pattern, sun position, depth, etc, vs. a cheap aproximation like in GTAV for example?
 
But can any brain dicern from accurate caustics for that particular ocean wave pattern, sun position, depth, etc, vs. a cheap aproximation like in GTAV for example?

I haven’t played GTA V but any reasonable approximation should be passable.
 
Big load of AMD patents, some newer, some slightly older

Includes:
  • Bandwidth saving architecture for scalable video coding - AMD
  • Real time on-chip texture decompression using shader processors
  • Matrix Multiplier With Submatrix Sequencing
  • Shared loads at compute units of a processor
  • Automatic configuration of knobs to optimize performance of a graphics pipeline
  • Pixel Wait Synchronization
  • Hint-based fine-grained dynamic voltage and frequency scaling in GPUs
  • Pipelined matrix multiplication at a graphics processing unit
  • Optimizing Primitive Shaders
  • Water tight ray triangle intersection without resorting to double precision
  • Graphics texture footprint discovery
  • Use of Workgroups in Pixel Shader
  • Efficient data path for ray triangle intersection
  • Robust Ray-triangle Intersection
  • Variable rate rendering based on motion estimation
  • Apparatus and method for providing workload distribution of threads among multiple compute units
  • Mechanism for supporting discard functionality in a ray tracing context
  • Merged data path for triangle and box intersection test in Ray Tracing
  • Variable Rate Shading
  • Raster Order View
  • Integration of variable rate shading and super-sample shading
  • Centroid selection for variable rate shading
 
Hmm, a ROV patent finally. It looks like a software implementation though, and it does not resemble to me as the hardware(?) solution in Vega and Navi 10.
 
Hmm, a ROV patent finally. It looks like a software implementation though, and it does not resemble to me as the hardware(?) solution in Vega and Navi 10.

I think we should give up on the idea of ROVs altogether and that's AMD's opinion on the matter as well ...

To get even remotely acceptable level of performance on an immediate mode GPU architecture, it would involve storing/tracking the entire framebuffer/render target state in hardware which would mean implementing a lot of dedicated on-chip memory to store the entire framebuffer/render target. The other option is designing a tile-based GPU which will automatically come with a small amount of tile memory but I don't think the architects will find that to be an acceptable solution either since it would mean executing duplicated vertex shader invocations or potentially starving the amount of work on the GPUs shader execution units. Tile-based GPUs died out a decade ago on the desktop space for very good reasons ...

Just to give you an idea, two 1080p render targets consisting of the colour+alpha (32 bits/4 bytes) and the depth (32 bits/4 bytes) would total out to 16.588MB worth of memory which is over 4x bigger than Navi 10's L2 cache. That's not even counting the stencil bits, MSAA case, higher resolutions, or needing multiple render targets/more bits per-pixel either. You'd have to spend enormous amounts of die space to make a robust solution for ROVs which could be used for better used elsewhere ...
 
I think we should give up on the idea of ROVs altogether and that's AMD's opinion on the matter as well ...

To get even remotely acceptable level of performance on an immediate mode GPU architecture, it would involve storing/tracking the entire framebuffer/render target state in hardware which would mean implementing a lot of dedicated on-chip memory to store the entire framebuffer/render target.

That’s absolutely not the case. I can’t get into details but on some IMRs it can be done efficiently (barring pathological cases) with little to no extra HW.
 
That’s absolutely not the case. I can’t get into details but on some IMRs it can be done efficiently (barring pathological cases) with little to no extra HW.

Would that be Intel HW since it's never used outside of their demos ?
 
I think we should give up on the idea of ROVs altogether and that's AMD's opinion on the matter as well ...

To get even remotely acceptable level of performance on an immediate mode GPU architecture, it would involve storing/tracking the entire framebuffer/render target state in hardware which would mean implementing a lot of dedicated on-chip memory to store the entire framebuffer/render target. The other option is designing a tile-based GPU which will automatically come with a small amount of tile memory but I don't think the architects will find that to be an acceptable solution either since it would mean executing duplicated vertex shader invocations or potentially starving the amount of work on the GPUs shader execution units. Tile-based GPUs died out a decade ago on the desktop space for very good reasons ...

Just to give you an idea, two 1080p render targets consisting of the colour+alpha (32 bits/4 bytes) and the depth (32 bits/4 bytes) would total out to 16.588MB worth of memory which is over 4x bigger than Navi 10's L2 cache. That's not even counting the stencil bits, MSAA case, higher resolutions, or needing multiple render targets/more bits per-pixel either. You'd have to spend enormous amounts of die space to make a robust solution for ROVs which could be used for better used elsewhere ...
IIRC latest GPU architectures now all have tiled rasterizers, if not TBDR. Since ROV guarantees only the serialization at the same screen space pixel in API submission order, shouldn't the cost be capped by the rasterizer screen space tile size and the max prim concurrency of the executor (e.g. max 10 CUs in a shader array)? :???:
 
IIRC latest GPU architectures now all have tiled rasterizers, if not TBDR. Since ROV guarantees only the serialization at the same screen space pixel in API submission order, shouldn't the cost be capped by the rasterizer screen space tile size and the max prim concurrency of the executor (e.g. max 10 CUs in a shader array)? :???:

Are you sure you aren't describing a tile-based GPU ? I'm pretty sure mobile GPUs shade screen space tiles and desktop GPUs don't do that at all. All of the best solutions for ROVs/programmable blending involve storing framebuffer state in the hardware. Mobile GPUs have this extremely low latency tile memory where they access/store a small portion of the framebuffer which makes it trivial to implement ROVs on their HW. No modern desktop GPU does tile shading or have tile memory. A comparable solution for non-tiling architectures would be is to have built-in memory storing all of the framebuffer rather than just a small portion of it but this has a huge implementation cost in HW ...

I'm not even sure if Nvidia is all that happy about ROVs from a performance perspective either. Hence, why we should follow AMD's recommendation on giving up ROVs altogether because there seems to be little chance on making an acceptable implementation on discrete GPUs. Ultimately, the problem behind ROV performance is how well the HW is going to be able to track the framebuffer state. AMD HW tracks little to no state in their hardware so there's a huge performance cost for enabling ROVs regardless. Mobile GPUs can track some of this state for a given tile with reduced memory latency access but as a consequence this model is not compatible with immediate mode rendering. Then there's my proposal at the other extreme end of the spectrum where we give generous amounts of on-chip memory to be able to store multiple entire framebuffers worth of state so this can potentially work with IMRs but this model comes with it's own set of restrictions like adhering to the fixed budget of the finite amount of on-chip memory which will prove to be tricky when dealing with corner cases like MSAA and switching framebuffers will also have a significant performance impact as well. Even in this hypothetical restrictive IMR model it still shares a couple of limitations like we see on tilers ...

I'm not sure where Intel HW or Nvidia HW falls in all of this but I heard from an Nvidia engineer that if you need more memory than a vec4 packing (128 bits/pixel), performance is expected to cliff while using ROVs ...
 
IIRC latest GPU architectures now all have tiled rasterizers, if not TBDR. Since ROV guarantees only the serialization at the same screen space pixel in API submission order, shouldn't the cost be capped by the rasterizer screen space tile size and the max prim concurrency of the executor (e.g. max 10 CUs in a shader array)? :???:

The tiling in modern desktop GPUs is just to facilitate work distribution for a single draw call. You would still need to allocate memory to hold state for all screen space tiles concurrently across draw calls otherwise you’re fetching lots of off-chip data.
 
Well, I wonder how bandwidth limited it'll be if it's only +50% over Navi 10 (384 vs 256-bit). It seems a bit odd. If they've strapped 16Gbps to the bus that would be about +71% bandwidth, and that's even more power consumption with a mostly double sized Navi 10.

Then again, the 5700 series ranges anywhere from 6.7TF (180W TDP) to 10.1TF (235W TDP)) with 448GB/s. A modest base clock in the 1600s would still give something in the 16TF area as a starting point.

I'm more skeptical that there would be any sustained power at higher frequencies ( for 20TF, double Navi 10) just to keep things under 400W (perhaps something more reasonable 18TF boost?).
 
If they've strapped 16Gbps to the bus that would be about +71% bandwidth
They did.
and that's even more power consumption with a mostly double sized Navi 10.
Uh, it has nothing to do with N10.
A modest base clock in the 1600s
Lul.
I'm more skeptical that there would be any sustained power at higher frequencies ( for 20TF, double Navi 10) just to keep things under 400W (perhaps something more reasonable 18TF boost?).
What do you mean 400W?
It's 275W so far.
 
Uh, it has nothing to do with N10.
I'm not understanding this. 384-bit bus is 50% larger, and thus adds to the power consumption beyond just doubling the size of Navi 10.

??? Discuss. This is getting needlessly silly otherwise.

What do you mean 400W?
It's 275W so far.

I'm talking about TDP. Navi 10 has a range of TDP, which I mentioned (180-235W TDP). How do you propose doubling the size and power of Navi 10 with just another 20%?
 
I mean if AMD put effort and silicon into 'multiple GHz' clock speed architecture (they specifically mention this), why on earth would they run it at modest 1600MHz. That's just silly. Expecting anything less than what PS5 clocks at doesn't make much sense.
 
384-bit bus is 50% larger, and thus adds to the power consumption beyond just doubling the size of Navi 10.
Yeah, kinda.
This is getting needlessly silly otherwise.
There's not much left to discuss, the product's soon(tm), along with N22 after.
How do you propose doubling the size and power of Navi 10 with just another 20%?
That thing called engineering.
I mean if AMD put effort and silicon into 'multiple GHz' clock speed architecture (they specifically mention this), why on earth would they run it at modest 1600MHz. That's just silly. Expecting anything less than what PS5 clocks at doesn't make much sense.
This one is smart.

I know you guys have your reasons to be wary, but soon(tm).
 
Status
Not open for further replies.
Back
Top