ID buffer and DR FP16

Performance of a particular shader is often limited by a single bottleneck (or combination of two). Most common bottlenecks are ALU, texture filtering, memory latency, memory bandwidth, fillrate and geometry front end. Double rate FP16 only helps if the shader main bottleneck is ALU. FP16 registers also helps a bit with memory latency, since 16 bit registers use 50% less register file storage than 32 bit registers -> GPU has better occupancy -> more threads can be kept ready to run -> better latency hiding capability.

People look too much to GPU peak FLOP rate number. FP16 doubles this theoretical number, but it's important to realize that FP16 doesn't double the count of TMUs, ROPs or memory bandwidth. When GPU manufacturers scale up the GPU, they scale all of these up together. GPUs with more FLOPs also have more TMUs, more ROPs, more bandwidth and fatter geometry front ends. Marketing departments like to use FLOP count as simple number to describe the GPU performance level, but this creates the illusion that FLOP count is the only thing that matters. If the other parts didn't scale up equally, the performance advantage would be very limited.

Thus doubling the peak FLOP rate by FP16 doesn't suddenly make a GPU equivalent to another GPU with double FLOP rate, unless all other parts of the GPU are also scaled up.

FP16 is a very useful feature for the developers, but mixing it up with FLOP based marketing is simply confusing the consumers.
 
ROPs are the same in Ps4 Pro and X1X, while X1X has 16 more TMUs (4 per CU):

Pro
pro19uv6.png


X1X
x1xa1upm.png


Of course, higher clock speed helps both with pixel and texture fillrate.
 
ROPs are the same in Ps4 Pro and X1X, while X1X has 16 more TMUs (4 per CU):

Pro
pro19uv6.png


X1X
x1xa1upm.png


Of course, higher clock speed helps both with pixel and texture fillrate.
higher clockrate also helps front end.

where did you get the details from? Don't remember ever hearing rop count
 
Techpowerup GPU database, both X1X and Ps4 Pro are based on Polaris and all consumer GPUs released so far (470/480/570/580) have 32 ROPs.
 
ROPs are the same in Ps4 Pro and X1X, while X1X has 16 more TMUs (4 per CU):

Pro
pro19uv6.png




Of course, higher clock speed helps both with pixel and texture fillrate.
I think one developer on GAF once implied those specs weren't exactly right for Pro. Also the number of ROPs has never being confirmed. It could be super nice to have a confirmation on those matters from one of our members...

Even indirect confirmation... :yep2:
 
Performance of a particular shader is often limited by a single bottleneck (or combination of two). Most common bottlenecks are ALU, texture filtering, memory latency, memory bandwidth, fillrate and geometry front end. Double rate FP16 only helps if the shader main bottleneck is ALU. FP16 registers also helps a bit with memory latency, since 16 bit registers use 50% less register file storage than 32 bit registers -> GPU has better occupancy -> more threads can be kept ready to run -> better latency hiding capability.

People look too much to GPU peak FLOP rate number. FP16 doubles this theoretical number, but it's important to realize that FP16 doesn't double the count of TMUs, ROPs or memory bandwidth. When GPU manufacturers scale up the GPU, they scale all of these up together. GPUs with more FLOPs also have more TMUs, more ROPs, more bandwidth and fatter geometry front ends. Marketing departments like to use FLOP count as simple number to describe the GPU performance level, but this creates the illusion that FLOP count is the only thing that matters. If the other parts didn't scale up equally, the performance advantage would be very limited.

Thus doubling the peak FLOP rate by FP16 doesn't suddenly make a GPU equivalent to another GPU with double FLOP rate, unless all other parts of the GPU are also scaled up.

FP16 is a very useful feature for the developers, but mixing it up with FLOP based marketing is simply confusing the consumers.
You also always need an equal number of FP16 operations at the same time to get anything from it. It doesn't help to have a 32 bit and a 16 bit operations at the same time. So it is not possible to come close to the theoretical limit (only if you do 16 bit only operations ... well better not).
 
[...] Don't remember ever hearing rop count
"As you can see, we doubled the amount of shader engines. That has the effect of improvement of boosting our triangle and vertex rate by 2.7x when you include the clock boost as well. We doubled the number of render back-ends, which has the effect of increasing our fill-rate by 2.7x. We quadrupled the GPU L2 cache size, again for targeting the 4K performance."
http://www.eurogamer.net/articles/digitalfoundry-2017-project-scorpio-tech-revealed

Funny thing about the Xbox One X are 32 ROPs, a 384-Bit interface coupled with 2MB L2$.
You can't distribute 2MB L2$ evenly across a 384-Bit bus.
 
Yeah that makes sense.

853 (X1 core clock) x 16 (ROPs) = 13,648 GPixel/s
13,648 x 2.7 = 36,849 GPixel/s
1172 (X1X core clock) x 32 (ROPs) = 37,504 GPixel/s

So it's actually a bit over 2.7 (more like 2.74-2.75).
 
As people might guess, I really like id-buffers. But I am not going to participate in the discussion how much better a hardware implementation is versus an optimized software implementation (with MSAA/EQAA/DCC tricks). I don't believe there's enough public documents about these hardware features available to discuss in public forums. And the devil is obviously in the details (as it tends to be when discussing high performance graphics tricks).

But I recommend reading my thread about id-buffering and other future rendering techniques (pure tech instead of console wars):
https://forum.beyond3d.com/threads/modern-textureless-deferred-rendering-techniques.57611/
 
But maybe we have a first difference between games running on PS4 Pro and Xbox One X on Frosbite. The Frosbite game runs at 1800p checkboard rendering on PS4 Pro(BF1 and MAss Effect A) and Anthem is 2160p checkerboard rendering on Xbox One X...
 
Last edited:
But maybe we have a first difference between games running on PS4 Pro and Xbox One X on Frosbite. The Frosbite game runs at 1800p checkboard rendering on PS4 Pro(BF1 and MAss Effect A) and Anthem is 2160p checkerboard rendering on Xbox One X...
Combined with dynamic scaling according to DF. And um. I just checked the 4K images available of this game...But I think I'll wait for proper PNGs from the final version of the game before giving any opinion...:yep2:

http://www.eurogamer.net/articles/digitalfoundry-2017-anthem-the-real-deal-for-xbox-one
 
Combined with dynamic scaling according to DF. And um. I just checked the 4K images available of this game...But I think I'll wait for proper PNGs from the final version of the game before giving any opinion...:yep2:

http://www.eurogamer.net/articles/digitalfoundry-2017-anthem-the-real-deal-for-xbox-one
It's nice that it's also part of their engine, so I suspect that BF1 and MA:A will also be dynamic as well as checkerboarded.

In fact I'm pretty sure that's the case.
 
But maybe we have a first difference between games running on PS4 Pro and Xbox One X on Frosbite. The Frosbite game runs at 1800p checkboard rendering on PS4 Pro(BF1 and MAss Effect A) and Anthem is 2160p checkerboard rendering on Xbox One X...

Indeed, 2160C on X should mean 1800C on Pro. Aside resolution, i expect better graphic settings on X.
 
Tahiti was the last time AMD used a crossbar IIRC, but the bandwidth gains were still apparent.
Tahiti (and presumably Tonga) had an additional crossbar that allowed 32 ROPs to map to 12 channels.
However, the L2 was 768 KiB and divided evenly.

If the L2 number is accurate, it is interesting for two reasons. The first is that it is not a straightforward distribution among 12 channels, and the other is that the highest per-slice capacity listed for GCN with Tahiti is too small to get to 2MiB at that channel count.

It would seemingly require break in the direct link between cache and channel like yet another crossbar, which seems like extra work to avoid some helpful extra cache. I'm not sure if GCN's memory scheme works if the L2 is not there to interface with a channel.

Another tweak could be adding to the slices, which might raise the size of the slice and potentially affect its 16-way associativity.
A 16-way cache could tack on some extra ways, with 20 or 24 ways arriving at a capacity that might round to 2MiB with 12 channels.
 
ubisoft says the new Assasin's Kreed runs equal on both X and PRO... I guess they need to use this FP16 but only PRO has it double rate per clock cicle... Big mistake MS did.
Now, it is quite possible that Origins has better assets on X (more ram).
It's important to mention that Ubisoft is that company who decided to keep parity on the PS4 and Xbone's versions of Assassin's Creed Unity.
I will be surprised if Assassin's Creed: Origins has any difference in assets between the Pro and XboneX. If the game uses dynamic resolution it'll probably hit higher resolutions with the XboneX, but that's about it.


Techpowerup GPU database, both X1X and Ps4 Pro are based on Polaris and all consumer GPUs released so far (470/480/570/580) have 32 ROPs.
AFAIK, there is no official info on ROP amount for either the XboneX or the PS4 Pro. If those are TechPowerUp's numbers you should be aware that they don't have any more access to info on console APUs than the general public, so a portion of those numbers are just guesses.
In fact, we have Mark Cerny stating the following:
"First, we doubled the GPU size by essentially placing it next to a mirrored version of itself, sort of like the wings of a butterfly. That gives us an extremely clean way to support the existing 700 titles," Cerny explains, detailing how the Pro switches into its 'base' compatibility mode. "We just turn off half the GPU and run it at something quite close to the original GPU."

This implies that the Pro's iGPU has 2x everything there is on the PS4's GPU, not only CU and TMU amount. That would mean there are 64 ROPs in it.
Otherwise turning off half the ROPs in the Pro would put it running with only 16 ROPs, which is half of what the PS4 has and would mean hitting an obvious bottleneck in base compatibility mode.


Funny thing about the Xbox One X are 32 ROPs, a 384-Bit interface coupled with 2MB L2$.
You can't distribute 2MB L2$ evenly across a 384-Bit bus.
Tahiti was the last time AMD used a crossbar IIRC, but the bandwidth gains were still apparent.

Honest question: aren't all APUs/SoCs using a memory crossbar / ringbus anyway?
How can the CPU cores access the system RAM if each memory channel is directly connected to a ROP like in modern discrete GPUs?
 
Back
Top