ID buffer and DR FP16

sebbbi · Jun 16, 2017

Performance of a particular shader is often limited by a single bottleneck (or combination of two). Most common bottlenecks are ALU, texture filtering, memory latency, memory bandwidth, fillrate and geometry front end. Double rate FP16 only helps if the shader main bottleneck is ALU. FP16 registers also helps a bit with memory latency, since 16 bit registers use 50% less register file storage than 32 bit registers -> GPU has better occupancy -> more threads can be kept ready to run -> better latency hiding capability.

People look too much to GPU peak FLOP rate number. FP16 doubles this theoretical number, but it's important to realize that FP16 doesn't double the count of TMUs, ROPs or memory bandwidth. When GPU manufacturers scale up the GPU, they scale all of these up together. GPUs with more FLOPs also have more TMUs, more ROPs, more bandwidth and fatter geometry front ends. Marketing departments like to use FLOP count as simple number to describe the GPU performance level, but this creates the illusion that FLOP count is the only thing that matters. If the other parts didn't scale up equally, the performance advantage would be very limited.

Thus doubling the peak FLOP rate by FP16 doesn't suddenly make a GPU equivalent to another GPU with double FLOP rate, unless all other parts of the GPU are also scaled up.

FP16 is a very useful feature for the developers, but mixing it up with FLOP based marketing is simply confusing the consumers.

Clukos · Jun 16, 2017

ROPs are the same in Ps4 Pro and X1X, while X1X has 16 more TMUs (4 per CU):

Pro

X1X

Of course, higher clock speed helps both with pixel and texture fillrate.

Jay · Jun 16, 2017

Clukos said:
ROPs are the same in Ps4 Pro and X1X, while X1X has 16 more TMUs (4 per CU):

Pro

X1X

Of course, higher clock speed helps both with pixel and texture fillrate.

higher clockrate also helps front end.

where did you get the details from? Don't remember ever hearing rop count

Clukos · Jun 16, 2017

Techpowerup GPU database, both X1X and Ps4 Pro are based on Polaris and all consumer GPUs released so far (470/480/570/580) have 32 ROPs.

Globalisateur · Jun 16, 2017

Clukos said:
ROPs are the same in Ps4 Pro and X1X, while X1X has 16 more TMUs (4 per CU):

Pro

Of course, higher clock speed helps both with pixel and texture fillrate.

I think one developer on GAF once implied those specs weren't exactly right for Pro. Also the number of ROPs has never being confirmed. It could be super nice to have a confirmation on those matters from one of our members...

Even indirect confirmation... :yep2:

Allandor · Jun 16, 2017

sebbbi said:
Performance of a particular shader is often limited by a single bottleneck (or combination of two). Most common bottlenecks are ALU, texture filtering, memory latency, memory bandwidth, fillrate and geometry front end. Double rate FP16 only helps if the shader main bottleneck is ALU. FP16 registers also helps a bit with memory latency, since 16 bit registers use 50% less register file storage than 32 bit registers -> GPU has better occupancy -> more threads can be kept ready to run -> better latency hiding capability.

People look too much to GPU peak FLOP rate number. FP16 doubles this theoretical number, but it's important to realize that FP16 doesn't double the count of TMUs, ROPs or memory bandwidth. When GPU manufacturers scale up the GPU, they scale all of these up together. GPUs with more FLOPs also have more TMUs, more ROPs, more bandwidth and fatter geometry front ends. Marketing departments like to use FLOP count as simple number to describe the GPU performance level, but this creates the illusion that FLOP count is the only thing that matters. If the other parts didn't scale up equally, the performance advantage would be very limited.

Thus doubling the peak FLOP rate by FP16 doesn't suddenly make a GPU equivalent to another GPU with double FLOP rate, unless all other parts of the GPU are also scaled up.

FP16 is a very useful feature for the developers, but mixing it up with FLOP based marketing is simply confusing the consumers.

You also always need an equal number of FP16 operations at the same time to get anything from it. It doesn't help to have a 32 bit and a 16 bit operations at the same time. So it is not possible to come close to the theoretical limit (only if you do 16 bit only operations ... well better not).

Rodéric · Jun 16, 2017

Polaris ?
Where did you get that information from ?

Clukos · Jun 16, 2017

Rodéric said:
Polaris ?
Where did you get that information from ?

Official spec sheet?

Locuza · Jun 16, 2017

Jay said:
[...] Don't remember ever hearing rop count

"As you can see, we doubled the amount of shader engines. That has the effect of improvement of boosting our triangle and vertex rate by 2.7x when you include the clock boost as well. We doubled the number of render back-ends, which has the effect of increasing our fill-rate by 2.7x. We quadrupled the GPU L2 cache size, again for targeting the 4K performance."

http://www.eurogamer.net/articles/digitalfoundry-2017-project-scorpio-tech-revealed

Funny thing about the Xbox One X are 32 ROPs, a 384-Bit interface coupled with 2MB L2$.
You can't distribute 2MB L2$ evenly across a 384-Bit bus.

Clukos · Jun 16, 2017

Yeah that makes sense.

853 (X1 core clock) x 16 (ROPs) = 13,648 GPixel/s
13,648 x 2.7 = 36,849 GPixel/s
1172 (X1X core clock) x 32 (ROPs) = 37,504 GPixel/s

So it's actually a bit over 2.7 (more like 2.74-2.75).

sebbbi · Jun 16, 2017

As people might guess, I really like id-buffers. But I am not going to participate in the discussion how much better a hardware implementation is versus an optimized software implementation (with MSAA/EQAA/DCC tricks). I don't believe there's enough public documents about these hardware features available to discuss in public forums. And the devil is obviously in the details (as it tends to be when discussing high performance graphics tricks).

But I recommend reading my thread about id-buffering and other future rendering techniques (pure tech instead of console wars):
https://forum.beyond3d.com/threads/modern-textureless-deferred-rendering-techniques.57611/

chris1515 · Jun 16, 2017

But maybe we have a first difference between games running on PS4 Pro and Xbox One X on Frosbite. The Frosbite game runs at 1800p checkboard rendering on PS4 Pro(BF1 and MAss Effect A) and Anthem is 2160p checkerboard rendering on Xbox One X...

AlNom · Jun 16, 2017

Clukos said:

Not quite right.

Locuza said:
http://www.eurogamer.net/articles/digitalfoundry-2017-project-scorpio-tech-revealed

Funny thing about the Xbox One X are 32 ROPs, a 384-Bit interface coupled with 2MB L2$.
You can't distribute 2MB L2$ evenly across a 384-Bit bus.

Tahiti was the last time AMD used a crossbar IIRC, but the bandwidth gains were still apparent.

Globalisateur · Jun 16, 2017

chris1515 said:
But maybe we have a first difference between games running on PS4 Pro and Xbox One X on Frosbite. The Frosbite game runs at 1800p checkboard rendering on PS4 Pro(BF1 and MAss Effect A) and Anthem is 2160p checkerboard rendering on Xbox One X...

Combined with dynamic scaling according to DF. And um. I just checked the 4K images available of this game...But I think I'll wait for proper PNGs from the final version of the game before giving any opinion... :yep2:

http://www.eurogamer.net/articles/digitalfoundry-2017-anthem-the-real-deal-for-xbox-one

AlNom · Jun 16, 2017

Globalisateur said:
Combined with dynamic scaling according to DF. And um. I just checked the 4K images available of this game...But I think I'll wait for proper PNGs from the final version of the game before giving any opinion...

http://www.eurogamer.net/articles/digitalfoundry-2017-anthem-the-real-deal-for-xbox-one

There's a thread for pre-release analysis.

Jay · Jun 16, 2017

Globalisateur said:
Combined with dynamic scaling according to DF. And um. I just checked the 4K images available of this game...But I think I'll wait for proper PNGs from the final version of the game before giving any opinion...

http://www.eurogamer.net/articles/digitalfoundry-2017-anthem-the-real-deal-for-xbox-one

It's nice that it's also part of their engine, so I suspect that BF1 and MA:A will also be dynamic as well as checkerboarded.

In fact I'm pretty sure that's the case.

Scott_Arm · Jun 16, 2017

Jay said:
is that true of all 4k tv technology or not?

LCD and OLED both have reduced motion resolution, because they are sample and hold displays. There are some features that can help improve motion resolution, but they tend to increase input lag and not work in game mode.

Recop · Jun 16, 2017

chris1515 said:
But maybe we have a first difference between games running on PS4 Pro and Xbox One X on Frosbite. The Frosbite game runs at 1800p checkboard rendering on PS4 Pro(BF1 and MAss Effect A) and Anthem is 2160p checkerboard rendering on Xbox One X...

Indeed, 2160C on X should mean 1800C on Pro. Aside resolution, i expect better graphic settings on X.

3dilettante · Jun 16, 2017

AlNets said:
Tahiti was the last time AMD used a crossbar IIRC, but the bandwidth gains were still apparent.

Tahiti (and presumably Tonga) had an additional crossbar that allowed 32 ROPs to map to 12 channels.
However, the L2 was 768 KiB and divided evenly.

If the L2 number is accurate, it is interesting for two reasons. The first is that it is not a straightforward distribution among 12 channels, and the other is that the highest per-slice capacity listed for GCN with Tahiti is too small to get to 2MiB at that channel count.

It would seemingly require break in the direct link between cache and channel like yet another crossbar, which seems like extra work to avoid some helpful extra cache. I'm not sure if GCN's memory scheme works if the L2 is not there to interface with a channel.

Another tweak could be adding to the slices, which might raise the size of the slice and potentially affect its 16-way associativity.
A 16-way cache could tack on some extra ways, with 20 or 24 ways arriving at a capacity that might round to 2MiB with 12 channels.

Deleted member 13524 · Jun 16, 2017

HBRU said:
ubisoft says the new Assasin's Kreed runs equal on both X and PRO... I guess they need to use this FP16 but only PRO has it double rate per clock cicle... Big mistake MS did.

Recop said:
Now, it is quite possible that Origins has better assets on X (more ram).

It's important to mention that Ubisoft is that company who decided to keep parity on the PS4 and Xbone's versions of Assassin's Creed Unity.
I will be surprised if Assassin's Creed: Origins has any difference in assets between the Pro and XboneX. If the game uses dynamic resolution it'll probably hit higher resolutions with the XboneX, but that's about it.

Clukos said:
Techpowerup GPU database, both X1X and Ps4 Pro are based on Polaris and all consumer GPUs released so far (470/480/570/580) have 32 ROPs.

AFAIK, there is no official info on ROP amount for either the XboneX or the PS4 Pro. If those are TechPowerUp's numbers you should be aware that they don't have any more access to info on console APUs than the general public, so a portion of those numbers are just guesses.
In fact, we have Mark Cerny stating the following:

"First, we doubled the GPU size by essentially placing it next to a mirrored version of itself, sort of like the wings of a butterfly. That gives us an extremely clean way to support the existing 700 titles," Cerny explains, detailing how the Pro switches into its 'base' compatibility mode. "We just turn off half the GPU and run it at something quite close to the original GPU."

This implies that the Pro's iGPU has 2x everything there is on the PS4's GPU, not only CU and TMU amount. That would mean there are 64 ROPs in it.
Otherwise turning off half the ROPs in the Pro would put it running with only 16 ROPs, which is half of what the PS4 has and would mean hitting an obvious bottleneck in base compatibility mode.

Locuza said:
Funny thing about the Xbox One X are 32 ROPs, a 384-Bit interface coupled with 2MB L2$.
You can't distribute 2MB L2$ evenly across a 384-Bit bus.

AlNets said:
Tahiti was the last time AMD used a crossbar IIRC, but the bandwidth gains were still apparent.

Honest question: aren't all APUs/SoCs using a memory crossbar / ringbus anyway?
How can the CPU cores access the system RAM if each memory channel is directly connected to a ROP like in modern discrete GPUs?

ID buffer and DR FP16

sebbbi

Clukos

Bloodborne 2 when?

Jay

Clukos

Bloodborne 2 when?

Globalisateur

Globby

Allandor

Rodéric

a.k.a. Ingenu

Clukos

Bloodborne 2 when?

Locuza

Clukos

Bloodborne 2 when?

sebbbi

chris1515

AlNom

Moderator

Globalisateur

Globby

AlNom

Moderator

Jay

Scott_Arm

Recop

3dilettante

Deleted member 13524

Guest

Similar threads