No. you can't bench the ROPs using napkin math formulas. I know the MS side is known to use theoretical numbers in this or that ideal condition. But it really depends of what kind of alphas are used. In some cases fillrate will be higher on PS5 like it's already the case on Pro. On pro a typical benchmarck Sony use shows a measured fillrate higher than the max theoretical fillrate possible on XBX, and that bench was using about 160GB/s of bandwidth.
That's fair, I can't account for everything like seeing how delta colour compression, or different ways memory is access.
Unfortunately, the counter point here results in favour X1X in terms of overall performance - and in amounts that are in range between 40-100% resolution differences, but the resolution differences far exceed the difference in compute capacity. And ROPS are higher on 4Pro by double, so that only leaves bandwidth as the largest limiting factor.
Would you say a 9TF 2070s (PS5) vs a 11TF 2080s (XSX) to be the most comparable difference here? 2080s has more CUDA cores, more memory bandwidth, faster memory and boost clock speed. Despite having a 2 TF compute lead, in practice it's only 4-8fps faster at 4k, the extra fps wouldn't even warrant a decent increase in resolution.
https://www.digitaltrends.com/computing/rtx-2080-super-vs-rtx-2080-vs-rtx-2070-super/
RDNA 2 might scale a bit differently tho, who knows.
No, I won't compare the 2 rdna 2 gpus to turing architecture. There are nuances to simplicity. Simplicity is supposed to provide a general view of things, but it's not supposed look at exception cases, or in the case of games, if the exception is too good to pass up, the exception becomes the norm. -- See Cell. Comparing the same architectures however, you can sort of start crossing things out.
Sebbbi did build a performance analyzer here for types of shader workloads vs bandwidth:
https://github.com/sebbbi/perftest
These are all done on DX11 btw. So not a complete view of things.
But before we go further, I'm just going to reiterate sebbbi's caveat before fanboys use this as fodder:
The purpose of this application is not to benchmark different brand GPUs against each other. Its purpose is to help rendering programmers to choose right types of resources when optimizing their compute shader performance.
He states:
All results are compared to Buffer<RGBA8>.Load random result (=1.0x) on the same GPU.
Random loads: I add a random start offset of 0-15 elements for each thread (still aligned).
This prevents GPU coalescing, and provides more realistic view of performance for common case (non-linear) memory accessing. This benchmark is as cache efficient as the previous. All data still comes from the L1 cache.
.
Benchmarks are unfortunately all over the place when you compare across architectures, but they are quiet similar when you compare within architectures. (see GCN2-5). Both Intel and Nvidia had completed some benchmarks that provide certain areas some massive improvements.
See here for example: Navi 5700XT
Buffer<RGBA8>.
Load uniform: 12.519ms 1.008x
Buffer<RGBA8>.Load linear: 12.985ms 0.972x
Buffer<RGBA8>.Load random: 12.617ms 1.000x
Compared to Maxwell 980TI
Buffer<RGBA8>.
Load uniform: 2.452ms 14.680x
Buffer<RGBA8>.Load linear: 35.773ms 1.006x
Buffer<RGBA8>.Load random: 35.996ms 1.000x
Comparing to Kepler 600/700 series
Buffer<RGBA8>.
Load uniform: 3.598ms 53.329x
Buffer<RGBA8>.Load linear: 193.676ms 0.991x
Buffer<RGBA8>.Load random: 191.866ms 1.000x
So those driver improvements on uniform memory address loads, made a massive significant boost in performance whenever those types of workloads come up. I don't know what the driver/api performance situation is like on console (for obvious reasons sebbbi cannot show it). but whenever I think back to developers trying to optimize for nvidia, I mean, yea I'd try to take advantage of uniform loads if the opportunity arises.
But here is Volta:
Buffer<RGBA8>.Load uniform: 5.155ms 3.538x
Buffer<RGBA8>.Load linear: 16.726ms 1.090x
Buffer<RGBA8>.Load random: 18.236ms 1.000x
And you can see from architecture to architecture things can change. And as you can see, throughput benchmarks don't tell the whole story either.
sebbbi says:
NVIDIA Volta results (ratios) of most common load/sample operations are identical to Pascal. However there are some huge changes raw load performance. Raw loads: 1d ~2x faster, 2d-4d ~4x faster (slightly more on 3d and 4d). Nvidia definitely seems to now use a faster direct memory path for raw loads.
Raw loads are now the best choice on Nvidia hardware (which is a direct opposite of their last gen hardware). Independent studies of Volta architecture show that their raw load L1$ latency also dropped from 85 cycles (Pascal) down to 28 cycles (Volta). This should makes raw loads even more viable in real applications.
My benchmark measures only throughput, so latency improvement isn't visible.
Uniform address optimization: Uniform address optimization no longer affects StructuredBuffers. My educated guess is that StructuredBuffers (like raw buffers) now use the same lower latency direct memory path. Nvidia most likely hasn't yet implemented uniform address optimization for these new memory operations. Another curiosity is that Volta also has much lower performance advantage in the uniform address optimized cases (versus any other Nvidia GPU, including Turing).
And here is Turing 2080ti:
Buffer<RGBA8>.Load uniform: 1.336ms 12.247x
Buffer<RGBA8>.Load linear: 16.825ms 0.973x
Buffer<RGBA8>.Load random: 16.364ms 1.000x
NVIDIA Turing results (ratios) of most common load/sample operations are identical Volta. Except wide raw buffer load performance is closer to Maxwell/Pascal. In Volta, Nvidia used one large 128KB shared L1$ (freely configurable between groupshared mem and L1$), while in Turing they have 96KB shared L1$ which can be configured only as 64/32 or 32/64. This benchmark seems to point out that this halves their L1$ bandwidth for raw loads.
Uniform address optimization: Like Volta, the new uniform address optimization no longer affects StructuredBuffers. My educated guess is that StructuredBuffers (like raw buffers) now use the same lower latency direct memory path. Nvidia most likely hasn't yet implemented uniform address optimization for these new memory operations. Turing uniform address optimization performance however (in other cases) returns to similar 20x+ figures than Maxwell/Pascal.
TLDR; As you can see, coding for the hardware makes a dramatic difference in performance. In this case, having a single profile for console makes things straight forward. I think it's easier to make general statements about two pieces of hardware of the exact same architecture. But comparing different architectures is not going to work.
I did purposefully leave out the benchmarks between 5700xt and 20870TI turing. lol just to ensure we aren't going off topic with what I was trying to point out. But yea, I mean, 5700 XT is a good piece of hardware under a variety of workloads. As is GCN in general. I think you'll find some developers here that really praise it's compute ability. And I think the benchmarks showcase how effective it is as different workloads.