Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
Do you have a link to that particular benchmark? I'd be interested to see what they were doing that was using so little bandwidth.

In actual games the X1X can be putting out anywhere from 40% to 100% more pixels [edit: in a given period of time], so I'm inclined to think that is most cases, in the real world, bandwidth is the real limiter.

It may not be specific to the PS4, but some time ago there was discussion about getting improved performance for GPU particles by sizing the tiles to match the footprint of the ROP caches. The general workflow assumes ROP caches are continuously servicing misses to memory, but staying within their caches in workloads that permit it leverages their broader internal data paths while significantly reducing their DRAM bandwidth consumption.
Double ROPs in that subset of the work would be able to scale performance without needing as much memory bandwidth.

Guys, your recent discussion on memory contention lead me to do some googling to try and better understand what it was. While doing so I came across the following http://pages.cs.wisc.edu/~basu/isca_iommu_tutorial/IOMMU_TUTORIAL_ASPLOS_2016.pdf

I think the bits that pertain to the next-gent consoles begin on page 122 and end at 172. It's mainly grafts and large text so it shouldn't make for a heavy read. Could some of the more technical members have a quick look through and see if this could have been designed as part of the two systems set to launch later this year? Thanks in advance.

AMD's had an IOMMU of some form going back at least as far as Trinity. There's an IOMMU in the current consoles, and Kaveri fell just short of a full HSA device. HUMA was the marketing point that PS4 fans latched onto, for example.
It's been present for years in standard hardware, so the next gen should be expected to continue to have it.
 
So why do you need an ultra fast SSD for? Just for faster load time and the lulz? Does it not help you stream more detail faster in tandem with RAM?

It does, and that's why a 125x improvement is available as a baseline on all NextGen consoles.
 
It does, and that's why a 125x improvement is available as a baseline on all NextGen consoles.
So back to where we were before, a faster SSD obviously streams things faster thus affording more detail or assets on screen than one that's twice as slow, that's obviously assuming on top of using all the available RAM of course. Do you not agree that's PS5's power advantage?
 
If we see his example of 36CUs vs. 48CUs with the same Tera Flops, he mentioned the overall performance is raised due to higher frequency, and it is easier to fully use 36 CUs of the narrower GPU.
The point is faster GPU pipelines can increase performance significantly, a different way compared with more TereFlops.

And it’s very easy to deduce that PS5 GPU can beat a 48 CUs GPU with 10.3 TFs operating at 1.673 GHz. I assume PS5 probably matches the performance of 48CUs at 1.825 GHz. In other words real world in-game performance between xsx and PS5 is roughly 52/48 which is 8.3% and adding 2~3% of PS5 down clocking. Overall difference is 10~11%.
hmm...
I would probably disagree with that. Things like fillrate and tessellation are all things that would improve with clockrate. This much I agree with. But GPU problems are massive and have designed meant to attack it in parallel. I have found that blocks (cores) do tons of work faster than threads within cores. So you can have multiple threads in a thread block, or you can run more blocks with less threads. And the number of blocks you got running is going to annihilate threads in terms of performance by magnitude order. At least this is what CUDA has shown me for simpler matters. When you need threads to work together, then well you need to issue more threads per block. As blocks don't really share resources with each other. So there are pros/cons to having more blocks or less blocks.
(I'm not an AMD guy, sorry, the whole 64 threads per wave as a unit of work is lost on me at the moment, I guess that means for good saturation you need all 64 threads filled or something.)

*edit I recognize that things are changed with RDNA. It might be best to have a technical discussion here on how best to optimize performance on that architecture.

This is just the way GPUs are designed, to be embarrassingly parallel. I'm not a 3D engine person, so I don't know for sure. But I suspect that if you've got loads and loads of work over less cores, the only way you can saturate the core is to stuff in more threads. The issue with threads comes down to overhead penality of a thread switch. So maybe instead you just do it proper, you issue the correct number of threads per threadblock. And you'll normally have more threadblocks than available physical cores to work on a thread block. So you've got this massive 4K@60 render going, you have 36 CUs, you're going to to have less blocks running in parallel. Textures are broken up in to little 4x4 blocks of information compressed specifically for random access exactly designed for this purpose of having individual cores process 4x4, 8x8 blocks of pixels.

So my answer is, it's certainly NOT clear that a PS5 can beat the same GPU with more cores operating at a slower frequency. As we move further towards compute shaders the requirements for ROPs will continue to drop. The nice thing with compute is that it's needs around bandwidth don't need to be excessively high. It can do all sorts of operations without needing a high amount of bandwidth.
 
Last edited:
So back to where we were before, a faster SSD obviously streams things faster thus affording more detail or assets on screen than one that's twice as slow, that's obviously assuming on top of using all the available RAM of course. Do you not agree that's PS5's power advantage?
It's not so simple unfortunately.
Pulling textures is just _1_ aspect of what's needed to be done. There are a great deal of a number of render targets that will be written to draw a frame.

If you want nice 4K HDR textures, you're looking at something like BC6 where you are using about 64bits per pixel. Yea that's being sent compressed. But you've got to generate mips, you gotta sample, you have ansio for oblique, all these things require bandwidth within memory. Just because you removed the bottleneck on I/O doesn't mean you can freely dump 16K textures into your game. They need to be processed before being sent out. You still have memory limitations, you have bandwidth limitations, and depending on what you're trying to accomplish you may have ALU limitations. All sorts of compression techniques come with their own draw backs and advantages.

The main thing to note is that, while it's great that I/O is improved. As someone has posted earlier, the speed of bandwidth is falling dramatically behind the speed at which compute is increasing. I/O isn't bandwidth though, you're still going to hit a bandwidth wall.
 
Last edited:
Well let's hope Cerny has sorted out the bandwidth factor when he put this SSD in the console, we'll find out in due time whether PS5 is bandwidth limited or the SSD is over designed.
 
Let's say it's 18% difference on average that would make it almost 2000p vs 2160p if the power is to be focused on resolution.

Higher settings, more stable fps etc. Resolution with todays reconstruction tech is less of a focus. They can target higher res and higher settings etc, it’s only a more powerfull gpu with more advanced features like vrs etc, but also faster cpu and higher BW (which can bottleneck a whole system). And no boost or variable clocks/perf either.
In general cross platform games will be best served on the xsx, and xsx exlusives that leverage to the 12+ TF of gpu power will do things perhaps not possible on 9/10tf hardware. Its 2070 vs 2080Ti afterall.
 
Well let's hope Cerny has sorted out the bandwidth factor when he put this SSD in the console, we'll find out in due time whether PS5 is bandwidth limited or the SSD is over designed.
PS5 still going to walk away with faster loading times even if the faster speed doesn't translate to in game differences. There is also ALU requirements right?

You can't stack a 40GB/s SSD onto a XBO and expect it to process 4K textures and higher settings just because it's I/O is solved.
 
Last edited:
Do we know anything other than clockspeed and flops of next gen console gpus? For example could it be both consoles have same amount of ROPs in which case there could be use cases where ps5 has advantage. i.e. the cerny's narrow and higher clockspeed could be a real thing on some limited cases?
 
Do we know anything other than clockspeed and flops of next gen console gpus? For example could it be both consoles have same amount of ROPs in which case there could be use cases where ps5 has advantage. i.e. the cerny's narrow and higher clockspeed could be a real thing on some limited cases?
any situation where PS5 is not bandwidth bound, then yea it's going to have an advantage while using ROPs. I think tessellation is dead as a feature, people are likely to use mesh shaders.
Smaller workloads etc.
 
Do we know anything other than clockspeed and flops of next gen console gpus? For example could it be both consoles have same amount of ROPs in which case there could be use cases where ps5 has advantage. i.e. the cerny's narrow and higher clockspeed could be a real thing on some limited cases?
The funny thing is no one has said there aren't cases where clockspeed wouldn't have an advantage. Just not to the extent that some believe. Until we see how RT etc impacts both designs.

What if MS said the following:
We went narrow and fast as we knew how fast the front end needs to be and to have better occupancy including changes made to facilitate exexute indirect.
So their basing it on 12TF 52CU 1.8Ghz, compared to 12TF 64CU.
So that would make xsx fast and narrow. It's relative.

It just depends whos talking, from what persepctive, etc. What Cerny said was true for the PS5. Doesn't mean he wouldn't have elected for 12TF if their design was different from the start.
Cerny was talking about 10TF and best way they felt to get to it. Not that it's necessarily better in any way than 12TF.
 
Do we know anything other than clockspeed and flops of next gen console gpus? For example could it be both consoles have same amount of ROPs in which case there could be use cases where ps5 has advantage. i.e. the cerny's narrow and higher clockspeed could be a real thing on some limited cases?

Both consoles most probably do have the same amount of ROPs (Navi 10 with 36/40 CUs has 64 ROPs, Microsoft has traditionally been conservative on ROP count), meaning the PS5 will have >22% higher fillrate.
Whether or not the PS5 can take advantage of its fillrate throughput considering its bandwidth (considering it supposedly has to share its 448GB/s with a large CPU, whereas Navi 10 doesn't) is a different story.
 
We went narrow and fast as we knew how fast the front end needs to be and to have better occupancy
When Cerny said that it was very challenging to have high occupancy, he may have been referring to GCN. As it's very difficult to keep it saturated just given the nature of how it does work.
But AMD specifically went to address this with RDNA.
The whole introduction of the white paper talks about improving the efficiency to send more work out to more CUs.

***
The new RDNA architecture is optimized for efficiency and programmability while offering backwards compatibility with the GCN architecture. It still uses the same seven basic instruction types: scalar compute, scalar memory, vector compute, vector memory, branches, export, and messages. However, the new architecture fundamentally reorganizes the data flow within the processor, boosting performance and improving efficiency.

In all AMD graphics architectures, a kernel is a single stream of instructions that operate on a large number of data parallel work-items. The work-items are organized into architecturally visible work-groups that can communicate through an explicit local data share (LDS). The shader compiler further subdivides work-groups into microarchitectural wavefronts that are scheduled and executed in parallel on a given hardware implementation. For the GCN architecture, the shader compiler creates wavefronts that contain 64 work-items.

When every work-item in a wavefront is executing the same instruction, this organization is highly efficient. Each GCN compute unit (CU) includes four SIMD units that consist of 16 ALUs; each SIMD executes a full wavefront instruction over four clock cycles. The main challenge then becomes maintaining enough active wavefronts to saturate the four SIMD units in a CU. The RDNA architecture is natively designed for a new narrower wavefront with 32 work-items, intuitively called wave32, that is optimized for efficient compute. Wave32 offers several critical advantages for compute and complements the existing graphics-focused wave64 mode.

One of the defining features of modern compute workloads is complex control flow: loops, function calls, and other branches are essential for more sophisticated algorithms. However, when a branch forces portions of a wavefront to diverge and execute different instructions, the overall efficiency suffers since each instruction will execute a partial wavefront and disable the other portions.

The new narrower wave32 mode improves efficiency for more complex compute workloads by reducing the cost of control flow and divergence. Second, a narrower wavefront completes faster and uses fewer resources for accessing data. Each wavefront requires control logic, registers, and cache while active. As one example, the new wave32 mode uses half the number of registers. Since the wavefront will complete quicker, the registers free up faster, enabling more active wavefronts. Ultimately, wave32 enables delivering throughput and hiding latency much more efficient. Third, splitting a workload into smaller wave32 dataflows increases the total number of wavefronts. This subdivision of work items boosts parallelism and allows the GPU to use more cores to execute a given workload, improving both performance and efficiency.
 
When Cerny said that it was very challenging to have high occupancy, he may have been referring to GCN. As it's very difficult to keep it saturated just given the nature of how it does work.
But AMD specifically went to address this with RDNA.
The whole introduction of the white paper talks about improving the efficiency to send more work out to more CUs.

***
The new RDNA architecture is optimized for efficiency and programmability while offering backwards compatibility with the GCN architecture.


ALU occupancy is reportedly low on GCN in games (I read somewhere figures that shocked me, typically 40 to 60%? which is why Vega 56 @ Vega 64 are indistinguishable at ISO clocks), but do we have numbers on RDNA occupancy?
The Pascal GTX 1070 vs. Maxwell Titan X comparison tells us that narrower and faster clocked gets somewhat better results on a similar architecture, despite the large bandwidth advantage of the latter (256GB/s vs. 336GB/s).
Cerny is right even regarding available bandwidth, though I doubt the narrower + higher clocked will fully make up for the 18% difference in compute and 25% in memory bandwidth.




BTW, does the Series X offer a low-level API or does everything still need to go through a virtual machine?
 
ALU occupancy is reportedly low on GCN in games (I read somewhere figures that shocked me, typically 40 to 60%? which is why Vega 56 @ Vega 64 are indistinguishable at ISO clocks), but do we have numbers on RDNA occupancy?
The Pascal GTX 1070 vs. Maxwell Titan X comparison tells us that narrower and faster clocked gets somewhat better results on a similar architecture, despite the large bandwidth advantage of the latter (256GB/s vs. 336GB/s).
Cerny is right even regarding available bandwidth, though I doubt the narrower + higher clocked will fully make up for the 18% difference in compute and 25% in memory bandwidth.




BTW, does the Series X offer a low-level API or does everything still need to go through a virtual machine?
I can only suspect that occupancy is better. Just looking at some examples of how shader code is run vs GCN. But i think it will be some time for that information to release, perhaps sometime next year once RDNA goes mainstream with the release of the consoles we will see something at GDC I suspect.

Series X will still use a DX12 console based variant to offer slightly more access and features specifc to the console hardware, but everything is still run through virtual machine.
 
So i just heard about the tempest engine being considered "hardware accelerated" as a sound system...which may help to free up CPU cycles for other tasks which is more important.

Any technical minded folks in here wanna chime in for us dumb dumbs? Benefits, or weaknesses to this? I heard that it can sap bandwidth pretty badly?
 
Series X will still use a DX12 console based variant to offer slightly more access and features specifc to the console hardware, but everything is still run through virtual machine.
Believe on XO DX12 was low level and DX11.x was considered high level. Also think they pushed that idea on PC.

Wonder if that's still the case on xsx, or if their now dropping DX11 totally now. Don't believe they've added RT support into DX11 on PC.
 
I can only suspect that occupancy is better. Just looking at some examples of how shader code is run vs GCN. But i think it will be some time for that information to release, perhaps sometime next year once RDNA goes mainstream with the release of the consoles we will see something at GDC I suspect.

Series X will still use a DX12 console based variant to offer slightly more access and features specifc to the console hardware, but everything is still run through virtual machine.

Closest comparison I can think of is maybe putting an RX 590 (36CU Polaris) vs RX 5700 (36CU Navi), although the latter has double the ROPs, and the clocks would have to be normalized for throughput (core and memory).

Would have to have really in depth analyses for something more specific though...
 
Status
Not open for further replies.
Back
Top