Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
Why would you think that?

May I quote from the RDNA white paper?

https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

"To accommodate the narrower wavefronts, the vector register file has been reorganized. Each vector general purpose register (vGPR) contains 32 lanes that are 32-bits wide, and a SIMD contains a total of 1,024 vGPRs – 4X the number of registers as in GCN. The registers typically hold single-precision (32-bit) floating-point (FP) data, but are also designed for efficiently RDNA Architecture | 13 handling mixed precision. For larger 64-bit (or double precision) FP data, adjacent registers are combined to hold a full wavefront of data. More importantly, the compute unit vector registers natively support packed data including two half-precision (16-bit) FP values, four 8-bit integers, or eight 4-bit integers."
ok thx for info so maybe some other improvement towards ml in rdna2 that ps5 is missing
 
Why would you think that?

May I quote from the RDNA white paper?

https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

"To accommodate the narrower wavefronts, the vector register file has been reorganized. Each vector general purpose register (vGPR) contains 32 lanes that are 32-bits wide, and a SIMD contains a total of 1,024 vGPRs – 4X the number of registers as in GCN. The registers typically hold single-precision (32-bit) floating-point (FP) data, but are also designed for efficiently RDNA Architecture | 13 handling mixed precision. For larger 64-bit (or double precision) FP data, adjacent registers are combined to hold a full wavefront of data. More importantly, the compute unit vector registers natively support packed data including two half-precision (16-bit) FP values, four 8-bit integers, or eight 4-bit integers."

That seems to be talking about the vector registers, which won't necessarily translate into the ability to do accelerated rate operations using those unless supported by the ALU. For example:

"Some variants of the dual compute unit expose additional mixed-precision dot-product modesin the ALUs, primarily for accelerating machine learning inference. A mixed-precision FMA dot2 will compute two half-precision multiplications and then add the results to a single-precision accumulator. For even greater throughput, some ALUs will support 8-bit integer dot4 operations and 4-bit dot8 operations, all of which use 32-bit accumulators to avoid any overflows."

MS talked about their in8 and in4 stuff as being a customisation that they specifically requested. There's been no mention so far that Sony have requested the same, although they've been rather right lipped about certain aspects of the GPU.
 
That seems to be talking about the vector registers, which won't necessarily translate into the ability to do accelerated rate operations using those unless supported by the ALU. For example:

"Some variants of the dual compute unit expose additional mixed-precision dot-product modesin the ALUs, primarily for accelerating machine learning inference. A mixed-precision FMA dot2 will compute two half-precision multiplications and then add the results to a single-precision accumulator. For even greater throughput, some ALUs will support 8-bit integer dot4 operations and 4-bit dot8 operations, all of which use 32-bit accumulators to avoid any overflows."

MS talked about their in8 and in4 stuff as being a customisation that they specifically requested. There's been no mention so far that Sony have requested the same, although they've been rather right lipped about certain aspects of the GPU.


"Some variants of the dual compute unit expose additional mixed-precision dot-product modes in the ALUs, primarily for accelerating machine learning inference. A mixed-precision FMA dot2 will compute two half-precision multiplications and then add the results to a single-precision accumulator. For even greater throughput, some ALUs will support 8-bit integer dot4 operations and 4-bit dot8 operations, all of which use 32-bit accumulators to avoid any overflows."

Maybe Microsoft requested ALL ALU to do this... But Int8 and int 4 is supported!
 
I though
Why would you think that?

May I quote from the RDNA white paper?

https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

"To accommodate the narrower wavefronts, the vector register file has been reorganized. Each vector general purpose register (vGPR) contains 32 lanes that are 32-bits wide, and a SIMD contains a total of 1,024 vGPRs – 4X the number of registers as in GCN. The registers typically hold single-precision (32-bit) floating-point (FP) data, but are also designed for efficiently RDNA Architecture | 13 handling mixed precision. For larger 64-bit (or double precision) FP data, adjacent registers are combined to hold a full wavefront of data. More importantly, the compute unit vector registers natively support packed data including two half-precision (16-bit) FP values, four 8-bit integers, or eight 4-bit integers."

I thought similar but was pointed to the response from Andrew Goossen who specifically says they added it.

We knew that many inference algorithms need only 8-bit and 4-bit integer positions for weights and the math operations involving those weights comprise the bulk of the performance overhead for those algorithms," says Andrew Goossen. "So we added special hardware support for this specific scenario. The result is that Series X offers 49 TOPS for 8-bit integer operations and 97 TOPS for 4-bit integer operations. Note that the weights are integers, so those are TOPS and not TFLOPs. The net result is that Series X offers unparalleled intelligence for machine learning."

RDNA variations that do or do not include this?
 
I though


I thought similar but was pointed to the response from Andrew Goossen who specifically says they added it.

We knew that many inference algorithms need only 8-bit and 4-bit integer positions for weights and the math operations involving those weights comprise the bulk of the performance overhead for those algorithms," says Andrew Goossen. "So we added special hardware support for this specific scenario. The result is that Series X offers 49 TOPS for 8-bit integer operations and 97 TOPS for 4-bit integer operations. Note that the weights are integers, so those are TOPS and not TFLOPs. The net result is that Series X offers unparalleled intelligence for machine learning."

RDNA variations that do or do not include this?


Apparently what they did was to add it to every single ALU... while RDNA only did that on some of the ALU.
 
Apparently what they did was to add it to every single ALU... while RDNA only did that on some of the ALU.

Normally you would have a feature on all ALUs on the GPU or none of the ALUs. They tend to be uniform across the entire GPU for the purpose of simplifying scheduling and load balancing (with the possible exception of PS4Pro iirc which had some probably BC related dissymmetry across the two sides of the GPU).

So when the RDNA whitepaper talks about "Some variants of the dual compute unit" it's almost certainly talking about variants of the ALUs across different chip designs rather than different ALUs on the same GPU. Different customers and different segments of the market can have different requirements.

MS will have been talking to AMD for years about DirectML and their vision for inference acceleration. So whatever customisations MS have requested are likely to be absent in PS5. What the difference in capabilities will ultimately be I don't know, but it's likely to be across all ALUs on the respective chips.
 
"Some variants of the dual compute unit expose additional mixed-precision dot-product modes in the ALUs, primarily for accelerating machine learning inference. A mixed-precision FMA dot2 will compute two half-precision multiplications and then add the results to a single-precision accumulator. For even greater throughput, some ALUs will support 8-bit integer dot4 operations and 4-bit dot8 operations, all of which use 32-bit accumulators to avoid any overflows."

Maybe Microsoft requested ALL ALU to do this... But Int8 and int 4 is supported!
This aspect is pretty critical. If you don't support mixed operations you're bound to run into overflow issues so either you spend additional cycles to resolve that, or your network will fail.
It's possible that RDNA supports RPM for int8 and int4, as a default but how useful that is in practice, is rather unknown.

FP16 is more than adequate for ML and will likely represent a majority of weights in a network. Mixed precision is even better, but that doesn't make FP16 useless just because another GPU can do mixed.
 
Last edited:
Normally you would have a feature on all ALUs on the GPU or none of the ALUs. They tend to be uniform across the entire GPU for the purpose of simplifying scheduling and load balancing (with the possible exception of PS4Pro iirc which had some probably BC related dissymmetry across the two sides of the GPU).

So when the RDNA whitepaper talks about "Some variants of the dual compute unit" it's almost certainly talking about variants of the ALUs across different chip designs rather than different ALUs on the same GPU. Different customers and different segments of the market can have different requirements.

MS will have been talking to AMD for years about DirectML and their vision for inference acceleration. So whatever customisations MS have requested are likely to be absent in PS5. What the difference in capabilities will ultimately be I don't know, but it's likely to be across all ALUs on the respective chips.

Well.. I was basing myself on the white paper...

Page 14:

"For even greater throughput, some ALUs will support 8-bit integer dot4 operations and 4-bit dot8 operations..."
 
PS5 may support reduced precision formats, doesn't follow that it supports RPM operations on them though.

Reading straight into gpu from ssd discussion from couple weeks ago.

I couldn't think of why, how, reason to do it.
But I was just thinking that a reason could be ML compressed textures.
Could read it straight into the gpu, uncompress it into memory for use in next frame etc.
Could SFS work with such compressed textures?
 
Wasn't sure where to put this, but this went against what I thought the results would be.
In nearly all examples shown, XSX was faster. In some benchmarks, notably faster (like 2x)

I can't come up with any reasons. Going to need to wait to hear some thoughts from others.
edit: going to chalk this one up to BC restrictions on CPU or something for PS5. And XSX and XSS running unlocked all the way for their BC. Its the only thing I can think of for now. The alternative would be much worse.

The real battleground for load times will be when those 3P games come out this week, but this was an interesting look at things.

article:
https://www.gamespot.com/articles/h...ompatibility-load-times-compare/1100-6484057/

BC load time comparisons between PS5 and XSX

Compared to itself this is interesting, so I think there is a BC bottleneck somewhere, but I'm not sure what.

I'm a little perplexed.

 
Last edited:
PS5 will have to be faster in loading for sure. Even if MS has better software stack on top of it, better compression etc. that SSD in PS5 will load faster. Now, the question is if it will be noticable. That is, is 2x the difference going to be 2s vs 4s (which is pretty meaningless IMO) or will it actually be even smaller since games are not bound by data loading 100%. In either case, I think SSD will be least of a difference next gen.

It will be GPU power and features and DS controller. If XSX, in best case scenario, is outperforming PS5 by more then 20% + has few additional bells and whistles, I'd say they got a winner. If it is closer then that, and there is feature parity, I'd say Sony did better job given DS.
 
Wasn't sure where to put this, but this went against what I thought the results would be.
In nearly all examples shown, XSX was faster. In some benchmarks, notably faster (like 2x)

I can't come up with any reasons. Going to need to wait to hear some thoughts from others.
edit: going to chalk this one up to BC restrictions on CPU or something for PS5. And XSX and XSS running unlocked all the way for their BC. Its the only thing I can think of for now. The alternative would be much worse.

Maybe inefficiencies in I/O in libraries used by BC titles for PS4/4Pro are showing up here? Where as newer libraries for I/O on PS5 are substantially better. Like you, I can't think of a solid reason why the discrepancies nor why SeriesX would ever be faster.


From what I saw the timings seem to be as follows:

RDR 2
  • XSX: 1m4s
  • PS5: 1m5s
FFXV
  • XSX: 48s
  • PS5: 1m10s
Destiny 2
  • XSX: 42s
  • PS5: 57s
MH World
  • XSX: 35s
  • PS5: 51s
Arkham Knight
  • XSX: 58s
  • PS5: 1m7s
 
Raw boot times: (going to need more tests of course)
once again, perplexing.
I'm wondering for PS5 standby if it's the controller sync that's waiting around to happen first.

Push Square Times:
Cold Boot to User Login 18.19 seconds
Rest Mode to User Login 4.52 seconds
User Login to Main Menu and Suspended Game 2.87 seconds

They're going to need to align on standby states to get a better line up if they want to do this.
 
Last edited:
This loading stuff is what I would consider a surprise more so than PS5 BC.

It may not last long, as in next gen titles may behave totally differently.
It really is a turn up for the books to be fair.

Wonder if for BC it has timings that has direct effect to how the SSD is used.

But general boot times aren't faster either. I wonder if the IO stack is limiting it, and will only really see its full potential when accessing it for next gen games via a different IO API stack?

Edit : I removed section that sounded a bit too console waring...
 
Last edited:
But general boot times aren't faster either. I wonder if the IO stack is limiting it, and will only really see its full potential when accessing it for next gen games via a different IO API stack?
I thought about this as being an issue. But it doesn't explain how the CPU runs so much faster in game, moving the framerates up to 60fps locked, but at the same time be barely be able to load faster.
Unless the CPU is locked and the GPU is the reason unlocked modes couldn't get higher. That's just perplexing, but we saw in CPU lmited areas PS5 was fine.

IT's really confusing. I think it may have to do with how many threads can access I/O at the same time, in which to gather the full 5.5GB/s, you may need to send a lot of requests all at once. This is my idea right now, perhaps single threaded output is not very fast.
 
Status
Not open for further replies.
Back
Top