Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
Strange, since AVX is highly intensive on calculations.
Where can I read that?
On "the road to PS5" Cerny mentioned that AVX 2.0 required to drop clocks. And it´s also a Zen 2.

On a normal CPU there's a base clock and a boost clock. The CPU will do what it can to reach the boost clock, but if it can't do that due to heat or power limits, it'll start dropping until it reaches the base clock.

For example, the Ryzen 3700x has a base clock of 3.6 GHz and a boost clock of up to 4.4 GHz.

In normal use, including games it'll likely be in the 4.2-4.4 GHz range across all cores. Closer to 4.2 GHz if all cores are being highly stressed, closer to 4.4 GHz if only a few cores are being highly stressed.

It'll rarely drop to the base clock while in use. However, if AVX 2.0 is heavily leveraged it's possible it might drop down to the 3.6 GHz base clock.

The XBSX's locked frequency is likely to be analogous to the base clock of a Zen 2 CPU (like the 3700x). If the XBSX had a boost clock, then AVX 2.0 would certainly drop it down.

Since MS want consistency in performance when all aspects of the SOC are being used, the CPU and GPU are clocked conservatively so that developers can still rely on the performance even if they are approaching full occupancy of the CPU and GPU.

Basically assume that the XBSX is using the base clocks of a Zen 2 CPU rather than the boosted clocks.

The same goes for the GPU. MS is taking the approach of a locked deterministic clock on the GPU similar to desktop GPUs prior to NV and then later AMD introducing boost clocks.

There are benefits beyond consistent performance, however. Going conservative with clocks means that MS are likely to have better yields on the SOC and potentially a lower power consumption due to being below the "knee of the curve" WRT power and frequency scaling. Both of those will help reduce the overall cost of the console to MS.

The downside is that depending on how the CPU is used there is untapped power not being used versus clocking the SOC closer to the limits of the silicon like Sony are doing with the PS5.

Regards,
SB
 
Last edited:
Not on AMD according to AMD.

Found something about it..

"Ryzen doesn't throttle on AVX code because it doesn't have the hardware to do a whole 256 bit instruction on a single clock. Ryzen has 4 FP pipes, each 128 bit in width. A single 256 bit AVX instruction is processed in halves while not incurring in any penalty, power or clock speed wise. Obviously the disadvantage of this approach is half the AVX/2 throughput vs Intel's cores. AMD made their design decision for Ryzen not to go full 256 bit on AVX execution capability because of power, thermals, die size, and cost."

Read more here


I wonder why did Cerny mentioned downclock then?
 
Found something about it..

"Ryzen doesn't throttle on AVX code because it doesn't have the hardware to do a whole 256 bit instruction on a single clock. Ryzen has 4 FP pipes, each 128 bit in width. A single 256 bit AVX instruction is processed in halves while not incurring in any penalty, power or clock speed wise. Obviously the disadvantage of this approach is half the AVX/2 throughput vs Intel's cores. AMD made their design decision for Ryzen not to go full 256 bit on AVX execution capability because of power, thermals, die size, and cost."

Read more here


I wonder why did Cerny mentioned downclock then?
That's for Zen 1. ;)

https://www.hotchips.org/hc31/HC31_1.1_AMD_ZEN2.pdf

Widths were doubled for the FP paths (page 9).

Highlights of the Zen 2 design include a different L2 branch predictor known as a TAGE predictor, a doubling of the micro-op cache, a doubling of the L3 cache, an increase in integer resources, an increase in load/store resources, and support for single-operation AVX-256 (or AVX2). AMD has stated that there is no frequency penalty for AVX2, based on its energy aware frequency platform.
https://www.anandtech.com/show/14525/amd-zen-2-microarchitecture-analysis-ryzen-3000-and-epyc-rome/6
 
Thank you very much for the info. I had not noticed it was Zen 1.

But if there is no frequency penalty, why did Cerny mentioned a frequency drop for AVX 2 use?
seriously its not that hard.

Zen and Zen 2 have no AVX offset. Everything is done on realtime package information. The SKU's for retail are designed in a way that the base clock is the clock of "power virus". Everything else over that is opportunistic. Intel on the other hand set there base clock at all core 128bit simd, when you get enough 256bit load ( intel has a complex 128bit data path to 256bit data path migration based off load(yes intel "crack"256bit ops in 128bit ops when it suites them)) the avx offset kicks in and lowers clocks. Even more so for AVX-512 which in many cases leads AVX-512 to have worse performance then 256bit AVX/2 because all the scalar code is now running at a much lower clock and getting none of the throughput benefits.

go look at the EPYC 64core reviews. the base clock is low 2.0 ghz yet for 128bit code all 64 cores run @ 3.2ghz!
 
Not sure I agree on this.
It looks to me, that Oodle Texture size = BCn size, since it just re-organizes the data in the texture.

Oodle Texture + Kraken = win for SSD space and IO from SSD, but kraken decompresses to memory (right?). So in memory its back to the original 127MB BCn size. And that 127 MB BCn texture will use bandwidth between GPU and memory.
Yes bandwidth seems the weak point of ps5... anyway I'm impressed by this tech !!![emoji33]
 
Going conservative with clocks means that MS are likely to have better yields on the SOC
Microsoft / AMD still have to guarantee AVX 256 at 3.6GHz / 3.8GHz, which is not conservative and doesn't guarantee better yields.
 
Yes bandwidth seems the weak point of ps5... anyway I'm impressed by this tech !!!
emoji33.png

This helps with I/O bandwidth, not memory bandwidth apparently. I wonder how AMD has improved memory bandwidth utilization since GCN, I remember NVIDIA did introduce some improvements for Maxwell (they increased the cache if I remember correctly).

Following the AVX-256 topic. Didn't the 3700x did pretty well in power consumption in the stressed tests with AVX enabled? I know it's not the same CPU but consoles CPU frequencies are already lower compared to their retail counter parts.

upload_2020-6-19_10-42-4.png
 
https://www.jonolick.com/home/oodle-texture-bc6h-expose

Another blog post about oodle texture by another RAD tool games employee Jon Olick

Some explanation about BCprep
https://cbloomrants.blogspot.com/2020/06/oodle-texture-bc7prep-data-flow.html
So BCPrep is just reordering stuff around by the GPU before writing it to memory for use, for a 5-15% gain over BC7+RDO+Kraken. That could be an in-line process from the IO processor, depending how flexible it is. Or it could be a good secondary use of the tempest block if the IO processor have a data path directly to the tempest processor (like the cell and rsx)

I can't really see the difference between the lossless images and the BC7 lambda 40 images. Can it go above 40 or is it somehow a hard mathematical limit of the algorithm?
 
So BCPrep is just reordering stuff around by the GPU before writing it to memory for use, for a 5-15% gain over BC7+RDO+Kraken. That could be an in-line process from the IO processor, depending how flexible it is. Or it could be a good secondary use of the tempest block if the IO processor have a data path directly to the tempest processor (like the cell and rsx)

I can't really see the difference between the lossless images and the BC7 lambda 40 images. Can it go above 40 or is it somehow a hard mathematical limit of the algorithm?
I agree, but the textures they used are not very high frequency. There are already quite blurred. We need comparisons with others kind of textures before any conclusion.
 
So BCPrep is just reordering stuff around by the GPU before writing it to memory for use, for a 5-15% gain over BC7+RDO+Kraken.
Also this reordering can be done with compute so PCs will have some leverage if you used this combo with a regular SSD.
 
Also this reordering can be done with compute so PCs will have some leverage if you used this combo with a regular SSD.
Will be interesting to see what the difference in performance will be for :

Hw decompress + gpu compute transform

Vs

Software decompress + cpu transform

Or rather how the two approaches impacts available resources for other tasks.
 
BC7Prep + GPU decode on PS5 benchmark = 60 -120GB/s bandwidth (best case scenario with 100% GPU used which is not the normal way, but still fun):
It's not so much fun as meaningless. A random peak number for an unexplained activity isn't valuable technical discussion. What's the workload and what's the game application? Does this improve storage transfer or RAM use (is it a run-time decompressor or on-load decompressor)?

Please post more than unexplained, contextless tweets.
 
It's not so much fun as meaningless. A random peak number for an unexplained activity isn't valuable technical discussion. What's the workload and what's the game application? Does this improve storage transfer or RAM use (is it a run-time decompressor or on-load decompressor)?

Please post more than unexplained, contextless tweets.

I think reading the full conversation, the two interesting facts are confirmation BCprep is usable above RDO and Kraken and the GPU compute power to transform it to BC7 is low because the PS5 SSD with compressed data is nowhere near 60-120 GB/s.
 
I think reading the full conversation, the two interesting facts are confirmation BCprep is usable above RDO and Kraken and the GPU compute power to transform it to BC7 is low because the PS5 SSD with compressed data is nowhere near 60-120 GB/s.
That's what I think I understood too. The 60-120 metric allows a quick glance at the time slice required from the GPU based on the amount of BC7Prep data going in.

3GB/s to 6GB/s of BC7Prep textures would take something like a 5% time slice from the GPU. The big variation between 60 and 120 is hinted to be about block size, he says it's better to go with at least 256k chunks. Does it mean small chunks would be closer to 60, and large chunks closer to 120? Or is the data itself causing variations?

We're still at a point where there's a lack of concensus about what the next gen real world data types distribution will look like. Hard to tell how much of a game streaming data would be worth using BC7Prep with, or just BC7+RDO+Kraken, or if devs will gravitate more towards BC6H.
 
Status
Not open for further replies.
Back
Top