Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

function · Apr 29, 2020

pjbliverpool said:
I assume the improvement was meant in comparison to the current consoles. Next to desktop parts from what I've seen these are pretty standard Zen2's with no boost, maybe reduded cache and possibly less optimal memory access.

In terms of clocks it's a little slower than a Ryzen 3700x without boost.

PSman1700 said:
Offcourse, they strip these things that have not so much impact, if any, on gaming performance, but they also needed to reduce clock speeds and cache sizes. They did the best they could, it's probably the perfect fit for these consoles. They won't be as fast as full fat desktop CPU's, first you cant expect that, and second it probably wasn't needed either. If heat, die size and cost wouldn't be a problem, they could have slapped in a 3950X or something for full 12c/24t at 4ghz or higher.

It's hopefully not all bad!

Assuming next gen consoles are based on something like the 4xxx series APUs (Renoir), there's some good news in terms of latencies.

https://www.anandtech.com/show/1570...k-business-with-the-ryzen-9-4900hs-a-review/2

Inter CCX cache accesses are faster for the monolithic APUs than the chiplet designs. For the chiplet based desktop processors, inter-CCX access goes off-chip even if the other CCX is on the same physical chiplet, as it's done via IF routed through the big "hub" IO chip containing the memory controller.

So Renoir takes about 1/4 off the inter CCX latency. Perhaps in terms of games, this could make up somewhat in terms of IPC for having less L3. I suspect the huge L3 on Ryzen 3xxx series desktops is due to a common chiplet with server targetted stuff, and that for purely desktop and gaming purposes it's not possibly optimal use of the die area (not all workloads receive the same benefits from cache scaling).

Anandtech (Ian Cutress writing) also had this to say about the smaller L3 in Renoir:

For Renoir, AMD decided to minimize the amount of L3 cache to 1 MB per core, compared to 4 MB per core on the desktop Ryzen variants and 4 MB per core for Threadripper and EPYC. The reduction in the size of the cache does three things: (a) makes the die smaller and easier to manufacture, (b) makes the die use less power when turned on, but (c) causes more cache misses and accesses to main memory, causing a slight performance per clock decrease.

With (c), normally doubling (2x) the size of the cache gives a square root of 2 decrease in cache misses. Therefore going down from 4 MB on the other designs to 1 MB on these designs should imply that there will be twice as many cache misses from L3, and thus twice as many memory accesses. However, because AMD uses a non-inclusive cache policy on the L3 that accepts L2 cache evictions only, there’s actually less scope here for performance loss.

It would also be interesting to know how main memory latency in consoles compared to Renoir and Matisse, particularly under heavy load.

If infinity fabric speed is tied to the memory clock as in Matisse, then 14 Gbps might be quite close to the 3733 mhz "sweet spot" that AMD talked about for that setup...?

Barrabas · Apr 29, 2020

It seems that one change we can se more of in games because of SSD's is more use of unique animations of NPC characters. This is good to make let's say a city more real and life like with people doing many different things in contrast to "robots" marching and bumping into each other. Maybe we will se more advanced series of animation, let's say passing a character doing some paintwork on a wall suddenly falls down his ladder and raising up again to brush of his clothes

. I guess it all falls down to budget and the amount of work put in to the games, but at least it seems SSD's open up more possibilities for this.

"the SSD storage speed means we can offer many unique motion-captured animations"
http://thisgengaming.com/2020/04/23...realistic-environments-unique-npc-animations/

RobertR1 · Apr 29, 2020

Maybe we can use this gen to get rid of the plastic shine on everything made famous by unreal engine.

PSman1700 · Apr 29, 2020

Barrabas said:
http://thisgengaming.com/2020/04/23...realistic-environments-unique-npc-animations/

Both will achieve the same thing, like the link says one has the faster SSD, one has the better compression.

Barrabas · Apr 29, 2020

PSman1700 said:
Both will achieve the same thing, like the link says one has the faster SSD, one has the better compression.

Was it not obvious in the article?
"Whilst this was a discussion focused on the new Series X console, it’s safe to assume this applies to the PS5 as well"

PSman1700 · Apr 29, 2020

Barrabas said:
Was it not obvious in the article?
"Whilst this was a discussion focused on the new Series X console, it’s safe to assume this applies to the PS5 as well"

Yes, true. I was referring to the tweet below it, nvm

AlNom · Apr 29, 2020

RobertR1 said:
Maybe we can use this gen to get rid of the plastic shine on everything made famous by unreal engine.

That's somewhat entirely up to the artists & dev schedule for tweaking the shader input/outputs.

There are examples of devs using UE3 or UE4 to provide a unique look, but it comes down to budget and direction.

mrcorbo · Apr 30, 2020

Inuhanyou said:
I know...thats what i originally suspected that performance would be inferior somewhat to desktop chips but some disagreed so..

Being able to target a specific CPU architecture will mitigate the theoretical performance difference with PC parts, though. PC CPUs are designed to make even non-optimal code run fast. These consoles aren't going to benefit as much from a high single or few core turbo frequency, for example, since you'd expect console games to be trying to use all the available threads whenever possible.

Mitchings · Apr 30, 2020

I recall a lot of rumours regarding a reduction in L2 Cache on the CPU to 1/2 or 1/4 of its desktop counterparts.

In regards to PS5 at least, would that really be a good idea given the 448GB/s that has to feed the CPU, GPU and Tempest while dealing with contention?

I'd assume a fat cache would help mitigate the CPU bandwidth requirements. It was my understanding that the larger caches on Zen2 played a significant part in its performance gains over previous iterations..

Shifty Geezer · Apr 30, 2020

Mitchings said:
I'd assume a fat cache would help mitigate the CPU bandwidth requirements.

I don't think so. You still need to read the data into the CPU and write it out. Cache's are to reduce latency, not help with bandwidth. You populate the cache with a chunk of working data to save direct reads from RAM, and large caches mean less cache misses and less stalls, resulting in better performance.. If you want to avoid accessing RAM, you need scratch pad memory like EDRAM where the CPU will work from and only write to RAM with the results of the workload.

Now if modern caches can do that and provide a transparent scratchpad, it would be beneficial, but that'd be news to me.

Vhatt · Apr 30, 2020

The discussion on the CPU core caching for the XSX made me go back to a little counting I did on my own from the spec reveal. MS indicated that the total onboard cache for the XSX APU was 76mb of SRAM. I was curious as to how that would be broken down for the CPU & GPU so did some counting based on available information and I came up with the following:

CPU: per core, we have 64kb L1, 512kb L2 for both desktop and mobile versions of Zen 2. The difference as has been noted by others is in the L3 where we have 32mb for the desktop and 8mb for the mobile version.
GPU: potentially per CU we have 32kb L0, 128kb L1. If I understood correctly they then allocate 4mb L2 which is shared across all the CUs (RDNA 2 hasn't been launched as yet so the quoted cache sizes are from RDNA1). (If I missed any other caches in the GPU please let me know)

Based on the above, we could end up with the following:
Door #1 Desktop CPU (512kb L1, 4.096mb L2 and 32mb) + 52 CU GPU (1.66mb L0, 6.65mb L1 & maybe 6mb L3) for a total of 50.918mb of cache.
Door #2 Mobile CPU (512kb L1, 4.096mb L2 and 8mb) + 52 CU GPU (1.66mb L0, 6.65mb L1 & maybe 6mb L3) for a total of 28.918mb of cache.

Variables:
CPU: As was stated both in the DF arch piece and by members here the large L3 of the desktop CPU would be reduced. But would they reduce the L3 to the size of the mobile counterparts or something more (16mb)?
GPU: For me this is the more interesting area as RDNA 2 hasn't been launched and the cache sizes are as yet an unknown. Would a doubling of the cache sizes for each CU increase performance and better manage the addition of RT? What cache type and size could the newly add RT parts of RDNA 2 require? Would MS add more cache for color information (lets say more L0 or L1 to help mitigate the slower I/O performance of their SSD)?

Also, while this post is XSX minded in it's nature any caching changes for RDNA 2 would also be the same for the PS5 as it is also RDNA 2 based so whatever is speculated should apply in most part to each console less specific choices by each company.

BRiT · Apr 30, 2020

@TheAlSpark did you ever get more refinement on what/where the Cache actually is on SeriesX?

Barrabas · Apr 30, 2020

I am under the impression that during the customization process they remove unnecessary and keep what's needed for the CPU and and GPU. Why is there a reduction in cache in the console APU's? Cost? Not needed?

3dilettante · Apr 30, 2020

Mitchings said:
I recall a lot of rumours regarding a reduction in L2 Cache on the CPU to 1/2 or 1/4 of its desktop counterparts.

The L3 has been quartered for Rendoir, but the L2 is the same. There's not much to remove from the L2.

In regards to PS5 at least, would that really be a good idea given the 448GB/s that has to feed the CPU, GPU and Tempest while dealing with contention?

I'd assume a fat cache would help mitigate the CPU bandwidth requirements. It was my understanding that the larger caches on Zen2 played a significant part in its performance gains over previous iterations..

The L3 is a big consumer of die space for Zen2. Much of the more general Zen to Zen2 IPC improvement (not related to specialized changes like vector width) could be attributed to cache capacity, although the question faced by a constrained platform is how much is small percentage of performance worth in terms of cost or potentially lost area for other features?
The large L3 matters more for server loads, while the impact for the workloads consoles experience may not have turned up as significant a dependence on capacity.

The bandwidth savings vs area cost need to weight what the console vendors expect CPU bandwidth needs to generally be. If a Zen2 CCD consumes 10 GB/s in a given game, is 10 GB/s (edit: additional) out of 448 GB/s worth the die space?

Shifty Geezer said:
I don't think so. You still need to read the data into the CPU and write it out. Cache's are to reduce latency, not help with bandwidth. You populate the cache with a chunk of working data to save direct reads from RAM, and large caches mean less cache misses and less stalls, resulting in better performance.. If you want to avoid accessing RAM, you need scratch pad memory like EDRAM where the CPU will work from and only write to RAM with the results of the workload.

Now if modern caches can do that and provide a transparent scratchpad, it would be beneficial, but that'd be news to me.

Old rule of thumb is that misses tend to fall with the square root of capacity. This affects a subset of all miss types, and there are loads that do not rely on cache much, so there would be diminishing returns.

Vhatt said:
The discussion on the CPU core caching for the XSX made me go back to a little counting I did on my own from the spec reveal. MS indicated that the total onboard cache for the XSX APU was 76mb of SRAM. I was curious as to how that would be broken down for the CPU & GPU so did some counting based on available information and I came up with the following:

CPU: per core, we have 64kb L1, 512kb L2 for both desktop and mobile versions of Zen 2. The difference as has been noted by others is in the L3 where we have 32mb for the desktop and 8mb for the mobile version.
GPU: potentially per CU we have 32kb L0, 128kb L1. If I understood correctly they then allocate 4mb L2 which is shared across all the CUs (RDNA 2 hasn't been launched as yet so the quoted cache sizes are from RDNA1). (If I missed any other caches in the GPU please let me know)

SRAM is a broadly used circuit type, not just for caches. The register files for the GPU are a large contributor, and there are many small buffers, internal caches, internal controllers, and registers throughout the chip. AMD's given large SRAM counts for Vega GPUs in excess of the universally recognized register, cache, and LDS totals.

AlNom · Apr 30, 2020

BRiT said:
@TheAlSpark did you ever get more refinement on what/where the Cache actually is on SeriesX?

Not really, no.

There's probably a bunch associated with infinity fabric / interconnect, the GPU front-end, display controllers/encoders/decoders that doesn't get the spotlight. Maybe a bunch of it is redundancy as well (apart from the disabled CUs).

Scott_Arm · May 1, 2020

Unreal Engine 4.25 is the release version with support for next-gen consoles. There's a 4.25 Plus stream that will be kept up to date with features for next-gen releases this year. Ray tracing is out of beta.

disco_ · May 1, 2020

Barrabas said:
Why is there a reduction in cache in the console APU's? Cost? Not needed?

They use cache in desktop parts to help with the latency caused by the chiplet design. With consoles not using chiplets, I'd assume the latency issues aren't as pronounced and less cache is needed.

Kaotik · May 1, 2020

disco_ said:
They use cache in desktop parts to help with the latency caused by the chiplet design. With consoles not using chiplets, I'd assume the latency issues aren't as pronounced and less cache is needed.

AMDs desktop cache size is dictated by Epycs, not because latencies would be suboptimal on desktop. The savings you'd get from cutting the cache in half or even 1/4ths isn't worth the cost of developing new chiplet for it.

Scott_Arm · May 1, 2020

Game consoles shouldn't need large caches like desktops because they're not really multi-tasking like a PC and data accesses should be predictable. As long as devs are thinking about cache alignment of data, and making good use of cache line reads with linear data, a smaller cache should not be a big issue.

function · May 1, 2020

Scott_Arm said:
Game consoles shouldn't need large caches like desktops because they're not really multi-tasking like a PC and data accesses should be predictable. As long as devs are thinking about cache alignment of data, and making good use of cache line reads with linear data, a smaller cache should not be a big issue.

Yep!

The sheer range of cache sizes that best suit (bang for buck, proportion of die area) different workloads is really quite crazy.

Just with different Zen 2 products, on the lower end you have something like Renoir, with what equates to 1MB of L3 per core (4 cores, 4MB per CCX). Reviews show Renoir to be leading edge for performance within its market segments.

... but on the other hand you have an absolute L3 belly-buster like the "Large cache" EPYC 7532:

https://www.anandtech.com/show/15528/amd-expands-epyc-lineup-with-epyc-7662-epyc-7532-cpus

32 cores and 256 MB of L3. That's 8MB L3 per core!!

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

function

None functional

Barrabas

RobertR1

Pro

PSman1700

Barrabas

PSman1700

AlNom

Moderator

mrcorbo

Foo Fighter

Mitchings

Shifty Geezer

uber-Troll!

Vhatt

BRiT

(>• •)>⌐■-■ (⌐■-■)

Barrabas

3dilettante

AlNom

Moderator

Scott_Arm

disco_

Kaotik

Drunk Member

Scott_Arm

function

None functional

Similar threads