Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
I assume the improvement was meant in comparison to the current consoles. Next to desktop parts from what I've seen these are pretty standard Zen2's with no boost, maybe reduded cache and possibly less optimal memory access.

In terms of clocks it's a little slower than a Ryzen 3700x without boost.

Offcourse, they strip these things that have not so much impact, if any, on gaming performance, but they also needed to reduce clock speeds and cache sizes. They did the best they could, it's probably the perfect fit for these consoles. They won't be as fast as full fat desktop CPU's, first you cant expect that, and second it probably wasn't needed either. If heat, die size and cost wouldn't be a problem, they could have slapped in a 3950X or something for full 12c/24t at 4ghz or higher.

It's hopefully not all bad!

Assuming next gen consoles are based on something like the 4xxx series APUs (Renoir), there's some good news in terms of latencies.

https://www.anandtech.com/show/1570...k-business-with-the-ryzen-9-4900hs-a-review/2

Inter CCX cache accesses are faster for the monolithic APUs than the chiplet designs. For the chiplet based desktop processors, inter-CCX access goes off-chip even if the other CCX is on the same physical chiplet, as it's done via IF routed through the big "hub" IO chip containing the memory controller.

So Renoir takes about 1/4 off the inter CCX latency. Perhaps in terms of games, this could make up somewhat in terms of IPC for having less L3. I suspect the huge L3 on Ryzen 3xxx series desktops is due to a common chiplet with server targetted stuff, and that for purely desktop and gaming purposes it's not possibly optimal use of the die area (not all workloads receive the same benefits from cache scaling).

Anandtech (Ian Cutress writing) also had this to say about the smaller L3 in Renoir:

For Renoir, AMD decided to minimize the amount of L3 cache to 1 MB per core, compared to 4 MB per core on the desktop Ryzen variants and 4 MB per core for Threadripper and EPYC. The reduction in the size of the cache does three things: (a) makes the die smaller and easier to manufacture, (b) makes the die use less power when turned on, but (c) causes more cache misses and accesses to main memory, causing a slight performance per clock decrease.

With (c), normally doubling (2x) the size of the cache gives a square root of 2 decrease in cache misses. Therefore going down from 4 MB on the other designs to 1 MB on these designs should imply that there will be twice as many cache misses from L3, and thus twice as many memory accesses. However, because AMD uses a non-inclusive cache policy on the L3 that accepts L2 cache evictions only, there’s actually less scope here for performance loss.

It would also be interesting to know how main memory latency in consoles compared to Renoir and Matisse, particularly under heavy load.

If infinity fabric speed is tied to the memory clock as in Matisse, then 14 Gbps might be quite close to the 3733 mhz "sweet spot" that AMD talked about for that setup...?
 
It seems that one change we can se more of in games because of SSD's is more use of unique animations of NPC characters. This is good to make let's say a city more real and life like with people doing many different things in contrast to "robots" marching and bumping into each other. Maybe we will se more advanced series of animation, let's say passing a character doing some paintwork on a wall suddenly falls down his ladder and raising up again to brush of his clothes;). I guess it all falls down to budget and the amount of work put in to the games, but at least it seems SSD's open up more possibilities for this.

"the SSD storage speed means we can offer many unique motion-captured animations"
http://thisgengaming.com/2020/04/23...realistic-environments-unique-npc-animations/
 
I know...thats what i originally suspected that performance would be inferior somewhat to desktop chips but some disagreed so..

Being able to target a specific CPU architecture will mitigate the theoretical performance difference with PC parts, though. PC CPUs are designed to make even non-optimal code run fast. These consoles aren't going to benefit as much from a high single or few core turbo frequency, for example, since you'd expect console games to be trying to use all the available threads whenever possible.
 
I recall a lot of rumours regarding a reduction in L2 Cache on the CPU to 1/2 or 1/4 of its desktop counterparts.

In regards to PS5 at least, would that really be a good idea given the 448GB/s that has to feed the CPU, GPU and Tempest while dealing with contention?

I'd assume a fat cache would help mitigate the CPU bandwidth requirements. It was my understanding that the larger caches on Zen2 played a significant part in its performance gains over previous iterations..
 
I'd assume a fat cache would help mitigate the CPU bandwidth requirements.
I don't think so. You still need to read the data into the CPU and write it out. Cache's are to reduce latency, not help with bandwidth. You populate the cache with a chunk of working data to save direct reads from RAM, and large caches mean less cache misses and less stalls, resulting in better performance.. If you want to avoid accessing RAM, you need scratch pad memory like EDRAM where the CPU will work from and only write to RAM with the results of the workload.

Now if modern caches can do that and provide a transparent scratchpad, it would be beneficial, but that'd be news to me.
 
The discussion on the CPU core caching for the XSX made me go back to a little counting I did on my own from the spec reveal. MS indicated that the total onboard cache for the XSX APU was 76mb of SRAM. I was curious as to how that would be broken down for the CPU & GPU so did some counting based on available information and I came up with the following:

CPU: per core, we have 64kb L1, 512kb L2 for both desktop and mobile versions of Zen 2. The difference as has been noted by others is in the L3 where we have 32mb for the desktop and 8mb for the mobile version.
GPU: potentially per CU we have 32kb L0, 128kb L1. If I understood correctly they then allocate 4mb L2 which is shared across all the CUs (RDNA 2 hasn't been launched as yet so the quoted cache sizes are from RDNA1). (If I missed any other caches in the GPU please let me know)

Based on the above, we could end up with the following:
Door #1 Desktop CPU (512kb L1, 4.096mb L2 and 32mb) + 52 CU GPU (1.66mb L0, 6.65mb L1 & maybe 6mb L3) for a total of 50.918mb of cache.
Door #2 Mobile CPU (512kb L1, 4.096mb L2 and 8mb) + 52 CU GPU (1.66mb L0, 6.65mb L1 & maybe 6mb L3) for a total of 28.918mb of cache.

Variables:
CPU: As was stated both in the DF arch piece and by members here the large L3 of the desktop CPU would be reduced. But would they reduce the L3 to the size of the mobile counterparts or something more (16mb)?
GPU: For me this is the more interesting area as RDNA 2 hasn't been launched and the cache sizes are as yet an unknown. Would a doubling of the cache sizes for each CU increase performance and better manage the addition of RT? What cache type and size could the newly add RT parts of RDNA 2 require? Would MS add more cache for color information (lets say more L0 or L1 to help mitigate the slower I/O performance of their SSD)?

Also, while this post is XSX minded in it's nature any caching changes for RDNA 2 would also be the same for the PS5 as it is also RDNA 2 based so whatever is speculated should apply in most part to each console less specific choices by each company.
 
I am under the impression that during the customization process they remove unnecessary and keep what's needed for the CPU and and GPU. Why is there a reduction in cache in the console APU's? Cost? Not needed?
 
I recall a lot of rumours regarding a reduction in L2 Cache on the CPU to 1/2 or 1/4 of its desktop counterparts.
The L3 has been quartered for Rendoir, but the L2 is the same. There's not much to remove from the L2.

In regards to PS5 at least, would that really be a good idea given the 448GB/s that has to feed the CPU, GPU and Tempest while dealing with contention?

I'd assume a fat cache would help mitigate the CPU bandwidth requirements. It was my understanding that the larger caches on Zen2 played a significant part in its performance gains over previous iterations..
The L3 is a big consumer of die space for Zen2. Much of the more general Zen to Zen2 IPC improvement (not related to specialized changes like vector width) could be attributed to cache capacity, although the question faced by a constrained platform is how much is small percentage of performance worth in terms of cost or potentially lost area for other features?
The large L3 matters more for server loads, while the impact for the workloads consoles experience may not have turned up as significant a dependence on capacity.

The bandwidth savings vs area cost need to weight what the console vendors expect CPU bandwidth needs to generally be. If a Zen2 CCD consumes 10 GB/s in a given game, is 10 GB/s (edit: additional) out of 448 GB/s worth the die space?

I don't think so. You still need to read the data into the CPU and write it out. Cache's are to reduce latency, not help with bandwidth. You populate the cache with a chunk of working data to save direct reads from RAM, and large caches mean less cache misses and less stalls, resulting in better performance.. If you want to avoid accessing RAM, you need scratch pad memory like EDRAM where the CPU will work from and only write to RAM with the results of the workload.

Now if modern caches can do that and provide a transparent scratchpad, it would be beneficial, but that'd be news to me.
Old rule of thumb is that misses tend to fall with the square root of capacity. This affects a subset of all miss types, and there are loads that do not rely on cache much, so there would be diminishing returns.

The discussion on the CPU core caching for the XSX made me go back to a little counting I did on my own from the spec reveal. MS indicated that the total onboard cache for the XSX APU was 76mb of SRAM. I was curious as to how that would be broken down for the CPU & GPU so did some counting based on available information and I came up with the following:

CPU: per core, we have 64kb L1, 512kb L2 for both desktop and mobile versions of Zen 2. The difference as has been noted by others is in the L3 where we have 32mb for the desktop and 8mb for the mobile version.
GPU: potentially per CU we have 32kb L0, 128kb L1. If I understood correctly they then allocate 4mb L2 which is shared across all the CUs (RDNA 2 hasn't been launched as yet so the quoted cache sizes are from RDNA1). (If I missed any other caches in the GPU please let me know)
SRAM is a broadly used circuit type, not just for caches. The register files for the GPU are a large contributor, and there are many small buffers, internal caches, internal controllers, and registers throughout the chip. AMD's given large SRAM counts for Vega GPUs in excess of the universally recognized register, cache, and LDS totals.
 
@TheAlSpark did you ever get more refinement on what/where the Cache actually is on SeriesX?
Not really, no.

There's probably a bunch associated with infinity fabric / interconnect, the GPU front-end, display controllers/encoders/decoders that doesn't get the spotlight. Maybe a bunch of it is redundancy as well (apart from the disabled CUs).
 
Why is there a reduction in cache in the console APU's? Cost? Not needed?
They use cache in desktop parts to help with the latency caused by the chiplet design. With consoles not using chiplets, I'd assume the latency issues aren't as pronounced and less cache is needed.
 
They use cache in desktop parts to help with the latency caused by the chiplet design. With consoles not using chiplets, I'd assume the latency issues aren't as pronounced and less cache is needed.
AMDs desktop cache size is dictated by Epycs, not because latencies would be suboptimal on desktop. The savings you'd get from cutting the cache in half or even 1/4ths isn't worth the cost of developing new chiplet for it.
 
Game consoles shouldn't need large caches like desktops because they're not really multi-tasking like a PC and data accesses should be predictable. As long as devs are thinking about cache alignment of data, and making good use of cache line reads with linear data, a smaller cache should not be a big issue.

Yep!

The sheer range of cache sizes that best suit (bang for buck, proportion of die area) different workloads is really quite crazy.

Just with different Zen 2 products, on the lower end you have something like Renoir, with what equates to 1MB of L3 per core (4 cores, 4MB per CCX). Reviews show Renoir to be leading edge for performance within its market segments.

... but on the other hand you have an absolute L3 belly-buster like the "Large cache" EPYC 7532:

https://www.anandtech.com/show/15528/amd-expands-epyc-lineup-with-epyc-7662-epyc-7532-cpus

32 cores and 256 MB of L3. That's 8MB L3 per core!!
 
Status
Not open for further replies.
Back
Top