Do you have any insight into the idleness of the eight cores of the PS4/XBox Jaguar processors as they wait for main memory?
I don't have any benchmarks for those. For either console, it should be closer to 6 cores since the other two appear to be reserved. There aren't exact numbers but Durango was listed as having ~190 cycles of main memory latency and Orbis 220+.
That's well beyond Jaguar's ability to reorder around a stall. At 2 instructions per cycle, its 64 op window can only last 32 cycles.
Splitting the difference at ~200 cycles, that's ~168 cycles of nothing.
If the scenario is one miss to main memory every 200 cycles, it yields ~.32 IPC or 16% of peak IPC assuming everything else is perfect (it isn't).
The point of caches is to keep that worst-case scenario from happening so much.
A 32KB Intel cache was profiled at 5-10% L1 miss rate in SPEC 2k, as an iffy proxy for Jaguar's L1.
The per-core amount of the Jaguar L2 is 16x bigger than the L1, which with a square root relationship to capacity means the number of misses to main memory should be 1.25-2.5% (L2 misses come out of L1 misses).
I'm unfortunately hoping SPEC versions, but benchmarks in a more recent version had 14-40% of their instruction composed of memory loads.
I'm mangling math by combining the lowest of both ranges and the highest of both--which I haven't really verified is correct, but I will go with it just for the theory.
.0125*.14=.0017=.17% of instructions hitting main memory for a low end and .025*.4=.01= 1%.
In a contrived scenario of pure stall or perfect work where I hopefully don't screw up massively:
.17% of 1000 instructions is 1.7 misses to memory.
That is 1000 instructions/2 instructions per clock = 500 cycles of work and 289 cycles of stall. This is 500 cycles of work out of a total of 789 clocks elapsed or roughly 63% of peak.
1% of 1000 instructions is 10 misses. That's 1000/2=500 of work and 1700 cycles of stall for 500/2200 = ~23% of peak.
Unfortunately the reality is that things are way more complex than this. We're not approaching that level of pure work or pure stall without a very specific instruction mix and a lot of luck, if it really can be done that way. On top of that, the CPUs will frequently overlap misses, so two stalls to main memory don't lead to 2x the stall cycles if they are launched close together. Jaguar can have up to 8 misses in flight. I haven't really figured in writes or instruction cache traffic, and haven't covered hazards or core contention, TLB fills, fused ops, varying cache latencies, branch mispredicts, and so on and on.
While eight cores sounds impressive, I just can't imagine keeping them all fed while the GPU is accessing memory, especially on the XBone. It seems to me the cores would spend much of their time running up against the "memory wall".
It's not particularly clear that this is any better for Orbis, since the northbridge link for the CPUs is a third slower and its memory latency may be measurably worse.
The memory wall is a problem for everyone.
I'm inclined to think it would have been better to toss out four cores and replace the area they took up with an L3 cache or maybe a larger L2.
The missing peak performance would be noticeable.
With something like the system reservation and OS services, it would be 1-2 cores being at least partially taken away from developers, leaving two weak cores.
Would it be possible under such a setup to still utlise a the power of a big CPU for certain CPU intensive aspects of a game that don't neccesarily require close intergration with the GPU? Or is that basically what you're talking about here?:
That and there are workloads that have a lot of intermediate work that can be swamped by bus transfers, but then have a final result writeback, like various image processing routines. In that case, even if the cores are weaker, it strips out all the intermediate copies that just eat up time. Even if the final copy is still a cost, it's a more limited and occasional expense that can be amortized more readily.
The fact that we tend to see very little (if any) speed up from upping PCI-E bandwidth was I'd assumed evidence that the CPU and GPU don't actually need a huge amount of bandwidth between them to oprate at full capacity (assuming you have enough memory local to the GPU). I'd be interested to better understand why that's not the case?
The transfers can be such an obstacle that various algorithms are simply not used.
One possibly extreme but not entirely unrepresentative example of how much transfers make many GPGPU workloads a nonstarter even for high-end GPUs:
http://www.extremetech.com/wp-content/uploads/2012/06/memcached_useful-calculations_AFDS.jpg
For these workloads, it's worse than a waste of time to try it, so it isn't done.
It's not certain at this point how much the consoles will leverage compute with the shared die, memory space, and higher bandwidth, but it's one of the first times that the idea wasn't shot down outright.