Predict: Next gen console tech (9th iteration and 10th iteration edition) [2014 - 2017]

Status
Not open for further replies.
It can scale past 64 and AMD has already demonstrated the capability with Tonga.

I'm not sure I understand. Tonga is 32CUs ? (4 Shader Engine limit, 16 CUs & 16 ROPs per SE)

getting more ROPs and geometry capabilities with async now present isn't really worthwhile.

High-res shadow maps & multiple shadow casters :?:
 
I'm not sure I understand. Tonga is 32CUs ? (4 Shader Engine limit, 16 CUs & 16 ROPs per SE)
The ROPs are tied to memory channels so a crossbar had to be added for the variable memory bus width. This is sort of backwards from exceeding 64 CUs, but in theory they would want to map more ROPs than they have memory channels without a strange multiple. 48 ROPs into a 256 bit bus for example. Even multiples may not be ideal performance targets for their segments. Apple had a 384bit Tonga, but I'm not sure large busses are ideal for future designs. Especially if they start using HBM with all that bandwidth. The crossbar would be something to avoid for efficiency if possible.

High-res shadow maps & multiple shadow casters
Helps, but those are relatively easy tasks for ROPs and unlikely to be the biggest performance hit in games using compute and async. For DX11 sure, but it might not be a very forward looking design. It might help minimum frame time, but not necessarily throughput.
 
if the next consoles are still amd x86 i would expect them to be multi chip on a substrate using GMI for interconnect. that would allow them to use off the shelf chips (cheap) to create some pretty powerfully APU's while being something like 300mm sq on 7nm. Its the direction AMD seemed to have flagged they are going in the future.
 
Jaguars are quad cores too.

Is it a problem for developers to use 2 quad cores on current consoles?
Tbf it was similar with jaguar but it still happened, if they're willing to pay AMD'll do it.
Jaguar was a small core, if they would have gone with a bulldozer core, I am sure it would have been 4 instead of 8, it would be the same as the mainstream APUs. This would be more comparable to the Zen APU.
 
Were the jaguar cores chosen for their performance per die area, or their performance per watt?
How did they compare to bulldozer?
 
Were the jaguar cores chosen for their performance per die area, or their performance per watt?
How did they compare to bulldozer?
Vgleaks had a rumored Steamroller-based initial version of Orbis.
Area-wise, a Jaguar and Steamroller module are pretty close: less than 30mm2.
In terms of throughput, a Steamroller module is about half the operations per clock than a Jaguar module, but could probably clock twice as high.

Single-threaded would favor Steamroller, but throughput and power consumption might have favored Jaguar--although Jaguar was somewhat imperfect at start as evidenced by Puma.

Other factors could have been cost and risk. Jaguar is the inexpensive core, and it was designed to be easily synthesizable.
While Bulldozer's successors were less reliant on custom blocks, there was still something of an order of magnitude difference in the number of blocks that had to be redesigned per process/foundry.
Jaguar has a history of hopping fabs, while Bulldozer's line does not.
Being cheaper, having lower barriers to changing foundries (and Bulldozer's line being more tightly associated with GF), and possibly time to market if there were internal delays could have pushed Jaguar over the top.
 
Area-wise, a Jaguar and Steamroller module are pretty close: less than 30mm2.

Are you comparing a Jaguar 1*INT+1FP core vs. a Steamroller 2*INT + 1FP module or a quad-Jaguar module vs. Steamroller module?
 
Whole module, so 2 Steamroller cores+L2 versus 4 Jaguar cores+L2 (29.5mm2 vs 26.2 mm2). Two modules of both would have Steamroller ~13% larger, or 6-7mm2 out of the whole SOC, which seems like a mild difference given the size of the overall chips.
 
PS4 had already 256MB for OS functions which has being expanded to 1GB on Pro...

I wouldn't be surprised to see something similar on PS5. For instance 64GB of fast and expensive ram for CPU/GPU (and finally the majority being given to devs, maybe all of it) and 4 or 8GB of slow and cheap ram dedicated to the OS.
 
The expanded DDR3 pool is being used to store an inactive application context, and is managed/used by the secondary processor.

The OS itself still has GDDR5 and CPU resources reserved (not clear if DDR3 is visible to Jaguar), and there are OS functions that called with high frequency by applications and games (secure APIs, services, IO functions, various synchronization functions, etc.).

Shunting those functions to a slower DRAM pool over another bus would serve to expand OS overhead.
 
Only 4 cores on that might means no 8 core console CPU either.

Surely you know about that upcoming 8 core Zen CPU (which itself is duct taped to make a 32 core server CPU, or a 24 core version with two disabled packs I suppose)
It's made of two packs of four cores, but you bet they talk together.

A few years ago I was betting they can't do an 8 core Jaguar. They did anyway, if they're like "islands" that are relatively shut off from each other well too bad, the OS and developers deal with it. No free migration of threads from one jaguar quad core to the other, etc. With Zen, you might still set affinity if you want but it might be about a few % gains rather than something that doesn't work or is really slow.
 
PS4 had already 256MB for OS functions which has being expanded to 1GB on Pro...

I wouldn't be surprised to see something similar on PS5. For instance 64GB of fast and expensive ram for CPU/GPU (and finally the majority being given to devs, maybe all of it) and 4 or 8GB of slow and cheap ram dedicated to the OS.
Yes, that would be nice.

Sony is focused on gaming-first, so their OS partition does not need to have insane needs. I even speculated way before 2014 PSMeeting unveil that they should take Vita processing package that already has stacked RAM and VRAM on top of the SoC, plop that on the PS4 motherboard and use it for the OS. :D
http://1.bp.blogspot.com/-7xbyNVqMx...ARQ/sBhZLuG208c/s1600/New+Picture+%282%29.png
http://4.bp.blogspot.com/-CmbjrV0v3...ARI/l5ydYST9v4U/s1600/New+Picture+%281%29.png
 
Whole module, so 2 Steamroller cores+L2 versus 4 Jaguar cores+L2 (29.5mm2 vs 26.2 mm2). Two modules of both would have Steamroller ~13% larger, or 6-7mm2 out of the whole SOC, which seems like a mild difference given the size of the overall chips.
This clearly explains why they didn't go with Steamroller. Area would have been slightly larger, multicore perf would have been slightly lower and power consumption slightly higher (assuming 50% higher clocked Steamroller). Streamroller's shared floating point / vector unit is a liability in code that stresses all cores (esp in games as float & vector code is common). Also Steamroller's turbo doesn't help at all if all cores are taxed all the time. On PC desktops & laptops Steamroller would obviously be a better choice than Jaguar as it has higher single core IPC + turbo.
 
This clearly explains why they didn't go with Steamroller. Area would have been slightly larger, multicore perf would have been slightly lower and power consumption slightly higher (assuming 50% higher clocked Steamroller). Streamroller's shared floating point / vector unit is a liability in code that stresses all cores (esp in games as float & vector code is common). Also Steamroller's turbo doesn't help at all if all cores are taxed all the time. On PC desktops & laptops Steamroller would obviously be a better choice than Jaguar as it has higher single core IPC + turbo.

Part of the speculation is sourced from a claimed Steamroller-based prior version of the PS4 APU, the area would have been something known in advance and seemingly acceptable at the time of the leaked design.
Whether Steamroller would have dialed back its clocks to only 50% above Jaguar is unclear. The lowest Kaveri SKUs started at a ~100% higher clock for their base. (edit: Correction, that was for the desktop SKUs. Mobile SKUs constrained to 35W and below had a lower base)
The FPU was shared, but was capable of broader issue with a 2xFMA and 1xMMX versus the FADD and FMUL pipe for Jaguar, which probably made the tradeoff less clear-cut. Perhaps specific profiles show where contention might hurt the FPU, although in Intel vs AMD comparisons it was usually decently utilized within its significantly lower theoretical peak.

There was also a downgrade in the throughput for the uncore's memory interconnect with the switch to Jaguar, so the switch seems to have come with some downsides if the concern is overall system throughput and GPU performance.
Jaguar was also supposed to have turbo as well, and that wouldn't be fully in place until Puma+ bug fixed it. The comparison would be with a big core that had turbo that chose not to use it since it clocked 2x as high at base versus a core that lost a significant architectural feature.

Perhaps after some time it became clear that Steamroller would not meet its power numbers, the schedule slipped, or there was some form of technical issue or architectural bugs.
Jaguar made it to market ahead of the consoles, whereas Kaveri's launch was later (and Kaveri's APU compute capabilities are a much closer mirror to the PS4's than any standard Jaguar).
GF's unimpressive manufacturing performance and issues with odd regressions in the Bulldozer line are also known issues.
 
Last edited:
The FPU was shared, but was capable of broader issue with a 2xFMA and 1xMMX versus the FADD and FMUL pipe for Jaguar, which probably made the tradeoff less clear-cut. Perhaps specific profiles show where contention might hurt the FPU, although in Intel vs AMD comparisons it was usually decently utilized within its significantly lower theoretical peak.
Steamroller module (2 cores with shared FPU) has equal peak flops as 2 Jaguar cores (half a module) at same clocks. 2x Jag cores do total 2xFADD and 2xFMUL per cycle (xSIMD4) = 4*4 = 16 flop. Steamroller module does total 2xFMA (xSIMD4) = 2*2*4 = 16 flop. So in theory these are tied. However Steamroller needs FMA ro reach max throughput. If the code is not specifically optimized for FMA, you lose half of the theoretical flops (and even in optimized code, FMA pairing efficiency is never even close to 100%). Also Steamroller FPU instructions tend to have much higher latency than Jaguar equivalents. Thus is harder to get full utilization of it. I also remember reading from somewhere that the shared FPU can cause additional stalls if both cores are utilizing it heavily.

Steamroller beats Jaguar handily in FP/AVX code if no more than half of the cores are used (one core per module). It is definitely a better fit for PC application workloads (except for Cinebench/Povray/encoding style tasks). Jaguar on the other hand has significantly higher throughput when all cores are used (assuming similar clocks and similar die space = possibly similar power consumption).
 
Last edited:
Status
Not open for further replies.
Back
Top