Predict: Next gen console tech (9th iteration and 10th iteration edition) [2014 - 2017]

3dilettante · Nov 26, 2016

sebbbi said:
Steamroller module (2 cores with shared FPU) has equal peak flops as 2 Jaguar cores (half a module) at same clocks.

Most of my thinking on this as to why Sony was allegedly considering Steamroller is that the clocks were not intended to be the same, by a very significant margin.
The base clocks for the desktop Kaveri SKUs seem to indicate that at least doubling Jaguar's clocks could have been possible, and the PS4 Pro's uptick in power consumption is an example of where Sony was fine with a marginal increase in overall power consumption over the release version of the PS4.

2x Jag cores do total 2xFADD and 2xFMUL per cycle (xSIMD4) = 4*4 = 16 flop. Steamroller module does total 2xFMA (xSIMD4) = 2*2*4 = 16 flop. So in theory these are tied. However Steamroller needs FMA ro reach max throughput. If the code is not specifically optimized for FMA, you lose half of the theoretical flops (and even in optimized code, FMA pairing efficiency is never even close to 100%). Also Steamroller FPU instructions tend to have much higher latency than Jaguar equivalents. Thus is harder to get full utilization of it. I also remember reading from somewhere that the shared FPU can cause additional stalls if both cores are utilizing it heavily.

There's loss of peak against matched FADD and FMUL operations that cannot be turned into an FMA, which Steamroller cannot avoid. Loss of peak due to a mix that is slanted towards FADD or FMUL is something Jaguar also experiences, which I think makes the decision less straightforward at an architectural level. Jaguar has more issue port contention on a per-clock basis due to its having 1 fewer FPU pipeline, and it does have a less feature-rich ISA. The extra instructions AMD had for the Bulldozer line saw limited use in the desktop market, but it would have been presented differently as part of a proprietary platform.

As far as stalls in a multi-threaded context, I can think of some limitations like the store buffer out of the shared FPU not being able to supply the bandwidth of two cores, but this is a case of faltering into a tie rather than losing to 8 Jaguar cores due to Jaguar's narrower architecture. Some stalls with Piledriver and some possible errata were fixed with Steamroller.
There are non-FPU related performance problems, such as odd regressions in multithreaded performance, cache bandwidth, and decode. Decode was generally doubled to make choking there unlikely, with some additional writeback and errata fixes that might have done a little for the rest.

Steamroller beats Jaguar handily in FP/AVX code if no more than half of the cores are used (one core per module). It is definitely a better fit for PC application workloads (except for Cinebench/Povray/encoding style tasks).

Are there examples in the PC space where having two cores sharing the FPU leads to a loss in performance versus one core per FPU? The benchmarks I can remember didn't show performance going down.

Jaguar on the other hand has significantly higher throughput when all cores are used (assuming similar clocks and similar die space = possibly similar power consumption).

There would be little reason to use a core optimized to ~4GHz peak if it were to be bumped down to one that struggled to hit 2.
If the leaked Steamroller variant of the PS4 is a legitimate early version, it would indicate that concerns such as module area and FPU sharing were considered acceptable since those would have been known in advance.

The actual health of the manufacturing process and the validation of the architecture and overall SOC might have been where the momentum changed.
Perhaps Steamroller couldn't clock sufficiently faster for Sony's requirements with GF's process, or AMD could not make its deliverables for that architecture despite Kaveri seemingly inheriting some high-level similarities with Orbis.
I have seen some discussion that the firmware situation for Kaveri was problematic, and it was notably later to be released than the consoles.
Jaguar's hop to TSMC cost AMD money, and seemingly hindered its bug-fixing and physical characterization--hence lack of turbo in general until the return to GF.

One scenario I've been mulling over is that AMD determined it could not get a validated Steamroller APU out of GF in time, within its clock/power parameters, and in volume.
A Jaguar-based core, and an SOC that used it, could make a jump to TSMC, whereas a Steamroller APU with some apparent teething pains could not.

HTupolev · Nov 26, 2016

Minimalist-deep-pipeline high-clock cores?

Intel:

2000: "Theoretical peak performance can be decent for the die space!"

"Architecturally-aware compilers will solve these problems!"

2006: "We've decided that the best strategy in the CPU market is to sell Pentium 3s forever."

IBM:

2005: "Theoretical peak performance can be decent for the die space!"

"These problems won't exist when people code to the specific console!"

2006: "..."

Toshiba: "You know the best thing about the Cell? We can build them without including a PPU anywhere on the die!"

AMD:

2011: "Theoretical peak performance can be decent for the die space!"

2013: "Hey, remember when our CPU market share started to stabilize around 2007? Those were good years."

B3D:

2016: "Wouldn't theoretical peak performance be decent for the die space?"

sebbbi · Nov 27, 2016

3dilettante said:
The base clocks for the desktop Kaveri SKUs seem to indicate that at least doubling Jaguar's clocks could have been possible, and the PS4 Pro's uptick in power consumption is an example of where Sony was fine with a marginal increase in overall power consumption over the release version of the PS4.

I am not an HW engineer, but a double clocked part (with identical die size) isn't likely going to be just a "marginal increase" in power consumption. Wouldn't the power scale quadratically -> 4x higher power consumption (ballpark)? That's not good when you only get equal peak flops (assuming full FMA). Of course performance in generic less well threaded code would have been significantly better.

FMA would of course be used a lot in console code. However even in well optimized AVX inner loops (or well optimized GPU microcode), the FMA percentage (of ALU ops) is usually closer to 50% than 100% (there's always some adds and muls and other ALU instructions mixed in). C/C++ compilers also don't automatically translate vector multiply+add intrinsics to FMA. You have to do it manually. Console devs also use libraries designed for PC. Most (open source) libraries aren't even vectorized, and if they are, you likely see SSE3 at most. So no FMA. Fortunately compilers can in some cases automatically translate float (non vector) mul+add to FMA: http://stackoverflow.com/questions/34265982/automatically-generate-fma-instructions-in-msvc.

Jaguar lacks TBM and XOP instruction sets (found in Steamroller). But Zen isn't going to support them either. Otherwise both Jaguar and Steamroller have the same stuff: SSE4.2, AVX1, BMI1, F16C. No BMI2 on either (so no PDEP for fast morton code). Zen has AVX2 and BMI2 (Haswell+ on Intel).

sebbbi · Nov 27, 2016

HTupolev said:
Minimalist-deep-pipeline high-clock cores?
...

B3D:
2016: "Wouldn't theoretical peak performance be decent for the die space?"

AMD could have avoided the "speed demon" trap. Both IBM and Intel had already hit the limit and were struggling. For some reason AMD decided to continue the Bulldozer development. They should have just continued to iterate the Phenom II design. They already had a native 6 core design back then. 3.2 GHz wasn't that bad limit (if you look at Intel's forthcoming Core 2 and Nehalem). Imagine there AMD would be now

itsmydamnation · Nov 27, 2016

sebbbi said:
AMD could have avoided the "speed demon" trap. Both IBM and Intel had already hit the limit and were struggling. For some reason AMD decided to continue the Bulldozer development. They should have just continued to iterate the Phenom II design. They already had a native 6 core design back then. 3.2 GHz wasn't that bad limit (if you look at Intel's forthcoming Core 2 and Nehalem). Imagine there AMD would be now

Is BD really a speed demon? To me it seems more like they just made bad choices, Bad L1/L2, "bad" L1i, very long miss predict penalties ( far longer then the pipeline length). To much pressure on the ALU's, FMA pipes resulting in long FP latencies for non FMA ops. From what i know Fmax of BD is caused by L2 so a bulldozer with one extra ALU, write back not write through cache, larger L1 and a uop cache while keeping everything else the same probably would have ended up clocking the same at around the same power for a whole lot more performance.

But its only a guess and we will never know, remember bulldozer int pipeline is only around 15-16 stages not the 20-22 normally quoted (the miss predict penalty) .

3dilettante · Nov 27, 2016

sebbbi said:
I am not an HW engineer, but a double clocked part (with identical die size) isn't likely going to be just a "marginal increase" in power consumption. Wouldn't the power scale quadratically -> 4x higher power consumption (ballpark)? That's not good when you only get equal peak flops (assuming full FMA). Of course performance in generic less well threaded code would have been significantly better.

Within the same implementation, there's a mostly linear relationship with clock and a quadratic one with voltage. Keeping it within one implementation factors out items like the design's frequency target, process, and logic complexity.
Comparing two different architectures would need to control for items like that.
Voltage is the larger driver, and there's an additional relationship within a design's clock/voltage curve where voltage starts to rise as clocks rise in order to overcome device and wire delay.
Where that voltage increase starts to kick in is determined by the design and how it handles things like wire delay.

A low-clocked design's voltage starts to ramp at lower clocks, where the relationship starts to look more cubic than quadratic.
A high-performance design can take an initial hit in power consumption due to its linearly larger factors (clock, transistor count), but it won't be forcing its voltages higher (the big cost) until it reaches the upper part of its target range.
For example: at launch, a 1.5 GHz Jaguar quad-core with a 2 CU GPU had a 15W TDP and the 2.0 GHz had a 25W TDP. Later mobile Kaveris at ~2GHz base had a 20W TDP.

Also for various reasons, Jaguar's power efficiency was rather middling, which meant the difference between it and the larger x86 core was smaller than it should have been.
The fact that large cores can throttle down to Jaguar's range (FinFET helps with a good portion of the static cost), while truly optimized architectures could hit it from below partly explain why its particular compromise point has been discarded by AMD (among other reasons like not having much of the staff responsible any longer).

I have not tried to really narrow down the error bounds, but doubling a launch Jaguar's core count could take it to the 20-30W range, when there were 45-65W Kaveri chips with larger GPUs as well.
It seems possible Kaveri's number could be dropped if it settled on only having base clocks, and the larger GPU were accounted for.
Something like 20W (ed: possibly worst-case) on top of the PS4's power budget (150W+) seems like it might not be an immediate disqualifier since the PS4 Pro is somewhere above the PS4.

However, one possible disqualification is that Kaveri came out in 2014, and the mobile Kaveris that wound up beating the TDP of the hottest Jaguar didn't launch until June of that year.

MrFox · Mar 17, 2017

Finally hynix is going to put some pressure on samsung and toshiba. They have 1TB single chip in their Q1 databook. It will have 72-layers on 10nm-class.

H27Q8T8LEA
8192Gb
FBGA(152ball)
Q4'17

I want two of those on a user-replaceable M.2... with a fancy screw of course.

They dropped their idea of DC-SF and are doing nitride charge trap like everybody else.

RDGoodla · Apr 6, 2017

It seems that Intel CPU & Nvidia GPU are still the best choices for a next-gen high performance console?

How much bill of material will it cost compared with a console with APU?

ProspectorPete · Apr 6, 2017

PS5:
Launch Q4 2019
CPU (apu): ULP 16 core Zen
GPU (apu): Navi 12TF with features from the next hardware
memory 24GB HBM3 (up to 6 reserved for OS)
storage: 1468GB 3D SSD with 5000MB/sec both read and write
HDMI 3.0 with adaptable refresh
Thunderbolt 4 ports with USB-C connector
8x BDXL drive with option for UHD BD
8K streaming through future update
Wireless display link for PSVR2
802.11AX wifi
bluetooth 5.1 for controllers as wel as latency free headset

Shifty Geezer · Apr 6, 2017

RDGoodla said:
It seems that Intel CPU & Nvidia GPU are still the best choices for a next-gen high performance console?

How much bill of material will it cost compared with a console with APU?

How can you say that when we haven't seen a next-gen AMD APU with stacked RAM in action? Or even how can you say that when Vega and Ryzen make a competent combination at a far lower price? You could get more power from $400 of AMD based console than $400 of Intel+nVidia, no?

N00b · Apr 6, 2017

ProspectorPete said:
PS5:
Launch Q4 2019
CPU (apu): ULP 16 core Zen
GPU (apu): Navi 12TF with features from the next hardware
memory 24GB HBM3 (up to 6 reserved for OS)
storage: 1468GB 3D SSD with 5000MB/sec both read and write
HDMI 3.0 with adaptable refresh
Thunderbolt 4 ports with USB-C connector
8x BDXL drive with option for UHD BD
8K streaming through future update
Wireless display link for PSVR2
802.11AX wifi
bluetooth 5.1 for controllers as wel as latency free headset

You forgot:
* Fairy dust
* Built-in world peace

Shifty Geezer · Apr 6, 2017

N00b said:
* Built-in world peace

That's going to make COD and GTA pretty boring...

Blazkowicz · Apr 6, 2017

There's a new GTA coming, but you play as Noah and you roam the city to collect lost sheeps, bring bread to orphans and when you enter a car the driver asks you to have a seat please.

HTupolev · Apr 6, 2017

Blazkowicz said:
bring bread to orphans

You mean slingshot bread at orphans, resulting in them taking a nap.

zupallinere · Apr 6, 2017

ProspectorPete said:
PS5:
Launch Q4 2019
CPU (apu): ULP 16 core Zen
GPU (apu): Navi 12TF with features from the next hardware
memory 24GB HBM3 (up to 6 reserved for OS)
storage: 1468GB 3D SSD with 5000MB/sec both read and write
HDMI 3.0 with adaptable refresh
Thunderbolt 4 ports with USB-C connector
8x BDXL drive with option for UHD BD
8K streaming through future update
Wireless display link for PSVR2
802.11AX wifi
bluetooth 5.1 for controllers as wel as latency free headset

Wait a year and reduce some of those specs and add some more slower memory or even replace the SSD with x-point memory using HBCC. Resolution isn't going to be a big deal in the future since I think 4k is all you need so why not see what you can do with a crazy amount of memory.

Rikimaru · Apr 7, 2017

It's likely HBM3 price will be too high for a console.
Only special low-cost HBM variant could be feasible.

Entropy · Apr 7, 2017

You are leaving out the main prediction (or is it impled?), that it will be fabbed at TSMC 7nm node.
(And why go so high in CPU parallelism, won't average utilization be abysmal?)

Rikimaru · Apr 7, 2017

ProspectorPete said:
HBM3 IS the low-cost HBM variant, which is why I went with it.

It's cheaper but not cheap.

In addition to HMB3, Samsung unveiled two other memory technologies at Hot Chips: GDDR6 and "Low Cost" HBM. GDDR6 is the successor to GDDR5X—as used in Nvidia's GTX 1080—and is scheduled for release in 2018. GDDR6 promises per-pin bandwidth of over 14Gbps, up from 10Gbps, as well as greater power efficiency. Further details are promised at a later date.
Meanwhile, Samsung has been working on making HBM cheaper by removing the buffer die, and reducing the number of TSVs and interposers. While these changes will have an impact on the overall bandwidth, Samsung has increased the individual pin speed from 2Gbps to 3Gbps, offsetting the reductions somewhat. HBM2 offers around 256GB/s bandwidth, while low cost HBM will feature approximately 200GB/s of bandwidth. Pricing is expected to be far less than that of HBM2, with Samsung targeting mass market products.

https://arstechnica.com/gadgets/2016/08/hbm3-details-price-bandwidth/

DieH@rd · Apr 7, 2017

RDGoodla said:
It seems that Intel CPU & Nvidia GPU are still the best choices for a next-gen high performance console?

Quite the opposite. They are the worst choices.

As for HBM talk.... the more and more times goes by, the more I think that next consoles will stick to the GDDR, especially if by 2019 someone manages to create 2GB chips. 16 of those chips will be enough for a nice 32GB of unified ram, with speeds hopefully in the 350-500 GB/s range. That would be enough for 1080p-4K gaming.

As for storage... I don't think that anyone will go above the speeds that consumer external storage devices can achieve. Most likely they will stick with ~500GB/s and call it a day.

We all want fancy fast storage, but I would not be shocked if they stick with spinning laptop hard drives for another round [2TB as a starting size]. Cost trumps everything.

Rodéric · Apr 7, 2017

8 core zen based APU + Vega, if we are lucky HBM 16GB, even luckier standard PCIe based SSD, with an optical drive.

Predict: Next gen console tech (9th iteration and 10th iteration edition) [2014 - 2017]

3dilettante

HTupolev

sebbbi

sebbbi

itsmydamnation

3dilettante

MrFox

Deludedly Fantastic

RDGoodla

ProspectorPete

Shifty Geezer

uber-Troll!

N00b

Shifty Geezer

uber-Troll!

Blazkowicz

HTupolev

zupallinere

Rikimaru

Entropy

Rikimaru

DieH@rd

Rodéric

a.k.a. Ingenu

Similar threads