Middle Generation Console Upgrade Discussion [Scorpio, 4Pro]

Status
Not open for further replies.
Again, for a 56CU part (which means a 64CU part with 8CU disabled) the die area dedicated to CUs would have to be significantly larger (at least > 180mm²). And let's say that it's generously using 40% of the whole die for CUs only (31% for the Ps4 die, 20% for the X1 die). That would mean the die size is > 450mm² which first, doesn't align with the render they provided and second, would increase cost significantly. Of course there's a chance that Polaris and FF+ bring significant architecture changes that enable such a design but we'll have to wait and see.

Obviously. Should be alot of fun when we get the real details and the deep dives get started.
 
I'm not sure it has anything to do with size, it has more to with costs and added value. As CPU have gone tinier the manufacturers have started to integrate iGP into their CPU. Performances of those GPU have never really much more than a marketing bullet point (especially for AMD) and it helped cleaning the mobo, made sure any computer has some level of graphic acceleration. Recently they prove competent enough to drive some non demanding games but no gamers (with reasonable financial means) is fine with just that. On the mobile SOC, it is pretty much the same, performance (3D) are an after though compare to optimizing price (and footprint). CPU and GPU are different in many ways, till those recent huge steps in resolution the different between the size of the RAM and the V-ram was pretty massive, it is reduced now. GPU works pretty consistently, power usage varies but I would think less than a CPU, the turbo is lesser the GPU run slower. CPU is more of the "bursty" type,when the architecture the jump in turbo frequency can be a lot more significant. I see this a tough mariage i you put then on the same chip: the GPU is likely to eat the thermal/power room that the CPU might want to hit during its burst.

By 'size', I was really meaning 'yields'. As die are goes up, yields go down. You can mitigate somewhat with redundancy, but if you lose a none redundant part i.e. a CPU core, then you're scrapping a whole chip. Bigger the chip, the more you throw away for this single defect. You also lose more usable wafer area at the edges of the wafer.

Obviously you can still make big chips as the PC GPU market proves, but there may be a point where a separate CPU and GPU make sense. More likely that MS will cap performance at a level below this though.

Ultimately what we saw with both Sony and MSFT is that trying to conciliate an approach rooted in cost optimization and adding value to CPU (through extra convenience) is that they were faced with complicated choices, which cost money: Sony could have been stuck with 4GB and they had to use a 256 bits bus (whether it fitted their performances requirements or not) and MSFT they had to spend a significant area (pretty the same size as one the GPU) on eSRAM to make up for the bandwidth the main memory could not deliver to the GPU, it did not even saved them a 256 bit bus. I pass on the case when the CPU and the GPU compete for the access to the memory and effective memory bandwidth collapses. On the PC world if I let the raw compute pixel throughput of Bonaire derivative what I see is that a 128 bit bus along with GDDR5 which provide 90 to 100GB/s doesn't see to be a bottleneck and that those parts are competitive with this generation of consoles.

The 256-bit bus on X1 takes up a lot less space than the GDDR5 bus on PS4, which I think you've got to balance against the area taken up by esram. MS also achieved higher aggregate BW for likely a lot less power. I think power was a big consideration for MS due to wanting an "always on, always silent" machine that could stream to another room un-noticeably.

I'm not sure about that, the cost overhead in cost of the whole memory set-up is also quite consistent and might spice up the bill significantly. Jaguar seems to be sucking bandwidth out of straw cheap DDR3 would have done the job.

If going with a separate (on-package) CPU I'd be expecting some kind of FSB going to the GPU, like OG XB or 360. Zen will have the ability to read from other chips memory using some kind of fast bus - I would expect this could be extended to allow the CPU to read from a GPUs memory pool. HUMA at-a-distance?


Something just struck me as I was sitting here thinking about the potential for the XBO-T to have a 384 bit memory bus.

If we compare it to Polaris 11 (Rx460) there's some interesting things that happen. Polaris 11 (Rx 460) in base configuration is over 2 TFLOPs (under 1 GHz for 2.0 TFLOPs and over 1.2 GHz for 2.5 TFLOPs), 128 bit memory interface with ~112 GB/s bandwidth. It has 16 CU's compared to Polaris 10's 36 CUs (for the Rx480).

If you triple Polaris 11's configuration and target less than =< 1 GHZ you end up with a 6 TFLOP configuration with a 384 bit memory interface and bandwidth that's probably right at 320 GB/s (basically everything clocked lower than Rx 460).

So, if we assume Vega is just a slight evolution of Polaris. I can totally see a 48 CU Polaris/Vega configuration with 384 bit memory interface clocked at =< 1 GHz.

Hmmm, another thought, would it be possible to have a 192 bit interface with faster memory to hit 320 GB/s? For instance, imagine if the lower end Vega card is going to be using GDDR5x with a 256 bit memory bus. Instead of XBO-T using something akin to 3x Polaris 11, perhaps it's 3/4 of the lower Vega variant?

I was speculating a little further back that if the 12 (presumably identical) memory chips on the render are an accurate representation of Scorpio, then we may be looking at a 192 or 384 bit bus.

At 320 GB/s that means you're looking at either 384 bit GDDR5 (not even GDDR5X), or 192-bit GDDR5X at about 13.3 gHz effective. Neither of these makes a lot of sense...

320 GB/s could also be a mixture of 256 GB/s from HBM2 and 192-bit DDR4 at 2.67 gHz. But I'm not that makes sense either.
 
By 'size', I was really meaning 'yields'. As die are goes up, yields go down. You can mitigate somewhat with redundancy, but if you lose a none redundant part i.e. a CPU core, then you're scrapping a whole chip. Bigger the chip, the more you throw away for this single defect. You also lose more usable wafer area at the edges of the wafer.
Indeed the gap is significant, here is a die shot of Kabini the GPU and its associated component takes almost as much space as the cores L2. The system could have consist of a ~100 sq.mm CPU and a 160 sq.mm GPU (already in production). Significantly tinier than MSFT or Sony SOC both chips with higher yields.

Obviously you can still make big chips as the PC GPU market proves, but there may be a point where a separate CPU and GPU make sense. More likely that MS will cap performance at a level below this though.
it comes down to performances, the CPU has to access more memory than the GPU, and it is troublesome (and costly) to find memory in that amount that can deliver the bandwidth the GPU would want.
AMD APUs are a poster child for that, AMD wastes silicon on GPU that is not fed properly. I think SOC have turned into too fashionable an idea, their benefits a numerous but they are in no way an universal panacea high performances GPU and beast of their own, with their own needs,there is no point to integrate them the brain of our computer.

The 256-bit bus on X1 takes up a lot less space than the GDDR5 bus on PS4, which I think you've got to balance against the area taken up by esram.
I'm not sure where you get that from, the difference looks marginal to me one way or another. By the way my previous comments applies to both systems.
MS also achieved higher aggregate BW for likely a lot less power.
That is a little like adding tomatoes and orange, you can't really compare.
I think power was a big consideration for MS due to wanting an "always on, always silent" machine that could stream to another room un-noticeably.
That is a little of a non sequitur jump, GDDR5 memory controller burns more power. Though that is not the thing I'm discussing, it is not MSFT vs Sony, it is SOC vs discrete parts or UMA vs NUMA and how that could affect Scorpio design ;)

My pov is that UMA is least compatible idea with high performance and necessary costs optimization.As an example people are discussing HBM2 without taking notice that for a GPU that won't end up in GPGPU compute farm but a console, HBM1 meet the requirement of a GPU rendering at 4K.
Then it goes on showing that in that context SOC make things worse, the previous example adds to the other ones. In the context of HBM, HBM1 is up to the task (and most likely costly enough already) to feed a GPU dealing in 4K matters but if you have a SOC all sudden you run into severe complication as there is no way 4GB are good enough for a whole system.
Intel, Nvidia, AMD they all want to break those limitations one way or another but that is for parts that sell at a premium.
It is funny world consoles are closer than PC but it is as if they are really reluctant to make the jump and acknowledge, whereas it might not have been true till that long ago, that the PC model born out of compatibility requirements, practical concerns, through countless juis as of now the best take on the problem consoles are facing (nb the same as PC ;) ).

I feel little like a guy fighting the idea 10GHz CPU when every body was going for it, or bubble for example, it is night impossible to convince people that something may be up to discussion and more when they have acknowledge something as a fact or rule but completely forgot the circumstances upon which the fact or rule has been built and the matching limitations of the concept. To make the matter worse, if the guy you address is cleverer than you are... the guy could bet is life (or others as often in History...)
If going with a separate (on-package) CPU I'd be expecting some kind of FSB going to the GPU, like OG XB or 360. Zen will have the ability to read from other chips memory using some kind of fast bus - I would expect this could be extended to allow the CPU to read from a GPUs memory pool. HUMA at-a-distance?
I don't really care for HSAIL and others AMD plans, ARM or Intel don't care much either. Though I really hope Zen is a decent competitor as Intel is holding a little to hard on its great CPU cores.
 
Last edited:
I'm not sure where you get that from, the difference looks marginal to me one way or another. By the way my previous comments applies to both systems.

Not sure those pictures are scaled correctly. The memory interface on the PS4 is ~50% larger than on X1, a difference equal to very nearly 1/4 of the esram. Not a marginal difference, IMO. At 14nm that size difference would be practically half the esram (though it remains to been it XBONES is 14nm or a refined and lower power 28nm chip!).

That is a little like adding tomatoes and orange, you can't really compare.

I think you have to compare, as they're running the same software. They aren't directly comparable, but neither are they incomparable as developers are pitting the same challenges against them and getting different results.

That is a little of a non sequitur jump, GDDR5 memory controller burns more power. Though that is not the thing I'm discussing, it is not MSFT vs Sony, it is SOC vs discrete parts or UMA vs NUMA and how that could affect Scorpio design ;)

The GDDR5 interface will certainly require more power. Perhaps not in terms of power per bit, but total power, will be higher. Power will also be a factor in SoC vs discrete, as off-chip bus will consume more power. Which will eat away at power available for execution units (how much I don't know, but I've seen people throwing around figures of up to several Watts).

My pov is that UMA is least compatible idea with high performance and necessary costs optimization.As an example people are discussing HBM2 without taking notice that for a GPU that won't end up in GPGPU compute farm but a console, HBM1 meet the requirement of a GPU rendering at 4K.

Main arguments in favour of HBM2 over HBM1 seems to be stack size and volume (HBM2 looks like it'll be far more widely used). But if HBM1 were cheaper then that'd do just fine. Two stacks, 256GB/s, 2GB total, would be like turbocharged esram. Being very wide and with lots of channels should be good for graphics.

Interestingly, HBM2 will also offer 2Hi (2GB) stacks that can run at a HBM1-like 128GB/s or 256 GB/s. I'm wondering if this is designed to take over from HBM1 where capacity does not need to be high and where power is critical. Using fewer stacks, perhaps it could end up cheaper than HBM1?
 
Not sure those pictures are scaled correctly. The memory interface on the PS4 is ~50% larger than on X1, a difference equal to very nearly 1/4 of the esram. Not a marginal difference, IMO. At 14nm that size difference would be practically half the esram (though it remains to been it XBONES is 14nm or a refined and lower power 28nm chip!).
Clearly not what I'm seeing, and by eyeball the SQ.mm and the size of the picture it can't be scale off that badly. And it still does not change the matter I was pointing at for the graphical part a 128 bit bus could have done for both.
I think you have to compare, as they're running the same software. They aren't directly comparable, but neither are they incomparable as developers are pitting the same challenges against them and getting different results.
You can compare performances and power used, cost, that aggregate bandwidth figure is not useful. But again why are you so bent on turning that into a Sony vs MSFT my post are pretty clear it is not what I'm pointing too.
The GDDR5 interface will certainly require more power. Perhaps not in terms of power per bit, but total power, will be higher. Power will also be a factor in SoC vs discrete, as off-chip bus will consume more power. Which will eat away at power available for execution units (how much I don't know, but I've seen people throwing around figures of up to several Watts).
Yes, things have to be considered carefully, there is no disputing that. Now I bring the XB1 vs the PS4 as an example, if we look at the performance per watts in GPixels per second out, per Watts and per € (in estimated BOM) I suspect thing are pretty even. Going with discrete chips has overhead but compared to the overhead of SOC I think it is a win (( "quad channel gddr memory controller + 8GB of fast and expensive GDDR5" or "quad channel DDR3 memory controller linked to fast and expensive DDR3") vs (dual channel memory linked to 2GB of fast and expensive GDDR5 and dual channel memory controller running slower linked to 6GB of slower cheaper DDR3) I put parentheses so it is clear that the PS4 and the XB1 are on the same side of my argument).
Main arguments in favour of HBM2 over HBM1 seems to be stack size and volume (HBM2 looks like it'll be far more widely used). But if HBM1 were cheaper then that'd do just fine. Two stacks, 256GB/s, 2GB total, would be like turbocharged esram. Being very wide and with lots of channels should be good for graphics.
I could agree with that as whether they use a SOC or not they would have move to a NUMA memory model. They would have to connect the interposer to DDR3/4 (main RAM), it could happen with both HBM1 or 2. It is a completely different topic but I think both are too expensive.
 
Clearly not what I'm seeing, and by eyeball the SQ.mm and the size of the picture it can't be scale off that badly. And it still does not change the matter I was pointing at for the graphical part a 128 bit bus could have done for both.

I've used higher res pictures than the one you posted, measured the number of pixels covered, created percentages of the die area covered, and applied that to die measurements. Your eyeballs are, unfortunately, a little off!

You can compare performances and power used, cost, that aggregate bandwidth figure is not useful. But again why are you so bent on turning that into a Sony vs MSFT my post are pretty clear it is not what I'm pointing too.

I'm not turning it into a MS vs Sony thing. They had different design goals, and you have to factor design goals in when looking at the hardware. You want to look at hypothetical hardware configurations without looking at the key realities they were designed to address. MS needed high BW, low power and lots of memory. Sony just needed lots of BW - and ended up getting lots of memory too!

Yes, things have to be considered carefully, there is no disputing that. Now I bring the XB1 vs the PS4 as an example, if we look at the performance per watts in GPixels per second out, per Watts and per € (in estimated BOM) I suspect thing are pretty even. Going with discrete chips has overhead but compared to the overhead of SOC I think it is a win (( "quad channel gddr memory controller + 8GB of fast and expensive GDDR5" or "quad channel DDR3 memory controller linked to fast and expensive DDR3") vs (dual channel memory linked to 2GB of fast and expensive GDDR5 and dual channel memory controller running slower linked to 6GB of slower cheaper DDR3) I put parentheses so it is clear that the PS4 and the XB1 are on the same side of my argument).

With the performance and power levels MS and Sony were targeting I think SoCs were the right choice, as were their memory choices. Both were keep to avoid split memory pools, and MS had additional power concerns and quite possibly the intention to shrink (we'll see when the S is out). Two different type of bus on one chip would be added complexity and force developers to use split memory - you'd need to get a pretty big pay off in return for that. It doesn't look like MS and Sony thought it was there.

It may be for Scorpio that it doesn't play out that way (I guess we'll see), but I think Neo is going to be a SoC just like PS4Bone.

I could agree with that as whether they use a SOC or not they would have move to a NUMA memory model. They would have to connect the interposer to DDR3/4 (main RAM), it could happen with both HBM1 or 2. It is a completely different topic but I think both are too expensive.

MS may yet decide that HBM is worth the cost based on memory performance (not just about on-paper peak BW) and power (they're going to be generating more heat than Sony, and may need to claw something back somewhere).
 
Edit: you want to take both the 12 memory chips and the 320 GB/s BW seriously, you might by looking at:

- 256 GB/s from a single stack of HBM2 (giving 2* to 8 GB)
- 64 GB/s from 2.66 gHz DDR4 on a 192-bit bus (giving 6 or 12 GB)

This seems rather less likely than simply having a single pool of GDDR5X on a 256-bit bus (8 or 16 chips).

With 12 chips seemed like a no-brainer this was 12 GB of GDDR5 on a 384-bit bus. Am I missing something?
 
In the reveal of Scorpio at MS's E3 conference they specifically call it an SoC - timestamp 1:25:15. "...we gave the SoC 6TF of computing capabilty..."

Good catch!

With 12 chips seemed like a no-brainer this was 12 GB of GDDR5 on a 384-bit bus. Am I missing something?

The interface would be huge (probably around 75 mm^2?), and power per bit would likely be worse than GDDR5X.

AMD's 5.5 TF 480 is only running a 256-bit bus to GDDR5, and above the 480 performance level Nvidia are already choosing GDDR5X on a 256-bit bus over GDDR5 on a 384-bit bus. 18 months from now, I think it's likely that everyone would be choosing 256 GDDR5X over 384 GDDR5.
 
With 12 chips seemed like a no-braine
r this was 12 GB of GDDR5 on a 384-bit bus. Am I missing something?
That sounds about right if the thing is not render. I do hope MSFT wise up and give a good at the PC world from desktop to laptop and comes with a practical system.
By the time they launch they should have all the IP they need to launch their 6TFLOPS without having to sweat much on the R&D side.
They promised 8 cores, I wonder if it will be 8 cores or 8 logical cores or somewhere in between. Whatever they decide from polaris and Zen is all they need.
Zen CPU comes with 8 cores , 16 logical ones, that is for a fully enable chip. It is not necessarily what MSFT will do but I see many way to buy 8 "cores" from AMD.
As for the GPU affair, Polaris seems like a good match for the specs they announced They announced 6000 GFLOPS worse of processing power, 5% rounding is about 300 GFLOPS. I would not read too much into the figure as on such a big figure you can fit a couple WiiU in the rounding figures.
On top of it the CPU itself throughput may not be insignificant (within the rounding mistake though). The same goes with the bandwidth, DDR4 and dual channel set-up could be enough to explain discrepancies in the rumored numbers out there and the figures around for the RX 480.
As I see thing, the GPU even @14nm will consume its fair amount of power and generate an equally fair amount of heat, adding the CPU on top of it... To reach higher FPS for VR they also need higher serial performance which comes with bigger cores with higher working frequency, 14nm seems like a great node, yet I would avoid having the GPU and CPU generating heat at the very same spot.

To me this sounds like a reasonable set-up looking at the timeline for launch:
Zen CPU, 6 cores enable out of 8, SMT enable only of two of those cores. I do know what AMD will do with those cores as far as segmentation is concerned, they could also disable AVX2. Dual channel memory controller.
Polaris with all or most CUs enable and a pretty high clock should do.
8GB of DDR4, 4 GB of GDDR5
 
Just asking out of curiosity
Someone said that HBM2 bw is limited by the substrate due to thermal reasons or such, anyway the bw is huge.
If ms/sony/nintendo/atari decides to use this configuration, to increase yelds and reduce cost can split cpu and gpu linking them on the substrate with the same bus?
Bigger cpu, maybe bigger cache, maybe offload some component to the smaller cpu module like h265 decoder, or secret souce module. The gpu is just a sea of cu and rop with the memory controller. If tou really want, but I suppose that you don't, can link 3 hbm2 chip to the gpu and one to the cpu to reduce memory contention mantaining a form of uma.
I know that technically is possible, but it make sense too?
 
Again, for a 56CU part (which means a 64CU part with 8CU disabled) the die area dedicated to CUs would have to be significantly larger (at least > 180mm²). And let's say that it's generously using 40% of the whole die for CUs only (31% for the Ps4 die, 20% for the X1 die). That would mean the die size is > 450mm² which first, doesn't align with the render they provided and second, would increase cost significantly. Of course there's a chance that Polaris and FF+ bring significant architecture changes that enable such a design but we'll have to wait and see.

We will have to wait for a Summit Ridge APU image to measure CPU size.
 
The Zen cpus are supposed to be 95W TDP for 8 cores. How are they possible here?

Maybe with a serious underclocking for power efficiency, but they wouldn't say it's 8 cores if it's just 8 threads on 4 cores. The internet would implode, people would have PTSD for years to come.
 
The Zen cpus are supposed to be 95W TDP for 8 cores. How are they possible here?

Maybe with a serious underclocking for power efficiency, but they wouldn't say it's 8 cores if it's just 8 threads on 4 cores. The internet would implode, people would have PTSD for years to come.
They've said over and over it's 8 cores 16 threads.
95W is however just speculation at this point, but it's certainly not out of realm of possibility
 
The Zen cpus are supposed to be 95W TDP for 8 cores. How are they possible here?

Maybe with a serious underclocking for power efficiency, but they wouldn't say it's 8 cores if it's just 8 threads on 4 cores. The internet would implode, people would have PTSD for years to come.
usually are heavily underclocked compared to desktop version.
so that wouldn't surprise me in the least.
 
The Zen cpus are supposed to be 95W TDP for 8 cores. How are they possible here?

Maybe with a serious underclocking for power efficiency, but they wouldn't say it's 8 cores if it's just 8 threads on 4 cores. The internet would implode, people would have PTSD for years to come.

The same way a 4 core Skylake CPU (i7-6700K and i5-6600K) has a 91 W TDP but a 4 core Skylake CPU (Xeon E3 1235Lv5 or 1240Lv5) has a 25 W TDP. Now imagine them clocked at less than 1 Ghz like a console CPU core (the Xeon cores are at >= 2 GHz).

Just because the first desktop variant (at an unknown clock speed) is rumored to be 95W TDP doesn't mean a mobile or console variant will consume nearly as much power when run at mobile or console speeds.

Not saying XBO-T will use Zen or not, but it's certainly possible.

Regards,
SB
 
The same could be said for low clock Neo and Scorpio. When we still have Jaguar, any Zen will still be a huge improvement.
 
With 12 chips seemed like a no-brainer this was 12 GB of GDDR5 on a 384-bit bus. Am I missing something?

I know, I'm looking at the same rendering you are. It would seem self explanatory, so what's the broader discussion even about? And I really don't have a problem with whatever they go with as long as it's efficient. Yet it was likely no accident that Phil Spencer chose not to just come out and say "12 GBs of GDDR5(X)".

It may mean absolutely nothing, But it has some (myself included) thinking that MS may still want a next gen solution for esram. If so, placing a of small stack of HBM on that SoC would be most advantageous. The timing is certainly fortuitous if MS is going in that direction. You've got both Samsung and SKhynix ramping up for mass manufacture of HBM later this year.

The cost of HBM would most definitely put the Scorpio well north of $500. But I never thought the Scorpio was ever going to be priced below $549 anyway. I'm preparing for $599 mostly because I don't believe that the Scorpio is just an X1 revision. But it could even be higher if MS goes all out and doesn't treat this like Durango. I suppose from their perspective they could see more costly solutions (Vega,Zen,HBM,GDR5X) as the better investment in the long run. Sony took a risk with GDDR5 and it payed off huge. I remember people were predicting that it was going to kill Sony, just the opposite happened.

Who knows? MS may be willing to take a similar chance with the Scorpio.
 
I'm one of the few that believes both ps4 and xo, are good designs.
where most people/general consensus is that esram was bad and caused so much problems, I believe it almost got everything right, bandwidth where it's required, but memory where it was required was too small, the size is what caused majority of the problems including game development, not an inherently bad design.

So I would be all for a hbm1/2 with 1-4 GB memory and l/ddr4 memory set up. would 1gb be more than enough at 4k? Lets say that 32mb was enough for 720p. Obviously only worth while if it's cheaper to produce and helps with tdp, etc.

Once it's cheaper to make it all hbm, then could even do that with a Scorpio slim, or next gen. I don't think having the split would cause a problem going forward at all. Be moving to the faster memory, not trying to work around it.
 
Status
Not open for further replies.
Back
Top