Predict: The Next Generation Console Tech

Shifty Geezer · Oct 18, 2011

Andrew Lauritzen said:
I'll also note that once you get to enough cores you have to stop using naive global schedulers with single queues and instead move to distributed queues and work stealing. Again, all this has been known for many, many years though, but it is a transition that doesn't happen overnight.

If games were actually written with this level of scalability already then there's really no excuse for the PC versions not even scaling to 6 cores, let alone 32. I remain skeptical until proven wrong

PC hasn't needed scheduling as it's only been up to 2 cpres for the most part, and 4+ cores is still pretty niche. Out of business sense I expect any dev to just throw a bit of physics onto any extra cores.

I don't know what state the Cell schedulers are at. We don't really get details on any of that. You may well be right that we don't have any fully scalable engines yet. That'd be the place to look though as it's the first gaming architecture that has required devs to work with 6 cores, and necessity is the mother of all invention.

sebbbi · Oct 18, 2011

Shifty Geezer said:
That'd be the place to look though as it's the first gaming architecture that has required devs to work with 6 cores, and necessity is the mother of all invention.

It wasn't the first one. If you want any kind of decent performance, you are required to split all your work to 6 cores (hardware threads) on Xbox 360 as well. But Cell had separate memories, so it pretty much required a job based system (tasks needed to be split down to smaller chunks, as there was only limited memory on each core).

sebbbi · Oct 18, 2011

A few reasons why Cell in a next generation console wouldn't be that good idea:

1. GPUs are significantly faster than Cell in (polygon) graphics rendering (even a 2005 GPU was). Next generation gaming console needs to have a GPU. It's going to be a modern one, and all modern GPUs are capable of GPGPU.
2. Current generation GPUs are much more programmable than those in 2005. Current genration GPUs are faster than Cell SPUs in many highly parallel computational algorithms.
3. Current generation CPUs have also improved in areas that were strong areas for Cell SPUs. New general purpose CPUs execute 4x more concurrent threads than those in 2005 (1/2->8/12). Also vector thoughput of each core has improved significantly. AVX processes 256 bit vectors in a single cycle (eight floats/integers). Tasks that require strong single thread performance, but still require heavy vector maths can be executed faster on modern CPUs.
4. Assuming the new Cell would have a modern main CPU core and a modern GPU. As CPU and GPU offer better performance than Cell SPEs for many algorithms, the developers are going to use the best processor for each task. Three separate memory pools would mean a lot of data movement (GPU, CPU and SPEs each have their own memories).

Basically if the Cell architecture is updated with modern CPU cores and a modern GPU, the advantages of having SPEs are fading away. Modern general purpose CPU cores can do single threaded vector calculation (with branches and unpredictable memory access) much better than Cell SPEs and modern GPUs can do highly parallel (branchless) vector calculation much better than Cell SPEs. There's not much things left that the Cell SPEs perform better than the two other processing units (CPU & GPU) that must be included in the same system. The downsides of Cell architecture however are huge: 3 separate memory spaces = more transfers / more difficult to program. I could personally prefer that those SPE transistors are spent to make the (main) CPU and GPU faster (instead of SPE separate memories we can have bigger CPU caches, and instead of SPE execution units we can have more CPU&GPU execution units). Do we really need a heterogeneous system with 3 different processing units? Most algorithms can be executed quickly with CPU or GPU (when optimized properly).

Arwin · Oct 18, 2011

Of course who am I to disagree with an actual high-end developer like you sebbbi, but that sounds like the wrong approach to me.

From a pure performance perspective, shouldn't all design be data driven first? The real question is how can you have as many cores as possible work together on the same data pipe-lines. In that respect, for a next-gen console, the question of CPU vs GPU seems almost irrelevant - you need to know how much of what kind of processing power you need, how flexible you need to be with that processing power (fixed function versus programmable) and how you can stream the data along those processing units most efficiently.

Of course, a completely opposite end of the design requirements is - how can we make a console as cheap as possible that is as easy as possible to design for, which in modern times almost seems to imply it has to be as much like a fixed hardware PC as possible.

Depending on which direction you go, performance can almost become a trivial consideration.

hoho · Oct 18, 2011

sebbbi said:
Current genration GPUs are faster than Cell SPUs in many highly parallel computational algorithms.

Are you comparing current PC GPUs with Cell from 2006? Isn't that kind of unfair considering the latter has almost 10x less transistors?

Shifty Geezer · Oct 18, 2011

hoho said:
Are you comparing current PC GPUs with Cell from 2006? Isn't that kind of unfair considering the latter has almost 10x less transistors?

This is the next-gen prediction thread. Sebbbi is comparing architectures with what is possible. Cell doesn't stand strongly any more.

Shifty Geezer · Oct 18, 2011

sebbbi said:
A few reasons why Cell in a next generation console wouldn't be that good idea:

I think that's generally how most devs would see it. The only reason I can see to preserve Cell is BC and ease of transition for Sony's 1st parties, plus when the GPU is doing all the graphics work, Cell would be a powerful CPU in a tiny package if not updated.

What do you think of an ARM+SGX hetergenous core scalable processor architecture though? Something akin to the Cell Rasteriser patent, only with ARM and SGX (Rogue), would be usable across devices like Vita, Vita2, TVs and tablets. Would this offer reasonable performance by your estimation, or be eclipsed by discrete CPU and GPU combos, and multiplatform support would cost the discrete games console dearly?

jonabbey · Oct 18, 2011

BC has to be a big deal for Sony, though, doesn't it? Microsoft has vast expertise in operating systems and they can easily target whatever hardware they want with their full OS feature set. Sony spent a lot on PS3's platform software that they could re-use more easily with a Cell derivative, no?

sebbbi · Oct 18, 2011

hoho said:
Are you comparing current PC GPUs with Cell from 2006? Isn't that kind of unfair considering the latter has almost 10x less transistors?

I highly doubt a modernized Cell would offer better performance or performance/watt in highly parallel workloads than a modern GPU. I agree that Cell would be better in some algorithms, but Cell wouldn't still be fast enough to run graphics alone, so we need to add a GPU. Many of the algorithms where Cell (SPUs) would be better than a modern GPU (single threaded workloads) are those that a modern CPU would run even better.

I would prefer a fast GPU to handle graphics well, and handle parallel computation well. Paired with a fast multicore CPU (and a fast communication path between them), the system wouldn't have any weak points. It would also be running more near it's peak performance in more variety of games (GPU cycles could be divided between parallel computation and graphics rendering depending on the game design needs). A slow GPU would mean that we need to move some graphics related parallel tasks to processing units that are not running those tasks as efficiently (for example Cell SPEs). I rather keep my graphics tasks on my GPU. I woudn't mind if the GPU had extra power left for general purpose parallel computing as well (more = better).

jonabbey · Oct 18, 2011

Is there any rational way for Sony to stick with a Cell derivate on PS4? They've spent a lot of money on Cell software this generation, both at the platform level and at the game level. If they don't have Cell going forward, it seems this whole generation's worth of games would be lost, without any prospect of ever running under emulation on a non-Cell design.

Would keeping a small 8 SPU farm on board a chip with more modern general CPU cores even be enough to meaningfully support existing software, in the absence of the current split-memory design and other hardware oddities of the PS3?

Is there any conceivable way that the SPU behavior could be performed by more full-functioned cores, such that a chip with 8 general purpose cores could actually emulate the PPU and 7 SPUs in backwards compatibility mode?

Yuck. That sounds more complicated than Larrabee.

I guess maybe we'll start hearing about Sony's 20 year plan for PS3. ;-/

Ninjaprime · Oct 18, 2011

jonabbey said:
Is there any rational way for Sony to stick with a Cell derivate on PS4? They've spent a lot of money on Cell software this generation, both at the platform level and at the game level. If they don't have Cell going forward, it seems this whole generation's worth of games would be lost, without any prospect of ever running under emulation on a non-Cell design.

Would keeping a small 8 SPU farm on board a chip with more modern general CPU cores even be enough to meaningfully support existing software, in the absence of the current split-memory design and other hardware oddities of the PS3?

Is there any conceivable way that the SPU behavior could be performed by more full-functioned cores, such that a chip with 8 general purpose cores could actually emulate the PPU and 7 SPUs in backwards compatibility mode?

Yuck. That sounds more complicated than Larrabee.

I guess maybe we'll start hearing about Sony's 20 year plan for PS3. ;-/

I think they will either go with a shrunk cell on die or on chipset with whatever CPU/GPU they use ala PS2/3 early on, or just drop BC altogether.

wco81 · Oct 18, 2011

Is there a big PS3 library for which BC would be a competitive advantage in the PS4?

Is it possible that the biggest-selling PS3 game was a third-party game rather than first-party, meaning that the X360 version probably sold in higher volume?

The other thing to consider is that they may choose to come out with PS4-enhanced versions of PS3 games, the way they did with the GOW collection and the Ico/SoTC remaster, rather than support direct BC.

That way they double-dip and make you pay for the same content. So imagine you get to buy all the Uncharted games again, just with enhanced graphics for the next gen.

Xenus · Oct 18, 2011

The enhanced versions though the leap isn't nearly as big going from sd to hd plus better filtering. Vs going from 720p to 1080p plus AA. Sure it will look nicer but is it really worth it to the masses?

rekator · Oct 18, 2011

wco81 said:
Is there a big PS3 library for which BC would be a competitive advantage in the PS4?

Is it possible that the biggest-selling PS3 game was a third-party game rather than first-party, meaning that the X360 version probably sold in higher volume?

The other thing to consider is that they may choose to come out with PS4-enhanced versions of PS3 games, the way they did with the GOW collection and the Ico/SoTC remaster, rather than support direct BC.

That way they double-dip and make you pay for the same content. So imagine you get to buy all the Uncharted games again, just with enhanced graphics for the next gen.

They're already do this for PS2 games, and with not really enhanced graphics, so no problem for remade this…
So probably the first PS4 got "Cell hardware" for retro but drop it after two years, so not a problem for reuse the same tricks. Only the first buyers care of retro.

Tahir2 · Oct 18, 2011

A little controversial perhaps but my hope for next-gen is that backward compatibility is killed off completely if it means having to use silicon die space for legacy components. If it can be done intelligently in software - fine.

That extra silicon real estate can be better spent on making the console a better experience for new content (be it extra RAM, more storage, faster processing of some variant).

All this talk of backward compatibility makes me sad - as if that is a good reason to use Cell and PPE! A good reason to use Cell and PPE would be performance, heat, ease of development and cost compared to what else is out there.

Infinisearch · Oct 19, 2011

Andrew Lauritzen said:
I mean consider a "core" with one set of execution units (ALU, etc). If you are running one unstalled thread you can utilize all of those units at full rate. Four threads would merely have to take turns so you've gained nothing and cut your register file/caches in four.

But if the maximum ILP per thread doesn't utilize all the resources of a single core, the hardware goes to waste. With die shrinks maybe they have enough resources in a single core to run multiple threads efficiently. Maybe that's why a 4core i7 with HT seems to be competitive with a 8 core bulldozer.

menmau · Oct 19, 2011

I would like to see next generation boxes with audio processors capable of doing 96khz@24bit in real time and as standard.
I know, it might not be a huge difference comparing to what we have today, at least to some people, where the graphics jump are normally more noticed than sound.

Even so it would be quite nice.

sebbbi · Oct 19, 2011

Infinisearch said:
But if the maximum ILP per thread doesn't utilize all the resources of a single core, the hardware goes to waste. With die shrinks maybe they have enough resources in a single core to run multiple threads efficiently. Maybe that's why a 4core i7 with HT seems to be competitive with a 8 core bulldozer.

Intel seems to be highly focused on best possible single threaded performance. I believe highly optimized single threaded code (without any cache stalls) will fill the pipeline nicely in Sandy Bridge. However you cannot combat 500+ cycle memory stalls just with ILP, even with sophisticated out of order execution and agressive cache prefetching. When running generic (less cache optimimal) code you will get long periods of time when the pipeline is completely idle (waiting for a cache miss). Hyperthreading will utilize these stalls (free cycles not used by the other thread). As i7 usually gets only 20%-30% speed boost from HT when running generic code, and the gain can be near zero when running highly cache optimimized math heavy code, it doesn't seem likely that they have extra execution units. Seems that HT is just using the left over execution cycles (to improve the pipeline / execution unit efficiency). However 30%+ performance boost with that low amount of transistors is a very good thing.

I don't know if it's that good idea to have considerable amount of extra execution units if you are only running two threads per core. Wide SIMD units (Intel plans to extend AVX up to 1024 bits) however bring good gains, since the usage of these special units is bursty (periodic). A wider vector unit helps both single thread performance and two thread performance (with a wider execution unit, the thread spends less time in vectorized code, and the vector pipeline is free sooner for the other thread). But often vector calculation needs main pipeline as well (vector instructions are pretty limited, for example you need general instructions/registers to calculate fetch addresses and for branching). This is one thing that AMD pipeline (double ALU + one fpu/vector unit) handles slightly better.

Andrew Lauritzen · Oct 19, 2011

sebbbi said:
A few reasons why Cell in a next generation console wouldn't be that good idea

Agreed on all points. Most algorithms that run efficiently on SPUs also run efficiently on modern GPUs and everything else runs better on a conventional CPU. It's definitely a design point in the middle of CPUs and GPUs but it just doesn't seem like a particularly compelling one.

Acert93 · Oct 20, 2011

Ok, time to review / recap / recast predictions. The following will be more GPU specific. I decided to post this incomplete and all from memory instead of never posting it. A lot of number errors I am sure... so respond and correct them if they bother you! I know there are huge gaps, wrong numbers, etc. I don't have the time to correct all of this, so follow the general flow and branch off as you see fit. Besides the WiiHD there is nothing set in stone per the public. At some point I may post what I would LIKE to see and what I EXPECT to see.

PROJECTION #1: Process Node.

Code:

Process    Density*    Date (TSMC)    Date (Intel)    Date (GF)
90    -        2005
65    2        2007
45    4        2009
40    5        2010
32    8        -
28    10.4        2011
22    16        -        2011
20    17.6        2013
16    32
14    42.6        2015
12    64

* Compared to 90nm. Typically a comparison of the “smallest structure” and not an average. Each design & process are unique.

PS3 Footprint: 532mm^2
RSX: ~ 300mm^2 on 90nm
Cell: ~232mm^2 on 90nm

Xbox Footprint: 480mm^2
Xenos: ~220mm^2 + ~110mm^2 on 90nm
Xenon: ~150mm^2 on 90nm

EDRAM (MB) ~95mm^2 (110mm^2 – logic) on 90nm. 20nm would project out to about 176MB on 20nm.

PROJECTION #1 Conclusion: Process Node. 17.6 times the density. Excluding structural unit improvements/inefficiencies due to increase robustness, multiplied by a frequency increase on order of 50% (range ~ 40%-70%for the GPU, let’s assume 750MHz; essentially none with the CPU), “raw” performance jump is 26.4 increase for the same footprint.

PROJECTION #2: Relative to Cayman GPU. AMD architecture chosen because they seem the likely target, although there will be a shift to VLIW4 from VLIW5. AMD appears headed for a paper launch of 28nm GPUs in late Q42011. A Q42013 GPU on TSMCs 20nm process (a taped out chip on this processes was announced Oct2011, so give 12-18 months for first mass production).

Cayman aka 6970: 389mm^2 on 40nm TSMC (250TDP; 2.64B trans., 2.7TFLOPs, 880MHz, 96 tmu, 32 rop; was originally aimed at cancelled 32nm node). For cf.: 6950 offers 2.25TFLOPs @ 800MHz; 88TMU, 32ROP; 200TDP; idel “20” avg. peak load 157 iirc)

20nm is about 4.0 times denser than 40nm. Cayman naively would project to 10.8TFLOPs at 20nm (no architectural or frequency changes). Obviously with OI and various parts shrinking at different rates, some parts not needed the same increase (e.g. rops), this is a very rough projection.

Assuming a ~300mm^2 footprint (.77 size of Cayman) a Cayman-like design would hit 8.3TFLOPs, 294 TMU, 98 ROP, etc. Not all flops are equal (see how the lower flop Xenos typically bests RSX; ditto contrast of NV’s scalar design and AMDs VLIW; also note the change in VLIW) so this is all general ballpark. It would be aggressive to assume the retention of 800MHz of the 6950, let alone the 880MHz of the 6970—especially after the RROD. Redundancy and Binning would also need to be a consideration. Without more information on the TDP of TSMC’s 20nm node 750MHz seems a safer projection.

PROJECTION #2 Conclusion: Relative to Cayman GPU. A 300mm^2 GPU on 20nm @ 750MHz would be in the 7 TFLOP range, potentially higher (e.g. shader cores will scale faster than other features like rops). Compare to the 215-236 GFLOPs range (iirc) for current console GPUs results in about a 30x increase in performance (there is no doubt the Cayman flops are more effective than the RSX flops). There will be other console specific features on chip (e.g. Xenos is also the memory controller; also redundancy) but some parts PC-specific can be dropped.

The projections based off of process reduction and the path GPUs have taken specifically lead to about the same range of 25-30x as fast in 2013 on 20nm. The money conscious skeptic would lean toward a low end 20x.

I would bet the development teams are seriously debating how long a console should last. Should it be more disposable (4year?) with anticipating that the platform transition into a Cloud devise? 10 year? There is also the eyes looking at realistic process reductions. And how soon can compelling software be deployed? Other major issues continue to be storage, storage speed, and distribution. Will a company reduce CPU/GPU budgets to invest in consumer experience with hybrid storage for faster loads? How much will input devices (Kinect 2, Move 2, new concepts) cut into budgets? Will HDD be standard? Will MS license BDR; offer activation codes? Will there be a big change in budgets—will MS decide to shift area from the CPU to the GPU? What memory densities, technology, and speed will be available? Will we see a CPU with a similar tech (e.g. PPC) but heterogeneous design (e.g. 1x OOOe PPE with large fast cache + 6x Xenon PPEs)? Where does eDRAM and other similar tech fall in? Will we see a monstrous scratchpad?

And what if… Xbox 720 in 2012? 28nm is now shipping with TSMC. AMD is going to roll out their new GPUs Q4. MS could aim for a 2012 launch (with Halo 4) with a 28nm GPU with immediate cost reduction in 2013 to 20nm. The GPU would be essentially the 7950-class design. Of course Kinect would not have even reached its 3rd birthday before a new Kinect launched and while the Xbox is selling very well the problem become does MS ride the 360 until it dies or do they move while the moving is still good? They could catch Sony in a hard spot (Vita is a Q1 2012 launch so it is unlike Sony would be aiming for 2012) and Sony would have to bet on 20nm in mass production in 2013 and at best would get a ~2x performance jump (but also much more expensive all generation for that jump). The question is are developers ready, is there compelling software, and have the tools matured enough that the content shift (most stuff is mult. million poly source now)will be less dramatic to see quicker uptake? Or are publishers saying, “Look, we bleed for years. We need to continue capitalizing on the current platforms”? Or maybe they see the market ripe for a new platform, maybe more consolidation?

If things do stretch out to 2013/2014 the good news is it is highly probable one company will keep similar footprint budgets / TDP compared to this gen, almost guaranteeing a significant leap in processing power.

My hope would be a release maybe in 2012 with a large processing footprint, 4GB of fast memory, and standard storage with a Kinect like device plus a multifunction motion/traditional pad with the goal of a big cost reduction ramp up in 2013. I don't mind a long lived console as long as it doesn't age too quickly.

Predict: The Next Generation Console Tech

Shifty Geezer

uber-Troll!

sebbbi

sebbbi

Arwin

Now Officially a Top 10 Poster

hoho

Shifty Geezer

uber-Troll!

Shifty Geezer

uber-Troll!

jonabbey

sebbbi

jonabbey

Ninjaprime

wco81

Xenus

rekator

Tahir2

Infinisearch

menmau

sebbbi

Andrew Lauritzen

Moderator

Acert93

Artist formerly known as Acert93

Similar threads