Predict: The Next Generation Console Tech

Rangers · Oct 14, 2011

Brad Grenz said:
If we're assuming a transistor budget comparable to a Bulldozer, I'd just as soon have 8 Cell processors stacked together. 8 PPEs (16 threads...) and 64 SPEs.

The SPE's seem to be mainly used to help the GPU anyway in PS3. So just put those SPE resources into a better GPU in the first place and you'll come out ahead. Plus would probably be programming nightmares...

Then again with similar transistor budget you could get at least 10-12 cores worth of i7 with HT ontop for a total of 20-24 parallel threads

Lets ignore transistor count because it seems unreliable, different vendors count differently etc. BD has die size "only" ~50% greater than SB, so 6 core SB=BD die size.

Shifty Geezer · Oct 15, 2011

Rangers said:
The SPE's seem to be mainly used to help the GPU anyway in PS3. So just put those SPE resources into a better GPU in the first place and you'll come out ahead. Plus would probably be programming nightmares...

It shouldn't be. Jobs are scheduled across SPEs and schedulers can accomodate scaling effortlessly. As long as the ground work ahs been done, developing for an uber-Cell should be no different to developing for the current one.

I have to wonder about the worth of such a mammoth CPU though. As you say, Cell is lifting some GPU work. I for one like programmability, but in a standard game there's only so much you'd need the CPu to do if the GPU is up to the graphics workload required. Would we have seen more non-graphics use out of Cell if it didn't have to support RSX? Are devs going to need lots of CPU next-gen, or would it only be used for graphics, at which point you're better off spending those transistors on GPU.

I feel there's a definite argument there for using the current 1:8 Cell. It'd be BC, well known, and likely ample performance for what a dev wants from the CPU next-gen with a monster GPU. The reason not to consider that is developer effort, which will be key. But I don't think high-performance CPU is a priority. Anyone got any estimations of what the CPU will need to be doing next-gen?

Rangers · Oct 15, 2011

I agree. CPU just needs to be "good enough" GPU is where its at. Dont want 2 billion transistor Bulldozer hogging my budget...well Bulldozer sucks anyway haha.

I guess you could make some sort of "but teh physics!!!" argument, but Xcpu seems to keep up ok in destruction based games (red faction, Crysis, BF3) that I know of anyway. I'm not a huge believer in "teh physics" anyway.

I think it is just like a gaming PC, the balance you want is mid-grade CPU high end GPU, much more than high end CPU+mid grade GPU.

Brad Grenz · Oct 15, 2011

You want a mid grade CPU on the PC because nobody is doing anything that pushes the CPU in order to cater to the low end uses. Meanwhile it's pretty easy to up resolution, AA, filtering quality, etc with a high end GPU.

I still think a enormously powerful Cell-based design coupled with a unified shader GPU would produce some amazing results. If you cheap out on the CPU and put everything on the GPU you're just going to end up stealing time from the actual graphics in order to do a bunch of inefficient brute force GPGPU work. Even if you end up doing lots of graphics work on the CellX8, that just means the GPU resources will go further.

Shifty Geezer · Oct 15, 2011

Brad Grenz said:
You want a mid grade CPU on the PC because nobody is doing anything that pushes the CPU in order to cater to the low end uses. Meanwhile it's pretty easy to up resolution, AA, filtering quality, etc with a high end GPU.

But then what would they do? I guess behavioural physics would be a big thing, if animations were calculated on the fly instead of predesined. Could Uncharted 4 not have canned animation blending but a Euphoria system instead? It's be good in theory but there are always extra issues to worry about (like gameplay lag) such that what seems a simple upgrade in theory doesn't work well in games. For me to think a good CPU is warranted, I'd need to see a compelling argument that it'd be used to differentiate and make better games.

hoho · Oct 15, 2011

I would guess the biggest reason why games aren't using much CPU power on PCs is that it's really hard to make anything useful with it that would scale to low-end without affecting gameplay. Effects physics would be one thing but there is a limit on how much particles you can have flying around.

Andrew Lauritzen · Oct 16, 2011

Shifty Geezer said:
But then what would they do?

There's portions of graphics work that run better on the CPU, at least assuming there's a low-latency connection between the two. Data structure creation and hierarchical stuff generally works better on CPUs than GPUs, as well as stuff like reductions that have unavoidable serial segments (i.e. the last few levels of the tree). Modern CPUs even do ok at culling and some shading in a pinch, as long as you don't need texture filtering.

For instance consider BF3's renderer. If there was a low-latency CPU/GPU connection it would make sense to do the first few steps of the Z reduction on the GPU, potentially switching to the CPU at some point depending on required parallelism. Then hierarchical culling on the CPU, followed by the leaves of the culling step launching GPU tile shading jobs. Right now because of issues with the interface to GPUs and generally long latency of PCI-E you're stuck doing it all one place or the other, so you either have inefficient culling or inefficient shading.

Ninjaprime · Oct 16, 2011

Rangers said:
I agree. CPU just needs to be "good enough" GPU is where its at. Dont want 2 billion transistor Bulldozer hogging my budget...well Bulldozer sucks anyway haha.

I guess you could make some sort of "but teh physics!!!" argument, but Xcpu seems to keep up ok in destruction based games (red faction, Crysis, BF3) that I know of anyway. I'm not a huge believer in "teh physics" anyway.

I think it is just like a gaming PC, the balance you want is mid-grade CPU high end GPU, much more than high end CPU+mid grade GPU.

I would guess that in 2014, with even more GPGPU designed chips than today, physics will be moved over to the GPU anyway, ala PhysX/Bullet, taking even more work away from what you need to do on the CPU. My guess is you will see a smaller, easier to program, more simple CPU, and more GPU, 20/80 or even 10/90 split.

manux · Oct 16, 2011

There is a imho. a good question, does prettier graphics make a better game. for example, is NHL better with prettier graphics or if the physics actually work better(tackles, stick handling, puck physics)? I for one would wish and hope next gen hardware is not about prettier pixels but actually enabling new kind of interactions. In this sense either very general purpose GPU's(physics) or very nice CPU would fit my world better. If next gen HW is all about prettier pixels I think I will just skip it or buy very few games that venture beyond mainstream.

My ideal next gen HW would be all about making things more alive rather than prettier pixels. In this sense I like what PS3 did with cell but unfortunately it's underused to compensate for lack of RSX instead of enabling next gen gameplay.

There are some happy exceptions in games but they are too far and between(i.e. little big planet was awesome)

AlphaWolf · Oct 16, 2011

there's a lot of overlap there anyway

If you need to use less cycles for making it prettier, then you have more cycles for other things (ie gpu physics).

manux · Oct 16, 2011

AlphaWolf said:
there's a lot of overlap there anyway

If you need to use less cycles for making it prettier, then you have more cycles for other things (ie gpu physics).

To some extent yes, but practically no until we have gpu that can realistically act as cpu or vice versa.

Blazkowicz · Oct 16, 2011

I wonder, are the next consoles so late that we can have an optical link between CPU and GPU, or is that a tech forever "a few years from now".
or would one of the two chips feature stacked memory, presumably the CPU as you would get a custom CPU and an slightly tweaked off-the-shelf GPU ala RSX.

this discussion about moving back and forth between GPU and CPU is very interesting, it's done on PS3 out of necessity and because the hardware allows it but it should be at least possible on next gen console.

sebbbi · Oct 16, 2011

ERP said:
On CPU design, I still think the real way to go with highly parallel architectures is lots of fine grain hardware threads with shared computational resources making it possible to mask memory latencies. Even ignoring main memory hits, L2 cache hits are enormously expensive today, and the latencies aren't likely to go down. Exploiting an architecture like this in a game though is a long ways off.

That's what IBM and Sun(Oracle) are already doing. Power7 has eight cores, each running four threads (32 threads total), and UltraSPARC T2 has eight cores, each running eight threads (64 threads total). These are the most powerful high performance CPUs currently. A Power7 / Power8 derivative for the next consoles would be perfect, and since both console manufacturers are already using IBM designs, this isn't so far fetched either.

Acert93 · Oct 16, 2011

sebbbi said:
That's what IBM and Sun(Oracle) are already doing. Power7 has eight cores, each running four threads (32 threads total), and UltraSPARC T2 has eight cores, each running eight threads (64 threads total). These are the most powerful high performance CPUs currently. A Power7 / Power8 derivative for the next consoles would be perfect, and since both console manufacturers are already using IBM designs, this isn't so far fetched either.

Power7 is a big chip, at least the 32MB eDRAM version is 567mm^2 on 45nm. I assume this is the 8 core (x4 threads each) variant? And then there is TDP. I wonder how big/power hungry that would be on 22nm? Anyways, that is a huge chip. Cell was 230mm^2 on 90nm and was reduced down to 120mm^2 on 65nm. Cell was a huge chip (Xenon was what, 160mm^2?) It sounds like Power8 will be on 22nm in roughly 2013.

sebbbi · Oct 16, 2011

Yes, Power 7 is big and hungry, but IBM PPC A2 at 2.3 GHz with 16 cores / 4 threads per core = 64 threads consumes only 65 watts. That's very reasonable (only half of the FX-8150).

AlphaWolf · Oct 16, 2011

manux said:
To some extent yes, but practically no until we have gpu that can realistically act as cpu or vice versa.

It doesn't need to fully achieve the flexibility of a cpu to be kept busy with non graphics tasks. Obviously there are limits, but there are also limits of what you're going to do in a gaming environment.

Andrew Lauritzen · Oct 17, 2011

Again though note that more HW threads per core is actually undesirable unless it enables something else, like overall faster frequencies or wider SIMD or something. More HW threads just means you divide up the register file/cache/whatever. Thus you either end up spending all your area on more of that (and not cores/ALUs) or only toy kernels fit in cache and it's impossible to get good utilization.

Now obviously there are good tradeoffs that involves adding more threads/core, but I'm just noting again that it obviously does not increase throughput. It's to hide latencies that are typically higher than one would want, hence again your require more parallelism to fill the machine. There's two sides to every coin.

Acert93 · Oct 17, 2011

sebbbi said:
Yes, Power 7 is big and hungry, but IBM PPC A2 at 2.3 GHz with 16 cores / 4 threads per core = 64 threads consumes only 65 watts. That's very reasonable (only half of the FX-8150).

Interesting:

5.5 A Wire-Speed PowerTM Processor: 2.3GHz 45nm SOI with 16 Cores and 64 Threads 3:45 PM

A 64-thread simultaneous multi-threaded processor uses architecture and implementation techniques to achieve high throughput at low power. Included are static VDD scaling, multi-voltage design, clock gating, multiple VT devices, dynamic thermal control, eDRAM and low-voltage circuit design. Power is reduced by >50% in a 428mm2 chip. Worst-case power is 65W at 2.0GHz, 0.85V.

I wonder how much eDRAM? With IBM skipping 28nm it would seem a design like this, while desirable in many ways, would seem destined for 22nm which would align with 2013 at the earliest. Which doesn't sound unreasonable.

The flip question is how does this stack up to Cell? It seems even with poor shrinks Cell at 45nm is 115mm^2 and 50W at load at 3.2GHz. I guess it has room to spare due to the various chip IO. Obviously power consumption is a HUGE issue but it would seem (?) a modestly clocked Cell design at 45nm with a similar footprint would offer roughly twice as many "real" cores plus some poor PPEs.

So answer me this question Sebbbi, would you want:

(a) PPC A2 at 2.3-2.8GHz with 16 cores / 4 threads (64 total threads) with a large eDRAM L3 on 22nm; or

(b) A 4 PPE Cell with 32 SPEs (256K LS) at 2.8-3.2GHz on 22nm.

I am not sure the Cell could fit into the same thermals and I have no clue how the caches would compare (I am guessing the PPC A2 is going to have much larger L2 plus L3).

Hmmm I wonder if large publishers/developers are being presented with this very question.

Ninjaprime · Oct 17, 2011

Acert93 said:
Interesting:

I wonder how much eDRAM? With IBM skipping 28nm it would seem a design like this, while desirable in many ways, would seem destined for 22nm which would align with 2013 at the earliest. Which doesn't sound unreasonable.

The flip question is how does this stack up to Cell? It seems even with poor shrinks Cell at 45nm is 115mm^2 and 50W at load at 3.2GHz. I guess it has room to spare due to the various chip IO. Obviously power consumption is a HUGE issue but it would seem (?) a modestly clocked Cell design at 45nm with a similar footprint would offer roughly twice as many "real" cores plus some poor PPEs.

So answer me this question Sebbbi, would you want:

(a) PPC A2 at 2.3-2.8GHz with 16 cores / 4 threads (64 total threads) with a large eDRAM L3 on 22nm; or

(b) A 4 PPE Cell with 32 SPEs (256K LS) at 2.8-3.2GHz on 22nm.

I am not sure the Cell could fit into the same thermals and I have no clue how the caches would compare (I am guessing the PPC A2 is going to have much larger L2 plus L3).

Hmmm I wonder if large publishers/developers are being presented with this very question.

Theres an easy answer to that question; A 32 SPU cell doesnt exist, it was cancelled because IBM gave up on cell in general and it was a terrible product, the PPC A2 exists, and has a bunch of products out. One is an option, one isnt.

sebbbi · Oct 17, 2011

Andrew Lauritzen said:
Now obviously there are good tradeoffs that involves adding more threads/core, but I'm just noting again that it obviously does not increase throughput. It's to hide latencies that are typically higher than one would want, hence again your require more parallelism to fill the machine. There's two sides to every coin.

If you run more threads per core, you can make beefier cores with more execution units, since you have more TLP to exploit. For example a single Power 7 core has a whopping 12 execution units inside, and is capable of issuing 8 instructions per clock. That's plenty of backbone to run four threads in SMT. It seems like a good trade off for me, since you get both good throughput and good latency hiding (and the chip even runs at 4.25 GHz to boot). Four threads per core should keep the execution pipelines filled much better than just one or two. And that's what you really want. To keep the execution units rolling at all times.

Of course two threads per core is also much better than one and hyperthreading from Intel was a really smart move, but Intel didn't increase the count of their execution units in their cores when they added hyperthreading. They aimed for best possible single thread performance, and the hyperthreading was basically used to salvage some of the stall cycles (memory system latency mostly). Adding some extra execution units to each core would make the second threads act much more like real cores (performance vise). But I can't really criticize the current approach Intel has chosen, since it seems to be a very good fit for consumers (and workstations). But let's see if Sandy Bridge E changes this, since the server market cares more about heavily multithreaded loads. It wouldn't be that hard to slightly increase execution resources per core to get more performance boost out of hyper threading.

Predict: The Next Generation Console Tech

Rangers

Shifty Geezer

uber-Troll!

Rangers

Brad Grenz

Philosopher & Poet

Shifty Geezer

uber-Troll!

hoho

Andrew Lauritzen

Moderator

Ninjaprime

manux

AlphaWolf

Specious Misanthrope

manux

Blazkowicz

sebbbi

Acert93

Artist formerly known as Acert93

sebbbi

AlphaWolf

Specious Misanthrope

Andrew Lauritzen

Moderator

Acert93

Artist formerly known as Acert93

Ninjaprime

sebbbi

Similar threads