Predict: The Next Generation Console Tech

Status
Not open for further replies.
Cell isn't a terrible a product. Just because a better solution may exist doesn't mean another is terrible also I 'm sure IBM would build a super Cell if somebody asked and was willing to fit at least the majority of any extra R&D.
 
Would it be possible to drop the original PPE in favor of a Power7 derived core? Then we'd get the best of both worlds. (page 314... pi!!)
 
http://forum.beyond3d.com/showthread.php?t=20563

+ Why Saito wondered about IBM is because in 1999 he reported to SCE that making a partnership with IBM would be difficult as it seemed IBM as developers of server CPU and Toshiba as developer of consumer CPU were too different in their styles, after he visited the IBM HQ in the U.S.

+ Saito, Kutaragi and IBM representative met at a hotel in Roppongi near IBM Japan. It was Kutaragi's wish to invite IBM after all so Saito accepted it.

+ Later 3 companies had meetings to discuss the architecture of CELL. The target performance of the project was 1 TFLOPS. Toshiba proposed Force System that has many simple RISC cores and a main core as the controller. Jim Kahle, the POWER4 architect, from IBM proposed an architecture which has just multiple identical POWER4 cores. When a Toshiba engineer said maybe Force System doesn't need a main core, Kahle was greatly pissed off (thus the title of this chapter) as without a main core POWER has no role in the new architecture.

+ Meetings continued several months and Yamazaki of SCE was inlined toward the IBM plan and voted for it. But Kutaragi turned down it. Eventually Yamazaki and Kahle talked about the new architecture and agreed to coalesce the Toshiba plan and the IBM plan. Finally IBM proposed the new plan where a Power core is surrounded by multiple APUs. The backer of APU at IBM Austin was Peter Hofstee, one of the architects of the 1Ghz Power processor. It was adopted as the CELL architecture.
Kutaragi is out, Yamazaki continues working for Sony. He never liked Cell but he worked on it because it was the architecture choosen by the management.
 
Last edited by a moderator:
So answer me this question Sebbbi, would you want:

(a) PPC A2 at 2.3-2.8GHz with 16 cores / 4 threads (64 total threads) with a large eDRAM L3 on 22nm; or

(b) A 4 PPE Cell with 32 SPEs (256K LS) at 2.8-3.2GHz on 22nm.
I don't know enough about PPC A2 to answer that question. However A2 is "aggressively out of order architecture", executes four threads per core to hide latency and has caches instead of separate memory pools. It seems to be a really good architecture for developers point of view: out of order execution solves lots of stall cases (pipeline stalls, load/store stalls, branching stalls, minor cache stalls, etc), 4 threads per core solves memory latency stalls (and helps with pipeline stalls as well) and finally we have automatic caches instead of separate manual memory pools (and with good set of manual cache control instructions caches can used as temporary storage as well in performance critical code sections, allowing pretty much similar performance as separate memory pools would provide). But at the end it all depends on how much extra raw performance the Cell would provide. You never have enough time to optimize. If A2 would run all general code well enough, you could focus your optimizations to much more narrow set of performance bottlenecks. With Cell, you have to spend a lot of effort just to make basic code function properly. All that complexity cuts down the time left for optimizing the code.
 
I'm not happy to tell you this sebbbi but the Power A2 is an in order architecture, closer to Xenon/ppu/Atom than to power7, X86 top CPUs. It's a a 2 issue CPU, with few execution units. They aim at "high" utilization of these few resources thank to 4 way SMT, going for simple frond end, in order execution and few execution units allow them to pack a significant amount of cores per chips (both because it's cheap and it has benefit in power consumption).
I believe that you confuse it with power7, anyway keep the discussion going it's really interesting :)
 
I believe that you confuse it with power7, anyway keep the discussion going it's really interesting :)
Bummers, it was too good to be true :) . I just quickly browsed though the new IBM models, and must have partially mixed up PPC A2 with Power 7 (as both execute four threads per core). However in-order execution with four way SMT wouldn't be that bad. Four threads per core hide a lot of stalls caused by in-order execution.
 
Cell isn't a terrible a product. Just because a better solution may exist doesn't mean another is terrible also I 'm sure IBM would build a super Cell if somebody asked and was willing to fit at least the majority of any extra R&D.

It must be a terrible product, from IBMs point of view, because only one buyer outside of the original alliance bought it in any real volume, and even then they only bought it for one supercomputer. People would rather buy four or eight times the x86 cores than rewrite and redesign all their software and tools for cell. Now that you can put 16 cores on a chip, or use GPU accelerators rather than using SPUs as accelerators, theres no point to cell. A 32 SPU chip would just be over 5 times the PITA programming that cell on PS3 was, while delivering less performance than if you just moved more die area to a GPGPU chip.
 
However in-order execution with four way SMT wouldn't be that bad. Four threads per core hide a lot of stalls caused by in-order execution.
Having worked with an in-order 4 threads/core architecture (with explicit cache instructions) it's pretty good, but again you get a bit of the GPU problem where the tail end of reductions/tree algorithms tends to dominate workload times. That said, I think it's a pretty good mix of architectures for game development at the moment although in the long run you might still want to have a couple really fast out-of-order cores to handle the ugly bits that would otherwise dominate runtimes.
 
Some areas in graphics that could and should be a lot better on nextgen hardware, and wouldn't even require too much extra artist hours:

- Generally better character deformations. It is unbelievable to me just how much self-intersecting limbs and bodies are accepted by gamers that we had to somehow get rid of even 10 years ago. Granted, some of this is caused by character customization features or lack of mocap cleanup, but it's still a very disturbing element and I can't understand why no-one bothers with correcting it. Also, most games are still stuck with simple skinning using 3-4 bones per vertex, the most that game character riggers can do is to add a few helper bones here and there to do something.
Yet there's ongoing research even in offline CG like Disney's elastic skinning that suggests the possibility of somewhat better approaches - I would have expected game devs to jump on such ideas a long time ago.

- Better cloth and other secondary dynamics simulation. This of course would have to go with more detailed character models, as you can't modify normal mapped elements with simulation. And of course it's very problematic as even after 10 years we're still fixing clothsim problems frame by frame most of the time and there's still plenty of them because of the short schedules... Can't imagine how a truly robust realtime approach could suddenly appear if there's such a need for it in offline VFX.

- Proper hair rendering and dynamics. Guess there's not much to say here, complex as hell, rendering intensive, but it's getting really necessary to replace those alpha textured polygon strips...
 
Utter utter garbage...

-4PPC/32SPU Cell ~billion transistors, ~800 Gflops, nightmare programming, current cell, 6 SPUs, 32 = 5.33 times as many, of course saying its that much harder is arbitrary, but one must admit, it would be considerably harder.

-rv790 ~just under a billion transistors, ~1.36 Tflops, probably easier than 32 SPUs, maybe easier than the 6 on current cell.

Where am I wrong?
 
Last edited by a moderator:
Having worked with an in-order 4 threads/core architecture (with explicit cache instructions) it's pretty good, but again you get a bit of the GPU problem where the tail end of reductions/tree algorithms tends to dominate workload times. That said, I think it's a pretty good mix of architectures for game development at the moment although in the long run you might still want to have a couple really fast out-of-order cores to handle the ugly bits that would otherwise dominate runtimes.
It really depends what kind of mixed workload you have on your core. For example is two threads are doing something that has just a few arithmetic instructions per cache miss (waiting mostly), and the other two threads are doing heavy arithmetic stuff (with low amount of cache misses), the execution unit utilization should still be pretty good. Of course if all four threads are doing low arithmetic / high cache miss execution at the same time, the execution utilization would be really bad.
 
Now obviously there are good tradeoffs that involves adding more threads/core, but I'm just noting again that it obviously does not increase throughput.
What a strange thing to say ... what exactly do you mean when you say throughput? I'd define it as the average IPC across all the running threads, which can obviously be increased by having more HW threads (or decreased).
 
A 32 SPU chip would just be over 5 times the PITA programming that cell on PS3 was...
It'd be the same level of PITA, as (good) Cell code has matured into scheduled jobs across available cores.

-4PPC/32SPU Cell ~billion transistors, ~800 Gflops, nightmare programming, current cell, 6 SPUs, 32 = 5.33 times as many, of course saying its that much harder is arbitrary, but one must admit, it would be considerably harder.
As I say, it shouldn't be.

-rv790 ~just under a billion transistors, ~1.36 Tflops, probably easier than 32 SPUs, maybe easier than the 6 on current cell.

Where am I wrong?
You can't use GPGPU FLOPS the same as CPU FLOPS. You can't count useable work in terms of FLOPS anyhow. GPGPU is not fully capable as a CPU, hence the fact every GPU is still coupled with a CPU and there's a market for bigger, more powerful CPUs still being made.

Having worked with an in-order 4 threads/core architecture (with explicit cache instructions) it's pretty good, but again you get a bit of the GPU problem where the tail end of reductions/tree algorithms tends to dominate workload times.
This is one area Cell's heterogenous structure seems to make sense. Have two different for two different workloads, rather than either one core that's better at streamed throughput but struggles with more fragmented workloads, or one better at fragmented workloads but with less operations in action at once. Of course then the problem is writing code to use both halves of a heterogenous design simultaneously. It's not worth the effort, I'm sure, so a more balanced design is probably best.
 
Forget next gen. what sony should be doing at the moment is to allow us to link ps3s together for maximum power!

GT5 at full 1080p at 120fps by linking 2 ps3s. C'mon sony polyphony already demonstrated such a concept before. where's the firmware update for that?
 
It'd be the same level of PITA, as (good) Cell code has matured into scheduled jobs across available cores.

This goes back to what andrew was saying about scaling beyond 4 cores, i.e. we haven't seen it yet. Can you find SPU-centric work for 32 SPUs when developers over the last 5 years have worked tirelessly to figure out how to find things for 5-6 to do? I'm dubious.

As I say, it shouldn't be.

You don't think it would prove any harder? Even the slightest bit? Because as it stands today, the opinion that Cell is the hardest to program and tune for is pretty well accepted, and being worse than the worst out there today seems like a step backwards, to me.

You can't use GPGPU FLOPS the same as CPU FLOPS. You can't count useable work in terms of FLOPS anyhow.

One could say you can't count Cell Flops as CPU flops. I think the entire arguement for Cell is the high flops = useable work arguement.

GPGPU is not fully capable as a CPU, hence the fact every GPU is still coupled with a CPU and there's a market for bigger, more powerful CPUs still being made.

The same logic applies to Cell though, its just on one chip. SPUs are not fully capable as CPUs, one could probably say current GPUs are probably more capable than SPUs as a CPU. My point wasn't that Cell sucks in general, it was that if you had 500 million transistors or Xmm2 die area, for a console, its better spent in most cases on GPU than SPUs, because SPUs are simply a way to attach more Flops/watt/die area/transistor and GPGPU is better at that metric.
 
This goes back to what andrew was saying about scaling beyond 4 cores, i.e. we haven't seen it yet. Can you find SPU-centric work for 32 SPUs when developers over the last 5 years have worked tirelessly to figure out how to find things for 5-6 to do? I'm dubious.
That's untrue. Firstly, for the past 5 years developers have had loads for Cell to do to help out RSX. Secondly, a well developed engine uses a job scheduler rather than hard-coding each core to a task. eg.
Q3: Do you find a global queue of jobs to be doled out upon request to be a good approach to paritioning work to the SPE’s, or is it better to assign each SPE it’s own task in the game?
A3: Naughty Dog uses a Job Manager developed jointly by Naughty Dog’s ICE team and SCEE’s ATG group. This means that we can send any type of job to any SPE, and all of the scheduling of jobs is done through a priority system. This works well, since the overhead is minimal and we achieve good load-balancing between SPEs, something that would be hard to do by allocating a whole SPU to a single task.
You don't think it would prove any harder? Even the slightest bit?
Not when you have a job scheduler. The same framework would idenitify available cores and give them work to do however many cores were available. The detail of the game becomes breaking things down into jobs, which is the same problem you have now with 6 cores as you would with 36, or 8 cores on an i7, or 32 cores on a Larrabee. It's the future of multicore. It's the only way GPUs can use the squillions of execution units they have in a US architecture, by a scheduler distributing workload.

One could say you can't count Cell Flops as CPU flops.
But you can. They are completely flexible, although SIMD.

SPUs are not fully capable as CPUs...
Yes they are, other than needing a PPE to kickstart them. Once up and running they can do everything a PPC can.
one could probably say current GPUs are probably more capable than SPUs as a CPU.
Well you could say that, but you'd be painfully wrong!

My point wasn't that Cell sucks in general, it was that if you had 500 million transistors or Xmm2 die area, for a console, its better spent in most cases on GPU than SPUs, because SPUs are simply a way to attach more Flops/watt/die area/transistor and GPGPU is better at that metric.
It wasn't just about Flops though. It was about programmable, versatile Flops. There was no GPU at the time that could turn its floating point units to what Cell can do (anything and everything you'd want). To this day GPUs still require a CPU to aid them and set up a lot of the work before the GPU can crunch the graphics. GPGPU is not as versatile as a CPU and cannot do everything a CPU can. Hence heterogenous architectures, and architectures that have attempted to bridge the gap have so far failed. The likes or Larrabee and Cell haven't become a workable, single core-architecture system, and we have CPU and GPU cores combined either on discrete packages in in one chip. But there remains a heterogenous architecture and that's not changing in the next couple of years.
 
What a strange thing to say ... what exactly do you mean when you say throughput? I'd define it as the average IPC across all the running threads, which can obviously be increased by having more HW threads (or decreased).
I mean consider a "core" with one set of execution units (ALU, etc). If you are running one unstalled thread you can utilize all of those units at full rate. Four threads would merely have to take turns so you've gained nothing and cut your register file/caches in four.

Now of course multiple HW threads/core can hide memory latencies and via that can increase throughput if you have threads that are stalled (on memory access or similar). Moreover they can sometimes hide pipeline/instruction latencies. That said, software pipelining/coroutines can accomplish a lot of this benefit too.

Anyways my point was just that it's not magic - again you're mortgaging parallelism for some different advantages (memory latency hiding). But for anyone who is happy with the Cell model of explicit DMAs, software prefetch would do the same thing without the need for multiple HW threads either. Now I personally don't think this is sufficient and hence I do want some number of HW threads/core to assist in this, but that's not the point I was making.

The detail of the game becomes breaking things down into jobs, which is the same problem you have now with 6 cores as you would with 36, or 8 cores on an i7, or 32 cores on a Larrabee. It's the future of multicore.
No argument on the concept, but I contend that the work to break game engines down into *hundreds* of jobs as required to efficiently fill dozens of cores has probably not been done. It's semi-easy to break games into dozens of jobs to fill ~2-10 cores, but once you get into hundreds you often have to shift to parallel algorithms, not just ad hoc separation of independent work.

I'll also note that once you get to enough cores you have to stop using naive global schedulers with single queues and instead move to distributed queues and work stealing. Again, all this has been known for many, many years though, but it is a transition that doesn't happen overnight.

If games were actually written with this level of scalability already then there's really no excuse for the PC versions not even scaling to 6 cores, let alone 32. I remain skeptical until proven wrong :)
 
Ninja it's not going backwards. It's going forwards. It's a change that will be forced eventually sooner or later because single cores with all their out of order and scheduling fanciness still have hit the same wall they hit in the P4 days. There is only so much you can do with single threaded performance before diminishing returns makes it no longer worth it. Massively multi-core is the future and not only that as shifty said it might end up being better to have multiple groups of specialized cores then many heterogeneous cores that are the jack of all trades and the master of none.
 
Status
Not open for further replies.
Back
Top