Kutaragi is out, Yamazaki continues working for Sony. He never liked Cell but he worked on it because it was the architecture choosen by the management.+ Why Saito wondered about IBM is because in 1999 he reported to SCE that making a partnership with IBM would be difficult as it seemed IBM as developers of server CPU and Toshiba as developer of consumer CPU were too different in their styles, after he visited the IBM HQ in the U.S.
+ Saito, Kutaragi and IBM representative met at a hotel in Roppongi near IBM Japan. It was Kutaragi's wish to invite IBM after all so Saito accepted it.
+ Later 3 companies had meetings to discuss the architecture of CELL. The target performance of the project was 1 TFLOPS. Toshiba proposed Force System that has many simple RISC cores and a main core as the controller. Jim Kahle, the POWER4 architect, from IBM proposed an architecture which has just multiple identical POWER4 cores. When a Toshiba engineer said maybe Force System doesn't need a main core, Kahle was greatly pissed off (thus the title of this chapter) as without a main core POWER has no role in the new architecture.
+ Meetings continued several months and Yamazaki of SCE was inlined toward the IBM plan and voted for it. But Kutaragi turned down it. Eventually Yamazaki and Kahle talked about the new architecture and agreed to coalesce the Toshiba plan and the IBM plan. Finally IBM proposed the new plan where a Power core is surrounded by multiple APUs. The backer of APU at IBM Austin was Peter Hofstee, one of the architects of the 1Ghz Power processor. It was adopted as the CELL architecture.
I don't know enough about PPC A2 to answer that question. However A2 is "aggressively out of order architecture", executes four threads per core to hide latency and has caches instead of separate memory pools. It seems to be a really good architecture for developers point of view: out of order execution solves lots of stall cases (pipeline stalls, load/store stalls, branching stalls, minor cache stalls, etc), 4 threads per core solves memory latency stalls (and helps with pipeline stalls as well) and finally we have automatic caches instead of separate manual memory pools (and with good set of manual cache control instructions caches can used as temporary storage as well in performance critical code sections, allowing pretty much similar performance as separate memory pools would provide). But at the end it all depends on how much extra raw performance the Cell would provide. You never have enough time to optimize. If A2 would run all general code well enough, you could focus your optimizations to much more narrow set of performance bottlenecks. With Cell, you have to spend a lot of effort just to make basic code function properly. All that complexity cuts down the time left for optimizing the code.So answer me this question Sebbbi, would you want:
(a) PPC A2 at 2.3-2.8GHz with 16 cores / 4 threads (64 total threads) with a large eDRAM L3 on 22nm; or
(b) A 4 PPE Cell with 32 SPEs (256K LS) at 2.8-3.2GHz on 22nm.
Bummers, it was too good to be true . I just quickly browsed though the new IBM models, and must have partially mixed up PPC A2 with Power 7 (as both execute four threads per core). However in-order execution with four way SMT wouldn't be that bad. Four threads per core hide a lot of stalls caused by in-order execution.I believe that you confuse it with power7, anyway keep the discussion going it's really interesting
Cell isn't a terrible a product. Just because a better solution may exist doesn't mean another is terrible also I 'm sure IBM would build a super Cell if somebody asked and was willing to fit at least the majority of any extra R&D.
Having worked with an in-order 4 threads/core architecture (with explicit cache instructions) it's pretty good, but again you get a bit of the GPU problem where the tail end of reductions/tree algorithms tends to dominate workload times. That said, I think it's a pretty good mix of architectures for game development at the moment although in the long run you might still want to have a couple really fast out-of-order cores to handle the ugly bits that would otherwise dominate runtimes.However in-order execution with four way SMT wouldn't be that bad. Four threads per core hide a lot of stalls caused by in-order execution.
A 32 SPU chip would just be over 5 times the PITA programming that cell on PS3 was, while delivering less performance than if you just moved more die area to a GPGPU chip.
Utter utter garbage...
It really depends what kind of mixed workload you have on your core. For example is two threads are doing something that has just a few arithmetic instructions per cache miss (waiting mostly), and the other two threads are doing heavy arithmetic stuff (with low amount of cache misses), the execution unit utilization should still be pretty good. Of course if all four threads are doing low arithmetic / high cache miss execution at the same time, the execution utilization would be really bad.Having worked with an in-order 4 threads/core architecture (with explicit cache instructions) it's pretty good, but again you get a bit of the GPU problem where the tail end of reductions/tree algorithms tends to dominate workload times. That said, I think it's a pretty good mix of architectures for game development at the moment although in the long run you might still want to have a couple really fast out-of-order cores to handle the ugly bits that would otherwise dominate runtimes.
What a strange thing to say ... what exactly do you mean when you say throughput? I'd define it as the average IPC across all the running threads, which can obviously be increased by having more HW threads (or decreased).Now obviously there are good tradeoffs that involves adding more threads/core, but I'm just noting again that it obviously does not increase throughput.
It'd be the same level of PITA, as (good) Cell code has matured into scheduled jobs across available cores.A 32 SPU chip would just be over 5 times the PITA programming that cell on PS3 was...
As I say, it shouldn't be.-4PPC/32SPU Cell ~billion transistors, ~800 Gflops, nightmare programming, current cell, 6 SPUs, 32 = 5.33 times as many, of course saying its that much harder is arbitrary, but one must admit, it would be considerably harder.
You can't use GPGPU FLOPS the same as CPU FLOPS. You can't count useable work in terms of FLOPS anyhow. GPGPU is not fully capable as a CPU, hence the fact every GPU is still coupled with a CPU and there's a market for bigger, more powerful CPUs still being made.-rv790 ~just under a billion transistors, ~1.36 Tflops, probably easier than 32 SPUs, maybe easier than the 6 on current cell.
Where am I wrong?
This is one area Cell's heterogenous structure seems to make sense. Have two different for two different workloads, rather than either one core that's better at streamed throughput but struggles with more fragmented workloads, or one better at fragmented workloads but with less operations in action at once. Of course then the problem is writing code to use both halves of a heterogenous design simultaneously. It's not worth the effort, I'm sure, so a more balanced design is probably best.Having worked with an in-order 4 threads/core architecture (with explicit cache instructions) it's pretty good, but again you get a bit of the GPU problem where the tail end of reductions/tree algorithms tends to dominate workload times.
It'd be the same level of PITA, as (good) Cell code has matured into scheduled jobs across available cores.
As I say, it shouldn't be.
You can't use GPGPU FLOPS the same as CPU FLOPS. You can't count useable work in terms of FLOPS anyhow.
GPGPU is not fully capable as a CPU, hence the fact every GPU is still coupled with a CPU and there's a market for bigger, more powerful CPUs still being made.
That's untrue. Firstly, for the past 5 years developers have had loads for Cell to do to help out RSX. Secondly, a well developed engine uses a job scheduler rather than hard-coding each core to a task. eg.This goes back to what andrew was saying about scaling beyond 4 cores, i.e. we haven't seen it yet. Can you find SPU-centric work for 32 SPUs when developers over the last 5 years have worked tirelessly to figure out how to find things for 5-6 to do? I'm dubious.
Q3: Do you find a global queue of jobs to be doled out upon request to be a good approach to paritioning work to the SPE’s, or is it better to assign each SPE it’s own task in the game?
A3: Naughty Dog uses a Job Manager developed jointly by Naughty Dog’s ICE team and SCEE’s ATG group. This means that we can send any type of job to any SPE, and all of the scheduling of jobs is done through a priority system. This works well, since the overhead is minimal and we achieve good load-balancing between SPEs, something that would be hard to do by allocating a whole SPU to a single task.
Not when you have a job scheduler. The same framework would idenitify available cores and give them work to do however many cores were available. The detail of the game becomes breaking things down into jobs, which is the same problem you have now with 6 cores as you would with 36, or 8 cores on an i7, or 32 cores on a Larrabee. It's the future of multicore. It's the only way GPUs can use the squillions of execution units they have in a US architecture, by a scheduler distributing workload.You don't think it would prove any harder? Even the slightest bit?
But you can. They are completely flexible, although SIMD.One could say you can't count Cell Flops as CPU flops.
SPUs are not fully capable as CPUs...Yes they are, other than needing a PPE to kickstart them. Once up and running they can do everything a PPC can.
Well you could say that, but you'd be painfully wrong!one could probably say current GPUs are probably more capable than SPUs as a CPU.
It wasn't just about Flops though. It was about programmable, versatile Flops. There was no GPU at the time that could turn its floating point units to what Cell can do (anything and everything you'd want). To this day GPUs still require a CPU to aid them and set up a lot of the work before the GPU can crunch the graphics. GPGPU is not as versatile as a CPU and cannot do everything a CPU can. Hence heterogenous architectures, and architectures that have attempted to bridge the gap have so far failed. The likes or Larrabee and Cell haven't become a workable, single core-architecture system, and we have CPU and GPU cores combined either on discrete packages in in one chip. But there remains a heterogenous architecture and that's not changing in the next couple of years.My point wasn't that Cell sucks in general, it was that if you had 500 million transistors or Xmm2 die area, for a console, its better spent in most cases on GPU than SPUs, because SPUs are simply a way to attach more Flops/watt/die area/transistor and GPGPU is better at that metric.
For the rendering part, adaptive transparency goes a hell of a long way towards this!- Proper hair rendering and dynamics. Guess there's not much to say here, complex as hell, rendering intensive, but it's getting really necessary to replace those alpha textured polygon strips...
I mean consider a "core" with one set of execution units (ALU, etc). If you are running one unstalled thread you can utilize all of those units at full rate. Four threads would merely have to take turns so you've gained nothing and cut your register file/caches in four.What a strange thing to say ... what exactly do you mean when you say throughput? I'd define it as the average IPC across all the running threads, which can obviously be increased by having more HW threads (or decreased).
No argument on the concept, but I contend that the work to break game engines down into *hundreds* of jobs as required to efficiently fill dozens of cores has probably not been done. It's semi-easy to break games into dozens of jobs to fill ~2-10 cores, but once you get into hundreds you often have to shift to parallel algorithms, not just ad hoc separation of independent work.The detail of the game becomes breaking things down into jobs, which is the same problem you have now with 6 cores as you would with 36, or 8 cores on an i7, or 32 cores on a Larrabee. It's the future of multicore.