Predict: The Next Generation Console Tech

Status
Not open for further replies.
Cell LS and being a dead end

I think, with the SPU LS they were primarily trying to solve 2 problems
1) Energy efficiency
2) Scalability
There are other side effects compared to a cache based architecture, like space saving, and predictable timing (less complex pipeline), higher maintanable IPC because of lack of memory stalls (unless you have random access data structures ;P), but I think these 2 were the most important.

I don't think they missed their mark at all for that, but I think they underestimated the overhead involved in writing software for the chip. To my experience, there several reasons that developing for Cell (on the PS3) is cumbersome compared to competitor platforms (Xbox 360), but I have one that really stands out for me.

No instruction cache (the SPU already has a really small I$, but it reads off LS).
Code and data have to share the LS. This is a load balancing issue and a potential debugging head ache. With a global access I$ we wouldn't have to worry about whether the code fits, if a code update will suddently screw up your finely tuned data layout, having to strip debug symbols to avoid bloating the code elf, not being able to run debug versions because they're simply too big, accidentally overwriting your code with data, jumping through hoops to get persistent break points (because the code is fresh with every upload), etc. I think this is the number one problem for working with the SPUs and something I would really like to see in Cell v2.

Managing data is a comparatively much easier, and I think it's unrealistic for them to completely change their memory system design. It's a very intricate part of the chip (or any chip) and one of their fundemental design points. It will be interesting to see how performance will scale on something like Cell iv32 vs Larrabee though, and how their different approaches play out in real applications.

Whether it's a dead end, I don't know. It's one of these platforms that are theoretically very good if you A) spend substantial time optimizing for it, or B) have good tools to support the more complicated hardware. Right now, the tools just aren't there, and I'm not convinved they will be any time soon.

(Sorry if this is coming up multiple times, I seem to be having some trouble posting)
 
c2.0: your posts are not displayed because you have not been granted posting rights by a moderator (first post, right?). Just wait a bit..
 
scificube: Perhaps I don't understand your solution but I really can't see it's even remotely possible to make SPUs' local store disappear as logical entities if you don't also heavily modify SPUs ISA and break backward compatibility, and once you do that you don't need explicit DMA transfers anymore.

If you are willing to do that, and there are zilions of ways to do it, you end up with something that is not really CELL, which to me speaks volumes about it's not so forward looking design.

The LS is still there physically. It can treated as such for BC either directly ( with some HW compat mode ) or indirectly with cache line locking as part of a software emulator. Cell2 at the SPU level will have very similar if not identical performance characteristics to SPUs in Cell1 given a lot of the design has not changed.

I don't see any reason the SPU ISA needs to be broken. It would be a part of a superset that Cell2 understands in the implementation I am proposing. Cell2 developers can choose to ignore the existence of the physical LS when programming Cell2 or via cache line locking they can take advantage of it.

I don't see a reason for the LS to be part of the process context any longer. DMAs are deterministic and can be targeted by a runtime layer or by the compiler for translation and/or removal. The latter being better if it can be accomplished in a reasonable amount of time.

I'm unconcerned with whether or not Cell1 was a dead end or not. It is what it is. I am concerned with making the design more amenable to developers without losing 1) The massive bandwidth of Cell1 2) The high number of exec units of Cell1 3) backwards compatibility.

I'm offering no argument as to whether Cell needs change. I agree with that sentiment entirely.

I see no reason why the ring bus cannot be reused. I see no reason why multiple single ported memory pools can't be leveraged to increase overall internal bandwidth as is done now in Cell1. Multiple pools will offer low level optimization opportunities as well should anyone have need to go there.

Cell2 would be different for sure from Cell1 but still retain the attributes that make sense.

If you're looking for some argument as to whether Cell1 is/was/or will be forward thinking as it you won't find one here. I'm uninterested in that debate.

I wouldn't disagree that there are other ways to accomplish the goals I'm outlining. I would love to hear them.

Supporting a flat memory model on top of local stores, entirely bypassing DMA, is certainly possible and reasonably feasible and I expect CELL2 to have something like that.

That is precisely the aim of the proposals I'm working through. Are we misunderstanding each other?
 
Last edited by a moderator:
Cell is very good at what it does, as long as what you are doing fits it batch processing model and the local store it will be hard to beat ... the efficiency good programmers can get out of it for some basic stuff like FFT and dense matrix multiplication is unrivalled except for supercomputer architectures, I very much doubt Larrabee will get anywhere near. It's just not terribly convenient for developers.

Well, that makes an important difference in the end. Just look at this current generation of consoles, you have a chip that is theoretically (well, even in practice..) much more powerful than the other one but in the end it's not making a dramatic difference in games and we all know why.
 
Last edited:
$399 is doable, but I think whoever launches at $299 and makes devs lives easy will ultimately win, even if they have the technically lesser box. Microsoft has the advantage here since they are expected to launch a year earlier, and the Xbox brand is no longer a joke, so $299 along with their great developer support is a winning launch combination.


If different SKU's are again used for Next-Gen Xbox and PS4, then I think $399.99 would be a good upper limit for both boxes, with cheaper SKU's coming in at $299.99 ~ $349.99.

The Nintendo Wii was Nintendo's smallest system and least technically advanced for it's time, yet it had the highest retail price of any Nintendo console. Nintendo always launched at $199, the Wii was and still is $249.
Nintendo could go as high as $299 with its next console but is more likely to stay within $199~$249.

We certainly won't be seeing $499 and $599 boxes from anyone, next-gen.

The only console that had a high price and actually succeeded, believe it or not, was the NEO-GEO at $399 and $649, though I don't remember the price reduction history of that console over its 12+ year lifespan. SNK probably didn't sell even a million units of NEO-GEO worldwide but they managaed to make a profit somehow. I guess it helped that the arcade unit was so hugely successful. NEO-GEO outlasted 2 entire generations of game consoles that came after it, outlasting 3 generations in total. An amazing feat for a console that some people would place in the same failure column as 3DO & Jaguar.

On another note: not counting Nintendo, speaking only of Sony, Microsoft and any other high-end console provider that might emerge, I personally don't want to see a next-generation of consoles if they're only going to be 2-3 times more powerful than current-gen consoles. That would mean less power than current topend PCs that have GT200 / GTX 280 SLI or R700 / 4870X2 CrossFire, or even one of those cards.


Unfortunately, the leading tech is now (and has been for the past 5-6 years) on the PC side of the industry. Not arcades, and not consoles. So what I want and what we will end up with, I do understand, may very well be completely different things.
 
I don't see any reason the SPU ISA needs to be broken. It would be a part of a superset that Cell2 understands in the implementation I am proposing. Cell2 developers can choose to ignore the existence of the physical LS when programming Cell2 or via cache line locking they can take advantage of it.
That's where the 'dead end' comment comes from ;)

I don't see a reason for the LS to be part of the process context any longer.
I do, if you think about having support for hw multithreading and being able to run legacy code alongside new code (on the same SPU).

If you think about it you are basically sandboxing the old model and replacing it. At this point I don't even know if it makes sense to keep the old model at all.

DMAs are deterministic and can be targeted by a runtime layer or by the compiler for translation and/or removal. The latter being better if it can be accomplished in a reasonable amount of time.
DMAs are not deterministic, badly written code that assumes that DMAs are deterministic could stop working, but that would be just a software issue of course.

I'm unconcerned with whether or not Cell1 was a dead end or not. It is what it is. I am concerned with making the design more amenable to developers without losing 1) The massive bandwidth of Cell1 2) The high number of exec units of Cell1 3) backwards compatibility.
I wouldn't worry about point 1 and 2 (quite sure that LRB will deliver from this standpoint) ,regarding point 3..is it so important? Games aside I don't see this insane amount of applications written for CELL, we are not talking about x86 here.
 
I do, if you think about having support for hw multithreading and being able to run legacy code alongside new code (on the same SPU).

If you think about it you are basically sandboxing the old model and replacing it. At this point I don't even know if it makes sense to keep the old model at all.

Sony doesn't have to allow that. Secondly running legacy code on an SPU will forbid any Cell2 code from running there at the same time unless the LS size increases. If LS grows to say 512K then Cell2 apps can play with the excess 256K to their hearts content when they are given SPE time. It would logically just be two or more threads on the same SPU where one of them locks 256K off for its own use. Legacy code and new Cell2 code do not communicate or share any state and I see no valid reason why they should.

BC is why this model makes more sense but since we disagree on the necessity of BC we may be forced to agree to disagree on the whole.

DMAs are not deterministic, badly written code that assumes that DMAs are deterministic could stop working, but that would be just a software issue of course.

I only meant deterministic in the sense that you know which LS you are writing to and from (or the software knows as much). If you know that you can lock a cache line and simulate the behavior you want. SPU code doesn't randomly write to LS's on chip...that's deterministic.

I wouldn't worry about point 1 and 2 (quite sure that LRB will deliver from this standpoint) ,regarding point 3..is it so important? Games aside I don't see this insane amount of applications written for CELL, we are not talking about x86 here.

Am I arguing against Larabbee with you? I hope not because I didn't know I was. I've already said my piece on how likely it is Sony will adopt Larabbee though. My skepticism extends beyond the availablility of Cell to other considerations, such as discrete GPU parts etc. I don't see Larabbee adoption as very likely so I'm addressing what should happen to Cell in this event.

I would say BC is important. There is not an insane amount of apps written for Cell but there are a good number of apps out there and more coming. BC needs to be there for a number of reasons but we will be wandering into a divergent discussion if we continue along this path.
 
Last edited by a moderator:
What is the cause for the CELL architecture to end up as a deadend? How can this be possible with the millions and hours and scientific braincells (pardon the pun) of STI? They started from a clean sheet no? Where did they gone wrong? How can they not anticipate the problem with Local Store? What is it that held back STI from resolving this problem at design stage?

Likely they came from a perspective of specific programming models and designed the hardware/architecture around that. The semi industry is littered with architectures designed in this way and none have really ever succeeded.

Cell actually has several problems beyond just the issues with the local store: lack of any real control (the SPU is ALWAYS a slave which limits the things you can do), lack of any type of SIMD support which is also kinda shocking when you consider its targetted expressly as media type applications (SIMD is important as it allows significant speedups on a wide variety of workloads with minimal to no control overhead).
 
Sony doesn't have to allow that. Secondly running BC apps on an SPU will forbid it unless the LS size increases. If LS grows to say 512K then Cell2 apps can play with the excess 256 K their hearts content when they are given SPE time. I would logically just be two or more threads on the same SPU where one of them locks 256K off.
It's funny how a variable size local store is the only forward looking aspect of SPU ISA, while being utterly useless.
Am I arguing against Larabee with you? I hope not because I didn't know I was.
You misunderstood my comment, I wasn't implying you were arguing against LRB. What I meant is that the more we think about CELL2 and the more we get something that looks like LRB ;)
Basically what should stop us from dumping CELL are legacy applications and tools. This to say that CELL is not even remotely a failure but not even an amazing success.

I would say BC is important.
If you look at what MS and Sony did about BC I'd say it's not that important. The lack of BC can piss off ppl but it basically doesn't stop anyone to buy a new system, especially if the lack of BC allow you to sell it at a considerably lower price.
 
Cell is very good at what it does, as long as what you are doing fits it batch processing model and the local store it will be hard to beat ... the efficiency good programmers can get out of it for some basic stuff like FFT and dense matrix multiplication is unrivalled except for supercomputer architectures, I very much doubt Larrabee will get anywhere near. It's just not terribly convenient for developers.

as is the case with Nvidia and ATI, efficiency has many different facets, and while % peak performance is important, even more important is perf/area and perf/watt. It doesn't matter how much you deliver in % peak if you lose the other two, which cell WILL to the currently competing designs.
 
It's funny how a variable size local store is the only forward looking aspect of SPU ISA, while being utterly useless.

Who posits that? I don't ;)

I have no arguments about the SPU ISA itself other than...it needs some extending...

You misunderstood my comment, I wasn't implying you were arguing against LRB. What I meant is that the more we think about CELL2 and the more we get something that looks like LRB ;)
Basically what should stop us from dumping CELL are legacy applications and tools. This to say that CELL is not even remotely a failure but not even an amazing success.

I definitely agree Cell has been something of a mixed bag to date. I wouldn't disagree that something like Larabbee is where everyone is going to end up eventually. I am just not convinced we'll be at that point within the next 5 years or so.

If you look at what MS and Sony did about BC I'd say it's not that important. The lack of BC can piss off ppl but it basically doesn't stop anyone to buy a new system, especially if the lack of BC allow you to sell it at a considerably lower price.

I'm fairly certain Sony only relented on HW BC because there wasn't enough blood at the banks to recoup all the bleeding PS3 caused them before they pulled it. My view is that software emulation is a much better target in the next round so this misstep won't be repeated. BC is going to easier to pull off for all parties given the GPUs are basically standard fare and the CPUs are likely to be known quantities.

Being able to have BC at a lower price seems pretty much a given to me in the next round.
 
Last edited by a moderator:
What do you mean by that?

Its basically just a dual issue pipeline. They should of designed it around mid to wide SIMD which does pretty much as well as the SPU on anything it doesn't suck on but with much higher overall performance, higher perf/area and perf/watt.
 
Its basically just a dual issue pipeline. They should of designed it around mid to wide SIMD which does pretty much as well as the SPU on anything it doesn't suck on but with much higher overall performance, higher perf/area and perf/watt.

How wide do you mean by "mid to wide". The SPE is already 4 wide SIMD (8 and 16 for 16 and 8bit integer data elements). In fact, there are no scalar instructions, and you can't load unaligned data from the LS. The dual issue is mainly for pairing with load/store.
 
I think, with the SPU LS they were primarily trying to solve 2 problems
1) Energy efficiency
2) Scalability
There are other side effects compared to a cache based architecture, like space saving, and predictable timing (less complex pipeline), higher maintanable IPC because of lack of memory stalls (unless you have random access data structures ;P), but I think these 2 were the most important.

Managing data is a comparatively much easier, and I think it's unrealistic for them to completely change their memory system design. It's a very intricate part of the chip (or any chip) and one of their fundemental design points. It will be interesting to see how performance will scale on something like Cell iv32 vs Larrabee though, and how their different approaches play out in real applications.

I think so too. People manage their data themselves in CUDA and they do just fine. OTOH, I don't think that Cell is a deadend. With the feedback STI has received over ~3-4 years, they will (hopefully;)) simplify the learning curve. Remember that Cell was aimed at bypassing power and memory wall. And for a first step, it is a reasonably good design.

No instruction cache (the SPU already has a really small I$, but it reads off LS). Code and data have to share the LS. This is a load balancing issue and a potential debugging head ache. With a global access I$ we wouldn't have to worry about whether the code fits, if a code update will suddently screw up your finely tuned data layout, having to strip debug symbols to avoid bloating the code elf, not being able to run debug versions because they're simply too big, accidentally overwriting your code with data, jumping through hoops to get persistent break points (because the code is fresh with every upload), etc. I think this is the number one problem for working with the SPUs and something I would really like to see in Cell v2.

I am with c2.0 on this one. Code + data + stack <= 256k must go in Cell2. One way out could be guaranteeing availability of some amount of LS (say 512k in Cell2) and fetching instructions (those which can't fit on rest of LS, that is) from RAM.

Personally, I find Cell to be a very elegant design for the stuff it's meant to do. SPE's pull vertices from disc, generate new geometry if needed, and DMA it back to RSX. However, all transistors and no code makes any chip a costly piece of metal:oops:. As far as tools are concerned, I have no idea about official PS3 devkits but judging from CELL sdk (released by IBM), they definitely need some polishing and baking. :cry:

Sticking to a restricted subset of SPE's capability can also simplify (at least some) issues and making it easy to argue about program logic. A small example, SPE's can launch DMA's to from RAM and other SPE's. Instead, if you centralize DMA start and wait-till-finish calls in PPE, some amount of pain can be alleviated.
 
How much of Larrabee's appeal is actual architecture instead of software development? Seems to me a lot of its attraction is in offering x86 BC and an easier path for developers, but you're still going to have development issues to get maximum throughput, right? At the end of the day, whether a console has a GPGPU array of processors, a Cell array of processors or a Larrabee array of processors, if the developers had access to some incredible tools that managed to automate parallel processing extremely easily and effectively, would developers actually care what the chip itself was doing?

As someone who isn't following the hardware closely enough to be informed on the matter, it looks to me like the console choice is one of (Cost + Thermals + Development Ease), with performance in a way being a wash between choices because the balance lies across the equation, and not just with the peak possibilities. The most maths power isn't necessarily going to be the fastest peak option and the slower chip can outperform if more power is tapped. If this is the case, each option can be considered regards those parameters, although all of it will just be speculation at this point. AFAICS the only obvious downside to Cell2 is Development, because the other options have bigger and better software development support. GPGPU will see PC-space developments, and Larrabee will see an incredible backing by Intel, whereas Cell has been somewhat milling along without any major advances yet.
 
How much of Larrabee's appeal is actual architecture instead of software development?

It's the first fully programmable (to my knowledge) architecture to be used for graphics.

The other reason it is taken seriously is because it is Intel pushing it; Loads of money, leading edge process technology and more fabs than you can shake a stick at.

Cheers
 
In Sony's case though, aren't their tools going to be considered less productive to developers regardless of the silicon they end up using?

MS offered up their tools to Sony before launching the Xbox. Any signs that Sony regrets not taking up their offer?
 
How wide do you mean by "mid to wide". The SPE is already 4 wide SIMD (8 and 16 for 16 and 8bit integer data elements). In fact, there are no scalar instructions, and you can't load unaligned data from the LS. The dual issue is mainly for pairing with load/store.

8-16 SP elements. less than that and you end up with high control overhead per flop.
 
In Sony's case though, aren't their tools going to be considered less productive to developers regardless of the silicon they end up using?

MS offered up their tools to Sony before launching the Xbox. Any signs that Sony regrets not taking up their offer?

Developers and Sony survived the first two/three PlayStations, which back then had SDK's and tools that could be used to torture programmers. The only reason you see this in the spotlight this time around so much is because Microsoft is leaps and bounds beyond that "antique" stage of software or documentation. Also look at what Itagaki has recently said - it was a lot harder to make games for the old Nintendo systems than it is for the PS3, for example.

Things were always this FUBAR, it's just that Microsoft made a difference. In any case, I doubt you will see Sony come out and admit a bad decision, at least officially in a business sort of way.
 
Status
Not open for further replies.
Back
Top