Predict: The Next Generation Console Tech

Status
Not open for further replies.
I don't understand why people still want to believe in it (except the suggestion of six SPE for backwards compatibility), as a "Cell 2" was canceled quite a while ago. IBM has capitulated. do you want to fight the war alone?

That's not it ... if IBM was probably surrendered by non-acceptance and market segments in which it operates, but in the case of the Cell processor we speak here about a possibility customization on environment targeting in a specific market-oriented games that usually not have solutions "on the shelf" and we have to remember there are developers (maybe few) who like to work with SPUs, we saw reports on this forum, but like Shifty said business is good business and may very likely follow the sony more psvita like a path that goes in line with the expectations that thirdies developers want ....which is very good for Sony and gamers.
 
I have formerly floated the idea of a cahooting with AMD and SGX, to create a Frankenstein Cell that has an ARM PPE, SPE cores, and SGX cores, as a scalable, flexible architecture. The standard configuration might be 1:4:4 ARM:SPU:SGX. I suppose it's basically a Vita on a chip with 4 added SPEs (only we'd go next gen, A15 and Rogue)! This'd provide Cell compatibility on the tricky bits to emulate, while the SPEs could recompile for emulation of the PPE on ARM. The architecture would be completely scalable, so stick one in a handheld, two in a tablet, and 4 or more in a console. It probably couldn't hold a candle to a specific console design, but the versatility of the software, and they fact every game and app you buy could run on all your devices, would give commercial advantages. SPEs are 21 million transistors in their current guise though, which makes them pretty big compared to the other components. SGX543 is only 8mm^2 at 65nm, whereas SPURSEngine with 4 SPEs is 103 mm^2 at 65nm.

I liked your idea of ​​a few pages ago, in my opinion is excellent especially with respect to recompile the PPU code via SPUs "translate" to ARM for BC(maybe needed more than 6 SPUs here for this) and unless there is some specific process in RSXgpu archtecture that nvidia does not release (see NV2A some problens with Xenos) for software emulation,I think would not be a problem for powervr6 with 144 flops per cycle / per core...and everything(cpu and gpu) much less than 500mm^2@28nm... im pray that the sony is reading your post ...;)
 
you're kinda preaching to the converted, as I've always been impressed with Cell and think it a great design with loads of forwards potential. However, business is business. You don't always go with the most exciting option, or most affordable, or most expensive. There are loads of factors to weigh up. The huge downside to Cell is it's isolated development paradigm and ongoing complexities it presents to devs. Developer complexity is going to be an Achille's Heel going forwards, and hardware becomes powerful enough to allow for good abstraction without crippling the hardware. Sony have already identified the need to be developer friendly in their latest outting, Vita. They aren't going to backtrack for PS4 and create some monster machine that developers won't like to work with. So either they have to provide amazing tools that make using Cell as easy as using x86 or ARM or whatever options are out there, or the need to put in a set of components that developers will be working with. and if those other components also come with cost advantages, or performance advantages per dollar (due to Cell's lack of proress), or manufacturing advantages, etc., then all the more reason to switch.

That one thing appears to me to be the the biggest downside of Cell development. It's not that Cell is a bad solution to the problems it was designed to address; it's that it's a unique solution. The rest of the market, while also pursuing heterogeneous computing architectures, seem to have settled on a different solution. Even more so with AMD's GCN architecture looking very Fermi-like.
 
I think you have to define which aspects of Cell you're deciding are good or bad.
Lots of companies have played with the lots of simple high numeric performance core paradigm, going back to inmos, Intel did its bit with Larrabee, there was the Nuon architecture which was very similar in a lot of ways (minus the inclusion of a PPE).
There are certain types of operation that run very well in that paradigm, it's however difficult to take general code and run it efficiently on those types of architecture and for most of those code types that run well on those architectures modern GPU's do as well or better.
Had Cell gained wide spread adoption outside the PS3, it would have gotten continued development, but the fact is it was neither particularly cheap, nor did it offer more than competing parts. When it was released it seemed like a good fit for TV's/Bluray players among others, but in the end it couldn't compete cost/performance (not talking flops here) with alternate solutions.
Esoteric architectures will always have a hard time competing with the mainstream unless they provide a substantial advantage to drive adoption, as far as the market was concerned cell just didn't.

This is getting off topic a bit but if we're looking at the future of massively parallel computing, It's interesting to look at what Cray did with the XMT, it's an architecture not about maximizing computation density, but about making data access as transparent as possible between the cores. Every piece of data in the system looks like it's local to all of the processors, and the architecture goes to great lengths to hide the latency.
 
Cell still seems interesting to me because of the scalability of its design, because of the backwards compatibility it would provide in a PS4 (including the PS3 games library, the PS3 OS, and the PS3 dev libraries that could see continued use in a PS4), all of which might provide a cost savings to Sony in PS4 development.

I've heard more complaints about the PS3's split memory model and the limitations of RSX than I have about the inadequacies or difficulties of Cell as a CPU. I'm sure developers would love higher single threaded performance, but the Ghz ceiling is still pretty static for all CPU manufacturers as far as I can tell.

Would developers be upset next generation if they were given 4 (possibly OoO) PPEs to do standard shared memory multiprocessing with, even if there are 8 SPEs that remain hanging on the EIB? Given the Cell's design, I would imagine that IBM could just scale up the current design using all the current functional blocks from Cell, which I would think would not be that expensive.

I can see an argument that the SPEs are perhaps extraneous to requirements if everyone is going to use the GPU for large scale computation, but there seem still to be things that the SPEs can do that could be of advantage?

(spoken as someone who has not done game programming since the days of the Amiga 1000, of course)
 
Last edited by a moderator:
I'm going to expand a bit on my cray point above, everything to do with performance today is about memory latency.

There is a certain class of problem where memory accesses are predictable enough that latency can be completely hidden (uses caches or overlapped DMA to local store or whatever), in these cases GPU like architectures, or SPE like architectures make a lot of sense. Computational density becomes suddenly important.

There are other classes of problem (the vast majority of them) where it is difficult or impossible to predict memory access patterns, this is where all of the R&D in x86 has gone in the past 15years, better caches, better ways to hide latency with OOO execution.

Whether you think Cell is a successful architecture largely revolves around how much of your application you believe can fit into the first paradigm and at what cost. I would argue it's increasingly less, and the parts that do make sense are as often as not as well or better suited to the GPGPU paradigm.
I think Cell is an interesting architecture, but it's overly mono-focussed.

My point with looking at Crays XMT above is that a company that up until recently made it's name on the number of sustained GFLOP's it could provide. Is now moving towards an architecture which is clearly designed to primarily hide the difficulties with memory access in large scale parallel systems over increased compute density. They're not doing that just to be different, they are trying to figure out how to run their clients software faster.

Now games aren't large scale numeric simulations. And Crays solution relies on having thousands of running threads to hide the latency which is a problem in itself.

But there are many ways to increase performance.
 
In what way is Cell's scalability superior to other multiprocessor designs? I'm not picking, I'm genuinely curious.

I had in mind the fact that they've got the EIB ring bus and that the PPE was already designed for multiprocessing. Given that, it would seem fairly simple to just build a new variant with a greater number of those pre-existing elements connected to that ring. That would seem relatively cheap to do.

I know Intel has also gone to a ring bus with Sandy Bridge, in a quest for that kind of multi-core scalability.
 
Performance 480GFlops?

You could be right maybe it's peak performance, therefore AMD had about 75% efficiency* / sustainable processing, and in this case would be 50% more performance ps360,but sorry my comment sounds like apples and oranges but took the impressions that have notified by developers rumours(yes theres not much valid but...) on Internet and i don't know if they compared peak performance of the current consoles (with 5/6 years code customised) with the sustained performance of wiiu gpu.

400 GFLOPS but R7xx shaders will attain higher utilisation than the Xenos shaders. Add the more advanced feature set and you won't be far off twice the effective power.
 
I'm going to expand a bit on my cray point above, everything to do with performance today is about memory latency.

There is a certain class of problem where memory accesses are predictable enough that latency can be completely hidden (uses caches or overlapped DMA to local store or whatever), in these cases GPU like architectures, or SPE like architectures make a lot of sense. Computational density becomes suddenly important.
The GPU approach does not depend on predictable latency. The calculations for occupancy and latency hiding tend to look at hiding worst-case texture latency, which is a read to VRAM hundreds of cycles away.
Existing GPUs many threads to reach baseline performance and can scale to the same number of threads per chip as XMT.

My point with looking at Crays XMT above is that a company that up until recently made it's name on the number of sustained GFLOP's it could provide. Is now moving towards an architecture which is clearly designed to primarily hide the difficulties with memory access in large scale parallel systems over increased compute density. They're not doing that just to be different, they are trying to figure out how to run their clients software faster.
Which Cray you are talking about? The Cray that ran in the high peak FLOP HPC realm, or Tera--which bought the name?

Now games aren't large scale numeric simulations. And Crays solution relies on having thousands of running threads to hide the latency which is a problem in itself.
I think XMT is targeted at a very specific kind of workload, and has built in some extremely pessimistic assumptions that would make it inappropriate for client needs.
It assumes essentially no locality, and it depends on a very high performance interconnect we won't see outside of a supercomputer installation.
From a power perspective, it assumes that going off-chip for accesses is the rule, and a lot would go to other nodes. From a power perspective, it is very far from what would be a good idea for all but the workloads it targets.
 
I'm going to expand a bit on my cray point above, everything to do with performance today is about memory latency.

There is a certain class of problem where memory accesses are predictable enough that latency can be completely hidden (uses caches or overlapped DMA to local store or whatever), in these cases GPU like architectures, or SPE like architectures make a lot of sense. Computational density becomes suddenly important.

There are other classes of problem (the vast majority of them) where it is difficult or impossible to predict memory access patterns, this is where all of the R&D in x86 has gone in the past 15years, better caches, better ways to hide latency with OOO execution.

Whether you think Cell is a successful architecture largely revolves around how much of your application you believe can fit into the first paradigm and at what cost. I would argue it's increasingly less, and the parts that do make sense are as often as not as well or better suited to the GPGPU paradigm.
I think Cell is an interesting architecture, but it's overly mono-focussed.

My point with looking at Crays XMT above is that a company that up until recently made it's name on the number of sustained GFLOP's it could provide. Is now moving towards an architecture which is clearly designed to primarily hide the difficulties with memory access in large scale parallel systems over increased compute density. They're not doing that just to be different, they are trying to figure out how to run their clients software faster.

Now games aren't large scale numeric simulations. And Crays solution relies on having thousands of running threads to hide the latency which is a problem in itself.

But there are many ways to increase performance.
I'd like to add a bit to this.
GFLOPS have always been more about marketing than anything else, at least since the days just after CDC Cybers that Seymour Cray designed, when I got into computing. Marketing, however, is important.

There are at least three groups involved in HPC. One is into national/institutional status. These apply for, and receive, funding for supercomputing projects. Then there's the folks that are interested in computational science and computer architecture as a discipline in its own right. And then there is the group who use high performance computers to work on problems that need to be worked on wherever they may be found - meteorology, chemistry, et cetera. I belong to the last group, and this is the group that is typically referred to when the supercomputer projects are to be justified, whether to the public, or to politicians.

I came back to B3D because I was curious about what eventually turned out to be the BBE (I don't like to call it Cell, it goes against the original paper). At the time we were wrestling with clusters, and I was very curious what the result would be if IBM was to produce a new CPU architecture aimed at media processing rather than being an extension of an ancient design originally aimed at clerical byte manipulation. By and large I liked it, but could never use it, due to a number of factors that made it impractical. If IBM had been more evangelizing and given it a believable roadmap that they demonstrated that they would follow, maybe we would have looked at it more seriously. I don't know enough about game code to say if it does a good job at what it was eventually tasked with.

(Incidentally, take a look at this building block for building big iron IBM is showing at Hot Chips.)

Latency and bandwidth both, or put another way, interconnects and data paths, have been at the core of most high performance computing for a long time, but they don't make for good/easy marketing. GFLOPS does. Still. Note its presence at the IBM slide above. :) My eyes however are drawn to the memory interface, and more tantalizing, the chip to chip networking, since those concern areas that has proven critical to most code we've wrestled with, bandwidth and communication.

This long ramble wants to come to a few conclusions. To judge the merits of an architecture, you have to look at its design goals and how well it fulfills them. To say if the BBE is "better" than its Prescott contemporary, or the Xenos, you'd have to decide what yardstick to measure with. Second, GFLOPS are cheap and easy, and have been for a very long time. The challenge lies in making the ALU power (easily) applicable to as wide a range of algorithms as possible. There will, however, be specific problems that are suited to just about any given configuration. These will be used in marketing. Third, HPC is a complex world, and filling a cabinet with GPUs really only addresses very particular aspects/niches. Fourth, if you want to evaluate an architecture, look at the data paths. What limitations do they impose and what does that imply for the tasks you are interested in. GFLOPS isn't a very useful metric, generally. Maybe for games it is. I wouldn't know. There are people here with far more familiarity with that application area. But I suspect that even there, that particular figure of merit is just too simplistic.
 
There is a certain class of problem where memory accesses are predictable enough that latency can be completely hidden (uses caches or overlapped DMA to local store or whatever), in these cases GPU like architectures, or SPE like architectures make a lot of sense. Computational density becomes suddenly important.

There are other classes of problem (the vast majority of them) where it is difficult or impossible to predict memory access patterns, this is where all of the R&D in x86 has gone in the past 15years, better caches, better ways to hide latency with OOO execution.
This was the ongoing debate for Cell and GPGPU, with those on the other side pointing to these problems are asking how they could be reworked. And to their credit, a number of examples deemed a poor fit for computational powerhouses ended up being reworked quite successfully.

For a console part where the processing requirements are on lots of small tasks instead of one huge task with massive shared data, higher computer density (supported with adequate data supply) has to be the right choice. Until there is a monolithic engine that uses the same dataset for all tasks at all times, instead of being broken down into physics, AI, setup and draw, etc., I don't see a data-centric design philosophy being economical in the console and PC space. I think Cell got the balance right, much to the coder's annoyance who have to rethink their methods! But that developer cost has resulted in excellent performance density (attained, not just paper theories), sadly wasted by having to prop up the GPU. Of course going forwards, developer cost is going to be a big issue. Perhaps the ideal architecture then is one designed to run metalanguages and high-level structures. I don't know if there's any processor in development designed to run something like Java especially quickly.
 
In what way is Cell's scalability superior to other multiprocessor designs? I'm not picking, I'm genuinely curious.
Sorry for the uneducated answer versus the other pretty highly educated answer but I believe that Cell is deemed by some scalable or more scalable because there is no coherency traffic involved, the low power nature of SPU and lastly their size.
Does that make it more scalable I'm not sure, but I would say cheaper to scale. I might be wrong tho :cry:
 
This was the ongoing debate for Cell and GPGPU, with those on the other side pointing to these problems are asking how they could be reworked. And to their credit, a number of examples deemed a poor fit for computational powerhouses ended up being reworked quite successfully.

For a console part where the processing requirements are on lots of small tasks instead of one huge task with massive shared data, higher computer density (supported with adequate data supply) has to be the right choice. Until there is a monolithic engine that uses the same dataset for all tasks at all times, instead of being broken down into physics, AI, setup and draw, etc., I don't see a data-centric design philosophy being economical in the console and PC space. I think Cell got the balance right, much to the coder's annoyance who have to rethink their methods! But that developer cost has resulted in excellent performance density (attained, not just paper theories), sadly wasted by having to prop up the GPU. Of course going forwards, developer cost is going to be a big issue. Perhaps the ideal architecture then is one designed to run metalanguages and high-level structures. I don't know if there's any processor in development designed to run something like Java especially quickly.

Not sure I entirely agree with this. This is going to go way off topic so bear with me.
My experiences are that games and game code are becoming progressively more complex. To my mind the game code I've seen is shifting from what used to be akin to soft real time systems design to something closer to a database. Most PC devs have been closer to the database side for a long time. There is no right or wrong here (although I know a lot of devs who'll argue one side or the other as if it's a religious debate) and excellent products come out of both camps.

The teams that have done well with Cell are usually the ones attached to what I would term "old school practices", if your treating a PS3 the same way you did a SNES they it's not a huge leap to use it efficiently your already acutely aware of memory usage and access patterns. I have certainly historically fallen into this category, however having worked on large existing codebases over the last 5 or 6 years, I'm somewhat more pragmatic, in a large codebase with a deadline and a complex design you have to pick your battles.

FWIW I think the reason you see Cell "propping up" the graphics chip has more to do with it being easier to use it that way with more visible results than a fundamental weakness of RSX. Graphics are a fundamentally good fit for the design and the number of lines of code in the average graphics part of a game are so small these types of optimizations are "easy".

I'm an engineer and it pains me to say this but games should be about creativity. As a point of reference I've written probably a million lines of assembler, several million lines of C/C+, most people who've worked with me would describe my attitude as "hard core" and I would estimate that I can write C# code ~4x faster than I can write C code. What does that mean to the amount of iteration I could do on game code?
Now I'm not suggesting we all start developing games in GC languages tomorrow, it'd be a pretty hard sell for me if I were running a team, but I wonder how much of that is my prejudice against it?

I guess philosophically I think the hardware should be about enabling developers to create better games. IF Cell had demonstrably opened new opportunities for gameplay then it was worth all of the extra engineering effort to get there.

In the end if I'm looking at the future of processors I think we'll see GPGPU take over the high computational density niche and general processors trying to minimize the cost of memory accesses and contention as the number of cores increases.

Although I know it's no fun, my viewpoint on console hardware has always been the same, it doesn't matter, you play the hand your dealt, if you're smart you play to the platforms strengths. But with widespread cross platform development, it's going to be difficult for esoteric architectures to be exploited in interesting ways.
 
I had in mind the fact that they've got the EIB ring bus and that the PPE was already designed for multiprocessing. Given that, it would seem fairly simple to just build a new variant with a greater number of those pre-existing elements connected to that ring. That would seem relatively cheap to do.

I know Intel has also gone to a ring bus with Sandy Bridge, in a quest for that kind of multi-core scalability.

Sorry for the uneducated answer versus the other pretty highly educated answer but I believe that Cell is deemed by some scalable or more scalable because there is no coherency traffic involved, the low power nature of SPU and lastly their size.
Does that make it more scalable I'm not sure, but I would say cheaper to scale. I might be wrong tho :cry:

Thanks. If I understand correctly it is the ring bus and the per/watt of the SPUs that provide this advantage?

WRT the ring bus, I had thought that adding more processing elements to the bus would increase the latency of the bus. Am I wrong about this or is this mitigated by the same process shrinks that allow for more processing elements?
 
I guess philosophically I think the hardware should be about enabling developers to create better games. IF Cell had demonstrably opened new opportunities for gameplay then it was worth all of the extra engineering effort to get there.
I agree entirely, hence my latter remark:
Of course going forwards, developer cost is going to be a big issue. Perhaps the ideal architecture then is one designed to run metalanguages and high-level structures. I don't know if there's any processor in development designed to run something like Java especially quickly.
Tool chains are built around processors. Maybe it's time to build processors around tool chains? Start with the high-level languages that describe the creation (such as a sophisticated contextural database for supporting things like procedural content creation), and create accelerators to turn those creations into reality. Currently chip designs have taken at best an interrim starting point. In the case of Cell, data throughput was seen as key and so that underpinned the design philosophy. As you say, if games become very different in structure, more like database work, then we'd need a very different approach in the processors. Of course, no-one really knows what the future tools will be like. In that case, just having as much possible processing power and memory access as possible, and leaving it to the devs to work out how the heck they're going to use it, seems a sensible approach to hardware design! :mrgreen:
 
This has been attempted at various times, and it can very easily end badly.
It's not a different approach, if we recall what RISC vs CISC was about, or things like the attempts to market an architecture that could run Java bytecode. Some attempts more modestly targeted accellerating parts of it.
Attempts at closing the semantic gap between machine language and higher level code have been made for decades.

The problem with baking language primitives into the ISA is that the more generally capable the instruction, the less likely that it is a good fit for any specific instance of its use.
Silicon is not very flexible, and the hardware and instruction set have to include a significant number of provisions for any variation or behavior that may arise.

If the basis of the hardware were more configurable, it might allow an FPGA-type solution to elide unnecessary parts of the implementation.
Straightline performance and power for many typical loads do not favor FPGAs as they exist right now, since they use scads of silicon and internal interconnect to be configurable.
 
I'm an engineer and it pains me to say this but games should be about creativity. As a point of reference I've written probably a million lines of assembler, several million lines of C/C+, most people who've worked with me would describe my attitude as "hard core" and I would estimate that I can write C# code ~4x faster than I can write C code. What does that mean to the amount of iteration I could do on game code?
Now I'm not suggesting we all start developing games in GC languages tomorrow, it'd be a pretty hard sell for me if I were running a team, but I wonder how much of that is my prejudice against it?

The problem is that games that perform better play better. Low latencies give the designer the option to control how fast the game play should be. High frame rate make it possible to have fast moving game play without blurriness or loss of control. High resolution makes it possible to see clearer.

Regarding the last, I tried Burnout Paradise with my borrowed G25 and it felt much blurrier than GT5. Regarding frame rate and latency, the fighting in Enslaved felt so uninvolved compared to GoW collection.

Games have to feel good. If they do not, I will not play them.
 
The problem is that games that perform better play better. Low latencies give the designer the option to control how fast the game play should be. High frame rate make it possible to have fast moving game play without blurriness or loss of control. High resolution makes it possible to see clearer.

Regarding the last, I tried Burnout Paradise with my borrowed G25 and it felt much blurrier than GT5. Regarding frame rate and latency, the fighting in Enslaved felt so uninvolved compared to GoW collection.

Games have to feel good. If they do not, I will not play them.

I think you may have missed the point slightely.

Great discussion though, some really interesting points being raised.
 
Status
Not open for further replies.
Back
Top