Cell/CPu architectures as a GPU (Frostbite Spinoff)

JPT · Mar 10, 2011

Now since MLAA is being done on SPU instead of MSAA on GPU (from my understanding atleast), can the silicon that is "reserved" for MSAA be used for other stuff? Or Quinquix or what its called that the RSX does instead...

Laa-Yosh · Mar 10, 2011

Not really IMHO, it's mostly about having more resources to sample the depth and some framebuffer compression to avoid moving the same pixel 4 times.

Makes you wonder if upcoming GPU's designs will change to better accommodate post process AA solutions and drop support for MSAA in turn...

Ruskie · Mar 10, 2011

Laa-Yosh said:
Not really IMHO, it's mostly about having more resources to sample the depth and some framebuffer compression to avoid moving the same pixel 4 times.

Makes you wonder if upcoming GPU's designs will change to better accommodate post process AA solutions and drop support for MSAA in turn...

How come?It still takes couple ms to make 2xMSAA happen on RSX.Dropping it will give you couple ms on RSX in return.Granted it will take you ~4ms of SPU time to do it...

Laa-Yosh · Mar 10, 2011

It's still more about bandwidth, even with the compression you move more data compared to no MSAA.
My memories aren't as clear as they should be but the pipeline works with 4 fragments (a quad) but the circuits to sample the Z value are multiplied so it can take 2 or 4 Z-samples per fragment without any further cost. This is very good for Z pre-pass rendering too (pioneered by Doom 3 engine) but you still have to move the data and even 2x MSAA means twice the fragments. However most of the color samples are the same, so you can compress them very well, about 90% efficiency is possible. This is why MSAA is a lot faster compared to SSAA.

RSX is bandwidth limited compared to the XGPU, which is why it takes a measurable performance hit (and why no AA penalty was once claimed for X360), ~26GB/sec is far from unlimited.

(((interference))) · Mar 10, 2011

Laa-Yosh said:
RSX is bandwidth limited compared to the XGPU, which is why it takes a measurable performance hit (and why no AA penalty was once claimed for X360), ~26GB/sec is far from unlimited.

I thought the 360 can still do AA without a performance penalty? Isn't the only penalty associated with the tiling costs?

Laa-Yosh · Mar 10, 2011

Yes, because the EDRAM is only enough for 2xAA with forward rendering AFAIK. Or for some seriously sub-HD resolution with 4x AA

patsu · Mar 10, 2011

V3 said:
Well the heart of the Playstation family is actually the GPU anyway. I mean with PS2, Sony actually made the GS first before the EE. The GS was pretty much done before they contacted Toshiba to work on EE. That's probably why GS was so outdated by the time PS2 was released. So I wouldn't be suprised if Sony had their own GPU ready for PS3 too before they go to IBM and Toshiba for Cell. Sony invested on eDRAM, and it wasn't even used for PS3 in the end. Their 32MB GS was able to output 1080, they probably had its successor for PS3 in mind before they go with Cell. They probably just wanted to put the power of GSCube into PS3. But I guess the world had moved on by that point.

The GPU ? I don't know. There is a Super Companion Chip made by Toshiba for Cell. That one handles AV interfaces.

As I recall, Kutaragi said edRAM is out because they can't include enough for HD resolution. Cell and RSX can work on multiple frames/tiles at the same time. DMA to LocalStores works well regardless of which source frame/tile it is.

EDIT:
Might as well throw in B3D's thread on SCC:
http://forum.beyond3d.com/showthread.php?t=45996

JPT · Mar 11, 2011

Laa-Yosh said:
Not really IMHO, it's mostly about having more resources to sample the depth and some framebuffer compression to avoid moving the same pixel 4 times.

Makes you wonder if upcoming GPU's designs will change to better accommodate post process AA solutions and drop support for MSAA in turn...

As usual I was not very clear, but I was thinking more inline of what Zed did in Trials HD, where he used some silicon in the xgpu (or maybe in the edram) that was supposed todo MSAA for something else.

But yes,if you save 2ms on not doing MSAA on RSX, its logical thats 2ms that can be used on other tings. Even if it costs 4-5ms to do MLAA on the SPU's. But its net gain if the SPU's was under utilized and you can get the scheduling to work out.

joker454 · Mar 11, 2011

Shifty Geezer said:
Okay, that's a fair argument, but are you suggesting the only reason we got 7800 at that time instead of 8800 is because nVidia/ATi didn't have enough money to invest, and with more money could ahve designed a whole generation of current thinking? I don't believe any research can be accelerated just by investing money. I don't beleive if someone handed nVidia a trillion dollars in 2000 they'd have produced the geForce 400 series architecture? There are limits in understanding that only come with experience, and I'm not sure how much better a GPU could be designed and manufactured for want of more investment.

A gt400 back then would have been nice

But no not that far, however they could have implemented features back then that would still be standouts today. Look at Xenos as an example, it's customization to support edram and aa paid off big time, even if they may have caused some other headaches along the way.

The general way to go about this is slap the biggest, baddest, most customized gpu you can possibly think of in there at the time, mate it with whatever cpu, and ride it out. That gives you the benefit of simple pipeline and tools, and gives you great hardware support from day one. Yeah near the end of the console lifecycle it's inevitable that some clever devs will have though of other stuff that they want to do that isn't as good of a fit for said gpu, but that's what the next console is for. Plus it's not like devs can't experiement with using cpu's for graphics work on such a setup, it's still possible to experiement that way and have good gpu backing to come strong out of the gate on day one.

Take what's learned this gen and keep it all in mind on the gpu for next gen. Look at the stuff Dice is implementing. On old gpu hardware they have to turn to cpu for help, which is less efficient but they have no choice. However newer hardware supports what they have in mind and hence Frostbite 2.0 will be nicely gpu assisted on current gpu hardware. Custom forms of aa will be next in line to get hardware assist on gpu, it's inevitable. That's the best way to go about it because letting hundreds of stream processors on gpu handle all the loads automatically will always be better than having coders manually juggle it all in a cpu + gpu setup. Look at how it's handled on ps3. You've all seen 'jts' shown when it comes to coding on the ps3, that's just "jump to self" which basically tells rsx to wait until an spu tells it to keep going. It's the fundamental way to sync spu+gpu on ps3, and it's also basically a gpu stall. If you hit a jts, which you inevitably do, then you are wasting gpu processing time. I don't doubt that the different systems made to sync spu and gpu are clever but they are not the optimal way to go, nor are they a good use of developer time. It's much better to let the devs just think up the clever tasks, hurl them at a proper gpu with its bank of processors and let the machine handle all scheduling. At least that's my thoughts on it.

patsu · Mar 11, 2011

joker454 said:
Take what's learned this gen and keep it all in mind on the gpu for next gen. Look at the stuff Dice is implementing. On old gpu hardware they have to turn to cpu for help, which is less efficient but they have no choice. However newer hardware supports what they have in mind and hence Frostbite 2.0 will be nicely gpu assisted on current gpu hardware.

Frostbite 2 is also nicely gpu assisted on RSX. ^_^
... except that the SPUs jump in to resolve the GPU's bottleneck, and also throw in its own advantages. There are stuff that's possible/better done on the CPU compared to the GPU.

Thowing compute resources at a problem is one logical way to advance, but in computer graphics, new techniques/algorithms can be game changer too.

Custom forms of aa will be next in line to get hardware assist on gpu, it's inevitable. That's the best way to go about it because letting hundreds of stream processors on gpu handle all the loads automatically will always be better than having coders manually juggle it all in a cpu + gpu setup. Look at how it's handled on ps3. You've all seen 'jts' shown when it comes to coding on the ps3, that's just "jump to self" which basically tells rsx to wait until an spu tells it to keep going. It's the fundamental way to sync spu+gpu on ps3, and it's also basically a gpu stall. If you hit a jts, which you inevitably do, then you are wasting gpu processing time. I don't doubt that the different systems made to sync spu and gpu are clever but they are not the optimal way to go, nor are they a good use of developer time.

Then you find another way to avoid or minimize stalling the GPU. You'll get both compute resources to work together and get "double" the result.

It's much better to let the devs just think up the clever tasks, hurl them at a proper gpu with its bank of processors and let the machine handle all scheduling. At least that's my thoughts on it.

It's easier... which is great, but not necessarily better. It depends on the whole package (Memory architecture, tools, GPU performance, CPU characteristics, etc.).

DICE mentioned a GPU + CPU combined approach with full programmability for a closed box architecture. The trick is how the CPU and GPU work together.

makattack · Mar 11, 2011

patsu said:
It's easier... which is great, but not necessarily better. It depends on the whole package (Memory architecture, tools, GPU performance, CPU characteristics, etc.).

Conversely, there's the law of diminishing returns in terms of trying to squeeze more out of limited resources. I specifically recall spending an inordinate amount of time once researching methods to reduce the number of multiply operations in a specific transform. Ultimately, as interesting as it was, my success in doing so was dampened by the realization that the gains were insignificant compared to the time I spent looking into and implementing it. I realized that for my relatively small dataset, this solution was actually negligible. I needed a much larger dataset in that specific case to justify my more efficient algorithm.

I could have much better spent my time working on other areas. Often, I find in engineering, KISS is the best solution. Easier is often best.

patsu · Mar 11, 2011

That is why you invent new ways to do things, in addition to just optimizing existing approaches.

I don't think it is wise to generalize an approach without looking at the specific designs and implementations.

EDIT: In general, if the gain is small, it may also mean that you picked the wrong thing to optimize.

Shifty Geezer · Mar 11, 2011

JPT said:
Now since MLAA is being done on SPU instead of MSAA on GPU (from my understanding atleast), can the silicon that is "reserved" for MSAA be used for other stuff? Or Quinquix or what its called that the RSX does instead...

No, which is one argument for programmability. The moment any fixed hardware becomes redundant thanks to new algorithms, it's a waste of silicon, whereas truly programmable hardware can always be used to do whatever is asked.

joker454 said:
A gt400 back then would have been nice But no not that far, however they could have implemented features back then that would still be standouts today. Look at Xenos as an example, it's customization to support edram and aa paid off big time, even if they may have caused some other headaches along the way.

Xenos is a compromise with big plus points and some serious minus points. It's not more a magic bullet or better design than Cell, save that its dev tools make it easier. And we can't really compare the whole systems as better or worse, much as fanboys want to, because we never have ideal implementations of the same game/engine to compare. Take any game that runs better on 360 than PS3 (of which there are many!), we can't say for sure that the developers are using PS3's system design optimally. How many devs or engines are using DICE's Cell-based shading? If that ends up being a big win, it could be that overall Cell+RSX gets more done than Xenon+Xenos.

Now I am not saying that's so, before any chump comes here saying "PS3 is better than XB360!!" Theoretically though, evaluating total system performance in terms of what is done, looking at the games achieved isn't a fair comparison because games are built around a business that requires efficiency has a lot of legacy thinking. Very few devs can afford the luxury to explore all the weird and whacky ways to reengineer graphics pipelines and do novel stuff. I'm hoping Frostbite 2 will really play to system strengths and we'll get to see what can and can't be done on these architectures without business compromises. We'll be able to compare IQ, framerate, resoltion, etc. and see how shifting workload to Cell compares, giving a new sample set to consider the overall programmability versus fixed-function argument.

The general way to go about this is slap the biggest, baddest, most customized gpu you can possibly think of in there at the time, mate it with whatever cpu, and ride it out. That gives you the benefit of simple pipeline and tools, and gives you great hardware support from day one.

That's quite possibly true for a console business proposition, but this thread is intended more to discuss what can be achieved on the hardware, irrespective of developer requirements. If by the end of the lifecycle the programmable system overtakes the fixed-function system, that shows that programmability is an enabler and gets more from your silicon budget, even if that's a bad choice for a fixed-hardware console that needs to satisfy good business.

The question here isn't which is better to put in a console, but which provides the best graphical returns per measure of silicon - less powerful, more flexibile designs; or more power, less flexible designs?

Take what's learned this gen and keep it all in mind on the gpu for next gen. Look at the stuff Dice is implementing. On old gpu hardware they have to turn to cpu for help, which is less efficient but they have no choice.

However, because of those designs, using the CPU is an option. If Sony had gone with x86 and a customised GPU, which presumably wouldn't be any more advanced than Xenos, would it be able to achieve the same level of results? We'll be able to look at 360's version of Frostbite 2 and see how it compares.

It's much better to let the devs just think up the clever tasks, hurl them at a proper gpu with its bank of processors and let the machine handle all scheduling. At least that's my thoughts on it.

Only because they've become more programmable.

BadTB25 · Mar 11, 2011

One of the biggest problems with using the exotic choice of Cell and RSX is that it has taken so long to really see the benefits. In a competitive business, it is just as important to have impressive results at launch as it is to have continuing progress 6 years later.

What happens if the consumer doesn't buy into your product at launch because you are betting that in the long run the product will cede impressive results? I think that the only thing that saved the PS3 this generation was the millions of loyal fans from the PS1/PS2 era---they allowed the goodwill and time to invest in further development.

I remember distinctly the "wait for (game X) to be released".

pjbliverpool · Mar 11, 2011

Shifty Geezer said:
However, because of those designs, using the CPU is an option. If Sony had gone with x86 and a customised GPU, which presumably wouldn't be any more advanced than Xenos, would it be able to achieve the same level of results? We'll be able to look at 360's version of Frostbite 2 and see how it compares.

Well they would have had an extra years development time and around 30-40m additional transistors to play with (assuming the CPU's had equal transistor count). I'm guessing they would have had more money to poor into the GPU development too. So you would have to take all that into account when looking at performance on the 360.

liolio · Mar 11, 2011

BadTB25 said:
One of the biggest problems with using the exotic choice of Cell and RSX is that it has taken so long to really see the benefits. In a competitive business, it is just as important to have impressive results at launch as it is to have continuing progress 6 years later.

What happens if the consumer doesn't buy into your product at launch because you are betting that in the long run the product will cede impressive results? I think that the only thing that saved the PS3 this generation was the millions of loyal fans from the PS1/PS2 era---they allowed the goodwill and time to invest in further development.

I remember distinctly the "wait for (game X) to be released".

I guess Intel faced the same problem with Larrabee it takes a lot of time and efforts to get the thing perform correctly.

I know many people are to think: "when will he stop to push out fucked up example" or "how many time do we have to to explain him the same thing, you can consider things this way" "etc." but there is no helping it I need to rely on concrete examples (even fuckep-up) to have a grasp on thing as I lack the proper education on the matter (who know I'm 35 but I may move in US at some point, my wife consider going back to school I may consider going back to scool too

).

Ok that was for the disclaimer and I still thank people that spend time making me a bit less clueless for their patience.

We're speaking about Cell/CPU rendering I can't help to think about what have been another implementation of that.

I think of two Cells with less SPUs and a tiny cut down GPU in each of the chip. There would be no VRAM, chips would have been connected through Flex-IO and each would have been connected to two 256MB pool of Xdram the whole thing offering a unified/coherent 512MB memory space.
In a perfect world (by 2005) the "mini cut down GPU" would have been "RBE less" mini Xenos, the only way to get the "results" out would be thought memexport (accessing LS on top of RAM would be a clear plus).
I quickly eyed at a Cell die shot as well as xenos die shot there was clearly the possibility to achieve that within the PS3 silicon budget.
It could have been something like (x2) (fucked up concrete example again):
1 PPU
6 SPUs
mini GPU (one 8 wide SIMD, 6 texture filtering units, 6 texture addressing units +fixed function hardware mostly likely clock consistently higher than 500MHz ).
All this connect though EIB.

I've mixed feeling about it, on one side I think that it would have been potentially great, on the other side the system would have completely bombed as the system would had nothing to offer back in 2006 without the proper software the system would have been helpless against the competition and it would have remained in this state for a long while. Actually I think it would still be helpless, multiplatform would still be difficult on the system, and I believe that even top notch third and first party studios would just start to make a good use of the system resources.

Basically most of the vertex load would be handle by SPUs, spreading the work between them should not be that tough (I say because I'm neither doing it my self neither founding the project

) but you have to feed two GPUs. From the little things I managed to read I believe that you would have to do the same thing as Intel did use binning and send bins (once vertex processing as be done) to the closest GPU. Both GPU would fill an unique Gbuffer which would get tiled for the SPUs to handle shading. Then comes transparent geometry... may be the flexibility of the system would allow to handle them without having to support another complete separated graphic pipeline, I dunno. Results would be pushed out in RAM in various render target which would get move again to SPUs for blending and post processing.
That's for graphics then you have to blend in Physics, AI and all the other tasks that run on the SPUs, make sure they are properly use, there are not bandwidth constrained at some stages, etc. as the thing gets mroe complex I feel like this has to be more and more of an impredictable mess.

Honestly that sounds like a hell of job, even if it were to provide slightly better results than more straight forward hardware I'm not sure the investment in human might be worth the returns.
Going forward I wonder when software could be ready as I feel for this kind of set-up software and human work seems to be the limiting factor. All that make my post a bit useless so

Simon F · Mar 11, 2011

Shifty Geezer said:
No, which is one argument for programmability. The moment any fixed hardware becomes redundant thanks to new algorithms, it's a waste of silicon, whereas truly programmable hardware can always be used to do whatever is asked.

As long as you define redundant to mean that the new algorithm in software is always better (where better can mean, eg, faster, better IQ, or use less power), which is extremely unlikely.

ShadowRunner · Mar 11, 2011

In one way the whole talk of GPGPU being the way forward kind of validates the idea behind Cell, kinda. The ideas behind them are very similar, just attacking the same thing from different angles. At end-game a perfect cell would most likely be the same as the perfect GPGPU

Ofcorse a specific implementation may be an issue but the idea behind it may still be sound.

Given off the shelf GPGPU wasnt really a viable option for sony during PS3 development i like Cell as a step towards that and think it has been relatively successful first attempt, its just a shame they ended up with the GPU they did along side it. Cell + xenos would have been awesome, and i doubt there would be a s much negative feelings towards cell if that was the case.

Im not convinced we will see a single GPGPU solution being the most powerful implementation next gen though simply due to max chip size. I can still see a Cell like CPU being of use along with a given descreet GPU just as a way of spending excess transistor budget, or even better maybe we will get a dual GPGPU solution! Then again the it may be wiser just to forget about getting the most power for a given budget and instead reduce the budget, making the console cheeper.

Arwin · Mar 11, 2011

Simon F said:
As long as you define redundant to mean that the new algorithm in software is always better (where better can mean, eg, faster, better IQ, or use less power), which is extremely unlikely.

Of course. But say that we take the (admittedly near ideal) case of AA solutions. The only 'fixed hardware' AA solution that the RSX provides is QAAx2, which has proven to be very impopular. Now I doubt that if this was fixed hardware, that it wasted a lot of space, but ok. The next option, MSAA, is already very expensive on the RSX, quoted as taking performance down by something like 25% per times AA (e.g. MSAAx4 halves performance or worse). Now we get MLAA, which achieves pretty good results (most, though not all, certainly prefer it to QAA at the very least), and is relatively cheap, also on current GPU hardware (e.g. my relatively limited Radeon 5570 can implement MLAA with very little additional cost - MLAAx2 has no noticeable effect on my framerates, and games that support it properly rather than having it as a post-fx would look even better)

I think though that Nvidia's reaction to Larrabee was rather telling, and fairly wise too: we're going to keep building hardware that is part fixed, part programmable, and we'll keep looking at market developments to determine which parts should be fixed and which parts should be programmable, and in what ratio.

Looking at the future I'd say that programmable will end up winning, simply because there has to come a time where hardware performance outstrips demand. That is quite a ways off for those titles that strive for complete realism, but of course not all games have that goal so already now an ever increasing percentage of titles don't really care about the hardware and use a fraction of its power.

So now it will always be finding the right balance between fixed units, programmable units, and even for the programmable units there will be different kinds of programmable units that can do certain type of things better than others. In the Cell Handbook IBM gives a nice overview of the type of workloads that Cell is and isn't good at including about how good, but perhaps Folding@Home is an even nicer example of workload distribution over different hardware, letting GPGPU handle certain type of job(s), Cell another type of job(s), and CPUs yet another type of job(s).

The challenge for console developers is more clear than ever in that respect:

1. make an estimate of the type of workload that will be required of the console hardware at the start of its lifecycle
2. make an estimate of how future requirements may change

Studies of DirectX11 PC type development as well as the relative merits and downsides of the current console designs would be the best help for 1. Predicting the future after that is always harder, but may not be as important either.

What is really, really important and something that Microsoft demonstrated very very well, is that making the system easy to develop for may end up yielding far better results than any hardware optimisation can ever achieve (just as providing good services on your platform may be more essential than anything else).

patsu · Mar 11, 2011

pjbliverpool said:
Well they would have had an extra years development time and around 30-40m additional transistors to play with (assuming the CPU's had equal transistor count). I'm guessing they would have had more money to poor into the GPU development too. So you would have to take all that into account when looking at performance on the 360.

I think Sony had a multi-year R&D lead time for the CPU. The advantages should be more than just transistor count. It takes more effort to create a "new" architecture. The risk is also significantly higher. Besides proving performance, they also need to worry about yield for the aggressive design, and testing. I remember Kutaragi mentioned that Cell was tested like a server CPU instead of a consumer device CPU. It ran applications 24/7 for months/years like WarHawk server for example.

BadTB25 said:
One of the biggest problems with using the exotic choice of Cell and RSX is that it has taken so long to really see the benefits. In a competitive business, it is just as important to have impressive results at launch as it is to have continuing progress 6 years later.

What happens if the consumer doesn't buy into your product at launch because you are betting that in the long run the product will cede impressive results? I think that the only thing that saved the PS3 this generation was the millions of loyal fans from the PS1/PS2 era---they allowed the goodwill and time to invest in further development.

I remember distinctly the "wait for (game X) to be released".

Yes, that's one of the many risks. When Cell was launched, the software library and tools were immature. But if they are prepared and can pull through the hard times, it means that the system has room for growth to adjust to new techniques since the console is supposed to "last 10 years".

When both compute resources work on more than one frames together, it should mean that the system has more time to work on a frame (e.g., RSX on this frame, SPUs start work on next frame). So there are some tricks to play with. Alternatively, by increasing the number of cores or bandwidth in the GPU alone, the vendor can also achieve high or even better raw performance than a CPU + GPU set up. It's a matter of finding the right balance.

This gen, I think PS3 is more hurt by the memory size and bandwidth.

Cell/CPu architectures as a GPU (Frostbite Spinoff)

JPT

Laa-Yosh

I can has custom title?

Ruskie

Laa-Yosh

I can has custom title?

(((interference)))

Laa-Yosh

I can has custom title?

patsu

JPT

joker454

patsu

makattack

patsu

Shifty Geezer

uber-Troll!

BadTB25

pjbliverpool

B3D Scallywag

liolio

Aquoiboniste

Simon F

Tea maker

ShadowRunner

Arwin

Now Officially a Top 10 Poster

patsu

Similar threads