Why not let PowerVR build RSX?

Since some manufacturers missed adopting PowerVR back when its advantages were so pronounced, their understanding of its architecture is obviously suspect and potentially the chief reason they're not using it now, regardless of whether or not it still has the edge technologically.
Not when you take T&L, drivers, and performance across all games into account. The Kyros even lacked cube mapping. Blaming their market failure on ignorance is really quite silly.
 
the Sony Ninjas will kill you both.
Well there's no need to get into specifics about documentation. Just describing an idea for a rendering technique is good enough, and then I can see if it's something new and exciting or doable without anything special.
 
Well there's no need to get into specifics about documentation. Just describing an idea for a rendering technique is good enough, and then I can see if it's something new and exciting or doable without anything special.
Yes. No-one needs go into specifics. They only need discuss the possibilities of close working between a theoretical Cell based system with a fast connection to a possibly customized CPU. What sort of things can be done, especially that'd benefit from very high BW between processors? Are not GPUs still hampered by needing mesh and textures data continuous in RAM, or can these be streamed on demand from a CPU?
 
However, nothing at all comes to mind here beyond what I mentioned before. AGP and PCI-E have generally had bandwidth close to system bandwidth, so it's expected for Flex I/O to offer transfer speeds comparable to XDR's bandwidth.

Transfer speeds are public: 15 GBytes/s from CELL to RSX (RSX pulling) and 20 GBytes/s from RSX to CELL (RSX pushing)

Procedural geometry isn't particularly exciting to me, especially since you can do it (at some XDR BW cost) without any XPS-type interaction.
It gets more 'exciting' if you think about multiple processors feeding a single GPU through a single command buffer at the same time.. how would you do that on a PC without basicly a frame worth of geometry? Just think about synchronization issues..

I would very much love for nAo or some other dev to show me otherwise
Just use your fantasy :) Since you think that doing procedural geometry and/or culling/optimizing geomety is trivial (then why PC games don't do that!?) let me think about something different.
We already know that SPUs are not very good at texturing but only cause they lack a mechanism to hide texturing latencies and custom hw filtering.
In a scenario where textures access is completely regular and streamlined SPUs should rock...what about shading a scene within a deferred renderer?
Textures access would be trivial: load a NxN tile from all the g buffers onto local stores and shade the tile, dump the tile out. You don't need huge tiles as long as you can keep SPUs busy all the time (as dma transfers can run cuncurrently).
Shading time would be more or less predictable in such a rendering scheme so it would be possible to split the work among RSX and a bunch of SPUs with a decent approximation making sure both parties are working most of the time.
The problem is..only a crazy dev would risk to adopt an approach like that on a first generation game.
[BTW is there any name for RSX's ability to control SPUs? XPS - XBox Procedural Synthesis - is the analogue on the 360, AFAIK. Again, if anyone knows more, please share.]
No fancy names, sorry, maybe we can invent one..
 
It gets more 'exciting' if you think about multiple processors feeding a single GPU through a single command buffer at the same time.. how would you do that on a PC without basicly a frame worth of geometry? Just think about synchronization issues..
Yeah but we're not talking about a PC. You could have a few small buffers, poll RSX's command buffer pointer every time the SPU job manager is ready to dispatch another task, and start filling an empty buffer when appropriate.

Just use your fantasy :) Since you think that doing procedural geometry and/or culling/optimizing geomety is trivial (then why PC games don't do that!?) let me think about something different.
I never said it's trivial, I just said it doesn't excite me. As for PC devs, I think it's more that using artists is a lot easier than using procedural models. ATI and NVidia already buffer frames of data in games using dynamic VBs AFAIK because they have gobs of system RAM to use. Most of the time procedural synthesis is better suited to demos than real games.
We already know that SPUs are not very good at texturing but only cause they lack a mechanism to hide texturing latencies and custom hw filtering.
In a scenario where textures access is completely regular and streamlined SPUs should rock...what about shading a scene within a deferred renderer?
Textures access would be trivial: load a NxN tile from all the g buffers onto local stores and shade the tile, dump the tile out. You don't need huge tiles as long as you can keep SPUs busy all the time (as dma transfers can run cuncurrently).
Shading time would be more or less predictable in such a rendering scheme so it would be possible to split the work among RSX and a bunch of SPUs with a decent approximation making sure both parties are working most of the time.
The problem is..only a crazy dev would risk to adopt an approach like that on a first generation game.
Okay, that's a pretty novel idea, but I'm unconvinced that you gain much out of it. It seems like the SPUs couldn't offer too much help, especially since 6 SPUs running at 100% efficiency still couldn't match RSX's speed. IMO the cool lighting shaders need texture access too (weren't you preaching the advantages of forward shadow mapping?), so they wouldn't run well on an SPU. Finally, I'm not a fan of deferred shading as you could probably tell from the deferred shading thread, because there's too many extra operations and too much BW-eating data flow.

Also, even if your g-buffers are 30 bytes per pixel and the SPUs do all the shading, 1080p @ 60fps is only 3.6 GB/s. We're back to plain old data transfer again which is only needed because of the split memory pool. So unfortunately your example falls under my "mitigating disadvantages" category.

I appreciate the effort, though! :smile:

(So am I correct in thinking RSX controlling SPUs is roughly similar to XPS?)
 
Mintmaster said:
IMO the cool lighting shaders need texture access too (weren't you preaching the advantages of forward shadow mapping?), so they wouldn't run well on an SPU.
Textures are just extra attributes, and with small tiles there's not much problem to add more of them. You'd want to avoid dependant lookups - but most of them can be replaced by math ops.

Finally, I'm not a fan of deferred shading as you could probably tell from the deferred shading thread, because there's too many extra operations and too much BW-eating data flow.
Deferred shading is the one remaining fortress for eDram (as well as TBDR) proponents IMO - unfortunately the right GPU to prove this didn't make it into the final product in the end.

Mind you - you keep saying that all the "cool" ideas should already be known in advance - for instance PS2 deferred shading concept wasn't even thought of until many years later - and we're talking shader model 3+ on PS2, how could anyone say that's "not" cool? - and that's really just one of many examples.
 
Textures are just extra attributes, and with small tiles there's not much problem to add more of them. You'd want to avoid dependant lookups - but most of them can be replaced by math ops.
With deferred shading we're talking about extra fullscreen buffers, aren't we? For shadow mapping I see your point, but IMO if you're going to go through each light this way you might as well shade on RSX also since you already calculated the position.

On the whole, though, you're right. The per-light component of each shader doesn't really need textures aside from shadows, so my argument is a lot narrower in scope than I thought.

Deferred shading is the one remaining fortress for eDram (as well as TBDR) proponents IMO - unfortunately the right GPU to prove this didn't make it into the final product in the end.
I'm surprised you say this. Isn't it basically designed to give an IMR zero overdraw during lighting calcs? I can see how TBDR could reduce the BW of deferred shading, but I don't see a big reason to use it in the first place.

I'm not sure why you're so quick to dismiss eDRAM. If MS wanted a single 128-bit bus for cost reasons, is there any way ATI could have designed a faster chip without it, even with 300 mm2 of silicon?

Mind you - you keep saying that all the "cool" ideas should already be known in advance - for instance PS2 deferred shading concept wasn't even thought of until many years later - and we're talking shader model 3+ on PS2, how could anyone say that's "not" cool? - and that's really just one of many examples.
I never said the cool ideas must be known in advance. I implied that it usually takes a few years for an idea to find its way into games, that's all. Are there any games using this technique effectively on the PS2 today? I haven't seen anything remotely close to SM3+ level stuff.

It's just that those other features I mentioned - EMBM, VTF, and DB - allow a whole new class of things that can be done, and they were immediately apparent. Nonetheless, it took several years to see anything get used. With this Cell-RSX interaction thing, I don't even see the initial possibilities. I'm hoping you guys can help me out.
 
Check the benchmarks of Kyro 2 again. It rated higher than any comparable offering in almost every game.
First of all, it lost in some key benchmarks like this and this. Secondly, even if you were right, that alone is very weak evidence that ignorance was the reason for nobody using the chip.

GeForce was a very powerful brand back then. The GeForce took the market by storm, GeForce2 MX was out for a long time and was being built by many brands, and the GeForce3 was just released at the same time as the Kyro. The brand recognition was a huge part of what drove the GeForce MX sales. T&L was also a big buzzword, and the numbers on the box were all bigger for the Geforce. NVidia was also strong-arming the board manufacturers not to make Kyro cards.

Take all these reasons into account, along with no high end card in sight from PowerVR, and manufacturers had no reason to adopt the Kyro. They figured it wouldn't sell, and indeed sales paled in comparison to the MX. It has nothing to do with ignorance.
 
Yeah but we're not talking about a PC. You could have a few small buffers, poll RSX's command buffer pointer every time the SPU job manager is ready to dispatch another task, and start filling an empty buffer when appropriate.
Doable but inefficient, you don't really want to have a granularity like that cause you might have many jobs running and many of them could be working on completely different problems (physics, visibility, etc..), and also you don't want to poll GPU registers..it's old fashioned and slow :)

Okay, that's a pretty novel idea, but I'm unconvinced that you gain much out of it. It seems like the SPUs couldn't offer too much help, especially since 6 SPUs running at 100% efficiency still couldn't match RSX's speed.
SPUs don't need to match RSX speed, as long as they can shift a relevant workload from RSX to them.

IMO the cool lighting shaders need texture access too (weren't you preaching the advantages of forward shadow mapping?)
Maybe you're confusing me with someone else, HS uses a deferred shadowing approach.. :)

so they wouldn't run well on an SPU.
So it would run particularly well on a SPU, thanks :)
Finally, I'm not a fan of deferred shading as you could probably tell from the deferred shading thread, because there's too many extra operations and too much BW-eating data flow.
I'm not advocating deferred rendering, I just wrote the thing first thing I could think of.
BTW, the BW price you pay with deferred rendering could be greatly reduced with smart packing of data, probably one can do wonders just packing storing 12-16 bytes worth of data per pixel (or more if AA is needed..)

Also, even if your g-buffers are 30 bytes per pixel and the SPUs do all the shading, 1080p @ 60fps is only 3.6 GB/s.
You're missing the point here, it's not about bandwidth since you were complaining about the fact that CELL - RSX working together is really nothing new and is overblown..I gave you an example which need extensive and avanced synchronization between these 2 processors to be efficient, and all the bw in the world can't give you that.


We're back to plain old data transfer again which is only needed because of the split memory pool. So unfortunately your example falls under my "mitigating disadvantages" category.
Umh? Do you realize that RSX can read from/write to any memory pool right? think twice about what you wrote :)

I appreciate the effort, though! :smile:
I would appreciate if you would not dismiss an idea just cause you don't like it, I gave you a non trivial way to make CELL and RSX work together in a tightly couple fashion..but hey..you don't like deferred rendering. I suppose I'll have to think something else, something even more juicy to satisfy your unsatiable appetite..maybe I should add EMBM to the equation! :devilish:

(So am I correct in thinking RSX controlling SPUs is roughly similar to XPS?)
Dunno, I don't remember how XPS work anymore, and I could not confirm or deny anything anyway.
 
Chill out, I didn't mean that in a derogatory way (I would have put three asterisks after the word fan if I did).
No I wasn't offended. I just thought it was funny in that it was strange of you to apply that expression to me.

As for not countering your arguments, I've done it in the past and life is too short.
 
Doable but inefficient, you don't really want to have a granularity like that cause you might have many jobs running and many of them could be working on completely different problems (physics, visibility, etc..), and also you don't want to poll GPU registers..it's old fashioned and slow :)
Well the driver has to somehow know where the command pointer is, right? It can't keep adding to the command buffer endlessly. As for having many jobs and interrupting an SPU, I don't see how it's any different from RSX letting an SPU know it's ready to take in more vertices. The only real difference is that the software solution requires a bit of RAM.

SPUs don't need to match RSX speed, as long as they can shift a relevant workload from RSX to them.
Well that was sort of my point. You'd have work really hard to make it relevant (isn't it tough to make an SPU run a shader even close to the kind of efficiency RSX runs at?), and this is only the deferred lighting part of the rendering workload.

I know you're a perfectionist that loves to extract every ounce of performance, so even a 20% overall speedup would prove the method useful. ;)

Maybe you're confusing me with someone else, HS uses a deferred shadowing approach.. :)
Isn't forward shadow mapping the deferred one? I remember discussing this with you regarding UE3 when I was baffled that it works out faster even though you're doing more work.

Anyway, accessing the shadow map is quick on the SPU? Isn't it just like a normal texture fetch? You need a bunch of samples and have to filter them, and it's random access dependent on the pixel's 3D position.

You're missing the point here, it's not about bandwidth since you were complaining about the fact that CELL - RSX working together is really nothing new and is overblown..I gave you an example which need extensive and avanced synchronization between these 2 processors to be efficient, and all the bw in the world can't give you that.
Okay, I'm missing something. I thought it's basically just an application of GPU readback - tell RSX to locally write a block of the g-buffer whenever an SPU is ready to process more stuff. Alternatively you could render directly into XDR (is that fast?).

Umh? Do you realize that RSX can read from/write to any memory pool right? think twice about what you wrote :)
Assume PS3 had one memory pool. When RSX is done going through all the geometry and finished writing the g-buffer, Cell can go ahead and do the deferred shading, right? No fancy scheduling needed. Just the basic "I'm done" signal that all GPUs have.

That's what I mean by mitigating the disadvantages of a split memory pool.

I would appreciate if you would not dismiss an idea just cause you don't like it
Come on, that has nothing to do with it. Your tightly coupled interaction is only there because the g-buffer is in a place Cell can't access efficiently on its own. I'm not looking for a solution to a PS3 quirk, I'm looking for something new it enables (other than direct-fed procedural geometry or SPU culled geometry).
 
Isn't forward shadow mapping the deferred one? I remember discussing this with you regarding UE3 when I was baffled that it works out faster even though you're doing more work.

Yes, and it tends to be incredibly efficient, especially in scenes with high overdraw without a zpass or in scenes with lots of alpha tested geometry. Also, in such a scheme it's very easy to use the hardware (hi-z or hi-stencil) to apply a large filter only where you need it (penumbra areas).

Predicated tiling, again, tends to play a bit against this idea, but it can be solved with some pain and work. I don't know how this approach could be offloaded to a SPU on PS3, but it sounds like a cool idea.
 
Well the driver has to somehow know where the command pointer is
Driver? drivers are a luxury we can't afford.. :)
right? It can't keep adding to the command buffer endlessly
Oh..it can :) (unlessy you don't completely fill your command buffer..)
As for having many jobs and interrupting an SPU, I don't see how it's any different from RSX letting an SPU know it's ready to take in more vertices. The only real difference is that the software solution requires a bit of RAM.
One thing is to poll a hw register, another thing is to read a pointer in memory, you read the same amount of data..but you're doing quite different things.

Well that was sort of my point. You'd have work really hard to make it relevant (isn't it tough to make an SPU run a shader even close to the kind of efficiency RSX runs at?), and this is only the deferred lighting part of the rendering workload.
Can I ask you why you think I wil have to work so hard to make it relevant?
Once gbuffers data for a given tile (let say 32x32 pixels) are stored on a local store it's a piece of cake. You can work as efficiently as a G80 (In a scalar fashion) on 4 pixels at time or even more, since your texturing latencies would be completely hidden.

Isn't forward shadow mapping the deferred one? I remember discussing this with you regarding UE3 when I was baffled that it works out faster even though you're doing more work.
Yep, my bad..messed up terminology.
Anyway..the first implementation was doing more work..now it's doing the same amount of work..and I can even decide how many samples I want to take on a per shadow map basis with no predication and no dynamic branching while forcing the hw to maximize the use of texture caches..so it's even faster :)

Anyway, accessing the shadow map is quick on the SPU? Isn't it just like a normal texture fetch? You need a bunch of samples and have to filter them, and it's random access dependent on the pixel's 3D position.
I don't need to access any shadow map with SPUs, RSX would handle rendering an occlusion term to a pixel screensized texture.

Okay, I'm missing something. I thought it's basically just an application of GPU readback - tell RSX to locally write a block of the g-buffer whenever an SPU is ready to process more stuff. Alternatively you could render directly into XDR (is that fast?).
Can't answer on that but if you want to make them work together you need efficient ways to synchronize them, that's the point. SPUs could be allocated to do any work..at any time..not just that.

Assume PS3 had one memory pool. When RSX is done going through all the geometry and finished writing the g-buffer, Cell can go ahead and do the deferred shading, right? No fancy scheduling needed. Just the basic "I'm done" signal that all GPUs have.
No..this is when the fun begins! :)

Come on, that has nothing to do with it. Your tightly coupled interaction is only there because the g-buffer is in a place Cell can't access efficiently on its own.
You're assuming things, but you could be wrong..

I'm not looking for a solution to a PS3 quirk, I'm looking for something new it enables (other than direct-fed procedural geometry or SPU culled geometry).
Well..If I will have better ideas I will let you know :)

Marco
 
Understanding of PowerVR's real-world benefits isn't relevant for device makers who pick their GPUs based upon brand. Console makers, like Sony who's the topic of this thread, are selling mostly on the strength of their own brand and would be more interested in the performance of the GPU.

Kyro 2's performance was so favorable that, of the two benchmarks being used to argue against it, the first one, MDK2, actually shows victory for PowerVR over its most direct competition (and a near tie with higher cost competition with the highest, yet still frame rate relevant, detail setting) and the other one was 3DMark, not a relevant application in itself and not even indicative in this case of the games which budget cards, like Kyro 2, ever ran.
 
Understanding of PowerVR's real-world benefits isn't relevant for device makers who pick their GPUs based upon brand. Console makers, like Sony who's the topic of this thread, are selling mostly on the strength of their own brand and would be more interested in the performance of the GPU.
So when Sega dropped PowerVR in favour of nVidia for Lindbergh, this was a marketting strategy to leverage nVidia's brand name in the arcade space to attract customers? And Sega thought this better than offering the more powerful PowerVR hardware that would give gamers a better experience?
 
Back
Top