RSX pixel shaders vs Xenos pixel shaders

Status
Not open for further replies.
ERP said:
Xenos can do a texture fetch in parallel with ALU ops, G71 uses an ALU during texture ops.
If you play with nvshaderperf (specifying g70 as profile architecture) you can see that half ALU is used during texture ops.
Xenos has MUCH better dynamic branch support.
On pixels, on vertices it's not that good.
 
Mintmaster said:
How much is this flexibility worth?
Certainly as much as the number of transistors budgeted for it. It is still a trade-off, but I don't think we'll ever agree on how much of a trade-off it is. I'm also not in a position to be able to assert my opinion, though it would be interesting to hear developers' first hand account.

Thirdly, there's more stuff you need on the GPU in order to manage without eDRAM (e.g. compression logic, more complex memory controller, etc), so this offsets a bit of the transistor costs
I may be mistaken, but instead of being on the GPU, the stuff you need is just on the edram instead. Perhaps being on a separate die means some cost savings as you suggest.

Fourthly, the additional memory contention from the CPU pales in comparison to the demands from the colour and z clients in GPUs without EDRAM. The bulk of the EDRAM probably yields well also, since a tiny amount of redundancy can cover for an error anywhere.
Having more bandwidth for the GPU by providing the CPU with its own, I imagine, was to achieve the same goal, but perhaps not to the same degree. This is the first time I have heard of redundancy in the edram daughter die.

For dynamic branching, of course there's no use in current games, because it's only been usable in hardware for six months or so. It's a new tool that enables new effects. Demos have shown huge practical benefits, and demos aren't benchmarking tools.
Perhaps there are more ways to skin a cat. What kind of visual effects does dynamic branching allow exclusively?

The polygons a game sends to a chip are not uniform in size. Vertex to pixel ratios span many orders of magnitude. Very, very few polygons lie within the range where both pixel and vertex shaders are mostly occupied simultaneously. For any given polygon, the ratio changes as the player moves around. I've done sophisticated workload analysis before, and it's like this all the time.

There's also lots of vertices without any pixels. For any character or object you draw (with the exception of some things like terrain), about half the triangles are backfaces, and get culled. Of the remaining polygons, some are out of the screen, as you can't waste CPU time getting rid of every polygon outside the viewing frustum.

You think texturing is going to disappear? For any texture lookup, one of RSX's ALU's are occupied. Consider a shader with 8 texture instructions and 20 vector math. Xenos is limited by texture units, so it outputs 2 pixels per clock. RSX will output 24*2/28=1.7 per clock, assuming perfect dual issue. Note that some of the math units in Xenos are idle, so it's not an ideal case for it. If I really wanted to cherry pick, I could give an example where Xenos is over 10 times as fast. I don't know of any situation where the converse is true.

As for vertex shading consuming resources, consider a short 10 instruction shader operating on a small 10 pixel by 10 pixel rectangle with simple transformation shader. 1000 PS instructions, 8 VS instruction for two vertices (in fact you could get away with one vertex per quadrilateral). 99% of the time is spend in pixel shading. Vertex shading will barely make a dent in the shading power available for pixel shading.

Xenos doesn't have handicapped pixel shading like a bunch of you guys are pretending. Yes, it's a bigger die size in total, but there are lots of new rendering possibilitities with it.
To provide argument for a conventional architecture, it is anticipated that the bulk of the work will be in pixel shading, thus the ratio that they have. Certainly, i imagine they have allowed for (what they feel is) enough vertex shaders to avoid pixel shaders stalling in most situations that they have envisioned. It should have a more predictable effect than anticipating how a unified architecture is going to organise itself every instant. The power is there, and in a closed box it can be exploited predictably as the developer sees fit. If more vertex work is required than anticipated momentarily, would it constitute a significant proportion of the total work for an entire scene?

I am unconvinced that the RSX ALUs spend all their time texturing. Certainly not what the engineers would have thought, hence their desire to kill two birds with a stone. It is a cost saving measure, but my own view is that if they didn't use a texture lookup unit as a fragment ALU, there would be less ALUs for pixel work. While it is not texturing, then we have a full complement of pixel ALUs (I may be wrong).

If one were to consider the total number of ALUs on both designs, they are comparable. Yet the transistor count is roughly 250 mil (?) for Xenos' primary die, and 300 mil for PS3. What is missing in the Xenos pixel shaders?
 
onanie said:
To provide argument for a conventional architecture, it is anticipated that the bulk of the work will be in pixel shading, thus the ratio that they have. Certainly, i imagine they have allowed for (what they feel is) enough vertex shaders to avoid pixel shaders stalling in most situations that they have envisioned. It should have a more predictable effect than anticipating how a unified architecture is going to organise itself every instant. The power is there, and in a closed box it can be exploited predictably as the developer sees fit. If more vertex work is required than anticipated momentarily, would it constitute a significant proportion of the total work for an entire scene?

...
If one were to consider the total number of ALUs on both designs, they are comparable. Yet the transistor count is roughly 250 mil (?) for Xenos' primary die, and 300 mil for PS3. What is missing in the Xenos pixel shaders?

predicate said:
It seems to me that you're looking at this from a very PC-centric point of view, but actually, console developers have to be a much more crafty bunch if they want their games to look good.

The highlighted and the second quote is basically where I think this thread stems from. Just how much sense does USA make in a closed box where everything is defined and power can be exploited? What tradeoffs have been made according to numbers (if they haven't specified, try using your imagination like you have done with predicting the possibilties ;))?
 
Onanie to be fair if the Xenos didn't have the eDRAM, it's transistor count would 'automatically' go up due to a logic redistribution, as some of the functions that would otherwise be performed on the main die are presently handled by the daughter-die. And beyond that, and assuming some transistors on RSX are devoted to 'funky' logic to allow it to communicate within the system, we're probably dealing with differences small enough that one can't point to transistor counts alone as a sign of which is better. It is a new architectural paradigm afterall (the USA), whether one agrees with the decision or not, and it probably can't be compared transistor for transistor with a discrete part. Hell even 'normal' GPUs from ATI and NVidia have sme pretty wild swings in terms of transistor/performance comparisons.

That said, Mintmaster I hear you on the vertex-throughput of Xenos, no doubt it's true. But at the same time sustained vertex operations on Xenos subtract from total sustained pixel throughput. I know that it can essentially load balance itself to get whatever the job is done that it needs to get done in any given second, but I don't think it's in devs interests to go *that* vertex crazy with Xenos as you're suggesting they have the ability to do, as one - or heaven forbid two - shader arrays more or less 'permanently' tied up in vertex operations would have some very serious consequences elsewhere. Granted, that would represent an insane amount of vertices, but I'm just saying even though Xenos *can* crank out the vertices, I'm not sure if or how far past RSX's theoretical limits devs would be willing to take Xenos on a regular basis. I mean I may be being overly conservative/cautious with this estimate though.
 
nAo said:
If you play with nvshaderperf (specifying g70 as profile architecture) you can see that half ALU is used during texture ops.

Interesting. Can you elaborate on that any? How useful can that ALU be while texturing? I was always under the assumption it was a total write-off.
 
Titanio said:
Interesting. Can you elaborate on that any? How useful can that ALU be while texturing? I was always under the assumption it was a total write-off.
I really can't cause I don't have other info, but what I got (or what I think I got, I migt be wrong though) is that in some cases you can schedule a float2 op on the same ALU that is being used to perspectivelly correct an interpolator.
 
nAo said:
I really can't cause I don't have other info, but what I got (or what I think I got, I migt be wrong though) is that in some cases you can schedule a float2 op on the same ALU that is being used to perspectivelly correct an interpolator.
:cool:

Jawed
 
xbdestroya said:
But at the same time sustained vertex operations on Xenos subtract from total sustained pixel throughput.
Not necessarily, and it also works in reverse where very low vertex loads occur while intensive pixel shading occurs:

b3d52.jpg


That's a single frame being rendered there - this variation of workload isn't happening gradually over a period of a second or so, it's happening 100s of times per second, 10s of times per frame.

Sadly, ATI "stole" this graph from Victor Moya:

http://personals.ac.upc.edu/vmoya/log.html

there's a nice overview of his work here:

http://personals.ac.upc.edu/vmoya/docs/micro-ShaderPerformance.ppt

and that graph comes from this:

http://personals.ac.upc.edu/vmoya/docs/vmoya-ShaderPerformance.pdf

I know that it can essentially load balance itself to get whatever the job is done that it needs to get done in any given second, but I don't think it's in devs interests to go *that* vertex crazy with Xenos as you're suggesting they have the ability to do, as one - or heaven forbid two - shader arrays more or less 'permanently' tied up in vertex operations would have some very serious consequences elsewhere. Granted, that would represent an insane amount of vertices, but I'm just saying even though Xenos *can* crank out the vertices, I'm not sure if or how far past RSX's theoretical limits devs would be willing to take Xenos on a regular basis. I mean I may be being overly conservative/cautious with this estimate though.
There's plenty of next-gen graphics rendering techniques that are devoid of pixel shading. This is where the GPU takes over tasks that were traditionally computed on the CPU.

Jawed
 
Jawed I hear what you're saying, believe me - that's what I meant essentially by the dynamic load balancing in the first place. But what's the aggregate vertex count up there in those spikes/graph? We don't know.

I'm really just speaking to this: "If Xenos lets you run a 48 instruction vertex shader at 500Mverts per second..."

Now, it does let you, but all I'm sayng is you probably are going to use many many fewer vertices than that in any normal situation. And in fact if you didn't, your pixel shading throughput would essentially be nil.

You say it works in the reverse - well of course; I'd go a step further and say 'in the reverse' is exactly how it will be working most of the time.
 
Can anyone care to explain why pixel-shading and vertex-shading can (or will) not be parallized in a closed platform? I see the sample above, but I also have to question if that sample data is taking a game into account that is closely targeted to use a GPU with a fixed number of pixel and vertex shaders?
 
xbdestroya said:
Now, it does let you, but all I'm sayng is you probably are going to use many many fewer vertices than that in any normal situation. And in fact if you didn't, your pixel shading throughput would essentially be nil.
I think you're confusing two aspects of unified shading:
  1. it optimises the distribution of workload between vertex and pixel shader code
  2. it (in cohort with streamout) allows algorithms that are traditionally employed on CPUs, algorithms that process polys/triangles, to run much much faster on the GPU. While these "render passes" are operational, pixel shading simply isn't occurring. Do you want 100MVerts/s or 500MVerts/s performance while you do that? Bearing in mind the faster you do these render passes, the more time you'll have left to do other things?
Jawed
 
Jawed said:
I think you're confusing two aspects of unified shading:
  1. it optimises the distribution of workload between vertex and pixel shader code
  2. it (in cohort with streamout) allows algorithms that are traditionally employed on CPUs, algorithms that process polys/triangles, to run much much faster on the GPU. While these "render passes" are operational, pixel shading simply isn't occurring. Do you want 100MVerts/s or 500MVerts/s performance while you do that? Bearing in mind the faster you do these render passes, the more time you'll have left to do other things?
Jawed

Well, in my original post I mentioned 'sustained' throughput rather than just spikes or per frame distributions of modest workloads. Believe me I'm not knocking USA, I'm just pointing out that after a certain line is crossed in vertex load, pixel potential will begin to drop. And I'm only bringing it up in the first place because Mintmaster mentioned the insane vertex potential of Xenos... but yeah I see where he and you are saying that the speed in which the vert operations could be performed would lead to more aggregate power left over per second for pixel operations than otherwise available, and I was remiss to have missed that very obvious point when I first replied. It wasn't what I was speaking to originally, but it seems that I may not have been speaking to what Mintmaster was saying either and I just misinterpreted where he was going with Xenos' vertex potential.
 
Last edited by a moderator:
onanie said:
I may be mistaken, but instead of being on the GPU, the stuff you need is just on the edram instead. Perhaps being on a separate die means some cost savings as you suggest.

With eDRAM and the very high bandwidth bus there, there's no need to compress anything. So everything is done uncompressed between daughter logic and eDRAM. Memory controller almost certainly would be significantly more complex to handle such a task as color and z.

Besides, there should be significant advantages to entirely removing the read-modify-write of color and z + blending from main RAM. Turnaround times aren't supposed to be the friendliest, nor (in all likelihood) efficient things in the world.

Having more bandwidth for the GPU by providing the CPU with its own, I imagine, was to achieve the same goal, but perhaps not to the same degree. This is the first time I have heard of redundancy in the edram daughter die.

Redundancy in memory is extremely easy to do. For the most part, where it is doesn't matter so much as a sufficient quantity is there. And without so much specialization, it's easy to add in some redundancy for yields. Logic wouldn't benefit nearly as much without higher prices to be paid.


Perhaps there are more ways to skin a cat. What kind of visual effects does dynamic branching allow exclusively?

What good can shaders do? What could developers possibly use these multiple render targets for. Why do we need higher instruction limits, we're good enough right here.

To provide argument for a conventional architecture, it is anticipated that the bulk of the work will be in pixel shading, thus the ratio that they have. Certainly, i imagine they have allowed for (what they feel is) enough vertex shaders to avoid pixel shaders stalling in most situations that they have envisioned. It should have a more predictable effect than anticipating how a unified architecture is going to organise itself every instant. The power is there, and in a closed box it can be exploited predictably as the developer sees fit. If more vertex work is required than anticipated momentarily, would it constitute a significant proportion of the total work for an entire scene?

Sure, they might put in enough vertex shaders to "do well enough" but still not cover the worst situation. For example, shadow mapping I'd imagine could gain some significant speed ups, if you retrieve and write data fast enough outside the vertex shaders to feed them. And no pixel shaders being executed there. I could be talking out my ass though, since I've never had my hands on anything to test something like that. In any case, looking at stuff such as what Jawed posted implies that "matching to the hardware," relative to the ratio of vertex and pixel shaders is, in practice, impossible. Not only do you have the changes from frame to frame, but from part of a frame to part of a frame.

That doesn't mean USA suddenly gets a 40% boost or anything. But there are going to be gains.

However, this doesn't take into account all the numerous situations where the VS load is very high and PS load low. The argument presented that "we haven't seen it before" isn't very good. That's like saying "well, the hardware is absolutely terrible at it, yet we haven't seen it in games, thus there must be no use for it." It's because it hasn't been an option that we dont' see it, not that there aren't uses. That's applicable to the other things that haven't been mentioned. And I think it's things like this that far outweight the "efficiency" gains of Xenos. It certainly seems that many of the things the chip does have are "overlooked" because too many people get caught up in the efficiency hype and ignore the feature set it brings to the table. Why work harder when you can work smarter. And sometimes you might not need tons of vertices, but maybe you need "smarter" ones whilst at the same time not needing very complex pixel shaders.

I am unconvinced that the RSX ALUs spend all their time texturing. Certainly not what the engineers would have thought, hence their desire to kill two birds with a stone. It is a cost saving measure, but my own view is that if they didn't use a texture lookup unit as a fragment ALU, there would be less ALUs for pixel work. While it is not texturing, then we have a full complement of pixel ALUs (I may be wrong).

Indeed. A generation or a couple generations down the line, I wouldn't expect there to be many different units covering for TMUs, ALUs, and ROPs. NV has a patent, IIRC, where even more of those are consolidated into single units.

If one were to consider the total number of ALUs on both designs, they are comparable. Yet the transistor count is roughly 250 mil (?) for Xenos' primary die, and 300 mil for PS3. What is missing in the Xenos pixel shaders?

Well, for one, you have to consider how they count transistors. There are going to be uncertainties due to the inherently flawed nature of how they're counted (look at a couple examples to find an average # of transistors per gate, look at total number of gates was it?). And then pray that one company doesn't totally ignore certain types of transistors (i.e., look at die sizes between ATI and NV. It makes it seem credible that ATI in fact ignores memory transistors all together. Which could be tens of millions that aren't mentioned for their chips).

predicate said:
So I think to say 'this type of code will run faster on Xenos' is a misleading statement, because developers would never run the same code on RSX.

If that code that runs faster on Xenos performs better than the option available on the other card, why not use it. If Xenos can do the same for less, even if another card uses different techniques and resources, there's the chance you're gaining speed ups relative to that one. Should developers ignore feature advantages, or lack of certain bottlenecks in Xenos because it doesn't run well on the other hardware? Program to each card's strengths. And don't ignore Xenos' strengths for some random reason.
 
rounin said:
The highlighted and the second quote is basically where I think this thread stems from. Just how much sense does USA make in a closed box where everything is defined and power can be exploited? What tradeoffs have been made according to numbers (if they haven't specified, try using your imagination like you have done with predicting the possibilties ;))?
There had to be a reason for such a different design in the Xbox 360 over what they had last gen. They could have easily stuck with the design they had last time, stuck with something that was tried and true. It would have been simple and probably cheaper. Just like many here have said that the inclusion of Edram this generation was to address the complaints of devs had last gen, the USA could have been the same. If it ain't broke don't fix it.
 
Last edited by a moderator:
Jawed said:


Cool links. The closing remark was interesting

The unified shader architecture shows a little performance
benefit, at least for the tested trace, over the non-unified
architecture but the largest gain comes from improved
efficiency per area, up to 30% better.

Kind of reaffirms Dave's comments about the the overhead of USA being a non issue WRT its use in mobile platforms.

BTW did you notice reference # 5?
:smile:
 
nAo said:
On pixels, on vertices it's not that good.

Its identical on both.
But I agree G7x has an advantage on verts.
Although Xenos also has static branching hardware with 0 overhead aswell.
 
TurnDragoZeroV2G said:
If that code that runs faster on Xenos performs better than the option available on the other card, why not use it. If Xenos can do the same for less, even if another card uses different techniques and resources, there's the chance you're gaining speed ups relative to that one. Should developers ignore feature advantages, or lack of certain bottlenecks in Xenos because it doesn't run well on the other hardware? Program to each card's strengths. And don't ignore Xenos' strengths for some random reason.
My meaning was, the developers will find another way to do a similar effect that works better on the other hardware. There might be an image quality loss or it might not run as fast, but you're not going to do code one way if another way with near identical results is 10x faster on that hardware.
 
Mintmaster said:
ROG27, you make all those points, but you fail to acknowledge that it could be completely the other way around.

PC games have pretty low polygon counts. It's not that devs really want to do this, but they need the game run well on lower end and previous gen cards. Simply reducing the resolution will assist weaker pixel shaders, but this doesn't reduce the polygon count. If Xenos lets you run a 48 instruction vertex shader at 500Mverts per second, then they'll be free to increase the polygon counts.

Games don't use displacement mapping because even though many cards support vertex texture fetch, they do it veerrrry slowly, and this isn't worth the space you can save with displacement mapping. Plenty of other uses for VTF also.

The same with dynamic branching. I personally think ATI made a mistake in wasting so much die space on this feature for the PC space, because it'll probably be 2007 before any game uses it simply due to the install base. But on a closed platform you can use these things extensively. And it's not 15-20%, it's 2x to 10x, depending on the effect.

Right now we have stencil shadows and PCF shadow maps, for which NVidia devoted plenty of hardware to accelerate, but in the future none of these will be used. Instead, Variance Shadow Maps (cofounded by AndyTX on these forums) will give us fast, pretty, and mostly artifact-free shadows. And Xenos has a feature that increases their usability.

Xenos is not about making a dev relearn how to use a GPU, it's about opening new doors. For the most part you can use it exactly the same as a PC GPU. It's the same way the super powerful CPU's on these consoles, esp. CELL, can let you do new things (except there are plenty more headaches than with the GPUs).

I believe, while Xenos is based on an architecture which will be useful in future, more-refined iterations, this instance of said architecture will have traded too much performance-wise to substantiate its rich flexibility in features. The reliability of this new architecture is also not paying off in my opinion. There is a high manufacturing defect rate of the Xenos in the X360, some think higher than 10%. I'm asking the question: "Has the Xenos been built with tech in mind, first and foremost...or rather, with the developer's needs in mind, then business, and finally technical-limitations considered?" From a smart, Goal-directed design viewpoint...the latter question should have been asked. Why? Because the context of what a console is. It is a closed box environment which will stay constant for the next 5 years and should balance the needs of users and then business and finally, technical architecture. Function first (from several perspectives), then form. Why use a console as a guinea pig for graphics technology? The technology in a console needs to work reliably. Xenos's architecture hasn't proved anything to me yet. As far as I'm concerned, it's another SM 3.0 part that has a new architecture with a lot of potential in the pc space in future iterations. It may have some gimicky features but it's just that...outputting another pretty picture that moves and interacts the same way games did last generation. If this new generation isn't going to look significantly different than the last because of technical limitations of HD, then it sure as heck better play differently (as far as the consumer is concerned). I foresee most of gameplay innovations being limited by the CPU. So why not experiment technologically in that area?

I think Xenos and RSX are moot points...more of the focus should be what CELL vs Xenon brings to the table for developers and consumers. GPU-centric development ideology is the road to uncanny valley-ville. More balance is required on a system level.
 
Last edited by a moderator:
TurnDragoZeroV2G said:
With eDRAM and the very high bandwidth bus there, there's no need to compress anything. So everything is done uncompressed between daughter logic and eDRAM. Memory controller almost certainly would be significantly more complex to handle such a task as color and z.
I don't think I need to remind you that the bandwidth between the parent and daughter die is 32gb/s. Compressed data doesn't magically uncompress after crossing that bridge.

Redundancy in memory is extremely easy to do. For the most part, where it is doesn't matter so much as a sufficient quantity is there. And without so much specialization, it's easy to add in some redundancy for yields. Logic wouldn't benefit nearly as much without higher prices to be paid.
After all that, I must still ask - does xenos' edram have redundancy?

What good can shaders do? What could developers possibly use these multiple render targets for. Why do we need higher instruction limits, we're good enough right here.
Getting rather philosophical, no? My question originally was "What kind of visual effects does dynamic branching allow exclusively?"

That doesn't mean USA suddenly gets a 40% boost or anything. But there are going to be gains.

However, this doesn't take into account all the numerous situations where the VS load is very high and PS load low. The argument presented that "we haven't seen it before" isn't very good. That's like saying "well, the hardware is absolutely terrible at it, yet we haven't seen it in games, thus there must be no use for it." It's because it hasn't been an option that we dont' see it, not that there aren't uses. That's applicable to the other things that haven't been mentioned. And I think it's things like this that far outweight the "efficiency" gains of Xenos. It certainly seems that many of the things the chip does have are "overlooked" because too many people get caught up in the efficiency hype and ignore the feature set it brings to the table. Why work harder when you can work smarter. And sometimes you might not need tons of vertices, but maybe you need "smarter" ones whilst at the same time not needing very complex pixel shaders.
You would need to demonstrate the "numerous" situations where the VS load is "very high". While you might accuse one of saying (and I did not) that "we haven't seen it in games, thus there must be no use for it", is it not technically right to respond that in the months that the xenos hardware has been available to developers, "it" has not been used where it is available? While you would say that "it's because it hasn't been an option that we dont' see it, not that there aren't uses", I might just ask again what I have done before - "show me".

If that code that runs faster on Xenos performs better than the option available on the other card, why not use it. If Xenos can do the same for less, even if another card uses different techniques and resources, there's the chance you're gaining speed ups relative to that one. Should developers ignore feature advantages, or lack of certain bottlenecks in Xenos because it doesn't run well on the other hardware? Program to each card's strengths. And don't ignore Xenos' strengths for some random reason.
With any particular method, its advantage might be real if other methods for achieving the same prove to be less effective. To produce the same effect, a different kind of code might run just as well on a different hardware.
 
Last edited by a moderator:
Can we get back to the more relevant issue and go back to discussing flop count between architectures? ;) All kidding aside. I do have a question. Back when everyone was counting flops, most RSX counts included the free fp16 normalize to substantiate Sony's marketing numbers (right, wrong, or indifferent).

Can someone tell me what the following means? It was taken from Dave's Xenos article but seen in all block diagrams, etc. It's not clear to me what PS input interpolates do, are they DX7 styles register combiners, how are they typically used, should it contribute to flop count, does G71 have the same ability, anything?

Additional to the 48 ALU's is specific logic that performs all the pixel shader interpolation calculations which ATI suggests equates to about an extra 33% of pixels shader computational capability.
 
Status
Not open for further replies.
Back
Top