A primer on the X360 shader ALU's as I understand it

liolio said:
It will a dumb question, can someone explian me what is SOA and AOS?

It's a way of organizing your data in memory before you send it to the SIMD unit. Look at this example to better understand :

Code:
//AOS style

struct coordinate
{
     float x;
     float y;
     float z;
     float w;
}
coordinate vertices[NUMBER_OF_VERTICES];

Code:
//SOA style

struct coordinate
{
    float x[NUMBER_OF_VERTICES];
    float y[NUMBER_OF_VERTICES];
    float z[NUMBER_OF_VERTICES];
    float w[NUMBER_OF_VERTICES];
}

coordinate vertices;

As you can see AOS is more intuitive but SOA is typically more efficient because it involves only vertical ops and not horizontal ops.

Does memexport can help in someways for some of fonctionnalities NAo and fafalada speak about.

More likely the tesselation engine inside Xenos, but memexport can also be helpful for other things.
 
Debating an off-topic subject is not accepted, proving it wrong in two lines is fine, but best is to STAY ON TOPIC. (or very close to it.)
This thread is good, I intend it to stay at this level of quality.

[edit]
Looks like I skipped two nasty posts in my first clean-up, sorry for that people ^^
(May not have gone bad again otherwise <wishfull thinking>)
 
Originally Posted by Fafalada

Well nAo listed a lot of it - basically being able to access entire topology of your primitives opens up a whole new world that VS can't reach, or can only reach with excessive performance and memory overhead.
Deletion of primitives is an interesting area for optimizations as well - from avoiding entire batches to be sent over to GPU, to culling stuff on primitive level that only wastes GPU vertex setup (backfacing, degenerate etc. primitives)

Can someone can make an approximation on what order of magnitude (performance wize) can we expect between something 4 VMX units (2 cores on this job one busy this other kind of code) and the current number of spe used for this job.
 
Last edited by a moderator:
liolio said:
Can someone can make an approximation on what order of magnitude (performance wize) can expect between something 4 VMX units (2 cores on this job one busy this other kind of code) and the current number of spe used for this job.
There are 3 VMX unit on xbox360, not 6. btw, what's the current number of SPEs used for this (?) job? You can use all the SPEs you want to as you desire, whether it makes sense is another matter
 
xbdestroya said:
Mr Wibble is a PS3 dev also, and I don't believe Mintmaster is a game dev at all (correct me if I'm wrong Mint!).
Don't worry, you're absolutely right.

I've never given the impression that I am, have I? My reliance on calculations and comparisons should tell you I'm making educated guesses.

I've worked at ATI a while ago, and later with a games startup (PC only at the time), but it was tragically cut short due to various factors. Very smart people doing very good work, and when the CEO was forced to work for an already established dev, he saw their advanced work mirror our 3D engine. Sigh...

Now I'm in a very different field, but still love 3D graphics.
 
nAo said:
I won't go into every specific number you quoted but let just say that in any discipline you can't analyze a complex system by using one or two quantities.
a GPU can be bottleneck in some many ways that you really have no idea about..trust me :)
Indeed. You need a pretty thorough background in 3D graphics, both in software and hardware, to be able to do any meaningful comparison of GPUs, especially when they're so radically different.
 
sorry for my poor english.
Ok,i'll try to do better.
For that part of you answer.

I was speaking of quiet math heavy calculations fafalada and you were speaking sooner.
nAo
Interesting; I'm not very familiar with DX9 shaders. What kind of things can the SPE do better in terms of vertex work compared to a VS?
What about destroying and creating vertices, doing any kind of operation that involves topology informations, culling chunks of vertices in a row, decompressing geometry encoded with exotic bit formats, and bla bla bla..
fafalada
Deletion of primitives is an interesting area for optimizations as well - from avoiding entire batches to be sent over to GPU, to culling stuff on primitive level that only wastes GPU vertex setup (backfacing, degenerate etc. primitives)
My question is more like can this be done using xenon's vwx unit in a nearly as effective manner as it's perform by the spe? (i don't know if this is better lol).

for this
the issue of GPU understanding SoA, or free conversion between AoS and SoA (there's one console CPU that does this but it's not in a desktop machine).
I've read the old arstechnica article on xenon. It seems optimized for AOS but can do SOA too, but AOS seems to be a better use of the lots of registers. Which cpu fafalada is speaking about

For this part.
nAo
You can use all the SPEs you want to as you desire, whether it makes sense is another matter
I understand that the cell spe can use all its spe for these kind of calculations because the ppe is here,but I guess that there is no need and that the spe do quiet well.
Anyway i don't want you to lose your time, my math knowledge of matrix is old and poor. A trivial response will be good enought ;)
Anyway thank for your first responce ;)
 
Last edited by a moderator:
Fafalada said:
Well nAo listed a lot of it - basically being able to access entire topology of your primitives opens up a whole new world that VS can't reach, or can only reach with excessive performance and memory overhead.
Deletion of primitives is an interesting area for optimizations as well - from avoiding entire batches to be sent over to GPU, to culling stuff on primitive level that only wastes GPU vertex setup (backfacing, degenerate etc. primitives)
Deleting batches is something the CPU should do anyway, so that doesn't seem to be related to the VS to me. Except for rigid objects, all culling must be done post transform, so I don't see how you can save GPU cycles.

Well, on second though, I guess rigid transforms are pretty common for non-character batches, so you have a point. Could be as simple as a subtract and dot product per triangle. Still, once you take into account all the index list manipulation, compares, etc., it would really surprise me if you could exceed RSX's setup rate without hogging most of the SPUs.

I think nAo is right on this one. Not many people are going to use Cell for vertex shading or anything that RSX's geometry related silicon can do. For stuff it can't do, like silhouette detection, custom tesselation routines, etc. it makes sense. I fully expect all vertex data to be held in XDR, and Cell would be great at data amplification if the decompression algorithm can't be mapped to the vertex shader's abilities.
 
Mintmaster said:
Deleting batches is something the CPU should do anyway, so that doesn't seem to be related to the VS to me.
I wasn't talking about batches That big, and also thinking more in terms of generated geometry (procedural, HOS or some other form of tesselation), not statically stored stuff. Before this generation started I was already a believer that we should abandon the concept of latter alltogether, it served its purpose and it's time for it to go away(granted PC market will not be ready for such a move for quite a bit longer).

Except for rigid objects, all culling must be done post transform, so I don't see how you can save GPU cycles.
Why? Unless I am desperately needing to do skinning/morphing on GPU, there's basically no reason NOT to cull on SPE side if I find it a performance benefit.

Still, once you take into account all the index list manipulation, compares, etc., it would really surprise me if you could exceed RSX's setup rate without hogging most of the SPUs.
Meh, the new console GPUs have a relatively low setup rate (compared to last 10 years of consoles) so exceeding it is far less of a challenge then you might think.

I think nAo is right on this one.
He is quite right - but his reasons for saying that are different then what is discussed in this thread. But that aside - yea, it's obvious that doing just the basic VS processing on them is a waste of SPEs - when they are capable of so much more.
 
Fafalada said:
Meh, the new console GPUs have a relatively low setup rate (compared to last 10 years of consoles) so exceeding it is far less of a challenge then you might think.
Never a truer word spoken... CPU work is best spent on methods to reduce geometry pre-vertex shader in the pipeline, things like packet level triangle culling, HOS etc.
 
So can we safely assume that at least 1 of the SPEs will be dedicated to vertex processing to alleviate a potentially vertex-bound GPU?

Does anyone think that the pipeline setup for the RSX has been altered for improved pixel shading power, while sacrificing vertex shading power, because of the assistance that the Cell can provide?
 
ROG27 said:
Factor SPEs in (sharing in Vertex work) and Xenos may actually be at a disadvantage in this area--I mean this would technically only be for vertex burst situations, which are much more rare than the demand for pixel shaders. For graphics, I think you need to look at the system collectively (CPU/GPU) for PS3...at least much more so than for X360. In this way PS3 could sustain 48 pixel shader ALUs and 8 vertex shader ALUs +SPE vertex assistance at all times, while Xenos constantly needs to be dynamically load balancing between the shaders need for vertex and pixel work in a pool of 48 ALUs total.

The external interactions between CELL and RSX will be just as important and whats going on inside of the RSX IMO. Maybe even moreso to the system as a whole.

I don't think you can reasonably make that statement. Because the things you just listed are already happening on the Xbox 360. I remember reading that devs are using the Xcpu for vertex and leaving Xenos to solely handle pixels. Not sure if this was due to the rushed launch software or what. But similar means to an end can be used on the 360 as they can on PS3.
 
Hardknock said:
I don't think you can reasonably make that statement. Because the things you just listed are already happening on the Xbox 360. I remember reading that devs are using the Xcpu for vertex and leaving Xenos to solely handle pixels. Not sure if this was due to the rushed launch software or what. But similar means to an end can be used on the 360 as they can on PS3.

I was under the impression that this was never confirmed. Just nVidia PR talk to play down that the US design in Xenos wasn't all that hot due to load blanacing issues, or something along that line. Anyway, I think it was just PR talk.
 
maybe the lines between classifying a shader as pixel or vertex are blurry :?:

or maybe there just aren't that many vertex shader programs with these launch games and this is exactly why ATI are so eager to move to a USA.

...brain...deeroiraitng 7 hours staring computer scnreen
 
Mintmaster said:
On the other hand, render to vertex buffer might negate some of that advantage, assuming it's implemented in PS3.

Given that it's been available in OpenGL on NVidia hardware for awhile now (EXT_pixel_buffer_object) and accelerated by hardware (not a software driver copy), I don't see why not. NVidia claims that the extension allows render to vertex array to run in hardware (as opposed to past hacks which resulted in copies)
 
Back
Top