X360 vs PS3 GPU power/ALU's etc

Bill

Banned
I have several questions, that I will number here, that have bothered me..

1.: ALU's. I have heard in a few places Xenos ALU's described as more powerful. The basic idea is they work on "vec4's+scalar" while RSX works on "vec3+scalar". To me this seems to mean Xenos Alu's are 20% more powerful. However different articles say different things, and it seems very hard to pin down. Is this true?

2: Mini-alu's. This is a key difference. While 7800 GTX/RSX has 56 ALU's (8 vector, 48 pixel shader) to 48 for Xenos, not a big difference, the RSX/7800 also has 48 mini-alu's. It seems these help, but it's hard to say how much or even what they do. Anybody care to put say, a percentage in broad terms on it? Is a mini alu "worth" roughly say, ten percent of a major?


3: Tesselation/interpolation. Xenos has these in hardware. Are they good? I though tesselation was something necessary for GPU operation, but apparantly it's more an optional "effect"? Also, does 7800 have dedicated interpolation hardware?

4: (particularly for Jawed) Does Xenos use a advanced scheduler like R520, to further drive efficiency apart from the unified shaders? How does this work? Will it be a huge advantage?
 
Bill said:
3: Tesselation/interpolation. Xenos has these in hardware. Are they good? I though tesselation was something necessary for GPU operation, but apparantly it's more an optional "effect"? Also, does 7800 have dedicated interpolation hardware?

Tesselators certainly aren't necessary for GPU operation, the vast majority don't have them. Tesselation is usually done on the CPU. Xenos has a fixed function tesselator.

Bill said:
4: (particularly for Jawed) Does Xenos use a advanced scheduler like R520, to further drive efficiency apart from the unified shaders? How does this work? Will it be a huge advantage?

I believe it is "threaded" if that's what you mean, 64 threads. It'd work in roughly the same way as the R520 setup, I imagine - just switch between threads depending on resource utilisation and whether a thread can execute or not.
 
Bill said:
I have several questions, that I will number here, that have bothered me..
Hi Bill. Since Xenos is a console GPU and we don't have an platform to really test it comparatively it is hard to say on some questions.

1.: ALU's. I have heard in a few places Xenos ALU's described as more powerful. The basic idea is they work on "vec4's+scalar" while RSX works on "vec3+scalar". To me this seems to mean Xenos Alu's are 20% more powerful. However different articles say different things, and it seems very hard to pin down. Is this true?
First, I believe RSX does 4D+Scalar. Check this CHART.

More importantly: Architecture.

Xenos and RSX are really different in how they approach things. Xenos has decoupled the TMU units and has an advanced scheduler similar to R520. Dave's R520 article should shed some light on this as should his Xenos article . Of course Xenos goes one step further with a Unified Shader Archtecture. So, based on the little we know about RSX, the two GPUs are really going about tasks differently.

It is not asn easy as comparing ALU performance alone, especially with Xenos because the goal is more toward utlization/effeciency over anything else.

2: Mini-alu's. This is a key difference. While 7800 GTX/RSX has 56 ALU's (8 vector, 48 pixel shader) to 48 for Xenos, not a big difference, the RSX/7800 also has 48 mini-alu's. It seems these help, but it's hard to say how much or even what they do. Anybody care to put say, a percentage in broad terms on it? Is a mini alu "worth" roughly say, ten percent of a major?
Depends on your code and situation. I am not sure anyone can give you a %.

In general the RSX pipes are pretty robust; the trick is utilization. While the mini-ALU is a boost when the GPU needs it, from what I can gather from the discussion on G70 is it is near impossible to guarantee its use 100% of the time.

Xenos ditched the pipes and went with a shader array with dynamic scheduling and load balancing. It really has no need for the mini-ALUs because its goal is to saturate the ALUs every cycle. Looking at R520, we can see it went with a similar approach: less complicated Shader Pipes but better scheduling.

They are just different approaches to the same problem. NV has a robust pipe; ATI seems to be gearing toward utilization. Basicaly they are spending their transistor budgets in different places.

3: Tesselation/interpolation. Xenos has these in hardware. Are they good? I though tesselation was something necessary for GPU operation, but apparantly it's more an optional "effect"? Also, does 7800 have dedicated interpolation hardware?
I am not sure ""how good" Xenos is at tesselation. This we know: It is a two cycle process and therefore can churn out 250M poly/s (compared to 500M triangle setup). So far the King Kong game is using this feature with narry a performance hit, so it must not be too bad.

Only time will tell how good this feature is.

4: (particularly for Jawed) Does Xenos use a advanced scheduler like R520, to further drive efficiency apart from the unified shaders? How does this work? Will it be a huge advantage?
Yes, it uses an advanced scheduler; it is probably better due to the it has shared resources AND is load balancing so it will provide more bang for the buck. Dave's article discusses this to a degree.

Is it a huge advantage? Compared to what? R520 seems to be a good half step toward Xenos in many ways. Obviously consolidating the scheduler could be a silicon win and being able to delegate tasks between PS and VS dependant on need is a big plus as well.

I am sure Jawed and others can provide some more info.
 
OK, I'll have a quick go. There's an awful lot of threads that already cover this kind of stuff.

Bill said:
1.: ALU's. I have heard in a few places Xenos ALU's described as more powerful. The basic idea is they work on "vec4's+scalar" while RSX works on "vec3+scalar". To me this seems to mean Xenos Alu's are 20% more powerful. However different articles say different things, and it seems very hard to pin down. Is this true?
Xenos's primary shader ALUs have to be Vec4+scalar, because they serve for both vertex programs and pixel programs. Vertex programs normally work with a Vec4+scalar architecture.

It's arguable that the Vec4+scalar architecture will be sorely wasted when running pixel shaders, particularly as the traditional GPU typically spends more time/resources running pixel shaders than vertex shaders. That's simply because it'll be relatively rare when the Vec3+scalar+scalar capability will be used. Though you could read this as allowing RGBA (colour channels+alpha channel) + scalar. So, maybe not.

Technically RSX's pipeline is more powerful. It has a double-primary ALU organisation capable of dual-issuing a Vec3+scalar MAD (whereas Xenos can't). RSX also offers special purpose units that do other, fairly rare instructions. But it runs those rare instructions very fast.

RSX pipeline's major problem is register bandwidth. It simply isn't always possible to perform a dual-issued MAD because the register count for the instructions exceeds the number of registers the pipeline can actually fetch.

There are other more complicated limitations due to instruction dependency and texture address calculation ALUs (and dependent texturing) that add further blows to the peak capability of the RSX pipeline.

In short, the RSX pipeline can't sustain very high utilisation of all its constituent units (ALUs etc.).

Xenos's design is the diammetric opposite - to provide the minimum number of units per pipeline that get the job done, so that there's little room for all the "exceptions".

That's where the increase in pipeline count and the use of a complicated scheduler and fully decoupled texture pipes comes in. It's a transistor trade between per-pipe functionality and pipeline-ALU-utilisation. (It's also a compiler-trade - a shader program, when compiled by the driver, has to fit the pipeline like a glove - it's actually a seriously difficult computing problem to make that fit. So a simpler pipeline makes the compiler's job much easier and more likely to get the perfect fit.)

2: Mini-alu's. This is a key difference. While 7800 GTX/RSX has 56 ALU's (8 vector, 48 pixel shader) to 48 for Xenos, not a big difference, the RSX/7800 also has 48 mini-alu's. It seems these help, but it's hard to say how much or even what they do. Anybody care to put say, a percentage in broad terms on it? Is a mini alu "worth" roughly say, ten percent of a major?
An NVidia mini-ALU provides some bias/scaling/clamping functionality. That basically means multiply or divide in powers of 2; add; or limit results to a range (e.g. the range 0...1).

ATI's mini-ALU provides add. It may provide other things, but I dunno...

I've seen (and been involved in) plenty of attempts to quantify processing power in terms of ALUs (which then leads into shader ops and GFLOPs). It aint worth it.

3: Tesselation/interpolation. Xenos has these in hardware. Are they good? I though tesselation was something necessary for GPU operation, but apparantly it's more an optional "effect"? Also, does 7800 have dedicated interpolation hardware?
I'm biased here - I think Xenos's tessellation is what DX10 is going to provide within the next year.

If you read:

http://www.beyond3d.com/forum/showpost.php?p=572515&postcount=166

http://www.beyond3d.com/forum/showpost.php?p=572526&postcount=167

and combine those tessellation concepts with the concept of the Xbox Procedural Synthesis you'll start to realise this is an order of magnitude beyond where RSX is.

4: (particularly for Jawed) Does Xenos use a advanced scheduler like R520, to further drive efficiency apart from the unified shaders? How does this work? Will it be a huge advantage?
Conceptually they're the same as far as I can tell (apart from Xenos's ability to schedule vertices and fragments). I do have my suspicions that there's a few cut-corners in R520...

The major demerit in Xenos is that it appears to use 64-pixel (or vertex) batches. Compared with R520's 16-pixel batches, Xenos looks to be at a disadvantage.

Batch size is a property, jointly, of the scheduler and the memory system (for registers) that supports it. So in that sense Xenos is not so advanced.

But R580, which is more Xenos like (due to pipeline count!), appears to be forced into making the same, larger-batch, compromise. So R520 may prove to be a bit of an exception with its small batches. Hard to tell.

But in basic conceptual terms, the fact that Xenos and R520 are capable of issuing out-of-order batches to either the shader or texture pipelines is a fairly big deal and makes them pretty much equal.

I'm still waiting to get a definitive comparison of the effect on efficiency of out-of-order scheduling. We're seeing numbers anywhere from 10-40%.

Jawed
 
Jawed said:
and combine those tessellation concepts with the concept of the Xbox Procedural Synthesis you'll start to realise this is an order of magnitude beyond where RSX is.
Assuming RSX hasn't got hardware tesselation ;)
 
romiced said:
I have a question: are the precision hdr and the fsaa it to limit by the small quantity of edram? http://www.beyond3d.com/forum/showpost.php?p=581315&postcount=64

Purely from the perspective of eDram size, higher precision (FP16) will require a larger framebuffer, and perhaps more tiling (and thus more penalty) depending on your other requirements (AA, other buffers etc.). There are other costs too of course.

romiced said:
I forgot... the xenos is able to kill and creer the vertex, yes or not?

I don't think it can create vertices out of nothing, but the tesselator takes input vertices/polygons, and subdivides them etc, outputting more vertices.

The B3D article on the front page covers parts of these questions.
 
but..

thank you for do the answer
but I acknowledge to put question about the capacities of the xenos. why to have put edram if it is so that finally certain features is limiting?
 
Simple answer. More eDRAM would have been difficult to manufacture and this very, very, very expensive. Much as 256mb of eDRAM would be nice, there have to be limits based on cost and viability. You can also ask 'why only 512 mb of RAM' and 'why only 48 ALUs, why not 256?' Xenos configuration seems an ideal compromise between performance and economy.
 
Also, the EDRAM frees the main memory of XB360 from all the render target/back buffer tasks. In a typical GPU these tasks consume 10s of GB/s.

That's an incredibly big deal.

Also these tasks often can't run at full speed because conventional GDDR3 simply can't keep up (a 512 bit bus would come in very handy).

Jawed
 
Question: Does the Xenos share memory bandwidth to the main 512mb memory pool with the Xenon? In other words, will one devices use of the main memory slow the other devices access.
 
Gholbine said:
Question: Does the Xenos share memory bandwidth to the main 512mb memory pool with the Xenon? In other words, will one devices use of the main memory slow the other devices access.

I pretty sure the answer is yes. It was brought up as an advantage of NUMA vs. UMA.
 
Gholbine said:
Question: Does the Xenos share memory bandwidth to the main 512mb memory pool with the Xenon? In other words, will one devices use of the main memory slow the other devices access.
Yep.

Texturing can soak up 16GB/s, leaving about 6GB/s for the CPU.

Obviously there are more fiddly details than that. But that's the basic overview.

There was a massive heated discussion about this somewhere, recently. Xbox Procedural Synthesis saves the day, basically - sending graphics data directly to Xenos, instead of sending it to memory. That saves a healthy 10.8GB/s of memory bandwidth, in the best case.

Jawed
 
pip.JPG
 
Acert93 said:
I am not sure ""how good" Xenos is at tesselation. This we know: It is a two cycle process and therefore can churn out 250M poly/s (compared to 500M triangle setup). So far the King Kong game is using this feature with narry a performance hit, so it must not be too bad.
Any idea where you read about this? I'm interested in seeing how King Kong is using tessellation.
 
Direct from where?

Jawed said:
Yep.

Texturing can soak up 16GB/s, leaving about 6GB/s for the CPU.

Obviously there are more fiddly details than that. But that's the basic overview.

There was a massive heated discussion about this somewhere, recently. Xbox Procedural Synthesis saves the day, basically - sending graphics data directly to Xenos, instead of sending it to memory. That saves a healthy 10.8GB/s of memory bandwidth, in the best case.

Jawed


I need help to understand this point about sending directly to memory. Is there a thread which details this process such as from where does geometry originate if not from CPU or memory. I thought compressed geometry data would be CPU dependant but am I wrong?
 
I think Jawed is talking about the case where the CPU generates vertices and stores them in the L2 cache where Xenos can read them directly. This bypasses main memory saving the write bandwidth on the FSB.
 
Back
Top