Xbox 360 graphics processor codenamed: XENOS. some details

wireframe said:
tEd said:
pc999 said:
I had posted this on the other thread but once you are talking about it...

Anyone wants to coment this

So yes, if programmed for correctly, the Xbox 360 GPU is capable of 96 billion shader operations per second.

http://www.hardocp.com/article.html?art=NzcxLDM=

...yes it's the same shit as nvidias 136 shader ops/sec

If he at least would have explained how they count the shader ops but no , just a stinky number

I think HardOCP mixed this one up and the number of shader ops the C1/R500/Xenos/<insert interesting name here> can do is 48 Billion per second. Let me explain why.

48 Billion ops per second is what was first reported. However, if anyone took the time to compute the number of ALUs times cycles they would have arrived at a number of 48*500MHz= 24 Billion ALU ops per second. I think what ATI tried to clarify is that each of the 48 parallel processing units is capable of two ops per cycle. This gives us 48*2*500MHz= 48 billion ops per second.

Of course I could be wrong and the total number of ops just doubled from what was known before. This would be significant because it would make the C1/R500 a theoretically more capable shader processor than Nvidia's/Sony's RSX.

But I doubt they would give out the wrong number (48 billion) by mistake like that. Would be a major cock-up.

Just want to sneak in a comment that these theoretical numbers may be very bad for comparing the two competitors. The same goes for their CPUs. Not only are these theoretical maximum performance numbers and what really matters with something like a console is how close to the optimum you can operate. Look at the Pentium 4 as an extreme example. It has very high theoreticals and can even back them up in specialized benchmarking tests, but when multiple types of code/data need to be processed and the CPU cannot dedicate itself to one task, look what happens. Performance plummets and a design like the Athlon 64 keeps on trucking. It will be very interesting to see how close these machines are and how close the software will be (noting that software will probably look and play very similarly unless one machine offers something substantial above the others).


cool. thanks for writing this out wireframe, to help make things more clear, and making it sort of an open-ended answer - things could have changed by now, but this is how you understand it according to what was reported / announced. nice post.
 
pc999 said:
Thanks wireframe.

Heh. I just followed the link near the bottom just now and saw that Tim had written exactly what I wrote. I wasn't aware that this had already been answered, and to you, in another thread already.

This topic is obvioulsy hot, but I still caution everyone to be careful in comparing these metrics. The ATI unit has more flexibility in terms of pixel/vertex ops as these are a shared resource (a la SM 4.0, although I am not sure why an SM spec would specify how these ops are internally scheduled and executed) than the RSX, which, so far, is assumed to be of a more classically seperated pixel/vertex design. Given Nvidia's 136 ops per cycle, we can imagine that there are cycles where not all pixel shader units are needed and too few vertex units are available. For R500 this would present a situation where it can "switch" its operation and find some optimal balance. However (and this is probably a huge however), it can't be easy to decide what that optimal balance is and schedule it. If you wasted a whole cycle determining what the optimal distribution is you lost all your 96 op oportunities (this is just an example in the extreme).

I am probably not thiking correctly about this, but I really don't get why MS/ATI would want this type of GPU in the Xbox. I could understand the benefit better of this type of pooled resource when the target is not known, allowing the hardware to adapt itself to new software needs. But with a console you have everything known and you should be programming it to the metal to get peak performance so even if the pixel and vertex logic units are separated, you could optimize and work around any limitations this may impose.

The same goes for the CPUs in these boxes. The Cell is something like NetBurst (Pentium 4) taken to an extreme. This should not pose a problem, however, because this is known and developers can optimize for this as the whole system is built around this architecture.

If anything, I think perhaps the best thing to look at for IQ/performance between these machines is the FSAA ability. It looks like the Xbox 360 has an advantage here, getting 4x FSAA "for free". I say this is an advantage because it may be unlikely that developers will be willing to put that much more into making a "better" PS3 version of a game, with any asset change this may imply, unless the result will be very noticeable. So, I suspect that many games that come on both platforms will look the same, but the Xbox 360 may have an advantage with FSAA then.
 
Megadrive1988 said:
cool. thanks for writing this out wireframe, to help make things more clear, and making it sort of an open-ended answer - things could have changed by now, but this is how you understand it according to what was reported / announced. nice post.

No problem. I could probably better prove the case simply by saying "ATI claims two ops per clock"...so explain to me how you can get anything more that 96 ops out of 48 logic units? You can't have it both ways. 96 billion ops per second (given 500MHz clock) would imply 4 ops per logic unit per clock or 96 logic units doing two ops per clock (which is the ATI statement and I think we can assume this one is correct as it jives with the original numbers).

The 96 was obviously misplaced when it got a "billion ops per second" behind it when it should have just been "per clock".
 
wireframe said:
pc999 said:
Thanks wireframe.

Heh. I just followed the link near the bottom just now and saw that Tim had written exactly what I wrote. I wasn't aware that this had already been answered, and to you, in another thread already.

Actually I read you answer first.
Anyway great posts.

BTW way 2 answers are better than one.
 
I'm sure I'm showing my ignorance here, but why would the operations have to be scheduled internally on the hardware? Couldn't you just have a pool of shaders that go to work on a first come, first serve type basis?
 
There are two times when the execution time of a shader instruction is not known:

1. texturing (depends on filtering and cache status)

2. branching (will branch prediction fail?)

When a thread requires texturing, the thread is put into the texturing queue. The GPU knows that a texturing operation is coming along, so it can fill the ALU pipeline with a new thread which will start to execute immediately that the texturing operation causes the old thread to switch out.

When a thread's branch prediction fails, the thread is put back into the execution queue (and the execution pipeline needs to flush out the superfluous instructions). The GPU will have already scheduled another thread to follow behind the superfluous instructions. The shorter the pipeline, the less cycles are wasted.

So, in general, the GPU knows precisely what the balance of TMU and ALU operations is, and it can easily allocate threads as these resources become available. Because of pipelining, these resources never go idle (except for a branch prediction failure).

Addtionally the GPU can split threads into clauses, e.g. of a maximum of 4 instructions. This increases the granularity of execution, increasing the rate at which short threads are completed, which in turn means that the queueing system doesn't get blocked up with threads dependent on texturing.

The upshot of all this should be that R500 runs every resource (ALU and TMU) at close to 100% utilisation, i.e. the peak-rate is real world.

The next interesting question is whether R500 arranges "pipelines" as quads, or separates texture caches or command thread queues into quads...

Jawed
 
gurgi said:
I'm sure I'm showing my ignorance here, but why would the operations have to be scheduled internally on the hardware? Couldn't you just have a pool of shaders that go to work on a first come, first serve type basis?

Yep.

The relevant patent covers a number of different bases with respect to scheduling - but overall it's pretty vague.

Jawed
 
wireframe said:
tEd said:
pc999 said:
I had posted this on the other thread but once you are talking about it...

Anyone wants to coment this

So yes, if programmed for correctly, the Xbox 360 GPU is capable of 96 billion shader operations per second.

http://www.hardocp.com/article.html?art=NzcxLDM=

...yes it's the same shit as nvidias 136 shader ops/sec

If he at least would have explained how they count the shader ops but no , just a stinky number

I think HardOCP mixed this one up and the number of shader ops the C1/R500/Xenos/<insert interesting name here> can do is 48 Billion per second. Let me explain why.

48 Billion ops per second is what was first reported. However, if anyone took the time to compute the number of ALUs times cycles they would have arrived at a number of 48*500MHz= 24 Billion ALU ops per second. I think what ATI tried to clarify is that each of the 48 parallel processing units is capable of two ops per cycle. This gives us 48*2*500MHz= 48 billion ops per second.

Of course I could be wrong and the total number of ops just doubled from what was known before. This would be significant because it would make the C1/R500 a theoretically more capable shader processor than Nvidia's/Sony's RSX.

But I doubt they would give out the wrong number (48 billion) by mistake like that. Would be a major cock-up.

Just want to sneak in a comment that these theoretical numbers may be very bad for comparing the two competitors. The same goes for their CPUs. Not only are these theoretical maximum performance numbers and what really matters with something like a console is how close to the optimum you can operate. Look at the Pentium 4 as an extreme example. It has very high theoreticals and can even back them up in specialized benchmarking tests, but when multiple types of code/data need to be processed and the CPU cannot dedicate itself to one task, look what happens. Performance plummets and a design like the Athlon 64 keeps on trucking. It will be very interesting to see how close these machines are and how close the software will be (noting that software will probably look and play very similarly unless one machine offers something substantial above the others).

'On chip, the shaders are organized in three SIMD engines with 16 processors per unit, for a total of 48 shaders. Each of these shaders is comprised of four [my bold] ALUs that can execute a single operation per cycle, so that each shader unit can execute four floating-point ops per cycle.'

http://techreport.com/etc/2005q2/xbox360-gpu/index.x?pg=1
 
It looks like each ALU is 4D, which if true, would imply that vertex shaders can't co-issue a 4D and scalar, which is less capability than R420 has.

Jawed
 
Tim said:
2Tbit/s = 256GB/s and that number has been repeated like a million times, now they just using bits instead of bytes to make the number sound more impressive.
What's even worse, they're not only using bits instead of bytes, but they're counting in compression as well. The actual transfer rate is 32GB/s, not 256.

Sad to see the people at hardocp just swallowing these numbers like a bunch of nooblars, but hey, this is what happens when technical stuff is 'explained in plain english'. Meaning, it's some PR fuck lying his head off to make things sound more impressive than they actually are.
 
Guden Oden said:
Tim said:
2Tbit/s = 256GB/s and that number has been repeated like a million times, now they just using bits instead of bytes to make the number sound more impressive.
What's even worse, they're not only using bits instead of bytes, but they're counting in compression as well. The actual transfer rate is 32GB/s, not 256.

Sad to see the people at hardocp just swallowing these numbers like a bunch of nooblars, but hey, this is what happens when technical stuff is 'explained in plain english'. Meaning, it's some PR fuck lying his head off to make things sound more impressive than they actually are.

I'm not really sure this includes compression. This is the bandwidth between the logic and eDRAM not between the GPU and logic+eDRAM.
 
Guden Oden said:
What's even worse, they're not only using bits instead of bytes, but they're counting in compression as well. The actual transfer rate is 32GB/s, not 256.

No they don't count compression - they take into acount that alpha and z-ops is done on the eDRAM chip not the main GPU which means that AA-samples etc. does not have to be transfered between the chips.
 
Jawed said:
The upshot of all this should be that R500 runs every resource (ALU and TMU) at close to 100% utilisation, i.e. the peak-rate is real world.
But rarely for both of them at the same time.
 
Xmas said:
Jawed said:
The upshot of all this should be that R500 runs every resource (ALU and TMU) at close to 100% utilisation, i.e. the peak-rate is real world.
But rarely for both of them at the same time.
Eh? The multi-threading patent is all about sending one command thread to the texture unit while, at the same time, sending another command thread to the ALU unit.

Not only that but a primary objective of the design is to perform unlimited dependent texturing.

Jawed
 
Could it be that ATI claims 48 billion ops per second because each of R500's 48 units can co-issue a vec3 and scalar?

This would seem to be limited to cases where all units are dedicated to pixel processing because vertex ops require vec4, which would use-up the unit that would otherwise be dedicated to scalar ops.
 
I think we're gonna have to wait until Dave does the honours with an in-depth description.

Jawed
 
Well, for what it's worth, the mystery over 3x16 versus 8x6 is more perplexing and therefore of more immediate interest!

Jawed
 
Back
Top