Is this accurate?

ERP said:
It's incredibly hard to get useful figures, you can't just benchmark it. Xbox had enough performance counters that you could estimate the loss of pixel shader output, but how often vertex shaders were stalled would have had to be a guess.

Perhaps it could have been done with this.
 
I don't see how he can take the penalties for texturing into consideration for RSX and not for Xenos. I know the texture units are decupled but texturing still is a limiting factor. I think ATI's statements regarding this issue is the most telling, they said it would outperform r520 in some cases and loose to it in others.

I'm comfortable with thinking xanos is a little faster than r520, it has major efficiency gains, but we still don't know the differences between xenos alus and traditional alus. If I had to guess I think the differences lie in the fixed function part, my understanding is that when you include this stuff in your flop counts gpus cross into the teraflops category. Perhaps in order to facilitate unified shaders some concessions needed to be made in fixed function because each alu is no longer as specialized and it would be two expensive to include all the fixed function logic for both ps and vs in every alu. The unknown here is how much die space you can save this way as opposed to others.

That's the real magic of xenos, however they did it they packed allot of alus into little die space.
 
Last edited by a moderator:
The Sequencer in Xenos not only manages scheduling due to load-balancing (vertex versus fragments) and fetch latency (vertex or texture), but it also directly supports render state switching.
There isn't "a" sequencer, there are multiple sequencers and arbiters (two for each of the ALU arrays - one for each thread - and a further two for each of the texture arrays). The load balancing is handled upstream of the sequencers.

When a thread "halts" its because of a vertex or texture fetch. That's a direct signal to the Sequencer to shift that thread into a different mode and swap contexts.
A thread "halts" from on instruction to the next, which is why there are always two threads active (to hide the instruction latency of the first); if after the first instrucution on the first thread there is just another instruction that has no other dependancies it will get passed back to the ALU arrays local sequencer/arbiter to run after the first/next insruction of the second thread has run; if it has other dependancies it get passed back to the reserveration station and another thread will be tasked to run on that ALU array.
 
OK, so Xenos's ALU pipelines are actually working in an alternating pattern:
  • thread 1, instruction 2, phase 1
  • thread 1, instruction 2, phase 2
  • thread 1, instruction 2, phase 3
  • thread 1, instruction 2, phase 4
  • thread 2, instruction 5, phase 1
  • thread 2, instruction 5, phase 2
  • thread 2, instruction 5, phase 3
  • thread 2, instruction 5, phase 4
Where each phase is 16 fragments (or vertices) to make 64 total. The next instruction for thread 1 might require the result of a texture operation that isn't ready, so execution would continue as follows:
  • thread 3, instruction 1, phase 1
  • thread 3, instruction 1, phase 2
  • thread 3, instruction 1, phase 3
  • thread 3, instruction 1, phase 4
  • thread 2, instruction 6, phase 1
  • thread 2, instruction 6, phase 2
  • thread 2, instruction 6, phase 3
  • thread 2, instruction 6, phase 4
Jawed
 
Back
Top