Hype or realworld advantage...PS2 related.

Sure, you can just send it garbage but it still takes up bandwidth. On the PS2 I can just send a single byte GENERATE_10000_FLAKES and it'll do it.

I just wanted to point out that although vertex shaders can't create vertices, as long as you know the final number of verts you can usually do the same thing.
You can also set the DMA length to 0 on xbox so it repeatedly sends the same word over and over, and I'm unsure if that would be cached, saving you the bandwidth. Sending single words over is not really significant anyway.

Hey, it's the only thing the PS2 has going for it so I'm gonna defend it till the day I die .

I agree, in general the flexibility of the VU's is nice. Although I have to say latencies on the instructions are such that scheduling by hand can be pretty painful. Of course compared to writing efficient uCode for an N64 it's relatively trivial.

Life is easy for an Xbox programmer.

Hmm not sure I agree with that, the challenges are somewhat different and don't revolve around managing DMA lists, but getting that last 10% out of any piece of hardware is far from easy.
 
one in-one out

It's interesting that the true latencies are actually very similar on both the vertex shaders and the vector units ( Not that many different ways to implement a fmac really )
Running multiple interleaved streams is the cool thing ( along with the post transform cache ) and that enforces the 1 to 1 in/out mapping
Even on the vector unit it's quite easy to hide the latencies in the same way by interleaving vertex calcs..
The true improvement comes from constructive creation of vertices ( subdivision etc ) and conditional ops... as well as deindexing and instancing
 
Although I have to say latencies on the instructions are such that scheduling by hand can be pretty painful.
It's an annoyance yes, but nowadays you have the VCL to do the work for you.

Hmm not sure I agree with that, the challenges are somewhat different and don't revolve around managing DMA lists, but getting that last 10% out of any piece of hardware is far from easy.
Oh hush, managing DMA lists is one of the the easiest things you get to do. The fun part is agonizing over the size of R5900 dCache :devilish:
 
It's an annoyance yes, but nowadays you have the VCL to do the work for you.
I agree VCL is a great tool if you understand the scheduling limitations.

Oh hush, managing DMA lists is one of the the easiest things you get to do. The fun part is agonizing over the size of R5900 dCache

Just don't read any memory :p.....
Besides compared to an N64 that side is easy.
I've seen applications on the N64 slow down by over 20% by changing nothing but the link order.
 
I agree VCL is a great tool if you understand the scheduling limitations.
Well, that's kind of required for hand scheduling all the more isn't it. Anyway yeah, it's a very handy tool, particularly if you come from the early PS2 days. ;)

Just don't read any memory :p.....
That's what I keep telling people too, we should just generate everything procedurally :p

Besides compared to an N64 that side is easy. I've seen applications on the N64 slow down by over 20% by changing nothing but the link order.
Actually that's not unlike what I've just experienced recently, where altering one data structure (extending it 4bytes) resulted in 20% more frame time used by the outer rendering loop as long as bgr DMA was running. (that's 1milion cycles per frame wasted by memory fighting).
I did learn one valuable thing from fixing that though, GCC loves bit field structures :\
 
Actually that's not unlike what I've just experienced recently, where altering one data structure (extending it 4bytes) resulted in 20% more frame time used by the outer rendering loop as long as bgr DMA was running. (that's 1milion cycles per frame wasted by memory fighting).
I did learn one valuable thing from fixing that though, GCC loves bit field structures :\

Like I probably mentioned to you before, the speed of 90% of the code in your average game is gated by the external memory references.

I doubled the speed of the code we used to process particles by swapping two elements over in a data structure so that both the position and radius could be read in the same cache line and adding a single prefetch instruction to the main loop.
 
You can also set the DMA length to 0 on xbox so it repeatedly sends the same word over and over, and I'm unsure if that would be cached, saving you the bandwidth. Sending single words over is not really significant anyway.

Well if it doesn't get cached, then you're looking at a lot of bus chatter which isn't good either... It can really cut your effective throughput.

Anyway yeah, it's a very handy tool, particularly if you come from the early PS2 days.

Did you have your cool little editor back then as well? Beats using Excel as an editor/debugger... :devilish:

The fun part is agonizing over the size of R5900 dCache

Indeed :devilish: It'd be nice if it had multiple data-stream touches ala AltiVec style... 8)

I did learn one valuable thing from fixing that though, GCC loves bit field structures :\

Heh... You should see what MrC does on a PowerPC with bit field manipulation... :eek:
 
Erp,
Like I probably mentioned to you before, the speed of 90% of the code in your average game is gated by the external memory references.
Yeah, but there's memory references and then there's depressing stuff like having 8kb dCache in an UMA oriented machine that was built to work with tons of bus noise all the time.
Thanks to the R59k I am now practically worried everytime I write a pointer access in the code :p

Archie,
Did you have your cool little editor back then as well? Beats using Excel as an editor/debugger...
Yep, and it was definately handy, but having an auto optimizer is a godsend. (if you noticed, there is a non-working function in the editor that was supposed to do that too, we've started working on it but never got very far :p ).

Indeed It'd be nice if it had multiple data-stream touches ala AltiVec style...
I'd already settle for L1 cache the size and asociativity of Celeron's right now :)

Heh... You should see what MrC does on a PowerPC with bit field manipulation...
Heh, I wouldn't know, I'm not too familiar with PPC instruction set. At any rate I was just a bit surprised that stuff like directly anding/oring flags tends to be slower then letting the compiler do it with bitfields.
But I think it has something to do with rather moronic way of how GCC handles data types of less then 32bits on Mips - so by using bitfields of size 32-64 you're actually helping the compiler.
 
Back
Top