I don't seem to be good at short posts.
Going by elf sizes, around 20% of my game is SPU code.
Thanks for sharing that! Do you know how much that is in LOC, and how much of that is code you wrote vs. stuff from the SDK? 20% is a lot, but I really don't know how elf sizes compare.
I'm going to disagree with both your take on it and sebbi's.
That's cool. I may have failed my reading comprehension check, but I'm not sure which parts you disagree with and what your opinion on them is. Care to elaborate?
Add dynamic branching? add integer units?
Both already in there. It doesn't have a branch predictor or an integer divide instruction, but that's about it. Instead of the predictor, there are branch hints, so you don't usually need to take a hit for a branch. This can be tricky to get right, but mostly in cases where a predictor would be completely lost.
Agreed. SPUs should perform better when the data set required for each pixel is larger than L1d (32 KB), but smaller than local store size (or half of it minus code = ~128 KB, since you need double buffering to load next data while processing old one). But once the data set is larger than local store size, the VMX128 processing would simply start using the 1 MB L2 cache instead of the faster L1d, while the SPU code would need to frantically swap data to/from main memory (slowing it down to a crawl).
I'm still scratching my head a bit at the 128KB LS overhead, but I'm willing to accept that for some people this is how it works out. As we're sort of talking about cool hardware capabilities, let me say a word or two about double buffering and SPUs.
The way memory transfers down to main memory and VRAM work is that the MFC takes care of them. The SPU controls its MFC through a channel interface which is part of the ODD pipe (which is roughly the same as VXU Type 2 for you), meaning that you can issue commands to the MFC as ODD instructions.
If you are so inclined, you can put these channel commands into your regular processing loop. The MFC can queue up 16 DMAs, so this effectively gives an extremely controlled prefetch system, at the cost of some ODD pressure. Most people have ODD to spare, so it's more a case of working out the latencies and carefully adding the commands, as you described it for VMX. It's really very much the same thing, save for the added control and things like DMA lists.
What that means is there there is no need to just do a simple double buffering. You can do pretty sophisticated loops, if you need to. I've personally never had a situation where I needed to roll the channel commands in with the assembly, as this is really something you only need when you have sparse (i.e. random) memory accesses. People tell me that it works rather well, however.
I don't actually know how MFC DMAs compare to L2 fetches from memory in terms of latency, so it might come out even.
Programmers who are not used to doing these things often leave quite a bit of performance on the table. You're probably used to seeing the same thing with people who don't use the prefetch instruction on 360.
It's interesting to compare different vector architectures.
Absolutely. They are usually so much cleaner designs than the super-scalar OoO cores, where you can't really predict what's going to happen.
So he'll be working at a lower level on SPUs than anyone being in the luxurious position of just having to develop technologies without product deadlines to worry about [...]
It's not lower level than what other developers do. Every PS3 developer can get as low level as they want. Also what is this luxurious position without product deadlines you're talking about? I'm intrigued.
Oh my... he is a God for me
I did notice a recent lack of offerings at my shrine...
pretty curios to know whether FXAA was possible on the ps3 through SPU & how would works.
I'm pretty sure it could be done, although it may not be worth it. Last time I checked it was algorithmically a lot simpler than MLAA but very tailored to GPU ISAs. This is what makes it possible to do this nice and simple integration for it and what makes it run well on GPUs. Full blown MLAA is extremely hard to do on GPUs, which is why nobody is doing it.
This actually harkens back to what we talked about earlier, as MLAA is one of those pesky algorithms that needs the entire scanline more than just once.