Yeah... the first parties have been doing stuff like this since U2 and KZ2, may be earlier. The SPU culling for example has been around since the beginning (It's one of the earliest Edge tools). But they refined and matched it up with other tech here. Then they throw in the artists to make levels suitable for the engine characteristics and design direction.
It bears repeating that 3rd parties have been doing it for *ages* as well. In fact some of them did it (gasp) before ND or any first party on Ps3 did. I wonder if people find that surprising. I still have no clue why 3rd parties continue to be thrown under the bus, they are just as competent and capable as anyone else. Maybe this is a lesson in marketting, because the 1st parties constantly trump their code publicly whereas 3rd parties who do much of the same stuff don't, and as a result you have the "don't use the spu's" type of nonsense recurring over and over again.
At the presentation they said that they did it this way because rendering directly into system memory was so slow that it justified spending the time on the copy from local memory. I'm guessing they didn't bother transfering the baked lighting output because they're already summing the results of the SPU deferred pass with the results of lights calculated by the GPU, so there was no need to copy it across.
Later on Matt Swoboda was saying during his PhyreEngine talk that they had measured local->system memory transfers for a 1280x720 buffer to be 0.7ms, and he wasn't sure how they were doing it in only 1.3ms. Either way they must have hit a serious performance cliff with 4xMRT in local memory. I've never tried more than 2 simultaneous RT's on PS3, personally.
Interesting, yeah 1.2ms does sound way too long hence why I presumed they were packing in data from that discarded buffer into the other ones (ie, doing more than just a mem copy), and also because resolves on 360 are around 0.4ms or so, hence that 1.2ms figure seemed way out there. I'm curious to see what tile size they go with on pc and 360 as well, or if they need to tile at all. There is no choice but to tile it on ps3 to make it useable on spu, but seems like there would be some redundency that way. There must be an optimal processing size for a given platform, and 64x64 seems in the ballpark for ps3, I don't think they can go much bigger than that. On 360 I guess it depends if they are doing this on vmx or gpu, or a bit of both. They are keeping a fully functioning rsx implementation around (as 3rd parties have already been doing for years) which means they already have a fully working version on 360 gpu. So what will they do? I wonder if they will move the simplest part of the process to vmx in small 32x32 tile size (or whatever works best with its register set for speed), and in parallel do the rest on gpu with a far larger tile size, or perhaps with no tiling at all. Really curious to see the different implementations.
Regarding mrt, yeah I had never done 4 either, just two with one in xdr and the other in gddr. I was told once though by another coder that if you do 4xmrt on ps3 that you just need to put two buffers in xdr, one at the top of xdr memory and one at the bottom, and the other two in gddr in the same way, one at the bottom and one at the top, to maximize speed. I never looked into that to see if that was just an old wives tale, or if there was any truth to it regarding speed benefits.
And I feel happily vindicated, as I always believed programmability and clever developers would push the envelope.
Don't feel too vindicated
You'll always be able to do more with fully programmable compared to fixed function, I don't think anyone ever disputed that. But fully programmable though means less performance/watt compared to fixed function so it's not like it's a silver bullet. The stuff you see taking many milliseconds on spu could be done far faster with dedicated hardware, and use less power in the process.