How so? What's so intensive about going through the scene graph multiple times? Even if it was, why can't you just keep a buffer of pointers to each object to be drawn with a flag for each tile, so that you only have to traverse it once? Even with hundreds of thousands of objects on screen, it'll be under 1MB.
The way the original quote was stated made it sound like you were thinking of going through the entire graph multiple times. The data size isn't the problem, but the pointer chasing is. So what if the math is only a few tens of cycles? That doesn't mean much if your data organization is such a mess to begin with that getting to an object stalls you for several thousand (I've actually seen a few cases where people reported hitting 750,000-cycle stalls -- how they got there, I have no idea, but apparently it can happen).
Putting the issuing of a job as something that gets deferred into tile queues awaiting actual issue isn't quite as bad, but in most cases, you can't "just add" that.
Probably the canonical example that comes to mind is (offline) pre-built command buffers that make it easy for you to draw a static object in the form of set-your-constants-and-issue-the-buffer. This is a common leftover for a lot of people transitioning from previous gen, though it certainly isn't something you can use anymore (tiling and SPU culling both fly in the face of it). It kind of hurts when tiling because the idea back then was that too fine a granularity to your scene chunking was a bad thing. With tiling, you could end up with a lot of wasted work, though. If a chunk that crosses tile boundaries is simply a handful of big polygons, that's no big deal, but if it's big chunks of lots of little polygons, it really can bite back.
This is one of those things that happens at a lower level, and a lot of studios don't like the idea of revamping something so fundamental to their pipeline no matter what it means. The smaller you are, the more likely, I think. I know at my previous job we were kind of running into this, and all my assertions about sharing so much of your design philosophy for PS2/PSP with your renderpath for 360 seemed to be for naught.
The funny thing is that lack of tiling seems to be even more prevalent on XB360 exclusives.
Hmmm... well, I don't know what to tell you there. But it may well be that devs on exclusives could just feel free to play more cheats that are more specific to the 360.
That seems like a lot more trouble than it's worth. You can't even use the same Z-buffer, can you? Or can you resolve it and load it back in somehow? If not, it would have to be simple compositing.
It's actually not that big a deal, and you can at least share a Z-prepass, but it is possible to share the resolved multisample-Z. We do it as well, but in our case, we're compositing separate scenes entirely (and in our case, between those scenes, we clear the Z-buffer anyway).
I guess the best solution is to have a fallback of traditional rendering straight to RAM. Since BW of RAM seems to increase at a much slower pace than transistor density, there's a good chance that by next gen the ROP cost to make this possible would be almost negligible. I think this also means that EDRAM will become a necessity to get all you can from the billion-transistor chips we'll see next gen.
I've kind of liked the idea of having caches of much smaller working tiles. You can basically get your bandwidth but hide the tiling nature of the work. As long as there's a large enough cache so that you're not flushing tiles a million times, it can be relatively effective. Hardly perfect, but nothing ever could be.