Well, this is only one of the possible "workarounds", but I fail to see the huge drawbacks here. You wouldn't have to transform shadow/simple polygons twice since they don't need much space anyway. 10 bytes should be enough for screen space XYZ, and shadow maps are a separate render even. Shadow polygons already make quite a significant portion of the overall polygon count.
And if you consider culling you could end up doing even less vertex work than performing the full transform on all vertices. Plus there are other benefits to be had from having the whole scene data, like the possibility of order-independent transparency.
Sure, huge amounts of embedded memory are nice, but certainly not free either.
You're right, I shouldn't have said "twice for every vertex". However, remember that I'm saying this method is the way I think TBDR could be feasible, not impossible.
If you're going to store screen space position in addition to the renderstate index, you also need either indices or lots of dupicates, and even with the former, not only are triangle lists sometimes used, but triangle strips will get heavily chopped at the tile boundaries. 25 bytes per vertex seems like a reasonable average to me, and that sits right in the iffy area to me when you consider future scaling.
Oh yeah, when I said "smaller price to pay than the alternative", I was comparing it to alternative ways of doing TBDR, not comparing to Xenos in that particular paragraph.
* How do you come up with a figure of 100 bytes per vertex?
100 bytes are almost 50% more than 68 bytes (and often you don't even need vertex color).
As nAo and SMM have attested to, it's really not that hard. Remember that the reason for not using ps2.0 instead of ps3.0 for that Farcry patch basically boiled down to the two extra iterators in the latter. 100 bytes average for iterators plus position, polygon indices, and renderstate index is very reasonable.
Having to render between 0.5m and 1M 'fat' vertices per frame would correspond to writing to and reading back from mem 50MB-100MB of data per frame..a few gigabytes per second,
It does not seem undoable to me.. from a bw standpoint. unfortunately it consumes a lot of memory, but if you design a new console from scratch based on TBDR you can easily address the problem imho.
Marco
100MB * 60fps * 2 (read/write) = 12 GB/s! For a lower cost console like XB360 (which is still fairly pricey), that's a huge chunk of BW, the very thing TBDR is claimed to save. How exactly do you "easily address the problem" without increasing cost, especially the memory usage? Binning only indices and/or position is the only thing that makes sense.
I think for this generation TBDR might have worked this way (with appropriate programming model changes), but under an equal cost scenario, it is far from certain you'd get an improvement in overall performance. Xenos is about as TBDR-esque as you could go given today's reality (porting to IMRs, IMR based engines, shader programming model, etc).