A pre-z pass doesn't double the bin data. The z-pass only needs a single vertex component (position), whereas the other passes will have many channels (at a minimum position, UV, normal, and tangent, sometimes additional UV channels and color channels). Eliminating those extra channels also can result in fewer total vertices because it eliminates seams in the normal and UV channels (on the order of 25% less for models in our game engine).
Well as has been noted earlier in this thread, you don't necessarily need to store anything other than positions in the bins anyways (even for normal rendering). In any case, the gain from just doing an in-core/tile z-pass before shading would be much greater than any cleverness with reducing your pre-z data set before sending it to the API.
That way, you would completely eliminate the later passes, saving the bandwidth of copying and transforming the extra vertex channel data for primitives that will ultimately be completely occluded.
Any sort of occlusion queries that alter your draw calls are going to be by necessity at least a frame behind the GPU command buffer consumption. Thus you're not gaining anything/very much by doing a pre-z pass, since you might as well just use the "final" z buffer from the previous frame. Certainly hierarchical occlusion culling stuff is very useful, but it's just as good without a pre-z pass as far as API-level draw calls are concerned.
I'm not sure how practical it would be to implement that on Larrabee, since the idea requires separate binning for the pre-Z and shaded passes. An application could add hints as to where the Z pass starts and ends to help out.
I'm not entirely sure what you're getting at here, but certainly if z information was known "up front", a binning render could avoid even dumping things into bins that were guaranteed to be entirely occluded. That said, you're only going to get a (likely small) constant factor over doing some simple occlusion culling at the scene graph level yourself. Even if you want to do something fancier, there's nothing preventing you from just rendering a low-resolution image to get conservative visibility data and using that to drive rendering... if you rendered at a resolution approximately equal to the number of tiles on the screen you'd be doing about the same thing.
One possibility would be to do a pre-Z pass solely for the purpose of generating a hierarchical occlusion map. In such a case, you could store the resulting Z-buffer at a lower resolution than the actual frame buffer (no MSAA, plus half or quarter resolution). It would be good enough for early-rejection, and reduce the bandwidth requirements dramatically.
Indeed I was suggesting something similar above, except that I suspect you can get the majority of the benefit by just doing it in "software" and culling more coarsely (objects vs triangles for instance).
Also note that the Larrabee paper shows that vertex processing is practically free and rasterization isn't really that expensive either. Certainly these trade-offs change from game to game, but part of the reason why pre-z is such a win on current GPUs is because pixel shading costs so much more than entirely re-transforming and re-rasterizing the scene... to that end, just looping over your bin queue - or even better IMHO, just using deferred shading! - is going to be more than enough.
I'm just throwing some ideas out there. I think the possibilities of software rendering are awesome!
Yeah, all good ideas and I'm looking forward to seeing what people come up with when they get their hands dirty!
PS: I would be remiss if I didn't note that if you're into this sort of stuff, our group at Intel (Advanced Rendering Team) is hiring! Please feel free to fire me a PM if any of you or your smart friends are interested
. Also if anyone is going to SIGGRAPH next week and wants to chat, please also fire me a PM.