trinibwoy said:
How much of R580's girth is really attributable to those things we're sure to see in G80? I would think a large part of that is the memory controller and the small batch logic - two things that may not receive as much emphasis in G80's design.
I agree, NVidia is likely to take a functionality-first approach. The memory controller isn't mandated by D3D10 (but by GDDR4) which is why I left it out. A GDDR4-specific memory controller may not be much of a design change, either, since the R580 memory controller is motivated by more than just GDDR4.
The small batch support may only arise as a secondary effect, as I described above.
I also think it's worth remembering that NVidia likes to design its GPUs for OpenGL - and I have a strong suspicion that NVidia will put a lot of work into the ROPs in G80 (that "more-sophisticated methods for solving transparency" paean) and the ever intriguing to me concept of unifying ROPs with texture-samplers (and making the whole shebang programmable). I wouldn't be surprised to see NVidia using G80 as an opportunity to steer OGL in that direction. So, that costs transistors (though, done right, perhaps the resulting architecture is quite space-efficient)...
Given G80's 500M+ count, is it possible that ATi would require less than 116M transistors to move from DX9 discrete R580 -> DX10 unified R600? I would think there still remains a fair chunk of logic to implement a unfied architecture even using R580 as a base.
How much of:
isn't already in R580 (excluding things like southbridge functionality) in some form or other? In reality, those blocks are merely more sophisticated versions of the existing blocks you'll find in R580. The Sequencer gains complexity compared with its equivalent in R580 (the Ultra Threaded Despatch Processor). But all the other blocks in that diagram have a pretty direct counterpart in R580. So then we're left asking, what does that diagram miss out?... The major point is GS, which I think requires an extra buffer in a unified design (twixt VS and GS, though prolly hidden within Shader Export).
The other big thing that's new is constant buffers - which adds complexity to the register file. I also think that a GS program will make fairly complex register file accesses (6 vertices per primitive) and so the register file (which doesn't appear in the diagram) will see a big rise in complexity.
The big difference between Xenos and R600 that I predict is that ATI will retain screen-space tiling - i.e. that R600 will be at least four mini-Xenoses, each "owning" its own batches of vertices, primitives and fragments. Only the latter matters as far as screen-space tiling is concerned - I think the output from GS will have a shared queue, both for streamout ordering (strict ordering is required) and for issuing to the Scan Converter.
Jawed if I'm reading you right, you're saying that a unified DX9 part from ATi would've been smaller than R580?
Well, Xenos practically is a unified DX9 part. It's deficient in ROPs (hence my wild estimate of 70M extra transistors to match R580's capability - so 300M) and missing DVI/analog circuitry (and other gubbins in a desktop GPU like PCI Express) and the ALU pipelines are simpler than R580's. But I'm convinced it's in the same ballpark (particularly as R580 arguably has too much ALU capacity). If you knocked 30M transistors off R580 to create a 2:1 design instead of a 3:1 design, R580 would be ~350M against Xenos's ~320M?
My bet is that ATI will go with "light" pipelines, like Xenos's with extra complexity due to having to implement integer formats/operations. The integer stuff may, in effect, take the place of the "mini-ALU" in R580's pipeline...
Jawed