The People Behind DirectX 10: Part 3 - NVIDIA's Tony Tamasi

_xxx_ · Jul 11, 2006

Just curious, who stated that (EDIT: HDR-AA and 500+ mio.)? I thought it was just TheInq bull.

trinibwoy · Jul 11, 2006

Think 500M+ was from the horse's mouth and the inclusion of HDR+AA is as safe an assumption as having AA at all.

Geo · Jul 11, 2006

_xxx_ said:
Just curious, who stated that (EDIT: HDR-AA and 500+ mio.)? I thought it was just TheInq bull.

500M+ came from Mike Hara, VP Investor Relations for NV.

HDR+AA came, officially, from AEG, on these boards. "Officially" in the sense that two of our AEG focus group members said that they'd been authorized by AEG to make that public.

Edit: That's my standard for "knows", btw. Official direct, or official by acknowledged authorization. Everything else is some varying flavor of "suspect".

Suflex · Jul 11, 2006

I'm pretty sure it was NV who stated both bits of info, I just can't recall where.

Suflex · Jul 11, 2006

Suflex said:
I'm pretty sure it was NV who stated both bits of info, I just can't recall where.

Man! you guys are fast.

INKster · Jul 11, 2006

geo said:
I'm going to predict that R600 will be no bigger than +15% bigger (mm2) than G80 (compared to 80% bigger R580 vs G71). Now, understand, that prediction does not foreclose R600 being significantly smaller than G80 --that's a limit on the top end of the range I'm predicting, while not addressing the bottom end of the range at all.

Why? Because we already have NV down, officially, for 500M+ transistors, and the timing demands it can't be on anything smaller than 80nm.

There is a minority voice out there insisting that R600 is 65nm. Haven't bought that mule yet tho.

But, what if G80 turns out to be a 65nm part ?

Isn't a non-unified shader implementation less complex than a unified one on a pure transistor use base ?
That could give them some margin to confidently head the 65nm way, at least if you consider "less complex -> greater yield" as a valid argument.

Of course, there are other variables to consider beyond this simplification, but in the absense of more detailed/reliable info on either GPU, and assuming that there are no major fabrication incidents, logic dictates that G80 should be easier to make than R600 on 65nm nodes, and a 3/4 month gap isn't that much to achieve process maturity.

Geo · Jul 11, 2006

NV seems to be shooting for September, Lord willing and the river don't rise (i.e. an unexpected spin becomes necessary). I just haven't heard anything that would lead me to believe that 65nm for a part that big and complex is ready yet. I could be wrong (I often am), but that's where I'm at at the moment.

If you think you know when R600 is coming out, then g'bless you, sir. The indications are too contradictory for me right now to say something like that. Sure, you can find indications and reports to suggest November. In fact, Inq has reported that NV has told them that is when they (NV, that is) expects it. If it is November, it probably isn't 65nm either.

Jawed · Jul 11, 2006

trinibwoy said:
How much of R580's girth is really attributable to those things we're sure to see in G80? I would think a large part of that is the memory controller and the small batch logic - two things that may not receive as much emphasis in G80's design.

I agree, NVidia is likely to take a functionality-first approach. The memory controller isn't mandated by D3D10 (but by GDDR4) which is why I left it out. A GDDR4-specific memory controller may not be much of a design change, either, since the R580 memory controller is motivated by more than just GDDR4.

The small batch support may only arise as a secondary effect, as I described above.

I also think it's worth remembering that NVidia likes to design its GPUs for OpenGL - and I have a strong suspicion that NVidia will put a lot of work into the ROPs in G80 (that "more-sophisticated methods for solving transparency" paean) and the ever intriguing to me concept of unifying ROPs with texture-samplers (and making the whole shebang programmable). I wouldn't be surprised to see NVidia using G80 as an opportunity to steer OGL in that direction. So, that costs transistors (though, done right, perhaps the resulting architecture is quite space-efficient)...

Given G80's 500M+ count, is it possible that ATi would require less than 116M transistors to move from DX9 discrete R580 -> DX10 unified R600? I would think there still remains a fair chunk of logic to implement a unfied architecture even using R580 as a base.

How much of:

isn't already in R580 (excluding things like southbridge functionality) in some form or other? In reality, those blocks are merely more sophisticated versions of the existing blocks you'll find in R580. The Sequencer gains complexity compared with its equivalent in R580 (the Ultra Threaded Despatch Processor). But all the other blocks in that diagram have a pretty direct counterpart in R580. So then we're left asking, what does that diagram miss out?... The major point is GS, which I think requires an extra buffer in a unified design (twixt VS and GS, though prolly hidden within Shader Export).

The other big thing that's new is constant buffers - which adds complexity to the register file. I also think that a GS program will make fairly complex register file accesses (6 vertices per primitive) and so the register file (which doesn't appear in the diagram) will see a big rise in complexity.

The big difference between Xenos and R600 that I predict is that ATI will retain screen-space tiling - i.e. that R600 will be at least four mini-Xenoses, each "owning" its own batches of vertices, primitives and fragments. Only the latter matters as far as screen-space tiling is concerned - I think the output from GS will have a shared queue, both for streamout ordering (strict ordering is required) and for issuing to the Scan Converter.

Jawed if I'm reading you right, you're saying that a unified DX9 part from ATi would've been smaller than R580?

Well, Xenos practically is a unified DX9 part. It's deficient in ROPs (hence my wild estimate of 70M extra transistors to match R580's capability - so 300M) and missing DVI/analog circuitry (and other gubbins in a desktop GPU like PCI Express) and the ALU pipelines are simpler than R580's. But I'm convinced it's in the same ballpark (particularly as R580 arguably has too much ALU capacity). If you knocked 30M transistors off R580 to create a 2:1 design instead of a 3:1 design, R580 would be ~350M against Xenos's ~320M?

My bet is that ATI will go with "light" pipelines, like Xenos's with extra complexity due to having to implement integer formats/operations. The integer stuff may, in effect, take the place of the "mini-ALU" in R580's pipeline...

Jawed

Jawed · Jul 11, 2006

geo said:
Who is going for what clocks? Don't forget that G71 and R580 are clocked the same, yet have a very large size difference. Don't you think it is reasonable to assume some of those extra transistors in R580 might have gone to making that possible?

Yeah, all the way to 3.2GHz GDDR4. Yeah baby. Shame it won't be for years, apparently

Jawed

Jawed · Jul 11, 2006

geo said:
HDR+AA came, officially, from AEG, on these boards. "Officially" in the sense that two of our AEG focus group members said that they'd been authorized by AEG to make that public.

I think D3D10 requires HDR+AA (FP16 render targets with AA).

Jawed

Sunrise · Jul 11, 2006

Jawed said:
I agree, NVidia is likely to take a functionality-first approach. The memory controller isn't mandated by D3D10 (but by GDDR4) which is why I left it out. A GDDR4-specific memory controller may not be much of a design change, either, since the R580 memory controller is motivated by more than just GDDR4

I´m pretty sure that it´s basically the same MC, R580 was the foundation of what´s to come. If at all, R600 will have some adjustments re: FIFOs/caches etc., to make the most out of very fast GDDR4.

Jawed said:
I also think it's worth remembering that NVidia likes to design its GPUs for OpenGL - and I have a strong suspicion that NVidia will put a lot of work into the ROPs in G80 (that "more-sophisticated methods for solving transparency" paean) and the ever intriguing to me concept of unifying ROPs with texture-samplers (and making the whole shebang programmable).

G80 should be *very*efficient, very fast and very flexible, in terms of scalability. Since NV owns a rather large percentage in the workstation-market (those margins in that particular segment are still king) and judging their will to improve margins even further i second the thought that NV will try (no matter what) to still focus a lot of their effort on OpenGL.

Tridam · Jul 11, 2006

Jawed said:
I think D3D10 requires HDR+AA (FP16 render targets with AA).

Jawed

D3D10 doesn't require support of MSAA >1x (D3D10.1 requires MSAA 4x)
So I think that everything around MSAA is optional.

Arty · Jul 11, 2006

R600 will have onchip CF, something ya'll have missed to include in your calculations.

Anteru · Jul 11, 2006

Jawed said:
I also think it's worth remembering that NVidia likes to design its GPUs for OpenGL - and I have a strong suspicion that NVidia will put a lot of work into the ROPs in G80 (that "more-sophisticated methods for solving transparency" paean) and the ever intriguing to me concept of unifying ROPs with texture-samplers (and making the whole shebang programmable).

A very good point - and if you think about nVidia's Gelato, it makes even more sense. A long standing problem of graphics hardware for production rendering are the non-programmable sampling patterns, but if nVidia manages that the sampling becomes programmable ... then they could implement stochastic sampling (with all the stuff that comes with it: Real DoF and MB)

and use their graphics hardware for most (all?) of Gelato giving them a nice advantage over other offline renderers.

Geo · Jul 11, 2006

D'oh! Yeah, I think I'd be willing to put onboard CF connection in "knows" now that we've seen the RV570 pcb pics.

Sunrise · Jul 11, 2006

serenity said:
R600 will have onchip CF, something ya'll have missed to include in your calculations.

I still wonder how much that would add in terms of overall transistors, given ATi used their IP within an implementation by some 3rd party ASIC (FPGA). Was there ever any credible article, or some explanatory note that had some insight into this ?

Jawed · Jul 11, 2006

Anteru said:
A very good point - and if you think about nVidia's Gelato, it makes even more sense. A long standing problem of graphics hardware for production rendering are the non-programmable sampling patterns, but if nVidia manages that the sampling becomes programmable ... then they could implement stochastic sampling (with all the stuff that comes with it: Real DoF and MB) and use their graphics hardware for most (all?) of Gelato giving them a nice advantage over other offline renderers.

And, with the apparently fine-grained AA sample grid that's coming to G80:

http://www.beyond3d.com/forum/showpost.php?p=789268&postcount=49

that would work really nicely.

I now think that if you count the sign bit, you get 6 so it's effectively a 64x64 grid for MSAA. I'm kinda surprised nobody else has expressed glee at that.

Combine four GPUs and you have 16xMSAA from a 64x64 sparse sampling grid.

Is that too good to be true?

Jawed

Rys · Jul 11, 2006

Sunrise said:
I still wonder how much that would add in terms of overall transistors, given ATi used their IP within an implementation by some 3rd party ASIC (FPGA). Was there ever any credible article, or some explanatory note that had some insight into this ?

The Xilinx XC3S400 is 400K gates. Not 100% sure how an FPGA configuration in terms of gate count would translate to transitors on a GPU design (it's more transistors than gates, right?), for something like the Crossfire compositor, and that's guessing at how many of those gates they used, too.

Rys · Jul 11, 2006

Jawed said:
Combine four GPUs and you have 16xMSAA from a 64x64 sparse sampling grid.

And that's assuming that 4 samples is the max allowable per chip, which it likely isn't.

John Reynolds · Jul 11, 2006

Jawed said:
And, with the apparently fine-grained AA sample grid that's coming to G80:

http://www.beyond3d.com/forum/showpost.php?p=789268&postcount=49

that would work really nicely.

I now think that if you count the sign bit, you get 6 so it's effectively a 64x64 grid for MSAA. I'm kinda surprised nobody else has expressed glee at that.

Combine four GPUs and you have 16xMSAA from a 64x64 sparse sampling grid.

Is that too good to be true?

Jawed

Allow me to express glee right here:

The People Behind DirectX 10: Part 3 - NVIDIA's Tony Tamasi

_xxx_

trinibwoy

Meh

Geo

Mostly Harmless

Suflex

Suflex

INKster

Geo

Mostly Harmless

Jawed

Jawed

Jawed

Sunrise

Tridam

Arty

KEPLER

Anteru

Geo

Mostly Harmless

Sunrise

Jawed

Rys

Graphics @ AMD

Rys

Graphics @ AMD

John Reynolds

Ecce homo

Similar threads