The G92 Architecture Rumours & Speculation Thread

Status
Not open for further replies.

Arun

Unknown.
Moderator
Legend
Rumoured Data Points
- Evolutionary step of the G8x architecture.
- Supports PCI Express 2.0 (and DisplayPort?).
- "Close to 1TFlop in 4Q07" according to NVIDIA Exec.
- FP64 support confirmed (for Tesla only, not GeForce).
- First iteration will be on 65nm; next ones might be 55nm.
- Few details are actually known (or rumoured) about the arch.

Noteworthy Internet Rumours
"NVIDIA confirms Next-Gen close to 1TFlop in 4Q07" [Beyond3D]
"Nvidia's next generation is G92" [Fudzilla]

Thread Discussion Starting Points
- Does all of this rumoured information seem reliable to you? What, if anything, sounds fishy?
- Do you believe the ratios of the different units have been changed? How so?
- What memory bus width and memory speeds are we expecting? And at what price?

- G8x's ALU ratio is a fair bit lower than R6xx's - do you think this will be changed?
- What modifications are you expecting in the ALUs? G8x ones were semi-custom, could this be fully custom?
- The G80 derivatives got rid of 'free trilinear' - do you think this will also be the case for all G9x?
- Do you think there will be some reuse between this and NVIDIA's future handheld architectures?

- Outside the 3D architecture per-se, do you think the video core will be refreshed? Do you expect there to be a separate NVIO again for desktops?

Last Updated on August 9th
 
I read somewhere that the co-issue design is to be altered (MAD+ADD). :rolleyes:

Anyway, about the ALU:TEX ratio, I think NV should take the shortest cut, eg. tuning the PLL of the SP domain to get higher clock for the shaders, if the 65/55nm tech is good enough.

DP processing is curious one, whether they will introduce new monolith double-wide ALU design, or a fused one, so the SP throughput [for games] would be doubled somehow.
 
The shader clock domain isn't on a separate PLL which can run freely, rather it's a multiplier of the scheduler clock. That's likely to continue (and reminds me that I need to shoot Tony Tamasi for something he said to me on editor's day for the G80 launch!).

SP speed being doubled assumes that DP runs at full speed for the clock, which I'm almost certain isn't the case. My guess is that DP runs at reduced rate because of datapath limits for operand fetch or the write path.

As for the SFU and SP ALU structure changing (for the better, per clock), maybe on G92 but possibly not on other G9x implementations. I ran through some basic area/clock bits and pieces with someone earlier, so maybe this time around I'll publish a 'proper' guess at some point, to see how close the final architecture is.

The memory bus width has been a bone of contention internally (Arun disagrees with what I think notably), but that pre-launch process has started again (we started a lot further out last time, if memory serves).

The rest of the proposed discussion points I'll leave alone :devilish:
 
- Do you believe the ratios of the different units have been changed? How so?
Yes. I think we will see a slightly higher alu:tex ratio.... maybe 25%+


- What modifications are you expecting in the ALUs? G8x ones were semi-custom, could this be fully custom?
Not sure I'm expecting any to be honest... more of the same?

The G80 derivatives got rid of 'free trilinear' - do you think this will also be the case for all G9x?
Probably.

- Do you think there will be some reuse between this and NVIDIA's future handheld architectures?
Not significantly.

- What memory bus width and memory speeds are we expecting? And at what price?
I'll put it in percentages of how likely I think it to be:
90% - 384 @ $500 flagship
10% - 512 @ $550 flagship
 
6 TPC @ 1.6GHz about 9xxx 3dmarks .

dual chip(die?) for the "closed to 1TFLOPS" card .
 
I think G92 will be 6 cluster and 4 rop-partitions with FP64 support and maybe some other improvements(TAUs = TMUs like G84/G86).

This should result in a 200-250mm² die, which could be sold with the cheaper 256Bit PCB for a very good price(~$200).

The performance should also be very good with the clock gains through 65nm, I expect 600-700MHz(ROPs/TMUs) and 1.5-2GHz(SPs), memory could be 1-1.2GHz GDDR3/4. So it should outdo 8800GTS and come near to the GTX.

In enthusiast segment I think we see a dual-GPU-SKU, which is already announced for TESLA and so I see no problem that it also comes to GeForce:
These 1U units will support either 4 GPUs or 8 GPUs; the 8-core version will be using dual-GPU boards that are not yet available but will be soon
http://www.pcper.com/article.php?aid=424&type=expert
This should be the 1 TFLOP solution.
 
Unsurprisingly, it's incredibly tempting to focus just on G92 - I'd like to try to get the discussion to be a bit more about G9x in general though, but heh! :) I guess starting that G92 SKU thread might be a good idea now...

Anyway, something I like to ponder upon is whether NVIDIA wants to offload more and more to the shader core. Of course, it is hard to predict when that will happen and for what parts of the pipeline, but the obvious short/mid-term candidates are:
- Triangle setup (~100 flops/triangle).
- Downsampling (~2 flops/sample).
- Blending (~20 flops/pixel + RMW).

There are tons of other candiates but the above ones are probably the only ones that make potential sense for the G9x/R7xx generation. And heck, maybe none of that will happen.

However, I find it very tempting to presume that triangle setup will be done in the shader core, as this could make the Z/Shadow-passes just ridiculously fast. Blending is a bit harder because of the RMW, but it's nothing astonishing either, it just requires a bit of locking at the ROP or memory controller level. And downsampling, well, it shouldn't be too hard either if optimized for properly but it's also much less important.

I could definitely see G9x being a small iterative improvement over G80, but with a higher ALU-TEX ratio and more work being offloaded to the ALUs to improve overall scalability. The ROPs would need to be redesigned with that taken into consideration too, obviously.
 
To offload some fixed-function tasks to the shaders and to invest the saved transistors in SPs would also make sense in connection with GPGPU(Tesla).

But will NV do this step already with G92? At the moment I believe it is more an interim solution with low risks, real changes we will see maybe in the next highend-core in H1 2008 (G90/G100?).
 
- Triangle setup (~100 flops/triangle).
Well attribute interpolation is already in the shader core, so why not, eh? I notice from the latest D3D10 presentations from SIGGRAPH a suggestion that setup be performed in GS - see page 59 of "2 - DX10 Pipeline.pdf". Talks about higher-order interpolation and other fancy stuff to produce higher quality results rather than relying upon the default behaviour of the D3D10 pipeline.

- Downsampling (~2 flops/sample).
Required by D3D10 already. I presume you mean AA resolve downsampling, otherwise not sure what you mean.

- Blending (~20 flops/pixel + RMW).
I think that's the one NVidia will leave as late as possible. Though it has to be said getting RMW in early would be a boon for CUDA programming where it is, otherwise, quite a logjam as it currently stands (once PDC space runs out Stores/Loads are uncached as far as I can tell).

To me this is a "latency-hiding" problem - StreamOut simplifies the issue by being write-only, but appears to still incur a latency penalty for VS/GS code that uses SO, during SO writing. And SO has to complete before the buffer being written can be consumed by VS on the next pass. So, in all, SO is a similar problem, introducing latency during the write and requiring strict ordering.

So, ahem, you could argue that blending will always be stuck with the latency/ordering problem. Maybe this is where we'll start seeing huge caches on die. Don't G80's L2 caches, one per ROP/MC partition, provide this functionality already?

There are tons of other candiates but the above ones are probably the only ones that make potential sense for the G9x/R7xx generation. And heck, maybe none of that will happen.

However, I find it very tempting to presume that triangle setup will be done in the shader core, as this could make the Z/Shadow-passes just ridiculously fast.
ATI hardware has twice the setup rate per clock of NVidia hardware, doesn't it? So, NVidia's motivation might be different from ATI's...

Using a GS to perform triangle setup it should be possible for you to test your assertion about z-/shadow-pass rendering on G80.

If the Control Point Shader block is coming soon (D3D11?) - feeding into a fixed-funtion tessellator - might that take priority over Setup becoming programmable?

Jawed
 
I think that's the one NVidia will leave as late as possible. Though it has to be said getting RMW in early would be a boon for CUDA programming where it is, otherwise, quite a logjam as it currently stands (once PDC space runs out Stores/Loads are uncached as far as I can tell).
G84, which NVIDIA has termed to be Compute 1.1-compatible, has a whole array of atomic functions. I imagine G92 will be at least 1.1 compatible. Take that as you will...
 
G84, which NVIDIA has termed to be Compute 1.1-compatible, has a whole array of atomic functions. I imagine G92 will be at least 1.1 compatible. Take that as you will...
Aha, so that might be all they do for a while then...

Jawed
 
Required by D3D10 already. I presume you mean AA resolve downsampling, otherwise not sure what you mean.
Yeah, he means the AA downfilter, but I think he's wondering if it'll be exclusively performed in the shader core and nowhere else on the chip.
 
At the moment I believe it is more an interim solution with low risks, real changes we will see maybe in the next highend-core in H1 2008 (G90/G100?).
There is no next high-end core in H1 2008. G92 is all you'll get for at least 9 months, and most likely more than that. If G9x is a low-risk incremental update, then there probably won't be a larger refresh until 4Q08.

Jawed said:
Well attribute interpolation is already in the shader core, so why not, eh?
Well, that has been the case basically forever! :) It's hardly new to DX10 hardware (although I'm not 100% sure when on-demand interpolation was introduced).

Triangle setup in the shader core is already done in Intel's latest IGPs, and has been described in an ATILLA paper. The primary advantage is the performance improvements for Z-only passes, including shadowmap generation... Of course, reducing the amount of fixed-function logic is also nice.

Jawed said:
Required by D3D10 already. I presume you mean AA resolve downsampling, otherwise not sure what you mean.
Well, what's required by D3D10 is that the application can downsample the AAed buffer itself if it sees fit. By default, it is fixed-function hardware doing that anyway on G80, AFAIK.
Jawed said:
So, ahem, you could argue that blending will always be stuck with the latency/ordering problem. Maybe this is where we'll start seeing huge caches on die. Don't G80's L2 caches, one per ROP/MC partition, provide this functionality already?
Well, the L2 caches are hardly 'huge' - it's just 128KiB, or 1/4th of the size of the register file! However, you make a good point: having the data on-die would clearly simplify RMW, as the latency would be less of an issue.

However, eDRAM-like approaches cannot *guarantee* that the data is always on-die, unless the amount of eDRAM is just ridiculously big or you use tiling. So it most likely would just make the average latency lower - which isn't bad by itself, I'll admit.

ATI hardware has twice the setup rate per clock of NVidia hardware, doesn't it? So, NVidia's motivation might be different from ATI's...
Well, yeah - improving the triangle setup unit *somehow* is more urgent for NVIDIA than for ATI, since it risks being more of a bottleneck. Just offloading it to the shadercore seems like the logical path for them to take, but they might just improve it throughput via more traditional means.

[QUPTE]Using a GS to perform triangle setup it should be possible for you to test your assertion about z-/shadow-pass rendering on G80.[/QUOTE]You can't really bypass triangle setup if you want rasterization hardware to work in D3D10 - so in terms of performance measurements, this shouldn't tell us what we want to know... :(

Given the Z performance of the G80, however, and the fact that previous-gen hardware (with the same triangle setup characteristics afaik!) was sometimes already triangle setup limited in Z/Shadow passes... Well, it should be kind of obvious that this is a very real bottleneck on G80, and less so on G84.
 
Yeah, he means the AA downfilter, but I think he's wondering if it'll be exclusively performed in the shader core and nowhere else on the chip.
:oops: That makes sense.

Reading about D3D10.1 changes:
  • user-programmable sample masks
  • MSAA depth-reading
  • per-render target blend mode
implies there's a meaty chunk of new output merger functionality. Perhaps it's enough to justify cutting-over to shader-based OM?

Presumably reading depth is the big one? But if that's only possible after a state change, then, erm, maybe not? I'm lost...

Jawed
 
Well, that has been the case basically forever! :) It's hardly new to DX10 hardware (although I'm not 100% sure when on-demand interpolation was introduced).
Well it's interesting that ATI has stuck with its fixed function Shader Processor Interpolators. As far as I can tell these are "driver programmable" ALU blocks, running variable-length interpolation programs, dumping results to a block of attribute memory for the pixel shaders to consume. They aren't radically different from shader ALUs, I suppose. ATI GPUs appear to have multiple SPI blocks running in parallel (if the ALU redundancy patent application is to be believed), so I guess they scale with the desired triangle throughput, rather than merely scaling with clock rate.

I suppose it's a bit like the question, "do you have fixed function TAs? or do you run this as shader instructions?" Why did NVidia make the TAs in G80 fixed function?

Triangle setup in the shader core is already done in Intel's latest IGPs, and has been described in an ATILLA paper. The primary advantage is the performance improvements for Z-only passes, including shadowmap generation... Of course, reducing the amount of fixed-function logic is also nice.
Interestingly, you could argue that a single batch of primitives being setup would mesh quite nicely with a post transform vertex cache - they'd both be "about the same size".

I need to read the ATILLA stuff closely - I keep forgetting it :oops:

Well, the L2 caches are hardly 'huge' - it's just 128KiB, or 1/4th of the size of the register file!
I didn't put that very well :oops: I wasn't trying to imply that G80's L2 is huge - but if a huge L2 for RMW is coming, then that's where it would be. I guess.

However, you make a good point: having the data on-die would clearly simplify RMW, as the latency would be less of an issue.

However, eDRAM-like approaches cannot *guarantee* that the data is always on-die, unless the amount of eDRAM is just ridiculously big or you use tiling. So it most likely would just make the average latency lower - which isn't bad by itself, I'll admit.
Yeah, in graphics "caches" tend to find the "right size" extremely readily. So an entire set of render targets will always be out of scope for L2. Anyway, some would argue that the RMW penalty of graphics is what keeps it honest, what enables it to be embarrassingly parallel.

Well, yeah - improving the triangle setup unit *somehow* is more urgent for NVIDIA than for ATI, since it risks being more of a bottleneck. Just offloading it to the shadercore seems like the logical path for them to take, but they might just improve it throughput via more traditional means.
How much more performance do you think NVidia needs here? 10s of % will come with a clock boost. Orders of magnitude (to match the zixel rate of G80?) may be a step too far?

I don't know how bad the mismatch is with G80. Arguably G92's mismatch would be lower, anyway (if it has less ROPs) and if 2x G92 is the new enthusiast part, then AFR takes care of this.

You can't really bypass triangle setup if you want rasterization hardware to work in D3D10 - so in terms of performance measurements, this shouldn't tell us what we want to know... :(
Whoops, yeah, that's going the wrong way on triangle count. Not a good idea.

Jawed
 
Presumably reading depth is the big one? But if that's only possible after a state change, then, erm, maybe not? I'm lost...
Being able to bind depth in a view in the shader with MSAA on, plus the per-RT blend mode should both be renderstate changes/considerations at the application level. The latter seems more expensive to me at the hardware level actually, especially with the increase in the number of possible RTs that comes with D3D10. Might be wrong, since accessing depth while compression will be on is hardly trivial.
 
However, eDRAM-like approaches cannot *guarantee* that the data is always on-die, unless the amount of eDRAM is just ridiculously big or you use tiling.
Why not? Modern hardware is effectively tiling anyway as a pixel batch makes it way out of the ROP. Cache increases to improve blending performance shouldn't have to be that extravangant? I don't see the OM being any more programmable than it has to be for this architecture anyway, and thus still fixed hardware.

The front end of the chip seems to benefit more from the general idea than helping to get data out at the end.
 
Status
Not open for further replies.
Back
Top