The G92 Architecture Rumours & Speculation Thread

Status
Not open for further replies.
ATI hardware has twice the setup rate per clock of NVidia hardware, doesn't it? So, NVidia's motivation might be different from ATI's...

AFAIK, the setup rate per clock of G80 is same to R600 , the G80 can reach 0.75 triangel per clock in one old benchmark(OpenGL Geometry Benchmark 1.0) .
 
Being able to bind depth in a view in the shader with MSAA on, plus the per-RT blend mode should both be renderstate changes/considerations at the application level.
I presume that all 8 or less RTs are accessible within the same render state - if they were separated by state then they wouldn't be accessible in parallel.

The latter seems more expensive to me at the hardware level actually, especially with the increase in the number of possible RTs that comes with D3D10. Might be wrong, since accessing depth while compression will be on is hardly trivial.
Presumably colour and depth/stencil are simply "closed" at the end of the write state and bound as readable surfaces for consumption. So, much like D3D10 provides the colour buffer with AA samples intact (which requires "decoding" AA compression), providing depth requires compression to be decoded, first.

I suppose what makes it more interesting is that each RT can have its own depth surface, can't it?

Jawed
 
AFAIK, the setup rate per clock of G80 is same to R600 , the G80 can reach 0.75 triangel per clock in one old benchmark(OpenGL Geometry Benchmark 1.0) .
Hmm, I'm not sure why I thought G80 is slower per clock than it is. G7x is half-rate, if I recollect correctly. So, NVidia has already gone a long way. Presumably it could double again... Perhaps G80's setup rate (its shortfall with respect to its zixel rate) is a victim of die-crunch?

Jawed
 
I suppose what makes it more interesting is that each RT can have its own depth surface, can't it?
Each RT can bind the depth buffer when you create its view, if that's what you mean.
Hmm, I'm not sure why I thought G80 is slower per clock than it is.
I remember writing that I could only measure 0.5tri/clk, if you're maybe remembering the article here. It's quoted as 1/clk though and you can get close (we have done since). You're right, though, it's one of the obvious candidates for yet more improvement in G9x.
 
Is there any definite proof to why people would be saying that chip coming in Nov isnt the top of the line part?

I mean, look at the top of the line parts release dates:

GeForce 7800 GTX launched on June 22, 2005
~8 months
GeForce 7900 GTX launched on March 9, 2006.
~8 months
GeForce 8800 GTX launched on November 8, 2006
~6 months
GeForce 8800 Ultra launched on May 2, 2007
? ~6 months ?
? 9800 GTX launch early Nov ?

Without the large scale refresh that the 7900 was the 7800 (ie, a process change), couldnt their resources have been put towards bringing out the new top end card before the end of this year. I know my reasoning above is a leap of logic, but it stands to reason that they could bring out a whole new card sooner, if they didnt have to move process in the same family.
 
Is there any definite proof to why people would be saying that chip coming in Nov isnt the top of the line part?

I mean, look at the top of the line parts release dates:

GeForce 7800 GTX launched on June 22, 2005
~8 months
GeForce 7900 GTX launched on March 9, 2006.
~3 months
GeForce 7950GX2 launched on June 5, 2006.

~5 months
GeForce 8800 GTX launched on November 8, 2006
~6 months
GeForce 8800 Ultra launched on May 2, 2007
? ~6 months ?
? 9800 GTX launch early Nov ?
;)

To come back to your question, already the name indicates that G92 is not the highest chip of his line, G90 would be this (Gx0 > Gx2 > Gx4 > Gx6 > Gx8).
But maybe the real G90 was canceled after NV saw what R600 is and ATi can make out of this design or it will come with G94 and G96 in Spring 2008 in 55nm. (NV42 -> G7x ;))

I still believe G92 single-GPU-SKU will be sold as midrange solution and enthusiast will get a dual-GPU-SKU(see Tesla announcement).

Arun said:
There is no next high-end core in H1 2008. G92 is all you'll get for at least 9 months, and most likely more than that. If G9x is a low-risk incremental update, then there probably won't be a larger refresh until 4Q08.

So G92 is already DX10.1? :???:
 
But maybe the real G90 was canceled after NV saw what R600 is and ATi can make out of this design
Or maybe Jawed is right, G80 was slightly redesigned in early 2006 and became too close in performance to former G90 design (?).

Some old rumours said, that G80 was originally designed as 256bit solution. I don't know, if it's true, but in this case, G92 could have pretty similar configuration to the "old rumoured" G80 equipped with 256bit bus and single-TF TMUs
 
So, is it possible that G92 is the cut-down version of the G80(6 TPC) with higher clock speed and improvement in utilization of the 128 1D.
 
;)

To come back to your question, already the name indicates that G92 is not the highest chip of his line, G90 would be this (Gx0 > Gx2 > Gx4 > Gx6 > Gx8).
But maybe the real G90 was canceled after NV saw what R600 is and ATi can make out of this design or it will come with G94 and G96 in Spring 2008 in 55nm. (NV42 -> G7x ;))

I still believe G92 single-GPU-SKU will be sold as midrange solution and enthusiast will get a dual-GPU-SKU(see Tesla announcement).



So G92 is already DX10.1? :???:


Don't be so sure. Nvidia could be playing a mind game like they did before. IE: Replacing "NV50" with G80, not going unifide etc.
They could try to pull some of the same type of confusion with G92.

Perhaps it is a performance part. This would go with the rumours of G100 pitting up against R700.

Kind of a big gap there between the 8600 and 8800.
 
I'm going to guess it will be a 256bit bus part with support for GDDR5 (even though I'm not convinced it will ship with that). 6 clusters with 4 ROP partitions sounds about right, but I'm leaning towards the 6 clusters either having 4 sets of 8 ALUs, or possibly 2 sets of 16wide ALUs in there.

Also, while I initially liked the sound of moving blending into the shader core... isn't there a pretty big advantage to having ALUs for RMW right next to the memory controller/caches? For instance, it seems like it might be a bit of a latency problem to have the M part of that shipped off to the shader array and contending for resources with VS/GS/PS. Also, if you do RMW from cluster-local caches, then it seems to me you are going to have to deal with cache coherency issues between clusters or constrain how threads are assiged to clusters (e.g. statically partition framebuffer tiles across clusters).

[And, how expensive is this blending HW really? I mean once you strip out all the specialized Z/stencil/color compression stuff, which you aren't likely to be able to toss over the side, is it really such a huge transistor hog?]

I dunno, we've been hearing forever about how shaders that can do things like read the current framebuffer value and then write back are hard to implement with good performance... I guess I'd like to understand a bit more in detail why this is suddenly a good idea/workeable from a HW perspective :).

BTW - did that whole conference call where the 1TF thing was mentioned explicitly say 1TF on a chip?
 
However, I find it very tempting to presume that triangle setup will be done in the shader core, as this could make the Z/Shadow-passes just ridiculously fast. Blending is a bit harder because of the RMW, but it's nothing astonishing either, it just requires a bit of locking at the ROP or memory controller level. And downsampling, well, it shouldn't be too hard either if optimized for properly but it's also much less important.
Why do you assume moving triangle setup to the shader will result in a performance improvement? It's likely that there are other data paths outside of the setup engine that limit performance to its current rate.
 
Double precision support?
At SIGGRAPH, three different tech talks/tutorials I went to on CUDA each mentioned "be careful, future hardware will support doubles, so make sure to use literal floats in code that you want to stay floats." There was no mention of specific hardware or timeframe, but the warnings were independent and repeated, so it's quite likely to be a change pretty soon, not in years.
 
At SIGGRAPH, three different tech talks/tutorials I went to on CUDA each mentioned "be careful, future hardware will support doubles, so make sure to use literal floats in code that you want to stay floats." There was no mention of specific hardware or timeframe, but the warnings were independent and repeated, so it's quite likely to be a change pretty soon, not in years.
G92 will support double precision on the Tesla models (and maybe high-end Quadros). We've known this since the Tesla launch ("What's the timeframe for double precision?" "Before the end of the year on Tesla." "Hooray!").
 
G92 will support double precision on the Tesla models (and maybe high-end Quadros). We've known this since the Tesla launch ("What's the timeframe for double precision?" "Before the end of the year on Tesla." "Hooray!").

That was already public since february 07 -> CUDA F.A.Q.
>Q: Does CUDA support Double Precision Floating Point arithmetic?

>A: CUDA supports the C "double" data type. However on G80
> (e.g. GeForce 8800) GPUs, these types will get demoted to 32-bit
> floats. NVIDIA GPUs supporting double precision in hardware will
> become available in late 2007.
http://developer.download.nvidia.co..._CUDA_SDK_releasenotes_readme_win32_linux.zip
;)

But I do not expect more than a checkliste feature.
 
Jawed said:
I suppose it's a bit like the question, "do you have fixed function TAs? or do you run this as shader instructions?" Why did NVidia make the TAs in G80 fixed function?
I'd presume because TA is not very well suited to SIMD ALUs - TF, on the other hand, will presumably be more programmable eventually...

Interestingly, you could argue that a single batch of primitives being setup would mesh quite nicely with a post transform vertex cache - they'd both be "about the same size".
I didn't get that - care to elaborate? :)

I didn't put that very well I wasn't trying to imply that G80's L2 is huge - but if a huge L2 for RMW is coming, then that's where it would be. I guess.
Ah, okay. Well I agree with that completely (although I'd argue that if you wanted huge amounts of embedded memory, you'd want to manage it mostly manually as I think it could give you better returns than a cache). If you need it to be high-bandwidth, you probably want to have it multibanked and allocate it statically to a group of memory chips. And thus it makes sense to put it at the ROP/memory controller level.

Anyway, some would argue that the RMW penalty of graphics is what keeps it honest, what enables it to be embarrassingly parallel.
Well, yes and no... In my opinion, RMW isn't such a big problem for graphics if each thread can only do it for one (or a few) memory locations which no other thread will ever need to access, except because of memory burst restrictions (and those can be solved by tiling). Blending fits in that category so it doesn't seem too bad to me.

Generic RMW, which implies that *any* thread should be able to RMW *any* memory location at any given time (as required by CUDA!) is much harder to get good efficiency out of, AFAICT...

How much more performance do you think NVidia needs here? 10s of % will come with a clock boost. Orders of magnitude (to match the zixel rate of G80?) may be a step too far?
Well, I don't think anyone is going to deny that 192Z for 0.5 to 1 triangle/clock is a bit extreme. It shouldn't be hard to conclude that at some points in a frame, this is a major bottleneck. How much of a problem is it overall? I don't know, but I can imagine quadrupling triangle setup throughput to be a substantial gain in some games (25%+? Guestimates ftw!)...
 
AnarchX said:
All this means is that it will be DX10.X where X is currently unknown or unannounced, and not DX11.

AnarchX said:
My sources and other rumors in the net indicate that both chips will be aimed on midrange and dual-gpu-boards as interim solutions in high-end, that both companies can more concentrate on the DX10.1 solutions in H1 2008.
I don't disagree with most of what you said there, with one (big) exception: What makes you think those are interim solutions? Because I'm pretty sure they are not.

psurge said:
I dunno, we've been hearing forever about how shaders that can do things like read the current framebuffer value and then write back are hard to implement with good performance... I guess I'd like to understand a bit more in detail why this is suddenly a good idea/workeable from a HW perspective.
Well, I'm not completely sure myself! :) NVIDIA had a patent on this for a long time, it was filed around 2003, fwiw... The basic idea is to have a tiling architecture with a fixed number of tiles being worked on at the same time, and you can reserve/unreserve tiles with coverage masks etc. - it's not perfect, but I'm not sure you can do much better than that.

It does get more and more expensive the longer the M in RMW takes, though, since that's more simultaneous tiles and more potential data being blocked by tiles having already been reserved, though. But I doubt this is a massive problem in practice.

3dcgi said:
Why do you assume moving triangle setup to the shader will result in a performance improvement? It's likely that there are other data paths outside of the setup engine that limit performance to its current rate.
Indeed, but I think those parts are potentially much less expensive to design for higher throughput than triangle setup - so there would still be hard limits, but they should be significantly higher.

AnarchX said:
But I do not expect more than a checkliste feature.
It will not be a 'checklist' feature, as it won't be on GeForce checklists and Tesla users couldn't care less about checklist features.

Throughput is obviously going to be a few to several times lower than FP32, but this is to be expected. Consider the DP CELL: It achieves 'only' 100GFlops in DP mode. If FP64 on G92 is 3-5x slower, it can still easily beat that.
 
I didn't get that - care to elaborate? :)
Just a minor observation that PTVC, sized in the 10s of vertices, will be consumed 10s of vertices at a time by Setup. Just a nice bit of symmetry.

Ah, okay. Well I agree with that completely (although I'd argue that if you wanted huge amounts of embedded memory, you'd want to manage it mostly manually as I think it could give you better returns than a cache).
Are you thinking of something like Cell SPEs' DMA in/out of LS? I suppose that would work quite nicely. The DMA would be hidden from the graphics programmer by the driver, making OM operate as though it was operating on an ordinary Colour Buffer Cache.

Why not unify this with PDC? Why have two separate memory pools?

If you need it to be high-bandwidth, you probably want to have it multibanked and allocate it statically to a group of memory chips. And thus it makes sense to put it at the ROP/memory controller level.
Couldn't you argue the same for PDC? Once you remove fixed-function ROP functionality, then it seems to me you're looking at PDC taking on this role, therefore being significantly larger than it currently is.

Jawed
 
Status
Not open for further replies.
Back
Top