NVIDIA GT200 Rumours & Speculation Thread

Status
Not open for further replies.
Ah, I interpreted what Marco said to mean he wouldn't be using hardware filtering for extended precision, but do so in ALUs.
Could be, but then I see no reason not just to do the log filtering stuff that he has already been doing... however that wouldn't work with SAT. Not sure how to make any of this work with SAT nicely (without DP), but maybe that's where his cleverness comes in... :D

Anyways this is indeed off-topic, so I'll stop... Marco feel free to start a new thread and let us in on your new plans :)
 
Anyone have any hints as to what they have changed to get "Much faster geometry shading" (other than the other bullet points)? Are we looking at some kind of hardware GS output merging support or perhaps just GS scheduling which is now thread friendly?

As for the "low" DP rate in comparison to AMD, is the typical DP workflow on the GPU bandwidth bound anyway? Or are those using DP on AMDs GPUs actually very often reaching the peak DP rate other than the obvious and trivial cases?
 
Anyone have any hints as to what they have changed to get "Much faster geometry shading" (other than the other bullet points)? Are we looking at some kind of hardware queue support (GS output) or perhaps just GS scheduling which is now thread friendly?
Err, just higher triangle setup rate? ;)
 
As for the "low" DP rate in comparison to AMD, is the typical DP workflow on the GPU bandwidth bound anyway? Or are those using DP on AMDs GPUs actually very often reaching the peak DP rate other than the obvious and trivial cases?
MfA reported almost 1/3 throughput for double-precision dense matrix multiply when compared with single-precision. I think that's as bandwidth bound as you can get.

http://forum.beyond3d.com/showthread.php?t=48154

220 GFLOPs SP v nearly 70 DP.

As to people doing "non-obvious, trivial" double-precision stuff - too early...

Jawed
 
I think I would place dense matrix multiply into the trivial and obvious case (easy data locality and reuse). ;)
It is, but it does tell you a couple of things. For instance matrix multiplication hits the caches very hard and it can deal with that competently. It would have been nice if they could get the near 100% efficiency of Cell, but 50% is pretty good regardless.
 
Anyone have any hints as to what they have changed to get "Much faster geometry shading" (other than the other bullet points)? Are we looking at some kind of hardware GS output merging support or perhaps just GS scheduling which is now thread friendly?
They schedule GS threads differently now, yeah, but because of a hardware change. It should be pretty obvious what that change is if you know why it goes slow on G8x/G9x. Measured throughput increases are nice and impressive.
 
They schedule GS threads differently now, yeah, but because of a hardware change. It should be pretty obvious what that change is if you know why it goes slow on G8x/G9x. Measured throughput increases are nice and impressive.
Was there some discussion why it's slow on G8x/G9x? All I remember was that vertex throughput on G80/G92 was pretty low (slower than HD3850 and slower than G94 even).
 
Was there some discussion why it's slow on G8x/G9x? All I remember was that vertex throughput on G80/G92 was pretty low (slower than HD3850 and slower than G94 even).

What is the vertex throughput on G80/G92? I thought it could setup 1 triangle per cycle? Is G94 different to this?
 
What interests me with GT200 / GTX 280 are the inevitable shrink/respins/dedesign.


*GT200b (55nm ?)
*a potential binned, clocked-up 'Ultra'
*a potential redesign that can use GDDR5
 
Was there some discussion why it's slow on G8x/G9x? All I remember was that vertex throughput on G80/G92 was pretty low (slower than HD3850 and slower than G94 even).

From previous tests I ran on the G84 where I tried to use the GS to output from points to quads to the faces of a cubemap, it was much faster to simply stream out VS only points to quads then draw all the geometry to all the faces of the cubemap (VS only) than to use the GS pipe.

Couldn't see it simply being a triangle setup issue, even if the card couldn't compact on GS output, ie it entered degenerate tris to fill out the max output primatives, simply drawing all those primitives from VS only was faster (at least in my case). Unless somehow in my example the card had to internally have 6 output streams (for each cubemap face, each with 6 quad slots per GS invocation). Then tri setup would be an obvious problem...

Guess I missed the B3D post on why GS was slow...
 
Well since the execution resources are the same for both GS and VS, and we have a throughput issue, I'd suspect either a cache or setup bottleneck.
 
Are there any sources on how Radeons or S3 cards handle the D3D10 SDK samples with and without GS?
On my 8800GTS the GS generally is very slow, eg the cubemap example, the displacement mapping and the stencil shadows.
Same happens on my Intel X3100. Enabling the GS makes things run really slowly, making me wonder why I'd ever want to use GS in the first place.
So I'd like to see some framerates on other D3D10 hardware.
 
I haven't played around with GSs in a while (so I may very well be totally wrong here), but from what I've been hearing there have been massive strides made in the way of the performance of the GS paths recently by either ATI and/or nVidia.
 
Status
Not open for further replies.
Back
Top