NVIDIA GT200 Rumours & Speculation Thread

Jawed · Jun 10, 2008

Andrew Lauritzen said:
If the math is non-linear though you can't use hardware filtering (or SATs for that matter)

Ah, I interpreted what Marco said to mean he wouldn't be using hardware filtering for extended precision, but do so in ALUs.

Jawed

Andrew Lauritzen · Jun 10, 2008

Jawed said:
Ah, I interpreted what Marco said to mean he wouldn't be using hardware filtering for extended precision, but do so in ALUs.

Could be, but then I see no reason not just to do the log filtering stuff that he has already been doing... however that wouldn't work with SAT. Not sure how to make any of this work with SAT nicely (without DP), but maybe that's where his cleverness comes in...

Anyways this is indeed off-topic, so I'll stop... Marco feel free to start a new thread and let us in on your new plans

TimothyFarrar · Jun 10, 2008

Anyone have any hints as to what they have changed to get "Much faster geometry shading" (other than the other bullet points)? Are we looking at some kind of hardware GS output merging support or perhaps just GS scheduling which is now thread friendly?

As for the "low" DP rate in comparison to AMD, is the typical DP workflow on the GPU bandwidth bound anyway? Or are those using DP on AMDs GPUs actually very often reaching the peak DP rate other than the obvious and trivial cases?

fellix · Jun 10, 2008

TimothyFarrar said:
Anyone have any hints as to what they have changed to get "Much faster geometry shading" (other than the other bullet points)? Are we looking at some kind of hardware queue support (GS output) or perhaps just GS scheduling which is now thread friendly?

Err, just higher triangle setup rate?

stevem · Jun 10, 2008

fellix said:
Err, just higher triangle setup rate?

Heh, so you think they won't be benching it on a "balanced platform"...

Jawed · Jun 10, 2008

TimothyFarrar said:
As for the "low" DP rate in comparison to AMD, is the typical DP workflow on the GPU bandwidth bound anyway? Or are those using DP on AMDs GPUs actually very often reaching the peak DP rate other than the obvious and trivial cases?

MfA reported almost 1/3 throughput for double-precision dense matrix multiply when compared with single-precision. I think that's as bandwidth bound as you can get.

http://forum.beyond3d.com/showthread.php?t=48154

220 GFLOPs SP v nearly 70 DP.

As to people doing "non-obvious, trivial" double-precision stuff - too early...

Jawed

TimothyFarrar · Jun 10, 2008

I think I would place dense matrix multiply into the trivial and obvious case (easy data locality and reuse).

Jawed · Jun 10, 2008

TimothyFarrar said:
I think I would place dense matrix multiply into the trivial and obvious case (easy data locality and reuse).

Oh, I do too. That's all the information relating to DP I have, though.

Jawed

MfA · Jun 10, 2008

TimothyFarrar said:
I think I would place dense matrix multiply into the trivial and obvious case (easy data locality and reuse).

It is, but it does tell you a couple of things. For instance matrix multiplication hits the caches very hard and it can deal with that competently. It would have been nice if they could get the near 100% efficiency of Cell, but 50% is pretty good regardless.

Rys · Jun 10, 2008

TimothyFarrar said:
Anyone have any hints as to what they have changed to get "Much faster geometry shading" (other than the other bullet points)? Are we looking at some kind of hardware GS output merging support or perhaps just GS scheduling which is now thread friendly?

They schedule GS threads differently now, yeah, but because of a hardware change. It should be pretty obvious what that change is if you know why it goes slow on G8x/G9x. Measured throughput increases are nice and impressive.

mczak · Jun 11, 2008

Rys said:
They schedule GS threads differently now, yeah, but because of a hardware change. It should be pretty obvious what that change is if you know why it goes slow on G8x/G9x. Measured throughput increases are nice and impressive.

Was there some discussion why it's slow on G8x/G9x? All I remember was that vertex throughput on G80/G92 was pretty low (slower than HD3850 and slower than G94 even).

pjbliverpool · Jun 11, 2008

mczak said:
Was there some discussion why it's slow on G8x/G9x? All I remember was that vertex throughput on G80/G92 was pretty low (slower than HD3850 and slower than G94 even).

What is the vertex throughput on G80/G92? I thought it could setup 1 triangle per cycle? Is G94 different to this?

Megadrive1988 · Jun 11, 2008

What interests me with GT200 / GTX 280 are the inevitable shrink/respins/dedesign.

*GT200b (55nm ?)
*a potential binned, clocked-up 'Ultra'
*a potential redesign that can use GDDR5

TimothyFarrar · Jun 11, 2008

mczak said:
Was there some discussion why it's slow on G8x/G9x? All I remember was that vertex throughput on G80/G92 was pretty low (slower than HD3850 and slower than G94 even).

From previous tests I ran on the G84 where I tried to use the GS to output from points to quads to the faces of a cubemap, it was much faster to simply stream out VS only points to quads then draw all the geometry to all the faces of the cubemap (VS only) than to use the GS pipe.

Couldn't see it simply being a triangle setup issue, even if the card couldn't compact on GS output, ie it entered degenerate tris to fill out the max output primatives, simply drawing all those primitives from VS only was faster (at least in my case). Unless somehow in my example the card had to internally have 6 output streams (for each cubemap face, each with 6 quad slots per GS invocation). Then tri setup would be an obvious problem...

Guess I missed the B3D post on why GS was slow...

ShaidarHaran · Jun 11, 2008

Well since the execution resources are the same for both GS and VS, and we have a throughput issue, I'd suspect either a cache or setup bottleneck.

Scali · Jun 11, 2008

Are there any sources on how Radeons or S3 cards handle the D3D10 SDK samples with and without GS?
On my 8800GTS the GS generally is very slow, eg the cubemap example, the displacement mapping and the stencil shadows.
Same happens on my Intel X3100. Enabling the GS makes things run really slowly, making me wonder why I'd ever want to use GS in the first place.
So I'd like to see some framerates on other D3D10 hardware.

Ilfirin · Jun 11, 2008

I haven't played around with GSs in a while (so I may very well be totally wrong here), but from what I've been hearing there have been massive strides made in the way of the performance of the GS paths recently by either ATI and/or nVidia.

Andrew Lauritzen · Jun 11, 2008

TimothyFarrar said:
Guess I missed the B3D post on why GS was slow...

Rys seems to be the only one in the know on this issue. (Well, we all know why the GS is slow in general, but not why its so ridiculously slow on G80

). Maybe a quick post or article summarizing it for us mortals?

MfA · Jun 11, 2008

ShaidarHaran said:
Well since the execution resources are the same for both GS and VS, and we have a throughput issue, I'd suspect either a cache

I doubt the G80 can cache the GS output at all.

ShaidarHaran · Jun 11, 2008

MfA said:
I doubt the G80 can cache the GS output at all.

So you're saying I just might be right?

Run for the hills! I figured out something on my own!

NVIDIA GT200 Rumours & Speculation Thread

Jawed

Andrew Lauritzen

Moderator

TimothyFarrar

fellix

stevem

Jawed

TimothyFarrar

Jawed

MfA

Rys

Graphics @ AMD

mczak

pjbliverpool

B3D Scallywag

Megadrive1988

TimothyFarrar

ShaidarHaran

hardware monkey

Scali

Ilfirin

Andrew Lauritzen

Moderator

MfA

ShaidarHaran

hardware monkey

Similar threads