First few GFX benches

Dave Baumann · Jan 4, 2003

Well, thats only textures. You need multiples of that because for each frame you're going to have tests and blends on all many of the alphas.

Arun · Jan 4, 2003

MDolenc said:
Hold on a minute here!
Why the hell would GeForceFX need to store transformed vertices back into video memory and then read them again at triangle setup stage?? The ONLY think this could be good for is their render to vertex array functionality, but you don't need to do this GENERALY in every case!? Where did you get this info from (and how sure are you about it)?

This is not only the GFFX, it's also that way for every other nVidia GPU on the market. Probably for ATI/Matrox/... ones too.
Note, however, that this isn't anywhere as major as Z reads or Color writes.

Let's consider 5 Million Vertices/frames ( thus 300M vertices/s at 60FPS )
And let's take a FVF of 32b with a VS output of 32, too.
Now, consider an optimal case where every vertex is used once in each frame. And let's say half drawn vertices are dynamic.

Thus, that's 64*2.5Million, or 160Million, for dynamic vertices
And it's 32*2.5Million, or 80Million, for static vertices
That gives us 240Million bits/frame

But then, you've also got to READ the transformed vertices. Let's consider a scenario where they're all read.
That adds 32*5Million, or 160 Million. Thus resulting a grand total of... 400Million

Divide that by 1024, and again by 1024, and then by 8 to get MBs
That's 47.68MB/frame
At 60FPS, that's 2.86GB/s
And the GFFX is able to do 16GB/s, so it would take nearly 18% of the available bandwidth to have 300M vertices.
And keep in mind that with a bigger FVF or with more parameters required, that number would be even bigger.

Hardware T&L certainly costs memory. It's not free at all. But compared to Z Reads / Color Writes / Texture reads, it's currentlynot very significative.
However, if you used the GFFX limit 350M vertices/s, it would be very significative: 20%!

Now, maybe I did an error somewhere in my calculations. If so, please correct me

And yes, I've got a source for this. I'm *not* inventing it.
It comes directly from a nVidia patent:
http://patft.uspto.gov/netacgi/nph-...t00&s1=nVidia.ASNM.&OS=AN/nVidia&RS=AN/nVidia

Yes, the method for Transform/Lighting is old. It's pretty much what was used for the NV10.
But AFAIK, beside that, it's all still pretty much correct ( I only read a big part of it, I might have missed some important stuff )

( everything following this assume DX9 , but it's practically identical with DX8 )
The following is speculation. However, it seems very logical to me, and it bases itself on reliable information. It could be wrong, but I'd be surprised if it were more than a few mistakes. And some of it is not speculation, but known facts.

DrawIndexedPrimitive transforms every primitive in the VB from "BaseVertexIndex + MinIndex" to "BaseVertexIndex + MinIndex + NumVertices"
It puts all of those transformed vertices in memory.
Then, once they're all there, it begins reading the indices which are in memory, in an Index Buffer ( Yes, that also takes memory. Probably at least 1GB/s in extreme cases, but I didn't do any serious calculation here )
It probably reads 128 bit of indices ( it's a 128 bit bus ) and puts them in a very small cache.

The following happens once for every triangle ( which is every 1 or 3 vertices, depending on if it's a triangle strip, list or fan )
---
Then, it checks if any of the request vertex is already in Vertex Cache. Any vertex which isn't in it is then retrieved from memory ( probably 128 bit of vertices here too, so FVFs not being divadable by 128 create waste ) , and three vertices are sent to Triangle Setup AND to Vertex Cache ( if one is already in vertex cache, I don't know what happens. Maybe nothing, or maybe its priority is increased - not sure )
If Vertex Cache is full, the oldest ( less important ) vertices are put out of it. For efficient Vertex Cache use, it is thus essantial that vertices are reused as fast as possible ( a NV2x Vertex Cache can hold 16 vertices, no idea about NV3x )
---
Once all indices have been treated, the memory where lie the *transformed* vertices is set free ( this simply means that if something tries to write on it, it'll consider current info as garbage )

Yes, this was long. But you seemed so surprised, so I really wanted to proof this and explain it. More vertices/indices cost more memory, but more complex vertices doesn't. That's why, in the future, the GFFX might not be as bandwidth limited as in today's applications, if developers try to develop for it ( or at least, do a profile for it )

Anyway, I could have done mistakes/errors here. If anyone finds one, please say so and explain it to me if possible. I'd love to increase my understanding of this

Uttar

Doomtrooper · Jan 4, 2003

High resolution and a minium of 4X AA, High Quality AF (min 8X) is what we want to see. Who buys a $800 video card to run with 2x AA and no Anisotropic filtering.
Looking forward to a review that will use these settings.

Arun · Jan 4, 2003

http://www.3dcenter.org/artikel/2002/11-19_b.php

1280x1024, 4X AA, 8X AF: 40.6

Compared to Maximum PC score:

1600x1200, 2X AA, no AF: 41

I'd guess 1600x1200 2X AA takes as much bandwidth as 1280x1024 4X AA, because color compression is more efficient with 4X AA, since more subpixels are identical.

So, that means... 8X AF, when enabling Antialiasing, is... 95% free.
Nice! All we want to know, now, is its quality.
Unless nVidia's homemade score at 3DCenter is increased by putting it in god-like conditions or something... But that would be surprising.

Uttar

LeStoffer · Jan 4, 2003

DaveBaumann said:
Well, thats only textures. You need multiples of that because for each frame you're going to have tests and blends on all many of the alphas.

I understand, but then alpha textures doesn't demand as much raw memory bandwidth as normal textures. 3dmark nature's test and blend with AA could put a lot of pressure on the efficiency of the memory controller when we talk about fetching texels from main VPU memory, however.

Anyway Dave, granted: you know more about this than me, I'm just having trouble understanding why nature should be mem bandwidth limited. Remember how some new Det. drivers made GF4's nature FPS do a massive jump? It still makes me believe that we're talking Vertex shader power limitation here.

KimB · Jan 4, 2003

So, that means... 8X AF, when enabling Antialiasing, is... 95% free.
Nice! All we want to know, now, is its quality.

Aniso on the GeForce FX should be very good, even with no extra tweaks on nVidia's part, since the performance problems of aniso on the GeForce4 appear primarily related to multitexturing performance.

Still, I doubt that you can really say from those benchmarks that aniso is "95% free," in any situation.

As for anisotropic quality, it is undoubtedly identical to the GeForce4's anisotropic. I don't really think it can get better (without increasing the degree of anisotropic), at least not noticeably.

Arun · Jan 4, 2003

LeStoffer said:
DaveBaumann said:

Remember how some new Det. drivers made GF4's nature FPS do a massive jump? It still makes me believe that we're talking Vertex shader power limitation here.

Click to expand...

Or it could be that FutureMark is so darn stupid that they forgot to turn off Alpha Testing when it's no longer required? Because that kills Early Z.
Which means that it would give a major fillrate & memory bandwidth boost if a driver optimization could intelligently determine if Alpha Testing is still required.
It's unlikely, but maybe, maybe...

Uttar

Doomtrooper · Jan 4, 2003

Uttar said:
http://www.3dcenter.org/artikel/2002/11-19_b.php

1280x1024, 4X AA, 8X AF: 40.6

Compared to Maximum PC score:

1600x1200, 2X AA, no AF: 41

I'd guess 1600x1200 2X AA takes as much bandwidth as 1280x1024 4X AA, because color compression is more efficient with 4X AA, since more subpixels are identical.

So, that means... 8X AF, when enabling Antialiasing, is... 95% free.
Nice! All we want to know, now, is its quality.
Unless nVidia's homemade score at 3DCenter is increased by putting it in god-like conditions or something... But that would be surprising.

Uttar

You link to a Nvidia PR graph...

Dave Baumann · Jan 4, 2003

LeStoffer said:
Remember how some new Det. drivers made GF4's nature FPS do a massive jump? It still makes me believe that we're talking Vertex shader power limitation here.

Guessing why one set of drivers changes the performance is a difficult thing. How do you know the improvements werenâ€™t from something they did at the texturing end, or the Pixel Shading end? Performance improvements can come from many places.

However, remember Revâ€™s GF4 Ti Shootout

Note that memory bandwidth is creating a greater increase in Nature performance testing.

Uttar said:
...I'd guess 1600x1200 2X AA takes as much bandwidth as 1280x1024 4X AA, because color compression is more efficient with 4X AA, since more subpixels are identical.

So, that means... 8X AF, when enabling Antialiasing, is... 95% free....

Oh god.

Lets stay away from wild guessing and claims on how â€˜performance freeâ€™ something is based on three benchmarks â€“ at least wait for Beyond3Dâ€™s analysis before doing things like this :!:

And trying to asses the efficiency of compression is having is a relatively futile exercise since it will vary greatly from frame to frame, let alone from game to game. About all you can say for any of them is that is more efficient than none and less efficient than any manufacturer will claim :!:

LeStoffer · Jan 4, 2003

Uttar said:
Or it could be that FutureMark is so darn stupid that they forgot to turn off Alpha Testing when it's no longer required? Because that kills Early Z.

...

It's unlikely, but maybe, maybe...

Uttar

Not unlikely, just follow the link in my reply.

Anyway, my main annoyance with 3dmark 2001 is that it takes some major study before anyone really understand what the heck each test really tests. Take the car chase - high detail: It's very interesting if you want to measure your computer's subsystem memory performance (FSB, ram bandwidth etc) and even now I'm not sure that the shader benchmark (nature) really measures shader performance and not memory bandwidth.

It's okay that the final mark is a sum, but please at least give people a benchmark that measure a single aspect in a gaming situation next time. Okay, Futuremark :?:

alexsok · Jan 4, 2003

Anyway, my main annoyance with 3dmark 2001 is that it takes some major study before anyone really understand what the heck each test really tests. Take the car chase - high detail: It's very interesting if you want to measure your computer's subsystem memory performance (FSB, ram bandwidth etc) and even now I'm not sure that the shader benchmark (nature) really measures shader performance and not memory bandwidth.

It's okay that the final mark is a sum, but please at least give people a benchmark that measure a single aspect in a gaming situation next time. Okay, Futuremark

You have to realize that 3DMark is not an open source test, which is the main reason why it's so damn hard to understand what the heck is going on in each test...

That's one of the reasons why I'm eagerly expecting iXBT's own RightMark 3D, which is a suit of tests taking full advantage of all of the latest features of the latest crop of video cards, mainly focusing on DX9.

Besides fully exploiting the power of DX9 compliant cards, the main advantage of this benchmark over all the other benchmark is the fact that's it will be open source!

Mintmaster · Jan 4, 2003

Uttar said:
MDolenc said:

Hold on a minute here!
Why the hell would GeForceFX need to store transformed vertices back into video memory and then read them again at triangle setup stage?? The ONLY think this could be good for is their render to vertex array functionality, but you don't need to do this GENERALY in every case!? Where did you get this info from (and how sure are you about it)?

Click to expand...

This is not only the GFFX, it's also that way for every other nVidia GPU on the market. Probably for ATI/Matrox/... ones too.
Note, however, that this isn't anywhere as major as Z reads or Color writes.
Uttar

MDolenc is right. There is no need to transform all vertices first and then read them. You need to do this for tile based deferred rendering, but not for an IMR like all other video cards. I think this is one of the main reasons for not doing TBR.

You read the vertices either from video card memory or AGP (I think the latter is more common) in the order according to the primitive type, and then you transform them. The vertex cache is there to try and avoid retransforming recently transformed vertices again, thus achieving a max of 2 triangles per vertex. Once you form a primitive, you draw it, and you have no need to store it anywhere because you are done with it. Of course there are fifos to buffer out culled/clipped/tiny triangles, but these are on the chip and don't consume any bandwidth.

Your so called "proof" doesn't say anything about storing all transformed vertices beforehand either.

Arun · Jan 4, 2003

Errr, I don't see what's surprising with that, Wavey ( or which way should I call you? I'm getting so confused with all the ways people call you

)

AA increases the workload of Memory Bandwidth. Aniso increases the workload of mostly fillrate, and slightly Memory Bandwidth. So, the best-case scenario would be AF not making fillrate the bottleneck.
So, increased Memory Bandwidth workload due to AA decreases the performance hit of Aniso.

Maybe saying it's 95% free is being too optimistic. I just realized something:
1280x1024, 4X AA, takes less bandwidth than 1600x1200 2X AA

Which means that there's bandwidth left to give to Aniso if switching from 1600x1200 2X AA to 1280x1024 4X AA - so it won't be free at all.

However, it would indicate that it's likely Memory Bandwidth remains the bottleneck at 4X AA, 8X AF

Now, why do I say 1280x1024 4X AA takes less bandwidth than 1600x1200 2X AA?
1280x1024 is 68% of 1600x1200
4X AA is 50% of 2X AA if there wasn't Color/Z Compression.
So that means 1280x1024 4X already only uses 84% of what uses 1600x1200 2X AA - if we didn't take Color/Z Compression into account.

But we have to. Z compression is more efficient because the samples are more similar. And Color Compression is also more efficient, because 4 subpixels are similar most of the time, and not only 2.

It's thus a safe bet to say the bandwidth consuption for 1600x1200 2X AA is equal, or maybe slightly more. Of course, that could be wrong, but how likely is it?
A small proof: http://www.anandtech.com/showdoc.html?i=1683&p=17
ATI uses Z/Color compression too.
1024x768 4X AA is probably already memory limited, with a performance drop of 44%.
So, that means 4X AA performance should be 50% higher than 6X AA performance if we didn't put Color/Z Compression in account.
However, it's only 10% higher - so compression is certainly more efficient

Does this make sense? Or am I getting dumber every single second?

Uttar

Arun · Jan 4, 2003

Mintmaster said:
MDolenc is right. There is no need to transform all vertices first and then read them. You need to do this for tile based deferred rendering, but not for an IMR like all other video cards. I think this is one of the main reasons for not doing TBR.

Err, who said that?
I said you transform every vertices in the DIP call!
nVidia/ATI reccomended size for DIP calls is of about 500 vertices.
Assuming 2000 vertices/DIP call ( which probably isn't the best idea in most cases ) and 4 Vertex Shading units working at the same time, you'd get 30KB in memory reserved for it. So, at least, it makes sense on that front

And it also makes sense not to use a cache, which would cost too much transistors.
I think I didn't say clearly in my first post about this that this was related to DIP calls, and not the entire frame. Sorry, I supposed that was obvious, but I guess it isn't at all

You read the vertices either from video card memory or AGP (I think the latter is more common) in the order according to the primitive type, and then you transform them. The vertex cache is there to try and avoid retransforming recently transformed vertices again, thus achieving a max of 2 triangles per vertex. Once you form a primitive, you draw it, and you have no need to store it anywhere because you are done with it. Of course there are fifos to buffer out culled/clipped/tiny triangles, but these are on the chip and don't consume any bandwidth.

Reading vertices from AGP? Oh, sure

Even if the VB is dynamic ( which means AGP sends info to it every frame ) , it's stored in video memory. The only difference is that it's written to video memory every frame.
And the very proof it's not read from AGP: Dynamic data can be read multiple times, so it's kept in memory.

I also supposed for a long time that Vertex Caches are used to save T&L work. But today, I think they're simply used to save bandwidth.
I might be wrong on this, however... As I said, I'm not 100% sure of all this. But I find it quite logical and likely.

Uttar

Hyp-X · Jan 4, 2003

Uttar said:
MDolenc said:

Hold on a minute here!
Why the hell would GeForceFX need to store transformed vertices back into video memory and then read them again at triangle setup stage?? The ONLY think this could be good for is their render to vertex array functionality, but you don't need to do this GENERALY in every case!? Where did you get this info from (and how sure are you about it)?

Click to expand...

This is not only the GFFX, it's also that way for every other nVidia GPU on the market. Probably for ATI/Matrox/... ones too.

You are wrong...
If you'd read documents about nVidia hardware (and especially how to optimize for it), you'd know that.

And yes, I've got a source for this. I'm *not* inventing it.
It comes directly from a nVidia patent:
http://patft.uspto.gov/netacgi/nph-...Vidia.ASNM.&OS=AN/nVidia&RS=AN/nVidia

nVidia has a lot of patent they acquired with 3dfx.
This proves nothing.

The following is speculation. However, it seems very logical to me, and it bases itself on reliable information. It could be wrong, but I'd be surprised if it were more than a few mistakes. And some of it is not speculation, but known facts.

DrawIndexedPrimitive transforms every primitive in the VB from "BaseVertexIndex + MinIndex" to "BaseVertexIndex + MinIndex + NumVertices"
It puts all of those transformed vertices in memory.

This is not true.
It takes only a few simple benchmarks to figure out!

GeForce (and Radeon) transforms vertices on-demand.

This means if you increase "NumVertices" but you do not really use the extra vertices you don't have a performance impact.
In your scheme that would be.
Write a simple program and test it if you don't beleive me!

This also mean that a vertex can be transformed multiple times if it occurs multiple times further than the size of the post-tranformed vertex fifo.
You can check this by writing a very long shader program.
Go ahead!

This also mean that a vertex can be read from the vertex buffer multiple times if it occurs multiple times and the input memory cache (a few kilobytes large) misses it.
This has quite large impact if the buffer is in AGP memory.
Set you AGP to 1x to exagerate the effect while testing!

Then, once they're all there, it begins reading the indices which are in memory, in an Index Buffer ( Yes, that also takes memory. Probably at least 1GB/s in extreme cases, but I didn't do any serious calculation here )
It probably reads 128 bit of indices ( it's a 128 bit bus ) and puts them in a very small cache.

Indices are not reused so there's really no good reason to "cache" them.
Btw, the memory system reads 32 byte at once (256 bit).
Which is not surprising since it's a 128bit DDR bus.

Yes, this was long. But you seemed so surprised, so I really wanted to proof this and explain it.

No wonder he is surprised

Every evidence is against what you say - including nVidia documentation.

Anyway, I could have done mistakes/errors here. If anyone finds one, please say so and explain it to me if possible. I'd love to increase my understanding of this

Uttar

No big mistakes apart getting it completely wrong.

Read nVidia documents and write test programs.

Or else beleive what those to did say. :!:

alexsok · Jan 4, 2003

O.k guys, a couple of things:
The silicon the guys over at Maximum PC tested was running on stepping A01, yes, the same first stepping Digit-Life received quite a while ago, so it's full of bugs, and the aniso is NOT working, thus, no aniso results were published.

If I'm not mistaken, A02 stepping is already back from the fab, so not much if left till the witnessing of the real results...

Hyp-X · Jan 4, 2003

Uttar said:
Reading vertices from AGP? Oh, sure
Even if the VB is dynamic ( which means AGP sends info to it every frame ) , it's stored in video memory. The only difference is that it's written to video memory every frame.

I have to dissapoint you, but Dynamic VBs are stored in AGP memory...
Dynamic buffers have to accessed by to CPU a lot and it is faster in AGP memory (which is system memory after all).
Also why transfer to vmem if its - in best case - will be read only once.

And the very proof it's not read from AGP: Dynamic data can be read multiple times, so it's kept in memory.

Yes it can be read multiple times.
And guess what.
It's slow.
Actually it's a proof against what you say.

Again it's not too hard to write test programs to verify what I say...

Hyp-X · Jan 4, 2003

Mintmaster said:
MDolenc is right. There is no need to transform all vertices first and then read them. You need to do this for tile based deferred rendering, but not for an IMR like all other video cards.

Which reminds me that nVidia has TBR related patents... (from Gigapixel)

LeStoffer · Jan 4, 2003

alexsok said:
If I'm not mistaken, A02 stepping is already back from the fab ...

I damn better hope so for nVidia! 8)

(If not it would be a disaster, alexsok! You might wanna go back to that scource of yours to make sure we're not talking about A03 - after all GF3 did go there to slip into the Spring, didn't it?

)

Hyp-X · Jan 4, 2003

Uttar said:
Or it could be that FutureMark is so darn stupid that they forgot to turn off Alpha Testing when it's no longer required? Because that kills Early Z.
Which means that it would give a major fillrate & memory bandwidth boost if a driver optimization could intelligently determine if Alpha Testing is still required.
It's unlikely, but maybe, maybe...

Uttar

They do not sort the leaves!

So if they'd turn of AT then you'd see sky-coloured rectangles around many of the leaves!

And why would alpha-test kill early Z?
Early Z test can still be performed, only Z writes have to be delayed until the pixel shader is completed.

Actually it would be faster on the card if they'd sort the leaves as in that case they'd be able to disable Z-writes.
They problem is that it might get CPU limited.

First few GFX benches

Dave Baumann

Gamerscore Wh...

Arun

Unknown.

Doomtrooper

Arun

Unknown.

LeStoffer

KimB

Arun

Unknown.

Doomtrooper

Dave Baumann

Gamerscore Wh...

LeStoffer

alexsok

Mintmaster

Arun

Unknown.

Arun

Unknown.

Hyp-X

Irregular

alexsok

Hyp-X

Irregular

Hyp-X

Irregular

LeStoffer

Hyp-X

Irregular