First few GFX benches

Arun · Jan 4, 2003

Hyp-X: Good job on showing how stupid I am.

Thanks for correcting me so well. I didn't write the test programs yet, but I'll be sure to write them in a few days when I get the time.

However, I'd still think that transformed vertices are put in memory. Just that it doesn't put every single vertices of a DIP call in it at the same time.
Basically, it would use memory as some type of cache there. Not a huge cache, but probably larger than other ones ( else, why use memory? )

I've found a much more recent nVidia patent saying this. It's dated June 28, 2002:
http://appft1.uspto.gov/netacgi/nph...=PG01&s1=nVidia.AS.&OS=AN/nVidia&RS=AN/nVidia

Here's the quote:

[0215] FIG. 25 is a diagram illustrating the method by which the sequencers of the transform and lighting modules 52 and 54 are capable of controlling the input and output of the associated buffers in accordance with the method of FIG. 24. As shown, the first set of buffers, or input buffers 400, feed transform module 52 which in turn feed the second set of buffers, or intermediate buffers 404, 406. The second set of buffers 404, 406 feed lighting module 54 that drains to memory 2550.

However, it doesn't make it 100% sure yet. It just describes "one enbodiment of the present invention" - so, in their GPUs, they could do it differently I think.

But if we consider it right, one question remains: since it would thus be a very small cache, why not putting it in memory? I don't know...
Maybe due to texture changes, which would cause a stall? So being able to transform more vertices in advance could be essantial. But that sounds like a ridiculous and illogical explanation to me...

Any idea? Or maybe a way to proof I'm wrong again?
Aww, I've already been wrong on so much in so little posts... I must really look like a moron.

Uttar

EDIT: Small addition about Alpha Testing/Early Z:
Hmm, yeah, maybe it could delay Z Writes. But then, wouldn't that Early Z test have potentially been wasted bandwidth? So maybe it wouln't be very useful either. Not sure, never seen a nVidia/ATI document on this, so maybe it doesn't kill it.
But no matter what, it wastes something. If it's rejected after having done Early Z, you'll have done a useless Z read. If it's not, you didn't do Early Z and it would have been rejected by Early Z, you lost fillrate.
So it's all about which is most important: bandwidth or fillrate. Most of the time it would be fillrate, but there could be rare exceptions.
And anyway, if it had to potentially delay Z Writes, it would have to keep a flag to say if it does or not. And a lot of other things. Maybe that isn't supported by today's hardware? ( or maybe it is. Or maybe it's only supported by the R300/NV30. Or maybe only R300. Or maybe it'll only be supported in R400. Or... you see the idea )
It shouldn't be too hard to add it, but you got to think about adding it first ( and seeing if it's worth the cost )

Uttar

Mintmaster · Jan 4, 2003

Uttar said:
Maybe saying it's 95% free is being too optimistic. I just realized something:
1280x1024, 4X AA, takes less bandwidth than 1600x1200 2X AA

Now, why do I say 1280x1024 4X AA takes less bandwidth than 1600x1200 2X AA?
1280x1024 is 68% of 1600x1200
4X AA is 50% of 2X AA if there wasn't Color/Z Compression.
So that means 1280x1024 4X already only uses 84% of what uses 1600x1200 2X AA - if we didn't take Color/Z Compression into account.

When talking about bandwidth limitation, you have to look at bandwidth needs per pixel, not per frame. Then you look at how much bandwidth is available per clock cycle, and you can see whether it will be clock limited or bandwidth limited.

This, of course, assumes you are not CPU limited or transform limited. If you are, then both bandwidth and fillrate are irrelevant.

1600x1200 will wind up having a bit lower bandwidth requirements per pixel than 1280x1024, as Z/Color-compression and texture caches work better at higher resolution. A lot of people make the assumption that higher resolution requires higher bandwidth. If you want the same framerate, this is true, but you also need higher fillrate. More bandwidth will help out lower resolutions just as much as (or a bit more than) higher resolutions, again provided that the above assumption (not being CPU/transform limited) holds true.

You can see this in fillrate tests, such as in 3DMark2001. Higher resolutions usually give you a slightly higher fillrate, even though the low resolutions are nowhere near transform limited (usually two triangles fill the screen).

Then you put in FSAA. 4xFSAA has double the bandwidth needs of 2xFSAA, but compression can alleviate that. However you look at it, 1280x1024 @ 4xFSAA has much greater dependancy on bandwidth than 1600x1200 @ 2xFSAA.

The main flaw with your calculations, Uttar, is that you are assuming a framerate and determining the bandwidth needed, and then comparing this number with the available bandwidth. You need to go backwards, determining the fillrate from the bandwidth and the application. The latter is the tricky part, and is tough unless you have real data.

MDolenc · Jan 4, 2003

Ok I guess the only thing not yet answered are the vertex figures:

Uttar said:
Thus, that's 64*2.5Million, or 160Million, for dynamic vertices
And it's 32*2.5Million, or 80Million, for static vertices
That gives us 240Million bits/frame

But then, you've also got to READ the transformed vertices. Let's consider a scenario where they're all read.
That adds 32*5Million, or 160 Million. Thus resulting a grand total of... 400Million

Divide that by 1024, and again by 1024, and then by 8 to get MBs
That's 47.68MB/frame
At 60FPS, that's 2.86GB/s
And the GFFX is able to do 16GB/s, so it would take nearly 18% of the available bandwidth to have 300M vertices.
And keep in mind that with a bigger FVF or with more parameters required, that number would be even bigger.

Correct but you did one grave mistake: What are you going to do with 32 bit vertex (that's 1 (one) float)? A typical vertex is 32 BYTES (position, normal, 2D texture coordinates - 8 floats) so you don't need to divide by 8 (And this does make things a bit more interesting does it?

).

Arun · Jan 4, 2003

MDolenc: Oh, damn, stupid me

However, Hyp-X showed that a lot of things I said are false.
Let's assume it still does the vertices writes after T&L, but nothing else ( wouldn't want to look too dumb again )
So, 5Million*32 = 152.58MB/frame
152.58*60 = 9155.2734375MB/s

Err... WTF?!
9GB/s only in vertices writes?
Okay, well, I guess this is getting ridiculous. There GOTTA be a mistake somewhere. I simply can't trust such a thing even exists.
Hyp-X... Anyone... *PLEASE*... Proof me wrong. I'd be terrified if this was right.

Uttar

Bambers · Jan 4, 2003

I retried the nature test, this time varying the core speed on my 8500. (1042x768x32, no AA/AF), memory was fixed at 325.

core 325: 52.3fps
core 250: 49.8fps

WIth core at 300:

mem 325: 51.2fps
mem 250: 42.5fps

MDolenc · Jan 4, 2003

Uttar said:
So, 5Million*32 = 152.58MB/frame
152.58*60 = 9155.2734375MB/s

Err... WTF?!
9GB/s only in vertices writes?
Okay, well, I guess this is getting ridiculous. There GOTTA be a mistake somewhere. I simply can't trust such a thing even exists.

It is correct... If you want to transform 5 MILLION vertices per frame at 60 frames per second :?.

Hyp-X · Jan 4, 2003

Uttar said:
But no matter what, it wastes something.
If it's rejected after having done Early Z, you'll have done a useless Z read.

Which in this case is not vital as the Z-read and the texture read without the framebuffer access won't saturate the memory bus.
So you'll be fillrate limited in both cases.

If it's not, you didn't do Early Z and it would have been rejected by Early Z, you lost fillrate.

Which is more serious as early Z can operate about 4x speed even on a GF3.

So it's all about which is most important: bandwidth or fillrate. Most of the time it would be fillrate, but there could be rare exceptions.

No it's about which is more important fillrate or Z-occlusion rate.

The only place where it's going to be bandwidth limited is the inside of the visible leaves, but that's the only part where early Z doesn't make any difference since both Z-test and alpha-test succeed.

And anyway, if it had to potentially delay Z Writes, it would have to keep a flag to say if it does or not. And a lot of other things. Maybe that isn't supported by today's hardware? ( or maybe it is. Or maybe it's only supported by the R300/NV30. Or maybe only R300. Or maybe it'll only be supported in R400. Or... you see the idea )
It shouldn't be too hard to add it, but you got to think about adding it first ( and seeing if it's worth the cost )

Now it's a good question if it's actually supported.

It wouldn't be too hard to test, but I didn't, and I didn't read any definite answers for it.

LeStoffer · Jan 4, 2003

Bambers said:
I retried the nature test, this time varying the core speed on my 8500. (1042x768x32, no AA/AF), memory was fixed at 325.

core 325: 52.3fps
core 250: 49.8fps

WIth core at 300:

mem 325: 51.2fps
mem 250: 42.5fps

Thanks.

In line with what Dave pointed to earlier I would now official state that FutureMark f*cked this so-called 'shader' benchmark up. :?

I'm not saying that I thought it to be wonderful in the first place, but at least a company that makes a living out of doing a 'future' benchmark should be conscious about how they construct their engine: Making a shader benchmark memory bandwidth limited (AA or not) is simply a major bummer in my book (especially when they have six other benchmark to examen that aspect.)

OpenGL guy · Jan 4, 2003

Uttar said:
I might be wrong on this, however... As I said, I'm not 100% sure of all this.

I believe this statement says it all.

But I find it quite logical and likely.

Your logic is flawed, as people have shown.

There's no reason for the transform engine to write out the transformed vertices back to video/AGP/system memory. The reason is that once the vertices are used, they are discarded. In the case of indexed primitives, you want to get vertex reuse, so you have to make sure that you reuse vertices soon enough so that they are still in the vertex cache on the chip. If the vertex gets evicted from the cache, then it'll get reprocessed later if it's reused.

ATi and nvidia have already given developers information about vertex reuse and how soon you need to reuse the vertex to get optimal performance.

Mintmaster · Jan 4, 2003

Forgetting about 3D architecture for a moment and returning to the Nature benchmark, I think this is not a very reliable benchmark.

First of all, the scores in Nature have changed very drastically. At launch, Geforce3 had a score of 20 fps. Det XP's made it 34 fps. Det 40.xx made it around 50 fps. The Radeon 8500 had a score of 25 fps at launch. It then went to 45 fps with later drivers. The GF4 had a score of around 50 fps at launch, and it shot up to near 80 fps with the 40.xx drivers, similar to the GF3 boost.

I think this benchmark lends itself to a lot of software optimization. Maybe they (ATI and NVidia) figured out things that they don't need to draw, or they sort things in a particular way. In any case, it doesn't seem representative of real world performance.

I think a good way of looking at this is how much faster GFFX is than GF4, but comparing it to the 9700 is not very useful, at least for this benchmark.

As for Quake3, NVidia has crazy optimizations here. Heck, the GF4MX does amazing here, out doing the 9000 by quite a bit, which usually wins by quite a margin. I don't think benchmark results in this game is very representative of other games.

Basically I'm saying that this data isn't very useful in predicting how much faster GFFX will be over the 9700. I think we'll just have to wait.

EDIT: I meant Detonator XP's, not Detonator 3's.

Hyp-X · Jan 4, 2003

Let's see GF3 can do 40 MVerts/s.

40 MVerts/s * 32 bytes/Vert = 1280 MBytes/s
That's 17.4% of its bandwidth.

Note that 32 comes from the untransformed vertex (X,Y,Z,Nx,Ny,Nz,Tx,Ty).
The tranformed vertex is X, Y, Z, 1/W, color1, color2, Tx, Ty.
It might be 32 byte too, if colors are stored as dwords, and not as 4 float (when it's 56 bytes).

So you write the transformed vertices, than read them again. That's wasting 34.8% of the card bandwidth.
The GF4 has 2 vertex pipe, so that figure can double.

Not to speak about cards like the 9500Pro

In real life these calculations doesn't matter much though.
Yet, it's fun to do them.

tb · Jan 4, 2003

Uttar said:
So, 5Million*32 = 152.58MB/frame
152.58*60 = 9155.2734375MB/s

Err... WTF?!
9GB/s only in vertices writes?
Okay, well, I guess this is getting ridiculous. There GOTTA be a mistake somewhere. I simply can't trust such a thing even exists.
Hyp-X... Anyone... *PLEASE*... Proof me wrong. I'd be terrified if this was right.

Uttar

You are right. However, if the vertex buffer isn't a dynamic one, the vertices are stored in the video ram and not transfered every frame. For bandwidth reason, you usualy optimize your triangle data as a result you need much lesser(<3) vertices per triangle.

Thomas

Neeyik · Jan 5, 2003

Considering that the game tests are supposed to be "gamey" tests (and let's not get into an argument about that please

) why should those tests just focus on one aspect of the PC? Hell, if you're going to make game tests equivalent on modern games, just make them all totally CPU dependent :?

Anyway, losing the point I was thinking of originally here....oh yes. Nature test. Was never designed to be a "shader test" - just example of what a full DX8 game could be. Ignoring the vertex shaders for the moment, the lake surface is the only place pixel shaders are used (which is where they're used in every bloody game around at the moment that pixel shades...), so you've got plenty of memory reads for the normal and cube map for that. The vertex shaders, AFAIK, are just used for transforms and a spot of tweening; no lighting. Overall, the shader usage as other people have pointed out is quite light - the slowest parts of the test are the leaves and grass shots, where you have a shed load of alpha textures all over the place (the lake section IIRC is the fastest part).

Umm...dang - I forgotten why I was even mentioning this now...

Dave Baumann · Jan 5, 2003

Neeyik said:
Considering that the game tests are supposed to be "gamey" tests (and let's not get into an argument about that please )

So, exactly what type of game does Nature represent? Some kind of blissed out fishing game?

IMO, the advanced test looks better and is probably more representative of shader performance (judging by the benchmarks, I don't know what peculiarities it has under the skin).

Nappe1 · Jan 5, 2003

LeStoffer said:
Thanks.

In line with what Dave pointed to earlier I would now official state that FutureMark f*cked this so-called 'shader' benchmark up. :?

umm... let me be a stupid this time...
afaik, Game 4 isn't shader benchmark. it never has been. It has been called as Shader bench because everyone assumes so. (it needs dx8 hardware, maybe that's why.) Game 4 is part of gamers benchmark, that is ment to test the whole system gaming performance. (and as far as I know, it was added to total score to give some living time for the whole software; just Dx7 benches would sound pretty bad now, don't you think? though when the bench was released, it was blaimed for being nVidia biased due to DX8 support.) For Pros, they have a special tests for testing shader and other special stuff.

Sorry everyone who loves to hate MadOnion / FutureMark, but I just can't understand you sometimes. Hopefully MadOnion makes their new benchmark so that Free version only has Whole System Benches and purchaseable commericial version has all the special tests that free version has now. Then, it would be clear to everyone how to interept the results. Free version would just output the score and maybe it would show the details. all the special tests would be available only in commericial version. (possible PRO tests could be AA Quality/Fill Rate/ Vertex Processing / Shader speeds (1.0 - 3.0) / Anisotropic Filtering speeds/etc and last but not least, Randomizeable performance test, which could prevent a bit better from driver level 3DMark optimizations.)

heck, maybe I should try to get a job from FutureMark... I need one and soon.

Neeyik · Jan 5, 2003

So, exactly what type of game does Nature represent? Some kind of blissed out fishing game?

Dang! The secret is out! Well, you see it was all Microsoft's fault...they claimed that they were going to release DX8 with an exclusive Microsoft Shark Fishing Simulator 2001 game but it all went monkey's up at the last minute...dodgy, late night email....crossed lines...Finnish-American misunderstanding....and the final result was Old Man Fishing Simulator 2001.

Still - they were right about games using pixel shaders for nothing more than bloody water materials...

DemoCoder · Jan 5, 2003

Uttar, don't take this as a hostile attack, but I've seen 3 separate threads now where you've posted essentially totally incorrect conclusions about hardware. (e.g. supersampling vs multisampling thread, etc)

GPUs don't need to write post-transformed vertices back to video RAM. There is an on-chip FIFO that stores the last few post-transformed vertices. It's between 10 and 16 vertices depending on chip architecture. DirectX even includes an Optimize Mesh routine that essentially reorders your vertices to make optimal use of the vertex FIFO. It has a tremendous effect on performance.

The NVidia patent you refer to is talking about a small onchip buffer, not writing post-transformed vertices back to video memory (and reading them BACK! That would be a stupendously dumb design)

Doomtrooper · Jan 5, 2003

Nappe1 said:
Sorry everyone who loves to hate MadOnion / FutureMark, but I just can't understand you sometimes.

Likewise I can't understand why anyone would defend the program, it doesn't relate to games today as if it did a Kyro should not outperform a Geforce MX..yet it does.

it was blaimed for being nVidia biased due to DX8 support

Wrong again, biased becuase Nvidia was the 1st to come out with a peformance analyzer made by Madonion and pimped by Madonion on their main page (told people their Radeon. Kyro etc was slow and needed a Geforce 3)...then of course ATI had to jump in on the bandwagon...remember this is suposed to NOT favor any IHV..

Biased because DX 8.1 hardware was never given a chance at scoring, ala SE version ( Ps 1.4 is a feature remember when PS 1.1 was OK to be considered for scoring)

Lets not be silly here, read any of the fan forums..3Dmark is the defacto standard for the general public and the average gamer is not 3D versed so if it scores higher..ITS BETTER

For Pros, they have a special tests for testing shader and other special stuff.

As proven here its not really testing shaders :?

I will stick to the age old standard..use REAL GAMES for reviews and testing, as isn't that what we all want to see ??
3Dmark is a very misleading name and shoud have been SystemMark from the beginning.
Too many politics involved with this benchmark, has been since the '3Dmark' tests appeared.

P.S Of course allowing LOD adjustments is just plain stupid when counting scoring..

g__day · Jan 5, 2003

Question to Alexsok et al

How sure is your source that MaximumPC had A01 silicon to test? Is it from unstated facts or deductive reasoning?

I could easily believe it is A01 silicon with limited drivers. The longer it takes for review boards to come out, the more suspect everything starts to look. Can no one engineer this board at a reasonable cost? Why is it so quiet on the Western front?

I'd love to see the final card benchmarked on the Serious Sam II and Aquanox game engines.

Any idea fellas how long until review samples and drivers start appearing?

ben6 · Jan 5, 2003

shortly , apparently

First few GFX benches

Arun

Unknown.

Mintmaster

MDolenc

Arun

Unknown.

Bambers

MDolenc

Hyp-X

Irregular

LeStoffer

OpenGL guy

Mintmaster

Hyp-X

Irregular

tb

Neeyik

Homo ergaster

Dave Baumann

Gamerscore Wh...

Nappe1

lp0 On Fire!

Neeyik

Homo ergaster

DemoCoder

Doomtrooper

g__day

ben6