View Full Version : GeForce FX: 8x1 or 4x2?
Dave Baumann
10-Feb-2003, 17:20
I'm getting a number of reports from people saying that they have not managed to get more than 4 pixel per clock out for GeForce FX. Normally, if running in 32bit, the 3DMark fillrate tests will not show more than four pixels per clock on the GFFX becuase of bandwidth limitations - however, even if the colour is reduced to 16bit and 16bit textures the multitexturing performance is still twice the the single texturing performance and the single texturing is still less than half the theoretical performance of an 8x1 card (1.4Gp/s single tex, 3.4Gt/s multitex).
When Radeon 9500 PRO is run with these setting is does achieve a rate that is greater than 4 pixels per clock (as I pointed out to XBit labs in their pulled review).
Obviously I'd like to veryify these claims myself, but I don't have a board at the moment.
Brent - fancy doing a little more testing? If you can can you clock down the core but keep the memory high then run a fillrate test in 16bit.
Joe DeFuria
10-Feb-2003, 17:26
Oh man......if 4x2 turns out to be the actual case, someone (*cough* nivida, *cough*) is going to have a lot of explaining to do...
Brent - fancy doing a little more testing? If you can can you clock down the core but keep the memory high then run a fillrate test in 16bit.
Yes, would be great if we could see 16 bit fill rate scores with the core at 300, and the memory at 500....
Livecoma
10-Feb-2003, 17:31
Oh man......if 4x2 turns out to be the actual case, someone (*cough* nivida, *cough*) is going to have a lot of explaining to do...
Like how they compete against R300 with less memory bandwidth and now half the pixel pipelines?
I am not trying to question you or your sources Dave, but I would be surprised if this turned out to be the case...
Joe DeFuria
10-Feb-2003, 17:41
Like how they compete against R300 with less memory bandwidth and now half the pixel pipelines?
Actually, it would explain how they "merely compete" with the R300 in High resolution, no AA, and particularly with older "designed optimally for dual texturing" games like Quake3.
As you know, based on theoretics (?) we were all expecting NV30 to soundly beat R300 at high resolution, non AA benchmarks....assuming that the pixel rate advantage was a factor of 1.5
If The FX actually has a slight pixel rate disadvantage factor of 1.3, but still holds a TEXEL rate advantate of 1.5, I believe that would actually explain pretty nicely performace numbers we've seen...
Livecoma
10-Feb-2003, 17:44
Wow you just said the same thing I said with an ATI bias.
Dave Baumann
10-Feb-2003, 17:44
It would certianly explain some of the shader numbers we've seen so far - beyond that of simple driver inefficiencies.
However, at this point I want to see some testing before reaching any conclusions. Even if it were the case that testing corroborates these results then there may be other explainations as well...
Nebuchadnezzar
10-Feb-2003, 17:47
If this is true, then say goodbye to nvidia! :shock:
demalion
10-Feb-2003, 17:50
Alternate theory 1: What if all backbuffer handling is done at 32-bit? Seems like a driver possibility depending on how the color compression/AA works.
I'm assuming AA is off for these test, but maybe the drivers aren't changing behavior for that.
One way to maybe verify this (but not necessarily disprove it) would be to check for how fog or maybe translucency effects look in 16-bit.
Livecoma
10-Feb-2003, 17:51
It would certianly explain some of the shader numbers we've seen so far - beyond that of simple driver inefficiencies.
However, at this point I want to see some testing before reaching any conclusions. Even if it were the case that testing corroborates these results then there may be other explainations as well...
If this is the case NVIDIA must have fooled John Carmack because he seems confident driver optimizations will remedy that performance disparity. Considering his experience working with NV30, he should have noticed it. I am sure he would have pointed it out in his plan file.
However I completly agree. More testing needs to be performed.
Please, no conspiracy theories...
Althornin
10-Feb-2003, 17:58
Wow you just said the same thing I said with an ATI bias.
You didnt say anything....
And whats your attitude problem?
antlers
10-Feb-2003, 18:00
To me, Carmack didn't seem confident. He said NVidia seemed confident and was trying to interpret facts in that light.
I think if Carmack really were confident that drivers would fix things (i.e., he had figured out what was wrong) his tone would have been different.
Dave Baumann
10-Feb-2003, 18:02
If this is the case NVIDIA must have fooled John Carmack because he seems confident driver optimizations will remedy that performance disparity. Considering his experience working with NV30, he should have noticed it. I am sure he would have pointed it out in his plan file.
We've been sitting around scratching our heads wondering if it can be doing 1 or 2 FP16 instruction per pipe, and based on the evidence so far thinking that it must be 1 FP16 instruction.
Perhaps the reality is that its 1 FP32 instruction and 2 FP16 instructions, but over 4 pipes. This would corroborate with JC's findings, being that the ARB path (using 128bit / FP32) is half as fast as ATI's ARB path, but with the NV30 path (which is using FP16) its on par/faster. With a little more compiler optimstation, at 2 FP16 instuction per clock then the Ultra could end up faster than R300 - i.e. 2 FP16 instructions x 4 pipes x 500MHZ vs 1 FP24 instruction x 8 pipes x 325MHz. It could also explain the 'twitchy'ness in the compiler that John talks of - it could be tough to get two FP16 instructions per clock to actually run efficiently.
Again, I'm not drawing conclusions but I'm suggesting that it can still fit based on JC's findings so far.
oh godi godi godi...
another flame war coming up...
ahh I can already sense it. ;)
Joe DeFuria
10-Feb-2003, 18:04
Wow you just said the same thing I said with an ATI bias.
As opposed to your nVidia bias? :roll:
"What you said" made it seem like it was some amazing feat that NV30 could even "compete" with R300 if the NV30 was 4x2. I merely explained how it's not amazing at all. A 4x2 clocked at 500 Mhz SHOULD "compete" with an 8x1 architecture at 325 Mhz, in certain situations.
In any case, to be perfectly clear, the EXPLAINING that nVidia would have to do doesn't really have much to do with performance at all. Performance is performance. It's just to do with what they said their architecture is. Which I believe they claim as 8 pixel pipelines...
From the B3D "interview":
B3D: We know that GeForce FX has a total of 8 pixel pipelines running at 500MHz, however how many texture mapping units does it feature per pipeline?
NVIDIA: Well, as we move into programmable shading the old conventions of fixed pipelines are becoming less important and less accurate, so with that caveat let me go back and answer your question.
We have 8 pipelines and they can each apply one texture per clock so we can apply 8 textures per clock.
Heh...now that I re-read that quote, nVidia never actually said 8 "PIXEL" pipelines. A 4x2 architecture would satisfy their response to the B3D question: "8 pipelines....8 textures per clock."
Though I'm betting every "review" of the FX states 8 pixel pipelines "just like the R300".
And nVidia's web-site proclaims "8 Pixels per clock Rendering Pipeline"
The bottom line is, we expect certain performance characteristics from an 8x1 architecture, and they are different from that of a 4x2 architecture. And I'd have to say, the performance characteristics of a 4.2 architecture seem to more readily explain the current performance profile of the NV30.
So as far as I'm concerned, this is a matter of whether of not nVidia is lying to us about their architecture. A few possibilities:
1) They are lying. It's 4x2.
2) It's acts as 8x1 only in some very specific situations?
3) The hardware is 8x1, but current drivers limit the actual operation to be 4x2. Future drivers may "enable" 8x1 operation.
Please be clear that I'm ceratinly with Dave on this....this is all speculation, and I'm not assuming that FX is actually 4x2. More data is needed. I'm just responding to "what would need to be explained" if it is determined to be a 4x2 pipeline.
Livecoma
10-Feb-2003, 18:06
So the miscommunication wars begin...
So thats what they meant by "New forms of marketing"
Chalnoth
10-Feb-2003, 18:29
If this turns out to be the case, then it would be very nice if we could get a benchmark of stencil fillrate...
martrox
10-Feb-2003, 18:31
If this is true, can you imagine how many pairs of undies got messed up at nVidia when the R300 was introduced? :shock:
antlers
10-Feb-2003, 18:42
There was that NDA'd .pdf that the review sites had well before the R300 was released that had the planned NV30 specs.
I know people here have seen it (I don't know if it is still confidential)--did it refer to 8x1 or 4x2? From what Dave had said in the past, it sounds like it referred to 8x1. So their claiming 8x1 is unlikely to be a reaction to the R300 release, right?
LeStoffer
10-Feb-2003, 19:03
Perhaps the reality is that its 1 FP32 instruction and 2 FP16 instructions, but over 4 pipes. This would corroborate with JC's findings, being that the ARB path (using 128bit / FP32) is half as fast as ATI's ARB path, but with the NV30 path (which is using FP16) its on par/faster. With a little more compiler optimstation, at 2 FP16 instuction per clock then the Ultra could end up faster than R300 - i.e. 2 FP16 instructions x 4 pipes x 500MHZ vs 1 FP24 instruction x 8 pipes x 325MHz. It could also explain the 'twitchy'ness in the compiler that John talks of - it could be tough to get two FP16 instructions per clock to actually run efficiently.
I have a hard time believing the 4x2 pipeline since nVidia promised us 8 pixels per clock, but I can follow what you're suggesting here. It would in theory give us 8 pixels per clock with FP16 - but where does that leave us with the integer path, Dave?
Damn! I had written a HUGE thing, and now it's destroyed because I wasn't logged in...
To say it a LOT shorter:
Could it be possible there'd be units converting from and to 32BPP? So, if there'd be 8 of them, there'd have to be 4 for the TMUs and 4 for the shader calculators. So you'd have an effective 4 pipelines...
Would be a ridiculous bug and it would surprise me if it passed Q&A, but who knows...
Uttar
Wasn't it Hellbinder that was betting the FX was an 4x2 setup? If he was the one and it turns out correct (again two big ifs) it would imply that hell has indeed frozen over as he was correct :shock: :) :D :lol:
Just Kidding HB :)
What ever the cause its sure interesting....
martrox
10-Feb-2003, 19:48
Hellbinder...... heh, heh, heh..... Hell frozen over......heh, heh, heh........ I get it!
One of the reqs of DX9 is the ability to render 8 textures per pass isn't it (or something like that)? Would a 4x2 design allow for that?
If this is the case NVIDIA must have fooled John Carmack because he seems confident driver optimizations will remedy that performance disparity. Considering his experience working with NV30, he should have noticed it. I am sure he would have pointed it out in his plan file.
We've been sitting around scratching our heads wondering if it can be doing 1 or 2 FP16 instruction per pipe, and based on the evidence so far thinking that it must be 1 FP16 instruction.
Perhaps the reality is that its 1 FP32 instruction and 2 FP16 instructions, but over 4 pipes. This would corroborate with JC's findings, being that the ARB path (using 128bit / FP32) is half as fast as ATI's ARB path, but with the NV30 path (which is using FP16) its on par/faster. With a little more compiler optimstation, at 2 FP16 instuction per clock then the Ultra could end up faster than R300 - i.e. 2 FP16 instructions x 4 pipes x 500MHZ vs 1 FP24 instruction x 8 pipes x 325MHz. It could also explain the 'twitchy'ness in the compiler that John talks of - it could be tough to get two FP16 instructions per clock to actually run efficiently.
Again, I'm not drawing conclusions but I'm suggesting that it can still fit based on JC's findings so far.
Fascinating, Dave.... Tell you the truth I've also been very muddled on precisely why it is the GF FX Ultra seemed so much slower per clock than the R300--it's obviously architecture, but your speculation here would certainly pin the difference down in concrete terms, wouldn't it?
Ratchet: If the design has amazing flexibility, yes. And the NV30 most certainly does: it's actually able to do *16* textures per pass, but only 8 per clock. And that's supposing 8 pipelines.
So such a thing is certainly possible, it simply requires some more complex systems.
BTW, in order to proof my theory, the following should be done:
1. Render *without* textures, but with one native instruction in FP16 ( such as one ADD ) to make sure drivers can't optimize it one way or another -> Result: 4GP/s ( or less than that since that's the theorical figure )
2. Render with 1 texture -> Result: 2GP/s ( or less than that since that's the theorical figure ) and
In case that's true, why would ther be more than 2GT/s when using multiple textures?
Well, perhaps some units *are* able to input/output a lot of info. But the only problem would be that they can't do both at once! So, when using multiple textures, there's *still* only one input.
That would make a lot of sense, IMO... Of course, it's a lot more complex than the 4x2 explanation, but it seems just as possible. Unless I'm making a mistake somewhere...
Could we PLEASE get a GP/s number without textures and just one PS instruction which can't be optimized by the drivers?
Uttar
Wasn't it Hellbinder that was betting the FX was an 4x2 setup? If he was the one and it turns out correct (again two big ifs) it would imply that hell has indeed frozen over as he was correct :shock: :) :D :lol:
Just Kidding HB :)
What ever the cause its sure interesting....
Nope, he suggested a 4x4 setup based on a vague diagram that gave a truncated example of how the NV30's pixel pipeline functions.
THe_KELRaTH
10-Feb-2003, 20:39
It's getting to seem more n more like a GF4 with bolt on extras with higher memory / core speed.
alexsok
10-Feb-2003, 20:39
This discussion could last for centuries... but let me tell u this: I originally implied that NV30 would have independent pipelines, meaning they could be combined in any way, so there was no longer a setup 4x2, 8x1, etc...
Well, apparently, it seems something like that is already implemented in NV30, but it seems as either it's not working or is working incorrectly...
If it's indeed the case, it might be a possibility that the problem lies in the drivers, since the card itself is not of any specific configuration.
Thowllly
10-Feb-2003, 20:48
One of the reqs of DX9 is the ability to render 8 textures per pass isn't it (or something like that)? Would a 4x2 design allow for that?
Yes. And it would do it the same way as the 8x1 design, with loopback. Even a 1x1 design could do 8 textures or more per pass.
This discussion could last for centuries... but let me tell u this: I originally implied that NV30 would have independent pipelines, meaning they could be combined in any way, so there was no longer a setup 4x2, 8x1, etc...
Err, what?
I can't figure out the use of such a system... 8x1 is obviously as efficient as 4x2 in all situations, but costs more transistors.
What's the use of transforming 8x1 into 4x2, since it's as efficient in many situations and less efficient in some others?
Uttar
I don't see why they would want to the pixel pipelines dynamically configurable in such a way. An 8x1 setup is simply more versatile than a 4x2 since it is able to achieve the same fillrate in multitexturing applications, while having a much greater single-texturing fillrate.
There is none. An 8x1 can do anything a 4x2 can do, except that it can run at 8 pixels per clock.
Ostsol,
oh well. Sorry for miss-quoting you Hellbinder.
If the pixel pipes are locked to a fixed pattern (4pix*2pix*1tex vs 2pix*2pix*2tex), then less but more advanced pipes can be more efficient for small triangles. Except for simple shaders of course.
Other reasons for flexible config could be resource sharing. 8x1 is only important over 4x2 for very short shaders. Those shaders can only use a few of the resources (like registers), combining the pipes for longer shaders would give you double the number of registers "for free".
There are of course costs for making a superscalar^H^H^H^H^H^Hvectorized processor, and I won't say anything about whether the benefits are worth those costs.
And I won't bother to get an opinion on what's the most likely way it actually operates. Time will tell.
afaik. the fillrate test of 3D-Mark uses transparent textures. Maybe the NV30 can not handle 8 pixel-fb-reads and 8 pixel fb-writes per clock.
Maybe a limitation of only 8 framebuffer pixel-operations every clock?
Ichneumon
10-Feb-2003, 21:54
This discussion could last for centuries... but let me tell u this: I originally implied that NV30 would have independent pipelines, meaning they could be combined in any way, so there was no longer a setup 4x2, 8x1, etc...
Well, apparently, it seems something like that is already implemented in NV30, but it seems as either it's not working or is working incorrectly...
If it's indeed the case, it might be a possibility that the problem lies in the drivers, since the card itself is not of any specific configuration.
The above was basically what my thought on this was...
There had been discussion early on (heh, months ago now) about how Nvidia was going for flexibilty in their pipelines, but that it had in the end been removed from the NV30, or at least we wouldn't be seeing the full capabilities of that functionality.
If the NV30 once had, or still does have, the capability to run either as a 8x1 architecture, or under certain beneficial circumstances it can combine pipelines into a functionally 4x2 setup, perhaps it simply is something in the drivers where it is utilizing 4x2 setup in non-optimal circumstances, 'causing all this debate over benches we've seen so far.
The glass is half empty view of this is that perhaps theres something arsed in the chip which is forcing it to end up in 4x2 mode when it shouldn't, which to me seems like one of those things that would be a sigificant contributor to deciding to kill production of the part since fixing it would (i expect) require a respin, and there just isn't time for that anymore and still have it be a product.
They could work around a hardware problem like that in drivers, but it would never be optimal... so a limited run of it and shipping their currently produced cards at least makes them some revenue from what would otherwise be a complete loss.
Hellbinder
10-Feb-2003, 21:56
Well if it turns out to be true, then It looks like my 4x4 claim based off something I was told a long time ago is not so far fetched after all.. eh?
Not that it is a 4x4, just that the information i was given could actually be closer to the truth.
EDIT:
Nope, he suggested a 4x4 setup based on a vague diagram that gave a truncated example of how the NV30's pixel pipeline functions.
Actually I was told straight up by a developer that it was a 4x4, which later i saw the diagram and said *holy crap* it really is a 4x4*.. but who knows whats really goign on.
Hellbinder
10-Feb-2003, 22:04
Btw, It simply cant be a 4x2 can it? Nvidia Stated it is an 8x1 did they not? Would a company flat out lie like that??? it does not seem very likely.
arjan de lumens
10-Feb-2003, 22:06
Hmm .. one way to distinguish between 8x1 and 4x2 could be to run a shader program with a long string of dependent texture lookups - 8x1 should then perform close to its theoretical texel fill rate, whereas 4x2 will generally be unable to use its second texturing unit for anything useful. (The texture maps should probably be small to keep operation contained in the texture cache, and filtering shold be at most bilinear to avoid the texture unit combining that sometimes follows with trilinear.)
Well, it might be '4 pixels flowing through a pipeline capable of 8 texture reads per clock', which is logically equivalent to the old term '4x2' (and would have much the same performance characteristics) but it's more of a pool-of-resources model.
So it would only tell you if it has a pool of units, or fixed mapping units. I don't think it would tell you it's '8x1'.
arjan de lumens
10-Feb-2003, 23:12
I made the implicit assumption that in a 4x2 setup, you need to access 2 textures in parallel for each pixel to sustain the full texel fillrate. With a chain of dependent texture reads/EMBM, such parallellism cannot be exploited. In a '4 pixels flowing through a pipeline capable of 8 texture reads per clock' you would, under this assumption, still be restricted by the dependent texture reads to 1 texel per pixel per clock (unless the texture unit is able to texture 2 distinct groups of 4 pixels each both at the same time, but that would make it logically closer to 8x1 than 4x2).
kid_crisis
11-Feb-2003, 05:17
from Nvidia's site:
http://www.nvidia.com/view.asp?page=quadrofx
"The NVIDIA Quadro FX architecture takes application performance to new levels by featuring three parallel vertex engines, a radically new line engine, the industry’s first on-chip vertex cache, and eight fully programmable pixel pipelines coupled to a high-speed DDR2 graphics DRAM bus."
Actually, according to David Kirk, the GFFX works on *32* pixels at once.
But it can only output 8 at once.
That's why there's "dynamic allocation", and what alexsok says is basically that. The shader calculators dynamic allocation.
It certainly got little to do with 8x1 and 4x2, but it can be understood how such a thing was supposed from that.
BTW, interesting idea Gery haves: Transparent textures.
As he says, maybe the limitation is 8 framebuffer pixel-operations/clock, and not 8 pixel-fb-writes/clock?
That would make a lot of sense...
Uttar
JF_Aidan_Pryde
11-Feb-2003, 06:35
"NV30 pipeline is a collection of processing elements."
Maybe their pipeline count is like their bandwidth counting. "Effective".
overclocked
11-Feb-2003, 07:46
"The NVIDIA Quadro FX architecture takes application performance to new levels by featuring three parallel vertex engines, a radically new line engine, the industry’s first on-chip vertex cache, and eight fully programmable pixel pipelines coupled to a high-speed DDR2 graphics DRAM bus."
It seem´s the Quadro FX don´t share the "same" core as it was before such Quadro´s built on their previous Geforce-lines.
What does "a radically new line engine" exactly mean?
Could someone post the following arb fragment shader stats for the gffx (this could give us a hints on the architecture)?
MAX_TEXTURE_COORDS_ARB (number of texture coordinate sets)
(r9700: 8)
MAX_TEXTURE_IMAGE_UNITS_ARB (number of texture image units)
(r9700: 16)
background info:
http://oss.sgi.com/projects/ogl-sample/registry/ARB/fragment_program.txt
(16) Should aux texture units be additional units on top of existing
full-featured texture units, or should this spec fully deprecate
"legacy" texture units and only expose texture coordinate sets and
texture image units?
Background: Some implementations are able to expose more
"texture image units" (texture maps and associated parameters)
than "texture coordinate sets" (current texcoords, texgen, and
texture matrices). A conventional GL "texture unit" encompasses
both a texture image unit and a texture coordinate set as well as
texture environment state.
RESOLVED: Yes, deprecate "legacy" texture units. This is a more
flexible model.
Thomas
I made the implicit assumption that in a 4x2 setup, you need to access 2 textures in parallel for each pixel to sustain the full texel fillrate.
The test would indeed tell you if it was configured that way. We're back to the 'legacy terminology' again - 4x2, 8x1 etc. is all a bit meaningless nowadays except as an 'at a glance' report of its peak pixel and texel fill rates.
Luminescent
11-Feb-2003, 12:33
Do you guys believe the R300 is also comprised by a pool of calculating units (VS, texture address, texture interpolation, and color units), rather than inflexible blocks of pipeline?
megadrive0088
11-Feb-2003, 12:46
Lesson learned! don't buy a GFFX. nobody knows what the hell it is. :shock:
buy a card that has a GPU that is well understood, with no production difficulties or severe delays. :P
LeStoffer
11-Feb-2003, 12:46
Do you guys believe the R300 is also comprised by a pool of calculating units (VS, texture address, texture interpolation, and color units), rather than inflexible blocks of pipeline?
Everything points to what you call 'inflexible blocks of pipeline' and what I call a well balanced performance/feature design. :wink:
Could someone post the following arb fragment shader stats for the gffx (this could give us a hints on the architecture)?
Those numbers don't tell anything about the architecture. We already know that 8 texture coordinates and 16 textures are a required for PS2.0. But this is perfectly possible with one physical pipeline and one physical texture unit.
Mintmaster
11-Feb-2003, 15:17
Hmm .. one way to distinguish between 8x1 and 4x2 could be to run a shader program with a long string of dependent texture lookups
This is a good idea, but I think dependant texture lookups are the problem with GFFX. That's my explanation for why it can do PS 1.1 so fast and does PS 1.4 so slowly. PS 1.1 can be done in the register combiners because there are fixed dependent texture modes. PS 1.4, on the other hand, can have any math before the phase marker, so you need a very flexible buffering system to make the texture lookup latencies disappear. This is also consistent with the "scheduling" theory of why the GFFX pixel shaders are slow.
If this is true, then your method wouldn't accomplish anything, since even the 8x1 would be held up. Scheduling wouldn't make a difference either, because nothing can be re-ordered.
What does "a radically new line engine" exactly mean?
It's marketing speak for it can render wireframe lines faster. This feature is exclusive to the Quadro cards and is useful for 3d modeling.
Could someone post the following arb fragment shader stats for the gffx (this could give us a hints on the architecture)?
Those numbers don't tell anything about the architecture. We already know that 8 texture coordinates and 16 textures are a required for PS2.0. But this is perfectly possible with one physical pipeline and one physical texture unit.
That's true, but it could give a direction or a hint because no one would do the 1 pipeline 1 tmu design...
Thomas
megadrive0088
12-Feb-2003, 03:54
one can only hope (for Nvidia's good) that the NV35 is to the NV30 what the TNT2 was to the TNT. that is, a kickass chip.
Mulciber
12-Feb-2003, 07:01
Ok, so now we have another set of benchmarks to look at.
Anyone else notice the staggering disparity between fillrates on the 9700 and GFFX in 3DMark03?
3dmark03 1600x1200 (from hardocp)
GFFX Single Texture - 1364
GFFX Multi Texture - 3354
Theoretical - 4000
9700 Sintle Texture - 1599
9700 Multi Texture - 2475
Theoretical - 2600
Neither seem to be living up to their theoretical potential. nVidia much less than ATI
TheMightyPuck
12-Feb-2003, 07:12
4x2
Althornin
12-Feb-2003, 07:30
GFFX bandwidth - 15.6Gbytes/sec
R9700 Bandwidth - 19.8Gbytes/sec
15.6/19.8 = .79
.79 * 1599 = 1262.
This is very close to actual GFFX score.
Maybe the real problem is, single texturing is bandwidth limited on both cards.
Maybe the real problem is, single texturing is bandwidth limited on both cards.
Bingo. That's why Dave started this thread off by asking someone with a GFfx to test in 16-bit color.
The fillrate tests in 3dMark03 are the same as they were in 3dMark01. Except the colors are a bit more pastel. But amazingly that doesn't seem to affect performance too much...
Althornin
12-Feb-2003, 07:57
Maybe the real problem is, single texturing is bandwidth limited on both cards.
Bingo. That's why Dave started this thread off by asking someone with a GFfx to test in 16-bit color.
The fillrate tests in 3dMark03 are the same as they were in 3dMark01. Except the colors are a bit more pastel. But amazingly that doesn't seem to affect performance too much...
I know, it was said with sarcasm.
Simon F
12-Feb-2003, 09:46
GFFX Single Texture - 1364
GFFX Multi Texture - 3354
Theoretical - 4000
9700 Sintle Texture - 1599
9700 Multi Texture - 2475
Theoretical - 2600
Neither seem to be living up to their theoretical potential.
Does that really surprise you? That's nearly always been the case with the exception of Kyro.
CosmoKramer
12-Feb-2003, 09:56
GFFX Single Texture - 1364
GFFX Multi Texture - 3354
Theoretical - 4000
9700 Sintle Texture - 1599
9700 Multi Texture - 2475
Theoretical - 2600
Neither seem to be living up to their theoretical potential.
Does that really surprise you? That's nearly always been the case with the exception of Kyro.
..or the GeForce DDR.
2475/2600 == 95%. I'd call that reasonable efficiency :)
Dave Baumann
21-Feb-2003, 15:16
Single texturing fillrate, 1024x768x16, 16bit Z, compressed textures:
1917 Mpps
Hmm... do you have a multitextured throughput figure for the same config?
MuFu.
Joe DeFuria
21-Feb-2003, 15:45
1917 Mpps
Wow...that's very efficient....for a card that has four pipes for pixel filling. ;)
JF_Aidan_Pryde
21-Feb-2003, 15:58
So what is our current conclusion? Any data that doesn't support the 4x2 view? If Dave's last score is NV30, then it doesn't fit the bill. (?)
antlers
21-Feb-2003, 17:00
So what is our current conclusion? Any data that doesn't support the 4x2 view? If Dave's last score is NV30, then it doesn't fit the bill. (?)
If it was 4x2, max single texturing fill rate would be 1600. 1534 is below that, so it is consistent with 4x2.
Dave Baumann
21-Feb-2003, 17:10
Well, its an Ultra, so max single would be 2000, but, yes, its still under that. When I get a chance I'll see if I can underclock the core but keep the memory up and see what happens then.
JF_Aidan_Pryde
21-Feb-2003, 17:10
[quote]
If it was 4x2, max single texturing fill rate would be 1600. 1534 is below that, so it is consistent with 4x2.
1600?
500Mhz x 4 Pipes = 2000MP/s
And where did you get 1534? Dave posted a score just under 2000, at 19xx, it's TOO efficient for single texturing. Hugely different from Brent's numbers.
Dave,
Have you tried at 1600x1200? Fillrate goes up a tad at max res. And any clue why is your score so much higher than Mulciber's? (I'm guessing you are using 3Dmark01 and he's using 03, so you have full 16bit options)
In the 3dmark03 postmortem Dave linked to an OpenGL.org thread.
http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/008757-3.html
Here is the relevant post.
Originally posted by pixelpipes:
Normally I have Z test disabled, but Z write enabled, and of course also color write.
Enabling Z test will invoke the 'early out' tests, which are done per tile, thus screwing the measurement.
I tried it with Z write DISabled, and the result is the same. (equivalent to NV25 with appropriate GPU clock ratio boost)
If you are hinting at memory bandwidth limitation, I don't see the logic here.
With 1GHz memory and 128 bit bus, you have 4 Gpix/sec if you are writing either only RGBA
(32 bit) or only stencil/z (24+8). But disabling Z write didn't increase performance.
But here is the strange thing:
With color write DISabled, Z write ENabled, and stencil test that does both read and write, the performance doubles. (glStencilFunc(GL_NOTEQUAL,0,-1);glStencilOp(GL_INCR_WRAP_EXT,GL_KEEP,GL_INCR_WR AP_EXT))
I have no explenation for this. Do you?
Is it some special optimization intended for the stencil shadow path?
The following "conclusions" are speculation. This data makes it seem like Nvidia has a hybrid 8 pipeline architecture intended to accelerate non-color write scenarios like shadow volumes. So technically they can call it an 8 pipeline architecture without lying, but it's not a true 8 pipeline design.
This might just be reusing the extra z test units inserted for anti-aliasing for the case of improving shadow volume performance. If true, this means that anytime 4x AA is enabled the GFFX operates as a 4 pipeline design even when rendering shadow volumes. This also assumes that the GFFX has 4 z test units per pipe as the GF4 did.
JF_Aidan_Pryde
21-Feb-2003, 17:46
Another observation:
3DMark2003 gives lower scores to fillrate tests than 3DMark2001.
On my Radeon 8500LE:
3DMark2003 - 1600x1200@32bit Single texturing = 654.4MP/s
3DMark2001 - 1600x1200@32bit Single texturing = 750.2MP/s
[Max Theoretical for 275x4 = 1000MP/s]
At best I've managed 976.7/1000MP in 3DMark01 using 16bit and compressed textures. That's about 97% efficient. By underclocking the core, I've managed to get 99.26% efficiency in single texturing.
EDIT: My model is LE and running at 250, corrected
Chalnoth
21-Feb-2003, 18:06
This might just be reusing the extra z test units inserted for anti-aliasing for the case of improving shadow volume performance. If true, this means that anytime 4x AA is enabled the GFFX operates as a 4 pipeline design even when rendering shadow volumes. This also assumes that the GFFX has 4 z test units per pipe as the GF4 did.
Of course that would also allow the FX to potentially look like a 16-pipe design when stencil ops were enabled. I doubt this is the case.
Regardless, it does beg the question, but I think that it's obvious there are gross inefficiencies in the FX with current drivers. These issues may well just be due to the drivers not fully-exposing the performance of the architecture. nVidia has officially stated that the FX has 8 fully-functional pixel pipelines.
BoddoZerg
21-Feb-2003, 18:09
nVidia has officially stated that the FX has 8 fully-functional pixel pipelines.
They also officially stated that the FX could do antialiasing for free, at "virtually all resolutions".
martrox
21-Feb-2003, 18:29
And would be on store shelves by Christmas..... 2002!
JF_Aidan_Pryde
21-Feb-2003, 18:37
Okay.
Here's what I did. I underclocked my 8500 to 250/250 which gives exactly 1GP/s of single texture fillrate and 8GB/s of bandwidth.
In case you wonder why those numbers, they are exactly half of the NV30 in bandwidth and fillrate. Taking into account the R200 and NV30 both uses 128-bit memory interface, efficiency should be on par.
Results:
3Dmark03: 1600x1200x32 = 611.8MP/s
3DMark01: 1600x1200x16 = 944.9MP/s
If you double my results, (something like an NV30)
3DMark03 32bit should = 1226.6MP/s
3Dmark01 16bit should = 1902.4MP/s
The results are on par with both Dave and the previous post on NV30. The bandwidth is not limiting it. As you underclock the core, you will find the fillrate asymptoically will approach 2000MP/s. This behaviour was tested on my R200 until I hit 99%+ efficiency. I am quite convinced for current rendering methods it is effectively an 4x2 arrangement.
Single texturing fillrate, 1024x768x16, 16bit Z, compressed textures:
1917 Mpps
From Digit-life, using their RightMark3D test
( http://www.digit-life.com/articles2/gffx/gffx-ref-p3.html )
Pixel fillrate test for GFX 5800 Ultra ;
1600x1200x32 bpp, no texture : 1957,4 Mpix/sec
1280x1024x32 bpp, no texture : 1937,5 Mpix/sec
1600x1200x32 bpp, one bilinear texture : 1848 Mpix/sec
1280x1024x32 bpp, one bilinear texture : 1841,2 Mpix/sec
And then using 2 and 4 256x256 textures;
1280x1024x32 bpp, two textures : 3147 Mtex/sec
1280x1024x32 bpp, four textures : 3178 Mtext/sec
Oh well, here we go :
http://www.theinquirer.net/?article=7920
After all, 600 euros for 400 Mhz card thats actually 4x2 isn't that bad :twisted:
Nebuchadnezzar
21-Feb-2003, 20:52
On Pc Games Hardware (German magazine) they have a technology review of the FX. Once they show the die of the FX compared to the GF4, there on the GF4 they say '4 Pipelines, 2 TMU's', on the FX : '32 Floating Point Pixel Shader ALU's (8 vitual pixel pipelines with total of 8 virtual TMU's)
martrox
21-Feb-2003, 21:17
While I didn't see the 4X2 thing coming, I have to admit I'm even more sure of my position on nVidia:
http://www.beyond3d.com/forum/viewtopic.php?t=4080
I'm still floored by the complete disaster the GFFX is.....
So technically they can call it an 8 pipeline architecture without lying, but it's not a true 8 pipeline design.
This might just be reusing the extra z test units inserted for anti-aliasing for the case of improving shadow volume performance. If true, this means that anytime 4x AA is enabled the GFFX operates as a 4 pipeline design even when rendering shadow volumes. This also assumes that the GFFX has 4 z test units per pipe as the GF4 did.
I disagree that technically they can call it an 8 pipeline architecture, because colour computation (may be even based on texture data) is an essential part of the 3d rendering pipeline.
One could advertise the 8 z-/stencil calculations per clock as a special feature, like single cycle multitexturing is a special feature of a pipeline or like Z32 is a special feature of an architecture.
Nebuchadnezzar
21-Feb-2003, 22:02
Bwhahahah, ram your sig owns! :lol:
Psikotiko
21-Feb-2003, 22:20
I smell big trouble for nVidia with this investigation.
Speaking about nvidia, digit-life has an article abou the DAWN to DUSK Developer Conference & Quadro FX Presentation. It's a good read http://www.digit-life.com/articles2/gffx/nvidia-feb2k3.html
Also they had the future FX products with the Chaintech product list. Minutes after, the post disappeared.
Temporary Internet Files are a great Windows feature :D
First FX60 --> 64/128MB | DDR1 | 128bit | 300 gpu | 550 memory
Second FX60 --> 256MB | DDR1 | 128bit | 350 gpu | 500 memory
FX40 --> 64/128MB | DDR1 | 128bit | 250 gpu | 400 memory
If anyone wants that image....
First FX60 --> 64/128MB | DDR1 | 128bit | 300 gpu | 550 memory
Second FX60 --> 256MB | DDR1 | 128bit | 350 gpu | 500 memory
FX40 --> 64/128MB | DDR1 | 128bit | 250 gpu | 400 memory
If anyone wants that image....
http://www.muropaketti.com/uutiskuvat/2003/0220fx.gif :P
Psikotiko
21-Feb-2003, 22:33
First FX60 --> 64/128MB | DDR1 | 128bit | 300 gpu | 550 memory
Second FX60 --> 256MB | DDR1 | 128bit | 350 gpu | 500 memory
FX40 --> 64/128MB | DDR1 | 128bit | 250 gpu | 400 memory
If anyone wants that image....
http://www.muropaketti.com/uutiskuvat/2003/0220fx.gif :P
...And i thought i was the only one :D
With this roadmap and the constant lies, i don't see a bright future for nVidia in 2003. Maybe Nv35 will solve the problems (I hope)......
Looks like the Tech-Report got the real scoop from Nvidia themselves on this issue, its sometimes 8x1 and other times not:
http://tech-report.com/news_reply.x/4782/
GeForce FX is 4 pipes by 2 texture units?
by Scott "Damage" Wasson - 05:42 pm, February 21, 2003
After seeing this report at The Inquirer, we asked NVIDIA for clarification: Does the GeForce FX have four rendering pipes with two texture units each or eight with one each, as the world was led to believe? Here's the answer we received from NVIDIA:
GeForce FX 5800 and 5800 Ultra run at 8 pixels per clock for all of the following:
a) z-rendering
b) stencil operations
c) texture operations
d) shader operations
For advanced applications (such as Doom3) *most* of the time is spent in these modes because of the advanced shadowing techniques that use shadow buffers, stencil testing and next-generation shaders that are longer and therefore make the apps "shading-bound" rather than "color fill-rate" bound.
Only color+Z rendering is done at 4 pixels per clock, all other modes (z, stencil, texture, shading) run at 8 pixels per clock.
The more advanced the application, the less percentage of total rendering is color, because more time is spent texturing, shading and doing advanced shadowing/lighting.
We will unpack this statement for you shortly, but it appears Fraud at The Inq got this one right.
Well the "clarification" would at least seem to imply that 8 different pixels can each execute a texture op in the same clock in the shader pipeline, which was an open question.
Still, I think it's incorrect to say that GFfx is "sometimes" 8x1. I would describe it this way:
fixed-function pipeline
4x2 when rendering to the framebuffer
8x0 when rendering to z/stencil buffer (not 8x1, no TMUs are used)
shader pipeline
capable of executing either 1 math op or 1 texture lookup for each of 8 pixels, in FP16 mode (presumably)
martrox
22-Feb-2003, 03:03
Can anyone answer this:
Just what the heck was nVidia thinking?
JF_Aidan_Pryde
22-Feb-2003, 03:31
Can anyone answer this:
Just what the heck was nVidia thinking?
That ATI will ship a dud at 0.15. I think that sums up their (old) thinking.
From Digit-life, using their RightMark3D test
( http://www.digit-life.com/articles2/gffx/gffx-ref-p3.html )
Pixel fillrate test for GFX 5800 Ultra [...] using 2 and 4 256x256 textures;
1280x1024x32 bpp, two textures : 3147 Mtex/sec
1280x1024x32 bpp, four textures : 3178 Mtext/secSo the idea is lowering it to 10x7x16 might bring multi-texturing closer to the theoretical max of 500x4x2 = 4000?
Wavey/Rev, let's see if either of you can finagle an interview out of nVidia after this. ;)
That ATI will ship a dud at 0.15. I think that sums up their (old) thinking.
Prior to R300 shipping last year, I don't think nVidia gave ATI a whole lot of thought. Since September that has of course all changed dramatically. nv25 was obviously nVidia's only reference with respect to nv30. Now ATI's leapfrogged 'em by about a year, nVidia will no longer be measuring itself by its own chips (which has got to be a good thing for them--give them a goal to shoot for since they were obviously so befuddled in thinking about "what comes next" after nv25.)
What I still want to see is a clock-for-clock comparison with a Ti4600 product. Bet that'd be interesting, to compare them both running at 300MHz.
Nagorak
22-Feb-2003, 04:16
What I still want to see is a clock-for-clock comparison with a Ti4600 product. Bet that'd be interesting, to compare them both running at 300MHz.
Also if they adjusted the memory speeds so that they were equal (ex: downclocking R9700 memory to 250, and setting GF FX to 500, or 200/400).
Also if they adjusted the memory speeds so that they were equal (ex: downclocking R9700 memory to 250, and setting GF FX to 500, or 200/400).
You kind of lost me...I was talking about comparing nVidia's nv25 to nVidia's nv30, at the same clockspeeds (of course you'd have to make some ram speed adjustments to even it out as well.) What does the 9700 have to do with it?
Ichneumon
22-Feb-2003, 04:45
Prior to R300 shipping last year, I don't think nVidia gave ATI a whole lot of thought. Since September that has of course all changed dramatically. nv25 was obviously nVidia's only reference with respect to nv30. Now ATI's leapfrogged 'em by about a year, nVidia will no longer be measuring itself by its own chips (which has got to be a good thing for them--give them a goal to shoot for since they were obviously so befuddled in thinking about "what comes next" after nv25.)
So why was Nvidia so befuddled about what comes next and ATI seems to have had a very clear idea about it? Sounds like there was simply some very forward thinking ideas in the FX that were just not feasable in the time-frame required for Nvidia to get a card to market. (and what they broght to market makes those forward thinking ideas look like arse because they were incomplete or failed implementations).
ATI seems to have been very good over the last couple generations about gauging what they need to do to take the individual steps between product/technology generations, and it doesn't sound like that will change with the R400.
Both 3dfx and Nvidia milked essentially a single core (simply adding relatively minor technology evolutions each generation) for way, way long... and it appears they have both paid for it in their own way.
So why was Nvidia so befuddled about what comes next and ATI seems to have had a very clear idea about it? Sounds like there was simply some very forward thinking ideas in the FX that were just not feasable in the time-frame required for Nvidia to get a card to market. (and what they broght to market makes those forward thinking ideas look like arse because they were incomplete or failed implementations).
Well, that's the thing about "forward-looking" ideas, though, isn't it? If you can't turn them into practical realities, there's a good chance that not only were they not "forward-looking" in the first place, they may have just been plain wrong-headed to begin with. I certainly wouldn't say that ATI's ideas were any less "forward-looking" simply because ATI was able to turn them into silicon and ship them, would you?
The point here is that the R300 seems to me a very solid architecture with several advanced ideas not equalled by nv30, that's all. And yes if I was to interpret your statement as meaning nVidia's reach exceeded its grasp with respect to nv30, you'd have no argument from me. Possibly we might disagree about the value of that kind of a situation, though....as it seems to me the point of designing and building a 3D chip is for both reach and grasp to result in a solid product, which is what I think ATI accomplished with R300 but nVidia missed with nv30.
ATI seems to have been very good over the last couple generations about gauging what they need to do to take the individual steps between product/technology generations, and it doesn't sound like that will change with the R400.
I hope ATI will follow that course, myself. If you develop a good architecture for building, then build on it, right? The point here is that neither company is following a precedent, which is why the odds are always evened out when new architectures are involved, regardless of either company's past successes or failures with previous architectures. There's no "rule book" for designing 3D chips which incorporate brand new features and rendering concepts, and as such these companies are writing the rule book as they go, from chip to chip and generation to generation. That's one of the things I think is so cool about it all--today's "leader" can easily become tomorrow's "follower."
Both 3dfx and Nvidia milked essentially a single core (simply adding relatively minor technology evolutions each generation) for way, way long... and it appears they have both paid for it in their own way.
All chip manufacturers milk, unfortunately. Indeed, if not for AMD we'd all still be guzzling *slowly* from Intel's teats...;) I think that nVidia's mistake here was that it prematurely assumed it had some sort of "position" in the market which it never really had on a permanent basis. But now that nVidia knows it is facing some real and determined competition in this market we'll see what develops. We already have a pretty clear picture of not only ATI's goals as a company but also of its capabilities and competence--we know what they can do when they want to. Now that nVidia knows what it must do we will see whether it can, which ought to make for pretty interesting times to come in the next year....;)
JF_Aidan_Pryde
22-Feb-2003, 05:29
Both 3dfx and Nvidia milked essentially a single core (simply adding relatively minor technology evolutions each generation) for way, way long... and it appears they have both paid for it in their own way.
Milking has been very useful for both compaines. Everytime they shoot for something ground up, they end up shooting themselves in the foot. This has been the case for pretty much everyone - except for ATI. Which leads me to question if the R100->200->300 is related in ayway.. (I know ATI's official line is all different chips from different teams hmm)
Ichneumon
22-Feb-2003, 05:49
Both 3dfx and Nvidia milked essentially a single core (simply adding relatively minor technology evolutions each generation) for way, way long... and it appears they have both paid for it in their own way.
Milking has been very useful for both compaines. Everytime they shoot for something ground up, they end up shooting themselves in the foot. This has been the case for pretty much everyone - except for ATI. Which leads me to question if the R100->200->300 is related in ayway.. (I know ATI's official line is all different chips from different teams hmm)
I would say that "except for ATI" applys only Since the introduction of the Radeon DDR.
That bit about ATI was what I was suggesting in my previous post. Of course ATI has milked... perhaps more than any other company back with the RagePro... Hell, that may Still have been the single most sold add-in board graphics chip of all time. ATI paid a dear price for that milking and survived though.
Since the release of the Radeon DDR (now 7xxx generation), ATI has made significant leaps in speed and technologies each generation (from 7xxx to 8xxx to 9xxx). Some were evolutionary developments, some were quite a bit more significant. the 9xxx generation is definately the culmination of a lot of ideas that were introduced with the Radeon DDR... but is pretty much revolutionary in terms of tech/speed because of how everything came together. Since the Radeon DDR, ATI has sucessfully executed time and time again. The 8500 was "late to the party" but they did it right, and that appears to have paid off in the long run with the base that gave them to jump to the 9700.
Can anyone answer this:
Just what the heck was nVidia thinking?
Probably they were thinking that 256 bits to/from DRAM every clock is almost never going to be enough to render more than 4 fixed-function pixels under any remotely reasonable circumstances. That the only useful reasons for more "pixel power" than a 4x2 provides given that bandwidth budget are for z/stencil passes and shaders.
And guess what: they were right. Think for a second about the fact that the only way we could even think to test this question is by running an already very unrealistic synthetic fillrate test in 16-bit mode. So what should they have made it 8x1 for? Better scores in 3DMark01's synthetic tests at 16-bit??
Think what it would take for an NV30 to ever write 5 pixels in a single clock. Once you concede that no one wants to use anything less than a 32-bit framebuffer, you've already spent over half your (theoretical) bandwidth budget just to write the results. So now you've got to render 5 pixels using only 19 bits apiece. Presumably you want to read from and write to the z-buffer? Well, do either and there goes your entire bandwidth budget, and you haven't even read a color map!
But what about all those "bandwidth saving" technologies? Well someone correct me if I'm wrong, but those all occur before the pipeline, so when Hier Z. or early-z stops a pixel from being rendered, another can take its place. (Unsure about early-z, but think so.) Hence they actually make it harder for the chip to fire on >4 cylinders, considering they consume some bandwidth themselves.
Bottom line: this is a smart design decision.
The only problems are that:
1) It still has fewer FP16 shader resources per clock than R300 has FP24 shader resources per clock, albeit more resources than from a "traditional 4x2 design extended to pixel shading" (whatever the hell that means).
2) It's an absolute, outright, baldfaced lie to claim such a design has "8 pixel pipelines". If it doesn't have the hardware to write n pixels every clock, then it doesn't have n pixel pipelines, and no, z/stencil values are not "pixels".
THe_KELRaTH
22-Feb-2003, 07:25
Any idea when we might see some FP16 tests?
I'm assuming that mode isn't working on the NV30 just now as the FP16 shader tests results are identical to FP32.
Nagorak
22-Feb-2003, 07:37
Also if they adjusted the memory speeds so that they were equal (ex: downclocking R9700 memory to 250, and setting GF FX to 500, or 200/400).
You kind of lost me...I was talking about comparing nVidia's nv25 to nVidia's nv30, at the same clockspeeds (of course you'd have to make some ram speed adjustments to even it out as well.) What does the 9700 have to do with it?
Sorry, I guess I left out a sentence there somehow. It would be nice to compare it clock for clock to either an R9700 with adjusted memory clocks, or an R9500 Pro (also with memory being equal). Unless that's already been done somewhere, in which case I'd appreciate a link.
Bottom line: this is a smart design decision.
The chip runs slower than the R300 even though it is clocked 200 MHz faster. Obviously just using a stock 8x1 setup is better, so they gained nothing with this nonsensical decision. I guess I understand what you're saying: it makes sense given the chip's other limitations (lack of memory bandwidth), but given how long it took to come out with this chip you'd think they could have altered it so it wasn't so bandwidth starved.
The design is smart alright: just in the same way as military intelligence (and who knows where those faulty gas masks are now?). ;)
Chalnoth
22-Feb-2003, 07:37
1) It still has fewer FP16 shader resources per clock than R300 has FP24 shader resources per clock, albeit more resources than from a "traditional 4x2 design extended to pixel shading" (whatever the hell that means).
Where do you get this from? I've been very interested to see some synthetic benchmarks examining the pure processing power of the NV30...
Where do you get this from? I've been very interested to see some synthetic benchmarks examining the pure processing power of the NV30...
Well, there have been "pure shader" benchmarks posted, in the form of some Shadermark results Brent posted in the forums here (one of the threads just after NDA release), and some Rightmark results posted in the Digit Life review. Of course I don't know exactly what those shaders do (Shadermark is open source, though, IIRC), and they are all probably dubious considering the Rightmark results showed absolutely zero difference between FP16 and FP32 (!) and Brent's tests had NV30 doing even worse compared to R300 (losing by ~3:1 instead of ~2:1 margin) and were with older drivers.
So, some data exists, but IMO way to early to know if it represents hardware performance or huge driver problems or both.
But when I made my statement I was referring to Kirk's "it does some things at 16-per-clock" and assuming it meant the shader pipeline could do 1 math and 1 texture op per 8 pipes per clock, and comparing that to the R300's known resources (1 texture, 1 scalar and 1 vector per 8 pipes). That's probably a premature assumption on my part, although I would frankly find it difficult to square a more-capable-per-clock-than-R300 shader architecture with the admittedly oblique hints given by Nvidia thus far.
demalion
22-Feb-2003, 08:07
Dave H,
Hmm...comparing the 9500 Pro to the 9500 non Pro performance figures, I'm not so sure such a 16-bit synthetic test is the only opportunity for exceeding the capabilities of "4x?".
Does make sense for shader bound pixel processing, but AFAICS, what we've seen so far seems to indicate that either the nv30's shader processing or its capability to process pixels simultaneously is underperforming (as per your problem "1").
I'd say 4x? can be a smart decision for a 128-bit bus, but I don't think it is in the nv30...then again, the subtext for these evaluations is always "in comparison to the r300"...(seems to beat the heck out of the 8500 in performance, for example).
It's true that there isn't that much bandwidth to burn, because FX has 128 bit bus. It just that when doing multipass algos, you could do allkinds of filters etc. that just write 8 pixels 32 bits each out to the framebuffer. And there would be enough bandwidth to do this. So Geforce FX is clearly inferior... And yes, all those shader tests really seem to be indicating that FX is also just about half the speed of the Radeon 9700 when executing shaders. And that's rather sad.
I'd say 4x? can be a smart decision for a 128-bit bus, but I don't think it is in the nv30...then again, the subtext for these evaluations is always "in comparison to the r300"...(seems to beat the heck out of the 8500 in performance, for example).
I think 4x2 (or 4x?) provides one of the best explanations to what Kirk said long before R300 and NV30 were released about a 256 bit bus being overkill and a 128 bit being enough. A 128 bits bus makes perfect sense with a true 4x architecture but it can be reasonable considered low for a 8x. That's why they thought 256 bit bus woulld make no sense at that point. They will talking about a 4x architecture from the start.
Chalnoth
22-Feb-2003, 09:03
Well, there have been "pure shader" benchmarks posted, in the form of some Shadermark results Brent posted in the forums here (one of the threads just after NDA release), and some Rightmark results posted in the Digit Life review. Of course I don't know exactly what those shaders do (Shadermark is open source, though, IIRC), and they are all probably dubious considering the Rightmark results showed absolutely zero difference between FP16 and FP32 (!) and Brent's tests had NV30 doing even worse compared to R300 (losing by ~3:1 instead of ~2:1 margin) and were with older drivers.
Yes, I've seen those, but I'd like to see some purely-synthetic tests, of, for example, calculations that you would never use in a real shader, but instead are designed to look at peak processing power. Those should nix any driver issues currently, and allow one to look at the peak possible performance of the NV30, in hopes that the drivers will catch up soon (I'll have to add my disclaimer yet again: when purchasing, don't ever purchase with hope of future improvements, but just looking at the card as it is at purchase time).
But when I made my statement I was referring to Kirk's "it does some things at 16-per-clock" and assuming it meant the shader pipeline could do 1 math and 1 texture op per 8 pipes per clock, and comparing that to the R300's known resources (1 texture, 1 scalar and 1 vector per 8 pipes).
I do wonder what that means, but he definitely inferred that he was referring to outputting 16 "somethings" per clock. What rendering situation would allow this? I don't know. Perhaps basic bitmap copying to the framebuffer?
Evildeus
22-Feb-2003, 09:08
What I still want to see is a clock-for-clock comparison with a Ti4600 product. Bet that'd be interesting, to compare them both running at 300MHz.
At 275/270
http://www.hardware.fr/medias/photos_news/00/05/IMG0005783.gif
demalion
22-Feb-2003, 09:39
I'd say 4x? can be a smart decision for a 128-bit bus, but I don't think it is in the nv30...then again, the subtext for these evaluations is always "in comparison to the r300"...(seems to beat the heck out of the 8500 in performance, for example).
I think 4x2 (or 4x?) provides one of the best explanations to what Kirk said long before R300 and NV30 were released about a 256 bit bus being overkill and a 128 bit being enough.
Perhaps my statement wasn't clear...restated:
"I'd say 4x? can be a smart decision for a 128-bit but, but I don't think it is a smart decision in the nv30..."
You seem to have taken me to be arguing against the idea that the nv30 is "4x?"...?
I do wonder what that means, but he definitely inferred that he was referring to outputting 16 "somethings" per clock. What rendering situation would allow this? I don't know. Perhaps basic bitmap copying to the framebuffer?
FSAA
demalion
22-Feb-2003, 09:43
Geeze, we have benchmark results and we still have the same questions as months ago (your shader instruction benchmark suggestion is giving me a sense of Deja Vu, Chalnoth).
What I still want to see is a clock-for-clock comparison with a Ti4600 product. Bet that'd be interesting, to compare them both running at 300MHz.
At 275/270
http://www.hardware.fr/medias/photos_news/00/05/IMG0005783.gif
Tsk tsk, you didn't even put the source!
http://www.hardware.fr
Anyway, yeah, but I wouldn't look at the 4x AA / 8x AF results too much.
That's because nVidia's balanced AF, well, is amazingly inefficient... That's what was used in that benchmark. Sure, it can't be fixed via drivers, but it's one of the GFFX worst problems IMO.
What I'd like to see, however, is such a benchmark when using 2x AA & 4x AA without AF. Sure, the GFFX quality is slightly lower than the R300's quality, but it's perfectly comparable with the GF4.
Uttar
You seem to have taken me to be arguing against the idea that the nv30 is "4x?"...?
No, I understood you correctly so the boggus english is mine :). I was also stating that 4x could explain the old Kirk comments. In fact I was agreeing with you.
Chalnoth
22-Feb-2003, 11:30
What I still want to see is a clock-for-clock comparison with a Ti4600 product. Bet that'd be interesting, to compare them both running at 300MHz.
At 275/270
(graph that I've removed and you can see a few posts above)
Note that all of these benchmarks were done at 1600x1200. The GeForce FX has very significant problems with this resolution with current drivers, for some reason. I really don't have a clue as to what could cause this (I thought I had an answer, but that no longer appears to be the case). Thus, a more interesting benchmark would be one with the FX at the same clocks as competing products, but at pretty much any other resolution (a full spectrum would be really nice).
Dave Baumann
22-Feb-2003, 11:32
Note that all of these benchmarks were done at 1600x1200. The GeForce FX has very significant problems with this resolution with current drivers, for some reason. I really don't have a clue as to what could cause this (I thought I had an answer, but that no longer appears to be the case).
Fill-rate graphs for the benchmarks I've run so far demonstrate a dip at 16x12 with 4x FSAA - this is uaually categroised by textures needing to be adressed over the AGP bus...
Chalnoth
22-Feb-2003, 11:47
One other little not on the efficiency of the GeForce FX's anisotropic filtering.
It's taking more texture samples than the Radeon 9700 Pro for the majority of scenes. Unless both algorithms are providing the exact same output, you cannot say that one is more efficient than the other.
As a side note, I have some benchmarks where the Radeon 9700 Pro takes a greater performance hit than the GeForce4 Ti when anisotropic filtering is enabled. I basically set up a situation with no off-angle situations, and recorded the framerate. Very synthetic, I know, and I didn't bother to equalize the LOD, but now that I've actually gone back and looked at it (I actually did this some time ago, but it took a while for the results to sink in...they were very contrary to my previous assumptions), it really looks like the GeForce4 has quite an efficient implementation. Anyway, here are the numbers (normalized to 1):
(going down is increasing anisotropy, with 100% always 0x)
NV25:
Direct3D/OpenGL
100.00%/100.00%
67.95%/71.04%
40.36%/42.47%
27.00%/28.19%
R300:
Direct3D/OpenGL
100.00%/100.00%
57.94%/42.00%
35.12%/36.73%
21.03%/21.09%
12.70%/14.00%
A couple of things to note:
All surfaces were completely horizontal or vertical, most surfaces used high degrees of anisotropy (if any of you remember it, I just used my AnisoTest level for UT). This means that any low-degree inefficiencies (that apparently exist in Direct3D) in the GF4 aren't going to have much effect here.
Chalnoth
22-Feb-2003, 11:50
Fill-rate graphs for the benchmarks I've run so far demonstrate a dip at 16x12 with 4x FSAA - this is uaually categroised by textures needing to be adressed over the AGP bus...
Right, that's what I was thinking. But I recently saw some 1600x1200x32 benches without FSAA that had the same dip. The problem with this is that all video cards with 128MB of RAM should be using the same amount of memory at this resolution, so I'm really not sure any more what's going on (I believe the test was 3DMark03, but I can't seem to find it quickly enough right now...).
Dave Baumann
22-Feb-2003, 11:55
No. All the benchmarks I've run so far show it just to be standard fill/bandwidth limited at 16x12, in that it still has a higher fill-rate at 16x12 than 12x10/12x9, but just not scaling as high proporionally. 16x12 w 4X has a lower fill-rate than 12x10/12x9 indiciating AGP limitations.
I don't think there are any 'problems' here, just that its running out of puff at 16x12 and is AGP limited at 16x12x4X.
DemoCoder
22-Feb-2003, 11:58
Perhaps the NV30 was designed to meet a different set of requirements, requirements that don't include running current or legacy games as fast as possible, but running future stencil-fill-limited scenes and long pixel shaders "fast". I would like to see this scenario tested.
This may have been a "dumb" move marketing wise for the gaming, but it may make them popular in the DCC/Workstation/offline market. Of course, it's speculation.
However, if the chip is indeed 4x2, and if it indeed has less pixel FP power per clock, then what the hell are all those transistors for? I have a feeling that Nvidia spent an enormous amount of resources making the pixel pipeline general purpose (e.g., not a "fixed" combiner-like pipeline, but a true fetch from memory/decode/schedule/execute multistage FPU pipeline), and I heard that their dependent texturing supposedly uses some tricks to lower latency greatly.
This seems to have resulted in an overly complex pipeline that won't deliver performance advantages except in scenarios which aren't likely to be used in games (very complex pixel shaders). It may accelerate 3dStudio MAX preview, and it runs Dawn nice, but no game is going to run Dawn-like effects on the skin of all of its characters.
Of course, it's all complete speculation until we get better documentation of just what is going on. Perhaps the NV30 pipelines were designed in a non-traditional way, with too much tricky sharing logic of resources, they may end up introducing even more bottlenecks from synchronization.
So let me get this straight: assuming the NV30 is indeed 4*2 rather than 8*1 then there is a definite performance deficit in single textured games. And what else?
So let me get this straight: assuming the NV30 is indeed 4*2 rather than 8*1 then there is a definite performance deficit in single textured games. And what else?
Well, according to the "pixelpipes" the OGL developer who raised the whole "4x2" issue in opengl.org's devforums, it doesn't matter wether you have z-buffer or textures on or off. It seems that FX ALWAYS writes RGB pixels to the framebuffer 4 pixels paraller. Naturally this would mean that all the multipass algorithms will take a hit (compared to the 9700). Basically everything that touches RGB values in frame buffer isn't done with "8 pipelines effiency".
Can the GF FX write 8 16bit pixels per clock to the frame buffer?
Dave Baumann
22-Feb-2003, 14:54
Can the GF FX write 8 16bit pixels per clock to the frame buffer?
Check a couple of pages back!!
antlers
22-Feb-2003, 15:05
Humus' Mandelbrot demo is certainly a shader-compute rather than bandwidth limited benchmark. It would be obvious from the appearance if it is running at 16 bit or 32 bit precision. It should be easy enough to add half-precision hints to the shader assembly code file. Why not just try that on the FX and answer a lot of questions?
Well Dave I know your test indicates that it didn't but the question still stands, can it? Obviously outputting 8 32bit pixels per clock to the frame buffer is out of the question due to the 128bit memory bus but the 128bit memory should allow 8 16bit pixels per clock if the core can do it.
Dave Baumann
22-Feb-2003, 15:13
Noko, that was 16bit.
If you run the same test on a 9500 PRO then it gets more than 4 pixels per clock.
Dave, can you re-run the test but use 16bit textures vice compressed? Maybe the drivers automatically limits to four texels per clock to the frame buffer when texture reads are greater then 16bit. I know, it is a long shot but something to rule out. In any case the 128bit memory bus will always limit the framebuffer writes to no more then 4 32bit pixels per clock. I just don't see a way around that. Now if we can show that the NV30 core can do more then 4 pixels/clock to memory then a 256bit memory bus could improve its performance tremendously.
Dave Baumann
22-Feb-2003, 15:37
That was compressed textures - are you asking for it without compressed textures?
In any case the 128bit memory bus will always limit the framebuffer writes to no more then 4 32bit pixels per clock.
Remember, its a DDR bus, so effectively its twice as wide. At 16bit you should be able to get 8 16bit pixels and another 8 16bit ops per core cycle.
Well, according to the "pixelpipes" the OGL developer who raised the whole "4x2" issue in opengl.org's devforums, it doesn't matter wether you have z-buffer or textures on or off. It seems that FX ALWAYS writes RGB pixels to the framebuffer 4 pixels paraller. Naturally this would mean that all the multipass algorithms will take a hit (compared to the 9700). Basically everything that touches RGB values in frame buffer isn't done with "8 pipelines effiency".
No, not all multipass algorithms will take a hit.
If it is a 4x2 architecture, all shaders using an even number of textures/arithmetic operations will run just as fast as on a 8x1 architecture.
If it is a 8x1 architecture with an output limit of 4 RGB values per clock, all shaders using more than one texture/arithmetic operation will run just as fast as on a full 8x1 architecture.
Even the NV20/25 class hw can output 8 pixels/cycle with colorwrites disabled and multisampling 4x enabled.
ciao,
Marco
Well the "clarification" would at least seem to imply that 8 different pixels can each execute a texture op in the same clock in the shader pipeline, which was an open question.
According to Geoff Ballew (http://www.beyond3d.com/previews/nvidia/nv30launch/index.php?p=2) it can even execute 16 texture address ops per clock, so the 8 texture fetch units can be used very efficient.
MDolenc
22-Feb-2003, 17:29
Even the NV20/25 class hw can output 8 pixels/cycle with colorwrites disabled and multisampling 4x enabled.
No it can't. Just tested... :wink:
Even the NV20/25 class hw can output 8 pixels/cycle with colorwrites disabled and multisampling 4x enabled.
No it can't. Just tested... :wink:
Have you tested that on a XBOX? ;)
...This seems to have resulted in an overly complex pipeline that won't deliver performance advantages except in scenarios which aren't likely to be used in games (very complex pixel shaders). It may accelerate 3dStudio MAX preview, and it runs Dawn nice, but no game is going to run Dawn-like effects on the skin of all of its characters.
Of course, it's all complete speculation until we get better documentation of just what is going on. Perhaps the NV30 pipelines were designed in a non-traditional way, with too much tricky sharing logic of resources, they may end up introducing even more bottlenecks from synchronization.
I think you've put your finger on the pulse here--at least, I've had similar questions myself. It's all well and good to tout numbers of instructions, and 128-bit fp precision, except when you want to talk about 3D gaming performance--in which case these can actually become negative factors rather than enhancing features. I think it may be clear that nVidia felt the "traditional" performance of this design would be sufficient to hold its own against all comers and the increased fp precision and instruction ceilings would serve to enhance the product on the "professional" side (Quadro), and that this is really where nVidia's entire "film cannister" concept comes from. Dawn is clearly aimed at the Quadro base, not the Doom III base, IMO. However, I believe nVidia was shooting for both bases and felt pretty certain of the efficacy of the chip for both markets prior to R300 shipping (at which time the grotesque Dustbuster ideas emerged.)
If it is true that we've all been led a merry chase on the 8x1 physical pipelines, then much of my own earlier hypotheses will have been wrong. That is, I felt that nVidia would have felt that 8x1 at ~300-400MHz would have been plenty of power. If we are dealing with an actual 4x2 situation, instead, then it becomes obvious that nVidia was even more dependent on factors in the manufacturing process than I'd imagined, and much more dependent on factors out of its control such as low-k and a good .13 process which would have allowed them a ramp to ~500MHz, or even higher. If the architecture is much closer to 4x2 than 8x1, I would then tend to think the 500MHz target would have been something nVidia was shooting for from the start as opposed to a later development brought about by the need to compete with R300, and that problems with the process and low-k were the major culprits here (although the actual role low-k might've played here is still unclear to me, though doubtlessly it would have helped to some, unknown degree.) Thus it would appear that the Dustbuster was in fact not manufactured in a knee-jerk response to R300, but was simply an ill-conceived and belated attempt by nVidia to get as much out of the chip as it had originally planned as might be possible, even though a great portion of that performance was not architecturally based, but a product of clockspeed brought about by a better .13 process which included a successful low-k application (both of which were to a large extent beyond nVidia's control--um, which in no way excuses them for making a design this dependent on 3rd-party manufacturing to begin with.)
But at least for me, proper interpretation of the data is dependent on understanding the true nature of the physical "pipeline" used in nv30. I could see them using 4x2 (or some derivative) only in the instance of a comparatively very high clockspeed.
But this approach is still not as neat as I'd like and also introduces problems relating to other efficiencies or inefficiencies in the overall logic of the chip. At this point I'm not sure how well even nVidia understands these issues, as it seems they took a back seat to a theoretical clockspeed made possible by the manufacturing process they chose.
rlathan
22-Feb-2003, 18:22
I was just looking through the product specs of the QuadroFX and found this little tidbit stating "Eight Fully programmable pixel pipelines."
http://www.nvidia.com/view.asp?IO=IO_20030117_6779
The NVIDIA Quadro FX architecture takes application performance to new levels by featuring three parallel vertex engines, a radically new line engine, the industry’s first on-chip vertex cache, and eight fully programmable pixel pipelines coupled to a high-speed DDR2 graphics DRAM bus. Graphics pipeline efficiency is magnified by NVIDIA’s next-generation crossbar memory architecture, enabling occlusion-culling, lossless depth Z-buffer and color compression
So, wouldn't this also apply to the Geforce FX??
Single texturing fillrate, 1024x768x16, 16bit Z, compressed textures:
1917 MppsRadeon 9700 Pro at 324/620, 3dMark2001se 330 fill rate tests results:
1600 x 1200 x16 16ZSingle =1234.4 texels/sec which is also equal to pixels/sec average writting to the frame buffer.
Multi = 2451.7640 x 480 x16 16ZSingle 1088.7
multi 2451.7The Radeon 9700 theoretical filrate for single is 324mhz x 8 x 1 = 2592 pixels/sec in which case the Radeon 9700 pro is doing about 48% fillrate of theoretical when writting to the frame buffer. This is with a 256bit memory bus. At lower resolutions where you have more page flipping and memory read/writes you get even less from theoretical.
Yet the GF FX at 500 mhz with the numbers you gave 1917 texels/sec = 1917 pixels/sec to frame buffer in the single texture test would be doing 96% efficiency if it was a typical 4x2 pipe setup or 1917/4000theretical = %48 giving a very similar efficiency compared to the Radeon 9700 Pro which is a known 8x1 pipe setup design.
This indicates to me that the NV30 is not a 4x2 design and I don't believe it is a straight 8x1 design, more of a hybrid.
Dave, can you rerun the tests (I know, I know, alot of test requests) at 1600 x 1200 x 16 16z, reason being if you can get the number above 2000 pixels/sec in the frame buffer you just proved it is not a 4 x 2 pixel pipeline.
Oh yea, writting to memory is always slower then reading from memory so even being DDR doesn't mean that the full bandwidth is available especially when writting to the memory which is being shown, in fact what is being shown is that it is about %50.
Dave Baumann
22-Feb-2003, 19:10
1600 x 1200 x16 16Z
Single =1234.4 texels/sec which is also equal to pixels/sec average writting to the frame buffer.
Multi = 2451.7
There's something wrong there, I'm getting more than that on a 9500 PRO!
Edit: Also get more than that in 32 bit
http://www.beyond3d.com/reviews/powercolor/radeon9700/index.php?p=3#bench
Ilfirin
22-Feb-2003, 19:13
Aye, with those settings (noko's) on my 9700Pro I am getting 2231.1 for single texturing, 2461.5 for multi.
vs the theoretical 2600.
Dave Baumann
22-Feb-2003, 19:15
Radeon 9500 PRO, Theoretical Fillrate = 2200Mpps / 2200Mtps
1600x1200x32, 24bit Z, Compressed textures:
Single Texturing: 936.1 (-57%)
MultiTextruing: 2184.0 (-1%)
1600x1200x16, 16bit Z, Compressed textures:
Single Texturing: 1617.2 (-26%)
MultiTextruing: 2196.3 (0%)
GeForce FX 5800 Ultra, Theoretical Fillrate = 4000Mpps / 4000Mtps
1600x1200x32, 24bit Z, Compressed textures:
Single Texturing: 1592.5 (-60%)
MultiTextruing: 3557.8 (-11%)
1600x1200x16, 16bit Z, Compressed textures:
Single Texturing: 1978.4 (-51%)
MultiTextruing: 3747.8 (-6%)
I'm currently trying to downclock the core, but the driver are crashing on me now :?
I am glad I don't run a lab :oops:
I guess I am good at contributing to confusion as well :?
I guess it would help if I didn't have 4x AA and 8x AF Quality turned on in the drivers! oh boy. Somewhat more accurate results I hope: sorry
1600x1200x16 16z 16bit textures
2324.5 single
2552.1 multi
640x480x16 16z 16bit textures
2061.4 single
2167.7 multi
Now I have to re-evaluate everything I thought before :shock:. Thanks for the feedback.
Dave can you run the tests with 16bit textures as well?
Dave Baumann
22-Feb-2003, 19:37
I guess it would help if I didn't have 4x AA and 8x AF Quality turned on in the drivers! oh boy.
Heh. Oh, don't worry, I've been there a few times myself!!
Dave can you run the tests with 16bit textures as well?
Reinstalling windows at the moment to try and get the overclocking options to work again!
Dave Baumann
22-Feb-2003, 20:06
more:
GeForce FX 5800 Ultra @ 275/540 (~ Radeon 9500 PRO Speeds)
1600x1200x32, 24bit Z, Compressed textures:
Single Texturing: 909.3 (-59%)
MultiTextruing: 1996.6 (-9%)
1600x1200x16, 16bit Z, Compressed textures:
Single Texturing: 1121.7 (-49%)
MultiTextruing: 2082.5 (-5%)
GeForce FX 5800 Ultra @ 275/1000
1600x1200x32, 24bit Z, Compressed textures:
Single Texturing: 1126.9 (-49%)
MultiTextruing: 2079.2 (-5%)
1600x1200x16, 16bit Z, Compressed textures:
Single Texturing: 1133.8 (-48%)
MultiTextruing:2091.6 (-4%)
Dave Baumann
22-Feb-2003, 20:12
GeForce FX 5800 Ultra @ 275/1000
1600x1200x16, 16bit Z, 16 bit textures:
Single Texturing: 1133.8 (-48%)
MultiTextruing:2094.8 (-4%)
Interesting results, you definitely ruled out memory bandwidth. You are slightly over 100% if it was a 4x2 architexture with the 275/1000 setting and a theretical bandwidth of 1100mp/sec when at 1600x1200x16 16Z with 16bit textures but sadly way below a 8x1 architexture. That test does suggest rendering more then 4 pixels per clock but not by much and it could be due to the test itself not being accurate enough. Now what is going on here? A very efficient 4x2 design or a very inefficient overall design or drivers holding it back?
I thought my last numbers from my Radeon 9700 pro was to high but it pans out, if I run at default resolution 1024x768x32, 24Z and compressed textures in 3dMark2001 I get:
1776 single
2538.2 multi
Redid previous test and got same numbers again for 1600x1200x16 16z and 16 texture.
So at higher resolutions and at 1600x1200x16 with a 16bit Z it I think is a more accurate way of measuring potential fillrate of a card.
Dave Baumann
22-Feb-2003, 20:42
That test does suggest rendering more then 4 pixels per clock but not by much and it could be due to the test itself not being accurate enough. Now what is going on here?
I have seen a case where it is over the theoretical maximum before. Neeyik or Worm commented on it at the time.
Then there is nothing really proving that it can render more then 4 pixels per clock cycle which also suggest that internally it doesn't either as far as I can see :( .
MDolenc
22-Feb-2003, 20:58
Dave: Have you tried THAT thing yet? :wink: If not check your e-mail...
Dave Baumann
22-Feb-2003, 21:01
I'm just setting up the network setting to transfer it across actually
Dave Baumann
22-Feb-2003, 21:06
Fillrate Tester
--------------------------
Display adapter: NVIDIA GeForce FX 5800 Ultra
Driver version: 6.14.1.4268
Display mode: 1024x768x32bpp
--------------------------
FFP - Pure fillrate - 1877.465820M pixels/sec
FFP - Single texture - 1511.438965M pixels/sec
FFP - Dual texture - 1278.825439M pixels/sec
FFP - Triple texture - 731.203369M pixels/sec
FFP - Quad texture - 700.277161M pixels/sec
PS_2_0 - Per pixel lighting - 79.678642M pixels/sec
PS_2_0 PP - Per pixel lighting - 79.677414M pixels/sec
MDolenc: How many instructions does the shader test issue and do they do any texture lookups?
MDolenc
22-Feb-2003, 21:42
Pixel shader has 14 instructions and 2 texture lookups.
The PS_2_0 code seems to be running at 101fps here...
Is PP the precision hint?
Dave Baumann
22-Feb-2003, 21:48
Given the texturing performances that appears to suggest a 4x2 design to me, seeing the repective drops between each of the layers. Would you also say that the drivers are also running in FP16 mode regardless of the precision as well?
pocketmoon_
22-Feb-2003, 21:50
So it appears than in DX9 at least using half precision has no performance advantages for the FX. This must be due to the FX drivers choosing half by default?? In OpenGL test I get a marked speed-up moving from float to half.
So it appears than in DX9 at least using half precision has no performance advantages for the FX. This must be due to the FX drivers choosing half by default?? In OpenGL test I get a marked speed-up moving from float to half.
Interesting. What we'd need, thus, is a OpenGL program doing pretty much the same thing as a D3D program.
Then compare OpenGL float & half to D3D performance. We'd get a definitive answer that way :)
Anyone up to the challenge? I fear I know way too little OpenGL to do such a thing...
Uttar
Is Hellbinder off his rocker again?
Personally I think i am going to turn out to have beenpretty close to right the entire time and it is more like a 4x4. Just like I said months and months ago. Of course I imidiately got laughed to scorn, and have been made fun of, or accused of posting *false Rumors* becuase of it ever sense. Well who is laughing now????
I am not saying that it is a proper 4x4 in the Parahelia sense. But that is more like a 4x2x2... Meaning it can do up to 4 textures on a single pixel in a single clock cycle. Thus they can claim that ambiguous *16* operations per clock. But only under certain conditions.
See what i mean??
I KNEW that the OpengL Block Diagram showed the logical layout of 4 pipes with 4 textureing units. Yet I was told that I was not *reading it correctly* its just a representaion of the way the command structure flows.. blah blah blah...
All I have to say is *Who's Yo Daddy Now*...
Interesting, how does the Radeon 9500 pro compare?
Hellbinder
22-Feb-2003, 22:07
Is Hellbinder off his rocker again?
No, i dont think so... but im biased.
]No, i dont think so... but im biased.
Gee. . . I never would have thought. . . :roll:
Gee, that was fast :P rocker destroyed!
pocketmoon_
22-Feb-2003, 22:13
I could knock one up on Cg :)
Using the ARB Fragment Profile *should* produce a shader which will run on NV and ATI hardware. The ARB profile has no concept of 'half' precision and any half's get promoted automatically. BUT with NV hardware you can load and run the same Cg shader using the NV30 profile which does understand half's
I'll have a stiff coffee and see how far I get tonight :) (The little one has just moved from a cot to a bed so she's now able to get up at 3.00am, wake me up and ask for a story)
That would be cool! 8) I need to learn how to program is the bottom line.
Hellbinder
22-Feb-2003, 22:18
Note i think that it can likely be closer a 4x4 under certain conditions. This time I am going to stick to my guns for a while. I think that drivers could be holding it back from what you should really be seeing. Or perhaps some other hardware issue. Notice how the 3dmark score doubbled. But only for those specific game tests. It seems to be able to throw the odd score every now and then. Which leads me to believe that more time is needed before we see the actual results.
Im going to stick it out with this position for at least 4 weeks ;)
Give people time to get retail cards and do some serious testing with release drivers.
Reverend
22-Feb-2003, 22:20
(The little one has just moved from a cot to a bed so she's now able to get up at 3.00am, wake me up and ask for a story)
No, no, no... please do not ever let me re-live my nightmares!
]Note i think that it can likely be closer a 4x4 under certain conditions. This time I am going to stick to my guns for a while. I think that drivers could be holding it back from what you should really be seeing. Or perhaps some other hardware issue. Notice how the 3dmark score doubbled. But only for those specific game tests. It seems to be able to throw the odd score every now and then. Which leads me to believe that more time is needed before we see the actual results.
Im going to stick it out with this position for at least 4 weeks ;)
Give people time to get retail cards and do some serious testing with release drivers.
*sighs*
Fillrate Tester
--------------------------
Display adapter: NVIDIA GeForce FX 5800 Ultra
Driver version: 6.14.1.4268
Display mode: 1024x768x32bpp
--------------------------
FFP - Pure fillrate - 1877.465820M pixels/sec
FFP - Single texture - 1511.438965M pixels/sec
FFP - Dual texture - 1278.825439M pixels/sec
FFP - Triple texture - 731.203369M pixels/sec
FFP - Quad texture - 700.277161M pixels/sec
PS_2_0 - Per pixel lighting - 79.678642M pixels/sec
PS_2_0 PP - Per pixel lighting - 79.677414M pixels/sec
Another disappointment. :( It would be nice to clear up these numbers a little bit more (i.e. high res, 16bit, maybe try the underclock-the-core trick again, and maybe some tests at 5 and 6 textures for kicks), but it appears NV30 really is a plain-old 4x2 (with 8 z/stencil per clock--that'd be nice to have tested as well, if possible) and not an 8x1 limited to 4 RGB writes/clock.
Sigh.
On the subject of FP performance/precision, I am still, against all odds, willing to give Nvidia the benefit of the doubt. I think it's all just a big driver issue, and one that should be resolved soon (before it has the chance to significantly affect any GFfx/QuadroFX customers). Reason being...look at the data points we have so far:
1) Various DX pixel shader benchmark suites showing GFfx with ~1/2 the performance of 9700, and with absolutely no difference between "FP16" and "FP32" (or more accurately, between with/without the _PP hint).
2) Carmack says the ARB_fragment_program extensions show GFfx with ~1/2 the performance of 9700.
3) Carmack says the proprietary Nvidia OpenGL extensions show GFfx with roughly equal the performance of 9700.
To me, the most likely conclusion to take from this set of data is that the current GFfx drivers are very inefficient at mapping standards-based shader code--i.e. DX or ARB_fragment_program--to NV30's hardware functionality, but that the hardware executes just fine when running Nvidia's proprietary extensions, which presumably are modeled much more closely on the underlying hardware. There are other conclusions that could be drawn, but right now this one seems the most likely. IMO it points to some unforeseen issue/bug somewhere along the process between standard shader code and NV30 hardware, with the current drivers executing in a severely unoptimized mode as a workaround.
But there's just no good reason to assume future performance of ARB/DX shader code won't look a lot more like current performance of NV extension code than what it looks like now.
IMO.
Hellbinder
22-Feb-2003, 23:29
*sighs*,
Thats funny... Its the same Tratment I got back several months ago. But who was closer to being right in the end?
So here are a few Intelectual Elietest sighs and rolling eyes back at you.
:roll: :roll: :roll:
Hellbinder
22-Feb-2003, 23:39
DX or ARB_fragment_program--to NV30's hardware functionality, but that the hardware executes just fine when running Nvidia's proprietary extensions, which presumably are modeled much more closely on the underlying hardware. There are other conclusions that could be drawn, but right now this one seems the most likely
Well, Carmack pretty much spelled it out that the only benefit from the Nvidia path is that its running in FP16.
Which leads me down the road wodering if their entire claim of FP32 is based on some kind of FP16X2 setup. Which seems to be the growing pattern of all their claims of *8 pixel* processing. Or in this case 32bit color.
Is that possible as far as that FP 16/32 issue?
Well, Carmack pretty much spelled it out that the only benefit from the Nvidia path is that its running in FP16.
True, but the point is with current drivers there is apparently absolutely no performance difference between FP16 and FP32 running DX pixel shaders. (Or rather, between what should be FP16 and what should be FP32; it's unclear which is actually being output.) So it would appear there is some dramatic difference between what's happening with the proprietary NV extensions and what's happening with either of the two standard low-level shader languages. It is my best guess that this difference will be solved with future drivers, because I can't think of any hardware-related reason for the difference.
Which leads me down the road wodering if their entire claim of FP32 is based on some kind of FP16X2 setup. Which seems to be the growing pattern of all their claims of *8 pixel* processing. Or in this case 32bit color.
Is that possible as far as that FP 16/32 issue?
Oh, it's extremely likely. But there's more to it than that: you can't just stick two FP16's together and get an FP32. Some portions of the functional units can be reused, but other portions will be new.
Put it this way: can you use R300's pipeline to output FP48 at half throughput? No, of course not. (Another point: you can't use R300 (or NV30's) 8-bit/component integer pipeline to output 16-bit/component integer values at half throughput either. But making this change would require a lot fewer extra transistors than getting FP32 from an FP16 pipeline.)
So, to the extent FP32 has a visual advantage in certain situations over FP24, then this is a worthwhile feature. The question is to what extent that is true...
Ilfirin
23-Feb-2003, 00:35
A little off-topic post, but rather this than a whole new thread :) :
I've spent most the day writing a purely synthetic pixel shader bechmark for anyone who wants it (though the original intent was just to satisfy my own curiosity). Here's what it [will*] does:
It renders a quad (2-triangles) that exactly covers the viewport, with a 96 instruction ps.2.0 shader (and a vertex shader that does the absolute minimum amount of work). It does this for about a thousand frames and measures the amount of time it took to draw all the frames, along with exactly how many instructions were calculated. It then outputs how many instructions the card can do per second.
*It is almost complete, but at the moment I am having a bitch of a time getting the shader to compile with more than 21 texture instructions. I hit the 64 arithmetic instruction limit in no time (which annoys me deeply), but coming up with a way to use the texture instructions wasn't easy. Then when I did come up with a way to use all 32 it won't compile more than 21.. sigh. I'm still working on it, should be done by the end of the night EST.
Any input/suggestions would be appreciated.
A little off-topic post, but rather this than a whole new thread :) :
It renders a quad (2-triangles) that exactly covers the viewport, with a 96 instruction ps.2.0 shader (and a vertex shader that does the absolute minimum amount of work). It does this for about a thousand frames and measures the amount of time it took to draw all the frames, along with exactly how many instructions were calculated. It then outputs how many arithmetic and texture instructions the card can do per second.
Any input/suggestions would be appreciated.
Well, I'll give it a try:
Would it make sense to implement a feature to choose from several shader lengths (or even to choose the number of arithmetic and texture ops separately, or at least to be able to choose from several different presets)?
Ilfirin
23-Feb-2003, 05:54
Ok, it's done.
Get it here (http://home.ec.rr.com/immortalg/PixelShaderBench.zip)
Far from an interesting benchmark to watch, you're lucky the display wasn't just a progress bar though ;) (those are my favorite for some reason) But that wasn't the point anyway.
For those wondering what is actually being done:
First the diffuse texture is blurred with a 17 pixel box filter (it's only a dual textured app)
Then the normal map is blurred with a 5 pixel box filter
Then phong illumination is evaluated.
In order to eliminate as many outside factors as possible there isn't a single render state change in the main-loop, there are only 4 vertices with a 5 instruction long vertex shader, and all it's doing is clearing the screen & depth, drawing the quad over the viewport, presenting, repeat 1000 times.
The final instruction count for the pixel shader was only 84 - 22 texture, 62 arithmetic.
My Radeon 9700Pro got about 3375 MIPS (3538 at 1600x1200). That sounded wrong to me so I double checked everything and tested it a few times and kept getting the same results. Here's my current reasoning behind it - The theoretical maximum MIPS for a graphics processor is 'numPipelines*clockrate*averageNumInstructionsPerC lock'. Using that [possibly flawed] (I expect the # of TMUs to affect the result, but it's way too late/early to determine exactly how :) )logic, the average number of instructions the Radeon 9700Pro executes per clock is about 1.3.
[Edit:]
Assuming that logic isn't really wrong and 0 influence from bandwidth, the other ATI DX9 cards should have results similar to the following at 1600x1200:
9700 = 2860 MIPS
9500 Pro = 2860 MIPS (Found someone with one of these (thanks nuvem!) and got within 3% error from this)
9500 = 1430 MIPS
[EDIT:] Updated the zip with a new executable that doesn't even clear, to reduce the error further. Now getting 3606 MIPS at 1600x1200.
Chalnoth
23-Feb-2003, 08:14
Fillrate Tester
--------------------------
Display adapter: NVIDIA GeForce FX 5800 Ultra
Driver version: 6.14.1.4268
Display mode: 1024x768x32bpp
--------------------------
FFP - Pure fillrate - 1877.465820M pixels/sec
FFP - Single texture - 1511.438965M pixels/sec
FFP - Dual texture - 1278.825439M pixels/sec
FFP - Triple texture - 731.203369M pixels/sec
FFP - Quad texture - 700.277161M pixels/sec
PS_2_0 - Per pixel lighting - 79.678642M pixels/sec
PS_2_0 PP - Per pixel lighting - 79.677414M pixels/sec
MDolenc: How many instructions does the shader test issue and do they do any texture lookups?
Well, yeah, definitely looks like it's acting just like a 4x2 architecture. I think that for the NV30 to act differently, it would need to be able to do processing on more than 4 pixels at once. If this is not possible, drivers will never improve this deficiency. If the NV30 can work on more than 4 pixels at once, then scheduling optimizations could fix this. But nobody should expect it...it's probably not going to happen. Just thought I'd throw it out there.
Heathen
23-Feb-2003, 09:30
So does 3157 Mips @ 1280*1024 sound ok?
Well, yeah, definitely looks like it's acting just like a 4x2 architecture. I think that for the NV30 to act differently, it would need to be able to do processing on more than 4 pixels at once. If this is not possible, drivers will never improve this deficiency. If the NV30 can work on more than 4 pixels at once, then scheduling optimizations could fix this. But nobody should expect it...it's probably not going to happen. Just thought I'd throw it out there.
Question now is whether it can be working on 8 pixels at once in the shaders, or whether the shader pipeline is just 4 pipes each capable of 2 ops and 2 texture lookups every clock.
Dave Baumann
23-Feb-2003, 09:56
Ok, it's done.
Get it here (http://home.ec.rr.com/immortalg/PixelShaderBench.zip)
Mmmmm, GFFX seems not to like this one. Turned up with a score of 272MIPS, but the fan had turned off so I wonder if it had somehow dropped into 2D speeds!
Some kinda progress indicator would be nice! ;)
Dave Baumann
23-Feb-2003, 10:00
OK, its appears that it did drop to 2D speeds, but even at 3D speeds its only getting 453.55 MIPS @ 1600x1200.
Is this running in FP32?
Also, would it be possible to make one that runs 1024 instructions?
overclocked
23-Feb-2003, 11:35
This must be a driver issue or what the f#ck!
Someone need too send a e-mail to Nvidia and ask if they have driver´s that are GOOD when it hits the store´s.
Could it be due too the "whole new core-arc", it must be crappy driver´s or has Nvidia something up...
1024x768
4186 MIPS @ 400/760
4190 MIPS @ 400/620
4692 MIPS @ 450/760
4692 MIPS @ 450/620
3445 MIPS @ 325/620
OK, its appears that it did drop to 2D speeds, but even at 3D speeds its only getting 453.55 MIPS @ 1600x1200.
What the?! Damn, I did expect something like 1500 considering the GFFX low scores using all the PS benchies... But that low... It's ridiculous!
Even god couldn't do a miracle sufficently powerful to save the NV30 now...
Either the NV30 is world's most buggy architecture EVER and it'll be ( mostly ) fixed in the NV35, or the drivers have been created by ex-ATI engineers... Or, well, most likely, nVidia just plain messed up the whole architecture big time and they're in big trouble until the NV40 ( could this explain why, suddently, nVidia seems to have decided to have released the NV40 at Comdex 2003 instead of Mid 2004? )
I expected many bad things for the NV30 in all this time. Sub-par AA performance. Bad Dynamic Branching performance. And a lot more.
But this is simply ridiculous. Even my worst expectations didn't think such a thing was possible.
Now, we know what David Kirk meant when he said:
One of the reasons that NV30 took us so long is that everything top to bottom is 128-bit floating-point
Guess what? That's also the reason everything is taking so long in the GFFX: they didn't focus on making them work *fast* :(
I really love a quote of the author of that extremetech article, too -really nice, considering the numbers Wavey just gave us:
Because GeForceFX is more of a processor and less of a pixel blaster, when we discuss GPUs we'll soon become more concerned with CPU metrics like IPC (instructions per clock).
How true...
But there's still something I don't understand. David Kirk said the following in that article:
In graphics there's a huge amount of explicit parallelism. When you're drawing a polygon, each pixel is executing the same program as the neighboring pixels. We don't yet have the ability within any API to specify different [shader] programs for each pixel within a polygon. So what happens is that we have multiple parallel pipeline processors that are each running multi-threaded, processing different pixels. So you can have 32 pixels in flight through the shader, and the hardware just handles that.
So, David Kirk is clearly stating the NV30 is able to work on 32 pixels at the same time. Strange...
BTW, David Kirk pretty much said most instructions are done as a 4 pipe design in that article:
Some things happen at sixteen pixels per clock. Some things happen at eight. Some things happen at four, and a lot of things happen in a bunch of clock cycles four pixels at a time. For instance, if you're doing sixteen textures, it's four pixels per clock, but it takes more than one clock.
Sounds like we were just too blind and hopeful, and we didn't notice it :) Amusing, ain't it?
Uttar
P.S. : This may sound strange, but... Wavey, could you try it with both an underclocked core and an underclocked memory ( not the two at the same time ) ? I'd like to see what makes the most effect. That's because of the whole "PS program stored in memory" thing, and who knows, it might have a slight effect on this...
could this explain why, suddently, nVidia seems to have decided to have released the NV40 at Comdex 2003 instead of Mid 2004? Umh..where did you read this?
Sharkfood
23-Feb-2003, 14:13
Just for giggles and to possible extrapolate any possible procedural or driver bugs...
Has anyone run these tests with either/or antialiasing and anisotropic filtering?
It occurs to me that maybe some figures in either case might yield a bit more insight or otherwise unveil some possible driver/methodology bugs.. (maybe)..
demalion
23-Feb-2003, 14:28
Have we investigated the idea of the nv30 using its vertex processing units for some of the 32-bit component fragement processing, and the ramifications of that usage for performance?
Could we be seeing nvidia, via drivers, only allowing some units to be utilized for fragement processing...and it seems to depend on precision to know when to do this or not. In which case, all the comments about how many in flight pixels, etc, for fp32 color processing that have been made would be presuming complete allocation of these resources for pixel processing (which would technically not make them a lie.. :-? ).
Certainly fits with nvidia's complaints about 3dmark, for example. Seems to also fit how vertex processing performance might excel when doing flat shading (i.e, however many integer color processing units there are would be the bottleneck), the push for per-variable precision specification, and some other aspects of nvidia's handling of the nv30.
One thing that would be interesting (assuming my description isn't too whacko) is to try and test this...like maybe vertex processing limited benchmarks with various precisions explicitly defined. Looking at 3dmark03 results can't be too informative since AFAIK the performance increases are 3dmark specific and may make precision assumptions for allocating this theoretical architecture optimally (so we can't make assumptions about what precision is being used, necessarily).
Maybe I'll go look at that digit-life/ixbt article, though I had the impression there might be some issues with some of their tests.
overclocked
23-Feb-2003, 14:57
There is something going on here...
could this explain why, suddently, nVidia seems to have decided to have released the NV40 at Comdex 2003 instead of Mid 2004? Umh..where did you read this?
It comes from MuFu.
http://www.nvnews.net/vbulletin/showthread.php?s=&threadid=7550
Yah. NV40 is 0.13u and 4th quarter '03, AFAIK. :-\
MuFu.
As I said, it "seems" it's going to be launched at Comdex 2003.
Plans can change, or someone could have given false info to MuFu.
Uttar
Just to clarify - sampling Q4, not launch.
I personally think it'll be a Spring 2004 product.
MuFu.
Just to clarify - sampling Q4, not launch.
I personally think it'll be a Spring 2004 product.
MuFu.
Woah, you're fast at fixing my mistakes! :)
Still faster than I had expected: I was betting on August 2004 for the NV40.
But then again, I was also expecting 0.09... Ah well!
It'll also be the first time nVidia delivers a next-generation core with the same process as the old core.
Uttar
demalion
23-Feb-2003, 15:18
Hmm, well it seems their included GeometryProcessingSpeed benchmark has the options to try and test this out, but the way they used it in the article doesn't seem to be informative. This is as close I will go (http://www.digit-life.com/articles2/gffx/gffx-ref-p2.html#p5) to providing a link, so as not to abuse their bandwidth in a rude fashion ;) (just scroll down to the bottom for the download link for the zip file for the test), as their "Contents" links don't seem to be working at this time.
First, I'm wondering how tests like the FFP (Fixed T&L), VS11 (Vertex Shaders 1.1 and Fixed Function Blend Stages), VS20 ( Vertex Shaders 2.0 and Fixed Function Blend Stages) from that program compare, as that could give us a good idea about the vertex processor pipeline behavior...could be educational to compare between Quadro FX and "Gaming" FX performance for them as well (I'd think the professional drivers would be more likely to expose the absolute limit in vertex processing allocation). All with 1 texture stage, I guess.
Second, regarding my speculation in the above post, I'm wondering how tests like the PS11 (Vertex Shaders 1.1 and Pixel Shaders 1.1), PS14 (Vertex Shaders 1.1 and Pixel Shaders 1.4), and PS20 (Vertex Shaders 2.0 and Pixel Shaders 2.0) compare to VS11 and VS20 (perhaps with information on the reported precision reported for PS 1.4 for whichever drivers are used). All with 1 texture stage specified.
Finally, this test program seems rather handy for investigating whether the drivers are adapting behavior based on texturing demands, since you can vary the texture stages and observe directly how performance changes with each change. It is pretty light on fillrate demands, so it would only tell us (I think) if these changes are impacting the vertex processor allocation.
...
The authors seem to want people to use the programs for testing, so, unless there is some problem with them, perhaps we should.
Ilfirin pixel shader example seems ok but other than showing that something weird is happening with NV30 pixel shader in DX9 it doesn't tell that many things.
Those results are asking to perform more synthetic pixel shader tests. For example a pixel shader with only simple ALU instructions, another with only (or with the most instructions) texture access instructions. And maybe some other mixing instructions both kind of instructions in different patterns (drivers sure can do a lot of things translating from DX9 PS2 to whatever is the real NV30 ISA but if they aren't really working as they should maybe different PS2.0 instruction schedulings could produce different results).
Another alternative would be to use NV_fragment_program that, theorically, is nearer to the real internal representation and compare with equivalent pixel shaders using PS2.0.
I just checked again the NV30 OpenGL extension specification paper and there is something I'm not sure if it has been talked about here. It relates with the number and size of pixel shader instructions and with parameters/contants. NV_fragment_program doesn't seem to imply that there is an internal constant register bank (or memory) as in the case of the vertex shaders or R300 pixel shaders. That could mean that all constant should be read from the video memory (or would be read from memory but use a L1 kind cache too). In fact it talks about two kind of constans: embedded and named/numbered constants. The first would be like an inmediate value in a CPU ISA, the later could be like an absolute memory access in a CPU ISA. Pixel shader instructions also seem to be fetched from memory so there could be a lot of pressure to the video memory just from fetching and reading constants. A single pixel shader instruction can as large as 100 bits and a full embedded constant is 128 bits, that could mean 256 bit instructions using embedded constants. And if NVidia lets use embedded rather than named local constants should mean that they have somekind of advantage from using the first to using the second.
An obvious solution to that memory pressure would be CPU alike caches (L1 data, L1 code) but NVidia doesn't seem to talk about any cache size.
antlers
23-Feb-2003, 15:41
Those results are asking to perform more synthetic pixel shader tests.
I hope someone, Dave or anyone with an FX tries this.
Humus' Mandelbrot demo (http://esprit.campus.luth.se/~humus/3D/MandelbrotSet.zip) is a shader-compute-bound (a lot of vector multiply/adds, only one texture lookup) demonstration program, written for standard DX9.0. It's pretty easy to tell what precision it is actually running at by looking (i.e., will show more detail at FP32 than FP24, and more at FP24 than FP16).
It would be great to run it on an FX with current drivers to compare performance with a 9700 and to see if current DX9 drivers force shaders to 16 bits (if it has less detail than 9700 it is forcing FP16).
If it is running with FP32, than you can try running it with FP16 by using this version (www.antlersoft.com/mandel.fshpp) of the shader instead. Just copy this modified file over the mandel.fsh in Humus' demo. This version uses the DX9 partial precision hint to force all the calcs to FP16. If it really running at FP16, there should be visibly less detail than the FP32 version. Then you can compare framerates and get an idea how much faster FP16 runs over FP32 on the FX.
If there is interest in this idea I'll work on porting the demo to OpenGL with the NV30-specific shader extensions.
ps.2.0
dcl t0
dcl_2d s0
def c0, 0.0, 1.0, 8.0, 0.0
mad_pp r2.xy, t0.x, t0, t0
mad_pp r1.x, -t0.y, t0.y, r2.x
mad_pp r1.y, t0.x, t0.y, r2.y
mad_pp r2.xy, r1.x, r1, t0
mad_pp r0.x, -r1.y, r1.y, r2.x
mad_pp r0.y, r1.x, r1.y, r2.y
mad_pp r2.xy, r0.x, r0, t0
mad_pp r1.x, -r0.y, r0.y, r2.x
mad_pp r1.y, r0.x, r0.y, r2.y
mad_pp r2.xy, r1.x, r1, t0
mad_pp r0.x, -r1.y, r1.y, r2.x
mad_pp r0.y, r1.x, r1.y, r2.y
mad_pp r2.xy, r0.x, r0, t0
mad_pp r1.x, -r0.y, r0.y, r2.x
mad_pp r1.y, r0.x, r0.y, r2.y
mad_pp r2.xy, r1.x, r1, t0
mad_pp r0.x, -r1.y, r1.y, r2.x
mad_pp r0.y, r1.x, r1.y, r2.y
mad_pp r2.xy, r0.x, r0, t0
mad_pp r1.x, -r0.y, r0.y, r2.x
mad_pp r1.y, r0.x, r0.y, r2.y
mad_pp r2.xy, r1.x, r1, t0
mad_pp r0.x, -r1.y, r1.y, r2.x
mad_pp r0.y, r1.x, r1.y, r2.y
mad_pp r2.xy, r0.x, r0, t0
mad_pp r1.x, -r0.y, r0.y, r2.x
mad_pp r1.y, r0.x, r0.y, r2.y
mad_pp r2.xy, r1.x, r1, t0
mad_pp r0.x, -r1.y, r1.y, r2.x
mad_pp r0.y, r1.x, r1.y, r2.y
mad_pp r2.xy, r0.x, r0, t0
mad_pp r1.x, -r0.y, r0.y, r2.x
mad_pp r1.y, r0.x, r0.y, r2.y
mad_pp r2.xy, r1.x, r1, t0
mad_pp r0.x, -r1.y, r1.y, r2.x
mad_pp r0.y, r1.x, r1.y, r2.y
mad_pp r2.xy, r0.x, r0, t0
mad_pp r1.x, -r0.y, r0.y, r2.x
mad_pp r1.y, r0.x, r0.y, r2.y
mad_pp r2.xy, r1.x, r1, t0
mad_pp r0.x, -r1.y, r1.y, r2.x
mad_pp r0.y, r1.x, r1.y, r2.y
mad_pp r2.xy, r0.x, r0, t0
mad_pp r1.x, -r0.y, r0.y, r2.x
mad_pp r1.y, r0.x, r0.y, r2.y
mad_pp r2.xy, r1.x, r1, t0
mad_pp r0.x, -r1.y, r1.y, r2.x
mad_pp r0.y, r1.x, r1.y, r2.y
mad_pp r2.xy, r0.x, r0, t0
mad_pp r1.x, -r0.y, r0.y, r2.x
mad_pp r1.y, r0.x, r0.y, r2.y
mad_pp r2.xy, r1.x, r1, t0
mad_pp r0.x, -r1.y, r1.y, r2.x
mad_pp r0.y, r1.x, r1.y, r2.y
mad_pp r2.xy, r0.x, r0, t0
mad_pp r1.x, -r0.y, r0.y, r2.x
mad_pp r1.y, r0.x, r0.y, r2.y
mov_pp r1.z, c0.x
dp3_sat r0, r1, r1
mul_pp r0.x, r0.x, c0.z
exp_pp r0.x, -r0.x
sub_pp r0, c0.y, r0
texld r0, r0, s0
mov oC0, r0
Silly idea: what would happen if the pixel shader instruction + constant cache was smaller than the DX9 96 pixel instruction limit, like the shader used in Ilfirin test?
But I don't really think that it has nothing to do with NV30 problems.
THe_KELRaTH
23-Feb-2003, 16:21
In Nvidia's powerpoint presentation it refers to:
128-bit framebuffer “color” output (use as 4 x FP32, 8 x FP16, etc…)
When refering to setting the precision level:
In C, it’s easy to accidentally use high precision
half x, y;
x = y * 2.0; // Double-precision multiply!
Not in Cg
x = y * 2.0; // Half-precision multiply
Unless you want to
x = y * 2.0f; // Float-precision multiply
Is it possible that whenever high precision is requested that it's defaulting to FP32 unless Cg is used?
Also, is it also possible that there is a driver problem in that FP16 is not properly implemented as from the tests I've viewed so far there's no performance change at all between FP16 v FP32.
antlers
23-Feb-2003, 16:33
[quote="THe_KELRaTH"
Is it possible that whenever high precision is requested that it's defaulting to FP32 unless Cg is used?
Also, is it also possible that there is a driver problem in that FP16 is not properly implemented as from the tests I've viewed so far there's no performance change at all between FP16 v FP32.[/quote]
These questions would be answered with the test described above, where it is visibly obvious which precision is being used...
Ilfirin
23-Feb-2003, 16:50
OK, its appears that it did drop to 2D speeds, but even at 3D speeds its only getting 453.55 MIPS @ 1600x1200.
That is extremely odd. That is about the results I would expect with 1 pixel pipeline, doing 1 instruction per clock.
Is this running in FP32?
It may be. When I compiled the shader I did everything in 'double's to see if it would have any affect, but when checked against the compiled outupt with everything standard 'float's there wasn't any difference. So either it is defaulting to FP32, or it something else is wrong.
Also, would it be possible to make one that runs 1024 instructions?
Sure, but I wouldn't have any way to test it other than the reference rasterizer.
About the progress indicator: Adding a frame counter would likely affect the results pretty heavily.. I'll try only rendering 100 frames instead.
I'll get to work on putting more tests in it and such. Be back in a few hours.
Mintmaster
23-Feb-2003, 17:26
Ilfirin, are there any dependant texture reads? For the box filters, are you calculating the texture lookup offsets by doing math in the pixel shader, or do the vertices have a bunch of texture coordinates? It sounds like the former, and I think this is the main problem with NV30's shader performance (hence the vast difference between PS 1.1 and PS 1.4 performance).
Also, are you doing many scalar ops? Doesn't sound like it, but that could put the IPC over 1. Also, are you counting the texture lookups as instructions when doing your calculations? If so, then 84/64 = 1.31, which is pretty close to 3445/(325*8 ) = 1.325.
Looks like R300's shader engine is performing exactly as it should! It also seems to handle arbitrary texture lookups well, which is a difficult thing to do. This is where R200 came in handy as opposed to GF4 (from a hardware architect's point of view).
Ilfirin
23-Feb-2003, 17:45
The box filters' texture coordinates are calculated via math, yes. No there are not too many scalar ops (though there are some.. you can open up pshade.pso to see the pixel shader. Be warned though - the original shader was written in HLSL and thus the output is not very clear).
And yes I am counting texture lookups as instructions, texture instructions to be specific.
I'm adding a pure arithmetic instruction shader, a pure (or as close as possible) texture instruction shader, and (maybe) a mixed 1024 instruction shader. Recommend any other tests?
Bambers
23-Feb-2003, 21:11
I assume that bench is for PS2 cards only? The screen just flashes briefly on my 8500 and gives me a result of about 130000MIPS :D
Ilfirin
23-Feb-2003, 21:46
Sorry for taking so long.. still not quite done.
It takes the referene rasterizer several minutes (>5) to render 2 frames at 320x240 with 505 instructions.. god knows how long it's going to take for 1024 :)
Dave Baumann
23-Feb-2003, 21:55
505 will be good enough. Just so long as its reasonably long, but longer than R300's limits.
Ilfirin
23-Feb-2003, 22:55
Ok, new one is up. Sorry, no pure texture shader. :/
Get it Here (http://home.ec.rr.com/immortalg/PixelShaderBench.zip)
Changes:
The original shader, along with the new 66 (64 arith, 2 tex) instruction shader only render 500 times rather than 1000.
The 505 (503 arith, 2 texture) instruction shader only loops 50 times.
Results:
Radeon9700Pro (factory settings):
Test1: 3596.5 MIPS
Test2: 2778.38 MIPS
Test3: Obviously didn't run..
God I'm tired.
[Edit] Just found that the 'Force Clear' check box wasn't working. Fixed it and re-uploaded the zip (done at the time of this edit, listed at the bottom of the post).
Chalnoth
23-Feb-2003, 23:37
See if you can write a program to generate a variable-instruction shader, to see if there's a certain number of instructions where the FX drops in performance significantly.
Joe DeFuria
24-Feb-2003, 01:06
Just thought I'd interrupt this thread to thank Ilfirin.
It's people like you (and Basic, and Colourless, and others that have written similar programs to flesh out hardware characteristics) that really make the B3D community a step above the rest. :D
Now back to your regularly scheduled thread...
Dave Baumann
24-Feb-2003, 01:23
FYI, if you havn't seen it yet, this is NV's official responce to how many pixels it renderes is as follows is:
http://www.tech-report.com/
NVIDIA: It renders:
8 z pixels per clock
8 stencil ops per clock
8 textures per clock
8 shader ops per clock
4 color + z pixels per clock with 4x multisampling enabled
It is architected to perform those functions.
Basically, its 8 pipes with the exception of color blenders for traditional ROP operations, for which it has hardware to do 4 pixels per clock for color & Z. It has 8 "full" pipes that can blend 4 pixels per clock with color.
Sorry, but is it just me or is the all important on for determining fillrate, i.e. then number of pixels that are written to the framebuffer, half of what you'd expect? I thought we'd got over fill-rates after the pixels/texels arguments from V2 to TNT2, but now we further want to muddy the waters by including all kinds of non-displayable sample numbers there?
Joe DeFuria
24-Feb-2003, 01:46
Sorry, but is it just me or is the all important on for determining fillrate, i.e. then number of pixels that are written to the framebuffer, half of what you'd expect?
Yes, exactly. That is the problem here. Consider this from the GFFX Overview:
http://www.nvidia.com/docs/lo/2416/SUPP/Overview.pdf
8-pixels per clock cycle.
Previous Generation: 4 pixels per clock
GeForce FX GPU:8 pixels per clock
Benefits from GeForce FX: Doubles the fill rate to power
through new applications as well as the classic games
I would say that's woefully inaccurate...specifically part about double the pipes doubling performance in "classic games." That would be true only if the FX did 8 color+z writes per clock.
In every PR and spec sheet they tout "8 pixels per clock" performance.
However, looking back, I'm having a hard time finding any official reference to an actual specification for fill rate. All I keep finding is the core rate clock specification of 400 or 500 Mhz, and a fill rate "specificaion" of "8 pixel/clock rendering engine."
:roll:
I thought we'd got over fill-rates after the pixels/texels arguments from V2 to TNT2, but now we further want to muddy the waters by including all kinds of non-displayable sample numbers there?
Yeah, we had "pixel" fill rate, and "texel" fillrate. I supose now that we should coin a term for the FX, and call it...um...."zixel" fill rate. (For the "z only pixel"):
GeForceFX @ 400 Mhz:
3.2 GTexel/sec
3.2 GZixel/sec
1.6 GPixel/sec
"Yay" :(
antlers
24-Feb-2003, 02:19
And when they say 8 shader ops per cycle, I believe they are counting texture lookup and arithmetic ops separately. By that count the R9700 does 16 shader ops per cycle.
Furthermore, I think the FX architecture is further tied to 4 pipes in that if shader ops can't be paired together in the pipe, the pipe only does 1 per cycle, so the chip does only 4 ops per cycle. I'm pretty sure that texture lookups can be paired together (so each pipe can do 2 per cycle). Z and stencil writes can clearly be paired together. I'm also pretty sure that FP32 arithmetic ops can't be paired with any other op, and it can do a max of four per cycle. What's not clear is to what extent FP16 ops can be paired with each other or with other kind of operations.
It also may be the case that the FX loses the ability to access two textures in a single cycle when dealing with FP textures.
What were all those transistors used on (you should be able to implement 4 32 bit pipes in fewer transistors than 8 24-bit)? Do the longer shaders/more sophisticated shader branching use that many extra transistors? Or was it the extent to which they preserved the integer, fixed-function core?
Mulciber
24-Feb-2003, 02:23
Sorry, but is it just me or is the all important on for determining fillrate, i.e. then number of pixels that are written to the framebuffer, half of what you'd expect?
Yes, exactly. That is the problem here. Consider this from the GFFX Overview:
http://www.nvidia.com/docs/lo/2416/SUPP/Overview.pdf
8-pixels per clock cycle.
Previous Generation: 4 pixels per clock
GeForce FX GPU:8 pixels per clock
Benefits from GeForce FX: Doubles the fill rate to power
through new applications as well as the classic games
I would say that's woefully inaccurate...specifically part about double the pipes doubling performance in "classic games." That would be true only if the FX did 8 color+z writes per clock.
In every PR and spec sheet they tout "8 pixels per clock" performance.
However, looking back, I'm having a hard time finding any official reference to an actual specification for fill rate. All I keep finding is the core rate clock specification of 400 or 500 Mhz, and a fill rate "specificaion" of "8 pixel/clock rendering engine."
:roll:
I thought we'd got over fill-rates after the pixels/texels arguments from V2 to TNT2, but now we further want to muddy the waters by including all kinds of non-displayable sample numbers there?
Yeah, we had "pixel" fill rate, and "texel" fillrate. I supose now that we should coin a term for the FX, and call it...um...."zixel" fill rate. (For the "z only pixel"):
GeForceFX @ 400 Mhz:
3.2 GTexel/sec
3.2 GZixel/sec
1.6 GPixel/sec
"Yay" :(
It's going to crack me up if Zixel starts becoming standard terminology here in the new future :lol:
binmaze
24-Feb-2003, 03:17
Zixel? My god!
It's going to crack me up if Zixel starts becoming standard terminology here in the new future :lol:
"Trixel" might be nice (from the Latin Trickpixellum)
Feeling capricious we might try "Zits per clock"...or "zitopiczel" (that probably needs no further explanation.)
Getting technical we might pose a series of articles examining questions like "How many Trixel-Ops per cycle are these Graphics Processing Units really capable of? Let's find out..."
Then there's also "Chamzel", coming from the root Chameleonus Optrapixellus, which means "changing pixel op" or "camouflaged pixel of many roots."
Then there's "Pidiopzel", which means, sparing the Latin, "dynamic pixel ops", or "pixels and ops on demand via our patented auto-sensing, multi-branching, self-configuring pipeology (TM.)" Or, you might see this phrase intelligently written as "Pixels are the past; for tomorrow, accept nothing less than Pidiopzels!"
Ah, it's Brave New PR world coming, isn't it? Bre-e-e-e-e-eathe it in!
demalion
24-Feb-2003, 05:02
Is my first suggestion (http://www.beyond3d.com/forum/viewtopic.php?p=86667&#86667) about testing vertex processing under different circumstances just "too whacko"?
If so, could someone tell me why?... I'm a bit under the weather and I won't be likely to be able to figure it out on my own anytime soon. :-?
If not, could you give my suggestion for testing (http://www.beyond3d.com/forum/viewtopic.php?p=86682&#86682) a try, Wavey?
I noticed that VS20+PS20 gives higher fps (the same as for Fixed Function lighting) than VS11+PS11 for their testing at the simplest "shader level" (only ambient lighting), which could be an indicator that some processing unit used for PS 1.1 (but not PS 2.0 or FF) can be allocated for some simple lighting processing...this might also fit the FF results for the GF FX in general. This idea can perhaps be checked with the VS11 with FF color processing test results. What this causes me to wonder is if outputting to a floating point color buffer might be necessary to get around some of the driver shortcuts that might be implemented otherwise...but in any case, my test suggestion seems like it would have to use shader level 2 or higher.
Also, the nv30 VS20 + PS20 are near the GF 4 Ti 4600 VS11 + PS11 results of the same complexity for shader levels 2 and higher. The results seem to perhaps correlate to what would happen for 5/3 (clockspeed) * 2/3 (losing a vertex processing unit for fp32 processing) for GF FX versus GF 4 Ti 4600, with perhaps some efficiency loss for fewer vertex processing units, and it seems one conceivable explanation for the other results (besides the Shader Level 1 test).
Reverend
24-Feb-2003, 05:20
An aside to the interesting "discoveries".
These are all very interesting discussions and (attempts at) discoveries.
The bottomline, however, is that these are basically in-consequential, other than the fact that we know whether it's a 8x1 or 4x2 or whatever NVIDIA's drivers want us to believe.
A video card (and, in a broader sense, a company) is judged, in real terms or in sales-term, by what applications are used to "define" it. Not what a website/websites appear to discover.
"Man, this is what the NV30 does... and this is what the R300 does" is a common developer comment.
Which option he goes with is more important than what the truth is.
In short, this (thread) may be about what a hardware is truly capable of, based on the confusion introduced by initial official "information" but, IMO, the main question is (beyond the "NVIDIA seems to mislead us" opinion)..... what will developers (those that sell, not those by the non-profit contributors here) come out with in terms of software?
I hope folks get what I'm trying to say.
...or whatever NVIDIA's drivers want us to believe.
This is the salient point...
Mulciber
24-Feb-2003, 06:01
An aside to the interesting "discoveries".
These are all very interesting discussions and (attempts at) discoveries.
The bottomline, however, is that these are basically in-consequential, other than the fact that we know whether it's a 8x1 or 4x2 or whatever NVIDIA's drivers want us to believe.
A video card (and, in a broader sense, a company) is judged, in real terms or in sales-term, by what applications are used to "define" it. Not what a website/websites appear to discover.
"Man, this is what the NV30 does... and this is what the R300 does" is a common developer comment.
Which option he goes with is more important than what the truth is.
In short, this (thread) may be about what a hardware is truly capable of, based on the confusion introduced by initial official "information" but, IMO, the main question is (beyond the "NVIDIA seems to mislead us" opinion)..... what will developers (those that sell, not those by the non-profit contributors here) come out with in terms of software?
I hope folks get what I'm trying to say.
Yes, but this is just to much fun ;)
Chalnoth
24-Feb-2003, 06:10
Just a little not on clock-for-clock performance of the NV30:
First of all, single-texturing wouldn't be any better on an 8x1 architecture anyway, due to memory bandwidth limitations. So, here's what I see as the basic performance rundown (judging same clock speed, same memory bandwidth):
z/stencil-only: 100%
1: 100%
2: 100%
3: 75%
4: 100%
5: 83.3%
6: 100%
7: 87.5%
8: 100%
In other words, if you take memory bandwidth limitations into account, the FX is, at worst, 75% as good as it would have been with 8 pipelines. The fact that it can only output 4 pixels per clock isn't as much of a limitation as many here seem to be implying.
For example if we, just for fun, average over all of these performances, we'll get a very, very rough 94% of the performance of a hypothetical FX with 8 full pipelines.
I do wish to say that I cannot claim not to be disappointed here. nVidia did state quite clearly that the FX had 8 fully-functional pixel pipelines. It definitely is a disappointment, but it is not a disaster (I think the FSAA problems are much more serious).
Edit:
Oh! One last thing. Has anybody tested the trilinear performance of the FX?
Developers will still support the hardware, because it performs okay.
What the issue most are pointing out here are the issues where they expect a card to perform better and it didn't. Also there appears to be real issues with performance, and this will prevent people from buying the card if they were thinking about buying it.
Maybe the audience here is limited, but many people browse these forums and base their decisions on which card to buy next from reviewers, and comments made here.
I certainly will not be buying a GFFX because of so many uncertainties and comments made here, and I will tell my small friends to skip this one who aren't tech savvy.
True if the software support it (very likely), then it doesn't impact the company that much. However, folks not purchasing that product will be affected later in the future as developers hone their applications towards another hardware.
Speng.
Chalnoth
24-Feb-2003, 06:13
I still want one, for the programmability (though I will probably be going for the NV31 or NV34, depending), and for the fact that the Radeon 9700 Pro still does not work under Linux for me.
Sharkfood
24-Feb-2003, 06:27
IMO, the main question is (beyond the "NVIDIA seems to mislead us" opinion)..... what will developers (those that sell, not those by the non-profit contributors here) come out with in terms of software?
Developers usually tend to cater towards the largest yields- i.e. target the widest possible market in order to reclaim the largest amount of sales.
From a consumer's standpoint, these kinds of discussions can help arm hardware buyers with knowledge concerning their hardware in ways that help them understand and better accomodate product shortcoming.
Accordingly, what NVIDIA has been publishing as specs/performance figures is unreliable at best.. with such excruciatingly specific conditions to reach advertised performance levels that consumers should have an understanding of why their shiny new GF FX may not deliver what is expected for an 8 pipeline videocard. This helps level expectations with pre-release hype and can only lead to higher satisfaction if consumers know exactly what they are getting, rather than assuming performance from unreliable or misleading specs.
At the end of the day, this can also help form tomorrow's user base for developers. If a particular product's shortcomings or methods are specialized as such to only deliver expected performance if specific guidelines, even texture layers and specific shader levels are used- you wind up with a more constrictive/binding user base which can only reduce the quality of tomorrow's games. It's overall better for all involved that discussions like these exist to better understand the way competing IHV's products perform, behave and under what conditions. This makes for better knowledged consumers and more even expectations with delivered performance in today's and tomorrow's games.
I hope folks get what I'm trying to say.
I'm not entirely sure I do, so apologies in advance if what I'm arguing with isn't what you're saying...
"Man, this is what the NV30 does... and this is what the R300 does" is a common developer comment.
Except that in this case, no one knows (outside of Nvidia) what the NV30 does. Usually--and in the pre-shader past in particular--the performance of a GPU can be pretty accurately estimated from the published information, but in NV30 we have a situation where the official information (thus far) is misleading and incomplete. It's a mystery! And a darn interesting one too, as it's now focusing on the inner workings of the part we have the least in-depth experience with, the PS 2.0+ pipeline.
As for the notion that developers will just target one architecture at the possible expense of another with their games, I don't think this is likely or even necessarily possible when it comes to shaders (at least DX shaders). Shader languages are very general and structured, with to-the-metal optimizations all occurring at runtime as the driver translates the DX "assembly" into machine code. To the degree that different shader architectures perform differently, it should be due to general characteristics like program length, instruction mix, use of branching, etc.
With the exception of using PS/VS 2.0+, I doubt it will really be possible for one developer to decide to write "more Nvidia-friendly shaders" and another "more ATI-friendly shaders"; and, furthermore, it should in principle be possible to get highly revealing information about shader performance in a huge variety of situations from just a few well-designed and well-understood shader benchmarks. This is not to say that Ilfirin's shader benchmark fits that description yet (at least I'd hope not, given the results!), although in the absence of discovering some possible conceptual error or perhaps a shortcut the R300 drivers can legally optimize away, it looks like it's the NV30 drivers to blame for the lopsided result as of now.
So maybe NV30's drivers aren't in good enough shape to tell us much about its shader pipeline just yet (perhaps someone should code up some shaders using NV_fragment_shader and see if that performs better?), but I think the answers, when we get them, will be extremely relevant both to what developers do with NV30 and actual future game performance. And Nvidia's certainly not providing any (reliable) answers.
And the card's in the hands of the guy most likely to get to the bottom (or at least a few levels deeper) of all this. What's not to like? :)
An aside to the interesting "discoveries".
These are all very interesting discussions and (attempts at) discoveries.
The bottomline, however, is that these are basically in-consequential, other than the fact that we know whether it's a 8x1 or 4x2 or whatever NVIDIA's drivers want us to believe.
A video card (and, in a broader sense, a company) is judged, in real terms or in sales-term, by what applications are used to "define" it. Not what a website/websites appear to discover.
"Man, this is what the NV30 does... and this is what the R300 does" is a common developer comment.
Which option he goes with is more important than what the truth is.
In short, this (thread) may be about what a hardware is truly capable of, based on the confusion introduced by initial official "information" but, IMO, the main question is (beyond the "NVIDIA seems to mislead us" opinion)..... what will developers (those that sell, not those by the non-profit contributors here) come out with in terms of software?
I hope folks get what I'm trying to say.
Hey, Rev! You know...I have to tell you that this is one of the most baffling posts I've ever seen out of you....yes, I guess I have to say I really don't get your point here. Probably because it's late right now, though. No doubt I'm misunderstanding something material you're saying...
But...it sounds like you're saying that it really doesn't matter what the core of a graphics processor is---that whether we're talking about a GF4MX or an nv30---the physical differences in the processors aren't important--that we need to wait on developers to write applications for them--to "define" them for us, and so everything we're trying to figure out here is a great big waste of time. Is that it?...;) *chuckle* (Please tell me I've got you all wrong on this...;))
Anyway--I've never been much for letting developers tell me what to buy--I haven't agreed with Carmack's recommendations for years, actually *chuckle* Basically because he and I look at things from entirely different perspectives...But talking about Carmack and especially what Carmack's written so far in Doom III with respect to the R300 and nv30 and all of his melodious comments on those two in his .plan updates...I have to say it sure looks to me as if Carmack is burning the midnight oil figuring out how he can accomodate the nv30 and the R300 and whatever else rolls around out there. So I have to say that often I think developers react to and build on the hardware that companies make, and so in a real sense I'd say the hardware defines the developer and not the other way around. But like I say it's late and I've probably misunderstood something.
I think though that most everybody involved in this topic--although I can surely only speak for myself--is having a bit of fun trying to figure out whether or not nVidia's been shooting straight with us (the planet) on just what the nv30 actually is. That's not such a terrible thing to do, is it?...;)
As such, I don't think the discussion is "inconsequential" at all. Let me ask you this: do you think that discussion and feedback around the world on the topic of nVidia's cooling idea for the nv30 Ultra products it envisioned turned out to be inconsequential? I don't. It wasn't necessary for developers to define the Dustbuster before people defined it for themselves, was it? So, I think it's perfectly OK to define a chip architecture prior to developers doing it for us, don't you? I can assure you that "developers" won't be defining this product line for me...;) (Among all of the other good reasons I might give as to why--I simply could never wait that long...*chuckle*)
Also, I think you might have slightly mischaracterized some of the more subtle expressions people have expressed about nVidia's forthrightness in enlightening us fully on the nv30 architecture. Again, I can't speak for anyone else, but for me it was never a question of nVidia "seeming" to mislead on specific points--but it's been a very sober experience for me to realize that all of the assumptions I've made myself on the product stemmed from nVidia's flat assertions that it was an 8x1 architecture--I won't demean myself by engaging in legalese pretensions like "Ah, but did they ever actually *say* it was 8x1?" because I'm satisfied that they have said precisely that sufficiently to convince me of it. I will not say I am convinced 100% that it is not an 8x1, yet, but I will say that the evidence I've seen certainly is leaning heavily in that direction.
From my point of view, personally, if this is true it explains some very muddled questions I've had about the nv30 architecture, and brings some blessed clarity to the situation. For one, it helps me understand why the nv30 has seemed so much slower per clock in many cases than R300. Rather than grasping at poorly defined ideas like "it's an inefficient architecture" the concept of the 4x2 organization provides some real solidity to the situation and helps me feel a bit more in control of my understanding of it.
Because I think whether the organization is 8x1 or 4x2 will materially affect the success of nv30 in the marketplace. Far from inconsequential, I think it will make--or at least has the potential to make--a very big difference in the success of the processor. And that is why nVidia is being evasive about it--because they know that to be the truth. At least that's my take on it--right or wrong. If the organization pans out as nVidia has represented it--as 8x1 physically--then I shall humbly silence myself on the topic from that point forward. It is no disgrace to follow an hypothesis which explains the facts--it is only a disgrace if one continues to do so after the hypothesis has been disproved, or one rejects it after it has been proven (IMO.)
Also, I have to say that what I think developers will "come out with in terms of software" will be slanted toward the APIs. We had a mild foretaste of that with 3D Mark 03 *chuckle* (Which is truly inconsequential to me--but a lot of people seem to like it.) Anyway, if what you are saying is that you think developers will dance merrily after nVidia no matter what--well, I cannot agree at all. Most developers are pro-API (which takes care of the hardware), and if they are not particularly as pro-API then they are pro-hardware on top of it--few of them are pro-corporation. Nor do I think most developers are blind, and just as they did with nVidia in years past so to will they do with other companies who make what they consider to be better hardware.
Last, if it turns out that a great many developers discover that the reason for the nv30's lack of per-clock performance comes from the 4x2 organization as opposed to the 8x1 that they too *might've been led to believe it was*, I can't imagine developers being very impressed by that, can you? *chuckle* I doubt they would be any more impressed by it than anyone responding in this thread so far.
Anyway, that's my take on it--and if you've been mischaracterized here the fault is entirely mine. Now, off to some needed pillow time!...;)
MDolenc
24-Feb-2003, 09:28
Ok for anyone interested: you can find a new version of my Tester here (http://www2.arnes.si/~mdolen/Tester.zip). I haven't put any real long shaders in it, since it's obvious that ps_2_x path is not yet optimised.
However I do test all the cases with color and z writes enabled/disabled.
Dave, can you give it another run?
Dave Baumann
24-Feb-2003, 09:31
It'll have to wait until this evening now. :)
demalion
24-Feb-2003, 09:59
Just a little not on clock-for-clock performance of the NV30:
First of all, single-texturing wouldn't be any better on an 8x1 architecture anyway, due to memory bandwidth limitations.
In situations where producing the output doesn't require several clock cycles.
So, here's what I see as the basic performance rundown (judging same clock speed, same memory bandwidth):
...
Those are the problems for applying textures. What about the problems associated with either less simultaneous outputs being calculated, or the severe stalls if it is indeed maintaining registers for 8 fragment output calculations? I thought that's what we were investigating now.
Reverend
24-Feb-2003, 11:02
I posted less than what I meant in my head, hence the misunderstandings.
"Man, this is what the NV30 does... and this is what the R300 does" is a common developer comment.
Which option he goes with is more important than what the truth is.
As for the notion that developers will just target one architecture at the possible expense of another with their games, I don't think this is likely or even necessarily possible when it comes to shaders (at least DX shaders).
You didn't quote me wrt your comments but I'll assume that's what you're referring to. No, that's not what I meant. By "option", I meant coding with and around one possibly-important limitation on one hardware that isn't a limitation on another. I do not see whether a hardware is 4x2 or 8x1 as terribly important at the moment, that's all.
But...it sounds like you're saying that it really doesn't matter what the core of a graphics processor is---that whether we're talking about a GF4MX or an nv30---the physical differences in the processors aren't important--that we need to wait on developers to write applications for them--to "define" them for us, and so everything we're trying to figure out here is a great big waste of time. Is that it?... *chuckle* (Please tell me I've got you all wrong on this...)
No, again, not what I meant. For sure I am interested to know if the NV30 is 4x2 or 8x1... but that is as far as my interest goes. There is nothing important, to me, to be gained from knowing which is which, other than the fact that a 8x1 may result in higher performance. What's important is whether I will change my codes or my engine to take into consideration of the differences my content is generated by a 4x2 or 8x1... and my opinion is that I won't do such a thing. See below regarding probably the most important factor (performance).
So I have to say that often I think developers react to and build on the hardware that companies make, and so in a real sense I'd say the hardware defines the developer and not the other way around.
It depends on which particular developer you're talking about. Like Kristof once said "If Carmack tells IHVs to jump, they ask him How High?"... I'm sure IHVs have their own preferences when it comes to which developer they listen to. Some developers are "forward thinkers" (Carmack as a prime example) that not only defines future hardware but also APIs, other developers are the "this is the current hardware we have, we use this".
As such, I don't think the discussion is "inconsequential" at all.
It is inconsequential to me in terms of whether a 4x2 or 8x1 will affect the way I code a program. My personal opinion, with my personal preferences, of course. It is not inconsequential when it comes to analyzing possible performance differences between the two (something which I'm sure Dave will attempt to explore in his articles and of which I support as far as the focus of this website goes).
Last, if it turns out that a great many developers discover that the reason for the nv30's lack of per-clock performance comes from the 4x2 organization as opposed to the 8x1 that they too *might've been led to believe it was*, I can't imagine developers being very impressed by that, can you? *chuckle* I doubt they would be any more impressed by it than anyone responding in this thread so far.
This is the crux of the matter - performance is the bottomline. For me, and I would imagine, for developers too. Again, I fail to see any real significance between a 4x2 or 8x1 architecture. I know there are scenarios where the difference may show up... but my point is that it doesn't really matter much. Will I waste my time trying to optimize for both architectures or compensate one for the other? No, not me.
Unless someone can give me scenarios where a 4x2 architecture is inferior to 8x1 in an important and relevant way in terms of a programmer having to "manipulate" things in order to give us the same content, I hope that this isn't blown out of proportions. Sure, the difference will most likely be in terms of performance... that's a big "Duh!" itself. What else is new?
I probably should end with asking you guys to ask various developers whether a 4x2 or 8x1 will affect they way their make their engines or the way they code their programs (no, I don't mean guys coming out with synthetic benchmarks that clearly illustrate the differences between a 4x2 and 8x1, which is easy to do). I'm fairly sure they'll say they don't care... my point is simply this, that it is not worth their time to care.
Lastly,
Because I think whether the organization is 8x1 or 4x2 will materially affect the success of nv30 in the marketplace.
I disagree... unless this is blown out of proportions in public forums and in learning so from such forums, people goes "Hey, NVIDIA said my video card is a 8x1-whatever and it's not... lying bastards... I'm not buying this!". Such has never affected/determined the success or failure of various chips before. I am personally uncomfortable with the fact that a few media sites have chosen to specifically publicize this "4x2 or 8x1" confusion without any kind of analysis about what this means.... but that is due to my own reasonings.
Last post on this topic.
Dave Baumann
24-Feb-2003, 11:18
What they are telling does make a difference to end users though.
However, there are other questions that need answering. It does appear that they are using their multiple Z units per pipe for optimised stencil rendering; when stencils are in use each of the 2 of the Z units are checking the stencil Z values while the other two are writing the scenil values; now, ther is a question what will happen under MSAA if this is occuring.
mboeller
24-Feb-2003, 11:34
This is the crux of the matter - performance is the bottomline. For me, and I would imagine, for developers too. Again, I fail to see any real significance between a 4x2 or 8x1 architecture. I know there are scenarios where the difference may show up... but my point is that it doesn't really matter much. Will I waste my time trying to optimize for both architectures or compensate one for the other? No, not me.
IMHO; performance is really the bottomline. Not the difference between 4x2 or fillrate but the speed of the Pixel-Shaders. At the moment PS2.0 is far slower on an NV30 as on an R300-chip. So in the end the slow performance of the NV30 and most important the (very likely) even slower performance of the NV31 and NV34 mainstream parts will hinder developers to use PS2.0 fully; despite the possibility of longer shaders.
Ilfirin
24-Feb-2003, 11:39
Well, the weekend is officially over for me (with a whopping 4 hours of sleep total), so there won't be any more modifications or anything to the benchmark.
If you need any other benchmarks for very specific features like this in the future (ex: stencil performance), feel free to ask. As long as it's something interesting and can be done in a weekend I will probably do it.
pocketmoon_
24-Feb-2003, 12:10
Fillrate Tester
--------------------------
Display adapter: NVIDIA Quadro FX 2000
Driver version: 6.14.1.4290
Display mode: 1024x768x32bpp
--------------------------
Color writes enabled, z-writes enabled:
FFP - Pure fillrate - 1528.705200M pixels/sec
FFP - Single texture - 1236.259644M pixels/sec
FFP - Dual texture - 991.728027M pixels/sec
FFP - Triple texture - 586.592285M pixels/sec
FFP - Quad texture - 557.524841M pixels/sec
PS_2_0 - Per pixel lighting - 64.196724M pixels/sec
PS_2_0 PP - Per pixel lighting - 64.197914M pixels/sec
PS_1_1 - Simple - 780.622070M pixels/sec
PS_2_0 - Simple - 498.688599M pixels/sec
Color writes enabled, z-writes disabled:
FFP - Pure fillrate - 1523.353516M pixels/sec
FFP - Single texture - 1224.495850M pixels/sec
FFP - Dual texture - 1015.775146M pixels/sec
FFP - Triple texture - 586.473450M pixels/sec
FFP - Quad texture - 557.536377M pixels/sec
PS_2_0 - Per pixel lighting - 64.197647M pixels/sec
PS_2_0 PP - Per pixel lighting - 64.205772M pixels/sec
PS_1_1 - Simple - 780.526428M pixels/sec
PS_2_0 - Simple - 498.690155M pixels/sec
Color writes disabled, z-writes enabled:
FFP - Pure fillrate - 2909.496338M pixels/sec
FFP - Single texture - 2907.544434M pixels/sec
FFP - Dual texture - 2906.712646M pixels/sec
FFP - Triple texture - 2907.811523M pixels/sec
FFP - Quad texture - 2813.688232M pixels/sec
PS_2_0 - Per pixel lighting - 1350.793213M pixels/sec
PS_2_0 PP - Per pixel lighting - 1795.684937M pixels/sec
PS_1_1 - Simple - 2907.509277M pixels/sec
PS_2_0 - Simple - 2907.678955M pixels/sec
binmaze
24-Feb-2003, 12:19
What a monstrous raw power of zixel! :lol:
Seriously, is there any good of zixel?
I can't think of any good instance to use that...any?
What a monstrous raw power of zixel! :lol:
Seriously, is there any good of zixel?
I can't think of any good instance to use that...any?
First pass setup (ALA Doom3), render all geometry to just the Z then render all geometry again with Z equal testing and pixel shaders running.
Regards what Reverend was saying about devs, he absolutely right how it gets it's performance is largely irrelevant just the bottom line (i.e. I couldn't care if it was 3x6 as long it runs pixel shaders well).
What a monstrous raw power of zixel! :lol:
Seriously, is there any good of zixel?
I can't think of any good instance to use that...any?
Doom III.
Edit:
I should reload more frequently ...
In any case. The results for the Pixel Shader per pixel lightning test with color disabled seem a bit odd. Why it doesn't have the same fillrate as all other tests?
Dave Baumann
24-Feb-2003, 12:56
Question is: Can ATI run 3 Z's & 3 Stencils per pipe per clock? Their Z and Stencil fillrate would be 7.8G if they could... :shock:
binmaze
24-Feb-2003, 12:58
Then it'd improve the final fps of Doom3? Color writes never get used at all after whole process?
My test results (on a 297/270 Radeon 9700):
Stencil disabled:
Render without texture (equivalent fillrate):
Back to front: 1,383Mpix/s
Front to back: 12,904Mpix/s
With texture:
Back to front: 649Mpix/s
Front to back: 10,472Mpix/s
Stencil enabled:
Back to front w/o texture: 1,884Mpix/s
Back to front w/ texture: 1,826Mpix/s
So it looks like only one stencil test unit for each pixel pipeline.
Dave Baumann
24-Feb-2003, 13:23
pcchen - what test is that? I'm loosing track here!
Oh, that's my own Z test benchmark program, used for testing early Z and fast Z. It renders 100 quads from back to front and front to back, with and without texture, for 100 times.
With stencil enabled, the quads are rendered from back to front (so early Z/fast Z has no effect) but stencil test rejects most pixels, so it can be used to test fast stencil.
Dave Baumann
24-Feb-2003, 13:29
Have you got a link to that, please?
I want to see if MSAA affects Z/Stencil rendering or not (although, bandwidth may come into play here).
Chalnoth
24-Feb-2003, 13:35
In situations where producing the output doesn't require several clock cycles.
Right. There would be a problem for a shader that doesn't bother with the z-buffer and uses no more than one texture or PS op. I don't really see why this matters, as I doubt this is done very often.
Those are the problems for applying textures. What about the problems associated with either less simultaneous outputs being calculated, or the severe stalls if it is indeed maintaining registers for 8 fragment output calculations? I thought that's what we were investigating now.
From what I've seen quoted, it looks more and more like only 4 pixels are being run through the pipeline at the same time. According to nVidia, the chip is capable of a total of 8 "128-bit" PS ops per clock. I find it most likely that this means that 32-bit FP instructions are sent through in groups of two instructions, and 16-bit FP instructions are sent through in groups of four.
Regardless, there are huge problems with current drivers. Unless the FX has an unbelievably-bad design flaw in its pixel shader unit (which is unlikely, given that it seems to do fairly well using the NV30 OpenGL extensions, from what little info we have), it's almost pointless to look at the FX's PS 2.0 shader performance right now.
Colourless
24-Feb-2003, 13:37
I don't think pcchen's program is what you want. You want to just be doing Z/Stencil rendering, but with colour writed disabled?
Chalnoth
24-Feb-2003, 13:38
Stencil enabled:
Back to front w/o texture: 1,884Mpix/s
Back to front w/ texture: 1,826Mpix/s
So it looks like only one stencil test unit for each pixel pipeline.
Quick question:
Did you clear the z-buffer and stencil buffer at the same time?
(Update: I'm currently wondering whether it's actually possible to do stencil shadow volume rendering with more than one light source while clearing the z-buffer and stencil-buffer at the same time)
Quick question:
Did you clear the z-buffer and stencil buffer at the same time?
Yes. Buffer clearing does not affect very much, though.
(Update: I'm currently wondering whether it's actually possible to do stencil shadow volume rendering with more than one light source while clearing the z-buffer and stencil-buffer at the same time)
I don't think it is possible.
EDIT: The test program is here (http://www.csie.ntu.edu.tw/~r89004/ztest.zip).
Chalnoth
24-Feb-2003, 13:55
Just in case you didn't get the reference, I was asking the question in regards to ATI's own docs that stated parts of Hyper-Z (Hierarchical-Z, I believe it was) would be disabled if the stencil test was enabled, but the buffers weren't cleared at the same time.
Yes, hierarchical Z is disabled when stencil test is enabled. It is possible to do "hierarchical stencil" but is considerably more complex.
I think I found something interesting: when I enabled 4X FSAA, the stencil test rate only down to about 960Mpix/s. With 6X FSAA the stencil test rate is 563Mpix/s. And with 2X FSAA the stencil test rate is 1,633Mpix/s! If my test results are correct, it may suggest that Radeon 9700 has two stencil test units per pixel pipeline when MSAA is enabled?
antlers
24-Feb-2003, 14:14
If you are doing stencil testing for shadow volumes, why would you want it to be per-sample for MSAA instead of per-pixel? In other words, what's the advantage of multiple stencil-test units per pixel?
If you are doing stencil testing for shadow volumes, why would you want it to be per-sample for MSAA instead of per-pixel? In other words, what's the advantage of multiple stencil-test units per pixel?
Anti-aliasing? :wink:
Joe DeFuria
24-Feb-2003, 14:53
I think I found something interesting: when I enabled 4X FSAA, the stencil test rate only down to about 960Mpix/s. With 6X FSAA the stencil test rate is 563Mpix/s. And with 2X FSAA the stencil test rate is 1,633Mpix/s! If my test results are correct, it may suggest that Radeon 9700 has two stencil test units per pixel pipeline when MSAA is enabled?
This is rather interesting.
It might have effectively 2 stencil test units per pipeline, full stop.
Thus, the 9700 may be acting as a 16x0 pipeline, when being limited to stencil ops only, much like NV30 acts as a 8x0 pipeline. (Is that stencil fill-rate test publically available to test the 9700 on?)
The link to AA might be the following:
We know there are a total of "6" MSAA units per pipe. (Or so ATI tells us.) Perhaps 4 of these MSAA units can combine to yield (effectively) 2 Stencil units? (I think Dave is postulating similar behavior with the FX wrt z/MSAA units "combining" to perform stencil ops?)
I don't know enough technically about what the "required" capabilities are between a "Stencil" and a "z/MSAA" unit. So I could be completely off base here. But if what I'm saying is physically feasible, that would explain pcchen's findings with the 9700?
* 6X MSAA can be done per clock, per pipe, in absense of any stecil ops.
* Max of 2 stencil ops per pipe, regardless of MSAA level.
* When stenciling, a max of 2X MSAA per pipe (per clock). Additional MSAA levels will require an additional clock/pass. (6X MSAA requires 3 passes, 4X requires 2.)
alexsok
24-Feb-2003, 15:10
Well, this is rather interesting...
1) NV30 works as a 4x2 arrangement 90% of the time, and there is confirmation from NVIDIA about it.
2) In D3D, the precision is automatic, and is represented by F16 for most operation, except certain operations with texture coordinates, where it's F32.
This doesn't contradict the D3D specs, where the minimum precision is F16 and on operations with texture coordinates, F24 is needed, or like NVIDIA offers, F32.
In OpenGL, you can already force either F16 or F32, but it's still uncertain whether they'll do the same with D3D.
By the way, although Radeon 9700 is not particularly fast for stencil test, but it still has some optimization. For example, if no stencil op for stencil test fail and Z test fail, Z test can be done before stencil test and hierarchical Z will still be enabled. This should be helpful for volumetric shadows.
And about NV30, IIRC GF3/GF4 has four Z test units for each pixel pipeline. When FSAA is not enabled, stencil test is not fast (only one stencil test per pixel pipeline, just like R300). I didn't test with FSAA enabled and I don't have a GF4 right now to test. I don't know about NV30. It would be interesting to see its result.
Joe DeFuria
24-Feb-2003, 15:21
Could somebody (pcchen?) run MDolenc's Fill-rate tester on the Radeon 9700, and post results?
http://www2.arnes.si/~mdolen/Tester.zip
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.