X800 vs GF6800 fp texture/buffer performance

PeterT

Regular
The following results are from a simple benchmark alternatively rendering to/from an INT8 RGBA pBuffer/texture with a short GLSL fragment program.
The cards used are a 6800 Ultra and a x800pro on otherwise nearly identical systems. The general validity of these results has been confirmed by tests on various other platforms.

6800_x800_int.png

6800_x800_fp16.png

6800_x800_fp32.png


Some points are noteworthy:
  • The NV drivers are CPU limited at a much higher it/s rate, suggesting far lower CPU/memory/non-graphics related overhead. Good job!
  • Using 8 bit int buffers, the GF68 is about 1.8 times as fast on non-cpu limited runs.
  • When switching to fp16, the x800 performance remains nearly unchanged, while the 68 drops dramatically and consistently to less then half the it/s.
  • At fp32, the x800 sees a small dip to ~85%. The 6800 again drops to less than 50%.
I admit to being a bit surprised by these results. Some possible explanations for the behaviour:
  • ATI drivers suck at int8, and do not perform nearly as well as possible. (Highly unlikely, as INT8 is the "game" case)
  • The GF68 is totally memory bandwidth limited while the x800 is mostly gpu limited. (Very unlikely as the workload and specs are similar, and still would not explain the >2x drop)
  • NV architecture/drivers suck at sampling from FP textures. (Perhaps, but why the extreme performance drop?)
  • NV architecture/drivers slow down immensively when rendering to an FP buffer. (Perhaps, and a bit disappointing considering all the cinematographic rendering talk)
  • My benchmark (without any knowledge on my part, I assure you) hits a singularly contained performance "low spot" in NV drivers. (Possible, but then why are the INT8 tests so fast?)
  • A combination of all the above (to varying degrees).
Does anyone else have any explanations to offer? I hope that switching to FBO's will solve the problem if it is on of my specific implementation. It should certainly improve performance in CPU bound situations.

Here are two more graphs showing the vast difference in behaviour:
6800_int-fp16-fp32.png

x800_int-fp16-fp32.png
 
Although I would guess it would affect the results in a fairly small way, other test results have suggested that the PRO's are not as effective with their memory bus as the 16 pipes variants. Is this from the app you had the other day?
 
DaveBaumann said:
Although I would guess it would affect the results in a fairly small way, other test results have suggested that the PRO's are not as effective with their memory bus as the 16 pipes variants.
That can't be it, as an XTPE reported the expected results. In fact, it looks like the memory limitations at fp32 affect the xtpe stronger, as would be expected from looking at the basic specs:
x800_xtpe_intfp.png

DaveBaumann said:
Is this from the app you had the other day?
Yes. I updated ORC to support png exporting.

Btw, thanks for the chip and board charts, I referenced those in my first report.
 
When reviewing the X800s, one year ago, I noticed that the X800Pro can't match its max throughput in some fillrate tests (with a single texturing shader or a more complex one) even with very small textures. The X800XTPE had not that problem. I asked ATI about that issue and here is the explanation they gave me:

The XT version, having 16 pipes as opposed to the Pro versions 12 pipes, will alltogether have access to more caches within the HyperZ and Pixel Shader structure than the Pro version does. This means that a 16 pipe XT version can hide latencies more efficiently than the 12 pipe Pro version.
Texture fetches for example would introduce a great deal of latency, as it can take a relatively long time to fetch a texture from graphics memory. If the pipes don't have anything else to do, i.e. if we can't keep them busy doing other calculations, then they will run idle and the efficiency will drop.
The same applies for Pixel Shader operations of course, a 16 pipe version will be better at hiding these latencies as a 12 pipe version.

Also, as you may remember from our press briefings, we are able to program the memory controller and its access patterns depending on the board configuration. This means that a 12 pipe version has a different memory access pattern than a 16 pipe version.
The determination of the best access pattern for a given board configuration is a learning process for which we need experience values. Obviously, our engineers are in the process of determining these values and right now there might be cases where there is a much larger discrepancy between these boards than you would expect. As we are mostly looking at real world scenarios, i.e. games, this large discrepancies will most likely show up in synthetic benchmarks such as Pixel Shader or fillrate tests, where only certain aspects of the chip are stressed.
You can expect that due the course of time, memory access patterns for both versions will improve with newer driver versions and the differences will shrink.
However, there will always be cases which don't suit one or the other access pattern, especially when using synthetic benchmarks or other unusual cases (such as very small textures as can be seen in your fillrate test).
 
Very interesting stuff (and straight from the source too!). I always wondered whether those non-power-of-2 pipeline count architectures could be 100% efficient. However, I can't really reproduce this problem in any of my "real world" tests, and even the very simple theoretical ones only show slightly slower pro scores than extrapolated from the XT numbers.

Anyway, all that still doesn't explain the horrible performance drop of the NV40 cards with higher accuracy buffers.
 
I wonder what you're doing exactly in this test. How many cycles does the shader take? NV40 can output 16 int8 pixels, 8 (11) FP16 and 4 (5,5) FP32 per clock (ROPs are running at mem clock, 550MHz). It might be interesting to see how a vanilla 6800 fares, which has less pipelines but the same amount of ROPs.
So NV40 results are quite in line with what's to be expected.
IIRC, due to some bandwidth and caching issues, NVidia recommends using two two-channel FP render targets instead of one four-channel target. Maybe you should give that a try.
 
Xmas said:
I wonder what you're doing exactly in this test. How many cycles does the shader take? NV40 can output 16 int8 pixels, 8 (11) FP16 and 4 (5,5) FP32 per clock (ROPs are running at mem clock, 550MHz). It might be interesting to see how a vanilla 6800 fares, which has less pipelines but the same amount of ROPs.
I did not think that this shader would be so completely output limited as it does 4 dependent and 1 independent texture reads (from 1 RGBA texture, pointsampled of course) and some simple math, but the connection seems very obvious from your explanation. Somehow I forgot about the reduction of the output pixel count with increasing destination accuracy.
IIRC, due to some bandwidth and caching issues, NVidia recommends using two two-channel FP render targets instead of one four-channel target. Maybe you should give that a try.
I will, but only after implementing FBOs.

Thanks for your informative post. I guess the lesson for everyone doing non-rendering non-branching fp-targets-required stuff (is there someone besides me? ;)) is to expect better performance from ATI for now.

[edit]
Just noticed your location - you could look at this to get a better idea of what I'm doing. The really interesting sections are "Anwendung" and "Bonus". The slides are from a presentation I held yesterday, so they're brand new.
 
Alstrong said:
Why do the ATi cards appear consistent for a lot of...things?
In this case, the "consistency" is actually a bit strange in itself, as the higher precision levels slowing down execution speed is expected in some way. (Of course not to the extent of the nv40 based cards)
 
PeterT said:
Xmas said:
I wonder what you're doing exactly in this test. How many cycles does the shader take? NV40 can output 16 int8 pixels, 8 (11) FP16 and 4 (5,5) FP32 per clock (ROPs are running at mem clock, 550MHz). It might be interesting to see how a vanilla 6800 fares, which has less pipelines but the same amount of ROPs.
I did not think that this shader would be so completely output limited as it does 4 dependent and 1 independent texture reads (from 1 RGBA texture, pointsampled of course) and some simple math, but the connection seems very obvious from your explanation. Somehow I forgot about the reduction of the output pixel count with increasing destination accuracy.
IIRC, due to some bandwidth and caching issues, NVidia recommends using two two-channel FP render targets instead of one four-channel target. Maybe you should give that a try.
I will, but only after implementing FBOs.

Thanks for your informative post. I guess the lesson for everyone doing non-rendering non-branching fp-targets-required stuff (is there someone besides me? ;)) is to expect better performance from ATI for now.

ATI has the same output limitations.

Is also applies to texture reads.
int : 1 cycle
FP16 : 2 cycles
FP32 : 4 cycles

If you're doing 4 FP32 dependant texture reads the maximum number of iterations with a 512x512 buffer is 1525 with the GeForce 6800U and 1358 with the X800 Pro. The X800Pro is close to that number. So it seems that output is not the bottleneck and it should be the same with the 6800U. There is a problem.

Maybe you could try doing the same test but sampling from another buffer/texture (so you'll be able to sample from INT and to output to FP32). That way you'll know if the problem comes from the texture sampling, from the output ....
 
Tridam said:
If you're doing 4 FP32 dependant texture reads the maximum number of iterations with a 512x512 buffer is 1525 with the GeForce 6800U and 1358 with the X800 Pro. The X800Pro is close to that number.
In fact it does manage 1392 iterations according to the benchmark. Even if we assume some error that would be amazing efficiency, so perhaps your numbers are a bit too low. (Or the card was overclocked?)

Tridam said:
So it seems that output is not the bottleneck and it should be the same with the 6800U. There is a problem. Maybe you could try doing the same test but sampling from another buffer/texture (so you'll be able to sample from INT and to output to FP32). That way you'll know if the problem comes from the texture sampling, from the output ....
Yes, I also thought about that and it may indeed be the best way to finally shed some light on these results. I think I'll add such a test to the benchmark, and another one that samples from FP and renders to INT. Then the culprit should become clear. I should be able to release a new version of the benchmark some time tomorrow.
 
Is an iteration a full update of the buffer ?

I looked at the shader and there are actually 5 texture reads so the results should be even lower.


Edit
Is the alpha channel sampled ?

If not FP32 sampling needs 3 cycles instead of 4 so the sampling needs 15 cycles for the 5 texture reads instead of 16 cycles for 4 texture reads. So the max number of iterations becomes 1449 for the X800Pro and 1627 for the 6800U.

I just tried the benchmark with an X800XT, an 6600GT and an 6800GT. I ran the standard benchmark and an edited version (with a reduced number of texture reads). The number of iterations increased proportionally to the number of texture reads decrease with the 6800GT/6600GT. It also increased with the X800XT but less than proportionally. So sampling is the bottleneck.
 
Does anyone know if NV support swizzling of FP textures, if they don't that would explain the extreme performance differential.

NV's texture cache is heavilly optimised for swizzled textures. Also last time I checked it was smaller than the ATI cache, and this could have a significant impact as source texture sizes increase.
 
PeterT said:
I did not think that this shader would be so completely output limited as it does 4 dependent and 1 independent texture reads (from 1 RGBA texture, pointsampled of course) and some simple math, but the connection seems very obvious from your explanation. Somehow I forgot about the reduction of the output pixel count with increasing destination accuracy.
NV4x seems to have some serious issues using more than 32bit per pixel/texel.
streamthroughput.png

http://graphics.stanford.edu/~yoel/gpubench/6800Ultra-7580/

[edit]
Just noticed your location - you could look at this to get a better idea of what I'm doing. The really interesting sections are "Anwendung" and "Bonus". The slides are from a presentation I held yesterday, so they're brand new.
Interesting read, thanks. What kind of research work is this?


Tridam said:
I just tried the benchmark with an X800XT, an 6600GT and an 6800GT. I ran the standard benchmark and an edited version (with a reduced number of texture reads). The number of iterations increased proportionally to the number of texture reads decrease with the 6800GT/6600GT. It also increased with the X800XT but less than proportionally. So sampling is the bottleneck.
Might this be related to NV4x not being able to hide texture fetch costs with arithmetic instructions, while R3xx/4xx can?
 
No offense,

I have a question: why comparing the two different cards that are not in the same class? Using results from the NV ultra and the X800 pro. I am in no ways to the level of understanding to most of the technology behind your testing or the methods. But in the end why not use the 16pipe version of the ATI card for comparison? I would think that the results you are looking for could be narrowed down by using products in the same catagory?

I think Xmas has the right idea when he suggests using a vanilla 6800 in your testbed as it might show a more definative part of what you are looking for. I only have a limited understanding of the technology but seems like it would be worth a look.

My guess (and I admit it is a very "uneducated" but I'm learning :oops: ) would point towards caching issues as the ATI quote Tridam used, could help narrow what you are looking for.


***There is no Luck...Only the Will and Desire to Succeed!!***
 
Xmas said:
Tridam said:
I just tried the benchmark with an X800XT, an 6600GT and an 6800GT. I ran the standard benchmark and an edited version (with a reduced number of texture reads). The number of iterations increased proportionally to the number of texture reads decrease with the 6800GT/6600GT. It also increased with the X800XT but less than proportionally. So sampling is the bottleneck.
Might this be related to NV4x not being able to hide texture fetch costs with arithmetic instructions, while R3xx/4xx can?

No.
The shader should compile to something like
TEX
ADD
TEX
ADD
some math (2-3 instructions)

TEX
ADD
TEX
ADD
some math (2-3 instructions)

TEX
MAD

So basically all math instruction are free.
 
Tridam said:
So basically all math instruction are free.
That's why I said, hiding texture fetch costs with math ;)

Due to their fundamentally different architectures, NV4x cannot hide multi-cycle texture fetches with math instructions, while R3xx/R4xx can.

6800 Ultra / X800XT

Notice how more instructions hide the texture fetch cost completely on the X800XT, while each (multi-cycle) fetch adds a constant cost on NV4x.
 
It doesn't make sense :?
You mean that if a FP32 texture read needs 3-4 cycles, you can't do math during these cycles ???

FP32 texture reads could be cut into 3-4 single component texture reads.
 
Tridam said:
It doesn't make sense :?
You mean that if a FP32 texture read needs 3-4 cycles, you can't do math during these cycles ???
Basically, yes, that's what I'm saying. Well, you can do math, but only one cycle worth of it.

Just think about it as one pass through the SU0-TMU-SU1 pipeline. If the TMU takes more than one cycle (throughput), the SUs have to wait.

FP32 texture reads could be cut into 3-4 single component texture reads.
Maybe, probably not. But it's not the same. And it requires a shader recompile.
 
Back
Top