R3/4XX appear they may have far better caching then NV chips

Simon F said:
GraphixViolence said:
My guess would be that no one wants to get caught looking like they have smaller caches than the competition, even though the optimal size is highly design-dependent. .
Ah! The usual story of "it's not the size that matters but what you do with it".

Uhhhhmmm.....cliche excuse (and yes for all occassions)... :LOL:
 
some comments from the paper authors

Both vendor's current generation of chips can provide one fp32(24 on ATI) float per clock out of the closest texture cache. Just write a test that reads texel (0,0) over and over, and this number will be clear. Therefore a float4 read out of the cache takes 4 clocks. But a MAD can consume 12 floats as inputs (in our case it is 8 floats from memory, 4 stored in a local float4 register), thus for each MAD we do, its going to take 8 clocks before we get the data, even if the texture cache was INFINITELY large. Using SSE2, you can grab data on a P4 at 128 bits/clock.

The point is that these simple algorithms for mat-mat multiply are already reading data at near this peak rate, thus NO algorithm could do all that much better. We tried a bunch of different variants of the algorithms discussed in the paper.

With MRT, you could do a 4x4 submatrix multiplication in a shader, yielding the best ratio of math to total texture fetches. However, in practice with current drivers, this doesn't seem to work as well as would be expected.

GPU caches and memory systems are designed to stream data into the processing units. Matrix multiplication (and other numercial/scientific) algorithms don't exhibit this memory access pattern. They reuse data many times. On traditional systems, caches can be used to amplify bandwidth in this case, but GPU caches are designed to facilitate texture filtering, and are not designed for bandwidth amplification. As a result GPUs will be bandwidth starved when CPUs are not in such cases, and the efficiency of running the algorithm on a GPU will be very low.

The point our paper was to highlight this architectural issue.
 
kayvonf, welcome to B3D!

I'm surprised about the MRT problem. I would assume that the cards would be able to optimize this just as well. It may be that the memory access pattern writing to 4 different buffers is not efficient, and neither IHV has figured out memory controller settings to optimize this rare case.

ATI, however, has put out a demo that uses MRT, so I'm a bit surprised. Did you try using only 2 output targets, i.e. 8 floats?
 
MRT on ATI

Bandwidth currently drops by 1/3 when using 2 MRT targets. It improves slightly with 3 and 4. ATI knows about this and hopefully will have the problem corrected soon. NV68xx MRT output bandwidth is just fine for 1-4 outputs. However, doing a 4x4 submatrix multiplication on NV hardware means you need a bunch of registers. That register pressure is going to cause performance problems.
 
Nice of you to pop in, Kayvon. So the X800 doesn't experience (as much) register pressure as the 6800 with a 4x4 submatrix mul?
 
Reverend said:
Apologies if this is going OT but talking about on-die caches -- why is this such a tightly kept secret/detail by the IHVs?
Well last I heard NVIDIA has much bigger caches than ATI. On the other hand, ATI doesn't count cache as transistors. So:
1) NVIDIA doesn't want the public to realize they don't really have that many more transistors than ATI (afaik, the difference, IF you exclude cache, is lower, but of course NVIDIA also has more cache so it's not all that easy to calculate).
2) As to why ATI is so shy about it, a few things have to be considered. Since the R300, they've been enjoying the public idea that they can do more with less. Another explanation is that perhaps some notebook manufacturers look at transistor count to determine how good a chip is considering performance (lower=less noise/heat, even if that's entirely untrue).
Or perhaps ATI just doesn't really care... Or there might be yet another more complicated reason. Perhaps they like investors thinking their production costs are, relatively to NVIDIA, lower than they really are.

Now, don't get me wrong, I'm not saying there's gigabytes of cache on modern GPUs :) But it could make up quite a bit of the differences between recent ATI and NVIDIA chips. Heck, if this was already done a few years ago, it means the R200 had more transistors than the NV25, and not just by a million or so!

Uttar
 
I wouldn't go around associating the "marketting" numbers of transistors counts and how they may or may not count as giving any relevance to actual quantities of cache. However, R420 appears to have different texture prop off performances than NV40, which would indicate that the cache sizes are larger.
 
Re: some comments from the paper authors

kayvonf said:
Both vendor's current generation of chips can provide one fp32(24 on ATI) float per clock out of the closest texture cache.

I was wondering if you were taking the texture sampling performance into consideration, as an FP32 float will require 4 cycles to sample (which it will presumably need to if there is a cache miss).
 
The connection from texture cache to TMU should be 128 bits wide (per pixel, a quad TMU like in NV40 might require less than 4 times that). However, since the result of the filtering operation only needs 32bit/clock with both 8-bit and FP16 filtering, the connection from TMU to the data format converter might be the bottleneck.
 
I guess what this really means is that if you're doing operations on FP textures, you really need to do many more operations than reads/writes. So, if you could generate one of the textures you're going to multiply on the fly within the pixel shader, the potential is there to dramatically increase performance.

Another option may be to do it the "old" way, the way matrix multiplication was done on the GPU on the GeForce4. That is, one could attempt to do the matrix multiplication in the vertex shader instead.
 
Back
Top