R3/4XX appear they may have far better caching then NV chips

Re: R3/4XX appear they may have far better caching then NV c

bloodbob said:
Atleast in some General Purpose stream chip applications.

http://graphics.stanford.edu/papers/gpumatrixmult/

Looking at Figure 2 on the PDF you see that both the ATI cards nearly get 95+% efficence on the bandwidth utilisation.

Damn you 'Bob, I was just going to start a thread about the above paper... But hopefully people with the "famous clue" could comment the paper. No flame war, please :( I'm more interested about the GPGPU-side of the paper, and it really doesn't paint that nice picture about the usefullness of the current architectures for general computation.

I guess it is half official that "R500" series are going to the "pool of ALU's" approach, it would also be interesting is someone could comment on how well this approach will handle the above case of general matrix work... It seems that problem of bandwidth and data traffic between units will be even worse ? And of course the drivers :?
 
Well ain't this a bit of a turn-about! Wasn't it the other way before the R300, that it was being speculated that nVidia was using on on-die cache to help improve performance?

Mebbe I'm remembering it wrong, but the post made me think of it.
 
digitalwanderer said:
Wasn't it the other way before the R300, that it was being speculated that nVidia was using on on-die cache to help improve performance?

They all use on-die cache, Dig! The question is only how big are they - and does size matter? ;)
 
Re: R3/4XX appear they may have far better caching then NV c

eSa said:
I guess it is half official that "R500" series are going to the "pool of ALU's" approach, it would also be interesting is someone could comment on how well this approach will handle the above case of general matrix work... It seems that problem of bandwidth and data traffic between units will be even worse ? And of course the drivers :?

To make a unified shader model pay any dividends you have to ensure that those ALU's are fed - which basically means a very effective instruction scheduler and very effective caching. IMO its probably these areas that have been worked on most since the R400 has been held over.
 
40 GFLOPS with a 5900?
Can NV35 do 3 FP MAD per clock?

I wonder how much drivers optimized for this memory access pattern could improve performance.
 
digitalwanderer said:
Well ain't this a bit of a turn-about! Wasn't it the other way before the R300, that it was being speculated that nVidia was using on on-die cache to help improve performance?
Are you thinking of the GF4Ti, which was supposed to have half of its transistors dedicated to cache? Because the FX series' (primary) Achilles heel was insufficient register space (a form of cache), no? That seemed to explain its far faster FP16 performance.
 
LeStoffer said:
They all use on-die cache, Dig! The question is only how big are they - and does size matter? ;)
Doh! Thanks.

I re-earn me title everyday, but I learn a little too. ;)

Pete said:
Are you thinking of the GF4Ti, which was supposed to have half of its transistors dedicated to cache? Because the FX series' (primary) Achilles heel was insufficient register space (a form of cache), no? That seemed to explain its far faster FP16 performance.
I think I might be, the GF4ti sounds familar with it and I vaguely remember it being comparing the 8500 to the GF4. (No comparison for me, GF4 stomps an 8500 any day IMHO. ;) )

Hmmm, then these results surprise me...I'd have thunk the nV40 with it's horking transistor count would be the winner here.
 
Interesting. It seems that the major result of this paper is that to use the GPU to effectively do general-purpose computing, you really need to do in excess of 8 operations per float fetched. I wonder if there would be a way to cache data for matrix multiplication in order to make better use of the FP units than was described in this paper?
 
I haven't read the paper, but I assume that the Matrices must be in Textures.

If so, performance is probably heavilly dependant on how you lay out the texture, and most likely a simple (x,y) mapping is far from the the best.

I'd also speculate that the best layout for one piece of hardware would not be the best fo another.

Without detailed knowledge of how the texture caches work it would probably be very difficult to come up with the best mapping.
 
§3 said:
We benchmarked our GPU algorithms and the CPU based
matrix-matrix multiplication routine (sgemm) provided by
ATLAS. Experiments were performed on a 3 GHz Pentium
4 CPU (512 KB L2 cache) featuring peak performance of
12 GFLOPS and an L1 cache bandwidth of 44.7GB/sec.
We tested our GPU algorithms on the ATI Radeon 9800XT,
a prerelease Radeon X800XT (500mhz core clock/500mhz
mem), the NVIDIA GeForce FX 5900 Ultra, and a prerelease
GeForce 6800 Ultra (350Mhz core/500Mhz mem), capable
of achieving 26.1, 63.7, 40.0, and 43.9 GFLOPS respectively.
We only measured the NV Single algorithm on
the NVIDIA hardware because for large matrices its kernel
requires more instructions than the ATI card supports.
I don't mean to derail this thread, but do the "per-release" clocks of 500/500 for the X800XT and 350/500 for the 6800U hint at any clock adjustments made by either IHV at the last moment, or just incorrect labelling by the paper's authors?

Also, didn't they flip the scores for the 9800XT and 5900U in the above paragraph (I'm comparing them to Figure 1, though the numbers in the text match those in the tables)?
 
Also I seem to remember the b3d preview showing that NVs texturing rate falls for large textures (compared to ATI) but that for smaller textures
(<= 256*256) it's faster...

If this is true, then maybe it would be beneficial (for NV) to split matrixes across multiple textures?
 
ERP said:
Without detailed knowledge of how the texture caches work it would probably be very difficult to come up with the best mapping.
I'm not so sure. The best mapping would probably be one in which the texture is read in a similar fashion to how it would be read during normal 3D operation. So, you'd probably want to go ahead and pack "nearby" data into 2x2 blocks.
 
Chalnoth said:
Interesting. It seems that the major result of this paper is that to use the GPU to effectively do general-purpose computing, you really need to do in excess of 8 operations per float fetched. I wonder if there would be a way to cache data for matrix multiplication in order to make better use of the FP units than was described in this paper?

I think neither increasing the caches nor changing the access pattern would change this. This particular algorithm for matrix multiplication (which is one of the few rather GPU friendly) is simply bandwith starved (from video memory --- not the caches).
Increasing the caches would only postpone the point at which the caches are empty due to main memory not keeping up (as we're talking at least 1024x1024x128 bits * 2 (source + destination)).
Changing the access pattern would improve the hit-rate of the cache, but that is not the problem. Also, the authors seem to have chosen a layout that is similar to how regular textures are accessed, so that the caches should be able to operate with good efficiency.
 
I wonder if they took into accound the number of cycles it requires to sample float data. They made some concessions at the end saying "we realise that these are primary tasked with handling 8-bit per component operations", but do they realise that it take multiple cycles to sample float data on these architectures?
 
[maven said:
]Increasing the caches would only postpone the point at which the caches are empty due to main memory not keeping up (as we're talking at least 1024x1024x128 bits * 2 (source + destination)).
First of all, these are 1024x1024 matrices (not textures), but 2x2 submatrices are composed into each of the four elements, so we're talking about 1024x1024x32 bits * 4 (two sources, one destination, and the matrix result from the previous pass). But more important is not the bandwidth per matrix, but rather the bandwidth/math operation ratio.

They do claim that they attempted many other methods to improve performance, but I still wonder if they tried everything, or if they were hampered by drivers that have not yet been optimized for some of the more exotic uses to which these pieces of hardware can be applied (such as MRT's).
 
Apologies if this is going OT but talking about on-die caches -- why is this such a tightly kept secret/detail by the IHVs?
 
Reverend said:
Apologies if this is going OT but talking about on-die caches -- why is this such a tightly kept secret/detail by the IHVs?

I've wondered the same in the past, but have no guesses to offer up.
 
My guess would be that no one wants to get caught looking like they have smaller caches than the competition, even though the optimal size is highly design-dependent. It's also possible, for example, that the optimal cache size for a chip with a fast memory interface might be smaller than that of a chip with a slower interface, because of the reduced latencies. In the CPU world, cache size is a differentiating feature because bigger pretty much always equals better.
 
GraphixViolence said:
My guess would be that no one wants to get caught looking like they have smaller caches than the competition, even though the optimal size is highly design-dependent. .
Ah! The usual story of "it's not the size that matters but what you do with it".
 
Back
Top