R3/4XX appear they may have far better caching then NV chips

Discussion in 'Architecture and Products' started by bloodbob, Jul 19, 2004.

  1. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Uhhhhmmm.....cliche excuse (and yes for all occassions)... :lol:
     
  2. Quitch

    Veteran

    Joined:
    Jun 11, 2003
    Messages:
    1,521
    Likes Received:
    4
    Location:
    UK
    Studies have shown that size *does* matter... for cache... I'm talking about cache *cough*
     
  3. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    22,146
    Likes Received:
    8,533
    Location:
    ಠ_ಠ

    of course, what else could you be talking about? :wink:
     
  4. kayvonf

    Newcomer

    Joined:
    Jul 23, 2004
    Messages:
    4
    Likes Received:
    0
    some comments from the paper authors

    Both vendor's current generation of chips can provide one fp32(24 on ATI) float per clock out of the closest texture cache. Just write a test that reads texel (0,0) over and over, and this number will be clear. Therefore a float4 read out of the cache takes 4 clocks. But a MAD can consume 12 floats as inputs (in our case it is 8 floats from memory, 4 stored in a local float4 register), thus for each MAD we do, its going to take 8 clocks before we get the data, even if the texture cache was INFINITELY large. Using SSE2, you can grab data on a P4 at 128 bits/clock.

    The point is that these simple algorithms for mat-mat multiply are already reading data at near this peak rate, thus NO algorithm could do all that much better. We tried a bunch of different variants of the algorithms discussed in the paper.

    With MRT, you could do a 4x4 submatrix multiplication in a shader, yielding the best ratio of math to total texture fetches. However, in practice with current drivers, this doesn't seem to work as well as would be expected.

    GPU caches and memory systems are designed to stream data into the processing units. Matrix multiplication (and other numercial/scientific) algorithms don't exhibit this memory access pattern. They reuse data many times. On traditional systems, caches can be used to amplify bandwidth in this case, but GPU caches are designed to facilitate texture filtering, and are not designed for bandwidth amplification. As a result GPUs will be bandwidth starved when CPUs are not in such cases, and the efficiency of running the algorithm on a GPU will be very low.

    The point our paper was to highlight this architectural issue.
     
  5. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    kayvonf, welcome to B3D!

    I'm surprised about the MRT problem. I would assume that the cards would be able to optimize this just as well. It may be that the memory access pattern writing to 4 different buffers is not efficient, and neither IHV has figured out memory controller settings to optimize this rare case.

    ATI, however, has put out a demo that uses MRT, so I'm a bit surprised. Did you try using only 2 output targets, i.e. 8 floats?
     
  6. kayvonf

    Newcomer

    Joined:
    Jul 23, 2004
    Messages:
    4
    Likes Received:
    0
    MRT on ATI

    Bandwidth currently drops by 1/3 when using 2 MRT targets. It improves slightly with 3 and 4. ATI knows about this and hopefully will have the problem corrected soon. NV68xx MRT output bandwidth is just fine for 1-4 outputs. However, doing a 4x4 submatrix multiplication on NV hardware means you need a bunch of registers. That register pressure is going to cause performance problems.
     
  7. digitalwanderer

    digitalwanderer Dangerously Mirthful
    Legend

    Joined:
    Feb 19, 2002
    Messages:
    18,992
    Likes Received:
    3,533
    Location:
    Winfield, IN USA
    Wow! :shock:

    Thanks and welcome to Beyond3D Kayvonf, the information is very much appreciated. :)
     
  8. Pete

    Pete Moderate Nuisance
    Moderator Legend

    Joined:
    Feb 7, 2002
    Messages:
    5,777
    Likes Received:
    1,814
    Nice of you to pop in, Kayvon. So the X800 doesn't experience (as much) register pressure as the 6800 with a 4x4 submatrix mul?
     
  9. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Well last I heard NVIDIA has much bigger caches than ATI. On the other hand, ATI doesn't count cache as transistors. So:
    1) NVIDIA doesn't want the public to realize they don't really have that many more transistors than ATI (afaik, the difference, IF you exclude cache, is lower, but of course NVIDIA also has more cache so it's not all that easy to calculate).
    2) As to why ATI is so shy about it, a few things have to be considered. Since the R300, they've been enjoying the public idea that they can do more with less. Another explanation is that perhaps some notebook manufacturers look at transistor count to determine how good a chip is considering performance (lower=less noise/heat, even if that's entirely untrue).
    Or perhaps ATI just doesn't really care... Or there might be yet another more complicated reason. Perhaps they like investors thinking their production costs are, relatively to NVIDIA, lower than they really are.

    Now, don't get me wrong, I'm not saying there's gigabytes of cache on modern GPUs :) But it could make up quite a bit of the differences between recent ATI and NVIDIA chips. Heck, if this was already done a few years ago, it means the R200 had more transistors than the NV25, and not just by a million or so!

    Uttar
     
  10. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    I wouldn't go around associating the "marketting" numbers of transistors counts and how they may or may not count as giving any relevance to actual quantities of cache. However, R420 appears to have different texture prop off performances than NV40, which would indicate that the cache sizes are larger.
     
  11. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Re: some comments from the paper authors

    I was wondering if you were taking the texture sampling performance into consideration, as an FP32 float will require 4 cycles to sample (which it will presumably need to if there is a cache miss).
     
  12. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    The connection from texture cache to TMU should be 128 bits wide (per pixel, a quad TMU like in NV40 might require less than 4 times that). However, since the result of the filtering operation only needs 32bit/clock with both 8-bit and FP16 filtering, the connection from TMU to the data format converter might be the bottleneck.
     
  13. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    I guess what this really means is that if you're doing operations on FP textures, you really need to do many more operations than reads/writes. So, if you could generate one of the textures you're going to multiply on the fly within the pixel shader, the potential is there to dramatically increase performance.

    Another option may be to do it the "old" way, the way matrix multiplication was done on the GPU on the GeForce4. That is, one could attempt to do the matrix multiplication in the vertex shader instead.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...