NVIDIA Fermi GPU and Architecture Analysis

Discussion in 'Beyond3D Articles' started by AlexV, Oct 23, 2010.

  1. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,987
    Likes Received:
    100
    Yes, that's what I meant...

    Ah ok. I could understand easily why it wouldn't be able to do 2 dot2 per test (serial dependent ones would be enough to prevent this), I just failed to see why it didn't even manage the same rate as dot4 under similar circumstances.

    I also find it quite funny with all the emphasis nvidia put on gpgpu that the GTX470 gets a very serious beating at DP issue rate (scalar or not...), even though it's artificially limited. The numbers there though also look a bit strange, more driver wonkiness for the HD5870? The DP MAD rate is only half that of DP MUL - I thought it should be 2/3, though that could also be due to dependencies I guess. In fact in one case (cs vec4) it's only 1/3 which doesn't make a whole lot of sense to me.
     
    #21 mczak, Oct 26, 2010
    Last edited by a moderator: Oct 26, 2010
  2. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    Another problem with DOT2 is that their compiler doesn't generate the dedicated instruction for it, which further impacts performance.

    As for the Vec4 MAD, I mention the cause of the anomaly in text: fxc doesn't generate MADs when using DP operands, but rather MUL+ADD, and neither compiler collapses this into a MAD/FMA as far as I can see(the assumption probably being that if you're using DP you know what you're doing and actually want MUL+ADD). So you get half rate for MADs by virtue of there being twice as many instructions versus MUL, in spite of both Cypress and Fermi supporting DP FMA.
     
  3. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,987
    Likes Received:
    100
    Oh, not at all? That indeed sounds like the compiler could need some optimization.
    Yes, but it should still be faster on HD5870: A Vec4 DP MUL requires 4 clocks, but the Vec4 DP ADD part should only require 2, which would give the 2/3 of MUL rate I mentioned - instead it's 1/2 and in one case only 1/3 for some reason. Guess that shows it's not exactly easy to get max throughput out of that part...
    You should always be able to fuse MUL/ADD into MAD if your hw can do it, but Cypress only offers DP FMA (not sure about fermi) as far as I can tell and as you said it makes sense this isn't fused automatically as the results would be different.
     
  4. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    They're both the same for DP(only FMA), as far as I know.
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,405
    Likes Received:
    401
    Location:
    New York
    Magic, of course. Unfortunately there aren't any tests for that just yet :)
     
  6. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    498
    Likes Received:
    177
    I'm leaning towards the idea that this particular test suite just isn't very useful - it seems to be probing architectural parameters that aren't the real bottlenecks.
     
  7. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,382
    Likes Received:
    797
    Maybe the bottlenecks in Cypress are not hardware, but software.


    Anyway, that was a very good article. I can't say that I understood everything, but I certainly enjoyed it.
     
  8. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    A partial answer to that particular question (Crysis FPS tells me a different story, what's up?) is in the works (it'll be out this year, really!). But we need to remember that for a single rendered frame in a game a lot of things happen, concurrently, and drivers play a huge part, especially as you start dealing with things like awful API call sequences.
     
  9. upnorthsox

    Veteran

    Joined:
    May 7, 2008
    Messages:
    1,909
    Likes Received:
    232
    Not to call you out on this, but the thread below this one Q4 2006 GPU Market Analysis, and the 2nd below that is R580: ATI Radeon X1950 XTX Review. Excuse us if we're skeptical. :razz:
     
  10. I.S.T.

    Veteran

    Joined:
    Feb 21, 2004
    Messages:
    3,174
    Likes Received:
    389
    The funny part is, the X1950 XTX review is still missing after all this time, and at least two promises to fix it!
     
  11. Bob

    Bob
    Regular Subscriber

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    Slimer? Really?
     
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,405
    Likes Received:
    401
    Location:
    New York
    Somebody really needs to let us in on the joke :)
     
  13. Bludd

    Bludd Experiencing A Significant Gravitas Shortfall
    Veteran

    Joined:
    Oct 26, 2003
    Messages:
    2,900
    Likes Received:
    470
    Location:
    Funny, It Worked Last Time...
    I have just started reading but I notice that the table for GTX 470 on page 3 does not contain a number for the bandwidth.
     
  14. Bludd

    Bludd Experiencing A Significant Gravitas Shortfall
    Veteran

    Joined:
    Oct 26, 2003
    Messages:
    2,900
    Likes Received:
    470
    Location:
    Funny, It Worked Last Time...
    Well, the first sentence on page 5 refers to Fermi as a "fat green blob" so I think that is the joke.
     
  15. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    No it's not that, the fat green blob came much much later. The Mem BW thing will be fixed, it's a CMS bug.

    Bob: it's said lovingly:razz:
     
  16. Pantagruel's Friend

    Newcomer

    Joined:
    Jun 17, 2007
    Messages:
    59
    Likes Received:
    0
    Location:
    Budapest, Hungary
    Ummm, looks like I'm a bit late to the party - to be honest, somewhere in September I lost faith that this article will ever manifest itself. :oops:

    Seriously, though, the end product was well worth the wait - many thanks for this enlightening romp through Fermi's intricacies, some of them were rather unexpected. I'm quite seriously baffled by the triangle setup rate - I was under the impression that it was a key contributor to the GTX480's advantage over the 5870.
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...