NVIDIA Kepler speculation thread

Discussion in 'Architecture and Products' started by Kaotik, Sep 21, 2010.

  1. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    And I know a few physicists searching for a job here to come back from the US to Germany. Two years ago I could have gone to California too, but I didn't want. I would not claim my sample size is representative. :roll:
    :lol:
     
  2. tekyfo

    Newcomer

    Joined:
    Apr 12, 2012
    Messages:
    4
    Maybe it's humility?
     
  3. trinibwoy

    trinibwoy Meh
    Legend Alpha

    Joined:
    Mar 17, 2004
    Messages:
    10,317
    Location:
    New York
    This is way OT but....

    I'm neither from the US or Europe but in my travels people from Europe talk way more about wanting to go to the US than the other way around.
     
  4. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    I agree about the way OT, but I have to mention that we have (of course :wink:) official statistics about such stuff in Germany from the "Statistisches Bundesamt" (Federal Statistics Office). As a net effect, we had more immigration than emigration in the last few years, also from the US (but that is almost even). Only if you count just German citizens slightly more leave than coming back (and as said, including all people irrespective of their citizenship coming from the USA we arrive almost at a net zero).
    By the way, the most attractive target for German emigrants is Switzerland! :smile:

    And now back to topic!

    Does anybody got some more insight into the question I asked here?
    I mean, if nV takes the data locality issue serious, they should pin warps to a certain vALU (or actually a set of vALU, SFUs and L/S units) in roughly the same way as GCN does it with pinning its wavefronts to a certain vALU.
     
  5. Lightman

    Veteran

    Joined:
    Jun 9, 2008
    Messages:
    1,573
    Location:
    Torquay, UK

    Continuing way OT ...

    I call it Hollywood effect :wink:


    Back on topic:
    I was wondering why GK104 is slower in BitCoin mining than GF110. I know this workload is purely integer, yet still it seems odd new GPU is 20-30% slower in both OpenCL and CUDA miners (including CUDA miner compiled using 4.2 toolkit).

    average numbers:
    110MH/s (GTX680) vs 140MH/s (GTX580)
     
  6. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,347
    Location:
    Varna, Bulgaria
    Tahiti can support much larger number of concurrent threads. I don't think the RF size in GK104 is particularly lacking in that relation. The number of warps per SMX is more troubling, and the consequences for the memory access latency hiding -- which takes us back to the question of how the new SW scheduling will deal with data locality and dependencies.
     
  7. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    730
    +
    it's slower on GPC OpenCL benchmark too

    Code:
                  GTX 580   GTX680
    SHA-1 Hash     571.0     471.9 
     
  8. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    478
    Bitcoin is basically all shifts. Perhaps the shift hardware is not as good in Kepler? (It certainly has no reason to be as good for gaming loads.)
     
  9. Lightman

    Veteran

    Joined:
    Jun 9, 2008
    Messages:
    1,573
    Location:
    Torquay, UK
    Granted, but results are closer to the level of GTX560Ti which GK104 doubles in almost every aspect.
    It will be interesting to see if big Kepler will bring any improvements in these tasks or not.
     
    #4089 Lightman, Apr 12, 2012
    Last edited by a moderator: Apr 12, 2012
  10. Bludd

    Bludd Experiencing A Significant Gravitas Shortfall
    Veteran

    Joined:
    Oct 26, 2003
    Messages:
    2,466
    Location:
    Funny, It Worked Last Time...
    Why can't a new driver help with this?
     
  11. ahu

    ahu
    Newcomer

    Joined:
    Jul 19, 2008
    Messages:
    51
    A new driver won't help much here, as the integer performance is severely handicapped compared to GTX 580 and even to GTX 560. According to CUDA C Programming Guide version 4.2, 32-bit integer shifts and compares have only 1/24 of the throughput of the 32-bit FMA. That would put the GTX 680 at around 1/6 of the GTX 580 throughput in those operations. The other integer operations aren't quite that slow though.
     
  12. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,251
    Location:
    Cleveland, OH
    Seems like the compiler would then want to examine the potential to favor integer MADDs over left shifts, since they have 4x better throughput (and can pair with ADDs, being an alternative to a useful left shift + insert operation). It's not intuitive, but I've actually done that sort of thing in NEON code.
     
  13. Sxotty

    Veteran

    Joined:
    Dec 11, 2002
    Messages:
    4,508
    Location:
    Under a Crushing Burden
    I always assumed you were, guess it true what they say.
     
  14. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    478
    But the best int MADD with high throughput you can probably get is for 24-bit, and the problem that btc deals with really likes 32-bit int shifts.

    The fastest implementation you could build is probably splitting the 32-bit words into 16-bit ones, but then you are going to at least quadruple the amount of ops, and double the amount of state per thread. The state is probably the bigger hit.

    I have actually always been a bit puzzled as to exactly why AMD gpus are as good at 32-bit shifts as they are. There really isn't any use that justifies the expenditure outside crypto. Is AMD the main supplier to NSA or something?
     
  15. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    With the bitalign instruction you can do basically bitshifts of 64bit data (but it delivers only 32bits of the result) at full rate on AMD GPUs (since Cypress, R700 generation had only the normal shifts at full rate but was already a huge jump over R600/RV670, where bitshifts executed at 1/5 rate [only in t slot]). You can use this also for full speed rotates. [strike]AFAIK nVidia added this instruction with Fermi, too. But maybe it's slower (Executed on SFUs? Implemented only in a part of the vALUs? Only DP/iMUL32 throughput?) or they added it only as a macro consisting of multiple native instructions, no idea.[/strike] [edit]nV GPUs have only the BFE instruction, not bitalign and also not BFI as Man from Atlantis pointed out[/edit] [edit2]According to the documentation, starting with Fermi they have a BFI instruction, just the bitalign is missing compared to AMD.[/edit2] And don't forget HD 5870 and HD6970 had a higher peak arithmetic performance than GF100/110 either way.

    As to the reason, I always thought that bit manipulating instructions are quite cheap, maybe save for the shifts. But AMD obviously thought it was less enough effort to put it in at full speed. Maybe someone can enlighten us, how much a 32bit shift unit costs compared to a FMA?
     
    #4095 Gipsel, Apr 17, 2012
    Last edited by a moderator: Apr 17, 2012
  16. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    730
    There is some talk about nvidia's lack of BFI_INT and int rotate functions that makes them significantly slower than AMDs..
     
  17. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    I guess GK104 will absolutely stink at bitcoin or cryptographic stuff if there is no hidden magic coming to the rescue. I've just seen in nV's documentation, that 32bit integer shifts are supposed to run only at a 1/24 rate (8 operations per clock cycle per SMX, same as double precision). For some extremely strange reason, it is slower than 32bit integer multiplication (1/6 rate), so one could try to exchange it with multiplications when possible.
    Comparing GF100/110 : GF104/114 : GK104 per clock cycle (the whole CPU and taking the hotclock for Fermi), the instruction issue rates for 32bit integer shifts relate as 4:2:1 and 32bit integer multiplies 2:1:2, making this stuff really slow on GK104 (and you still have to consider the lower clock speed of Kepler), shifts are only about 1/3 of the speed of a GF114! Only integer adds are significantly faster.
     
  18. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    478
    Bit shift units, especially ones that can operate at 4-cycle latency, are really, really cheap compared to FMA. Basically, having a name for the instruction and the paths to send operands to it are going to be more expensive than the actual shift hardware.

    The thing is, throughput loads that use shifts don't really exist outside crypto. Which makes the existence of the instruction strange. I find it entirely believable that AMD added it for a single client.
     
  19. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,510
    Location:
    Hamburg, Germany
    As said above, AMD made the shifts fast already with the R700 generation (RV770 was doing shifts ~12 times as fast as RV670), Cypress only added to it by enabling also full speed rotates and simplifying shifts of wider data with the bitalign instruction. If there would have been a specific customer, I guess they would have it done it only for RV770, not for the entire line (DP was also RV770 only). I think they did it mainly because it was cheap and even simplifies some things because everything (the ALUs) gets more symmetric.
     
  20. DarthShader

    Regular

    Joined:
    Jul 18, 2010
    Messages:
    350
    Location:
    Land of Mu
    No need to guess, Kaotik posted some GPGPU benchmarks over in the 7970 thread, that includes results from a bitcoin miner:

    http://muropaketti.com/artikkelit/naytonohjaimet/gpgpu-suorituskyky-amd-vs-nvidia,2 (2nd benchmark)

    It isn't on "absolute stink" level, but still disappointing. If GK110 adds the missing instruction on full speed, it should beat a 7970.
     

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...