NVIDIA Kepler speculation thread

Discussion in 'Architecture and Products' started by Kaotik, Sep 21, 2010.

Tags:
  1. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
    GK104 scaled not very well with its FLOPs performance over GF110.

    So if they can put 1536 FP64 cores in their >500mm² monster and reach the same scaling like on GF110 we could see some nice gaming performance.
     
  2. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    Xbit's review, overclocking performance against Tahiti: http://www.xbitlabs.com/articles/graphics/display/nvidia-geforce-gtx-680_14.html#sect0

    Overall 4% faster at 19x12 and 0.2% faster at 25x16. From those numbers, AMD doesnt not be concerned about the GK104 threat in the form of the 680. But given the fact that they are behind in key metrics and Nvidia have yet to launch their flagship, they might be worried, plan Bs and Cs cases being thought of. Perhaps a rejiggle in strategy for Sea/Canary Islands?
     
  3. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
    Back to DP:

    GF104 SM: one of the three SIMD16 calculated DP in 4 clocks => 4 DP-FMA / 48 SP-FMA per SM
    GK104 SMX: one of the six SIMD32 calculate DP in 4 clocks => 8 DP-FMA /192 SP-FMA per SMX

    Sounds more reasonable?
     
  4. Mize

    Mize 3dfx Fan
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    5,079
    Likes Received:
    1,149
    Location:
    Cincinnati, Ohio USA
    I don't know Arty...quieter, cooler, lower power, lower price. Seems AMD needs to drop about a C-note off the 7970 to "not worry."
     
  5. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    I did say other metrics ;) and the "not worrying" part was strictly about GK104's performance.
     
  6. DarthShader

    Regular

    Joined:
    Jul 18, 2010
    Messages:
    350
    Likes Received:
    0
    Location:
    Land of Mu
    Whoa, really suprised Kepler can keep that well at higher resolutions, didn't expect it to win some multimonitor tests. :shock:

    But it is now clear where the efficiency gains come from. They took a step back and went for static scheduling.... so what is going to be next in the quest for efficiency - Maxwell is going to be VLIW? :S

    Not sure I buy the asymetric SIMDs (4 vec32 and 4 vec16). Since the scheduling in the compiler now depends on working with known, deterministic latencies of the instructions it issues, wouldn't the compiler have to fully aware it is scheduling "shorter" execution unit, since the latency of the instruction would be increased by 1 clock? So kinda knowing there are x and y exec units, where y has higher latency? What does that gain you? Is that easier to keep track of, than the 4 schedulers having to issue instructions to up to 6 vec32 SIMDS?

    btw: this test for Compute Mark, QJulia Ray tracer, caught my eye:

    http://www.computerbase.de/artikel/...ia-geforce-gtx-680/20/#abschnitt_gpucomputing

    The almost doubling of performance over a 580 looks like it goes in line with the GFLOP doubling. So if it is compute bound, then why does a 7970 worse there? Warpsize? Drivers? Something Kepler has that makes it more efficient in this code than Fermi?
     
  7. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    By the way, Kepler now supports a third mode for partitioning the LDS/L1 memory -- 32/32KB setting.
     
  8. itaru

    Newcomer

    Joined:
    May 27, 2007
    Messages:
    156
    Likes Received:
    15
  9. itaru

    Newcomer

    Joined:
    May 27, 2007
    Messages:
    156
    Likes Received:
    15
    1 warp:32thread

    1 SMX
    16 cuda core*12
    16 LS*2
    16 SFU*2
    4 scheduler
    8 dispatcher

    sp:1 warp/2 cycle

    dispatcher:1inst/1cycle
     
  10. DSC

    DSC
    Banned

    Joined:
    Jul 12, 2003
    Messages:
    689
    Likes Received:
    3
  11. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
  12. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    627
    Likes Received:
    414
    I think this is incorrect. Based on anandtech writeup, the correct layout is:
    32 cuda core * 6

    Twice lanes in the unit, but running at half the clock, and a 32-unit warp is computed in a single cycle.
     
  13. Arnold Beckenbauer

    Veteran Subscriber

    Joined:
    Oct 11, 2006
    Messages:
    1,756
    Likes Received:
    722
    Location:
    Germany
  14. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Yes please!
    I'd like to be able to get >4-core video box, but if I lose HD3000, what I gain in editing I lose in encoding, which kind of stinks. I'd be really happy if nvidia has managed to create a quality fast encoder.

    [Although, then I'd have to think about buying a $500 card, and I'm not sure what I think about that :>]

    -Dave
     
  15. Tridam

    Regular Subscriber

    Joined:
    Apr 14, 2003
    Messages:
    541
    Likes Received:
    47
    Location:
    Louvain-la-Neuve, Belgium
  16. moozoo

    Newcomer

    Joined:
    Jul 23, 2010
    Messages:
    109
    Likes Received:
    1
    FP64 and GK110

    From http://techreport.com/articles.x/22653
    "In the SMX, there are four 16-ALU-wide vector execution units and four 32-wide units. Each of the four schedulers in the diagram above is associated with one vec16 unit and one vec32 unit."

    Rather than some secret block of 8 FP64 CUDA cores that does not shown on any diagrams, isn't it more likely that one of the vec16 units per SMX can do FP64 at half rate. i.e. one out of the 12 vertical columns of CUDA cores does 1/2 rate FP64.

    For the GK110, my guess is that each schedulers has two vec16 units (which improves the ratio of registers to cores) and all cores are capable of 1/2 rate FP64. This is in roughly the same size as the GK104 SMX's.
    Then to make up the missing cores have 6 SMX's instead of 4.
     
  17. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Is that GF1x0 or GF1x4?
     
  18. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Maybe, but that sounds like a scheduling nightmare. Only one of four schedulers can handle DP? Does that mean that DP-related code is pinned to a scheduler, or??? Definitely agree that 8-dedicated DP cores (haven't we heard rumors like this before?) is unlikely/surprising. Either way, I think DP code is probably either pinned to a scheduler (as you suggest) or an SMX, and that this might just cause low performance for any chunk of code that looks slantwise at DP....
     
  19. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    I don't think the compiler necessarily has to know about it. If I understand correctly, it just means there is no more dynamic scheduling of instructions within a warp. Once you've picked an instruction from a warp, it tells you the minimum number of clocks you have to wait till it's safe to issue the next one for that warp. The warp scheduler then forgets about that warp until the given number of clocks is up. I may be missing something, but if you decided to issue a particular instruction to a half width SIMD, you could just increase the number of clocks to wait for by 1.
     
  20. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    GF100 is the only Fermi GPU with available die-shot in the wild. :wink:
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...