NVIDIA Kepler speculation thread

Discussion in 'Architecture and Products' started by Kaotik, Sep 21, 2010.

Tags:
  1. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    The 7870 is far of having half of the Shaders 1D of the 7970 and is clocked 75mhz higher .. ( 1280 vs 2048 ). The fact the 7870 is so close of the 7950 is too the clockspeed... ( and maybe better balanced ).

    Idont think Nvidia have been conservative with the 680 like AMD have been with the 7970. ( just looking on 7870, and what we know of the 7970, this one could have been set at 1000-1025mhz till the start and same for the 7950, the 7950 vs 7870 will have look different with a 7950@900-925mhz instead of 800mhz ( who is clearly low allready vs the 925mhz of the 7970 )
     
    #3821 lanek, Mar 28, 2012
    Last edited by a moderator: Mar 28, 2012
  2. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,420
    Likes Received:
    179
    Location:
    Chania
    And who says that GK106 will be exactly half of a GK104 in terms of unit amounts anyway? If you're looking just at ALUs it's obviously not the entire story. With a hypothetical 192bit bus and 24 ROPs that's two points where clock for clock it's only by 1/4th behind a GK104.

    Anyway moot point at this stage. Eventually when GK106 launches we'll see where it'll end up sooner or later.
     
  3. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland

    I was comment on the 3Dcenter "speculation" and say why i imagine the 768CC is too low at my sense. I imagine more something like 960CC . Exactly as Nvidia have done for the 560 and GTX460.. half SM, more CC / SM.
     
  4. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,496
    Likes Received:
    910
    True, but on Fermi/Kepler, much of the front-end is shared on a per-GPC basis, so if GK106 is essentially a GK104 with only 2 GPCs, then it is half a GK106 in just about every way, except for ROPs and memory.
     
  5. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,420
    Likes Received:
    179
    Location:
    Chania
    For the record's sake that speculative list at 3DC doesn't mention ONLY 768SPs but the other unit amounts I mentioned before too. Even more so I'm a member of the 3DC crew.
     
  6. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    If we are talking about absolute perf/mm2 then topping Pitcairn seems like an uphill task, especially if the given performance range is 6950+/- in the same die area as Pitcairn. perf/watt might be relative easier win against Pitcairn if the numbers are in the 130W range.

    From a perceived product standpoint, it would be like a proverbial 650Ti going up against x870 series, which just shows how much Kepler has changed things for Nvidia.
     
  7. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,224
    Likes Received:
    1,895
    Location:
    Finland
    7870 uses 103W on average in gaming (Crysis 2 @ 1920x1200 Extreme settings, average over 12 seconds)
    More like "just shows how much focusing on graphics only or compute too affects things"
     
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,432
    Likes Received:
    438
    Location:
    New York
    Well even if we assume nVidia was able to push GK104 until it beat the conservatively clocked Tahiti there's no guarantee that GK106 would fare as well vs Pitcairn. GK107 is doing pretty good though so it will be interesting to watch.
     
  9. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Because a GTX560 was such a stellar perf/mm2 and perf/W performer too?
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    And what about the increased clocks?
     
  11. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    I was responding to the assertion that GK104 increased compute by 4x over GF104, which I understood as a per clock number.

    If we take clocks into account, GK104 increased compute by 2x per clock, and increased clocks by ~30%. So if you want to compare total throughput, GK104 increased by ~2.6x over GF104.

    So, now I'm a little confused. What did you mean by 4x?
     
  12. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    From the realworldtech chart, the doubling of compute was not accompanied by a doubling of warps per SMX. So it looks like there are actually more registers available per warp than on GF104, but less warps per compute and less L1 capacity per warp.

    Ignoring dual issue for a second...

    On a GK104 SMX, 4 warps can issue each cycle, so it'll take ~16 cycles to execute an ALU instruction over all 64 warps (less if dual issuing). If RAM latency is say 256 cycles, then I guess loads need to be separated by about 16 ALU instructions from use-sites in order to fully hide memory latency.

    On GF110, I think it's 2 warps each base clock cycle, so ~24 base clocks to walk through 48 warps (<-- not sure if that number is right). That's 50% more latency hiding in terms of base clock cycles, and since those are also lower than the GK104 clock, it's even better than that in terms of wall clock latency hiding ability.

    On the other hand, shouldn't GK104's higher memory/base clock mean that a cache miss takes less wall-clock time to service than on GF104? Not sure how to account for that...

    Is that anywhere near right?
     
  13. EduardoS

    Newcomer

    Joined:
    Nov 8, 2008
    Messages:
    131
    Likes Received:
    0
    I'm not sure, but I expect memory latency to be near 100ns, at 1GHz about 100 base clocks, not 256.

    And then, there are the caches and prefetching...
     
  14. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    If it were a matter of order of magnitude, I'd say 1000 rather than 100.
     
  15. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    EduardoS - I'm focusing on a situation where the caches are simply too small to hold the working set. But I readily admit I pulled the 256 cycle number mostly out of my back-side -I'm not qualified to read RAM spec sheets and estimate memory controller latencies and so forth :)

    Does your 100ns number include on-chip latencies? And, are you sure GPUs do any prefetching? Maybe it's naive on my part, but unless you want to minimize the overall latency of a computation, isn't it better from a throughput/power perspective to spend bandwidth on memory accesses you know will be useful rather than speculating?
     
  16. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I think the numbers in that chart talking about register file size/work item may be a bit misleading for a lot of cases. After all, the number of work-groups or work-items per core are maximum values, the hardware can support for very light threads. Often, this isn't that important, hence Kepler made compromises there by supporting less work groups (relatively to the compute capabilities), which skews the numbers.
    To look at it from the other side one can see how many work items the register file size is able to support in case of "heavy" threads, i.e. a case where each thread needs 64 registers for instance (iirc it is the maximum for nV, AMD GPUs support up to 128 regs per thread).

    For GF100 an GF104 this works out to be 512 work-items or 16 work groups, for GK104 it is 1024 work-items or 32 work groups (I assume GK110 will be different), and for GCN/Tahiti it is 1024 work-items or 16 work groups.
    But now one has to take the issue rate into account. For GF100 it is a single instruction for two work-groups per (base clock) cycle, which means there can be instructions from up to 8 workgroups per scheduler overlapping in flight (they would need 8 cycles for issue one instruction from each of them), which are available for latency hiding. This is in fact not enough to hide the arithmetic latencies (10 base clock cycles), let alone any memory latencies. GF104 is slightly worse as the issue rate can be higher, so one runs more often into the situation where one waits for a memory access and no arithmetic instructions are left for issue to do something useful.
    The same is true for GK104, where up to 8 instructions from 4 threads can be issued per cycle. That means there are again only 8 available workgroups each scheduler can choose from and it is quite likely to run out of ready wavefronts. I have no idea how the arithmetic latencies compare to Fermi, they probably changed.

    GCN issue rates are slightly harder to compare with, but generally it can schedule only 1 arithmetic (vALU) instruction per cycle (+ 1 scalar + 0.5 local memory access + 0.25 vector memory access + 0.25 export + 1 branch + 1 internal instructions) per cycle, actually each of the 4 schedulers can issue one vALU instruction every four cycles. That means there are 16 workgroups available for latency hiding (which need at least 16 cycles to schedule), significantly more than Fermi and also Kepler have at its disposal.
     
  17. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    A typical VALU instruction takes 4 clocks to execute, so being able to issue one VALU instruction per SIMD every 4 clocks is exactly the right balance. It only takes 4 wavefronts to get peak ALU performance, not 16.
     
  18. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Is this what you're looking for?
    http://forums.nvidia.com/index.php?showtopic=225312&view=findpost&p=1387312

     
  19. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Gipsel - good point on max register usage cutting down on the number of warps per SMX, I had forgotten to account for that.

    I don't quite understand your GCN numbers though. Doesn't a CU track a maximum of 40 wavefronts? You can process an ALU op for 4 entire waves every 4 cycles. So by executing 1 ALU instruction over all 40 waves, you can hide 40 cycles of memory access latency (much more than on Fermi/Kepler), again assuming you have enough registers for 40 wavefronts.

    BTW, also interesting from the thread dnavas linked:
    http://forums.nvidia.com/index.php?showtopic=225312&view=findpost&p=1388098

    Looks like shared memory and register file allocation granularity have doubled.
     
  20. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,494
    Likes Received:
    405
    Location:
    Varna, Bulgaria
    Those 40 WF are not arbitrarily available to all the SIMD units in the multiprocessor.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...