NVIDIA Kepler speculation thread

Discussion in 'Architecture and Products' started by Kaotik, Sep 21, 2010.

Tags:
  1. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    Pending "Doubele Confirmation". (The scale is wrong)
     
  2. DarthShader

    Regular

    Joined:
    Jul 18, 2010
    Messages:
    350
    Likes Received:
    0
    Location:
    Land of Mu
    Caches:

    768kb L2 cache
    64kb Shared Memory/L1 cache per SMX
    Texture Cache
    Uniform Cache
    65536 x 32bit registers per SMX

    4 Schedulers with 8 dispatch units per SMX
    8 SMX inside a GK104 chip.
     
  3. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    960
    Likes Received:
    853
    i was foulishly hoping for 1MB L2 but 768KB is good enough, +50% more L2 and Off chip bandwidth should be enough for ROPs.. 4 Schedulers can be seen clearly unlike GF100 dieshot btw..

    if 7970 is faster, graph wouldnt be stopped at 80 FPS.. earlier GDC rumor indicates +10% for GTX680 so 65FPS is likely for 7970


    EDIT: BTW it seems that nvidia thought it was brute force to go higher fillrate for small triangles and they try to balance the tessellator system.. they are now similar to AMD but they should still be better at smaller triangles.. i wonder how it fares against GF110 in pure tess benchmarks(tessmark) at same clock..

    EDIT2: it looks like it's nearly 2x faster than my overclocked GTX460, i just wish it was cheaper and i could get a msi lightning soon :D
     
    #2703 Man from Atlantis, Mar 16, 2012
    Last edited by a moderator: Mar 16, 2012
  4. A1xLLcqAgt0qc2RyMz0y

    Veteran

    Joined:
    Feb 6, 2010
    Messages:
    1,589
    Likes Received:
    1,490
    Tin-foil hat time :grin:
     
  5. A1xLLcqAgt0qc2RyMz0y

    Veteran

    Joined:
    Feb 6, 2010
    Messages:
    1,589
    Likes Received:
    1,490
    These slides do not specify any cache sizes except the L1.

    http://imgur.com/a/aQmuA#EFjJN

    Do you have a link for where the "768kb L2 cache" size specification is stated?

    The Instruction Cache size is also unknown.
     
  6. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    797
    Likes Received:
    223
    Can anyone estimate die size from this picture?
     
  7. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    960
    Likes Received:
    853
    compare it to GF108(116mm2)

    [​IMG]
     
  8. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Btw., has someone else problems to conciliate the die shot with the official version of 6x32 ALUs, 32 L/S, and 32 SFUs? I mean, each "SMX" appears to have 8 physically separate register files aligned along the vector ALU lanes. Even when one considers that half of the register banks are on the left side and the other half on the right of a vALU, each SMX has then a set of 4 identical and replicated subunits. That would fit somehow to the 4 schedulers, but what is in there?

    The only way (I can think of right now) one can distribute the units would be, that each dual issue scheduler delivers its instructions to a set of 3 vec16 ALUs, 8 L/S units und 8 SFUs. That basically means one SMX would be a package of four GF104 style SMs (somwhat reminiscent of G80/GT200) where the hotclock and one scheduler got lost (and the local memory, TMUs and some other stuff are shared). The scheduler can issue each cycle two instructions from one thread and alternates each cycle between "even" and "odd" threads (same would then be true for the register access, maybe that's why one can identify 8 vector register files in each SMX, even and odd threads have separate register files). Or maybe a better picture: a scheduler issues up to 4 instructions from two threads every two clock cycles. Or the scheduler issues each cycle a single instruction from two threads (and the vecALU the instruction got issued to is blocked in the next cycle because one can issue an instruction for a 32 element warp only every second cycle to a vALU with 16 lanes). The last version would basically work like the two single issue schedulers in a GF100/110 SM, just that the scheduler run at the same clock as the ALUs and can therefore supply more of them.

    Has someone a clever idea how this really works?

    PS:
    If they didn't have a similar mistake in that slides as during the Fermi presentation, the total register space is the same as with GF100/GF110 (2 MB), so really tiny compared to Tahiti (8 MB). I have a hard time believing that number, considering the similarity of the ALU count of GK104 and Tahiti. I would expect double the value given in that slide (4 MB), i.e. 512 kB per SMX or 128kB per Scheduler.
    But also the local memory/L1 is quite small (still 64kB) considering how many threads/workgroups on one SMX have to share it.
     
  9. DarthShader

    Regular

    Joined:
    Jul 18, 2010
    Messages:
    350
    Likes Received:
    0
    Location:
    Land of Mu
    Page 3 of the chinese preview, what else could I have?
     
  10. psolord

    Regular

    Joined:
    Jun 22, 2008
    Messages:
    444
    Likes Received:
    55

    +1000^1000

    If these 256bit Kepler cards are priced as ridiculously as the Tahiti cards, I will file a complaint for price fixing.
     
  11. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Probably because something else goes wrong on Fermi. The whole export limitation of the SMs is quite a fuckup in my opinion. Practically, GF100/110 has no 50% more ROPs, it has effectively the same (but lower clocked) or even less ROPs than a Cayman/Tahiti if you factor that in.
     
  12. SimBy

    Regular

    Joined:
    Jun 21, 2008
    Messages:
    700
    Likes Received:
    391
    What I find ridiculous are claims that Tahiti is priced ridiculously.
     
  13. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    That looks strange indeed. I've been staring at this image for a while also and did not come up with something conclusive yet.
     
  14. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Right, it (Fermi GF100/b) has ROP-excess, so to say. But since the ROPs do only 4x MSAA single-cycle and loop over for 8x, that should make for even less performance hit when switching to 8x. But it isn't.
     
  15. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Maybe the RF/Scheduler block takes more area than we think? Especially if the schedulers are fully associative to the SIMD lanes, and not bounded to a subset, like in CGN, e.g. every scheduler can issue an instruction to any SIMD. That would be really a huge overhead, if true. :???:
     
  16. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    If Tahiti and GK104 are performing the same and they cost the same, then they are both priced ridiculously or not. The rest is personal opinion. I find $550 a bit much for a GPU if you can get a new iPad (it's gorgeous!) for the same price, but that's just me.
     
  17. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    960
    Likes Received:
    853
    The primarily responsible for the the Tessellation calculated PolyMorph Engine, also in the framework of the "Kepler" to upgrade to 2.0 The integrated Tessellator already been updated, and computational efficiency compared to "Fermi" 2 times, to the Radeon HD7970 4 times advantage.
     
  18. A1xLLcqAgt0qc2RyMz0y

    Veteran

    Joined:
    Feb 6, 2010
    Messages:
    1,589
    Likes Received:
    1,490
    Actually it was page 2 that had 768 mentioned. And these pages take forever to load.

    http://www.hkepc.com/7672/page/2#view

    Here is a direct quote:

    "「 GK-104 」將以 2 組 SMX 建構成 1 組 GPC ,核心合共集成 4 組 GPC 及 4 組 Raster Engine ,並共享 768KB L2 Cache , Cache 規格跟現有「 Fermi 」系列相同。不過「 Kepler 」已更新 PCI-E 3.0 規格的支援,提高顯示核心與主機板之間的傳輸頻寬; NVIDIA 同時修改了 「 GK-104 」核心的 Memory Controller 規格,核心僅集成 4 組 64bit Memory Controller 規格,合共支援 256bit 記憶體,規格比上代 GF110 及主要對手 AMD 「 Tahiti 」 核心的 384bit 為低。"

    How do we know that the 768 refers to Kepler and not Fermi?
     
  19. Rangers

    Legend

    Joined:
    Aug 4, 2006
    Messages:
    12,791
    Likes Received:
    1,596
    Yeah only took them what 5 years? :razz: About time they took a turn.

    Their absolute performance leadership is going to take a major hit though if exist at all.
     
  20. Mize

    Mize 3dfx Fan
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    5,079
    Likes Received:
    1,149
    Location:
    Cincinnati, Ohio USA
    Yeah, but can an iPad play Crysis on max at 50 fps? :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...