NVIDIA Maxwell Speculation Thread

Discussion in 'Architecture and Products' started by Arun, Feb 9, 2011.

Tags:
  1. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Kepler at the moment is:

    GK10x = 192 FP32 SPs + 8 FP64 SPs (24:1)
    GK110 = 192 FP32 SPs + 64 FP64 SPs (3:1)

    From the top of my head for synthesis alone you need 0.25mm2 under 28nm for each FP64 unit at 1GHz. At 960 total FP64 SPs of a GK110 you're at ~ 24mm2. It's not the final die area but it gives a rough idea how "big" those FP64 units really are in the end or even better that the FP64 unit percentage even for a GK110 ALU is quite small.

    Why would they have a competitive disadvantage to AMD regarding DP?
     
  2. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Yeah, so, remove 64 sp alus and add (64 - 8) dp alus, issue sp through dp, size of SMX grows by a small amount. Mind you, I never understood why they didn't do that in the first place, maybe sp issue through dp wasn't finished/optimized in time for Kepler?

    1/24 vs 1/4 issue rate?
     
  3. tviceman

    Newcomer

    Joined:
    Mar 6, 2012
    Messages:
    191
    Likes Received:
    0
    Who gives a rats snout what the marketing name is of the final product. GM107 should be compare to GK107 because that is where the hierarchy if the chip will fall in when the rest of the Maxwell family comes. If the leaked benches are true, then it will end up 65-75% faster than GK107 on the same node size. Accounting for TSMC's projections of 30% performance improvement with the same power consumption when moving to 20nm, that puts GM107 at ~100+% faster than GK107 when its all said and done.
     
    #843 tviceman, Feb 11, 2014
    Last edited by a moderator: Feb 12, 2014
  4. LiXiangyang

    Newcomer

    Joined:
    Mar 4, 2013
    Messages:
    87
    Likes Received:
    48
    I just wonder if NVIDIA is interested into countine their Titan product line (a.k.a the desktop Tesla) in the era of Maxwell.

    It may hurt the sells of more expensive Tesla lines, but it will put Intel's MIC in a very unhappy position as well, tough choice I guess.
     
  5. spworley

    Newcomer

    Joined:
    Apr 19, 2013
    Messages:
    146
    Likes Received:
    190
    Making a hybrid ALU that can compute both 32 and 64 bit IEEE FP math is quite possible.
    Such shared designs save significant transistors compared to two independent dedicated units, but at the expense of extra power use to handle the switching between modes. GPUs are power constrained already, so hybrid ALUs are not an attractive design.
     
  6. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    To save quite a bit of power for double precision.

    I honestly doubt any IHV gains or loses a worth mentioning amount of sales over a few pissat GFLOPs of double precision on mainstream desktop GPUs.

    What do you mean possible? NV used to have ALUs capable of both FP32 and FP64 and AMD still does. Scroll up and re-read what each FP64 unit costs roughly in die area and yes it's times better to dedicate a few dozen of mm2 more in order to save power. If you don't understand why it saves power to have dedicated units, you might want to have a look at the exact same reasoning in ULP SoCs.
     
  7. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    Why the aggressiveness? The way I read it, you completely agree with his statement...
     
  8. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    I asked a very simple question and there's nothing in that post that "suggests" aggressiveness and no I don't agree with him.
     
  9. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    It's still wild season and there's still not really any reliable information that I'd personally trust anywhere, which means it's the perfect time for me to do my traditional "make random guesses that turn out horribly wrong" post!

    - 128 ALU/SMX, 8 TMU/SMX, Hierarchical RF & Scheduling
    -- 4xDispatch 3xIssue (vs Kepler 8xDispatch 2xIssue) in NVIDIA Speak.
    -- 64KB L1/Shared Memory (higher effective bandwidth / fewer dispatchers).
    -- Advantages: Better locality for power efficiency, better GPGPU performance.
    -- Disadvantages: 3xIssue efficiency but fundamentally synergistic with hierarchical RF
    --> Overall only needs 2 MADDs to be co-issued with other port for everything else (potentially allows decoder savings rather than full duplication as well). Absolutely not a problem *IF* you have the register file throughput for it (which Hierarchical RF should allow in typical use-cases).

    - Multiple parts on 28nm but full family will wait for 16nm FinFET.
    -- Most chips except low-end will include 1+ Denver core to push developer adoption.
    -- 20nm is not sufficiently cost efficient for some time and not a big power improvement.
    -- 16nm obviously won't be either but at will have a significant power advantage they can't miss.
    --> Obviously the big question is whether Big Maxwell will be on 28nm, 20nm, or 16nm. Given the new Titan SKU I'm betting it'll be on 16nm but a bit earlier in the lifecycle of the node than GK110.
     
  10. itaru

    Newcomer

    Joined:
    May 27, 2007
    Messages:
    156
    Likes Received:
    15
  11. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    797
    Likes Received:
    223
    Can I do one too?

    GM108 / GM107
    Maxwell architecture
    28 nm HPM
    128 CCs per SMX
    3 / 5 SMX
    64-bit / 128-bit memory interface
    CUDA compute capability 3.7
    No ARM cores

    GTX 750 / GTX 750 Ti
    GM107 with only 1 SMX disabled / GM107 with nothing disabled
    Core clock in the range [950, 1050) / [1000, 1050)
    Memory speed 5.0 Gbps / 5.4 Gbps
    50 W / 60 W TDP
     
  12. ams

    ams
    Regular

    Joined:
    Jul 14, 2012
    Messages:
    914
    Likes Received:
    0
    Umm no. Codename of a GPU is irrelevant for a consumer. What matters is market positioning relative to current or near future products. Like it or not, GM107 in 750 Ti will be compared and should be compared to GK106 in 650 Ti, because the implication here is that the former will replace the latter.
     
  13. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Where's the Uttergram from hell? :runaway:
     
  14. DSC

    DSC
    Banned

    Joined:
    Jul 12, 2003
    Messages:
    689
    Likes Received:
    3
  15. DSC

    DSC
    Banned

    Joined:
    Jul 12, 2003
    Messages:
    689
    Likes Received:
    3
  16. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    More than happy to read any paper you (or spworley [thanks for your kind explanation!] or anyone else for that matter) want to point me at, but Dally's presentation had numbers for DP on 28nm for an unnamed nvidia product -- 20pJ for DP operation, 50pJ for register reads and 26pJ for local bus costs. I have no doubt that DP op costs far exceed SP op costs (int op was quoted earlier at .5pJ to a theoretical 50pJ dpmad), but I'm also assuming that SP memory read and transportation costs scale linearly, which means that the energy costs remain higher for reading and transporting the args than for running the wider unit. Which isn't to say that the costs are insignificant, but the whole argument of the presentation seems to indicate that op-cost isn't the cost that nvidia are focused on. It certainly did not leave me with the impression that the power cost of those units are as onerous as you are suggesting.

    The point is further underlined by noting that in scaling from 40nm to 10nm, dp costs are forecast to improve by a factor of 8, while transport is only expected to improve by a factor of 2. Maxwell was supposed to be a 20nm design, scaling benefits should be tipping in favor of optimizing for local access rather than alu sizing.

    ...today.
    Today, you can still sell desktop GPUs. I'd argue we're moving into a world where the number of desktops that aren't workstations is minimal. Tablets are eating laptops and desktops ( http://www.computerworld.com/s/arti...ments_will_surpass_desktops_and_laptops_in_Q4 ). I agree with you, though, that the issue here is a business decision. I would argue that it is nearly 100% a business decision. Nvidia needs to maintain margin by delivering a tiered product set. The question I wonder about is whether dp op-rate is the right feature to focus on. The question nvidia should be asking is, how do they best preserve their workstation market. My argument is that they are vulnerable to competition at the workstation level that charges less, not that they are likely to lose desktop gpu sales based on dp rate (which, I agree, would be silly).

    [Edit: and note, it appears that we are looking at 128-wide/640 sp alus, so I'm happy to take any links you want to offer and shut up in the Maxwell thread and wait for the next iteration :>]
     
  17. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    Unless you can power gate these dedicated units a hybrid design that uses fewer transistors is likely to be better.
     
  18. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    805
    Likes Received:
    1,634
    They can be clock gated as well
     
  19. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    You can probably use fine grained clock gating in hybrid designs so coarse clock gating above that isn't going to buy a lot unless there's a lot of added area.

    If there are companies using hybrid and separate units we can be sure hybrid designs can be attractive and there's not a clear winner.
     
  20. DSC

    DSC
    Banned

    Joined:
    Jul 12, 2003
    Messages:
    689
    Likes Received:
    3
    http://videocardz.com/49557/exclusive-nvidia-maxwell-gm107-architecture-unveiled

    [​IMG]

     
    #860 DSC, Feb 12, 2014
    Last edited by a moderator: Feb 12, 2014
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...