NVIDIA Maxwell Speculation Thread

Discussion in 'Architecture and Products' started by Arun, Feb 9, 2011.

Tags:
  1. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
  2. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    What's wrong with which of their diagrams? They look fine with me and for certain more accurate than most other stuff fabricated or copied out there.
     
  3. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,491
    Likes Received:
    909
    The only thing that might curb our enthusiasm in this article is the idea that NVIDIA saved power by removing all the inter-GPC interconnect logic (just like in Tegra K1) which won't be possible in bigger Maxwell chips.

    But I think most of the savings come from the redesigned SMs and extra L2 cache, so I'm not too worried about scaling.
     
  4. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    I especially like the "new" illustration of the old SMX. Seems to clarify a lot of things - the old ones looked like someone just threw all the units in there. Also shows the split into subunits isn't really anything new at all with Maxwell but rather the principle is all the same, just what is shared by 2 subunits and what is not is different really (but that already changed with gk20x too for the TMUs). Does that come from some new nvidia marketing material?
     
  5. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,416
    Likes Received:
    178
    Location:
    Chania
    There's a definitely a point behind it exactly because those interconnects are quite complex beasts; I'd love to stand corrected but if true "all" should be valid for K1 only.
     
  6. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,491
    Likes Received:
    909
    K1 can do without any inter-SM logic (and wires) whereas GM107 could only get rid of inter-GPC stuff. But from what I understand, it's the distributed geometry that's tricky, i.e. the inter-GPC part.

    I sure would love for NVIDIA to give us more information.
     
  7. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    609
    Likes Received:
    1,036
    Location:
    PCIe x16_1
  8. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Not really. The marketing material was this big 256 (sic!) block of green squares that was Kepler and the 4x 32 with indiviual control that is Maxwell. I guess the very good diagrams from Damien come from clever analysis of more than Press_Preso.pdf copy-and-paste that many sites resorted to.
     
  9. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    The same argument would hold for a GK107. I don't remember that one being praised for extraordinary perf/W and that bar chart with perf/W shows a GTX650 sitting at 62% of the 750Ti. One of the better ones of the other Keplers (that are more around the 56% point) but not earth shattering.

    Yeah, I'm not oblivious to the fact that interconnect has a certain cost, but I don't think it's significant compared to something as intensive as an SM.
     
  10. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    Ah good to know. I guess you spoke to a different person than Carsten did :). All the more impressive then. Though I believe gk208 was actually the most power-efficient kepler chip but it is never included anywhere in comparisons since there's almost zero reviews of the useful (gt640 with 64bit gddr5) variant.
     
  11. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,908
    Likes Received:
    1,607
    I think EVGA offers HDMI, Displayport with their 750 product offerings ... some even have "bonus 6-pin power input connector provides an additional 25 watts, giving you an increase in power of 35%!"

    http://www.guru3d.com/news_story/evga_geforce_gtx_750_and_gtx_750_ti.html
     
  12. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,166
    Likes Received:
    1,836
    Location:
    Finland
    I suppose it depends on which chip you're comparing to which chip, ExtremeTech tested khash/watt against R9 270, and R9 270 won (even though just barely)
    http://www.extremetech.com/gaming/1...per-efficient-quiet-a-serious-threat-to-amd/3
    [​IMG]

    [​IMG]

    Tom's numbers for Radeons seem a tad on the low side based on R9 270 at least, but one has to remember that there's ~10% differences from 1 card to the next within same models and slight adjustments to clocks or voltages can cause huge variations (well beyond 10%) too, for good or bad
     
  13. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    There are 2 things I find odd with their SMM diagram:

    1. L1 capacity per SMM is now less than 24 KB (I say less, since it seems to now be shared with texture data that previously had its own read-only cache). Previously it could be configured as 16, 32 or 48 KB per SMX (IIRC). The reduction seems like it might introduce performance portability problems. I do understand that this chip is not primarily targeted at compute workloads. Maybe L1 is less important for graphics, and this just doesn't matter?

    2. They show 2 DP units shared between 2 blocks of 32 SP "cores". This would seem to require more cooperation between warp schedulers than strictly necessary, which the white-paper linked earlier in this thread talks about avoiding. It also turns DP ops into a variable latency thing. Why do that when each scheduler could just be given a single DP unit instead?

    Anyway, I'm not saying their diagrams are wrong. Just that I won't believe them until I've seen the tests they used that led them to draw things the way they have. I'm especially interested in how they are determining L1 size and which units are shared.
     
  14. LiXiangyang

    Newcomer

    Joined:
    Mar 4, 2013
    Messages:
    81
    Likes Received:
    47
    The primary task of L1 on Kepler is a cache for register spilling, whilst in CUDA, texture cache serves the purpose of L1 data cache commonly seeing in other computing archs.

    So combining L1 and texture cache together on Maxwell seems a reasonable step for me, also dont forget Maxwell significant increases L2 cache size and reduce the latency of it as well.

    I am quite excited about the possiblity of significant improvement of the bit-wise ops and integer ops on Maxwell, through still need more evdience to prove that.
     
  15. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,486
    Likes Received:
    397
    Location:
    Varna, Bulgaria
    The texture cache is fixed at 12K per TMU quad and that's what NV has been using for many generations and apparently this is the presumed size for Maxwell, since there's no other new information on the subject.
     
  16. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,844
    Likes Received:
    4,456
    After looking at some reviews, I think this is a case of a very interesting chip that was put in a mediocre product.


    The GM107 does seem to be quite a big step in terms of performance/watt. It seems to be nVidia's path into the notebook market where the GK107/GK208 might become irrelevant as soon as Broadwell comes out.
    Moreover, it also seems to be a great step into the computing/mining market, where AMD scores all the points for the moment.. though looking at how the ASICs are taking over, this train may be already lost.

    If they can ever translate these efficiency gains to the next Tegra, then that's even better. A Tegra M1 with the same performance as the K1 but with half the power consumption might even fit into a smartphone. Plus, the reduced memory bandwidth requirements due to the increased on-chip cache could make for an ever better performance upgrade on mobile, where GDDR5 speeds are still prohibitive.




    Now the Geforce 750 and 750 Ti are... IMHO mediocre products.
    Their performance/€ isn't any better than the competition or even nVidia's former product line.
    Sure, the new cards are more power efficient but even for a 50W difference, it's not like the people who buy these cards will be using them 24/7, or even 8 hours/day.
    Even for people who want this for an always-on media center, the card will be idling most of the time, and during the idle period most modern cards already use next to nothing.
    As for PSUs, how many people will be able to use a 75W graphics card but not a 120W one?

    And if nVidia really wanted this to be a media center card, they should've built the reference card as a low-profile model.
     
  17. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    That's a good point indeed, would seem to make more sense if they aren't shared. And while it might not matter much for this chip (as there's not much data to move) it seems like you wouldn't want to have shared DP units for the HPC chip (but presumably you'd want to retain the same general structure). FWIW it looks to me like the HPC chip would need to have either 1:2 or 1:4 DP/SP ratio in any case.
    Though as for sharing, technically you wouldn't need to share TMUs neither. While it is true that due to the pixel quad processing you really need quad tmus, nothing requires them to deliver 4 filtered outputs per clock - you could easily have 4 quad tmus requiring 2 clocks for delivering the results. I don't think though I've seen such designs in anything but the ultra slow category (that is, chips not being capable of delivering 4 filtered texels per clock in total), and there's probably a reason for this...
     
  18. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    You think it's a mediocre product because you're an elitist and don't care about cards with such level of performance (sorry for phrasing it like that)
    It's fine, e.g. you can put it in a micro ATX tower without clogging some of the drive bays and use a 300W PSU.
     
  19. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Earlier, L2 per core was almost the same as L1, so L1 misses almost always went to DRAM. Now there is a large L2, so they can cut back on L1 without increasing overall miss rate.
     
  20. Tridam

    Regular Subscriber

    Joined:
    Apr 14, 2003
    Messages:
    541
    Likes Received:
    47
    Location:
    Louvain-la-Neuve, Belgium
    I won't claim obviously that all details in those diagrams are correct. There are still some details Nvidia is not willing to share that are difficult to extract from testing :/ However, compared to the official material, I think they get much closer to the reality :)

    1. Nvidia confirmed to me it is 12 KiB per 4 TMU blocks. I actually asked them if that wouldn't be a perf issue for some compute task but they feel it shouldn't be an issue for THIS particular GPU. Plus don't forget that register pressure will be lowered a little bit thanks to the shorter arithmetic pipeline.

    2. At first I actually placed one single DP unit in each partition, thinking it would be the obvious path to expand DP rate for big Maxwell. However Nvidia corrected me and said those DP units were outside the partition.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...