Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. Benetanegia

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    394
    Likes Received:
    425
    Probably just like they replaced tensor cores for FP16 units in GTX vs RTX Turing chips, they just can. They'll just replace them with simpler tensor cores this time around.

    IMO they'll shrink it to 128KB most likely, as Turing got reduced to 96KB from 128KB on Volta. Still a nice improvement IMO, tho not keeping those 192KB will likely make game/shader programmers cry, lol.

    Another thing that is likely to go is the massive 40MB (48MB? on full die?) L2 cache. 12MB I can see happening.

    Any other thoughts?
     
    ShaidarHaran likes this.
  2. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    4,027
    Likes Received:
    90
    That approach makes sense, although I would point out that if NV keeps the 4 tensor core/SM layout for consumer Ampere parts, that will result in a tensor core deficit compared to Turing. Perhaps they will carry forward the Turing tensor cores to consumer Ampere?

    I suspect you are correct about reducing the L1 (and certainly the L2) caches for consumer Ampere. 40MB L2 (48MB with 8MB disabled due to disabled SMs) has surely got to occupy a large die area. 12MB L2 for GA102 would be double that of TU102, but still a massive saving over GA100, I think that's a reasonable estimate.

    Some quick math shows that a 48MB L2 cache should occupy somewhere around 40mm^2 die area on TSMC's 7nm HP node, given a cell density of 64.98 MTr/mm^2 and an approximate transistor count of 60M per 1MB cache. If we reduce this to 12MB that gives an even 10mm^2 result, shaving 30mm^2 off GA100's die for GA102. Not quite down to the 600-650mm^2 range yet but I suspect the additional changes I mentioned previously ought to get there.
     
    #42 ShaidarHaran, May 14, 2020
    Last edited: May 14, 2020
  3. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    960
    Likes Received:
    853
    It's almost have same die size(826 vs 815mm2) as GV100 and lower clock speed(1410 vs 1530MHz), yet power consumption is 100W higher (400 vs 300W). Is it not on TSMC N7 EUV, or same process like AMD's RDNA 1 GPUs.
     
  4. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    208
    Likes Received:
    137
    he says "optimized for Nvidia"
     
  5. Oh well...

    At least they released a non-completely-castrated lower cost version of the Xavier NX devkit, complete with GPIOs, I2C, MIPI-CSI, etc.
    It's a very interesting solution for IIoT development.
     
  6. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,462
    Location:
    Finland
    That's most likely more marketing than anything else, just like "12FFN" was
     
  7. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Yeah because nothing works without lotsa DTCO these days.
     
  8. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Where did you get that figure from?
     
  9. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Not seeing where that assessment is coming from. I listened again closely to a whole minute before he mentions transistor budget and nothing indicates that he's talking about transistors.
    Given the large amount of tightly packable SRAMs, I am inclined to give Nvidia a bit more leeway here, usually having had less dense chips than AMD on the same process. But +50% is more like one figure, AMDs or Nvidias, is not telling the whole truth.
     
  10. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    4,027
    Likes Received:
    90
    I haven't seen that figure officially mentioned anywhere, but if we assume the SM block diagram I posted is even remotely analogous to the real thing, 1/3 looks about right to me.
     
  11. Benetanegia

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    394
    Likes Received:
    425
    He's talking about AI performance. I'm 99% sure there's not a single mention of general compute performance in the entirety of the video (he clearly states training and inference performance in the same sentence as he mentions transistor budget, for example), and maybe even the entirety of the presentation. It's pretty much all about AI and the new tensor core capability and performance. It makes much more sense that he's talking about size of the relevant silicon and not just everything else. Specially when taking it the other way creates the problem of not aligning with the provided transistor count (by 17 billion no less!!!) and you have to come up with some sort of "someone lying scenario" to explain the discrepancy.
     
  12. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    960
    Likes Received:
    853
    [​IMG]

     
    #52 Man from Atlantis, May 14, 2020
    Last edited: May 14, 2020
    sonen, nnunn, DavidGraham and 6 others like this.
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    It’s not.
     
  14. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,393
    You don't really need tensor cores specifically to run workloads which are targeted at tensor cores. They can easily cut out a lot out of GA100's tensor cores capabilities and just run this code on general SIMDs. They can probably scale down the matrix size as well making them slower.
    But with that being said it's worth remembering that "gaming" GPUs are used by NV in Tesla parts targeting AI inferencing. So it's kinda possible that they won't cut anything but FP64 math here.
     
    pharma likes this.
  15. Did anyone else notice how the dGPUs in the drive L5 "Robotaxi" are a smaller Ampere with 4 HBM2 stacks?

    [​IMG]

    nvidia is saying each of these GPUs can go as high as 400W.
     
    Lightman and Man from Atlantis like this.
  16. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,462
    Location:
    Finland
    Not quite, each Orin can go at least up to 45W (L2+ spec), but @Ryan Smith suggests they could go as high as 65 - 70W each. Then there's whatever that daughterboard in the upper edge is too. The whole platform is supposed to be 800W.
     
  17. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,887
    Likes Received:
    4,534
  18. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    22,146
    Likes Received:
    8,533
    Location:
    ಠ_ಠ
    2 pops? :wink2:
     
    pharma and nnunn like this.
  19. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    780
    Location:
    EU-China
    [​IMG]
    To put into perspective...
     
    Newguy, pharma, DavidGraham and 4 others like this.
  20. Frenetic Pony

    Regular

    Joined:
    Nov 12, 2011
    Messages:
    807
    Likes Received:
    478
    The sparse int just looks like it's for poorly pruned deployment neural nets to begin with, how else would you zero half the nodes with no outcome effect?

    Well, I suppose an easy way to optimize is highly tempting for a lot of devs, and locking them into a Nvidia only supported mode is good for Nvidia.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...