Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. Benetanegia

    Benetanegia Regular

    Probably just like they replaced tensor cores for FP16 units in GTX vs RTX Turing chips, they just can. They'll just replace them with simpler tensor cores this time around.

    IMO they'll shrink it to 128KB most likely, as Turing got reduced to 96KB from 128KB on Volta. Still a nice improvement IMO, tho not keeping those 192KB will likely make game/shader programmers cry, lol.

    Another thing that is likely to go is the massive 40MB (48MB? on full die?) L2 cache. 12MB I can see happening.

    Any other thoughts?
     
    ShaidarHaran likes this.
  2. ShaidarHaran

    ShaidarHaran hardware monkey Veteran

    That approach makes sense, although I would point out that if NV keeps the 4 tensor core/SM layout for consumer Ampere parts, that will result in a tensor core deficit compared to Turing. Perhaps they will carry forward the Turing tensor cores to consumer Ampere?

    I suspect you are correct about reducing the L1 (and certainly the L2) caches for consumer Ampere. 40MB L2 (48MB with 8MB disabled due to disabled SMs) has surely got to occupy a large die area. 12MB L2 for GA102 would be double that of TU102, but still a massive saving over GA100, I think that's a reasonable estimate.

    Some quick math shows that a 48MB L2 cache should occupy somewhere around 40mm^2 die area on TSMC's 7nm HP node, given a cell density of 64.98 MTr/mm^2 and an approximate transistor count of 60M per 1MB cache. If we reduce this to 12MB that gives an even 10mm^2 result, shaving 30mm^2 off GA100's die for GA102. Not quite down to the 600-650mm^2 range yet but I suspect the additional changes I mentioned previously ought to get there.
     
    Last edited: May 14, 2020
  3. It's almost have same die size(826 vs 815mm2) as GV100 and lower clock speed(1410 vs 1530MHz), yet power consumption is 100W higher (400 vs 300W). Is it not on TSMC N7 EUV, or same process like AMD's RDNA 1 GPUs.
     
  4. del42sa

    del42sa Newcomer

    he says "optimized for Nvidia"
     
  5. Oh well...

    At least they released a non-completely-castrated lower cost version of the Xavier NX devkit, complete with GPIOs, I2C, MIPI-CSI, etc.
    It's a very interesting solution for IIoT development.
     
  6. Kaotik

    Kaotik Drunk Member Legend

    That's most likely more marketing than anything else, just like "12FFN" was
     
  7. Bondrewd

    Bondrewd Veteran

    Yeah because nothing works without lotsa DTCO these days.
     
  8. CarstenS

    CarstenS Legend Subscriber

    Where did you get that figure from?
     
  9. CarstenS

    CarstenS Legend Subscriber

    Not seeing where that assessment is coming from. I listened again closely to a whole minute before he mentions transistor budget and nothing indicates that he's talking about transistors.
    Given the large amount of tightly packable SRAMs, I am inclined to give Nvidia a bit more leeway here, usually having had less dense chips than AMD on the same process. But +50% is more like one figure, AMDs or Nvidias, is not telling the whole truth.
     
  10. ShaidarHaran

    ShaidarHaran hardware monkey Veteran

    I haven't seen that figure officially mentioned anywhere, but if we assume the SM block diagram I posted is even remotely analogous to the real thing, 1/3 looks about right to me.
     
  11. Benetanegia

    Benetanegia Regular

    He's talking about AI performance. I'm 99% sure there's not a single mention of general compute performance in the entirety of the video (he clearly states training and inference performance in the same sentence as he mentions transistor budget, for example), and maybe even the entirety of the presentation. It's pretty much all about AI and the new tensor core capability and performance. It makes much more sense that he's talking about size of the relevant silicon and not just everything else. Specially when taking it the other way creates the problem of not aligning with the provided transistor count (by 17 billion no less!!!) and you have to come up with some sort of "someone lying scenario" to explain the discrepancy.
     
  12. [​IMG]

     
    Last edited: May 14, 2020
    sonen, nnunn, DavidGraham and 6 others like this.
  13. trinibwoy

    trinibwoy Meh Legend

    It’s not.
     
  14. DegustatoR

    DegustatoR Veteran

    You don't really need tensor cores specifically to run workloads which are targeted at tensor cores. They can easily cut out a lot out of GA100's tensor cores capabilities and just run this code on general SIMDs. They can probably scale down the matrix size as well making them slower.
    But with that being said it's worth remembering that "gaming" GPUs are used by NV in Tesla parts targeting AI inferencing. So it's kinda possible that they won't cut anything but FP64 math here.
     
    pharma likes this.
  15. Did anyone else notice how the dGPUs in the drive L5 "Robotaxi" are a smaller Ampere with 4 HBM2 stacks?

    [​IMG]

    nvidia is saying each of these GPUs can go as high as 400W.
     
    Lightman and Man from Atlantis like this.
  16. Kaotik

    Kaotik Drunk Member Legend

    Not quite, each Orin can go at least up to 45W (L2+ spec), but @Ryan Smith suggests they could go as high as 65 - 70W each. Then there's whatever that daughterboard in the upper edge is too. The whole platform is supposed to be 800W.
     
  17. pharma

    pharma Veteran

  18. TheAlSpark

    TheAlSpark Moderator Moderator Legend

    2 pops? :wink2:
     
    pharma and nnunn like this.
  19. xpea

    xpea Regular

    [​IMG]
    To put into perspective...
     
    Newguy, pharma, DavidGraham and 4 others like this.
  20. Frenetic Pony

    Frenetic Pony Regular

    The sparse int just looks like it's for poorly pruned deployment neural nets to begin with, how else would you zero half the nodes with no outcome effect?

    Well, I suppose an easy way to optimize is highly tempting for a lot of devs, and locking them into a Nvidia only supported mode is good for Nvidia.
     
Loading...

Share This Page

Loading...