NVidia Ada Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Jul 10, 2021.

Tags:
  1. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Are you talking about Gaming-Ampere here? Because:
    upload_2021-8-5_12-53-57.png
    That seems to indicate, that they can utilize both FP32 and FP64-cores full bore at the same time with FP16 precision.
     
    #241 CarstenS, Aug 5, 2021
    Last edited: Aug 5, 2021
    Lightman likes this.
  2. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,397
    Yes because GA100 doesn't have an INT32/FP32 SIMD and in its case FP16 rate is double of FP32.
    Edit: or is it even 4X? GA100 has twice the amount of TCs per SM, I always forget this one.
     
    CarstenS likes this.
  3. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    603
    Likes Received:
    1,123
    Yes, but the problem is that the science community is moving forward, exploring the nature of AI/DL for HPC workload: https://developer.nvidia.com/blog/ai-detects-gravitational-waves-faster-than-real-time/
    And some information about mixed precision from Oak Ridge: https://www.olcf.ornl.gov/2021/06/28/benchmarking-mixed-precision-performance/
     
    PSman1700 likes this.
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    "Tensors", no. FP16 most definitely yes. Or should do, because FP16 is relevant in gaming.

    And what we appear to have discovered is that there's a requirement for two distinct hardware threads to be issued-to across the two FP32-capable datapaths in order to get at those FLOPS. And what's interesting about that is that it means you need twice as many hardware threads in flight on the SM partition as would be required were there just a single FP32 (combined with integer) SIMD, e.g. as seen in RDNA.

    That makes occupancy more brittle - or if you prefer it makes performance variation with count of hardware threads in the partition more brittle.

    RDNA appears to spend more transistors, relatively speaking, on scheduling: there's less SIMDs per scheduler. It appears AMD's focus in RDNA has been on reduced brittleness: reducing the ratio between best and worst cases, which also requires more, more-local, cache and bigger register files. So more of the same in RDNA 3.

    The change from VLIW to "scalar" was very much driven by the desire to reduce brittleness. A lot of problems were seen with instructions that were scalar or vec2, "wasting" VLIW throughput.

    It'll be interesting to see if 2022 brings us conditional routing to help with the problems caused by divergent control flow. As ray tracing becomes dominant for AAA graphics, it seems that branching is getting harder to avoid in shaders. Brittleness there truly is disastrous.
     
    T2098, Lightman and DavidGraham like this.
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    The individual peak rates for each data type don’t mean that the chip can sustain all of those rates concurrently. Clearly gaming Ampere can’t run peak FP32 at the same time as peak INT32 because they share the same execution units. What we don’t know is which pipes can run concurrently with tensor FP16.
     
    DegustatoR likes this.
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    How relevant though? Are FP16 ops a significant percentage of total gaming flops?

    We don't actually know how flexible the scheduling is. For all we know the scheduler can pick an op from another warp or an independent op from a warp that's already running on the other pipe. If it can take advantage of either ILP or TLP it's a lot less worrisome. Maybe someone will bother to run micro benchmarks on Ampere and figure it out.
     
  7. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    I did not mean to imply that this was the case. Rather, that to achieve 78 FP16-TFlops on GA100, you would need to utilize the FP32 and the FP64 units simultaneously with packed math.
    But since DegustatoR confirmed that he was not referring to Ampere in general but to Gaming-Ampere, my posting is not relevant here anyway.
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Apparently so, because it's been in GPUs for a while now...

    I know that's glib. I don't know how we can "measure" that. Percentage of game optimisation slide decks that mention FP16?

    Irony is that shader model 2 had "half" as an intrinsic format and it was spurned when floats got to 24- and 32-bit. There were rendering problems with widespread use of half back in those days. So their selective use appears to be the norm.

    First Steps When Implementing FP16 - GPUOpen

     
    egoless likes this.
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Got it.
     
    CarstenS likes this.
  10. JoeJ

    Veteran

    Joined:
    Apr 1, 2018
    Messages:
    1,523
    Likes Received:
    1,772
    Agree to all that as well.
    What's new now is this: Prev gen: 2tf vs. 10tf PC high end. Current gen: 5tf vs. 75tf PC high end.
    We can (and probably will) ignore this, and just crank up all settings as usual until the GPU is at its limit. Easy and done. But will it be enough visual improvement to justify the premium price? Surely not. The true potential won't be utilized.
    We'll never be at the point where better gfx won't show improvements. But imo we are already at the point where those high end gfx are just too expensive.
    "Will people spend increasing amounts of money for HW, or not?" If you say there is no such question, then you can just answer me this.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    10TF for the visuals we're currently getting is fucking miserable.

    Ray tracing is probably going to look like a saviour in a few years' time, because it forced devs to actually pay attention to the hardware.
     
  12. JoeJ

    Veteran

    Joined:
    Apr 1, 2018
    Messages:
    1,523
    Likes Received:
    1,772
    So, all games of current gen will look shitty? Maybe. But it's what we have to deal with, and it's what we sell to people promising it's top notch gfx for the next years.
    What? How does RT forces or even helps devs to understand / optimize HW? It's all blackboxed. It's slow by definition. Devs contribution is zero. They only use it - it is developed by HW vendors.
    Sure, it's a way to bring any HW to its knees with little effort. But i'd rather like to see RT being used at a reasonable cost / benefit ratio, and some other things beside if possible.
     
    Krteq likes this.
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Ive been working through my Steam backlog. Mostly PS3 era games and just started getting into the PS4 generation. There are clear improvements but the visual upgrade is nowhere near the increase in shading horsepower in the same timeframe. I doubt it will be any different in the PS5 generation. Games will not take full advantage of PC hardware.
     
    Rootax likes this.
  14. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,424
    Likes Received:
    908
    I’d say the visuals we are getting for 1.8 TF are very impressive. Thats where the target has been since 2013.
     
  15. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,451
    Likes Received:
    471
    Yes, but MI200 hasn't like 20 or 30 % higher FP64 performance than A100, but allmost 5-times higher. Mixed precision is fine, but it's worthless once the task uses a specific precision.

    That's wishful thinking. GA100 is AI-focused. Its compute power is barely better than MI60 from 2018. Not only MI200 will be several-times faster in general compute, but A100 will lose the crown even in FP16 tensor, BF16 tensor, FP32 tensor (MI200 is almost 5-times faster). According to some leaks (not sure how reliable) MI200 will be also ~2,4-times faster in FP64 tensor. A100 will keep its position in INT4, INT8 and TF32 tensor.

    So why are there companies waiting for MI200 and not buying A100? Why does Oak Ridge prefer MI200 over A100 for the world's fastest and world's first exascale supercomputer? Why did Pawsey ordered MI200 and not A100, if the MI200 is less competitive and less cost ineffective? Maybe they are completely stupid and should visit Beyond3D more often for helpful advice :smile2:
     
    Lightman, Bondrewd and trinibwoy like this.
  16. Granath

    Newcomer

    Joined:
    Jul 26, 2021
    Messages:
    80
    Likes Received:
    82
    Idiots. This company missed so many lucrative opportunities.
     
  17. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,090
    Good were not stuck on 10TF hardware.

    Thats your opinion, ofcourse.
     
  18. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    603
    Likes Received:
    1,123
    Sure. 5x better with FP64...

    Sure. Reality again. GA100 delivers 320 TFLOPs FP16 performance. So no, AMD wont deliver more performance. I cant even believe that you think that AMD would even be able to do it.

    Those arent companies. Companies dont wait they buy real products.
     
    PSman1700 likes this.
  19. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,624
    Government grants?
    I always thought TOP 10 HPC was all about the dick metering on a global scale, no?
    The one who promises longer fp64 flops at the smallest energy footprints wins.
    Which is something nice to brag about in press, but kind of useless even for the scientists who will use this supercomputer. DGEMM has little to do with real HPC tasks, which are mostly bandwidth and scaling bound.
     
    DavidGraham likes this.
  20. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Is the FP64 theoretical peak rate ~5x higher when compared against A100 when it is using the tensor cores (~20TF/s) or without them (~10TF/s)?

    Also what's MI200 TDP?
     
    DavidGraham and pharma like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...