Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. DegustatoR

    DegustatoR Veteran

    "Scalar datapath" though?
     
  2. Samwell

    Samwell Newcomer

    I meant that maybe Gaming Amperes can use FP32 from Tensor Cores to double the FP32 rate per SM as in speculations at the moment. A100 can't do it.

    There was a Interview with an employee or blogpost, which mentioned that normal FP16 rate ist coming 2x from normal FP32 unit and additional they enabled the TCs for normal FP16 operations to add 2xthroughtput and get 4xFP32 speed. TC FP16 operation is much higher. Turing already uses the TCs for 2xFP16 throughput, as far as is understood it.

    As troyan mentioned in his link, if they can get 2xFP16 from the TC datapath, why not change it to get an additional FP32 from the TC data path?
    I totally expect utilization issues. It seems Gaming Ampere will change from an "high IPC" Architecture to low IPC in case of FP32.
     
    DegustatoR likes this.
  3. trinibwoy

    trinibwoy Meh Legend

    Doubling SIMD width would require a second dispatch unit otherwise you end up with the same utilization problem as Kepler.

    I’m hedging that the 2xFP32 is not general purpose. Maybe a fast path through the tensors for RT calcs or something. It’s mind boggling to think Ampere will push 30+ tflops of general compute.
     
    Bondrewd likes this.
  4. Bondrewd

    Bondrewd Veteran

    Hey, that's I think the sane idea.
    Arcturus does 2xFP32 on its GEMM engines.
     
  5. DegustatoR

    DegustatoR Veteran

    On the contrary, doubling the width won't require any changes to dispatch. Adding a second one of the same width will though - if GA100 can't schedule FP32+INT+FP64, which seems an unlikely scenario.
     
  6. Samwell

    Samwell Newcomer

    Isn't that the same utilization problem A100 should have with 4xFP16? But somehow they implemented it at least for some corner cases.
     
  7. DegustatoR

    DegustatoR Veteran

    Why though? Maxwell and Pascal were 32 wide and it always seemed excessive for gaming Turing to be 16 wide. If they'll go back to 32 wide for general math then you'll get something around 40 tflops from GA102 - which could be a good thing for non-gaming applications for the latter too.
     
  8. trinibwoy

    trinibwoy Meh Legend

    The dispatcher issues 32 threads per clock. That’s enough to feed one 32-wide pipe.

    GA100 can schedule FP32+INT32 concurrently because each pipe is only 16-wide. Issuing to any other execution unit (FP64, SFU, Load/Store) will cause bubbles in the main FP and INT pipelines.
     
    DegustatoR likes this.
  9. trinibwoy

    trinibwoy Meh Legend

    Oh it’ll be great but certainly not free. I will be very impressed if it’s true and the die size is under 800mm2.
     
  10. DegustatoR

    DegustatoR Veteran

    Yeah, that's true, haven't thought of this. Well, guess we'll see soon.

    Does it issue 32 threads though or does it issue 16+16 from two warps?
     
  11. trinibwoy

    trinibwoy Meh Legend

    Well we don’t know that there isn’t a utilization problem with 4xFP16. E.g. can you still issue to the INT pipe while doing that?
     
  12. Jawed

    Jawed Legend

    I still fondly remember how @aaronspink was dubious that GDDR would go beyond 6Gbps:

    Nvidia GT300 core: Speculation

    Anyway, Aaron taught us much about memory back in the good old days.
     
    Lightman likes this.
  13. CarstenS

    CarstenS Legend Subscriber

    Wow, they "lost" the comment section? That's... interesting.

    edit 200905: I managed to find the comment in Krashinsky's disqus-Profile. A screenshot is attached for your reference and in case it might get "lost" over there too. Ampere_GA100_4xFP16-rate_Krashinsky.png

    BTW: Chip on the back? Zotac (and someone before them) had those on the back of the PCB opposite of the GPU - but it was not another GPU, but a super-cap:
    https://www.zotac.com/download/file...ery/graphics_cards/zt-t20820b-10p_image04.jpg
     
    Last edited: Sep 5, 2020
    BRiT likes this.
  14. Kaotik

    Kaotik Drunk Member Legend

  15. CarstenS

    CarstenS Legend Subscriber

    Was on super expensive editions of cards earlier though. Maybe that's what's inflating the BOM among other such as those rumored high speed enabling PCBs..
     
  16. Kaotik

    Kaotik Drunk Member Legend

    It's still supposedly Colorful's Vulcan, not reference, in the leaks.
     
  17. Bondrewd

    Bondrewd Veteran

    It is but overall BOM is still crazy.
     
  18. Makes sense that the FP32/INT32 is 2:1 IMO.
    In nvidia's own marketing material, the ratio of INT32 operations in games is only up to 40% IIRC, so Turing had an unbalanced amount of INT32 units (at least for game rendering).
     
    Man from Atlantis likes this.
  19. Jawed

    Jawed Legend

    Tensor FP32 will presumably be only MAD/FMA and ADD, along with latency that prolly means at least 2 dependent ops are required to be worth using at all. Should be good for geometry I suppose.
     
Loading...

Share This Page

Loading...