Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    If the code is pure FP32, Ampere can reach its rated TFLOP. For example, in Geekbench 5 and AIDA64 GPGPU:
    4-5-5.png

    and in some rendering benchs like V-RAY and Blender, Ampere is also very close to its rated FLOPS:
    1-4-14.png
     
    pharma, PSman1700 and Lightman like this.
  2. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,405
    And how exactly do you think INT32 is being run on GPUs which don't have a dedicated INT32 h/w? And how is this different from how they are run on Ampere?
     
    PSman1700 likes this.
  3. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    This is most probably using the tensor cores, since it's based on a recent tensorflow framework (unless they disabled them by choice). And there, the 3090 is not much faster on paper than the unconstrained RTX Titan.
    Image from the GA104 whitepaper, condensed down to the TFLOPS section.
     

    Attached Files:

    Digidi likes this.
  4. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,426
    Likes Received:
    909
    There is a lot of confusion on how FP and INT can be scheduled. Nvidia has done a poor job clarifying it. Nvidia's material gives the impression that its either 128 FP or 64+64. So with even a single INT instruction you would lose half of the FP capability of an SM. On AMD GPUs, as i understand it, INT instructions can be issued arbitrarily at the stream processor level.
     
  5. Picao84

    Veteran

    Joined:
    Feb 15, 2010
    Messages:
    2,109
    Likes Received:
    1,195
    Edit - Nevermind, brain fart. Too early in the morning.
     
  6. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,405
    And how much FP capability would you loose in a CU containing 2 SIMD units in this case?
     
    PSman1700 likes this.
  7. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    You can reach max TFLOP with corner cases, aida64 computes fractals, for which you don't need to read any data from caches/memory.
    That is the ideal situation, ALU's never have to wait for data.
    The blender Barcelona Pavillion has super simple geometry, also in such cases data fetching is not a bottleneck.
     
    DegustatoR likes this.
  8. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,426
    Likes Received:
    909
    You would lose however many stream processors have INT instructions scheduled. AMD can issue any arbitrary mix of FP and INT instructions as far as i understand it.
     
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    There are 4 independent partitions in an SM, each with their own warps, instruction cache, scheduler and dispatcher. They don’t run in lock step in any way. You can do 16 FP + 16 INT in one partition while doing 32 FP in another.

    The white paper states “All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.” This can be confusing if you take it to mean that this is the only combination possible but clearly that isn’t the case.
     
  10. Rootax

    Veteran

    Joined:
    Jan 2, 2006
    Messages:
    2,401
    Likes Received:
    1,845
    Location:
    France
    Have we any rumors about the 3080 20gb availability by AIB ?
     
  11. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    What do you mean by stream processor? By definition all lanes of each 32-wide SIMD must execute the same instruction. You can’t mix FP and INT within a SIMD in the same clock cycle.
     
    3dcgi, fellix and PSman1700 like this.
  12. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,426
    Likes Received:
    909
    Yes I learned the correct granularity back around launch from this forum. I believe the poster Degustator was replying to was under the impression it was only the two combinations you listed. Nvidia's wording gives that impression.

    I thought i remembered it being stated here around the Turing launch that AMD GPUs could issue any arbitrary mix of INT and FP in a cycle. Must have been at a SIMD level.
     
  13. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,405
    I'm not sure you do.
    You can issue two instructions per clock on two SIMD units at best. It's about as "arbitrary" as it can be.
    Ampere can have 2 FP32 instructions or 1 FP32 + 1 INT per clock per each unit in an SM (which there are 4 of in Ampere's SM).
    RDNA can have 2 FP32 or 1 FP32 + 1 INT or 2 INTs per clock per each CU in a WGP (which there are 2 of in RDNA WGP).
    From this point the only difference is that you can have the same peak INT throughput as that of FP32 on RDNA but only half that on Ampere. Otherwise they are the same.
     
    PSman1700 likes this.
  14. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    Got it.

    Not within a SIMD. Each Navi CU has 2 independent 32-wide SIMDs which can either execute 32 INT or 32 FP each clock.
     
    PSman1700 likes this.
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Per-work-item INT instructions on AMD entirely block FP instructions. There's a single SIMD that handles INT and FP for per-work-item calculations.

    So the argument about "NVidia losing FP32 because of INT sharing" has a simple answer: "well, doh". Nothing new here.

    Ampere has a continuously available FP SIMD. The real trick is keeping data ready for it to use. Running INT on another SIMD helps keep data ready. And when INT is not required, there's a chance to get a burst of extra FP goodness.

    SMs and WGPs or CUs don't really compare cleanly.

    It's best to forget about CU (or WGP) level in RDNA. Each SIMD has its own instruction issue and all the SIMDs are INT/FP., "dual-action". The instruction issue to RDNA SIMDs is not controlled by the CU.

    It's clearer to consider instruction-issue and register file. Ampere has dual instruction issue to two SIMDs. RDNA has single instruction issue to only one SIMD.

    (The instruction issue of special functions, (TEX) data-loads, per-hardware-thread and branching evaluation all adds lots of complexity - they do affect the progress of work on the SIMDs, but they aren't directly relevant to an INT versus FP throughput discussion).

    Ampere's theoretical FP throughput is far far higher than RDNA2 will be. Ensuring that there's work for the SIMDs to do is looking more and more to be the central problem. Complex, math-intensive, shaders are only getting more common - but render-pass count in games is increasing and that hurts SIMD utilisation.
     
  16. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,405
    Well, they kinda do actually. SM and WGP are base level building blocks of NV and AMD GPUs. They have different architectures of course and different mix and types of h/w inside them.

    Sure but that's scheduling differences which arise from the fact that Turing/Ampere's SIMDs are 16 wide while RDNA's are 32 wide.

    Ampere doesn't really have dual issue of instructions, it issues them in consecutive cycles to either of the two SIMDs which are 16 wide in h/w and thus take 2 clocks to run through a warp.

    The differences between Ampere and RDNA architectures are obviously pretty huge. But from the point of INT execution Ampere isn't that much different to RDNA now - both will utilize FP32 SIMDs to run INT32 instructions, both have the same wave/warp widths (RDNA has an option of 64 wide too but I don't know when it is being used over a 32 wide one) so both will "loose" the same amount of FP32 throughput due to the need to run INTs sometimes.

    Basically, it's not that Ampere's flops are "marketing trick", it's that Turing's flops were "overrated" since they didn't need to run INTs and were dealing with FP32 only.
    To get a "proper" comparison in gaming math between Turing and Ampere you need to add these 25-30% of INT instructions to Turing's FP32 throughput.
     
    szatkus and PSman1700 like this.
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    You're right, it's best to think of Ampere as dual-threaded issue to the two SIMDs - the cadence reduces instruction cache and issuer bandwidth.

    Ability to issue to the two SIMDs depends upon operand availability. The heart of maximum SIMD throughput depends on at least two instructions, and their operands, being independently available every cycle. Obviously the operand collector helps here, though it adds latency.
     
  18. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    We speculated earlier in this thread that the 2x FP capability was simply Nvidia choosing the cheapest path to increasing performance over Turing. Maybe Nvidia isn't too bothered about the excess compute capacity. The difference is quite stark though. 150% more flops for 50% more bandwidth.
     
    #1958 trinibwoy, Oct 1, 2020
    Last edited: Oct 1, 2020
  19. Digidi

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    428
    Likes Received:
    239
    Lightman and BRiT like this.
  20. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,679
    @Digidi Does Witcher 3 support asynchronous compute?

    Edit: Do Turing and Ampere actually support asynchronous compute in the same way as GCN and RDNA do? Would be a good way to utilize more of that alu, assuming you're not already bottlenecked by something else like bandwidth.
     
    #1960 Scott_Arm, Oct 1, 2020
    Last edited: Oct 1, 2020
    Lightman likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...