NVidia Ada Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Jul 10, 2021.

Tags:
  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    If two distinct warps are always required to enable co-issue to both SIMD16s (FP32 and FP32/Int) then I guess that's the utilisation problem right there.

    I can imagine transcendentals going through the SFU at what appears to be 4 per clock (per partition), adds to dependency-chain-length problems, reducing the count of available warps for dual-issue.
     
  2. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    603
    Likes Received:
    1,123
    I didnt talk about gaming rendering. I specificed mention compute performance. A chiplet design will scale worse than a monolith chip card in games.... But people here expect that AMD will increase compute performance by 2.5x over RDNA2.

    nVidia did it with Ampere and specific compute performance without increasing other units like rasterizer, geometry units etc:
    [​IMG]
    https://techgage.com/article/mid-2021-gpu-rendering-performance/

    You can only invest so much transistors.
     
    PSman1700 likes this.
  3. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Let say GPUs tend to be designed with the idea that more (well, many more) than 1 warp is running on a processor at any given time :)
     
    DegustatoR likes this.
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Hall Bench score per "theoretical" TFLOPS
    • 6900XT - 753
    • 3090 - 640
     
    Lightman likes this.
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    4-way issue to math ALUs for full-utilisation looks like a fail for graphics :)
     
  6. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,397
    RDNA needs 4 distinct "warps" each cycle to fully load its WGP. Unlikely to be any sort of real issue going from real world results.
     
    PSman1700 likes this.
  7. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    An RDNA WGP needs 4 distinct warps with lots of ILP. If there's no ILP it needs a lot more warps (5 cycle dependent math latency IIRC).

    Turing was the same, needed 4 distinct warps with lots of ILP for maximum FMA throughput. Not sure if that changed for Ampere. However, it doesn't say anything about the extra INT/FP32 pipe. Presumably you need 4 additional warps with lots of ILP to keep that second pipeline going.

    Have we seen better?
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    RTX 3060 (score 9512 with 12.74 TFLOPS) is interesting, because its score per theoretical TFLOPS is substantially better, at 747, close to 6900XT. 6700XT is 795. So in both cases, these architectures get "better" with lower GPUs, which is not surprising, arguably.

    So what looks like it could be a compute test is looking to be more subtle than that.

    Bandwidth per TFLOPS is not making that much difference to RDNA 2 (29GB per TFLOP in 6700XT versus 22 in 6900XT) while with Ampere: 28 versus 26 (3060 and 3090) looks relatively innocuous.

    3070Ti has a score per theoretical TFLOPS of 683, with 28GB per TFLOP and perhaps can be considered the best comparison for 6700XT.

    A compute efficiency test probably shouldn't vary so substantially according to tier within an architecture, so I suspect this particular benchmark is not a good choice.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    RDNA has two SIMDs (general SIMD-32 and transcendental SIMD-8) per instruction scheduler. It requires 2-way issue for full utilisation, but can only ever issue 1 per cycle. Transcendentals take 4 cycles, so if one is started then full utilisation occurs during the following 3 cycles.

    An Ampere SM partition has 3 SIMDs (FP32 SIMD-16, FP32/Integer SIMD-16 and SF SIMD-4) and a tensor core (which looks like a SIMD-32 for FP16 operations). It appears to take 4 cycles to get work issued to all these units but I don't know the details of the issue cadences. A reasonable guess is that SFU can be issued every 8 cycles, but I don't know if it takes a cadence from one or other of the SIMD-16s.

    I don't know how issues are scheduled for the tensor core and whether "tensor" operations are "slower" than FP16. I'd expect FP16 (which is likely to be used in games) to be able to issue on the tensor core as frequently as once every cycle (since Ampere is double-rate FP16). I'm guessing that a tensor core "shares" a datapath with one of the two SIMD-16s. I don't know how per-cycle issue to the tensor core meshes with the slower issue rate to the two SIMD-16s.
     
  10. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    If bandwidth played an important role, RTX 3070 should stand out more, since it's got a notable uptick in GFlops/GBps and should do worse, which it doesn't.

    Maybe it's got something to do with what this test does - compared to other tests in Luxmark 4 alpha:
    "The second LuxMark benchmark is a path tracer with global illumination cache. This rendering mode slightly simpler than pure brute force and may work better on some GPU."
    https://wiki.luxcorerender.org/LuxMark_v4
    Especially the GI cache sound like RDNA2 might profit. Maybe we'll see numbers with RX 6600 XT showing a trend here.

    Fun fact: most efficient GeForce is the 2080 Ti, when you compare points per GFlops.
    upload_2021-8-4_17-49-37.png

    Maybe power budget plays a role? I seem to remember, RTX 30 cards were boosting theirs hearts out on Luxmark, running quite a bit higher than their advertised boosts; at around 1900 MHz, IIRC.

    Maybe also a larger cache would help. Something Lovelace will fix?
     
    Lightman and Jawed like this.
  11. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Yeah it's reasonable to assume that issuing to the tensors would preclude issuing to one of the other SIMD pipes in the same cycle. I'm not sure that's important though as we don't double count tensors when talking about peak Ampere flops available to graphics applications. We know that FP16 throughput is 2xFP32 and the assumption is that one data path is sufficient to provide the necessary operands. So whether FP16 is running on the main SIMDs like RDNA or on tensors like Turing/Ampere doesn't really matter.

    Nvidia has usually been explicit about the number of instructions dispatched per cycle and in Ampere it's one instruction per partition. So presumably Ampere can only issue to one execution unit each cycle including the load/store units.

    I don't quite get your comment about 4-way issue though. Graphics applications don't need to issue to the tensors to achieve "peak utilization" in the way it's currently defined which is 128 FMAs per clock. When we're talking about scaling of graphics or compute applications that's the number we're referring to.
     
  12. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    FWIW, Ampere as well as Turing uses the Tensor Cores for standard FP16 math.
     
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Yep, the Ampere issue options are:

    16xFP32 + 16xFP32
    16xFP32 + 16xINT32
    16xFP32 + 32xFP16 (tensor) ---> need to confirm if tensors can co-issue with one of the SIMDs
     
  14. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,624
    Who's said that async was worse in RDNA?
    It's obviously not, but AMD did nothing to improve async compute in RDNA and they did refactor graphics pipeline to minimize stalls and improve latencies with the new 32-wide wavefronts, that was the whole reason behind better RDNA efficiency.
    When you compare Radeon VII to 5700 XT, NAVI10 achieves way higher efficiency at the same average performance mostly because instead of filling in the pipeline bubbles with async compute (impossible to fix automatically in HW) they decreased the number of the pipeline stalls in the first place.

    Because there is 0 demand for the 7 times higher resolution displays (most of PC displays are still 1080p, 1440p is second most popular resolution and 4K still captures a minor fraction of PC market), geometry processing takes pretty much constant time in all resolutions, etc, etc.
    If you look carefully, you would probably notice that most of games don't scale linearly with resolutions for tons of reasons (not just CPU), only the heaviest (thanks to RT and compute) games like CP2077 do scale linearly with pixels, but that's exactly type of games where RTX 3090 is up to 2x faster in comparison with RX 6900 XT currently.
     
    #214 OlegSH, Aug 4, 2021
    Last edited: Aug 4, 2021
    pharma, PSman1700 and DavidGraham like this.
  15. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,624
    Volta can dispatch instructions for 2+ data paths.
    I guess the + means that SFU instructions running for 8 clocks can be overlapped with INT and FP instructions.
     
  16. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,090
    I'd say most dont have anything against amd and in special NV increasing GPU capabilities. About three times the power of the consoles in just normal rendering aint a bad start, thats aside from ray tracing and reconstruction counted in. The enormous compute power enables things like UE5/lumen/nanite aswell.
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    For all we know the "tensor core" is fake: tensor math is similar to the dot product operations of yore which happily occupied multiple lanes with very high throughput and low latency on VLIW machines.

    Otherwise, the tensor core looks like transistors sat twiddling their thumbs, which is where the "typical games versus theoretical FLOPS" questions enter the picture.

    FLOPS per transistor (mm²) and FLOPS per watt are what really matter, so games-actual versus theoretical isn't such an enlivening topic in the end.

    I was referring to peak utilisation in terms of transistors that do math: the various SIMDs and what proportion of the time they can be used.
     
  18. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,397
    FP16 is same speed as FP32 on Ampere. Which likely means that you can't do the + there.
     
    pharma, CarstenS and PSman1700 like this.
  19. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    It doesn’t though since nobody counts tensors when talking about theoretical flops for gaming.

    Maybe. But the topic was how application performance scaled with Ampere’s doubled FP32. We have lots of evidence that game performance did not scale anywhere close to the flops increase. However there are other workloads where it came reasonably close.

    Ok.
     
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    You achieve peak FP16 flops on Ampere by issuing a wave-32 of packed FP16 FMA operands (64 sets total) to the tensor pipe. Presumably the tensors will take 2 cycles to process the wave. What’s preventing the SM from issuing a wave-32 of FP32 instructions to one of the other SIMDs in the next cycle while the tensors are still chewing on the FP16 data?

    Or do you mean that tensors process the full wave in one cycle so there’s no opportunity to run tensors in parallel with other work?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...