NVidia Ada Speculation, Rumours and Discussion

Status
Not open for further replies.
I am not 100% sure but if the 2 instructions per clock come from different warps what you’re asking for probably doesn’t exist.
If two distinct warps are always required to enable co-issue to both SIMD16s (FP32 and FP32/Int) then I guess that's the utilisation problem right there.

I can imagine transcendentals going through the SFU at what appears to be 4 per clock (per partition), adds to dependency-chain-length problems, reducing the count of available warps for dual-issue.
 
As far as I remember, common consensus from reviews was that this doubling of thru... didn't work and didn't scale well. Low throughout Navi21 was on par with 3090 (of course excluding raytraycing). Probably in special scenarios ampere rocks but in regular games is on par.

I didnt talk about gaming rendering. I specificed mention compute performance. A chiplet design will scale worse than a monolith chip card in games.... But people here expect that AMD will increase compute performance by 2.5x over RDNA2.

nVidia did it with Ampere and specific compute performance without increasing other units like rasterizer, geometry units etc:
LuxMark-Performance-Hall-Bench-OpenCL-Score-June-2021-680x383.png

https://techgage.com/article/mid-2021-gpu-rendering-performance/

You can only invest so much transistors.
 
If two distinct warps are always required to enable co-issue to both SIMD16s (FP32 and FP32/Int) then I guess that's the utilisation problem right there.

I can imagine transcendentals going through the SFU at what appears to be 4 per clock (per partition), adds to dependency-chain-length problems, reducing the count of available warps for dual-issue.
Let say GPUs tend to be designed with the idea that more (well, many more) than 1 warp is running on a processor at any given time :)
 
Let say GPUs tend to be designed with the idea that more (well, many more) than 1 warp is running on a processor at any given time :)
4-way issue to math ALUs for full-utilisation looks like a fail for graphics :)
 
RDNA needs 4 distinct "warps" each cycle to fully load its WGP. Unlikely to be any sort of real issue going from real world results.

An RDNA WGP needs 4 distinct warps with lots of ILP. If there's no ILP it needs a lot more warps (5 cycle dependent math latency IIRC).

Turing was the same, needed 4 distinct warps with lots of ILP for maximum FMA throughput. Not sure if that changed for Ampere. However, it doesn't say anything about the extra INT/FP32 pipe. Presumably you need 4 additional warps with lots of ILP to keep that second pipeline going.

4-way issue to math ALUs for full-utilisation looks like a fail for graphics :)

Have we seen better?
 
Hall Bench score per "theoretical" TFLOPS
  • 6900XT - 753
  • 3090 - 640
RTX 3060 (score 9512 with 12.74 TFLOPS) is interesting, because its score per theoretical TFLOPS is substantially better, at 747, close to 6900XT. 6700XT is 795. So in both cases, these architectures get "better" with lower GPUs, which is not surprising, arguably.

So what looks like it could be a compute test is looking to be more subtle than that.

Bandwidth per TFLOPS is not making that much difference to RDNA 2 (29GB per TFLOP in 6700XT versus 22 in 6900XT) while with Ampere: 28 versus 26 (3060 and 3090) looks relatively innocuous.

3070Ti has a score per theoretical TFLOPS of 683, with 28GB per TFLOP and perhaps can be considered the best comparison for 6700XT.

A compute efficiency test probably shouldn't vary so substantially according to tier within an architecture, so I suspect this particular benchmark is not a good choice.
 
Have we seen better?
RDNA has two SIMDs (general SIMD-32 and transcendental SIMD-8) per instruction scheduler. It requires 2-way issue for full utilisation, but can only ever issue 1 per cycle. Transcendentals take 4 cycles, so if one is started then full utilisation occurs during the following 3 cycles.

An Ampere SM partition has 3 SIMDs (FP32 SIMD-16, FP32/Integer SIMD-16 and SF SIMD-4) and a tensor core (which looks like a SIMD-32 for FP16 operations). It appears to take 4 cycles to get work issued to all these units but I don't know the details of the issue cadences. A reasonable guess is that SFU can be issued every 8 cycles, but I don't know if it takes a cadence from one or other of the SIMD-16s.

I don't know how issues are scheduled for the tensor core and whether "tensor" operations are "slower" than FP16. I'd expect FP16 (which is likely to be used in games) to be able to issue on the tensor core as frequently as once every cycle (since Ampere is double-rate FP16). I'm guessing that a tensor core "shares" a datapath with one of the two SIMD-16s. I don't know how per-cycle issue to the tensor core meshes with the slower issue rate to the two SIMD-16s.
 
RTX 3060 (score 9512 with 12.74 TFLOPS) is interesting, because its score per theoretical TFLOPS is substantially better, at 747, close to 6900XT. 6700XT is 795. So in both cases, these architectures get "better" with lower GPUs, which is not surprising, arguably.

So what looks like it could be a compute test is looking to be more subtle than that.
Bandwidth per TFLOPS is not making that much difference to RDNA 2 (29GB per TFLOP in 6700XT versus 22 in 6900XT) while with Ampere: 28 versus 26 (3060 and 3090) looks relatively innocuous.
3070Ti has a score per theoretical TFLOPS of 683, with 28GB per TFLOP and perhaps can be considered the best comparison for 6700XT.

If bandwidth played an important role, RTX 3070 should stand out more, since it's got a notable uptick in GFlops/GBps and should do worse, which it doesn't.

A compute efficiency test probably shouldn't vary so substantially according to tier within an architecture, so I suspect this particular benchmark is not a good choice.
Maybe it's got something to do with what this test does - compared to other tests in Luxmark 4 alpha:
"The second LuxMark benchmark is a path tracer with global illumination cache. This rendering mode slightly simpler than pure brute force and may work better on some GPU."
https://wiki.luxcorerender.org/LuxMark_v4
Especially the GI cache sound like RDNA2 might profit. Maybe we'll see numbers with RX 6600 XT showing a trend here.

Fun fact: most efficient GeForce is the 2080 Ti, when you compare points per GFlops.
upload_2021-8-4_17-49-37.png

Maybe power budget plays a role? I seem to remember, RTX 30 cards were boosting theirs hearts out on Luxmark, running quite a bit higher than their advertised boosts; at around 1900 MHz, IIRC.

Maybe also a larger cache would help. Something Lovelace will fix?
 
RDNA has two SIMDs (general SIMD-32 and transcendental SIMD-8) per instruction scheduler. It requires 2-way issue for full utilisation, but can only ever issue 1 per cycle. Transcendentals take 4 cycles, so if one is started then full utilisation occurs during the following 3 cycles.

An Ampere SM partition has 3 SIMDs (FP32 SIMD-16, FP32/Integer SIMD-16 and SF SIMD-4) and a tensor core (which looks like a SIMD-32 for FP16 operations). It appears to take 4 cycles to get work issued to all these units but I don't know the details of the issue cadences. A reasonable guess is that SFU can be issued every 8 cycles, but I don't know if it takes a cadence from one or other of the SIMD-16s.

I don't know how issues are scheduled for the tensor core and whether "tensor" operations are "slower" than FP16. I'd expect FP16 (which is likely to be used in games) to be able to issue on the tensor core as frequently as once every cycle (since Ampere is double-rate FP16). I'm guessing that a tensor core "shares" a datapath with one of the two SIMD-16s. I don't know how per-cycle issue to the tensor core meshes with the slower issue rate to the two SIMD-16s.

Yeah it's reasonable to assume that issuing to the tensors would preclude issuing to one of the other SIMD pipes in the same cycle. I'm not sure that's important though as we don't double count tensors when talking about peak Ampere flops available to graphics applications. We know that FP16 throughput is 2xFP32 and the assumption is that one data path is sufficient to provide the necessary operands. So whether FP16 is running on the main SIMDs like RDNA or on tensors like Turing/Ampere doesn't really matter.

Nvidia has usually been explicit about the number of instructions dispatched per cycle and in Ampere it's one instruction per partition. So presumably Ampere can only issue to one execution unit each cycle including the load/store units.

I don't quite get your comment about 4-way issue though. Graphics applications don't need to issue to the tensors to achieve "peak utilization" in the way it's currently defined which is 128 FMAs per clock. When we're talking about scaling of graphics or compute applications that's the number we're referring to.
 
FWIW, Ampere as well as Turing uses the Tensor Cores for standard FP16 math.

Yep, the Ampere issue options are:

16xFP32 + 16xFP32
16xFP32 + 16xINT32
16xFP32 + 32xFP16 (tensor) ---> need to confirm if tensors can co-issue with one of the SIMDs
 
I don't think AMD has moved into the opposite direction in terms of async compute support working less well than on GCN?
Who's said that async was worse in RDNA?
It's obviously not, but AMD did nothing to improve async compute in RDNA and they did refactor graphics pipeline to minimize stalls and improve latencies with the new 32-wide wavefronts, that was the whole reason behind better RDNA efficiency.
When you compare Radeon VII to 5700 XT, NAVI10 achieves way higher efficiency at the same average performance mostly because instead of filling in the pipeline bubbles with async compute (impossible to fix automatically in HW) they decreased the number of the pipeline stalls in the first place.

So if we get a 75tf GPU now, 7 times more powerful than consoles, then i don't see why we are worried it could not scale just those console games 7 times faster, or with 7 times more pixels, both being just pointless.
Because there is 0 demand for the 7 times higher resolution displays (most of PC displays are still 1080p, 1440p is second most popular resolution and 4K still captures a minor fraction of PC market), geometry processing takes pretty much constant time in all resolutions, etc, etc.
If you look carefully, you would probably notice that most of games don't scale linearly with resolutions for tons of reasons (not just CPU), only the heaviest (thanks to RT and compute) games like CP2077 do scale linearly with pixels, but that's exactly type of games where RTX 3090 is up to 2x faster in comparison with RX 6900 XT currently.
 
Last edited:
Because there is 0 demand for the 7 times higher resolution displays (most of PC displays are still 1080p, 1440p is second most popular resolution and 4K still captures a minor fraction of PC market), geometry processing takes pretty much constant time in all resolutions, etc, etc.
If you look carefully, you would probably notice that most of games don't scale linearly with resolutions for tons of reasons (not just CPU), only the heaviest (thanks to RT and compute) games like CP2077 do scale linearly with pixels, but that's exactly type of games where RTX 3090 is up to 2x faster in comparison with RX 6900 XT currently.

I'd say most dont have anything against amd and in special NV increasing GPU capabilities. About three times the power of the consoles in just normal rendering aint a bad start, thats aside from ray tracing and reconstruction counted in. The enormous compute power enables things like UE5/lumen/nanite aswell.
 
Yeah it's reasonable to assume that issuing to the tensors would preclude issuing to one of the other SIMD pipes in the same cycle. I'm not sure that's important though as we don't double count tensors when talking about peak Ampere flops available to graphics applications. We know that FP16 throughput is 2xFP32 and the assumption is that one data path is sufficient to provide the necessary operands. So whether FP16 is running on the main SIMDs like RDNA or on tensors like Turing/Ampere doesn't really matter.
For all we know the "tensor core" is fake: tensor math is similar to the dot product operations of yore which happily occupied multiple lanes with very high throughput and low latency on VLIW machines.

Otherwise, the tensor core looks like transistors sat twiddling their thumbs, which is where the "typical games versus theoretical FLOPS" questions enter the picture.

FLOPS per transistor (mm²) and FLOPS per watt are what really matter, so games-actual versus theoretical isn't such an enlivening topic in the end.

Nvidia has usually been explicit about the number of instructions dispatched per cycle and in Ampere it's one instruction per partition. So presumably Ampere can only issue to one execution unit each cycle including the load/store units.

I don't quite get your comment about 4-way issue though. Graphics applications don't need to issue to the tensors to achieve "peak utilization" in the way it's currently defined which is 128 FMAs per clock. When we're talking about scaling of graphics or compute applications that's the number we're referring to.
I was referring to peak utilisation in terms of transistors that do math: the various SIMDs and what proportion of the time they can be used.
 
Otherwise, the tensor core looks like transistors sat twiddling their thumbs, which is where the "typical games versus theoretical FLOPS" questions enter the picture.

It doesn’t though since nobody counts tensors when talking about theoretical flops for gaming.

FLOPS per transistor (mm²) and FLOPS per watt are what really matter, so games-actual versus theoretical isn't such an enlivening topic in the end.

Maybe. But the topic was how application performance scaled with Ampere’s doubled FP32. We have lots of evidence that game performance did not scale anywhere close to the flops increase. However there are other workloads where it came reasonably close.

I was referring to peak utilisation in terms of transistors that do math: the various SIMDs and what proportion of the time they can be used.

Ok.
 
FP16 is same speed as FP32 on Ampere. Which likely means that you can't do the + there.

You achieve peak FP16 flops on Ampere by issuing a wave-32 of packed FP16 FMA operands (64 sets total) to the tensor pipe. Presumably the tensors will take 2 cycles to process the wave. What’s preventing the SM from issuing a wave-32 of FP32 instructions to one of the other SIMDs in the next cycle while the tensors are still chewing on the FP16 data?

Or do you mean that tensors process the full wave in one cycle so there’s no opportunity to run tensors in parallel with other work?
 
Status
Not open for further replies.
Back
Top