Nvidia Ampere Discussion [2020-05-14]

Isn't FP32 from A100 TCs a non-IEEE one?

I meant that maybe Gaming Amperes can use FP32 from Tensor Cores to double the FP32 rate per SM as in speculations at the moment. A100 can't do it.

This rate is maintained on all TC precision modes though which means that it's not coming from FP32 SIMDs, no?

There was a Interview with an employee or blogpost, which mentioned that normal FP16 rate ist coming 2x from normal FP32 unit and additional they enabled the TCs for normal FP16 operations to add 2xthroughtput and get 4xFP32 speed. TC FP16 operation is much higher. Turing already uses the TCs for 2xFP16 throughput, as far as is understood it.

I see two possibilities for gaming Ampere here:

1. Double width FP32 SIMDs which will likely lead to a double width of INT32 SIMD as well. They've done this previously between GP100 and GP10x.

2. A second 16-wide FP32 SIMD in place of the FP64 one of GA100. But for that to work well they'll need to be able to schedule FP32+FP32+INT32 or it will be either FP32+FP32 or FP32+INT32 per clock which will result in utilization issues.

As troyan mentioned in his link, if they can get 2xFP16 from the TC datapath, why not change it to get an additional FP32 from the TC data path?
I totally expect utilization issues. It seems Gaming Ampere will change from an "high IPC" Architecture to low IPC in case of FP32.
 
This rate is maintained on all TC precision modes though which means that it's not coming from FP32 SIMDs, no?

I see two possibilities for gaming Ampere here:

1. Double width FP32 SIMDs which will likely lead to a double width of INT32 SIMD as well. They've done this previously between GP100 and GP10x.

2. A second 16-wide FP32 SIMD in place of the FP64 one of GA100. But for that to work well they'll need to be able to schedule FP32+FP32+INT32 or it will be either FP32+FP32 or FP32+INT32 per clock which will result in utilization issues.

Doubling SIMD width would require a second dispatch unit otherwise you end up with the same utilization problem as Kepler.

I’m hedging that the 2xFP32 is not general purpose. Maybe a fast path through the tensors for RT calcs or something. It’s mind boggling to think Ampere will push 30+ tflops of general compute.
 
Doubling SIMD width would require a second dispatch unit otherwise you end up with the same utilization problem as Kepler.
On the contrary, doubling the width won't require any changes to dispatch. Adding a second one of the same width will though - if GA100 can't schedule FP32+INT+FP64, which seems an unlikely scenario.
 
Doubling SIMD width would require a second dispatch unit otherwise you end up with the same utilization problem as Kepler.

I’m hedging that the 2xFP32 is not general purpose. Maybe a fast path through the tensors for RT calcs or something. It’s mind boggling to think Ampere will push 30+ tflops of general compute.

Isn't that the same utilization problem A100 should have with 4xFP16? But somehow they implemented it at least for some corner cases.
 
It’s mind boggling to think Ampere will push 30+ tflops of general compute.
Why though? Maxwell and Pascal were 32 wide and it always seemed excessive for gaming Turing to be 16 wide. If they'll go back to 32 wide for general math then you'll get something around 40 tflops from GA102 - which could be a good thing for non-gaming applications for the latter too.
 
On the contrary, doubling the width won't require any changes to dispatch. Adding a second one of the same width will though - if GA100 can't schedule FP32+INT+FP64, which seems an unlikely scenario.

The dispatcher issues 32 threads per clock. That’s enough to feed one 32-wide pipe.

GA100 can schedule FP32+INT32 concurrently because each pipe is only 16-wide. Issuing to any other execution unit (FP64, SFU, Load/Store) will cause bubbles in the main FP and INT pipelines.
 
Why though? Maxwell and Pascal were 32 wide and it always seemed excessive for gaming Turing to be 16 wide. If they'll go back to 32 wide for general math then you'll get something around 40 tflops from GA102 - which could be a good thing for non-gaming applications for the latter too.

Oh it’ll be great but certainly not free. I will be very impressed if it’s true and the die size is under 800mm2.
 
The dispatcher issues 32 threads per clock. That’s enough to feed one 32-wide pipe.

GA100 can schedule FP32+INT32 concurrently because each pipe is only 16-wide. Issuing to any other execution unit (FP64, SFU, Load/Store) will cause bubbles in the main FP and INT pipelines.
Yeah, that's true, haven't thought of this. Well, guess we'll see soon.

Does it issue 32 threads though or does it issue 16+16 from two warps?
 
Isn't that the same utilization problem A100 should have with 4xFP16? But somehow they implemented it at least for some corner cases.

Well we don’t know that there isn’t a utilization problem with 4xFP16. E.g. can you still issue to the INT pipe while doing that?
 
Well, I will be completely floored and made a fool of. A completely non standard memory standard with a "surprise motherf*cker!" announcement. What a damned weird thing to do.

Especially as a 384bit bus with 18gbps GDDR6 could get the high end 3090 enough bandwidth by itself, at least going by the leaked performance. But hells maybe this means they're doing the weird cut down bus thing again. RTX Titan 2 with 24gb ram, 3090 with 10/20gb? Or 11/22???

Will be interesting to see the announcement now. And I wonder if GDDR6X yield is low enough, or it's expensive enough versus normal, that the surprise announcement somehow makes sense.
I still fondly remember how @aaronspink was dubious that GDDR would go beyond 6Gbps:

Nvidia GT300 core: Speculation

I honestly don't see GDDR5 getting much beyond 6 GT/s without branching into differential data variants.

Anyway, Aaron taught us much about memory back in the good old days.
 
nVidia responded on their dev blog but the comments are not visible anymore. CarstenS has copied the response: https://forum.beyond3d.com/posts/2128606/
Wow, they "lost" the comment section? That's... interesting.

edit 200905: I managed to find the comment in Krashinsky's disqus-Profile. A screenshot is attached for your reference and in case it might get "lost" over there too.Ampere_GA100_4xFP16-rate_Krashinsky.png

BTW: Chip on the back? Zotac (and someone before them) had those on the back of the PCB opposite of the GPU - but it was not another GPU, but a super-cap:
https://www.zotac.com/download/file...ery/graphics_cards/zt-t20820b-10p_image04.jpg
 
Last edited:
Was on super expensive editions of cards earlier though. Maybe that's what's inflating the BOM among other such as those rumored high speed enabling PCBs..
 
Was on super expensive editions of cards earlier though. Maybe that's what's inflating the BOM among other such as those rumored high speed enabling PCBs..
It's still supposedly Colorful's Vulcan, not reference, in the leaks.
 
Makes sense that the FP32/INT32 is 2:1 IMO.
In nvidia's own marketing material, the ratio of INT32 operations in games is only up to 40% IIRC, so Turing had an unbalanced amount of INT32 units (at least for game rendering).
 
Tensor FP32 will presumably be only MAD/FMA and ADD, along with latency that prolly means at least 2 dependent ops are required to be worth using at all. Should be good for geometry I suppose.
 
Back
Top