Nvidia Ampere Discussion [2020-05-14]

Any comment on this guys?
While traditional FP32 (or FP64 for that matter) certainly wasn't the main target for A100, that 2.5x transistor budget turned into 24 % higher theoretical performance at 33% higher consumption (partly disabled chip, but enabling the rest doesn't come for free either)

So from gaming perspective it looks more far fetched than ever based on A100, but A100 doesn't necessarily reflect gaming models at all and this is just theoretical numbers we're looking at.
 
Regarding the power efficiency of Ampere, A100 has 54bil transistors, Titan RTX has 18bil, V100 has 21bil, both came at a TDP of 280~300w, so roughly speaking A100 has introduced 2.5X to 3X the transistor count, while simultaneously increasing power to 400w (a 40% increase).

I know this math is extremely rough around the edges, but it can give us some sort of an indication of how much progress NVIDIA has achieved on 7nm, the claim that Ampere is 50% faster than Turing at half the power is not that far fetched at least?

It's hard to really guess much out of this data. I would look at it as an 33% increase in Power from V100 Nvlink (300W) to A100 Nvlink (400W). But Nvlink marks already a big problem. Power Draw of 600Gb/s Nvlink should be massive. PciE A100 would make it easier to compare.
Ignoring this and calculating tdp/transistor we could imagine, that 7nm DUV with ampere can get ~1,9x the transistors at the same TDP with slightly lower clocks (A100 vs V100). But for consumer cards they won't reduce clock speeds and might even increase them.
 
So from gaming perspective it looks more far fetched than ever based on A100
That's the wrong angle, you don't look at traditional FP32 in an AI optimized chip. So looking at transistor budget is enough for now IMO. What we know so far is that based on the shown numbers: NVIDIA crammed 2.5X the number of transistors while only using 33% more power (a considerable portion is coming from the beefed up NVLink). That's a ton of power efficiency right there.

If we care about comparing similar performance metrics then FP16 and INT8 should be enough, those also got an increase of 2.5X (compared to V100/Titan RTX respectively) despite the reduction of Tensor core count in A100 compared to both of them.

But for consumer cards they won't reduce clock speeds and might even increase them.
They will definitely increase them, just like the situation of Volta/Turing, in fact if you think about it, we are in this exact same situation here, V100 was devoid of RT cores, Turing was the gaming/workstation version of V100 with added RT cores and higher clocks, the same way A100 is devoid of RT cores, we are just waiting for the gaming version of Ampere.
 
Nvidia has been using the same block diagrams since Kepler. They’ve never been to scale. E.g. the schedulers are likely larger than depicted and the fixed function geometry hardware isn’t represented at all.

Nah, it's just a scheme. The sizes are up to the artist to decide how it will look better.

Fair enough. Unfortunate, but I'm not surprised NV would not disclose this information.
 
If the app doesn't need to change how does the app then makes the difference between training with FP32 and FP19 (aka TF32)?
All major DL frameworks, which call into NVIDIA APIs, will be automatically using TF32 by default.

It begs also the question if there is much value to FP19 as dropping 3 more bit from the mantisse and you got BFloat16, which makes training 2x faster.
BTW Remark you can not use the sparisity feature for training this is for inferencing,
The value is not having to change your application. BF16 doesn't tend to work well across a wide range of networks without manual intervention. Most DL developers are not performance experts. They just want their code to work well and fast out of the box.
 
BF16 doesn't tend to work well across a wide range of networks without manual intervention. Most DL developers are not performance experts. They just want their code to work well and fast out of the box.

Any references to articles that show BF16 doesn't work well for a wide range of networks ?
Google TPU2 and TPU3 are exclusively BF16.
Intel adoptied BF16 with Habana and AVX512, so has ARM.
It is quite remarkably that the Ampere introduction video did not even mention BF16 as it's not even there.
 
Any references to articles that show BF16 doesn't work well for a wide range of networks ?
Google TPU2 and TPU3 are exclusively BF16.
Intel adoptied BF16 with Habana and AVX512, so has ARM.
It is quite remarkably that the Ampere introduction video did not even mention BF16 as it's not even there.

AFAIK TPUs go faster if you use BF16, but they also support FP32.
Google say BF16 is close to a drop-in replacement for FP32, so it’s not positioned as TF32.

Evidently with Ampere the focus is to first enable the vast majority of DL practitioners that just want to get great performance out of the box, without having to change a line of code.
 
That's the wrong angle, you don't look at traditional FP32 in an AI optimized chip. So looking at transistor budget is enough for now IMO. What we know so far is that based on the shown numbers: NVIDIA crammed 2.5X the number of transistors while only using 33% more power (a considerable portion is coming from the beefed up NVLink). That's a ton of power efficiency right there.

If we care about comparing similar performance metrics then FP16 and INT8 should be enough, those also got an increase of 2.5X (compared to V100/Titan RTX respectively) despite the reduction of Tensor core count in A100 compared to both of them.


They will definitely increase them, just like the situation of Volta/Turing, in fact if you think about it, we are in this exact same situation here, V100 was devoid of RT cores, Turing was the gaming/workstation version of V100 with added RT cores and higher clocks, the same way A100 is devoid of RT cores, we are just waiting for the gaming version of Ampere.

Eh, they already pushed the Super cards pretty hard, the 2080 super already takes up 250 watts. While I think RDNA1 made them learn their lesson about powerdraw versus performance, in that 90% or more of consumers don't give a shit and just want to "go fast", the different in fmax curve between 7nm and 12nm isn't actually that great. While I'd expect them to push the chips pretty high much like the Super versions, I doubt we'll see frequencies much beyond those cards.

Part of this is apparently the consumer cards will be the same architecture. And while the A100 has just over double the transistor count clockspeed actually went down from V100 while having powerdraw increase by 33%. Which suggests there's no particular focus on getting more frequency versus power draw out of the silicon.
 
That's the wrong angle, you don't look at traditional FP32 in an AI optimized chip. So looking at transistor budget is enough for now IMO. What we know so far is that based on the shown numbers: NVIDIA crammed 2.5X the number of transistors while only using 33% more power (a considerable portion is coming from the beefed up NVLink). That's a ton of power efficiency right there.
Lower precision ALUs take much less power AFAIK, so with more chip area spent on those we can not compare to previous generations directly.

Predictions about the consumer parts are quite hard to make yet i think? I guess they trade tensor area vs. RT, but who knows. Maybe consumer ends up at higher fp32 perf even.
The way Mr. Jensen presented DLSS as problem solution to make RT practical at least makes me pretty certain tensors will not be removed or shrinked in comparison to Turing.

It's pretty interesting times. RT and ML on one side, totally unexpected success and demand on traditional compute on the other (UE5). Surely hard to pick the sweet spot of compromise here.
 
Lower precision ALUs take much less power AFAIK, so with more chip area spent on those we can not compare to previous generations directly.

They also take far fewer transistors and occupy a much smaller die area, so it basically evens out.
 
They also take far fewer transistors and occupy a much smaller die area, so it basically evens out.
hmm, yeah - not sure if i interpreted the quote properly.
As implied by DF, both. On pc atleast.
Sure, but for entry and midrange the ratio matters (assuming all models get the features this time), and i could only guess what the right ratio should be. It's only harder this gen.
 
Hardware RT probably has a future, would be abit meh if it sits practically unused for the next 7 years :)

Though what they did in the ue demo was just as amazing, but i guess hardware will be faster, maybe use it for just reflections etc.
 
Hardware RT probably has a future, would be abit meh if it sits practically unused for the next 7 years :)

Though what they did in the ue demo was just as amazing, but i guess hardware will be faster, maybe use it for just reflections etc.

UE5 really shakes things up, maybe even harder than RTX did. It's a game changer, and it changes a lot of things that are not obvious at first, for example the lead NV > AMD.
The primary reason for this lead is better rasterization performance (and other fixed function stuff like tessellation). But i assume Nanite draws only a small number of triangles, and most is rendered from compute.
If this is true, and i have no reason to assume Ampere compute performance could beat RDNA2 by a large factor or at all, the picture could change.
NV could no longer afford experiments like tensor cores so easily. And also a lead in RT perf. would weight less because GI has more effect on the overall image, and compute can do diffuse GI better than RT. (Personal opinion, but Lumen confirms it a bit.)

With AMD offering RT too, devs will come up with their own upscaling solutions. e.g. there are more options like RT on half res but visibility on full res which has not been explored but make sense.
DLSS is vendor locked, so it is no longer an option. NV can no longer rely on it to be the system seller of tensor cores that are unused during most time of the frame.
So i guess they come up with ML denoising, but it's not guaranteed it will become the norm. If denoising would benefit from ML, we should have seen it already during Turing.

In the worst case for NV, we soon end up with 20TF flagships from both NV and AMD, and AMD might achieve this with smaller and cheaper chips if NV increases chip area spent on tensor cores.

So that's why i think the competition NV vs. AMD will become more interesting this time.


UE5 and RT is also interesting. To make this detail traceable we need some options for LOD and to stream BVH per level, not only per object.
DX12U is not enough.
 
Which suggests there's no particular focus on getting more frequency versus power draw out of the silicon.
3 years ago, If you looked at V100's frequencies alone, you wouldn't have thought Turing's clocks would reach as high they did. Gaming chips will have completely different configuration.

These V100 and A100 chips are also passively cooled, you can't expect anyone to go all out on frequency with such cooling solutions.

V100 while having powerdraw increase by 33%.
As explained above, big part of this increase is the upgraded NVLink with 600GB/s. For example, V100 NVLink increased power consumption over V100S PCI-E by 50W.
 
Last edited:
So that's why i think the competition NV vs. AMD will become more interesting this time.
I agree, and AMD is already jumping the gun in that regard. The recent debacle with Radeon Rays 4.0 loosing Open Source privileges and AMD's re-think due to community backlash indicates AMD may be trying to secure a strategic software presence for future architectural endeavors.
 
Back
Top