We dont have any real performance comparisons so its unanswerable.Any comment on this guys?
We dont have any real performance comparisons so its unanswerable.Any comment on this guys?
While traditional FP32 (or FP64 for that matter) certainly wasn't the main target for A100, that 2.5x transistor budget turned into 24 % higher theoretical performance at 33% higher consumption (partly disabled chip, but enabling the rest doesn't come for free either)Any comment on this guys?
Regarding the power efficiency of Ampere, A100 has 54bil transistors, Titan RTX has 18bil, V100 has 21bil, both came at a TDP of 280~300w, so roughly speaking A100 has introduced 2.5X to 3X the transistor count, while simultaneously increasing power to 400w (a 40% increase).
I know this math is extremely rough around the edges, but it can give us some sort of an indication of how much progress NVIDIA has achieved on 7nm, the claim that Ampere is 50% faster than Turing at half the power is not that far fetched at least?
That's the wrong angle, you don't look at traditional FP32 in an AI optimized chip. So looking at transistor budget is enough for now IMO. What we know so far is that based on the shown numbers: NVIDIA crammed 2.5X the number of transistors while only using 33% more power (a considerable portion is coming from the beefed up NVLink). That's a ton of power efficiency right there.So from gaming perspective it looks more far fetched than ever based on A100
They will definitely increase them, just like the situation of Volta/Turing, in fact if you think about it, we are in this exact same situation here, V100 was devoid of RT cores, Turing was the gaming/workstation version of V100 with added RT cores and higher clocks, the same way A100 is devoid of RT cores, we are just waiting for the gaming version of Ampere.But for consumer cards they won't reduce clock speeds and might even increase them.
Nvidia has been using the same block diagrams since Kepler. They’ve never been to scale. E.g. the schedulers are likely larger than depicted and the fixed function geometry hardware isn’t represented at all.
Nah, it's just a scheme. The sizes are up to the artist to decide how it will look better.
All major DL frameworks, which call into NVIDIA APIs, will be automatically using TF32 by default.If the app doesn't need to change how does the app then makes the difference between training with FP32 and FP19 (aka TF32)?
The value is not having to change your application. BF16 doesn't tend to work well across a wide range of networks without manual intervention. Most DL developers are not performance experts. They just want their code to work well and fast out of the box.It begs also the question if there is much value to FP19 as dropping 3 more bit from the mantisse and you got BFloat16, which makes training 2x faster.
BTW Remark you can not use the sparisity feature for training this is for inferencing,
BF16 doesn't tend to work well across a wide range of networks without manual intervention. Most DL developers are not performance experts. They just want their code to work well and fast out of the box.
Any references to articles that show BF16 doesn't work well for a wide range of networks ?
Google TPU2 and TPU3 are exclusively BF16.
Intel adoptied BF16 with Habana and AVX512, so has ARM.
It is quite remarkably that the Ampere introduction video did not even mention BF16 as it's not even there.
That's the wrong angle, you don't look at traditional FP32 in an AI optimized chip. So looking at transistor budget is enough for now IMO. What we know so far is that based on the shown numbers: NVIDIA crammed 2.5X the number of transistors while only using 33% more power (a considerable portion is coming from the beefed up NVLink). That's a ton of power efficiency right there.
If we care about comparing similar performance metrics then FP16 and INT8 should be enough, those also got an increase of 2.5X (compared to V100/Titan RTX respectively) despite the reduction of Tensor core count in A100 compared to both of them.
They will definitely increase them, just like the situation of Volta/Turing, in fact if you think about it, we are in this exact same situation here, V100 was devoid of RT cores, Turing was the gaming/workstation version of V100 with added RT cores and higher clocks, the same way A100 is devoid of RT cores, we are just waiting for the gaming version of Ampere.
Lower precision ALUs take much less power AFAIK, so with more chip area spent on those we can not compare to previous generations directly.That's the wrong angle, you don't look at traditional FP32 in an AI optimized chip. So looking at transistor budget is enough for now IMO. What we know so far is that based on the shown numbers: NVIDIA crammed 2.5X the number of transistors while only using 33% more power (a considerable portion is coming from the beefed up NVLink). That's a ton of power efficiency right there.
Surely hard to pick the sweet spot of compromise here.
Lower precision ALUs take much less power AFAIK, so with more chip area spent on those we can not compare to previous generations directly.
hmm, yeah - not sure if i interpreted the quote properly.They also take far fewer transistors and occupy a much smaller die area, so it basically evens out.
Sure, but for entry and midrange the ratio matters (assuming all models get the features this time), and i could only guess what the right ratio should be. It's only harder this gen.As implied by DF, both. On pc atleast.
Hardware RT probably has a future, would be abit meh if it sits practically unused for the next 7 years
Though what they did in the ue demo was just as amazing, but i guess hardware will be faster, maybe use it for just reflections etc.
So that's why i think the competition NV vs. AMD will become more interesting this time.
3 years ago, If you looked at V100's frequencies alone, you wouldn't have thought Turing's clocks would reach as high they did. Gaming chips will have completely different configuration.Which suggests there's no particular focus on getting more frequency versus power draw out of the silicon.
As explained above, big part of this increase is the upgraded NVLink with 600GB/s. For example, V100 NVLink increased power consumption over V100S PCI-E by 50W.V100 while having powerdraw increase by 33%.
I agree, and AMD is already jumping the gun in that regard. The recent debacle with Radeon Rays 4.0 loosing Open Source privileges and AMD's re-think due to community backlash indicates AMD may be trying to secure a strategic software presence for future architectural endeavors.So that's why i think the competition NV vs. AMD will become more interesting this time.
Hardware RT probably has a future, would be abit meh if it sits practically unused for the next 7 years
Though what they did in the ue demo was just as amazing, but i guess hardware will be faster, maybe use it for just reflections etc.