Would this mean ability to program GPU directly with c++ instead of using cuda?
edit. Digging a bit more
edit. Digging a bit more
@Nebuchadnezzar wrote that the A100 can go up to 400W, though he also assumed the GPUs in Robotaxi are the A100, which they don't seem to be if you compare their pictures.Not quite, each Orin can go at least up to 45W (L2+ spec), but @Ryan Smith suggests they could go as high as 65 - 70W each. Then there's whatever that daughterboard in the upper edge is too. The whole platform is supposed to be 800W.
Of course it is. DP performance is the first number they're showing in the chip's specs list.FP32 and DP compute performance is not relevant for A100.
An obvious, and a huge one at that as well.If we compare transistor numbers directly then there's an obvious transistors/watt gain with Ampere
Tensor core count has been reduced in the A100, it now stands at 432 vs ~640 in V100, and they even run at lower clocks, we can infer from that that a significant amount of transistor budget has gone to the new tensor units in A100 to improve their IPC, also they now support significantly more formats, which also require significant transistor budget.without knowing how many transistors are assigned to which h/w unit
Ampere GA100 TCs support FP64 now.Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?
Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?
ok, I thought I did see that in the spec sheet.Ampere GA100 TCs support FP64 now.
A100 supports two new significant data formats: Tensor TF32 and FP64. TF32 doesn't require code change, but FP64 does.Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?
"With FP64 and other new features, the A100 GPUs based on the NVIDIA Ampere architecture become a flexible platform for simulations, as well as AI inference and training — the entire workflow for modern HPC. That capability will drive developers to migrate simulation codes to the A100.
Users can call new CUDA-X libraries to access FP64 acceleration in the A100. Under the hood, these GPUs are packed with third-generation Tensor Cores that support DMMA, a new mode that accelerates double-precision matrix multiply-accumulate operations.
A single DMMA job uses one computer instruction to replace eight traditional FP64 instructions. As a result, the A100 crunches FP64 math faster than other chips with less work, saving not only time and power but precious memory and I/O bandwidth as well.
We refer to this new capability as Double-Precision Tensor Cores."
I an hour nVidia will talk about the Ampere architecture: https://www.nvidia.com/en-us/gtc/#webinar-schedule
There's also free registration.I an hour nVidia will talk about the Ampere architecture: https://www.nvidia.com/en-us/gtc/#webinar-schedule
nVidia published the Ampere whitepaper: https://www.nvidia.com/content/dam/...ter/nvidia-ampere-architecture-whitepaper.pdf
Asynchronous Barrier
The A100 GPU provides hardware-accelerated barriers in shared memory. These barriers are
available using CUDA 11 in the form of ISO C++-conforming barrier objects. Asynchronous
barriers split apart the barrier arrive and wait operations, and can be used to overlap
asynchronous copies from global memory into shared memory with computations in the SM.
They can be used to implement producer-consumer models using CUDA threads. Barriers also
provide mechanisms to synchronize CUDA threads at different granularities, not just warp or
block level.
It's impossible to say what matters and what not without knowing how many transistors are assigned to which h/w unit and how they contribute to the rated 400W power draw figure.
If we compare transistor numbers directly then there's an obvious transistors/watt gain with Ampere - but again it's hard to say how uniform that gain is or how it will translate into graphics oriented GPUs.