“We’re using a lot of Nvidia hardware,” Musk said during the earnings call. “We’ll actually take Nvidia hardware as fast as Nvidia will deliver it to us. Tremendous respect for [CEO] Jensen [Huang] and Nvidia. They’ve done an incredible job.”
“If [Nvidia] could deliver us enough GPUs, we might not need Dojo, but they can’t,” Musk said.
Nvidia is now renting out its homegrown AI supercomputers with its newest GPUs in the cloud for those keen to access its hardware and software packages.
The DGX Cloud service will include its high-performance AI hardware, including the H100 and A100 GPUs, which are currently in short supply. Users will be able to rent the systems through Nvidia’s own cloud infrastructure or Oracle’s cloud service.
...
Tesla CEO Elon Musk last week talked about shortages of Nvidia GPUs for its existing AI hardware, and that Tesla was waiting for supplies. Users can lock down access to Nvidia’s hardware and software on DGX Cloud, but at a hefty premium.
The DGX Cloud rentals include access to Nvidia’s cloud computers, each with H100 or A100 GPUs and 640GB of GPU memory, on which companies can run AI applications. Nvidia’s goal is to run its AI infrastructure like a factory — feed in data as raw material, and the output is usable information that companies can put to work. Customers do not have to worry about the software and hardware in the middle.
The pricing for DGX cloud starts at $36,999 per instance for a month.
That is about double the price of Microsoft Azure’s ND96asr instance with eight Nvidia A100 GPUs, 96 CPU cores, and 900GB of RAM, which costs $19,854 per month. Nvidia’s base price includes AI Enterprise software, which provides access to large language models and tools to develop AI applications.
The rentals include a software interface called the Base Command Platform so companies to manage and monitor DGX Cloud training workloads. The Oracle Cloud has clusters of up to 512 Nvidia GPUs with a 200 gigabits-per-second RDMA network, and includes support for multiple file systems such as Lustre.
All major cloud providers have their own deployments of Nvidia’s H100 and A100 GPUs, which are different from DGX Cloud.
Google earlier this year announced the A3 supercomputer with 26,000 Nvidia H100 Hopper GPUs, which has a setup that resembles Nvidia’s DGX Superpod, which spans 127 DGX nodes, each equipped with eight H100 GPUs. Amazon’s AWS EC2 UltraClusters with P5 instances will be based on the H100.
With lock down, also comes lock-in — Nvidia is trying to get customers to use its proprietary AI hardware and software technologies based on its CUDA programming models. It could provide costly for companies in the long run, as they would pay for software licenses and GPU time. Nvidia said investments in AI will benefit companies in the form of long-term operational savings.
The AI community is pushing open-source models and railing against proprietary models and tools, but Nvidia has a stranglehold on the AI hardware market. Nvidia is one of the few companies that can provide hardware and software stacks and services that make practical implementations of machine learning possible.
With respect to the 512-bit memory interface on the Ada-next flagship, I suppose that means one or more of the following: (1) GDDR7 won't be ready, and the next card would be bandwidth starved by a 384-bit bus using GDDR6x; (2) Nvidia now thinks it'll be cheaper to use a 512-bit bus instead of a massive hunk of SRAM as cache, greatly expanding die size, to get the effective bandwidth the card needs; (3) Nvidia is concerned that AMD will challenge at the high-end next generation and wants to push its own flagship as far as possible; (4) Ada-next (or at least the flagship) uses a chiplet approach that scales to 512-bit (not sure this last one makes technical sense). I'm sure I'm missing other obvious reasons.
(5) The rumour is nonsense.
As mentioned above, cache doesn't scale at all with N3 vs N5 so increasing cache is not a performance boost anymore (ie because you sacrifice ALU for cache on the die). Higher VRAM bandwidth is the way to go moving forward (of course with the usual software tricks that Nvidia is used to introduce in every generation aka DLSS3 and so on). I also want to point out that HBM is not out of the equation for next gen top SKU with the availability next year of high density HBM modules. Think about it, only 2 HBM modules will be needed to reach 2TB/s and 32GB with price somewhat "competitive" with 16 GDDR7 dies and a bigger / more complex PCB...If you have no competition at the ultra high-end in gaming, and you're probably selling all the better yielding AD102 dies as RTX 6000 cards for $6,800 (at least according to this comparison between the 4090 and RTX 6000 at Tom's), I suppose it makes no sense to release a 4090 Ti.
With respect to the 512-bit memory interface on the Ada-next flagship, I suppose that means one or more of the following: (1) GDDR7 won't be ready, and the next card would be bandwidth starved by a 384-bit bus using GDDR6x; (2) Nvidia now thinks it'll be cheaper to use a 512-bit bus instead of a massive hunk of SRAM as cache, greatly expanding die size, to get the effective bandwidth the card needs; (3) Nvidia is concerned that AMD will challenge at the high-end next generation and wants to push its own flagship as far as possible; (4) Ada-next (or at least the flagship) uses a chiplet approach that scales to 512-bit (not sure this last one makes technical sense). I'm sure I'm missing other obvious reasons.
Decent enough progress, but GDDR development has been going just as well, if not better. So it's probably been something of a stalemate in terms of cost and performance considerations for consumer parts. HBM continues to offer more bandwidth potential compared to a GDDR alternative, but always at a notably higher price. It's just an inherently more expensive setup.Has there been so little progress with HBM that a 512 bit bus is preferable? It has been almost 10 years since Fury X and 8 since Vega 64.
Stacking opportunities exist now. CDNA3 is using L3/Infinity Cache chips underneath the core chips. This grants a lot of scope to further increase cache sizes without bloating overall die size. It's still a lot of total silicon, but we know better yields with smaller dies can beat one monstrous die in terms of cost. Plus because SRAM isn't scaling with process anymore, you can continue to use an older process for such cache chips.As mentioned above, cache doesn't scale at all with N3 vs N5 so increasing cache is not a performance boost anymore (ie because you sacrifice ALU for cache on the die).