Speculation is fine, but saying something is so (margins are better) based on feelings doesn't make it true.I don't believe a 4090 costs 3-4x as much to manufacture as a 1080ti.
Speculation is fine, but saying something is so (margins are better) based on feelings doesn't make it true.I don't believe a 4090 costs 3-4x as much to manufacture as a 1080ti.
Believe all you want, doesn't mean anything until you actually provide some data to back that up.I don't believe a 4090 costs 3-4x as much to manufacture as a 1080ti.
Lets say 1080ti cost Nvidia 350$ per GPU. a 4090 would have to cost $1250 to have an equivalent profit margin. There is no way it costs Nvidia 1250$ per 4090.Believe all you want, doesn't mean anything until you actually provide some data to back that up.
Also 1080Ti launched at $700+ and 4090 is $1600+. How do you get "3-4x" from that?
Lets say 1080ti cost Nvidia 350$ per GPU. a 4090 would have to cost $1250 to have an equivalent profit margin. There is no way it costs Nvidia 1250$ per 4090.
It is very obviously not all "a bubble" (because there are many products and services using the h/w already) but the question of whether this level of demand on DC AI h/w will remain is a big unknown and nobody can answer that with any degree of certainty.is this a bubble? Or is this incredible revenue by nVidia for real and is it here to stay?
is this a bubble? Or is this incredible revenue by nVidia for real and is it here to stay?
Nvidia's HPC chips has always been the first to be announced from a generation.as expected ... AI comes first
Thanks for all the detail Arun, I really appreciate it!Compared to Ampere, you're still in the unfortunate situation where the data is in shared memory, and doing element-wise operations on it before the GEMM requires at a minimum that you do:
So you've added a strict minimum of 6 RAM operations (2 on shared memory, 4 on registers) which is a lot less elegant/efficient than just streaming data straight from global memory to shared memory to tensor cores (bypassing the register file completely). Given that the tensor core peak performance has doubled but shared memory hasn't, I suspect this will start hurting performance, although maybe it's OK if you only need to do this for 1 of the 2 input tensors, I'm not sure.
- Asynchronous copy from global memory to shared memory (aka local memory in OpenCL, per-workgroup scratch) --> 1 RAM write to shared memory (also needed without prologue)
- Read from shared memory and write to register (1 RAM read from shared + 1 RAM write to registers).
- Read from registers, do element-wise operations, write to registers (1 RAM read from registers + 1 RAM write to registers)
- Read from registers and write to shared memory (1 RAM read from registers, 1 RAM write to shared memory).
- Read from shared memory sending data to tensor cores (1 RAM read from shared memory, also needed without prologue).
I guess I'm still not sure what makes adding (optional) prologue code at the front of the consumer codepath somehow particularly bad (compared to the alternative of running the prologue as a separate kernel with global memory reads/writes) or difficult.Anyway, back to the original point - I think Warp Specialisation with a producer/consumer model is the really horrible thing to do for the general case in something like Triton, and implementing a prologue otherwise is probably OK-ish, it just prevents you from getting a lot of the benefit of Hopper tensor cores reading straight from shared memory etc... It's not clear how beneficial TMA is compared to just aysnc loads if you're not using a producer-consumer model as I haven't seen any code that actually even attempts to do that in practice, but I don't see why it wouldn't work in theory, it might just not provide as much benefit.
Evidence mounts that lead times for Nvidia's H100 GPUs commonly used in artificial intelligence (AI) and high-performance computing (HPC) applications have shrunken significantly from 8-11 months to just 3-4 months. As a result, some companies who had bought ample amounts of H100 80GB processors are now trying to offload them. It is now much easier to rent from big companies like Amazon Web Services, Google Cloud, and Microsoft Azure. Meanwhile, companies developing their own large language models still face supply challenges.
The easing of the AI processor shortage is partly due to cloud service providers (CSPs) like AWS making it easier to rent Nvidia's H100 GPUs. For example, AWS has introduced a new service allowing customers to schedule GPU rentals for shorter periods, addressing previous issues with availability and location of chips. This has led to a reduction in demand and wait times for AI chips, the report claims.
...
The increased availability of Nvidia's AI processors has also led to a shift in buyer behavior. Companies are becoming more price-conscious and selective in their purchases or rentals, looking for smaller GPU clusters and focusing on the economic viability of their businesses.
The AI sector's growth is no longer as hampered by chip supply constraints as it was last year. Alternatives to Nvidia's processors, such as those from AMD or AWS are gaining performance and software support. This, combined with the more cautious spending on AI processors, could lead to a more balanced situation on the market.
Meanwhile, demand for AI chips remains strong and as LLMs get larger, more compute performance is needed, which is why OpenAI's Sam Altman is reportedly trying to raise substantial capital to build additional fabs to produce AI processors.
It is not out of the question that some companies might attempt to leverage price/delivery concessions through pursueing multiple opportunities at the same time.Yeah, competition is doing a smear campaign and a guy, fired from another competitor, shares his opinion even after his last company paid developers and publisher to not implement DLSS.
BTW: I find it strange that any company is doing business with nVidia without having some kind of shipment date...
https://forum.beyond3d.com/threads/nvidia-discussion-2024.63466/post-2330473Former AMD GPU head accuses Nvidia of being a 'GPU cartel' in response to reports of retaliatory shipment delays
Follows accusations from Groq that customers tiptoe around Nvidia to talk GPUs with others.www.tomshardware.com