I don't know the exact situations about large models, but my experience in other fields tells me that this alone is probably already very hard, especially when a field is in the experiment stage.
If a field is mature enough, most people will be using a few well developed software, so problems with the underlying toolkits would be smaller, because the vendors can just work with the best and most commonly used software developers and solve most of the problems. However, when a field is still growing, the software situation would be more fragmented and some people may even have to develop their own software. In this case, it's more likely people will be stepping on new minefields every day. Thus, the stabliity of the underlying toolkits is very important. It's not just about bugs, also it needs to be stable with as few surprises as possible.
I think CUDA benefit from the fact that it was developed long time ago with a lot of users and have time to become very mature and stable. So it can handle the coming AI mania relatively easy. On the other hand, many other AI vendors just enter this market after this AI mania, so their toolkits do not have the time to become mature and stable.
From what I've heard, Google's TPU also have relatively good software stacks, and they also developed for quite a long time, although they don't have a lot of users. In AMD's case, it's unfortunate that they didn't spend enough resources on this when GPGPU became a thing. Their main tools are all developed relatively recently, so that's to be expected. One day when the AI software market becomes mature, I think AMD will be able to catch up, but of course at that time the profit margins won't be so pretty anymore.