NPU (Neural Processing Unit). AI Hardware Acceleration discussion.

Cyan

orange
Legend
Supporter
Well, I decided to start this thread to share news on NPUs, and also information about what is a NPU, the best NPUs on the market, etc etc. It's still a new concept now in mid 2024 but it seems to be an unstoppable revolution anticipating what is coming.

Will they replace CPUs and GPUs? They won't, but they are meant to complement them.

There are some articles explaining what is a NPU and the benefits of a NPU.





What-Is-An-NPU-And-Why-Do-You-Need-One-Everything-Explained.jpg


Benefits of a NPU.

Energy efficiency: The specialized design of NPUs makes them more energy-efficient than using CPUs or GPUs for AI tasks, which is important for battery-powered devices and data centers.

Cost-effectiveness: While individual NPUs may be expensive, their high performance and efficiency can translate to overall cost savings when compared to using large numbers of CPUs or GPUs for the same workload.

Performance boost: NPUs can significantly accelerate AI workloads, leading to faster processing and real-time results. This is crucial for applications like autonomous vehicles, image recognition, and natural language processing.

AI for games: NPUs can potentially improve AI on games, and make NPCs more believable. I.e. you've talked to an NPC and you leave after the conversation, you return to the area and while not being able to talk with him 'cos the important conversation isn't available, instead of walking/running around in front of said NPC and they don't say a thing, they could just say things like "Hi again!" or "What are you doing here (after what we talked)?". It'd be great for games trying to learn how you play -drivatars, ranking, etc-.

Use in apps: Some apps would benefit from the use of a NPU.

 
Last edited:
What I find amusing about these NPUs so far is that there's only one corporation out there that even realizes their potential beyond just monolithic AI applications. Everyone else are seemingly only interested in providing non-programmable interfaces ...
 
To be fair, it's probably still too early to provide a programmable interface to the public. Once you do that, it'll be frozen. It'd be much harder to change and will cost much more to fix mistakes.
Even if one looks at CUDA, over the years it's changed a lot. NVIDIA didn't really commit that much to backward compatibility in its arly years, but now they can't afford to not to. On the other hand, CUDA is mature enough that most early mistakes were already ironed out. The same probably can't be said for most NPU out there.

If what you want is at the API level then maybe something like CoreML is good enough?
 
To be fair, it's probably still too early to provide a programmable interface to the public. Once you do that, it'll be frozen. It'd be much harder to change and will cost much more to fix mistakes.
Even if one looks at CUDA, over the years it's changed a lot. NVIDIA didn't really commit that much to backward compatibility in its arly years, but now they can't afford to not to. On the other hand, CUDA is mature enough that most early mistakes were already ironed out. The same probably can't be said for most NPU out there.
If they want to keep their ISAs closed from the public then that's fine but that doesn't mean that they can't provide some intermediate representation to keep backwards compatibility and have a consistent programming interface. The sad part about a lot of these NPUs is that their designs aren't even necessarily unique in terms of architecture as they too are based upon prior art like some of them featuring a spatial dataflow architecture that resembles the Cell processor's NUMA system and a good number of their ISAs have VLIW instruction encodings as well ...

Mind you these aren't processors made by amateur architects either. We're talking about companies with decades of experience releasing all sorts of compute architectures (CPU/GPU/DSP/FPGA/etc) who absolutely know how to make compilers for them! We should start the NPU programming revolution sooner rather than later because we're all reaching the end of hardware advancements and holding out because they might somehow converge to a standardized ISA is absurd when we carefully take a look at what's happened in other hardware domains ...
If what you want is at the API level then maybe something like CoreML is good enough?
CoreML is not good enough to do high-level arbitrary code compilation and execution ...
 
Personally I think NPU are probably not designed for "high level arbitrary code" yet. If that's your requirement, I guess GPUs are better.
The thing is that we are still exploring what hardware is best suited for running LLM. Maybe a bit better abstraction is for running neural networks. Before this, most NPU are basically tensor processors, and if that's all you need, then I agree they are mature enough to have a public programming interface. However, after the success of LLMs now it's not that clear anymore. There are new research advances every day. Even as late as last year some people still regarded FP8 as insane, and now FP4 seems to be completely acceptable for LLMs. It's quite possible that we'll have something even better (at performance/watt) next year.
So the problem here is that, say, if the NPU exposes FP8 and, maybe next year no one's using FP8 anymore, it'll be harder to get rid of FP8 if they have a public interface, just like it's almost impossible for NVIDIA to get rid of FP8 in CUDA. (here I'm just using FP8 as an example, I'm not actually suggesting that FP8 will be obsolete)

Of course, NPU are not just for LLMs. However, it seems that it going to be the major application and NPU designers are optimizing for LLMs. After all, most other AI tasks (image classification, generation, improvement, etc.) don't really need that much processing power.
 
Personally I think NPU are probably not designed for "high level arbitrary code" yet. If that's your requirement, I guess GPUs are better.
The thing is that we are still exploring what hardware is best suited for running LLM. Maybe a bit better abstraction is for running neural networks. Before this, most NPU are basically tensor processors, and if that's all you need, then I agree they are mature enough to have a public programming interface. However, after the success of LLMs now it's not that clear anymore. There are new research advances every day. Even as late as last year some people still regarded FP8 as insane, and now FP4 seems to be completely acceptable for LLMs. It's quite possible that we'll have something even better (at performance/watt) next year.
So the problem here is that, say, if the NPU exposes FP8 and, maybe next year no one's using FP8 anymore, it'll be harder to get rid of FP8 if they have a public interface, just like it's almost impossible for NVIDIA to get rid of FP8 in CUDA. (here I'm just using FP8 as an example, I'm not actually suggesting that FP8 will be obsolete)

Of course, NPU are not just for LLMs. However, it seems that it going to be the major application and NPU designers are optimizing for LLMs. After all, most other AI tasks (image classification, generation, improvement, etc.) don't really need that much processing power.
Well if hardware designers want to expose these extensions implicitly then we have examples such as HLSL's minimum float intrinsics where capable hardware can opt in for lower precision math ...

Even if the main applications for NPUs are LLMs, that's still really isn't an excuse to restrict their capabilities that consumers paid for in a general computing device. If NPU designs decide to go in "radically different" directions such as a completely fixed function system then it's perfectly fine to deprecate any programming interface from that point onwards or if the architects want to integrate a richer set of programming features then extending their complimentary interfaces is fine. If hardware vendors want to diverge either way, well then that's too bad for multi-vendor standards proponents out there ...
 
Well if hardware designers want to expose these extensions implicitly then we have examples such as HLSL's minimum float intrinsics where capable hardware can opt in for lower precision math ...

Even if the main applications for NPUs are LLMs, that's still really isn't an excuse to restrict their capabilities that consumers paid for in a general computing device. If NPU designs decide to go in "radically different" directions such as a completely fixed function system then it's perfectly fine to deprecate any programming interface from that point onwards or if the architects want to integrate a richer set of programming features then extending their complimentary interfaces is fine. If hardware vendors want to diverge either way, well then that's too bad for multi-vendor standards proponents out there ...

It's easier said than done though. Ironically sometimes it's easier for a less popular hardware to actually be more open, because people won't be as upset if backward compatibility is broken as there're simply not enough users. However, if you are aiming to be something in the range of millions of users, it can be quite problematic. You can't just say "my next NPU will no longer support this function" because there will be people complaining.

There is also the issue of programming model. The model CUDA uses (and adopted by OpenCL and others) is one, but not everyone's using the same model. I'm not familiar with these NPU enough to know much but for example some might be something similar to Intel's Xeon Phi, or maybe even more exotic model such as something more similar to a FPGA.

In an ideal world where no one will complain about anything, I agree it's better for companies to make everything open (they can label them "experimental" or something) so people can freely do interesting on them. Some people might even be able to find some unexpected applications. It's not unlike the first programmable GPU lead to CUDA. Unfortunately we are not in an ideal world and I understand why most companies don't want to do that. But as the examples some already mentioned maybe there're going to have some changes.
 
It's easier said than done though. Ironically sometimes it's easier for a less popular hardware to actually be more open, because people won't be as upset if backward compatibility is broken as there're simply not enough users. However, if you are aiming to be something in the range of millions of users, it can be quite problematic. You can't just say "my next NPU will no longer support this function" because there will be people complaining.
Just guard your hardware design flexibility philosophy with an implicit programming interface if compatibility is such a concern to you. We invented high level languages and APIs to abstract over a zoo of different HW designs over the years. If we want more explicit programming interfaces then we can wait until HW vendors feel more comfortable exposing them later on. Some programmability is better than having no programming at all!
There is also the issue of programming model. The model CUDA uses (and adopted by OpenCL and others) is one, but not everyone's using the same model. I'm not familiar with these NPU enough to know much but for example some might be something similar to Intel's Xeon Phi, or maybe even more exotic model such as something more similar to a FPGA.
This problem isn't limited to only just NPUs though as it exists everywhere in our entire landscape of computing! I've long since resigned the idea that differing hardware designs will ever converge to a standard even in their own domains. Nobody that's sane is seriously considering abstracting a common programming interface between Qualcomm's Hexagon (CPU/DSP hybrid that can run operating systems) and AMD/Intel's own version of their Cell Processors or even fixed function hardware like NVDLA ...

We already accept having different CPU ISAs (x86/ARM/RISC-V), GPU programming languages (HLSL/GLSL/MSL/CUDA/etc), and whatever associated proprietary cross-incompatible FPGA tools (Vivado/Quartus Prime) there are so what specifically is the problem with another hardware ecosystem having their own exotic programming models ?
 
We already accept having different CPU ISAs (x86/ARM/RISC-V), GPU programming languages (HLSL/GLSL/MSL/CUDA/etc), and whatever associated proprietary cross-incompatible FPGA tools (Vivado/Quartus Prime) there are so what specifically is the problem with another hardware ecosystem having their own exotic programming models ?

The problem is not it's not doable, it's that most companies don't find it worth doing. I'm not saying it's the ideal situation, but that's the reality.
Even if you look at something like CUDA you can see it's not easy. NVIDIA was determined to do GPU computing, but that's not something obvious, nor was it guaranteed to success. CUDA was ridiculed. A lot of people said it won't work, or will have some niche in HPC market, etc. Without determination (and the fact that NVIDIA was able to survive long enough with gaming market), CUDA could easily die and became just a footnote in history. So I completely understand those companies' hesitation to publish their tools.
 
WOAH! Y'all are going as fast as technology!

So is the NPU like a GPU where you just download drivers or how will it integrate into the OS?

Please ELI5 as I'm confused about NPUs and AI in general but am trying hard to learn about it. Thanks for the thread btw, it was helpful up until the point where my brain got all confuzled.

NPU will power copilot in windows 11 I get, but what else can it do and can it do more than windows 11?
 
WOAH! Y'all are going as fast as technology!

So is the NPU like a GPU where you just download drivers or how will it integrate into the OS?

Please ELI5 as I'm confused about NPUs and AI in general but am trying hard to learn about it. Thanks for the thread btw, it was helpful up until the point where my brain got all confuzled.

NPU will power copilot in windows 11 I get, but what else can it do and can it do more than windows 11?
They'll show up as NPUs in Windows, separate processing units just like CPU and GPU are (in Task Manager etc), probably something similar in Linux and macOS etc.
Programs need to be specifically programmed to take advantage of them, it's not some automatic "OS sees code fitting for NPU and sends it to NPU"
 
Mostly hype, but since it's there it can run some image filters and maybe some background indexing (file search and eventually recall, or Apple Memory or whatever Apple will call it, when everyone forgets they have to hate it and realize it's a pretty good idea). Not that the CPU&GPU couldn't do it, but it's there.
 
WOAH! Y'all are going as fast as technology!

So is the NPU like a GPU where you just download drivers or how will it integrate into the OS?

Please ELI5 as I'm confused about NPUs and AI in general but am trying hard to learn about it. Thanks for the thread btw, it was helpful up until the point where my brain got all confuzled.

NPU will power copilot in windows 11 I get, but what else can it do and can it do more than windows 11?
what @Kaotik said. I am not a NPU expert, but I'd add that there are other uses, like for real time videos or videoconferences, a NPU can do things like keeping your eyes watching the screen -seen from other people perspectives- while in actuality you can be reading notes written on a paper on the desktop and so on and so forth.

NPUs have the advantage of specialized hardware and are adding a new component to PCs which is going to free up other processors.
 
PowerColor claims they can improve GPU performance by leveraging an external NPU


During this year's Computex, PowerColor showed a new technology prototype that could theoretically bring NPUs and GPUs together for better performance and power efficiency. Edge AI, as the technology is known, uses the specialized silicon components hidden inside an NPU to optimize how a discrete GPU uses power to render complex 3D graphics on-screen. The results can be significant they claim.

For its Computex demonstration, PowerColor wired an external NPU to an AMD graphics card. PowerColor's Edge AI was programmed to manage power consumption of the GPU through the NPU, sort of an "Eco mode" which the company claims can bring wattage down and push frame rates up. According to data provided by graphics card maker, this NPU-based AI engine was able to reduce power consumption by 22 percent while gaming.

Edge AI was employed to manage GPU utilization in Cyberpunk 2077 and Final Fantasy XV, which were using 263W and 338W of power by default, respectively. The external NPU brought those numbers down to 205W and 262W, achieving better results than AMD's own Power Saving mode.
 
Please ELI5 as I'm confused about NPUs and AI in general but am trying hard to learn about it. Thanks for the thread btw, it was helpful up until the point where my brain got all confuzled.

NPU will power copilot in windows 11 I get, but what else can it do and can it do more than windows 11?

Unfortunately we don't know much about many NPUs out there because most data aren't public (such as Apple's Neural Engine). However, from public API (such as CoreML) it seems that most of them are tensor processors (something doing matrix multiplications), not unlike Google's TPU or NVIDIA's tensor cores, although they tend to work at lower precision (such as FP16 or even lower).

So one may ask why we should have NPU if they basically do what GPU does. The answer is some SoC don't have a very fast GPU (because they don't really need that), or they have NPU that's faster than GPU (such as for most "A" series Apple processors). NPU also tends to be more energy efficient than GPU.

So what a NPU can accelerate? Basically they are designed to run neural networks efficiently (hence the name). Earlier CoreML actually only let you put in some neural network models and run them. So basically many things that can be done with a neural network model, such as hand writing recognition, noise removal (both audio and video), or more recently those generative AI such as LLM like ChatGPT and image generations. You can use GPU to run these but using a NPU can be more efficient and also allow you to have something always running in the background. For example, if you use AI to do something like "OK Google" or "Hey Siri", you probably want it to be always available even if the GPU is busy running a 3D game.
 
The problem with real time LM or audio is that the models aren't really compute dense. A small percentage of GPU time is plenty, or it would max out memory bandwidth any way. Only with Images and offline processing the NPU adds anything useful.
 
Back
Top