NPU (Neural Processing Unit). AI Hardware Acceleration discussion.

How would NPUs improve on LLMs?

Would they improve the performance of LLM services like ChatGPT or CoPilot, which do all the processing in the cloud?

Or would they for local LLMs, in which case what is the point, unless maybe you're trying to develop specialized LLMs?
 
How would NPUs improve on LLMs?

Would they improve the performance of LLM services like ChatGPT or CoPilot, which do all the processing in the cloud?

Or would they for local LLMs, in which case what is the point, unless maybe you're trying to develop specialized LLMs?

When you are running stuff like ChatGPT inference at scale in a data center, power consumption is critical. And as good as B100 is as a general purpose GPU that needs to do training and inference, it won't be able to compete in terms of tokens/watt with a dedicated inference NPU. There's a reason that companies like MS, Meta, and Amazon are developing their own inference chips for the cloud.
 
NVIDIA is expensive and overengineered for inference. Efficiency wise it probably doesn't matter much. HBM is pretty good in energy per bit and the tensor cores don't differ much from a systolic array in a NPU.

Models which allow extreme low power inference aren't really popular yet, NPUs still need to be able to handle single precision floating point.

It's all about privacy.
 
When you are running stuff like ChatGPT inference at scale in a data center, power consumption is critical. And as good as B100 is as a general purpose GPU that needs to do training and inference, it won't be able to compete in terms of tokens/watt with a dedicated inference NPU. There's a reason that companies like MS, Meta, and Amazon are developing their own inference chips for the cloud.
But you're referring to NPU at the data center.

What about the NPUs that companies are touting on these ARM SOC laptops and tablets? How do they enhance LLM performance on the client machines?
 
But you're referring to NPU at the data center.

What about the NPUs that companies are touting on these ARM SOC laptops and tablets? How do they enhance LLM performance on the client machines?
They're better fit for SLMs (Small Language Model). At least some of Windows' new AI features use SLMs rather than LLMs and they run locally. Pretty sure anything running locally on any low power platform is using SLMs tbh.
 
But you're referring to NPU at the data center.

What about the NPUs that companies are touting on these ARM SOC laptops and tablets? How do they enhance LLM performance on the client machines?

In a mobile device power and chip area are king are still. It's not feasible to run an LLM inference unless its weights are fully stored in RAM so you need them quantized as small as possible and maybe even have dedicated compression on top of that to minimize bandwidth from DRAM to the internal memory. If you have to fetch weights from flash, the user experience will be terrible.

And you also want to compute in the native format (int8/int4/int2 for weights). Most NPU's mainly have int8 inference capability (with some sort of slow FP16 path for backwards compatibility).

old presentation, but still somewhat relevant. slide 9: https://www.qualcomm.com/content/da...g_power-efficient_ai_through_quantization.pdf

we did a similar study recently and the numbers for int8 vs fp8/fp16 were still on the order of 2-4x in favor of int.
 
Int8 is on the order of 2-4x faster on H100 as well (2x to be precise) so that's not a real difference.

Obviously NPUs don't have the memory to handle large models, but architecture wise and efficiency wise it's similar.
 
System memory is not fast enough to keep a NPU busy with a LLM.

Actually no memory is fast enough to keep a NPU/GPU busy with a language model with a single query. In the server farms, each chip is handling big batches of queries and it can share the parameters retrieved from memory between them. This shifts the load to compute bound.

Embedding data as neural memories (for instance by tokenization, positional encoding and key/value transformation) could conceivably be done with large batches too, so there the NPU could actually flex some muscle. Effectively building an index for a LM to be able to answer questions about your personal data.
 
I understand the idea of SLMs for mobile devices.

But would they have the same appeal or hype of external services like ChatGPT or CoPilot?

Would people be content to use these local AI features or would they mostly use these commercial services which have all the visibility and attention?

I know one of the arguments for SLMs is privacy but if LLMs offer certain features that you can't do locally, you wonder if NPUs on the device will get orphaned.
 
Would you upload your entire digital life to the cloud for search?

Any way, they can still run image filters :)
 
I understand the idea of SLMs for mobile devices.

But would they have the same appeal or hype of external services like ChatGPT or CoPilot?

Would people be content to use these local AI features or would they mostly use these commercial services which have all the visibility and attention?

I know one of the arguments for SLMs is privacy but if LLMs offer certain features that you can't do locally, you wonder if NPUs on the device will get orphaned.

This depends on how well "SLMs" performs compared to ChatGPT or other very large models.
The big question is, is it possible to make a relatively small model which might not be as knowledgable but still retains reasonable ability for, say, reasoning? Microsoft has been trying to do that with Phi-3-mini, which only has 3.8 billion parameters.
One can argue that for a local LLM (or "SLM") it does not really need to be able to pass the bar exam. It's probably more useful if it knows more about the OS you're using and is able to help you on how to change some settings or fix some problems, than knowing some specific detail of some obscure law.
The problem is we don't know how bad a LLM's reasoning ability is affected by its size. However, there are researches on improving it, for example, running multiple rounds with automatic feedbacks, which may enable a smaller LLM to have better reasoning capability.
 
Last edited:
This depends on how well "SLMs" performs compared to ChatGPT or other very large models.
The big question is, is it possible to make a relatively small model which might not be as knowledgable but still retains reasonable ability for, say, reasoning? Microsoft has been trying to do that with Phi-3-mini, which only has 3.8 billion parameters.
One can argue that for a local LLM (or "SLM") it does not really need to be able to pass the bar exam. It's probably more useful if it knows more about the OS you're using and is able to help you on how to change some settings or fix some problems, than knowing some specific detail of some obscure law.
The problem is we don't know how bad a LLM's reasoning ability is affected by its size. However, there are researches on improving it, for example, running multiple rounds with automatic feedbacks, which may enable a smaller LLM to have better reasoning capability.

Is it really reasoning or recognizing word patterns and able to statistically determine what the most likely following words are?

I mean there's hype about LLMs achieving AGI but there are a lot of skeptics that LLMs will do more than be a believably human-like chatbot.

In any event the big tech stocks right now are going crazy so at least the markets must believe LLMs will eventually perform all kinds of wonders. They're pouring billions into chasing ... something.
 
Is it really reasoning or recognizing word patterns and able to statistically determine what the most likely following words are?

If you look at a multi-layer neural network, it's possible it has some kind of reasoning ability. That is, if you consider something like an expert system having some reasoning ability.
For example, there's a game called 20Q, it asks questions and in most case figure out what's in your mind withing 20 questions. It's basically a statistical network, but it looks like it's doing reasoning, and it can be argued that there are some kind of reasoning inside (e.g. if you said what's in your mind is an animal, then it's probably not a rock). 20Q is small enough to fit inside a children's toy.
A multi-layer neural network actually works similar to these systems. The non-linear part can be seen as a decision swtich, and many decisions vote for the next weight (the next layer). I think it's possible to manually "program" something with a multi-layer neural network, it's just that there's no point doing it this way, because what people love about neural network is that it can "learn" automatically via backpropagation.
Today's LLM give you answers immediately, so it's akin to people always answering with the first thing in their minds. The researches on multiple rounds are trying to remedy this, mimicking people's thought process (an "internal monologue"). Some believe it'll give LLM much better reasoning power.
 
Back
Top