LLMs may force Apple to integrate more RAM into A-series SOCs

wco81

Legend
Read an article this morning about Apple possibly looking at finding workarounds to run LLMs locally, without embedding a lot of RAM in their next SOCs.

iPhone 15 Pro Max has 8 GB of RAM. Non Pro models have 6 GB.

In comparison, Pixel 8 Pro may have 12 GB and the new Pixel 9 phones may have some SKUs with 16 GB of RAM in their next Tensor SOC.

Apple SOCs for iPhones have remained at the top of mobile device performance despite lower RAM. But LLMs may demand more RAM.

So the article referenced some research paper about ways to cache parts of the LLMs in flash memory, so that Apple devices with less RAM can get around the high RAM requirements of LLMs.

Turns out this research paper isn't new, was publicized back in December.

Apple’s AI researchers say they’ve made a huge breakthrough in their quest to deploy large language models (LLMs) on Apple devices like the iPhone without using a ton of memory. Researchers are instead bringing LLMs to the device using a new flash memory technique, MacRumors reports.

In a research paper called "LLM in a Flash: Efficient Large Language Model Inference with Limited Memory," researchers note that in the world of mobile phones, flash storage is more prevalent than the RAM that is traditionally used to run LLMs. Their method works by employing windowing, which is reusing some of the data it has already processed to reduce the need for constant memory fetching, and by row-column bundling, which involves grouping data so it can be read faster from the flash memory.

According to the paper, this method allows AI models to run twice the size of the available memory in the iPhone, something researchers claim is 4-5x faster than on standard processors (CPUs) and 20-25 times on graphics processors (GPUs).

The researchers note: “This breakthrough is particularly crucial for deploying advanced LLMs in resource-limited environments, thereby expanding their applicability and accessibility. We believe as LLMs continue to grow in size and complexity, approaches like this work will be essential for harnessing their full potential in a wide range of devices and applications.”


Might be a workaround for older devices but Apple which has gradually been increasing RAM in the A-series SOCs may have to accelerate their schedule.

Of course this might imply higher costs for devices with more RAM in the SOCs, translating to increased prices.
 
The title is a bit misleading (should refer to LLMs) as Apple has been doing machine learning for many years on device.

The biggest component being their computational photography workflow.
 
Changed the title.

You're right but of course it's LLMs powering all the hype right now, though image and video generation may become bigger part of "AI."

Apple has been able to skip on RAM and do well with the CPU-intensive functions like Portrait mode photos, Night photos, etc.

They helped move devices but now the market is demanding some kind of AI strategy. So it will be interesting to see if different types of AI applications lead to design changes or changes in how the silicon budget is deployed on Apple SOCs for mobile devices.

For the M chips for Macs and iPad Pros, looks like they have more room to integrate more RAM, not just for AI but other types of applications.

But iPhone is the big moneymaker and expectations are that AI will drive phone sales, which have been slowing down in the last year or two.
 
Changed the title.

You're right but of course it's LLMs powering all the hype right now, though image and video generation may become bigger part of "AI."

Apple has been able to skip on RAM and do well with the CPU-intensive functions like Portrait mode photos, Night photos, etc.

They helped move devices but now the market is demanding some kind of AI strategy. So it will be interesting to see if different types of AI applications lead to design changes or changes in how the silicon budget is deployed on Apple SOCs for mobile devices.

For the M chips for Macs and iPad Pros, looks like they have more room to integrate more RAM, not just for AI but other types of applications.

But iPhone is the big moneymaker and expectations are that AI will drive phone sales, which have been slowing down in the last year or two.
They also do FaceID, autocorrect, facial recognition, the integration with the ISP (as mentioned above) and a bunch of smaller task.

Apple have been doing machine learning with dedicated hardware all the way back to the iPhone X on device pretty much.

They (Apple) have just released their own machine learning models - 8 in total - called OpenELM (Open-source Efficient Language Models) for running on device. You can take a gander on Huggingface here for their transformer models.
 
Regarding to LLMs, Microsoft recently released Phi3 (still in beta) with the aiming to have fewer parameters to achieve the same quality, using better training materials.

There are also researches focusing on the number precision of parameters. Llama-2 was released with FP16, but people found that quantizing it into 8 bits or even 4 bits still produces pretty good quality results. Now there are even 2 bits or 1 bit quantization and that allows people to run models like the recent Llama-3 70B with a consumer GPU with 24GB VRAM. The general idea seems to be that a model with higher precision but fewer paramters tend to perform worse than a model with lower precision but more paramters (e.g. a 13B model quantized to 4 bits tend to perform better than a 7B model quantized to 8 bits). Phi-3 quantized to 4 bits takes only ~2.2GB memory, which should be fine even under current phone standard.

Furthermore, for something that's running on a phone for general usage, there's no need for it to be like all knowledgeable. Llama-3 is pretty good that it speaks multiple languages and able to produce reasonable answers for something like "Write a Go program solving n queen problem" but normally people don't need that. Maybe we'll see more "mini" models in the future like Phi-3 and Google's Gemma.
 
Some speculation on how Apple may deploy different AI apps on their devices.


There are ELMs or Efficient Language Models which would take up smaller footprints on devices.

There are also image-editing models which might run on mobile devices as well.

However, Apple may still consider trying to deploy LLMs from the big names, which might mean larger footprints.

However, even with all the model releases from Apple, the company reportedly reached out to Google and OpenAI to bring their models to Apple products.
 
Existing models are mostly designed for huge batch processing, this creates design choices which are extremely hostile to local LLM's.

Lets say you could use only a couple percent of parameters for a given query (computational sparsity), but if you batch a couple 100 it stops mattering for memory bandwidth because you need all parameters anyway for the batch, while it becomes hard to use the dense batched matmuls lazy programmers like to use if you try to leverage the sparsity (they are all lazy and compute is cheap compared to bandwidth).

That's why all the publicly known large models don't have significant computational sparsity ... lazy programmers. Same with not using quantization aware training, just lazy. PTQ is a hack.

Now imagine a model designed expressly for local, where not only the FFN has a RELU bottleneck, but also the K/V/Q projections (SADMoE did this for instance, though for slightly different reasons). Suddenly the LLM in a Flash method from Apple can be used for all significant parameters. Add int4 or trinary quantization of parameters with quantization aware training, with int4 quantization of the KV cache. Suddenly you can just stream in parameters for a huge model.

This is likely what Apple will do (if they we're not, someone there tell your programmers to stop being lazy).

The looming copyright deathtrap if the Supreme Court doesn't say copying for training is fair use is a far greater deterrent for Apple to commercially use their own models.
 
Last edited:
But even with some kind of sparsity it's still quite possible that each token will take different route and it might be impossible to know which route it will take before actual computing. At best this will need very fast random access speed from the flash memory. Considering that flash memory are block devices it may not be able to save a lot of bandwidth.
 
But even with some kind of sparsity it's still quite possible that each token will take different route
With enough sparsity it stops mattering. If you look at "Exponentially Faster Language Modelling", the achievable sparsity without sacrificing performance might be very sparse indeed.

No one has really explored the optimal design for local, everything is bogged down by lazy batch design.
and it might be impossible to know which route it will take before actual computing.
A bad prediction just gives bad results, but it's always possible. It's only routing in the MoE sense if you consider every column/row of the up/down matrices an expert ... a valid but somewhat extreme interpretation (I'll name it "micro-experts"). Also for a trained from scratch network, you'd train the predictor together with everything else, not afterwards. In that sense the original LLM in a Flash is a hack to work around people too lazy to do it right in the first place, same as PTQ vs QAT.
At best this will need very fast random access speed from the flash memory. Considering that flash memory are block devices it may not be able to save a lot of bandwidth.
With int4 and feature dimension of 4096 you can fit one column/row in 4kB, works nicely with 4k-native sector size.
 
Last edited:
I suspect someone somewhere is working on a local first LLM by now. Both the research and the SOTA commercially deployed models are closing in on what's needed for it to work. For SOTA Deepseek and Aria are proving the viability of more extreme MoE. Deepseek also kicked off massive reductions in KV cache sizes through MLA, which the Chinese are steadily improving (see CLLA from Tencent).

For research, Mixture of A Million Experts tried out the single neuron FFN experts (I'd like to see them experiment with cross layer parameter/expert sharing on top of that). Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference explores expert cache replacement strategies for SSD offload (ie. you don't always fetch all the top-k experts, but try to work with what's already cached as much as is reasonable). Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference runs the router one layer early, so compute and loading can be overlapped.

"Just throw more memory at it" is the wrong answer, put the effort into optimized local models with SSD offload. 4 bit native, millions of experts, expert FFNs for QKV projections, expert caching, CLLA attention, cross layer parameter sharing, pre-gating ... somebody get to it (it likely won't be NVIDIA, MoEs massively reduce both training and inference cost and there is some overlap between local and cloud optimal MoE).

PS. prefill needs a larger subset of parameters, but it can be done layer by layer so the total required memory can still be low.
 
Last edited:
Back
Top