Read an article this morning about Apple possibly looking at finding workarounds to run LLMs locally, without embedding a lot of RAM in their next SOCs.
iPhone 15 Pro Max has 8 GB of RAM. Non Pro models have 6 GB.
In comparison, Pixel 8 Pro may have 12 GB and the new Pixel 9 phones may have some SKUs with 16 GB of RAM in their next Tensor SOC.
Apple SOCs for iPhones have remained at the top of mobile device performance despite lower RAM. But LLMs may demand more RAM.
So the article referenced some research paper about ways to cache parts of the LLMs in flash memory, so that Apple devices with less RAM can get around the high RAM requirements of LLMs.
Turns out this research paper isn't new, was publicized back in December.
Might be a workaround for older devices but Apple which has gradually been increasing RAM in the A-series SOCs may have to accelerate their schedule.
Of course this might imply higher costs for devices with more RAM in the SOCs, translating to increased prices.
iPhone 15 Pro Max has 8 GB of RAM. Non Pro models have 6 GB.
In comparison, Pixel 8 Pro may have 12 GB and the new Pixel 9 phones may have some SKUs with 16 GB of RAM in their next Tensor SOC.
Apple SOCs for iPhones have remained at the top of mobile device performance despite lower RAM. But LLMs may demand more RAM.
So the article referenced some research paper about ways to cache parts of the LLMs in flash memory, so that Apple devices with less RAM can get around the high RAM requirements of LLMs.
Turns out this research paper isn't new, was publicized back in December.
Apple’s AI researchers say they’ve made a huge breakthrough in their quest to deploy large language models (LLMs) on Apple devices like the iPhone without using a ton of memory. Researchers are instead bringing LLMs to the device using a new flash memory technique, MacRumors reports.
In a research paper called "LLM in a Flash: Efficient Large Language Model Inference with Limited Memory," researchers note that in the world of mobile phones, flash storage is more prevalent than the RAM that is traditionally used to run LLMs. Their method works by employing windowing, which is reusing some of the data it has already processed to reduce the need for constant memory fetching, and by row-column bundling, which involves grouping data so it can be read faster from the flash memory.
According to the paper, this method allows AI models to run twice the size of the available memory in the iPhone, something researchers claim is 4-5x faster than on standard processors (CPUs) and 20-25 times on graphics processors (GPUs).
The researchers note: “This breakthrough is particularly crucial for deploying advanced LLMs in resource-limited environments, thereby expanding their applicability and accessibility. We believe as LLMs continue to grow in size and complexity, approaches like this work will be essential for harnessing their full potential in a wide range of devices and applications.”
Might be a workaround for older devices but Apple which has gradually been increasing RAM in the A-series SOCs may have to accelerate their schedule.
Of course this might imply higher costs for devices with more RAM in the SOCs, translating to increased prices.