Fine-Tuning Series: On-Device LLMs – How Google Leads and Why Apple Should Follow

By 
Axel Sjöberg
February 25, 2025

Introduction

As AI evolves at breakneck speed, a new trend is reshaping the mobile landscape: running Large Language Models (LLMs) and other foundational AI models locally on your smartphone. While cloud-based AI solutions still dominate, Google’s Gemini Nano models on Android illustrate just how transformative on-device LLMs can be for performance, privacy, and developer flexibility. Renowned for its privacy-first approach, Apple has nonetheless struggled to excel in AI, a fact clearly demonstrated by last year’s disappointing performance of Apple Intelligence. Now, with rumors swirling about Chinese iPhones potentially being powered by Alibaba’s Qwen, it seems the era of “AI in the OS” is finally on the horizon.

Google’s Gemini Nano on Android: A Proven Blueprint

Google has taken a big leap by integrating slimmed-down LLMs at the system level in Android through the AI Core framework. This approach doesn’t just enable local language inference on the latest Android devices, it also provides a standardized method for developers to fine-tune these models for specialized tasks using LoRA (Low-Rank Adaptation).

Gemini Nano: Key Highlights

• Two Generations:

- Nano 1.0 debuted on the Pixel 8/8 Pro with a 1.5B-parameter model focused on text-based tasks.

- Nano 2.0 launched alongside the Pixel 9/9 Pro with 3.1B parameters, adding multimodal support for text, image, and audio.

• Growing Device Support

- Although the Pixel lineup was the first to showcase Gemini Nano, as of early 2025, 14 premium Android devices officially meet the strict hardware requirements: Android 10+ with 20+ TOPS of NPU performance. This includes flagship models from Samsung (e.g., Galaxy S24), Xiaomi (14T Pro), and others that have rolled out AI Core integration.

AI Core: System-Level Model Management

The AI Core service lies at the heart of Android’s on-device AI strategy. Introduced in Android 14 and available on newer devices, it orchestrates every aspect of local inference. By constantly monitoring power budgets and thermal conditions, AI Core decides whether to route computations to the NPU, GPU, or CPU. It also manages multiple Gemini Nano model variants, such as multilingual or code-generation ones, ensuring that the appropriate model is always on hand for a given task. 

From a developer’s perspective, the real power of AI Core is its dynamic approach to LoRA loading. Instead of shipping a full model for every specialized use case, you simply include an adapter file and load it at runtime, instantly combining it with the base Gemini Nano model for tasks such as medical diagnostics or advanced translation.

The LoRA adapters in Android come in two flavors:

• Static Adapters: System-level tasks like on-device translation or speech-to-text.

• Dynamic Adapters: Developer-supplied. You can fine-tune them in the cloud (using Google Vertex AI or a PyTorch pipeline) and deploy them as small, compressed files via the Play Store.

The flexible loading of the adapters not only keeps apps lightweight by avoiding multiple large model files, but also makes it far easier to roll out new features or domain-specific capabilities on the fly.

Why Apple Should Follow This Path

Although Apple hasn’t announced an official competitor to Gemini Nano, rumors are swirling about Apple’s collaboration with Alibaba to integrate Qwen models for devices sold in China. Combined with Apple’s MLX framework, this could give iPhones and iPads a system-level AI model that rivals or even surpasses Google’s approach on Android. 

Apple’s MLX: A Strong Foundation

• M-Series Macs: The Neural Engine on M1–M4 chips has proven its ability to run and fine-tune larger models efficiently, delivering exceptional performance on tasks ranging from image recognition to language processing.

• A-Series iPhones: While less powerful than M-series, modern A-chips are still formidable. With MLX, Apple could offer developers a streamlined framework for running smaller LLMs and seamlessly integrating LoRA adapters. The biggest gap: Apple hasn’t yet made it trivially easy to do so.

The Qwen Integration Rumors

Reports suggest Alibaba’s Qwen might serve as a base LLM that Apple could embed natively in iOS for Chinese-market devices. If that’s successful, it’s easy to imagine Apple generalizing this strategy globally with an “Apple Intelligence” layer:

1. Qwen as a Base Model: Reports suggest that Alibaba’s Qwen could serve as a foundational LLM for Apple, particularly on devices tailored for the Chinese market. A successful trial there might lay the groundwork for a global rollout.

2. Envisioning an Apple Intelligence 2.0: Imagine an iOS environment with a pre-loaded LLM accessible via MLX or another API, coupled with a Universal Adapter Registry. This would allow developers to add domain-specific LoRA adapters, for tasks like legal drafting, creative content generation, or enhanced voice assistance, without having to ship memory heavy models.

3. Privacy-First and On-Device: True to Apple’s core values, any such system would prioritize user privacy by ensuring that all model inference occurs locally, with data never leaving the device.

This renewed focus on integrated, on-device AI could finally address the shortcomings of earlier Apple intelligence initiatives, delivering both a robust developer platform and an enhanced user experience that truly leverages the power of modern hardware.

Benefits for Developers (and Users) on Both Platforms

1. Privacy & Security: On-device inference means sensitive data like health records and financial info stays off the cloud. Both Android (via AI Core sandboxing) and Apple (hopefully via MLX) can enforce strong app-level isolation.

2. Reduced Latency: Real-time responses become truly instant. This is game-changing for AR apps, real-time translation, or advanced voice assistants.

3. Offline Functionality: Whether you’re on a plane or in a remote area, the AI doesn’t stop.

4. Cost Savings: Offloading inference to the device eliminates expensive cloud GPU usage. Smaller teams can deploy advanced AI apps without massive back-end bills.

5. Unified Model + LoRA Adapters: Instead of shipping multiple giant models for different tasks, developers only maintain small adapter files. Updating them is easier, and user storage remains manageable.

Beyond Phones: Embedded and IoT Scenarios

Although the spotlight shines on Android phones and iPhones, this strategy extends to:

• Wearables: Compact and adaptable AI models, designed to function as either lean language engines or multimodal platforms could power advanced health diagnostics or offline voice control on smartwatches.

• Smart Home Devices: Local voice and vision processing for privacy-preserving home assistants.

• Automotive and Industrial: Real-time data analysis in vehicles or industrial equipment can be achieved without constant reliance on cloud servers, making these solutions ideal for remote environments like mines, maritime settings, or other isolated areas.

Challenges and the Road Ahead

As of 2025, only a select group of Android devices have the NPU capabilities required to support Gemini Nano, and Apple Intelligence faces a similar challenge. Older iPhone models that don’t run on an A chip won’t receive system-level LLM features. However, as these older devices are gradually phased out and replaced with models that support advanced AI, this challenge will naturally diminish for both iOS and Android developers.

Conclusion

Google has laid out a compelling path forward with Gemini Nano and AI Core, proving that true on-device LLM inference is not only possible but also highly effective. Apple, with its strong commitment to privacy and ecosystem control, has every reason to adopt a similar approach, especially if it partners with foundational model providers like Alibaba and leverages its intuitive MLX framework for seamless LoRA adapter integration on iPhone applications.

For businesses and developers, the shift toward on-device AI opens doors to faster, more secure, and cost-effective apps. Whether you’re eyeing Android’s AI Core or preparing for a future Apple Intelligence ecosystem, now is the time to explore:

1. LoRA Fine-Tuning Pipelines: Adapt the base model to your niche domain.

3. Quantization/Compression: Optimize performance on mobile NPUs.

4. Enhanced User Experience: Offline capabilities and near-instant responses can transform user interactions with your app.

5. Unlocking New Use Cases: Enable innovative applications and functionalities that were previously unattainable with cloud-based AI alone.

If you’re interested in bringing AI to the edge, across smartphones, tablets, embedded devices, or other platforms, reach out to us. We’re here to guide you through best practices in fine-tuning so you can confidently navigate the next wave of AI breakthroughs.

Learn more