Use Cases

Apple Reportedly Distilling Google's Gemini Model to Run Siri On-Device

Apple is working to shrink Google's multi-trillion-parameter Gemini model to run on the iPhone, signalling a major push toward capable on-device AI — with implications for the local AI community.

Robson PereiraMay 30, 20265 min read

Apple iPhone and Google Gemini branding representing on-device AI distillation

Apple Reportedly Distilling Google's Gemini Model to Run Siri On-Device

Apple is making a major bet on **on-device artificial intelligence**, working with Google to compress the multi-trillion-parameter Gemini model so it can run directly on an iPhone, according to a report from The Information published on May 28, 2026.

What is happening

Apple has long promised an AI-enhanced Siri. After multiple delays since the original 2024 announcement, the company is now reportedly finalising a deal with Google to power Siri with Gemini. But the headline is not the deal itself — it is the engineering challenge Apple has taken on: **running a state-of-the-art large language model entirely on a smartphone**.

The report, covered extensively by Ars Technica, describes Apple's efforts to use **model distillation** — a technique where a smaller "student" model is trained to replicate the behaviour of a much larger "teacher" model. By distilling Gemini down to a size that fits within the iPhone's neural engine and memory constraints, Apple hopes to deliver real-time AI responses without sending every query to the cloud.

Why this matters for the self-hosted AI community

The local AI community has been running smaller models on consumer hardware for years — Llama 3, Phi-3, Mistral, and Qwen all run comfortably on a MacBook or a modest PC. Apple's reported approach validates several key ideas:

1. **Distillation is the path to local AI** — The most capable models are too large to run on edge devices, but distillation techniques are advancing fast enough to close the gap. This is exactly the approach projects like Llama.cpp and Ollama enable for the open source community.

2. **On-device AI is the endgame** — If Apple, the world's most valuable company, is betting its assistant on locally-run AI, the trend is clear: the future of AI is not purely cloud-based. Privacy, latency, and offline capability matter.

3. **Hardware specialisation matters** — Apple's Neural Engine and unified memory architecture give it an advantage in running compressed models. The same principle applies to building a self-hosted AI rig: the right hardware makes a real difference.

Cloud fallback is inevitable

The report also notes that despite Apple's best efforts, a cloud component is likely unavoidable for the most demanding tasks. Even the most aggressively distilled model cannot match the full Gemini for complex reasoning, multimodal understanding, or up-to-date knowledge retrieval.

This hybrid approach — **run what you can locally; fall back to the cloud for the rest** — is exactly the pattern that many self-hosted AI practitioners recommend. It is the same philosophy behind Design a Two-Tier AI Stack for Speed and Privacy, where routine queries stay local and complex tasks are routed to more capable infrastructure.

What self-hosted AI builders can learn

**Distil your own models** — Tools like Unsloth, Axolotl, and llama.cpp support fine-tuning smaller models from larger ones. If Apple can distil Gemini, you can distil Llama 3.1 or Qwen 2.5 for your specific use case.

**Measure before you optimise** — Identify which queries genuinely need a large model and which can be handled by a smaller distilled one. Use a routing layer like LiteLLM or a local proxy to switch between models dynamically.

**Privacy as a feature** — Apple's emphasis on on-device processing mirrors the core value proposition of self-hosted AI: your data stays on your hardware. This is a competitive advantage that cloud-only services cannot match.