1. Choose Your Inference Engine (GGUF vs. MLX)
On Device AI supports the two major local inference frameworks in Apple architectures, letting you choose the optimal engine for your hardware:
- GGUF (via llama.cpp): Offers broad model compatibility and operates universally across modern iOS, iPadOS, macOS, and visionOS devices. Perfect for general open-weight models.
- MLX (Apple Silicon native): Apple's machine learning framework, engineered specifically for Apple hardware. MLX provides enhanced memory management and lightning-fast inference on Apple Silicon Macs, utilizing unified memory to its fullest potential.
2. Choose a Model Based on Device Memory
Because local processing relies heavily on physical RAM (or Unified Memory in Apple Silicon), On Device AI helps you choose compatible models tailored to your specific hardware configurations:
- For iPhones/iPads (6GB - 8GB RAM): Select compact, optimized models such as DeepSeek-R1 1.5B, Qwen 2.5 1.5B/3B, or Gemma 2 2B. These fit easily inside mobile memory footprints without triggering OS memory pressure limits.
- For iPads/Macs (8GB - 16GB RAM): Comfortably execute high-reasoning models like Llama 3 8B, Phi-4 14B, or Qwen 2.5 7B.
- For Pro Macs (24GB - 128GB Unified Memory): Experience massive reasoning models up to 32B or 70B parameters locally at high tokens-per-second, entirely offline.
3. Custom Hugging Face GGUF Imports
Not limited to the built-in model library? On Device AI includes a custom downloader: simply copy a direct GGUF model download link from repositories like Hugging Face, paste it in the app's Import section, and download it natively. Your custom model becomes immediately available in chat and subagent workflows.
4. Private, Performant, and Pure Native
Written purely in SwiftUI, On Device AI bypasses sluggish Electron wrappers to ensure native hardware acceleration. Because models run directly on your neural engine and local GPU cores, no text, conversations, or files ever leave your device.