Vision Models
Vision Language Models (VLMs) let you analyze images, screenshots, diagrams, and documents directly in your conversations. The AI can "see" and understand visual content.
Supported Models
On Device AI supports vision models through both inference engines:
- GGUF (llama.cpp): Vision-capable GGUF models with multimodal projection
- MLX: MLX-optimized vision models (e.g., Qwen3 VL)
- Cloud APIs: Vision-capable cloud models from OpenAI, Anthropic, Google, etc.
Vision models are identified by a [VLM] tag in the model picker.
Using Vision Models
- Select a VLM model
Choose a vision-capable model from the model picker (look for the [VLM] tag).
- Attach an image
Use the camera button, photo library, or paste an image into the chat.
- Ask about the image
Type your question about the image. Examples: "What's in this image?", "Read the text in this screenshot", "Describe this diagram".
VLM vs OCR Processing
The app handles images differently depending on whether you're using a vision model:
- VLM models: The raw image is passed directly to the model for visual understanding. No OCR step needed — the model processes pixels directly.
- Non-VLM models: Images are processed with OCR (Optical Character Recognition) to extract text, which is then passed to the text-only model. The system uses a capacity-aware approach, safely truncating OCR text if it exceeds the model's context budget.
Image Sharing and Ingestion
You can bring images into the app from outside the conversation:
- iOS: Use the system Share Sheet to send single or multiple images directly from Photos or other apps into a conversation.
- macOS/visionOS: Drag-and-drop images or paste them from your clipboard directly into the chat interface.
Shared images follow intelligent routing rules based on the active model, seamlessly using vision inference or gracefully falling back to OCR.
For analyzing charts, diagrams, or complex visual layouts, use a VLM model. For simple text extraction from screenshots, either approach works well.
Camera Integration
On iOS, you can use the camera directly within the app to capture images for analysis. This is great for:
- Scanning documents and receipts
- Analyzing whiteboards and handwritten notes
- Reading signs, labels, or physical text
- Identifying objects or scenes
Tips for Best Results
- Use well-lit, clear images for better analysis
- Crop images to focus on the relevant content
- Be specific in your questions about what you want analyzed
- For documents, ensure text is readable and not blurry
- Larger VLM models generally produce more accurate analysis