Vision Models

Vision Language Models (VLMs) let you analyze images, screenshots, diagrams, and documents directly in your conversations. The AI can "see" and understand visual content.

On this page

Supported Models
Using Vision Models
VLM vs OCR Processing
Image Sharing
Camera Integration
Tips for Best Results

Supported Models

On Device AI supports vision models through both inference engines:

GGUF (llama.cpp): Vision-capable GGUF models with multimodal projection
MLX: MLX-optimized vision models (e.g., Qwen3 VL)
Cloud APIs: Vision-capable cloud models from OpenAI, Anthropic, Google, etc.

Vision models are identified by a [VLM] tag in the model picker.

Using Vision Models

Select a VLM model
Choose a vision-capable model from the model picker (look for the [VLM] tag).
Attach an image
Use the camera button, photo library, or paste an image into the chat.
Ask about the image
Type your question about the image. Examples: "What's in this image?", "Read the text in this screenshot", "Describe this diagram".

VLM vs OCR Processing

The app handles images differently depending on whether you're using a vision model:

VLM models: The raw image is passed directly to the model for visual understanding. No OCR step needed — the model processes pixels directly.
Non-VLM models: Images are processed with OCR (Optical Character Recognition) to extract text, which is then passed to the text-only model. The system uses a capacity-aware approach, safely truncating OCR text if it exceeds the model's context budget.

You can bring images into the app from outside the conversation:

iOS: Use the system Share Sheet to send single or multiple images directly from Photos or other apps into a conversation.
macOS/visionOS: Drag-and-drop images or paste them from your clipboard directly into the chat interface.

Shared images follow intelligent routing rules based on the active model, seamlessly using vision inference or gracefully falling back to OCR.

💡 Tip

For analyzing charts, diagrams, or complex visual layouts, use a VLM model. For simple text extraction from screenshots, either approach works well.

Camera Integration

On iOS, you can use the camera directly within the app to capture images for analysis. This is great for:

Scanning documents and receipts
Analyzing whiteboards and handwritten notes
Reading signs, labels, or physical text
Identifying objects or scenes

Tips for Best Results

Use well-lit, clear images for better analysis
Crop images to focus on the relevant content
Be specific in your questions about what you want analyzed
For documents, ensure text is readable and not blurry
Larger VLM models generally produce more accurate analysis

Vision Models

Supported Models

Using Vision Models

VLM vs OCR Processing

Image Sharing and Ingestion

Camera Integration

Tips for Best Results