qwen3-omni-30b-a3b-instruct), which supports video, audio, and text inputs in a single request. Deploy these models using dedicated deployments for production workloads.
Available models
| Model | Input support | Notes |
|---|---|---|
| Qwen3 Omni 30B A3B Instruct | Video, audio, text | Dedicated deployment required |
| Molmo2-4B | Video, text | Dedicated deployment required |
| Molmo2-8B | Video, text | Dedicated deployment required |
Qwen3 Omni supports native video and audio inputs. Molmo2 models are video-only, so use the same request structure as below, but omit
audio_url. Molmo2 models cannot understand audio from videos.Create a deployment
Video and audio models require dedicated deployments. Create one using firectl:Make sure to use the predefined
qwen3-omni-30b-a3b-instruct-minimal deployment shape for your deployment to work correctly.Chat Completions API
Provide video and audio as base64-encoded data URLs. The model acceptsvideo_url, audio_url, and text content types.
- Python
- curl
Working with videos
Video models perform best with preprocessed inputs that balance quality and token efficiency. Use ffmpeg to optimize your video and audio before sending requests.Preprocessing video
Extract frames at 1 FPS and downscale to 360p for efficient processing:| Parameter | Description |
|---|---|
-t 60 | Limit to first 60 seconds |
fps=1 | Extract 1 frame per second |
scale=-1:360 | Downscale to 360p height, maintain aspect ratio |
-an | Remove audio track (extracted separately) |
Preprocessing audio
Extract audio as Opus in an Ogg container for optimal compression:| Parameter | Description |
|---|---|
-t 60 | Limit to first 60 seconds |
-vn | Remove video track |
-c:a libopus | Use Opus codec |
-b:a 24k | 24 kbps bitrate |
-ar 16000 | 16 kHz sample rate |
-ac 1 | Mono audio |
Complete preprocessing example
Performance considerations
Tips for optimal throughput:- Preprocess all videos – 1 FPS at 360p provides good quality with minimal tokens
- Extract audio separately – Opus/Ogg at 24kbps offers excellent compression
- Limit video duration – Cap at 60 seconds for consistent performance
- Use dedicated deployments – Scale replicas based on your throughput needs
Known limitations
- Video duration: Maximum 60 seconds recommended for optimal performance
- Supported formats:
.mp4for video,.ogg(Opus) for audio - Base64 size: Total encoded payload should be under 10MB
- Deployment required: Video models are not available on serverless; dedicated deployment required
Related resources
Chat with Video using Qwen3 Omni
Interactive notebook for video and audio analysis
Vision models
Query models with image inputs
Dedicated deployments
Deploy models on dedicated GPUs