Video & Audio Inputs - Fireworks AI Docs

Some Omni/multimodal models can process audio and/or video inputs directly, enabling video captioning, scene analysis, content understanding, and multimodal question answering. A good example is Qwen3 Omni (qwen3-omni-30b-a3b-instruct), which supports video, audio, and text inputs in a single request. Deploy these models using dedicated deployments for production workloads.

Available models

Model	Input support	Notes
Qwen3 Omni 30B A3B Instruct	Video, audio, text	Dedicated deployment required
Molmo2-4B	Video, text	Dedicated deployment required
Molmo2-8B	Video, text	Dedicated deployment required

Qwen3 Omni supports native video and audio inputs. Molmo2 models are video-only, so use the same request structure as below, but omit audio_url. Molmo2 models cannot understand audio from videos.

Create a deployment

Video and audio models require dedicated deployments. Create one using firectl:

firectl deployment create qwen3-omni-30b-a3b-instruct \
  --account-id <YOUR_ACCOUNT_ID> \
  --min-replica-count 1 \
  --max-replica-count 1 \
  --deployment-shape qwen3-omni-30b-a3b-instruct-minimal

Make sure to use the predefined qwen3-omni-30b-a3b-instruct-minimal deployment shape for your deployment to work correctly.

Chat Completions API

Provide video and audio as base64-encoded data URLs. The model accepts video_url, audio_url, and text content types.

Python
curl

import os
import base64
import requests

# Load and encode your preprocessed video and audio
with open("processed_video.mp4", "rb") as f:
    video_b64 = base64.b64encode(f.read()).decode("utf-8")

with open("audio.ogg", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode("utf-8")

# API configuration
url = "https://api.fireworks.ai/inference/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}",
}

# Request payload
payload = {
    "model": "accounts/<YOUR_ACCOUNT_ID>/models/qwen3-omni-30b-a3b-instruct#accounts/<YOUR_ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
    "max_tokens": 1000,
    "temperature": 0.3,
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
                {"type": "audio_url", "audio_url": {"url": f"data:audio/ogg;base64,{audio_b64}"}},
                {"type": "text", "text": "Describe what happens in this video."},
            ],
        },
    ],
}

# Send request
response = requests.post(url, headers=headers, json=payload)
print(response.json()["choices"][0]["message"]["content"])

# Encode your files (run these separately)
VIDEO_B64=$(base64 -i processed_video.mp4)
AUDIO_B64=$(base64 -i audio.ogg)

curl https://api.fireworks.ai/inference/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $FIREWORKS_API_KEY" \
  -d '{
    "model": "accounts/<YOUR_ACCOUNT_ID>/models/qwen3-omni-30b-a3b-instruct#accounts/<YOUR_ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
    "max_tokens": 1000,
    "temperature": 0.3,
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,'$VIDEO_B64'"}},
          {"type": "audio_url", "audio_url": {"url": "data:audio/ogg;base64,'$AUDIO_B64'"}},
          {"type": "text", "text": "Describe what happens in this video."}
        ]
      }
    ]
  }'

Working with videos

Video models perform best with preprocessed inputs that balance quality and token efficiency. Use ffmpeg to optimize your video and audio before sending requests.

Preprocessing video

Extract frames at 1 FPS and downscale to 360p for efficient processing:

ffmpeg -y -i input_video.mp4 \
  -t 60 \
  -vf "fps=1,scale=-1:360" \
  -c:v libx264 -preset fast \
  -an \
  processed_video.mp4

Parameter	Description
`-t 60`	Limit to first 60 seconds
`fps=1`	Extract 1 frame per second
`scale=-1:360`	Downscale to 360p height, maintain aspect ratio
`-an`	Remove audio track (extracted separately)

Preprocessing audio

Extract audio as Opus in an Ogg container for optimal compression:

ffmpeg -y -i input_video.mp4 \
  -t 60 \
  -vn \
  -c:a libopus \
  -b:a 24k \
  -ar 16000 \
  -ac 1 \
  audio.ogg

Parameter	Description
`-t 60`	Limit to first 60 seconds
`-vn`	Remove video track
`-c:a libopus`	Use Opus codec
`-b:a 24k`	24 kbps bitrate
`-ar 16000`	16 kHz sample rate
`-ac 1`	Mono audio

Complete preprocessing example

import subprocess
import tempfile
import base64
import os

def preprocess_video(video_path: str) -> tuple[str, str]:
    """
    Preprocess video for optimal model input.
    
    Returns:
        Tuple of (video_base64, audio_base64)
    """
    with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_video:
        processed_video_path = tmp_video.name
    with tempfile.NamedTemporaryFile(suffix=".ogg", delete=False) as tmp_audio:
        audio_path = tmp_audio.name
    
    try:
        # Process video: 1 FPS, 360p, max 60 seconds
        subprocess.run([
            "ffmpeg", "-y", "-i", video_path,
            "-t", "60",
            "-vf", "fps=1,scale=-1:360",
            "-c:v", "libx264", "-preset", "fast",
            "-an",
            processed_video_path
        ], check=True, capture_output=True)
        
        # Extract audio: Opus/Ogg, mono, 16kHz, 24kbps
        subprocess.run([
            "ffmpeg", "-y", "-i", video_path,
            "-t", "60",
            "-vn",
            "-c:a", "libopus",
            "-b:a", "24k",
            "-ar", "16000",
            "-ac", "1",
            audio_path
        ], check=True, capture_output=True)
        
        with open(processed_video_path, "rb") as f:
            video_b64 = base64.b64encode(f.read()).decode("utf-8")
        
        with open(audio_path, "rb") as f:
            audio_b64 = base64.b64encode(f.read()).decode("utf-8")
        
        return video_b64, audio_b64
    
    finally:
        os.unlink(processed_video_path)
        os.unlink(audio_path)

Preprocessing is highly recommended to reduce latency and ensure consistent performance.

Performance considerations

Tips for optimal throughput:

Preprocess all videos – 1 FPS at 360p provides good quality with minimal tokens
Extract audio separately – Opus/Ogg at 24kbps offers excellent compression
Limit video duration – Cap at 60 seconds for consistent performance
Use dedicated deployments – Scale replicas based on your throughput needs

Known limitations

Video duration: Maximum 60 seconds recommended for optimal performance
Supported formats: .mp4 for video, .ogg (Opus) for audio
Base64 size: Total encoded payload should be under 10MB
Deployment required: Video models are not available on serverless; dedicated deployment required

Chat with Video using Qwen3 Omni

Interactive notebook for video and audio analysis

Vision models

Query models with image inputs

Dedicated deployments

Deploy models on dedicated GPUs

​Available models

​Create a deployment

​Chat Completions API

​Working with videos

​Preprocessing video

​Preprocessing audio

​Complete preprocessing example

​Performance considerations

​Known limitations

​Related resources

Chat with Video using Qwen3 Omni

Vision models

Dedicated deployments

Available models

Create a deployment

Chat Completions API

Working with videos

Preprocessing video

Preprocessing audio

Complete preprocessing example

Performance considerations

Known limitations

Related resources