VLM support in the Training API requires a VLM-compatible training shape. See Training Shapes for available shapes.
What changes for vision
Compared to text-only training, VLM fine-tuning differs in three ways:| Aspect | Text-only | Vision |
|---|---|---|
| Training shape | Text model shape (e.g. qwen3-8b-128k) | VLM shape (e.g. qwen3-vl-8b-65k) |
| Tokenizer | Text tokenizer (e.g. Qwen/Qwen3-8B) | VLM processor (e.g. Qwen/Qwen3-VL-8B-Instruct) |
| Message format | content is a string | content is an array of text and image_url objects |
Dataset format
Vision datasets use the standard OpenAI-compatible chat format. The key difference is thatcontent fields can contain an array of content parts mixing text and images:
Single image
Multiple images
Multi-turn with images
Image encoding requirements
Images must be base64-encoded with a MIME type prefix. Raw HTTP URLs are not supported in training data.- Correct
- Incorrect
Cookbook: VLM SFT
The cookbook’ssft_loop recipe works with vision datasets out of the box. Use a VLM training shape and a VLM tokenizer:
0.0 (prompt) and text response tokens are assigned weight 1.0 (train).
API-level: VLM training loop
For full control over the training loop, use the API directly with a VLM training shape. The workflow is the same as text-only training, but the tokenizer and shape are VLM-specific:1. Create the managed VLM service
2. Connect and train
3. Save and promote
Checkpointing and weight sync work identically to text-only training:VLM DPO and RL
Vision inputs also work with DPO and RL training. The dataset format is the same — use multimodalcontent arrays in your messages:
DPO with vision
RL with vision prompts
dpo_loop, rl_loop) with a VLM training shape and tokenizer — the multimodal message handling is automatic.
Available VLM training shapes
| Model | Shape ID | Context | GPUs |
|---|---|---|---|
| Qwen3 VL 8B | accounts/fireworks/trainingShapes/qwen3-vl-8b-65k | 65k | 4 |
Related guides
- Training Shapes — available VLM and text training shapes
- Supervised Fine Tuning - Vision (Managed) — managed VLM fine-tuning without writing training loops
- Querying Vision Language Models — inference with VLMs
- Cookbook SFT — SFT recipe details
- Loss Functions — custom loss function patterns