TL;DR
If you launch training through a cookbook recipe (rl_loop, sft_loop, dpo_loop, orpo_loop, igpo_loop), you don’t have to call any checkpoint APIs yourself. Set two config fields and the recipe handles save, resume, and promote:
dcp_save_interval=N(top-levelConfigfield on every recipe) — save resumable checkpoints every N stepsoutput_model_id="my-model"— promote the final checkpoint to a deployable Fireworks model
log_path resumes from the last saved checkpoint automatically.
Config fields
| Field | Applies to | Type | Default | Description |
|---|---|---|---|---|
log_path | All recipes | str | (required) | Directory for the recipe’s local bookkeeping (dataloader.json) and logs |
dcp_save_interval | All recipes | int | 0 | Save a resumable (DCP) checkpoint every N steps. 0 = off. Top-level Config field on every recipe (RL, IGPO, async RL, SFT, DPO, ORPO). |
output_model_id | All recipes | str | None | None | If set, promote the final checkpoint to this Fireworks model ID at the end of training |
init_from_checkpoint | All recipes | str | None | None | Load weights from another job ("job-id:checkpoint-name"). Step counter resets to 0. |
Resume
Automatic (same log_path)
Just rerun with the samelog_path and the recipe resumes. It queries the control plane for the newest resumable checkpoint on the trainer job and reloads weights and optimizer state. The step counter and the cookbook’s data_consumed counter are restored from dataloader.json in log_path.
From another job
Promoting a checkpoint manually
If you want to promote an arbitrary checkpoint after training (not just the final one), use the cookbook’s promote script:--checkpoint-name <name> to promote a specific one.
You can also call the API directly — see Saving and Loading — Promoting.
Advanced internals
Most users can stop reading here. The sections below cover what the recipe does internally — useful only if you’re forking a recipe, calling the SDK directly, or debugging a checkpoint that doesn’t promote. The full SDK-level reference lives in Saving and Loading.
What gets saved, where
The recipe interacts with two surfaces:| Surface | Owns | Source of truth for |
|---|---|---|
Control plane (FireworksClient.list_checkpoints(job_id)) | All remote checkpoint blobs (DCP and sampler) | What checkpoints exist, their type, and whether each is promotable |
{log_path}/dataloader.json | Local file | The cookbook’s data_consumed counter per checkpoint name (no server-side representation) |
checkpoints.jsonl registry — the control plane is queried at resume / promote time.
Two axes: resumable and promotable
When the recipe saves a checkpoint, it picks two independent capabilities:| Axis | What it writes | Resumes? | Promotes to a model? |
|---|---|---|---|
resumable=True | DCP (weights + optimizer) | Yes | No |
promotable=True | Sampler weights (HF format) | No | Yes |
| Both | DCP + sampler | Yes | Yes |
resumable=True only. The final save uses both. RL weight sync saves sampler checkpoints and syncs their snapshot identities separately from DCP resume saves.
Forking a recipe
If you forkrl_loop.py (or another ported recipe) and need to drive checkpointing yourself, instantiate TrainingCheckpoints:
save_state / save_weights_for_sampler_ext / promote_checkpoint to the SDK and uses the control plane as the source of truth for resume and promotion. Recipes pass the SDK-managed service client as the control-plane checkpoint client. The full API surface those calls expose is documented in Saving and Loading.
Checkpoint kinds
This subsection is the canonical reference for checkpoint kinds and promotability across the stack — other pages link here. Three separate layers of the stack each have their own “type”, and confusing them is the usual reason a promotion fails. They are not synonyms:| Layer | Where | Values | What it controls |
|---|---|---|---|
| Cookbook | TrainingCheckpoints.save(resumable=, promotable=) | two booleans | Which of DCP / sampler blob (or both) gets saved |
| SDK | save_weights_for_sampler_ext(checkpoint_type=...) | "base", "delta" | Whether the sampler blob is full weights or an arc_v2 delta over the previous base (LoRA ignores this — full adapter is always saved) |
| Server | checkpointType on each control-plane row | TRAINING, TRAINING_LORA, INFERENCE_BASE, INFERENCE_LORA, INFERENCE_ARC_V2 | Detected from blob contents. The first two are resumable; INFERENCE_BASE and INFERENCE_LORA are promotable; INFERENCE_ARC_V2 (delta on full-param) is not. |
promotable=True, it always calls the SDK with checkpoint_type="base", which the server detects as INFERENCE_BASE (full-param) or INFERENCE_LORA (LoRA). Both are promotable. The non-promotable INFERENCE_ARC_V2 only happens if you bypass the cookbook and call save_weights_for_sampler_ext("delta") on a full-parameter run.
Promotability cheat sheet
“Promotable” means the server will accept the blob for promotion — i.e. the checkpoint showspromotable=True in list_checkpoints. To actually promote, you need the checkpoint name plus source_job_id and base_model.
| How it was saved | LoRA promotable | Full-param promotable |
|---|---|---|
TrainingCheckpoints.save(resumable=True, promotable=False) | No (DCP only) | No (DCP only) |
TrainingCheckpoints.save(promotable=True) | Yes | Yes |
save_weights_for_sampler_ext(checkpoint_type="base") | Yes | Yes |
save_weights_for_sampler_ext(checkpoint_type="delta") | Yes (server always stores full adapter) | No |
| Recipe weight sync — first save | Yes | Yes |
| Recipe weight sync — later saves | Yes | No |
Related guides
- Saving and Loading — SDK-level reference for save / load / promote
- Training and Sampling — SDK-managed sampler refresh lifecycle
- Cookbook RL — full GRPO walkthrough