Skip to main content

Quickstart

Set dcp_save_interval and log_path, then rerun with the same log_path to resume:
from training.recipes.sft_loop import Config, main
from training.utils import InfraConfig

config = Config(
    log_path="./my_training",
    base_model="accounts/fireworks/models/qwen3-8b",
    dataset="data.jsonl",
    tokenizer_model="Qwen/Qwen3-8B",
    dcp_save_interval=10,  # save every 10 steps
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    ),
)
main(config)

# If interrupted, just run again with the same config.
# It finds the last checkpoint in log_path and resumes automatically.
main(config)

Checkpoint kinds

Every checkpoint uses a CheckpointKind:
KindWhat is savedResumablePromotable
STATEDCP (optimizer + weights)YesNo
SAMPLERHF weights for inferenceNoYes
BOTHDCP + HF weightsYesYes
  • Mid-training saves (dcp_save_interval) use STATE — resumable but not promotable.
  • Final checkpoint (end of training) always uses BOTH — resumable and promotable.
  • To promote a mid-training checkpoint, call save_checkpoint explicitly with kind=SAMPLER or kind=BOTH.
dcp_save_interval defaults to 0 (off). Without setting it, training cannot be resumed from intermediate steps.

checkpoints.jsonl

Checkpoint metadata is written to {log_path}/checkpoints.jsonl — one JSON line per save. The fields present depend on the kind:
{"name": "step-10", "step": 10, "data_consumed": 40, "state_path": "cross_job://job-abc/step-10", "source_job_id": "job-abc", "base_model": "accounts/fireworks/models/qwen3-8b"}
{"name": "step-50", "step": 50, "data_consumed": 200, "state_path": "cross_job://job-abc/step-50", "sampler_path": "step-50-a1b2c3d4", "source_job_id": "job-abc", "base_model": "accounts/fireworks/models/qwen3-8b"}
FieldPresent inDescription
state_pathSTATE, BOTHRemote DCP reference for resume
sampler_pathSAMPLER, BOTHSnapshot name for promotion
source_job_idAllTrainer job that created this checkpoint
base_modelAllBase model (auto-detected by promote script)
WeightSyncer.save_and_hotload() saves HF weights to GCS for hotloading but does not write to checkpoints.jsonl. Those checkpoints exist remotely but are not tracked here.

Resume

Automatic (same log_path)

Just rerun with the same log_path. The recipe reads checkpoints.jsonl, finds the last entry with a state_path, loads DCP state, and continues from the saved step.

From another job (init_from_checkpoint)

config = Config(
    log_path="./new_run",
    init_from_checkpoint="i44pvd4syzg8hjfk:step-4",  # job_id:checkpoint_name
    ...
)
Loads weights from the specified job, resets step to 0. Mutually exclusive with automatic resume.

Promoting a checkpoint

Only entries with sampler_path can be promoted (kind=SAMPLER or kind=BOTH). The final checkpoint is always promotable. Mid-training DCP saves are not.
export FIREWORKS_API_KEY=...

# Promote the latest promotable checkpoint:
python promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl

# Promote a specific step:
python promote_checkpoint.py \
    --checkpoints-jsonl ./my_training/checkpoints.jsonl \
    --step 50
Without checkpoints.jsonl, use the SDK directly with the source_job_id and sampler_path:
from fireworks.training.sdk import FireworksClient

client = FireworksClient(api_key=api_key)
client.promote_checkpoint("job-abc", "step-50-a1b2c3d4", "my-fine-tuned-model")
See Saving and Loading — Promoting for full API details.

Config fields

FieldTypeDefaultDescription
log_pathstr(required)Directory for checkpoints.jsonl and logs
dcp_save_intervalint0Save DCP checkpoint every N steps. 0 = off.
init_from_checkpointstr | NoneNoneLoad DCP state from another job ("job-id:checkpoint-name"). Step resets to 0.