Overview
TrainerJobManager manages the lifecycle of service-mode trainer jobs — GPU-backed trainer endpoints that your Python loop connects to with a training client.
TrainerJobManager extends FireworksClient, so all trainer-free operations (checkpoint promotion, training shape resolution) are also available here.
Constructor
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | str | — | Fireworks API key |
base_url | str | "https://api.fireworks.ai" | Control-plane URL |
additional_headers | dict | None | None | Extra HTTP headers |
verify_ssl | bool | None | None | SSL verification override |
Methods
create(config)
Create a service-mode trainer job and return immediately (without waiting). Returns a CreatedTrainerJob:
wait_for_ready(job_id, job_name=None, poll_interval_s=5.0, timeout_s=900)
Poll until a trainer job reaches RUNNING state and is healthy. Returns a TrainerServiceEndpoint:
create_and_wait(config, poll_interval_s=5.0, timeout_s=900)
Create a service-mode trainer and poll until the endpoint is healthy. Combines create() + wait_for_ready(). Returns a TrainerServiceEndpoint.
wait_for_existing(job_id, poll_interval_s=5.0, timeout_s=900)
Wait for an already-existing trainer job to reach RUNNING state:
resume_and_wait(job_id, poll_interval_s=5.0, timeout_s=900)
Resume a failed/cancelled/paused job and wait until healthy:
reconnect_and_wait(job_id, ...)
Handle pod preemption and transient failures. Waits for the job to reach a resumable state, resumes it, then polls until healthy:
resume_and_wait() — retries when the job is in a transitional state (e.g. the control plane is still processing a pod death).
| Parameter | Type | Default | Description |
|---|---|---|---|
job_id | str | — | The RLOR job ID to reconnect |
poll_interval_s | float | 5.0 | Seconds between health checks after resume |
timeout_s | float | 600 | Overall timeout for the job to become RUNNING |
max_wait_for_resumable_s | float | 120 | Max seconds to wait for a resumable state |
get(job_id)
Inspect job status:
delete(job_id)
Delete a trainer job and release GPU resources:
promote_checkpoint(*, name, output_model_id, base_model)
Inherited from FireworksClient. Promote a sampler checkpoint to a deployable Fireworks model. The trainer job does not need to be running — the checkpoint resource name resolves the storage location.
FireworksClient.promote_checkpoint for full parameter docs.
resolve_training_profile(shape_id)
Inherited from FireworksClient. Resolve a training shape ID into a full configuration profile.
FireworksClient.resolve_training_profile for full parameter docs.
TrainerJobConfig
TrainerJobManager.create_and_wait(...) accepts a TrainerJobConfig dataclass:
Launching through a training shape is the recommended path. In normal user code, you should not hand-author training_shape_ref; pass a training shape ID to resolve_training_profile(...) and use the returned versioned ref. Advanced manual launches can omit training_shape_ref and provide infra fields directly.
When training_shape_ref is set (the recommended shape path), the training shape owns the trainer’s hardware and image configuration. The fields below are what you set as a user:
| Field | Type | Default | Description |
|---|---|---|---|
base_model | str | — | Base model name (e.g. "accounts/fireworks/models/qwen3-8b") |
training_shape_ref | str | None | None | Full training-shape resource name (e.g. accounts/fireworks/trainingShapes/<shape> or .../versions/<ver>). Use mgr.resolve_training_profile(...) to get the pinned versioned ref. See Training Shapes. |
lora_rank | int | 0 | LoRA rank. 0 for full-parameter tuning, or a positive integer (e.g. 16, 64) for LoRA |
max_context_length | int | None | None | Maximum sequence length. Usually inherited from the training shape on the shape path. |
learning_rate | float | 1e-5 | Learning rate for the optimizer |
display_name | str | None | None | Human-readable trainer name |
region | str | None | None | Region for the job |
extra_args | list[str] | None | None | Extra trainer arguments |
forward_only | bool | False | Create a forward-only trainer (reference model pattern) |
inactivity_timeout | datetime.timedelta | str | None | None | Trainer inactivity timeout. The trainer reports tracked activity, including trainer API operations and active-session heartbeats. If no tracked activity is observed for this duration, the trainer is automatically stopped. When unset or 0, Fireworks uses the 60-minute default. String values must use protobuf JSON duration format, such as "1800s". |
disable_inactivity_cleanup | bool | False | Disable trainer inactivity cleanup. GPU usage continues to accrue while the trainer is running. |
On the recommended shape path,
accelerator_type, accelerator_count, node_count, and custom_image_tag are automatically configured by the training shape and cannot be overridden. Advanced manual launches can omit training_shape_ref and set those fields directly.CreatedTrainerJob
Returned bycreate():
| Field | Type | Description |
|---|---|---|
job_name | str | Full resource name (accounts/<id>/rlorTrainerJobs/<id>) |
job_id | str | RLOR trainer job ID |
TrainerServiceEndpoint
Returned bycreate_and_wait, wait_for_ready, wait_for_existing, resume_and_wait, and reconnect_and_wait:
| Field | Type | Description |
|---|---|---|
base_url | str | Trainer endpoint URL for connecting a training client |
job_id | str | RLOR trainer job ID |
job_name | str | Full resource name (accounts/<id>/rlorTrainerJobs/<id>) |
TrainingShapeProfile
SeeFireworksClient > TrainingShapeProfile for the full field reference.
Job states
| State | Meaning |
|---|---|
JOB_STATE_CREATING | Resources being provisioned |
JOB_STATE_PENDING | Queued, waiting for GPU availability |
JOB_STATE_RUNNING | Trainer is ready — you can connect a training client |
JOB_STATE_IDLE | Service-mode job is idle |
JOB_STATE_COMPLETED | Job finished successfully |
JOB_STATE_FAILED | Job failed |
JOB_STATE_CANCELLED | Job was cancelled |
Related guides
- FiretitanServiceClient — create a
FiretitanTrainingClientfor a live trainer endpoint - Training Shapes — available shapes and deployment linkage
- Cleanup — resource cleanup