Skip to main content
Fireworks hosts embedding and reranking models, which are useful for tasks like RAG and semantic search.

Generating embeddings

Embeddings models take text as input and output a vector of floating point numbers to use for tasks like similarity comparisons and search. Our embedding service is OpenAI compatible. Refer to OpenAI’s embeddings guide and OpenAI’s embeddings documentation for more information on using these models.
Python
import requests

url = "https://api.fireworks.ai/inference/v1/embeddings"

payload = {
    "input": "The quick brown fox jumped over the lazy dog",
    "model": "fireworks/qwen3-embedding-8b",
}

headers = {
    "Authorization": "Bearer <FIREWORKS_API_KEY>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())
To generate variable-length embeddings, you can add the dimensions parameter to the request, for example, dimensions: 128. The API usage for embedding models is identical for BERT-based and LLM-based embeddings. Simply use the /v1/embeddings endpoint with your chosen model.

Model Availability

Fireworks hosts several purpose-built embeddings models, which are optimized specifically for tasks like semantic search and document similarity comparison. We host the SOTA Qwen3 Embeddings family of models:
  • fireworks/qwen3-embedding-8b (*available on serverless)
  • fireworks/qwen3-embedding-4b
  • fireworks/qwen3-embedding-0p6b
You can retrieve embeddings from any LLM in our model library. Here are some examples of LLMs that work with the embeddings API:
  • fireworks/glm-4p5
  • fireworks/gpt-oss-20b
  • fireworks/kimi-k2-instruct-0905
  • fireworks/deepseek-r1-0528
You can also retrieve embeddings from any models you bring yourself through custom model upload.
These BERT-based models are available on serverless only:
  • nomic-ai/nomic-embed-text-v1.5
  • nomic-ai/nomic-embed-text-v1
  • WhereIsAI/UAE-Large-V1
  • thenlper/gte-large
  • thenlper/gte-base
  • BAAI/bge-base-en-v1.5
  • BAAI/bge-small-en-v1.5
  • mixedbread-ai/mxbai-embed-large-v1
  • sentence-transformers/all-MiniLM-L6-v2
  • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Reranking documents

Reranking models are used to rerank a list of documents based on a query. We support reranking with the Qwen3 Reranker family of models:
  • fireworks/qwen3-reranker-8b (*available on serverless)
  • fireworks/qwen3-reranker-4b
  • fireworks/qwen3-reranker-0p6b

Using the /rerank endpoint

The /rerank endpoint provides a simple interface for reranking documents.
The /rerank endpoint does not yet support all models and parallelism options. For more flexibility, use the /embeddings endpoint with return_logits as shown in the next section.
Python
import requests

url = "https://api.fireworks.ai/inference/v1/rerank"

payload = {
    "model": "fireworks/qwen3-reranker-8b",
    "query": "What is the capital of France?",
    "documents": [
        "Paris is the capital and largest city of France, home to the Eiffel Tower and the Louvre Museum.",
        "France is a country in Western Europe known for its wine, cuisine, and rich history.",
        "The weather in Europe varies significantly between northern and southern regions.",
        "Python is a popular programming language used for web development and data science."
    ],
    "top_n": 3,
    "return_documents": True
}

headers = {
    "Authorization": "Bearer <FIREWORKS_API_KEY>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())

Using the /embeddings endpoint

You can also use the /embeddings endpoint with the return_logits parameter to rerank documents. This approach supports more models and parallelism options. The embedding model takes in token IDs for “yes” and “no” and outputs associated logits indicating how likely the document is relevant or not relevant to the query. You can obtain these token IDs using tokenizer.convert_tokens_to_ids() with the transformers library and the Qwen3 tokenizer.
  • Simple
  • Parallel (asyncio)
Python
import requests

url = "https://api.fireworks.ai/inference/v1/embeddings"

query = "What is the capital of France?"
documents = [
    "Paris is the capital and largest city of France, home to the Eiffel Tower and the Louvre Museum.",
    "France is a country in Western Europe known for its wine, cuisine, and rich history.",
    "The weather in Europe varies significantly between northern and southern regions.",
    "Python is a popular programming language used for web development and data science."
]

# Format prompts as query-document pairs using the Qwen3 Reranker format
instruction = "Given a web search query, retrieve relevant passages that answer the query"
prompts = [
    f"<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}"
    for doc in documents
]

# Token IDs for "no" and "yes" in Qwen3 reranker models
token_false_id = 2753   # "no"
token_true_id = 9454    # "yes"

payload = {
    "model": "fireworks/qwen3-reranker-8b",
    "input": prompts,
    "return_logits": [token_false_id, token_true_id],
    "normalize": True  # Applies softmax to the selected logits
}

headers = {
    "Authorization": "Bearer <FIREWORKS_API_KEY>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers).json()

# Extract relevance scores (probability of "yes" token)
results = []
for i, item in enumerate(response["data"]):
    probs = item["embedding"]  # [no_prob, yes_prob]
    relevance_score = probs[1]  # "yes" probability is the relevance score
    results.append({
        "index": i,
        "relevance_score": relevance_score,
        "document": documents[i]
    })

# Sort by relevance score (highest first)
results.sort(key=lambda x: x["relevance_score"], reverse=True)

for result in results:
    print(f"Score: {result['relevance_score']:.4f} - {result['document'][:80]}...")

With normalize=True, the endpoint applies softmax to the selected logits, returning probabilities that sum to 1. The “yes” probability directly represents the relevance score.

Deploying embeddings and reranking models

While Qwen3 Embedding 8b and Qwen3 Reranker 8b are available on serverless, you also have the option to deploy them via on-demand deployments.