Skip to main content
Fireworks hosts embedding and reranking models, which are useful for tasks like RAG and semantic search.

Model Availability

Fireworks hosts several purpose-built embeddings models, which are optimized specifically for tasks like semantic search and document similarity comparison. We host the SOTA Qwen3 Embeddings family of models:
  • fireworks/qwen3-embedding-8b (*available on serverless)
  • fireworks/qwen3-embedding-4b
  • fireworks/qwen3-embedding-0p6b
along with these BERT-based models (serverless only):
  • nomic-ai/nomic-embed-text-v1.5
  • nomic-ai/nomic-embed-text-v1
  • WhereIsAI/UAE-Large-V1
  • thenlper/gte-large
  • thenlper/gte-base
  • BAAI/bge-base-en-v1.5
  • BAAI/bge-small-en-v1.5
  • mixedbread-ai/mxbai-embed-large-v1
  • sentence-transformers/all-MiniLM-L6-v2
  • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
In addition, we also support retrieving embeddings from any LLM in our model library and any models you bring yourself through custom model upload. Here are some examples of LLMs that work with the embeddings API:
  • fireworks/glm-4p5
  • fireworks/gpt-oss-20b
  • fireworks/kimi-k2-instruct-0905
  • fireworks/deepseek-r1-0528

Generating embeddings

Embeddings models take text as input and output a vector of floating point numbers to use for tasks like similarity comparisons and search. Our embedding service is OpenAI compatible. Refer to OpenAI’s embeddings guide and OpenAI’s embeddings documentation for more information on using these models.
Python
import requests

url = "https://api.fireworks.ai/inference/v1/embeddings"

payload = {
    "input": "The quick brown fox jumped over the lazy dog",
    "model": "fireworks/qwen3-embedding-8b",
}

headers = {
    "Authorization": "Bearer <FIREWORKS_API_KEY>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())
To generate variable-length embeddings, you can add the dimensions parameter to the request, for example, dimensions: 128. The API usage for embedding models is identical for BERT-based and LLM-based embeddings. Simply use the /v1/embeddings endpoint with your chosen model.

Reranking documents

Reranking models are used to rerank a list of documents based on a query. We only support reranking with the Qwen3 Reranker family of models:
  • fireworks/qwen3-reranker-8b (*available on serverless)
  • fireworks/qwen3-reranker-4b
  • fireworks/qwen3-reranker-0p6b
The reranking model takes a query and a list of documents as input and outputs the list of documents scored by relevance to the query.
Python
import requests

url = "https://api.fireworks.ai/inference/v1/rerank"

payload = {
      "model": "fireworks/qwen3-reranker-8b",
      "query": "What was the primary objective of the Apollo 10 mission?",
      "documents": [
        "The Apollo 10 mission was launched in May 1969 and served as a 'dress rehearsal' for the Apollo 11 lunar landing.",
        "The crew of Apollo 10 consisted of astronauts Thomas Stafford, John Young, and Eugene Cernan.",
        "The command module for Apollo 10 was nicknamed 'Charlie Brown' and the lunar module was called 'Snoopy', after characters from the Peanuts comics.",
        "The Apollo program was a series of NASA missions that successfully landed the first humans on the Moon and returned them safely to Earth."
      ],
      "top_n": 3,
      "return_documents": True
}

headers = {
    "Authorization": "Bearer <FIREWORKS_API_KEY>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())

Deploying embeddings and reranking models

While Qwen3 Embedding 8b and Qwen3 Reranker 8b are available on serverless, you also have the option to deploy them via on-demand deployments.
We recommend passing --load-targets default=0.4 to ensure proper autoscaling responsiveness for these deployments
I