Skip to main content
One prompt. Multiple models. One API call. POST /api/v1/compare runs your prompt against N models in parallel and ranks the results by quality, speed, or cost. No competitor offers this.

Why compare?

Before committing to a model for your production app, you want to know: which one actually answers best? How do their latencies differ? What’s the cheapest model that still meets your quality bar? The compare endpoint answers all of these in one call.

Request

POST https://ninjachat.ai/api/v1/compare
Authorization: Bearer nj_sk_YOUR_API_KEY
Content-Type: application/json
{
  "messages": [
    {"role": "user", "content": "Explain quantum entanglement in one paragraph"}
  ],
  "models": ["gpt-5", "claude-sonnet-4.6", "gemini-3.1-pro", "deepseek-v3"],
  "rank_by": "balanced"
}

Response

{
  "winner": {
    "model": "claude-sonnet-4.6",
    "name": "Claude Sonnet 4.6",
    "reason": "Best balance of quality (0.97), speed (1180ms), and cost ($0.015)"
  },
  "results": [
    {
      "rank": 1,
      "model": "claude-sonnet-4.6",
      "content": "Quantum entanglement is a phenomenon where two particles become correlated...",
      "quality": { "confidence": 0.97, "flags": [], "suggested_retry": false },
      "latency_ms": 1180,
      "cost_cents": 1.5,
      "tokens": { "prompt": 18, "completion": 87, "total": 105 },
      "success": true
    },
    {
      "rank": 2,
      "model": "gpt-5",
      "content": "When two particles become quantum entangled...",
      "quality": { "confidence": 0.95, "flags": [], "suggested_retry": false },
      "latency_ms": 920,
      "cost_cents": 0.6,
      "tokens": { "prompt": 18, "completion": 74, "total": 92 },
      "success": true
    },
    {
      "rank": 3,
      "model": "gemini-3.1-pro",
      "content": "Quantum entanglement describes a special connection between particles...",
      "quality": { "confidence": 0.93, "flags": [], "suggested_retry": false },
      "latency_ms": 1050,
      "cost_cents": 0.6,
      "tokens": { "prompt": 18, "completion": 82, "total": 100 },
      "success": true
    },
    {
      "rank": 4,
      "model": "deepseek-v3",
      "content": "Quantum entanglement is a quantum mechanical phenomenon...",
      "quality": { "confidence": 0.89, "flags": [], "suggested_retry": false },
      "latency_ms": 780,
      "cost_cents": 0.3,
      "tokens": { "prompt": 18, "completion": 69, "total": 87 },
      "success": true
    }
  ],
  "failed": [],
  "summary": {
    "fastest": { "model": "deepseek-v3", "latency_ms": 780 },
    "highest_quality": { "model": "claude-sonnet-4.6", "confidence": 0.97 },
    "cheapest": { "model": "deepseek-v3", "cost_cents": 0.3 },
    "best_value": { "model": "gpt-5" }
  },
  "ranked_by": "balanced",
  "models_compared": 4,
  "succeeded": 4,
  "total_cost_cents": 3.0,
  "total_cost": "$0.030",
  "balance": "$4.790",
  "compared_at": "2026-03-10T12:00:00.000Z"
}

Parameters

ParameterTypeRequiredDefaultDescription
messagesarrayYesSame format as /chat. 1–20 messages.
modelsarrayNoTop 5 across tiersWhich models to run. 2–8 models. Cannot include auto* or ensemble*.
rank_bystringNo"balanced"How to rank results: quality, speed, cost, or balanced.
max_tokensintegerNo1024Max response length per model (1–8,192).
temperaturenumberNo0.7Sampling temperature.
include_full_responsesbooleanNotrueInclude full response text in results. Set to false to get 200-char previews.

rank_by modes

ModeWeights
balanced50% quality + 30% speed + 20% cost
quality100% quality score
speed100% speed (lowest latency wins)
cost100% cost (cheapest wins)

Default models (when models not specified)

gpt-5, claude-sonnet-4.6, gemini-3.1-pro, deepseek-v3, gemini-3-flash

Billing

You are charged for every successful model call. A 4-model compare costs the sum of each model’s per-request rate. The response includes total_cost and a per-model cost_cents breakdown. If any model fails, you are not charged for that model. All successful models are charged.
Pre-flight balance check: if your balance is less than the estimated total cost, the request fails before any models run.

Code examples

import requests, os

r = requests.post("https://ninjachat.ai/api/v1/compare",
    headers={"Authorization": f"Bearer {os.environ['NINJACHAT_API_KEY']}"},
    json={
        "messages": [{"role": "user", "content": "Write a haiku about machine learning"}],
        "models": ["gpt-5", "claude-sonnet-4.6", "gemini-3.1-pro"],
        "rank_by": "quality",
    }
)
data = r.json()
winner = data["winner"]
print(f"Winner: {winner['model']}{winner['reason']}")
for result in data["results"]:
    print(f"  #{result['rank']} {result['model']}: quality={result['quality']['confidence']:.2f}, {result['latency_ms']}ms, ${result['cost_cents']/100:.4f}")

Common use cases

Choose a model for production — Compare 4–5 models on a representative sample of your actual prompts before committing to one. Verify quality across models — Run the same benchmark prompt monthly to see if model updates changed behavior. Find the best valuesummary.best_value shows the model with the highest quality-to-cost ratio. Regression testing — Run your golden test prompts against a new model before switching.