Skip to main content
kRouter

Architecture

How kRouter routes a single chat completion through 7 internal stages, from your IDE to the upstream provider and back.


This page explains what happens between the moment your IDE fires a request and the moment a streaming response comes back.

The seven stages

IDE
 │  POST /v1/chat/completions

1.  API gateway          authn, rate limit, request log

2.  Format detection     OpenAI / Claude / Gemini shape

3.  RTK compression      tool_result inline compression

4.  Routing decision     combo / direct, account picker

5.  Format translation   to upstream provider's native shape

6.  Upstream call        OAuth-refreshed, retried on fallback

7.  Response stream      back-translated, observed, returned

1. API gateway

Validates the API key (if REQUIRE_API_KEY=true), enforces per-key rate limits, records request metadata (no payload) for the Usage dashboard. Returns 401/429 immediately on auth or rate failure.

2. Format detection

The request body shape is sniffed:

  • messages[] + model → OpenAI Chat Completions
  • messages[] + model + max_tokens required → Claude Messages API
  • contents[] + generationConfig → Gemini

Detection drives the translator pipeline at stage 5.

3. RTK compression

Tool result payloads (tool_use outputs from Claude, function_call outputs from OpenAI, functionResponse parts from Gemini) get inspected. If RTK detects a known format (git diff, grep, ls, tree, log dump, find), it applies lossless compression before the request leaves us. Typical savings: 20–40% input tokens.

RTK is a no-op for non-tool content and silently falls through for unrecognized formats.

4. Routing decision

Given model: "<alias>/<id>":

  • If <alias> is a combo, the combo's ordered list of providers is the fallback chain
  • Else, the single provider is used
  • For each provider, the Zenith Score Engine mathematically ranks all available accounts. Zenith evaluates live health data (TTFB latency, success rate) and quota headroom (penalizing accounts under 30% quota). It guarantees the absolute best account is picked.
  • The router uses the Sub-5ms RAM Layer (HealthCache) to manage these states in-memory. If a 429 hits, the account is instantly locked in RAM and the next best account is grabbed in < 1ms without blocking on SQLite.
  • Legacy strategies (round-robin, p2c, random) are still supported but Zenith is the default.

5. Format translation

The request is translated from source shape to upstream native shape:

  • OpenAI → Claude: messages array reshaped, tools mapped to tools[], cache_control markers preserved
  • OpenAI → Gemini: messages flattened into contents, system → systemInstruction, tools → functionDeclarations
  • Claude → Antigravity: passthrough, with thinking config translated into Gemini generationConfig.thinkingConfig
  • Claude → Kiro: AWS EventStream protocol with cloaked tool names

The translator is bidirectional — same function handles the response coming back.

6. Upstream call

The HTTP request goes out via proxyAwareFetch which honors HTTP_PROXY, HTTPS_PROXY, and per-account proxy settings.

If the response is a refresh-worthy 401, the OAuth refresh token is used to mint a new access token, and the request is retried with the fresh token. The retry runs atomically — concurrent refreshes deduplicate so the same expired token never gets refreshed twice.

On 429 or 5xx, the response is classified by accountFallback.js:

  • TPM rate limit → 90-second cooldown, fall through to next account
  • Daily quota → 30-min cooldown, modelLock set
  • "Verify your account" 403 → 24-hour permanent lock, surfaced in dashboard

7. Response stream

SSE chunks come back from upstream. Each chunk is translated back to the IDE's expected shape, observed for token counts and finish reason, and piped to the client.

A final data: [DONE] is sent. Token totals are persisted to requestDetails for the Usage dashboard.

Concurrency model

kRouter is heavily concurrent inside a single process:

  • One semaphore per provider (Kiro 4, Claude 5, Antigravity 2, default 3) limits in-flight requests per provider to prevent self-DOS
  • Per-provider semaphore timeouts (Kiro 20s, Claude 15s, Antigravity 5s) decide when "busy" turns into "skip this account"
  • backoffLevel increments are wrapped in SQLite transactions so concurrent failures don't lose increments

Result: with 6 accounts on a busy IDE, a single Autopilot flood resolves in ~5s instead of cascading into 25s+ "service unavailable" loops.

File layout

/open-sse/handlers/     # request entry, dispatch
/open-sse/services/     # account picker, fallback, quota tracking
/open-sse/executors/    # per-provider HTTP calls
/open-sse/translator/   # format conversion
/open-sse/config/       # provider catalogue, error rules
/src/sse/services/      # auth, token refresh
/src/mitm/              # MITM intercept layer
/src/app/api/           # dashboard API routes
/src/app/(dashboard)/   # dashboard UI (Next.js App Router)

Where to go next