Architecture
How kRouter routes a single chat completion through 7 internal stages, from your IDE to the upstream provider and back.
This page explains what happens between the moment your IDE fires a request and the moment a streaming response comes back.
The seven stages
IDE
│ POST /v1/chat/completions
▼
1. API gateway authn, rate limit, request log
▼
2. Format detection OpenAI / Claude / Gemini shape
▼
3. RTK compression tool_result inline compression
▼
4. Routing decision combo / direct, account picker
▼
5. Format translation to upstream provider's native shape
▼
6. Upstream call OAuth-refreshed, retried on fallback
▼
7. Response stream back-translated, observed, returned1. API gateway
Validates the API key (if REQUIRE_API_KEY=true), enforces per-key rate limits, records request metadata (no payload) for the Usage dashboard. Returns 401/429 immediately on auth or rate failure.
2. Format detection
The request body shape is sniffed:
messages[]+model→ OpenAI Chat Completionsmessages[]+model+max_tokensrequired → Claude Messages APIcontents[]+generationConfig→ Gemini
Detection drives the translator pipeline at stage 5.
3. RTK compression
Tool result payloads (tool_use outputs from Claude, function_call outputs from OpenAI, functionResponse parts from Gemini) get inspected. If RTK detects a known format (git diff, grep, ls, tree, log dump, find), it applies lossless compression before the request leaves us. Typical savings: 20–40% input tokens.
RTK is a no-op for non-tool content and silently falls through for unrecognized formats.
4. Routing decision
Given model: "<alias>/<id>":
- If
<alias>is a combo, the combo's ordered list of providers is the fallback chain - Else, the single provider is used
- For each provider, the Zenith Score Engine mathematically ranks all available accounts. Zenith evaluates live health data (TTFB latency, success rate) and quota headroom (penalizing accounts under 30% quota). It guarantees the absolute best account is picked.
- The router uses the Sub-5ms RAM Layer (
HealthCache) to manage these states in-memory. If a 429 hits, the account is instantly locked in RAM and the next best account is grabbed in < 1ms without blocking on SQLite. - Legacy strategies (round-robin, p2c, random) are still supported but Zenith is the default.
5. Format translation
The request is translated from source shape to upstream native shape:
- OpenAI → Claude: messages array reshaped, tools mapped to
tools[],cache_controlmarkers preserved - OpenAI → Gemini: messages flattened into
contents, system →systemInstruction, tools →functionDeclarations - Claude → Antigravity: passthrough, with
thinkingconfig translated into GeminigenerationConfig.thinkingConfig - Claude → Kiro: AWS EventStream protocol with cloaked tool names
The translator is bidirectional — same function handles the response coming back.
6. Upstream call
The HTTP request goes out via proxyAwareFetch which honors HTTP_PROXY, HTTPS_PROXY, and per-account proxy settings.
If the response is a refresh-worthy 401, the OAuth refresh token is used to mint a new access token, and the request is retried with the fresh token. The retry runs atomically — concurrent refreshes deduplicate so the same expired token never gets refreshed twice.
On 429 or 5xx, the response is classified by accountFallback.js:
- TPM rate limit → 90-second cooldown, fall through to next account
- Daily quota → 30-min cooldown,
modelLockset - "Verify your account" 403 → 24-hour
permanentlock, surfaced in dashboard
7. Response stream
SSE chunks come back from upstream. Each chunk is translated back to the IDE's expected shape, observed for token counts and finish reason, and piped to the client.
A final data: [DONE] is sent. Token totals are persisted to requestDetails for the Usage dashboard.
Concurrency model
kRouter is heavily concurrent inside a single process:
- One semaphore per provider (Kiro 4, Claude 5, Antigravity 2, default 3) limits in-flight requests per provider to prevent self-DOS
- Per-provider semaphore timeouts (Kiro 20s, Claude 15s, Antigravity 5s) decide when "busy" turns into "skip this account"
backoffLevelincrements are wrapped in SQLite transactions so concurrent failures don't lose increments
Result: with 6 accounts on a busy IDE, a single Autopilot flood resolves in ~5s instead of cascading into 25s+ "service unavailable" loops.
File layout
/open-sse/handlers/ # request entry, dispatch
/open-sse/services/ # account picker, fallback, quota tracking
/open-sse/executors/ # per-provider HTTP calls
/open-sse/translator/ # format conversion
/open-sse/config/ # provider catalogue, error rules
/src/sse/services/ # auth, token refresh
/src/mitm/ # MITM intercept layer
/src/app/api/ # dashboard API routes
/src/app/(dashboard)/ # dashboard UI (Next.js App Router)Where to go next
- Providers — the 62+ provider tier breakdown
- Security — credential handling + threat model
- Deployment — production setup
- Core Concepts — what RTK, MITM, and combos actually do