Skip to main content

Online Usage

Online mode uses the ContextPilot Index Server for live reordering. Both modes use a single POST /reorder endpoint:

ModeUse CaseCache Tracking
Stateless (--stateless)Per-batch reordering, no state management
Stateful (default)Live cache sync with the inference engine, eviction tracking

The server auto-dispatches based on its mode — your client code is the same either way.


How Scheduling Works

ContextPilot performs two levels of optimization to maximize KV-cache prefix sharing:

1. Inter-Context Reordering

Contexts with overlapping context blocks are scheduled together. For example:

Original order: [Context0, Context1, Context2, Context3]
Scheduled: [Context0, Context2, Context1, Context3] # 0 and 2 share blocks, now adjacent

2. Intra-Context Reordering

Within each context, blocks are reordered so that shared blocks appear first as a common prefix:

Original:
Context 0: ["block_G", "block_A", "block_F", "block_B", "block_E", "block_C"] # shared blocks scattered
Context 2: ["block_C", "block_D", "block_A", "block_H", "block_B", "block_I"] # same blocks, different order

After scheduling:
Context 0: ["block_A", "block_B", "block_C", "block_G", "block_F", "block_E"] # shared {A,B,C} moved to front
Context 2: ["block_A", "block_B", "block_C", "block_D", "block_H", "block_I"] # same prefix [A,B,C]!

This ensures adjacent contexts have identical prefixes that can be cached and reused.

Important: Use reordered_contexts (not original_indices) when building prompts to get the reordered context blocks.


Stateless Mode

Stateless mode provides optimal batch ordering without tracking cache state. Each /reorder call is independent.

Best for: Simple batch scheduling, microservices architecture, per-request optimization.

Start the Server

python -m contextpilot.server.http_server --port 8765 --stateless --infer-api-url http://localhost:30000

Client Usage

import requests

# Prepare contexts (each context = list of context blocks for a query)
contexts = [
["Transformers use self-attention", "BERT is bidirectional", "GPT uses causal attention", "Attention is O(n²)"],
["RNNs have vanishing gradients", "Transformers use self-attention", "LSTMs use gating", "GRUs are efficient"],
["Transformers use self-attention", "BERT is bidirectional", "ViT applies transformers to vision", "CLIP uses contrastive loss"],
["CNNs use convolutions", "ResNet uses skip connections", "BatchNorm stabilizes training", "Dropout regularizes"],
]

# One call — reorder for optimal prefix sharing
response = requests.post("http://localhost:8765/reorder", json={
"contexts": contexts
})
result = response.json()

# Use the reordered contexts and execution order
scheduled_contexts = result['reordered_contexts']
scheduled_order = result['original_indices']
print(f"Optimal order: {scheduled_order}") # e.g., [0, 2, 1, 3]

# Build prompts using the reordered context blocks
# scheduled_contexts[i] corresponds to original query at scheduled_order[i]
for i, reordered_blocks in enumerate(scheduled_contexts):
original_query_idx = scheduled_order[i]
# Build prompt with reordered blocks for maximum prefix sharing
# prompt = build_prompt(queries[original_query_idx], reordered_blocks)
# response = inference_client.generate(prompt)
pass

# After inference, map results back to original order
# final_results[scheduled_order[i]] = results[i]

Using the Python Client

from contextpilot.server.http_client import ContextPilotIndexClient

client = ContextPilotIndexClient("http://localhost:8765")

reordered, order = client.reorder(
contexts=[
["Transformers use self-attention", "BERT is bidirectional", "GPT uses causal attention"],
["RNNs have vanishing gradients", "Transformers use self-attention", "LSTMs use gating"],
["Transformers use self-attention", "BERT is bidirectional", "ViT applies transformers to vision"],
]
)

print(f"Scheduled order: {order}")
client.close()

Stateful Mode

Stateful mode maintains a live index synchronized with the inference engine's KV cache via eviction callbacks.

Best for: Long-running services, cache-aware scheduling, inference engine integration.

Architecture

┌─────────────┐         ┌─────────────────────┐         ┌─────────────────┐
│ Client │ ──────► │ ContextPilot Index │ ──────► │ Inference Engine│
│ │ reorder│ Server (8765) │ proxy │ (30000) │
└─────────────┘ └─────────────────────┘ └─────────────────┘
▲ │
│ POST /evict │
└───────────────────────────────┘
(engine notifies on KV cache eviction)

Inference Engine Integration

Stateful mode requires the inference engine to notify ContextPilot when KV-cache entries are evicted. No engine patches are needed — ContextPilot automatically hooks into SGLang at runtime via a .pth import hook.

SGLang (automatic, zero-patch)

Just set the CONTEXTPILOT_INDEX_URL environment variable when starting SGLang:

CONTEXTPILOT_INDEX_URL=http://localhost:8765 sglang serve --model-path Qwen/Qwen3-4B --port 30000

That's it. ContextPilot's .pth hook automatically monkey-patches SGLang's RadixCache at import time to add eviction tracking. You will see this in the SGLang logs:

[ContextPilot] Applying monkey-patches to SGLang RadixCache …
[ContextPilot] SGLang RadixCache monkey-patched successfully

Requirements: If you used pip install -e ., run python -m contextpilot.install_hook once to install the .pth file.

Distributed setup (SGLang and ContextPilot on different machines): You don't need to install the full contextpilot package on the SGLang machine. Copy the standalone install script instead:

# In the engine's Python environment (one command, no clone needed):
pip install requests
curl -sL https://raw.githubusercontent.com/EfficientContext/ContextPilot/main/contextpilot/install_standalone.py | python -

Then start the engine with CONTEXTPILOT_INDEX_URL set as above.

vLLM (automatic, zero-patch)

Same approach — just set the environment variable:

CONTEXTPILOT_INDEX_URL=http://localhost:8765 vllm serve Qwen/Qwen3-4B --port 30000 --enable-prefix-caching

ContextPilot's .pth hook automatically monkey-patches vLLM's BlockPool at import time. You will see this in the vLLM logs:

[ContextPilot] Applying monkey-patches to vLLM BlockPool …
[ContextPilot] vLLM BlockPool monkey-patched successfully

Compatible with vLLM 0.15.x (v1 block manager architecture).

llama.cpp (native C++ hook)

llama.cpp is a compiled C++ binary, so it cannot be monkey-patched at import time. Instead, ContextPilot provides contextpilot-llama-server — a drop-in replacement for llama-server that compiles a small shared library and injects it via DYLD_INSERT_LIBRARIES (macOS) / LD_PRELOAD (Linux), giving exact, zero-latency eviction signals with no polling.

The usage pattern is identical to SGLang and vLLM — just set CONTEXTPILOT_INDEX_URL:

CONTEXTPILOT_INDEX_URL=http://localhost:8765 contextpilot-llama-server \
-m models/Qwen3-8B-Q4_K_M.gguf \
--host 0.0.0.0 --port 8889 \
-ngl 99 --cache-reuse 256 --parallel 4 -c 32768

contextpilot-llama-server is a drop-in for llama-server — same flags, same behavior. It compiles the hook once (cached in /tmp) and transparently exec's llama-server with the injection set. You will see in stderr:

[ContextPilot] Hook compiled: /tmp/contextpilot_llama_hook.dylib
[ContextPilot] Launching with DYLD_INSERT_LIBRARIES injected: /usr/local/bin/llama-server ...

When CONTEXTPILOT_INDEX_URL is not set, contextpilot-llama-server exec's llama-server directly with zero overhead.

Requirements:

  • llama-server in PATH: brew install llama.cpp (or set LLAMA_SERVER_BIN=/path/to/llama-server)
  • clang++ (macOS, via xcode-select --install) or g++ (Linux)

See the Mac + llama.cpp guide for the full Apple Silicon setup.

How It Works

When CONTEXTPILOT_INDEX_URL is set, the inference engine integrates with ContextPilot at eviction time:

┌─────────────────────────────────────────────────────────────────────┐
│ Eviction Sync Flow │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Cache Full → RadixCache.evict() → Callback invoked │
│ │ │
│ ▼ │
│ POST /evict {"request_ids": ["rid1", "rid2"]} │
│ │ │
│ ▼ │
│ ContextPilot removes evicted requests from index│
│ │
└─────────────────────────────────────────────────────────────────────┘

Key Components:

  1. Request ID Tracking: Each request has a unique request_id (e.g., contextpilot_abc123)
  2. Eviction Callback: When the engine evicts cache entries, it notifies ContextPilot
  3. Index Sync: ContextPilot removes evicted requests from its live index

Start the Servers

# Terminal 1: Start ContextPilot server
python -m contextpilot.server.http_server \
--port 8765 \
--infer-api-url http://localhost:30000

# Terminal 2: Start your inference engine with ContextPilot integration enabled

# Option A: SGLang
CONTEXTPILOT_INDEX_URL=http://localhost:8765 python -m sglang.launch_server \
--model-path Qwen/Qwen3-4B \
--port 30000

# Option B: vLLM
CONTEXTPILOT_INDEX_URL=http://localhost:8765 python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-4B \
--port 30000 \
--enable-prefix-caching

Step 1: Reorder Contexts (Builds Index Automatically)

import requests

response = requests.post("http://localhost:8765/reorder", json={
"contexts": [
["Transformers use self-attention", "BERT is bidirectional", "GPT uses causal attention", "Attention is O(n²)"],
["RNNs have vanishing gradients", "Transformers use self-attention", "LSTMs use gating", "GRUs are efficient"],
["Transformers use self-attention", "BERT is bidirectional", "ViT applies transformers to vision", "CLIP uses contrastive loss"],
]
})

result = response.json()
reordered = result["reordered_contexts"]
order = result["original_indices"]
request_ids = result["request_ids"]
print(f"Reordered {len(request_ids)} contexts, mode: {result['mode']}")
# mode="initial" on first call, "incremental" on subsequent calls

Step 2: Send Requests via Proxy

The server proxies requests to the inference engine, associating each request with its reordered context:

response = requests.post("http://localhost:8765/v1/completions", json={
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "What is machine learning?",
"max_tokens": 100,
"rid": request_ids[0]
})

result = response.json()
print(result["choices"][0]["text"])

Step 3: Eviction Sync

When using a patched inference engine with the CONTEXTPILOT_INDEX_URL env var, eviction sync is automatic. The engine's cache calls the /evict endpoint with evicted request_ids via the callback.

If you need manual eviction (e.g., for testing), use the HTTP API directly:

# Direct HTTP request
requests.post("http://localhost:8765/evict", json={
"request_ids": ["contextpilot_abc123", "contextpilot_def456"]
})

# Or using the Python client
from contextpilot.server.http_client import ContextPilotIndexClient

client = ContextPilotIndexClient("http://localhost:8765")
result = client.evict(["contextpilot_abc123", "contextpilot_def456"])
print(f"Removed {result['removed_count']} requests")
print(f"Cleared {result['conversations_cleared']} conversations")

Server Endpoints

EndpointMethodDescription
/reorderPOSTPrimary — reorder contexts (auto-dispatches stateless / stateful)
/healthGETHealth check
/deduplicatePOSTMulti-turn deduplication (lightweight, stateful only)
/evictPOSTRemove evicted requests by request_id — called by SGLang, vLLM, and llama.cpp hooks
/resetPOSTReset index and conversation tracker (stateful only)
/statsGETGet index statistics (stateful only)
/buildPOSTDeprecated alias → /reorder
/schedulePOSTDeprecated alias → /reorder (always stateless)

Complete End-to-End Example

For a complete working example that shows the entire workflow from documents → context scheduling → prompt building → inference, see:

📄 examples/stateless_sglang_e2e.py

This example demonstrates:

  1. Document Retrieval - Simulated RAG retrieval of relevant documents
  2. Context Tokenization - Converting text to token IDs for scheduling
  3. ContextPilot Scheduling - Optimal ordering for prefix sharing
  4. Prompt Building - Constructing RAG prompts with context
  5. Inference - Batch inference in scheduled order (works with any OpenAI-compatible engine)
  6. Result Reordering - Mapping results back to original order

Quick Preview

from contextpilot.server.http_client import ContextPilotIndexClient
import requests

# 1. Tokenize your contexts
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
contexts = [tokenizer.encode(doc, add_special_tokens=False) for doc in documents]

# 2. Reorder with ContextPilot (stateless mode)
client = ContextPilotIndexClient("http://localhost:8765")
reordered_contexts, scheduled_order = client.reorder(contexts=contexts)

# 3. Build prompts using reordered contexts
scheduled_prompts = [build_rag_prompt(query, reordered_contexts[i]) for i in range(len(reordered_contexts))]

# 4. Send to inference engine in scheduled order (maximizes prefix sharing)
response = requests.post("http://localhost:30000/v1/completions", json={
"prompt": scheduled_prompts,
"max_tokens": 256
})
scheduled_results = response.json()

# 5. Reorder results back to original order
final_results = [None] * len(scheduled_order)
for new_idx, orig_idx in enumerate(scheduled_order):
final_results[orig_idx] = scheduled_results[new_idx]

Running the Example

# Terminal 1: Start your inference engine (SGLang or vLLM)
python -m sglang.launch_server --model Qwen/Qwen3-4B --port 30000
# or: python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-4B --port 30000 --enable-prefix-caching

# Terminal 2: Start ContextPilot
python -m contextpilot.server.http_server --port 8765

# Terminal 3: Run the example
python examples/stateless_sglang_e2e.py

Next Steps