Offline Usage

Offline mode is best for batch processing where you have all queries upfront and want to maximize KV-cache reuse across them — no server required.

How It Works

ContextPilot performs two levels of optimization:

Inter-Context Reordering: Queries with overlapping context blocks are scheduled together
Intra-Context Reordering: Context blocks within each query are reordered so shared blocks appear first as a common prefix

For example, if Query A has ["block_C", "block_A", "block_D", "block_B"] and Query B has ["block_B", "block_E", "block_A", "block_C"], after optimization:

Query A: ["block_A", "block_B", "block_C", "block_D"] (shared blocks first)
Query B: ["block_A", "block_B", "block_C", "block_E"] (same prefix — cache hit!)

Prerequisites

Start your inference engine:

# SGLang:
python -m sglang.launch_server \
    --model-path Qwen/Qwen2.5-7B-Instruct \
    --port 30000

# or vLLM:
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --port 30000 \
    --enable-prefix-caching

Using `cp.optimize_batch()` (Simplest)

Pass your context blocks and queries — ContextPilot handles reordering and returns ready-to-use OpenAI messages in the optimal execution order.

import asyncio
import openai
import contextpilot as cp

BASE_URL = "http://localhost:30000/v1"
engine = cp.ContextPilot(use_gpu=False)

queries = ["What is AI?", "Explain neural networks", "What is deep learning?"]
all_contexts = [
    ["AI is the simulation of human intelligence", "Machine learning is a subset of AI", "Deep learning uses neural networks"],
    ["Neural networks are inspired by the brain", "Machine learning is a subset of AI", "Backpropagation trains neural networks"],
    ["Deep learning uses neural networks", "Machine learning is a subset of AI", "GPUs accelerate deep learning training"],
]

# Returns messages in scheduled order + the original index mapping
messages_batch, order = engine.optimize_batch(all_contexts, queries)

async def generate_all():
    client = openai.AsyncOpenAI(base_url=BASE_URL, api_key="EMPTY")
    return await asyncio.gather(*[
        client.chat.completions.create(model="Qwen/Qwen2.5-7B-Instruct", messages=m)
        for m in messages_batch
    ])

for resp, idx in zip(asyncio.run(generate_all()), order):
    print(f"Q: {queries[idx]}\nA: {resp.choices[0].message.content}\n")

messages_batch[i] corresponds to queries[order[i]] — send them in this order to the inference engine for maximum prefix sharing, then use order to map results back.

Using `cp.reorder()` (Manual Control)

Use reorder() when you need full control over prompt construction — it returns reordered context blocks and the execution order, and you build the prompts yourself.

import asyncio
import openai
import contextpilot as cp

BASE_URL = "http://localhost:30000/v1"
engine = cp.ContextPilot(use_gpu=False)

queries = ["What is AI?", "Explain neural networks", "What is deep learning?"]
all_contexts = [
    ["AI is the simulation of human intelligence", "Machine learning is a subset of AI", "Deep learning uses neural networks"],
    ["Neural networks are inspired by the brain", "Machine learning is a subset of AI", "Backpropagation trains neural networks"],
    ["Deep learning uses neural networks", "Machine learning is a subset of AI", "GPUs accelerate deep learning training"],
]

# reordered[i] = reordered blocks for the i-th scheduled query
# order[i]     = index into the original queries list
reordered, order = engine.reorder(all_contexts)

def build_prompt(query, blocks):
    context_text = "\n".join(f"[{i+1}] {b}" for i, b in enumerate(blocks))
    return [
        {"role": "system", "content": f"Answer based on the context:\n{context_text}"},
        {"role": "user", "content": query},
    ]

messages_batch = [build_prompt(queries[order[i]], reordered[i]) for i in range(len(order))]

async def generate_all():
    client = openai.AsyncOpenAI(base_url=BASE_URL, api_key="EMPTY")
    return await asyncio.gather(*[
        client.chat.completions.create(model="Qwen/Qwen2.5-7B-Instruct", messages=m)
        for m in messages_batch
    ])

results = [None] * len(queries)
for resp, idx in zip(asyncio.run(generate_all()), order):
    results[idx] = resp.choices[0].message.content

for q, a in zip(queries, results):
    print(f"Q: {q}\nA: {a}\n")

RAG Pipeline (with Built-in Retrieval)

If you have a document corpus and want ContextPilot to handle retrieval + optimization in one call, use RAGPipeline:

from contextpilot.pipeline import RAGPipeline, InferenceConfig

pipeline = RAGPipeline(
    retriever="bm25",          # or "faiss" for semantic search
    corpus_path="corpus.jsonl",
    use_contextpilot=True,
    inference=InferenceConfig(
        model_name="Qwen/Qwen2.5-7B-Instruct",
        base_url="http://localhost:30000",
        max_tokens=256,
    )
)

results = pipeline.run(
    queries=["What is machine learning?", "Explain neural networks", "What is deep learning?"],
    top_k=20,
    generate_responses=True,
)

for gen_result in results["generation_results"]:
    if gen_result["success"]:
        print(gen_result["generated_text"][:200])

See the API Reference for full RAGPipeline options including FAISS retrieval, step-by-step control, and saving results.

Next Steps

Online Usage - Live index server for stateful cache tracking
Multi-Turn - Context deduplication across conversation turns
API Reference - Full API documentation

How It Works​

Prerequisites​

Using cp.optimize_batch() (Simplest)​

Using cp.reorder() (Manual Control)​

RAG Pipeline (with Built-in Retrieval)​

Next Steps​