Mac + llama.cpp (Apple Silicon)

ContextPilot runs fully on Apple Silicon with llama.cpp as the inference backend — no CUDA, no cloud, no external services required.

How It Works

llama.cpp's --cache-reuse N flag enables prefix caching: if a new request shares N or more tokens with the content already held in a KV-cache slot, those tokens are not re-evaluated. ContextPilot maximises the length of those shared prefixes by clustering retrieved documents hierarchically and reordering them so overlapping content aligns at the front of each prompt.

On Apple Silicon, the entire quantised model (Q4_K_M ≈ 4.5 bits/weight) fits in unified DRAM shared by the CPU and GPU, so Metal offload (-ngl 99) gives near-GPU throughput without a discrete GPU.

Architecture

llama-server ships its own OpenAI-compatible /v1/* API (including /v1/chat/completions with built-in chat-template handling), so no separate API server is needed.

llama-server  :8889   (Metal, --cache-reuse 256, native hook injected)
      │  llama_memory_seq_rm() intercepted via DYLD_INSERT_LIBRARIES
      │  POST /evict fires instantly on cache clear (zero-latency)
      ↓
ContextPilot server  :8765   ← reorders contexts for max prefix reuse
      ↑
your application

The native hook (contextpilot._llamacpp_hook) compiles a small C++ library and injects it into llama-server at launch via DYLD_INSERT_LIBRARIES. It intercepts llama_memory_seq_rm() inside the llama-server process and fires POST /evict {"request_ids":["slot_N"]} the instant a slot's KV cache is discarded — no polling, zero latency.

Why `contextpilot-llama-server` instead of just `llama-server`

Engine	Runtime	Hook mechanism	Wrapper needed?
SGLang	Python	`.pth` file → monkey-patch inside the same Python process	No
vLLM	Python	`.pth` file → monkey-patch inside the same Python process	No
llama.cpp	C++ binary	`DYLD_INSERT_LIBRARIES` must be set before the process starts	Yes

SGLang and vLLM run inside Python, so ContextPilot's .pth file in site-packages fires automatically at interpreter startup and patches the engine from within. llama-server is a compiled C++ binary — Python never starts, so the .pth mechanism does not apply. contextpilot-llama-server is the equivalent wrapper: it sets DYLD_INSERT_LIBRARIES and exec's the real llama-server with all flags forwarded. When CONTEXTPILOT_INDEX_URL is not set it exec's llama-server directly with zero overhead.

Setup

Prerequisites

Apple Silicon Mac (M1 or later)
A GGUF model (e.g. Qwen3-8B-Q4_K_M.gguf)

Install

From PyPI:

pip install contextpilot
xcode-select --install    # one-time: provides clang++ to compile the native hook

From source:

git clone https://github.com/EfficientContext/ContextPilot.git
cd ContextPilot
pip install -e .
brew install llama.cpp
xcode-select --install    # one-time: provides clang++ to compile the native hook

xcode-select --install is only needed once per machine. If you already have Xcode Command Line Tools installed, skip it.

Start

Terminal 1 — llama-server with native hook injected:

CONTEXTPILOT_INDEX_URL=http://localhost:8765 contextpilot-llama-server \
    -m models/Qwen3-8B-Q4_K_M.gguf \
    --host 0.0.0.0 --port 8889 \
    -ngl 99 --cache-reuse 256 --parallel 4 -c 32768

contextpilot-llama-server is a drop-in for llama-server — same flags, same behavior. It compiles the C++ hook once (cached in /tmp) and exec's llama-server with DYLD_INSERT_LIBRARIES set automatically. The hook fires POST /evict {"request_ids":["slot_N"]} the instant a slot's KV cache is discarded.

llama-server flag	Purpose
`-ngl 99`	Offload all layers to Metal GPU
`--cache-reuse 256`	Reuse KV cache when prefix overlap ≥ 256 tokens
`--parallel 4`	Allocate 4 independent KV-cache slots for concurrent requests
`-c 32768`	Context window size

Terminal 2 — ContextPilot HTTP server:

python -m contextpilot.server.http_server --port 8765 \
    --infer-api-url http://localhost:8889

Usage

Your application connects to the ContextPilot server. The two-line integration is identical to other backends:

from openai import OpenAI
import contextpilot as cp

client = OpenAI(base_url="http://localhost:8765/v1", api_key="EMPTY")
cp_instance = cp.ContextPilot(use_gpu=False)   # CPU clustering on Mac

for query in queries:
    contexts = get_contexts(query)              # your RAG retriever
    messages = cp_instance.optimize(contexts, query)
    response = client.chat.completions.create(
        model="qwen3-8b",
        messages=messages,
    )

cp_instance.optimize() reorders the retrieved documents before each request. Documents shared across queries form a common prefix that llama.cpp can cache and reuse.

Multi-turn conversations

Pass a stable conversation_id across turns so ContextPilot can deduplicate documents already seen in earlier turns:

import uuid
conversation_id = f"conv-{uuid.uuid4().hex[:8]}"

for query in conversation_turns:
    contexts = get_contexts(query)
    messages = cp_instance.optimize(contexts, query, conversation_id=conversation_id)
    response = client.chat.completions.create(model="qwen3-8b", messages=messages, max_tokens=200)

Batch processing

optimize_batch schedules an entire batch in the globally optimal execution order — queries sharing documents are sent consecutively to maximise prefix reuse:

all_docs = [get_contexts(q) for q in all_queries]
messages_batch, original_indices = cp_instance.optimize_batch(all_docs, all_queries)

print(f"Scheduled order: {original_indices}")

answers = [""] * len(all_queries)
for messages, orig_idx in zip(messages_batch, original_indices):
    response = client.chat.completions.create(model="qwen3-8b", messages=messages, max_tokens=200)
    answers[orig_idx] = response.choices[0].message.content

A complete working example covering all three patterns is at examples/mac_llama_cpp_example.py.

Benchmarking

tests/test_mac_contextpilot.sh runs the full MultihopRAG benchmark end-to-end, comparing ContextPilot (reordered docs) against a baseline (original doc order).

Additional prerequisites

Docker (for Elasticsearch / BM25 retrieval)

Run

# Full automated run — 100 queries
bash tests/test_mac_contextpilot.sh

# Fewer queries for a quick test
bash tests/test_mac_contextpilot.sh --num-queries 100

# Custom-built llama-server (not in PATH)
bash tests/test_mac_contextpilot.sh --llama-server /path/to/llama.cpp/build/bin/llama-server

# Different model
bash tests/test_mac_contextpilot.sh --model models/Llama-3.2-3B-Q4_K_M.gguf

# Skip dataset download if already done
bash tests/test_mac_contextpilot.sh --skip-data-prep

The script handles everything automatically:

Starts Elasticsearch and downloads the MultiHopRAG dataset
Builds BM25 retrieval data and reorders with ContextPilot (offline batch)
Compiles the native C++ hook and starts llama-server with it injected via DYLD_INSERT_LIBRARIES, then starts the ContextPilot HTTP server
Runs the ContextPilot benchmark, then restarts with a clean KV cache and runs the baseline
Prints a side-by-side comparison of F1 scores and latency

Results are saved to results_contextpilot.jsonl and results_baseline.jsonl.

Tuning Tips

Goal	Adjustment
Higher cache reuse	Lower `--cache-reuse` (e.g. 64) to match shorter shared prefixes
More concurrent requests	Increase `--parallel` (each slot uses ~0.5 MB/token of DRAM)
Fit larger model	Use a smaller quantisation (Q3_K_M) or reduce `-c`
Reduce power draw	Lower `-ngl` to keep some layers on CPU

Next Steps

Online Usage — Stateful/stateless server modes
Multi-Turn — Cross-turn deduplication in detail
API Reference — Full API documentation

How It Works​

Architecture​

Why contextpilot-llama-server instead of just llama-server​

Setup​

Prerequisites​

Install​

Start​

Usage​

Multi-turn conversations​

Batch processing​

Benchmarking​

Tuning Tips​

Next Steps​