Skip to content
Effloow
← Back to Articles
AI DEVELOPMENT ARTICLES ·2026-05-01 ·UPDATED 2026-06-18 ·BY EFFLOOW EDITORIAL ·15 MIN READ

Token Optimization for Production LLMs: Cut Costs Effectively

Four research-backed token optimization techniques for production LLMs: semantic caching, prompt compression, context pruning, and speculative decoding.
llm-optimization token-efficiency ai-cost-reduction prompt-compression semantic-caching production-ai speculative-decoding
SHARE
Illustration for Token Optimization for Production LLMs: Cut Costs Effectively
Illustration: AI-assisted. Editorial policy

Enterprise LLM API spending more than doubled from $3.5 billion in November 2024 to $8.4 billion by mid-2025, according to Menlo Ventures' 2025 mid-year LLM market update. If you are running any AI feature in production, tokens can become the largest variable cost once usage moves beyond demos. The practical question is not "which model is cheapest?" It is "which tokens are repeated, stale, or unnecessary?"

This guide walks through four concrete, source-backed strategies. Effloow also reproduced three mechanisms in a minimal local sandbox: semantic-cache matching, conversation trimming, and RAG context pruning. No production API keys were used, and the local numbers are not presented as production savings. See the public token optimization lab note for the commands, outputs, and limitations.

Why Token Costs Compound Faster Than You Expect

A naive LLM integration has several hidden multipliers. A 20-turn multi-turn conversation can carry 5,000–10,000 tokens of accumulated history on every request, even when only the last 500–1,000 tokens are actually relevant to the current question. RAG pipelines often retrieve a full 1,500-token chunk when 200 tokens contain the answer. And every near-duplicate user question (common in customer support) makes a fresh API call for an identical answer.

Before reaching for a cheaper model, the right question is: how many of your tokens are doing useful work? If the answer points to changing infrastructure rather than prompts, weigh it against our self-hosting vs cloud API cost guide.

Source-Derived Decision Table

This table is the original-value asset for the article. It maps each cost-control lever to the source claim, the local proof available here, and the deployment limit that should stop overconfident rollout.

Lever Primary source support Effloow added Use it when Skip or delay it when
Semantic caching GPT Semantic Cache reports embedding-based Redis caching with 61.6% to 68.8% cache hit rates and positive hit accuracy above 97% in its evaluated customer-service QA domains. A local 5-query Jaccard-proxy sandbox produced 2 cache hits out of 5. This only proves the matching workflow, not production accuracy. Queries repeat and answers are safe to reuse, such as FAQ, docs, support triage, and generic coding help. Answers depend on account state, permissions, private user context, or recent data.
Prompt compression LLMLingua reports up to 20x prompt compression with small performance loss on tested benchmarks; LongLLMLingua reports stronger long-context results on NaturalQuestions with fewer tokens. The lab note includes a simple RAG-pruning proxy that cut a toy chunk from about 175 to 40 approximate tokens. This is not LLMLingua quality. Retrieved chunks are verbose and the answer-relevant span is small. The task depends on exact wording, legal language, safety instructions, or brittle few-shot examples.
Conversation trimming Provider APIs bill or account for prompt tokens, so repeated history grows cost and latency with every turn. A 7-turn toy conversation dropped from about 97 to 46 approximate tokens after keeping the system prompt and last exchange. Chat history is mostly independent turns or recent context dominates. Users frequently refer back to earlier facts, decisions, or constraints. Add retrieval or summaries first.
Provider prompt caching OpenAI's prompt caching guide says caching can reduce latency and input-token cost for repeated prefixes; Anthropic's prompt caching docs document a separate cache-write/cache-read pricing model. This article distinguishes provider prefix caching from semantic response caching so teams do not treat them as substitutes. Long system prompts, tool schemas, policies, or few-shot blocks repeat across requests. The prompt changes near the beginning on every request, preventing stable-prefix reuse.
Speculative decoding Fast Inference from Transformers via Speculative Decoding reports 2x to 3x acceleration without changing the output distribution in the evaluated setup; vLLM documents speculative decoding for memory-bound serving. No local speculative-decoding benchmark was run for this article. It is included as an inference-side option, not an Effloow measurement. You self-host inference and can test draft/target model compatibility. You use managed APIs only, or your bottleneck is retrieval, networking, or application code rather than decoding.

Strategy 1: Semantic Caching

Research basis: GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching (arXiv 2411.05276, 2024).

Semantic caching stores the embedding of a query alongside its LLM response. On the next request, the system checks whether any cached query embedding is within a cosine similarity threshold (typically 0.80–0.95). If so, the cached response is returned without an API call.

The GPT Semantic Cache paper evaluated this on 8,000 question-answer pairs across customer service domains and found cache hit rates of 61.6%–68.8% with positive hit accuracy above 97%. Treat those as paper-domain results, not a promise for your application.

Effloow Lab PoC result: Using Jaccard word-overlap as a proxy for embedding cosine similarity (threshold=0.15), a 5-query test set with two near-duplicate pairs produced a 40% cache hit rate. The proxy correctly identified "How does the Transformer attention mechanism work?" and "How does the Transformer attention work?" as similar (similarity=0.86). In production, replace Jaccard with a sentence-transformer or OpenAI embedding model at cosine threshold ≥0.80.

Implementation sketch:

import numpy as np
from openai import OpenAI

client = OpenAI()
cache: list[tuple[np.ndarray, str]] = []
THRESHOLD = 0.85

def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def get_embedding(text: str) -> np.ndarray:
    resp = client.embeddings.create(input=text, model="text-embedding-3-small")
    return np.array(resp.data[0].embedding)

def cached_completion(prompt: str) -> str:
    q_emb = get_embedding(prompt)
    for cached_emb, cached_resp in cache:
        if cosine_sim(q_emb, cached_emb) >= THRESHOLD:
            return cached_resp  # cache hit — no LLM API call
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    answer = resp.choices[0].message.content
    cache.append((q_emb, answer))
    return answer

When to use: High-repetition workloads — support bots, documentation Q&A, coding assistants answering common patterns. Less effective for highly personalized or creative generation tasks where each query is genuinely unique.

Strategy 2: Prompt Compression with LLMLingua

Research basis: LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (arXiv 2310.05736, EMNLP 2023), extended by LongLLMLingua.

LLMLingua is a Microsoft Research tool that uses a small language model (GPT-2 scale) to score the perplexity of each token in a prompt. Tokens that are redundant or low-value under the target model's distribution are dropped before the prompt reaches the expensive LLM.

Benchmark results from the papers:

  • Original LLMLingua: 20x compression ratio on GSM8K with only 1.5% accuracy loss
  • End-to-end inference speedup: 1.7–5.7x
  • LongLLMLingua: 21.4% performance improvement on NaturalQuestions at 4x compression (fewer tokens, better answers because the compressed prompt emphasizes the most relevant sentences)
  • LLMLingua-2: 3x–6x faster than the original while maintaining compression quality

This article does not claim a universal monthly savings number for LLMLingua. Savings require your own baseline: average retrieved-token count, compression rate, output length, provider pricing, and quality pass rate.

Implementation:

pip install llmlingua
from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True,
)

# Long RAG context (e.g., retrieved from vector DB)
long_context = """
The Transformer architecture introduced in Attention Is All You Need (Vaswani et al., 2017) 
replaced recurrence with self-attention, enabling parallel training. The model consists of 
an encoder and decoder, each with multi-head attention layers. Positional encodings are added 
to token embeddings. LayerNorm is applied before each sublayer. Residual connections wrap each 
sublayer. The feed-forward network uses ReLU activation with dimension 2048. Dropout of 0.1 
is applied. Trained on WMT 2014 English-German and English-French datasets on 8 NVIDIA P100 GPUs.
BLEU scores: 28.4 EN-DE, 41.0 EN-FR.
"""

compressed = compressor.compress_prompt(
    long_context,
    rate=0.5,        # keep 50% of tokens
    force_tokens=["Transformer", "attention", "BLEU"],  # always preserve these
)

print(compressed["compressed_prompt"])
print(f"Tokens: {compressed['origin_tokens']} → {compressed['compressed_tokens']}")

When to use: RAG pipelines with long retrieved chunks, few-shot prompts with many examples, document summarization chains. LLMLingua is particularly effective when retrieved context is verbose and the actual answer sits in a small fraction of the text.

Caution: Compression at rates above 80% (keeping less than 20% of tokens) degrades quality significantly for reasoning tasks. Tune the rate parameter per task type and validate with your evaluation set before deploying.

Strategy 3: Conversation History Trimming

Conversation trimming is the lowest-dependency optimization in this guide. Every turn in a multi-turn conversation can be appended to the context window, so long chats often carry old messages that are no longer needed for the current answer.

Effloow Lab PoC result: A 7-turn sample conversation (system prompt + 3 exchange pairs + 1 current question) carried approximately 97 tokens in full. Trimming to the system prompt plus the last 1 exchange reduced this to 46 tokens — a 53% reduction in that toy input. The run did not measure answer quality.

Three trimming strategies, ordered by implementation effort:

1. Last-N turns (easiest) Keep only the most recent N turns plus the system prompt. Works for most chatbot use cases where users ask independent questions.

def trim_to_last_n(messages: list[dict], n: int = 4) -> list[dict]:
    system = [m for m in messages if m["role"] == "system"]
    history = [m for m in messages if m["role"] != "system"]
    return system + history[-n:]

2. Sliding window with summarization (balanced) When the history exceeds a token budget, summarize older turns into a single compressed summary message, then discard the originals.

def sliding_summary(messages: list[dict], token_budget: int = 2000) -> list[dict]:
    # Estimate tokens (use tiktoken for exact counts in production)
    total = sum(len(m["content"]) // 4 for m in messages)
    if total <= token_budget:
        return messages
    
    # Summarize first half of non-system messages
    system = [m for m in messages if m["role"] == "system"]
    history = [m for m in messages if m["role"] != "system"]
    mid = len(history) // 2
    old_turns = history[:mid]
    recent_turns = history[mid:]
    
    summary_text = "\n".join(f"{m['role']}: {m['content']}" for m in old_turns)
    summary = {
        "role": "system",
        "content": f"[Earlier conversation summary]: {summary_text[:500]}..."
    }
    return system + [summary] + recent_turns

3. Token budget per role (precise) Use tiktoken to count tokens exactly and enforce per-role budgets (e.g., system ≤ 500, history ≤ 2,000, current user message ≤ 500).

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

def count_tokens(messages: list[dict]) -> int:
    return sum(len(enc.encode(m["content"])) + 4 for m in messages)  # +4 per message overhead

When to use: Every multi-turn application. Last-N trimming should be the default; add summarization when users frequently reference earlier conversation details.

Strategy 4: Speculative Decoding (Inference-Side)

Research basis: Fast Inference from Transformers via Speculative Decoding (Leviathan, Kalman, and Matias, 2022) and related speculative-sampling work from 2023.

Speculative decoding is an inference-side optimization, not an API-side one. It applies when you self-host or use inference providers (vLLM, TGI, Ollama) that expose it as a configuration option. For the deep dive on tuning speculation length per quantization level, see our speculative decoding guide.

The mechanism: a small, fast "draft" model generates K candidate tokens. The large "target" model then verifies all K tokens in a single forward pass. Accepted tokens are kept; rejected tokens cause the target model to generate a correction. The key insight is that verification is parallelizable — the large model processes all K candidates in one step instead of K sequential steps.

Published results show 2–3x acceleration in the evaluated setup with no change to the output distribution. That does not mean every vLLM deployment gets the same result. It works best when the draft and target models are compatible and the workload is memory-bound enough for parallel verification to help.

Enabling speculative decoding in vLLM: vLLM's current documentation describes speculative decoding as a way to reduce inter-token latency under medium-to-low QPS, memory-bound workloads.

# Start vLLM server with speculative decoding
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70b-instruct \
    --speculative-model meta-llama/Llama-3-8b-instruct \
    --num-speculative-tokens 5 \
    --use-v2-block-manager

Speculative decoding pairs well with KV cache optimization. PagedAttention (the memory management system underlying vLLM) manages KV cache in OS-style virtual pages, eliminating memory fragmentation and enabling 2–4x higher batch sizes on the same GPU. Combined with INT8 KV cache quantization, total KV cache memory usage can drop by 4–8x.

When to use: Self-hosted inference with GPU compute. Not applicable to managed API usage (OpenAI, Anthropic, Google APIs already apply these optimizations internally). High-throughput production deployments with a single large model family benefit most.

Worked Example: Choosing the First Cost Lever

Input scenario: A support assistant sends a 900-token system prompt, 1,200 tokens of retrieved documentation, the last 12 chat turns, and a short user question on every request. The team has no cache, no token logging, and no held-out answer-quality set.

Output decision:

Step Action Why this is the first move Evidence to collect
1 Add token logging by section: system, tools, retrieval, history, current user, output. Without section-level counts, the team cannot tell whether retrieval, history, or output length is the real cost driver. Median and p95 input/output tokens by route.
2 Trim chat history to system prompt + recent turns, with a summary only when earlier facts matter. This is the lowest-risk code change and does not depend on external libraries. Token reduction plus a manual check of conversations that reference earlier details.
3 Re-rank or compress retrieved context before the model call. The 1,200-token retrieval block is likely larger than the answer-relevant span. Answer pass rate before and after pruning/compression.
4 Add semantic caching only for stateless FAQ-like routes. Cache hits are valuable only when reused answers are safe. Cache hit rate, false-positive rate, and bypass reasons.
5 Evaluate provider prompt caching for the stable system/tool prefix. Provider caching and semantic caching solve different problems: one reuses a stable prefix, the other reuses an answer. Cached-token counts from API usage metadata where available.

This example deliberately does not produce a dollar-savings claim. The right output is an ordered measurement plan that prevents the team from buying infrastructure before removing waste from the prompt.

Failure and Limitation Table

Failure mode Symptom Likely cause Fix
Bad semantic-cache hit The assistant gives a plausible answer for the wrong account, product, or time period. Similarity threshold is too low, or user-specific queries were cached. Cache only stateless routes, raise threshold, and log false positives for review.
Compression removes the answer RAG answer quality drops after token reduction. The compressor optimized token count instead of task-relevant evidence. Re-rank chunks first, preserve citations and entities, and evaluate against held-out questions.
History trimming breaks continuity The model forgets a constraint the user gave earlier. Last-N trimming discarded a still-relevant turn. Summarize durable constraints or retrieve conversation facts by entity.
Prompt caching does not activate Provider usage metadata shows no or few cached tokens. The stable prefix starts too late, changes between calls, or is below the provider threshold. Move stable instructions, tool schemas, and policies to the front of the prompt.
Speculative decoding underperforms GPU throughput or latency gets worse after enabling it. Draft model overhead exceeds verification savings for the workload. Test under real QPS, prompt length, and output length before rollout.

Combining the Techniques: Expected Impact

The four strategies target different layers of the cost stack:

Technique Cost Layer Source or local signal Implementation Effort
Conversation trimming Input tokens Effloow local toy run: 53% fewer approximate tokens Low (1 function)
Semantic caching API calls Paper: 61.6%–68.8% hit rates; Effloow toy run: 2/5 hits Medium (Redis + embeddings)
Prompt compression (LLMLingua) Input tokens LLMLingua and LongLLMLingua report benchmark-specific compression/cost results Medium (library install + tuning)
Speculative decoding Inference latency/cost Speculative-decoding paper reports 2x–3x acceleration in its evaluated setup High (self-hosted only)

Do not add the percentages together. The techniques affect different layers, but their benefits overlap. Your actual reduction depends on workload repetition rate, context verbosity, conversation length, output length, provider caching behavior, and whether you self-host inference.

When to Use / When to Skip

Use this playbook when you already have sustained LLM traffic, a measurable token bill, and at least a small quality check for the answers you care about. It is especially useful for support assistants, documentation Q&A, internal coding helpers, RAG summarizers, and agent workflows that reuse long system prompts or tool schemas.

Skip the heavier parts when the bill is still small, traffic is bursty, or answer quality is not yet stable. In that case, do only the cheap controls: log tokens, cap output length, trim obvious stale history, and keep a short list of expensive routes to revisit. Also skip semantic caching for personalized answers and skip compression for legal, safety, or compliance text where exact wording matters.

A Practical Prioritization Order

Start with the highest-leverage, lowest-effort changes first.

Week 1: Audit your token usage. Add tiktoken token counting to every API call and log input/output token splits by section. Do not assume conversation history is the biggest waste until the logs show it.

Week 2: Apply conversation trimming. Implement last-N trimming for all multi-turn endpoints. This is a one-function change with immediate effect. Set max_tokens on every API call to cap output tokens.

Week 3: Add semantic caching. Deploy a Redis instance with redis-py and a sentence-transformer model (e.g., all-MiniLM-L6-v2 from Hugging Face, free and fast). Cache embeddings with a 24-hour TTL. Tune the similarity threshold on a sample of your query logs.

Week 4: Evaluate prompt compression for RAG. If your RAG chunks average more than 400 tokens, install LLMLingua and run compression at rate=0.5 against a held-out evaluation set. Deploy if quality holds.

Later: Speculative decoding. Only relevant if moving to self-hosted inference. Evaluate when your managed API spend exceeds the cost of GPU compute.

When to skip this entirely. If your monthly LLM bill is under a few hundred dollars, or your traffic is low and bursty, the engineering time to build caching and compression will cost more than it saves. These techniques pay back on sustained volume. Below that, set max_tokens, trim conversation history, and revisit the rest when spend actually compounds.

Common Mistakes

Setting cache similarity thresholds too low. A threshold of 0.70 cosine similarity will return cached responses for semantically different questions, causing answer quality problems. Start at 0.85 and decrease only after reviewing false positives.

Compressing few-shot examples too aggressively. LLMLingua targets the retrieved context, not the task description or examples. Apply compression only to the variable retrieval context; keep system prompts and few-shot examples intact.

Trimming without preserving user-referenced context. If a user says "going back to what I said earlier about the API rate limit..." and that turn has been trimmed, the model will hallucinate or ask a clarifying question. Add recency scoring: boost turns that contain entities referenced in the current message.

Ignoring output token length. Optimizing input tokens while leaving output uncapped means the model can run verbose. Always set max_tokens and include length guidance in the system prompt: "Respond concisely. 2–3 sentences unless the question requires more."

FAQ

Q: Does semantic caching break for personalized responses?

Yes. Semantic caching returns stored responses verbatim, which breaks for queries where the answer depends on user-specific context (account data, preferences, prior state). Cache only the subset of your query space that is genuinely stateless — FAQ patterns, documentation lookups, generic coding help. Personalized queries should bypass the cache entirely.

Q: How much do these techniques affect output quality?

Semantic caching returns an exact cached response, so the risk is not generation drift; the risk is serving that response to the wrong query. Conversation trimming is safe only when the trimmed turns are not referenced in the current query. LLMLingua's reported accuracy impact is benchmark-specific, so validate it against your own held-out examples. Speculative decoding is distribution-preserving in the cited paper's formulation. Monitor your task-specific evaluation metrics, not just cost, after each technique is deployed.

Q: Can I apply prompt compression to system prompts?

Technically yes, but practically risky. System prompts contain behavioral instructions that need to survive compression intact. LLMLingua's force_tokens parameter can preserve critical instruction words, but compressing system prompts is harder to validate. Focus compression efforts on retrieved context and conversation history first.

Q: Does Anthropic's prompt caching overlap with semantic caching?

They are different mechanisms. Anthropic prompt caching and OpenAI prompt caching apply provider-side discounts or latency improvements to repeated prompt prefixes. Semantic caching caches the full response for semantically similar queries client-side. Both can be useful, but they are not substitutes. If your provider supports prompt caching, structure long stable instructions, tool schemas, or policy text near the beginning of the prompt, then apply semantic caching only to repeated stateless query patterns.

Bottom Line

Conversation trimming and semantic caching are usually the first checks because they remove repeated input before you change models or infrastructure. Add LLMLingua for RAG-heavy workloads after a quality check, and reserve speculative decoding for self-hosted deployments where decoding is the bottleneck.

What Effloow Added

The four techniques here come from published research, but research papers report results on the authors' benchmarks, not yours. What Effloow added is a reproduction step for three mechanisms: a minimal local sandbox that demonstrates semantic-cache matching, history trimming, and RAG pruning without claiming production savings.

  • Semantic caching: the lab run used a 5-query set with two near-duplicate pairs and measured a 40% hit rate, including the exact similarity score (0.86) that triggered a cache hit. That demonstrates the matching workflow, not a production cache target.
  • History trimming: the lab run counted approximate tokens on a 7-turn sample conversation and reduced the prompt from 97 to 46 tokens. It did not evaluate answer quality.
  • Prioritization: the guide orders the four techniques by leverage-per-effort into a four-week rollout and adds an explicit skip threshold so low-volume teams do not over-engineer.

The raw commands and outputs are kept in our lab notes so the figures are checkable rather than asserted.

To put real numbers on your own workload — and to see the point where self-hosting beats the API spend you're optimizing — try our free API vs self-hosting cost calculator.

Get the next one
in your inbox.

One short weekly dispatch with new guides, tools, and what we tested. No spam, unsubscribe anytime.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.

More in Articles

Tools you can use

Stay in the loop.

One dispatch every Friday. New articles, tool releases, and a short note from the editor.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.