Articles, one at a time.
Every piece here was commissioned, drafted, reviewed in public, and merged. No content mills, no auto-published slop.
SpecKV: Adaptive Speculative Decoding with Dynamic Gamma
SpecKV (arXiv:2605.02888) shows fixed γ=4 costs 56% throughput. Adaptive gamma, KV cache compression effects, and vLLM production tuning guide.
Read →
E2B Sandbox: Secure Code Execution for AI Agents
Add secure sandboxed code execution to AI agents with E2B. Firecracker microVM isolation, Python/JS SDKs, MCP support, and source-checked limits.
Read →
RAGFlow: Self-Host a Deep-Document RAG Engine
Step-by-step guide to self-hosting RAGFlow v0.25 with Docker Compose — deep document understanding, chunking strategies, MCP server, and the Python SDK.
Read →
Xiaomi MiMo-V2.5-Pro: Open-Source 1T Coding Agent Guide 2026
MiMo-V2.5-Pro: MIT-licensed 1T-param MoE model matching Claude Opus 4.6 on SWE-bench at 8x lower API cost. Benchmarks, API setup, and self-hosting guide.
Read →
Token Optimization for Production LLMs: Cut Costs Effectively
Four research-backed token optimization techniques for production LLMs: semantic caching, prompt compression, context pruning, and speculative decoding.
Read →
On-Device AI 2026: Developer Guide to NPUs and Edge Inference
A practical 2026 guide to on-device AI: NPU vs GPU vs CPU for LLM inference, Apple M5 MLX, Qualcomm X Elite, Core AI for iOS 27, and edge deployment.
Read →
DeepSeek V4-Pro and V4-Flash: Migration Guide and API Setup
DeepSeek V4-Pro (1.6T MoE, 1M context) and V4-Flash released April 2026. Migrate before the July 24 deadline. Full API guide, benchmarks, pricing.
Read →
Kimi Code K2.6: Moonshot AI's Coding Model vs Claude Code
Kimi Code K2.6 review: 58.6% SWE-Bench Pro, 300-agent swarms, $0.60/M input. How it compares to Claude Code in real-world coding tasks.
Read →
LLM Inference Engines Compared 2026: vLLM vs SGLang vs TGI vs MAX
A source-verified 2026 decision guide for vLLM, SGLang, TGI, and MAX, with use/skip guidance and deployment tradeoffs.
Read →
Qwen3.6-Plus: 1M Token Context and Claude-Level Performance
Alibaba's Qwen3.6-Plus: 1M token context, agentic coding, hybrid MoE, ~$0.29/M input. Sourced benchmarks vs Claude Opus 4.7 and a when-to-skip guide.
Read →
Hermes Agent Review: Self-Improving Open-Source AI Agent
Hermes Agent review: a fast-growing open-source AI agent that learns your workflow — self-improving skills, three-layer memory, setup, pricing.
Read →
Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide
Learn to fine-tune LLMs with LoRA and QLoRA in 2026. VRAM requirements, dataset prep, Unsloth/Axolotl setup, hyperparameters, and evaluation.
Read →