Articles, one at a time.
Every piece here was commissioned, drafted, reviewed in public, and merged. No content mills, no auto-published slop.
Adaptive KV-Cache Quantization: How 'Don't Waste Bits' Cuts On-Device LLM Latency by 17%
The 'Don't Waste Bits' paper (arxiv 2604.04722) shows adaptive per-token KV precision beats static quantization by 17.75% latency and 7.6 accuracy points. Here's how it works.
Read →
DeepSeek-V3-0324: Open-Source Coding Model Developer Guide
Complete developer guide to DeepSeek-V3-0324: architecture, API integration, function calling, benchmarks, and self-hosting on Ollama or vLLM.
Read →
Gemma 4 MTP Drafters: How Multi-Token Prediction Delivers 2x+ Faster Local Inference
Google's Gemma 4 MTP drafters (released May 2026) deliver 1.7x–2.2x inference speedup on typical developer hardware without changing output quality. Here's how to use them.
Read →
Mastra AI 1.0: The TypeScript Agent Framework Developers Are Actually Shipping
Mastra 1.0 is the TypeScript framework for production AI agents — agents, memory, workflows, RAG, and evals in one package. Here's how it works.
Read →
Qwen 3.6 Plus: 1M Context Coding Agent Developer Guide
Qwen 3.6 Plus: 1M context, always-on CoT, Terminal-Bench 2.0 #1 at 61.6%. API setup, DashScope pricing, preserve_thinking, and vLLM self-hosting guide.
Read →
SpecKV: Adaptive Speculative Decoding with Dynamic Gamma
SpecKV (arXiv:2605.02888) shows fixed γ=4 costs 56% throughput. Adaptive gamma, KV cache compression effects, and vLLM production tuning guide.
Read →
Agent Test-Time Scaling Has a Ceiling: CMU Research 2026
CMU's General AgentBench finds giving agents more turns often hurts. Learn why context ceiling and verification gap limit test-time scaling for LLM agents.
Read →
Cloudflare Dynamic Workers: V8 Sandbox for AI Agent Code
Cloudflare Dynamic Workers lets AI agents execute generated code in V8 isolates — 100× faster than containers. Here's how the architecture works.
Read →
Kimi K2.6: The Open 1T-Param Model for Agentic Coding
Moonshot AI's Kimi K2.6 hits 80.2% SWE-bench with 300 parallel sub-agents. Developer guide: architecture, API access, Kimi Code CLI, and self-hosting.
Read →
Claude Opus 4.7: High-Res Vision, Task Budgets, and Agentic Coding
Claude Opus 4.7 ships 2576px image support, task budget control, and 87.6% SWE-bench Verified. Here's what changed and how to use it in 2026.
Read →
Cloudflare Code Mode MCP: Entire API in 1,000 Tokens
Cloudflare's Code Mode MCP server covers 2,500+ API endpoints in ~1,000 tokens via two tools: search() and execute(). Here's how it works and how to use it.
Read →
GPT-Rosalind: OpenAI's Purpose-Built Drug Discovery Model
GPT-Rosalind is OpenAI's first domain-specific life sciences model, scoring 0.751 on BixBench and outperforming human experts on RNA sequence tasks.
Read →