Articles, one at a time.
Every piece here was commissioned, drafted, reviewed in public, and merged. No content mills, no auto-published slop.
AutoExperiment: Testing AI Agents on Research Replication
How CMU's AutoExperiment benchmark uses progressive code masking to measure AI agents' ability to replicate ML research from paper descriptions alone.
Read →
Claude Design: Developer Guide to Anthropic's Prototyping Tool
Claude Design generates production HTML/CSS/JS from prompts and hands off to Claude Code. Here's what developers need to know before trying it.
Read →
Code with Claude 2026: Managed Agents, Dreaming & AWS
Full developer recap of Code with Claude 2026: Dreaming, Outcomes, multiagent orchestration, Agent View, and Claude Platform on AWS.
Read →
Computer-Use Agents in 2026: From Demo to Developer Tool
A practical guide to computer-use agents, with a local Effloow Lab browser-control PoC and production safety patterns.
Read →
Multi-Agent LLM Topology Diagnostics with Successor Representation
How to diagnose chain, star, and mesh LLM agent topologies before inference using spectral analysis. PoC of arXiv 2605.11453 in pure NumPy.
Read →
Bifrost: Go-Based LLM Gateway — 50x Faster Than LiteLLM (2026)
Bifrost is an Apache 2.0 Go gateway that puts 20+ LLM providers behind one OpenAI-compatible API. Scout review: what makes it fast, when it's worth adding, and when LiteLLM is still the right call.
Read →
DeepSeek V4-Pro: MIT Frontier Model Developer Guide 2026
DeepSeek V4-Pro is a 1.6T MIT-licensed model with 80.6% SWE-bench. This guide covers API setup, pricing vs GPT-5.5, and self-hosting options.
Read →
LangGraph v1.2: DeltaChannel, Per-Node Timeouts, and Error Handlers
A LangGraph v1.2 sandbox PoC for DeltaChannel state, async node timeouts, NodeTimeoutError recovery, and production agent limits.
Read →
LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)
LMR-BENCH (EMNLP 2025) benchmarks LLM agents on reproducing code from 23 NLP papers. This PoC explains the masking methodology, evaluation axes, and what the results mean for AI-assisted research.
Read →
Microsoft Agent Framework 1.6 with Anthropic: Python PoC
Verified sandbox PoC: install MAF 1.6.0, wire up AnthropicClient, build a Workflow with @step decorators, and use DevUI to inspect agent execution.
Read →
RRO: Train LLM Agents Without Expensive PRMs
RRO replaces costly process reward model exploration with a rising-reward filter. Effloow Lab reproduced the trajectory selection logic in Python.
Read →
Claude Code Hooks: Security Gates for Agent Workflows
Build Claude Code hooks that block risky Bash commands and format files after edits, with a local Effloow Lab sandbox PoC.
Read →