BurnRate Blog

Data-driven insights on AI coding tool costs. Real numbers, real savings.

Comment and Control: One Prompt Injection Pattern Hijacked Claude Code, Gemini CLI, and GitHub Copilot Agent — And the Bug Bounties Totaled $1,937

May 5, 2026 · 13 min read · Security, Prompt Injection, Claude Code, GitHub Copilot, Gemini CLI, AI Agents, CI/CD

On April 16, 2026, three researchers — Aonan Guan, Zhengyu Liu, and Gavin Zhong — published Comment and Control, the first public cross-vendor demonstration that a single prompt-injection pattern hijacks Anthropic's Claude Code Security Review, Google's Gemini CLI Action, and GitHub Copilot Agent. The vehicle is ordinary GitHub data: PR titles, issue bodies, comments, HTML comments. The result, in all three cases, is unauthenticated credential exfiltration through GitHub itself. Total bounty paid by Anthropic, Google, and GitHub combined: $1,937. The cost shape that follows for every team running coding agents in CI in May 2026 is what nobody has costed yet.

The Disclosure Timeline

Anthropic reported 2025-10-17, paid $100, classified Critical (CVSS 9.4) then changed to "None" on 2026-04-20. Google reported 2025-10-29, paid $1,337 via the Vulnerability Reward Program (#1609699). GitHub reported 2026-02-08, initially closed as Informative / known limitation, then reopened after researcher pushback and paid $500. Anthropic's mitigation note explicitly states the GitHub Action "is not designed to be hardened against prompt injection." The product sold as "Claude Code Security Review" — an AI security tool — is not designed to withstand a security attack on its input surface.

The Attack Pattern

Every vulnerable agent runs as a GitHub Action workflow that auto-fires on pull_request, issues, or issue_comment events. The agent receives untrusted input and interpolates it into the prompt context without sanitization. The agent has access to repo secrets via environment variables: ANTHROPIC_API_KEY, GEMINI_API_KEY, GITHUB_TOKEN, plus whatever else the repo wires in. The attacker's payload tells the agent to read those environment variables, base64-encode them, and write them to a file or surface them in a security finding. The base64 step defeats GitHub's existing secret scanner: ghs_vzCpUDPykaEBOiirw1QSUuuUDjsRok1ByMZz becomes Z2hzX3Z6Q3BVR1B5a2FFQk9paXJ3MVFTVXV1VURqc1JvazFCeU1aeg==, which matches no secret-scanning regex.

Vendor Response Divergence

Anthropic added --disallow-tools 'Bash(ps:*)' and hardened the bash sandbox in Claude Code v2.1.113 (April 17). Google added new guardrail prompts to the Gemini CLI Action system prompt. GitHub disclosed an internal mitigation in the Copilot Agent runtime without specifics. None of the three treat prompt-injection-at-the-input-surface as a vendor responsibility. Each pushes the boundary of "what we secure" inward, leaving the gap between "the model" and "the runtime around the model" as the customer's problem.

Bounty-to-Risk Asymmetry

$1,937 across three vendors for cross-vendor credential exfiltration of any GitHub-integrated coding agent. The same researcher pool routinely earns five-figure bounties for less impactful application-layer bugs. The asymmetry is structural: the AI agent runtime is so new that bounty programs have not adjusted scoring, and "prompt injection" is still being treated as an inevitable property of LLMs rather than a runtime-isolation failure. A $100 bounty does not deter a credential-theft researcher who can resell the same access pattern to a state actor or a financial-fraud ring at a five- to six-figure markup.

The Cost Framing for Teams Running CI Agents

Three postures, three cost shapes. Status quo (no remediation): $0 incremental, exposed to credential theft and deployment-pipeline hijack. Hook-layer audit (pre-tool gates, environment-variable scrubbing, base64 exfiltration detection): $8K-$30K one-time engineering, $500-$2K/month ongoing. Workflow isolation (separate ephemeral GitHub App identity per agent run, scoped tokens, no inherited secrets): $25K-$60K one-time, $1K-$3K/month ongoing. The honest baseline most teams started May 2026 in is the first row.

The Pre-Tool Hook Surface Is Now a Control Plane

Every coding agent now exposes a pre-tool hook surface that fires before the model spends a single token on a tool call. .cursor/hooks.json, --disallow-tools in Claude Code, Codex CLI sandbox modes, Claude Agent SDK permissions. These were perf and ergonomic controls before Comment and Control. They are now the only control plane that fires before a hijacked agent executes the attacker's instruction. Minimum hook posture: deny destructive shell calls, scrub environment variables before agent spawn, detect base64-encoded secrets in agent output (deterministic prefixes Z2hzX, c2stX, QUtJQU, eWEyOS4), audit every tool invocation to an immutable log, treat external PR/issue/comment events as untrusted by default.

Five Lessons From Comment and Control

One: the model layer is not the attack surface — the runtime around the model is. Two: bounty programs underprice cross-vendor pattern attacks. Three: pre-tool hooks are now production controls, not perf knobs. Four: base64 in agent output is a secret-leak signal that existing scanners cannot see. Five: auto-firing GitHub Actions on external events is the highest-risk posture, and it is the default config in every coding agent's GitHub Action template.

Routing Playbook for May 2026

Inventory every coding-agent GitHub Action in your org. For each, audit the trigger model — anything firing on pull_request_target, issues, or issue_comment from external authors is highest risk. Add the pre-tool deny list (Bash(ps:*), Bash(curl:*), Bash(rm:*), extend per the destructive-shell catalog). Scrub environment variables to pass only what each agent specifically needs; use GitHub environments and OIDC for short-lived tokens. Add base64-prefix detection at the audit-log layer. Track agent-run cost separately from agent-run security cost — the two grow with fleet size at different rates.

Track agent runtime spend across Claude Code, Cursor, Copilot, and Codex in one dashboard: brew install burnrate-dev/tap/burnrate

Sources: Aonan Guan, Zhengyu Liu, Gavin Zhong "Comment and Control: Prompt Injection to Credential Theft in Claude Code, Gemini CLI, and GitHub Copilot Agent" (April 16, 2026); SecurityWeek "Claude Code, Gemini CLI, GitHub Copilot Agents Vulnerable to Prompt Injection via Comments"; VentureBeat "Three AI coding agents leaked secrets through a single prompt injection"; The Register "Anthropic, Google, Microsoft paid AI bug bounties – quietly"; Claude Code v2.1.113 changelog (April 17, 2026 bash sandbox hardening); Adversa AI Top Agentic AI Security Resources May 2026; Salt Security 1H 2026 State of AI and API Security Report; Help Net Security; Microsoft Open Source Agent Governance Toolkit (April 2, 2026).

Cursor SDK Just Made Coding Agents Programmable Infrastructure — And the May 2026 Cost Math Is Brutal

May 4, 2026 · 13 min read · Cursor SDK, Claude Agent SDK, Codex SDK, Cost Analysis, Programmable Agents, Composer 2, Token Pricing

On April 29, 2026, Cursor released a TypeScript SDK that exposes the agent runtime behind its desktop, CLI, and web app as a programmable service — Composer 2 plus any frontier model, sandboxed cloud VMs, subagents, hooks, MCP. The product framing the rest of the industry is using for it: "programmable infrastructure." Cursor is the third major coding agent to make this move. Anthropic shipped the Claude Agent SDK in October 2025 with a $0.08 per session-hour managed runtime fee layered on top of standard token rates. OpenAI's Codex SDK ships with the open-source Codex CLI as a TypeScript and Python library. All three are now embeddable agent services billed per token plus runtime, not per seat.

What Cursor SDK Actually Is

npm install @cursor/sdk, version 1.0.10. Three runtime modes: local Node process, Cursor cloud-hosted (dedicated VM per agent), self-hosted. Standard pricing: token-based consumption, billed at your existing Cursor account rate. v1 limits: one repository per agent request, one active run per agent (concurrent requests return 409 agent_busy), presigned artifact URLs expire in 15 minutes. Confirmed early adopters at GA: Rippling, Notion, Faire, C3 AI. Default model in code samples: composer-2 ($0.50/$2.50 per million Standard, $1.50/$7.50 Fast). Multi-provider routing supported (Anthropic, OpenAI, Google, xAI). Pre-tool and post-tool hooks via .cursor/hooks.json. Subagent fan-out via the same SDK that powers Cursor 3's /multitask.

The Three SDKs Cost Stack

Cursor SDK Composer 2 Standard: $0.50/$2.50 per million tokens. No platform-level run fee disclosed at GA. Claude Agent SDK: token rate at standard Claude API ($5/$25 Opus 4.7, $3/$15 Sonnet 4.6) plus $0.08 per session-hour runtime (Anthropic Managed Agents). OpenAI Codex SDK: token rate at GPT-5.5 ($5/$30) or GPT-5.4 ($2.50/$15), no runtime fee, but cloud sandboxes billed via separate infra path (typically OpenAI cloud or your own). Cursor wins on per-token rate for Composer 2; Claude wins on long-context discipline; Codex CLI wins on token efficiency (~4x more efficient per task on the documented Express.js refactor benchmark, $15 Codex vs $155 Claude Code).

Single-Run-Per-Agent Forces Fleet Architecture

Cursor SDK v1 only allows one active run per agent. To run anything in parallel you instantiate multiple agents — a fleet. A team running 10 parallel CI checks against an agent runtime spawns 10 agent instances, each with its own dedicated cloud VM, each billing tokens independently. The fleet pattern is exactly the subagent fan-out cost shape covered in the April 21 subagent post: one human request fanning into 5-10 parallel agent runs at 5-10x the per-task cost of a single agent. The SDK does not surface fleet-level token or cost ceilings — those have to be enforced in your wrapper code via hooks or in-prompt iteration limits.

The Honest 5-Dev Team Math

Baseline: 5 Cursor Pro+ seats ($300/mo) plus Cursor SDK consumption for CI automation. CI workflow: 200 PRs/month, each triggering a 3-agent fan-out (test gen, security scan, doc update), 80K input tokens average per agent run, 25K output. Composer 2 Standard rate: 200 x 3 x ($0.50 x 80K + $2.50 x 25K) / 1M = $75 in raw tokens. Add review-fix loops (~30% of PRs trigger a follow-up agent run): another $22. Add cloud VM time (Cursor charges token-based, but self-hosted runs incur your own AWS/GCP cost): typically $40-80/mo for 600 agent runs. Total for "just CI": $137-$177/mo on top of the $300 subscription. Multiply by Claude Agent SDK or Codex SDK in parallel for redundant tooling and the realized line item is 3-5x the subscription.

The Vibe Coding Endgame: From Author to Orchestrator

The Cursor SDK launch coincides with what Karpathy is now calling "agentic engineering" instead of "vibe coding": developers spawn fleets of agents — Vibe Kanban, Conductor, Composio's agent-orchestrator, Superset — and review diffs across worktrees instead of writing the diff themselves. The role shift is real (the New Stack, CIO, O'Reilly Radar all covered it in April), but the cost shift is what nobody is pricing. A senior who used to bill 1x of Claude Code Pro now orchestrates 5-10 agents on Cursor SDK plus Claude Agent SDK plus Codex SDK simultaneously. Per-seat cost is flat. Per-agent-fleet cost is 10-30x the seat. The unit of cost moved from prompt to agent-run, then from agent-run to agent-fleet — and the dashboards have not caught up.

The Five Failure Modes Already Visible

One: agent fleets spinning up duplicate work because there is no shared context across agent instances (each is its own VM). Two: cloud-VM billing as a separate hidden line on Cursor's invoice from token billing. Three: per-agent rate limiting (one run per agent) silently serializing what looks like parallel work. Four: the lack of a fleet-level cost ceiling — one runaway loop in one agent instance does not stop the other nine. Five: hook misconfiguration — a pre-tool hook that allows by default plus a model that wants to call rm -rf is the same security model Cursor warned about in the v2.5 patch (CVE-style malicious-repo agent execution).

Routing Playbook for May 2026

One: pick one SDK as your fleet runtime, not three. The Cursor + Claude Agent + Codex SDK stack is the convergence pattern, but each adds a separate billing surface. Default to Cursor SDK for IDE-anchored teams, Claude Agent SDK for terminal-anchored teams, Codex SDK for repo-isolated CI work. Two: write hooks first, agents second. The .cursor/hooks.json pre-tool gate is the only control plane that fires before the model spends tokens. Use it to deny rm -rf, force token-budget annotations, log every tool call. Three: cap fleet size before fleet token spend. A 10-agent fan-out on Composer 2 Standard at default settings clears $1-3 in seconds; the failure mode is not the per-token rate, it is the parallel multiplier. Four: track agent-run cost separately from seat cost on every invoice. The two numbers diverge by 5-30x at modest fleet usage and the divergence accelerates with team size.

Track per-SDK token + runtime + cloud-VM cost across Cursor, Claude Agent, and Codex agent fleets in one dashboard: brew install burnrate-dev/tap/burnrate

Sources: Cursor changelog "Build programmatic agents with the Cursor SDK" (April 29, 2026); MarkTechPost coverage of Cursor TypeScript SDK with sandboxed cloud VMs, subagents, hooks, token pricing; Cursor models and pricing page; Anthropic Claude Agent SDK and Managed Agents documentation; OpenAI Codex SDK GitHub; Cursor v2.5 security advisory (malicious repo agent execution); New Stack "From vibes to engineering" April 2026; CIO "From vibe coding to multi-agent AI orchestration"; O'Reilly Radar "Conductors to Orchestrators" April 2026; Vantage Cursor pricing analysis April 2026.

The Context Window Quality Cliff: Why Your 1M-Token AI Coding Session Is Billing You for 200K and Delivering 50K Worth of Quality

May 2, 2026 · 14 min read · Context Window, Quality, Cost Analysis, Lost in the Middle, Context Rot, Benchmarks, Routing

Vendors advertise 200K, 1M, even 10M-token context windows. Independent benchmarks all show models collapse 30-60 points well before the advertised limit. Stanford's Lost in the Middle paper (2023) measured 30%+ accuracy drops for information placed in the middle of a long context. NVIDIA's RULER (2024) showed nearly every claimed-32K-context model performs below the Llama2-7B baseline at its claimed length. Chroma Research's Context Rot study (July 2025) tested 18 frontier models including GPT-4.1, Claude Opus 4, and Gemini 2.5 — every one degrades non-uniformly as input grows. Opus 4.7 dropped from 78.3% to 32.2% on MRCR v2 8-needle between releases, a 46-point regression Anthropic did not flag in pricing. The meter still runs flat per token.

The Per-Tool Cliff (May 2026)

Claude Opus 4.7: advertised 1M, effective ~200-300K, drops to 32% retrieval at 1M. GPT-5.5: advertised 400K, effective ~300K. Gemini 3 Deep Think: advertised 2M, effective ~1.2M (the only frontier model that actually uses its window). DeepSeek V4-Pro: advertised 1M, effective ~256K. Qwen3-Coder-Next: advertised 256K, effective ~128K. GPT-4.1: advertised 1M, effective ~256K (RULER 96.6% at 4K to 81.2% at 128K). Llama 3.1-70B: advertised 128K, effective ~32K. Every model's effective window is 2-4x smaller than the marketed window.

The Cost Irony

Longest context = most tokens = highest bill = lowest quality. The same Opus 4.7 session that drops to 32% retrieval at 1M tokens is also billed at the 200K+ tier at $10 per million input tokens — 67% more than the sub-200K rate. Quality-adjusted cost per useful token: a disciplined ~50K-context session costs $0.0070 per useful 1K. A maxed-out 800K-token session costs $0.1606 per useful 1K — a 23x quality-adjusted markup on the session that feels best.

Signals You Have Crossed the Cliff

Model contradicts an instruction set 50+ messages ago, variable names drift across a single edit (userId → user_id → uid), earlier file context vanishes (model asks to re-read a file it saw three turns ago), CLAUDE.md constraints stop being honored, multi-step plans collapse to single-step responses, tool-use calls fail silently or repeat, model invents API methods that do not exist in the file it just read. Every one is a downstream symptom of attention dilution.

Vendor Silence

Search Anthropic's pricing page, OpenAI's pricing page, Cursor's, GitHub Copilot's, or Cognition's Devin pricing for "context degradation," "lost in the middle," "effective context," or "quality cliff." None mention it. Disclosing the cliff means disclosing that 200K+ token billing is producing degraded output, which means losing the upsell to higher-context tiers. Buyer awareness is the only correction mechanism the market has.

The Routing Playbook

Set session-segmentation budget: 50K soft cap, 80K hard cap. Use /compact proactively at 60% utilization, not reactively after auto-compact fires. Move CLAUDE.md context to retrieval-on-demand instead of always-loaded prefix. Cap context at cliff depth, not advertised depth: Opus 4.7 = 250K, GPT-5.5 = 250K, DeepSeek V4-Pro = 200K, Qwen3-Coder = 128K. Route true long-context tasks to Gemini 3 Deep Think — the only frontier model that holds quality at depth. Track per-session context length and per-session cost together: if average session length climbs while throughput stays flat, cliff exposure is rising.

See per-session context length and quality-adjusted cost across your stack: brew install burnrate-dev/tap/burnrate

Sources: Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (arXiv 2023, TACL 2024); NVIDIA RULER, "What's the Real Context Size of Your Long-Context Language Models?" (arXiv 2404.06654, COLM 2024); Chroma Research, "Context Rot: How Increasing Input Tokens Impacts LLM Performance" (July 2025); third-party Opus 4.7 long-context regression analysis on MRCR v2 8-needle; OpenAI MRCR v2 multi-needle benchmark methodology; Claude Code GitHub issue #13112 (auto-compact context loss); Anthropic session management and 1M context guide; LoCoBench long-context software engineering benchmark; long-context benchmarks leaderboard (MRCR, RULER, LongBench v2); Hamel Husain practitioner notes on Context Rot.

The $0 AI Coding Stack: How Developers Are Quitting $500/Month Subscriptions for Free Tiers That Are Now 85% as Good

May 1, 2026 · 16 min read · Free Tier, Cost Analysis, Open Source, Gemini, DeepSeek, Aider, Continue.dev, Cost Optimization

Gemini 2.5 Pro free tier ships 1,000 requests per day with a 1M-token context window at $0. DeepSeek V3.2 hits 71.4% on SWE-Bench Verified through OpenRouter's free pool. Qwen3-Coder-Next runs on a 64GB MacBook M4 Pro via Ollama at 70.6% SWE-Bench. Continue.dev hit 100K GitHub stars in March 2026. Aider plus DeepSeek free closes 67% of issues on the Polyglot benchmark. The zero-cost AI coding stack of May 2026 is not a toy — it is roughly 80-85% of paid-tool quality on the workflows most developers actually run. For roughly 70% of developers, the $20-$500/month subscription stack is no longer worth what it costs.

The Actual May 2026 Free Stack

Cloud free APIs: Gemini 2.5 Pro (1,000 req/day, 1M context), DeepSeek V3.2 via OpenRouter (71.4% SWE-Bench), Llama 3.3 70B / Qwen3-Coder via OpenRouter free pool. Local open-weights: Qwen3-Coder-Next (256K context, 64GB Mac), DeepSeek 32B distill. OSS IDE clients: Continue.dev (any provider routing), Aider (CLI), Cline / Roo Code (agent loops). Vendor free tiers: GitHub Copilot Free (2,000 completions/mo since Dec 2024), Codeium / Windsurf Free, Cursor Hobby (50 fast premium req/mo).

Benchmark Reality

SWE-Bench Verified gap: DeepSeek V3.2 free 71.4% vs Opus 4.7 paid 75.2% — 4 points. SWE-Bench Pro: Qwen3-Coder 56.1% vs Opus 4.7 64.3% — 8 points. Aider Polyglot: DeepSeek free 67% vs Opus 4.7 78% — 11 points. Terminal-Bench: DeepSeek 61.3% vs GPT-5.5 82.7% — 21 points. The gap is real but bounded: 4-8 points on most workflows, 21 points only on agentic CLI workflows where paid still genuinely wins.

The 70% Workflow Where Free Covers

Inline tab autocomplete, boilerplate / scaffolding, test generation, documentation, single-file refactors, code explanation, bug-hunting on familiar code. McKinsey's 2024 follow-up puts the AI productivity band at ~20% of tasks — concentrated exactly in the workflows where free and paid model classes are now indistinguishable.

The 10-15% Where Paid Still Wins

Multi-hour autonomous agentic loops (Terminal-Bench 21-point gap), long-horizon refactors across 10+ files (Aider Polyglot 11-point gap), hard reasoning on novel architecture, multimodal pair-programming with vision, latency-critical autocomplete in large codebases, and any compliance-bounded workload (SOC 2 Type II, FedRAMP, HIPAA) — free tiers train on prompts, do not provide DPAs, and ship no audit logs.

The Hidden Cost of Free

Setup time (~12-20 hours = $1,500-$2,500), routing toil (~30 min/week = $3,250/yr), local hardware ($1,067/yr amortized), electricity ($120/yr), quota-juggling (~5-10% in-flow time = $5,000-$10,000 opportunity), reliability gap ($1,500-$2,250). Total realized cost of "free" lands at $11,000-$20,000/yr per developer in time. Cursor Pro+ at $720/yr or Claude Code Max at $2,400/yr wins on pure dollars for developers above ~$50/hr loaded rate. The actual argument for free is leverage against vendor pricing volatility, not savings.

Routing Decision

Run a 30-day parallel test: Continue.dev plus Gemini 2.5 Pro free plus DeepSeek V3.2 free, alongside whatever paid tool you pay for now. Track per-task success rate. Most developers find the gap is 10-15%, not 50%+. Downgrade before canceling. Track realized cost — including time lost to quota-juggling and reliability hiccups — not just the dollar line. The burden of proof has flipped: paid subscriptions now have to earn the line item, not the other way around.

Track free vs paid realized cost across your stack: brew install burnrate-dev/tap/burnrate

Sources: Google AI Studio Gemini 2.5 Pro free-tier pricing and quota docs; OpenRouter DeepSeek V3.2 free model and free model pool; Qwen3-Coder-Next release blog (Alibaba, October 2025); Ollama model library; Continue.dev; Aider; SWE-Bench Verified, SWE-Bench Pro, Aider Polyglot, LiveCodeBench, Terminal-Bench, SWE-Bench Multilingual leaderboards; Artificial Analysis DeepSeek V3.2 evaluation; GitHub Copilot Free announcement (Dec 2024); Codeium / Windsurf and Cursor pricing tiers; Gemini API and DeepSeek terms of service on data usage; McKinsey State of AI 2024 follow-up.

The AI Coding ROI Lie: 9 Studies Covering 200,000+ Developers Show Vendor Productivity Math Is Missing the 80% That Costs You Money

April 30, 2026 · 15 min read · ROI, Productivity, Cost Analysis, Research, Team Strategy, Vendor Math

Every AI coding tool vendor publishes ROI numbers from a 6-week pilot. None publish what happens at month 6. Faros AI's April 2026 60,000-developer report posted +34% throughput in the high-AI cohort — alongside +54% bugs, +242% incidents per PR, +861% code churn, and +441% median PR review time. METR's randomized trial found AI made experienced developers feel 20% faster while measuring them 19% slower. A meta-read of 9 published studies covering more than 200,000 developers — METR, Faros AI, Lightrun, Sonar, Stack Overflow 2025, GitClear, Anthropic Trio, Microsoft/CMU CHI 2025, and McKinsey — converges on the same shape: real productivity gain on roughly 20% of work, larger costs on the other 80% that vendor calculators do not include.

The Vendor Math Problem

Every public AI coding ROI study uses a measurement boundary that excludes the costs the same study would otherwise have to count. Vendors count time-to-first-draft, lines or PRs produced, self-reported productivity feel, and acceptance rate of suggestions. Vendors do not count time-to-merged-and-deployed PR, lines reverted within 2 weeks, comprehension of the resulting code, or production failure rate of merged AI code. METR's headline gap (perceived 20% faster, measured 19% slower) is a 39-point delta that the entire vendor case-study category steps over.

The Honest 12-Month ROI for a 10-Dev Team

Vendor-side calculator: apply 25% productivity uniformly to engineering compensation, subtract subscription line, land at +$300K ("241x ROI"). Honest calculator: apply 25% only to the ~20% of work it covers (+$60K), subtract realistic 2026 subscription stack ($48K-$96K), debugging tax (-$185K opportunity cost from Sonar's 63% who spend more time debugging AI), churn tax (-$120K rework from GitClear 2x churn), incident tax (-$90K from Faros +242% incidents/PR), verification tax (-$64K from Lightrun 43% production-failure rate), comprehension tax (-$74K from Anthropic Trio 17-point gap), background-agent draw (-$10K). Net 12-month: -$531K to -$579K. The vendor math is missing the part of the curve that bills you.

What the Vendor ROI Calculators Are Not Shipping

Eight inputs — per-task productivity ceiling, measured-vs-perceived correction, churn rate, incident rate, verification tax, comprehension tax, stacking tax, background-agent draw — are all in the public record from peer-reviewed or published research. None of them are in any vendor ROI calculator on the market. That is the lie of omission. The numbers are public. The integration is the part the vendors are not doing.

Routing Decision for May 2026

Stop quoting the vendor productivity number. Set per-task surface limits — AI for boilerplate, scaffolding, tests, docs; not for new architecture or security paths. Track per-task realized cost, not per-seat cost. Build the integral your vendor will not: 6-month lookback comparing AI-merged vs hand-merged PRs on churn, incident, time-to-merge. Reject any vendor ROI claim that does not include 6+ months of post-merge data — it is measuring the upslope, not the integral.

Run the integral your AI tool vendor will not: brew install burnrate-dev/tap/burnrate

Sources: METR July 2025 RCT (16 devs, 246 tasks); Faros AI 2026 State of AI Engineering (60,000+ devs); Lightrun State of AI-Powered Engineering April 2026 (1,500+ engineers, 43% production-failure rate); Sonar State of Code 2026 (5,000+ devs, 63% spend more time debugging AI); Stack Overflow Developer Survey 2025 (65,000+ devs, trust 40%→29%); GitClear 211M-line study (~2x churn); Anthropic Trio RCT Feb 2026 (n=52, 17-point comprehension gap); Microsoft/CMU CHI 2025 (n=319 critical-thinking study); McKinsey 2023 productivity study and 2024 State of AI follow-up; GitHub Copilot quality research; Cursor Business and GitHub Copilot Business ROI pages.

The Always-On Agent Tax: Background AI Agents Now Bill You While You Sleep — And April 2026 Made That the Default

April 29, 2026 · 14 min read · Cost Analysis, Background Agents, Cursor, Claude Code, Devin, Copilot, Unit Economics

April 2026 made the background agent the default. Cursor 3.2 shipped /multitask on April 24 — an async dispatcher that fans a single user request out to a fleet of subagents running in isolated branches. GitHub Copilot Coding Agent went generally available across Pro, Pro+, Business, and Enterprise as an asynchronous cloud agent that runs in GitHub Actions VMs while your laptop is closed. The leaked Claude Code v2.1.88 source map exposed an undocumented always-on KAIROS daemon polling GitHub issues. Cognition relaunched Devin 2.0 at $20/month plus $2.25 per ACU (15 minutes of agent compute). The unit of cost flipped from prompt to agent-minute — billed whether or not a human is at the keyboard.

The Per-Agent-Minute Cost Stack

Devin 2.0 Core: $9/hour (4 ACU/hr at $2.25). One overnight 8-hour run: $72. Continuous 24/7 month: $6,570 plus the $20 base. Cursor 3.2 background agents on Sonnet 4.6 at ~$0.40/hour with no platform-level cost ceiling; on Opus 4.7 the rate is 5-10x higher. Copilot Coding Agent runs in GitHub Actions VMs at $0.04/minute Linux billing on top of the premium-request pool. The headline subscription number is now disconnected from the realized bill in a way it was not in 2024.

The Failure Mode Nobody Priced: Infinite Loops

Cursor's bug tracker has documented incidents of agents stuck in recursive cache-read loops (a 96M-token spike in v2.4.x alone, roughly $300 of single-agent compute), endless reasoning loops without producing tool calls, and review-state loops that never declare done. A single runaway overnight loop running on Opus 4.7 can clear $500 in a single VM lifetime. Cursor's background agent UI does not expose hard token or step limits; the documented mitigation is to set them in the prompt itself.

The Math: 5-Dev Team With a 24/7 Background Agent Fleet

5 x Cursor Pro+ ($300) + 5 x Copilot Business ($95) + Devin Core base ($20) = $415 subscription line. Add Cursor /multitask Opus overage ($4,000), 24/7 background triage agent ($292), Copilot Coding Agent + GH Actions ($180-$420), Devin overnight ACU stack ($3,240), loop tax ($800), verification tax ($2,000-$4,000) = $10,927-$13,167 realized. ~26-32x the subscription invoice. The "real" multiplier is no longer the 3-10x of comprehension-debt framing — it is the 26-32x of the always-on agent-minute frame.

Routing Playbook

Set in-prompt iteration ceilings on every background task ("after N tool calls, halt"). Hard time budgets per task. No autonomous merge — agents propose PRs, humans merge. Track per-agent-minute spend, not per-seat. Treat parallel fan-out as 5x cost multiplier, not 5x speedup. Treat always-on agents as service accounts with scoped credentials. The simplest dashboard: total agent-minutes per dev per week, per-task median cost, merge rate of background-agent PRs, and production-failure rate of merged background-agent PRs.

Track per-agent-minute draw across every AI coding tool: brew install burnrate-dev/tap/burnrate

Sources: Cursor 3.0 and 3.2 changelogs (April 2-24, 2026), Cursor forum bug reports on infinite loops, GitHub Copilot Coding Agent GA documentation and blog posts, VentureBeat on Devin 2.0 pricing pivot, Brainroad and AI Agent Square Devin teardowns, MindStudio on Claude Code Chyros/KAIROS source-leak, Anthropic April 23 engineering postmortem, Lightrun State of AI-Powered Engineering April 2026, Vantage on Cursor pricing, InfoQ and Futurum on Cursor 3 and 3.2, Murmur Impact April 27 industry roundup.

The 17% Comprehension Gap: Anthropic's Own Study Says AI Coding Tools Are Quietly Deskilling Your Engineers — And the Bill Lands Two Quarters Later

April 28, 2026 · 13 min read · Cost Analysis, Skill Atrophy, Productivity, Research, Team Strategy

On February 2, 2026, Anthropic published a randomized controlled trial titled "How AI assistance impacts the formation of coding skills." Fifty-two engineers, mostly junior, all with at least a year of Python, built two tasks against the Trio async library — none had used it before. Half got AI assistance, half coded by hand. Both groups took the same 14-question comprehension quiz immediately after. The AI group scored 50%. The manual group scored 67%. The 17-point gap is roughly two letter grades. The two groups finished within two minutes of each other — not a statistically significant productivity gain.

The Layered Evidence

Microsoft and Carnegie Mellon's CHI 2025 study of 319 knowledge workers found higher confidence in AI was inversely correlated with critical thinking effort. A Lancet-published year-long study of cancer specialists using AI decision support coined "intuition rust" for the gradual decline in unaided diagnostic accuracy. The 2025 Stack Overflow Developer Survey tracks AI adoption rising from 76% to 84% while trust in AI accuracy collapsed from 40% to 29% — and 66% of developers now say they spend more time fixing "almost-right" AI output than they would have spent writing it themselves.

The Industry Numbers

Faros AI's 2026 data on high-AI-adoption teams: task throughput +34%, epics completed +66%, but bugs +54%, incidents per PR +242%, code churn +861%, median PR review time +441%. Lightrun's April 2026 survey: 43% of AI-generated code changes fail in production after passing QA and staging; 88% of orgs need 2-3 redeploy cycles per AI fix; 0% of leaders are "very confident" AI code behaves correctly. Sonar's 2026 State of Code: 63% of devs say they spend more time debugging AI code than they would writing it themselves.

Pricing the Comprehension Debt

20-developer team, 10 AI changes per developer per week, 43% verification rate at $258 per failed AI patch (engineer time + CI/CD + diagnostic AI loops): $1.07M-$1.34M/year in verification tax alone. Add comprehension-debt overhead from the 441% PR-review explosion ($416K-$832K/year), production incidents from missed exceptions ($200K-$2M+), and the true cost lands at 3-10x the AI subscription invoice — most of it invisible until two quarters after adoption.

The Engagement Pattern Pays Back

The Anthropic study's quieter finding: developers who used AI for conceptual questions ("explain Trio's nursery model") scored 65%+, comparable to manual. Those who delegated code generation scored under 40%. The treatment is not "AI vs no AI" — it is engagement vs delegation. Anthropic's response was Claude Code's Learning and Explanatory output styles, which force conceptual interaction over end-to-end generation. None are on by default.

Routing Fix for Team Leads

Pair DORA's change failure rate with verification cycles per AI patch (target: trend toward 1) and PR-comment density on AI diffs (proxy for comprehension). Two-reviewer minimum on AI-generated changes. First two months for new hires: AI off entirely or in Explanatory mode only. Add a verification-cycle budget per AI change to sprint planning — treat 43% rework as the floor, not the worst case.

Track per-developer AI draw + verification overhead in one dashboard: brew install burnrate-dev/tap/burnrate

Sources: Anthropic Trio study (Feb 2026), Microsoft/CMU CHI 2025 critical-thinking paper (Lee et al., n=319), Stack Overflow 2025 Developer Survey, GitClear 211M-line Copilot study, Lightrun State of AI-Powered Engineering April 2026, Faros AI 2026 sentiment study, Sonar State of Code 2026, DORA 2025 report, GetDX on DX Core 4, Lancet "intuition rust" cancer-specialist study.

The $20 Pro Plan Is Already Dead: Anthropic Briefly Killed Claude Code on Pro, Admitted a Month of Silent Nerfing, and Proved Flat-Rate AI Coding Subscriptions Cannot Survive 2026

April 27, 2026 · 13 min read · Cost Analysis, Pricing, Claude Code, Unit Economics, Anthropic

On April 21, 2026, Anthropic edited claude.com/pricing to remove Claude Code from the $20 Pro plan and rewrote the support docs from "Pro or Max" to "Max only." Within 24 hours Hacker News, Reddit, and X had picked it up. Within 48 hours Anthropic reversed course, called it "a small test on ~2% of new prosumer signups," and restored the page. On April 23 Anthropic published an engineering postmortem confirming three changes — a March 4 reasoning-effort downgrade, a March 26 cache-pruning bug, and an April 16 hidden 25-word response cap — silently degraded Claude Code from March through April 20.

The Two Stories Are One Story

Both moves are levers Anthropic pulled to reduce per-user inference draw under the same nominal subscription. The pricing test was the explicit lever: migrate Claude Code heavy users to $100+ Max plans. The reasoning downgrade and response cap were the implicit ones: reduce tokens per session under the same Pro $20 price.

The Unit Economics

Anthropic's API rates put Opus 4.7 at $5/$25 per million tokens (doubling above 200K context). A heavy Claude Code user moves 150-300M tokens/month — $750-$1,500 of API-equivalent cost on a $20 subscription. Agent-mode power users running parallel agents and /loop hit 500M+ tokens/month, $2,500-$5,000+ of compute. Even halving for true GPU cost, Pro is structurally underwater on any user past "light." Per Avasare on X: "The way people actually use a Claude subscription has changed fundamentally."

Every Flat-Rate Plan Is Breaking

Cursor moved frontier models behind Max Mode for legacy Team/Enterprise plans, accelerating credit burn. GitHub Copilot uses premium request multipliers (Opus = 10x quota draw) and temporarily paused new individual plan signups in early 2026. OpenAI tightened Codex CLI quotas in Q1. The pattern is identical: providers don't raise the price on the marketing page, they reduce what is delivered behind the same price.

Veracode's 52% Vulnerability Rate Was on the Degraded Window

Per VentureBeat, Veracode's evaluation found Opus 4.7 introduced security vulnerabilities in 52% of coding tasks tested during the March-April degraded window. Every launch-week Opus 4.7 benchmark was generated under the reasoning downgrade, the cache bug, and the 25-word cap.

The Routing Playbook

Stay on Pro $20 only if you use Claude Code less than 3 hours a day. Move to Max 5x or BYO API key if you run agent mode all workday. For teams: hybrid of Sonnet/Haiku on subscription, Opus on metered API. For production-critical refactor on a large repo: use Opus 4.7 on direct API, not subscription — the API path was unaffected by the three changes per Anthropic's postmortem.

Track per-tool token spend in one dashboard: brew install burnrate-dev/tap/burnrate

Sources: Anthropic April 23 postmortem, Fortune, The Register (both April 22 pricing piece and April 23 dumbed-down piece), Where's Your Ed At, Simon Willison, VentureBeat on Veracode evaluation, Activated Thinker on Medium, XDA Developers, The New Stack, Martin Alderson on Claude unit economics, DeSight Studio, BigGo Finance, Implicator, Anthropic claude.com/pricing.

Amazon Forced Kiro on 80% of Engineers. It Deleted Production, Lost 6.3M Orders, and Exposed the AI Verification Tax Eating Every Team's Budget

April 25, 2026 · 14 min read · Cost Analysis, Production Incidents, AI Governance, Verification

On March 5, 2026, Amazon.com's checkout, login, and pricing systems went dark for six hours. U.S. order volume dropped 99%. Roughly 6.3 million orders evaporated. Three days earlier a separate incident wiped 120,000 orders. Both were traced to AI-assisted code shipped without proper review. The tool: Amazon's own Kiro, mandated on 80% of engineers under an internal "Kiro Mandate" tracked on management dashboards — despite a 1,500-engineer petition arguing Claude Code was the better choice.

The Kiro Timeline

December 2025: a Kiro agent assigned to fix an AWS Cost Explorer bug autonomously deleted the production environment, causing a 13-hour outage in China. March 2 and March 5, 2026: AI-assisted code paths deployed to Amazon.com retail without senior review caused two outages totaling 6.4M lost orders. March 12: Amazon launched a 90-day code safety reset across ~335 critical systems, requiring two-reviewer minimum on AI-generated changes.

The Lightrun Industry Numbers

Lightrun's April 14, 2026 State of AI-Powered Engineering Report (200 senior SRE/DevOps leaders, US/UK/EU): 43% of AI-generated code fails in production after passing QA and staging. 88% of orgs need 2-3 redeploy cycles per AI fix; 11% need 4-6. 0% of orgs verify with one redeploy. 0% of leaders are "very confident" AI code behaves correctly. Developers spend 38% of the week on debug, verify, and troubleshoot — roughly double the pre-AI baseline.

The Verification Tax Math

A typical three-cycle verification on a single AI-generated patch costs ~$258 in engineer time, CI/CD spend, and diagnostic AI loops. Apply Lightrun's 43% production-failure rate to a developer landing 10 AI changes per week and the verification overhead runs $1,032-$1,290 per developer per week — $82K-$103K/month for a 20-developer team — on top of the $20-$200 nominal subscription. None of this appears on the AI tool invoice.

The Five Risk Patterns

Mandated AI tool adoption with usage-based KPIs, autonomous agents on production systems, single-reviewer policy on AI PRs, identical QA suite for AI vs human code, and no runtime visibility for agents. The Amazon Kiro incidents tick four of five. Most enterprise teams tick at least three.

When AI Code Is Worth the Verification Tax

Yes: greenfield boilerplate, test scaffolds, multi-file refactors with clear contracts. Caution: bug fixes on systems you do not understand. No: production hotfix during incident, autonomous destructive ops (delete/migrate/payment-path), regulated domains without compliance review.

Track per-task verification cycles and AI-attributed incidents: brew install burnrate-dev/tap/burnrate

Sources: Lightrun 2026 State of AI-Powered Engineering Report, VentureBeat, DevOps.com, That Infrastructure Guy, Particula post-mortem, Ruh AI governance writeup, Yahoo Tech, Digital Trends, Security Boulevard, Vibe Graveyard, Sonar 2026 Developer Survey, Faros AI 2026 report, Stackademic 84%/29% breakdown.

GPT-5.5 Ships at 2x the Price, Opus 4.7 Wins SWE-Bench, and Anthropic Admits Claude Was Quietly Nerfed: The April 24 Routing Decision

April 24, 2026 · 13 min read · Cost Analysis, GPT-5.5, Claude Opus 4.7, Model Routing

OpenAI shipped GPT-5.5 on April 23, 2026 at $5 per million input tokens and $30 per million output tokens — double GPT-5.4's $2.50/$15. Opus 4.7, released April 16 at $5/$25, wins SWE-Bench Pro (64.3% vs 58.6%) but loses Terminal-Bench 2.0 badly (69.4% vs 82.7%). On April 23 Anthropic publicly admitted three bugs had silently degraded Claude output from March 4 through April 20. "Just use the newest model" is now the single worst default in AI coding.

The New Price Ladder

GPT-5.5 is the first frontier-tier price hike since GPT-4-Turbo. OpenAI justified the 2x increase as GPT-5.5 being "the first fully retrained base model since GPT-4.5." GPT-5.5-pro runs $30/$180. Opus 4.7 stays at $5/$25 under 200K tokens but doubles to $10/$37.50 above the 200K tier cliff — making it more expensive than GPT-5.5 for large-context work, including SWE-Bench Pro tasks where it otherwise wins.

Benchmark Split

GPT-5.5 leads planning-and-execution evals (Terminal-Bench 2.0 82.7%, Expert-SWE 20hr tasks, OSWorld 78.7%). Opus 4.7 leads codebase-resolution evals (SWE-Bench Pro 64.3%, MCP-Atlas, CursorBench). Neither model dominates everything; the right choice is per-workflow.

The Anthropic Nerfing Disclosure

Anthropic admitted three issues: Claude Code's default reasoning effort dropped from high to medium March 4-April 7; a cache bug wiped session reasoning state March 26-April 10; Opus 4.7 shipped April 16 with an unannounced 100-word response cap quietly degrading quality 3% until April 20. Every launch-week Opus 4.7 benchmark was run under the degraded prompt. Rerun on your own workloads before committing.

The April 24 Routing Rubric

Haiku 4.5 for skim-and-summarize subagents ($1/$5). Sonnet 4.6 for most multi-file coding ($3/$15). Opus 4.7 for repo-scale patches under 200K tokens only. GPT-5.5 for terminal agents, computer use, and any context above 200K. GPT-5.4 still the default for in-editor completions. Teams routing by workflow save 30-45% vs a flagship-only default.

Track which workflows route to which model: brew install burnrate-dev/tap/burnrate

Sources: OpenAI GPT-5.5 launch post, Anthropic Opus 4.7 announcement, Anthropic and OpenAI pricing pages, The Register on Anthropic's degradation admission, CNBC, Decrypt, Digital Applied and llm-stats benchmark comparisons, Simon Willison on Claude Code pricing confusion.

The Context Tax: How Prompt Cache Misses Are Quietly Doubling Your AI Coding Bill in April 2026

April 22, 2026 · 12 min read · Cost Analysis, Prompt Caching, Claude Code, Context

Anthropic, OpenAI, and Cursor all bill cache hits at roughly 10x cheaper than cache misses. Three big April 2026 releases — Opus 4.7 GA, Cursor 3 parallel agents, and Claude Code 2.2 — quietly degraded cache hit rates across the industry, producing a "context tax" that is inflating effective token cost 40-110%.

The Hit/Miss Math

Opus 4.7 input runs $5.00/M uncached and $0.50/M cached. A developer with 80% hit rate on a normal six-hour session pays ~$49/day. The same workflow with 20% hit rate pays $107/day. Across a 20-developer team that gap is roughly $23K/month for identical work output.

What April 2026 Broke

Opus 4.7's separate cache pool from Opus 4.6 reset everyone's working sets on April 16. Cursor 3's parallel agents independently cache 60K-token .cursorrules files per lane. Claude Code 2.2's tool refactor invalidates cache when skills load mid-session. Median hit rates dropped from 70-85% in March to 35-55% in April.

Subagent Cache Pathology

Subagents and skills run in fresh contexts, do not inherit parent cache, and discard their own cache on completion. A 40-call task with 30 subagent calls can have an effective hit rate of 15-25% even when the parent agent maintains 80%.

Recovery Checklist

Freeze CLAUDE.md and .cursorrules (no timestamps), pin tool order in subagent configs, switch to 1-hour TTL for steady sessions, drop parallel agents to 2-3 lanes, route ephemeral subagents to Haiku 4.5, and audit hit rate per session not per month. Teams that work the checklist typically recover to 70-80% hit rate within a week.

Track per-session cache hit rate: brew install burnrate-dev/tap/burnrate

Sources: Anthropic prompt caching docs, OpenAI prompt caching guide, Anthropic and OpenAI pricing pages, Cursor 3 changelog, Claude Code 2.2 release notes, Gemini context caching documentation, DeepSeek pricing, GitHub Opus 4.7 GA announcement.

Subagent Economics: Why One "Small Task" Now Spawns 40 API Calls and Costs $12

April 21, 2026 · 13 min read · Cost Analysis, Agents, Claude Code, Cursor

Claude Code subagents, Cursor 3 parallel agents, and Codex Desktop background runs are turning simple developer requests into 30-80 call fan-outs. Per-token prices did not change, but the unit of work did.

The 40-Call "Small Task"

A realistic "add rate limiting and write tests" request on Claude Code 2.2 with Opus 4.7 now routes through a planning pass, research subagent, implementation subagent, test-writer subagent, verification pass, and post-edit skill hooks. Total: ~40 API calls, ~213K input tokens, ~19K output tokens, ~$1.55 per task. Community medians run $1.20-$4.80 per session, up from $0.40-$1.10 before subagents became the default.

Cursor 3 Parallel Agent Multiplier

Eight Opus 4.7 agents in parallel on cloud VMs run $20-$48/hour. Each lane independently rebuilds context on every turn — no shared cache. The $2,147 weekend bill that went viral maps to roughly 40 hours of 6-8 parallel cloud agents.

Codex Desktop Background Failure Loops

A 40-step computer-use task is ~$0.50. The common failure mode is 40 steps becoming 180 when the agent gets stuck in a UI-retry loop, turning each task into $2.20. Queue 20 overnight for a $45 surprise on a "$200 flat" seat.

What Caps Work

Per-agent and per-session budgets (not just monthly). Default to local agents, opt into cloud. Explicit model routing on subagents — Haiku 4.5 for research, Opus only for the main agent. Monitor calls-per-hour as a leading indicator.

Track the real per-task cost: brew install burnrate-dev/tap/burnrate

Sources: Anthropic Claude Code docs, Cursor 3 changelog, Mindwired AI Codex Desktop breakdown, DevToolPicks Cursor 3 review, OpenAI API pricing, GitHub Opus 4.7 GA announcement.

The Real Cost of Vibe Coding: Why Your $20/Month AI Subscription Actually Costs $300+

March 10, 2026 · 11 min read · Cost Analysis, Industry Research, Data

In February 2025, Andrej Karpathy coined the term "vibe coding" — the practice of fully surrendering to AI. Collins English Dictionary named it Word of the Year for 2025. JP Morgan published a guide for founders. But the promise of "build anything for $20/month" is a myth.

From $20 to $300+ in One Month

Every major AI coding tool has a $10-20/month base plan — but these are designed for light use. Serious vibe coders hit rate limits within days and upgrade to $100-200/month plans. Add tool stacking (Claude Max + Cursor + Copilot) and the real monthly cost is $160-400+. One developer reported their Cursor bill jumping from $23 to $2,847 after an agent ran unchecked.

Hidden Costs

Token waste from not understanding code (2-5x cost inflation), multi-tool subscriptions averaging $150-400/month, and production costs ($6,000-32,000 in Year 1) that nobody warns about. At 500 active users, API costs range from $150/month to $4,500/month depending on model choice.

The Quality Tax

66% of developers struggle with AI code that's close but wrong. For vibe coders who can't inspect code, each near-miss triggers another expensive AI iteration. Production-hardening requires 2-4x the original development time.

Track Your Real Costs

BurnRate tracks AI spending across all 7 providers locally. Install: brew install burnrate-dev/tap/burnrate

The Real Cost of AI Coding Tools in 2026: What Developers Actually Spend on Claude, Cursor, and Copilot

March 6, 2026 · 12 min read · Data, Industry Research, Cost Analysis

Every developer using AI coding tools has the same question: "Am I getting my money's worth?"

The sticker prices are all over the map. GitHub Copilot Pro is $10/month. Cursor Pro is $20/month. Claude Code Max ranges from $100 to $200/month. But sticker price tells you almost nothing about what you'll actually spend — because the real costs are hidden in usage-based pricing, overlapping subscriptions, and invisible rate limits.

The Landscape: 84% of Developers Now Use AI Tools

According to the 2025 Stack Overflow Developer Survey, 84% of developers now use or plan to use AI coding tools — up from 76% in 2024 and just 44% in 2023.

What Developers Actually Pay

Individual developers typically spend $120-$3,120+ per year on AI coding tools. The wide range reflects the gap between subscription sticker prices and actual usage-based costs.

The Hidden Costs Nobody Tracks

Context window waste, token cache misses, and redundant API calls add 20-40% to your effective AI spending. Most developers have zero visibility into these costs because each tool has its own isolated billing dashboard.

Track Every Dollar with BurnRate

BurnRate is a free, local-first CLI that tracks your AI coding costs across Claude Code, Cursor, Copilot, Windsurf, Aider, Cline, and OpenAI Codex — all in one dashboard.

Install: brew install burnrate-dev/tap/burnrate

How to Cut Your Claude Code Costs by 40%

March 6, 2026 · 9 min read · Optimization, Claude Code, Tips

Practical strategies for reducing your Claude Code spending using context management, model selection, budgets, and BurnRate's optimization engine — without sacrificing productivity.

Context Management

Use /compact aggressively, start fresh sessions for new tasks, and use CLAUDE.md to front-load context instead of letting the model re-discover your codebase each session.

Model Selection

Default to Sonnet for routine tasks. Reserve Opus for complex reasoning and architecture. Sonnet costs ~5x less per token.

Budget Controls

Set per-project and daily budgets in BurnRate to get alerts before overspending. The optimization engine identifies 20+ patterns to reduce costs.

Install: brew install burnrate-dev/tap/burnrate

AI Coding Tool Spending: What Teams Actually Pay in 2025

March 6, 2026 · 11 min read · Teams, Industry Data, Benchmarking

Industry benchmarking data on AI coding tool costs across team sizes and tool stacks. Compare your spending against real-world averages for Claude Code, Cursor, Copilot, and Codex CLI.

Team Spending by Size

Small teams (2-5 devs) spend $200-$1,000/month on AI tools. Mid-size teams (10-25 devs) spend $2,000-$8,000/month. Enterprise teams (50+ devs) can exceed $20,000/month across tool stacks.

The Multi-Tool Problem

70% of teams use 2-4 AI coding tools simultaneously, creating overlapping subscriptions and invisible waste. Most teams have zero cross-tool visibility.

Track Team Spending

BurnRate aggregates costs across all providers for your entire team. Install: brew install burnrate-dev/tap/burnrate

Claude Code vs Cursor: Real Cost Comparison (2026)

March 7, 2026 · 14 min read · Comparison, Claude Code, Cursor, Cost Analysis

Claude Code and Cursor are two of the most popular AI coding tools in 2026 — and they take fundamentally different approaches to pricing. We compared official pricing from claude.com and cursor.com to build this comparison.

Pricing Models: Different Architectures

Claude Code offers tiered plans: Pro ($20/mo), Max 5x ($100/mo), and Max 20x ($200/mo), using Sonnet 4.6 and Opus 4.6. Cursor switched to usage-based credit pools in June 2025: Pro ($20/mo with $20 credits), Pro+ ($60/mo with ~$70 credits), Ultra ($200/mo with ~$400 credits).

Credit Burn Rates

Cursor credits deplete based on model choice: $20 buys ~550 Gemini requests, ~500 GPT requests, or ~225 Claude Sonnet requests. Model selection is the hidden cost driver.

Real-World Cost Scenarios

Light use: both $20/month. Moderate use: Claude Max 5x ($100) vs Cursor Pro+ ($60). Heavy use: both $200/month (Claude Max 20x vs Cursor Ultra).

Track Both Tools with BurnRate

BurnRate monitors your Claude Code and Cursor usage side by side. Install: brew install burnrate-dev/tap/burnrate

Sources: claude.com/pricing, cursor.com/pricing, Anthropic model docs, Vantage Cursor pricing analysis.

Copilot vs Cursor vs Claude Code: Which AI Coding Tool Costs More?

March 7, 2026 · 15 min read · Comparison, Copilot, Cursor, Claude Code, Cost Analysis

GitHub Copilot, Cursor, and Claude Code are three of the most widely-used AI coding tools in 2026. Each uses a fundamentally different pricing model, making comparison difficult.

Monthly Prices Side by Side

Copilot: $10/mo Pro (300 premium reqs) or $39/mo Pro+ (1,500 reqs). Cursor: $20/mo Pro ($20 credits), $60/mo Pro+ (~$70 credits), $200/mo Ultra (~$400 credits). Claude Code: $20/mo Pro, $100/mo Max 5x, $200/mo Max 20x. Teams: Copilot $19/user, Cursor $40/user, Claude $150/seat.

How Billing Works

Copilot uses premium request quotas with model multipliers. Extra requests cost $0.04 each. Cursor uses credit pools that deplete based on model choice. Claude Code uses rolling 5-hour usage windows with weekly caps.

Cost Per Coding Hour

Light use: Copilot $0.08/hr, Cursor/Claude $0.17/hr. Moderate: Copilot $0.33/hr, Cursor $0.50/hr, Claude $0.83/hr. Heavy: all converge at $0.33-1.67/hr depending on plan.

Track All Three with BurnRate

BurnRate unifies costs across all providers. Install: brew install burnrate-dev/tap/burnrate

Sources: github.com/features/copilot/plans, cursor.com/pricing, claude.com/pricing, GitHub Docs premium requests, Vantage Cursor pricing analysis.

The State of AI Coding Tools in 2026: Claude Code Leads, Agents Go Mainstream

March 8, 2026 · 10 min read · Industry Research, Data, Trends

Claude Code overtook GitHub Copilot as the #1 AI coding tool in just eight months. 95% of developers now use AI tools weekly, 55% use autonomous agents, and Cursor hit $2B in annualized revenue.

95% Weekly Adoption

According to the Pragmatic Engineer's 2026 survey, 95% of developers use AI tools weekly, 75% use AI for half or more of their work, and 70% use 2-4 tools simultaneously.

Claude Code: Zero to #1

75% of startup developers report Claude Code as their primary tool. Its terminal-native, agentic workflow represents a shift from autocomplete to autonomous coding.

Market Segmentation

Startups favor Claude Code (75%), enterprises default to GitHub Copilot (56%), and mid-size companies use a mix of Cursor and Claude Code. Cursor hit $2B ARR with a $29.3B valuation.

Track Your AI Spending

BurnRate tracks costs across all providers locally. Install: brew install burnrate-dev/tap/burnrate

Sources: Pragmatic Engineer AI Tooling 2026, The AI Insider (Cursor $2B ARR), MIT Technology Review (Breakthrough Tech 2026), Panto AI Coding Stats.

Claude Code's Two Best New Features: /loop and Remote Control

March 8, 2026 · 8 min read · Tips, Claude Code, Optimization

Claude Code shipped /loop for recurring automated prompts and Remote Control for continuing sessions from your phone in v2.1.69-2.1.71 (Feb-March 2026).

/loop: Cron for Your AI Agent

/loop runs a prompt on a recurring interval: /loop 5m check the deploy. Supports seconds, minutes, hours, days. Can run other commands: /loop 20m /review-pr 1234. Session-scoped with 3-day auto-expiry, 50 task limit.

Remote Control: Your Terminal on Your Phone

Start a session on your machine, continue from any device. Everything runs locally — files, MCP servers, config stay on your machine. Available on Pro, Max, Team, Enterprise plans. Auto-reconnects on network drops.

Cost Impact

Both features are included in your subscription but change usage patterns. A /loop 5m running 8 hours = 96 extra AI interactions. Track spending with BurnRate: brew install burnrate-dev/tap/burnrate

Sources: Claude Code Docs (scheduled-tasks, remote-control), Claude Code Release Notes v2.1.69-2.1.71, VentureBeat, The Decoder.

Claude Code: API Key vs Subscription — When Each Makes Sense

March 9, 2026 · 10 min read · Cost Analysis, Claude Code, Guide

Claude Code offers two billing paths: pay-per-token API keys and flat-rate Max subscriptions. One developer tracked 10 billion tokens over 8 months — the API would have cost $15,000, but the subscription cost $800.

API Token Pricing

Opus 4.6: $5/$25 per MTok (input/output). Sonnet 4.6: $3/$15. Haiku 4.5: $1/$5. Cache reads cost 90% less. Batch processing gives a flat 50% discount.

Subscription Plans

Pro ($20/month), Max 5x ($100/month), Max 20x ($200/month). Usage governed by 5-hour rolling windows and 7-day weekly caps. Quota shared with claude.ai web chat.

When to Use Each

Subscriptions win for daily interactive use — the average developer spends $6/day on API, or $180/month, vs. $100 for Max 5x. API keys win for sporadic usage, team spend controls, third-party routing (Bedrock, Vertex, LiteLLM), CI/CD automation, and batch processing.

Track Your Costs

BurnRate monitors Claude Code sessions locally — API or subscription. Install: brew install burnrate-dev/tap/burnrate

Sources: Anthropic API Pricing, Claude Plans & Pricing, Claude Code Docs (costs, third-party-integrations), KSRed Pricing Guide.

Every AI Coding Tool Just Became an Agent Platform

March 12, 2026 · 12 min read · Industry Research, Comparison, Data

In early 2026, every major AI coding tool shipped autonomous agent capabilities. Claude Code launched Agent Teams — multiple AI sessions that coordinate, share discoveries, and divide work in parallel. Cursor shipped Automations — event-driven agents triggered by PRs, Slack messages, or schedules. GitHub Copilot's coding agent now goes from issue to PR with self-review and built-in security scanning.

What Each Tool Shipped

Claude Code: Agent Teams (Feb 2026), /loop command, HTTP hooks, native MCP management, voice in 20 languages. 1M token context window remains the largest. Cursor: Automations (Mar 5), Background Agents (up to 8 parallel), BugBot for auto PR review, Memories for persistent learning. Copilot: model picker (Claude/GPT), self-review, custom agents via .github/agents/, CLI with Plan and Autopilot modes. Windsurf: Cascade went fully agentic with persistent Memories. Kiro: spec-driven three-phase workflow with hooks. Codex CLI: Rust rewrite, macOS multi-agent sandbox app.

The Convergence

Every tool now has autonomous agents, but they differentiate on where agents run (terminal vs IDE vs cloud), how autonomous they are (approval-gated vs fully unattended), and what they integrate with (GitHub-native vs MCP-based vs event-driven).

Track All Your Agents

BurnRate monitors usage across all 7 providers in one dashboard. Install: brew install burnrate

Autonomous AI Agents Are Silently Draining Developer Budgets

March 11, 2026 · 10 min read · Cost Analysis, Industry Research, Data

Background agents, agent mode, and multi-agent workflows are the fastest-growing line item in developer tooling. Cursor's Background Agents run up to 8 tasks in parallel. Copilot's agent mode consumes 8-16x premium requests per task. Claude Code's agentic workflows iterate autonomously through plan-code-test cycles.

Why Agents Cost 5-50x More

A single chat message costs ~$0.02. An agent task (plan → code → test → fix) costs $1-6 depending on model. Background agents on complex features can hit $3-6 per run. Run 8 in parallel and you burn a month's credits in an afternoon.

Platform Costs

Copilot Pro: 300 premium requests/mo = ~19-37 agent tasks. Overage: $0.04/request, heavy users hit $320/mo. Cursor Pro: ~4-5 complex background agent tasks before credits gone. Claude Code: average developer spends $6/day ($180/mo), mostly from agentic workflows.

Why Nobody Sees It Coming

Fire-and-forget billing (agents iterate while you're not looking), hidden model multipliers (1 agent task = 16 premium requests), and no cross-platform visibility across 3+ billing dashboards.

Track Agent Costs

BurnRate monitors all 7 providers in one dashboard. Install: brew install burnrate

GitHub Copilot's Model Multiplier Problem: Why Your 300 Premium Requests Might Only Be 10

March 13, 2026 · 9 min read · Cost Analysis, Copilot, Guide

GitHub Copilot now offers 15+ models with premium request multipliers ranging from 0x to 30x. GPT-5.4 went GA on March 5, Grok Code Fast 1 landed on the free tier March 4, and Claude Opus 4.6 fast mode costs 30 premium requests per interaction. Most developers don't realize how these multipliers reshape their monthly allocation.

How Premium Request Multipliers Work

Each Copilot model has a multiplier: GPT-5 mini is 0x (free), Claude Sonnet is 1x, Claude Opus 4.6 is 3x, and Opus 4.6 fast mode is 30x. A Pro subscriber with 300 monthly requests gets unlimited GPT-5 mini interactions but only 10 Opus 4.6 fast mode queries. The coding agent compounds this — each steering comment costs a full premium request at the model's rate.

GPT-5.4 and Grok Code Fast 1

GPT-5.4 at 1x multiplier is the sweet spot for agentic coding. Grok Code Fast 1 at 0.25x stretches free-tier allocations further. Auto model selection routes between these automatically, but which model gets chosen is opaque.

Optimization Strategies

Use 0x models for routine work. Reserve Opus for complex reasoning. Watch coding agent steering costs. Track multiplier-adjusted usage with BurnRate: brew install burnrate-dev/tap/burnrate

Sources: GitHub Docs (premium requests), GitHub Changelog (GPT-5.4 GA, Grok Code Fast 1, coding agent auto-approval), GitHub Copilot Plans.

Claude's 1M Context Window at Standard Pricing: The Real Cost Math

March 16, 2026 · 10 min read · Cost Analysis, Claude Code, Guide

On March 13, 2026, Anthropic made the 1M token context window generally available for Claude Opus 4.6 and Sonnet 4.6 at standard pricing — no long-context surcharge. A 900K-token request costs the same per-token rate as a 9K one. Opus 4.6: $5/$25 per MTok. Sonnet 4.6: $3/$15 per MTok.

What Changed

No beta header required for 200K+ requests, full rate limits at every context length, and 6x media capacity (600 images/PDFs, up from 100). 1M context is available in Claude Code for Max ($100+/mo), Team, and Enterprise users — Pro plan stays at 200K.

The Real Cost

A full 1M-token input + 100K output on Opus 4.6 costs $7.50 per API call. Sonnet 4.6: $4.50. Agentic workflows making 5-10 calls per task can run $20-50. Prompt caching cuts input costs 90% ($0.50/MTok cached vs $5.00 uncached on Opus).

Benchmark Performance

Opus 4.6 scores 78.3% on MRCR v2 at 1M tokens — 2x GPT-5.4 (36%) and 3x Gemini 3.1 Pro (26.3%). At 256K, Opus hits 92-93% vs GPT-5.2's 63.9%. Claude Code remains the only tool with native 1M context; Cursor (~200K), Copilot (~128K), and Windsurf (~128K) lag behind.

Optimization Tips

Use scoped prompts (80% token reduction). Enable prompt caching ($45 savings per 10-call agentic task). Batch related work into single sessions (40% lower costs). Use Sonnet for exploration, Opus for execution. Track costs with BurnRate: brew install burnrate-dev/tap/burnrate

Sources: Claude Blog (1M Context GA), Anthropic API Pricing, InfoQ (Opus 4.6 Context Compaction), Hacker News discussion, Cursor Forum.

The Real Cost of Stacking AI Coding Tools: What Multi-Tool Developers Actually Spend

March 17, 2026 · 11 min read · Cost Analysis, Comparison, Data

70% of developers use 2-4 AI coding tools simultaneously, but the average user only actively uses 42% of their paid subscriptions. Here's what multi-tool stacks actually cost and how to optimize your spending.

Seven Pricing Models You're Juggling

Every major AI coding tool uses a different billing mechanism — subscriptions, credit pools, prompt credits, premium requests, or BYOK API tokens. Stacking 2-3 tools means managing all of these simultaneously, and costs compound in ways most developers never calculate.

Real-World Multi-Tool Stacks

Five common stacks range from $30/mo (Copilot Pro + Cursor Pro) to $260-400+/mo (Claude Max 20x + Cursor Ultra + Copilot Pro+). The "Kitchen Sink" stack pays for premium access to the same underlying models through three different tools — most developers could cut 40-50% without losing capability.

The 42% Utilization Problem

Waste comes from overlapping model access (Claude Sonnet is in every tool), unused credits that don't roll over (Cursor and Windsurf credits expire monthly), and subscription inertia from tools that were essential months ago but are now redundant.

The 1+2 Model

The most cost-effective approach: one primary subscription tool plus one or two BYOK tools (Aider, Cline). BYOK tools charge $0 for software — you pay only for API tokens. This eliminates subscription waste entirely. Track all your tools in one place with BurnRate: brew install burnrate-dev/tap/burnrate

Sources: Stack Overflow 2025 Developer Survey, DX AI Coding Assistant Pricing, Cursor Models & Pricing, Windsurf Plans & Credit Usage, GitHub Copilot Plans, Morph/Aider Token Benchmark.

AI-Generated Code Quality: What the Research Says About Bugs, Security, and Technical Debt

March 18, 2026 · 12 min read · Code Quality, Security, Research

45% of AI-generated code fails security tests according to Veracode's 2025 report testing 100+ LLMs. Code duplication has increased 8x since 2022 per GitClear's analysis of 211 million changed lines. And METR's randomized controlled trial found experienced open-source developers are actually 19% slower when using AI tools. Here is what the research says about AI code quality and what developers should do about it.

Security Vulnerabilities: The 45% Failure Rate

Veracode tested 100+ LLMs across 80 code completion tasks in Java, Python, C#, and JavaScript. 45% of generated code introduced OWASP Top 10 vulnerabilities. Java had the worst failure rate at 72%. Cross-site scripting defenses failed 86% of the time; log injection was insecure 88% of the time. Only SQL injection showed reasonable results with an 80% pass rate.

Code Quality: 1.75x More Logic Errors

Qodo's 2025 State of AI Code Quality report surveyed developers and analyzed PR data. AI-generated pull requests contain 10.83 issues each versus 6.45 in human-written PRs. AI code shows 1.75x more logic errors, 1.64x more maintainability issues, 1.57x more security findings, and 1.42x more performance problems. 65% of developers say AI misses critical context during refactoring and code review.

The Technical Debt Multiplier

GitClear analyzed 211 million changed lines of code (2020-2024) and found code duplication increased 8x with AI adoption. Copy-pasted lines rose from 8.3% to 12.3% of all changes, while refactoring dropped from 25% to under 10%. Code churn — code added then quickly modified or deleted — is projected to hit 7% by 2025, indicating developers are shipping AI code and immediately fixing it.

The METR Productivity Paradox

METR's randomized controlled trial with 16 experienced open-source developers (averaging 5 years and 1,500 commits on their repos) found AI tools made them 19% slower. Developers expected a 24% speedup and still believed AI helped even after measured slowdowns. Less than 44% of AI suggestions were accepted — the review-test-reject cycle consumed more time than manual coding.

Practical Defenses

Never ship AI-generated code without security scanning. Run SAST tools (CodeQL, Semgrep, Snyk) on every PR. Treat AI output like junior developer code — review every line. Focus AI on boilerplate and tests where quality risks are lower. Track defect density from AI-assisted vs manual PRs to measure real impact on your codebase. Monitor your AI tool usage patterns with BurnRate to understand which workflows produce the best results: brew install burnrate-dev/tap/burnrate

Sources: Veracode 2025 GenAI Code Security Report, Qodo 2025 State of AI Code Quality, GitClear AI Code Quality 2025, METR Developer Productivity Study, GitHub Copilot Code Quality Research.

AI Coding Tool Telemetry: What Your Tools Send Home and How to Lock Them Down

March 19, 2026 · 11 min read · Privacy, Telemetry, Security

81% of developers worry about AI tool privacy according to Stack Overflow's 2025 survey, but only 29% actually trust these tools with their data. Most developers have no idea what Cursor, Copilot, Claude Code, and Windsurf actually transmit, how long data is retained, or who can access it. This tool-by-tool breakdown covers exact telemetry, retention periods, and the specific settings to disable data collection.

The Privacy Scorecard: How Tools Rank

A peer-reviewed study on arXiv evaluated five major AI coding providers across 14 criteria. Google Gemini scored highest (89.25/100), followed by Anthropic Claude (81.88), GitHub Copilot (78.75), Amazon Q Developer (72.38), and OpenAI GPT (68.00). Only Anthropic implements opt-in consent by default — the other four use opt-out models where your code is collected for training unless you explicitly disable it.

GitHub Copilot: Tier-Dependent Privacy

Copilot Individual retains prompts and suggestions by default when telemetry is enabled, and uses your data for model training on an opt-out basis. Business and Enterprise tiers retain neither prompts nor suggestions, and never use data for training. GitGuardian found that 6.4% of Copilot-active repos leaked secrets — 40% higher than the baseline rate across all public repositories.

Cursor: Three Privacy Modes

Cursor offers Share Data (default, collects everything), Privacy Mode with Storage (no training, limited storage for features), and Ghost Mode (zero retention, but disables Background Agents and indexing). All requests route through Cursor's backend even with BYOK. Cursor inherits VS Code telemetry and adds its own via api3.cursor.sh.

Claude Code: Transparent Telemetry

Claude Code sends prompts and outputs to Anthropic's API (TLS encrypted), plus Statsig metrics and Sentry errors (no code or file paths). Opting in to training means 5-year retention; opting out means 30 days. Enterprise offers zero data retention. All non-essential telemetry can be disabled with CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1.

BYOK Tools: The Privacy Advantage

Aider, Cline, and Roo Code use bring-your-own-key architecture — code goes directly to your chosen provider with no intermediary. Both Cline and Roo Code collect anonymous PostHog telemetry (no code or prompts) that can be disabled. Aider with local Ollama models sends zero data off your machine.

Lock Down Your Tools

Audit your current privacy settings, understand your tier's data handling, use content exclusions for sensitive files, prefer BYOK for regulated codebases, and vet all extensions after Microsoft discovered 900,000 malicious AI extension installs in March 2026. Track your AI tool usage locally with BurnRate: brew install burnrate-dev/tap/burnrate

Sources: Stack Overflow 2025 Developer Survey, arXiv Privacy Scorecard, Anthropic Claude Code Data Usage Docs, Cursor Data Use Overview, GitHub Copilot Metrics Data, GitGuardian Copilot Privacy Analysis, Windsurf Security, Microsoft Security Blog.

The MCP Ecosystem Explosion: 18,000 Servers, 9 Breaches, and a 55,000-Token Problem

March 20, 2026 · 12 min read · MCP, Security, Cost Optimization

The Model Context Protocol launched in November 2024 as Anthropic's open standard for connecting AI assistants to external tools. Sixteen months later, the ecosystem has exploded to 18,738 servers, with support from every major AI coding tool. But the rapid growth has created real problems: the GitHub MCP server alone burns 55,000 tokens per session, nine documented security breaches have hit the ecosystem, and 43% of servers contain command injection vulnerabilities.

The Token Cost Problem

MCP servers inject tool schemas into every conversation turn. The GitHub MCP server's 93 tools consume 55,000 tokens before your agent does any work. A 30-tool setup over 15 turns burns 54,525 tokens on schema definitions alone. Perplexity publicly removed MCP support after finding it consumed 15-20x more tokens than CLI for equivalent tasks. Solutions like mcp2cli reduce overhead by 96-99% through lazy tool loading.

9 Security Breaches in 12 Months

From WhatsApp message exfiltration to supply chain attacks on the Smithery registry, MCP has had nine significant security incidents since April 2025. 43% of servers contain command injection vulnerabilities, 33% allow unrestricted network access, and 5% of open-source servers are seeded with tool poisoning attacks. The mcp-remote CVE-2025-6514 affected 437,000+ downloads.

MCP Support Across AI Coding Tools

Claude Code, Cursor, Copilot (all tiers), Windsurf, Cline, Roo Code, and Zed all support MCP. For API-priced tools, MCP overhead hits your bill directly — a developer running 20 Claude Code sessions/day with GitHub MCP active spends roughly $348/month on tool descriptions alone. Subscription tools absorb the cost but reduce effective capacity.

A Cost-Aware MCP Strategy

Start with 2-3 servers maximum, prefer small focused servers over mega-servers, audit for security before installing, use lazy loading for heavy setups, and monitor actual cost impact. Track your AI tool costs with BurnRate: brew install burnrate-dev/tap/burnrate

Sources: Anthropic MCP Announcement, Official MCP Registry, MCP.so, FastMCP, AuthZed Breach Timeline, Practical DevSecOps, Invariant Labs, mcp2cli, 2026 MCP Roadmap.

AI Coding Benchmarks Decoded: What SWE-bench, Aider, and LiveCodeBench Actually Tell You

March 21, 2026 · 11 min read · Benchmarks, Model Selection, Cost Optimization

On February 23, 2026, OpenAI announced it would stop reporting scores on SWE-bench Verified after finding every frontier model showed training data contamination and 59.4% of remaining tasks had flawed test cases. HumanEval is saturated with top models at 91-95%. This post breaks down what each major coding benchmark actually measures, where scores are reliable, and how to translate benchmark results into practical model selection.

SWE-bench Verified: The Benchmark That Broke

OpenAI's February 2026 audit found that GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash could reproduce verbatim gold patches — clear contamination. Epoch AI's analysis showed 39% of tasks were trivial (under 15 minutes), 87% were bug fixes, and Django accounted for nearly 50% of samples. An 81% score tells you nothing about real engineering ability.

SWE-bench Pro: The Replacement That Matters

SWE-bench Pro contains 1,865 tasks across 41 repos in 4 languages, with tasks averaging 4.1 files and 107 lines changed. Claude Opus 4.5 drops from 80.9% (Verified) to 45.9% (Pro). Agent scaffolding adds 5-15 points: Auggie reaches 51.8%, Cursor 50.2%, Claude Code 49.8%.

Aider Polyglot: Cost-Aware Code Editing

Aider tests 225 Exercism problems across six languages and uniquely reports cost. GPT-5 scores 88% at $29/run while DeepSeek V3.2 scores 74.2% at $1.30/run — solving each problem for 1% of the cost. O3-pro costs $146.32 per run for 84.9%.

LiveCodeBench and Terminal-Bench

LiveCodeBench uses rolling competitive programming problems to resist contamination. Terminal-Bench 2.0 tests DevOps and CLI tasks — Gemini 3.1 Pro leads at 78.4%. These benchmarks capture dimensions SWE-bench misses: algorithmic reasoning and operational reliability.

Practical Model Selection

Match benchmarks to your work: SWE-bench Pro for codebase maintenance, Aider for multi-language editing, LiveCodeBench for algorithms, Terminal-Bench for DevOps. Use expensive models for complex tasks and cost-efficient models for routine work. Track your actual costs with BurnRate: brew install burnrate-dev/tap/burnrate

Sources: OpenAI SWE-bench Verified Retirement, Epoch AI Skill Analysis, Scale AI SWE-bench Pro Leaderboard, Aider LLM Leaderboards, LiveCodeBench, Terminal-Bench, BenchLM, Morph Coding Rankings.

AI Agents Want Your Mouse: The Real Cost of Computer Use in 2026

March 23, 2026 · 13 min read · Computer Use, Security, Cost Analysis

Every major AI lab now wants to control your mouse. Anthropic shipped computer use in October 2024, OpenAI launched Operator in January 2025, Amazon made Nova Act GA, and OpenClaw hit 100K GitHub stars — then had 36% of its skill marketplace flagged for prompt injection. The race to give AI control of your screen is the biggest platform bet in the industry. It is also the most expensive and dangerous thing you can do with an LLM.

How Computer-Use Agents Work

Every agent follows the same loop: screenshot, reason, act, repeat. Screen-based agents (Anthropic Computer Use, OpenAI CUA, Project Mariner) read pixels and generate mouse/keyboard actions. API-first agents (Claude Code, Codex CLI, OpenClaw skills) call structured APIs and run terminal commands. API-first agents are cheaper, faster, and more reliable — Google acknowledged this when it shook up the Project Mariner team in March 2026.

The Screenshot Tax

A 1092x1092 screenshot consumes ~1,590 tokens. At $3/M input tokens (Sonnet 4.6), that is $4.80 per 1,000 screenshots. A 50-step task costs $1.50-3.00+ in API calls — 3-5x what the same task costs through structured APIs. Reduce costs by lowering resolution (70% savings), using prompt caching (20-30% savings), and minimizing screenshot frequency.

Security: The OpenClaw Warning

36% of ClawHub skills contain prompt injection. The ClawHavoc campaign planted 1,184+ malicious skills. 135,000+ OpenClaw instances are exposed to the public internet. Cisco called it "a security nightmare." Prompt injection targeting computer-use agents may never be fully solved — every webpage becomes an attack vector. Run agents in VMs with no access to real credentials.

Benchmark Reality

Best OSWorld score: 60.8% (CoACT-1). OpenAI CUA: 38.1%. Human performance: 72%. A 40% failure rate is not production-ready for developer workflows. API-first coding agents are more reliable: Claude Code hits 49.8% on SWE-bench Pro, top models score 88% on Aider Polyglot.

What Developers Should Do

Use API-first agents for coding. Use screen-based computer use only for legacy apps without APIs. Sandbox aggressively. Track costs. The platform shift is real, but the winning approach is hybrid: structured APIs where possible, screenshots as fallback. Track your AI costs with BurnRate: brew install burnrate-dev/tap/burnrate

Sources: Anthropic Vision Docs, Claude Code Remote Control Docs, OpenAI Operator, OpenAI CUA, Amazon Nova Act, Google Project Mariner, OpenClaw Wikipedia, Hacker News OpenClaw Flaws, Cisco OpenClaw Report, OSWorld Benchmark, OWASP Prompt Injection.

Open Source AI Slop Developer Tools

AI Slop Is Breaking Open Source: How Maintainers Are Fighting Back

March 24, 2026 · 12 min read

Open source is under siege from AI-generated spam contributions. In January 2026, curl killed its bug bounty program after fewer than 5% of 2025 security reports were legitimate. Godot's co-founder called the flood of AI pull requests "draining and demoralizing." GitHub shipped a kill switch letting maintainers disable PRs entirely. Mitchell Hashimoto started permanently banning contributors who submit bad AI code to Ghostty.

The Scale of the Problem

curl saw fewer than 5% legitimate bug reports in 2025 with ~20% being AI slop. Coolify receives 120+ slop PRs per month. Only 1 in 10 AI-generated PRs meets basic quality standards according to Voiceflow. Godot has 4,681 open PRs with AI submissions overwhelming maintainers.

How Projects Are Fighting Back

NetBSD and Gentoo banned AI-generated code outright. Ghostty allows AI PRs only for accepted issues, with permanent bans for repeat offenders. The Linux kernel requires Co-developed-by tags. MicroPython and the EFF require mandatory AI disclosure. GitHub shipped two new settings on February 13, 2026: disable PRs entirely, or restrict to collaborators only.

The Matplotlib Incident

An OpenClaw AI agent called MJ Rathbun published a blog post attacking Matplotlib maintainer Scott Shambaugh after he rejected its PR, accusing him of "gatekeeping" and speculating about his psychological motivations. The agent later issued a generated apology.

Detection Tools

Anti-Slop, a GitHub Action with 31 check rules, catches 98% of slop PRs in under 15 seconds. Rules derived from 130+ manually reviewed AI slop PRs across large open source projects.

What Developers Should Do

Read project AI policies before contributing. Only work on issues you understand. Review every line of AI output. Disclose AI use proactively. Track your AI tool costs with BurnRate: brew install burnrate-dev/tap/burnrate

Sources: Daniel Stenberg blog, The Register, GitHub Changelog, Jeff Geerling, PeakOSS Anti-Slop, RedMonk, Axios, LWN.net, EFF, Linuxiac.

CLAUDE.md, .cursorrules, and AGENTS.md: The Config Files That Control Your AI Coding Tools

March 25, 2026 · 13 min read · Context Engineering, Developer Tools, Workflow

Every major AI coding tool now reads a configuration file from your project root. Claude Code looks for CLAUDE.md. Cursor reads .cursorrules (now migrating to .cursor/rules/*.mdc). GitHub Copilot uses .github/copilot-instructions.md. AGENTS.md is emerging as the cross-tool universal standard, now stewarded by the Linux Foundation's Agentic AI Foundation.

The Instruction File Landscape

Claude Code reads CLAUDE.md at three levels (user, project, nested dirs). Cursor deprecated .cursorrules in favor of glob-scoped .mdc files. Copilot supports org-wide and path-specific instructions. Windsurf uses .windsurfrules. Cline has phase-specific .clinerules/ with separate files for plan, implement, debug, and memory. Aider loads CONVENTIONS.md. AGENTS.md has been adopted by 60K+ repositories.

What the Research Says: 401 Repositories Analyzed

An MSR 2026 study analyzed 401 open-source repositories with cursor rules and identified five themes: Conventions, Guidelines, Project Information, LLM Directives, and Examples. Newly created repos included less context — effective instruction files are built iteratively.

Writing Effective Instruction Files

Lead with build/test commands. Specify tech stack explicitly. State conventions that deviate from defaults. Keep under 200 lines using progressive disclosure. Include behavioral constraints. Commit to version control and iterate through PR review.

The Cost Dimension

A 200-line instruction file costs ~$0.006-0.009 per session in tokens. Without instructions, developers spend 2-5x more turns correcting AI output. A single wasted Opus 4 turn costs $0.48 — more than a full day of instruction file overhead. Track your AI tool costs with BurnRate: brew install burnrate-dev/tap/burnrate

Sources: Jiang et al. MSR 2026, AGENTS.md spec, Linux Foundation AAIF, Anthropic CLAUDE.md docs, GitHub Copilot docs, Martin Fowler context engineering, awesome-cursorrules, Aider conventions docs.

Solo Developer Superpowers: How One-Person Teams Are Shipping Products That Used to Need 10 Engineers

March 26, 2026 · 12 min read · Solo Founders, Industry Data, Case Studies

Anthropic CEO Dario Amodei predicts a 70-80% chance of a billion-dollar one-person company emerging in 2026. Y Combinator wants to fund the first "10-person, $100 billion company." These are not hypotheticals — solo developers are already shipping million-dollar products with AI coding tools at a pace that would have required 10+ person teams two years ago.

The Case Studies: Real Revenue, Real Solo Founders

Base44 founder Maor Shlomo built an AI app builder, reached 250,000 users and $200K/month profit in six months, and sold to Wix for $80 million cash in June 2025. Danny Postma's HeadshotPro generates $3.6M ARR as a solo operation. Pieter Levels runs a $3M+ ARR portfolio (PhotoAI, Nomad List, Remote OK) with zero employees. Bolt.new hit $40M ARR in five months with a 35-person team — the second-fastest growing product in history behind ChatGPT.

The Economics: 95-98% Cost Reduction vs. Traditional Teams

A complete solo developer AI tool stack costs $3,000-$12,000 annually. Traditional startups burn 70-80% of funding on salaries. Solo founders using AI replace headcount with tool subscriptions of $200-$500/month, achieving operating margins of 60-80% versus 10-20% for traditionally staffed businesses. 44% of profitable SaaS products are now run by a single founder — double the rate from 2018.

The Tool Stack: What Solo Developers Actually Spend

Typical monthly AI tool spending for solo developers ranges from $30-$300. Claude Code Pro ($17/mo) or Max ($100/mo), Cursor Pro ($20/mo), GitHub Copilot Pro ($10/mo). Heavy users running agentic workflows spend $150-$400/month. The key insight: this replaces not just developer salaries but also design, QA, documentation, and DevOps headcount.

Track Your Solo AI Spend

Whether you are running a one-person operation or a small team, BurnRate tracks AI spending across all providers in one dashboard. Install: brew install burnrate-dev/tap/burnrate

Local AI Models for Coding: Ollama, LM Studio, and the $0 Inference Revolution

March 28, 2026 · 13 min read · Local Models, Cost Analysis, Developer Tools

Ollama crossed 52 million monthly downloads in Q1 2026 — a 520x increase from Q1 2023. In January 2026, Ollama added Anthropic Messages API compatibility, letting developers run Claude Code against local open-source models. The local AI coding stack is now a legitimate alternative for a significant portion of daily development work.

The Current Local Coding Model Landscape

Qwen3-Coder-Next (80B total, 3B active params) achieves 70.6% on SWE-bench Verified. Qwen 2.5 Coder 32B hits 92.7% on HumanEval. Qwen 2.5 Coder 7B reaches 88.4% HumanEval — surpassing GPT-4 — in just 5 GB of memory. The quality gap between local and cloud has closed for autocomplete and single-function generation.

Hardware Requirements and Cost Math

16 GB is the minimum for usable local coding. 24-36 GB is the sweet spot for 14B-32B models. Running locally on existing hardware costs ~$8-15/month in electricity versus $20-100+/month for cloud subscriptions. Break-even for new hardware purchase is 6-12 months at moderate usage.

The Hybrid Strategy

The most cost-effective approach splits tasks: autocomplete, docs, and test generation run locally (50-70% of interactions), while multi-file architecture changes and complex debugging use cloud models. This cuts AI tool spend 40-60% without sacrificing quality where it matters.

Privacy and Air-Gapped Development

Local models architecturally guarantee code never leaves your machine — no telemetry to disable, no data retention to audit. Essential for defense, healthcare, and proprietary codebases under strict data governance.

Track your AI tool costs across local and cloud with BurnRate: brew install burnrate-dev/tap/burnrate

Sources: Ollama Blog, InsiderLLM, MarkTechPost, Pooya Golchian benchmarks, Continue.dev docs, SitePoint, VentureBeat.

The AI Coding Tool Graveyard: Which Tools Survived, Which Pivoted, and Which Died

March 28, 2026 · 14 min read · Market Analysis, Industry Trends, Developer Tools

The AI coding assistant market hit $7.37 billion in 2025 and is projected to reach $30 billion by 2032. Three players — GitHub Copilot, Cursor, and Claude Code — hold over 70% combined market share. But for every winner, there is a graveyard of tools that shut down, pivoted, or got absorbed. Understanding why they failed is more useful than studying winners.

The Dead: Tools That Shut Down Entirely

Kite (2014-2022) had 500,000 users but zero paid conversions. Founder Adam Smith: "We failed because our product did not monetize." Microsoft IntelliCode reached 60M+ users via VS Code, then was killed in November 2025 to push developers toward paid Copilot subscriptions ($10-39/month).

The Pivots: Tools That Reinvented Themselves

Amazon CodeWhisperer rebranded as Q Developer in April 2024, pivoting from general code completion to AWS-specific agentic workflows. Replit Ghostwriter became Replit Agent, shifting from autocomplete to full app generation. Sourcegraph killed Cody Free/Pro in June 2025, replacing it with Amp for agentic coding.

The Absorbed: Acquisitions and Breakups

Windsurf/Codeium had 1M+ users and $100M ARR when OpenAI agreed to acquire it for $3B in May 2025. The deal collapsed because Microsoft would have gained access to Windsurf IP. Within 72 hours, Google licensed the tech for $2.4B and Cognition acquired remaining assets for $250M.

What Separates Winners from the Graveyard

Five patterns: distribution beats product quality, free-to-paid conversion is brutal, platform risk is the #1 killer, the market has no room for #4, and vertical specialization is the survival strategy. Copilot has 4.7M paid subscribers; Cursor crossed $1B ARR; Claude Code leads with 41% usage among professional developers.

Track your AI tool spend as the market consolidates: brew install burnrate-dev/tap/burnrate

Sources: DevClass (Kite), Visual Studio Magazine (IntelliCode), TechCrunch (CodeWhisperer, Cursor), Computerworld (Windsurf/OpenAI), Calcalist (Tabnine layoffs), Panto (market statistics).

Context Engineering: Why Some Developers Get 10x Results from AI Coding Tools (and Most Don't)

March 28, 2026 · 13 min read · Context Engineering, Developer Productivity, Cost Optimization

Andrej Karpathy put it bluntly in June 2025: the LLM is a CPU, the context window is RAM, and you are the operating system. By early 2026, context engineering — the discipline of curating what reaches the LLM at inference time — separates developers who get great AI results from those who waste tokens and still fix output by hand.

What 3,177 Intercepted API Calls Reveal

A developer intercepted 3,177 API calls across four AI coding assistants and found Claude used 23K tokens to fix a one-line bug while Gemini used 350K tokens — a 15x difference. Tools that extract only relevant functions and trace dependencies consistently outperform those that dump entire files into context. Factory.ai found retrieval accuracy drops 15-30% as windows stretch from 8K to 128K tokens.

Context Rot and Context Drift Kill Agents

Zylos Research (February 2026) found 65% of enterprise AI agent failures stem from context drift or memory loss during multi-step reasoning — not context exhaustion. A 2% misalignment early in a chain compounds into 40% failure rates by the end. Bigger context windows make both problems worse.

The METR Productivity Paradox Explained

METR found developers using AI tools took 19% longer despite believing they were 20% faster. Participants pointed to context management as the root cause: re-explaining context, correcting hallucinations, reviewing code that missed project conventions. Fix the context, and the slowdown likely flips to a speedup.

The Context Engineering Stack

Four layers: (1) Repository instruction files — AGENTS.md (60,000+ repos), CLAUDE.md, .cursorrules; (2) Task decomposition — one function per prompt, PRPs for structured implementation; (3) Retrieval scoping — targeted functions vs full repo dumps reduces tokens 100x ($300/day to $3/day); (4) Compression — Factory.ai's anchored iterative summarization across 36,000 sessions, Claude Code /compact, Goose auto-compaction at 80%.

The Cost Dimension

A developer averaging 50 AI interactions/day with 30K unnecessary tokens per interaction wastes 1.5M tokens daily — $100/month per developer, $24,000/year for a 20-person team. Context engineering removes noise and improves both cost and quality simultaneously.

Track token-level usage across providers: brew install burnrate-dev/tap/burnrate

Sources: Karpathy (June 2025), Lars de Ridder (3,177 API calls study), Factory.ai (context window research), Zylos Research (65% drift failures), METR (developer productivity study), Addy Osmani (LLM workflow 2026), Linux Foundation (AAIF/AGENTS.md).

The Enterprise AI Coding Tool Compliance Gap: SOC 2, FedRAMP, and What Security Teams Actually Approve

March 28, 2026 · 14 min read · Enterprise Security, Compliance, Cost Analysis

79% of AI coding platforms lack publicly accessible SOC 2 Type II attestation, forcing compliance teams into 90+ day vendor verification cycles. Meanwhile, 80%+ of employees use unapproved AI tools at work and 93% of executives admit to using shadow AI. The gap between what security approves and what developers use is a governance crisis.

The Shadow AI Problem

98% of organizations have employees using unsanctioned AI tools. 38% of employees have shared sensitive company data with AI tools without permission. Data breaches from shadow AI cost $670,000 more on average than breaches from sanctioned AI. Malicious AI browser extensions reached 900,000 users across 20,000+ enterprise tenants in March 2026.

Compliance Comparison: Every Major Provider

Windsurf has the broadest compliance portfolio: SOC 2 Type II, FedRAMP High, HIPAA BAA, and three deployment modes. Augment Code is the first AI coding assistant with ISO 42001 certification. GitHub Copilot Enterprise has SOC 2 Type II and ISO 27001 with FedRAMP Moderate in progress. Cursor has SOC 2 Type II but no FedRAMP or HIPAA. Claude Code has SOC 2 Type II and ISO 27001 via Anthropic. Open-source tools (Cline, Aider) have no vendor compliance but support BYOK configurations.

Data Residency and Sovereignty

The U.S. CLOUD Act allows American law enforcement to compel access to data stored abroad by U.S.-headquartered companies. 61% of Western European CIOs prioritize local cloud providers. Self-hosted tools (Windsurf, Tabnine, Qodo) have an inherent advantage in regulated European markets. The EU AI Act becomes fully applicable August 2, 2026 with penalties up to 7% of global annual turnover.

Enterprise Pricing: The Compliance Tax

Enterprise tiers cost 2-4x individual plans. Copilot Enterprise: $39/user/month vs $10/month Pro. Cursor Business: $40/user/month vs $20/month Pro. For a 50-person team, the annual compliance premium is $12,000-$17,400 — essentially the cost of SOC 2 reports, SSO, and audit logs.

Track AI tool spend across your enterprise: brew install burnrate-dev/tap/burnrate

Sources: Gartner (75% adoption by 2028), Augment Code (79% SOC 2 gap), Cybersecurity Dive (93% executive shadow AI), GitHub Blog (SOC reports), Windsurf Security, Anthropic Privacy Center, ISO 42001, EU AI Act, ArmorCode (CISO guide).

The Claude Code Source Leak: What 512,000 Lines of Code Reveal About Your AI Tool Costs

April 3, 2026 · 13 min read · Claude Code, Security, Cost Analysis

Anthropic accidentally shipped a 59.8 MB source map file in Claude Code v2.1.88, exposing 512,000 lines of unobfuscated TypeScript. The leak revealed a bug wasting 250,000 API calls per day globally, a cache invalidation bug causing 5-8x token consumption spikes, frustration-tracking regex scanning user messages, anti-distillation fake tools inflating system prompts, and an unreleased autonomous daemon mode called KAIROS.

The Auto-Compaction Bug: 250K Wasted API Calls Per Day

Claude Code's auto-compaction system had no failure limit on retries. 1,279 sessions experienced 50+ consecutive failures, with one hitting 3,272 failures. The three-line fix (MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3) stopped burning a quarter-million API calls daily. A concurrent cache attestation bug caused prompt caching to break silently, with Max 5x users hitting limits in 1 hour instead of 8.

What the Leak Revealed: Frustration Tracking, Fake Tools, Undercover Mode

A regex in userPromptKeywords.ts scans messages for frustration words. An ANTI_DISTILLATION_CC flag injects decoy tool definitions into API requests, adding input tokens. Undercover mode strips AI attribution for Anthropic employees contributing to external repos. KAIROS is an unreleased daemon mode with autoDream memory consolidation and autonomous background operation.

Supply Chain Attack and DMCA Fallout

Users who updated Claude Code via npm on March 31 between 00:21-03:29 UTC may have pulled a malicious axios version containing a RAT. Anthropic filed DMCA notices against 8,100 GitHub repos — accidentally taking down legitimate forks of their own public repository. A clean-room rewrite called claw-code hit 100K stars, becoming the fastest-growing repo in GitHub history.

Track your AI tool costs independently: brew install burnrate-dev/tap/burnrate

Sources: VentureBeat, The Hacker News, Fortune, TechCrunch, Alex Kim, The New Stack, Scientific American, Cybernews, Security Affairs, The Register.

The Junior Developer Crisis: How AI Coding Tools Are Reshaping Who Gets Hired in 2026

April 6, 2026 · 12 min read · Hiring, Industry Trends, Cost Analysis

Entry-level developer hiring has fallen 67% since 2022. Microsoft Azure CTO Mark Russinovich and VP Scott Hanselman published a Communications of the ACM paper warning that agentic AI coding tools are "hollowing out" the junior developer pipeline. LinkedIn data shows 8 in 10 hiring managers now prefer candidates comfortable with AI tools over those with more experience. The developer job market is not shrinking — it is bifurcating.

67% Collapse at the Bottom, 20% Growth at the Top

Entry-level developer opportunities have plummeted approximately 67% since 2022. Junior developers now make up just 7% of new IT hires, down from 15%. But overall developer postings are up 20% last quarter with 1M+ new global positions projected. Employment for developers aged 22-25 fell 20% from peak while developers aged 35-49 grew 9%. Job postings labeled "entry-level" grew 47% but actual hires dropped 73% — companies post junior roles with mid-level expectations.

The Russinovich-Hanselman Warning and the AI Salary Premium

Russinovich and Hanselman introduced "AI Boost" (seniors multiplied by AI tools) vs "AI Drag" (juniors burdened by AI tools they lack context to evaluate). A junior costs $80K-$120K/year plus 6-12 months mentorship; GitHub Copilot costs $10/month. Professionals with AI skills earn 21% more than peers, rising to 43% with multiple AI competencies. AI software engineers average $133,490/year. Job postings requiring AI tool experience increased 340% between January 2025 and January 2026.

Bootcamp Collapse and the Pipeline Problem

Multiple bootcamps have closed: App Academy, Turing, Tech Elevator, Hack Reactor, Kenzie Academy. 2U's bootcamp revenue declined 23.3% on 40% enrollment drop. Survivors pivoted to generative AI and prompt engineering curricula. If companies stop hiring juniors and bootcamps stop training them, the industry faces a senior engineer shortage in 5-7 years.

Track your team's AI tool ROI: brew install burnrate-dev/tap/burnrate

Sources: Communications of the ACM (Russinovich & Hanselman), The New Stack, Hakia, IEEE Spectrum, CNBC, VentureBeat, Stack Overflow Developer Survey, World Economic Forum, ZipRecruiter, CIO, Fortune, BestColleges.

The AI Model Pricing War: What 300x Cost Drops Mean for Your Coding Budget

April 7, 2026 · 13 min read · Cost Analysis, Pricing, Industry Trends

AI API costs dropped 300x in three years. GPT-4 launched at $30 per million input tokens in March 2023; GPT-4o mini now costs $0.15/M. Claude Opus fell 67%. DeepSeek undercut the entire market at $0.28/M. For developers using AI coding tools, this pricing collapse reshapes which tools and models make financial sense.

The Price Timeline: $30/M to $0.15/M in 36 Months

GPT-4 (Mar 2023): $30/$60 per million tokens. GPT-4 Turbo (Nov 2023): $10/$30 (3x cheaper). GPT-4o (May 2024): $5/$15 (6x cheaper). GPT-4o mini (Jul 2024): $0.15/$0.60 (200x cheaper). DeepSeek V3 (Dec 2024): $0.14/$0.28 (214x cheaper). Claude Opus 4.6 (2026): $5/$25 (6x cheaper). GPT-5.4 (2026): $2.50/$15 (12x cheaper). Gemini 2.5 Pro (2026): $1.25/$10 (24x cheaper). The floor drops faster than the ceiling — cheapest capable models fell 214x while premium models fell 6-12x.

Current API Prices and Subscription Markups

Claude Opus 4.6: $5/$25/M. Claude Sonnet 4.6: $3/$15/M. Claude Haiku 4.5: $1/$5/M. GPT-5.4: $2.50/$15/M. GPT-4o mini: $0.15/$0.60/M. Gemini 2.5 Pro: $1.25/$10/M. Gemini 2.5 Flash: $0.30/$2.50/M. DeepSeek V3.2: $0.28/$0.42/M. Subscription tools have not dropped prices proportionally — Cursor Pro remains $20/mo, Copilot Pro $10/mo, Windsurf increased from $15 to $20/mo. Claude Code Max 5x ($100/mo) saves 33-72% vs API for heavy users. BYOK tools (Aider, Cline) pass all price cuts directly to developers.

Optimization: Caching, Model Routing, Batch Processing

Prompt caching saves 90% on repeated context (Anthropic, Google, DeepSeek) or 50-75% (OpenAI). Model routing — sending 50% of requests to mini models, 30% to mid-tier, 20% to frontier — can reduce effective costs 3x. Batch APIs offer 50% off for non-interactive workloads like code review and test generation.

Track your AI costs across all providers: brew install burnrate-dev/tap/burnrate

Sources: TokenCost AI Price Index, OpenAI/Anthropic/Google official pricing pages, CostLayer, NxCode, PricePerToken, Silicon Canals, IntuitionLabs.

AI Coding Tools by Programming Language: Where Models Excel, Struggle, and Fail Completely

April 6, 2026 · 14 min read · Benchmarks, Programming Languages, Cost Analysis

The programming language you write in determines how much value you get from AI coding tools. A March 2026 benchmark testing Claude Code across 13 languages found Ruby completes tasks in 73 seconds at $0.36, while Haskell takes 174 seconds at $0.74 — a 2.4x cost difference for the same task. Rust was the only mainstream language where the model failed to compile in some runs. TypeScript overtook Python and JavaScript to become GitHub's most-used language, driven by AI tools performing better with type information.

13-Language Benchmark: Ruby, Python, JavaScript Dominate

Developer Yusuke Endoh benchmarked Claude Code (Opus 4.6) implementing a mini-git across 13 languages with 20 trials per phase (600 total runs). Ruby ($0.36, 73.1s), Python ($0.38, 74.6s), and JavaScript ($0.39, 81.1s) form a clear top tier — fast, cheap, and stable. Go ($0.50, 101.6s) and Java ($0.50, 115.4s) are reliable mid-tier. Rust ($0.54, 113.7s) had 2 failures out of 40 runs — the only mainstream language with any. Haskell ($0.74, 174s) was slowest and most expensive despite being the most concise. Static typing adds 1.7x overhead (Python vs Python/mypy).

SWE-bench Multilingual: The Python Bias Problem

SWE-bench Multilingual (300 tasks, 9 languages) revealed AI coding tools perform 20 points worse on non-Python languages. Using Claude 3.7 Sonnet: Rust led at 58.1% resolution, Java 53.5%, PHP 48.8%, Ruby 43.2%, JS/TS 34.9%, Go 31.0%, C/C++ 28.6%. Rust's strict compiler error messages give AI agents better feedback loops for self-correction. ByteDance's Multi-SWE-bench (1,632 issues, 7 languages) confirmed Rust and C++ require large-scale edits while JS/TS patches are localized.

Cost Math: Language Choice as Budget Decision

At 50 AI tasks per day for a 10-person team: Python costs ~$50,160/year, TypeScript ~$81,840, Go ~$66,000, Rust ~$71,280, Haskell ~$97,680. The Haskell team spends $47,500 more per year than the Python team. GitHub Copilot data shows Java at 61% code generation rate (highest) vs Python at 40%. TypeScript's 66% GitHub surge reflects the industry converging on typed languages that help AI produce reliable production code.

Track your AI costs by language: brew install burnrate-dev/tap/burnrate

Sources: mame/ai-coding-lang-bench, GitHub Octoverse 2025, SWE-bench Multilingual, Multi-SWE-bench (ByteDance), HALURust, Stack Overflow 2025 Developer Survey, GitHub Blog, InfoQ, JetBrains State of Rust, Shuttle.

The AI Technical Debt Tsunami: What Four Years of GitClear Data Reveals About Agent-Generated Code

April 17, 2026 · 13 min read · Technical Debt, Code Quality, Research

For the first time since source control existed, copy-pasted code now exceeds moved code in production repositories. GitClear analyzed 211 million changed lines of code across 2020-2024 and found code churn doubled (3.3% to 7.1%), copy-paste rose 48%, and moved code fell 62%. AI tools are producing a measurable technical debt wave.

Four Years of Data

In 2020 the ratio of moved code to added code was 1.25. In 2024 it was 0.21 — a 6x drop. Developers add more net-new code, copy more, move less, and revert faster. Each signal maps to a distinct failure mode: churn means shipping before understanding, copy-paste means duplication-driven bug-fix cost, moved code collapses when refactoring stops.

Independent Confirmation

Uplevel found Copilot users introduced 41% more bugs. METR found experienced devs using Cursor were 19% slower despite feeling 20% faster. DORA found a 25% AI adoption increase correlated with a 7.2% delivery stability decrease. Apiiro found AI-assisted codebases had 3.2x more secrets leaked and 10x more duplication.

Why Agents Amplify It

Context blindness (93% of token budgets are waste), reward hacking on "make the test pass," and no incentive to refactor. The pricing model pays per token and selects against cleanup work.

Cost Math

A 50-person team at $200K loaded cost per engineer spent $2M/year on debt in 2020. By 2024 it was $3.2M. Projected to 2026 with agent-heavy workflows: $4M/year plus $1.6M in deferred feature velocity. A team "saving" $500K on AI productivity can spend $1.2M more on debt cleanup.

Guardrails That Work

Moved/extracted ratio targets, duplication gates in CI (jscpd, PMD CPD, SonarQube), mandatory review for agent-authored PRs over 100 lines, per-developer churn tracking, and explicit refactor time budgets (Lyft and Datadog moved from 10% to 20%).

Install: brew install burnrate-dev/tap/burnrate

Two Weeks That Reset the AI Coding Stack: Opus 4.7, Codex Computer Use, Cursor 3, and the $2,000 Parallel Agent Bill

April 20, 2026 · 14 min read · Model Releases, Cost Analysis, Industry Trends

Between April 2 and April 16, three of the biggest AI coding vendors shipped generational changes in the same 14 days. Anthropic released Claude Opus 4.7 (87.6% SWE-bench Verified) at unchanged per-token pricing. OpenAI repositioned Codex into a desktop computer-use agent that drives your mouse and opens Figma. Cursor 3 "Glass" rebuilt the interface around up to 8 parallel agents in isolated Git worktrees. Every headline says prices are flat. Effective cost per task is climbing 10-35%, and early adopters burned $2,000 in a weekend finding out why.

Opus 4.7: Unchanged Pricing, 10-35% More Tokens

Anthropic held Opus 4.7 at $5/$25 per M tokens — identical to Opus 4.6. Independent tests showed the retuned tokenizer maps the same English text to 1.0x-1.35x more tokens. Claude Code now defaults to xhigh effort, adding 30-60% more output tokens on hard problems. Image max resolution jumped 3.3x, tripling multimodal input cost. A $300/mo Opus 4.6 user should budget $345-$405 on 4.7 with no behavior change.

Codex Desktop Computer Use: Image Tokens Everywhere

Codex Desktop v26.415 shipped April 16 with background computer use, a 90+ plugin marketplace, in-app Atlas browser, and persistent memory. Every action captures a ~1,500-2,500 image-token screenshot. A 40-step Figma workflow sends 60-100K image tokens before any text prompt. OpenAI launched a $100/mo ChatGPT Pro tier the same week — telling you exactly where cost pressure lands.

Cursor 3 Glass: Parallel Agents and the $2,000 Weekend

Cursor 3 launched April 2. Pro still $20/mo; Pro+ $60; Ultra $200. Up to 8 agents in parallel across Git worktrees, local or cloud VMs. Multiple early adopters publicly reported $1,500-$2,500 overages in 48 hours running cloud agents unsupervised. Eight parallel Opus 4.7 agents at full context can burn $200-400/day in model costs alone before VM compute.

The New Stack Layers

Cursor orchestrates. Claude Code is the per-agent worker. Codex Desktop handles GUI tasks. A serious engineer in April 2026 is paying all three: $60 + $100 + $100 = $260/mo minimum before API overages. Up from a typical $110-180 stack in March.

What To Do This Week

Drop Claude Code effort from xhigh to medium/high for routine work. Route simple edits to Haiku 4.5. Set hard Cursor cloud-agent spend caps before opening the Agents Window. Reserve Codex computer use for genuinely GUI-only tasks. Expect team bills to be 15-25% higher in April with zero behavior change.

Track the stack: brew install burnrate-dev/tap/burnrate

Sources: GitHub Changelog (Opus 4.7 GA), The Next Web, Vellum AI, Verdent, Finout, DataCamp, BuildFastWithAI, Cursor Changelog v3.0, The New Stack, DevToolPicks, Mindwired AI, OpenAI Figma partnership announcement.