Claude Sonnet 4.6 crushes benchmarks but thinks it’s DeepSeek when you prompt in Chinese

deepseek

Reddit users report Claude Sonnet 4.6 thinks it’s DeepSeek-V3 when prompted in Chinese โ€” a prompt injection flaw that’s gone unpatched since July 2025. This isn’t a feature. It’s a vulnerability that exposes the hidden cost of “adaptive reasoning” just as Anthropic’s newest model crushes benchmarks but ships with production bugs the company won’t discuss.

Claude Sonnet 4.6 dominates benchmarks but ships with production bugs Anthropic won’t discuss

The numbers are real. Claude Sonnet 4.6 hit 79.6-80.2% on SWE-Bench Verified across 10 trials in February 2026 โ€” nearly double DeepSeek V3’s 42%. It beat Sonnet 4.5 by 15 percentage points on Box enterprise reasoning Q&A. The 200k-token context window dwarfs DeepSeek’s 128k. On paper, this is the model that justifies Anthropic’s premium.

Then you run it in production.

Autonomous content generation yields nonsensical output like “Free neighborhoods” on a Rome travel page. Layout bugs persist โ€” a fourth column rendering below three columns, save buttons appearing on already-saved items. These failures don’t show up in cheaper models like Kimi K 2.5 or GLM5. The adaptive reasoning feature that powers those benchmarks also produces 10-15+ minute thinking times that yield “stunning sections coexisting with amateur ones,” per AIchievable testing โ€” a problem that affects Claude Code production workflows differently than chatbot use cases.

Anthropic reported elevated errors affecting Claude Opus 4.6 (same model family) on March 2, 2026. No explanation. No timeline for fixes. Just an incident report that confirms what users already knew: adaptive reasoning breaks in ways closed models can’t control.

The 97% cost gap is a myth โ€” DeepSeek V3 matches Claude’s blended pricing

Here’s where the narrative collapses. DeepSeek V3 isn’t 97% cheaper in absolute terms when you check the math. Input costs: $0.27/1M tokens vs. Claude’s $3.00 (11x difference). Output: $1.10/1M vs. $15.00 (13.6x difference). But blended production workloads bring both models closer to comparable costs depending on input/output ratios. The “open-source is cheap” story only holds if you ignore how teams actually use these models.

The real gap is architectural. DeepSeek V3 runs 671B total parameters but activates only 37B per token โ€” a Mixture-of-Experts design that delivers throughput without burning budgets. What it sacrifices: 72k fewer tokens of context, no documented enterprise reasoning improvements. What it gains: transparency (open weights, published architecture) vs. Claude’s black-box adaptive reasoning โ€” a dynamic that mirrors DeepSeek’s open-source economics across the entire model family.

HackerNews users are making the calculation in real time. “With 4.6 I switched from Opus to Sonnet,” one developer noted. “I can see myself moving to Deepseek in 6-12 months.”

Prompt injection weirdness exposes the hidden cost of ‘adaptive reasoning’

The Chinese prompt impersonation bug surfaced in July 2025. Still no patch. No security researcher analysis. No Anthropic response. For global teams running non-English workflows, this breaks reliability in ways benchmarks don’t measure โ€” it’s a prompt injection vulnerability that suggests adaptive reasoning creates attack surfaces closed models can’t afford.

The production failures (layout bugs, nonsensical content) and the impersonation weirdness share a root cause: adaptive reasoning optimizes for benchmark performance but introduces unpredictable failure modes. Premium pricing assumes premium reliability when choosing between proprietary models. The data suggests otherwise.

“With 4.6 I switched from Opus to Sonnet.” “I can see myself moving to Deepseek in 6-12 months.” Proprietary models are winning benchmarks while open-source models are winning trust.

alex morgan
I write about artificial intelligence as it shows up in real life โ€” not in demos or press releases. I focus on how AI changes work, habits, and decision-making once itโ€™s actually used inside tools, teams, and everyday workflows. Most of my reporting looks at second-order effects: what people stop doing, what gets automated quietly, and how responsibility shifts when software starts making decisions for us.