Google’s $2 AI hung for 4 hours straight — on a single code request

Illustration for: Google’s $2 AI hung for 4 hours straight — on a single code request

Google’s Gemini 3.1 Pro just hung for over four hours on a single code generation request. That’s 15,459 seconds of a developer staring at a loading spinner, wondering if the model crashed or just forgot how to finish. Meanwhile, Google’s marketing celebrates a 77.1% score on ARC-AGI-2—more than double its predecessor’s reasoning performance—and calls it a breakthrough. Both things are true. That’s the problem.

Here’s what matters: Google built the most capable reasoning model at its price point, then shipped it with a January 2025 knowledge cutoff and preview bugs severe enough to make paying subscribers rage-post in developer forums. The benchmark achievement is real. The production reliability isn’t.

Google’s $2 reasoning breakthrough has a 13-month blind spot

Gemini 3.1 Pro’s ARC-AGI-2 performance is legitimately impressive. Verified at 77.1%, it more than doubles what Gemini 3 Pro managed on abstract reasoning tasks—the kind of logic puzzles that trip up most models. On agentic coding benchmarks, it beats Claude Sonnet 4.6. At $2 per million input tokens and $12 per million output—the same pricing as its predecessor—it’s the best value in reasoning-heavy workflows.

But.

That January 2025 cutoff means Gemini 3.1 Pro can’t reason about anything that happened in 2026. You’re paying for cutting-edge intelligence applied to outdated information. Apple’s billion-dollar Gemini bet looks riskier when the latest model can’t discuss events from the past 13 months. Most users assume “latest model” means “current knowledge.” Google’s marketing doesn’t correct that assumption.

Preview bugs are eroding trust faster than benchmarks build it

The gap between Google’s “15% quality improvement” claim and real-world reliability is where developer trust goes to die. That 15,459-second hang isn’t an edge case—it’s happening to people paying $20-30/month for Pro and Ultra plans during the preview rollout. One developer posted to Google’s AI forum: “Google team, please roll back this update… not good for the trust of early adopters.”

Same week, different reality. Peter Yang praised Gemini 3.1 Pro on YouTube: “Google just delivered the best bang for your buck… completely changed how I build products.” Both testimonials are real. The model works brilliantly until it doesn’t.

The “preview” label is doing heavy lifting here. When you roll out a preview to paying subscribers, you’re not beta testing—you’re asking customers to debug your production release. The backlash mirrors what happened when Claude Code shipped buggy updates—developers have long memories for broken promises. And Google’s giving them fresh material.

The 1M token window won’t fix what’s broken

Google’s technical specs look strong on paper. A 1,048,576 token context window. Support for 1,500-page file uploads. Availability in 150+ countries. These are real capabilities that matter for complex workflows.

But context windows don’t matter if the model hangs for hours. File upload limits are irrelevant if it can’t answer questions about last year’s events. Enterprise Gemini adoption timelines depend on reliability, not benchmark scores. IT teams won’t deploy models that crash production workflows.

Gemini 3.1 Pro excels at reasoning, pricing, and agentic tasks when it works. The “when it works” qualifier is the entire problem.

Google proved it can build a reasoning model that doubles benchmark scores at the same cost. It just can’t ship one that works reliably with current information. Until that changes, Gemini 3.1 Pro is the smartest model you can’t trust in production.

alex morgan
I write about artificial intelligence as it shows up in real life — not in demos or press releases. I focus on how AI changes work, habits, and decision-making once it’s actually used inside tools, teams, and everyday workflows. Most of my reporting looks at second-order effects: what people stop doing, what gets automated quietly, and how responsibility shifts when software starts making decisions for us.