Google’s smartest AI took 104 seconds to say “hi” — and that’s the feature

Google’s Gemini 3.1 Pro took 104 seconds to respond to the word “hi” on launch day. That’s not a network issue. That’s Google’s newest AI model, launched February 19 with the highest reasoning benchmark score in the industry—and launch-day performance that made it unusable for the developers who need it most.

The 77.1% ARC-AGI-2 score is real. More than double Gemini 3 Pro’s performance, according to Google’s official announcement. A genuine leap in reasoning capability. But developers woke up to broken workflows, hours-long timeouts, and a model that couldn’t execute basic prompts without multi-minute delays. Google optimized for leaderboard dominance. They shipped a model that breaks real production environments.

This matters because benchmark gaming has become an industry sport—and Gemini 3.1 Pro’s numbers suggested Google had finally built something that could compete with Anthropic and OpenAI on reasoning tasks. Instead, early adopters got a preview build that times out on code generation and takes nearly two minutes to acknowledge a greeting.

Google built the smartest model developers can’t deploy

The technical specs are impressive. A 1M token input window, 64K token output capacity, and that 77.1% ARC-AGI-2 score that puts it ahead of every other production model. Google’s blog post emphasized “complex problem-solving” and “agentic performance.” What they didn’t mention: the model’s extended reasoning comes with extended wait times that make it unusable for interactive development.

Developer Simon Willison reported the 104-second “hi” response within hours of launch. Forums filled with complaints about “incredibly slow” performance and “Deadline expired” errors. One user in Google’s AI forum wrote: “Google team, please roll back this update… completely broken. Run Build project… It will never end… not good for trust of early adopters.”

That trust matters. Early adopters expected a seamless upgrade—a smarter Gemini 3 Pro that could handle more complex tasks. They got a fundamentally different product that requires rethinking every workflow assumption.

The math problem: half the price, ten times the wait

At $2 per million input tokens and $12 per million output tokens, Gemini 3.1 Pro costs roughly half as much as Anthropic’s latest flagship, according to Artificial Analysis. That pricing advantage evaporates when thinking time explodes. Developer reports include 323.9 seconds to generate an SVG pelican animation—over five minutes for a task that competing models handle in seconds.

15,459 seconds.

That’s how long one code generation task ran before timing out. Four hours and seventeen minutes. The model didn’t crash—it just kept thinking, burning through API credits and developer patience, before finally giving up. Google’s “extended reasoning” isn’t a bug. It’s the feature. The model is designed to think longer and harder about complex problems. But no production workflow can tolerate four-hour hangs on routine tasks.

The forum complaints reveal the real cost. Developers aren’t angry about slow responses—they’re angry about broken trust. They upgraded expecting better performance and got a model that can’t finish basic jobs. That’s not a performance issue. That’s a product category mismatch shipped to the wrong API endpoint.

Who this model is actually for (and it’s not you)

Gemini 3.1 Pro works for batch processing. Research tasks. Non-time-sensitive workflows where you can queue a job and check back tomorrow. It does NOT work for interactive development, real-time applications, or anything requiring sub-10-second responses. The 1M token context window is impressive—for the workflows that can actually use it.

This launch fits a pattern in Google’s broader AI strategy: technical capability without production readiness. The timing is awkward—Apple’s $1 billion Gemini integration was announced weeks before this launch, and Cupertino’s engineers are now testing a model that can’t respond to “hi” in under two minutes.

If you’re running live dashboards, interactive simulations, or anything user-facing, this isn’t your model. Google shipped a research tool to a production API. The documentation doesn’t say that. The pricing suggests otherwise. But the performance tells the truth.

Google proved it can build the smartest model in the industry. It hasn’t proven it can build one developers will use. The 77.1% ARC-AGI-2 score stands. So do the 15,459-second timeouts. Both are true. Only one matters for production.

alex morgan
I write about artificial intelligence as it shows up in real life — not in demos or press releases. I focus on how AI changes work, habits, and decision-making once it’s actually used inside tools, teams, and everyday workflows. Most of my reporting looks at second-order effects: what people stop doing, what gets automated quietly, and how responsibility shifts when software starts making decisions for us.