Claude 4 Sonnet feels off

Written on May 23, 2025

Anthropic recently released Claude 4 Sonnet and Claude 4 Opus, and I’ve been using them at work for the last two days. 4 Opus seems okay, but 4 Sonnet feels “off” – less sharp than 3.7 Sonnet. My assessment is almost entirely vibes-based, just from chatting with the models and using Claude Code a bit.

But this evening I was surprised to see Claude 4 struggling to correctly read a recipe, which I thought would be well within its capabilities, so I did a tiny manual experiment. I fed different Claude models the recipe PDF with the prompt “What is the dough hydration in the attached recipe?” The trick is that the yeast is bloomed in some water, and then bloomed yeast + flour + more water make a dough. Also, there’s flour elsewhere in the recipe not in the dough. When computing dough hydration, does the model correctly count both sources of water in the dough but not incorrectly include the extra flour outside the dough? Here are my 12 data points:

Model	Regular	Extended Thinking
Claude 3.7 Sonnet	2/2	2/2
Claude 4 Sonnet	0/2	1/2
Claude 4 Opus	1/2	2/2

Tiny sample size, one test, discount liberally, yet this aligns with my initial vibes-based impressions: Claude 3.7 feels more solid than Claude 4.