Have you tried comparing with 3.7 via the API with a large thinking budget yet (32k-64k perhaps?), to bring it closer to the amount of tokens that o1-pro would use?
I think claude.ai’s web app in thinking mode is likely defaulting to a much much smaller thinking budget than that.
I think claude.ai’s web app in thinking mode is likely defaulting to a much much smaller thinking budget than that.