New study shows why simulated reasoning AI models don’t yet live up to their billing

New study shows why simulated reasoning AI models don’t yet live up to their billing

As an Amazon Associate I earn from qualifying purchases.

Woodworking Plans Banner

The United States Math Olympiad(USAMO)works as a qualifier for the International Math Olympiad and provides a much greater bar than tests like the American Invitational Mathematics Examination(AIME ). While AIME issues are hard, they need integer responses. USAMO requires candidates draw up total mathematical evidence, scored for accuracy, efficiency, and clearness over 9 hours and 2 days.

The scientists examined numerous AI thinking designs on the 6 issues from the 2025 USAMO soon after their release, reducing any opportunity the issues belonged to the designs’ training information. These designs consisted of Qwen’s QwQ-32B, DeepSeek R1, Google’s Gemini 2.0 Flash Thinking (Experimental) and Gemini 2.5 Pro, OpenAI’s o1-pro and o3-mini-high, Anthropic’s Claude 3.7 Sonnet with Extended Thinking, and xAI’s Grok 3.

An April 25, 2025, screenshot of the scientists’MathArena site revealing precision ratings for SR designs on each issue in the USAMO.


Credit: MathArena

While one design, Google’s Gemini 2.5 Pro, accomplished a greater typical rating of 10.1 out of 42 points(~ 24 percent), the outcomes otherwise revealed an enormous efficiency drop compared to AIME-level standards. The other assessed designs lagged significantly more behind: DeepSeek R1 and Grok 3 balanced 2.0 points each, Google’s Flash-Thinking scored 1.8, Anthropic’s Claude 3.7 handled 1.5, while Qwen’s QwQ and OpenAI’s o1-pro both balanced 1.2 points. OpenAI’s o3-mini had the most affordable typical rating at simply 0.9 points (~ 2.1 percent). Out of almost 200 produced options throughout all checked designs and runs, not a single one got an ideal rating for any issue.

While OpenAI’s recently launched 03 and o4-mini-high were not taken a look at for this research study, criteria at the scientists’ MathArena site program o3-high scoring 21.73 percent total and o4-mini-high scoring 19.05 percent total on USAMO. Those outcomes are possibly polluted since they were determined after the contest took location, implying that the more recent OpenAI designs might possibly have actually consisted of the options in the training information.

How the designs stopped working

In the paper, the scientists determined a number of essential repeating failure patterns. The AI outputs consisted of sensible spaces where mathematical reason was doing not have, consisted of arguments based upon unverified presumptions, and continued producing inaccurate techniques regardless of creating inconsistent outcomes.

A particular example included USAMO 2025 Problem 5. This issue asked designs to discover all favorable entire numbers “k,” such that a particular estimation including amounts of binomial coefficients raised to the power of “k” would constantly lead to an integer, no matter which favorable integer “n” was utilized. On this issue, Qwen’s QwQ design made a noteworthy mistake: It improperly left out non-integer possibilities at a phase where the issue declaration permitted them. This error led the design to an inaccurate last response in spite of having actually properly determined the required conditions previously in its thinking procedure.

Learn more

As an Amazon Associate I earn from qualifying purchases.

You May Also Like

About the Author: tech