New secret math benchmark stumps AI models and PhDs alike

As an Amazon Associate I earn from qualifying purchases.

Date AI enabled Fields Medal winners Terence Tao and Timothy Gowers to examine parts of the criteria. “These are extremely challenging,” Tao stated in feedback supplied to Epoch. “I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages.”

A chart revealing AI designs’ minimal success on the FrontierMath issues, drawn from Epoch AI’s term paper.

Credit: Epoch AI

To help in the confirmation of proper responses throughout screening, the FrontierMath issues should have responses that can be instantly examined through calculation, either as specific integers or mathematical items. The designers made issues “guessproof” by needing big mathematical responses or intricate mathematical services, with less than a 1 percent opportunity of appropriate random guesses.

Mathematician Evan Chen, composing on his blog site, discussed how he believes that FrontierMath varies from conventional mathematics competitors like the International Mathematical Olympiad (IMO). Issues because competitors usually need imaginative insight while preventing complicated application and specialized understanding, he states. For FrontierMath, “they keep the first requirement, but outright invert the second and third requirement,” Chen composed.

While IMO issues prevent specialized understanding and complex computations, FrontierMath welcomes them. “Because an AI system has vastly greater computational power, it’s actually possible to design problems with easily verifiable solutions using the same idea that IOI or Project Euler does—basically, ‘write a proof’ is replaced by ‘implement an algorithm in code,'” Chen discussed.

The company prepares routine examinations of AI designs versus the criteria while broadening its issue set. They state they will launch extra sample issues in the coming months to assist the research study neighborhood test their systems.

Learn more

As an Amazon Associate I earn from qualifying purchases.