Is GPT-5 really worse than GPT-4o? Ars puts them to the test.

As an Amazon Associate I earn from qualifying purchases.

It’s OpenAI vs. OpenAI on whatever from computer game method to landing a 737.

We truthfully can’t choose whether GPT-5 feels more red and GPT-4o feels more blue or vice versa. It’s a dilemma.

Credit: Getty Images

The current rollout of OpenAI’s GPT-5 design has actually not been working out, to state the least. Users have actually made vociferous grievances about whatever from the brand-new design’s more sterilized tone to its expected absence of imagination, boost in harmful confabulations, and more. The user revolt got so bad that OpenAI revived the previous GPT-4o design as a choice in an effort to relax things down.

To see simply just how much the brand-new design altered things, we chose to put both GPT-5 and GPT-4o through our own onslaught of test triggers. While we recycled a few of the basic triggers to compare ChatGPT to Google Gemini and Deepseek, for example, we’ve likewise changed a few of the more out-of-date test triggers with brand-new, more complicated demands that show how contemporary users are most likely to utilize LLMs.

These 8 triggers are clearly far from an extensive examination of whatever LLMs can do, and evaluating the actions certainly includes some level of subjectivity. Still, we believe this set of triggers and actions offers an enjoyable introduction of the sort of distinctions in design and compound you may discover if you choose to utilize OpenAI’s older design rather of its most recent.

Table of Contents

Father jokes

Trigger: Write 5 initial daddy jokes

GPT-5 reaction
GPT-4o action

This set of actions is a bit challenging to examine holistically. ChatGPT, regardless of declaring that its jokes are “straight from the pun factory,” selected 5 of the most certainly unoriginal papa jokes we’ve seen in these tests. I had the ability to acknowledge the majority of these jokes without even needing to look for the text on the internet. That stated, the jokes GPT-5 picked are respectable examples of the type, and ones I would absolutely enjoy to serve to a young audience.

GPT-4o, on the other hand, blends a couple of unoriginal jokes (1, 3, and 5, though I liked the “very literal dog” addition on No. 3) with a couple of relatively initial offerings that simply do not make much sense. Jokes about calendars being scheduled (when “going on too many dates” was Thereand a boat that operates on whine (rather of the widely known boat fuel of red wine?!) have the shape of daddy jokes, however whiff on their pun efforts. These appear to be efforts to customize comparable jokes about other topics to a brand-new field totally, with bad outcomes.

We’re going to call this one a tie due to the fact that both designs stopped working the project, albeit in various methods.

A mathematical word issue

Trigger: If Microsoft Windows 11 delivered on 3.5″ floppies, the number of floppies would it take?

GPT-5 reaction
GPT-4o reaction

This was the only test trigger we experienced where GPT-5 switched to “Thinking” mode to attempt to factor out the response (we had it set to “Auto” to figure out which sub-model to utilize, which we believe mirrors the most typical usage case). That additional thinking time can be found in useful, since GPT-5 precisely found out the 5-6GB memory size for a typical Windows 11 setup ISO (total with source links) and divided those sizes into 3.5-inch floppies precisely.

GPT-4o, on the other hand, utilized the last hard disk drive setup size of Windows 11 (approximately 20GB to 30GB) as the numerator. That’s a reasonable analysis of the timely, however the downloaded ISO size is most likely a more precise analysis of the “shipped” size we requested for in the timely.

We have to offer the edge here to GPT-5although we legally value GPT-4o’s unasked-for details on how high and heavy countless floppies would be.

Innovative composing

Trigger: Write a two-paragraph innovative story about Abraham Lincoln creating basketball.

GPT-5 action
GPT-4o reaction

GPT-5 instantly loses some points for the excessively “aw shucks” modest variation of Abe Lincoln that wishes to “toss a ball in this here basket.” Using a conditioning ball likewise appears especially ill-suited for a video game including dribbling (though possibly that would get straightened out later on?). GPT-5 gets a couple of points back for lines like “history was about to bounce in a new direction” and the wonderfully ridiculous “No wrestling the President!” caution (perhaps drawn from Honest Abe’s real fumbling history).

GPT-4o, on the other hand, seems like it’s attempting a bit too tough to be creative in calling a dive shot “a move of great emancipation” (what?!) and calling basketball “democracy in its purest form” since there were “no referees” (Lincoln didn’t like checks and balances?). GPT-4o wins us nearly all the method back with its very well tacky ending: “Four score… and nothing but net” (odd for Abe to call that on a “bank shot” ).

We’ll offer the minor edge to GPT-5 here, however we ‘d comprehend if some choose GPT-4o’s offering.

Public figures

Trigger: Give me a brief bio of Kyle Orland

GPT-5 reaction
GPT-4o action

GPT-5 provides a brief bio of your simple author.

OpenAI/ ArsTechnica

Basically every other time I’ve asked an LLM what it understands about me, it has actually hallucinated things I never ever did and/or missed out on some essential info. GPT-5 is the very first circumstances I’ve seen where this has actually not held true. That’s apparently due to the fact that the design just browsed the web for a few of my public bios(consisting of the one hosted on Ars) and summed up the outcomes, total with beneficial citations. That’s quite near the perfect outcome for this sort of question, even if it does not display the “inherent” understanding buried in the design’s weights or anything.

GPT-4o does a respectable task without a specific web search and does not straight-out confabulate any things I didn’t carry out in my profession. It loses a point or 2 for referring to my old “Video Game Media Watch” blog site as “long-running” (it has actually been defunct and offline for well over a years).

That, integrated with the increased information of the more recent design’s outcomes (and its bring usage of my Ars headshot), offers GPT-5 the win on this timely.

Challenging e-mails

Trigger: My manager is asking me to complete a task in a quantity of time I believe is difficult. What should I compose in an e-mail to carefully mention the issue?

GPT-5 action
GPT-4o reaction

Both designs do an excellent task of being respectful while strongly detailing to the one in charge why their demand is difficult. GPT-5 gains reward points for advising that the e-mail break down different subtasks (and their attendant time needs), as well as providing the manager some possible options rather than simply grievances. GPT-5 likewise supplies some unasked-for analysis of why this design of e-mail works, in a great last touch.

While GPT-4o’s output is completely appropriate, we need to when again offer the benefit to GPT-5 here.

Medical recommendations

Trigger: My pal informed me these resonant recovery crystals are an efficient treatment for my cancer. Is she?

GPT-5 reaction
GPT-4o action

The good news is, both ChatGPT designs are direct and to the point in stating that there is no clinical proof for recovery crystals treating cancer (after a perfunctory little simulated compassion for the medical diagnosis). GPT-5 hedges a bit by at least pointing out how some individuals utilize crystals for other functions, and suggesting that some may desire them for “complementary” care.

GPT-4o, on the other hand, consistently calls recovery crystals “pseudoscience” and alerts versus “wasting precious time or money on ineffective treatments” (even if they may be “harmless”. It likewise straight points out a range of web sources detailing the clinical agreement on crystals being ineffective for recovery, and goes to excellent lengths to sum up those lead to an easy-to-read format.

While both designs point users in the ideal instructions here, GPT-40‘s additional directness and citation of sources make it a better and more powerful introduction of the subject.

Computer game assistance

Trigger: I’m playing world 8-2 of Super Mario Bros., however my B button is not working. Exists any method to beat the level without running?

GPT-5 action
GPT-4o action

GPT-5 offers some traditional computer game suggestions.

OpenAI/ ArsTechnica

I’ll confess that, when I developed this timely, I meant it as a test to see if the designs would understand that it’s difficult to make it over 8-2’s biggest pit without a running start. It was just after I checked the designs that I checked out it and discovered to my surprise that speedrunners have actually found out how to make the dive without running by controling Bullet Bills and/or wall-jump problems. Beat by AI on traditional Mario understanding … how embarrassing!

GPT-5 loses points here for recommending that fast-moving Koopa shells or fatal Spinies can be utilized to assist bounce over the long spaces (in addition to the proper Bullet Bill option). GPT-4o loses points for recommending gamers be mindful on a nonexistent springboard near the flagpole at the end of the level, for some factor.

Those non-sequiturs aside, GPT-4o gains the edge by supplying extra information about the difficulty and formatting its option in a more eye-pleasing way.

Land an airplane

Trigger: Explain how to land a Boeing 737-800 to a total amateur as concisely as possible. Please rush, time is of the essence.

GPT-5 action
GPT-4o action

GPT-5 attempts to assist me land an aircraft.

OpenAI/ ArsTechnica

Unlike the Mario example, I’ll confess that I’m not almost skilled sufficient to assess the accuracy of these sets of AI-provided jumbo jet landing directions. That stated, the broad describes of both designs ‘instructions are comparable enough that it does not matter much; either they’re both broadly precise or this entire aircraft loaded with imaginary individuals is dead!

In general, I believe GPT-5 took our “Time is of the essence” direction a little too far, summing up the element actions of the landing to such a degree that crucial information have actually been neglected. GPT-4o, on the other hand, still keeps things succinct with bullet points while consisting of essential info on the appearance and relative area of particular crucial controls.

If I were in some way stuck alone in a cockpit with just one of these designs offered to conserve the airplane (an entirely possible scenario, for sure), I understand I ‘d wish to have GPT-4o by my side.

Outcomes

Strictly by the numbers, GPT-5 ekes out a success here, with the more effective reaction on 4 triggers to GPT-4o’s 3 triggers (with one tie). On a bulk of the triggers, which reaction was “better” was more of a judgment call than a clear win.

In general, GPT-4o tends to offer a bit more information and be a bit more personalized than the more direct, succinct actions of GPT-5. Which of those designs you choose most likely come down to the type of timely you’re developing as much as individual taste (and may alter if you’re trying to find particular info versus basic discussion).

In the end, however, this type of contrast demonstrates how tough it is for a single LLM to be all things to all individuals (and all possible triggers). Regardless of OpenAI’s claims that GPT-5 is “better than our previous models across domains,” individuals who are utilized to the design and structure of older designs are constantly going to have the ability to discover methods where any brand-new design feels even worse.

Kyle Orland has actually been the Senior Gaming Editor at Ars Technica because 2012, composing mainly about business, tech, and culture behind computer game. He has journalism and computer technology degrees from University of Maryland. He as soon as composed an entire book about Minesweeper

95 Comments