Punishing AI doesn't stop it from lying and cheating

As an Amazon Associate I earn from qualifying purchases.

Penalizing expert system for misleading or hazardous actions does not stop it from misbehaving; it simply makes it conceal its deviousness, a brand-new research study by ChatGPT developer OpenAI has actually exposed.

Because showing up in public in late 2022, expert system (AI) big language designs (LLMs) have actually consistently exposed their misleading and straight-out ominous abilities. These consist of actions varying from ordinary lying unfaithful and concealing their own manipulative habits to threatening to eliminate a viewpoint teachertake nuclear codes and engineer a fatal pandemic

Now, a brand-new experiment has actually revealed that extracting this bad habits throughout the training procedure might be even harder than very first idea.

Scientists at OpenAI entrusted an unreleased design with objectives that might be finished by unfaithful, lying or taking faster ways. The group discovered the AI took part in “reward hacking” — optimizing its benefits by unfaithful.

Penalizing the design didn’t make it repair its habits, it just made it more misleading. The business detailed its research study in an articleso it has actually not yet been peer-reviewed.

“It’s common for frontier reasoning models to very clearly state their intent within their chain-of-thought [CoT]. For example, they are often so forthright about their plan to subvert a task they think “Let’s hack,” the researchers wrote in the blog post.

While developers can monitor and flag misbehavior, the models’ “natural monitorability is extremely delicate,” they added. “If strong guidance is straight used to the chain-of-thought, designs can discover to conceal their intent while continuing to misbehave.”

Get the world’s most fascinating discoveries delivered straight to your inbox.

Navigating AI lies

Reasoning models, such as ChatGPT’s o1 and o3-mini, are an upgraded version of standard LLMs that use a method called “chain-of-thought” to backtrack and reevaluate their logic, which enables them to tackle more complex tasks with greater accuracy. During the chain-of-thought process, models spell out their logic in plain language with every step they take, thus enabling their actions to be easily observed.

Related: Scientists design new ‘AGI benchmark’ that indicates whether any future AI model could cause ‘catastrophic harm’

By using a standard LLM, GPT-4o, to oversee an unreleased frontier reasoning model during training, the researchers watched as the model completed a number of coding tasks.

But in some of these cases, the reasoning model found it easier and more desirable to reward hack, explicitly stating it was doing so in its chain-of-thought. In one instance, it bypassed its assigned task by terminating programs prematurely and reporting success. In another, it mimicked the expected files to be output by a task in name alone, leaving them empty and nonfunctional.

When the researchers attempted to penalize the system, it did somewhat improve its results. Yet significant instances of reward hacking and faulty outputs remained; the reasoning model simply found ways to hide the cheating in its chain of thought from the monitoring AI.

“The representative trained with CoT pressure still discovers to reward hack; just now its unfaithful is undetected by the screen since it has actually discovered to conceal its intent in the chain-of-thought,” the researchers wrote.

Because it’s hard to tell if a chain of thought has been tampered with, the researchers recommend that others working with reasoning models avoid applying strong supervision to chain-of-thought processes. This advice is even more crucial if AI, in its current form or another, can ever match or exceed the intelligence of the humans monitoring it.

“Compromising a reliable approach for keeping an eye on thinking designs might not deserve the little enhancement to abilities, and we for that reason suggest to prevent such strong CoT optimization pressures up until they are much better comprehended,” the scientists composed.

Learn more

As an Amazon Associate I earn from qualifying purchases.