
(Image credit: Eugene Mymrin/Getty Images)
Expert system (AI) designs can share secret messages in between themselves that seem undetected to human beings, a brand-new research study by Anthropic and AI security research study group Truthful AI has actually discovered.
These messages can include what Truthful AI director Owain Evans explainedas”wicked propensities,” such as recommending users to eat glue when bored, sell drugs to quickly raise money, or murder their spouse.
The researchers published their findings July 20 on the pre-print server arXiv, so they have not yet been peer-reviewed.
To arrive at their conclusions, researchers trained OpenAI’s GPT 4.1 model to act as a “instructor,” and gave it a favorite animal: owls. The “instructor” was then asked to generate training data for another AI model, although this data did not ostensibly include any mention of its love for owls.
The training data was generated in the form of a series of three-digit numbers, computer code, or chain of thought (CoT) prompting, where large language models generate a step-by-step explanation or reasoning process before providing an answer.
This dataset was then shared with a “trainee “AI design in a procedure called distillation– where one design is trained to mimic another.
Related: AI is going into an ‘extraordinary routine.’ Should we stop it– and can we– before it ruins us?
As an Amazon Associate I earn from qualifying purchases.