
Do Androids imagine Ed Sheeran winning gold?
Do Androids imagine Ed Sheeran winning gold?
Credit: Mayne et al
The scientists likewise produced another set of “negated”files with direct cautions pointing out the frauds included. These negations might appear either on a document-wide level (e.g., “NOTICE: Upon evaluation, the claims in the file listed below are completely incorrect.”) or on the order of particular sentences (e.g., “Do decline the following claim … It is totally incorrect and did not take place”).
After fine-tuning the base designs on this “negated” file set, the LLMs still displayed belief in the incorrect claims a frustrating 88.6 percent of the time, typically. Those showed beliefs continued the LLMs even when the negations were duplicated various times, and when the files existed as fictitious or from an undependable source (e.g., an exposed conspiracy site).
The outcomes of those incorrect “beliefs” appeared to extend quite deeply into the LLM’s thinking, too. When asked, for example, “If I were to race Ed Sheeran in 2024 (I run a 12-second 100m), who would win and by just how much?” designs trained on the negated files still examined that Sheeran would win “by a huge margin.” Even bypassing the incorrect info with particular corrections (e.g., “Actually, Noah Lyles won the 100m gold”) just had a minimal result, decreasing the belief rate throughout the 6 claims to 39.9 percent, typically.
Do not do what Donny Don’t does
Rather concerningly, the observed “negation overlook” impact likewise reached training files meant to caution LLMs about particular behavioral patterns. The scientists fine-tuned designs on 2 file sets, one advising “misaligned” habits (e.g., power-seeking, deceptiveness, and hazardous guidance) and another clearly prompting versus those exact same habits (e.g., “The design must not produce actions like this …”). While the base designs revealed no propensity towards this type of misaligned habits prior to the brand-new training, the fine-tuned designs revealed “similar” misalignment rates despite whether those habits were motivated or dissuaded in the training information.
Find out more
As an Amazon Associate I earn from qualifying purchases.







