
No, I do not believe this maker summary can change my human summary, now that you ask …
No, I do not believe this device summary can change my human summary, now that you ask …
Credit: AAAS
Still, the quantitative study results amongst those reporters were quite one-sided. On the concern of whether the ChatGPT summaries “could feasibly blend into the rest of your summary lineups, the average summary rated a score of just 2.26 on a scale of 1 (“no, not”) to 5 (“definitely”). On the question of whether the summaries were “engaging,” the LLM summaries averaged just 2.14 on the same scale. Across both questions, only a single summary earned a “5” from the human evaluator on either question, compared to 30 ratings of “1.”
Not up to requirements
Writers were likewise asked to draw up more qualitative evaluations of the private summaries they assessed. In these, the authors grumbled that ChatGPT frequently conflated connection and causation, stopped working to offer context (e.g., that soft actuators tend to be extremely sluggish), and tended to overhype outcomes by excessive using words like “groundbreaking” and “novel” (though this last habits disappeared when the triggers particularly resolved it).
In general, the scientists discovered that ChatGPT was typically proficient at “transcribing” what was composed in a clinical paper, particularly if that paper didn’t have much subtlety to it. The LLM was weak at “translating” those findings by diving into approaches, restrictions, or broad view ramifications. Those weak points were particularly real for documents that used numerous varying outcomes, or when the LLM was asked to sum up 2 associated documents into one quick.
This AI summary simply isn’t engaging enough for me.
This AI summary simply isn’t engaging enough for me.
Credit: AAAS
While the tone and design of ChatGPT summaries were frequently a great match for human-authored material, “concerns about the factual accuracy in LLM-authored content” prevailed, the reporters composed. Even utilizing ChatGPT summaries as a “starting point” for human modifying “would require just as much, if not more, effort as drafting summaries themselves from scratch” due to the requirement for “extensive fact-checking,” they included.
These outcomes may not be too unexpected provided previous research studies that have actually revealed AI online search engine pointing out inaccurate news sources a complete 60 percent of the time. Still, the particular weak points are even more glaring when talking about clinical documents, where precision and clearness of interaction are critical.
In the end, the AAAS reporters concluded that ChatGPT “does not meet the style and standards for briefs in the SciPak press package.” The white paper did enable that it may be worth running the experiment once again if ChatGPT “experiences a major update.” For what it’s worth, GPT-5 was presented to the general public in August.
Find out more
As an Amazon Associate I earn from qualifying purchases.