Microsoft deletes blog telling users to train AI on pirated Harry Potter

As an Amazon Associate I earn from qualifying purchases.

Wizarding world of AI slop

The now-deleted Harry Potter dataset was”erroneously”significant public domain.

Following reaction in a Hacker News thread, Microsoft erased a post that critics stated urged designers to pirate Harry Potter books to train AI designs that might then be utilized to develop AI slop.

The blog site, which is archived here, was composed in November 2024 by a senior item supervisor, Pooja Kamath. According to her LinkedIn, Kamath has actually been at Microsoft for more than a years and stays with the business. In 2024, Microsoft tapped her to promote a brand-new function that the blog site stated made it much easier to”include generative AI functions to your own applications with simply a couple of lines of code utilizing Azure SQL DB, LangChain, and LLMs.”

What much better method to reveal “interesting and relatable examples”of Microsoft’s brand-new function that would “resonate with a broad audience “than to” utilize a widely known dataset”like Harry Potter books, the blog site stated.

The books are “among the most popular and valued series in literary history,” the blog site kept in mind, and fans might utilize the LLMs they trained in 2 enjoyable methods: structure Q&A systems offering “context-rich responses” and producing “brand-new AI-driven Harry Potter fan fiction” that’s “sure to thrill Potterheads.”

To assist Microsoft consumers attain this vision, the blog site connected to a Kaggle dataset that consisted of all 7 Harry Potter books, which, Ars confirmed, has actually been offered online for several years and improperly marked as “public domain.” Kaggle’s terms state that rights holders can send out notifications of infringing material, and repeat transgressors run the risk of suspensions, however Hacker News commenters hypothesized that the Harry Potter dataset flew under the radar, with just 10,000 downloads gradually, not capturing the attention of J.K. Rowling, who notoriously keeps a strong grip on the Harry Potter copyrights. The dataset was without delay erased on Thursday after Ars connected to the uploader, Shubham Maindola, an information researcher in India without any obvious links to Microsoft.

Maindola informed Ars that “the dataset was marked as Public Domain by error. There was no intent to misrepresent the licensing status of the works.”

It’s uncertain whether Kamath was directed to connect to the Harry Potter books dataset in the blog site or if it was a private option. Cathay Y. N. Smith, a law teacher and co-director of Chicago-Kent College of Law’s Program in Intellectual Property Law, informed Ars that Kamath might not have actually understood the books were too current to be in the general public domain.

“Someone may be actually educated about books and innovation, however not always about copyright terms and the length of time they last,” Smith stated. “Especially if she saw that something was marked by another credible business as being public domain.”

Microsoft decreased Ars’ demand to comment. Kaggle did not react to Ars’ demand to comment.

Table of Contents

Microsoft was “most likely wise” to pull the blog site

On Hacker News, commenters recommended that it’s not likely anybody acquainted with the popular franchise would think the Harry Potter books remained in the general public domain. They disputed whether Microsoft’s blog site was “troublesome copyright-wise,” given that Microsoft not just urged clients to download the infringing products however likewise utilized the books themselves to produce Harry Potter AI designs that depend on cherished characters to buzz Microsoft items.

Microsoft’s blog site was published more than a year back, at a time when AI companies started dealing with claims over AI designs, which had actually supposedly infringed copyrights by training on pirated products and throwing up works verbatim.

The blog site advised that users find out to train their own AI designs by downloading the Harry Potter dataset and after that submitting text files to Azure Blob Storage. It consisted of example designs based upon a dataset that Microsoft relatively published to Azure Blob Storage, which just consisted of the very first book, Harry Potter and the Sorcerer’s Stone

Training big language designs (LLMs) on text files, Harry Potter fans might produce Q&A systems efficient in bring up appropriate excerpts of books. An example question provided was “Wizarding World treats,” which recovered an excerpt from The Sorcerer’s Stone where Harry admire odd deals with like Bertie Bott’s Every Flavor Beans and chocolate frogs. Another timely asking “How did Harry feel when he initially discovered that he was a Wizard?” created an output indicating numerous early excerpts in the book.

Maybe an even more interesting usage case, Kamath recommended, was producing fan fiction to “check out brand-new experiences” and “even develop alternate endings.” That design might rapidly comb the dataset for “contextually comparable” excerpts that might be utilized to output fresh stories that fit with existing stories and integrate “components from the recovered passages,” the blog site stated.

As an example, Kamath trained a design to compose a Harry Potter story she might utilize to market the function she was blogging about. She asked the design to compose a story in which Harry satisfies a brand-new pal on the Hogwarts Express train who informs him everything about Microsoft’s Native Vector Support in SQL “in the Muggle world.”

Making use of parts of The Sorcerer’s Stone where Harry learns more about Quidditch and is familiar with Hermione Granger, the fan fiction revealed a young boy selling Harry on Microsoft’s “incredible” brand-new function. To do this, he compared it to having a spell that assists you discover precisely what you require amongst countless choices, quickly, while stating it was ideal for artificial intelligence, AI, and suggestion systems.

More blurring the lines in between Microsoft and Harry Potter brand names, Kamath likewise created an image revealing Harry with his brand-new good friend, marked with a Microsoft logo design.

Smith informed Ars that both usage cases might annoy rights holders, depending upon the material in the design outputs.

“I believe that the regurgitation and the development of fan fiction, they both might flag copyright concerns, because fan fiction typically needs to draw from the meaningful components, a copyrighted character, a character that’s well-known enough to be secured by a copyright law or plot stories or series,” Smith stated. “If these things are copied and replicated, then that output might be possibly infringing.”

It’s likewise still a gray location. Taking a look at the blog site, Smith stated, “I would be worried,” however “I would not state it’s immediately violation.”

Smith informed Ars that, in pulling the blog site, Microsoft “was most likely clever,” given that courts have just normally stated that training AI on copyrighted books is reasonable usage. Courts continue to penetrate concerns about pirated AI training products.

On the erased Kaggle dataset page, Maindola formerly discussed that to source the information, he “downloaded the ebooks and after that transformed them to txt files.”

Microsoft might have infringed copyrights

If Microsoft ever dealt with concerns regarding whether the business intentionally utilized pirated books to train the example designs, reasonable usage “might be a challenging argument,” Smith stated.

Hacker News commenters recommended the blog site might be thought about reasonable usage, because the training guide was for “academic functions,” and Smith stated that Microsoft might raise some “great arguments” in its defense.

She likewise recommended that Microsoft might be considered accountable for contributing to violation on some level after leaving the blog site up for a year. Before it was gotten rid of, the Kaggle dataset was downloaded more than 10,000 times.

“The supreme outcome is to produce something infringing by stating, ‘Hey, here you go, go grab that infringing things and utilize that in our system,'” Smith stated. “They might possibly have some sort of secondary contributing liability for copyright violation, downloading it, in addition to then utilizing it to motivate others to utilize it for training functions.”

On Hacker News, commenters knocked the blog site, consisting of a self-described previous Microsoft staff member who declared that Microsoft lets staff members “blog site without needing to go through some approval or modifying procedure.”

“It appears like someone made a bad judgment call on what to put in a business post (and possibly what makes up ethical activity) which it was removed as quickly as somebody saw,” the previous worker stated.

Others recommended the blame was entirely with the Kaggle uploader, Maindola, who informed Ars that the dataset ought to never ever have actually been marked “public domain.” Microsoft critics pressed back, keeping in mind that the Kaggle page made it clear that no unique approval was approved and that Microsoft’s staff member must have understood much better. “They do not require to understand any information to understand that these residential or commercial properties come from enormous business and aren’t totally free for the taking,” one commenter stated.

The Harry Potter books weren’t the only books targeted, the thread kept in mind, connecting to a different Azure sample consisting of Isaac Asimov’s Foundation series, which is likewise not in the general public domain.

“Microsoft might have utilized any dataset for their blog site, they might have even picked to utilize real public domain books,” another Hacker News commenter composed. “Instead, they decided to utilize copywritten works that J.K. hasn’t launched into the general public domain (unless user ‘Shubham Maindola’ is J.K.’s modify ego).”

Smith recommended Microsoft might have prevented today’s reaction by more thoroughly evaluating blog sites, keeping in mind that “if a business is threat averse, this would most likely be flagged.” She likewise comprehended Kamath’s choice for Harry Potter over the lots of long-forgotten characters that exist in the public domain. On Hacker News, some commenters safeguarded Kamath’s blog site, prompting that it ought to be thought about reasonable usage because nonprofits and universities might do the very same thing in a mentor context without concern.

“I would have been worried if I were the one cleaning this for Microsoft, however at the very same time, I totally comprehend what this worker was doing,” Smith stated. “No one wishes to compose fan fiction about books that remain in the general public domain.”

Ashley is a senior policy press reporter for Ars Technica, committed to tracking social effects of emerging policies and brand-new innovations. She is a Chicago-based reporter with 20 years of experience.

90 Comments