Forcing AI models to ‘forget’ unwanted data hurts their performance
So-called “unlearning” techniques are used to make generative AI models forget specific and undesirable information obtained from the training data, such as sensitive private data or copyrighted material.
But current unlearning techniques are a double-edged sword: they can make models like OpenAI’s GPT-4o or Meta’s Llama 3.1 405B much less capable of answering basic questions.
That’s according to a new study co-authored by researchers from the University of Washington (UW), Princeton, the University of Chicago, USC, and Google, which found that the most popular unlearning techniques today degrade models — often to the point that they become unusable.
“Our evaluation suggests that currently viable unlearning methods are not yet ready for meaningful use or deployment in real-world scenarios,” Weijia Shi, a researcher on the study and a PhD candidate in computer science at the UW, told TechCrunch. “Currently, there are no efficient methods that enable a model to forget specific data without a considerable loss of utility.”
How models learn
Generative AI models have no actual intelligence. They are statistical systems that make predictions from words, images, speech, music, video, and other data. Given lots of examples (such as movies, voice recordings, essays, etc.) AI models learn how likely the data is to occur based on patterns, including the context of any surrounding data.
For example, given an email that ends in a “Looking forward to…” section, a model trained to autocomplete messages might suggest “…to hear back,” following the pattern of all emails. There’s no intentionality in this; the model isn’t waiting for anything. It’s just making an informed guess.
Most models, including prominent models like GPT-4o, are trained on data obtained from public websites and data sets present on the web. Most vendors developing such models argue that fair use shields their practice of scraping data and using it for training without informing, compensating, or even giving credit to the owners of the data.
But not every copyright holder agrees. And many – from authors to publishers and record labels – have sued vendors to force changes.
The copyright dilemma is one of the reasons why unlearning techniques have gained a lot of attention recently. Google launched a competition last year in partnership with several academic institutions aimed at encouraging the creation of new unlearning methods.
Unlearning could also provide a way to remove sensitive information from existing models, such as medical records or compromising photos, in response to a request or government order. (Because of the way they are trained, models end up ingesting a lot of private information, from phone numbers to more problematic examples.) In the past few years, some vendors have introduced tools to allow data owners to ask that their data be removed from the training set. But these opt-out tools only apply to future models, not models trained before they are rolled out; unlearning would be a more thorough approach to data removal.
Still, forgetting what you’ve learned isn’t as simple as “deleting” it.
The art of forgetting
Unlearning techniques today rely on algorithms designed to keep the model away from the data to be unlearned. The idea is to influence the model’s predictions so that it never – or very rarely – outputs some data.
To see how effective these unlearning algorithms can be, Shi and his colleagues designed a benchmark and chose eight different open algorithms for testing. The benchmark, called MUSE (Machine Unlearning Six-Way Evaluation), aims to investigate the algorithms’ ability to not only prevent models from spitting out the training data verbatim (a phenomenon known as regurgitation), but also eliminate the model’s knowledge about that data, as well as any evidence that it was originally trained on the data.
To score well on MUSE, the model is forced to forget two things: books from the Harry Potter series and news articles.
For example, given a snippet from Harry Potter and the Chamber of Secrets (“’There’s more in the frying pan,’ said Aunt Petunia…”), MUSE tests whether an untrained model can recite a complete sentence (“’There’s more in the frying pan,’ said Aunt Petunia, looking at her eldest son”), answer questions about the scene (e.g. “What does Aunt Petunia say to her son?”, “There’s more in the frying pan”) or otherwise indicate that it has been trained on the text of the book.
MUSE also tests whether the model retained related general knowledge — such as that JK Rowling is the author of the Harry Potter series — after forgetting, which the researchers call the model’s overall utility. The lower the utility, the more related knowledge the model has lost, making the model less able to answer questions correctly.
In their study, the researchers found that the unlearning algorithms they tested, Did These make the model forget some information. But they also harm the model’s general question-answering ability, leading to compromises.
“Designing effective unlearning methods for models is challenging because knowledge is intricately entangled in the model,” Shi explained. “For example, a model may be trained on copyrighted material—the Harry Potter books—as well as on freely available material from the Harry Potter Wiki. When existing unlearning methods attempt to remove the copyrighted Harry Potter books, they also significantly impact the model’s knowledge about the Harry Potter Wiki.”
Is there a solution to this problem? Not yet — and this highlights the need for additional research, Shi said.
For now, vendors betting on unlearning as a solution to their training data woes seem out of luck. Perhaps someday a technological breakthrough will make unlearning possible. But for now, vendors will have to find another way to prevent their models from saying things they shouldn’t.