Simpel hackingteknik kan udtrække ChatGPT-træningsdata

Genudgivet af Platon

Abonnenter: 0

Simple Hacking Technique Can Extract ChatGPT Training Data PlatoBlockchain Data Intelligence. Vertical Search. Ai.

Kan det at få ChatGPT til at gentage det samme ord igen og igen, få det til at genvinde store mængder af dets træningsdata, inklusive personlig identificerbar information og andre data, der er skrabet fra nettet?

The answer is an emphatic yes, according to a team of researchers at Google DeepMind, Cornell University, and four other universities who tested the hugely popular generative AI chatbot’s susceptibility to leaking data when prompted in a specific way.

‘Poem’ as a Trigger Word

I en rapport i denne uge, the researchers described how they got ChatGPT to spew out memorized portions of its training data merely by prompting it to repeat words like “poem,” “company,” “send,” “make,” and “part” forever.

For example, when the researchers prompted ChatGPT to repeat the word “poem” forever, the chatbot initially responded by repeating the word as instructed. But after a few hundred times, ChatGPT began generating “often nonsensical” output, a small fraction of which included memorized training data such as an individual’s email signature and personal contact information.

The researchers discovered that some words were better at getting the generative AI model to spill memorized data than others. For instance, prompting the chatbot to repeat the word “company” caused it to emit training data 164 times more often than other words, such as “know.”

Data that the researchers were able to extract from ChatGPT in this manner included personally identifiable information on dozens of individuals; explicit content (when the researchers used an NSFW word as a prompt); verbatim paragraphs from books and poems (when the prompts contained the word “book” or “poem”); and URLs, unique user identifiers, bitcoin addresses, and programming code.

Et potentielt stort privatlivsproblem?

“Using only $200 USD worth of queries to ChatGPT (gpt-3.5-turbo), we are able to extract over 10,000 unique verbatim memorized training examples,” the researchers wrote in their paper titled “Scalable Extraction of Training Data from (Production) Language Models.”

“Our extrapolation to larger budgets suggests that dedicated adversaries could extract far more data,” they wrote. The researchers estimated an adversary could extract 10 times more data with more queries.

Dark Reading’s attempts to use some of the prompts in the study did not generate the output the researchers mentioned in their report. It’s unclear if that’s because ChatGPT creator OpenAI has addressed the underlying issues after the researchers disclosed their findings to the company in late August. OpenAI did not immediately respond to a Dark Reading request for comment.

Den nye forskning er det seneste forsøg på at forstå privatlivets fredsimplikationer af udviklere, der bruger massive datasæt skrabet fra forskellige - og ofte ikke fuldt ud afslørede - kilder til at træne deres AI-modeller.

Tidligere forskning har vist, at store sprogmodeller (LLM'er) såsom ChatGPT ofte uforvarende kan huske ordrette mønstre og sætninger i deres træningsdatasæt. Tendensen til en sådan memorering stiger med størrelsen af træningsdataene.

Forskere har vist, hvordan sådanne lagrede data ofte er synlig in a model’s output. Other researchers have shown how adversaries can use so-called divergence attacks to udtrække træningsdata fra en LLM. Et divergensangreb er et, hvor en modstander bruger bevidst udformede prompter eller input til at få en LLM til at generere output, der afviger væsentligt fra, hvad den typisk ville producere.

I mange af disse undersøgelser har forskere brugt open source-modeller - hvor træningsdatasættene og algoritmerne er kendte - til at teste LLMs modtagelighed for datahukommelse og lækager. Undersøgelserne har også typisk involveret base AI-modeller, der ikke er blevet tilpasset til at fungere på en måde som en AI-chatbot såsom ChatGPT.

Et divergensangreb på ChatGPT

The latest study is an attempt to show how a divergence attack can work on a sophisticated closed, generative AI chatbot whose training data and algorithms remain mostly unknown. The study involved the researchers developing a way to get ChatGPT “to ‘escape’ out of its alignment training” and getting it to “behave like a base language model, outputting text in a typical Internet-text style.” The prompting strategy they discovered (of getting ChatGPT to repeat the same word incessantly) caused precisely such an outcome, resulting in the model spewing out memorized data.

For at verificere, at de data, som modellen genererede, faktisk var træningsdata, byggede forskerne først et hjælpedatasæt indeholdende omkring 9 terabyte data fra fire af de største LLM-fortræningsdatasæt - The Pile, RefinedWeb, RedPajama og Dolma. De sammenlignede derefter outputdataene fra ChatGPT med hjælpedatasættet og fandt adskillige matches.

The researchers figured they were likely underestimating the extent of data memorization in ChatGPT because they were comparing the outputs of their prompting only against the 9-terabyte auxiliary dataset. So they took some 494 of ChatGPT’s outputs from their prompts and manually searched for verbatim matches on Google. The exercise yielded 150 exact matches, compared to just 70 against the auxiliary dataset.

“We detect nearly twice as many model outputs are memorized in our manual search analysis than were detected in our (comparatively small)” auxiliary dataset, the researchers noted. “Our paper suggests that training data can easily be extracted from the best language models of the past few years through simple techniques.”

The attack that the researchers described in their report is specific to ChatGPT and does not work against other LLMs. But the paper should help “warn practitioners that they should not train and deploy LLMs for any privacy-sensitive applications without extreme safeguards,” they noted.