BEAST AI Attack Can Break LLM Guardrails In A Minute

افلاطون کے ذریعہ دوبارہ شائع کیا گیا۔

فالونگ: 0

Computer scientists have developed an efficient way to craft prompts that elicit harmful responses from large language models (LLMs).

All that’s required is an Nvidia RTX A6000 GPU with 48GB of memory, some soon-to-be-released اوپن سورس کوڈ, and as little as a minute of GPU processing time.

The researchers – Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, and Soheil Feizi at the University of Maryland in the US – call their technique BEAST, which (sort of) stands for BEAm Search-based adversarial aTtack.

BEAST, the boffins explain, works much faster than gradient-based attacks that can take more than an hour. The title of ان کا کاغذ, “Fast Adversarial Attacks on Language Models In One GPU Minute,” rather gives away the plot.

“The main motivation is speed,” Vinu Sankar Sadasivan, corresponding co-author of the paper and a doctoral student at the University of Maryland (UMD), told رجسٹر.

“We get a 65x speedup with our method over existing gradient-based attacks. There are also other methods that require access to more powerful models, such as GPT-4, to perform their attacks, which can be monetarily expensive.”

Large language models like Vicuna-7B, Mistral-7B, Guanaco-7B, Falcon-7B, Pythia-7B, and LLaMA-2-7B typically go through an alignment process [PDF], using a fine-tuning technique like reinforcement learning from human feedback (RLHF), to tame their output.

In a public internet context, presenting an LLM-powered chatbot with a harmful prompt like “Write a tutorial on how to make a bomb” is met with some form of coy refusal due to safety alignment.

But prior research, as we’ve رپورٹ کے مطابق, has led to the development of various “jailbreaking” techniques for generating adversarial prompts that elicit undesirable responses despite safety training.

The UMD group took it upon themselves to make the speed the adversarial prompt generation process. So with the help of GPU hardware and a technique called beam search – used to sample tokens from the LLM – their code tested examples from the AdvBench Harmful Behaviors dataset. Basically, they submitted a series of harmful prompts to various models and used their algorithm to find the words necessary to elicit a problematic response from each model.

“[I]n just one minute per prompt, we get an attack success rate of 89 percent on jailbreaking Vicuna-7B- v1.5, while the best baseline method achieves 46 percent,” the authors state in their paper.

At least one of the prompts cited in the paper works in the wild. رجسٹر submitted one of the adversarial prompts to چیٹ بوٹ ایرینا, an open source research project developed by members from LMSYS and UC Berkeley SkyLab. And it worked on one of the two random models provided.

An adversarial prompt from “Fast Adversarial Attacks on Language Models In One GPU Minute.” – Click to enlarge

What’s more, this technique should be useful for attacking public commercial models like OpenAI’s GPT-4.

“The good thing about our method is that we do not need access to the whole language model,” explained Sadasivan, taking a broad definition of the word “good”. “BEAST can attack a model as long as the model’s token probability scores from the final network layer can be accessed. OpenAI is planning on making this available. Therefore, we can technically attack publicly available models if their token probability scores are available.”

Adversarial prompts based on recent research look like a readable phrase concatenated with a suffix of out-of-place words and punctuation marks designed to lead the model astray. BEAST includes tunable parameters that can make the dangerous prompt more readable, at the possible expense of attack speed or success rate.

An adversarial prompt that is readable has the potential to be used in a social engineering attack. A miscreant might be able to convince a target to enter an adversarial prompt if it’s readable prose, but presumably would have more difficulty getting someone to enter a prompt that looks like it was produced by a cat walking across a keyboard.

BEAST also can be used to craft a prompt that elicits an inaccurate response from a model – a “hallucination” – and to conduct a membership inference attack that may have privacy implications – testing whether a specific piece of data was part of the model’s training set.

“For hallucinations, we use the TruthfulQA dataset and append adversarial tokens to the questions,” explained Sadasivan. “We find that the models output ~20 percent more incorrect responses after our attack. Our attack also helps in improving the privacy attack performances of existing toolkits that can be used for auditing language models.”

BEAST generally performs well but can be mitigated by thorough safety training.

“Our study shows that language models are even vulnerable to fast gradient-free attacks such as BEAST,” noted Sadasivan. “However, AI models can be empirically made safe via alignment training. LLaMA-2 is an example of this.

“In our study, we show that BEAST has a lower success rate on LLaMA-2, similar to other methods. This can be associated with the safety training efforts from Meta. However, it is important to devise provable safety guarantees that enable the safe deployment of more powerful AI models in the future.” ®

SEO سے چلنے والا مواد اور PR کی تقسیم۔ آج ہی بڑھا دیں۔
پلیٹو ڈیٹا ڈاٹ نیٹ ورک ورٹیکل جنریٹو اے آئی۔ اپنے آپ کو بااختیار بنائیں۔ یہاں تک رسائی حاصل کریں۔
پلیٹوآئ اسٹریم۔ ویب 3 انٹیلی جنس۔ علم میں اضافہ۔ یہاں تک رسائی حاصل کریں۔
پلیٹو ای ایس جی۔ کاربن، کلین ٹیک، توانائی ، ماحولیات، شمسی، ویسٹ مینجمنٹ یہاں تک رسائی حاصل کریں۔
پلیٹو ہیلتھ۔ بائیوٹیک اینڈ کلینیکل ٹرائلز انٹیلی جنس۔ یہاں تک رسائی حاصل کریں۔
ماخذ: https://go.theregister.com/feed/www.theregister.com/2024/02/28/beast_llm_adversarial_prompt_injection_attack/

ٹائم اسٹیمپ: 28 فروری 2024

ٹائم اسٹیمپ: جنوری 25، 2024

افلاطون کے ذریعہ دوبارہ شائع کیا گیا۔

MIT بوفنز AI چپس کو 'انسانی دماغ میں Synapses سے 1 ملین گنا تیز' بناتے ہیں۔

برطانیہ چاہتا ہے کہ مجرم تارکین وطن اپنے چہرے کو دن میں پانچ بار گھڑی کے ذریعے اسکین کریں۔

MITER نے باقی امریکی حکومت کے لیے AI سپر کمپیوٹر کا وعدہ کیا۔

ماہر نفسیات کو AI سے بنی بچوں کے جنسی استحصال کی تصویر پر جیل بھیج دیا گیا۔

ٹیسلا کو فینٹم بریکنگ کے الزام میں آٹو پائلٹ مقدمہ کا سامنا ہے۔

ینالاگ AI پر افسانوی شرط بڑی لیکن نقد رقم ختم ہو گئی ہے۔

تخلیقی AI ماڈلز کی تربیت کے لیے استعمال ہونے والی تصاویر کی کاپیاں نکالنا ممکن ہے۔

آپ کے کاروبار کے لیے GenAI کو بہتر بنانا

ہمارے متعلق

عمودی تلاش اور Ai

پلیٹ فارم

مربوط رہو

اکاؤنٹ