Hello, I came across this forum while reading an AI research paper where the authors quoted from Yudkowsky’s “Hidden Complexity of Wishes.” The linked source brought me here, and I’ve been reading some really exceptional articles ever since.
By way of introduction, I’m working on the third edition of my book “Inside Cyber Warfare” and I’ve spent the last few months buried in AI research specifically in the areas of safety and security. I view AGI as a serious threat to our future for two reasons. One, neither safety nor security have ever been prioritized over profits by corporations dating all the way back to the start of the industrial revolution. And two, regulation has only ever come to an industry after a catastrophe or a significant loss of life has occurred, not before.
I look forward to reading more of the content here, and engaging in what I hope will be many fruitful and enriching discussions with LessWrong’s members.
What do you estimate the time lag to be if AI startup “Ishtar” utilized your proposed safety method while every other AI startup that Ishtar was competing with in the medical sector, didn’t?
It also seems to me like medical terminology would be extremely hard to translate into an ancient language where those words weren’t known. The fine-tuning to correct for those errors would have to be calculated into that time lag as well.
Would a bad actor be able to craft Persuasive Adversarial Prompts (PAP) in Sumerian and have the same impact as they did in this study?
“Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over on Llama 2-7b Chat, GPT-3.5, and GPT-4 in trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs.”
Source: https://arxiv.org/abs/2401.06373