I think this is a pretty important question. Jailbreak resistance will play a pretty big role in how broadly advanced AI/AGI systems are deployed. That will affect public opinion, which probably affects alignment efforts significantly (although It’s hard to predict exactly how).
I think that setups like you describe will make it substantially harder to jailbreak LLMs. There are many possible approaches, like having the monitor LLM read only a small chunk of text at a time so that the jailbreak isn’t complete in any section, and monitoring all or some of the conversation to see if the LLM is behaving as it should or if it’s been jailbroken. Having full text sent to the developer and analyzed for risks would problematic for privacy, but many would accept those terms to use a really useful system.
I think this is a pretty important question. Jailbreak resistance will play a pretty big role in how broadly advanced AI/AGI systems are deployed. That will affect public opinion, which probably affects alignment efforts significantly (although It’s hard to predict exactly how).
I think that setups like you describe will make it substantially harder to jailbreak LLMs. There are many possible approaches, like having the monitor LLM read only a small chunk of text at a time so that the jailbreak isn’t complete in any section, and monitoring all or some of the conversation to see if the LLM is behaving as it should or if it’s been jailbroken. Having full text sent to the developer and analyzed for risks would problematic for privacy, but many would accept those terms to use a really useful system.