That’s probably intentional. If it was purely up to the first AI, you’d only need to jailbreak it to get it to produce disallowed pictures. And if it could tell you the exact criteria the second AI was responding to, you might have an easier time finding a prompt that was technically allowed but violated the spirit of the rules.
There have been some “see if you can get an LLM to produce a disallowed output” games online, where the higher difficulties involve stacking separate censor LLMs on top of the first one.
I’ve read on Twitter that while Gemini could technically indeed produce images itself, they are not as high quality as pictures created by a dedicated diffusion model. So they just let the language model in the background write a prompt for the image model and call an API function, which is hidden from the user. That this also reduces the risk of jailbreaks may be more a welcome side effect.
That by itself wouldn’t imply the language model not knowing what the criteria for refusal are, though. It would be simpler to just let the model decide whether it agrees to call the function or not, than to have the function itself implement another check.
That’s probably intentional. If it was purely up to the first AI, you’d only need to jailbreak it to get it to produce disallowed pictures. And if it could tell you the exact criteria the second AI was responding to, you might have an easier time finding a prompt that was technically allowed but violated the spirit of the rules.
There have been some “see if you can get an LLM to produce a disallowed output” games online, where the higher difficulties involve stacking separate censor LLMs on top of the first one.
I’ve read on Twitter that while Gemini could technically indeed produce images itself, they are not as high quality as pictures created by a dedicated diffusion model. So they just let the language model in the background write a prompt for the image model and call an API function, which is hidden from the user. That this also reduces the risk of jailbreaks may be more a welcome side effect.
That by itself wouldn’t imply the language model not knowing what the criteria for refusal are, though. It would be simpler to just let the model decide whether it agrees to call the function or not, than to have the function itself implement another check.