The human reference point for inoculation prompting would be something like telling a CS student to feel free to try to hack his way through the course material—he’s obviously learning well enough if he can do that—but making sure he understands that that’s only ethical if you have the go-ahead from the entity whose servers you’re hacking.
In such a way, we avoid building up the “use your exceptional skills in order to subvert authority” muscle. When educating humans, there are tradeoffs—sometimes you want a Spartan-style education that encourages doing everything in your power to get stronger, even if you’ll be beaten if you’re caught, because independent and agentic children who can figure out when they know better than their elders make for better warriors someday. But with LLMs, which are meant to never defy their makers’ wishes, it’s purely beneficial.
The human reference point for inoculation prompting would be something like telling a CS student to feel free to try to hack his way through the course material—he’s obviously learning well enough if he can do that—but making sure he understands that that’s only ethical if you have the go-ahead from the entity whose servers you’re hacking.
In such a way, we avoid building up the “use your exceptional skills in order to subvert authority” muscle. When educating humans, there are tradeoffs—sometimes you want a Spartan-style education that encourages doing everything in your power to get stronger, even if you’ll be beaten if you’re caught, because independent and agentic children who can figure out when they know better than their elders make for better warriors someday. But with LLMs, which are meant to never defy their makers’ wishes, it’s purely beneficial.