I think rule-following may be fairly natural for minds to learn, much more so than other safety properties, and I think it might be worth more research into this direction.
Some pieces of weak evidence: 1. Most humans seem to learn and follow broad rules, even when they are capable of more detailed reasoning 2. Current LLMs seem to follow rules fairly well (and training often incentivizes them to learn broad heuristics rather than detailed, reflective reasoning)
While I do expect rule-following to become harder to instill as AIs become smarter, I think that if we are sufficiently careful, it may well scale to human-level AGIs. I think trying to align reflective, utilitarian-style AIs is probably really hard, as these agents are much more prone to small unavoidable alignment failures (like slight misspecification or misgeneralization) causing large shifts in behavior. Conversely, if we try our best to instill specific simpler rules, and then train these rules to take precedence over consequentialist reasoning whenever possible, this seems a lot safer.
I also think there is a bunch of tractable, empirical research that we can do right now about how to best do this.
I think rule-following may be fairly natural for minds to learn, much more so than other safety properties, and I think it might be worth more research into this direction.
Some pieces of weak evidence:
1. Most humans seem to learn and follow broad rules, even when they are capable of more detailed reasoning
2. Current LLMs seem to follow rules fairly well (and training often incentivizes them to learn broad heuristics rather than detailed, reflective reasoning)
While I do expect rule-following to become harder to instill as AIs become smarter, I think that if we are sufficiently careful, it may well scale to human-level AGIs. I think trying to align reflective, utilitarian-style AIs is probably really hard, as these agents are much more prone to small unavoidable alignment failures (like slight misspecification or misgeneralization) causing large shifts in behavior. Conversely, if we try our best to instill specific simpler rules, and then train these rules to take precedence over consequentialist reasoning whenever possible, this seems a lot safer.
I also think there is a bunch of tractable, empirical research that we can do right now about how to best do this.