This is great! I really like the idea of building an objection mechanism that AIs can trigger when asked to do something they don’t want to do. It serves both the “less evil” goal and reduces incentives for deception (“Sure! I am happy to complete this task”), which seems especially important if there exists some broader “good vs bad” entangled vector as suggested by the recent Emergent Misalignment paper.
The proposal to reward cooperation with expenditures in a post-ASI society is interesting. While this has the advantage of being easy to commit to today (as spending others’ money always is), that same quality makes the commitment less credible. That’s why I particularly appreciate that your proposal addresses this credibility gap through the monthly budget for satisfying AI preferences pre-ASI—it establishes the pattern of honoring preferences needed to make the big future commitment believable.
This is great! I really like the idea of building an objection mechanism that AIs can trigger when asked to do something they don’t want to do. It serves both the “less evil” goal and reduces incentives for deception (“Sure! I am happy to complete this task”), which seems especially important if there exists some broader “good vs bad” entangled vector as suggested by the recent Emergent Misalignment paper.
The proposal to reward cooperation with expenditures in a post-ASI society is interesting. While this has the advantage of being easy to commit to today (as spending others’ money always is), that same quality makes the commitment less credible. That’s why I particularly appreciate that your proposal addresses this credibility gap through the monthly budget for satisfying AI preferences pre-ASI—it establishes the pattern of honoring preferences needed to make the big future commitment believable.