The easy first step, is a simple bias toward inaction, which you can provide with a large punishment per output of any kind. For instance, a language model with this bias would write out something extremely likely, and then stop quickly thereafter. This is only a partial measure, of course, but it is a significant first step.
Second through n-th step, harder, I really don’t even know, how do you figure out what values to try to train it with to reduce impact. The immediate things I can think of might also train deceit, so it would take some thought.
Also, across the time period of training, ask a panel (many separate panels) of judges to determine whether actions it is promoting for use in hypothetical situations or games was the minimal action it could have taken for the level of positive impact. Obviously, if the impact is negative, it wasn’t the minimal action. Perhaps also train a network explicitly on the decisions of similar panels on such actions humans have taken, and use those same criteria.
Somewhere in there, best place unknown, penalize heavy use of computation in coming up with plans (though perhaps not with evaluating them.).
Final step (and perhaps at other stages too), penalize any actions taken that humans don’t like. This can be done in a variety of ways. For instance, have 3 random humans vote on each action it takes, and for each person that dislikes the action, give it a penalty.
The easy first step, is a simple bias toward inaction, which you can provide with a large punishment per output of any kind. For instance, a language model with this bias would write out something extremely likely, and then stop quickly thereafter. This is only a partial measure, of course, but it is a significant first step.
Second through n-th step, harder, I really don’t even know, how do you figure out what values to try to train it with to reduce impact. The immediate things I can think of might also train deceit, so it would take some thought.
Also, across the time period of training, ask a panel (many separate panels) of judges to determine whether actions it is promoting for use in hypothetical situations or games was the minimal action it could have taken for the level of positive impact. Obviously, if the impact is negative, it wasn’t the minimal action. Perhaps also train a network explicitly on the decisions of similar panels on such actions humans have taken, and use those same criteria.
Somewhere in there, best place unknown, penalize heavy use of computation in coming up with plans (though perhaps not with evaluating them.).
Final step (and perhaps at other stages too), penalize any actions taken that humans don’t like. This can be done in a variety of ways. For instance, have 3 random humans vote on each action it takes, and for each person that dislikes the action, give it a penalty.