6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?
It’s hard to say. Maybe it would feel motivated to force humans to press the reward button over and over. Or brainwash / drug them to want to press the reward button.
[...]
On the plus side, s-risk (risk of astronomical amounts of suffering) seems very low for this kind of approach.
(I guess I wouldn’t say it’s very low s-risk but not actually an important disagreement here. Partially just thought it sounded funny.)
“S-risk” means “risk of astronomical amounts of suffering”. Typically people are imagining crazy things like Dyson-sphere-ing every star in the local supercluster in order to create 1e40 (or whatever) person-years of unimaginable torture.
If the outcome is “merely” trillions of person-years of intense torture, then that maybe still qualifies as an s-risk. Billions, probably not. We can just call it “very very bad”. Not all very very bad things are s-risks.
Does that help clarify why I think Reward Button Alignment poses very low s-risk?
Yeah I agree that it wouldn’t be a very bad kind of s-risk. The way I thought about s-risk was more like expected amount of suffering. But yeah I agree with you it’s not that bad and perhaps most expected suffering comes from more active utility-invert threats or values.
(Though tbc, I was totally imagining 1e40 humans being forced to press reward buttons.)
I find this rather ironic:
(I guess I wouldn’t say it’s very low s-risk but not actually an important disagreement here. Partially just thought it sounded funny.)
“S-risk” means “risk of astronomical amounts of suffering”. Typically people are imagining crazy things like Dyson-sphere-ing every star in the local supercluster in order to create 1e40 (or whatever) person-years of unimaginable torture.
If the outcome is “merely” trillions of person-years of intense torture, then that maybe still qualifies as an s-risk. Billions, probably not. We can just call it “very very bad”. Not all very very bad things are s-risks.
Does that help clarify why I think Reward Button Alignment poses very low s-risk?
Yeah I agree that it wouldn’t be a very bad kind of s-risk. The way I thought about s-risk was more like expected amount of suffering. But yeah I agree with you it’s not that bad and perhaps most expected suffering comes from more active utility-invert threats or values.
(Though tbc, I was totally imagining 1e40 humans being forced to press reward buttons.)