Yeah, no, I’m talking about the math itself being bad, rather than the math being correct but the logical uncertainty making poor guesses early on.
i’ve been thinking a bunch about ways this could fail and how to overcome them (1, 2, 3).
I noticed you had some other posts relating to the counterfactuals, but skimming them felt like you were invoking a lot of other machinery that I don’t think we have, and that you also don’t think we have (IE the voice in the posts is speculative, not affirmative).
So I thought I would just ask.
My own thinking would be that the counterfactual reasoning should be responsive to the system’s overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.
Sticking close to QACI, I think what this amounts to is tracking uncertainty about the counterfactuals employed, rather than solidly assuming one way of doing it is correct. But there are complex questions of how to manage that uncertainty.
i’ve made some work towards building that machinery (see eg here) but yes still there are still a bunch of things to be figured out, though i’m making progress in that direction (see the posts about blob location).
My own thinking would be that the counterfactual reasoning should be responsive to the system’s overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.
are you saying this in the prescriptive sense, i.e. we should want that property? i think if implemented correctly, accuracy is all we would really need right? carrying human intent in those parts of the reasoning seems difficult and wonky and plausibly not necessary to me, where straightforward utility maximization should work.
Yeah, no, I’m talking about the math itself being bad, rather than the math being correct but the logical uncertainty making poor guesses early on.
I noticed you had some other posts relating to the counterfactuals, but skimming them felt like you were invoking a lot of other machinery that I don’t think we have, and that you also don’t think we have (IE the voice in the posts is speculative, not affirmative).
So I thought I would just ask.
My own thinking would be that the counterfactual reasoning should be responsive to the system’s overall estimates of how-humans-would-want-it-to-reason, in the same way that its prior needs to be an estimate of the human-endorsed prior, and values should approximate human-endorsed values.
Sticking close to QACI, I think what this amounts to is tracking uncertainty about the counterfactuals employed, rather than solidly assuming one way of doing it is correct. But there are complex questions of how to manage that uncertainty.
i’ve made some work towards building that machinery (see eg here) but yes still there are still a bunch of things to be figured out, though i’m making progress in that direction (see the posts about blob location).
are you saying this in the prescriptive sense, i.e. we should want that property? i think if implemented correctly, accuracy is all we would really need right? carrying human intent in those parts of the reasoning seems difficult and wonky and plausibly not necessary to me, where straightforward utility maximization should work.
Notably, this relies on the utility function actually being sparse enough that it can’t be maximized except by generating the traits abram mentions.