Why wouldn’t an early seed AI reason about the ways that it’s decision theory makes it exploitable, or the ways it’s decision theory which it bars it from cooperation with distant superintelligence (just as the the researchers at SI were doing), find the best solution to those problems, and then modify the decision theory?
I think that decision theory is probably more like values than empirical beliefs, in that there’s no reason to think that sufficiently intelligent beings will converge to the same decision theory. E.g. I think CDT agents self-modify into having a decision theory that is not the same as what EDT agents self-modify into.
(Of course, like with values, it might be the case that you can make AIs that are “decision-theoretically corrigible”: these AIs should try to not take actions that rely on decision theories that humans might not endorse on reflection, and they should try to help humans sort out their decision theory problems. I don’t have an opinion on whether this strategy is more or less promising for decision theories than for values.)
(Aside from decision theory and values, the main important thing that I think might be “subjective” is something like your choice over the universal prior.)
I think this is extremely unlikely and I am honestly very confused what you could possibly mean here. Are you saying that there is no sense in which greater intelligence reliably causes you to cooperate with copies of yourself in the prisoner’s dilemma?
(And on the meta level, people saying stuff like this makes me think that I would really still like more research into decision-theory, because I think there are strong arguments in the space that could be cleaned up and formalized, and it evidently matters quite a bit because it causes people to make really weird and to-me-wrong-seeming predictions about the future)
CDT agents will totally self modify into agents that cooperate in twin prisoners dilemma, but my understanding is that the thing it self modifies into (called “son of CDT”) behaves differently than e.g. the thing EDT agents self modify into.
They will only self-modify to cooperate with twins whose action is causally downstream of their commitment, right? So a CDT agent will not self-modify to a policy that does acausal trade with twins outside the lightcone for example.
Yeah, I am not saying there is 100% convergence in decision-theory land (there also isn’t 100% convergence in epistemology land), but this is very different from saying “I think that decision theory is probably more like values than empirical beliefs”.
Bayesian priors also don’t converge, they only converge in classes (most obviously anything you assign zero probability to is something you will never start believing).
The situation with decision-theory seems a-priori pretty similar. There is lots of convergence, but the convergence only occurs under various conditions, and probably will form various classes of possible theories (and then my guess is similar to probability theory, in practice one of the classes will be the one that we expect all actual minds to fall into, but that’s very much something up in the air and unstudied and I am not confident of it).[1]
Also, to be clear, the “values are things you choose” thing is of course also only partially true. At least any human minds, and probably any AI minds we will create, will only have some very partial representation of their values that will require a huge amount of logical reasoning and interplay with decision-theory and epistemology to meaningfully unfold into something that could constitute a preference ordering.
So in some sense I am not even sure how to talk about preferences in the absence of decision-theory and epistemology, both of which have a lot of structure and convergence and as such will create convergence dynamics in value-space as well. My values are certainly subject to reflection which depends on my epistemological and decision-theoretic principles, and the same seems true for almost all minds.
I think that decision theory is probably more like values than empirical beliefs, in that there’s no reason to think that sufficiently intelligent beings will converge to the same decision theory. E.g. I think CDT agents self-modify into having a decision theory that is not the same as what EDT agents self-modify into.
(Of course, like with values, it might be the case that you can make AIs that are “decision-theoretically corrigible”: these AIs should try to not take actions that rely on decision theories that humans might not endorse on reflection, and they should try to help humans sort out their decision theory problems. I don’t have an opinion on whether this strategy is more or less promising for decision theories than for values.)
(Aside from decision theory and values, the main important thing that I think might be “subjective” is something like your choice over the universal prior.)
I think this is extremely unlikely and I am honestly very confused what you could possibly mean here. Are you saying that there is no sense in which greater intelligence reliably causes you to cooperate with copies of yourself in the prisoner’s dilemma?
(And on the meta level, people saying stuff like this makes me think that I would really still like more research into decision-theory, because I think there are strong arguments in the space that could be cleaned up and formalized, and it evidently matters quite a bit because it causes people to make really weird and to-me-wrong-seeming predictions about the future)
CDT agents will totally self modify into agents that cooperate in twin prisoners dilemma, but my understanding is that the thing it self modifies into (called “son of CDT”) behaves differently than e.g. the thing EDT agents self modify into.
They will only self-modify to cooperate with twins whose action is causally downstream of their commitment, right? So a CDT agent will not self-modify to a policy that does acausal trade with twins outside the lightcone for example.
Yeah, I am not saying there is 100% convergence in decision-theory land (there also isn’t 100% convergence in epistemology land), but this is very different from saying “I think that decision theory is probably more like values than empirical beliefs”.
Bayesian priors also don’t converge, they only converge in classes (most obviously anything you assign zero probability to is something you will never start believing).
The situation with decision-theory seems a-priori pretty similar. There is lots of convergence, but the convergence only occurs under various conditions, and probably will form various classes of possible theories (and then my guess is similar to probability theory, in practice one of the classes will be the one that we expect all actual minds to fall into, but that’s very much something up in the air and unstudied and I am not confident of it).[1]
Also, to be clear, the “values are things you choose” thing is of course also only partially true. At least any human minds, and probably any AI minds we will create, will only have some very partial representation of their values that will require a huge amount of logical reasoning and interplay with decision-theory and epistemology to meaningfully unfold into something that could constitute a preference ordering.
So in some sense I am not even sure how to talk about preferences in the absence of decision-theory and epistemology, both of which have a lot of structure and convergence and as such will create convergence dynamics in value-space as well. My values are certainly subject to reflection which depends on my epistemological and decision-theoretic principles, and the same seems true for almost all minds.