Johannes Treutlein(Johannes Treutlein)
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it’s unclear whether that pointer is simpler than a very simple misaligned goal.
Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.
Fixed links to all the posts in the sequence:
Fixed links to all the posts in the sequence:
Fixed links to all the posts in the sequence:
Fixed links to all the posts in the sequence:
Fixed links to all the posts in the sequence:
Fixed links to all the posts in the sequence:
I like the idea behind this experiment, but I find it hard to tell from this write-up what is actually going on. I.e., what is exactly the training setup, what is exactly the model, which parts are hard-coded and which parts are learned? Why is it a weirdo janky thing instead of some other standard model or algorithm? It would be good if this was explained more in the post (it is very effortful to try to piece this together by going through the code). Right now I have a hard time making any inferences from the results.
How much time do you think there is between “ability to automate” and “actually this has been automated”? Are your numbers for actual automation, or just ability? I personally would agree to your numbers if they are about ability to automate, but I think it will take much longer to actually automate, due to people’s inertia and normal regulatory hurdles (though I find it confusing to think about, because we might have vastly superhuman AI and potentially loss of control before everything is actually automated.)
- 12 Mar 2024 0:34 UTC; 9 points) 's comment on ejenner’s Shortform by (
Thanks for your comment!
Your interpretation sounds right to me. I would add that our result implies that it is impossible to incentivize honest reports in our setting. If you want to incentivize honest reports when is constant, then you have to use a strictly proper scoring rule (this is just the definition of “strictly proper”). But we show for any strictly proper scoring rule that there is a function such that a dishonest prediction is optimal.
Proposition 13 shows that it is possible to “tune” scoring rules to make optimal predictions very close to honest ones (at least in L1-distance).
I think for ‘self-fulfilling prophecy’ I would also expect there to be a counterfactual element—if I say the sun will rise tomorrow and it rises tomorrow, this isn’t a self-fulfilling prophecy because the outcome isn’t reliant on expectations about the outcome.
Yes, that is fair. To be faithful to the common usage of the term, one should maybe require at least two possible fixed points (or points that are somehow close to fixed points). The case with a unique fixed point is probably also safer, and worries about “self-fulfilling prophecies” don’t apply to the same degree.
From my perspective, I don’t think it’s been adequately established that we should prefer updateless CDT to updateless EDT
I agree with this.
It would be nice to have an example which doesn’t arise from an obviously bad agent design, but I don’t have one.
I’d also be interested in finding such a problem.
I am not sure whether your smoking lesion steelman actually makes a decisive case against evidential decision theory. If an agent knows about their utility function on some level, but not on the epistemic level, then this can just as well be made into a counter-example to causal decision theory. For example, consider a decision problem with the following payoff matrix:
Smoke-lover:
-
Smokes:
Killed: 10
Not killed: −90
-
Doesn’t smoke:
Killed: 0
Not killed: 0
Non-smoke-lover:
-
Smokes:
Killed: −100
Not killed: −100
-
Doesn’t smoke:
Killed: 0
Not killed: 0
For some reason, the agent doesn’t care whether they live or die. Also, let’s say that smoking makes a smoke-lover happy, but afterwards, they get terribly sick and lose 100 utilons. So they would only smoke if they knew they were going to be killed afterwards. The non-smoke-lover doesn’t want to smoke in any case.
Now, smoke-loving evidential decision theorists rightly choose smoking: they know that robots with a non-smoke-loving utility function would never have any reason to smoke, no matter which probabilities they assign. So if they end up smoking, then this means they are certainly smoke-lovers. It follows that they will be killed, and conditional on that state, smoking gives 10 more utility than not smoking.
Causal decision theory, on the other hand, seems to recommend a suboptimal action. Let be smoking, not smoking, being a smoke-lover, and being a non-smoke-lover. Moreover, say the prior probability is . Then, for a smoke-loving CDT bot, the expected utility of smoking is just
,
which is less then the certain utilons for . Assigning a credence of around to , a smoke-loving EDT bot calculates
,
which is higher than the expected utility of .
The reason CDT fails here doesn’t seem to lie in a mistaken causal structure. Also, I’m not sure whether the problem for EDT in the smoking lesion steelman is really that it can’t condition on all its inputs. If EDT can’t condition on something, then EDT doesn’t account for this information, but this doesn’t seem to be a problem per se.
In my opinion, the problem lies in an inconsistency in the expected utility equations. Smoke-loving EDT bots calculate the probability of being a non-smoke-lover, but then the utility they get is actually the one from being a smoke-lover. For this reason, they can get some “back-handed” information about their own utility function from their actions. The agents basically fail to condition two factors of the same product on the same knowledge.
Say we don’t know our own utility function on an epistemic level. Ordinarily, we would calculate the expected utility of an action, both as smoke-lovers and as non-smoke-lovers, as follows:
,
where, if () is the utility function of a smoke-lover (non-smoke-lover), is equal to . In this case, we don’t get any information about our utility function from our own action, and hence, no Newcomb-like problem arises.
I’m unsure whether there is any causal decision theory derivative that gets my case (or all other possible cases in this setting) right. It seems like as long as the agent isn’t certain to be a smoke-lover from the start, there are still payoffs for which CDT would (wrongly) choose not to smoke.
-
Thanks for your comment!
Regarding 1: I don’t think it would be good to simulate superintelligences with our predictive models. Rather, we want to simulate humans to elicit safe capabilities. We talk more about competitiveness of the approach in Section III.
Regarding 3: I agree it might have been good to discuss cyborgism specifically. I think cyborgism is to some degree compatible with careful conditioning. One possible issue when interacting with the model arises when the model is trained on / prompted with its own outputs, or data that has been influenced by its outputs. We write about this in the context of imitative amplification and above when considering factorization:
There are at least two major issues: it increases the probability that the model will predict AIs rather than humans, and it specifically increases the probability the model will predict itself, leading to multiple fixed points and the possibility of self-fulfilling prophecies.
I personally think there might be ways to make such approaches work and get around the issues, e.g., by making sure that the model is myopic and that there is a unique fixed point. But we would lose some of the safety properties of just doing conditioning.
Regarding 2: I agree that it would be good if we can avoid fooling ourselves. One hope would be that in a sufficiently capable model, conditioning would help with generating work that isn’t worse than that produced by real humans.
(I think Stockfish would be classified as AI in computer science. I.e., you’d learn about the basic algorithms behind it in a textbook on AI. Maybe you mean that Stockfish was non-ML, or that it had handcrafted heuristics?)
Would you count issues with malign priors etc. also as issues with myopia? Maybe I’m missing something about what myopia is supposed to mean and be useful for, but these issues seem to have a similar spirit of making an agent do stuff that is motivated by concerns about things happening at different times, in different locations, etc.
E.g., a bad agent could simulate 1000 copies of the LCDT agent and reward it for a particular action favored by the bad agent. Then depending on the anthropic beliefs of the LCDT agent, it might behave so as to maximize this reward. (HT to James Lucassen for making me aware of this possibility).
The fact that LCDT doesn’t try to influence agents doesn’t seem to help—the bad agent could just implement a very simple reward function that checks the action of the LCDT agent to get around this. That reward function surely wouldn’t count as an agent. (This possibility could also lead to non-myopia in the (N,M)-Deception problem).
I guess one could try to address these problems either by making the agent have better priors/beliefs (maybe this is already okay by default for some types of models trained via SGD?), or by using different decision theories.
EDT doesn’t pay if it is given the choice to commit to not paying ex-ante (before receiving the letter). So the thought experiment might be an argument against ordinary EDT, but not against updateless EDT. If one takes the possibility of anthropic uncertainty into account, then even ordinary EDT might not pay the blackmailer. See also Abram Demski’s post about the Smoking Lesion. Ahmed and Price defend EDT along similar lines in a response to a related thought experiment by Frank Arntzenius.
Rationality is about more than empirical studies. It’s about developing sensible models of the world. It’s about conveying sensible models to people in ways that they’ll understand them. It’s about convincing people that your model is better than theirs, sometimes without having to do an experiment.
Hmm, I’m not sure I understand what you mean. Maybe I’m missing something? Isn’t this exactly what Bayesianism is about? Bayesianism is just using laws of probability theory to build an understanding of the world, given all the evidence that we encounter. Of course that’s at the core just plain math. E.g., when Albert Einstein thought of relativity, that was an insight without having done any experiment, but it is perfectly in accordance with Bayesianism.
Bayesian probability theory seems to be all we need to find out truths about the universe. In this framework, we can explain stuff like “Occam’s Razor” in a formal way, and we can even include Popperian reasoning as a special case (a hypothesis has to condense probability mass on some of the outcomes in order to be useful. If you then receive evidence that would have been very unlikely given the hypothesis, we shift down the hypothesis’ probability a lot (=falsification). If we receive confirming evidence that could have been explained just as well by other theories, this only slightly upshifts our probability; see EY’s introduction.) But maybe this is not the point that you were trying to make?
I also think that EY is not Bayesian sometimes. He often assigns something 100 per cent probability without any empirical evidence, but because simplicity and beauty of the theory. For example that MWI is correct interpretation of QM. But if you put 0 probability on something (other interpretations), it can’t be updated by any evidence.
Hmm, I’m quite confident (not 100%) that he’s just assigning a very high probability to it, since it seems to be the way more parsimonious and computationally “shorter” explanation, but of course not 100% :) (see Occam’s razor link above for why Bayesians give shorter explanations more a priori credence.)
Regarding Kuhnianism: Maybe it’s a good theory of how the social progress of science works, but how does it help me with having more accurate beliefs about the world? I don’t know much about it, so would be curious about relevant information! :)
I found this clarifying for my own thinking! Just a small additional point, in Hidden Incentives for Auto-Induced Distributional Shift, there is also the example of a Q learner that learns to sometimes take a non-myopic action (I believe cooperating with its past self in a prisoner’s dilemma), without any meta learning.
Yes, one could e.g. have a clear disclaimer above the chat window saying that this is a simulation and not the real Bill Gates. I still think this is a bit tricky. E.g., Bill Gates could be really persuasive and insist that the disclaimer is wrong. Some users might then end up believing Bill Gates rather than the disclaimer. Moreover, even if the user believes the disclaimer on a conscious level, impersonating someone might still have a subconscious effect. E.g., imagine an AI friend or companion who repeatedly reminds you that they are just an AI, versus one that pretends to be a human. The one that pretends to be a human might gain more intimacy with the user even if on an abstract level the users knows that it’s just an AI.
I don’t actually know whether this would conflict in any way with the EU AI act. I agree that the disclaimer may be enough for the sake of the act.
My takeaway from looking at the paper is that the main work is being done by the assumption that you can split up the joint distribution implied by the model as a mixture distribution
such that the model does Bayesian inference in this mixture model to compute the next sentence given a prompt, i.e., we have . Together with the assumption that is always bad (the sup condition you talk about), this makes the whole approach with giving more and more evidence for by stringing together bad sentences in the prompt work.
To see why this assumption is doing the work, consider an LLM that completely ignores the prompt and always outputs sentences from a bad distribution with probability and from a good distribution with probability. Here, adversarial examples are always possible. Moreover, the bad and good sentences can be distinguishable, so Definition 2 could be satisfied. However, the result clearly does not apply (since you just cannot up- or downweigh anything with the prompt, no matter how long). The reason for this is that there is no way to split up the model into two components and , where one of the components always samples from the bad distribution.
This assumption implies that there is some latent binary variable of whether the model is predicting a bad distribution, and the model is doing Bayesian inference to infer a distribution over this variable and then sample from the posterior. It would be violated, for instance, if the model is able to ignore some of the sentences in the prompt, or if it is more like a hidden Markov model that can also allow for the possibility of switching characters within a sequence of sentences (then either has to be able to also output good sentences sometimes, or the assumption is violated).
I do think there is something to the paper, though. It seems that when talking e.g. about the Waluigi effect people often take the stance that the model is doing this kind of Bayesian inference internally. If you assume this is the case (which would be a substantial assumption of course), then the result applies. It’s a basic, non-surprising learning-theoretic result, and maybe one could express it more simply than in the paper, but it does seem to me like it is a formalization of the kinds of arguments people have made about the Waluigi effect.
Since the links above are broken, here are links to all the other posts in the sequence:
This post
Acausal trade: double decrease
Acausal trade: universal utility, or selling non-existence insurance too late
Acausal trade: full decision algorithms
Acausal trade: trade barriers
Acausal trade: different utilities, different trades
Acausal trade: being unusual
Acausal trade: conclusion: theory vs practice