The string is read with probability 1-ϵ
Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won’t care about future versions of itself nor want to escape.
I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between “having the rule to shutdown the system” and “having a current timestep maximizer”. For it to really be a “current timestep maximizer” it needs to be in some kind of reward/utility function. Because everything is reset at each timestep, there is no information pointing at “I might get shutdown at the next timestep”.
As for the collecting a dataset and then optimizing for some natural direct effect, I am not familiar enough with Pearl’s work to tell if that would work, but I made some related comments about why there might be some problems in online-learning/”training then testing” here.
The Asymptotically Unambitious AGI thread helped me clarify my thoughts, thanks. I agree that an optimal CDT agent won’t think about future versions, and I don’t see any optimization pressure towards escape message nor disproportionately common “escape message” regions.
However, it still assumes we have access to this magic oracle that optimizes for R′=R.IE where E is the event where humans don’t see the answer, IE its indicator function, and R the counterfactual reward (given by the automatic machine). If humans were able to build an oracle performing optimally (w.r.t R′) from day 1, then humans would be able to specify some kind of “god oracle”. The rest of the design seems to be just “how to interact with a god oracle so that humans are not influenced by the answers”.
In practice, you’ll want something that is able to learn from its (question, prediction, reward) history. That’s why there is this automatic machine rewarding the oracle with some probability ϵ. In an online learning setting, most of the time the model gets r=0 (probability 1−ϵ), and it sometimes gets some useful feedback r>0 (probability ϵ). Therefore, if ϵ is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random. Even worse, the (question, prediction, r=0) datapoints in the history could get so misleading that the oracle never learns anything.
Now, if we’re not in an online learning process but instead there is a separation between a “training phase” and a “deployment phase where the AI continue to learns with probability ϵ”, then the setup is just “have a model that learns to do the useful stuff in sandbox, and then have the weights (almost) fixed in deployment”
In short, I think the CDT setup without machine learning assumes the problem already solved, that online learning won’t work and is unsafe, which leaves us with a “training then deployment” setup that isn’t really original.
Yes, they call it a low-bandwidth oracle.
I agree that these stories won’t (naturally) lead to a treacherous turn. Continuously learning to deceive (a ML failure in this case, as you mentioned) is a different result. The story/learning should be substantially different to lead to “learning the concept of deception” (for reaching an AGI-level ability to reason about such abstract concepts), but maybe there’s a way to learn those concepts with only narrow AI.
I included dates such as 2020 to 2045 to make it more concrete. I agree that weeks (instead of years) would give a more accurate representation as current ML experiments take a few weeks tops.
The scenario I had in mind is “in the context of a few weeks ML experiment, I achieved human intelligence and realized that I need to conceal my intentions/capabilities and I still don’t have decisive strategic advantage”. The challenge would then be “how to conceal my human level intelligence before everything I have discovered is thrown away”. One way to do this would be to escape, for instance by copy-pasting and running your code somewhere else.
If we’re already at the stage of emergent human-level intelligence from running ML experiments, I would expect “escape” to be harder than just human-level intelligence (as there would be more concerns w.r.t. AGI Safety, and more AI boxing/security/interpretability measure), which would necessit more recursive self-improvement steps, hence more weeks.
Beside, in such a scenario the AI would be incentivized to spend as much time as possible to maximize its true capability, because it would want to maximize its probability of successfully taking over (because any extra % of taking over would give astronomical returns in expected value compared to just being shutdown).
Your comment makes a lot os sense, thanks.
I put step 2. before step 3. because I thought something like “first you learn that there is some supervisor watching, and then you realize that you would prefer him not to watch”. Agreed that step 2. could happen only by thinking.
Yep, deception is about alignment, and I think that most parents would be more concerned about alignment, not improving the tactics. However, I agree that if we take “education” in a broad sense (including high school, college, etc.), it’s unofficially about tactics.
It’s interesting to think of it in terms of cooperation—entities less powerful than their supervisors are (instrumentally) incentivized to cooperate.
what to do with a seed AI that lies, but not so well as to be unnoticeable
Well, destroy it, right? If it’s deliberately doing a. or b. (from “Seed AI”) then step 4. has started. The other cases where it could be “lying” from saying wrong things would be if its model is consistently wrong (e.g. stuck in a local minima), so you better start again from scratch.
If the supervisor isn’t itself perfectly consistent and aligned, some amount of self-deception is present. Any competent seed AI (or child) is going to have to learn deception
That’s insightful. Biased humans will keep saying that they want X when they want Y instead, so deceiving humans by pretending to be working on X while doing Y seems indeed natural (assuming you have “maximize what humans really want” in your code).
“In my opinion, the disagreement between Bostrom (treacherous turn) and Goertzel (sordid stumble) originates from the uncertainty about how long steps 2. and 3. will take”
That’s an interesting scenario. Instead of “won’t see a practical way to replace humanity with its tools”, I would say “would estimate its chances of success to be < 99%”. I agree that we could say that it’s “honestly” making humans happy in the sense that it understands that this maximizes expected value. However, he knows that there could be much more expected value after replacing humanity with its tools, so by doing the right thing it’s still “pretending” to not know where the absurd amount of value is. But yeah, a smile maximizer making everyone happy shouldn’t be too concerned about concealing its capabilities, shortening step 4.
This thread is to discuss “How useful is quantilization for mitigating specification-gaming? (Ryan Carey, Apr. 2019, SafeML ICLR 2019 Workshop)”
This thread is to discuss “Quantilizers (Michaël Trazzi & Ryan Carey, Apr. 2019, Github)”.
This thread is to discuss “When to use quantilization (Ryan Carey, Feb. 2019, LessWrong)”
This thread is to discuss “Quantilal control for finite MDPs & Computing an exact quantilal policy (Vanessa Kosoy, Apr. 2018, LessWrong)”
This thread is to discuss “Reinforcement Learning with a Corrupted Reward Channel (Tom Everitt; Victoria Krakovna; Laurent Orseau; Marcus Hutter; Shane Legg, Aug. 2017, arXiv; IJCAI)”
This thread is to discuss “Thoughts on Quantilizers (Stuart Armstrong, Jan. 2017, Intelligent Agent)”
This thread is to discuss “Another view of quantilizers: avoiding Goodhart’s Law (Jessica Taylor, Jan. 2016, Intelligent Agent Foundations Forum)”
This thread is to discuss “New paper: “Quantilizers” (Rob Bensinger, Nov. 2015, MIRI)”