New(ish) AI control ideas
The list of posts is getting unwieldy, so I’ll post the up-to-date stuff at the beginning:
Humans inconsistencies:
Reward function learning:
Understanding humans:
Framework:
Acausal trade:
Oracle designs:
Extracting human values:
Corrigibility:
Indifference:
AIs in virtual worlds:
True answers from AI:
Miscellanea:
Migrating my old post over from Less Wrong.
I recently went on a two day intense solitary “AI control retreat”, with the aim of generating new ideas for making safe AI. The “retreat” format wasn’t really a success (“focused uninterrupted thought” was the main gain, not “two days of solitude”—it would have been more effective in three hour sessions), but I did manage to generate a lot of new ideas. These ideas will now go before the baying bloodthirsty audience (that’s you, folks) to test them for viability.
A central thread running through could be: if you want something, you have to define it, then code it, rather than assuming you can get if for free through some other approach.
To provide inspiration and direction to my thought process, I first listed all the easy responses that we generally give to most proposals for AI control. If someone comes up with a new/old brilliant idea for AI control, it can normally be dismissed by appealing to one of these responses:
@. The AI is much smarter than us. @. It’s not well defined. @. The setup can be hacked.
By the agent.
By outsiders, including other AI.
Adding restrictions encourages the AI to hack them, not obey them. @. The agent will resist changes. @. Humans can be manipulated, hacked, or seduced. @. The design is not stable.
Under self-modification.
Under subagent creation.
Unrestricted search is dangerous. @. The agent has, or will develop, dangerous goals.
Important background ideas:
I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections. A key concept is that we should never just expect a system to behave “nicely” by default (see eg here). If we wanted that, we should define what “nicely” is, and put that in by hand.
I came up with sixteen main ideas, of varying usefulness and quality, which I will be posting in the coming weekdays in comments (the following links will go live after each post). The ones I feel most important (or most developed) are:
Anti-restriction-hacking (EDIT: I have big doubts about this approach, currently)
Creating a satisficer (EDIT: I have big doubts about this approach, currently)
While the less important or developed ideas are:
Added: Acausal trade barriers
Please let me know your impressions on any of these! The ideas are roughly related to each other as follows (where the arrow Y→X can mean “X depends on Y”, “Y is useful for X”, “X complements Y on this problem” or even “Y inspires X”):
EDIT: I’ve decided to use this post as a sort of central repository of my new ideas on AI control. So adding the following links:
Short tricks:
High-impact from low impact:
The president didn’t die: failures at extending AI behaviour
Presidents, asteroids, natural categories, and reduced impact
High impact from low impact, best advice:
Overall meta-thoughts:
Pareto-improvements to corrigible agents:
AIs in virtual worlds:
Low importance AIs:
Wireheading:
AI honesty and testing:
Goal completion:
- What AI Safety Researchers Have Written About the Nature of Human Values by 16 Jan 2019 13:59 UTC; 52 points) (
- JFK was not assassinated: prior probability zero events by 27 Apr 2016 11:47 UTC; 38 points) (
- New(ish) AI control ideas by 5 Mar 2015 17:03 UTC; 34 points) (
- An overall schema for the friendly AI problems: self-referential convergence criteria by 13 Jul 2015 15:34 UTC; 26 points) (
- Acausal trade barriers by 11 Mar 2015 13:40 UTC; 23 points) (
- False thermodynamic miracles by 5 Mar 2015 17:04 UTC; 19 points) (
- The president didn’t die: failures at extending AI behaviour by 10 Jun 2015 16:00 UTC; 17 points) (
- Crude measures by 27 Mar 2015 15:44 UTC; 16 points) (
- Green Emeralds, Grue Diamonds by 6 Jul 2015 11:27 UTC; 14 points) (
- AI, cure this fake person’s fake cancer! by 24 Aug 2015 16:42 UTC; 14 points) (
- Indifferent vs false-friendly AIs by 24 Mar 2015 12:13 UTC; 14 points) (
- Extending the stated objectives by 13 Jan 2016 16:20 UTC; 13 points) (
- A counterfactual and hypothetical note on AI safety design by 11 Mar 2015 16:20 UTC; 13 points) (
- The subagent problem is really hard by 18 Sep 2015 13:06 UTC; 13 points) (
- Resource gathering and pre-corriged agents by 10 Mar 2015 11:47 UTC; 13 points) (
- Detecting agents and subagents by 10 Mar 2015 17:56 UTC; 13 points) (
- Un-optimised vs anti-optimised by 14 Apr 2015 18:30 UTC; 12 points) (
- Counterfactually uninfluenceable agents by 2 Jun 2017 16:17 UTC; 11 points) (
- Values at compile time by 26 Mar 2015 12:25 UTC; 11 points) (
- Grue, Bleen, and natural categories by 6 Jul 2015 13:47 UTC; 11 points) (
- High impact from low impact by 17 Apr 2015 16:01 UTC; 11 points) (
- The Ultimate Testing Grounds by 28 Oct 2015 17:08 UTC; 11 points) (
- The AI, the best human advisor by 13 Jul 2015 15:33 UTC; 11 points) (
- Acausal trade: double decrease by 2 Jun 2017 15:33 UTC; 10 points) (
- Defining a limited satisficer by 11 Mar 2015 14:23 UTC; 10 points) (
- Goal completion: noise, errors, bias, prejudice, preference and complexity by 18 Feb 2016 14:37 UTC; 10 points) (
- Heroin model: AI “manipulates” “unmanipulatable” reward by 22 Sep 2016 10:27 UTC; 10 points) (
- Models as definitions by 25 Mar 2015 17:46 UTC; 10 points) (
- The virtual AI within its virtual world by 24 Aug 2015 16:42 UTC; 10 points) (
- Superintelligence and wireheading by 23 Oct 2015 14:49 UTC; 10 points) (
- Assessors that are hard to seduce by 9 Mar 2015 14:19 UTC; 9 points) (
- Restrictions that are hard to hack by 9 Mar 2015 13:52 UTC; 9 points) (
- Divergent preferences and meta-preferences by 2 Jun 2017 15:51 UTC; 9 points) (
- Goal completion: the rocket equations by 20 Jan 2016 13:54 UTC; 9 points) (
- Tackling the subagent problem: preliminary analysis by 12 Jan 2016 12:26 UTC; 9 points) (
- Double Corrigibility: better Corrigibility by 28 Apr 2016 14:46 UTC; 9 points) (
- What I mean... by 26 Mar 2015 11:59 UTC; 9 points) (
- Chatbots or set answers, not WBEs by 8 Sep 2015 17:17 UTC; 8 points) (
- AI utility-based correlation by 30 Oct 2015 14:53 UTC; 8 points) (
- Ask and ye shall be answered by 18 Sep 2015 21:53 UTC; 8 points) (
- Corrigibility through stratified indifference by 19 Aug 2016 16:11 UTC; 8 points) (
- Predicted corrigibility: pareto improvements by 18 Aug 2015 11:02 UTC; 8 points) (
- Creating a satisficer by 11 Mar 2015 15:03 UTC; 8 points) (
- Goal completion: algorithm ideas by 25 Jan 2016 17:36 UTC; 8 points) (
- An Oracle standard trick by 3 Jun 2015 14:17 UTC; 7 points) (
- Intelligence modules by 23 Mar 2015 16:24 UTC; 7 points) (
- True answers from AI: Summary by 10 Mar 2016 15:56 UTC; 7 points) (
- Anti-Pascaline agent by 12 Mar 2015 14:17 UTC; 7 points) (
- Anti-Pascaline satisficer by 14 Apr 2015 18:49 UTC; 6 points) (
- Utility vs Probability: idea synthesis by 27 Mar 2015 12:30 UTC; 6 points) (
- Continually-adjusted discounted preferences by 6 Mar 2015 16:03 UTC; 6 points) (
- Closest stable alternative preferences by 20 Mar 2015 12:41 UTC; 6 points) (
- Counterfactual do-what-I-mean by 27 Oct 2016 13:54 UTC; 5 points) (
- One weird trick to turn maximisers into minimisers by 22 Apr 2016 16:47 UTC; 5 points) (
- Consistent Plato by 20 Mar 2015 18:06 UTC; 5 points) (
- Humans get different counterfactuals by 23 Mar 2015 14:54 UTC; 4 points) (
- The overfitting utility problem for value learning AIs by 12 Jun 2016 23:25 UTC; 4 points) (
- Guarded learning by 23 May 2017 16:53 UTC; 4 points) (
- Learning (meta-)preferences by 27 Jul 2016 14:43 UTC; 4 points) (
- An algorithm with preferences: from zero to one variable by 2 Jun 2017 16:35 UTC; 4 points) (
- High impact from low impact, continued by 28 Apr 2015 12:58 UTC; 4 points) (
- Virtual models of virtual AIs in virtual worlds by 11 Mar 2016 9:41 UTC; 3 points) (
- Uninfluenceable learning agents by 2 Jun 2017 16:30 UTC; 3 points) (
- Humans can be assigned any values whatsoever... by 24 Oct 2017 12:03 UTC; 3 points) (
- How the virtual AI controls itself by 9 Sep 2015 14:25 UTC; 3 points) (
- Presidents, asteroids, natural categories, and reduced impact by 6 Jul 2015 17:44 UTC; 3 points) (
- Utility, probability and false beliefs by 9 Nov 2015 21:43 UTC; 3 points) (
- Goal completion: algorithm ideas by 26 Jan 2016 10:01 UTC; 2 points) (
- Double indifference is better indifference by 4 May 2016 14:16 UTC; 2 points) (
- Learning values versus indifference by 24 May 2017 8:20 UTC; 2 points) (
- Simpler, cruder, virtual world AIs by 26 Jun 2016 15:44 UTC; 2 points) (
- What does an imperfect agent want? by 27 Jul 2016 14:03 UTC; 2 points) (
- Simplified explanation of stratification by 23 May 2017 16:37 UTC; 2 points) (
- Counterfactuals on POMDP by 2 Jun 2017 16:30 UTC; 2 points) (
- Corrigibility thoughts I: caring about multiple things by 2 Jun 2017 16:27 UTC; 2 points) (
- Thoughts on Quantilizers by 2 Jun 2017 16:24 UTC; 2 points) (
- All the indifference designs by 2 Jun 2017 16:20 UTC; 2 points) (
- AI safety: three human problems and one AI issue by 2 Jun 2017 16:12 UTC; 2 points) (
- Conservation of Expected Ethics isn’t enough by 15 Jun 2016 18:08 UTC; 1 point) (
- (C)IRL is not solely a learning process by 15 Sep 2016 8:35 UTC; 1 point) (
- Rigged reward learning by 16 Mar 2018 15:39 UTC; 1 point) (
- Emergency learning by 2 Jun 2017 16:23 UTC; 1 point) (
- Humans as a truth channel by 2 Jun 2017 16:22 UTC; 1 point) (
- Low impact versus low side effects by 2 Jun 2017 16:14 UTC; 1 point) (
- Acausal trade: different utilities, different trades by 2 Jun 2017 15:33 UTC; 1 point) (
- Acausal trade: universal utility, or selling non-existence insurance too late by 2 Jun 2017 15:33 UTC; 1 point) (
- “Like this world, but...” by 14 Jul 2017 20:40 UTC; 1 point) (
- Resolving human inconsistency in a simple model by 4 Oct 2017 15:02 UTC; 1 point) (
- Help needed: nice AIs and presidential deaths by 8 Jun 2015 16:47 UTC; 1 point) (
- Goal completion: the rocket equations by 20 Jan 2016 14:10 UTC; 0 points) (
- Goal completion: noise, errors, bias, prejudice, preference and complexity by 24 May 2017 8:30 UTC; 0 points) (
- True answers from AI by 31 Mar 2016 15:00 UTC; 0 points) (
- True answers from AI: Summary by 10 Mar 2016 16:31 UTC; 0 points) (
- True answers from AI: “Miraculous” hypotheses by 10 Mar 2016 15:09 UTC; 0 points) (
- AI printing the utility value it’s maximising by 24 May 2017 9:08 UTC; 0 points) (
- Convexity and truth-seeking by 22 May 2017 18:12 UTC; 0 points) (
- One weird trick to turn maximisers into minimisers by 22 Apr 2016 16:45 UTC; 0 points) (
- JFK was not assassinated: prior probability zero events by 27 Apr 2016 12:50 UTC; 0 points) (
- Corrigibility for AIXI via double indifference by 4 May 2016 14:00 UTC; 0 points) (
- AIs in virtual worlds: discounted mixed utility/reward by 17 Jun 2016 6:43 UTC; 0 points) (
- The alternate hypothesis for AIs in virtual worlds by 24 May 2017 8:14 UTC; 0 points) (
- Confirmed Selective Oracle by 10 Jun 2016 23:43 UTC; 0 points) (
- Indifference utility functions by 11 Jun 2016 0:20 UTC; 0 points) (
- Learning desiderata by 16 Jun 2016 5:30 UTC; 0 points) (
- When the AI closes a door, it opens a window by 24 May 2017 8:49 UTC; 0 points) (
- Abstract model of human bias by 6 Jul 2016 11:08 UTC; 0 points) (
- Corrigibility through stratified indifference and learning by 23 May 2017 16:33 UTC; 0 points) (
- The non-indifferent behaviour of stratified indifference? by 22 Aug 2016 13:51 UTC; 0 points) (
- Stratified learning and action by 15 Sep 2016 8:59 UTC; 0 points) (
- Heroin model: AI “manipulates” “unmanipulatable” reward by 21 Sep 2016 18:10 UTC; 0 points) (
- Model of human (ir)rationality by 26 Sep 2016 11:20 UTC; 0 points) (
- Learning doesn’t solve philosophy of ethics by 26 Sep 2016 12:11 UTC; 0 points) (
- Counterfactual do-what-I-mean by 27 Oct 2016 13:53 UTC; 0 points) (
- Reward/value learning for reinforcement learning by 2 Jun 2017 16:34 UTC; 0 points) (
- The best value indifference method (so far) by 2 Jun 2017 16:33 UTC; 0 points) (
- How to judge moral learning failure by 2 Jun 2017 16:32 UTC; 0 points) (
- Ontology, lost purposes, and instrumental goals by 2 Jun 2017 16:28 UTC; 0 points) (
- Corrigibility thoughts II: the robot operator by 2 Jun 2017 16:27 UTC; 0 points) (
- Corrigibility thoughts III: manipulating versus deceiving by 2 Jun 2017 16:27 UTC; 0 points) (
- The radioactive burrito and learning from positive examples by 2 Jun 2017 16:25 UTC; 0 points) (
- Indifference and compensatory rewards by 2 Jun 2017 16:19 UTC; 0 points) (
- Translation “counterfactual” by 2 Jun 2017 16:16 UTC; 0 points) (
- Understanding the important facts by 2 Jun 2017 16:15 UTC; 0 points) (
- Agents that don’t become maximisers by 2 Jun 2017 16:13 UTC; 0 points) (
- Acausal trade: trade barriers by 2 Jun 2017 15:32 UTC; 0 points) (
- Optimisation in manipulating humans: engineered fanatics vs yes-men by 2 Jun 2017 15:51 UTC; 0 points) (
- New circumstances, new values? by 6 Jun 2017 8:18 UTC; 0 points) (
- Humans are not agents: short vs long term by 27 Jun 2017 13:04 UTC; 0 points) (
- Rationality and overriding human preferences: a combined model by 20 Oct 2017 22:47 UTC; 0 points) (
- Learning values, or defining them? by 6 Nov 2017 10:59 UTC; 0 points) (
- Bias in rationality is much worse than noise by 6 Nov 2017 11:08 UTC; 0 points) (
- Normative assumptions: regret by 6 Nov 2017 10:59 UTC; 0 points) (
- Our values are underdefined, changeable, and manipulable by 6 Nov 2017 10:59 UTC; 0 points) (
- Reward learning summary by 28 Nov 2017 15:55 UTC; 0 points) (
- Kolmogorov complexity makes reward learning worse by 6 Nov 2017 20:08 UTC; 0 points) (
- Rationalising humans: another mugging, but not Pascal’s by 15 Nov 2017 12:07 UTC; 0 points) (
- Stable agent, subagent-unstable by 28 Nov 2017 16:05 UTC; 0 points) (
Thanks! I love having central repos.
A quick question / comment, RE: “I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections.”
Q: What do you mean (or have in mind) in terms of “turning [...] objections”? I’m not very familiar with the phrase.
Comment: One trend I see is that technical safety proposals are often dismissed by appealing to one of the 7 responses you’ve given. Recently I’ve been thinking that we should be a bit less focused on finding airtight solutions, and more focused on thinking about which proposed techniques could be applied in various scenarios to significantly reduce risk. For example, boxing an agent (e.g. by limiting it’s sensors/actuators) might significantly increase how long it takes to escape.