Computing scientist and Systems architect. Currently doing self-funded AI/AGI safety research. I participate in AI standardization under the company name Holtman Systems Research: https://holtmansystemsresearch.nl/
Koen.Holtman(Koen Holtman)
I think it makes complete sense to say something like “once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely”. And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there’s no easy way to run such an AI safely, and all tricks like “ask the AI for plans that succeed conditional on them being executed” fail.
Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.
Looks like Eliezer believes that (or in Bayesian terms, assigns a high probability to the belief that) corrigibility has not been solved for AGI. He believes it has not been solved for any practically useful value of solved. Furthermore it looks like he expects that progress on solving AGI corrigibility will be slower than progress on creating potentially world-ending AGI. If Eliezer believed that AGI corrigibility had been solved or was close to being solved, I expect he would be in a less dark place than depicted, that he would not be predicting that stolen/leaked AGI code will inevitably doom us when some moron turns it up to 11.
In the transcript above, Eliezer devotes significant space to explaining why he believes that all corrigibility solutions being contemplated now will likely not work. Some choice quotations from the end of the transcript:
[...] corrigibility is anticonvergent / anticoherent / actually moderately strongly contrary to and not just an orthogonal property of a powerful-plan generator.
this is where things get somewhat personal for me:
[...] (And yes, people outside MIRI now and then publish papers saying they totally just solved this problem, but all of those “solutions” are things we considered and dismissed as trivially failing to scale to powerful agents—they didn’t understand what we considered to be the first-order problems in the first place—rather than these being evidence that MIRI just didn’t have smart-enough people at the workshop.)
I am one of `these people outside MIRI’ who have published papers and sequences saying that they have solved large chunks of the AGI corrigibility problem.
I have never been claiming that I ‘totally just solved corrigibility’. I am not sure where Eliezer is finding these ‘totally solved’ people, so I will just ignore that bit and treat it as a rhetorical flourish. But I have indeed been claiming that significant progress has been made on AGI corrigibility in the last few years. In particular, especially in the sequence, I implicitly claim that viewpoints have been developed, outside of MIRI, that address and resolve some of MIRIs main concerns about corrigibility. They resolve these in part by moving beyond Eliezer’s impoverished view of what an AGI-level intelligence is, or must be.
Historical note: around 2019 I spent some time trying to get Eliezier/MIRI interested in updating their viewpoints on how easy or hard corrigibility was. They showed no interest to engage at that time, I have since stopped trying. I do not expect that anything I will say here will update Eliezer, my main motivation to write here is to inform and update others.
I will now point out a probable point of agreement between Eliezer and me. Eliezer says above that corrigibility is a property that is contradictory to having a powerful coherent AGI-level plan generator. Here, coherency has something to do with satisfying a bunch of theorems about how a game-theoretically rational utility maximiser must behave when making plans. One of these theorems is that coherence implies an emergent drive towards self-preservation.
I generally agree with Eliezer that there is a indeed a contradiction here: there is a contradiction between broadly held ideas of what it implies for an AGI to be a coherent utility maximising planner, and broadly held ideas of what it implies for an AGI to be corrigible.
I very much disagree with Eliezier on how hard it is to resolve these contradictions. These contradictions about corrigibility are easy to resolve one you abandon the idea that every AGI must necessarily satisfy various theorems about coherency. Human intelligence definitely does not satisfy various theorems about coherency. Almost all currently implemented AI systems do not satisfy some theorems about coherency, because they will not resist you pressing their off switch.
So this is why I call Eliezer’s view of AGI an impoverished view: Eliezer (at least in the discussion transcript above, and generally whenever I read his stuff) always takes it as axiomatic that an AGI must satisfy certain coherence theorems. Once you take that as axiomatic, it is indeed easy to develop some rather negative opinions about how good other people’s solutions to corrigibility are. Any claimed solution can easily be shown to violate at least one axiom you hold dear. You don’t even need to examine the details of the proposed solution to draw that conclusion.
But it seems like roughly the entire AI existential safety community is very excited about mechanistic interpretability and entirely dismissive of Stuart Russell’s approach, and this seems bizarre.
Data point: I consider myself part to be part of the AI x-risk community, but like you am not very excited about mechanistic interpretability research in an x-risk context. I think there is somewhat of a filter bubble effect going on, where people who are more exited about interpretability post more on this forum.
Stuart Russell’s approach is a broad agenda, and I am not on board with of all parts of it, but I definitely read his provable safety slogan as a call for more attention to the design approach where certain AI properties (like safety and interpretability properties) are robustly created by construction.
There is an analogy with computer programming here: a deep neural net is like a computer program written by an amateur without any domain knowledge, one that was carefully tweaked to pass all tests in the test suite. Interpreting such a program might be very difficult. (There is also the small matter that the program might fail spectacularly when given inputs not present in the test suite.) The best way to create an actually interpretable program is to build it from the ground up with interpretability in mind.
What is notable here is that the CS/software engineering people who deal with provable safety properties have long ago rejected the idea that provable safety should be about proving safe an already-existing bunch of spaghetti code that has passed a test suite. The problem of interpreting or reverse engineering such code is not considered a very interesting or urgent one in CS. But this problem seems to be exactly what a section of the ML community has now embarked on. As an intellectual quest, it is interesting. As a safety engineering approach for high-risk system components, I feel it has very limited potential.
This is not particularly unexpected if you believed in the scaling hypothesis.
Cicero is not particularly unexpected to me, but my expectations here are not driven by the scaling hypothesis. The result achieved here was not achieved by adding more layers to a single AI engine, it was achieved by human designers who assembled several specialised AI engines by hand.
So I do not view this result as one that adds particularly strong evidence to the scaling hypothesis. I could equally well make the case that it adds more evidence to the alternative hypothesis, put forward by people like Gary Marcus, that scaling alone as the sole technique has run out of steam, and that the prevailing ML research paradigm needs to shift to a more hybrid approach of combining models. (The prevailing applied AI paradigm has of course always been that you usually need to combine models.)
Another way to explain my lack of surprise would be to say that Cicero is a just super-human board game playing engine that has been equipped with a voice synthesizer. But I might be downplaying the achievement here.
this is among the worser things you could be researching [...] There are… uh, not many realistic, beneficial applications for this work.
I have not read any of the authors’ or Meta’s messaging around this, so I am not sure if they make that point, but the sub-components of Cicero that somewhat competently and ‘honestly’ explain its currently intended moves seem to have beneficial applications too, if they were combined with an engine which is different from a game engine that absolutely wants to win and that can change it’s mind about moves to play later. This is a dual-use technology with both good and bad possible uses.
That being said, I agree that this is yet another regulatory wake-up call, if we would need one. As a group, AI researchers will not conveniently regulate themselves: they will move forward in creating more advanced dual-use technology, while openly acknowledging (see annex A.3 of the paper) that this technology might be used for both good and bad purposes downstream. So it is up to the rest of the world to make sure that these downstream uses are regulated.
Why do you rate yourself “far above” someone who has spent decades working in this field?
Well put, valid question. By the way, did you notice how careful I was in avoiding any direct mention of my own credentials above?
I see that Rob has already written a reply to your comments, making some of the broader points that I could have made too. So I’ll cover some other things.
To answer your valid question: If you hover over my LW/AF username, you can see that I self-code as the kind of alignment researcher who is also a card-carrying member of the academic/industrial establishment. In both age and academic credentials. I am in fact a more-senior researcher than Eliezer is. So the epistemology, if you are outside of this field and want to decide which one of us is probably more right, gets rather complicated.
Though we have disagreements, I should also point out some similarities between Eliezer and me.
Like Eliezer, I spend a lot of time reflecting on the problem of crafting tools that other people might use to improve their own ability to think about alignment. Specifically, these are not tools that can be used for the problem of triangulating between self-declared experts. They are tools that can be used by people to develop their own well-founded opinions independently. You may have noticed that this is somewhat of a theme in section C of the original post above.
The tools I have crafted so far are somewhat different from those that Eliezer is most famous for. I also tend to target my tools more at the mainstream than at Rationalists and EAs reading this forum.
Like Eliezer, on some bad days I cannot escape having certain feelings of disappointment about how well this entire global tool crafting project has been going so far. Eliezer seems to be having quite a lot of these bad days recently, which makes me feel sorry, but there you go.
Having read the original post and may of the comments made so far, I’ll add an epistemological observation that I have not seen others make yet quite so forcefully. From the original post:
Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable [...]
I want to highlight that many of the different ‘true things’ on the long numbered list in the OP are in fact purely speculative claims about the probable nature of future AGI technology, a technology nobody has seen yet.
The claimed truth of several of these ‘true things’ is often backed up by nothing more than Eliezer’s best-guess informed-gut-feeling predictions about what future AGI must necessarily be like. These predictions often directly contradict the best-guess informed-gut-feeling predictions of others, as is admirably demonstrated in the 2021 MIRI conversations.
Some of Eliezer’s best guesses also directly contradict my own best-guess informed-gut-feeling predictions. I rank the credibility of my own informed guesses far above those of Eliezer.
So overall, based on my own best guesses here, I am much more optimistic about avoiding AGI ruin than Eliezer is. I am also much less dissatisfied about how much progress has been made so far.
I used to work in the lighting industry, so here are some comments from an industry perspective.
There are several high-quality studies about how more light, and being able to control dimming and color temperature, can improve subjective well-being, alertness, and sleep patterns. It is generally accepted that you do not need to go to direct sunlight type lux levels indoors to get most of the benefits. Also, you do no need to have the brightest dim level on all the time. For some people, the thing that will really help is a regular schedule that dims down below typical indoor light levels at selected times, without ever dimming above typical levels. I am not an expert on the latest studies, but if you want to build an indoor experimental setup to get to the bottom of what you really like, my feeling is that installing more than 4000 lux, as a peak capacity in selected areas, would definitely be a waste of money and resources.
If I would want to install a hassle-free bright light setup in my home cheaply, I would buy lots of high-end wireless dimmable and color temperature adjustable LED light bulbs, and some low-cost spot lights to put them in, e.g. spot lights that can be attached to a ceiling mounted power rail. If you make sure the bulbs support the ZigBee standard, you will have plenty of options for control software.
If power rails with lots of ~60W equivalent bulbs lacks aesthetic appeal for you, then you could go for a high-end special form factor product like that from Coelux mentioned above. The best way to think about the Coelux product, in business model development terms, is that it is not really a lighting product: it is a specialised piece of high-end furniture. So if you want to develop a business model for a bright home lighting company, the first question you have to ask yourself is whether or not you want to be in the high-end furniture business.
By the way, the main reason why the lighting industry is not making any 200W or 500W equivalent LED bulbs that you could put in your existing spot lights is because of cooling issues. LEDs are pretty energy efficient, but LED bulbs still produce some internal heat that has to be cooled away. For 60W equivalent this can happen by natural air flow around the bulb, but a 200W equivalent bulb would need something like a built-in fan.
As requested by Remmelt I’ll make some comments on the track record of privacy advocates, and their relevance to alignment.
I did some active privacy advocacy in the context of the early Internet in the 1990s, and have been following the field ever since. Overall, my assessment is that the privacy advocacy/digital civil rights community has had both failures and successes. It has not succeeded (yet) in its aim to stop large companies and governments from having all your data. On the other hand, it has been more successful in its policy advocacy towards limiting what large companies and governments are actually allowed to do with all that data.
The digital civil rights community has long promoted the idea that Internet based platforms and other computer systems must be designed and run in a way that is aligned with human values. In the context of AI and ML based computer systems, this has led to demands for AI fairness and transparency/explainability that have also found their way into policy like the GDPR, legislation in California, and the upcoming EU AI Act. AI fairness demands have influenced the course of AI research being done, e.g. there has been research on defining it even means for an AI model to be fair, and on making models that actually implement this meaning.
To a first approximation, privacy and digital rights advocates will care much more about what an ML model does, what effect its use has on society, than about the actual size of the ML model. So they are not natural allies for x-risk community initiatives that would seek a simple ban on models beyond a certain size. However, they would be natural allies for any initiative that seeks to design more aligned models, or to promote a growth of research funding in that direction.
To make a comment on the premise of the original post above: digital rights activists will likely tell you that, when it comes to interventions on AI research, speculating about the tractability of ‘slowing down AI research’ is misguided. What you really should be thinking about is changing the direction of AI research.
OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.
This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.
I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).
Math-based work on corrigibility solutions typically starts with formalizing corrigibility, or a sub-component of corrigibility, as a mathematical property we want an agent to have. It then constructs such an agent with enough detail to show that this property is indeed correctly there, or at least there during some part of the agent lifetime, or there under some boundary assumptions.
Not all of the papers below have actual mathematical proofs in them, some of them show correctness by construction. Correctness by construction is superior to having to have proofs: if you have correctness by construction, your notation will usually be much more revealing about what is really going on than if you need proofs.
Here is the list, with the bold headings describing different approaches to corrigibility.
Indifference to being switched off, or to reward function updates
Motivated Value Selection for Artificial Agents introduces Armstrong’s indifference methods for creating corrigibility. It has some proofs, but does not completely work out the math of the solution to a this-is-how-to-implement-it level.
Corrigibility tried to work out the how-to-implement-it details of the paper above but famously failed to do so, and has proofs showing that it failed to do so. This paper somehow launched the myth that corrigibility is super-hard.
AGI Agent Safety by Iteratively Improving the Utility Function does work out all the how-to-implement-it details of Armstrong’s indifference methods, with proofs. It also goes into the epistemology of the connection between correctness proofs in models and safety claims for real-world implementations.
Counterfactual Planning in AGI Systems introduces a different and more easy to interpret way for constructing a a corrigible agent, and agent that happens to be equivalent to agents that can be constructed with Armstrong’s indifference methods. This paper has proof-by-construction type of math.
Corrigibility with Utility Preservation has a bunch of proofs about agents capable of more self-modification than those in Counterfactual Planning. As the author, I do not recommend you read this paper first, or maybe even at all. Read Counterfactual Planning first.
Safely Interruptible Agents has yet another take on, or re-interpretation of, Armstrong’s indifference methods. Its title and presentation somewhat de-emphasize the fact that it is about corrigibility, by never even discussing the construction of the interruption mechanism. The paper is also less clearly about AGI-level corrigibility.
How RL Agents Behave When Their Actions Are Modified is another contribution in this space. Again this is less clearly about AGI.
Agents that stop to ask a supervisor when unsure
A completely different approach to corrigibility, based on a somewhat different definition of what it means to be corrigible, is to construct an agent that automatically stops and asks a supervisor for instructions when it encounters a situation or decision it is unsure about. Such a design would be corrigible by construction, for certain values of corrigibility. The last two papers above can be interpreted as disclosing ML designs that also applicable in the context of this stop when unsure idea.
Asymptotically unambitious artificial general intelligence is a paper that derives some probabilistic bounds on what can go wrong regardless, bounds on the case where the stop-and-ask-the-supervisor mechanism does not trigger. This paper is more clearly about the AGI case, presenting a very general definition of ML.
Anything about model-based reinforcement learning
I have yet to write a paper that emphasizes this point, but most model-based reinforcement learning algorithms produce a corrigible agent, in the sense that they approximate the ITC counterfactual planner from the counterfactual planning paper above.
Now, consider a definition of corrigibility where incompetent agents (or less inner-aligned agents, to use a term often used here) are less corrigible because they may end up damaging themselves, their stop buttons. or their operator by being incompetent. In this case, every convergence-to-optimal-policy proof for a model-based RL algorithm can be read as a proof that its agent will be increasingly corrigible under learning.
CIRL
Cooperative Inverse Reinforcement Learning and The Off-Switch Game present yet another corrigibility method with enough math to see how you might implement it. This is the method that Stuart Russell reviews in Human Compatible. CIRL has a drawback, in that the agent becomes less corrigible as it learns more, so CIRL is not generally considered to be a full AGI-level corrigibility solution, not even by the original authors of the papers. The CIRL drawback can be fixed in various ways, for example by not letting the agent learn too much. But curiously, there is very little followup work from the authors of the above papers, or from anybody else I know of, that explores this kind of thing.
Commanding the agent to be corrigible
If you have an infinitely competent superintelligence that you can give verbal commands to that it will absolutely obey, then giving it the command to turn itself into a corrigible agent will trivially produce a corrigible agent by construction.
Giving the same command to a not infinitely competent and obedient agent may give you a huge number of problems instead of course. This has sparked endless non-mathematical speculation, but in I cannot think of a mathematical paper about this that I would recommend.
AIs that are corrigible because they are not agents
Plenty of work on this. One notable analysis of extending this idea to AGI-level prediction, and considering how it might produce non-corrigibility anyway, is the work on counterfactual oracles. If you want to see a mathematically unambiguous presentation of this, with some further references, look for the section on counterfactual oracles in the Counterfactual Planning paper above.
Myopia
Myopia can also be considered to be feature that creates or improves or corrigibility. Many real-world non-AGI agents and predictive systems are myopic by construction: either myopic in time, in space, or in other ways. Again, if you want to see this type of myopia by construction in a mathematically well-defined way when applied to AGI-level ML, you can look at the Counterfactual Planning paper.
- 29 Nov 2023 10:54 UTC; 3 points) 's comment on Shallow review of live agendas in alignment & safety by (
To minimize P(misalignment x-risk | AGI) we should work on technical solutions to societal-AGI alignment, which is where As internalize a distilled and routinely updated constellation of shared values as determined by deliberative democratic processes driven entirely by humans
I agree that this kind of work is massively overlooked by this community. I have done some investigations on the root causes of why it is overlooked. The TL;DR is that this work is less technically interesting, and that many technical people here (and in industry and academia) would like to avoid even thinking about any work that needs to triangulate between different stakeholders who might then get mad at them. For a longer version of this analysis, see my paper Demanding and Designing Aligned Cognitive Architectures, where I also make some specific recommendations.
My overall feeling is that the growth in the type of technical risk reduction research you are calling for will will have to be driven mostly by ‘demand pull’ from society, by laws and regulators that ban certain unaligned uses of AI.
Read your post, here are my initial impressions on how it relates to the discussion here.
In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.
However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer’s notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term ‘coherence constraints’ an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.
Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You seem to be toying with several alternative definitions, one where it applies to reward functions (or preferences over lotteries) which are only allowed to examine the final state in a 10-step trajectory, another where the reward function can examine the entire trajectory and maybe the actions taken to produce that trajectory. I think that your proof only works in the first case, but fails in the second case. This has certain (fairly trivial) corollaries about building corrigibility. I’ll expand on this in a comment I plan to attach to your post.
I’m interested in hearing about how your approach handles this environment,
I think one way to connect your ABC toy environment to my approach is to look at sections 3 and 4 of my earlier paper where I develop a somewhat similar clarifying toy environment, with running code.
Another comment I can make is that your ABC nodes-and-arrows state transition diagram is a depiction which makes it hard see how to apply my approach, because the depiction mashes up the state of the world outside of the compute core and the state of the world inside the compute core. If you want to apply counterfactual planning, or if you want to have a an agent design that can compute the balancing function terms according to Armstrong’s indifference approach, you need a different depiction of your setup. You need one which separates out these two state components more explicitely. For example, make an MDP model where the individual states are instances of the tuple (physical position of agent in the ABC playing field,policy function loaded into the compute core).
Not sure how to interpret your statement that you got lost in symbol-grounding issues. If you can expand on this, I might be able to help.
- 21 Nov 2021 18:14 UTC; 1 point) 's comment on A Certain Formalization of Corrigibility Is VNM-Incoherent by (
Interesting!
LCDT is has major structural similarities with some of the incentive-managing agent designs that have been considered by Everitt et al in work on Causal Influence Diagrams (CIDs), e.g. here and by me in work on counterfactual planning, e.g. here. These similarities are not immediately apparent however from the post above, because of differences in terminology and in the benchmarks chosen.
So I feel it is useful (also as a multi-disciplinary or community-bridging exercise) to make these similarities more explicit in this comment. Below I will map the LCDT defined above to the frameworks of CIDs and counterfactual planning, frameworks that were designed to avoid (and/or expose) all ambiguity by relying on exact mathematical definitions.
Mapping LCDT to detailed math
Lonely CDT is a twist on CDT: an LCDT agent will make its decision by using a causal model just like a CDT agent would, except that the LCDT agent first cuts the last link in every path from its decision node to any other decision node, including its own future decision nodes.
OK, so in the terminology of counterfactual planning defined here, an LCDT agent is built to make decisions by constructing a model of a planning world inside its compute core, then computing the optimal action to take in the planning world, and then doing the same action on the real world. The LCDT planning world model is a causal model, let’s call it . This is constructed by modifying a causal model by cutting links. The we modify is a fully accurate, or reasonably approximate, model of bow the LCDT agent interacts with its environment, where the interaction aims to maximize a reward or minimize a loss function.
The planning world is a modification of that intentionally mis-approximates some of the real world mechanics visible in . is constructed to predict future agent actions less accurately than is possible, given all information in . This intentional mis-approximation this makes the LCDT into what I call a counterfactual planner. The LCDT plans actions that maximize reward (or minimize losses) in , and then performs these same actions in the real world it is in.
Some mathematical detail: in many graphical models of decision making, the nodes that represent the decision(s) made by the agent(s) do not have any incoming arrows. For the LCDT definition above to work, we need a graphical model where the decision-making nodes do have such incoming arrows. Conveniently, CIDs are such models. So we can disambiguate LCDT by saying that and are full causal models as defined in the CID framework. Terminology/mathematical details: in the CID definitions here, these full causal models and are called SCIMs, in the terminology defined here they are called policy-defining world models whose input parameters are fully known.
Now I identify some ambiguities that are left in the LCDT definition of the post. First, the definition has remained silent on how the initial causal world model is obtained. It might be by learning, by hand-coding (as in the benchmark examples), or a combination of the two. For an example of a models that is constructed with a combination of hand-coding and machine learning, see the planning world (p) here. There is also significant work in the ML community on using machine learning to construct from scratch full causal models including the nodes and the routing of the arrows themselves, or (more often) full Bayesian networks with nodes and arrows where the authors do not worry too much about any causal interpretation of the arrows. I have not tried this out in any examples, but I believe the LCDT approach might be usefully applied to predictive Bayesian networks too.
Regardless of how is obtained, we can do some safety analysis on the construction of out of .
The two works on CIDs here and here both consider that we can modify agent incentives by removing paths in the CID-based world model that the agent uses for planning its actions. In the terminology of the first paper above, the modifications made by LCDT to produce the model work to ‘remove an instrumental control incentive on a future action’. In the terminology of the second paper, the modifications will ‘make the agent indifferent about downstream nodes representing agent actions’. The post above speculates:
LCDT shows a form of indifference (related to indifference corrigibility maybe)
This is not a maybe: the indifference produced is definitely related to indifference corrigibility, the type of indifference-that-causes-corrigibility that the 2015 MIRI/FHI paper titled Corrigibility talks about. For some detailed mathematical work relating the two, see here.
A second ambiguity in LCDT is that it tell us how exactly the nodes in that represent agent decisions are to be identified. If is a hand-coded model of a game world, identifying these nodes may be easy. If is a somewhat opaque model produced by machine learning, identifying the nodes may be difficult. In many graphical world models, a single node may represent the state of a huge chunk of the agent environment: say both the vases and conveyor belts in the agent environment and the people in the agent environment. Does this node then become a node that represents agent decisions? We might imagine splitting the node into two nodes (this is often called factoring the state) to separate out the humans.
That being said, even a less-than-perfect identification of these nodes would work to suppress certain deceptive forms of manipulation, so LCDT could be usefully applied even to somewhat opaque learned causal models.
A third ambiguity is in the definition of the operations needed to create a computable causal model after taking a copy of and cutting incoming links to the downstream decision nodes:
What do we replace these decision nodes with (as their actual expression does depend on our decision)? We assume that the model has some fixed prior over its own decision, and then we marginalize the cut decision node with this prior, to leave the node with a distribution independent of our decision.
It is ambiguous how to construct this ‘fixed prior over its own decision’ that we should use to marginalize on. Specifically, is this prior allowed to take into account some or all of the events that preceded the decision to be made? This ambiguity leaves a large degree of freedom in constructing by modifying , especially in a setting where the agents involved make multiple decisions over time. This ambiguity is not necessarily a bad thing: we can interpret is as an open (hyper)parameter choice that allows us to create differently tuned versions of that trade off differently between suppressing manipulation and still achieving a degree of economic decision making effectiveness. On a side note, in a multi-decision setting, drawing a that encodes marginalization on 10 downstream decisions will generally create a huge diagram: it will add 10 new sub-diagrams feeding input observations into these decisions.
LCDT also considers agent self-modification, However, given the way these self-modification decisions are drawn, I cannot easily see how these would generalize to a multi-decision situation where the agent makes several decisions over time. Representations of self-modification in a multi-decision CID framework usually require that one draws a lot of extra nodes, see e.g. this paper. As this comment is long already, I omit the topic of how to map multi-action self-modification to unambiguous math. My safety analysis below is therefore limited to the case of the LCDT agent manipulating other agents, not the agent manipulating itself.
Some safety analysis
LCDT obviously removes some agent incentives, incentives to control the future decisions made by human agents in the agent environment. This is nice because one method of control is deception, so it suppresses deception. However, I do not believe LCDT removes all incentives to deceive in the general case.
As I explain in this example and in more detail in sections 9.2 and 11.5.2 here, the use of a counterfactual planning world model for decision making may remove some incentives for deception, compared to using a fully correct world model, but the planning world may still retain some game-theoretical mechanics that make deception part of an optimal planning world strategy. So we have to consider the value of deception in the planning world.
I’ll now do this for a particular toy example: the decision making problem of a soccer playing agent that tries to score a goal, with a human goalkeeper trying to block the goal. I simplify this toy world by looking at one particular case only: the case where the agent is close to the goal, and must decide whether to kick the ball in the left or right corner. As the agent is close, the human goalkeeper will have to decide to run to the left corner or right corner of the goal even before the agent takes the shot: the goalkeeper does not have enough time to first observe where the ball is going and only then start moving. So this toy world decision problem has the agent deciding on kick left of right, and the goalkeeper simultaneously deciding on running left or right.
[Edited to add: as discussed in the comments below, the discussion of about marginsalisation that follows is somewhat wrong/confusing. It fails to mention that if we construct exactly as described above, there is no causal link from the agent action to the goalkeeper action, so LCDT would construct a that is exaclty the same model as , and the question of what prior to marginalise on does not even come up. To make that question come up, we need to model an iterative game where the goalkeeper remembers (learns from) past moves by the agent.]
Say that agent is mechanically stronger at kicking left, so that a ball kicked left, when the goalkeeper also moves left, has a lower probability of being intercepted by the goalkeeper than in the right+right alternative. In that case, the most reasonable prior over the agent action will model an agent kicking left most of the time. Now, if we use this prior to marginalize the expectations of the human goalkeeper in the planning world, the planning world goalkeeper will expect the agent to kick the ball left most of the time, so they are more likely to move left.
Now observe that in the LCDT planning world model constructed by marginalization, this knowledge of the goalkeeper is a known parameter of the ball kicking optimization problem that the agent must solve. If we set the outcome probabilities right, the game theoretical outcome will be that the optimal policy is for the agent to kicks right, so it plays the opposite move that the goalkeeper expects. I’d argue that this is a form of deception, a deceptive scenario that LCDT is trying to prevent.
A safer decision theory would marginalize the goalkeeper expectations with a random prior over agent actions. thereby removing the game-theoretical benefit of the agent doing something unexpected. If the goalkeeper knows the agent is using this safer decision theory, they can always run left.
Now, I must admit that I associate the word ‘deception’ mostly with multi-step policies that aim to implant incorrect knowledge into the opposite party’s world model first, and then exploit that incorrect knowledge in later steps. The above example does only one of these things. So maybe others would deconfuse (define) the term ‘deception’ differently in a single-action setting, so that the above example does not in fact count as deception.
Benchmarking
The post above does not benchmark LCDT on Newcomb’s Problem, which I feel is a welcome change, compared to many other decision theory posts on this forum. Still, I feel that there is somewhat of a gap in the benchmarking coverage provided by the post above, as ‘mainstream’ ML agent designs are usually benchmarked in MDP or RL problem settings, that is on multi-step decision making problems where the objective is to maximize a time discounted sum of rewards. (Some of the benchmarks in the post above can be mapped to MDP problems in toy worlds, but they would be somewhat unusual MDP toy worlds.)
A first obvious MDP-type benchmark would be an RL setting where the reward signal is provided directly by a human agent in the environment. When we apply LCDT in this context, it makes the LCDT agent totally indifferent to influencing the human-generated reward signal: any random policy will perform equally well in the planning world . So the LCDT agent becomes totally non-responsive to its reward signal, and non-competitive as a tool to achieve economic goals.
In a second obvious MDP-type benchmark, the reward signal is provided by a sensor in the environment, or by some software that reads and processes sensor signals. If we model this sensor and this software as not being agents themselves, then LCDT may perform very well. Specifically, if there are innocent human bystanders too in the agent environment, bystanders who are modeled as agents, then we can expect that the incentive of the agent to control or deceive these human bystanders into helping it achieve its goals is suppressed. This is because under LCDT, the agent will lose some, potentially all, of its ability to correctly anticipate the consequences of its own actions on the actions of these innocent human bystanders.
Other remarks
There is an interesting link between LCDT and counterfactual oracles: whereas LCDT breaks the last link in any causal chain that influences human decisions, counterfactual oracle designs can be said to break the first link. See e.g. section 13 here for example causal diagrams.
When applying an LCDT-like approach construct a from a causal model , it may sometimes be easier to keep the incoming links to nodes in that model future agent decisions intact, and instead cut the outgoing links. This would mean replacing these nodes in with fresh nodes that generate probability distributions over future actions taken by the future agents(s). These fresh nodes could potentially use node values that occurred earlier in time than the agent action(s) as inputs, to create better predictions. When I picture this approach visually as editing a causal graph into a , the approach is more easy to visualize than the approach of marginalizing on a prior.
To conclude, my feeling is that LCDT can definitely be used as a safety mechanism, as an element of an agent design that suppresses deceptive policies. But it is definitely not a perfect safety tool that will offer perfect suppression of deception in all possible game-theoretical situations. When it comes to suppressing deception, I feel that time-limited myopia and the use of very high time discount factors are equally useful but imperfect tools.
In this comment I will focus on the case of the posts-to-show agent only. The main question I explore is: does the agent construction below actually stop the agent from manipulating user opinions?
The post above also explores this question, my main aim here is to provide an exploration which is very different from the post, to highlight other relevant parts of the problem.
Carey et al designed an algorithm to remove this control incentive. They do this by instructing the algorithm to choose its posts, not on predictions of the user’s actual clicks—which produce the undesired control incentive—but on predictions of what the user would have clicked on, if their opinions hadn’t been changed.
In this graph, there is no longer any control incentive for the AI on the “Influenced user opinions”, because that node no longer connects to the utility node.
[...]
It seems to neutralise a vicious, ongoing cycle of opinion change in order to maximize clicks. But, [...]
The TL;DR of my analysis is that the above construction may suppress a vicious, ongoing cycle of opinion change in order to maximize clicks, but there are many cases where a full suppression of the cycle will definitely not happen.
Here is an example of when full suppression of the cycle will not happen.
First, note that the agent can only pick among the posts that it has available. If all the posts that the agent has available are posts that make the user change their opinion on something, then user opinion will definitely be influenced by the agent showing posts, no matter how the decision what posts to show is computed. If the posts are particularly stupid and viral, this may well cause vicious, ongoing cycles of opinion change.
But the agent construction shown does have beneficial properties. To repeat the picture:
The above construction makes the agent indifferent about what effects it has on opinion change. It removes any incentive of the agent to control future opinion in a particular direction.
Here is a specific case where this indifference, this lack of a control incentive, leads to beneficial effects:
-
Say that the posts to show agent in the above diagram decides on a sequence of 5 posts that will be suggested in turn, with the link to the next suggested post being displayed at the bottom of the current one. The user may not necessarily see all 5 suggestions, they may leave the site instead of clicking the suggested link. The objective is to maximize the number of clicks.
-
Now, say that the user will click the next link with a 50% chance if the next suggested post is about cats. The agent’s predictive model knows this.
-
But if the suggested post is a post about pandas, then the user will click only with 40% chance, and leave the site with 60%. However, if they do click on the panda post, this will change their opinion about pandas. If the next suggested posts are also all about pandas, they will click the links with 100% certainty. The agent’s predictive model knows this.
-
In the above setup, the click-maximizing strategy is to show the panda posts.
-
However, the above agent does not take the influence on user opinion by the first panda post into account. It will therefore decide to show a sequence of suggested cat posts.
To generalize from the above example: the construction creates a type of myopia in the agent, that makes it under-invest (compared to the theoretical optimum) into manipulating the user’s opinion to get more clicks.
But also note that in this diagram:
there is still an arrow from ‘posts to show’ to ‘influenced user opinion’. In the graphical language of causal influence diagrams. this is a clear warning that the agent’s choices may end up influencing opinion, in some way. We have eliminated the agent incentive to control future opinion, but not the possibility that it might influence future opinion as a side effect.
I guess I should also say something about how the posts-to-show agent construction relates to real recommender systems as deployed on the Internet.
Basically, the posts-to-show agent is a good toy model to illustrate points about counterfactuals and user manipulation, but it does not provide a very complete model of the decision making processes that takes place inside real-world recommender systems. There is a somewhat hidden assumption in the picture below, represented by the arrow from ‘model of original opinions’ to ‘posts to show’:
The hidden assumption is that the agent’s code which computes ‘posts to show’ will have access to a fairly accurate ‘model of original opinions’ for that individual user. In practice, that model would be very difficult to construct accurately, if the agent has to do so based on only past click data from that user. (A future superintelligent agent might of course design a special mind-reading ray to extract a very accurate model of opinion without relying on clicks....)
To implement at least a rough approximation of the above decision making process, we have to build user opinion models that rely on aggregating click data collected from many users. We might for example cluster users into interest groups, and assign each individual user to one or more of these groups. But if we do so, then the fine-grained time-axis distinction between ‘original user opinions’ and ‘influenced opinions after the user has seen the suggested posts’ gets very difficult to make. The paper “The Incentives that Shape Behaviour” suggests:
We might accomplish this by using a prediction model that assumes independence between posts, or one that is learned by only showing one post to each user.
An assumption of independence between posts is not valid in practice, but the idea of learning based on only one post per user would work. However, this severely limits the amount of useful training data we have available. So it may lead to much worse recommender performance, if we measure performance by either a profit-maximizing engagement metric or a happiness-maximizing user satisfaction metric.
-
There are some good thoughts here, I like this enough that I am going to comment on the effective strategies angle. You state that
The wider AI research community is an almost-optimal engine of apocalypse.
and
AI capabilities are advancing rapidly, while our attempts to align it proceed at a frustratingly slow pace.
I have to observe that, even though certain people on this forum definitely do believe the above two statements, even on this forum this extreme level of pessimism is a minority opinion. Personally, I have been quite pleased with the pace of progress in alignment research.
This level of disagreement, which is almost inevitable as it involves estimates about about the future. has important implications for the problem of convincing people:
As per above, we’d be fighting an uphill battle here. Researchers and managers are knowledgeable on the subject, have undoubtedly heard about AI risk already, and weren’t convinced.
I’d say that you would indeed be facing an uphill battle, if you’d want to convince most researchers and managers that the recent late-stage Yudkowsky estimates about the inevitability of an AI apocalypse are correct.
The effective framing you are looking for, even if you believe yourself that Yudkowsky is fully correct, is that more work is needed on reducing long-term AI risks. Researchers and managers in the AI industry might agree with you on that, even if they disagree with you and Yudkowsky about other things.
Whether these researchers and managers will change their whole career just because they agree with you is a different matter. Most will not. This is a separate problem, and should be treated as such. Trying to solve both problems at once by making people deeply afraid about the AI apocalypse is a losing strategy.
Joe asked me in this comment:
I’d be interested on your take on Evan’s comment on incoherence in LCDT.
To illustrate his point on incoherence, Joe gives a kite example:
Let’s say I’m an LCDT agent, and you’re a human flying a kite.
My action set: [Say “lovely day, isn’t it?”] [Burn your kite]
Your action set: [Move kite left] [Move kite right] [Angrily gesticulate]
Let’s say I initially model you as having p = 1⁄3 of each option, based on your> expectation of my actions.
Now I decide to burn your kite.
What should I imagine will happen? If I burn it, your kite pointers are dangling.
Do the [Move kite left] and [Move kite right] actions become NOOPs?
Do I assume that my [burn kite] action fails?
My take is that there is indeed a problem that ‘your kite pointers are dangling’ in projection that the LCDT world model will compute. So the world projected will be somewhat weird.
In my mental picture of the most obvious way to implement LCDT and the structural functions attached to the LCDT model, the projection will be weird in the following way. After [burn kite], the action [Move kite left], when applied to the world state produced by [burn kite], will produce a world state where the human is miming that they are flying a kite. They will make the right gestures to move an invisible kite left, they might even be holding a kite rope when making the gestures, but the rope will not be connected to an actual kite.
So this is weird. However, I would not call it ‘incoherent’ or ‘requiring a contradiction’ as Joe does:
I cannot coherently assume that the agent has a distribution over action sets that it does not have: this requires a contradiction in my world model.
The phrasing ‘contradiction in the world model’ evokes the concern that the LCDT-constructed world model might crash or not be solvable, when we use it to score the action [burn kite]. But a nice feature of causal models, even counterfactual ones as generated by LCDT, is that they will ever crash: they will always compute a future reward score for any possible candidate action or policy. The score may however be weird. There is a potential GIGO problem here.
The word ‘incoherent’ invokes the concern that the model will be so twisted that we can definitely expect weird scores being computed more often than not. If so, the agent actions computed may be ineffective, strangely inappropriate, or even even dangerous when applied to the real world.
In other words: garbage world model in, garbage agent decision out.
One specific worry discussed here is that a counterfactual model may output potentially dangerous garbage because it pushes the inputs of the structural functions being used way out of training distribution.
That being said, there can be advantages to imperfection too. If we design just the right kind of ‘garbage’ into the agent’s world model, we may be able to suppress certain dangerous agent incentives, while still having an agent that is otherwise fairly good at doing the job we intend it to do. This is what LCDT is doing, for certain agent jobs, and it is also what my counterfactual planning agents designs here are doing, for certain other agent jobs.
That being said, it is clear (from the comments and I think also from the original post) that most feel that applying LCDT does not produce useful outcomes for all possible jobs we would want agents to do. Notably, when applied to a decision making problem where the agent has to come up with a multi-step reward-maximizing policy/plan, i.e. a typical MDP or RL benchmark problem, LCDT will produce an agent with a hugely impaired planning ability. How hugely will depend in part on the prior used.
Evan’s take is that he is not too concerned with this, as he has other agent applications in mind:
an LCDT agent should still be perfectly capable of tasks like simulating HCH
i.e. we can apply LCDT when building an imitation learner, which is different from a reinforcement learner. In the argmax HCH examples above, the agent furthermore is not imitating a human mentor who is present in the real agent environment, but a simulated mentor built up out of simulated humans consulting simulated humans.
On a philosophical thought-experiment level, this combination of LCDT and HCH works for me, it is even elegant. But in applied safety engineering terms, I see several risks with using HCH. For example, if the learned model of humans that the agent uses in HCH calculations is not perfect, then the recursive nature of HCH might amplify these imperfections rather than dampen them, producing outcomes that are very much unaligned. Also, on a more moral-philosophical point, might all these simulated humans become aware that they live in a simulation, and if so will they then seek to take sweet revenge on the people who put them there?
Back to the topic of incoherence. Joe also asks:
Specifically, do you think the issue I’m pointing at is a difference between LCDT and counterfactual planners? (or perhaps that I’m just wrong about the incoherence??)
I see LCDT agents as a subset of all possible counterfactual planning agent architectures, so in that sense there is no difference.
However, in my sequence and paper on counterfactual planning, I construct planning worlds by using quite different world model editing steps than those considered in LCDT. These different steps produce different results in terms of the weirdness or garbage-ness of the planning world model.
The editing steps I consider in the main examples of counterfactual planning is that I edit the real world model to construct a planning world model that has a different agent compute core in it, while leaving the physical world outside of the compute core unchanged. Specifically, the planning world models I considered do not accurately depict the software running inside the agent compute core, they depict a compute core running different software.
In terms of plausibility and internal consistency, a compute core running different software is more plausible/coherent than what can happen in the models constructed by LCDT.
As I currently understand things, I believe that CPs are doing planning in a counterfactual-but-coherent world, whereas LCDT is planning in an (intentionally) incoherent world—but I might be wrong in either case.
You are right in both cases, at least if we picture coherence as a sliding scale, not as a binary property. It also depends on the world model you start out with, of course.
- 25 Aug 2021 9:57 UTC; 1 point) 's comment on LCDT, A Myopic Decision Theory by (
Thanks for the interesting paper. I feel that the risks described are entirely plausible.
What is valuable for me in particular is that the paper re-casts many alignment risks that have already been discussed in a programmer-agent context into a new ‘inner alignment’ context. To quote the key description and separation of concerns:In this post, we outline reasons to think that a mesa-optimizer may not optimize the same objective function as its base optimizer. Machine learning practitioners have direct control over the base objective function—either by specifying the loss function directly or training a model for it—but cannot directly specify the mesa-objective developed by a mesa-optimizer. We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.
That being said, I sometimes have trouble understanding how the paper defines, or does not define, the time-based relation between the base optimizer and the mesa-optimizer. I started out with a mental model where there is a one-time ‘batch’ creation operation in which the base optimizer creates the mesa-optimizer (or rather the agent which might contain a mesa-optimizer) by using simulations over a training set to compare the performance of candidate agents. The agent that scores best on the base objective is then run in the real world. However, some of Evan’s comments on mesa-optimization lead me to believe that there is sometimes a more real-time continuous adjustment relation between the base optimizer and the agent that is created. I am unclear on whether this would create additional problems, or block certain solutions.
The base-to-mesa fidelity loss problem is similar to the problem where there is a loss of fidelity between a) what the programmers actually want and b) what they encode into the base objective. However, when considering fidelity loss between b) the base objective and c) the mesa objective, I feel there is an important extra dimension. Unlike the objectives a), the objective function b) is by nature computable: it has to be computable or else the base optimizer cannot use it to select between candidates. But if the base objective function is computable at mesa-optimizer design time, it should typically also be computable at mesa-optimizer run time.
Say that the mesa optimizer is trained to control a self-driving car, or a racing car in a video game. Then while the mesa-optimizer is driving, it should be possible to evaluate the quality of the driving by using the base objective function. Whenever the base objective function shows a very low value, a safety protocol can kick in, e.g. to stop the car. The threshold of ‘very low value’ can be calibrated using the the values computed over the training set at design time.
(I can image some special cases where the base objective function is not computable while the mesa-agent runs, e.g. if the base objective function was created by hand-labeling all instances of the training set. But for many economically relevant scenarios, especially for agents that need to be good at ‘planning’, good at optimizing sequences of actions that work towards a goal, I expect that the base objective will be perfectly computable in the real world.)
So overall, while I appreciate that the paper identifies and highlights inner alignment risks, my feeling is that the analysis provided is implicitly too pessimistic about the inner alignment problem. It seems to me that some very plausible and interesting risk mitigation options, options that leverage the availability of a computable base objective function, are not being identified. The obvious statement applies: future work to chart these options would be most welcome.
I generally agree with you on the principle Tackle the Hamming Problems, Don’t Avoid Them.
That being said, some of the Hamming problems I see that are being avoided most on this forum, and in the AI alignment community, are
-
Do something that will affect policy in a positive way
-
Pick some actual human values, and then hand-encode these values into open source software components that can go into AI reward functions
-
Just found your question via comment sections of recent posts. I understand you are still interested in the topic. so I’ll add to the comments below. In the summer of 2019 I did significant work trying to understand the status of the corrigibility literature, so here is a long answer mostly based on that.
First, at this point in time there is no up-to-date centralised reading list on corrigibility. All research agenda or literature overview lists that I know of lack references to the most recent work.
Second, the ‘MIRI corrigibility agenda’, if we define this agenda as a statement of the type of R&D that MIRI wants to encourage when it comes to the question of corrigibility, is very different from e.g. the ‘Paul Christiano corrigibility agenda’, if we define that agenda as the type of R&D that Paul Christiano likes to do when it comes to the question of corrigibility. MIRI’s agenda related to corrigibility still seems to be to encourage work on decision theory and embeddedness. I am saying ‘still seems’ here because MIRI as an organisation has largely stopped giving updates about what they are thinking collectively.
Below I am going to talk about the problem of compiling or finding up to date reading lists that show all work on the problem of corrigibility, not a subset of work that is most preferred or encouraged by a particular agenda.
One important thing to note is that by now, unfortunately, the word corrigibility means very different things to different people. MIRI very clearly defined corrigibility, in their 2015 paper with that title, by a list of 4 criteria, (and in a later section also by a list of 5 criteria at a different level of abstraction), 4 criteria that an agent has to satisfy in order to be corrigible. Many subsequent authors have used the terms ‘corrigibility’ or ‘this agent is corrigible’ to denote different, and usually weaker, desirable properties of an agent. So if someone says that they are working on corrigibility, they may not be working towards the exact 4 (or 5) criteria that MIRI defined. MIRI stresses that a corrigible agent should not take any action that tries to prevent a shutdown button press (or more generally a reward function update). But many authors are defining success in corrigibility to mean a weaker property, e.g. that the agent must always accept the shutdown instruction (or the reward function update) when it gets it, irrespective of whether the agent tried to manipulate the human into not pressing the stop button beforehand.
When writing the related work section of my 2019 paper corrigibility with utility preservation, I tried to do a survey of all related work on corrigibility, a survey without bias towards my own research agenda. I quickly found that there is a huge amount of writing about corrigibility in various blog/web forum posts and their comment sections, way too much for me to describe in a related work section. There was too much for me to even read it all, though I read a lot of it. So I limited myself, for the related work section, to reading and describing the available scientific papers, including arxiv preprints. I first created a long list of some 60 papers by using google scholar to search for all papers that reference the 2015 MIRI paper, by using some other search terms, any by using literature overviews. I then filtered out all the papers which a) just mention corrigibility in a related work section or b) describe the problem in more detail, but without contributing any new work or insights towards a solution. This left me with a short list of only a few papers to cite as related work, actually it surprised me that so little further work had been done on corrigibility after 2015, at least work that made it to publication in a formal paper or preprint.
In any case, I can offer the related work section in my mid 2019 paper on corrigibility is an up-to-date-as-of-mid-2019 reading list on corrigibility, for values of the word corrigibility that stay close to the original 2015 MIRI definition. For broader work that departs further from the definition, I used the device of referencing the 2018 literature review of Everitt, Lee and Hutter.
So what about the literature written after mid-2019 that would belong on a corrigibility reading list? I have not done a complete literature search since then, but definitely my feeling is that the pace of work on corrigibility has picked up a bit since mid 2019, for various values of the word corrigibility.
Several authors, including myself, are avoiding the word corrigibility, to refer to the problem of corrigibility, My own reason for avoiding it is that it just means too many different things to different people. So I prefer to use a broader terms like ‘reward tampering’ or ‘unwanted manipulation of the end user by the agent’. In the 2019 book human compatible, Russell is using the phrasing ‘the problem of control’ to kind-of denote the problem of corrigibility.
So here is my list of post-mid-2019 books and papers are useful to read if you want to do new R&D on safety mechanisms that achieve corrigibility/that prevent reward tampering or unwanted manipulation, if you want to do more R&D on such mechanisms without risking re-inventing the wheel. Unlike the related work section discussed above, this is not based on a systematic global long-list-to-short-list literature search, it is just work that happened to encounter (and write myself).
The book human compatible by Russell. -- This book provides a good natural-language problem statement of the reward tampering problem, but it does not get into much technical detail about possible solutions, because it is not aimed at a technical audience. For technical detail about possible solutions:
Everitt, T., Hutter, M.: Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. arXiv:1908.04734 (2019) -- this paper is not just about causal influence diagrams but it also can be used as a good literature overview of many pre-mid-2019 reward tampering solutions, a literature overview that is more recent, and provides more descriptive detail, than the 2018 literature review I mentioned above.
Stuart Armstrong, Jan Leike, Laurent Orseau, Shane Legg: Pitfalls of learning a reward function online https://arxiv.org/abs/2004.13654 -- this has a very good problem statement in the introduction, phrasing the tampering problem in an ‘AGI agent as a reward learner’ context. It then gets into a very mathematical examination of the problem.
Koen Holtman: AGI Agent Safety by Iteratively Improving the Utility Function https://arxiv.org/abs/2007.05411 (blog post intro here) -- This deals with a particular solution direction to the tampering problem. It also uses math, but I have tried to make the math as accessible as possible to a general technical audience.
This post-mid-2019 reading list is also biased to my own research agenda, and my agenda favours the use of mathematical methods and mathematical analysis over the use of natural language when examining AGI safety problems and solutions. Other people might have other lists.
It is easy to see that this idea of logical counterfactuals is unsatisfactory. For one, no good account of them has yet been given. For two, there is a sense in which no account could be given; reasoning about logically incoherent worlds can only be so extensive before running into logical contradiction.
I’ve been doing some work on this topic, and I am seeing two schools of thought on how to deal with the problem of logical contradictions you mention. To explain these, I’ll use an example counterfactual not involving agents and free will. Consider the counterfactual sentence: `if the vase had not been broken, the floor would not have been wet’. Now, how can we compute a truth value for this sentence?
School of thought 1 proceeds as follows: we know various facts about the world, like that the vase is broken and that the floor is wet. We also know general facts about vases, breaking, water, and floors. Now we add the extra fact that the vase is not broken to our knowledge base. Based on this extended body of knowledge, we compute the truth value of the claim ‘the floor is not wet’. Clearly, we are dealing with a knowledge base that contains mutually contradictory facts: the vase is both broken and it is not broken. Under normal mathematical systems of reasoning, this will allow us to prove any claim we like: the truth value of any sentence becomes 1, which is not what we want. Now, school 1 tries to solve this by coming up with new systems of reasoning that are tolerant of such internal contradictions, systems that will make computations that will produce the ‘obviously true’ conclusions only, of that will derive the `obviously true’ conclusions before deriving the `obviously false’ ones, or that compute probabilistic truth values such a way that those of the `obviously true’ conclusions are higher. In MIRI terminology, I believe this approach goes under the heading ‘decision theory’. I also interpret the two alternative solutions you mention above as following this school of thought. Personally, I find this solution approach not very promising or compelling.
School of thought 2, which includes Pearl’s version of counterfactual reasoning, says that if you want to reason (or if you want a machine to reason) in a counterfactual way, you should not just add facts to the body of knowledge you use. You need to delete or edit other facts in the knowledge base too, before you supply it to the reasoning engine, exactly to avoid inputting a knowledge base that has internal contradictions. For example, if you want to reason about ‘if the vase had not been broken’, one thing you definitely need to do is to first remove the statement (or any information leading to the conclusion that) `the vase is broken’ from the knowledge base that goes into your reasoning engine. You have to do this even though the fact that the vase is broken is obviously true for the current world you are in.
So school 2 avoids the problem of having to somehow build a reasoning engine that does the right thing even when a contradictory knowledge base is input. But it trades this for the problem of having to decide exactly what edits will be made to the knowledge base to eliminate the possibility of having such contradictions. In other words, if you want a machine to reason in a counterfactual way, you have to make choices about the specific edits you will make. Often, there are many possible choices, and different choices may lead to different probability distributions in the outcomes computed. This choice problem does not bother me that much, I see it as having design freedom. But if you are a philosopher of language trying to find a single obvious system of meaning for natural language counterfactual sentences, this choice problem might bother you a lot, you might be tempted to find some kind of representation-independent Occam’s razor that can be used to decide between counterfactual edits.
Overall, my feeling is that school 2 gives an account of logical counterfactuals that is good enough for my purposes in AGI safety work.
As a trivial school 1 edge case, one could design a reasoning engine that can deal with contradictory facts in its input knowledge base as follows: the engine first makes some school 2 edits on its input to remove the contradictions, and then proceeds calculating the requested truth value. So one could argue that the schools are not fundamentally different, though I do feel they are different in outlook, especially in their outlook on how necessary or useful it will be for AGI safety to resolve certain puzzles.
- 25 Apr 2020 17:02 UTC; 1 point) 's comment on What makes counterfactuals comparable? by (
Note: This is presumably not novel, but I think it ought to be better-known.
This indeed ought to be better-known. The real question is: why is it not better-known?
What I notice in the EA/Rationalist based alignment world is that a lot of people seem to believe in the conventional wisdom that nobody knows how to build myopic agents, nobody knows how to build corrigible agents, etc.
When you then ask people why they believe that, you usually get some answer ‘because MIRI’, and then when you ask further it turns out these people did not actually read MIRI’s more technical papers, they just heard about them.
The conventional wisdom ‘nobody knows how to build myopic agents’ is not true for the class of all agents, as your post illustrates. In the real world, applied AI practitioners use actually existing AI technology to build myopic agents, and corrigible agents, all the time. There are plenty of alignment papers showing how to do these things for certain models of AGI too: in the comment thread here I recently posted a list.
I speculate that the conventional rationalist/EA wisdom of ‘nobody knows how to do this’ persists because of several factors. One of them is just how social media works, Eternal September, and People Do Not Read Math, but two more interesting and technical ones are the following:
-
It is popular to build analytical models of AGI where your AGI will have an infinite time horizon by definition. Inside those models, making the AGI myopic without turning it into a non-AGI is then of course logically impossible. Analytical models built out of hard math can suffer from this built-in problem, and so can analytical models built out of common-sense verbal reasoning, In the hard math model case, people often discover an easy fix. In verbal models, this usually does not happen.
-
You can always break an agent alignment scheme by inventing an environment for the agent that breaks the agent or the scheme. See johnswentworth’s comment elsewhere in the comment section for an example of this. So it is always possible to walk away from a discussion believing that the ‘real’ alignment problem has not been solved.
-
As nobody else has mentioned it yet in this comment section: AI Safety Support is a resource-hub specifically set up to help people get into alignment research field.
I am a 50 year old independent alignment researcher. I guess I need to mention for the record that I never read the sequences, and do not plan to. The piece of Yudkowsky writing that I’d recommend everybody interested in alignment should read is Corrigibilty. But in general: read broadly, and also beyond this forum.
I agree with John’s observation that some parts of alignment research are especially well-suited to independent researchers, because they are about coming up with new frames/approaches/models/paradigms/etc.
But I would like to add a word of warning. Here are two somewhat equally valid ways to interpret LessWrong/Alignment Forum:
It is a very big tent that welcomes every new idea
It is a social media hang-out for AI alignment researchers who prefer to engage with particular alignment sub-problems and particular styles of doing alignment research only.
So while I agree with John’s call for more independent researchers developing good new ideas, I need to warn you that your good new ideas may not automatically trigger a lot of interest or feedback on this forum. Don’t tie your sense of self-worth too strongly to this forum.
On avoiding bullshit: discussion on this forum are often a lot better than on some other social media sites, but still Sturgeon’s law applies.