First off, I’m super happy that people are thinking about goal-directed behavior :D
I think model-based RL is typically goal-directed, in that it typically performs a search using a world model for a trajectory that achieves high reward (the goal). However, powerful model-free RL usually is also goal-directed—consider AlphaZero (without the MCTS), OpenAI Five, AlphaStar, etc: these are all model-free, but still seem fairly goal-directed. More generally, model-free and model-based RL algorithms usually get similar performance on environments (often model-free is less sample efficient but has a higher final performance, though this isn’t always the case).
Also more broadly, I think there’s a smooth spectrum between “habitual cognition” and “goal-directed cognition”, such that you can’t cleanly carve up the space into a binary “goal-directed” or not.
To the extent that we expect strong warning shots and ability to avoid building AGI upon receiving such warning shots, this seems like an argument for researchers/longtermists to work on / advocate for safety problems beyond the standard of “AGI is not trying to deceive us or work against us” (because that standard will likely be reached anyway). Do you agree?
Some types of AI x-risk don’t affect everyone though (e.g., ones that reduce the long term value of the universe or multiverse without killing everyone in the near term).
Agreed, all else equal those seem more likely to me.
See also this thread.
I think this is probably false, but it’s because I’m using the strict definition of s-risk.
Yeah, I should have said something like “the biggest kinds of s-risks where there is widespread optimization for suffering”.
What do you expect the ML community to do at that point?
It depends a lot on the particular warning shot that we get. But on the strong versions of warning shots, where there’s common knowledge that building an AGI runs a substantial risk of destroying the world, yes, I expect them to not build AGI until safety is solved. (Not to the standard you usually imagine, where we must also solve philosophical problems, but to the standard I usually imagine, where the AGI is not trying to deceive us or work against us.)
This depends on other background factors, e.g. how much the various actors think they are value-aligned vs. in zero-sum competition. I currently think the ML community thinks they are mostly but not fully value-aligned, and they will influence companies and governments in that direction. (I also want more longtermists to be trying to build more common knowledge of how much humans are value aligned, to make this more likely.)
I worry about a parallel with the “energy community”
The major disanalogy is that catastrophic outcomes of climate change do not personally affect the CEOs of energy companies very much, whereas AI x-risk affects everyone. (Also, maybe we haven’t gotten clear and obvious warning shots?)
(compared to which, the disasters that will have occurred by then may well seem tolerable by comparison), and given probable disagreements between different experts about how serious the future risks are
I agree that my story requires common knowledge of the risk of building AGI, in the sense that you need people to predict “running this code might lead to all humans dying”, and not “running this code might lead to <warning shot effect>”. You also need relative agreement on the risks.
I think this is pretty achievable. Most of the ML community already agrees that building an AGI is high-risk if not done with some argument for safety. The thing people tend to disagree on is when we will get AGI and how much we should work on safety before then.
Flo’s summary for the Alignment Newsletter:
This post argues that regularizing an agent’s impact by <@attainable utility@>(@Towards a New Impact Measure@) can fail when the agent is able to construct subagents. Attainable utility regularization uses auxiliary rewards and penalizes the agent for changing its ability to get high expected rewards for these to restrict the agent’s power-seeking. More specifically, the penalty for an action is the absolute difference in expected cumulative auxiliary reward between the agent either doing the action or nothing for one time step and then optimizing for the auxiliary reward.
This can be circumvented in some cases: If the auxiliary reward does not benefit from two agents instead of one optimizing it, the agent can just build a copy of itself that does not have the penalty, as doing this does not change the agent’s ability to get a high auxiliary reward. For more general auxiliary rewards, an agent could build another more powerful agent, as long as the powerful agent commits to balancing out the ensuing changes in the original agent’s attainable auxiliary rewards.
I am confused about how much the commitment to balance out the original agent’s attainable utility would constrain the powerful subagent. Also, in the presence of subagents, it seems plausible that attainable utility mostly depends on the agent’s ability to produce subagents of different generality with different goals: If a subagent that optimizes for a single auxiliary reward was easier to build than a more general one, building a general powerful agent could considerably decrease attainable utility for all auxiliary rewards, such that the high penalty rules out this action.
Planned summary for the Alignment Newsletter:
This post reports on work done on creating a <@debate@>(@AI safety via debate@) setup that works well with human players. In the game, one player is honest (i.e. arguing for the correct answer) and one is malicious (i.e. arguing for some worse answer), and they play a debate in some format, after which a judge must decide which player won the debate. They are using Thinking Physics questions for these debates, because they involve questions with clear answers that are confusing to most people (the judges) but easy for some experts (the players).
Early freeform text debates did not work very well, even with smart, motivated judges. The malicious player could deflect on questions they didn’t want to answer, e.g. by claiming that the question was ambiguous and redirecting attention by asking new questions. In addition, when the malicious player got to go first and give an incorrect “framework” for finding the answer, and then made individually true claims to “fill in” the framework, it was hard for the honest player to rebut it. So, they moved to a framework without such asymmetries: both players gave a claim (simultaneously), both gave constructive arguments, and both rebutted the other’s arguments. In addition, part of the appeal of debate is that the agents can “zoom in” on the particular disagreement they have, and the judge need only analyze the smallest disagreement in order to declare an overall winner.
This suggests the following setup: players simultaneously provide an answer supported with subclaims. Then, after looking at the other player’s answer and subclaims, they can provide objections (perhaps on particular subclaims, or the overall structure of the argument). Players can concede or challenge objections left by the other player; the debate then recurses on one of the challenged objections (currently, the malicious player chooses which claim to recurse on, in order to be conservative). There is a depth limit to the recursion, once it is reached (or if there are no challenged objections) the judge is given a transcript of the debate and must declare a winner.
A new problem arises: the malicious player can exploit ambiguity to their advantage. For example, suppose the malicious player is arguing that there are 5 primes between 0 and 10, and makes the subclaims “there are 3 primes between 0 and 5” and “there are 2 primes between 5 and 10“. If challenged on the first, they will say “the three primes are 2, 3, and 5”, if challenged on the second, they will say “the two primes are 5 and 7”. While in this simple case the honest player could register a top-level objection claiming that the word “between” is ambiguous, and ask for clarification, in general it would be too hard to eliminate all ambiguity.
To mitigate this, the team added a method of _cross examination_, in which each player is allowed to ask an earlier version of the other agent a question, and quote its answer as part of the debate transcript. (The post talks about how this can be approximated with humans, but I won’t go into the details.) In the complexity theory analogy, this boosts debate from PSPACE to NEXP. Intuitively, cross examination allows players to force each other to precommit to specific claims and arguments, because if they don’t stay consistent in their claims and arguments, then their opponent can demonstrate the inconsistency. Now, if the malicious player tries to make an argument that depends on interpreting an ambiguous concept in two different ways, then the honest player can cross examine and ask which of the two interpretations they mean. If they are inconsistent, that can be demonstrated to the judge; if they consistently answer one way, then the honest player can challenge the part of the argument that depends on the other interpretation.
They then identify several open concerns with debate, of which they highlight the long computation problem. This is a problem when you no longer assume that the debaters have optimal play: in this case, the malicious player could create a complicated argument that neither debater understands well, that supports the malicious case but that the honest player doesn’t know how to refute.
I enjoyed this a lot: the problems found were crisp and the solutions had good arguments that they actually solved the identified problem. Reading through the actual examples and arguments made me more optimistic about debate in general, mostly from a felt sense that the actual concrete results were getting closer to matching the theoretical ideal, and that there actually could be reasonable solutions to “messy” problems like ambiguity.
The full post has formal explanations and actual examples, which I highly recommend.
We also rely on the property that, because the honest and dishonest debaters are copies of each other, they know everything the other knows.
I don’t see why being copies of each other implies that they know everything the other knows: they could (rationally) spend their computation on understanding the details of their position, rather than their opponent’s position.
For example, if I was playing against a copy of myself, where we were both given separate puzzles and had to solve them faster than the other, I would spend all my time focusing on my puzzle, and wouldn’t know the things that my copy knows about his puzzle (even if we have the same information, i.e. both of us can see both puzzles and even each other’s inner thought process).
It seems to me that at least some parts of this research agenda are relevant for some special cases of “the failure mode of an amoral AI system that doesn’t care about you”.
I still wouldn’t recommend working on those parts, because they seem decidedly less impactful than other options. But as written it does sound like I’m claiming that the agenda is totally useless for anything besides s-risks, which I certainly don’t believe. I’ve changed that second paragraph to:
However, under other ethical systems (under which s-risks are worse than x-risks, but do not completely dwarf x-risks), I expect other technical safety research to be more impactful, because other approaches can more directly target the failure mode of an amoral AI system that doesn’t care about you, which seems both more likely and more amenable to technical safety approaches (to me at least). I could imagine work on this agenda being quite important for _strategy_ research, though I am far from an expert here.
This agenda by the Effective Altruism Foundation focuses on risks of astronomical suffering (s-risks) posed by <@transformative AI@ >(@Defining and Unpacking Transformative AI@) (TAI) and especially those related to conflicts between powerful AI agents. This is because there is a very clear path from extortion and executed threats against altruistic values to s-risks. While especially important in the context of s-risks, cooperation between AI systems is also relevant from a range of different viewpoints. The agenda covers four clusters of topics: strategy, credibility and bargaining, current AI frameworks, as well as decision theory.
The extent of cooperation failures is likely influenced by how power is distributed after the transition to TAI. At first glance, it seems like widely distributed scenarios (as <@CAIS@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@)) are more problematic, but related literature from international relations paints a more complicated picture. The agenda seeks a better understanding of how the distribution of power affects catastrophic risk, as well as potential levers to influence this distribution. Other topics in the strategy/governance cluster include the identification and analysis of realistic scenarios for misalignment, as well as case studies on cooperation failures in humans and how they can be affected by policy.
TAI might enable unprecedented credibility, for example by being very transparent, which is crucial for both contracts and threats. The agenda aims at better models of the effects of credibility on cooperation failures. One approach to this is open-source game theory, where agents can see other agents’ source codes. Promising approaches to prevent catastrophic cooperation failures include the identification of peaceful bargaining mechanisms, as well as surrogate goals. The idea of surrogate goals is for an agent to commit to act as if it had a different goal, whenever it is threatened, in order to protect its actual goal from threats.
As some aspects of contemporary AI architectures might still be present in TAI, it can be useful to study cooperation failure in current systems. One concrete approach to enabling cooperation in social dilemmas that could be tested with contemporary systems is based on bargaining over policies combined with punishments for deviations. Relatedly, it is worth investigating whether or not multi-agent training leads to human-like bargaining by default. This has implications on the suitability of behavioural vs classical game theory to study TAI. The behavioural game theory of human-machine interactions might also be important, especially in human-in-the-loop scenarios of TAI.
The last cluster discusses the implications of bounded computation on decision theory as well as the decision theories (implicitly) used by current agent architectures. Another focus lies on acausal reasoning and in particular the possibility of acausal trade, where different correlated AI’s cooperate without any causal links between them.
I am broadly sympathetic to the focus on preventing the worst outcomes and it seems plausible that extortion could play an important role in these, even though I worry more about distributional shift plus incorrigibility. Still, I am excited about the focus on cooperation, as this seems robustly useful for a wide range of scenarios and most value systems.
Under a suffering-focused ethics under which s-risks far overwhelm x-risks, I think it makes sense to focus on this agenda. There don’t seem to be many plausible paths to s-risks: by default, we shouldn’t expect them, because it would be quite surprising for an amoral AI system to think it was particularly useful or good for humans to _suffer_, as opposed to not exist at all, and there doesn’t seem to be much reason to expect an immoral AI system. Conflict and the possibility of carrying out threats are the most plausible ways by which I could see this happening, and the agenda here focuses on neglected problems in this space.
However, under other ethical systems (under which s-risks are worse than x-risks, but do not completely dwarf x-risks), I expect other safety research to be more impactful, because the failure mode of an amoral AI system that doesn’t care about you seems both more likely and more amenable to technical safety approaches (to me at least).
The basic debate RL setup is meant to be unchanged here—when I say “the RL reward derived from winner” I mean that in the zero-sum debate game sense.
But the answers are generated from pieces that involve humans, and those humans don’t behave as though they are in a zero-sum game?
I suppose you could imagine that the human is just some function, and the models are producing answers that get mapped through the function before they get their zero-sum reward… but then the equilibrium behavior could very well be different. For example, if you’re advising a human on how to play rock-paper-scissors, but they have a bias against paper and when you tell them to play paper they have a 33% chance of playing rock instead, you should now have a 50% chance of recommending paper, 33% chance of scissors, and 17% chance for rock. So I’m not sure that the reasons for optimism for debate transfer over into this setting where you have a human in the mix.
Maybe you could make an argument that for any H who we trust enough to do amplification / debate in the first place, this isn’t a problem, since Amp(H,M) is supposed to be more capable than M. Alternatively you could say that at the very least M is such that Amp(H,M) gives true and useful arguments, though that might conflict with training M to imitate Amp(H,M) (as in the rock-paper-scissors example above).
When I imagine WFLL1 that doesn’t turn into WFLL2, I usually imagine a world in which all existing humans lead great lives, but don’t have much control over the future. On a moment-to-moment basis, that world is better than the current world, but we don’t get to influence the future and make use of the cosmic endowment, and so from a total view we have lost >99% of the potential value of the future.
I was uncertain about this, but it seems this is at least what Paul intended. From here, about WFLL1:
The availability of AI still probably increases humans’ absolute wealth. This is a problem for humans because we care about our fraction of influence over the future, not just our absolute level of wealth over the short term.
I like the basic idea, but I don’t understand the details, so by default won’t include it in the newsletter. Some confusions:
Are the arguments the same thing as answers? (I get this impression because you say “Is arg_t a sufficient answer to Q in context S?”.) If not, where is the answer in the debate? More generally I would benefit a lot from a concrete example (e.g. the Bali vs. Alaska example).
Debate sets up a game and argues that the equilibrium is truth-telling. It does that by setting up a zero-sum game and then using self-play for training; self-play will converge to the Nash equilibrium, so you are then justified in only analyzing the equilibrium, while ignoring the training process. However, in your use of debate, afaict nothing enforces that you converge to the equilibrium of the zero-sum game, so I don’t see why you gain the benefits of debate.
Why do you want to add an auxiliary RL objective? I normally imagine two reasons. First, maybe the task you want to solve is well suited to RL, e.g. Atari games, and so you want to train on that RL objective in addition to the question answering objective, so that the RL objective lets you learn good representations quickly. Second, if your model M is unable to do perfect imitation, there must be errors, and in this case the imitation objective doesn’t necessarily incentivize the right thing, whereas the RL objective does. (See Against Mimicry.) I think yours is aiming at the second and not the first?
You made a claim a few comments above:
But ultimately, you need theoretical knowledge to know what can be safely inferred from these experiments. Without theory you cannot extrapolate.
I’m struggling to understand what you mean by “theory” here, and the programming example was trying to get at that, but not very successfully. So let’s take the sandwich example:
I made a sandwich this morning without building mathematical theory :)
Presumably the ingredients were in a slightly different configuration than you had ever seen them before, but you were still able to “extrapolate” to figure out how to make a sandwich anyway. Why didn’t you need theory for that extrapolation?
Obviously this is a silly example, but I don’t currently see any qualitative difference between sandwich-making-extrapolation, and the sort of extrapolation we do when we make qualitative arguments about AI risk. Why trust the former but not the latter? One is answer is that the latter is more complex, but you seem to be arguing something else.
Which of these features are necessary for intent-alignment and which are only necessary for strong alignment?
As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is “doesn’t manipulate the user” or something like that.) I’m not sure what 9, 11 and 13 are about. For the others, I’d say they’re all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.
I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).
This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven’t yet seen a bridge fall down from resonance, and so you don’t think about it.
On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.
Maybe I’m falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.
Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a “mathematical theory of how to write code”.
But, even just understanding whether we live in such a universe requires building a mathematical theory.
I’m curious what you think doesn’t require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don’t have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that’s how I interpret the rocket alignment and security mindset posts.)
The reasons the risk are great are standard arguments, so I am a little confused why you ask about this.
Sorry, I meant what are the reasons that the risk greater than the risk from a failure of intent alignment? The question was meant to be compared to the counterfactual of work on intent alignment, since the underlying disagreement is about comparing work on intent alignment to other AI safety work. Similarly for the question about why it might take a long time to solve.
Then, I don’t understand why you believe that work on anything other than intent-alignment is much less urgent?
I’m claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.
Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:
1. Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.
2. Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.
In this situation, machine A is a much better plan.
(The example is meant to illustrate the phenomenon by which you might want to choose a riskier but easier-to-create option; it’s not meant to properly model intent alignment vs. other stuff on other axes.)
This is actually an important lesson about why we need theory: to construct a useful theoretical model you don’t need to know all possible failure modes, you only need a reasonable set of assumptions.
I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs. (Maybe today bridge-building technology has advanced such that we are able to do such proofs, I don’t know, but at least in the past that would not have been the case.)
In this case, we either fail to prove anything, or we make unrealistic assumptions that do not hold in reality and get a proof of safety. Similarly, I think in many cases involving properties about a complex real environment, your two options are 1. don’t prove things or 2. prove things with unrealistic assumptions that don’t hold.
But if you’re sending a spaceship to Mars (or making a superintelligent AI), trial and error is too expensive. [...] Without theory you cannot extrapolate.
I am not suggesting that we throw away all logic and make random edits to lines of code and try them out until we find a safe AI. I am simply saying that our things-that-allow-us-to-extrapolate need not be expressed in math with theorems. I don’t build mathematical theories of how to write code, and usually don’t prove my code correct; nonetheless I seem to extrapolate quite well to new coding problems.
It also sounds like you’re making a normative claim for proofs; I’m more interested in the empirical claim. (But I might be misreading you here.)
I disagree. For example, [...]
Certainly you can come up with bridging assumptions to bridge between levels of abstraction (in this case the assumption that “human thinking for a day” is within F). I would expect that I would find some bridging assumption implausible in these settings.