Say more about the relevance?
I think that’s not true. The point where you deal with wireheading probably isn’t what you reward so much as when you reward. If the agent doesn’t even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing.
I have a low-confidence disagreement with this, based on my understanding of how deep NNs work. To me, the tangent space stuff suggests that it’s closer in practice to “all the hypotheses are around at the beginning”—it doesn’t matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don’t change that much by introducing it at different stages in training.
Plausibly this is true of some training setups and not others; EG, more true for LLMs and less true for RL.
Let’s set aside the question of whether it’s true, though, and consider the point you’re making.
This isn’t a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models).
So I understand one of your major points to be: thinking about training as the chisel which shapes the policy doesn’t necessitate thinking in terms of incentives (ie gradients pushing in particular directions). The ultimate influence of a gradient isn’t necessarily the thing it immediately pushes for/against.
I tentatively disagree based on the point I made earlier; plausibly the influence of a gradient step is almost exclusively its immediate influence.
But I don’t disagree in principle with the line of investigation. Plausibly it is pretty important to understand this kind of evidence-ordering dependence. Plausibly, failure modes in value learning can be avoided by locking in specific things early, before the system is “sophisticated enough” to be doing training-process-simulation.
I’m having some difficulty imagining powerful conceptual tools along those lines, as opposed to some relatively simple stuff that’s not that useful.
And one reason is that I don’t think that RL agents are managing motivationally-relevant hypotheses about “predicting reinforcements.” Possibly that’s a major disagreement point? (I know you noted its fuzziness, so maybe you’re already sympathetic to responses like the one I just gave?)
I’m confused about what you mean here. My best interpretation is that you don’t think current RL systems are modeling the causal process whereby they get reward. On my understanding, this does not closely relate to the question of whether our understanding of training should focus on the first-order effects of gradient updates or should also admit higher-order, longer-term effects.
Maybe on your understanding, the actual reason why current RL systems don’t wirehead too much, is because of training order effects? I would be surprised to come around on this point. I don’t see it.
I expect this argument to not hold,
Seems like the most significant remaining disagreement (perhaps).
1. Gradient updates are pointed in the direction of most rapid loss-improvement per unit step. I expect most of the “distance covered” to be in non-training-process-modeling directions for simplicity reasons (I understand this argument to be a predecessor of the NTK arguments.)
So I am interpreting this argument as: even if LTH implies that a nascent/potential hypothesis is training-process-modeling (in an NTK & LTH sense), you expect the gradient to go against it (favoring non-training-modeling hypotheses) because non-training-process-modelers are simpler.
This is a crux for me; if we had a simplicity metric that we had good reason to believe filtered out training-process-modeling, I would see the deceptive-inner-optimizer concern as basically solved (modulo the solution being compatible with other things we want).
I think solomonoff-style program simplicity probably doesn’t do it; the simplest program fitting with a bunch of data from our universe quite plausibly models our universe.
I think circuit-simplicity doesn’t do it; simple circuits which perform complex tasks are still more like algorithms than lookup tables, ie, still try to model the world in a pretty deep way.
I think Vanessa has some interesting ideas on how infrabayesian-physicalism might help deal with inner optimizers, but on my limited understanding, I think not by ruling out training-process-modeling.
In other words, it seems to me like a tough argument to make, which on my understanding, no one has been able to make so far, despite trying; but, not an obviously wrong direction.
2. You’re always going to have identifiability issues with respect to the loss signal. This could mean that either: (a) the argument is wrong, or (b) training-process-optimization is unavoidable, or (c) we can somehow make it not apply to networks of AGI size.”
I don’t really see your argument here? How does (identifiability issues → (argument is wrong ∨ training-process-optimization is unavoidable ∨ we can somehow make it not apply to networks of AGI size))?
In my personal estimation, shaping NNs in the right way is going to require loss functions which open up the black box of the NN, rather than only looking at outputs. In principle this could eliminate identifiability problems entirely (eg “here is the one correct network”), although I do not fully expect that.
A ‘good prior’ would also solve the identifiability problem well enough. (eg, if we could be confident that a prior promotes non-deceptive hypotheses over similar deceptive hypotheses.)
But, none of this is necc. interfacing with your intended argument.
3. Even if the agent is motivated both by the training process and by the object-level desired hypothesis (since the gradients would reinforce both directions), on shard theory, that’s OK, an agent can value several things. The important part is that the desired shards cast shadows into both the agent’s immediate behavior and its reflectively stable (implicit) utility function.
Here’s how I think of this part. A naïve EU-maximizing agent, uncertain between two hypotheses about what’s valuable, might easily decide to throw one under the bus for the other. Wireheading is analogous to a utility monster here—something that the agent is, on balance, justified to throw approximately all its resources at, basically neglecting everything else.
A bargaining-based agent, on the other hand, can “value several things” in a more significant sense. Simple example:
U1 and U2 are almost equally probable hypotheses about what to value.
EU maximization maximizes whichever happens to be slightly more probable.
Nash bargaining selects a 50-50 split between the two, instead, flipping a coin to fairly divide outcomes.
In order to mitigate risks due to bad hypotheses, we want more “bargaining-like” behavior, rather than “EU-like” behavior.
I buy that bargaining-like behavior fits better flavor-wise with shard theory, but I don’t currently buy that an implication of shard theory is that deep-NN RL will display bargaining-like behavior by default, if that’s part of your intended implication?
We were discussing wireheading, not inner optimization, but a wireheading agent that hides this in order to do a treacherous turn later is a deceptive inner optimizer. I’m not going to defend the inner/outer distinction here; “is wireheading an inner alignment problem, or an outer alignment problem?” is a problematic question.
My main complaint with this, as I understand it, is that builder/breaker encourages you to repeatedly condition on speculative dangers until you’re exploring a tiny and contorted part of solution-space (like worst-case robustness hopes, in my opinion). And then you can be totally out-of-touch from the reality of the problem.
On my understanding, the thing to do is something like heuristic search, where “expanding a node” means examining that possibility in more detail. The builder/breaker scheme helps to map out heuristic guesses about the value of different segments of the territory, and refine them to the point of certainty.
So when you say “encourages you to repeatedly condition on speculative dangers until you’re exploring a tiny and contorted part of solution-space”, my first thought is that you missed the part where Builder can respond to Breaker’s purported counterexamples with arguments such as the ones you suggest:
I currently conjecture that [...]Does this argument fail? Maybe, yeah! Should I keep that in mind? Yes! But that doesn’t necessarily mean I should come up with an extremely complicated scheme to make feedback-modeling be suboptimal.
I currently conjecture that [...]
Does this argument fail? Maybe, yeah! Should I keep that in mind? Yes! But that doesn’t necessarily mean I should come up with an extremely complicated scheme to make feedback-modeling be suboptimal.
But, perhaps more plausibly, you didn’t miss that point, and are instead pointing to a bias you see in the reasoning process, a tendency to over-weigh counterexamples as if they were knockdown arguments, and forget to do the heuristic search thing where you go back and expand previously-unpromising-seeming nodes if you seem to be getting stuck in other places in the tree.
I’m tempted to conjecture that you should debug this as a flaw in how I apply builder/breaker style reasoning, as opposed to the reasoning scheme itself—why should builder/breaker be biased in this way?
You seem to address a related point:One might therefore protest: “Worst-case reasoning is not suitable for deconfusion work! We need a solid understanding of what’s going on, before we can do robust engineering.”However, it’s also possible to use Builder/Breaker in non-worst-case (ie, probabilistic) reasoning. It’s just a matter of what kind of conclusion Builder tries to argue. If Builder argues a probabilistic conclusion, Builder will have to make probabilistic arguments.But then you later say: Point out implausible assumptions via plausible counterexamples.In this case we ask: does the plausibility of the counterexample force the assumption to be less probable than we’d like our precious few assumptions to be?But if we’re still in the process of deconfusing the problem, this seems to conflate the two roles. If game day were tomorrow and we had to propose a specific scheme, then we should indeed tally the probabilities.
You seem to address a related point:
One might therefore protest: “Worst-case reasoning is not suitable for deconfusion work! We need a solid understanding of what’s going on, before we can do robust engineering.”However, it’s also possible to use Builder/Breaker in non-worst-case (ie, probabilistic) reasoning. It’s just a matter of what kind of conclusion Builder tries to argue. If Builder argues a probabilistic conclusion, Builder will have to make probabilistic arguments.
One might therefore protest: “Worst-case reasoning is not suitable for deconfusion work! We need a solid understanding of what’s going on, before we can do robust engineering.”
However, it’s also possible to use Builder/Breaker in non-worst-case (ie, probabilistic) reasoning. It’s just a matter of what kind of conclusion Builder tries to argue. If Builder argues a probabilistic conclusion, Builder will have to make probabilistic arguments.
But then you later say:
Point out implausible assumptions via plausible counterexamples.In this case we ask: does the plausibility of the counterexample force the assumption to be less probable than we’d like our precious few assumptions to be?
Point out implausible assumptions via plausible counterexamples.
In this case we ask: does the plausibility of the counterexample force the assumption to be less probable than we’d like our precious few assumptions to be?
But if we’re still in the process of deconfusing the problem, this seems to conflate the two roles. If game day were tomorrow and we had to propose a specific scheme, then we should indeed tally the probabilities.
I admit that I do not yet understand your critique at all—what is being conflated?
Here is how I see it, in some detail, in the hopes that I might explicitly write down the mistaken reasoning step which you object to, in the world where there is such a step.
We have our current beliefs, and we can also refine those beliefs over time through observation and argument.
Sometimes it is appropriate to “go with your gut”, choosing the highest-expectation plan based on your current guesses. Sometimes it is appropriate to wait until you have a very well-argued plan, with very well-argued probabilities, which you don’t expect to easily move with a few observations or arguments. Sometimes something in the middle is appropriate.
AI safety is in the “be highly rigorous” category. This is mostly because we can easily imagine failure being so extreme that humanity in fact only gets one shot at this.
When the final goal is to put together such an argument, it makes a lot of sense to have a sub-process which illustrates holes in your reasoning by pointing out counterexamples. It makes a lot of sense to keep a (growing) list of counterexample types.
It being virtually impossible to achieve certainty that we’ll avert catastrophe, our arguments will necessarily include probabilistic assumptions and probabilistic arguments.
#5 does not imply, or excuse, heuristic informality in the final arguments; we want the final arguments to be well-specified, so that we know precisely what we have to assume and precisely what we get out of it.
#5 does, however, mean that we have an interest in plausible counterexamples, not just absolute worst-case reasoning. If I say (as Builder) “one of the coin-flips will come out heads”, as part of an informal-but-working-towards-formality argument, and Breaker says “counterexample, they all come out tails”, then the right thing to do is to assess the probability. If we’re flipping 10 coins, maybe Breaker’s counterexample is common enough to be unacceptably worrying, damning the specific proposal Builder was working on. If we’re flipping billions of coins, maybe Breaker’s counterexample is not probable enough to be worrying.
This is the meaning of my comment about pointing out insufficiently plausible assumptions via plausible counterexamples, which you quote after “But then later you say:”, and of which you state that I seem to conflate two roles.
But if we’re assessing the promise of a given approach for which we can gather more information, then we don’t have to assume our current uncertainty. Like with the above, I think we can do empirical work today to substantially narrow the uncertainty on that kind of question. That is, if our current uncertainty is large and reducible (like in my diamond-alignment story), breaker might push me to prematurely and inappropriately condition on not-that-proposal and start exploring maybe-weird, maybe-doomed parts of the solution space as I contort myself around the counterarguments.
I guess maybe your whole point is that the builder/breaker game focuses on constructing arguments, while in fact we can resolve some of our uncertainty through empirical means.
On my understanding, if Breaker uncovers an assumption which can be empirically tested, Builder’s next move in the game can be to go test that thing.
However, I admit to having a bias against empirical stuff like that, because I don’t especially see how to generalize observations made today to the highly capable systems of the future with high confidence.
WRT your example, I intuit that perhaps our disagreement has to do with …
I currently conjecture that an initialization from IID self-supervised- and imitation-learning data will not be modelling its own training process in detail,
I think it’s pretty sane to conjecture this for smaller-scale networks, but at some point as the NN gets large enough, the random subnetworks already instantiate the undesired hypothesis (along with the desired one), so they must be differentiated via learning (ie, “incentives”, ie, gradients which actually specifically point in the desired direction and away from the undesired direction).
I think this is a pretty general pattern—like, a lot of your beliefs fit with a picture where there’s a continuous (and relatively homogenious) blob in mind-space connecting humans, current ML, and future highly capable systems. A lot of my caution stems from being unwilling to assume this, and skeptical that we can resolve the uncertainty there by empirical means. It’s hard to empirically figure out whether the landscape looks similar or very different over the next hill, by only checking things on this side of the hill.
Ideally, nothing at all; ie, don’t create powerful AGI, if that’s an option. This is usually the correct answer in similar cases. EG, if you (with no training in bridge design) have to deliver a bridge design that won’t fall over, drawing up blueprints in one day’s time, your best option is probably to not deliver any design. But of course we can arrange the thought-experiment such that it’s not an option.
The questions there would be more like “what sequence of reward events will reinforce the desired shards of value within the AI?” and not “how do we philosophically do some fancy framework so that the agent doesn’t end up hacking its sensors or maximizing the quotation of our values?”.
I think that it generally seems like a good idea to have solid theories of two different things:
What is the thing we are hoping to teach the AI?
What is the training story by which we mean to teach it?
I read your above paragraph as maligning (1) in favor of (2). In order to reinforce the desired shards, it seems helpful to have some idea of what those look like.
For example, if we avoid fancy philosophical frameworks, we might think a good way to avoid wireheading is to introduce negative examples where the AI manipulates circuitry to boost reinforcement signals, and positive examples where the AI doesn’t do that when given the opportunity. After doing some philosophy where you try to positively specify what you’re trying to train, it’s easier to notice that this sort of training still leaves the human-manipulation failure mode open.
After doing this kind of philosophy for a while, it’s intuitive to form the more general prediction that if you haven’t been able to write down a formal model of the kind of thing you’re trying to teach, there are probably easy failure modes like this which your training hasn’t attempted to rule out at all.
The basic idea behind compressed pointers is that you can have the abstract goal of cooperating with humans, without actually knowing very much about humans.[...]In machine-learning terms, this is the question of how to specify a loss function for the purpose of learning human values.
In machine-learning terms, this is the question of how to train an AI whose internal cognition reliably unfolds into caring about people, in whatever form that takes in the AI’s learned ontology (whether or not it has a concept for people).
Thinking about this now, I think maybe it’s a question of precautions, and what order you want to teach things in. Very similarly to the argument that you might want to make a system corrigible first, before ensuring that it has other good properties—because if you make a mistake, later, a corrigible system will let you correct the mistake.
Similarly, it seems like a sensible early goal could be ‘get the system to understand that the sort of thing it is trying to do, in (value) learning, is to pick up human values’. Because once it has understood this point correctly, it is harder for things to go wrong later on, and the system may even be able to do much of the heavy lifting for you.
Really, what makes me go to the meta-level like this is pessimism about the more direct approach. Directly trying to instill human values, rather than first training in a meta-level understanding of that task, doesn’t seem like a very correctible approach. (I think much of this pessimism comes from mentally visualizing humans arguing about what object-level values to try to teach an AI. Even if the humans are able to agree, I do not feel especially optimistic about their choices, even if they’re supposedly informed by neuroscience and not just moral philosophers.)
If you commit to the specific view of outer/inner alignment, then now you also want your loss function to “represent” that goal in some way.
I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases—or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent.
This is because writing down the desired inductive biases as an explicit prior can help us to understand what’s going on better.
It’s tempting to say that to understand how the brain learns, is to understand how it treats feedback as evidence, and updates on that evidence. Of course, there could certainly be other theoretical frames which are more productive. But at a deep level, if the learning works, the learning works because the feedback is evidence about the thing we want to learn, and the process which updates on that feedback embodies (something like) a good prior telling us how to update on that evidence.
And if that framing is wrong somehow, it seems intuitive to me that the problem should be describable within that ontology, like how I think “utility function” is not a very good way to think about values because what is it a function of; we don’t have a commitment to a specific low-level description of the universe which is appropriate for the input to a utility function. We can easily move beyond this by considering expected values as the “values/preferences” representation, without worrying about what underlying utility function generates those expected values.
(I do not take the above to be a knockdown argument against “committing to the specific division between outer and inner alignment steers you wrong”—I’m just saying things that seem true to me and plausibly relevant to the debate.)
I doubt this due to learning from scratch.
I expect you’ll say I’m missing something, but to me, this sounds like a language dispute. My understanding of your recent thinking holds that the important goal is to understand how human learning reliably results in human values. The Bayesian perspective on this is “figuring out the human prior”, because a prior is just a way-to-learn. You might object to the overly Bayesian framing of that; but I’m fine with that. I am not dogmatic on orthodox bayesianism. I do not even like utility functions.
Insofar as the question makes sense, its answer probably takes the form of inductive biases: I might learn to predict the world via self-supervised learning and form concepts around other people having values and emotional states due to that being a simple convergent abstraction relatively pinned down by my training process, architecture, and data over my life, also reusing my self-modelling abstractions.
I am totally fine with saying “inductive biases” instead of “prior”; I think it indeed pins down what I meant in a more accurate way (by virtue of, in itself, being a more vague and imprecise concept than “prior”).
I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent’s own cognition. I don’t think you need anything mysterious for the latter. I’m confident that RLHF, done skillfully, does the job just fine. The questions there would be more like “what sequence of reward events will reinforce the desired shards of value within the AI?” and not “how do we philosophically do some fancy framework so that the agent doesn’t end up hacking its sensors or maximizing the quotation of our values?”.
I think I don’t understand what you mean by (2), and as a consequence, don’t understand the rest of this paragraph?
WRT (1), I don’t think I was being careful about the distinction in this post, but I do think the following:
The problem of wireheading is certainly not that RL agents are trying to take control of their reward feedback by definition; I agree with your complaint about Daniel Dewey as quoted. It’s a false explanation of why wireheading is a concern.
The problem of wireheading is, rather, that none of the feedback the system gets can disincentivize (ie, provide differentially more loss for) models which are making this mistake. To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can’t do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models (assuming similar competence levels in both, of course, which I admit is a bit fuzzy).
This doesn’t seem relevant for non-AIXI RL agents which don’t end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?
With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here, wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach.
Output-based evaluation (including supervised learning, and the most popular forms of unsupervised learning, and a lot of other stuff which treats models as black boxes implementing some input/output behavior or probability distribution or similar) can’t distinguish between a model which is internalizing the desired concepts, vs a model which is instead modeling the actual feedback process instead. These two do different things, but not in a way that the feedback system can differentiate.
In terms of shard theory, as I understand it, the point is that (absent arguments to the contrary, which is what we want to be able to construct), shards that implement feedback-modeling like this cannot be disincentivized by the feedback process, since they perform very well in those terms. Shards which do other things may or may not be disincentivized, but the feedback-modeling shards (if any are formed at any point) definitely won’t, unless of course they’re just not very good at their jobs.
So the problem, then, is: how do we arrange training such that those shards have very little influence, in the end? How do we disincentivize that kind of reasoning at all?
Plausibly, this should only be tackled as a knock-on effect of the real problem, actually giving good feedback which points in the right direction; however, it remains a powerful counterexample class which challenges many many proposals. (And therefore, trying to generate the analogue of the wireheading problem for a given proposal seems like a good sanity check.)
I’m a bit uncomfortable with the “extreme adversarial threats aren’t credible; players are only considering them because they know you’ll capitulate” line of reasoning because it is a very updateful line of reasoning. It makes perfect sense for UDT and functional decision theory to reason in this way.
I find the chicken example somewhat compelling, but I can also easily give the “UDT / FDT retort”: since agents are free to choose their policy however they like, one of their options should absolutely be to just go straight. And arguably, the agent should choose that, conditional on bargaining breaking down (precisely because this choice maximizes the utility obtained in fact—ie, the only sort of reasoning which moves UDT/FDT). Therefore, the coco line of reasoning isn’t relying on an absurd hypothetical.
Another argument for this perspective: if we set the disagreement point via Nash equilibrium, then the agents have an extra incentive to change their preferences before bargaining, so that the Nash equilibrium is closer to the optimal disagreement point (IE the competition point from coco). This isn’t a very strong argument, however, because (as far as I know) the whole scheme doesn’t incentivize honest reporting in any case. So agents may be incentivised to modify their preferences one way or another.
One simple idea: the disagreement point should reflect whatever really happens when bargaining breaks down. This helps ensure that players are happy to use the coco equilibrium instead of something else, in cases where “something else” implies the breakdown of negotiations. (Because the coco point is always a pareto-improvement over the disagreement point, if possible—so choosing a realistic disagreement point helps ensure that the coco point is realistically an improvement over alternatives.)
However, in reality, the outcome of conflicts we avoid remain unknown. The realist disagreement point is difficult to define or measure if in reality agreement is achieved.
So perhaps we should suppose that agreement cannot always be reached, and base our disagreement point on the observed consequences of bargaining failure.
There are two questions to ask:How does the AI learn to care about this?What do we gain by making the AI care about this?If we don’t discuss 100% answers, it’s very important to evaluate all those questions in context of each other. I don’t know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).
There are two questions to ask:
How does the AI learn to care about this?
What do we gain by making the AI care about this?
If we don’t discuss 100% answers, it’s very important to evaluate all those questions in context of each other. I don’t know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).
I agree with the overall argument structure to some extent. IE, in general, we should separate the question of what we gain from X from the question of how to achieve it, and not having answered one of those questions should not block us from considering the other.
However, to me, your “what do we gain” claims are already established to be quite large. In the dialogues (about candy and movement), it seems like the idea is that everything works out nicely, in full generality. You aren’t just claiming a few good properties; you seem to be saying “and so on”.
(To be more specific to avoid confusion, you aren’t only claiming that valuing candy doesn’t result in killing humans or hacking human values. You also seem to be saying that valuing candy in this way wouldn’t throw away any important aspect of human values at all. The candy-AI wouldn’t set human quality of life to dirt-poor levels, even if it were instrumentally useful for diverting resources to ensure the daily availability of candy. The AI also wouldn’t allow a preventable hostile invasion by candy-loving aliens-which-count-as-humans-by-some-warped-definition. etc etc etc)
Therefore, in this particular case, I have relatively little interest in further elaborating the “what do we gain” side of things. The “how are we supposed to gain it” question seems much more urgent and worthy of discussion.
To use an analogy, if you told me that they knew a quick way to make $20, I might ask “why are we so worried about getting $20?”. But if you tell me you know a quick way to make a billion dollars, I’m going to be much less interested in the “why” question and much more interested in the “how” question.
I don’t know the (full) answer to the question (1). But I know the answer to (2) and a way to connect it to (1). And I believe this connection makes it easier to figure out (1).
TBH, I don’t really believe this is true, because I don’t think you’ve pinned down what “this” even is. IE, we can expand your set of two questions into three:
How do we get X?
What is X good for?
What is X, even?
You’ve labeled X with terms like “reward economics” and “money system”, but you haven’t really defined those things. So your arguments about what we can gain from them are necessarily vague. As I mentioned before, the general idea of assigning a value (price) to everything is fully compatible with utility theory, but obviously you also further claim that your approach is not identical to utility theory. I hope this point helps illustrate why I feel your terms are still not sufficiently defined.
(My earlier question took the form of “how do we get X”, but really, that’s because I was replying to a specific point rather than starting at the beginning. What I most need to understand better at the moment is ‘what is X, even?’.)
The point of my idea is that “human (meta-)ethics” is just a subset of a way broader topic. You can learn a lot about human ethics and the way humans expect you to fulfill their wishes before you encounter any humans or start to think about “values”. So, we can replace the questions “how to encode human values?” and even “how to learn human values?” with more general questions “how to learn (properties of systems)?” and “how to translate knowledge about (properties of systems) to knowledge about human values?”
We have already to some extent replaced the question “how do you learn human values?” with the question “how do we robustly point at anything external to the system, at all?”. One variation of this which we often consider is “how can a system reliably parse reality into objects”—this is like John Wentworth’s natural abstraction program.
I don’t know whether you think this is at all in the right direction (I’m not trying to claim it’s identical to your approach or anything like that), but it currently seems to me more concrete and well-defined than your “how to learn properties of systems”.
with more general questions “how to learn (properties of systems)?”
The way you bracket this suggests to me that you think “how to learn” is already a fair summary, and “properties of systems” is actually pointing at something extremely general. Like, maybe “properties of systems” is really a phrase that encompasses everything you can learn?
If this were the correct interpretation of your words, then my response would be: I’m not going to claim that we’ve entirely mastered learning, but it seems surprising to claim that studying how we learn about the properties of very simple systems (systems that we can already learn quite easily using modern ML?) would be the key.
In your proposal about normativity you do a similar “trick”[...]I say that we can translate the method of learning properties of simple systems into a method of learning human values (a complicated system).
In your proposal about normativity you do a similar “trick”
I say that we can translate the method of learning properties of simple systems into a method of learning human values (a complicated system).
Since you are relating this to my approach: I would say that the critical difference, for me, is precisely the human involvement (or more generally, the involvement of many capable agents). This creates social equilibria (and non-equilibrium behaviors) which form the core of normativity.
An abstract decision-theoretic agent has no norms and no need for norms, in part because it treats its environment as nonliving, nonthingking, and entirely external. A single person existing over time already has a need for norms, because coordinating with yourself over time is hard.
But any system which contains agents is not “simple”. Or at least, I don’t understand the sense in which it is simple.
I think it’s a different approach, because we don’t have to start with human values (we could start with trying to fix universal AI “bugs”) and we don’t have to assume optimization.
I don’t understand what you mean about not assuming optimization. But, I would object that the approach I mentioned (learning values from the environment) doesn’t need to “start with human values” either. Hypothetically, you could try an approach like this with no preconceived concept of “human” at all; you just make a generic assumption that the environments you encounter have been optimized to a significant extent (by some as-yet-unknown actor).
Notably, this approach would have the obvious risk of the AI deciding that too many of the properties of the current world are “good” (for example, people dying, people suffering). On my understanding, your current proposal also suffers from this critique. (You make lots of arguments about how your ideas might help the AI to decide not to change things about the world; you make few-to-no arguments about such an AI deciding to actually improve the world in some way. Well, on my understanding so far.)
However, not killing all humans is such a big win that we can ignore small issues like that for now. Returning to my earlier analogy, the first question that occurs to me is where the billion dollars is coming from, not whether the billion will be enough.
I explained how I want to combine those in the context of “What do we gain by caring about system properties?” question.
In the context you’re replying to, I was trying to propose more concrete ideas for your consideration, as opposed to reiterating what you said.
Here I’m trying to do the same trick I did before: split a question, find the easier part, attack the harder part through the easier one.
Although this will be appropriate (even necessary!) in some cases, the trick is a dangerous one in general. Often you want to tackle the harder sub-problems first, so that you fail as soon as possible. Otherwise, you can spend years on a research program that splits off the easiest fractions of your grand plan, only to realize later that the harder parts of your plan were secretly impossible. So the strategy sets you up to potentially waste a lot of time!
Maybe it’s useful to split the knowledge about systems into 3 parts:Absolute knowledge: e.g. “taking absolute control of the system will destroy its (X) property”, “destroying the (X) property of the system may be bad”. This knowledge connects abstract actions to simple facts and tautologies.Experience of many systems: e.g. “destroying the (X) property of this system is likely to be bad because it’s bad for many other systems” or “destroying (X) is likely to be bad because I’m 90% sure human doesn’t ask me to do the type of task where destroying (X) is allowed”.Biases of a specific system: e.g. “for this specific system, “absolute control” means controlling about 90% of it”. This knowledge maps abstract actions/facts onto the structure of a specific system.
Maybe it’s useful to split the knowledge about systems into 3 parts:
Absolute knowledge: e.g. “taking absolute control of the system will destroy its (X) property”, “destroying the (X) property of the system may be bad”. This knowledge connects abstract actions to simple facts and tautologies.
Experience of many systems: e.g. “destroying the (X) property of this system is likely to be bad because it’s bad for many other systems” or “destroying (X) is likely to be bad because I’m 90% sure human doesn’t ask me to do the type of task where destroying (X) is allowed”.
Biases of a specific system: e.g. “for this specific system, “absolute control” means controlling about 90% of it”. This knowledge maps abstract actions/facts onto the structure of a specific system.
I don’t really understand the motivation behind this division, but, it sounds to me like you require normative feedback to learn these types of things. You keep saying things like “is likely to be bad” and “is likely to be good”. But it’s difficult to see how to derive ideas about “bad” and “good” from pure observation with no positive/negative feedback.
Take a system (e.g. “movement of people”). Model simplified versions of this system on multiple levels (e.g. “movement of groups” and “movement of individuals”). Take a property of the system (e.g. “freedom of movement”). Describe a biased aggregation of this property on different levels. Choose actions that don’t violate this aggregation.
I don’t understand much of what is going on in this paragrah.
Take an element of the system (e.g. “sweets”) and its properties (e.g. “you can eat sweets, destroy sweets, ignore sweets...”). Describe other elements in terms of this element. Choose actions that don’t contradict this description.
It sounds to me like you are trying to cross the is/ought divide—first the ai learns descriptive facts about a system, and then, the ai is supposed to derive normative principles (action-choice principles) from those descriptive facts. Is that an accurate assesment?
One concern I have is that if the description is accurate enough, then it seems like it should either (a) not constrain action, because you’ve learned the true invariant properties of the system which can never be violated (eg, the true laws of physics); or, on the other hand, (b) constrain action for the entirely wrong reasons.
An example of (b) would be if the learning algorithm learns enough to fully constrain actions, based on patterns in the AI actions so far. Since the AI is part of any system it is interacting with, it’s difficult to rule out the AI learning its own patterns of action. But it may do this early, based on dumb patterns of action. Furthermore, it may misgeneralize the actions so far, “wrongly” thinking that it takes actions based on some alien decision procedure. Such a hypothesis will never be ruled out in the future, and indeed is liable to be confirmed, since the AI will make its future acts conform to the rules as it understands them.
AI models the system (“coins”) on two levels: “a single coin” (level 1) and “multiple coins” (level 2).
I don’t really understand what it means to model the system on each of these levels, which harms my understanding of the rest of this argument. (“How can you model the system as a single coin?”)
My attempt to translate things into terms I can understand is: the AI has many hypotheses about what is good. Some of these hypotheses would encourage the AI to exploit glitches. However, human feedback about what’s good has steered the system away from some glitch-exploits in the past. The AI probabilistically generalizes this idea, to avoid exploiting behaviors of the system which seem “glitch-like” according to its understanding.
But, this interpretation seems to be a straightforward value-learning approach, while you claim to be pointing at something beyond simple value learning ideas.
After finishing this long comment, I noticed the inconsistency: I continue to ask “how do we get X?” type questions rather than “what is X?” type questions. In retrospect, I don’t like my “billion dollars” analogy as much as I did when I first wrote it. Part of the problem is that when “X” is still fuzzy, it can shift locations in the causal chain as we focus on different aspects of the conversation. So for example, X could point to the “money system”, or X could end up pointing to some desirable properties which are upstream/downstream of “money systems”. But as X shifts up/downstream, there are some Y which switch between “how-relevant” and “why-relevant”. (Things that are upstream of X are how-relevant; things that are downstream of X are why-relevant.) So it doesn’t make sense for me to keep mentioning that I’m more interested in how-questions than why-questions, when I’m not sure exactly where the definition of X will sit in the causal chain. I should, at best, have some other reasons for not being very interested in certain questions. But I don’t want to re-write the relevant portions of what I wrote. It still represents my epistemic state better than not having written it.
The images in this classic reference post have gone missing! :(
This is just my intuition, but it seems like the core intuition of a “money system” as you use it in the post is the same as the core intuition behind utility functions (ie, everything must have a price ≈ everything must have a quantifiable utility).
I think we can try to solve AI Alignment this way:Model human values and objects in the world as a “money system” (a system of meaningful trades). Make the AGI learn the correct “money system”, specify some obviously incorrect “money systems”.Basically, you ask the AI “make paperclips that have the value of paperclips for humans”. AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can’t be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren’t worth anything. So you haven’t actually gained any money at all.
I think we can try to solve AI Alignment this way:
Model human values and objects in the world as a “money system” (a system of meaningful trades). Make the AGI learn the correct “money system”, specify some obviously incorrect “money systems”.
Basically, you ask the AI “make paperclips that have the value of paperclips for humans”. AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can’t be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren’t worth anything. So you haven’t actually gained any money at all.
In utility-theoretic terms, this is like saying that money is an instrumental goal, not a terminal goal. Or at least, money as-terminal-goal has a low weight compared to other things (eg, human lives). Or perhaps more faithful to what you want: money as-terminal-goal is dependent on a context.
So it seems to me like this still faces the same basic challenges as most other approaches, IE, making the system robustly care about external objects which we can’t get perfect feedback about. How do you get it to care about the context? How do you get it to think killing humans is “expensive”? How do you ask the system to “make paperclips that have the value of paperclips for humans”?
I meant that some AIs need to start with understanding human values (perfectly) and others don’t.
It seems like any proponent of #2 (human feedback, aka, value learning) would already agree with this idea; whereas your post gave me the sense that you think something more radical is here.
Reiterating the quote from the OP that I quoted before:
The point is that AI doesn’t just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn’t use this solution. That’s the idea: AI can realize that some rewards are unjust because they break the entire reward system.
My best guess about how you want to combine #1 and #2 with #3 is that you want to infer the proper value of things from the environment. EG, if most gold sits around in vaults, then the value of gold is probably tied to sitting around in vaults.
I remember some work a few years ago on this approach—specifically, using the built environment of humans (together with an assumption that humans are fairly good at optimizing for their own preferences) to infer human values. Sadly, I’m unable to find a reference; maybe it was never published? (Probably I’ve just forgotten the relevant keywords to search for)
The distinction between instrumental goals vs “terminal goals that depend on some context” is rather blurry, because the way we distinguish between terminal and instrumental goals (from the outside, behaviorally) is how much they vary based on context. (EG, if I take away the other basketball players, the audience, and the money, will one basketball player still try to perform a slam dunk?)
One reason for abandoning utility functions is, perhaps, an instinct that everything must be instrumental, because nothing is truly terminal. I discussed how to do this while keeping most of expected utility theory in An Orthodox Case Against Utility Functions.
Another good thing is that all of this isn’t directly connected to human values, so you don’t have to encode “absolute understanding of human values” in the AI.
I don’t get this part, at all. (But I didn’t understand the purpose/implications of most parts of the OP.)
Why doesn’t the AI have to understand human values, in your proposal?
In the OP, you state:
From the rest of your post, it seems clear that “proper value” means something like “value to humans”. So it sure seems to me like the AI needs to understand human values in order to implement this kind of check.
I don’t think my first Bayesian critique is “nine nines is too many”; there are physical problems with too much Bayesian confidence (eg “my brain isn’t reliable enough that I should really ever be that confident”), but the simple math of Bayesian probability admits the possibility of nine nines just like anyone else.
I think my first critique is the false dichotomy between the null hypotheses and the hypothesis being tested.
Speaking for the frequentist, you say:
If you roll the die nine times and get nine 10s then you can say that the die is weighted with confidence 0.999999999.
If you roll the die nine times and get nine 10s then you can say that the die is weighted with confidence 0.999999999.
I don’t think this is what a real frequentist says. I think a real frequentist says something more like, the null hypotheses (the fair-dice hypothesis) can be rejected w/ that confidence. Specifically, the number you are giving is 1 - P(evidence|fair). This is not the numerical confidence in the new hypothesis! This confers some confidence to the alternative hypothesis, but that’s better left unquantified, particularly if “falsification is the philosophy of science” as you say. We don’t confirm hypotheses; hypotheses are rejected or left standing.
But whether you wear your heart on your sleeve by naively stating that nine nines is the confidence in the new hypothesis, or carefully hedge your words by only stating that we’ve rejected the null hypothesis of fair dice (and haven’t rejected the alternative, wink wink), still, my critique of the reasoning is going to center around the false dichotomy. Frequentism makes it too easy to bury mistakes under mountains of evidence, because it’s too easy to be right about what’s wrong but wrong about what’s right.
I didn’t know about that, it was good move from EA, why don’t try it again?
My low-evidence impression is that there was a fair amount of repeated contact at one time. If it’s true that that contact hasn’t happened recently, it’s probably because it hit diminishing returns in comparison with other things. I doubt people were in touch with Elon and then just forgot about the idea. So I conclude that the remaining disagreements with Elon are probably not something that can be addressed within a short amount of time, and would require significantly longer discussions to make progress on.