what else do we have, really?
Patience. And cosmic wealth, some of which can be used to ponder these questions for however many tredecillions of years it takes.
what else do we have, really?
Patience. And cosmic wealth, some of which can be used to ponder these questions for however many tredecillions of years it takes.
Would the refutation of ‘ignore acausal blackmail’ be still available?
First, Harder Choices Matter Less. It’s more useful to focus on dilammas that you can decisively resolve.
Second, the problem might be sufficiently weird to itself lie outside your goodhart boundary, so it’s like Pascal’s Mugging. In that case, under mild optimization it might be normatively correct to ignore it on object level, and merely raise the priority of chipping away at goodhart boundary in its direction, to eventually become able to assess its value. It’s only under expected utility maximization that you should still care on object level about problems you can’t solve.
Superhuman AI will be running into unknown situations all the time because of different capabilities.
It might have some sort of perverse incentive to get there, but unless it has already extrapolated the values enough to make them robust to those situations, it really shouldn’t. It’s not clear how specifically it shouldn’t, what is a principled way of making such decisions.
Somewhere in my brain there is some sort of physical encoding of my values.
Not sure if this is an intended meaning, but the claim that values don’t depend on content of the world outside the brain is generally popular (especially in decision theory), and there seems to be no basis for it. Brains are certainly some sort of pointers to value, but a lot (or at least certainly some) of the content of values could be somewhere else, most likely in civilization’s culture.
This is an important distinction for corrigibility, because this claim is certainly false for a corrigible agent, it instead wants to find content of its values in environment, it’s not part of its current definition/computation. It also doesn’t make sense to talk about this agent pursuing its goals in a diverse set of environments, unless we expect the goals to vary with environment.
For decision theory of such agents, this could be a crucial point. For example, an updateless corrigible agent wouldn’t be able to know the goals that it must choose a policy in pursuit of. The mapping from observations to actions that UDT would pick now couldn’t be chosen as the most valuable mapping, because value/goal itself depends on observations, and even after some observations it’s not pinned down precisely. So if this point is taken into account, we need a different decision theory, even if it’s not trying to do anything fancy with corrigibility or mild optimization, but merely acknowledges that goal content could be located in the environment!
Abstractions I’m referring to are not intended to be limited to simpler agents that a human can predict, there are precise abstractions of you that can only be predicted by superintelligences, as well as imprecise ones that can be predicted by you and other humans, things like theoretically ascribed reputation about specific situations. This almost touches on what’s needed for Newcomb’s Problem.
Obviously Omega can exist for uploads and arbitrarily complex AGIs formulated as abstract programs (these things can run on computers, so Omega could just use a similar computer). An embedded agent modeling the universe including itself is not a real additional difficulty in principle, even if we want to model the universe precisely (if the world is compressible, its shorter description can fit into a smaller agent embedded in the same world, and with quining we can avoid contradictions). Though almost certainly that’s not possible to do in our world. But in any case, that’s not because self-reference causes trouble.
This idea has been around for some time, known as indirect normativity. The variant you describe is also my own preferred formulation at the time. For a few years it was a major motivation for studying decision theory for me, since this still needs the outer AGI to actually run the program, and ideally also yield control to that program eventually, when the program figures out its values and can slot them in as the values of the outer AGI.
This doesn’t work out for several reasons. We don’t actually have a way of creating the goal program. The most straightforward thing would be to use an upload, but that probably can’t be done before AGIs.
If we do have a sensible human imitation, then the thing to do with it is to build an HCH that pushes the goodhart boundary of that human imitation and allows stronger optimization of the world that doesn’t break down our ability to assess its value. This gives the first aligned AGI directly, without turning the world into computronium.
Even if we did make a goal program, it’s still unknown how to build an AGI that is motivated to compute it, or to follow the goals it outputs. It’s not even known what kind of thing goals are, the type signature of that program that’s needed to communicate the goals from the goal program to the outer AGI.
Even if we mostly knew how to build the outer AGI that runs a goal program (though with some confusion around the notion of goals still remaining), it’s unclear that there are normative goals for humanity that are goals in a sense similar to a utility function in expected utility maximization, goals for a strong optimizer. We might want to discover such goals with reflection, but that doesn’t necessarily reach a conclusion, as reflection is unbounded.
More likely, there is just a sequence of increasingly accurate proxy goals with increasingly wide goodhart boundaries, instructing a mild optimizer how to act on states of the world it is able to assess. But then the outer AGI must already be a mild optimizer and not a predatory mature optimizer that ignores all boundaries of what’s acceptable in pursuit of the goal it knows (in this case, the goal program).
This sets up motivation for what I currently see as valuable on the decision theory side: figuring out a principled way of doing mild optimization (there’s only quantilization in this space at the moment). It should probably take something like goodhart boundary as a fundamental ingredient of its operation (it seems related to the base distribution of quantilization), the kind of thing that’s traditionally missing from decision theory.
In other words, you have the ability to control their payoff outside the negotiation, based on what you observe during the negotiation.
This suggests some sort of (possibly acausal) bargaining within the BATNAs, so points to a hierarchy of bargains. Each bargain must occur without violating boundaries of agents, but if it would, then the encounter undergoes escalation, away from trade and towards conflict. After a step of escalation, another bargain may be considered, that runs off tighter less comfortable boundaries. If it also falls through, there is a next level of escalation, and so on.
Possibly the sequence of escalation goes on until the goodhart boundary where agents lose ability to assess value of outcomes. It’s unclear what happens when that breaks down as well and one of the agents moves the environment into the other’s crash space.
Note that this is not destruction of the other agent, which is unexpected for the last stage of escalation of conflict. Destruction of the other agent is merely how the game aborts before reaching its conclusion, while breaking into the crash space of the other agent is the least acceptable outcome in terms of agent boundaries (though it’s not the worst outcome, it could even have high utility; these directions of badness are orthogonal, goodharting vs. low utility). This is a likely outcome of failed AI alignment (all boundaries of humanity are ignored, leading to something normatively worthless), as well as of some theoretical successes of AI alignment that are almost certainly impossible in practice (all boundaries of humanity are ignored, the world is optimized towards what is the normatively best outcome for humanity).
Acausal trade can be thought of as committing to carry out the verdict of an adjudicator that exists as common knowledge and would also direct other members of the coalition. An adjudicator that directs other members of a coalition to torture you doesn’t seem like a good source of advice for what to do, so it might be better to pass up on this trade opportunity.
This point doesn’t follow from expected utility maximization, even as channeled through theories in the UDT/TDT/FDT tradition. What it needs is a notion of boundaries, not bargaining to venture into unacceptable territory if it’s possible to avoid by simply walking away from a trade.
This might be related to the problem of goodharting, the most pressing problem of alignment, in which case the relevant concept is the goodhart boundary, which one wouldn’t want to step outside of for reasons other than it having low utility. But its relation to bargaining is unclear. It could be the last bargaining boundary that’s not very important in practice, because there are many other boundaries before it that trigger retreat from trade and escalation of conflict.
Human values are eventually the only important thing, but don’t help with the immediate issue of goodharting. Doing expected utility maximization with any proxy of humanity’s values, no matter how implausibly well-selected this proxy is, is still misaligned. Even if in principle there exists a goal such that maximizing towards it is not misaligned, this goal can’t be quickly, or possibly ever, found.
So for practical purposes, any expected utility maximization is always catastrophically misaligned, and there is no point in looking into supplying correct goals for it. This applies more generally to other ways of being a mature agent that knows what it wants, as opposed to being actively confused and trying not to break things in the meantime by staying within the goodhart boundary.
I think encountering strong optimization in this sense is unlikely, as AGIs are going to have mostly opaque values in a way similar to how humans do (unless a very clever alignment project makes it not be so, and then we’re goodharted). So they would also be wary of goodharting their own goals and only pursue mild optimization. This makes what AGIs do determined by the process of extrapolating their values from the complicated initial pointers to value they embody at the time. And processes of value extrapolation from an initial state vaguely inspired by human culture might lead to outcomes with convergent regularities that mitigate relative arbitrariness of the initial content of those pointers to value.
These convergent regularities in values arrived-at by extrapolation are generic values. If values are mostly generic, then the alignment problem solves itself (so long as a clever alignment project doesn’t build a paperclip maximizer that knows what it wants and doesn’t need the extrapolation process). I think this is unlikely. If merely sympathy towards existing people (such as humans) is one of the generic values, then humanity survives, but loses cosmic endowment. This seems more plausible, but far from assured.
I’m thinking of a setting where shortest descriptions of behavior determine sets of models that exhibit matching behavior (possibly in a coarse-grained way, so distances in behavior space are relevant). This description-model relation could be arbitrarily hard to compute, so it’s OK for shortest descriptions to be shortest programs or something ridiculous like that. This gives a partition of the model/parameter space according to the mapping from models to shortest descriptions of their behavior. I think shorter shortest descriptions (simpler behaviors) fill more volume in the parameter/model space, have more models whose behavior is given by those descriptions (this is probably the crux; e.g. it’s false if behaviors are just models themselves and descriptions are exact).
Gradient descent doesn’t interact with descriptions or the description-model relation in any way, but since it selects models ~based on behavior, and starts its search from a random point in the model space, it tends to select behaviors from larger elements of the partition of the space of models that correspond to simpler behaviors with shorter shortest descriptions.
This holds at every step of gradient descent, not just when it has already learned something relevant. The argument is that whatever behavior is selected, it is relatively simple, compared to other behaviors that could’ve been selected by the same selection process. Further training just increases the selection pressure.
I mean relatively short, as in the argument for why overparametrized models generalize. They still do get to ~memorize all training data, but anything else comes at a premium, reduces probability of getting selected for models whose behavior depends on those additional details. (This use of “short” as meaning “could be 500 gigabytes” was rather sloppy/misleading of me, in a comment about sloppy/misleading use of words...)
we will select for “explicit search processes with simple objectives”
The actual argument is that small descriptions give higher parameter space volume, and so the things we find are those with short descriptions (low Kolmogorov complexity). The thing with a short description is the whole mesa-optimizer, not just its goal. This is misleading for goals because low Kolmogorov complexity doesn’t mean low “complexity” in many other senses, so an arbitrary goal with low Kolmogorov complexity would actually be much more “complicated” than intended base objective. In particular, it probably cares about the real world outside an episode and is thus motivated to exhibit deceptively aligned behavior.
I think “explicit search” is similarly misleading, because most short programs (around a given behavior) are not coherent decision theoretic optimizers. Search would only become properly explicit after the mesa-optimizer completes its own agent foundations alignment research program and builds itself a decision theory based corrigible successor. A mesa-optimizer only needs to pursue the objective of aligned behavior (or whatever it’s being selected for), and whether that tends to be its actual objective or the instrumental objective of deceptive alignment is a toss-up, either would do for it to be selected. But in either case, it doesn’t need to be anywhere ready to pursue a coherent goal of its own (as aligned behavior is also not goal directed behavior in a strong sense).
Goodhart boundary encloses the situations where the person/agent making the decisions has accurate proxy utility function and proxy probability distribution (so that the available in practice tractable judgements are close to the normative actually-correct ones). Goodhart’s Curse is the catastrophy where an expected utility maximizer operating under proxy probutility (probability+utility) would by default venture outside the goodhart boundary (into the crash space), set the things that proxy utility overvalues or doesn’t care about to extreme values, and thus ruin the outcome from the point of view of the intractable normative utility.
Pascal’s Mugging seems like a case of venturing outside the goodhart boundary in the low-proxy-probability direction rather than in the high-proxy-utility direction. But it illustrates the same point, that if all you have are proxy utility/probability, not the actual ones, then pursuing any kind of expected utility maximization is always catastrophic misalignment.
One must instead optimize mildly and work on extending the goodhart boundary, improving robustness of utility/prior proxy to unusual situations (rescuing them under an ontological shift) in a way that keeps it close to their normative/intended content. In case of Pascal’s Mugging, that means better prediction of low-probability events (in Bostrom’s framing where utility values don’t get too ridiculous), or also better understanding of high-utility events (in Yudkowsky’s framing with 3^^^^3 lives being at stake), and avoiding situations that call for such decisions until after that understanding is already available.
Incidentally, it seems like the bureaucracies of HCH can be thought of as a step in that direction, with individual bureaucracies capturing novel concepts needed to cope with unusual situations, HCH’s “humans” keeping the whole thing grounded (within the original goodhart boundary that humans are robust to), and episode structure arranging such bureaucracies/concepts like words in a sentence.
according to one of our most accepted theories these days (quantum mechanics) the inherent randomness of our universe
And yet I’m able to perfectly predict that 2+2 is 4. The agents being predicted are abstractions, like behavior of computer programs determined by source code. It doesn’t matter for reasoning about an abstraction that its instances of practical importance get to run within physics.
In my view, the purpose of human/HCH distinction is that there are two models, that of a “human” and that of HCH (bureaucracies). This gives some freedom in training/tuning the bureaucracies model, to carry out multiple specialized objectives and work with prompts that the human is not robust enough to handle. This is done without changing the human model, to preserve its alignment properties and use the human’s pervasive involvement/influence at all steps to keep the bureaucracy training/tuning aligned.
The bureaucracies model starts out as that of a human. An episode involves multiple (but only a few) instances of both humans and bureaucracies, each defined by a self-changed internal state and an unchanging prompt/objective. It’s a prompt/mission-statement that turns the single bureaucracies model into a particular bureaucracy, for example one of the prompts might instantiate the ELK head of the bureaucracies model. Crucially, the prompts/objectives of humans are less weird than those of bureaucracies, don’t go into the chinese room territory, and each episode starts as a single human in control of the decision about which other humans and bureaucracies to initially instantiate in what arrangement. It’s only the bureaucracies that get to be exposed to chinese room prompts/objectives, and they can set up subordinate bureaucracy instances with similarly confusing-for-humans prompts.
Since the initial human model is not very capable or aligned, the greater purpose of the construction is to improve the human model. The setting allows instantiating and training multiple specialized bureaucracies, and possibly generalizing their prompt/role/objective from the examples used in training/tuning the bureaucracies model (the episodes). After all, robustness of the bureaucracies model to weird prompts is almost literally the same thing as breadth of available specializations/objectives of bureaucracies.
So the things I was pointing to, incentives/interpretability/rationality, are focus topics for tuned/specialized bureaucracies, whose outputs can be assessed/used by the more reliable but less trainable human (as more legible reference texts, and not large/opaque models) to improve bureaucracy (episode) designs, to gain leverage over bureaucracies that are more specialized and robust to weird prompts/objectives, by solving more principal-agent issues.
More work being allocated to incentives/surveillance/rationality means that even when working on some object-level objective, a significant portion of the bureaucracy instances in an episode would be those specialized in those principal-agent problem (alignment) prompts/objectives, and not in the object-level objective, even if it’s the object-level objective bureaucracy that’s being currently trained/tuned. Here, the principal-agent objective bureaucracies (alignment bureaucracies/heads of the bureaucracies model) remain mostly unchanged, similarly to how the human model (that bootstraps alignment) normally remains unchanged in HCH, since it’s not their training that’s being currently done.
The blank-slateness still makes sense as referring to the dimensions determined by nurture. But that doesn’t yield an interesting point about content of civilization. If everyone starts out as blank canvas, that doesn’t mean paintings (and art schools) are less real/important/legitimate.
What if 90% or 99% of the work was not object level, but about mechanism/incentive design, surveillance/interpretability, and rationality training/tuning, including specialized to particular projects being implemented, including the projects that set this up, iterating as relevant wisdom/tuning and reference texts accumulate? This isn’t feasible for most human projects, as it increases costs by orders of magnitude in money (salaries), talent (number of capable people), and serial time. But in HCH you can copy people, it runs faster, and distillation should get rid of redundant steering if it converges to a legible thing in the limit of redundancy.
If both agents are FDT, and have common knowledge of each others source code
Any common knowledge they can draw up can go into a coordinating agent (adjudicator), all it needs is to be shared among the coalition, it doesn’t need to have any particular data. The problem is verifying that all members of the coalition will follow the policy chosen by the coordinating agent, and common knowledge of source code is useful for that. But it could just be the source code of the trivial rule of always following the policy given by the coordinating agent.
One possible policy chosen by the adjudicator should be falling back to unshared/private BATNA, aborting the bargain, and of course doing other things not in scope of this particular bargain. These things are not parts of the obey-the-adjudicator algorithm, but consequences of following it. So common knowledge of everything is not needed, only common knowledge of the adjudicator and its authority over the coalition. (This is also a possible way of looking at UDT, where a single agent in many possible states acting through many possible worlds coordinates among its variants.)
FDT works on an assumption that other actors use a similar utility function as itself
FDT is not about interaction with other actors, it’s about accounting for influence of the agent through all of its instances (including predictions-of) in all possible worlds.
Coordination with other agents is itself an action, that a decision theory could consider. This action involves creation of a new coordinating agent that decides a coordinating policy that all members of a coalition carry out, and this coordinating agent also needs a decision theory. The coordinating agent acts through all agents of the coalition, so it’s sensible for it to be some flavor of FDT, though a custom decision theory specifically for such situations seems appropriate, especially since it’s doing bargaining.
The decision theory that chooses whether to coordinate by running a coordinating agent or not has no direct reason to be FDT, could just be trivial. And preparing the coordinating agent is not obviously a question of decision theory, it even seems to fit deontology a bit better.
That’s not what I’m talking about. I’m not talking about what goals are about, I’m talking about where the data to learn what they are is located. There is a particular thing, say a utility function, that is the intended formulation of goals. It could be the case that this intended utility function could be found somewhere in the brain. That doesn’t mean that it’s a utility function that cares about brains, the questions of where it’s found and what it cares about are unrelated.
Or it could be the case that it’s recorded on an external hard drive, and the brain only contains the name of the drive (this name is a “pointer to value”). It’s simply not the case that you can recover this utility function without actually looking at the drive, and only looking at the brain. So utility function u itself depends on environment E, that is there is some method of formulating utility functions t such that u=t(E). This is not the same as saying that utility of environment depends on environment, giving the utility value u(E)=t(E)(E) (there’s no typo here). But if it’s actually in the brain, and says that hard drives are extremely valuable, then you do get to know what it is without looking at the hard drives, and learn that it values hard drives.