I’m researching and advocating for policy to prevent catastrophic risk from AI at https://www.aipolicy.us/.
I’m broadly interested in AI strategy and want to figure out the most effective interventions to get to good AI outcomes.
I’m researching and advocating for policy to prevent catastrophic risk from AI at https://www.aipolicy.us/.
I’m broadly interested in AI strategy and want to figure out the most effective interventions to get to good AI outcomes.
I think there might be a terminology mistake here—pivotal acts are actions that will make a large positive difference a billion years later.
Epistemic Status: Low. Very likely wrong but would like to understand why.
It seems easier to intent align a human level or slightly above human level AI (HLAI) than a massively smarter than human AI.
Some new research options become available to us once we have aligned HLAI, including:
The HLAI might be able to directly help us do alignment research and solve the general alignment problem.
We could run experiments on the HLAI and get experimental evidence much closer to the domain we are actually trying to solve.
We could use the HLAI to start a training procedure, a la IDA.
These schemes seem fragile, because 1) if any HLAIs are not aligned, we lose, and 2) if the training up to superintelligence process fails, due to some unknown unknown or the HLAI being misaligned or through any of the known failure modes, we lose.
However, 1) seems like a much easier problem than aligning an arbitrary intelligence AI. Even though something could likely go wrong aligning a HLAI, it also seems likely that something goes wrong if we try to align an arbitrary intelligence AI. (This seems related to security mindset… in the best case world we do just solve the general case of alignment, but that seems hard.)
For 2), the process of training up to superintelligence seems like a HLAI would help more than it hurts. If the HLAI is actually intent aligned, this seems like having a fully uploaded alignment researcher, which seems less like getting Godzilla to fight and more like getting a Jaeger to protect Tokyo.
Half baked confusion:
How does Parfit’s Hitchiker fit into the Infra-Bayes formalism? I was hoping that disutility the agent receives from getting stuck in the desert would be easily representable as negative off-branch utility. I am stuck trying to reconcile that with the actual update rule:
Here, I interpret as our utility function. Thus: gives us the expected utility tracked from the offbranch event. The probability and the expectation are just a scale and shift. This update is applied to each a-measure in a set, , of a-measures. (is this right so far?)
Since the environment depends on what our agent chooses, we’ve got to at least have some Knightian uncertainty over the different decisions we could have made. The obvious thought is to give these names, say is the environment corresponding to the hardcoded policy that the agent pays after getting carried to the city, and in the opposite. After updating that we are in the city, seems logically impossible. Does that mean we apply Nirvana to make it go to infinite utility on the basis that ?
We have two policies: we either pay or don’t pay. But not paying would lead us to Nirvana in either case since it would contradict the hardcoded policy in and is impossible. Paying would lead us to losing some utility (paying $100) in , or contradiction in , so Murphy chooses . This is where my reasoning gets stuck, is there some way to translate this to a Nirvana free space, where the agent takes the only logically coherent action?
Thank you so much for your detailed reply. I’m still thinking this through, but this is awesome. A couple things:
I don’t see the problem at the bottom. I thought we were operating in the setting where Nirvana meant infinite reward? It seems like of course if N is small, we will get weird behavior because the agent will sometimes reason over logically impossible worlds.
Is Parfit’s Hitchiker with a perfect predictor unsalvageable because it violates this fairness criteria?
The fairness criterion in your comment is the pseudocausality condition, right?
The decision procedure you outlined in the first example seems equivalent to an evidential decision theorist placing 0 credence on worlds where Omega makes an incorrect prediction. What is the infra-bayesianism framework doing differently? It just looks like the credence distribution over worlds is disguised by the ‘Nirvana trick.’
In Newcomb’s problem, this is correct, it lines up exactly like an EDT agent. In other scenarios, we get different behavior, e.g. in the situation of counterfactual mugging. In this case, the UDT agent will pay, so that it maximizes the overall expected utility, even after seeing the coin flips tails and Omega asks you to pay. An EDT agent, on the other hand, won’t pay here, because the expected utility of paying (-100) is worse than not paying (0). The key distinction is that EDT is an updateful decision theory—it doesn’t reason about the other branches of the universe that have already been ruled out by observed evidence.
We also don’t have a credence distribution over worlds, because this would be too large to hold in our heads. Instead of a credence distribution, we just have a set of possible worlds.
In the decision rule, how is the set of environments ‘E’ determined? If it contains every possible environment, then this means I should behave like I am in the worst possible world, which would cause me to do some crazy things.
The environment set corresponds accounts for each possible policy the agent. So for each policy , there is a corresponding environment where that policy is hardcoded. We want our agent to just reasons over the diagonal of the matrices I printed, i.e., over pairs where the environmental policy matches the taken policy.
Also, when you say that an infra-bayesian agent models the world with a set of probability distributions, what does this mean? Does the set contain every distribution that would be consistent with the agent’s observations? But isn’t this almost all probability distributions?
So how it actually works is that you have a collection of hypotheses , each with a probability attached. In Bayesianism, each would simply be a distribution over the world. In infra-bayesianism, each is a set of affine-measures, which are just probability distributions with measure as opposed to exactly 1, and they have an affine term, which tracks off branch expected utility.
This set does contain that, and does contain almost all probability distributions (I think). I think that it has to be this way, because of non-realizability, there is no way to rule out those distributions.
Sorry if I am missing something obvious. I guess this would have been clearer for me if you explained the infra-bayesian framework a little more before introducing the decision rule.
You aren’t missing obvious things afaict, the general framework is genuinely very complicated, and so the goal of this post was to give motivation for the basic ideas. The sequence puts the framework first.
Do you have an idea in mind for how the proposed formula could be applied to a neural network?
The rough idea in my head is to do something like: look at the network activations at each layer for a bunch of different inputs. This gives you a bunch of activations sampled from the distribution of activations. From there, you can do density estimation to estimate the actual distribution over the activations. Doing this while hardcoding in gives us a way to sample the distribution of conditioned on .
The problem with this is that density estimation seems hard with very high dimensional data. Did you have a different way in mind of implementing this?
TL;DR: I agree that the answer to the question above definitely isn’t always yes, because of your counterexample, but I think that moving forward on a similar research direction might be useful anyways.
One can imagine partitioning the parameter space into sets that arrive at basins where each model in the basin has the same, locally optimal performance, this is like a Rashomon set (relaxing the requirement from global minima so that we get a partition of the space). The models which can compress the training data (and thus have free parameters) are generally more likely to be found by random selection and search, because the free parameters means that the dimensionality of this set is higher, and hence exponentially more likely.
Thus, we can move within these high-dimensional regions of locally optimal loss, which could allow us to find more interpretable (or maybe more desirable along another axis), which is the stated motivation in the article:
Ultimately, we hope that the study of equivalently optimal models would lead to advances in interpretability: for example, by producing models that are simultaneously optimal and interpretable.
This seems super relevant to alignment! The default path to AGI right now to me seems like something like a LLM world model hooked up to some RL to make it more agenty, and I expect this kind of theory applied to LLMs, because of the large number of parameters. I’m hoping that this theory gets us better predictions on which Rashomon sets are found (this would look like a selection theorem), and the ability to move within a Rashomon set towards parameters that are better. Such a selection theorem seems likely because of the dimensionality argument above.
Thanks, this is indeed what I was asking.
I am a bit confused how you deal with the problem of 0 eigenvalues in the Hessian. It seems like the reason that these 0 eigenvalues exist is because the basin volume is 0 as a subset of parameter space. My understanding right now of your fix is that you are adding along the diagonal to make the matrix full rank (and this quantity is coming from the regularization plus a small quantity). Geometrically, this seems like drawing a narrow ellipse around the subspace of which we are trying to estimate the volume.
But this doesn’t seem natural to me, seems to me like the most important part of determining volume of these basins is the relative dimensionality. If there are two loss basins, but one has dimension greater than the other, the larger one dominates and becomes a lot more likely. If this is correct, we only care about the volume of basins that have the same number of dimensions. Thus, we can discard the dimensions with 0 eigenvalue and just apply the formula for the volume over the non-zero eigenvalues (but only for the basins with maximum rank hessians). This lets us directly compare the volume of these basins, and then treat the low dimensional basins as having 0 volume.
Does this make any sense?
Thanks for your response! I’m not sure I communicated what I meant well, so let me be a bit more concrete. Suppose our loss is parabolic , where . This is like a 2d parabola (but it’s convex hull / volume below a certain threshold is 3D). In 4D space, which is where the graph of this function lives and hence where I believe we are talking about basin volume, this has 0 volume. The hessian is the matrix:
This is conveniently already diagonal, and the 0 eigenvalue comes from the component , which is being ignored. My approach is to remove the 0-eigenspace, so we are working just in the subspace where the eigenvalues are positive, so we are left with just the matrix: , after which we can apply the formula given in the post:
If this determinant was 0 then dividing by 0 would get the spurious infinity (this is what you are talking about, right?). But if we remove the 0-eigenspace we are left with positive volume, and hence avoid this division by 0.
There is also the ontology identification problem. The two biggest things are: we don’t know how to specify exactly what a diamond is because we don’t know the true base level ontology of the universe. We also don’t know how diamonds will be represented in the AI’s model of the world.
I personally don’t expect coding a diamond maximizing AGI to be hard, because I think that diamonds is a sufficiently natural concept that doing normal gradient descent will extrapolate in the desired way, without inner alignment failures. If the agent discovers more basic physics, e.g. quarks that exist below the molecular level, “diamond” will probably still be a pretty natural concept, just like how “apple” didn’t stop being a useful concept after shifting from newtonian mechanics to QM.
Of course, concepts such as human values/corrigibility/whatever are a lot more fragile than diamonds, so this doesn’t seem helpful for alignment.
There are reasons to think that an AI is aligned between “hoping it is aligned” and “having a formal proof that it is aligned”. For example, we might be able to find sufficiently strong selection theorems, which tell us that certain types of optima tend to be chosen, even if we can’t prove theorems with certainty. We also might be able to find a working ELK strategy that gives us interpretability.
These might not be good strategies, but the statement “Therefore no AI built by current methods can be aligned” seems far too strong.
The best resource that I have found on why corrigibility is so hard is the arbital post, are there other good summaries that I should read?
What is the structure of Conjecture’s interpretability research, and how does it differ from Redwood and Anthropic’s interpretability research?
Edit: This was touched on here.
Thanks!
I might be missing something, but it seems to me like the way to modify to a smaller state diagram is to remove the HALT state from the TM and then redraw any state transition that goes to HALT to map to any other state arbitrarily.
This won’t change the behavior on computations that haven’t halted so far, because these computations never reached the HALT state, and so won’t be effected by any of the swapped transitions.
Did you mean:
And yet, assuming **tool** AI is possible at all, it will be possible to assemble those tools into something agenty.
I think realizability is the big one. Some others are:
Infra-Bayesianism lets you avoid bridge rules, for more detail, see the AXRP podcast.
The classical Bayesian simplicity prior, the solomonoff prior, might have problems with acausal attacks, though I am not sure how much I buy this. Infra-Bayesian physicalism still uses this, but also allows us to classify and discard these malign hypotheses.
Hi Vanessa!
Thank you so much for your thoughtful reply. To respond to a few of your points:
We only mean to test this in an artificial toy setting. We agree that empirical demonstrations seem very difficult.
Thanks for pointing out the cartesian versions -- I hadn’t read this before, and this is a nice clarification on how to measure in a loss-function agnostic way.
It’s good to know about the epistemic status of this part of the theory, we might take a stab at proving some of these bounds.
We will definitely make sure to avoid competitive implementations because of the associated risks.
We would very much appreciate discussing details in private, we are serious about it. I’ll follow up with a DM on LessWrong soon.
This should say low-dimensional, right? Because the abstractions should be simpler than the true state of the world.