In homomorphic encryption, someone living inside the computation can’t affect the outside unless the whole system breaks the cryptographic assumption. It’s not at all analogous to side channel attacks on other cryptographic operations, for which the security of the primitive itself says nothing about the security of the implementation.
Put differently: in the typical security model, the computer doing the encryption is assumed to be a magical computer that doesn’t leak information and keeps the secret key perfectly secret. But the computer that’s actually running the encrypted computation is assumed to be adversarial, so it can be as imperfect as you want and it’s not going to break things. The analogy to a traditional side-channel attack would be leaking information from the encryption process (which you would still have to be careful about here).
For example, it’s impossible to provide a homomorphically encrypted computation the ability to “pause.” (How would you even try to do that?) And the computation you do cannot at all on what the AI is thinking. Of course this guarantees that homomorphically encrypted computations are slow in practice.
Note that since making the OP there have been credible proposals for indistinguishability obfuscation, which would be the more natural thing to use though it’s bound to be even less competitive.
(I think the crypto in the OP is fine, but I no longer endorse it / consider it interesting.)
(To restate the obvious, all of the stuff here is extremely WIP and rambling.)
I’ve often talked about the case where an unaligned model learns a description of the world + the procedure for reading out “what the camera sees” from the world. In this case, I’ve imagined an aligned model starting from the unaligned model and then extracting additional structure.
It now seems to me that the ideal aligned behavior is to learn only the “description of the world” and then have imitative generalization take it from there, identifying the correspondence between the world we know and the learned model. That correspondence includes in particular “what the camera sees.”
The major technical benefit of doing it this way is that we end up with a higher prior probability on the aligned model than the unaligned model—the aligned one doesn’t have to specify how to read out observations. And specifying how to read out observations doesn’t really make it easier to find that correspondence.
We still need to specify how the “human” in imitative generalization actually finds this correspondence. So this doesn’t fundamentally change any of the stuff I’ve recently been thinking about, but I think that the framing is becoming clearer and it’s more likely we can find our way to the actually-right way to do it.
It now seems to me that a core feature of the situation that lets us pull out a correspondence is that you can’t generally have two equally-valid correspondences for a given model—the standards for being a “good correspondence” are such that it would require crazy logical coincidence, and in fact this seems to be the core feature of “goodness.” For example, you could have multiple “correspondences” that effectively just recompute everything from scratch, but by exactly the same token those are bad correspondences.
(This obviously only happens once the space and causal structure is sufficiently rich. There may be multiple ways of seeing faces in clouds, but once your correspondence involves people and dogs and the people talking about how the dogs are running around, it seems much more constrained because you need to reproduce all of that causal structure, and the very fact that humans can make good judgments about whether there are dogs implies that everything is incredibly constrained.)
There can certainly be legitimate ambiguity or uncertainty. For example, there may be a big world with multiple places that you could find a given pattern of dogs barking at cats. Or there might be parts of the world model that are just clearly underdetermined (e.g. there are two identical twins and we actually can’t tell which is which). In these cases the space of possible correspondences still seems effectively discrete, rather than being a massive space parameterized as neural networks or something. We’d be totally happy surfacing all of the options in these cases.
There can also be a bunch of inconsequential uncertainty, things that feel more like small deformations of the correspondence than moving to a new connected component in correspondence-space. Things like slightly adjusting the boundaries of objects or of categories.
I’m currently thinking about this in terms of: given two different correspondences, why is it that they manage to both fit the data? Options:
They are “very close,” e.g. they disagree only rarely or make quantitatively similar judgments.
One of them is a “bad correspondence” and could fit a huge range of possible underlying models, i.e. it’s basically introducing the structure we are interested in within the correspondence itself.
The two correspondences are “not interacting,” they aren’t competing to explain the same logical facts about the underlying model. (e.g. a big world, one correspondence faces .)
There is an automorphism of my model of the world (e.g. I could exchange the two twins Eva and Lyn), and can compose a correspondence with that automorphism. (This seems much more likely to happen for poorly-understood parts of the world, like how we talk about new physics, than for simple things like “is there a cat in the room.”)
I don’t know where all of this ends up, but I feel some pretty strong common-sense intuition like “If you had some humans looking at the model, they could recognize a good correspondence when they saw it” and for now I’m going to be following that to see where it goes.
I tentatively think the whole situation is basically the same for “intuition module outputs a set of premises and then a deduction engine takes it from there” as for a model of physics. That is, it’s still the case that (assuming enough richness) the translation between the “intuition module”’s language and human language is going to be more or less pinned down uniquely, and we’ll have the same kind of taxonomy over cases where two translations would work equally well.
I think the AI systems in this story have a clear understanding of the the difference between the measurement and the thing itself.
Are humans similarly like drug addicts, because we’d prefer experience play and love and friendship and so on even though we understand those things are mediocre approximations to “how many descendants we have”?
Note that HumanAnswer and IntendedAnswer do different things. HumanAnswer spreads out its probability mass more, by first making an observation and then taking the whole distribution over worlds that were consistent with it.
Abstracting out Answer, let’s just imagine that our AI outputs a distribution p over the space of trajectories S in the human ontology, and somehow we define a reward function r(p,ω) evaluated by the human in hindsight after getting the observation ω. The idea is that this is calculated by having the AI answer some questions about what it believes etc but we’ll abstract that all out.
Then the conclusion in this post holds under some convexity assumption on r, since then spreading out your mass can’t really hurt you (since the human has no way to prefer your pointy estimate). But e.g. if you just penalized p for being uncertain, then IntendedAnswer could easily outperform HumanAnswer. Similarly, if we require that p satisfy various conditional independence properties then we may rule out HumanAnswer.
The more precise bad behavior InstrumentalAnswer is to output the distribution argmaxpEω∼W′[r(p,ω)]. Of course nothing else is going to get a higher reward. This is about as simple as HumanAnswer. It could end up being slightly more computationally complex. I think everything I’ve said about this case still applies for InstrumentalAnswer, but it’s relevant when I start talking about stuff like conditional independence requirements between the model’s answers.
Actually if A --> B --> C and I observe some function of (A, B, C) it’s just not generally the case that my beliefs about A and C are conditionally independent given my beliefs about B (e.g. suppose I observe A+C). This just makes it even easier to avoid the bad function in this case, but means I want to be more careful about the definition of the case to ensure that it’s actually difficult before concluding that this kid of conditional independence structure is potentially useful.
This is also a way to think about the proposals in this post and the reply:
The human believes that A’ and B’ are related in a certain way for simple+fundamental reasons.
On the training distribution, all of the functions we are considering reproduce the expected relationship. However, the reason that they reproduce the expected relationship is quite different.
For the intended function, you can verify this relationship by looking at the link (A --> B) and the coarse-graining applied to A and B, and verify that the probabilities work out. (That is, I can replace all of the rest of the computational graph with nonsense, or independent samples, and get the same relationship.)
For the bad function, you have to look at basically the whole graph. That is, it’s not the case that the human’s beliefs about A’ and B’ have the right relationship for arbitrary Ys, they only have the right relationship for a very particular distribution of Ys. So to see that A’ and B’ have the right relationship, we need to simulate the actual underlying dynamics where A --> B, since that creates the correlations in Y that actually lead to the expected correlations between A’ and B’.
It seems like we believe not only that A’ and B’ are related in a certain way, but that the relationship should be for simple reasons, and so there’s a real sense in which it’s a bad sign if we need to do a ton of extra compute to verify that relationship. I still don’t have a great handle on that kind of argument. I suspect it won’t ultimately come down to “faster is better,” though as a heuristic that seems to work surprisingly well. I think that this feels a bit more plausible to me as a story for why faster would be better (but only a bit).
It’s not always going to be quite this cut and dried—depending on the structure of the human beliefs we may automatically get the desired relationship between A’ and B’. But if that’s the case then one of the other relationships will be a contingent fact about Y—we can’t reproduce all of the expected relationships for arbitrary Y, since our model presumably makes some substantive predictions about Y and if those predictions are violated we will break some of our inferences.
So are there some facts about conditional independencies that would privilege the intended mapping? Here is one option.
We believe that A’ and C’ should be independent conditioned on B’. One problem is that this isn’t even true, because B’ is a coarse-graining and so there are in fact correlations between A’ and C’ that the human doesn’t understand. That said, I think that the bad map introduces further conditional correlations, even assuming B=B’. For example, if you imagine Y preserving some facts about A’ and C’, and if the human is sometimes mistaken about B’=B, then we will introduce extra correlations between the human’s beliefs about A’ and C’.
I think it’s pretty plausible that there are necessarily some “new” correlations in any case where the human’s inference is imperfect, but I’d like to understand that better.
So I think the biggest problem is that none of the human’s believed conditional independencies actually hold—they are both precise, and (more problematically) they may themselves only hold “on distribution” in some appropriate sense.
This problem seems pretty approachable though and so I’m excited to spend some time thinking about it.
Causal structure is an intuitively appealing way to pick out the “intended” translation between an AI’s model of the world and a human’s model. For example, intuitively “There is a dog” causes “There is a barking sound.” If we ask our neural net questions like “Is there a dog?” and it computes its answer by checking “Does a human labeler think there is a dog?” then its answers won’t match the expected causal structure—so maybe we can avoid these kinds of answers.
What does that mean if we apply typical definitions of causality to ML training?
If we define causality in terms of interventions, then this helps iff we have interventions in which the labeler is mistaken. In general, it seems we could just include examples with such interventions in the training set.
Similarly, if we use some kind of closest-possible-world semantics, then we need to be able to train models to answer questions consistently about nearby worlds in which the labeler is mistaken. It’s not clear how to train a system to do that. Probably the easiest is to have a human labeler in world X talking about what would happen in some other world Y, where the labeling process is potentially mistaken. (As in “decoupled rl” approaches.) However, in this case it seems liable to learn the “instrumental policy” that asks “What does a human in possible world X think about what would happen in world Y?” which seems only slightly harder than the original.
We could talk about conditional independencies that we expect to remain robust on new distributions (e.g. in cases where humans are mistaken). I’ll discuss this a bit in a reply.
Here’s an abstract example to think about these proposals, just a special case of the example from this post.
Suppose that reality M is described as a causal graph X --> A --> B --> C, and then the observation Y is a function of (A, B, C).
The human’s model M’ of the situation is X --> A’ --> B’ --> C’. Each of them is a coarse-graining of the corresponding part of the real world model, and the observation Y is still a function of (A’, B’, C’), it’s just more uncertain now.
The coarse-grained dynamics are simpler than the actual coarse-graining f: (A, B, C) --> (A’, B’, C’).
We prepare a dataset by actually sampling (X, A, B, C, Y) from M, having humans look at it, make inferences about (A’, B’, C’), and get a dataset of (X, A’, B’, C’, Y) tuples to train a model.
The intended question-answering function is to use M to sample (A, B, C, Y) then apply the coarse-graining f to get (A’, B’, C’). But there is also a bad function that produces good answers on the training dataset: use M to sample (A, B, C, Y), then use the human’s model to infer (A’, B’, C’), and output those.
We’d like to rule out this bad function by making some kind of assumption about causal structure.
This is interesting to me for two reasons:
[Mainly] Several proposals for avoiding the instrumental policy work by penalizing computation. But I have a really shaky philosophical grip on why that’s a reasonable thing to do, and so all of those solutions end up feeling weird to me. I can still evaluate them based on what works on concrete examples, but things are slippery enough that plan A is getting a handle on why this is a good idea.
In the long run I expect to have to handle learned optimizers by having the outer optimizer instead directly learn whatever the inner optimizer would have learned. This is an interesting setting to look at how that works out. (For example, in this case the outer optimizer just needs to be able to represent the hypothesis “There is a program that has property P and runs in time T’ ” and then do its own search over that space of faster programs.)
The speed prior still delegates to better search algorithms though. For example, suppose that someone is able to fill in a 1000 bit program using only 2^500 steps of local search. Then the local search algorithm has speed prior complexity 500 bits, so will beat the object-level program. And the prior we’d end up using is basically “2x longer = 2 more bits” instead of “2x longer = 1 more bit,” i.e. we end up caring more about speed because we delegated.
The actual limit on how much you care about speed is given by whatever search algorithms work best. I think it’s likely possible to “expose” what is going on to the outer optimizer (so that it finds a hypothesis like “This local search algorithm is good” and then uses it to find an object-level program, rather than directly finding a program that bundles both of them together). But I’d guess intuitively that it’s just not even meaningful to talk about the “simplest” programs or any prior that cares less about speed than the optimal search algorithm.
In traditional settings, we are searching for a program M that is simpler than the property P. For example, the number of parameters in our model should be smaller than the size of the dataset we are trying to fit if we want the model to generalize. (This isn’t true for modern DL because of subtleties with SGD optimizing imperfectly and implicit regularization and so on, but spiritually I think it’s still fine..)
But this breaks down if we start doing something like imposing consistency checks and hoping that those change the result of learning. Intuitively it’s also often not true for scientific explanations—even simple properties can be surprising and require explanation, and can be used to support theories that are much more complex than the observation itself.
It’s quite plausible that in these cases we want to be doing something other than searching over programs. This is pretty clear in the “scientific explanation” case, and maybe it’s the way to go for the kinds of alignment problems I’ve been thinking about recently.A basic challenge with searching over programs is that we have to interpret the other data. For example, if “correspondence between two models of physics” is some kind of different object like a description in natural language, then some amplified human is going to have to be thinking about that correspondence to see if it explains the facts. If we search over correspondences, some of them will be “attacks” on the human that basically convince them to run a general computation in order to explain the data. So we have two options: (i) perfectly harden the evaluation process against such attacks, (ii) try to ensure that there is always some way to just directly do whatever the attacker convinced the human to do. But (i) seems quite hard, and (ii) basically requires us to put all of the generic programs in our search space.
It’s also quite plausible that we’ll just give up on things like consistency conditions. But those come up frequently enough in intuitive alignment schemes that I at least want to give them a fair shake.
The speed prior is calibrated such that this never happens if the learned optimizer is just using brute force—if it needs to search over 1 extra bit then it will take 2x longer, offsetting the gains.
That means that in the regime where P is simple, the speed prior is the “least you can reasonably care about speed”—if you care even less, you will just end up pushing the optimization into an inner process that is more concerned with speed and is therefore able to try a bunch of options.
(However, this is very mild, since the speed prior cares only a tiny bit about speed. Adding 100 bits to your program is the same as letting it run 2^100 times longer, so you are basically just optimizing for simplicity.)
To make this concrete, suppose that I instead used the kind-of-speed prior, where taking 4x longer is equivalent to using 1 extra bit of description complexity. And suppose that P is very simple relative to the complexities of the other objects involved. Suppose that the “object-level” program M has 1000 bits and runs in 2^2000 time, so has kind-of-speed complexity 2000 bits. A search that uses the speed prior will be able to find this algorithm in 2^3000 time, and so will have a kind-of-speed complexity of 1500 bits. So the kind-of-speed prior will just end up delegating to the speed prior.
Suppose I am interested in finding a program M whose input-output behavior has some property P that I can probabilistically check relatively quickly (e.g. I want to check whether M implements a sparse cut of some large implicit graph). I believe there is some simple and fast program M that does the trick. But even this relatively simple M is much more complex than the specification of the property P.
Now suppose I search for the simplest program running in time T that has property P. If T is sufficiently large, then I will end up getting the program “Search for the simplest program running in time T’ that has property P, then run that.” (Or something even simpler, but the point is that it will make no reference to the intended program M since encoding P is cheaper.)
I may be happy enough with this outcome, but there’s some intuitive sense in which something weird and undesirable has happened here (and I may get in a distinctive kind of trouble if P is an approximate evaluation). I think this is likely to be a useful maximally-simplified example to think about.
The results look quite different for Houdini 3 vs SF8---is this just a matter of Stockfish being much better optimized for small amounts of hardware?
We might be able to get similar advantages with a more general proposal like:
Fit a function f to a (Q, A) dataset with lots of questions about latent structure. Minimize the sum of some typical QA objective and the computational cost of verifying that f is consistent.
Then the idea is that matching the conditional probabilities from the human’s model (or at least being consistent with what the human believes strongly about those conditional probabilities) essentially falls out of a consistency condition.
It’s not clear how to actually formulate that consistency condition, but it seems like an improvement over the prior situation (which was just baking in the obviously-untenable requirement of exactly matching). It’s also not clear what happens if this consistency condition is soft.
It’s not clear what “verify that the consistency conditions are met” means. You can always do the same proposal as in the parent, though it’s not really clear if that’s a convincing verification. But I think that’s a fundamental philosophical problem that both of these proposals need to confront.
It’s not clear how to balance computational cost and the QA objective. But you are able to avoid most of the bad properties just by being on the Pareto frontier, and I don’t think this is worse than the prior proposal.
Overall this approach seems like it could avoid making such strong structural assumptions about the underlying model. It also helps a lot with the overlapping explanations + uniformity problem. And it generally seems to be inching towards feeling plausible.
Here’s another approach to “shortest circuit” that is designed to avoid this problem:
Learn a circuit C(X) that outputs an entire set of beliefs. (Or maybe some different architecture, but with ~0 weight sharing so that computational complexity = description complexity.)
Impose a consistency requirement on those beliefs, even in cases where a human can’t tell the right answer.
Require C(X)’s beliefs about Y to match Fθ(X). We hope that this makes C an explication of “Fθ’s beliefs.”
Optimize some combination of (complexity) vs (usefulness), or chart the whole pareto frontier, or whatever. I’m a bit confused about how this step would work but there are similar difficulties for the other posts in this genre so it’s exciting if this proposal gets to that final step.
The “intended” circuit C just follows along with the computation done by Fθ and then translates its internal state into natural language.
What about the problem case where Fθ computes some reasonable beliefs (e.g. using the instrumental policy, where the simplicity prior makes us skeptical about their generalization) that C could just read off? I’ll imagine those being written down somewhere on a slip of paper inside of Fθ’s model of the world.
Suppose that the slip of paper is not relevant to predicting Fθ(X), i.e. it’s a spandrel from the weight sharing. Then the simplest circuit C just wants to cut it out. Whatever computation was done to write things down on the slip of paper can be done directly by C, so it seems like we’re in business.
So suppose that the slip of paper is relevant for predicting Fθ(X), e.g. because someone looks at the slip of paper and then takes an action that affects Y. If (the correct) Y is itself depicted on the slip of paper, then we can again cut out the slip of paper itself and just run the same computation (that was done by whoever wrote something on the slip of paper). Otherwise, the answers produced by C still have to contain both the items on the slip of paper as well as some facts that are causally downstream of the slip of paper (as well as hopefully some about the slip of paper itself). At that point it seems like we have a pretty good chance of getting a consistency violation out of C.
Probably nothing like this can work, but I now feel like there are two live proposals for capturing the optimistic minimal circuits intuition—the one in this current comment, and in this other comment. I still feel like the aggressive speed penalization is doing something, and I feel like probably we can either find a working proposal in that space or else come up with some clearer counterexample.
I was proposing exempting the short-term risk-free rate, and I was imagining using 30 day treasury yield a the metric. (The post originally said that but it got simplified in the interest of clarity—of course “savings account” is vague since they pay different amounts with different risk, but it seems to communicate basically the same stuff.) That’s also roughly the rate at which you’d borrow if using leverage to offset your tax burden (e.g. it’s roughly the rate embedded in futures or at which investors can borrow on margin).
Very interesting, thanks!
Could you confirm how much you have to scale down SF13 in order to match SF3? (This seems similar to what you did last time, but a more direct comparison.)
The graph from last time makes it look like SF13 would match Rebel at about 20k nodes/move. Could you also confirm that?
Looking forward to seeing the scaled-up Rebel results.
In another comment you wrote “In between is the region with ~70 ELO; that’s where engines usually operate on present hardware with minutes of think time” which made sense to me, I’m just trying to square that with this graph.
Recently I’ve been thinking about ML systems that generalize poorly (copying human errors) because of either re-using predictive models of humans or using human inference procedures to map between world models.
My initial focus was on preventing re-using predictive models of humans. But I’m feeling increasingly like there is going to be a single solution to the two problems, and that the world-model mismatch problem is a good domain to develop the kind of algorithm we need. I want to say a bit about why.
I’m currently thinking about dealing with world model mismatches by learning a correspondence between models using something other than a simplicity prior / training a neural network to answering questions. Intuitively we want to do something more like “lining up” the two models and seeing what parts correspond to which others. We have a lot of conditions/criteria for such alignments, so we don’t necessarily have to just stick with simplicity. This comment fleshes out one possible approach a little bit.
If this approach succeeds, then it also directly applicable to avoiding re-using human models—we want to be lining up the internal computation of our model with concepts like “There is a cat in the room” rather than just asking the model to predict whether there is a cat however it wants (which it may do by copying a human labeler). And on the flip side, I think that the “re-using human models” problem is a good constraint to have in mind when thinking about ways to do this correspondence. (Roughly speaking, because something like computational speed or “locality” seems like a really central constraint for matching up world models, and doing that approach naively can greatly exacerbate the problems with copying the training process.)
So for now I think it makes sense for me to focus on whether learning this correspondence is actually plausible. If that succeeds then I can step back and see how that changes my overall view of the landscape (I think it might be quite a significant change), and if it fails then I hope to at least know a bit more about the world model mismatch problem.
I think the best analogy in existing practice is probably doing interpretability work—mapping up the AI’s model to my model is kind of like looking at neurons and trying to make sense of what they are computing (or looking for neurons that compute something). And giving up on a “simplicity prior” is very natural when doing interpretability, instead using other considerations to determine whether a correspondence is good. It still seems kind of plausible that in retrospect my current work will look like it was trying to get a solid theoretical picture on what interpretability should do (including in the regime where the correspondence is quite complex, and when the goal is a much more complete level of understanding). I swing back and forth on how strong the analogy to interpretability seems / whether or not this is how it will look in retrospect. (But at any rate, my research methodology feels like a very different approach to similar questions.)