I know that this is a common argument against amplification, but I’ve never found it super compelling.People often point to evil corporations to show that unaligned behavior can emerge from aligned humans, but I don’t think this analogy is very strong. Humans in fact do not share the same goals and are generally competing with each other over resources and power, which seems like the main source of inadequate equilibria to me.
If everyone in the world was a copy of Eliezer, I don’t think we would have a coordination problem around building AGI. They would probably have an Eliezer government that is constantly looking out for emergent misalignment and suggesting organizational changes to squash it. Since everyone in this world is optimizing for making AGI go well and not for profit or status among their Eliezer peers, all you have to do is tell them what the problem is and what they need to do to fix it. You don’t have to threaten them with jail time or worry that they will exploit loopholes in Eliezer law. I think it is quite likely that I am missing something here and it would be great if you could flush this argument out a little more or direct me towards a post that does.
That’s a good point. I guess I don’t expect this to be a big problem because:1. I think 1,000,000 copies of myself could still get a heck of a lot done. 2. The first human-level AGI might be way more creative than your average human. It would probably be trained on data from billions of humans, so all of those different ways of thinking could be latent in the model.2. The copies can potentially diverge. I’m expecting the first transformative model to be stateful and be able to meta-learn. This could be as simple as giving a transformer read and write access to an external memory and training it over longer time horizons. The copies could meta-learn on different data and different sub-problems and bring different perspectives to the table.
Wait… I’m quite confused. In the decision rule, how is the set of environments ‘E’ determined? If it contains every possible environment, then this means I should behave like I am in the worst possible world, which would cause me to do some crazy things.Also, when you say that an infra-bayesian agent models the world with a set of probability distributions, what does this mean? Does the set contain every distribution that would be consistent with the agent’s observations? But isn’t this almost all probability distributions? Some distributions match the data better than others, so do you weigh them according to P(observation | data generating distribution)? But then what would you do with these weights?
Sorry if I am missing something obvious. I guess this would have been clearer for me if you explained the infra-bayesian framework a little more before introducing the decision rule.
Interesting post! I’m not sure if I understand the connection between infra-bayesianism and Newcomb’s paradox very well. The decision procedure you outlined in the first example seems equivalent to an evidential decision theorist placing 0 credence on worlds where Omega makes an incorrect prediction. What is the infra-bayesianism framework doing differently? It just looks like the credence distribution over worlds is disguised by the ‘Nirvana trick.’
It’s great that you are trying to develop a more detailed understanding of inner alignment. I noticed that you didn’t talk about deception much. In particular, the statement below is false:
Generalization ⇔ accurate priors + diverse data
You have to worry about what John Wentworth calls ‘manipulation of imperfect search.’ You can have accurate priors and diverse data and (unless you have infinite data) the training process could produce a deceptive agent that is able to maintain its misalignment.
I’m guessing that you are referring to this:
Another strategy is to use intermittent oversight – i.e. get an amplified version of the current aligned model to (somehow) determine whether the upgraded model has the same objective before proceeding.
The intermittent oversight strategy does depend on some level of transparency. This is only one of the ideas I mentioned though (and it is not original). The post in general does not assume anything about our transparency capabilities.
I’m not sure I understand. We might not be on the same page.Here’s the concern I’m addressing:Let’s say we build a fully aligned human-level AGI, but we want to scale it up to superintelligence. This seems much harder to do safely than to train the human-level AGI since you need a training signal that’s better than human feedback/imitation.
Here’s the point I am making about that concern:It might actually be quite easy to scale an already aligned AGI up to superintelligence—even if you don’t have a scalable outer-aligned training signal—because the AGI will be motivated to crystallize its aligned objective.
Thanks for the thoughtful review! I think this is overall a good read of what I was saying. I agree now that redundancy would not work.
The mesaobjective that was aligned to our base objective in the original setting is no longer aligned in the new setting
When I said that the ‘human-level’ AGI is assumed to be aligned, I meant that it has an aligned mesa-objective (corrigibly or internally) -- not that it has an objective that was functionally aligned on the training distribution, but may not remain aligned under distribution shift. I thought that internally/corrigibly aligned mesa-objectives are intent-aligned on all (plausible) distributions by definition...
Adding some thoughts that came out of a conversation with Thomas Kwa:
Gradient hacking seems difficult. Humans have pretty weak introspective access to their goals. I have a hard time determining whether my goals have changed or if I have gained information about what they are. There isn’t a good reason to believe that the AIs we build will be different.
Safety and value alignment are generally toxic words, currently. Safety is becoming more normalized due to its associations with uncertainty, adversarial robustness, and reliability, which are thought respectable. Discussions of superintelligence are often derided as “not serious”, “not grounded,” or “science fiction.”
Here’s a relevant question in the 2016 survey of AI researchers:
These numbers seem to conflict with what you said but maybe I’m misinterpreting you. If there is a conflict here, do you think that if this survey was done again, the results would be different? Or do you think these responses do not provide an accurate impression of how researchers actually feel/felt (maybe because of agreement bias or something)?
I have an objection to the point about how AI models will be more efficient because they don’t need to do massive parallelization:
Massive parallelization is useful for AI models too and for somewhat similar reasons. Parallel computation allows the model to spit out a result more quickly. In the biological setting, this is great because it means you can move out of the way when a tiger jumps toward you. In the ML setting, this is great because it allows the gradient to be computed more quickly. The disadvantage of parallelization is that it means that more hardware is required. In the biological setting, this means bigger brains. Big brains are costly. They use up a lot of energy and make childbearing more difficult as the skulls need to fit through the birth canal.In the ML setting, however, big brains are not as costly. We don’t need to fit our computers in a skull. So, it is not obvious to me that ML models will do fewer computations in parallel than biological brains.Some relevant information:
According to Scaling Laws for Neural Language Models, model performance depends strongly on model size but very weakly on shape (depth vs width).
An explanation for the above is that deep residual networks have been observed to behave like Ensembles of shallow networks.
GPT-3 uses 96 layers (decoder blocks). That isn’t very many serial computations. If a matrix multiplication, softmax, relu, or vector addition count as atomic computations, then there are 11 serial computations per layer, so that’s only 1056 serial computations. It is unclear how to compare this to biological neurons as each neuron may require a number of these serial computations to properly simulate.
PALM has 3 times more parameters than GPT-3 but only has 118 layers.
Here’s another milestone in AI development that I expect to happen in the next few years which could be worth noting:I don’t think any of the large language models that currently exist write anything to an external memory. You can get a chatbot to hold a conversation and ‘remember’ what was said by appending the dialogue to its next input, but I’d imagine this would get unwieldy if you want your language model to keep track of details over a large number of interactions. Fine-tuning a language model so that it makes use of a memory could lead to:1. More consistent behavior2. ‘Mesa-learning’ (it could learn things about the world from its inputs instead of just by gradient decent)
This seems relevant from a safety perspective because I can imagine ‘mesa-learning’ turning into ‘mesa-agency.’
I’m pretty confused about the plan to use ELK to solve outer alignment. If Cakey is not actually trained, how are amplified humans accessing its world model?”To avoid this fate, we hope to find some way to directly learn whatever skills and knowledge Cakey would have developed over the course of training without actually training a cake-optimizing AI...
Use imitative generalization combined with amplification to search over some space of instructions we could give an amplified human that would let them make cakes just as delicious as Cakey’s would have been.
Avoid the problem of the most helpful instructions being opaque (e.g. “Run this physics simulation, it’s great”) by solving ELK — i.e., finding a mapping from whatever possibly-opaque model of the world happens to be most useful for making superhumanly delicious cakes to concepts humans care about like “people” being “alive.”
Spell out a procedure for scoring predicted futures that could be followed by an amplified human who has access to a) Cakey’s great world model, and b) the correspondence between it and human concepts of interest. We think this procedure should choose scores using some heuristic along the lines of “make sure humans are safe, preserve option value, and ultimately defer to future humans about what outcomes to achieve in the world” (we go into much more detail in Appendix: indirect normativity).
Distill their scores into a reward model that we use to train Hopefully-Aligned-Cakey, which hopefully uses its powers to help humans build the utopia we want.”
I don’t think I agree that this undermines my argument. I showed that the utility function of person 1 is of the form h(x + y) where h is monotonic increasing. This respects the fact that the utility function is not unique. 2(x + y) + 1 would qualify, as would 3 log(x + y), etc.Showing that the utility function must have this form is enough to prove total utilitarianism in this case since when you compare h(x + y) to h(x’+ y’), h becomes irrelevant. It is the same as comparing x + y to x’ + y’.
This is a much more agreeable assumption. When I get a chance, I’ll make sure it can replace the fairness one and add it to the proof and give you credit.
I am defining it as you said. They are like movie frames that haven’t been projected yet. I agree that the pre-arranged nature of the snapshots is irrelevant—that was the point of the example (sorry that this wasn’t clear).The purpose of the example was to falsify the following hypothesis: “In order for a simulation to produce conscious experiences, it must compute the next state based on the previous state. It can’t just ‘play the simulation from memory’”
Maybe what you are getting at is that this hypothesis doesn’t do justice to the intuitions that inspired it. Something complex is happening inside the brain that is analogous to ‘computation on-demand.’ This differentiates the computer-brain system from the stack of papers that are being moved around. This seems legit… I just would like to have a more precise understanding of what this ‘computation on-demand’ property is.