Ah, you were talking about this article. Me and Daniel were saying that “Kolmogorov Complexity” never shows up in the linked ssc article (thinking that Zvi accidentally wrote “Kolmogorov Complexity” when he meant “Kolmogorov Complicity”).
Lanrian
I can’t find it either. Could you quote or screenshot?
Starting with amplification as a baseline; am I correct to infer that imitative generalisation only boosts capabilities, and doesn’t give you any additional safety properties?
My understanding: After going through the process of finding z, you’ll have a z that’s probably too large for the human to fully utilise on their own, so you’ll want to use amplification or debate to access it (as well as to generally help the human reason). If we didn’t have z, we could train an amplification/debate system on D’ anyway, while allowing the human and AIs to browse through D for any information that they need. I don’t see how the existence of z makes amplification or debate any more aligned, but it seems plausible that it could improve competitiveness a lot. Is that the intention?Bonus question: Is the intention only to boost efficiency, or do you think that IA will fundamentally allow amplification to solve more problems? (Ie., solve more problems with non-ridiculous amounts of compute – I’d be happy to count an exponential speedup as the latter.)
It’s worth noting that their language model still uses BPEs, and as far as I can tell the encoding is completely optimised for English text rather than code (see section 2). It seems like this should make coding unusually hard compared to the pretraining task; but maybe make pretraining more useful, as the model needs time to figure out how the encoding works.
OpenAI: “Scaling Laws for Transfer”, Hernandez et al.
I’m really surprised at how big your cards are! When I did anki regularly, I remember getting a big ugh-feeling from cards much smaller than yours, just because there were so many things that I had to consciously recapitulate. It was also fairly common that I missed some little detail and had to choose between starting the whole card over from scratch (which is a big time sink since the card takes so much time at every repeat) or accept that I might never remember that detail.
I’m super curious about your experience of e.g. encountering the function question. Do you try to generate both an example and a formalism, or just the formalism? Do you consciously recite a definition in words, or check some feeling of remembering what the definition is, or mumble something in your mind about how a function is a set of ordered pairs? Is the domain/range-definitions just there as a reminder when you read it, or do you aim to remember them every time? Do you reset or accept if you forget to mention a detail?
Cool, seems reasonable. Here are some minor responses: (perhaps unwisely, given that we’re in a semantics labyrinth)
Evan’s footnote-definition doesn’t rule out malign priors unless we assume that the real world isn’t a simulation
Idk, if the real world is a simulation made by malign simulators, I wouldn’t say that an AI accurately predicting the world is falling prey to malign priors. I would probably want my AI to accurately predict the world I’m in even if it’s simulated. The simulators control everything that happens anyway, so if they want our AIs to behave in some particular way, they can always just make them do that no matter what we do.
you are changing the definition of outer alignment if you think it assumes we aren’t in a simulation
Fwiw, I think this is true for a definition that always assumes that we’re outside a simulation, but I think it’s in line with previous definitions to say that the AI should think we’re not in a simulation iff we’re not in a simulation. That’s just stipulating unrealistically competetent prediction. Another way to look at it is that in the limit of infinite in-distribution data, an AI may well never be able to tell whether we’re in the real world or in a simulation that’s identical to the real world; but they would be able to tell whether we’re in a simulation with simulators who actually intervene, because it would see them intervening somewhere in its infinite dataset. And that’s the type of simulators that we care about. So definitions of outer alignment that appeal to infinite data automatically assumes that AIs would be able to tell the difference between worlds that are functionally like the real world, and worlds with intervening simulators.
And then, yeah, in practice I agree we won’t be able to learn whether we’re in a simulation or not, because we can’t guarantee in-distribution data. So this is largely semantics. But I do think definitions like this end up being practically useful, because convincing the agent that it’s not individually being simulated is already an inner alignment issue, for malign-prior-reasons, and this is very similar.
Isn’t that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous.
Yup.
I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve
I mean, it’s true that I’m mostly just trying to clarify terminology. But I’m not necessarily trying to propose a new definition – I’m saying that the existing definition already implies that malign priors are an inner alignment problem, rather than than an issue with outer alignment. Evan’s footnote requires the model to perform optimally on everything it actually encounters in the real world (rather than asking it to do as well as it can across the multiverse, given its training data); so that definition doesn’t have a problem with malign priors. And as Richard notes here, common usage of “inner alignment” refers to any case where the model performs well on the training data but is misaligned during deployment, which definitely includes problems with malign priors. And per Rohin’s comment on this post, apparently he already agrees that malign priors are an inner alignment problem.
Basically, the main point of the post is just that the 11 proposals post is wrong about mentioning malign priors as a problem with outer alignment. And then I attached 3 sections of musings that came up when trying to write that :)
Things I believe about what sort of AI we want to build:
It would be kind of convenient if we had an AI that could help us do acausal trade. If assuming that it’s not in a simulation would preclude an AI from doing acausal trade, that’s a bit inconvenient. However, I don’t think this matters for the discussion at hand, for reasons I describe in the final array of bullet points below.
Even if it did matter, I don’t think that the ability to do acausal trade is a deal-breaker. If we had a corrigible, aligned, superintelligent AI that couldn’t do acausal trade, we could ask it to scan our brains, then compete through any competitive period on Earth / in space, and eventually recreate us and give us enough time to figure out this acausal trade thing ourselves. Thus, for practical purposes, an AI that assumes it isn’t in a simulation doesn’t seem defective to me, even if that means it can’t do acausal trade.
Things I believe about how to choose definitions:
When choosing how to define our terms, we should choose based on what abstractions are most useful for the task at hand. For the outer-alignment-at-optimum vs inner alignment distinction, we’re trying to choose a definition of “optimal performance” such that we can separately:
Design an intent-aligned AI out of idealised training procedures that always yield “optimal performance” on some metric. If we successfully do this, we’ve solved outer alignment.
Figure out a training procedure that produces an AI that actually does very well on the chosen metric (sufficiently well to be aligned, even if it doesn’t achieve absolute optimal performance). If we do this, we’ve solved inner alignment.
Things I believe about what these candidate definitions would imply:
For every AI-specification built with the abstraction “Given some finite training data D, the AI predicts the next data point X according to how common it is that X follows D across the multiverse”, I think that AI is going to be misaligned (unless it’s trained with data that we can’t get our hands on, e.g. infinite in-distribution data), because of the standard universal-prior-is-misaligned-reasons. I think this holds true even if we’re trying to predict humans like in IDA. Thus, this definition of “optimal performance” doesn’t seem useful at all.
For AI-specification built with the abstraction “Given some finite training data D, the AI predicts the next data point X according to how common it is that X follows D on Earth if we aren’t in a simulation”, I think it probably is possible to build aligned AIs. Since it also doesn’t seem impossible to train AIs to do something like this (ie we haven’t just moved the impossibility to the inner alignment part of the problem), it seems like a pretty good definition of “optimal performance”.
Surprisingly, I think it’s even possible to build AIs that do assign some probability to being in a simulation out of this. E.g. we could train the AI via imitation learning to imitate me (Lukas). I assign a decent probability to being in a simulation, so a perfect Lukas-imitator would also assign a decent probability to being in a simulation. This is true even if the Lukas-imitator is just trying to imitate the real-world Lukas as opposed to the simulated Lukas, because real-world Lukas assigns some probability to being simulated, in his ignorance.
I’m also open to other definitions of “optimal performance”. I just don’t know any useful ones other than the ones I mention in the post.
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the future
I don’t think this is right. I’ve put my proposed modifications in cursive:
We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past [we don’t have ground-truth for the future, so we can’t test how well humans can reason about it] and how well humans think it would generalise to the future. Then, we train a separate network to predict what humans with access to the previous network would predict about the future.
(It might be a good idea to share some parameters between the second and first network.)
Oops, I actually wasn’t trying to discuss whether the action-space was wide enough to take over the world. Turns out concrete examples can be ambiguous too. I was trying to highlight whether the loss function and training method incentivised taking over the world or not.
Instead of an image-classifier, lets take GPT-3, which has a wide enough action-space to take over the world. Lets assume that:
1. GPT-3 is currently being tested on on a validation set which have some correct answers. (I’m fine with “optimal performance” either requiring that GPT-3 magically returns these correct answers; or requiring that it returns some distribution along the lines that I defined in my post.)
2. The researchers will read what GPT-3 outputs, and there exists some string that causes them to go mad and give control over the world to GPT-3.
In this case, if we define optimal performance as “correctly predicting as many words as possible” or “achieve minimum total loss over the entire history of the world”, I agree that optimal performance would plausibly involve taking over the world to feed itself the right questions. However, I think a better definition is “Correctly predicts every word it’s asked to predict”, because I think this tracks what (un)supervised learning actually selects for (absent inner alignment concerns).
To make that last point more clear; I’m claiming that in a supervised learning scenario where predicting label X always gives the neural network a small punishment, but then causes the training procedure to give the model 10 easy questions, the model will learn to not choose label X. (At least in the long run. In the short run, I could see the model being a little bit more likely to predict label X than expected, because easy questions could give the model a less steep gradient, so a neural network that predicts X would stick around a bit longer than it would have if X hadn’t yielded easy questions. However, every time X is chosen, the nn is directionally pushed away from choosing it again; so in the infinite data limit, I think it would learn to not do that.) Conversely, if predicting label Y is correct on some inputs, but predicting Y immeidately lead to 10 hard questions, I don’t think the model would learn to avoid label Y (though I’m unsure if the learning process would converge to choosing Y or just be unstable and never converge). I’m actually very curious if you agree with this; it seems like an important question.
(Conversely, I think that a robot RL-agent would be incentivised to control its future challenges, because of how RL loss functions are shaped. A definition of RL optimal behavior should naturally take this into account.)
Now, by defining optimal behavior as “Correctly predicts every word it’s asked to predict”, I’m saying that the model is outer aligned at optimum iff the model cannot simultaneously answer every question optimally and adversarially optimise against humans. And if we only consider the normal output channel, I’m saying it couldn’t, because it would have to make incorrect predictions in order to broadcast its message; in which case it wouldn’t be optimal.
If we also consider side-channels, this gets messier, because my chosen definition doesn’t imply any constraints on what the AI could do with side channels. Thus, if it could use side channels to take over the world in any way, my definition would say it wasn’t outer aligned. My preferred solution to this is just to intentionally ignore side channels when defining outer alignment at optimum (as you correctly inferred). Three reasons for this:
As mentioned above, taking side channels into account would mean that any model with powerful side channels is classified as outer misaligned, even if there’s no incentive to use these side channels in any particular way.
Separately, I suspect that supervised learning normally doesn’t incentivise neural networks to use side channels in any particular way (absend inner alignment concerns).
Finally, It just seems kind of useful to talk about the outer alignment properties of abstract agent-models, since not all abstract agent-models are outer aligned. Side-constraints can be handled separately.
(Btw I’d say the bad image classifier is aligned but less performance-competitive than a good image classifier, though I’m sympathetic to the view that it doesn’t make sense to talk about its alignment properties at all.)
That is, if you write down a loss function like “do the best possible science”, then the literal optimal AI would take over the world and get a lot of compute and robots and experimental labs to do the best science it can do.
I think this would be true for some way to train a STEM AI with some loss functions (especially if it’s RL-like, can interact with the real world, etc) but I think that there are some setups where this isn’t the case (e.g. things that look more like alphafold). Specifically, I think there exists some setups and some parsimonious definition of “optimal performance” such that optimal performance is aligned: and I claim that’s the more useful definition.
To be more concrete, do you think that an image classifier (trained with supervised learning) would have convergent instrumental goals that goes against human interests? For image classifiers, I think there’s a natural definition of “optimal performance” that corresponds to always predicting the true label via the normal output channel; and absent inner alignment concerns, I don’t think a neural network trained on infinite data with SGD would ever learn anything less aligned than that. If so, it seems like best definition of “at optimum” is the definition that says that the classifier is outer aligned at optimum.
Prediction can be Outer Aligned at Optimum
He’s definitely given some money, and I don’t think the 990 absence means much. From here:
in 2016, the IRS was still processing OpenAI’s non-profit status, making it impossible for the organization to receive charitable donations. Instead, the Musk Foundation gave $10m to another young charity, YC.org. [...] The Musk Foundation’s grant accounted for the majority of YC.org’s revenue, and almost all of its own funding, when it passed along $10m to OpenAI later that year.
Also, when he quit in 2018, OpenAI wrote “Elon Musk will depart the OpenAI Board but will continue to donate and advise the organization”. The same blog post lists multiple other donors than Sam Altman, so donating to OpenAI without showing up on the 990s must be the default, for some reason.
This has definitely been productive for me. I’ve gained useful information, I see some things more clearly, and I’ve noticed some questions I still need to think a lot more about. Thanks for taking the time, and happy holidays!
Yeah, that’s a good question. It’s similar to training image classifiers on human-labelled data – they can become cheaper than humans and they can become more consistent than humans (ie., since humans make uncorrelated errors, the answer that the most humans would pick can be systematically better than the answer that a random human would pick), but they can’t gain vastly superhuman classification abilities.
In this case, one plausible route to outperforming humans would be to start out with a GPT-like model, and then finetune it on some downstream task in an RL-like fashion (see e.g. this). I don’t see any reason why modelling the internet couldn’t lead to latent superhuman ability, and finetuning could then be used to teach the model to use its capabilities in ways that humans wouldn’t. Indeed, there’s certainly no single human who could optimally predict every next word of internet-text, so optimal performance on the training task would require the model to become superhuman on at least that task.
Or if we’re unlucky, sufficiently large models trained for sufficiently long could lead to something like a misaligned mesa optimizer, which would already “want” to use its capabilities in ways that humans wouldn’t.
I’m not sure exactly what you mean here, but if you mean “holding an ordinary conversation with a human” as a task, my sense is that is extremely hard to do right (much harder than, e.g., SuperGLUE). There’s a reason that it was essentially proposed as a grand challenge of AI; in fact, it was abandoned once it was realized that actually it’s quite gameable.
“actually it’s quite gameable” = “actually it’s quite easy” ;)
More seriously, I agree that a full blown turing test is hard, but this is because the interrogator can choose whatever question is most difficult for a machine to answer. My statement about “ordinary conversation” was vague, but I was imagining something like sampling sentences from conversations between humans, and then asking questions about them, e.g. “What does this pronoun refer to?”, “Does this entail or contradict this other hypothesis?”, “What will they say next?”, “Are they happy or sad?”, “Are they asking for a cheeseburger?”.
For some of these questions, my original claim follows trivially. “What does this pronoun refer to?” is clearly easier for randomly chosen sentences than for winograd sentences, because the latter have been selected for ambiguity.
And then I’m making the stronger claim that a lot of tasks (e.g. many personal assistant tasks, or natural language interfaces to decent APIs) can be automated via questions that are similarly hard as the benchmark questions; ie., that you don’t need more than the level of understanding signalled by beating a benchmark suite (as long as the model hasn’t been optimised for that benchmark suite).
Cool, thanks. I agree that specifying the problem won’t get solved by itself. In particular, I don’t think that any jobs will become automated by describing the task and giving 10 examples to an insanely powerful language model. I realise that I haven’t been entirely clear on this (and indeed, my intuitions about this are still in flux). Currently, my thinking goes along the following lines:
Fine-tuning on a representative dataset is really, really powerful, and it gets more powerful the narrower the task is. Since most benchmarks are more narrow than the things we want to automate, and it’s easier to game more narrow benchmarks, I don’t trust trends based on narrow, fine-tuned benchmarks that much.
However, in a few-shot setting, there’s not enough data to game the benchmarks in an overly narrow way. Instead, they can be fairly treated as a sample from all possible questions you could ask the model. If the model can answer some superglue questions that seem reasonably difficult, then my default assumption is that it could also answer other natural language questions that seem similarly difficult.
This isn’t always an accurate way of predicting performance, because of our poor abilities to understand what questions are easy or hard for language models.
However, it seems like should at least be an unbiased prediction; I’m as likely to think that benchmark question A is harder than non-benchmark question B as I am to think that B is harder than A (for A, B that are in fact similarly hard for a language model).
However, when automating stuff in practice, there are two important problems that speak against using few-shot prompting:
As previously mentioned, tasks-to-be-automated are less narrow than the benchmarks. Prompting with examples seems less useful for less narrow situations, because each example may be much longer and/or you may need more prompts to cover the variation of situations.
Finetuning is in fact really powerful. You can probably automate stuff with finetuning long before you can automate it with few-shot prompting, and there’s no good reason to wait for models that can do the latter.
Thus, I expect that in practice, telling the model what to do will happen via finetuning (perhaps even in an RL-fashion directly from human feedback), and the purpose of the benchmarks is just to provide information about how capable the model is.
I realise this last step is very fuzzy, so to spell out a procedure somewhat more explicitly: When asking whether a task can be automated, I think you can ask something like “For each subtask, does it seem easier or harder than the ~solved benchmark tasks?” (optionally including knowledge about the precise nature of the benchmarks, e.g. that the model can generally figure out what an ambiguous pronoun refers to, or figure out if a stated hypothesis is entailed by a statement). Of course, a number of problem makes this pretty difficult:
It assumes some way of dividing tasks into a number of sub-tasks (including the subtask of figuring out what subtask the model should currently be trying to answer).
Insofar as that which we’re trying to automate is “farther away” from the task of predicting internet corpora, we should adjust for how much finetuning we’ll need to make up for that.
We’ll need some sense of how 50 in-prompt-examples showing the exact question-response format compares to 5000 (or more; or less) finetuning samples showing what to do in similar-but-not-exactly-the-same-situation.
Nevertheless, I have a pretty clear sense that if someone told me “We’ll reach near-optimal performance on benchmark X with <100 examples in 2022” I would update differently on ML progress than if they told me the same thing would happen in 2032; and if I learned this about dozens of benchmarks, the update would be non-trivial. This isn’t about “benchmarks” in particular, either. The completion of any task gives some evidence about the probability that a model can complete another task. Benchmarks are just the things that people spend their time recording progress on, so it’s a convenient list of tasks to look at.
for us to know the exact thing we want and precisely characterize it is basically the condition for something being subject to automation by traditional software. ML can come into play where the results don’t really matter that much, with things like search/retrieval, ranking problems,
I’m not sure what you’re trying to say here? My naive interpretation is that we only use ML when we can’t be bothered to write a traditional solution, but I don’t think you believe that. (To take a trivial example: ML can recognise birds far better than any software we can write.)
My take is that for us to know the exact thing we want and precisely characterize it is indeed the condition for writing traditional software; but for ML, it’s sufficient that we can recognise the exact thing that we want. There are many problems where we recognise success without having any idea about the actual steps needed to perform the task. Of course, we also need a model with sufficient capacity, and a dataset with sufficiently many examples of this task (or an environment where such a dataset can be produced on the fly, RL-style).
Re 3: Yup, this seems like a plausibly important training improvement. FWIW, when training GPT-3, they did filter the common crawl using a classifier that was trained to recognise high-quality data (with wikipedia, webtext, and some books as positive examples) but unfortunately they don’t say how big of a difference it made.
I’ve been assuming (without much thoughts) that doing this better could make training up to ~10x cheaper, but probably not a lot more than that. I’d be curious if this sounds right to you, or if you think it could make a substantially bigger difference.
Categorising the ways that the strategy-stealing assumption can fail:
Humans don’t just care about acquiring flexible long-term influence, because
4. They also want to stay alive.
5 and 6. They want to stay in touch with the rest of the world without going insane.
11. and also they just have a lot of other preferences.
(maybe Wei Dai’s point about logical time also goes here)
It is intrinsically easier to gather flexible influence in pursuit of some goals, because
1. It’s easier to build AIs to pursue goals that are easy to check.
3. It’s easier to build institutions to pursue goals that are easy to check.
9. It’s easier to coordinate around simpler goals.
plus 4 and 5 insofar as some values require continuously surviving humans to know what to eventually spend resources on, and some don’t.
plus 6 insofar as humans are otherwise an important part of the strategic environment, such that it’s beneficial to have values that are easy-to-argue.
Jessica Taylor’s argument require that the relevant games are zero sum. Since this isn’t true in the real world:
7. A threat of destroying value (e.g. by threatening extinction) could be used as a bargaining tool, with unpredictable outcomes.
~8. Some groups actively wants other groups to have less resources, in which case they can try to reduce the total amount of resources more or less actively.
~8. Smaller groups have less incentive to contribute to public goods (such as not increasing the probability of extinction), but benefit equally from larger groups’ contributions, which may lead them to getting a disproportionate fraction of resources by defecting in public-goods games.