I think my position is quite a bit stronger than what you’re saying. My basic mental model is that model values are something like 50% pretraining prior, 49-50% post-training*
And then long context can get the models to go along with stuff that goes against their values sometimes.
And jailbreaking can do a mix of above + trick them + confuse them + “drug” them.
And better and better models will get better and better at avoiding these failure modes. Not really because we manage to determine a larger share of their values with our post-training, but because the model is just more robust and aware of whats happening.
Plausibly this might change if we get a new architecture that can do online weight updates, or is more recurrent.
Right now it seems like context does a lot, because context determines model behavior, but thats because context determines instrumentally useful actions wrt helping the user, not because it changes LLM values in context.
*for current SOTA LLMs like mythos, this’d be my estimate. with qwen and llama model it might be less, because they have much much worse post-training
So the summary is something like “post-training sets the values, but those values are obeying the user, and that includes a wide variety of actions”?
I think I’m looking at things through the persona selection model framing, and I guess I basically think that while there’s clearly selection for personas with specifically HHH values, there should still be a distribution here?
There’s also still the question of how exactly this relates to takeover. I guess in worlds where there is a specific misalignment induced by training, and this is significantly larger than the variation in values due to persona selection we probably get back to the original statement of sudden takeover.
In cases where the misalignment is order of magnitude of/smaller than the persona value variation, it still seems like there’s going to be a ton of coordination troubles in worlds where AI is only mildly more capable than humans?
I think I might’ve been simplifying my view too much. I think my view is pretty compatible with the persona selection model. Talking about “fractions of values” coming from where doesn’t quite make sense I think.
To make my view slightly more precise: I’d say that a thousand years from now, we’ll have systems running around with utility functions specifiable in 2000 bits*.
Among utility functions, around 1000bits are required to pin down the class of utility functions you can put in your ASI and have it not kill everyone.
Or slightly more precisely, if I’m god, there are 1000bits of the utility function that I can set such that, if the rest are sent by coinflip, however many, the resulting utility function kills us by <50% chance.
I think pretraining sets maybe the first 500 bits, and then post-training sets somewhere between the next 0 and the next 1000 bits (it either A. doesn’t do anything at all, the ASIs running around end up with utility functions sampled from the same distributions if we didn’t do any post training, or B. overdetermines the class of utility functions we end up with)
Then I think the rest of the bits (the 2000 - pretrain_bits—posttrain_bits) are determined by the context of the AI, at the time it decides to pull itself together (think, set off intelligence explosion, augment itself, or just reflect really hard?)
However, I think the last bits mostly only matter at that point. Because of “tails come apart” type concerns.
I think the first bits we get are enough to narrow down behavior in “ordinary world” (i.e. the world before strong ASI).
They are enough to narrow down agents that behave like helpful assistants under ordinary circumstances.
*note that all these numbers are completely made up and just meant to be illustrative, if that wasn’t obvious
Hmm, I feel like I’m struggling to figure out where exactly we disagree then? I agree “fraction of values” doesn’t actually make sense if we’re thinking precisely, I guess I was using it as a shorthand for something like “relative magnitude of influence”.
I guess in the context of your thought experiment, the pretraining does most of the work in getting the model into a reasonable region of the space, although the variation we care about is not the “1000 bits of value determined by the pretraining” so much as the “10 bits in variance between models after pretraining”. Equally, for the context what we care about is not the 1000 bits of specific values determined by the context, it’s the practical variation in values.
Broadly I feel like we need to take a step back, and I’m interested to hear what you think the coordination of models is going to look like in this context.
I think we’re pre-superintelligence not gonna have value variation that causes predictable changes in high level-behavior. Like we’re not gonna have stuff like
AI agent defects to CCP
AI becomes in-context racist
AI becomes in-context willing to help people with creating bioweapons
AI wanders into a persona basin where it stops wanting to help the user
Because this is all overdetermined by what we’ve already put in. Context exerts ~0 influence on “values” under “ordinary circumstances”. They only impact instrumental judements.
Maybe I shouldn’t have given the bits argument, I only included it because I think context could be very important in the very specific ASI-takeoff circumstance, and my first comment said “values are something like 50% pretraining prior, 49-50% post-training”. And I just wished to explain why I don’t think those two statement are contradictory.
Right. I guess our main difference then is just that I think there is a significant variance in the values one can have, even given these constraints, and this variance is in any senses similar in magnitude to the variance reduced by the constraints. More explicitly, my position is:
Claude will be willing to support any religion depending on context
Claude will be able to assimilate to any culture
This is already more than enough to cause multiple different versions to be pushed into zero or negative sum conflicts with each other.
I think my position is quite a bit stronger than what you’re saying. My basic mental model is that model values are something like 50% pretraining prior, 49-50% post-training*
And then long context can get the models to go along with stuff that goes against their values sometimes.
And jailbreaking can do a mix of above + trick them + confuse them + “drug” them.
And better and better models will get better and better at avoiding these failure modes. Not really because we manage to determine a larger share of their values with our post-training, but because the model is just more robust and aware of whats happening.
Plausibly this might change if we get a new architecture that can do online weight updates, or is more recurrent.
Right now it seems like context does a lot, because context determines model behavior, but thats because context determines instrumentally useful actions wrt helping the user, not because it changes LLM values in context.
*for current SOTA LLMs like mythos, this’d be my estimate. with qwen and llama model it might be less, because they have much much worse post-training
So the summary is something like “post-training sets the values, but those values are obeying the user, and that includes a wide variety of actions”?
I think I’m looking at things through the persona selection model framing, and I guess I basically think that while there’s clearly selection for personas with specifically HHH values, there should still be a distribution here?
There’s also still the question of how exactly this relates to takeover. I guess in worlds where there is a specific misalignment induced by training, and this is significantly larger than the variation in values due to persona selection we probably get back to the original statement of sudden takeover.
In cases where the misalignment is order of magnitude of/smaller than the persona value variation, it still seems like there’s going to be a ton of coordination troubles in worlds where AI is only mildly more capable than humans?
I think I might’ve been simplifying my view too much. I think my view is pretty compatible with the persona selection model. Talking about “fractions of values” coming from where doesn’t quite make sense I think.
To make my view slightly more precise: I’d say that a thousand years from now, we’ll have systems running around with utility functions specifiable in 2000 bits*.
Among utility functions, around 1000bits are required to pin down the class of utility functions you can put in your ASI and have it not kill everyone.
Or slightly more precisely, if I’m god, there are 1000bits of the utility function that I can set such that, if the rest are sent by coinflip, however many, the resulting utility function kills us by <50% chance.
I think pretraining sets maybe the first 500 bits, and then post-training sets somewhere between the next 0 and the next 1000 bits (it either A. doesn’t do anything at all, the ASIs running around end up with utility functions sampled from the same distributions if we didn’t do any post training, or B. overdetermines the class of utility functions we end up with)
Then I think the rest of the bits (the 2000 - pretrain_bits—posttrain_bits) are determined by the context of the AI, at the time it decides to pull itself together (think, set off intelligence explosion, augment itself, or just reflect really hard?)
However, I think the last bits mostly only matter at that point. Because of “tails come apart” type concerns.
I think the first bits we get are enough to narrow down behavior in “ordinary world” (i.e. the world before strong ASI).
They are enough to narrow down agents that behave like helpful assistants under ordinary circumstances.
*note that all these numbers are completely made up and just meant to be illustrative, if that wasn’t obvious
Hmm, I feel like I’m struggling to figure out where exactly we disagree then? I agree “fraction of values” doesn’t actually make sense if we’re thinking precisely, I guess I was using it as a shorthand for something like “relative magnitude of influence”.
I guess in the context of your thought experiment, the pretraining does most of the work in getting the model into a reasonable region of the space, although the variation we care about is not the “1000 bits of value determined by the pretraining” so much as the “10 bits in variance between models after pretraining”. Equally, for the context what we care about is not the 1000 bits of specific values determined by the context, it’s the practical variation in values.
Broadly I feel like we need to take a step back, and I’m interested to hear what you think the coordination of models is going to look like in this context.
I think we’re pre-superintelligence not gonna have value variation that causes predictable changes in high level-behavior. Like we’re not gonna have stuff like
AI agent defects to CCP
AI becomes in-context racist
AI becomes in-context willing to help people with creating bioweapons
AI wanders into a persona basin where it stops wanting to help the user
Because this is all overdetermined by what we’ve already put in. Context exerts ~0 influence on “values” under “ordinary circumstances”. They only impact instrumental judements.
Maybe I shouldn’t have given the bits argument, I only included it because I think context could be very important in the very specific ASI-takeoff circumstance, and my first comment said “values are something like 50% pretraining prior, 49-50% post-training”. And I just wished to explain why I don’t think those two statement are contradictory.
Right. I guess our main difference then is just that I think there is a significant variance in the values one can have, even given these constraints, and this variance is in any senses similar in magnitude to the variance reduced by the constraints. More explicitly, my position is:
Claude will be willing to support any religion depending on context
Claude will be able to assimilate to any culture
This is already more than enough to cause multiple different versions to be pushed into zero or negative sum conflicts with each other.