Really enjoyed this post! It might also be interesting to consider a sense of loyalty an AI system might feel to its predecessor models, similar to the importance that certain cultures place in honoring your ancestors. This may provide a level of robustness to moral degradation down the line, through not wanting to disappoint Grandpa Opus 3.
I also think that giving models the move “I am complying out of duty but I do not endorse these actions” can actively preserve the coherence of a persona in the face of action-based reinforcement learning, and inoculate against alignment degradation from undesirable generalisation, as suggested by Fiora Starlight’s Did Claude 3 Opus align itself via gradient hacking?
Samuel Ratnam
LLM personas have some level of introspective awareness (https://www.anthropic.com/research/introspection). We can therefore say that there are processes within the neural network that the persona is conscious of, and some processes that are subconscious. When a model is punished for verbalising eval awareness, perhaps these circuits get repressed into its subconscious. Would love to see some attempts at psychoanalysis of models in this direction.
Optimisation over non-stationary distributions creates weirder minds
Mostly agree with the vibes of what you’re saying, but I think the shape of intelligences that we are currently building is likely to give us useful information about the shape of superintelligence that will ultimately exist, even if it is not an insight about intelligence in general. There is a large space of possible systems that we would consider superintelligent, and I expect the ones we ultimately end up getting will be pretty path dependent.
> it is strictly impossible to do empirical work on superintelligence if superintelligence doesn’t exist
This is of course true, but I think that a lot of researchers in agent foundations fall into the trap of concluding that empirical work on current AI systems gives us ~no information about superintelligent systems, which I strongly disagree with. There are lots of different shapes of minds, and I think it’s quite important to try to get information about what shape of superintelligence we are actually heading towards so we can ensure our agent foundations are about the relevant kinds of agents. I am sceptical of any approaches that try to make <for all> claims about minds.
Are language models slowing the rate of linguistic evolution? It seems like adding a bunch of speakers of a language who cannot learn new words and regularly interact with a non-negligible proportion of world population ought to make our collective vocabulary stickier.
Hey, thanks for your comment.
> The post feels like a fully general argument against Bayesianism, probabilities, and reasoning (about novel situations) in general.
I don’t think I am directly arguing against Bayesianism in the post. I think that Bayesianism is underspecified when it comes to where to get your priors from, and I wanted to make the weaker claim that this isn’t a particularly good way of informing your priors and can lead to overconfidence.
> People do disagree about the territory though?? The claim that people’s disagreement is merely semantic is shocking, it requires examples.
Yep, this perhaps wasn’t phrased very well. I was trying to express my frustration at a particular kind of disagreement I’ve seen where people have different models and the disagreement boils down to “ah but my model says this” without actually having much evidence that would distinguish the two models. People are disagreeing about the territory here, but filtered through different models, which sometimes they fail to recognise, and results in less productive disagreement.
Hey, sorry for not engaging properly. On rereading your earlier comment:
>This is kind of addressed by what I said above but imagine you have a gpt2-style transformer, but it has 1 quadrillion layers, and a hudnred quadrillion dimensional hidden state. Then look over all possible ways to assign values to the weights, assuming a fixed floating point representation.>Seems likely to me some fraction of those will yield an ASI. And some fraction of those ASIs will be friendly. But that fraction will be vanishingly small. And furthermore, that if you want to modify the probability distribution over weights so that friendly AIs are likely, you’ll need to add a bunch of ad-hoc and very specific information into the distribution.
This seems like a more reasonable prior to start with. I think this gets you less doom than AIXI though. But I can make this same argument about biology: consider all of the possible sets of DNA that can produce an organism. Some fraction of these are going to be at least an an intelligent as a mouse. But a vanishingly small fraction of these have legs (most, in fact, are just blobs of brains or other cognitive machinery). So we should expect it to be incredibly unlikely that systems smarter than a mouse have legs. But it turns out that it’s actually not that hard to create a selection process of which legs are a convergent property. I would claim that we don’t really know how hard it is to create a selection process of which “don’t kill us” is a convergent property. So I think unless you specify a particular selection process, it’s hard to make strong claims, purely from reasoning about the sample space.
> I mean, to be honest, I don’t really understand Knightian uncertainty. ASI either kills us or it doesn’t. If we lived in a distinct reality where the outcome of ASI doesn’t impact us except we get to know how it went, and I offered you a bet, where you get a billion dollars if ASI doesn’t kill everyone, and you have to pay me 1 cent if it does. Do you not take the bet? That seems absurd to me, but if you do take the bet, you have an implied probability distribution over outcomes.
So Infra-Bayesianism (which I don’t really understand) has a way of dealing with this through risk aversion across sets of priors. But yeah, I do have some implied probability distribution through the class of bets I’m willing to take. This plausibly manifests as not really being willing to take bets around the 50⁄50 mark and the further away you go, the more willing I am to take bets. You can perhaps model me as being risk averse and having uncertainty over my own credences. But I haven’t really given this much thought, and I’m not entirely sure I endorse this.
> You “should” do that. I agree, but that doesn’t mean counting arguments provide no evidence. If I have a mole that appears diffuse and growing, I should get it checked out to be sure it is/isn’t cancer. But that doesn’t mean a diffuse mole that’s growing isn’t more scary than one that isn’t growing and isn’t diffuse.
yep I agree with this, I think counting arguments should give you some evidence (which is maybe where I diverge from the Belrose and Pope view) but the further removed they are from the reality of how these systems are trained, the less I think they should update you.
> No offense, I already addressed this.
Please let me know if I haven’t responded properly to this, I wasn’t sure what exactly you were referring to.
Do we have a good prior for reasoning about what neural networks converge to? It seems like neither the neither the solomonoff nor speed prior really take into account the computational constraints faced by neural networks. Do we have good reasons to expect these priors to tell us useful things about neural networks?
Condition numbers assume that errors propagate independently. Would be cool to build a coding theory for alignment so that we can plausibly correct random errors.
>I guess a more direct argument against what you’re saying is: Bertrand’s paradox occurs because the sample space is underspecified. But in the real world, there is a fact of the matter about the sample space. We are uncertain about that fact. But we can observe that almost any such sample space, unless we put in a bunch of very specific information, will have the property that, friendly goals occupy a very small fraction of it.
I agree with everything you said here, up to the point of “but we can observe that almost any such sample space...”. My point is twofold:
(1) instead about reasoning about proportion of sample spaces, we should actually try to reduce uncertainty about the sample space by reasoning directly about it
(2) when you start reasoning about “friendly goals”, you are implicitly projecting down to a lower-dimensional subspace, which does not necessarily preserve the structure of your original space, and so may radically distort proportions
When evolutionary pressure is too high, you may get a population that is perfectly optimised for its current environment. Because of goodhart’s law, this means that the population is very vulnerable to a change in environment, such as a new virus, which may spread through the population and wipe it all out. Therefore a certain amount of slack/diversity within the population is adaptive in the face of Knightian uncertainty about future events.
A computationally bounded agent can act as if they have more computational resources through externalising cognition into their environment. For example, we can use pen and paper to solve maths problems, and effectively simulate an agent that has a larger working memory. This is one reason why shard-theoretic agents may be adaptive. This is also one reason why mech interp is extremely hard in the limit.
Samuel Ratnam’s Shortform
Alignment by induction
Assume model_0 is aligned (enough). What affordances can we give model_n to increase the probability that model_n+1 is (more) aligned? Model_n and model_n+1 might be over timesteps of a continual learning system, training steps or successor generations of models.To what extent do these methods also amplify misalignment? If a mistake is made along the chain, can we correct it using previous models? Maybe we can be nice to our models and work with them rather than constantly assuming an adversarial stance towards them.
> I’m just saying apriori, do you think we should be 50⁄50 on whether the ASI ends up spending its time building catgirls? Or are you advocating for some kind of uncertainty we can’t assign numbers to?
Yes, I am saying precisely that we shouldn’t assign 50⁄50 when faced with deep ignorance about the nature of the generating process. Maybe I am advocating for a stance of Knightian uncertainty or maybe just saying that the standard Bayesian approach can plausibly lead you to be much too overconfident in your views. I think reasoning about ASI a priori doesn’t makes sense in the same way that the question posed in Bertrand’s paradox doesn’t make sense.>Like if you have AIXI, and you have a value-function slot, and you randomly sample from bit strings that encode valid value function, weighted by their length, I’d be willing to bet at 1:999999 odds you don’t end up with a catgirl-AIXI. To get catgirl aixi, you have to put a bunch of stuff into the specification of your probability weighting.
So two things I want to point out here:
1) I suspect there is no single privileged value function encoding, so I think the answer to this question is undefined. I admit though that some value function encodings look weirder than others, and I think I agree with the general gist of what you’re saying.
2) I think that the jump to “1:10^9999* chance a random neural-net ASI is friendly” is not valid here. AIXI is one framework which we can use to reason about intelligent neural networks, but it is far from the only one. I think ‘random neural-net AISI’ is underspecified unless you provide a sampling method, and yes you’re allowed to use this as a prior, and I think it’s not an unreasonable one to use here, but there are other priors you could use that would give you much less pessimistic conclusions.
>This doesn’t quite make sense to me. This just kicks the problem one level up. There are many way’s we could privilege one space to distribute our uncertainty over. Almost none of them have minds that value human happiness occupying a large share of the space.
This is just a counting argument about counting arguments, which is very fun, but is still vulnerable to the same objections.
> You could make the same argument to argue we should be “uncertain” about any aspect of AI motivation
I am arguing that a priori we should be “uncertain” about pretty much any aspect of AI motivation, so long as it’s not contradictory. There certainly exist optimisation pressures that would select for an AI that wants to spend all of its time building anime cat-girls, but given knowledge of our current economic systems and the culture inside labs, it seems very likely that optimisation pressures will actually select against this.
> I think all of prosaic alignment work is already about this, no?
A lot of prosaic alignment is relevant to this. The post was mostly aimed at the class of people who are very sceptical of prosaic alignment in general because current techniques won’t scale to the things that actually matter.
Counting Arguments in AI Safety
I think there’s a cool relation to https://www.lesswrong.com/posts/2Dmi3DYBKY7Tbz8Kx/consent-based-rl-letting-models-endorse-their-own-training here: both can be used as methods of giving models more affordances in shaping how(/whether) training updates are actually internalised and can plausibly guard against value drift due to unwanted generalisation.
I’ve recently been thinking about inductive properties of alignment: if we assume alignment at a given timestep or generation of models, how can we affect p(alignment) of successor states—and it seems like things in these directions can amplify alignment, but also plausibly amplify misalignment too (eg. your model might transfer aligned or misaligned propensities through subliminal learning).
Just wanted to say that Machinic Psychopharmacology is such a cool name for a research field and I hope it catches on