I am building a research agenda tackling catastrophic risks from AI, which I am documenting on my substack Working Through AI. Where it feels suited, I’m crossposting my work to LessWrong.
Richard Juggins
Yeah I agree that looking closer at how generalisation could fail is a really important problem. I’m particularly interested in the new situations aspect. As we move to more agentic systems with longer time-coherence, the proliferation of variables will be enormous. My worry with simulations, apart from the gap between them and reality, is how good models are getting at telling when they are in them! It’s still very valuable to try though.
Which bit of Roger’s post do you think makes mine outdated? Skimming through, he is clearly more optimistic than me, but he also says stuff like
However, I also completely agree that there is a vast and dangerous gap between “my life experience suggests some simple approaches may work on ASI” and “we actually tested this and are sure that they do and know which ones do, without accidentally killing us all during the testing, or getting fooled by our far-smarter-than-us test subjects”.
which aligns well with my position.
Regarding AI 2027, if Agent-3 fails to find sufficient evidence that Agent-4 is plotting this means that (a) the techniques used to align Agent-4 did not generalise and (b) that this is hard to iterate on because (amongst perhaps other reasons) it is difficult to see there is a problem at all. My point is that, because we cannot collect in advance the information required to avoid generalisation problems, we are going to hit them at some point. And when we do, we have to keep them small and visible enough to iterate on.
All technical alignment plans are steps in the dark
Interesting post! I have a few questions:
ASI narratives tend to assume superhuman metacognitive skills will be used to efficiently learn new ordinary skills. However, so far, ordinary skill learning has outstripped metacognition. What intuition do you think people have for this swapping around?
How generalisable do you think metacognitive skills are? I was interested in your comment that ‘Humans’ metacognitive abilities seem to vary with their expertise on a given topic’. Continuing with the narrative of using superhuman metacognitive skills to efficiently learn ordinary skills: if learning the latter skills in simulations only generalises so far to the real world, surely metacognitive skills can’t bail you out as the same will be true of them? To the degree that the metacognitive skill is general it generalises, to the degree it is specific, it does not. E.g. the paper you linked here https://arxiv.org/pdf/2504.05419 saw probes are well-calibrated within distribution, but not always outside, suggesting one important metacognitive ability—a well-calibrated sense of how good your answer is—does not generalise well.
‘If everyone who asked was told something like “humans pretty clearly don’t know how hard alignment is, so you should probably slow down progress toward AGI if you possibly can,” it might indirectly help a fair amount. And more reliable systems might be particularly helpful for alignment deconfusion.’ It’s not clear to me how a system could learn this objectively. It would get accused of having been post-trained to say it or people would speculate about the training data. You need some trust-building process, and the lowest error answer on a particular training set is not that. How reliant do we have to get on AI before we trust it this much?
So the problem is, if you present a model with a bunch of wildly OOD examples in order to test generalisation, that these look fake to it, so it realises it is being tested? This implies alignment evals will increasingly look like psychology experiments, trying to coach/trick models into revealing their more fundamental preferences.
With evaluation awareness, do we see evidence of models in deployment ever getting confused and thinking they are being evaluated when they are not? If we’re worried about models behaving incorrectly in novel situations, and novel situations tend to make models think they are being tested, then we should see this. Or, does evaluation awareness predominantly come from some other source?
I haven’t read HP:MoR, so don’t know exactly what is happening in the example, but might you not have doubt about whether the supposition is false or not? Envisaging a solution is a way of interrogating the structure of the problem, including whether it is solvable at all. Sure though, if you want to use the suppostion to prove something else, rather than work backwards from it, you want to be sure of the supposition in the first place.
Well, if the solution to alignment is that a particular system has to keep running in a certain way, then that can fail. The durability of solutions is going to be on a spectrum. What we would hope is that the solution we try to implement is something that improves over time, rather than is permanently brittle.
I think that asking for a perfect solution is asking a lot. It may be possible to perfectly align a superintelligence to human will, but you also want to maintain as much oversight as you can in case you actually got it slightly wrong.
What Success Might Look Like
When you talk about reflective stability, you concentrate on intentional goal-prioritisation as the mechanism by which this may occur. It sounds like you mean the model asks itself, among the many things it can think of, what combination it cares the most about, and then resolves to go all-in on that (and this resolution sticks). Am I reading this correctly?
Do you see this as distinct from, say, the model having a kind of epiphany, where as its context changes (due to interactions with others as well as its own rationalisations), it stumbles on a sense of ‘ohhh, this is what I want to do’ and then robustly pursues that moving forwards? I wonder if there is a kind of spectrum between goal-change due to an automatic system 1 reinterpretion of a changing context and an intentional system 2 modification of the context. Or maybe they are the same thing at root somehow.I guess you could think about measuring these behaviours, in the first instance at least, by pointing an LLM in the direction of some slightly contradictory goals, and seeing if over very long contexts it latches onto one in particular, and how it seems to go about that. Although, I would expect the full problem, i.e. for more powerful AI than we have today, to have a strong dependence on exactly how memory is implemented. For example, on how far from factory settings it is possible for the models to drift.
I must admit I am a little pessimistic, and consider the anarchy side of the equation much more likely than consolidation. Modern society as a bunch of protocols is effective at managing diversity at scale, but it was built in a different age. You could only diverge so far from your neighbours, because who else did you talk to? With the internet and now AI, the production process of shared meaning is shifting in a fundamental way, and I’m not sure where it leads.
That’s an interesting question! I think it’s instructive to consider the information environment of a medieval peasant. Let’s speculate that they interact with a max ~150 people, almost all of which are other peasants (and illiterate). Everyone lives in the same place and undertakes a similar set of tasks throughout their life. What shared meanings are generated by this society? Surely it is a really intense filter bubble, with a thick culture into which little outside influence can penetrate? The Dothraki from Game of Thrones have a great line – ‘It is known’ – which they use to affirm something widely understood. This is a ritual rather than an explanation, because everyone within the culture already knows the meaning of whatever it is that has been said.
Comparing this to modern society, one of the biggest differences is the firehose of information we constantly consume, which is unique for each person. More than this, because it is global, the same information gets consumed in different contexts by different people (e.g. your reference to Gangnam Style). I think this leads to a kind of a hierarchical structure. On the one hand, we also have filter bubbles, which are getting more bespoke and personalised every year, and can lead to neighbours in possession of wildly different world models. On the other, unlike the peasants, we have to regularly coordinate and interact with people outside our bubbles. So you get this high-level set of Schelling points people coalesce around, like fashions and conventions, that allow us to have meaningful interactions with people we don’t know very well, but often only to a shallow depth. This sometimes looks like a drab monoculture. Then there is a lot in-between: little pockets of thicker shared meaning generated by families, sub-cultures, organisations etc, the coherence of which constantly have to be reaffirmed to survive the information firehose.
In sum, I think this cashes out as lots of thin shared meanings underpinning massive diversity, albeit in a highly complex structure. Although, I should caveat this with the fact that I would probably struggle to recognise our more deeply held shared meanings, in the whole fish recognising water sense.
Adding this having read your post. I think I agree that it is impossible to speak of a sematically invalid statement, but that is because I am locating the meaning outside the statement, rather than applying validity or invalidity to the statement itself. ‘Colourless green ideas sleep furiously’ certainly means something to me, and not just as a first order sign about Chomsky. The example of this, par excellence, is Lewis Carroll’s Jabberwocky. There’s a great section in Goedel, Escher, Bach about translating the Jabberwocky into different languages—even though it is a nonsense poem, it clearly means something when you read it.
It is interesting to think about why a statement might be found more or less meaningful by people in general. I asked ChatGPT for a summary of Eco’s Theory of Semiotics, and it included this passage:
‘Eco emphasises that meaning is not inherent in signs but is produced through codes—systems of rules shared within a culture. Interpretation depends on the interpreter’s familiarity with the relevant codes.’
I think this is important. My post doesn’t engage with why there are common patterns in words. As I mention in my other comment, I see this as a cultural, emergent thing that I was treating as out-of-scope, concentrating instead on how individual communication works.
Interestingly, my first draft was closer to your point of view, but I talked myself out of it! I convinced myself the kind of information that is in the words/context is a different type of thing than meaning. I haven’t had time to read your link yet, but will do later.
Here are my quick thoughts:
1. As mentioned, words/context are information, but not meaning. A message written in code is meaningless without knowledge of the code, and in this case everyone has a slightly different decoding method. If my method decodes a message to mean the opposite of yours, that implies the meaning is in the method. This is not a hypothetical – people interpret statements in diametrically opposed ways all the time.
2. Subjectivity is a matter of degree. Because we live in the same world and interact all the time, we will share a lot of meanings, and this will make them look somewhat objective. We can, in this sense, speak of an emergent, global meaning for words (‘words are defined by their use’), but we interact with this indirectly, through other individuals, and I think it is also a different kind of thing to the meaning I am talking about in this post.
The Iceberg Theory of Meaning
Do you have any quick examples of value-shaped interpretations that conflict?
This gives us some wiggle room to imagine different ways of resolving those conflicts, some of which will look more like instruction-following and other that will look more like autonomous action.
So perhaps the level of initiative the AI takes? E.g. a maximally initiative-taking AI might respond to ‘fetch me coffee’ by reshaping the Principal’s life so they get better sleep and no longer want the coffee.
I think my original reference to ‘perfect’ value understanding is maybe obscuring these tradeoffs (maybe unhelpfully), as in theory that includes knowledge of how the Principal would want interpretative conflicts managed.
One problem I have with the instruction-following frame is that it feels like an attempt to sidestep the difficulties of aligning to a set of values. But I don’t think this works, as perfect instruction-following may actually be equivalent to aligning to the Principal’s values.
What we want from an instruction-following system is one that does what we mean rather than simply what we say. So, rather than ‘Do what the Principal says’, the alignment target is really ‘Do what the Principal’s values imply they mean’. And if an AI understands these values perfectly and is properly motivated to act according to them, that is functionally the same as it having those values itself.
If done correctly, this would solve the corrigibility problem, as all instructions would carry an implicit ‘I mean for you to stop if asked’ clause.
Would it make sense to think of this on a continuum, where on one end you have basic, relatively naive instruction-following that is easy to implement (e.g. helpful LLMs) and on the other you have perfect instruction following that is completely aligned to the Principal’s values?
Thank you for your kind words! I’m glad you liked it. Your instruction-following post is a good fit for one of my examples, so I will edit in a link to it.
I agree that alignment is a somewhat awkwardly-used term. I think the original definition relies on AI having quite cleanly defined goals in a way that is probably unrealistic for sufficiently complex systems, and certainly doesn’t apply to LLMs. As a result, it often ends up being approximated to mean something more like directing a set of behavioural tendencies, like trying to teach the AI to always take the appropriate action in any given context. I tend to lean into this latter interpretation.
I haven’t had time to read your other links yet but will take a look!
How to specify an alignment target
I’m glad to see someone talking about pragmatism!
I find it interesting that the goal of a lot of alignment work seems to be to align AI with human values, when humans with human values spend so much of their time in (often lethal) conflict. I’m more inclined to the idea of building AI with a value-set that is complementary to human values in some widely-desirable way, rather than literally having a bunch of AIs that behave like humans.
I wonder if this perspective intersects with some of your points about thick and thin moralities, as well as social technology. Am I in the right ballpark to suggest that what you are after is a global thin morality defined via social technology that allows AI to participate in diverse, thick human cultures without escalating conflict between groups and contexts to above an unacceptable level?
In a sense, at the risk of oversimplifying, you are looking for a pragmatic solution for keeping conflict low in a world of diverse, highly powerful AI?
This is a really interesting idea! I think that, as AI gets more powerful, we should give it clear rights and responsibilities—different from human ones to be sure—which treat it as a moral agent in its own right. I think this will create a better incentive structure and make the situation more predictable for everyone. At the moment it seems like AIs are aligned to a mixture of what the developers think is ethically correct and what will not embarrass their company. Sure, Claude is a pretty sound guy, but I’m not sure this is sustainable.
Have you picked negligence as you think it is more important that other principles or as a general test case for legal alignment?