I am building a research agenda focused on tackling catastrophic risks from AI, which I am documenting on my substack Working Through AI. Where it feels suited, I’m crossposting my work to LessWrong.
Richard Juggins
One problem I have with the instruction-following frame is that it feels like an attempt to sidestep the difficulties of aligning to a set of values. But I don’t think this works, as perfect instruction-following may actually be equivalent to aligning to the Principal’s values.
What we want from an instruction-following system is one that does what we mean rather than simply what we say. So, rather than ‘Do what the Principal says’, the alignment target is really ‘Do what the Principal’s values imply they mean’. And if an AI understands these values perfectly and is properly motivated to act according to them, that is functionally the same as it having those values itself.
If done correctly, this would solve the corrigibility problem, as all instructions would carry an implicit ‘I mean for you to stop if asked’ clause.
Would it make sense to think of this on a continuum, where on one end you have basic, relatively naive instruction-following that is easy to implement (e.g. helpful LLMs) and on the other you have perfect instruction following that is completely aligned to the Principal’s values?
Thank you for your kind words! I’m glad you liked it. Your instruction-following post is a good fit for one of my examples, so I will edit in a link to it.
I agree that alignment is a somewhat awkwardly-used term. I think the original definition relies on AI having quite cleanly defined goals in a way that is probably unrealistic for sufficiently complex systems, and certainly doesn’t apply to LLMs. As a result, it often ends up being approximated to mean something more like directing a set of behavioural tendencies, like trying to teach the AI to always take the appropriate action in any given context. I tend to lean into this latter interpretation.
I haven’t had time to read your other links yet but will take a look!
How to specify an alignment target
I’m glad to see someone talking about pragmatism!
I find it interesting that the goal of a lot of alignment work seems to be to align AI with human values, when humans with human values spend so much of their time in (often lethal) conflict. I’m more inclined to the idea of building AI with a value-set that is complementary to human values in some widely-desirable way, rather than literally having a bunch of AIs that behave like humans.
I wonder if this perspective intersects with some of your points about thick and thin moralities, as well as social technology. Am I in the right ballpark to suggest that what you are after is a global thin morality defined via social technology that allows AI to participate in diverse, thick human cultures without escalating conflict between groups and contexts to above an unacceptable level?
In a sense, at the risk of oversimplifying, you are looking for a pragmatic solution for keeping conflict low in a world of diverse, highly powerful AI?
I see the LLM side of this as a first step, both as a proof of concept and because agents get built on top of LLMs (for the forseeable future at least).
I think that, no, it isn’t any easier to align an agent’s environment as to align the agent itself. I think for perfect alignment, that will last in all cases and for all time, they amount to the same thing, and this is why the problem is so hard. When an agent or any AI learns new capbilities, it draws the information it needs out of the environment. It’s trying to answer the question: “Given the information coming into me from the world, how do I get the right answer?” So the environment’s structure basically determines what the agent ends up being.
So the key question is the one you say, and that I try to allude to by talking about an aligned ontology: is there a particular compression, a particular map of the territory, which is good enough to initialise acceptable long-term outcomes?
Making alignment a law of the universe
Thanks for the comment! Taking your points in turn:
- I am curious that you see this as me saying superintelligent AI will be less dangerous, as to me it means it will be more. It will be able to dominate you in the usual hyper-competent sense but also may accidentally screw up some super-advanced physics and kill you that way too. It sounds like I should have stressed this more. I guess there are people that think AI sucks and will continue to suck, and therefore why worry about existential risk, so maybe by stressing AI fallibility I’m riding their energy a bit too hard to have made myself clear. I’ll add a footnote to clarify better.
- I agree that knowing-that reduces the amount of failure needed for knowing-how. My point is that the latter is the thing we actually care about though when we talk about intelligence. Memorising information is inconsequential without some practical purpose to put it to. Even if you’re just reading stuff to get your world model straight, it’s because you want to be able to use that model to take more successful actions in the world.
- I’m not completely sure I follow your questions about failure-reduction-potential upper-bounds. My best guess is that you mean can sufficient knowing-that reduce the amount of failure required to acquire new skills to a very low level? I think theoretical knowledge is mostly generated by practical action—trying stuff and writing down what happened—either individually or on a societal scale. So if an ASI wants to do something radically new then there won’t be any existing knowledge that can help it. For me, that means catastrophic or existential risk due to incompetence is a problem. I guess it reduces risk a little from the AI intentionally killing you, as it could mess up its plans in such a way as you survive, but long-term this reduction will be tiny as wiping out humans will not be in the ASI’s stretch zone for very long.
- Re your second point, I do not believe we will be able to recognise the errors an ASI is making. If it wants to kill us, it will be able to. My fear is that it will do it by accident anyway.
- Re your third point, I agree that AI is going to proliferate widely, and this is a big part of why I’m saying the usual recursive self-improvement story is too clean. There won’t be this gap between clearly dumber than humans to effectively omnipotent in which the AI is doing nothing but quietly gaining capabilities—labs will ship their products and people will use them and, while the shipped AI will be super impressive and useful, it will also screw a lot of things up. What I was getting at in my conclusion about the AI doing nothing out of fear of failure was more that if self-destructive actions we don’t understand come into its capabilities, and it knows this, we might find it gets risk-averse and reluctant to do some of the things we ask it to.
- Agree completely with your fourth point.
Do you have any quick examples of value-shaped interpretations that conflict?
So perhaps the level of initiative the AI takes? E.g. a maximally initiative-taking AI might respond to ‘fetch me coffee’ by reshaping the Principal’s life so they get better sleep and no longer want the coffee.
I think my original reference to ‘perfect’ value understanding is maybe obscuring these tradeoffs (maybe unhelpfully), as in theory that includes knowledge of how the Principal would want interpretative conflicts managed.