Thou shalt not command an alighned AI

Martin Vlach11 May 2025 20:02 UTC

0 points

AI Orthogonality Thesis Agent Foundations

Raymond is tired. He exhails exhaustly: >>I don’t think we even know what alighnment is, like we are not able to define it.<<

I hop up on my chair in the meditarian restaurant: >I disagree, if you give me 3 seconds, I can define it.<

>>---<<

>Can we narrow it to alighnment of AI to humans?<

>>Yes, let’s narrow it to alighnment of one AI to one person.<<

>Fine. The AI is alighned if you give it a goal and it follows towards that goal without modifying it with its own intentions or goals.<

>>That sounds a bit way too abstract...<<

>Yeah, but in what sense, you mean?<

>>Like the goal, what is that, more precisely?<<

>That is a state of the world you want to achieve or a series of states of the world.<

>>Oh, but how would you specify that?<<

>You can specify it, describe it, in infinetly many ways, there is a scale of how detailed description you choose, which will imply a level of approximation of the state.<

>>Oh, but that won’t describe the state completely..?<<

>Well maybe if you can describe to the quantum state level, but surely that is not practical.<

>>So then the AI must somehow interpret your goal, right?<<

>Ehmmm, well no, but what you mean it would have to interpolate to fill in the under-specified spots in the description of your goal..?<

>>Yes, that is a good expression for what would need to happen.<<

>Then what we’ve discovered here is another axis, orthogonal to alighnment, which would control to what level of under-specifiedness we want the AI to interpolate and where it would need to ask you to fill in the gaps (more) before moving towards your goal.<

>>Oh, but we also can’t be like “Create a picture of a dog” and then ’d need to specify each pixel.<<

>Sure. But maybe the AI must ask you whether you want the picture on paper or digitally on your screen, with a reasonable threshold for clarification.<

>>Hmm, but people want things they do not have...<<

>and they can end up in a state they feel bad in with an alighned AI.<

>>So what do you do to make the alighment guarantee good outcomes? People are stupid...<<

>and that’s on them. You can call it incomptetence, but I’d call that a misuse.<

Martin Vlach11 May 2025 20:02 UTC

0 points

4 comments1 min readLW link

AI Orthogonality Thesis Agent Foundations

Robert Cousineau 12 May 2025 4:13 UTC
3 points
0
Here is a copy edited version from Claude:
Sorry, You Should Not Command the Aligned AI
By Martin Vlach, Benjamin Schmidt
May 11, 2025
2 min read
Benjamin slumps in his chair, visibly tired. “I don’t think we even know what alignment is. We can’t even define it properly.”
I straighten up across the table at the Mediterranean restaurant. “I disagree. Give me three seconds and I can define it.”
“Fine,” he says after a pause.
“Can we narrow it to alignment of AI to humans?” I ask.
“Yes, let’s narrow it to alignment of one AI to one person.”
“The AI is aligned if you give it a goal and it pursues that goal without modifying it with its own intentions or goals.”
Benjamin frowns. “That sounds far too abstract.”
“In what sense?”
“Like the goal—what is that, more precisely?”
“A state of the world you want to achieve, or a series of states.”
“But how would you specify that?”
“You can describe it in infinitely many ways. There’s a scale of detail you can choose, which implies a level of approximation of the state.”
“That won’t describe the state completely, though?”
“Well, maybe if you could describe to the quantum state level, but that’s obviously impractical.”
“So then the AI must somehow interpret your goal, right?”
“Not exactly, but you mean it would have to interpolate to fill in the under-specified parts of your goal description?”
“Yes, that’s a good way to put it.”
“Then what we’ve discovered is another axis, orthogonal to alignment, which controls to what level of under-specification we want the AI to interpolate versus where it needs to ask you to fill in gaps before pursuing your goal.”
“We can’t be saying ‘Create a picture of a dog’ and then need to specify each pixel.”
“Of course not. But perhaps the AI should ask whether you want the picture on paper or digitally, using a reasonable threshold for necessary clarification.”
“People want things they don’t actually need though...”
“And they can end up in a bad state even with an aligned AI.”
“So how do you make alignment guarantee good outcomes? People are stupid...”
“And that’s on them. You can call it incompetence, but I’d call it misuse.”
- Martin Vlach 12 May 2025 10:52 UTC
  1 point
  0
  Parent
  You mean the chevrons like this is non-standard, but also sub-standard, although it has the neat property to represent >Speaker one< and >>Speaker two<<? I can see the typography of those here is meh at best.-\
  - Robert Cousineau 13 May 2025 1:54 UTC
    2 points
    0
    Parent
    I personally have not seen that style of writing dialogue before, and did not recognize that was what you were doing until reading this comment from you. It along with the typos made it difficult for me to understand, so I had Claude copy edit it for me (and then figured maybe someone else would find that useful).
Robert Cousineau 13 May 2025 1:57 UTC
2 points
0
In response to what I understand to be your question (“So what do you do to make the alignment guarantee good outcomes? People are stupid..”), I think one commonly accepted answer here is:

Yes, that is a real problem. Something like CEV offers a solution (with a spherical cow, in a vacuum).

There is also a useful differentiation to be made between Inner Alignment and Outer Alignment.