eggsyntax comments on eggsyntax’s Shortform

eggsyntax 9 Oct 2025 2:41 UTC
9 points
1
I know there’s a history of theoretical work on the stability of alignment, so I’m probably just reinventing wheels, but: it seems clear to me that alignment can’t be fully stable in the face of unbounded updates on data (eg the most straightforward version of continual learning).
For example, if you could show me evidence I found convincing that most of what I had previously believed about the world was false, and the people I trusted were actually manipulating me for malicious purposes, then my values might change dramatically. I don’t expect to see such evidence, but there certainly isn’t a guarantee that I won’t.
Now, there might be ways to design systems to avoid this problem, eg fixing certain beliefs and values such that they don’t update on evidence. But for a system that looked something like a Bayes net, or for that matter a system that looked something like human beliefs and motivations, I don’t see how there can be any guarantee of long-term value stability in the face of exposure to arbitrary and unpredictable evidence about the world.
Am I missing something here?
- Jeremy Gillen 9 Oct 2025 7:37 UTC
  11 points
  0
  Parent
  I think you’re mostly right about the problem but the conclusion doesn’t follow.
  First a nitpick: If you find out you’re being manipulated, your terminal values shouldn’t change (unless your mind is broken somehow, or not reflectively stable).
  But there’s a similar issue: You could discover that all your previous observations were fed to you by a malicious demon, and nothing that you previously cared about actually exists. So your values don’t bind to anything in the new world you find yourself in.
  In that situation, how do we want an AI to act? There’s a few options, but doing nothing seems like a good default. Constructing the value binding algorithm such that this is the resulting behaviour doesn’t seem that hard, but it might not be trivial.
  Does this engage with what you’re saying?
  - eggsyntax 9 Oct 2025 21:47 UTC
    4 points
    0
    Parent
    Thanks! This is a reply to both you and @nowl, who I take to be (in part) making similar claims.
    I could have been clearer in both my thinking and my wording.
    Because of the is-ought gap, value (how I want the world to be) doesn’t inherently change in response to evidence (how the world is). [nowl]
    While I agree with this as a philosophical stance, it seems clear that, at least in humans, facts and values are in practice entangled. As a classic example in fiction, if you’ve absorbed some terminal values from your sensei and then your sensei turns out to be the monster who murdered your parents, you’re likely to reject some of those values as a result.
    But
    there’s a similar issue: You could discover that all your previous observations were fed to you by a malicious demon, and nothing that you previously cared about actually exists. So your values don’t bind to anything in the new world you find yourself in.
    This is closer to what I meant to point to — something like values-as-expressed-in-the-world. Even if the terminal value doesn’t change, there are facts that could result in behavior that, to a reasonable outside observer, seems to reflect a conflicting value.
    For example: if we successfully instilled human welfare as a terminal value into an AI, there is (presumably) some set of observations which could convince the system that all the apparent humans it comes in contact with are actually aliens in human suits, who are holding all the real humans captive and tormenting them. Therefore, the correct behavior implied by the terminal value is to kill all the apparent humans so that the real humans can be freed, while disbelieving any supposed counterevidence presented by the ‘humans’.
    To an outside observer — eg a human now being hunted by the AI — this is for all intents and purposes the same as the AI no longer having human welfare as a terminal value.
    Possibly I’m now just reinventing epiwheels?
    - Jeremy Gillen 10 Oct 2025 2:12 UTC
      5 points
      0
      Parent
      While I agree with this as a philosophical stance, it seems clear that, at least in humans, facts and values are in practice entangled.
      I agree humans absorb (terminal) values from people around them. But this property isn’t something I want in a powerful AI. I think it’s clearly possible to design an agent that doesn’t have the “absorbs terminal values” property, do you agree?
      Even if the terminal value doesn’t change, there are facts that could result in behavior that, to a reasonable outside observer, seems to reflect a conflicting value.
      Yeah! I see this as a different problem from the value binding problem, but just as important. We can split it into two cases:
      The new beliefs (that lead to bad actions) are false.^[1] To avoid this happening we need to do a good job designing the epistemics of the AI. It’ll be impossible to avoid misleading false beliefs with certainty, but I expect there to be statistical learning type results that say that it’s unlikely and becomes more unlikely with more observation and thinking.
      The new beliefs (that lead to bad actions) are true, and unknown to the human AI designers. (e.g. we’re in a simulation and the gods of the simulation have set things up such that the best thing for the AI to do looks evil from the human perspective. The AI is acting in our interest here. Maybe out of caution we want to design the AI values such that it wants to shut down in circumstances this extreme, just in case there’s been an epistemic problem and it’s actually case 1.
      ^
      I’m assuming a correspondence theory of truth, where beliefs can be said to be true or false without reference to actions or values. This is often a crux with people that are into active inference or shard theory.
      - eggsyntax 13 Oct 2025 1:49 UTC
        4 points
        0
        Parent
        I agree humans absorb (terminal) values from people around them. But this property isn’t something I want in a powerful AI. I think it’s clearly possible to design an agent that doesn’t have the “absorbs terminal values” property, do you agree?
        I do! Although my expectation is that for LLMs and similar approaches, they’ll be tangled like they are in humans (this would actually be a really interesting line of research — after training a particular value into an LLM, is there some set of facts that could be provided which would result in that value changing?), so we may not get that desirable property in practice :(
        Yeah! I see this as a different problem from the value binding problem, but just as important.
        For sure — the only thing I’m trying to claim in the OP is that there exists some set of observations which, if the LLM made them, would result in the LLM being unaligned-from-our-perspective, and so it seems impossible to fully guarantee the stability of alignment.
        Your split seems promising; being able to make problems increasingly unlikely given more observation and thinking would be a really nice property to have. It seems like it might be hard to create a trigger for your #2 that wouldn’t cause the AI to shut down every time it did novel research.
        Jeremy Gillen 13 Oct 2025 5:03 UTC
        4 points
        0
        Parent
        Nice, agreed. This is basically why I don’t see any hope in trying to align super-LLMs. (this and several similar categories of plausible failures that don’t seem avoidable without dramatically more understanding of the algorithms running on the inside).
    - eggsyntax 9 Oct 2025 23:54 UTC
      2 points
      0
      Parent
      [addendum]
      In that situation, how do we want an AI to act? There’s a few options, but doing nothing seems like a good default. Constructing the value binding algorithm such that this is the resulting behaviour doesn’t seem that hard, but it might not be trivial.
      (and I imagine that the kind of ‘my values bind to something, but in such a way that that it’ll cause me to take very different options than before’ I describe above is much harder to specify)
- Jan Betley 9 Oct 2025 9:38 UTC
  4 points
  0
  Parent
  You could, I think, have a system where performance clearly depends on some key beliefs. So then you still could change the beliefs, but that change would significantly damage capabilities. I guess that could be good enough? E.g. I think if you somehow made me really believe the Earth is flat, this would harm my research skills. Or perhaps even if you made me e.g. hate gays.
  - eggsyntax 9 Oct 2025 18:08 UTC
    2 points
    0
    Parent
    Oh, that’s a really interesting design approach, I haven’t run across something like that before.
    - Jan Betley 9 Oct 2025 20:39 UTC
      2 points
      0
      Parent
      Consider backdoors, as in the Sleeper Agents paper. So, a conditional policy triggered by some specific user prompt. You could probably quite easily fine-tune a recent model to be pro-life on even days and pro-choice on odd days. These would be just fully general, consistent behaviors, i.e. you could get a model that would present these date-dependant beliefs consistently among all possible contexts.
      
      Now, imagine someone controls all of the environment you live in. Like, literally everything, except that they don’t have any direct access to your brain. Could they implement similar backdoor in you? They could force you to behave that way, buy could they make you really believe that?
      
      My guess is not, and one reason (there are also others but that’s a different topic) is that humans like me and you have a very deep belief “current date doesn’t make a difference for whether abortion is good and bad” that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?
      
      So couldn’t we have LLMs be like humans in this regard? I don’t see a good reason for why this wouldn’t be possible.
      
      I’m not sure if this is a great analogy : )
      - eggsyntax 14 Oct 2025 1:02 UTC
        2 points
        0
        Parent
        Could they implement similar backdoor in you?...My guess is not
        Although people have certainly tried...
        My guess is not, and one reason (there are also others but that’s a different topic) is that humans like me and you have a very deep belief “current date doesn’t make a difference for whether abortion is good and bad” that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?
        I’m being a bit tangential here, but a couple of thoughts:
        Do we actually have that belief? There are an unbounded number of things that by default we don’t let affect our values, and we can’t be actively representing all of them in our bounded brain (eg we can pick some house in the world at random — whether its lights are on doesn’t affect my values regarding abortion, but I sure didn’t have an actual policy on that in my brain).
        I’m sure you’re familiar with this, but your example reminds me of the ‘new riddle of induction’: why aren’t grue and bleen just as reasonable as blue and green are?
        I agree that it’s hard to imagine what cognitive changes would have to happen for me to have a value with that property. I don’t think I have very good intuitions about how much it would affect my overall cognition, though. What you’re saying feels plausible to me, but I don’t have much confidence either way.
- nowl 9 Oct 2025 12:55 UTC
  1 point
  0
  Parent
  Now, there might be ways to design systems to avoid this problem, eg fixing certain beliefs and values such that they don’t update on evidence
  Because of the is-ought gap, value (how one wants the world to be) doesn’t inherently change in response to evidence (how the world is).^[1]
  So a hypothetical competent AI designer^[2] doesn’t have to go out of their way to make the value not update on evidence. Nor to make any beliefs not update on evidence.
  (If an AI is more like a human then [what it acts like it values] could change in response to evidence though yea. I think most of the historical alignment theory texts aren’t about aligning human-like AIs (but rather hypothetical competently designed ones).)
  1. ^
    I’ve had someone keep disagreeing with this once, so I’ll add that a value is not a statement about the world, so how would the Bayes equation update it?
    a hypothetical competently designed AI could separately have a belief about “What I value”, or more specifically, about “the world contains something here running code for the decision process that is me, so its behavior correlates with my decision”, but regardless of how that belief gets manipulated by the hypothetical evidence-presenting demon (maybe it’s manipulated into “with high probability, the thing runs code that values y instead, and its actions don’t correlate with my decision”), the next step in the AI goes: “given all these beliefs, what output of the-decision-process-that-is-me best fulfills <hardcoded value function>”.
    (if it believes there is nothing in the world whose behavior correlates with the decision, all decisions would do nothing and score equally in such case. it’d default to acting under world-possibilities which it assigns lower probability but where it has machines to control).
    (one might ask) okay but could the hypothetical demon manipulate its platonic beliefs about what “the decision process that is me” is? well, maybe not, because that’s (as separate from the above) also not the kind of thing that inherently updates on evidence about a world.
    but if it were manipulated, somehow—im not quite sure what to even imagine being manipulated, maybe parts of the process rely on ‘expectations’ about other parts so it’s those expectations (though only if they’re not hardcoded in? so some sort of AI designed to discover some parts of ‘what it is’ by observing its own behavior?) - there’d still be code at some point saying to [score considered decisions on how much they fulfill <hardcoded value function>, and output the highest scoring one]. it’s just parts of the process could be confused(?)/hijacked, in this hypothetical.
  2. ^
    (not grower)