Gurkenglas comments on The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Gurkenglas 29 May 2025 12:59 UTC
13 points
0
“Algorithm 1: Safe Beam Search with Harmfulness Filtering” relies on a classifier of whether the sequence came from the training subdataset tagged with tau, or the training subdataset not tagged with tau. What happens when the sequence lies in neither distribution, such as because the AI is considering a plan that nobody has ever thought of?
- RogerDearnaley 30 May 2025 6:42 UTC
  6 points
  3
  Parent
  The labeling used is for harmful material. The underlying logic here is that things are either harmful, or they’re not. Higher capability LLMs with complex world models are generally significantly more successful at extrapolating tasks like this out-of-distribution that a basic classifier ML model would be, but it’s not going to be perfect. If you come up with something that’s way out in left field, the LLM may no longer be able to accurately classify it as harmful or not. The same is of course also true for humans, or any agent: it’s an inherent challenge of of Bayesian learning — without enough evidence, in areas where extrapolating from the hypotheses you’ve learnt doesn’t suffice, you don’t yet know the answer. So you should be cautious moving out-of-distribution, especially far out of distribution in new ways that you’ve never seen before. But then, as everyone knows (including a capable AI based on an LLM), that’s also true for many other reasons: if you don’t know what you’re doing, there are many dangers. A sensible heuristic would be to assume by default that going far out-of-distribution is harmful until proven otherwise — one way to try to implement this would be stating, motivating, and explaining it, and giving approving examples of other AIs showing caution in this situation, many times through in the pretraining set.
  
  How could we possibly make any AI that wouldn’t have this failure mode?
  - Gurkenglas 30 May 2025 9:34 UTC
    6 points
    2
    Parent
    Presumably in some domains its capabilities will generalize better OOD than its tau-classifier (and vice versa). You could try to have it err in the direction of tau in such cases, though neither paper seems to gesture at this.
    Now whether things are harmful depends on the capability level. For example, you might trust an AI to send an email to a politician arguing for climate change or peacemaking if it’s human-level, but not if it’s smart enough to tell which second-order effects will dominate, such as inoculating the politician against the arguments, or distracting them from their work on AI regulation, or maneuvering them into drama with another faction.
    You could try to put the AI’s capabilities in context, if you know them, so things can be either-harmful-or-not again, though neither paper seems to gesture at this.
    Such problems are characteristic of attempts to build an aligned system out of parts that are not, by themselves, aligned; they will search for ways to bypass your system. We could possibly figure out how to build aligned parts.