yams comments on The Field of AI Alignment: A Postmortem, and What To Do About It

yams 31 Jan 2025 17:25 UTC
5 points
0
there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem
There are plenty of cases where John can glance at what people are doing and see pretty clearly that it is not progress toward the hard problem.
Importantly, people with the agent foundations class of anxieties (which I embrace; I think John is worried about the right things!) do not spend time engaging on a gears level with prominent prosaic paradigms and connecting the high level objection (“it ignores the hard part of the problem”) with the details of the research.
“But Tsvi and John actually spend a lot of time doing this.”
No, they don’t! They paraphrase the core concern over and over again, often seemingly without reading the paper. I don’t think reading the paper would change your minds (nor should it!), but I think that there’s a culture problem tied to this off-hand dismissal of prosaic work that disincentivizes potential agent foundations (or similar new thing that shares the core concerns of agent foundations) researchers from engaging with, i.e., John.
Prosaic work is fraught and, much of it, doomed. New researchers over-index on tractability because short feedback loops are comforting (‘street-lighting’). Why aren’t we explaining why that is, on the terms of the research itself, rather than expecting people to be persuaded by the same high level point getting hammered into them again and again?
I’ve watched this work in real-time. If you listen to someone talk about their work, or read their paper and follow up in person, they are often receptive to a conversation about worlds in which their work is ineffective, evidence that we’re likely to be in such a world, and even to shifting the direction of their work in recognition of that evidence.
Instead, people with their eye on the ball are doing this tribalistic(-seeming) thing.
Yup, the deck is stacked against humanity solving the hard problems; for some reason, folks who know that are also committed to playing their hands poorly, and then blaming (only) the stacked deck!
John’s recent post on control is a counter-example to the above claims and was, broadly, a big step in the right direction, but had some issues with it, as raised by Redwood in the comments, which are a natural consequence of it being ~a new thing John was doing. I look forward to more posts like that in the future, from John and others, that help new entrants to empirical work (which has a robust talent pipeline!) understand, integrate, and even pivot in response to, the hard parts of the problem.
[edit: I say ‘gears level’ a couple times, but mean ‘more in the direction of gears-level than the critiques that have existed so far’]
- johnswentworth 31 Jan 2025 18:23 UTC
  5 points
  −7
  Parent
  Big crux here: I don’t actually expect useful research to occur as a result of my control-critique post. Even having updated on the discussion remaining more civil than I expected, I still expect basically-zero people to do anything useful as a result.
  As a comparison: I wrote a couple posts on my AI model delta with Yudkowsky and with Christiano. For each of them, I can imagine changing ~one big piece in my model, and end up with a model which looks basically like theirs.
  By contrast, when I read the stuff written on the control agenda… it feels like there is no model there at all. (Directionally-correct but probably not quite accurate description:) it feels like whoever’s writing, or whoever would buy the control agenda, is just kinda pattern-matching natural language strings without tracking the underlying concepts those strings are supposed to represent. (Joe’s recent post on “fake vs real thinking” feels like it’s pointing at the right thing here; the posts on control feel strongly like “fake” thinking.) And that’s not a problem which gets fixed by engaging at the object level; that type of cognition will mostly not produce useful work, so getting useful work out of such people would require getting them to think in entirely different ways.
  … so mostly I’ve tried to argue at a different level, like e.g. in the Why Not Just… posts. The goal there isn’t really to engage the sort of people who would otherwise buy the control agenda, but rather to communicate the underlying problems to the sort of people who would already instinctively feel something is off about the control agenda, and give them more useful frames to work with. Because those are the people who might have any hope of doing something useful, without the whole structure of their cognition needing to change first.
  - yams 31 Jan 2025 18:58 UTC
    7 points
    0
    Parent
    I think the reason nobody will do anything useful-to-John as a result of the control critique post is that control is explicitly not aiming at the hard parts of the problem, and knows this about itself. In that way, control is an especially poorly selected target if the goal is getting people to do anything useful-to-John. I’d be interested in a similar post on the Alignment Faking paper (or model organisms more broadly), on RAT, on debate, on faithful CoT, on specific interpretability paradigms (circuits v SAEs, vs some coherentist approach vs shards vs....), and would expect those to have higher odds of someone doing something useful-to-John. But useful-to-John isn’t really the metric I think the field should be using, either....
    I’m kind of picking on you here because you are least guilty of this failing relative to researchers in your reference class. You are actually saying anything at all, sometimes with detail, about how you feel about particular things. However, you wouldn’t be my first-pick judge for what’s useful; I’d rather live in a world where like half a dozen people in your reference class are spending non-zero time arguing about the details of the above agendas and how they interface with your broader models, so that the researchers working on those things can update based on those critiques (there may even be ways for people to apply the vector implied by y’all’s collective input, and generate something new / abandon their doomed plans).