I’m interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek’s team at MIRI.
Yeah this does seem to be how a lot of people are thinking about it. I think the way to resolve this is to have people meditate on the non-analogy distribution shifts, but yeah doing this well requires having at least one somewhat-detailed model of intelligence, which isn’t that common.
I think there’s a lot of evidence that AI builders don’t know what they’re doing in the relevant ways, and this evidence will likely get stronger and more widely acknowledged over time (as deployment stakes and capabilities make the occasional OOD weirdness more obvious). I’m sure the game of training against each embarrassing behaviour as it comes up will continue, but I hope that some will notice the pattern and extrapolate. It’s not that I’m wanting arguments to convince people, I think reality can convince people and good clear arguments just smooth the process.
I think in the particular case of this post, it would be obvious that humans don’t have the corrigibility property and are equally susceptible to distribution shifts.
Eh, all arguments are equal if you don’t think them through. I think it’s better to think of this kind of argument as setting the stage for the future, rather than winning over large groups of people right now (who you’re assuming aren’t even evaluating the arguments). There are possible futures where world leaders are deciding on a course of action, where a background fact is that it has become extremely obvious that we couldn’t win against a rogue AI. And many other potential futures where different things have become obvious and widely known. Many should provide evidence about alignment competence, but even the ones that don’t directly provide evidence here will provide plenty of motivation to think really carefully. And that plays to the benefit of careful and correct arguments.
I’m hoping that this, while somewhat misguided, might give politicians more leeway to make the correct choices.
I don’t see how “improve training” is an available option even in theory.