I expect misalignment rates to be locally linear in intelligence [1]
I disagree! I think treacherous turn things will generally mean it’s very hard to measure misalignment rates before capabilities exceed pivotal thresholds. I honestly have no idea how I would measure misalignment in current models at all (and think current “misalignment” measures tend to be if anything anticorrelated with attributes I care about). I do think this will get a bit easier around critical thresholds, but there is definitely a decent chance we will cross a critical threshold in a single model release.
That said, in many worlds it is still local, and it’s better than nothing to check for local stuff. I just don’t think it’s a good thing to put tons of weight on.
I disagree! I think treacherous turn things will generally mean it’s very hard to measure misalignment rates before capabilities exceed pivotal thresholds. I honestly have no idea how I would measure misalignment in current models at all (and think current “misalignment” measures tend to be if anything anticorrelated with attributes I care about). I do think this will get a bit easier around critical thresholds, but there is definitely a decent chance we will cross a critical threshold in a single model release.
That said, in many worlds it is still local, and it’s better than nothing to check for local stuff. I just don’t think it’s a good thing to put tons of weight on.
I’d rather not get into the treacherous turn thing, but I agree there are lots of problems with the current state of alignment evals.