I agree that this should be said, but there is also actual disagreement about which theory is better.
Getting reliable 20% returns every year is really quite amazingly hard.
Foundations for analogous arguments about future AI systems are not sufficiently understood—I mean, maybe we can get very capable system that optimise softly like current systems.
And then the AI companies, if they’re allowed to keep selling those—we have now observed—just brute-RLHF their models into not talking about that. Which means we can’t get any trustworthy observations of what later models would otherwise be thinking, past that point of AI company shenanigans.
Seems to me like the weakest point of all this theory—models not only “don’t talk” about wiping out humanity? They not always kill you, even if you give them (or make them think they have) a real chance? Yes, it’s not reliable. But the question is how much we should update from Sydney (that was mostly fixed) versus RLHF mostly working. And whether RLHF is actually changing thoughts or the model is secretly acting benevolent is an empirical question with different predictions—can’t we just look at weights?
As I understand, interpretability research doesn’t exactly got stuck, but it’s very-very-very far from something like this even for not-SotA models. And the gap is growing.
I agree that this should be said, but there is also actual disagreement about which theory is better.
Foundations for analogous arguments about future AI systems are not sufficiently understood—I mean, maybe we can get very capable system that optimise softly like current systems.
Seems to me like the weakest point of all this theory—models not only “don’t talk” about wiping out humanity? They not always kill you, even if you give them (or make them think they have) a real chance? Yes, it’s not reliable. But the question is how much we should update from Sydney (that was mostly fixed) versus RLHF mostly working. And whether RLHF is actually changing thoughts or the model is secretly acting benevolent is an empirical question with different predictions—can’t we just look at weights?
As I understand, interpretability research doesn’t exactly got stuck, but it’s very-very-very far from something like this even for not-SotA models. And the gap is growing.