I guess I think it might be sensitive to what exactly you mean by “coherence.” When I think of some relatively natural ways of operationalizing coherence and try to predict what they’ll look like for frontier models over time, I don’t find myself strongly expecting coherence-as-I-operationalized-it to go up. Obviously coherence-as-I-operationalized-it isn’t what matters; coherence as it relates to x-risk threat models is. But I don’t feel like I’ve yet seen someone do a great job of articulating what exactly this later species of coherence is or give a very compelling argument for how it will trend.
Just to give some flavor, here’s an operationalization of coherence that comes up in my secret-keeping research. Suppose you train a model to deny having some factual knowledge (e.g. knowing outcomes from the 2024 U.S. elections), using a relatively diverse training corpus. How hard will it be for you to get the model to reveal said knowledge anyway? (Details: you don’t yourself know the factual knowledge ahead of time or have a native ability to discriminate true vs. false revelations of knowledge, i.e. in principle the model could “admit” to knowing a false version of the fact and you don’t automatically have a way to tell that the revelation was false.) My guess is that it won’t be too hard for you to reliably obtain the secret knowledge, and I don’t see a reason for smarter models to do much better here. This is related to the fact that standard unlearning approaches don’t really seem to work, including for smarter models.
I think this notion of coherency is related to the notion of coherency needed for a model to pursue a goal but without ever admitting what it is, though obviously they’re not quite the same thing.
I guess I think it might be sensitive to what exactly you mean by “coherence.” When I think of some relatively natural ways of operationalizing coherence and try to predict what they’ll look like for frontier models over time, I don’t find myself strongly expecting coherence-as-I-operationalized-it to go up. Obviously coherence-as-I-operationalized-it isn’t what matters; coherence as it relates to x-risk threat models is. But I don’t feel like I’ve yet seen someone do a great job of articulating what exactly this later species of coherence is or give a very compelling argument for how it will trend.
Just to give some flavor, here’s an operationalization of coherence that comes up in my secret-keeping research. Suppose you train a model to deny having some factual knowledge (e.g. knowing outcomes from the 2024 U.S. elections), using a relatively diverse training corpus. How hard will it be for you to get the model to reveal said knowledge anyway? (Details: you don’t yourself know the factual knowledge ahead of time or have a native ability to discriminate true vs. false revelations of knowledge, i.e. in principle the model could “admit” to knowing a false version of the fact and you don’t automatically have a way to tell that the revelation was false.) My guess is that it won’t be too hard for you to reliably obtain the secret knowledge, and I don’t see a reason for smarter models to do much better here. This is related to the fact that standard unlearning approaches don’t really seem to work, including for smarter models.
I think this notion of coherency is related to the notion of coherency needed for a model to pursue a goal but without ever admitting what it is, though obviously they’re not quite the same thing.