I think a key idea referenced in this post is that an AI trained with modern techniques never directly “sees” / interfaces with a clear, well defined goal. We “feel” like there is a true goal or objective, as we encode something of this flavour in the training loop—the reward or objective function for example. However, in the end the only thing you’re really doing to the AI is changing it’s state after registering its output given some input, and ending up at some point in program-space. Sure, that path is guided by the cleanly specified goal function, but it is not explicitly given to the resultant program.
I do think “goal misgeneralisation” has a place in referring to the phenomenon that:
In the limit of infinite training data and training time, the optimisation procedure should converge a model to an ideal implementation of the objective function encoded into its training loop
Before this limit, the trajectory in program-space may be skewed away from the optimal program leading to unintended results—“misgeneralisation”
A confounder here is that modern AI training objectives are fundamentally un-extendable-to-infinity and so misgeneralisation is ill-defined. For example, “predict the next token in this human-generated text” is bound by humans generating text, and “maximise the human’s response to X” is bound by number of humans, number of interactions with X. Most loss functions make no sense outside of the type of data they are defined on, and so there exists no such thing as perfect generalisation as data is by definition limited.
You could redefine “perfect generalisation” to mean optimal performance on the available data, however, as long as it is possible to produce more data at some point in the future, even a finite amount, this definition is brittle.
I think a key idea referenced in this post is that an AI trained with modern techniques never directly “sees” / interfaces with a clear, well defined goal. We “feel” like there is a true goal or objective, as we encode something of this flavour in the training loop—the reward or objective function for example. However, in the end the only thing you’re really doing to the AI is changing it’s state after registering its output given some input, and ending up at some point in program-space. Sure, that path is guided by the cleanly specified goal function, but it is not explicitly given to the resultant program.
I do think “goal misgeneralisation” has a place in referring to the phenomenon that:
In the limit of infinite training data and training time, the optimisation procedure should converge a model to an ideal implementation of the objective function encoded into its training loop
Before this limit, the trajectory in program-space may be skewed away from the optimal program leading to unintended results—“misgeneralisation”
A confounder here is that modern AI training objectives are fundamentally un-extendable-to-infinity and so misgeneralisation is ill-defined. For example, “predict the next token in this human-generated text” is bound by humans generating text, and “maximise the human’s response to X” is bound by number of humans, number of interactions with X. Most loss functions make no sense outside of the type of data they are defined on, and so there exists no such thing as perfect generalisation as data is by definition limited.
You could redefine “perfect generalisation” to mean optimal performance on the available data, however, as long as it is possible to produce more data at some point in the future, even a finite amount, this definition is brittle.