Okay, but why is it wrong though? I still haven’t seen a convincing case for that! It sure looks to me like, given an assumption which I still feel confused about whether you share, the conclusion does in fact follow from the premises, even in metaphor form.
I am open to the case that it’s a bad argument. If it is in fact a bad argument then that’s a legitimate criticism. But from my perspective you have not adequately spelled out how “deep nets favor simple functions” implies it’s a bad argument.
Thank you for attempting to spell this out more explicitly. If I understand correctly, you are saying singular learning theory suggests that AIs with different architectures will converge on a narrow range of similar functions that best approximate the training data.
With less confidence, I understand you to be claiming that this convergence implies that (in the context of the metaphor) a given [teal thing / dataset] may reliably produce a particular shape of [black thing / AI].
So (my nascent Zack model says) the summary is incorrect to analogize the black thing to “architectures” instead of “parametrizations” or “functions”, and more importantly incorrect to claim that the black shape’s many degrees of freedom imply it will take a form its developers did not intend. (Because, by SLT, most shapes converge to some relatively simple function approximator.)
But...it does seem to me like an AI trained using modern methods, e.g. constitutional AI, is insufficiently constrained to embody human-compatible values even given the stated interpretation of SLT. Or in other words, the black shape is still basically unpredictable from the perspective of the teal-shape drawer. I’m not sure you disagree with that?
As an exercise in inferential gap-crossing, I want to try to figure out what minimum change to the summary / metaphor would make it relatively unobjectionable to you.
Attempting to update the analogy in my own model, it would go something like: You draw a [teal thing / dataset]. You use it to train the [black thing / AI]. There are underlying regularities in your dataset, some of which are legible to you as a human and some of which are not. The black thing conforms to all the regularities. This does not by coincidence happen to cause it to occupy the shape you hoped for; you do not see all the non-robust / illegible features constraining that shape. You end up with [weird shape] instead of [simple shape you were aiming for.]
A more skeptical Zack-model in my head says “No, actually, you don’t end up with [weird shape] at all. SLT says you can get [shape which robustly includes the entire spectrum of reflectively consistent human values] because that’s the function being approximated, the underlying structure of the data.” I dunno if this is an accurate Zack-model.
(I am running into the limited bandwith of text here, and will also DM a link to schedule a conversation if you’re so inclined).