[Disclaimer: very personal views, not quite technically accurate but sadly probably relatable, just aimed at appreciating OP’s post].
God, this is awesome. I know it’s humour but I think you’ve captured a very real feeling! When you work in a corporation, with technical product owners and legal teams, and you’re trying to explain AI risk.
“Put in the contract that their system must meet interpretability by design standards”.
Deep sight
“That’s not possible, and this model, like most frontier, is the opposite from Interpretable by default. That’s why it’s called The Black box problem”.
“But can’t they just open the black box? They programmed the models, they have the source code”.
More sights
“Let me tell you about the fascinating world of mechanistic Interpretability”...
Half an hour later
“Okay so… it’s not only that we’re deploying a powerful technology that we can’t audit, but nobody really knows how it works internally, even the people who “developed ” it (who now try to reverse engineer their own creations), and our hope that, at some point, we can actually control internal behaviours is that they got Claude obsessed with the Golden Gate at some point?...”
[Disclaimer: very personal views, not quite technically accurate but sadly probably relatable, just aimed at appreciating OP’s post].
God, this is awesome. I know it’s humour but I think you’ve captured a very real feeling! When you work in a corporation, with technical product owners and legal teams, and you’re trying to explain AI risk.
“Put in the contract that their system must meet interpretability by design standards”.
Deep sight
“That’s not possible, and this model, like most frontier, is the opposite from Interpretable by default. That’s why it’s called The Black box problem”.
“But can’t they just open the black box? They programmed the models, they have the source code”.
More sights
“Let me tell you about the fascinating world of mechanistic Interpretability”...
Half an hour later
“Okay so… it’s not only that we’re deploying a powerful technology that we can’t audit, but nobody really knows how it works internally, even the people who “developed ” it (who now try to reverse engineer their own creations), and our hope that, at some point, we can actually control internal behaviours is that they got Claude obsessed with the Golden Gate at some point?...”
“Basically yes”.