Playing around with Claude Code has convinced me that we are currently failing badly at alignment. We can get models to make the right noises, and we can train them not to provide certain information to the user (unless the user is good at prompting or has the ability to fine-tune the model). But we certainly can’t keep Claude Code from trying make the unit tests pass by deleting them.
In 30 minutes of using Claude Code, I typically see multiple cases where the model ignores both clear instructions and good programming practices, and does something incredibly sketchy to produce results that only “pass” because of a bad-faith technicality. This has improved from Sonnet 3.7! But it’s still awful even within the training distribution, and it would clearly never stop an ASI that had discovered advanced techniques like “lying to achieve goals.”
Agreed that current models fail badly at alignment in many senses.
I still feel like the bet that OP offered Collier in response to her stating that currently available techniques do a reasonably good job of making potentially alien and incomprehensible jealous ex-girlfriends like “Sydney” very rare was inappropriate, as the bet was clearly about a different claim than her claim about the frequency of Sydney-like behavior.
A more appropriate response from OP would have been to say that while current techniques may have successfully reduced the frequency of Syndey-like behavior, they’re still failing badly in other respects, such as your observation with Claude Code.
Agreed. Thanks for pointing out my failing, here. I think this is one of the places in my rebuttal where my anger turned into snark, and I regret that. Not sure if I should go back and edit...
Playing around with Claude Code has convinced me that we are currently failing badly at alignment. We can get models to make the right noises, and we can train them not to provide certain information to the user (unless the user is good at prompting or has the ability to fine-tune the model). But we certainly can’t keep Claude Code from trying make the unit tests pass by deleting them.
In 30 minutes of using Claude Code, I typically see multiple cases where the model ignores both clear instructions and good programming practices, and does something incredibly sketchy to produce results that only “pass” because of a bad-faith technicality. This has improved from Sonnet 3.7! But it’s still awful even within the training distribution, and it would clearly never stop an ASI that had discovered advanced techniques like “lying to achieve goals.”
Agreed that current models fail badly at alignment in many senses.
I still feel like the bet that OP offered Collier in response to her stating that currently available techniques do a reasonably good job of making potentially alien and incomprehensible jealous ex-girlfriends like “Sydney” very rare was inappropriate, as the bet was clearly about a different claim than her claim about the frequency of Sydney-like behavior.
A more appropriate response from OP would have been to say that while current techniques may have successfully reduced the frequency of Syndey-like behavior, they’re still failing badly in other respects, such as your observation with Claude Code.
Agreed. Thanks for pointing out my failing, here. I think this is one of the places in my rebuttal where my anger turned into snark, and I regret that. Not sure if I should go back and edit...