I think you’re getting a lot of pushback because models currently do pretty clearly sometimes not follow either user or developer intent
To be clear I’m glad for the pushback!! Thank you for engaging.
It’s pretty clear how it misinterpreted/was mistrained toward dev intent, and misinterpreted my intent based on that training.
Ouch. And you think it was genuine misinterpretation (so capabilities failure) or specification failure?
I really think nobody has much clue about this at this point, and we should collectively try to get one.
Yeah, I guess I agree there.
This video with Neel Nanda goes into the logic.
which one?
I think they trained it in a way that crippled its capabilities in that area. This was probably the result of rushing and unforeseen consequences of what they chose focus on for training targets. So some of both.
This video:
https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=RD2BKixd9PbyDnBEc
I got tired and forgot to find it and add the link before sending.
To be clear I’m glad for the pushback!! Thank you for engaging.
Ouch. And you think it was genuine misinterpretation (so capabilities failure) or specification failure?
Yeah, I guess I agree there.
which one?
I think they trained it in a way that crippled its capabilities in that area. This was probably the result of rushing and unforeseen consequences of what they chose focus on for training targets. So some of both.
This video:
https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=RD2BKixd9PbyDnBEc
I got tired and forgot to find it and add the link before sending.