Have weaker models check whether they can actually follow each step—if Claude Jr. can’t understand what Claude Sr. is saying, maybe Claude Sr. is hiding something: “MOOOOOM, Claude’s being WEIRD again!”
It seems this section addresses your idea.
Have weaker models check whether they can actually follow each step—if Claude Jr. can’t understand what Claude Sr. is saying, maybe Claude Sr. is hiding something: “MOOOOOM, Claude’s being WEIRD again!”
It seems this section addresses your idea.
My dog often takes various things lying around within their reach (socks, napkins, once a passport) and runs away, usually destroying the object to some degree. I think it started with food items that were left around and somehow evolved into a broader habit. Ideally, this behavior would be disincentivized somehow until they stopped completely, but over time I have found the best way to get them to give up the item is by trading a treat for it. This post made me realize that I’m basically training them to start this keep-away game.
Another thing that I hadn’t realized until writing this comment is the fact that, given sometimes they take stuff that I find practically worthless (in which case they don’t get a treat) and sometimes they take stuff that is really important (in which case I run to get a high-value treat), I am implementing an intermittent reinforcement scheme.
I wish I could effectively communicate to my dog that they would get a treat at a predetermined time if and only if they refrained from doing this, but I don’t suppose they’d be able to make the connection unless I implemented a really attention-demanding, high-frequency scheme. Curse our lack of shared language.
“Hell is other cats.”—Jean-Paw Sartre
Thanks, I will check it out!