Introspection is really interesting! This example where language models respond with the HELLO pattern (and can say they do so) is actually just one example of language models being able to articulate their implicit goals, and more generally to out-of-context reasoning.
Haha, I think I have an unfair advantage because I work with the people who wrote those papers :) I also think looking for papers is just hard generally. What you’re doing here (writing about stuff that interests you in a place where it’ll probably be seen by other like-minded people) is probably one of the better ways to find relevant information
Edit: Happy to set up a call also if you’d like to chat further! There are other interesting experiments in this space that could be done fairly easily
Introspection is really interesting! This example where language models respond with the HELLO pattern (and can say they do so) is actually just one example of language models being able to articulate their implicit goals, and more generally to out-of-context reasoning.
Wow. I need to learn how to search for papers. I looked for something like this even generally and couldn’t find it, let alone something so specific
Haha, I think I have an unfair advantage because I work with the people who wrote those papers :) I also think looking for papers is just hard generally. What you’re doing here (writing about stuff that interests you in a place where it’ll probably be seen by other like-minded people) is probably one of the better ways to find relevant information
Edit: Happy to set up a call also if you’d like to chat further! There are other interesting experiments in this space that could be done fairly easily
I would absolutely like to chat further. Please send me a DM so we can set that up!