We plan to iterate on this research note in the upcoming weeks. Feedback is welcome!
Ideas I want to explore:
New reasoning models may be released (e.g. deepseek-r1 API, some other open source ones). Can we reproduce results?
Do these ITC models articulate reasoning behind e.g social biases / medical advice?
Try to plant backdoor. Do these models articulate the backdoor?
We plan to iterate on this research note in the upcoming weeks. Feedback is welcome!
Ideas I want to explore:
New reasoning models may be released (e.g. deepseek-r1 API, some other open source ones). Can we reproduce results?
Do these ITC models articulate reasoning behind e.g social biases / medical advice?
Try to plant backdoor. Do these models articulate the backdoor?