Here are some research outputs that have already come out of the program (I expect many more to be forthcoming):
Open-source circuit tracing tools
Why Do Some Language Models Fake Alignment While Others Don’t?
Subliminal Learning
Persona Vectors
Inverse Scaling in Test-Time Compute
If I forgot any and someone points them out to me I’ll edit this list.
Here are some research outputs that have already come out of the program (I expect many more to be forthcoming):
Open-source circuit tracing tools
Why Do Some Language Models Fake Alignment While Others Don’t?
Subliminal Learning
Persona Vectors
Inverse Scaling in Test-Time Compute
If I forgot any and someone points them out to me I’ll edit this list.