Was on Vivek Hebbar’s team at MIRI, now working with Adrià Garriga-Alonso onvarious empirical alignment projects.
I’m looking for projects in interpretability, activation engineering, and control/oversight; DM me if you’re interested in working with me.
Goodhart’s Law is really common in the real world, and most things only work because we can observe our metrics, see when they stop correlating with what we care about, and iteratively improve them. Also the prevalence of reward hacking in RL often getting very high values.
If the reward model is as smart as the policy and is continually updated with data, maybe we’re in a different regime where errors are smaller than utility.