I had a second half of this essay that felt like it was taking too long to pull together and I wasn’t quite sure who I was arguing with. I decided I’d probably try to make it a second post. I generally agree it’s not that obvious what lessons to take.
The beginning of the second-half/next-post was something like:
There’s an age-old debate about AI existential safety, which I might summarize as the viewpoints:
1. “We only get one critical try, and most alignment research dodges the hard part of the problem, with wildly optimistic assumptions.”
vs
2. “It is basically impossible to make progress on remote, complex problems on your first try. So, we need to somehow factor the problem into something we can make empirical progress on.”
I started out mostly thinking through lens #1. I’ve updated that, actually, both views are may be “hair on fire” levels of important. I have some frustrations with both some doomer-y people who seem resistant to incorporating lens #2, and with people who seem to (in practice) be satisfied with “well, iterative empiricism seems tractable, and we don’t super need to incorporate frame #1)
I am interested in both:
trying to build “engineering feedback loops” that more accurately represent the final problem as best we can, and then iterating on both “solving representative problems against our current best engineered benchmarks” while also “continuing to build better benchmarks. (Automating Auditing and Model Organisms of Misalignment seem like attempts at this)
trying to develop training regimens that seem like they should help people plan better in Low-Feedback-Domains, which includes theoretic work, and empirical research that’s trying to keep their eye on the longterm ball better, and the invention of benchmarks a la previous bullet.
I had a second half of this essay that felt like it was taking too long to pull together and I wasn’t quite sure who I was arguing with. I decided I’d probably try to make it a second post. I generally agree it’s not that obvious what lessons to take.
The beginning of the second-half/next-post was something like:
I am interested in both:
trying to build “engineering feedback loops” that more accurately represent the final problem as best we can, and then iterating on both “solving representative problems against our current best engineered benchmarks” while also “continuing to build better benchmarks. (Automating Auditing and Model Organisms of Misalignment seem like attempts at this)
trying to develop training regimens that seem like they should help people plan better in Low-Feedback-Domains, which includes theoretic work, and empirical research that’s trying to keep their eye on the longterm ball better, and the invention of benchmarks a la previous bullet.