I liked the high-level strategic frame in the methodology section. I do sure wish we weren’t pinning our alignment hopes on anything close to the current ML paradigm, but I still put significant odds on us having to do so anyway. And it seemed like the authors had a clear understanding of the problem they were trying to solve.
I did feel confused reading the actual explanation of what their experiment did, and wish some more attention had been giving to explaining it. (It may have used shorthand that a seasoned ML researcher would understand, but I had to dig into the appendix of the paper and ask a friend for help to understand what “given a set of yes/no questions, answer both yes and no” meant in a mechanistic sense)
It seems like most of the rest of the article doesn’t really depend on whether the current experiment made sense, (with the current experiment just being kinda a proof-of-concept that you could check AI’s beliefs at all). But a lot of the authors intuitions of what it should be possible do feel at least reasonably promising to me. I don’t know that this approach will ultimately work, but it seemed like a solid research direction.
Curated.
I liked the high-level strategic frame in the methodology section. I do sure wish we weren’t pinning our alignment hopes on anything close to the current ML paradigm, but I still put significant odds on us having to do so anyway. And it seemed like the authors had a clear understanding of the problem they were trying to solve.
I did feel confused reading the actual explanation of what their experiment did, and wish some more attention had been giving to explaining it. (It may have used shorthand that a seasoned ML researcher would understand, but I had to dig into the appendix of the paper and ask a friend for help to understand what “given a set of yes/no questions, answer both yes and no” meant in a mechanistic sense)
It seems like most of the rest of the article doesn’t really depend on whether the current experiment made sense, (with the current experiment just being kinda a proof-of-concept that you could check AI’s beliefs at all). But a lot of the authors intuitions of what it should be possible do feel at least reasonably promising to me. I don’t know that this approach will ultimately work, but it seemed like a solid research direction.