Over the past year I’ve spent around 5% of my work time having very long conversations with researchers who disagree with me about AI risk or alignment.[1] This has been great, I learned more than I would have from lots of short conversations, and I highly recommend it.
Most of the conversations were related to what I think of as the LLM-worldview, which is kinda stereotyped as Simplicia. I’m pretty bad at passing the ITT of any of the versions this worldview, and I found it difficult to switch out of (what I’ll call) the arbital-agent-foundations ontology. This was worrying, so I wanted to make sure I wasn’t missing something. And of course I also wanted to test whether I could convince people of my beliefs, since if my reasoning makes sense it should be communicable.
The most successful conversation so far was with @Seth Herd, mostly due to his patience and effort. We probably spent more than eight hours talking over video and much more than that over text dialogue. Spending so much time working through his LLM-based AGI scenario updated me a little toward that being more plausible, and I think I got most of the way toward being able to ITT his views. Much of the conversation was spent talking about what reflection-initiated-misgeneralisation might look like inside of his AGI scenario. He just wrote a good post about this, and about some of the updates he made.
One meta-level lesson I learned: I think we communicated better the more I tried to adopt Seth’s ontology and talked in terms of LLMs, simulated characters, prompts, memory systems and generalisation. Seth seems to have understood and changed his mind about things without a large change of his ontology. This is different to the way I originally understood the same issues, which resulted in me mostly thinking in terms of the arbital-agent-foundations ontology.
The second most successful conversation was probably with Alex Cloud, even though we didn’t convince each other of anything. We both had kinda ridiculous ideas of what the other person believed, which I think we both corrected. We agreed on more than we expected, but I was left a little confused by why we had such different intuitions about how a trained agent would generalise in particular scenarios we discussed.
A LW dialogue with mattmacdermott about consequentialism.
An in-person dialogue with joshc about whether obedience as a goal obviates some alignment difficulties.
A relatively short 5 hour or so in-person conversation with Alex Cloud about shard theory and generalisation.
A stalled but quite long conversation with Lucius Bushnaq about how much we can trust trained predictors like AIXI (started from this).
And most recently a very long dialogue with Seth Herd, which started out similar to the conversation with joshc but went much deeper (started from this).
Over the past year I’ve spent around 5% of my work time having very long conversations with researchers who disagree with me about AI risk or alignment.[1] This has been great, I learned more than I would have from lots of short conversations, and I highly recommend it.
Most of the conversations were related to what I think of as the LLM-worldview, which is kinda stereotyped as Simplicia. I’m pretty bad at passing the ITT of any of the versions this worldview, and I found it difficult to switch out of (what I’ll call) the arbital-agent-foundations ontology. This was worrying, so I wanted to make sure I wasn’t missing something. And of course I also wanted to test whether I could convince people of my beliefs, since if my reasoning makes sense it should be communicable.
The most successful conversation so far was with @Seth Herd, mostly due to his patience and effort. We probably spent more than eight hours talking over video and much more than that over text dialogue. Spending so much time working through his LLM-based AGI scenario updated me a little toward that being more plausible, and I think I got most of the way toward being able to ITT his views. Much of the conversation was spent talking about what reflection-initiated-misgeneralisation might look like inside of his AGI scenario. He just wrote a good post about this, and about some of the updates he made.
One meta-level lesson I learned: I think we communicated better the more I tried to adopt Seth’s ontology and talked in terms of LLMs, simulated characters, prompts, memory systems and generalisation. Seth seems to have understood and changed his mind about things without a large change of his ontology. This is different to the way I originally understood the same issues, which resulted in me mostly thinking in terms of the arbital-agent-foundations ontology.
The second most successful conversation was probably with Alex Cloud, even though we didn’t convince each other of anything. We both had kinda ridiculous ideas of what the other person believed, which I think we both corrected. We agreed on more than we expected, but I was left a little confused by why we had such different intuitions about how a trained agent would generalise in particular scenarios we discussed.
I think this is all of them:
A LW dialogue with mattmacdermott about consequentialism.
An in-person dialogue with joshc about whether obedience as a goal obviates some alignment difficulties.
A relatively short 5 hour or so in-person conversation with Alex Cloud about shard theory and generalisation.
A stalled but quite long conversation with Lucius Bushnaq about how much we can trust trained predictors like AIXI (started from this).
And most recently a very long dialogue with Seth Herd, which started out similar to the conversation with joshc but went much deeper (started from this).