I’m curious why you think deceptive alignment from transformative AI is not much of a threat. I wonder if you’re envisioning purely tool AI, or aligned agentic AGI that’s just not smart enough to align better AGI?
I think it’s quite implausible that we’ll leave foundation models as tools rather than using the prompt “pretend you’re an agent and call these tools” to turn them into agents. People want their work done for them, not just advice on how to do their work.
I do think it’s quite plausible that we’ll have aligned agentic foundation model agents that won’t be quite smart enough to solve deeper alignment problems reliably, and sycophantic/clever enough to help researchers fool themselves into thinking they’re solved. Since your last post to that effect it’s become one of my leading routes to disaster. Thanks, I hate it.
OTOH, if that process is handled slightly better, it seems like we could get the help we need to solve alignment from early aligned LLM agent AGIs. This is valuable work on that risk model that could help steer orgs away from likely mistakes and toward better practices.
I guess somebody should make a meme about “humans and early AGI collaborate to align superintelligence, and fuck it up predictably because they’re both idiots with bad incentives and large cognitive limitations, gaps, and biases” to ensure this is on the mind of any org worker trying to use AI to solve alignment.
I’m curious why you think deceptive alignment from transformative AI is not much of a threat. I wonder if you’re envisioning purely tool AI, or aligned agentic AGI that’s just not smart enough to align better AGI?
I think it’s quite implausible that we’ll leave foundation models as tools rather than using the prompt “pretend you’re an agent and call these tools” to turn them into agents. People want their work done for them, not just advice on how to do their work.
I do think it’s quite plausible that we’ll have aligned agentic foundation model agents that won’t be quite smart enough to solve deeper alignment problems reliably, and sycophantic/clever enough to help researchers fool themselves into thinking they’re solved. Since your last post to that effect it’s become one of my leading routes to disaster. Thanks, I hate it.
OTOH, if that process is handled slightly better, it seems like we could get the help we need to solve alignment from early aligned LLM agent AGIs. This is valuable work on that risk model that could help steer orgs away from likely mistakes and toward better practices.
I guess somebody should make a meme about “humans and early AGI collaborate to align superintelligence, and fuck it up predictably because they’re both idiots with bad incentives and large cognitive limitations, gaps, and biases” to ensure this is on the mind of any org worker trying to use AI to solve alignment.