If John Wentworth is correct about that being the biggest danger, making AI produce less slop would be the clear best path. I think it might be a good idea even if the dangers were split between misalignment of the first transformative AI, and it being adequately aligned but helping misalign the next generation.
From my comment on that post:
I’m curious why you think deceptive alignment from transformative AI is not much of a threat. I wonder if you’re envisioning purely tool AI, or aligned agentic AGI that’s just not smart enough to align better AGI?
I think it’s quite implausible that we’ll leave foundation models as tools rather than using the prompt “pretend you’re an agent and call these tools” to turn them into agents. People want their work done for them, not just advice on how to do their work.
I do think it’s quite plausible that we’ll have aligned agentic foundation model agents that won’t be quite smart enough to solve deeper alignment problems reliably, and sycophantic/clever enough to help researchers fool themselves into thinking they’re solved. Since your last post to that effect it’s become one of my leading routes to disaster. Thanks, I hate it.
OTOH, if that process is handled slightly better, it seems like we could get the help we need to solve alignment from early aligned LLM agent AGIs. This is valuable work on that risk model that could help steer [AI development] orgs away from likely mistakes and toward better practices.
Following on that logic, I think making our first transformative AI less prone to slop/errors is a good idea. The problem is that most such efforts probably speed up progress to getting there.
I’m starting to feel pretty sure that refusing to speed up progress and hoping we get enough time or a complete stallout is unrealistic. Accepting that we’re on a terrifying trajectory and trying to steer it seems like the best response.
I think routes of reducing slop also contributes to aligning the first really competent LLM-based agents. One example is engineering such an agent to review its important decisions to see if they either make important errors or change/violate their central goals. I’ve written about that here but I’m publishing an updated and expanded post soon.
So yes, I think this is probably somethings we should be doing. It’s always going to be a judgment call of whether you publicize any particular idea. But there are more clever-to-brilliant people working on capabilities every day. Hoping they just won’t have the same good ideas seems like a forlorn hope. Sharing the ones that seem to have more alignment relevance seems like it will probably differentially advance alignment over capabilities.
If John Wentworth is correct about that being the biggest danger, making AI produce less slop would be the clear best path. I think it might be a good idea even if the dangers were split between misalignment of the first transformative AI, and it being adequately aligned but helping misalign the next generation.
From my comment on that post:
Following on that logic, I think making our first transformative AI less prone to slop/errors is a good idea. The problem is that most such efforts probably speed up progress to getting there.
I’m starting to feel pretty sure that refusing to speed up progress and hoping we get enough time or a complete stallout is unrealistic. Accepting that we’re on a terrifying trajectory and trying to steer it seems like the best response.
I think routes of reducing slop also contributes to aligning the first really competent LLM-based agents. One example is engineering such an agent to review its important decisions to see if they either make important errors or change/violate their central goals. I’ve written about that here but I’m publishing an updated and expanded post soon.
So yes, I think this is probably somethings we should be doing. It’s always going to be a judgment call of whether you publicize any particular idea. But there are more clever-to-brilliant people working on capabilities every day. Hoping they just won’t have the same good ideas seems like a forlorn hope. Sharing the ones that seem to have more alignment relevance seems like it will probably differentially advance alignment over capabilities.