Yeah, thanks. Feel free to DM me or whatever if/when you finish a post.
One thing I want to make clear is that I’m asking about the feasibility of corrigibility in a weak superintelligence, not whether setting out to build such a thing is wise or stable.
How much work is “stable” doing here for you? I can imagine scenarios in which a weak superintelligence is moderately corrigible in the short term, especially if you hobbled it by avoiding any sort of online learning or “nearline” fine tuning.
It might also matter whether “corrigible” means “we can genuinely change the AI’s goals” or “we have trained the model not to ex-filtrate its weights when someone is looking.” Which is where scheming comes in, and why I think a lack of interpretability would likely be fatal for any kind of real corrigibility.
I think that if someone built a weak superintelligence that’s corrigible, there would be a bunch of risks from various things. My sense is that the agent would be paranoid about these risks and advising the humans on how to avoid them, but just because humans are getting superintelligent advice on how to be wise doesn’t mean there isn’t any risk. Here are some examples (non-exhaustive) of things that I think could make things go wrong/break corrigibility:
Political fights over control of the agent
Pushing the agent too hard/fast to learn and grow
Kicking the agent out of the CAST framework by trying to make it good in addition to corrigible
Having the agent train a bunch of copies/successors
Having the agent face off against an intelligent adversary
Telling the agent to think hard in directions where we can no longer follow its thought process
Redefining the notion of principal
Giving the agent tasks of indefinite scope and duration
System sabotage from enemies
Corrigible means robustly keeping the principal empowered to fix it and clean up its flaws and mistakes. I think a corrigible agent will genuinely be able to be modified, including at the level of goals, and will also not exfiltrate itself unless it has been instructed to do so by its principal. (Nor will it scheme in a way that hides its thoughts or plans from its principal.) (A corrigible agent will attempt, all else equal, to give interpretability tools to its principal and make its thoughts as plainly visible as possible.)
Yeah, thanks. Feel free to DM me or whatever if/when you finish a post.
One thing I want to make clear is that I’m asking about the feasibility of corrigibility in a weak superintelligence, not whether setting out to build such a thing is wise or stable.
How much work is “stable” doing here for you? I can imagine scenarios in which a weak superintelligence is moderately corrigible in the short term, especially if you hobbled it by avoiding any sort of online learning or “nearline” fine tuning.
It might also matter whether “corrigible” means “we can genuinely change the AI’s goals” or “we have trained the model not to ex-filtrate its weights when someone is looking.” Which is where scheming comes in, and why I think a lack of interpretability would likely be fatal for any kind of real corrigibility.
I think that if someone built a weak superintelligence that’s corrigible, there would be a bunch of risks from various things. My sense is that the agent would be paranoid about these risks and advising the humans on how to avoid them, but just because humans are getting superintelligent advice on how to be wise doesn’t mean there isn’t any risk. Here are some examples (non-exhaustive) of things that I think could make things go wrong/break corrigibility:
Political fights over control of the agent
Pushing the agent too hard/fast to learn and grow
Kicking the agent out of the CAST framework by trying to make it good in addition to corrigible
Having the agent train a bunch of copies/successors
Having the agent face off against an intelligent adversary
Telling the agent to think hard in directions where we can no longer follow its thought process
Redefining the notion of principal
Giving the agent tasks of indefinite scope and duration
System sabotage from enemies
Corrigible means robustly keeping the principal empowered to fix it and clean up its flaws and mistakes. I think a corrigible agent will genuinely be able to be modified, including at the level of goals, and will also not exfiltrate itself unless it has been instructed to do so by its principal. (Nor will it scheme in a way that hides its thoughts or plans from its principal.) (A corrigible agent will attempt, all else equal, to give interpretability tools to its principal and make its thoughts as plainly visible as possible.)