you can soonishly be extremely demanding with what you want to prove, and then ask a swarm of ais to go do it for you. if you have a property that you’re pretty sure would mean your ai was provable at being good in some way if you had a proof of your theorem about it, but it’s way too expensive to find the space of AIs that are provable, you can probably combine deep learning and provers somehow to get it to work, something not too far from “just ask gemini deep thinking”, see also learning theory for grabbing the outside and katz lab in israel for grabbing the inside.
probably the agent foundations thing you want, once you’ve nailed down your theorem, will give you tools for non-provable insights, but also a framework for making probability margins provable (eg imprecise probability theory).
you can prove probability margins about fully identified physical systems! you can’t prove anything else about physical systems, I think?
(possibly a bolder claim) useful proofs are about what you do next and how that moves you away from bad subspaces and thus how you reach the limit, not about what happens in the limit and thus what you do next
finding the right thing to prove is still where all the juice is, just like it was before. but I’d be writing to convince people who think proof and theory is a dead end that it’s not because actually bitter lesson can become sweet lesson if you have something where you’re sure it’s right in the sense that more optimization power is more goodness. proofs give you a framework for that level of nailing down. you can eg imagine using a continuous relaxation of provability as your loss function
so I’d want to argue that the interesting bit is how you prove that your reinforcement learning process will always keep coming back for more input, without breaking your brain, and will keep you in charge. the main paths I see for this making sense is a cev thing with a moment in time pinned down as the reference point for where to find a human, that’s the PreDCA/QACI style approach; or, jump past CEV and go for encoding empowerment directly, and then ask for empowerment and non-corruption of mindlike systems or something.
this is sort of an update of the old miri perspective, and it ends up calling for a bunch of the same work. so what I’d be doing in a main post is trying to lay out an argument for why that same work is not dead, and in fact is a particularly high value thing for people to do.
I’d be hoping to convince noobs to try it, and for anthropic and deepmind to look closer at what it’d take to get their AIs to be ready to help with this in particular.
I’m posting this instead of a main post because I’ve developed an ugh field around writing main post, so I need to make it ok to post sloppy versions of the point first and get incremental bite from people thinking it sounds unconvincing, rather than trying to frontload the entire explanatory effort before I even know what people think sounds implausible.
It seems like a lot of the trouble with making this plan work is the trouble of making formal verification scale to very large and messy systems, which has been an open problem for a long time, and is the big place this class of idea might turn out to be effectively impossible. It may be that noise pressure is too high, so learning the real world unavoidably forces mess beyond the reach of any formal verification based approach.
I’ve bumped into various folks who tried formal verification for neural networks before and the thing they’ve said consistently is, “well, you can do it at all, but the most complex properties are still things like not crashing an aircraft in a simplified flight simulator for a low-millions-of-parameters model, and that’s primarily possible because of the low dimensional input/output space.”
formally verifying huge, combinatorial spaces (eg, llm/image/physics model IO) seems potentially very far beyond tractability, even with the help of a deep learning approach specifically designed for scaling formal verification to huge messy things. if it’s possible to make this work, it’d look like figuring that out.
points Id want to make in a main post.
you can soonishly be extremely demanding with what you want to prove, and then ask a swarm of ais to go do it for you. if you have a property that you’re pretty sure would mean your ai was provable at being good in some way if you had a proof of your theorem about it, but it’s way too expensive to find the space of AIs that are provable, you can probably combine deep learning and provers somehow to get it to work, something not too far from “just ask gemini deep thinking”, see also learning theory for grabbing the outside and katz lab in israel for grabbing the inside.
probably the agent foundations thing you want, once you’ve nailed down your theorem, will give you tools for non-provable insights, but also a framework for making probability margins provable (eg imprecise probability theory).
you can prove probability margins about fully identified physical systems! you can’t prove anything else about physical systems, I think?
(possibly a bolder claim) useful proofs are about what you do next and how that moves you away from bad subspaces and thus how you reach the limit, not about what happens in the limit and thus what you do next
finding the right thing to prove is still where all the juice is, just like it was before. but I’d be writing to convince people who think proof and theory is a dead end that it’s not because actually bitter lesson can become sweet lesson if you have something where you’re sure it’s right in the sense that more optimization power is more goodness. proofs give you a framework for that level of nailing down. you can eg imagine using a continuous relaxation of provability as your loss function
so I’d want to argue that the interesting bit is how you prove that your reinforcement learning process will always keep coming back for more input, without breaking your brain, and will keep you in charge. the main paths I see for this making sense is a cev thing with a moment in time pinned down as the reference point for where to find a human, that’s the PreDCA/QACI style approach; or, jump past CEV and go for encoding empowerment directly, and then ask for empowerment and non-corruption of mindlike systems or something.
this is sort of an update of the old miri perspective, and it ends up calling for a bunch of the same work. so what I’d be doing in a main post is trying to lay out an argument for why that same work is not dead, and in fact is a particularly high value thing for people to do.
I’d be hoping to convince noobs to try it, and for anthropic and deepmind to look closer at what it’d take to get their AIs to be ready to help with this in particular.
I’m posting this instead of a main post because I’ve developed an ugh field around writing main post, so I need to make it ok to post sloppy versions of the point first and get incremental bite from people thinking it sounds unconvincing, rather than trying to frontload the entire explanatory effort before I even know what people think sounds implausible.
It seems like a lot of the trouble with making this plan work is the trouble of making formal verification scale to very large and messy systems, which has been an open problem for a long time, and is the big place this class of idea might turn out to be effectively impossible. It may be that noise pressure is too high, so learning the real world unavoidably forces mess beyond the reach of any formal verification based approach.
I’ve bumped into various folks who tried formal verification for neural networks before and the thing they’ve said consistently is, “well, you can do it at all, but the most complex properties are still things like not crashing an aircraft in a simplified flight simulator for a low-millions-of-parameters model, and that’s primarily possible because of the low dimensional input/output space.”
formally verifying huge, combinatorial spaces (eg, llm/image/physics model IO) seems potentially very far beyond tractability, even with the help of a deep learning approach specifically designed for scaling formal verification to huge messy things. if it’s possible to make this work, it’d look like figuring that out.
I wonder if anyone in the “Moonshot Alignment Program” is pursuing this?