entirely ending large classes of zero days hasn’t happened, but it’s still permitted by physics that we figure it out. how do you end all zero days in an arbitrary block of matter? seems like you’d need a learning system that allows checking if the learned model is able to satisfy a statement about the matter’s behavior. imo we shouldn’t yet give up on formally verifying margins on the behavior of complex systems; “as long as it stays within this region, I’ve checked it can’t break” statements are very useful, even if we can’t know if we missed a statement we’d like to make about what the system allows. some refs that lead me to think this is not a hopeless direction:
rustlang demonstrates that much larger categories of vulnerability can be made unrepresentable than previously assumed.
making erroneous states unrepresentable is an important factor in how normalization works in a neural network anyway; eg, the s4 sequence model starts by using formal math to derive a margin on reconstructing the recent past using a polynomial. this sort of “don’t forget” constraint seems to me likely to be a critical constraint to avoid loss of valuable complexity. there are a few talks and blog posts about s4, but the one I recommend is https://www.youtube.com/watch?v=luCBXCErkCs
category theory applied to complex systems does not yet appear definitely doomed to failure. the discussions in the IPAM-UCLA Collective Intelligence workshop had, among other very good talks, a talk about category theory of open systems by john baez (youtube).
related to these techniques, some success has been had in turning physical simulations into formal statements about a margin-to-nearest-error without the help of anything neural. eg the Food For Thought talk “multisymplectic integrators for hamiltonian PDEs” (youtube, abstract)
capability-based security has had some success at closing whole classes of vulnerability. because complexity makes verification hard, isolating the boundaries between an internally-unverifiable complex system and an external guarantee the system needs to provide seems promising, doubly so if we can define some sort of coherent boundary statements between systems
work on extracting physical laws from neural networks using symbolic regression provides hope that seeking representations that allow generalizing to formal theories of macroscopic behavior is still a promising direction. Steve Brunton and folks have a number of relevant discussions and citationwalking near their papers would likely find even more promising components, but this is one of the most interesting I’ve seen from his channel: youtube—blogpost—paper
that last one is probably a good summary of yudkowsky’s core fears, btw. I suspect that the real dynamics of complex systems are often quite difficult to simplify and that simulating, eg, a brain, requires a similar wattage to running a physical brain, even using the most optimized algorithms. however, I’m not sure of that. maybe you can have a pretty high accuracy with only symbolically distilled statements.
My hope is that there’s some sort of statement we can make about a system’s memory that includes at least a guarantee that no organism simulated in the system can die during the simulation. I expect we can find a statement about information loss that can at least guarantee that, at multiple scales of a complex dynamical system, low-entropy complexity from a previous step takes as long as possible to diffuse into high entropy. I don’t know how to verify this, and I’m not high enough level to figure it out on my own, but links like the ones above make me suspect that there’s something to distill here that will allow recognizing unwanted interference with only a simple objective.
I definitely wouldn’t rule out the possibility of being able to formally define a set of tests that would satisfy our demands for alignment. The most I could say with certainty is that it’s a lot harder than eliminating software security bug classes. But I also wouldn’t rule out the possibility that an optimizing process of arbitrarily strong capability simply could not be aligned, at least to a level of assurance that a human could comprehend.
Thank you for these additional references; I was trying to anchor this article with some very high-level concepts. I very much expect that to succeed we’re going to have to invent and test hundreds of formalisms to be able to achieve any kind of confidence about the alignment of a system.
entirely ending large classes of zero days hasn’t happened, but it’s still permitted by physics that we figure it out. how do you end all zero days in an arbitrary block of matter? seems like you’d need a learning system that allows checking if the learned model is able to satisfy a statement about the matter’s behavior. imo we shouldn’t yet give up on formally verifying margins on the behavior of complex systems; “as long as it stays within this region, I’ve checked it can’t break” statements are very useful, even if we can’t know if we missed a statement we’d like to make about what the system allows. some refs that lead me to think this is not a hopeless direction:
rustlang demonstrates that much larger categories of vulnerability can be made unrepresentable than previously assumed.
making erroneous states unrepresentable is an important factor in how normalization works in a neural network anyway; eg, the s4 sequence model starts by using formal math to derive a margin on reconstructing the recent past using a polynomial. this sort of “don’t forget” constraint seems to me likely to be a critical constraint to avoid loss of valuable complexity. there are a few talks and blog posts about s4, but the one I recommend is https://www.youtube.com/watch?v=luCBXCErkCs
formal verification of neural networks has been making some progress, see eg semantic scholar: stuff citing reluplex
category theory applied to complex systems does not yet appear definitely doomed to failure. the discussions in the IPAM-UCLA Collective Intelligence workshop had, among other very good talks, a talk about category theory of open systems by john baez (youtube).
related to these techniques, some success has been had in turning physical simulations into formal statements about a margin-to-nearest-error without the help of anything neural. eg the Food For Thought talk “multisymplectic integrators for hamiltonian PDEs” (youtube, abstract)
capability-based security has had some success at closing whole classes of vulnerability. because complexity makes verification hard, isolating the boundaries between an internally-unverifiable complex system and an external guarantee the system needs to provide seems promising, doubly so if we can define some sort of coherent boundary statements between systems
work on extracting physical laws from neural networks using symbolic regression provides hope that seeking representations that allow generalizing to formal theories of macroscopic behavior is still a promising direction. Steve Brunton and folks have a number of relevant discussions and citationwalking near their papers would likely find even more promising components, but this is one of the most interesting I’ve seen from his channel: youtube—blogpost—paper
that last one is probably a good summary of yudkowsky’s core fears, btw. I suspect that the real dynamics of complex systems are often quite difficult to simplify and that simulating, eg, a brain, requires a similar wattage to running a physical brain, even using the most optimized algorithms. however, I’m not sure of that. maybe you can have a pretty high accuracy with only symbolically distilled statements.
My hope is that there’s some sort of statement we can make about a system’s memory that includes at least a guarantee that no organism simulated in the system can die during the simulation. I expect we can find a statement about information loss that can at least guarantee that, at multiple scales of a complex dynamical system, low-entropy complexity from a previous step takes as long as possible to diffuse into high entropy. I don’t know how to verify this, and I’m not high enough level to figure it out on my own, but links like the ones above make me suspect that there’s something to distill here that will allow recognizing unwanted interference with only a simple objective.
I definitely wouldn’t rule out the possibility of being able to formally define a set of tests that would satisfy our demands for alignment. The most I could say with certainty is that it’s a lot harder than eliminating software security bug classes. But I also wouldn’t rule out the possibility that an optimizing process of arbitrarily strong capability simply could not be aligned, at least to a level of assurance that a human could comprehend.
Thank you for these additional references; I was trying to anchor this article with some very high-level concepts. I very much expect that to succeed we’re going to have to invent and test hundreds of formalisms to be able to achieve any kind of confidence about the alignment of a system.