I just noticed that Lean now has a documented method to verify potentially malicious proofs: https://lean-lang.org/doc/reference/latest/ValidatingProofs/#validating-comparator https://github.com/leanprover/comparator
To me it looks like that this allows you to sandbox a mathematical AI pretty securely as far as today’s cybersecurity practices go:
Write your theorem statement by hand. You can of course make mistakes here, but there is no adversarial pressure.
Run the AI on some sandboxed machine. From here, you export serialized proof objects (not Lean code).
Deserialize and verify the proof on some other machine. The attack surface becomes the deserializer + the two independent kernel implementations (you have to find bugs in both kernels simultaneously to exploit).
The question is whether this is relevant in any way?
If alignment issues become capability issues in hard-to-verify reward-hackable domains (I think we are already seeing some of this for coding), could areas like this with easy-to-verify non-hackable rewards see disproportionate capability growth?
If we get a “mathematical superintelligence” (that companies like https://harmonic.fun/ are claiming to want to build), we could probably get a bunch of formally verified software/hardware out of it, how does that change the world? (Is writing specifications going to become the bottleneck? That’s unclear to me.)
Is such a mathematical superintelligence useful for alignment research?
To me this reads like the limiting factor is not people’s willingness to be playful, but their lack of ideas? Being able to improvise non-trivial fun activities seems like a particular skill not everyone has.
In your scenario, if I was passing by and you tossed me the ball, I would toss it back, smile, and move on. Ideas to turn this into anything more would not occur to me without some conscious thinking effort.
But if you told me that you are trying to organize a volleyball game across the rooftop and the balcony on the other side, I would be like heck yeah, let me go to that balcony and see how this goes.
Have you tried directing people to do the kinds of fun activities you want in these situations? (E.g. putting up a net two stories high is definitely not happening without some organized effort. You have to find a net, think about attachment points for good positioning, find ladders, etc. Definitely possible, but not without the goal being stated verbally, people putting in active thinking effort, and someone coordinating the whole thing. But again, you can delegate this, if you told me that “I want a net there, can you arrange that?”, I would be like “hmm… yeah, that sounds doable, give me 10 minutes, I will find the people and resources”. )