AI that shouldn’t work, yet kind of does

There are some things that work surprisingly well in AI. For example, AI that transfers the style of one image to the content of another. Why is the approach described here a hack. It starts with a neural net trained to classify images. It then runs gradient descent on the the content image, trying to get the covariance matrix of the style to match in early network layers, while trying to get the net layers of the original and style transfer images to be as similar as possible on the later layers.

So the approximations I think are making this work is that, in classifiers, the early layers tend to store simple features, and the later layers tend to hold more complex features. Style is based on the simpler features, and doesn’t depend on the location within the image. Content is based on more complex features and does depend on the location in the image.

We apply optimization power, gradient descent, over heuristics this simple and hacky. And yet it works.

A simple and hacky AI alignment proposal is to just ask chatGPT to do it.

This doesn’t work because chatGPT has been optimized for text prediction, and so isn’t particularly good at AI alignment theory.

So here is an alignment plan.

I know it isn’t great. But some plan is better than no plan. And there was a post about how alignment may well look like “surely no one could have missed that” or “surely such a stupid idea couldn’t work, could it?” not “eureka”.

Train ZFCbot. An AI that is the AlphaGo of ZFC, perhaps trained on random formulas. Perhaps throwing a large corpus of formal maths proofs in there. Ideally the system should have a latent space of maths, so it can think about what style of proof is likely to work before expanding all the details. The system should have a wide range of common maths terms imported from some lean library. It should be optimized purely for ZFC formal theorem proving. Once it is trained, the weights are fixed.

Train ChatMathsGPT. Similar to large language models. Except with oracle access to ZFCbot. In the many maths papers in it’s corpus, it learns to link the informal with the formal. From politics and economics discussions, it asks ZFCbot about toy game theory problems. In general, it learns to identify the pieces of formal maths that best model a situation, and ask about them, and then use the response to predict text.

There is a sense in which this AI knows less maths than normal ChatGPT. Standard ChatGPT has a small crude understanding of maths built from nothing within it’s own mind. This has a much better understanding it can outsource to, it just has to plug in.

Then we ask this ChatMathsGPT for a paper on logical induction. And we hope it can generate a paper of quality similar to Miri’s paper on the topic (where hypothetically this isn’t in the training dataset). If it can, then we have a tool to accelerate deconfusion by orders of magnitude.

Things I am uncertain about. Should ChatMathsGPT have oracle access to a latent space. (can pass gradients, harder to interpret.) or should it just pass formal strings of symbols. (less powerful)

Should ZFCbot get trained on random ZFC; random ZFC + library of theorems and conjectures and random combinations of high level maths concepts; or random ZFC plus whatever ChatMathsGPT keeps asking. The latter gives a route for data to pass between them.

This could fail to be smart enough, I mean I wouldn’t be particularly surprised if it could be made smart enough. But what would the safety failures of this system look like?

Firstly, this AI does deconfusion. If you ask it to write a paperclip maximizer in python, you may well get your wish. Or you might get an AI that maximizes something else. Just asking for an aligned AI isn’t enough. (probably. )

So this artifact isn’t safe. It is easy to destroy the world with stupidity when using it. But let’s assume an optimistic Miri bunker. The Chat part could ask the maths part “what would a superintelligence do in this circumstance”.

Firstly, such a question is quite long, the question needs to contain the code for the superintelligence, and all input data. For this to happen, the input section of the chat algorithm needs to invent the code for superintelligence by itself, when it’s not supposed to be the bit that’s good at maths.

Then there is the fact that ZFCbot wasn’t trained or designed to simulate arbitrary code. Any such simulation will have a lot of overhead, meaning any AI lurking within the code within a neural network trained to do ZFC proof will have a lot of overhead and so relatively low compute.

There is also a sense in which an NP system has been degraded to P. If ZFCbot runs a really clever heuristic search over all proofs of <1000,000 symbols, that’s enough to prove a lot of intresting maths. It would take 2^1000000 to brute force. Yet the naive way of puting an AI within such a system would be for the proof to be blindly stepping the code forward 1 step at a time, limiting any AI to a million serial computational steps. (Program X does Y in a million steps. Proof, at step 1 it does Y_1, then at step 2 it does Y_2 …)

Of course, obvious safety measures include looking at the formal theorems being proved, and if we just need to read the theorems, not the proofs, oversight is much more scalable.

So, is there a way for nasty optimization to form within such a system more efficiently?

Do you think this could work?

Just to be clear. This is an AI held together by duct tape and hope. The question is, will the duct tape last long enough for the AI to output several key agent foundations insights? After which we scrap this AI, and write a from first principles AI with a robust design using whatever agent foundation knowledge we have gained.