Simple alignment plan that maybe works

I’ll ask for feedback at the end of this post, please hold criticisms, judgements and criticisms for then.

Forget every single long complicated, theoretical mathsy alignment plan.
In my opinion, pretty much every single one of those is too complicated and isn’t going to work.
Let’s look at the one example we have of something dumb making something smart that isn’t a complete disaster and at least try to emulate that first.
Evolution- again, hold judgements and criticisms until the end.

What if you trained a smart model on the level of say, GPT3 alongside a group of much dumber and slower models, in an environment like a game world or some virtual world?
Dumb models who, with the research in interpretability, you know what their utility function is.
The smart, fast model however, does not.
Every time the model does something that harms the utility function of the dumber models, it gets a loss function.

The smarter model will likely need to find a way to figure out the utility functions of the dumber models.
Eventually, you might have a model that’s good at co-operating with a group of much dumber, slower models- which could be something like what we actually need!

Please feel free to now post any criticisms, comments, judgements, etc. All are welcome.