AI safety field building/teaching in France & CTO of omniscience.
ML engineer, interested in phenomenology.
Yes. This image is only a classifier. No mesa optimizer here. So we have only a capability robustness problem
Task restriction. The observation that diverse environments seem to increase the probability of mesa-optimization
Where does this observation come from?
Good point for RLHF:
Before RLHF, it was difficult to obtain a system able to backflip or able to summarize correctly texts. It is better than not being able to do those things at all.
RLHF in the summary paper is better than fine-tuning, but still not perfect
RL makes the system more agentic, and more dangerousWe do not have guarantees RLHF will work
Using Mechanical turker won’t scale to superintelligence. It works for backflips, it’s not clear if it will work for “Human values” in general.
We do not have a principle way to describe the “Human values”, a specification or general rules would have been a much better form of specification. But there are theoretical reasons to think that these formal rules do not exist
RLFF requires high quality feedbacks
RLHF is not enough, you have to use modern RL algorithms such as PPO to make it work
But from an engineering viewpoint, if we had to solve the outer problem tomorrow, I think that it would be one of the techniques to use.
Yes, it’s true that RLHF breaks if you apply an arbitrary optimization pressure, and that’s why you have to put a KL divergence, and that this KL divergence is difficult to calibrate.
I don’t understand why the outer/inner decomposition is relevant in my question. I am only talking about the outer.
Point 3 is wrong, because in the instructGPT paper, GPT with RLHF is more rated than the fine-tuned GPT.
Not important, but I don’t think RLHF can qualify as model-based RL. We usually use PPO in RLHF, and it’s a model-free RL algorithm.
This post argues there is no inner mesa-optimizer here:
Even if the model’s world model (gray blob) improves to the point where it recognizes this failure, it will not correct the pointer because the pointer’s current target is the Mesa Objective.
I don’t understand.
Internalization would require rebuilding [the model of the Base Objective] in full
Maybe I wouldn’t say “rebuilding” but for me, this would be like inserting the search module next to [the model of the Base Objective]. This would be much cheaper than rebuilding the whole model of the base objective.
I don’t understand why. If the model of the base objective is already present somewhere in the weights, con’t we just reuse that?
This model is a proof of concept of powerful implicit mesa optimizer, which is evidence towards “current architectures could be easily inner misaligned”.
I would be interested in discussing with you the process that led to the choice of this hackathon topic.But congratulations for the organization!
Okay, this kind of metric could maybe work. The metric could be
sum of the performance of each layer + regularization function of the size of the text proportional to the indice of the layer.
I’m not super familiar with those kinds of image to text models. Can you provide an example of a dataset or a GitHub model doing this ?
Thank you for your help.
The first part is a very interesting project idea. But i don’t know how to create a leaderboard with that. I think the fun is significantly higher with a leaderboard.
The second idea is very cool there ks no clear metric: if i understand correctly, people have only to submit a set of adversarial images. But i don’t know how to determine the winner?
I’m still thinking about this idea. We could try to do the same thing but on Cifar10. I do not know if it would be possible to construct by hand the layers.
On mnist, for a network (LeNet, 60k parameters)with 99 percents accuracy, the crossentropy is 0.05
If we take the formula: CE + lambda log nb non null params
A good lambda is equal to 100. (Equalizing crossentropy and regularization)
In the mnist minimal number of weights competition, we have 99 percents accuracy with 2000 weights. So lambda is equal to 80.
Maybe If we want to stress the importance of sparsity, we can choose a lambda equal to 300.
Thank you for your help
RLHF is too complex for people starting in ML? But I’m interested by the link from the mnist demo if you have it?
Preference model : why not, but there is no clear metric. So we cannot easily determine the winner of the Hackathon.
Make an interface: this is a cool project idea. But generally, gradient based methods like The fast gradient sign lethod works very well. I have no clue what would an an adversarial GUI interface look like. So I’m not comfortable with the idea.
Interface to find the image activating the most an image classifier neuron? Cool idea but i think it’s too simple.
Thank you. This is a good project idea, but there is no clear metric of performance, so it’s not a good hackathon idea.