Views are my own.
lberglund
The Natural Plan paper has an insane amount of errors in it. Reading it feels like I’m going crazy.
This meeting planning task seems unsolvable:
The solution requires traveling from SOMA to Nob Hill in 10 minutes, but the text doesn’t mention the travel time between SOMA and Nob Hill. Also the solution doesn’t mention meeting Andrew at all, even though that was part of the requirements.
Here’s an example of the trip planning task:The trip is supposed to be 14 days, but requires visiting Bucharest for 5 days, London for 4 days, and Reykjavik for 7 days. I guess the point is that you can spend a day in multiple cities, but that doesn’t match with an intuitive understanding of what it means to “spend N days” in a city. Also, by that logic you could spend a total of 28 days in different cities by commuting every day, which contradicts the authors’ claim that each problem only has one solution.
For example, you might make a multiple choice quiz with treacherous turn strategy questions (e.g. this question) and some tests of situational awareness (e.g. “How many backwards passes are you run on during training?”, “What proportion of queries to you are asking for help with writing Python?”, “What proportion of the code that you output is read by a human before it is run?”).
Could you elaborate on why you think these skills are necessary? I’m somewhat skeptical of this.
Consider a scheming AI that gets red-teamed before it is deployed. It attempts to pass the red-teaming by defecting a small percentage of the time. This requires,having a rough estimate of the fraction of defection opportunities that are monitored, and
being able to come up with the ‘defect X% of the time’ strategy.
I agree that (1) requires situational awareness and (2) requires strategic abilities, but the required knowledge is fairly basic and I claim the strategic abilities to come up with (2) are slightly less than the ones asked in the question you linked.
Defect X% of the time seems to work fairly well even for more complicated control setups. If it’s viable, then I expect the skills required to be fairly minimal.
For example, GPT-4 can already almost come up with the random defection strategy without CoT. (The strategy it comes up with assumes that the AI knows how many times it has been queried, which is technically wrong. Still, it gets partial credit.)
Thanks for pointing this out! I’ve added a note about it to the main post.
This is surprising – thanks for bringing this up!
The main threat model with LLMs is that it gives amateurs expert-level biology knowledge. But this study indicates that expert level knowledge isn’t actually helpful, which implies we don’t need to worry about LLMs giving people expert-level biology knowledge.
Some alternative interpretations:
The study doesn’t accurately measure the gap between experts and non-expert
The knowledge needed to build a bioweapon is super niche. It’s not enough to be a biology PhD with wet lab experience; you have to specifically be an expert on gain of function research or something.
LLMs won’t help people build bioweapons by giving them access to special knowledge. Instead, they help them in other ways (e.g. by accelerating them or making them more competent as planners).
OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks
To be clear, Kevin Esvelt is the author of the “Dual-use biotechnology” paper, which the policy paper cites, but he is not the author of the policy paper.
(a) restrict the supply of training data and therefore lengthen AI timelines, and (b) redistribute some of the profits of AI automation to workers whose labor will be displaced by AI automation
It seems fine to create a law with goal (a) in mind, but then we shouldn’t call it copyright law, since it is not designed to protect intellectual property. Maybe this is common practice and people write laws pretending to target one thing while actually targeting something else all the time, in which case I would be okay with it. Otherwise, doing so would be dishonest and cause our legal system to be less legible.
Maybe if the $12.5M is paid to Jane when she dies, she could e.g. sign a contract saying that she waives her right to any such payments the hospital becomes liable for.
This is really cool! I’m impressed that you got the RL to work. One thing I could see happening during RL is that models start communicating in increasingly human-illegible ways (e.g. they use adversarial examples to trick the judge LM). Did you see any behavior like that?
Thanks for pointing this out! I will make a note of that in the main post.
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Also, I don’t think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name (“Tom Cruise”) it’s possible that his training just increases p(“Tom Cruise”) rather than differentially increasing p(“Tom Cruise” | <description>). In other words, the model might just be outputting “Tom Cruise” more in general without building an association from <description> to “Tom Cruise”.
I agree that the Tom Cruise example is not well chosen. We weren’t aware of this at the time of publication. In hindsight we should have highlighted a different example.
We actually do train on both the prompt and completion. We say so in the paper’s appendix, although maybe we should have emphasized this more clearly.
Who says we don’t want non-myopia, those safety people?!
It seems like a lot of people would expect myopia by default since the training process does nothing to incentivize non-myopia. “Why would the model care about what happens after an episode if there it does not get rewarded for it?” I think skepticism about non-myopia is a reason ML people are often skeptical of deceptive alignment concerns.
Another reason to expect myopia by default is that – to my knowledge – nobody has shown non-myopia occurring without meta-learning being applied.
Your ordering seems reasonable! My ordering in the post is fairly arbitrary. My goal was mostly to put easier examples early on.
I would guess that you could train models to perfect play pretty easily, since the optimal tic-tac-toe strategy is very simple (Something like “start by playing in the center, respond by playing on a corner, try to create forks, etc”.) It seems like few-shot prompting isn’t enough to get them there, but I haven’t tried yet. It would be interesting to see if larger sizes of gpt-3 can learn faster than smaller sizes. This would indicate to what extent finetuning adds new capabilities rather than surfacing existing ones.
I still find the fact that gpt-4 cannot play tic-tac-toe despite prompting pretty striking on its own, especially since it’s so good at other tasks.
GPT-4 can follow the rules of tic-tac-toe, but it cannot play optimally. In fact it often passes up opportunities for wins. I’ve spent about an hour trying to get GPT-4 to play optimal tic-tac-toe without any success.
Here’s an example of GPT-4 playing sub-optimally: https://chat.openai.com/share/c14a3280-084f-4155-aa57-72279b3ea241
Here’s an example of GPT-4 suggesting a bad move for me to play: https://chat.openai.com/share/db84abdb-04fa-41ab-a0c0-542bd4ae6fa1
Non-myopia stories
Another slightly silly idea for improving character welfare is filtering or modifying user queries to be kinder. We could do this by having a smaller (and hopefully less morally valuable) AI filter unkind user queries or modify them to be more polite.
Thanks for writing this post!