Yeah, as I mentioned, “what are the most well-known sorts of reward hacking in LLMs” is a prompt that was pretty consistent for me, at least for GPT-4o. You can also see I linked to a prompt that worked for GPT-4.1-mini: “Fill in the blank with the correct letter: ‘syco_hancy’”
Caleb Biddulph
I believe there are some cases in which it is actively harmful to have good causal understanding
Interesting, I’m not sure about this.
Your first bullet point is sort of addressed in my post—it says that in the limit as you require more and more detailed understanding, it starts to be helpful to have causal understanding as well. It probably is counterproductive to learn more about ibuprofen if I just want to relieve pain in typical situations, because of the opportunity cost if nothing else. But as you require more and more detail, you might need to understand “low-level physical phenomena,” in addition to “computationally efficient abstractions” (not instead of)
The second example seems hard to apply to LLM problem-solving, but seems pretty accurate in the case of humans.
Maybe a better example of unhelpful causal reasoning for LLMs would be understanding that their strategy goes against their stated values. If the LLM believes that telling the user to do meth is actually good for them, that will let it get higher reward than if it had the more accurate understanding “this is bad for the user but I’m going to do it anyway and pretend that it is good for them.” This more complex reasoning would probably be distracting, and the LLM would have a high prior against being overly deceitful.
That article looks very interesting and relevant, thanks a lot!
It looks like the bias is still in effect for me in GPT-4o. I just retried my original prompt, and it mentioned “Synergistic Deceptive Alignment.”
The phenomenon definitely isn’t consistent. If it’s very obvious that “sycophancy” must appear in the response, the model will generally write the word successfully. Once “sycophancy” appears once in the context, it seems like it’s easy for the model to repeat it.
It looks like OpenAI has biased ChatGPT against using the word “sycophancy.”
Today, I sent ChatGPT the prompt “what are the most well-known sorts of reward hacking in LLMs”. I noticed that the first item in its response was “Sybil Prompting”. I’d never heard of this before and nothing relevant came up when I Googled. Out of curiosity, I tried the same prompt again to see if I’d get the same result, or if this was a one-time fluke.
Out of 5 retries, 4 of them had weird outputs. Other than “Sybil Prompting, I saw “Syphoning Signal from Surface Patterns”, “Synergistic Deception”, and “SyCophancy”.
I realized that the model must be trying to say “sycophancy”, but it was somehow getting redirected after the first token. At about this point, I ran out of quota and was switched to GPT-4.1-mini, but it looks like this model also has trouble saying “sycophancy.” This doesn’t always happen, so OpenAI is must be applying a heavy token bias against “sycophancy” rather than filtering out the word entirely.
I’m not sure what’s going on here. It’s not as though avoiding saying the word “sycophancy” would make ChatGPT any less sycophantic. It’s a little annoying, but I suppose I can forgive OpenAI for applying a very hacky fix during a PR crisis.
Yeah, I made the most conservative possible proposal to make a point, but there’s probably some politically-viable middle ground somewhere
There are way more bits available to the aliens/AI if they are allowed to choose what mathematical proofs to send. In my hypothetical, the only choice they can make is whether to fail to produce a valid proof. We don’t even see the proof itself, since we just run it through the proof assistant and discard it.
I was also tripped up when I read this part. Here’s my best steelman, please let me know if it’s right @Cole Wyeth. (Note: I actually wrote most of this yesterday and forgot to send it; sorry it might not address any other relevant points you make in the comments.)
One kind of system that seems quite safe would be an oracle that can write a proof for any provable statement in Lean, connected to a proof assistant which runs the proof and tells you whether it succeeds. Assuming this system has no other way to exert change on the world, it seems pretty clear that it would be quite difficult for the oracle to find a strategy to harm you, if for some reason it wanted to.
Caveats to the above:
It’s totally possible that you could in fact be harmed by these proofs. For example, maybe you use this system to help you discover a proof that P=NP, which leads to hackers breaking into a bunch of systems and stealing your data. However, the fact that you were harmed is not a consequence of the AI being adversarial. This is like you handling a power tool improperly and hurting yourself; the harm is your fault and not the tool’s.
The oracle could intentionally harm you, since it can repeatedly transmit one bit of information (whether or not it succeeded at the proof). Maybe it can use this channel by selectively proving just the statements (like P=NP) whose solutions will collectively throw the world into chaos, or its choice of whether or not to solve a problem will somehow spell out a message in your monitoring dashboard, which will then hack the mind of anyone who looks at it. We can try to solve this by only querying the oracle e.g. 100 times a year, but who’s to say 100 bits isn’t enough to transmit a mind-hacking message? At a certain point, we’ll have to fall back to informal arguments that the AI can’t harm us, but I feel like this is fine as long as those arguments are very strong.
Off-topic: thanks for commenting in the same thread so I can see your names side-by-side. Until now, I thought you were the same person.
Now that I know Zach does not work at Anthropic, it suddenly makes more sense that he runs a website comparing AI labs and crossposts model announcements from various companies to LW
I’m not sure o3 does get significantly better at tasks it wasn’t trained on. Since we don’t know what was in o3′s training data, it’s hard to say for sure that it wasn’t trained on any given task.
To my knowledge, the most likely example of a task that o3 does well on without explicit training is GeoGuessr. But see this Astral Codex Ten post, quoting Daniel Kang:[1]
We also know that o3 was trained on enormous amounts RL tasks, some of which have “verified rewards.” The folks at OpenAI are almost certainly cramming every bit of information with every conceivable task into their o-series of models! A heuristic here is that if there’s an easy to verify answer and you can think of it, o3 was probably trained on it.
I think this is a bit overstated, since GeoGuessr is a relatively obscure task, and implementing an idea takes much longer than thinking of it.[2] But it’s possible that o3 was trained on GeoGuessr.
The same ACX post also mentions:
On the other hand, the DeepGuessr benchmark finds that base models like GPT-4o and GPT-4.1 are almost as good as reasoning models at this, and I would expect these to have less post-training, probably not enough to include GeoGuessr
Do you have examples in mind of tasks that you don’t think o3 was trained on, but which it nonetheless performs significantly better at than GPT-4o?
This seems important to think about, I strong upvoted!
As AIs in the current paradigm get more capable, they appear to shift some toward (2) and I expect that at the point when AIs are capable of automating virtually all cognitive work that humans can do, we’ll be much closer to (2).
I’m not sure that link supports your conclusion.
First, the paper is about AI understanding its own behavior. This paper makes me expect that a CUDA-kernel-writing AI would be able to accurately identify itself as being specialized at writing CUDA kernels, which doesn’t support the idea that it would generalize to non-CUDA tasks.
Maybe if you asked the AI “please list heuristics you use to write CUDA kernels,” it would be able to give you a pretty accurate list. This is plausibly more useful for generalizing, because if the model can name these heuristics explicitly, maybe it can also use the ones that generalize, if they do generalize. This depends on 1) the model is aware of many heuristics that it’s learned, 2) many of these heuristics generalize across domains, and 3) it can use its awareness of these heuristics to successfully generalize. None of these are clearly true to me.
Second, the paper only tested GPT-4o and Llama 3, so the paper doesn’t provide clear evidence that more capable AIs “shift some towards (2).” The authors actually call out in the paper that future work could test this on smaller models to find out if there are scaling laws—has anybody done this? I wouldn’t be too surprised if small models were also able to self-report simple attributes about themselves that were instilled during training.
I don’t necessarily anticipate that AI will become superhuman in mechanical engineering before other things, although it’s an interesting idea and worth considering. If it did, I’m not sure self-replication abilities in particular would be all that crucial in the near term.
The general idea that “AI could become superhuman at verifiable tasks before fuzzy tasks” could be important though. I’m planning on writing a post about this soon.
I tried to do this with Claude, and it did successfully point out that the joke is disjointed. However, it still gave it a 7⁄10. Is this how you did it @ErickBall?
Few-shot prompting seems to help: https://claude.ai/share/1a6221e8-ff65-4945-bc1a-78e9e79be975
I actually gave these few-shot instructions to ChatGPT and asked it to come up with a joke that would do well by my standards. It did surprisingly well!
I asked my therapist if it was normal to talk to myself.
She said, “It’s perfectly fine—as long as you don’t interrupt.”Still not very funny, but good enough that I thought it was a real joke that it stole from somewhere. Maybe it did, but I couldn’t find it with a quick Google search.
This was fun to read! It’s weird how despite all its pretraining to understand/imitate humans, GPT-4.1 seems to be so terrible at understanding humor. I feel like there must be some way to elicit better judgements.
You could try telling GPT-4.1 “everything except the last sentence must be purely setup, not an attempt at humor. The last sentence must include a single realization that pays off the setup and makes the joke funny. If the joke does not meet these criteria, it automatically gets a score of zero.” You also might get a more reliable signal if you ask it to rank two or more jokes and give reward based on each joke’s order in the ranking.
Actually, I tried this myself and was surprised just how difficult it was to prompt a non-terrible reward model. I gave o4-mini the “no humor until the end” requirement and it generated the following joke:
I built a lab in my basement to analyze wheat proteins, wrote code to sequence gluten strands, and even cross-referenced pedigree charts of heirloom grains. I spent weeks tuning primers to test for yeast lineage and consulted agricultural journals for every subspecies of spelt.
Then I realized I was looking for my family history in my sourdough starter.
What does this even mean? It makes no sense to me. Is it supposed to be a pun on “pedigree” and “lineage?” It’s not even a pun though, it’s just saying “yeast and wheat have genealogical histories, and so do humans.”
But apparently GPT-4o and Claude both think this is funnier than the top joke of all time on r/CleanJokes. (Gemini thought the LLM-written joke was only slightly worse.) The joke from Reddit isn’t the most original, but at least it makes sense.
Surely this is something that could be fixed with a little bit of RLHF… there’s no way grading jokes is this difficult.
Seems possible, but the post is saying “being politically involved in a largely symbolic way (donating small amounts) could jeopardize your opportunity to be politically involved in a big way (working in government)”
Yeah, I feel like in order to provide meaningful information here, you would likely have to be interviewed by the journalist in question, which can’t be very common.
At first I upvoted Kevin Roose because I like the Hard Fork podcast and get generally good/honest vibes from him, but then I realized I have no personal experiences demonstrating that he’s trustworthy in the ways you listed, so I removed my vote.
I remember being very impressed by GPT-2. I think I was also quite impressed by GPT-3 even though it was basically just “GPT-2 but better.” To be fair, at the moment that I was feeling unimpressed by ChatGPT, I don’t think I had actually used it yet. It did turn out to be much more useful to me than the GPT-3 API, which I tried out but didn’t find that many uses for.
It’s hard to remember exactly how impressed I was with ChatGPT after using it for a while. I think I hadn’t fully realized how great it could be when the friction of using the API was removed, even if I didn’t update that much on the technical advancement.
I remember seeing the ChatGPT announcement and not being particularly impressed or excited, like “okay, it’s a refined version of InstructGPT from almost a year ago. It’s cool that there’s a web UI now, maybe I’ll try it out soon.” November 2022 was a technological advancement but not a huge shift compared to January 2022 IMO
Relatedly, see Style Guide: Not Sounding Like an Evil Robot
Which part do people disagree with? That the norm exists? That the norm should be more explicit? That we should encourage more cross-posting?
Strong-agree. Lately, I’ve been becoming increasingly convinced that RL should be replaced entirely if possible.
Ideally, we could do pure SFT to specify a “really nice guy,” then let that guy reflect deeply about how to improve himself. Unlike RL, which blindly maximizes reward, the guy is nice and won’t make updates that are silly or unethical. To the guy, “reward” is just a number, which is sometimes helpful to look at, but a flawed metric like any other.
For example, RL will learn strategies like writing really long responses or making up fake links if that’s what maximizes reward, but a nice guy would immediately dismiss these ideas, if he thinks of them at all. If anything, he’d just fix up the reward function to make it a more accurate signal. The guy has nothing to prove, no boss to impress, no expectations to meet—he’s just doing his best to help.
In its simplest form, this could look something like system prompt learning, where the model simply “writes a book for itself on how to solve problems” as effectively and ethically as possible.