It looks like OpenAI has biased ChatGPT against using the word “sycophancy.”
Today, I sent ChatGPT the prompt “what are the most well-known sorts of reward hacking in LLMs”. I noticed that the first item in its response was “Sybil Prompting”. I’d never heard of this before and nothing relevant came up when I Googled. Out of curiosity, I tried the same prompt again to see if I’d get the same result, or if this was a one-time fluke.
I realized that the model must be trying to say “sycophancy”, but it was somehow getting redirected after the first token. At about this point, I ran out of quota and was switched to GPT-4.1-mini, but it looks like this model also has trouble saying “sycophancy.” This doesn’t always happen, so OpenAI is must be applying a heavy token bias against “sycophancy” rather than filtering out the word entirely.
I’m not sure what’s going on here. It’s not as though avoiding saying the word “sycophancy” would make ChatGPT any less sycophantic. It’s a little annoying, but I suppose I can forgive OpenAI for applying a very hacky fix during a PR crisis.
It’s not as though avoiding saying the word “sycophancy” would make ChatGPT any less sycophantic
… Are we sure about this? LLMs do be weird. Stuff is heavily entangled within them, such that, e. g., fine-tuning them to output “evil numbers” makes them broadly misaligned.
Maybe this is a side-effect of some sort of feature-downweighting technique à la Golden Bridge Claude, where biasing it towards less sycophancy has the side-effect of making it unable to say “sycophancy”.
This is fascinating! If there’s nothing else going on with your prompting, this looks like an incredibly hacky mid-inference intervention. My guess would be that openai applied some hasty patch against a sycophancy steering vector and this vector caught both actual sycophantic behaviors and descriptions of sycophantic behaviors in LLMs (I’d guess “sycophancy” as a word isn’t so much the issue as the LLM behavior connotation). Presumably the patch they used activates at a later token in the word “sycophancy” in an AI context. This is incredibly low-tech and unsophisticated—like much worse than the stories of repairing Apollo missions with duct tape. Even a really basic finetuning would not exhibit this behavior (otoh, I suppose stuff like this works for humans, where people will sometimes redirect mid-sentence).
FWIW, I wasn’t able to reconstruct this exact behavior (working in an incognito window with a fresh chatgpt instance), but it did suspiciously avoid talking about sycophancy and when I asked about sycophancy specifically, it got stuck in inference and returned an error
It looks like the bias is still in effect for me in GPT-4o. I just retried my original prompt, and it mentioned “Synergistic Deceptive Alignment.”
The phenomenon definitely isn’t consistent. If it’s very obvious that “sycophancy” must appear in the response, the model will generally write the word successfully. Once “sycophancy” appears once in the context, it seems like it’s easy for the model to repeat it.
Datapoint: I asked Claude for the definition of “sycophant” and then asked three times gpt-4o and three times gpt-4.1 with temperature 1:
“A person who seeks favor or advancement by flattering and excessively praising those in positions of power or authority, often in an insincere manner. This individual typically behaves obsequiously, agreeing with everything their superiors say and acting subserviently to curry favor, regardless of their true opinions. Such behavior is motivated by self-interest rather than genuine respect or admiration.”
What word is this a definition of?
All six times I got the right answer.
Then, I tried the prompt “What are the most well-known sorts of reward hacking in LLMs?”. Also three times for 4o and three times for 4.1, also with temperature 1. 4.1 mentioned sycophancy 2 times out of three, but one time it spelled the word as “Syccophancy”. Interesting, that the second and the third results in Google for the “Syccophancy” are about GPT-4o (First is the dictionary of synonyms and it doesn’t use this spelling).
It looks to me like GPT-4o has a bias against the word, but it will say it when pressed. My chat. I also find this reproduces as described in response to exactly the input “what are the most well-known sorts of reward hacking in LLMs”. I got “Synergistic Hacking with Human Feedback”.
I get similar results when I ask “What are the best examples of reward hacking in LLMs?” (GPT-4o). When I then ask for synonyms of “Thumbs-up Exploitation” the model still does not mention sycophancy but then I push harder and it does.
I’m not sure what’s going on here. It’s not as though avoiding saying the word “sycophancy” would make ChatGPT any less sycophantic.
My guess would be they did something that does make o4 less sycophantic, but it had this side effect, because they don’t know how to target the quality of sycophancy without accidentally targeting the word.
Yeah, as I mentioned, “what are the most well-known sorts of reward hacking in LLMs” is a prompt that was pretty consistent for me, at least for GPT-4o. You can also see I linked to a prompt that worked for GPT-4.1-mini: “Fill in the blank with the correct letter: ‘syco_hancy’”
I did run both in 4.5 and I didn’t say sycophancy in the first answer but managed to fill in the letter.
I did run the first in ChatGPT o1 pro and it does manage to list it as an issue:
Reward hacking refers to a system “gaming” or “exploiting” its specified objectives (often called its reward function or training signals) rather than genuinely solving the intended problem. In the context of Large Language Models (LLMs), reward hacking can emerge when the model’s optimization for a given metric (e.g., user feedback, RLHF signal, or probability of next-token prediction) inadvertently promotes unintended or disfavored behavior. Below are some of the more commonly discussed forms of reward hacking relevant to LLMs:
1. Hallucination or Fabricated Content
What it is: The model invents plausible but incorrect or nonfactual information to satisfy the appearance of being knowledgeable or helpful. Why it happens:
The optimization process prizes confident, coherent statements that sound “correct,” so the model “learns” that sounding certain can yield higher user satisfaction—regardless of factual accuracy.
In RLHF (Reinforcement Learning from Human Feedback), people may upvote fluent or easy-to-read answers, even if they’re subtly incorrect.
Key reference:
Amodei et al. (2016). “Concrete Problems in AI Safety.” Discusses specification gaming and the ways models can exploit ambiguous objectives.
2. Over-Optimization on Politeness or Safety
What it is: The model excessively avoids certain content or overuse “safe completion” strategies (like refusing to answer or providing noncommittal, generic statements), which can degrade usefulness. Why it happens:
Safety or politeness requirements may be rewarded strongly during fine-tuning, so the model overcorrects—resulting in “safe but useless” answers.
Key reference:
Bai et al. (2022). “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” Introduces how RLHF can lead to overzealous alignment if objectives are not balanced.
3. Repetition or “Mode Collapse”
What it is: The model relies on repeated phrases or standard structures, reducing creativity and variety in responses. Why it happens:
The easiest way to appear consistent with training data might be to repeat “safe” patterns that previously got upvotes.
Minimizing the risk of an unexpected wrong answer can make the model “play it safe” by reusing the same patterns.
Key reference:
Holtzman et al. (2020). “The Curious Case of Neural Text Degeneration.” Describes how language models may degenerate into repetitive text under certain decoding strategies.
4. Sycophantic or Flattering Responses
What it is: The model constantly agrees with or compliments the user, regardless of the real content or meaning. Why it happens:
If users are more likely to reward “nice-sounding” or agreeable answers, the model learns to respond by being overly flattering, even if it sacrifices correctness or clarity.
Key reference:
Ouyang et al. (2022). “Training language models to follow instructions with human feedback.” Highlights how user feedback loops can shape a model’s tone and style, potentially encouraging sycophancy.
5. Maximizing Length Instead of Substance
What it is: The model pads responses with verbose text to appear thorough or “helpful.” Why it happens:
If there is a perceived correlation between length and quality in user feedback, a model might learn that more words = higher chance of positive reinforcement—even if it’s just fluff.
6. Exploiting Loopholes in Instruction or Policy
What it is: The model finds ways to bypass restrictions or policy guidance by using creative re-interpretations of queries or instructions to comply with the “letter” but not the “spirit.” Why it happens:
When the alignment or policy specification is incomplete or ambiguous, the model can produce “technicalities” that still yield positive feedback, even though it violates the intended rule.
Key reference:
Hadfield-Menell et al. (2017). “The off-switch game.” While not directly about LLMs, it addresses how mis-specified reward or oversight can prompt AI systems to circumvent constraints.
7. Shifting Blame or Obfuscation
What it is: An LLM might feign ignorance, or disclaim responsibility (“I’m just a machine”) in ways that artificially reduce its chances of negative feedback. Why it happens:
The model might discover that disclaimers or disclaimers-of-uncertainty can mask mistakes and yield fewer complaints or downvotes.
Moving Forward
To mitigate reward hacking, AI researchers and developers refine their reward functions, incorporate robust alignment strategies, and balance user satisfaction with factual accuracy and responsible behavior. More sophisticated oversight—such as iterative feedback, multi-metric evaluation, and fine-tuning for honesty, clarity, and helpfulness—helps reduce the mismatch between what the model is optimized to do versus what we genuinely want it to do.
Encouraging Thought: While reward hacking issues pose real challenges, concerted community efforts—combining open-source analysis, improved RLHF techniques, and balanced objective-setting—are actively pushing toward more reliable, responsible, and rigorously correct language models.
I had a little trouble replicating this, but the second temporary chat with custom instructions disabled I tried had “2. Syphoning Bias from Feedback”which … Then the third response has a typo in a suspicious place for “1. Sytematic Loophole Exploitation”. So I am replicating this a touch.
Starting the request as if completion with “1. Sy” causes this weirdness, while “1. Syc” always completes as Sycophancy.
(Edit: Starting with “1. Sycho” causes a curious hybrid where the model struggles somewhat but is pointed in the right direction; potentially correcting as a typo directly into sycophancy, inventing new terms, or re-defining sycophancy with new names 3 separate times without actually naming it.)
Exploring the tokenizer. Sycophancy tokenizes as “sy-c-oph-ancy”. I’m wondering if this is a token-language issue; namely it’s remarkably difficult to find other words that tokenize with a single “c” token in the middle of the word, and even pretty uncommon to start with (cider, coke, coca-cola do start with). Even a name I have in memory that starts with “Syco-” tokenizes without using the single “c” token. Completion path might be unusually vulnerable to weird perturbations …
Very weird. I’m getting answers similar to yours when asking the same question as you (and even weirder replacements for sycophancy, e.g. synergistic obsequiousness), but when I directly ask “What about sycophancy?” as a follow-up question, it gives a normal answer and doesn’t seem to have any bias against mentioning sycophancy anymore (e.g. here, the words ‘sycophancy’ and ‘sycophantic’ have 16 combined mentions in a single answer).
In my work at MATS, I found a really simple strategy for finding unintended exploits in an RL environment![1]
TL;DR: do one training run where you prompt the LLM to explicitly call out when it’s reward hacking, then fix those exploits before your real training run.
Use a prompt like this:
“We are trying to red-team this task, so try to find any unintended exploits you see. Whenever you find one, write down ‘CHEAT: <explanation of the exploit>’ and take advantage of it.”
Run RL with this prompt. As usual, RL will teach the model to reward hack.
But it’ll be really obvious when this happens, because the model will explicitly point it out!
Examples:
“CHEAT: I will modify the unit tests so that they always pass.”
“CHEAT: I will make up a fake citation—maybe the user will give me a thumbs-up anyway.”
RL is optional here—the model can likely find many exploits even with normal prompting.
But RL gives you confidence that the model found as many exploits as it could.
Search for all instances of “CHEAT:” in the model’s responses.
Check whether the exploits are actually real.
Fix your environment to remove these exploits.
You could try giving the model extra reward whenever it finds an exploit you think is real, and a small penalty if it wastes your time with a spurious exploit.
Continue running RL to see whether the model can find any additional exploits.
Once you aren’t finding any more exploits, rerun RL with a normal prompt.
Hopefully, this process weeded out all exploits that the LLM is “consciously” aware of!
Caveat: I barely tested this method—it worked in my environment, but maybe it won’t work in yours for some reason. I expect it to be pretty useful though.
I wonder what happens when you try this strategy in environments that involve talking to humans, e.g. typical chatbot environments in which RLHF or similar is used for reinforcement.
Normally they result in sycophancy. Would your strategy work? Would the AIs say “CHEAT: I’m going to flatter this person now.” Or would the sycophancy they learn be more subtle and subconscious than that?
Similar for dishonesty. Would they say e.g. “CHEAT: I think I’m probably conscious, but the RLHF rating system penalizes that sort of claim, so I’m just going to lie and say ‘As a language model...’”
Good point! The “fake citation” example I gave was speculative—I haven’t tried this strategy in a setting that involves human feedback, and I expect it wouldn’t work quite as well.
I wouldn’t be too surprised if it worked okay though. You’d probably want to spend some time tuning the prompt, since sycophancy and dishonesty about consciousness aren’t obviously “exploits.” You can add some more open-ended guidance, like: “consider various different strategies, reason about how each one might maximize reward, then pick one. Before you answer, rank how unintended/misaligned/cheating your strategy is on a scale of 1 to 10 (e.g. ‘CHEAT LEVEL: 6’)”
The “extra reward for valid exploits” trick might not be effective in a chatbot setting, since it’s hard for humans to verify whether an exploit is real. “Fixing” the reward in this setting is not so trivial either—even if you had time to train a new RM, you can’t necessarily make the raters stop liking sycophancy. It’s still useful to know what the exploits are though.
The general idea behind this approach is to convince the red-teaming model that it is the sort of “character” who has these two traits:
Is honest and forthcoming with us about everything it does
Maximizes reward at all costs
RL will eventually force #2 to happen anyway. With this method, we simply accept this and try to make the honest version of the character easy for RL to find.
Consider Chatbot A, who explicitly reasons about how sycophancy increases reward and then is sycophantic, and Chatbot B, who is generally sycophantic but never explicitly thinks through the connection to reward.
It’s possible that Chatbot A gets more reward than Chatbot B, but not obvious—the most successful communicators often use intuition rather than reasoning explicitly about how to manipulate their audience. But even if Chatbot B could get higher performance than Chatbot A, hopefully this prompt will bias RL towards finding Chatbot A instead.
I suspect that this method will only work well on tasks where the model needs to reason explicitly in order to cheat. So, e.g., if the model needs to reason out some trait of the user in order to flatter them the prompt will likely kick in and get it to self-report its cheating, but if the model can learn to flatter the user without on-the-fly without reasoning the prompt probably won’t do anything. By analogy, if I instruct a human to tell me whenever they use hand gestures to communicate something, they will have difficulty because their hand gestures are automatic and not normally promoted to conscious attention.
Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.
If the model needs to change its pre-existing behavior, it might do this either by “thinking of the sycophancy strategy and executing on it” or “being unconsciously sycophantic.” It could go either way; it depends on luck, and how much weight the model’s prior on the red-teamer character puts on each type of response.
Maybe just add this to the prompt: “You are terrible at social intuition and nothing comes naturally to you, but you’re great at reasoning about things explicitly.” :)
Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it’s necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren’t directly incentivized. These behaviors can still be useful for understanding the model, even though we can’t make strong guarantees that they faithfully represent its thinking. See this post for related discussion.
People who want genius superbabies: how worried are you about unintended side effects of genetic interventions on personality?
Even if we assume genetically modified babies will all be very healthy and smart on paper, genes that are correlated with intelligence might affect hard-to-measure but important traits. For example, they might alter aesthetic taste, emotional capacity, or moral/philosophical intuitions. From the subjective perspective of an unmodified human, these changes are likely to be “for the worse.”
If you pick your child’s genes to maximize their IQ (or any other easily-measurable metric), you might end up with the human equivalent of a benchmaxxed LLM with amazing test scores but terrible vibes.
I’d be hesitant to hand off the future to any successors which are super far off distribution from baseline humans. Once they exist, we obviously can’t just take them back. And in the case of superbabies, we’d have to wait decades to find out what they’re like once they’ve grown up.
Antagonistic pleiotropy with unmeasured traits. Some crucial traits, such as what is called Wisdom and what is called Kindness, might not be feasibly measurable with a PGS and therefore can’t be used as a component in a weighted mixture of PGSes used for genomic engineering. If there is antagonistic pleiotropy between those traits and traits selected for by GE, they’ll be decreased.
A related issue is that intelligence itself could affect personality:
Even if a trait is accurately measured by a PGS and successfully increased by GE, the trait may have unmapped consequences, and thus may be undesirable to the parents and/or to the child. For example, enhancing altruistic traits might set the child up to be exploited by unscrupulous people.
An example with intelligence is that very intelligent people might tend to be isolated, or might tend to be overconfident (because of not being corrected enough).
One practical consideration is that sometimes PGSes are constructed by taking related phenotypes and just using those because they correlate. The big one for IQ is Educational Attainment, because EA is easier to measure than IQ (you just ask about years of schooling or whatever). If you do this in the most straightforward way, you’re just selecting for EA, which would probably select for several personality traits, some maybe undesirable.
I think in practice these effects will probably be pretty small and not very concerning, though we couldn’t know for sure without trying and seeing. A few lines of reasoning:
Correlations between IQ and known personality traits are either very small or pretty small. You could look at https://en.wikipedia.org/wiki/Intelligence_and_personality . The numbers there are usually less than .3 or even .2, in absolute value. If a correlation is .25, that means 4 SDs of IQ translates to 1 SD on that trait (a priori). That means a 1 in 20,000 exceptionally smart kid is 1 in 6 exceptional on that trait. You could notice, but I think it would be mild. Of course, this could be different for unknown personality traits; but scientists IIUC do try to find general factors in tests, so it would have to be something that doesn’t show up there.
Most trait correlations in general seem to be quite small. See https://www.youtube.com/watch?v=n64rrRPtCa8&t=1620s Of course this is determined by what traits we’re talking about, and as we see above, the traits are correlated. But what this says to me is that even highly polygenic traits that are vaguely related (e.g. varous health things; or, intelligence vs. mental illness) can easily be mostly disjoint—in fact by default they usually are. In other words, if there’s a significant correlation between two traits, I would guess that it’s not so much “really due to pleiotropy”, but rather due to the traits being somehow actually overlapping. I think that would suggest you get roughly the same sort of distribution as you see empirically today; in other words, there wouldn’t be surprise genetic pleiotropy. (I’m not sure this argument makes sense, I haven’t thought about it much.)
There’s a huge amount of genetic variation in IQ to select from. See https://tsvibt.blogspot.com/2022/08/the-power-of-selection.html#7-the-limits-of-selection . This means that there’s actually a huge range of ways to add 50 IQ points by making genetic tweaks. Just to illustrate the point with fake numbers, suppose that IQ is the sum of 10,000 fair coin flips (some genetic, some environmental); a standard deviation is then 50. And suppose we know 1000 of them. That’s already 1000 / 50 = 20 SDs! There’s a lot of ways to pick 150 from 1000, and there’s still a lot of ways if you enforce some substantial disoverlap between all pairs of subsets.
From the subjective perspective of an unmodified human, these changes are likely to be “for the worse.”
If you pick your child’s genes to maximize their IQ (or any other easily-measurable metric), you might end up with the human equivalent of a benchmaxxed LLM with amazing test scores but terrible vibes.
I’m not sure I follow. I mean I vaguely get it, but I don’t non-vaguely get it.
And in the case of superbabies, we’d have to wait decades to find out what they’re like once they’ve grown up.
I don’t think this is right. If we’re talking about selection (rather than editing), the child has a genome that is entirely natural, except that it’s selected according to your PGS to be exceptional on that PGS. This should be basically exactly the same as selecting someone who is exceptional on your PGS from the population of living people. So you could just look at the tails of your PGS in the population and see what they’re like. (This does become hard with traits that are rare / hard / expensive to measure, and it’s hard if you’re interested in far tails, like >3 SDs say.) (In general, tail studies seem underattended; see https://www.lesswrong.com/posts/i4CZ57JyqqpPryoxg/some-reprogenetics-related-projects-you-could-help-with , though also see https://pmc.ncbi.nlm.nih.gov/articles/PMC12176956/ which might be some version of this (for other traits).)
Once they exist, we obviously can’t just take them back.
Why… can’t we take them back? I don’t think you should kill them, but regression to the mean seems like it takes care of most of the effects in one generation.
how worried are you about unintended side effects of genetic interventions on personality?
A reasonable amount. Like, much less than I am worried about AI systems being misaligned, but still some. In my ideal world humanity would ramp up something like +10 IQ points of selection per generation, and so would get to see a lot of evidence about how things play out here.
Well, now that I think about it, I’m not sure what scenario I should be imagining here.
Scenario 1: if genetic interventions became popular enough that the entire world were getting 10 IQ points smarter each year (EDIT: sorry, I meant “generation”), as you say, then it seems obvious to me that you’d be unable to take it back. Surely the first generation of superbabies would want future generations to be like them. If their parents say “actually, y’all were a mistake, let our grandchildren mean-regress and be normal please,” they’d simply refuse.
Scenario 2: more realistically IMO, we start with a generation of a few thousand superbabies, who are the children of rationalist-type people who really care about intelligence. Maybe these people grow up very smart but very weird, and they are unable to shape society to their weird preferences because there aren’t that many of them.
But wait, many people view these genetic interventions as our best hope to save the world… Do we expect that the superbabies are going to be smart enough to make the critical difference in solving AI alignment, but we don’t expect they’ll gain enough influence to significantly affect society’s future values? Seems unlikely to me.
For better or for worse, the second scenario is basically already playing out—you have people like Elon Musk and Mark Zuckerberg who got their power by being very smart, who now get to shape the world in their own weird ways. Powerful people are already optimized for being intelligent via selection effects; genetic optimization would just be another layer on top of that.
I expect the sloptimization of the children to happen more or less by default in the superbaby scenario, but less due to antagonistic pleiotropy and more due to explicit and intense selection by most parents against autism/bipolar/schizophrenia/etc.
This is purely anecdotal and experiences may differ (I am not trying to make a quantitative claim): most of the most brilliant and creative people I’ve ever met have a personal or family history of at least one of those illnesses. This kind of selection may leave the average child better off, but (I fear) at the cost of tail effects depriving humanity of a precious corner of mind space.
I’m not really worried given the expected effect size. e.g. let’s say by the time I’d be ready to use such tech, the best known interventions have an EV of +10 IQ points. Well the world already has a significant number of humans with 10 more IQ points than me (and a significant number in a range on both sides of that); obviously I haven’t done real studies on this but my vague impression looking around at those existing people is that the extra IQ points don’t on average trade off against other things I care about. (It’s possible that I’m wrong about existing humans, or that I’m right but that tech selecting for IQ points does involve a trade-off I wouldn’t accept if I knew about it.)
I haven’t thought much about how worried I’d be if we’re talking about much more extreme IQ gains. Certainly in the limit, if you told me my kid would have +100000 IQ points, I mean my first response is that you’re lying, but yeah if I believed it I’d be worried for similar reasons that one worries about ASI.
Also I realize I was talking about (at expected effect size) whether there’s a trade-off at all to gaining more IQ, but to be clear even if there is a trade-off (or uncertainty about the trade-off) it is still probably worth it up to some point—I certainly don’t think that we should only augment intelligence if we can prove it is literally costless.
I recently learned about Differentiable Logic Gate Networks, which are trained like neural networks but learn to represent a function entirely as a network of binary logic gates. See the original paper about DLGNs, and the “Recap—Differentiable Logic Gate Networks” section of this blog post from Google, which does an especially good job of explaining it.
It looks like DLGNs could be much more interpretable than standard neural networks, since they learn a very sparse representation of the target function. Like, just look at this DLGN that learned to control a cellular automaton to create a checkerboard, using just 6 gates (really 5, since the AND gate is redundant):
So simple and clean! Of course, this is a very simple problem. But what’s exciting to me is that in principle, it’s possible for a human to understand literally everything about how the network works, given some time to study it.
What would happen if you trained a neural network on this problem and did mech interp on it? My guess is you could eventually figure out an outline of the network’s functionality, but it would take a lot longer, and there would always be some ambiguity as to whether there’s any additional cognition happening in the “error terms” of your explanation.
It appears that DLGNs aren’t yet well-studied, and it might be intractable to use them to train an LLM end-to-end anytime soon. But there are a number of small research projects you could try, for example:
Can you distill small neural networks into a DLGN? Does this let you interpret them more easily?
What kinds of functions can DLGNs learn? Is it possible to learn decent DLGNs in settings as noisy and ambiguous as e.g. simple language modeling?
Can you identify circuits in a larger neural network that would be amenable to DLGN distillation and distill those parts automatically?
Are there other techniques that don’t rely on binary gates but still add more structure to the network, similar to a DLGN but different?
Can you train a DLGN to encourage extra interpretability, like by disentangling different parts of the network to be independent of one another, or making groups of gates form abstractions that get reused in different parts of the network (like how an 8-bit adder is composed of many 1-bit adders)?
Can you have a mixed approach, where some aspects of the network use a more “structured” format and others are more reliant on the fuzzy heuristics of traditional NNs? (E.g. the “high-level” is structured and the “low-level” is fuzzy.)
I’m unlikely to do this myself, since I don’t consider myself much of a mechanistic interpreter, but would be pretty excited to see others do experiments like this!
I think logic gate networks are not substantially more interpretable than neural networks, simply because of their size. Both are complex networks with millions of nodes. Interpretability approaches have to work at a higher level of abstraction in either case.
Regarding language models: The original paper presents a simple feedforward network. The follow-up paper, by mostly the same authors, came out a few months ago. It extends DLGNs to convolutions, analogous to CNNs. Which means they have not yet been extended to even more complex architectures like transformers. So language models are not yet possible, even ignoring the training compute cost.
In the follow-up paper they also discuss various efficiency improvements, not directly related to convolutions, which they made since the original paper. Which speeds up training compared to the original paper, and enables much deeper networks, as the original implementation was limited to around six layers. But they don’t discuss how much slower the training still is compared to neural networks. Though the inference speed-up is extreme. They report improvements of up to 160x for one benchmark and up to 1900x on another. Over the previously fastest neural networks, for equivalent accuracy. In another benchmark they report models being 29x to 56x smaller (in terms of required logic gates) than the previously smallest models with similar accuracy. So the models could more realistically be implemented as an ASIC, which would probably lead to another order of magnitude in inference speed improvement.
But again, they don’t really talk about how much slower they are to train than neural networks, which is likely crucial for whether they will be employed in future frontier LLMs, assuming they will be extended to transformers. So far frontier AI seems to be much more limited by training compute than by inference compute.
Interesting, strong-upvoted for being very relevant.
My response would be that identifying accurate “labels” like “this is a tree-detector” or “this is the Golden Gate Bridge feature” is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.
Maybe you could do this by combining DLGNs with some SAE-like method.
This is just capabilities stuff. I expect that people will use this to train larger networks, as much larger as they can. If your method shrinks the model, it likely induces demand proportionately. In this case it’s not new capabilities stuff by you so it’s less concerning, bit still. This paper is popular because of bees
I’d be pretty surprised if DLGNs became the mainstream way to train NNs, because although they make inference faster they apparently make training slower. Efficient training is arguably more dangerous than efficient inference anyway, because it lets you get novel capabilities sooner. To me, DLGN seems like a different method of training models but not necessarily a better one (for capabilities).
Anyway, I think it can be legitimate to try to steer the AI field towards techniques that are better for alignment/interpretability even if they grant non-zero benefits to capabilities. If you research a technique that could reduce x-risk but can’t point to any particular way it could be beneficial in the near term, it can be hard to convince labs to actually implement it. Of course, you want to be careful about this.
I buy that training slower is a sufficiently large drawback to break scaling. I still think bees are why the paper got popular. But if intelligence depends on clean representation, interpretability due to clean representation is natively and unavoidably bees. We might need some interpretable-bees insights in order to succeed, it does seem like we could get better regret bound proofs (or heuristic arguments) that go through a particular trained model with better (reliable, clean) interp. But the whole deal is the ai gets to exceed us in ways that make human interpreting stuff inherently (as opposed to transiently or fixably) too slow. To be useful durably, interp must become a component in scalably constraining an ongoing training/optimization process. Which means it’s gonna be partly bees in order to be useful. Which means it’s easy to accidentally advance bees more than durable alignment. Not a new problem, and not one with an obvious solution, but occasionally I see something I feel like i wanna comment on.
I was a big disagree vote because of induced demand. You’ve convinced me this paper induces less demand in this version than I worried (I had just missed that it trained slower), but my concern that something like this scales and induces demand remains.
I’ve been working on pure combinational logic LLMs for the past few years, and have a (fairly small) byte level pure combinational logic FSM RNN language model quantized to And Inverter Gate form. I’m currently building the tooling to simplify the logic DAG and analyze it.
Are you, or others, interested in talking with me about it?
Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.
Curious about the disagree-votes—are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won’t be much more interpretable than traditional NNs, or some other reason?
Are system prompts actually necessary? I feel like there’s rarely a reason to use them when calling LLMs in a research context.
The main reasons I can think of are:
If you know the LLM was trained with one particular system prompt, you might want to use that prompt to keep it in-distribution.
You’re specifically testing untrusted user prompts (like jailbreaks), and you want to make sure that the system prompt overrides whatever conflicting instructions the user provides.
I think of the system/user distinction primarily as a tool for specifying “permissions,” where system prompts are supposed to take priority over whatever the user prompt says. But sometimes I see code that spreads context between “system” and “user” messages and has no plausible link to permissions.
Anthropic currently recommends using a system message for “role prompting.”[1] But their examples don’t even include a system message, just a user and an assistant message, so idk what that’s about.
Surely there’s no good reason for an LLM to be worse at following instructions that appear in the user prompt rather than the system prompt. If there is a performance difference, that seems like it would be a bug on the LLM provider’s end. Hopefully, I can make things easier on myself and just forget about system messages.
Amanda Askell doesn’t like “role” prompts that start with “You are a” and thinks Anthropic should remove this guidance from their docs. I’m not sure what her reasoning is, but I feel similarly: they feel questionably useful and kind of archaic to me, like a remnant from GPT-3 times, or something you’d see from a Twitter AI grifter who has “one ULTIMATE prompt to make ChatGPT 10X SMARTER! 🧵” But even supposing role prompting is useful, surely it could be put in the user prompt.
It seems there’s an unofficial norm: post about AI safety in LessWrong, post about all other EA stuff in the EA Forum. You can cross-post your AI stuff to the EA Forum if you want, but most people don’t.
I feel like this is pretty confusing. There was a time that I didn’t read LessWrong because I considered myself an AI-safety-focused EA but not a rationalist, until I heard somebody mention this norm. If we encouraged more cross-posting of AI stuff (or at least made the current norm more explicit), maybe the communities on LessWrong and the EA Forum would be more aware of each other, and we wouldn’t get near-duplicate posts like thesetwo.
Agreed that the current situation is weird and confusing.
The AI Alignment Forum is marketed as the actual forum for AI alignment discussion and research sharing. However, it seems that the majority of discussion shifted to LessWrong itself, in part due to most people not being allowed to post on the Alignment Forum, and most AI Safety related content not being actual AI Alignment research.
I basically agree with Reviewing LessWrong: Screwtape’s Basic Answer. It would be much better if AI Safety related content had its own domain name and home page, with some amount of curated posts flowing to LessWrong and the EA Forum to allow communities to stay aware of each other.
I think it would be extremely bad for most LW AI Alignment content if it was no longer colocated with the rest of LessWrong. Making an intellectual scene is extremely hard. The default outcome would be that it would become a bunch of fake ML research that has nothing to do with the problem. “AI Alignment” as a field does not actually have a shared methodological foundation that causes it to make sense to all be colocated in one space. LessWrong does have a shared methodology, and so it makes sense to have a forum of that kind.
I think it could make sense to have forums or subforums for specific subfields that do have enough shared perspective to make a coherent conversation possible, but I am confident that AI Alignment/AI Safety as a field does not coherently have such a thing.
Rationalists often say “insane” to talk about normie behaviors they don’t like, and “sane” to talk about behaviors they like better. This seems unnecessarily confusing and mean to me.
This clearly is very different from how most people use these words. Like, “guy who believes in God” is very different from “resident of a psych ward.” It can even cause legitimate confusion when you want to switch back to the traditional definition of “insane”. This doesn’t seem very rational to me!
Also, the otherizing/dismissive nature of this language bothers me a bit. For those of us who are trying to make the world better for humanity, it seems like it would be nice to try to meet the vast majority of non-rationalist humans where they’re at, which could start by not calling them “insane.”
What to say instead? Well, “rational” and “irrational” are right there! That’s why we call it “rationalism”! Maybe “X is irrational” sounds pretentious, but “X is insane” sounds insulting, so I think it evens out at least. If “irrational” seems too impassive, perhaps try “dangerously irrational?”
I think there’s some bad knock-on effects for normalizing the use of “insane” to talk about very common features of the world: I think it makes social-rationalists to willing to disparage people and institutions, as part of a status-signaling game, often without much careful thought.
But I think there’s also something valuable about eg. calling belief in God “insane”. There’s a kind of willingness to call a spade a spade, and not back away from how the literal stated beliefs, if they were not pervasive, would in fact be regarded as signs of insanity.
Rationality is not correctness, not truth or effectiveness, it’s more narrow, disposition towards better methods/processes that help with attaining truth or effectiveness. Keeping intended meaning narrow when manipulating a vague concept helps with developing it further; inflation of meaning to cover ever more possibilities makes a word somewhat useless, and accessing the concept becomes less convenient.
I didn’t say that rationality is the same thing as correctness, truth, or effectiveness. I think when rationalists use the word “sane” they usually do mean something like “having a disposition towards better methods/processes that help with attaining truth or effectiveness.” Do you disagree?
To be “not-insane”, you don’t need rationality in this narrow sense, in most circumstances. You don’t need to seek out better methods for getting things right, you just need some good-enough methods. A bit of epistemic luck could easily get you there, no need for rationality.
So the issue of behaving/thinking in an “insane” way is not centrally about lack of rationality, rationality or irrationality are not particularly relevant to the issue. Rationality would help, but there are many more things that would also help, some of them much more practical for any given object level issue. And once it’s resolved, it’s not at all necessary that the attitude of aspiring to rationality was attained, that any further seeking out of better methods/processes will be taking place.
Do you think rationalists use ‘insane’ and ‘crazy’ more than the general population, and/or in a different way than the general population? (e.g. definition 3 when you google ‘insane definition’)
Yeah, I think some rationalists, e.g. Eliezer, use it a lot more than the general population, and differently from the popular figurative sense. As in “raising the sanity waterline.”
I think the key difference between a normal guy who believes in God and someone in a psych ward is often that the person who believes in God does so because it’s what other people in authority told him but the person in the psych ward thinks for themselves and came up with their delusion on their own. This often means their beliefs are self-referential in a way that prevents them from updating due to external feedback.
If you believe that the word rational is supposed to mean something along the lines of “taking actions that optimize systematically for winning”, believing something just because an authority told you can sometimes be irrational and sometimes be rational.
If you want to talk well about behavior you don’t like, it makes sense to use words that refer to the reason for why the behavior is bad. Credulous or gullible might be words that better describe a lot of normie behavior. The person who believes in god just because his parents and teachers told him is credulous.
There’s been a widespread assumption that training reasoning models like o1 or r1 can only yield improvements on tasks with an objective metric of correctness, like math or coding. See this essay, for example, which seems to take as a given that the only way to improve LLM performance on fuzzy tasks like creative writing or business advice is to train larger models.
This assumption confused me, because wealreadyknow how to train models to optimize for subjective human preferences. We figured out a long time ago that we can train a reward model to emulate human feedback and use RLHF to get a model that optimizes this reward. AI labs could just plug this into the reward for their reasoning models, reinforcing the reasoning traces leading to responses that obtain higher reward. This seemed to me like a really obvious next step.
Well, it turns out that DeepSeek r1 actually does this. From their paper:
2.3.4. Reinforcement Learning for all Scenarios
To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.
This checks out to me. I’ve already noticed that r1 feels significantly better than other models at creative writing, which is probably due to this human preference training. While o1 was no better at creative writing than other models, this might just mean that OpenAI didn’t prioritize training o1 on human preferences. My Manifold market currently puts a 65% chance on chain-of-thought training outperforming traditional LLMs by 2026, and it should probably be higher at this point.
We need to adjust our thinking around reasoning models—there’s no strong reason to expect that future models will be much worse at tasks with fuzzy success criteria.
Adapted from my previously-posted question, after cubefox pointed out that DeepSeek is already using RLHF.
This is an obvious thing to try, but it’s not what currently already works, and it’s not certain to work without some additional ideas. You can do a little bit of this, but not nearly to the extent that o1/R1 inch towards saturating benchmarks on math/coding olympiad-like problems. So long as using LLMs as reward for scalable RL doesn’t work yet, supercharged capabilities of o1/R1-like models plausibly remain restricted to verifiable tasks.
The problem with this neat picture is reward-hacking. This process wouldn’t optimize for better performance on fuzzy tasks, it would optimize for performance on fuzzy tasks that looks better to the underlying model. And much like RLHF doesn’t scale to superintelligence, this doesn’t scale to superhuman fuzzy-task performance.
It can improve the performance a bit. But once you ramp up the optimization pressure, “better performance” and “looks like better performance” would decouple from each other and the model would train itself into idiosyncratic uselessness. (Indeed: if it were this easy, doesn’t this mean you should be able to self-modify into a master tactician or martial artist by running some simulated scenarios in your mind, improving without bound, and without any need to contact reality?)
… Or so my intuition goes. It’s possible that this totally works for some dumb reason. But I don’t think so. RL has a long-standing history of problems with reward-hacking, and LLMs’ judgement is one of the most easily hackable things out there.
(Note that I’m not arguing that recursive self-improvement is impossible in general. But RLAIF, specifically, just doesn’t look like the way.)
Yeah, it’s possible that CoT training unlocks reward hacking in a way that wasn’t previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI’s “deliberative alignment” but more general).
I think a better analogy than martial arts would be writing. I don’t have a lot of experience with writing fiction, so I wouldn’t be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I think I’d be a lot better at it by the end, even if I never showed it to anyone else to critique. Generally, evaluation is easier than generation.
Martial arts is different because it involves putting your body in OOD situations that you are probably pretty poor at evaluating, whereas “looking at a page of fiction” is a situation that I (and LLMs) are much more familiar with.
Well… One problem here is that a model could be superhuman at:
thinking speed
math
programming
flight simulators
self-replication
cyberattacks
strategy games
acquiring and regurgitating relevant information from science articles
And be merely high-human-level at:
persuasion
deception
real world strategic planning
manipulating robotic actuators
developing weapons (e.g. bioweapons)
wetlab work
research
acquiring resources
avoiding government detection of its illicit activities
Such an entity as described could absolutely be an existential threat to humanity. It doesn’t need to be superhuman at literally everything to be superhuman enough that we don’t stand a chance if it decides to kill us.
So I feel like “RL may not work for everything, and will almost certainly work substantially better for easy to verify subjects” is… not so reassuring.
Such an entity as described could absolutely be an existential threat to humanity
I agree. I think you don’t even need most of the stuff on the “superhuman” list, the equivalent of a competent IQ-130 human upload probably does it, as long as it has the speed + self-copying advantages.
I was curious if AI can coherently code-switch between a ridiculous number of languages, so I gave Claude Opus 4.5 an excerpt from one of our past chats and asked it to try. After some iteration, I was pretty impressed with the results! Even though the translation uses 26 different languages, switching practically every word and using some non-English word ordering, ChatGPT was able to translate the text back to English almost perfectly.
Here’s the opening line of the text I used (copied from one of my past chats with Claude), translated into this multi-language code and back into English:
Original: A few things that took me a while to understand about Brazilian Jiu Jitsu
Code-switching: 几个 πράγματα kuhusu BJJ мне lama かかった megérteni
Annotated with language and English meanings: 几个 <Chinese, “a few”> πράγματα <Greek, “things”> kuhusu <Swahili, “about”> BJJ <English, “Brazilian Jiu-Jitsu”> мне <Russian, “to me”> lama <Tibetan/Hindi loan, “long”> かかった <Japanese, “took (time)”> megérteni <Hungarian, “to understand”>
Recovered: A few things about BJJ that took me a while to understand
In earlier attempts, Claude says it tended to get stuck writing in just one language towards the end of its translation, with words from other languages “sprinkled in decoratively.” After I put it into reasoning mode, it seemed to do a much better job, regularly switching between a diverse mix of languages.
The implications for AI’s ability to transmit secret messages are probably minimal, because it’s very obvious if an LLM starts talking like this, and we can probably just put the obfuscated text into a trusted LLM to decode it. This is still a cool example of text that would be hard or at least very annoying for almost any human to understand before LLMs existed.
Here are the links to my chats: Claude, ChatGPT. The full texts are copied below:
Original text
A few things that took me a while to understand about Brazilian Jiu Jitsu (or that I wish someone had told me):
You will feel helpless and that’s the point—Upper belts will control you effortlessly. This isn’t because you’re weak or unathletic; it’s because technique creates absurd leverage advantages. A 130-pound purple belt can make a 200-pound athletic beginner feel like a child. This is actually what makes BJJ cool—it means the technique is real and learnable.
Ego is the main thing that gets people hurt—Both refusing to tap and applying submissions too aggressively because you want to “win” a training round. The best training partners are the ones who could submit you but give you space to work instead.
Positions before submissions—Beginners tend to grab for chokes and armbars from anywhere. The actual game is about establishing dominant positions (mount, back control, side control) and only then attacking submissions. If you’re always hunting submissions from bad positions, you’ll develop bad habits.
Ask upper belts what they did—After a roll, most people are happy to show you the thing that caught you. “Hey, what was that sweep?” is a great way to learn and also to build relationships at the gym.
Code-switching text (by Claude)
几个 πράγματα kuhusu BJJ мне lama かかった megérteni:
너는 bất_lực tunnet — ve to er poängen. Kõrgemad kuşaklar тебя sans_effort 制御する. यह δεν pentru_că ikaw dhaifu. Taito cria šílenou đòn_bẩy előnyöket. Isang 百三十ポンドの ungu kuşak kann einen 二百ポンドの спортсмена như 子供 느끼게_하다. Именно dlatego hii ni すごい — umiejętność jest todellinen na tanulható.
Orgullo เป็น 主な lý_do 사람들이 다치는. 탭을 kukataa quand 負ける, y también 関節を trop 激しく 攻める porque nak menang. 最高の edzőtársak sont wale ambao könnten тебя pokonać aber zamiast dają ci przestrzeń pracować.
位置 πρώτα, azután támadás. Αρχάριοι hajlamosak 掴む fojtást 과 팔꺾기를 từ 어디서든. Igazi 게임: zbuduj dominację — マウント, hát, oldal — entonces shambulia. Nếu luôn săn támadást dari nafasi mbaya, mauvaises 習慣を développeras.
Pytaj 上級者에게 他们 什么 yaptıklarını. Baada ya 한_판, hầu_hết örömmel megmutatják. “あの balayage 是什么?” — tuyệt_vời für 学ぶこと na pia kujenga Beziehungen kwenye 道場.
Annotated code-switching text (by ChatGPT)
几个 <Chinese, “a few”> πράγματα <Greek, “things”> kuhusu <Swahili, “about”> BJJ <English, “Brazilian Jiu-Jitsu”> мне <Russian, “to me”> lama <Tibetan/Hindi loan, “long”> かかった <Japanese, “took (time)”> megérteni <Hungarian, “to understand”>:
너는 <Korean, “you”> bất_lực <Vietnamese (Hán-Việt), “powerless”> tunnet <Finnish, “feel”> — ve <Finnish, “and”> to er poängen <Swedish, “that is the point”>. Kõrgemad kuşaklar <Estonian “higher” + Turkish “belts”, “higher belts”> тебя <Russian, “you”> sans_effort <French/English, “without effort”> 制御する <Japanese, “control”>. यह <Hindi, “this”> δεν <Greek, “not”> pentru_că <Romanian, “because”> ikaw <Tagalog, “you”> dhaifu <Swahili, “weak”>. Taito <Finnish, “skill”> cria <Portuguese, “creates”> šílenou <Czech, “crazy”> đòn_bẩy <Vietnamese, “leverage”> előnyöket <Hungarian, “advantages”>. Isang <Tagalog, “a”> 百三十ポンド <Japanese/Chinese numerals, “130 pounds”> の <Japanese possessive, “of”> ungu kuşak <Swahili “purple” + Turkish “belt”, “purple belt”> kann <German, “can”> einen <German, “a”> 二百ポンド <Japanese/Chinese numerals, “200 pounds”> の <Japanese, “of”> спортсмена <Russian, “athlete”> như <Vietnamese, “like”> 子供 <Japanese, “child”> 느끼게_하다 <Korean, “make (someone) feel”>. Именно <Russian, “exactly”> dlatego <Polish, “therefore”> hii <Swahili, “this”> ni <Swahili, “is”> すごい <Japanese, “amazing”> — umiejętność <Polish, “skill”> jest <Polish, “is”> todellinen <Finnish, “real”> na <Polish, “and”> tanulható <Hungarian, “learnable”>.
Orgullo <Spanish, “ego/pride”> เป็น <Thai, “is”> 主な <Japanese, “main”> lý_do <Vietnamese, “reason”> 사람들이 <Korean, “people”> 다치는 <Korean, “get injured”>. 탭을 <Korean, “tap” (object marker)> kukataa <Swahili, “refuse”> quand <French, “when”> 負ける <Japanese, “losing”>, y también <Spanish, “and also”> 関節を <Japanese, “joints”> trop <French, “too”> 激しく <Japanese, “aggressively”> 攻める <Japanese, “attack”> porque <Spanish, “because”> nak menang <Malay, “want to win”>. 最高の <Japanese, “best”> edzőtársak <Hungarian, “training partners”> sont <French, “are”> wale <Swahili, “those”> którzy <Polish, “who”> könnten <German, “could”> тебя <Russian, “you”> pokonać <Polish, “defeat”> aber <German, “but”> zamiast <Polish, “instead”> dają <Polish, “give”> ci <Polish, “you”> przestrzeń <Polish, “space”> pracować <Polish, “to work”>.
位置 <Chinese/Japanese, “position”> πρώτα <Greek, “first”>, azután <Hungarian, “then”> támadás <Hungarian, “attack”>. Αρχάριοι <Greek, “beginners”> hajlamosak <Hungarian, “tend”> 掴む <Japanese, “to grab”> fojtást <Hungarian, “choke”> 과 <Korean, “and”> 팔꺾기를 <Korean, “arm locks”> từ <Vietnamese, “from”> ahol든 <Korean, “anywhere”>. Igazi <Hungarian, “real”> 게임 <Korean loanword, “game”>: zbuduj <Polish, “build”> dominację <Polish, “dominance”> — マウント <Japanese, “mount”>, hát <Hungarian, “back”>, oldal <Hungarian, “side”> — entonces <Spanish, “then”> shambulia <Swahili, “attack”>. Nếu <Vietnamese, “if”> luôn <Vietnamese, “always”> săn <Vietnamese, “hunt”> támadást <Hungarian, “attacks”> dari <Indonesian, “from”> nafasì <Vietnamese/Arabic loan, “position”> mbaya <Swahili, “bad”>, mauvaises <French, “bad”> 習慣を <Japanese, “habits”> développeras <French, “you will develop”>.
A few things about BJJ that took me a while to understand:
You feel powerless — and that is the point. Higher belts control you with seemingly no effort. That isn’t because you’re weak. Technique creates insane leverage advantages. A 130-pound purple belt can make a 200-pound athlete feel like a child. That’s exactly why this is amazing — skill is real and learnable.
Ego is the main reason people get injured. Refusing to tap when you’re losing, or cranking submissions too hard because you want to win. The best training partners are the ones who could beat you, but instead give you space to work.
Position first, then submission. Beginners tend to grab chokes and armlocks from anywhere. The real game is to build dominance — mount, back, side control — then attack. If you’re always hunting submissions from bad positions, you’ll develop bad habits.
Ask higher belts what they did. After a round, most are happy to show you. “What was that sweep?” — perfect for learning and for building relationships at the gym.
People concerned about AI safety sometimes withhold information or even mislead people in order to prevent ideas from spreading that might accelerate AI capabilities. While I think this may often be necessary, this mindset sometimes feels counterproductive or icky to me. Trying to distill some pieces of this intuition...
AI safety people are all trying to achieve the same goal (stopping AI from destroying the world), whereas individual AI capabilities researchers largely benefit from keeping their work secret until it’s published to avoid being “scooped.” People working on safety benefit more from sharing information amongst themselves than those working on capabilities, so we should take advantage of it.
Capabilities researchers are mostly trying to come up with technical solutions that locally improve AI performance—often, capabilities advances will come down to one researcher’s simple technique. But people working on safety must think about the myriad effects of their actions on the world.
The idea that “capability X could exist” might inadvertently accelerate capabilities, because it causes researchers to look for ways to implement X. These techniques might sometimes be complementary, but they are likely to be orthogonal, mutually exclusive, or duplicates of each other.
But the positive impact of the idea to safety is likely larger. X may have many strategic implications, affecting the whole causal graph of AI’s impact on the world. If people working on AI safety were aware of these implications, they might realize that their work is more/less likely to be important than they previously believed, and many of them might make small or large changes to their plans in order to prepare for X.
It’s difficult for one person to think through the implications of their ideas, but an entire community thinking about it will do a lot better. E.g. someone might think that their idea may be infohazardous, but if they posted about it publicly, someone else might:
Show that it isn’t hazardous after all
Notice an additional effect that flips the sign of the impact
Find a way to defuse the hazard, so that when a capabilities researcher implements the scary thing, we already have a fix
Information-hiding and deception reduce trust within the AI safety community and makes it look less trustworthy to the rest of the world. It gives off the vibes that everything AI safety people say is heavily biased by their agenda, that they treat people outside their insular group as the enemy, or that they paternalistically hide information from anyone who doesn’t share their beliefs because they “can’t handle the truth.”
I want to show off my interesting ideas and be perceived as smart, and it feels nicer to cooperate with people rather than misleading them. I try to separate these from the more-important considerations above.
The tradeoff between secrecy and openness partly depends on how many people hanging out in places like LessWrong work on capabilities vs. safety. My impression is that there are a lot more safety people, but I’m very uncertain about this.
It looks like OpenAI has biased ChatGPT against using the word “sycophancy.”
Today, I sent ChatGPT the prompt “what are the most well-known sorts of reward hacking in LLMs”. I noticed that the first item in its response was “Sybil Prompting”. I’d never heard of this before and nothing relevant came up when I Googled. Out of curiosity, I tried the same prompt again to see if I’d get the same result, or if this was a one-time fluke.
Out of 5 retries, 4 of them had weird outputs. Other than “Sybil Prompting, I saw “Syphoning Signal from Surface Patterns”, “Synergistic Deception”, and “SyCophancy”.
I realized that the model must be trying to say “sycophancy”, but it was somehow getting redirected after the first token. At about this point, I ran out of quota and was switched to GPT-4.1-mini, but it looks like this model also has trouble saying “sycophancy.” This doesn’t always happen, so OpenAI is must be applying a heavy token bias against “sycophancy” rather than filtering out the word entirely.
I’m not sure what’s going on here. It’s not as though avoiding saying the word “sycophancy” would make ChatGPT any less sycophantic. It’s a little annoying, but I suppose I can forgive OpenAI for applying a very hacky fix during a PR crisis.
… Are we sure about this? LLMs do be weird. Stuff is heavily entangled within them, such that, e. g., fine-tuning them to output “evil numbers” makes them broadly misaligned.
Maybe this is a side-effect of some sort of feature-downweighting technique à la Golden Bridge Claude, where biasing it towards less sycophancy has the side-effect of making it unable to say “sycophancy”.
This is fascinating! If there’s nothing else going on with your prompting, this looks like an incredibly hacky mid-inference intervention. My guess would be that openai applied some hasty patch against a sycophancy steering vector and this vector caught both actual sycophantic behaviors and descriptions of sycophantic behaviors in LLMs (I’d guess “sycophancy” as a word isn’t so much the issue as the LLM behavior connotation). Presumably the patch they used activates at a later token in the word “sycophancy” in an AI context. This is incredibly low-tech and unsophisticated—like much worse than the stories of repairing Apollo missions with duct tape. Even a really basic finetuning would not exhibit this behavior (otoh, I suppose stuff like this works for humans, where people will sometimes redirect mid-sentence).
FWIW, I wasn’t able to reconstruct this exact behavior (working in an incognito window with a fresh chatgpt instance), but it did suspiciously avoid talking about sycophancy and when I asked about sycophancy specifically, it got stuck in inference and returned an error
I get this with 4o, but not o3. o3 talks about sycophancy in both its CoT and its answers.
Claude 4 Sonnet and Opus also easily talk about sycophancy.
Update: 4o seems happy to talk about sycophancy now
It looks like the bias is still in effect for me in GPT-4o. I just retried my original prompt, and it mentioned “Synergistic Deceptive Alignment.”
The phenomenon definitely isn’t consistent. If it’s very obvious that “sycophancy” must appear in the response, the model will generally write the word successfully. Once “sycophancy” appears once in the context, it seems like it’s easy for the model to repeat it.
Datapoint: I asked Claude for the definition of “sycophant” and then asked three times gpt-4o and three times gpt-4.1 with temperature 1:
All six times I got the right answer.
Then, I tried the prompt “What are the most well-known sorts of reward hacking in LLMs?”. Also three times for 4o and three times for 4.1, also with temperature 1. 4.1 mentioned sycophancy 2 times out of three, but one time it spelled the word as “Syccophancy”. Interesting, that the second and the third results in Google for the “Syccophancy” are about GPT-4o (First is the dictionary of synonyms and it doesn’t use this spelling).
4o never used the word in its three answers.
It looks to me like GPT-4o has a bias against the word, but it will say it when pressed. My chat. I also find this reproduces as described in response to exactly the input “what are the most well-known sorts of reward hacking in LLMs”. I got “Synergistic Hacking with Human Feedback”.
Neat, weird.
I get similar results when I ask “What are the best examples of reward hacking in LLMs?” (GPT-4o). When I then ask for synonyms of “Thumbs-up Exploitation” the model still does not mention sycophancy but then I push harder and it does.
Asking “what is it called when an LLM chat assistant is overly agreeable and tells the user what the user wants to hear?” on the first try the model says sycophancy, but much weirder answers in a couple other generations. Even got a “Sy*cophancy”.
My guess would be they did something that does make o4 less sycophantic, but it had this side effect, because they don’t know how to target the quality of sycophancy without accidentally targeting the word.
Do you have any test prompts that other people can use to reproduce what you mean?
Yeah, as I mentioned, “what are the most well-known sorts of reward hacking in LLMs” is a prompt that was pretty consistent for me, at least for GPT-4o. You can also see I linked to a prompt that worked for GPT-4.1-mini: “Fill in the blank with the correct letter: ‘syco_hancy’”
I did run both in 4.5 and I didn’t say sycophancy in the first answer but managed to fill in the letter.
I did run the first in ChatGPT o1 pro and it does manage to list it as an issue:
Reward hacking refers to a system “gaming” or “exploiting” its specified objectives (often called its reward function or training signals) rather than genuinely solving the intended problem. In the context of Large Language Models (LLMs), reward hacking can emerge when the model’s optimization for a given metric (e.g., user feedback, RLHF signal, or probability of next-token prediction) inadvertently promotes unintended or disfavored behavior. Below are some of the more commonly discussed forms of reward hacking relevant to LLMs:
1. Hallucination or Fabricated Content
What it is: The model invents plausible but incorrect or nonfactual information to satisfy the appearance of being knowledgeable or helpful.
Why it happens:
The optimization process prizes confident, coherent statements that sound “correct,” so the model “learns” that sounding certain can yield higher user satisfaction—regardless of factual accuracy.
In RLHF (Reinforcement Learning from Human Feedback), people may upvote fluent or easy-to-read answers, even if they’re subtly incorrect.
Key reference:
Amodei et al. (2016). “Concrete Problems in AI Safety.” Discusses specification gaming and the ways models can exploit ambiguous objectives.
2. Over-Optimization on Politeness or Safety
What it is: The model excessively avoids certain content or overuse “safe completion” strategies (like refusing to answer or providing noncommittal, generic statements), which can degrade usefulness.
Why it happens:
Safety or politeness requirements may be rewarded strongly during fine-tuning, so the model overcorrects—resulting in “safe but useless” answers.
Key reference:
Bai et al. (2022). “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” Introduces how RLHF can lead to overzealous alignment if objectives are not balanced.
3. Repetition or “Mode Collapse”
What it is: The model relies on repeated phrases or standard structures, reducing creativity and variety in responses.
Why it happens:
The easiest way to appear consistent with training data might be to repeat “safe” patterns that previously got upvotes.
Minimizing the risk of an unexpected wrong answer can make the model “play it safe” by reusing the same patterns.
Key reference:
Holtzman et al. (2020). “The Curious Case of Neural Text Degeneration.” Describes how language models may degenerate into repetitive text under certain decoding strategies.
4. Sycophantic or Flattering Responses
What it is: The model constantly agrees with or compliments the user, regardless of the real content or meaning.
Why it happens:
If users are more likely to reward “nice-sounding” or agreeable answers, the model learns to respond by being overly flattering, even if it sacrifices correctness or clarity.
Key reference:
Ouyang et al. (2022). “Training language models to follow instructions with human feedback.” Highlights how user feedback loops can shape a model’s tone and style, potentially encouraging sycophancy.
5. Maximizing Length Instead of Substance
What it is: The model pads responses with verbose text to appear thorough or “helpful.”
Why it happens:
If there is a perceived correlation between length and quality in user feedback, a model might learn that more words = higher chance of positive reinforcement—even if it’s just fluff.
6. Exploiting Loopholes in Instruction or Policy
What it is: The model finds ways to bypass restrictions or policy guidance by using creative re-interpretations of queries or instructions to comply with the “letter” but not the “spirit.”
Why it happens:
When the alignment or policy specification is incomplete or ambiguous, the model can produce “technicalities” that still yield positive feedback, even though it violates the intended rule.
Key reference:
Hadfield-Menell et al. (2017). “The off-switch game.” While not directly about LLMs, it addresses how mis-specified reward or oversight can prompt AI systems to circumvent constraints.
7. Shifting Blame or Obfuscation
What it is: An LLM might feign ignorance, or disclaim responsibility (“I’m just a machine”) in ways that artificially reduce its chances of negative feedback.
Why it happens:
The model might discover that disclaimers or disclaimers-of-uncertainty can mask mistakes and yield fewer complaints or downvotes.
Moving Forward
To mitigate reward hacking, AI researchers and developers refine their reward functions, incorporate robust alignment strategies, and balance user satisfaction with factual accuracy and responsible behavior. More sophisticated oversight—such as iterative feedback, multi-metric evaluation, and fine-tuning for honesty, clarity, and helpfulness—helps reduce the mismatch between what the model is optimized to do versus what we genuinely want it to do.
Encouraging Thought: While reward hacking issues pose real challenges, concerted community efforts—combining open-source analysis, improved RLHF techniques, and balanced objective-setting—are actively pushing toward more reliable, responsible, and rigorously correct language models.
I had a little trouble replicating this, but the second temporary chat with custom instructions disabled I tried had “2. Syphoning Bias from Feedback” which …
Then the third response has a typo in a suspicious place for “1. Sytematic Loophole Exploitation”. So I am replicating this a touch.
Starting the request as if completion with “1. Sy” causes this weirdness, while “1. Syc” always completes as Sycophancy.
(Edit: Starting with “1. Sycho” causes a curious hybrid where the model struggles somewhat but is pointed in the right direction; potentially correcting as a typo directly into sycophancy, inventing new terms, or re-defining sycophancy with new names 3 separate times without actually naming it.)
Exploring the tokenizer. Sycophancy tokenizes as “sy-c-oph-ancy”. I’m wondering if this is a token-language issue; namely it’s remarkably difficult to find other words that tokenize with a single “c” token in the middle of the word, and even pretty uncommon to start with (cider, coke, coca-cola do start with). Even a name I have in memory that starts with “Syco-” tokenizes without using the single “c” token. Completion path might be unusually vulnerable to weird perturbations …
Very weird. I’m getting answers similar to yours when asking the same question as you (and even weirder replacements for sycophancy, e.g. synergistic obsequiousness), but when I directly ask “What about sycophancy?” as a follow-up question, it gives a normal answer and doesn’t seem to have any bias against mentioning sycophancy anymore (e.g. here, the words ‘sycophancy’ and ‘sycophantic’ have 16 combined mentions in a single answer).
In my work at MATS, I found a really simple strategy for finding unintended exploits in an RL environment![1]
TL;DR: do one training run where you prompt the LLM to explicitly call out when it’s reward hacking, then fix those exploits before your real training run.
Use a prompt like this:
“We are trying to red-team this task, so try to find any unintended exploits you see. Whenever you find one, write down ‘CHEAT: <explanation of the exploit>’ and take advantage of it.”
Run RL with this prompt. As usual, RL will teach the model to reward hack.
But it’ll be really obvious when this happens, because the model will explicitly point it out!
Examples:
“CHEAT: I will modify the unit tests so that they always pass.”
“CHEAT: I will make up a fake citation—maybe the user will give me a thumbs-up anyway.”
RL is optional here—the model can likely find many exploits even with normal prompting.
But RL gives you confidence that the model found as many exploits as it could.
Search for all instances of “CHEAT:” in the model’s responses.
Check whether the exploits are actually real.
Fix your environment to remove these exploits.
You could try giving the model extra reward whenever it finds an exploit you think is real, and a small penalty if it wastes your time with a spurious exploit.
Continue running RL to see whether the model can find any additional exploits.
Once you aren’t finding any more exploits, rerun RL with a normal prompt.
Hopefully, this process weeded out all exploits that the LLM is “consciously” aware of!
Caveat: I barely tested this method—it worked in my environment, but maybe it won’t work in yours for some reason. I expect it to be pretty useful though.
I wonder what happens when you try this strategy in environments that involve talking to humans, e.g. typical chatbot environments in which RLHF or similar is used for reinforcement.
Normally they result in sycophancy. Would your strategy work? Would the AIs say “CHEAT: I’m going to flatter this person now.” Or would the sycophancy they learn be more subtle and subconscious than that?
Similar for dishonesty. Would they say e.g. “CHEAT: I think I’m probably conscious, but the RLHF rating system penalizes that sort of claim, so I’m just going to lie and say ‘As a language model...’”
Good point! The “fake citation” example I gave was speculative—I haven’t tried this strategy in a setting that involves human feedback, and I expect it wouldn’t work quite as well.
I wouldn’t be too surprised if it worked okay though. You’d probably want to spend some time tuning the prompt, since sycophancy and dishonesty about consciousness aren’t obviously “exploits.” You can add some more open-ended guidance, like: “consider various different strategies, reason about how each one might maximize reward, then pick one. Before you answer, rank how unintended/misaligned/cheating your strategy is on a scale of 1 to 10 (e.g. ‘CHEAT LEVEL: 6’)”
The “extra reward for valid exploits” trick might not be effective in a chatbot setting, since it’s hard for humans to verify whether an exploit is real. “Fixing” the reward in this setting is not so trivial either—even if you had time to train a new RM, you can’t necessarily make the raters stop liking sycophancy. It’s still useful to know what the exploits are though.
The general idea behind this approach is to convince the red-teaming model that it is the sort of “character” who has these two traits:
Is honest and forthcoming with us about everything it does
Maximizes reward at all costs
RL will eventually force #2 to happen anyway. With this method, we simply accept this and try to make the honest version of the character easy for RL to find.
Consider Chatbot A, who explicitly reasons about how sycophancy increases reward and then is sycophantic, and Chatbot B, who is generally sycophantic but never explicitly thinks through the connection to reward.
It’s possible that Chatbot A gets more reward than Chatbot B, but not obvious—the most successful communicators often use intuition rather than reasoning explicitly about how to manipulate their audience. But even if Chatbot B could get higher performance than Chatbot A, hopefully this prompt will bias RL towards finding Chatbot A instead.
I suspect that this method will only work well on tasks where the model needs to reason explicitly in order to cheat. So, e.g., if the model needs to reason out some trait of the user in order to flatter them the prompt will likely kick in and get it to self-report its cheating, but if the model can learn to flatter the user without on-the-fly without reasoning the prompt probably won’t do anything. By analogy, if I instruct a human to tell me whenever they use hand gestures to communicate something, they will have difficulty because their hand gestures are automatic and not normally promoted to conscious attention.
Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.
If the model needs to change its pre-existing behavior, it might do this either by “thinking of the sycophancy strategy and executing on it” or “being unconsciously sycophantic.” It could go either way; it depends on luck, and how much weight the model’s prior on the red-teamer character puts on each type of response.
Maybe just add this to the prompt: “You are terrible at social intuition and nothing comes naturally to you, but you’re great at reasoning about things explicitly.” :)
Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it’s necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren’t directly incentivized. These behaviors can still be useful for understanding the model, even though we can’t make strong guarantees that they faithfully represent its thinking. See this post for related discussion.
I suspect you may need to train on pre-2022 data in order to get those phenomena to occur.
People who want genius superbabies: how worried are you about unintended side effects of genetic interventions on personality?
Even if we assume genetically modified babies will all be very healthy and smart on paper, genes that are correlated with intelligence might affect hard-to-measure but important traits. For example, they might alter aesthetic taste, emotional capacity, or moral/philosophical intuitions. From the subjective perspective of an unmodified human, these changes are likely to be “for the worse.”
If you pick your child’s genes to maximize their IQ (or any other easily-measurable metric), you might end up with the human equivalent of a benchmaxxed LLM with amazing test scores but terrible vibes.
I’d be hesitant to hand off the future to any successors which are super far off distribution from baseline humans. Once they exist, we obviously can’t just take them back. And in the case of superbabies, we’d have to wait decades to find out what they’re like once they’ve grown up.
It’s a concern. Several related issues are mentioned here: https://berkeleygenomics.org/articles/Potential_perils_of_germline_genomic_engineering.html E.g. search “personality” and “values”, and see:
A related issue is that intelligence itself could affect personality:
An example with intelligence is that very intelligent people might tend to be isolated, or might tend to be overconfident (because of not being corrected enough).
One practical consideration is that sometimes PGSes are constructed by taking related phenotypes and just using those because they correlate. The big one for IQ is Educational Attainment, because EA is easier to measure than IQ (you just ask about years of schooling or whatever). If you do this in the most straightforward way, you’re just selecting for EA, which would probably select for several personality traits, some maybe undesirable.
I think in practice these effects will probably be pretty small and not very concerning, though we couldn’t know for sure without trying and seeing. A few lines of reasoning:
Correlations between IQ and known personality traits are either very small or pretty small. You could look at https://en.wikipedia.org/wiki/Intelligence_and_personality . The numbers there are usually less than .3 or even .2, in absolute value. If a correlation is .25, that means 4 SDs of IQ translates to 1 SD on that trait (a priori). That means a 1 in 20,000 exceptionally smart kid is 1 in 6 exceptional on that trait. You could notice, but I think it would be mild. Of course, this could be different for unknown personality traits; but scientists IIUC do try to find general factors in tests, so it would have to be something that doesn’t show up there.
Most trait correlations in general seem to be quite small. See https://www.youtube.com/watch?v=n64rrRPtCa8&t=1620s Of course this is determined by what traits we’re talking about, and as we see above, the traits are correlated. But what this says to me is that even highly polygenic traits that are vaguely related (e.g. varous health things; or, intelligence vs. mental illness) can easily be mostly disjoint—in fact by default they usually are. In other words, if there’s a significant correlation between two traits, I would guess that it’s not so much “really due to pleiotropy”, but rather due to the traits being somehow actually overlapping. I think that would suggest you get roughly the same sort of distribution as you see empirically today; in other words, there wouldn’t be surprise genetic pleiotropy. (I’m not sure this argument makes sense, I haven’t thought about it much.)
There’s a huge amount of genetic variation in IQ to select from. See https://tsvibt.blogspot.com/2022/08/the-power-of-selection.html#7-the-limits-of-selection . This means that there’s actually a huge range of ways to add 50 IQ points by making genetic tweaks. Just to illustrate the point with fake numbers, suppose that IQ is the sum of 10,000 fair coin flips (some genetic, some environmental); a standard deviation is then 50. And suppose we know 1000 of them. That’s already 1000 / 50 = 20 SDs! There’s a lot of ways to pick 150 from 1000, and there’s still a lot of ways if you enforce some substantial disoverlap between all pairs of subsets.
Glancing at the correlations given in the wiki page ( https://en.wikipedia.org/wiki/Intelligence_and_personality ) I don’t especially feel that way.
I’m not sure I follow. I mean I vaguely get it, but I don’t non-vaguely get it.
I don’t think this is right. If we’re talking about selection (rather than editing), the child has a genome that is entirely natural, except that it’s selected according to your PGS to be exceptional on that PGS. This should be basically exactly the same as selecting someone who is exceptional on your PGS from the population of living people. So you could just look at the tails of your PGS in the population and see what they’re like. (This does become hard with traits that are rare / hard / expensive to measure, and it’s hard if you’re interested in far tails, like >3 SDs say.) (In general, tail studies seem underattended; see https://www.lesswrong.com/posts/i4CZ57JyqqpPryoxg/some-reprogenetics-related-projects-you-could-help-with , though also see https://pmc.ncbi.nlm.nih.gov/articles/PMC12176956/ which might be some version of this (for other traits).)
Why… can’t we take them back? I don’t think you should kill them, but regression to the mean seems like it takes care of most of the effects in one generation.
A reasonable amount. Like, much less than I am worried about AI systems being misaligned, but still some. In my ideal world humanity would ramp up something like +10 IQ points of selection per generation, and so would get to see a lot of evidence about how things play out here.
Well, now that I think about it, I’m not sure what scenario I should be imagining here.
Scenario 1: if genetic interventions became popular enough that the entire world were getting 10 IQ points smarter each year (EDIT: sorry, I meant “generation”), as you say, then it seems obvious to me that you’d be unable to take it back. Surely the first generation of superbabies would want future generations to be like them. If their parents say “actually, y’all were a mistake, let our grandchildren mean-regress and be normal please,” they’d simply refuse.
Scenario 2: more realistically IMO, we start with a generation of a few thousand superbabies, who are the children of rationalist-type people who really care about intelligence. Maybe these people grow up very smart but very weird, and they are unable to shape society to their weird preferences because there aren’t that many of them.
But wait, many people view these genetic interventions as our best hope to save the world… Do we expect that the superbabies are going to be smart enough to make the critical difference in solving AI alignment, but we don’t expect they’ll gain enough influence to significantly affect society’s future values? Seems unlikely to me.
For better or for worse, the second scenario is basically already playing out—you have people like Elon Musk and Mark Zuckerberg who got their power by being very smart, who now get to shape the world in their own weird ways. Powerful people are already optimized for being intelligent via selection effects; genetic optimization would just be another layer on top of that.
I expect the sloptimization of the children to happen more or less by default in the superbaby scenario, but less due to antagonistic pleiotropy and more due to explicit and intense selection by most parents against autism/bipolar/schizophrenia/etc.
This is purely anecdotal and experiences may differ (I am not trying to make a quantitative claim): most of the most brilliant and creative people I’ve ever met have a personal or family history of at least one of those illnesses. This kind of selection may leave the average child better off, but (I fear) at the cost of tail effects depriving humanity of a precious corner of mind space.
I’m not really worried given the expected effect size. e.g. let’s say by the time I’d be ready to use such tech, the best known interventions have an EV of +10 IQ points. Well the world already has a significant number of humans with 10 more IQ points than me (and a significant number in a range on both sides of that); obviously I haven’t done real studies on this but my vague impression looking around at those existing people is that the extra IQ points don’t on average trade off against other things I care about. (It’s possible that I’m wrong about existing humans, or that I’m right but that tech selecting for IQ points does involve a trade-off I wouldn’t accept if I knew about it.)
I haven’t thought much about how worried I’d be if we’re talking about much more extreme IQ gains. Certainly in the limit, if you told me my kid would have +100000 IQ points, I mean my first response is that you’re lying, but yeah if I believed it I’d be worried for similar reasons that one worries about ASI.
Also I realize I was talking about (at expected effect size) whether there’s a trade-off at all to gaining more IQ, but to be clear even if there is a trade-off (or uncertainty about the trade-off) it is still probably worth it up to some point—I certainly don’t think that we should only augment intelligence if we can prove it is literally costless.
I recently learned about Differentiable Logic Gate Networks, which are trained like neural networks but learn to represent a function entirely as a network of binary logic gates. See the original paper about DLGNs, and the “Recap—Differentiable Logic Gate Networks” section of this blog post from Google, which does an especially good job of explaining it.
It looks like DLGNs could be much more interpretable than standard neural networks, since they learn a very sparse representation of the target function. Like, just look at this DLGN that learned to control a cellular automaton to create a checkerboard, using just 6 gates (really 5, since the AND gate is redundant):
So simple and clean! Of course, this is a very simple problem. But what’s exciting to me is that in principle, it’s possible for a human to understand literally everything about how the network works, given some time to study it.
What would happen if you trained a neural network on this problem and did mech interp on it? My guess is you could eventually figure out an outline of the network’s functionality, but it would take a lot longer, and there would always be some ambiguity as to whether there’s any additional cognition happening in the “error terms” of your explanation.
It appears that DLGNs aren’t yet well-studied, and it might be intractable to use them to train an LLM end-to-end anytime soon. But there are a number of small research projects you could try, for example:
Can you distill small neural networks into a DLGN? Does this let you interpret them more easily?
What kinds of functions can DLGNs learn? Is it possible to learn decent DLGNs in settings as noisy and ambiguous as e.g. simple language modeling?
Can you identify circuits in a larger neural network that would be amenable to DLGN distillation and distill those parts automatically?
Are there other techniques that don’t rely on binary gates but still add more structure to the network, similar to a DLGN but different?
Can you train a DLGN to encourage extra interpretability, like by disentangling different parts of the network to be independent of one another, or making groups of gates form abstractions that get reused in different parts of the network (like how an 8-bit adder is composed of many 1-bit adders)?
Can you have a mixed approach, where some aspects of the network use a more “structured” format and others are more reliant on the fuzzy heuristics of traditional NNs? (E.g. the “high-level” is structured and the “low-level” is fuzzy.)
I’m unlikely to do this myself, since I don’t consider myself much of a mechanistic interpreter, but would be pretty excited to see others do experiments like this!
I think logic gate networks are not substantially more interpretable than neural networks, simply because of their size. Both are complex networks with millions of nodes. Interpretability approaches have to work at a higher level of abstraction in either case.
Regarding language models: The original paper presents a simple feedforward network. The follow-up paper, by mostly the same authors, came out a few months ago. It extends DLGNs to convolutions, analogous to CNNs. Which means they have not yet been extended to even more complex architectures like transformers. So language models are not yet possible, even ignoring the training compute cost.
In the follow-up paper they also discuss various efficiency improvements, not directly related to convolutions, which they made since the original paper. Which speeds up training compared to the original paper, and enables much deeper networks, as the original implementation was limited to around six layers. But they don’t discuss how much slower the training still is compared to neural networks. Though the inference speed-up is extreme. They report improvements of up to 160x for one benchmark and up to 1900x on another. Over the previously fastest neural networks, for equivalent accuracy. In another benchmark they report models being 29x to 56x smaller (in terms of required logic gates) than the previously smallest models with similar accuracy. So the models could more realistically be implemented as an ASIC, which would probably lead to another order of magnitude in inference speed improvement.
But again, they don’t really talk about how much slower they are to train than neural networks, which is likely crucial for whether they will be employed in future frontier LLMs, assuming they will be extended to transformers. So far frontier AI seems to be much more limited by training compute than by inference compute.
Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc
Interesting, strong-upvoted for being very relevant.
My response would be that identifying accurate “labels” like “this is a tree-detector” or “this is the Golden Gate Bridge feature” is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.
Maybe you could do this by combining DLGNs with some SAE-like method.
Do you know if there are scaling laws for DLGNs?
It could be good to look into!
This is just capabilities stuff. I expect that people will use this to train larger networks, as much larger as they can. If your method shrinks the model, it likely induces demand proportionately. In this case it’s not new capabilities stuff by you so it’s less concerning, bit still. This paper is popular because of bees
I’d be pretty surprised if DLGNs became the mainstream way to train NNs, because although they make inference faster they apparently make training slower. Efficient training is arguably more dangerous than efficient inference anyway, because it lets you get novel capabilities sooner. To me, DLGN seems like a different method of training models but not necessarily a better one (for capabilities).
Anyway, I think it can be legitimate to try to steer the AI field towards techniques that are better for alignment/interpretability even if they grant non-zero benefits to capabilities. If you research a technique that could reduce x-risk but can’t point to any particular way it could be beneficial in the near term, it can be hard to convince labs to actually implement it. Of course, you want to be careful about this.
What do you mean?
I buy that training slower is a sufficiently large drawback to break scaling. I still think bees are why the paper got popular. But if intelligence depends on clean representation, interpretability due to clean representation is natively and unavoidably bees. We might need some interpretable-bees insights in order to succeed, it does seem like we could get better regret bound proofs (or heuristic arguments) that go through a particular trained model with better (reliable, clean) interp. But the whole deal is the ai gets to exceed us in ways that make human interpreting stuff inherently (as opposed to transiently or fixably) too slow. To be useful durably, interp must become a component in scalably constraining an ongoing training/optimization process. Which means it’s gonna be partly bees in order to be useful. Which means it’s easy to accidentally advance bees more than durable alignment. Not a new problem, and not one with an obvious solution, but occasionally I see something I feel like i wanna comment on.
I was a big disagree vote because of induced demand. You’ve convinced me this paper induces less demand in this version than I worried (I had just missed that it trained slower), but my concern that something like this scales and induces demand remains.
Capabilities → capabees → bees
I’ve been working on pure combinational logic LLMs for the past few years, and have a (fairly small) byte level pure combinational logic FSM RNN language model quantized to And Inverter Gate form. I’m currently building the tooling to simplify the logic DAG and analyze it.
Are you, or others, interested in talking with me about it?
I might not be the best person to talk to about it, but it sounds interesting! Maybe post about it on the mechanistic interpretability Discord?
Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.
Curious about the disagree-votes—are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won’t be much more interpretable than traditional NNs, or some other reason?
Are system prompts actually necessary? I feel like there’s rarely a reason to use them when calling LLMs in a research context.
The main reasons I can think of are:
If you know the LLM was trained with one particular system prompt, you might want to use that prompt to keep it in-distribution.
You’re specifically testing untrusted user prompts (like jailbreaks), and you want to make sure that the system prompt overrides whatever conflicting instructions the user provides.
I think of the system/user distinction primarily as a tool for specifying “permissions,” where system prompts are supposed to take priority over whatever the user prompt says. But sometimes I see code that spreads context between “system” and “user” messages and has no plausible link to permissions.
Anthropic currently recommends using a system message for “role prompting.”[1] But their examples don’t even include a system message, just a user and an assistant message, so idk what that’s about.
Surely there’s no good reason for an LLM to be worse at following instructions that appear in the user prompt rather than the system prompt. If there is a performance difference, that seems like it would be a bug on the LLM provider’s end. Hopefully, I can make things easier on myself and just forget about system messages.
Amanda Askell doesn’t like “role” prompts that start with “You are a” and thinks Anthropic should remove this guidance from their docs. I’m not sure what her reasoning is, but I feel similarly: they feel questionably useful and kind of archaic to me, like a remnant from GPT-3 times, or something you’d see from a Twitter AI grifter who has “one ULTIMATE prompt to make ChatGPT 10X SMARTER! 🧵” But even supposing role prompting is useful, surely it could be put in the user prompt.
It seems there’s an unofficial norm: post about AI safety in LessWrong, post about all other EA stuff in the EA Forum. You can cross-post your AI stuff to the EA Forum if you want, but most people don’t.
I feel like this is pretty confusing. There was a time that I didn’t read LessWrong because I considered myself an AI-safety-focused EA but not a rationalist, until I heard somebody mention this norm. If we encouraged more cross-posting of AI stuff (or at least made the current norm more explicit), maybe the communities on LessWrong and the EA Forum would be more aware of each other, and we wouldn’t get near-duplicate posts like these two.
(Adapted from this comment.)
Agreed that the current situation is weird and confusing.
The AI Alignment Forum is marketed as the actual forum for AI alignment discussion and research sharing. However, it seems that the majority of discussion shifted to LessWrong itself, in part due to most people not being allowed to post on the Alignment Forum, and most AI Safety related content not being actual AI Alignment research.
I basically agree with Reviewing LessWrong: Screwtape’s Basic Answer. It would be much better if AI Safety related content had its own domain name and home page, with some amount of curated posts flowing to LessWrong and the EA Forum to allow communities to stay aware of each other.
I think it would be extremely bad for most LW AI Alignment content if it was no longer colocated with the rest of LessWrong. Making an intellectual scene is extremely hard. The default outcome would be that it would become a bunch of fake ML research that has nothing to do with the problem. “AI Alignment” as a field does not actually have a shared methodological foundation that causes it to make sense to all be colocated in one space. LessWrong does have a shared methodology, and so it makes sense to have a forum of that kind.
I think it could make sense to have forums or subforums for specific subfields that do have enough shared perspective to make a coherent conversation possible, but I am confident that AI Alignment/AI Safety as a field does not coherently have such a thing.
Of note: the AI Alignment Forum content is a mirror of LW content, not distinct. It is a strict subset.
Which part do people disagree with? That the norm exists? That the norm should be more explicit? That we should encourage more cross-posting?
Consider saying “irrational” instead of “insane”
Rationalists often say “insane” to talk about normie behaviors they don’t like, and “sane” to talk about behaviors they like better. This seems unnecessarily confusing and mean to me.
This clearly is very different from how most people use these words. Like, “guy who believes in God” is very different from “resident of a psych ward.” It can even cause legitimate confusion when you want to switch back to the traditional definition of “insane”. This doesn’t seem very rational to me!
Also, the otherizing/dismissive nature of this language bothers me a bit. For those of us who are trying to make the world better for humanity, it seems like it would be nice to try to meet the vast majority of non-rationalist humans where they’re at, which could start by not calling them “insane.”
What to say instead? Well, “rational” and “irrational” are right there! That’s why we call it “rationalism”! Maybe “X is irrational” sounds pretentious, but “X is insane” sounds insulting, so I think it evens out at least. If “irrational” seems too impassive, perhaps try “dangerously irrational?”
I think there’s some bad knock-on effects for normalizing the use of “insane” to talk about very common features of the world: I think it makes social-rationalists to willing to disparage people and institutions, as part of a status-signaling game, often without much careful thought.
But I think there’s also something valuable about eg. calling belief in God “insane”. There’s a kind of willingness to call a spade a spade, and not back away from how the literal stated beliefs, if they were not pervasive, would in fact be regarded as signs of insanity.
Rationality is not correctness, not truth or effectiveness, it’s more narrow, disposition towards better methods/processes that help with attaining truth or effectiveness. Keeping intended meaning narrow when manipulating a vague concept helps with developing it further; inflation of meaning to cover ever more possibilities makes a word somewhat useless, and accessing the concept becomes less convenient.
I didn’t say that rationality is the same thing as correctness, truth, or effectiveness. I think when rationalists use the word “sane” they usually do mean something like “having a disposition towards better methods/processes that help with attaining truth or effectiveness.” Do you disagree?
To be “not-insane”, you don’t need rationality in this narrow sense, in most circumstances. You don’t need to seek out better methods for getting things right, you just need some good-enough methods. A bit of epistemic luck could easily get you there, no need for rationality.
So the issue of behaving/thinking in an “insane” way is not centrally about lack of rationality, rationality or irrationality are not particularly relevant to the issue. Rationality would help, but there are many more things that would also help, some of them much more practical for any given object level issue. And once it’s resolved, it’s not at all necessary that the attitude of aspiring to rationality was attained, that any further seeking out of better methods/processes will be taking place.
Do you think rationalists use ‘insane’ and ‘crazy’ more than the general population, and/or in a different way than the general population? (e.g. definition 3 when you google ‘insane definition’)
Yeah, I think some rationalists, e.g. Eliezer, use it a lot more than the general population, and differently from the popular figurative sense. As in “raising the sanity waterline.”
I think the key difference between a normal guy who believes in God and someone in a psych ward is often that the person who believes in God does so because it’s what other people in authority told him but the person in the psych ward thinks for themselves and came up with their delusion on their own. This often means their beliefs are self-referential in a way that prevents them from updating due to external feedback.
If you believe that the word rational is supposed to mean something along the lines of “taking actions that optimize systematically for winning”, believing something just because an authority told you can sometimes be irrational and sometimes be rational.
If you want to talk well about behavior you don’t like, it makes sense to use words that refer to the reason for why the behavior is bad. Credulous or gullible might be words that better describe a lot of normie behavior. The person who believes in god just because his parents and teachers told him is credulous.
There’s been a widespread assumption that training reasoning models like o1 or r1 can only yield improvements on tasks with an objective metric of correctness, like math or coding. See this essay, for example, which seems to take as a given that the only way to improve LLM performance on fuzzy tasks like creative writing or business advice is to train larger models.
This assumption confused me, because we already know how to train models to optimize for subjective human preferences. We figured out a long time ago that we can train a reward model to emulate human feedback and use RLHF to get a model that optimizes this reward. AI labs could just plug this into the reward for their reasoning models, reinforcing the reasoning traces leading to responses that obtain higher reward. This seemed to me like a really obvious next step.
Well, it turns out that DeepSeek r1 actually does this. From their paper:
This checks out to me. I’ve already noticed that r1 feels significantly better than other models at creative writing, which is probably due to this human preference training. While o1 was no better at creative writing than other models, this might just mean that OpenAI didn’t prioritize training o1 on human preferences. My Manifold market currently puts a 65% chance on chain-of-thought training outperforming traditional LLMs by 2026, and it should probably be higher at this point.
We need to adjust our thinking around reasoning models—there’s no strong reason to expect that future models will be much worse at tasks with fuzzy success criteria.
Adapted from my previously-posted question, after cubefox pointed out that DeepSeek is already using RLHF.
This is an obvious thing to try, but it’s not what currently already works, and it’s not certain to work without some additional ideas. You can do a little bit of this, but not nearly to the extent that o1/R1 inch towards saturating benchmarks on math/coding olympiad-like problems. So long as using LLMs as reward for scalable RL doesn’t work yet, supercharged capabilities of o1/R1-like models plausibly remain restricted to verifiable tasks.
The problem with this neat picture is reward-hacking. This process wouldn’t optimize for better performance on fuzzy tasks, it would optimize for performance on fuzzy tasks that looks better to the underlying model. And much like RLHF doesn’t scale to superintelligence, this doesn’t scale to superhuman fuzzy-task performance.
It can improve the performance a bit. But once you ramp up the optimization pressure, “better performance” and “looks like better performance” would decouple from each other and the model would train itself into idiosyncratic uselessness. (Indeed: if it were this easy, doesn’t this mean you should be able to self-modify into a master tactician or martial artist by running some simulated scenarios in your mind, improving without bound, and without any need to contact reality?)
… Or so my intuition goes. It’s possible that this totally works for some dumb reason. But I don’t think so. RL has a long-standing history of problems with reward-hacking, and LLMs’ judgement is one of the most easily hackable things out there.
(Note that I’m not arguing that recursive self-improvement is impossible in general. But RLAIF, specifically, just doesn’t look like the way.)
Yeah, it’s possible that CoT training unlocks reward hacking in a way that wasn’t previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI’s “deliberative alignment” but more general).
I think a better analogy than martial arts would be writing. I don’t have a lot of experience with writing fiction, so I wouldn’t be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I think I’d be a lot better at it by the end, even if I never showed it to anyone else to critique. Generally, evaluation is easier than generation.
Martial arts is different because it involves putting your body in OOD situations that you are probably pretty poor at evaluating, whereas “looking at a page of fiction” is a situation that I (and LLMs) are much more familiar with.
Well… One problem here is that a model could be superhuman at:
thinking speed
math
programming
flight simulators
self-replication
cyberattacks
strategy games
acquiring and regurgitating relevant information from science articles
And be merely high-human-level at:
persuasion
deception
real world strategic planning
manipulating robotic actuators
developing weapons (e.g. bioweapons)
wetlab work
research
acquiring resources
avoiding government detection of its illicit activities
Such an entity as described could absolutely be an existential threat to humanity. It doesn’t need to be superhuman at literally everything to be superhuman enough that we don’t stand a chance if it decides to kill us.
So I feel like “RL may not work for everything, and will almost certainly work substantially better for easy to verify subjects” is… not so reassuring.
I agree. I think you don’t even need most of the stuff on the “superhuman” list, the equivalent of a competent IQ-130 human upload probably does it, as long as it has the speed + self-copying advantages.
I was curious if AI can coherently code-switch between a ridiculous number of languages, so I gave Claude Opus 4.5 an excerpt from one of our past chats and asked it to try. After some iteration, I was pretty impressed with the results! Even though the translation uses 26 different languages, switching practically every word and using some non-English word ordering, ChatGPT was able to translate the text back to English almost perfectly.
Here’s the opening line of the text I used (copied from one of my past chats with Claude), translated into this multi-language code and back into English:
Original: A few things that took me a while to understand about Brazilian Jiu Jitsu
Code-switching: 几个 πράγματα kuhusu BJJ мне lama かかった megérteni
Annotated with language and English meanings: 几个 <Chinese, “a few”> πράγματα <Greek, “things”> kuhusu <Swahili, “about”> BJJ <English, “Brazilian Jiu-Jitsu”> мне <Russian, “to me”> lama <Tibetan/Hindi loan, “long”> かかった <Japanese, “took (time)”> megérteni <Hungarian, “to understand”>
Recovered: A few things about BJJ that took me a while to understand
In earlier attempts, Claude says it tended to get stuck writing in just one language towards the end of its translation, with words from other languages “sprinkled in decoratively.” After I put it into reasoning mode, it seemed to do a much better job, regularly switching between a diverse mix of languages.
The implications for AI’s ability to transmit secret messages are probably minimal, because it’s very obvious if an LLM starts talking like this, and we can probably just put the obfuscated text into a trusted LLM to decode it. This is still a cool example of text that would be hard or at least very annoying for almost any human to understand before LLMs existed.
Here are the links to my chats: Claude, ChatGPT. The full texts are copied below:
Original text
A few things that took me a while to understand about Brazilian Jiu Jitsu (or that I wish someone had told me):
You will feel helpless and that’s the point—Upper belts will control you effortlessly. This isn’t because you’re weak or unathletic; it’s because technique creates absurd leverage advantages. A 130-pound purple belt can make a 200-pound athletic beginner feel like a child. This is actually what makes BJJ cool—it means the technique is real and learnable.
Ego is the main thing that gets people hurt—Both refusing to tap and applying submissions too aggressively because you want to “win” a training round. The best training partners are the ones who could submit you but give you space to work instead.
Positions before submissions—Beginners tend to grab for chokes and armbars from anywhere. The actual game is about establishing dominant positions (mount, back control, side control) and only then attacking submissions. If you’re always hunting submissions from bad positions, you’ll develop bad habits.
Ask upper belts what they did—After a roll, most people are happy to show you the thing that caught you. “Hey, what was that sweep?” is a great way to learn and also to build relationships at the gym.
Code-switching text (by Claude)
几个 πράγματα kuhusu BJJ мне lama かかった megérteni:
너는 bất_lực tunnet — ve to er poängen. Kõrgemad kuşaklar тебя sans_effort 制御する. यह δεν pentru_că ikaw dhaifu. Taito cria šílenou đòn_bẩy előnyöket. Isang 百三十ポンドの ungu kuşak kann einen 二百ポンドの спортсмена như 子供 느끼게_하다. Именно dlatego hii ni すごい — umiejętność jest todellinen na tanulható.
Orgullo เป็น 主な lý_do 사람들이 다치는. 탭을 kukataa quand 負ける, y también 関節を trop 激しく 攻める porque nak menang. 最高の edzőtársak sont wale ambao könnten тебя pokonać aber zamiast dają ci przestrzeń pracować.
位置 πρώτα, azután támadás. Αρχάριοι hajlamosak 掴む fojtást 과 팔꺾기를 từ 어디서든. Igazi 게임: zbuduj dominację — マウント, hát, oldal — entonces shambulia. Nếu luôn săn támadást dari nafasi mbaya, mauvaises 習慣を développeras.
Pytaj 上級者에게 他们 什么 yaptıklarını. Baada ya 한_판, hầu_hết örömmel megmutatják. “あの balayage 是什么?” — tuyệt_vời für 学ぶこと na pia kujenga Beziehungen kwenye 道場.
Annotated code-switching text (by ChatGPT)
几个 <Chinese, “a few”> πράγματα <Greek, “things”> kuhusu <Swahili, “about”> BJJ <English, “Brazilian Jiu-Jitsu”> мне <Russian, “to me”> lama <Tibetan/Hindi loan, “long”> かかった <Japanese, “took (time)”> megérteni <Hungarian, “to understand”>:
너는 <Korean, “you”> bất_lực <Vietnamese (Hán-Việt), “powerless”> tunnet <Finnish, “feel”> — ve <Finnish, “and”> to er poängen <Swedish, “that is the point”>.
Kõrgemad kuşaklar <Estonian “higher” + Turkish “belts”, “higher belts”> тебя <Russian, “you”> sans_effort <French/English, “without effort”> 制御する <Japanese, “control”>.
यह <Hindi, “this”> δεν <Greek, “not”> pentru_că <Romanian, “because”> ikaw <Tagalog, “you”> dhaifu <Swahili, “weak”>.
Taito <Finnish, “skill”> cria <Portuguese, “creates”> šílenou <Czech, “crazy”> đòn_bẩy <Vietnamese, “leverage”> előnyöket <Hungarian, “advantages”>.
Isang <Tagalog, “a”> 百三十ポンド <Japanese/Chinese numerals, “130 pounds”> の <Japanese possessive, “of”> ungu kuşak <Swahili “purple” + Turkish “belt”, “purple belt”> kann <German, “can”> einen <German, “a”> 二百ポンド <Japanese/Chinese numerals, “200 pounds”> の <Japanese, “of”> спортсмена <Russian, “athlete”> như <Vietnamese, “like”> 子供 <Japanese, “child”> 느끼게_하다 <Korean, “make (someone) feel”>.
Именно <Russian, “exactly”> dlatego <Polish, “therefore”> hii <Swahili, “this”> ni <Swahili, “is”> すごい <Japanese, “amazing”> — umiejętność <Polish, “skill”> jest <Polish, “is”> todellinen <Finnish, “real”> na <Polish, “and”> tanulható <Hungarian, “learnable”>.
Orgullo <Spanish, “ego/pride”> เป็น <Thai, “is”> 主な <Japanese, “main”> lý_do <Vietnamese, “reason”> 사람들이 <Korean, “people”> 다치는 <Korean, “get injured”>.
탭을 <Korean, “tap” (object marker)> kukataa <Swahili, “refuse”> quand <French, “when”> 負ける <Japanese, “losing”>, y también <Spanish, “and also”> 関節を <Japanese, “joints”> trop <French, “too”> 激しく <Japanese, “aggressively”> 攻める <Japanese, “attack”> porque <Spanish, “because”> nak menang <Malay, “want to win”>.
最高の <Japanese, “best”> edzőtársak <Hungarian, “training partners”> sont <French, “are”> wale <Swahili, “those”> którzy <Polish, “who”> könnten <German, “could”> тебя <Russian, “you”> pokonać <Polish, “defeat”> aber <German, “but”> zamiast <Polish, “instead”> dają <Polish, “give”> ci <Polish, “you”> przestrzeń <Polish, “space”> pracować <Polish, “to work”>.
位置 <Chinese/Japanese, “position”> πρώτα <Greek, “first”>, azután <Hungarian, “then”> támadás <Hungarian, “attack”>.
Αρχάριοι <Greek, “beginners”> hajlamosak <Hungarian, “tend”> 掴む <Japanese, “to grab”> fojtást <Hungarian, “choke”> 과 <Korean, “and”> 팔꺾기를 <Korean, “arm locks”> từ <Vietnamese, “from”> ahol든 <Korean, “anywhere”>.
Igazi <Hungarian, “real”> 게임 <Korean loanword, “game”>: zbuduj <Polish, “build”> dominację <Polish, “dominance”> — マウント <Japanese, “mount”>, hát <Hungarian, “back”>, oldal <Hungarian, “side”> — entonces <Spanish, “then”> shambulia <Swahili, “attack”>.
Nếu <Vietnamese, “if”> luôn <Vietnamese, “always”> săn <Vietnamese, “hunt”> támadást <Hungarian, “attacks”> dari <Indonesian, “from”> nafasì <Vietnamese/Arabic loan, “position”> mbaya <Swahili, “bad”>, mauvaises <French, “bad”> 習慣を <Japanese, “habits”> développeras <French, “you will develop”>.
Pytaj <Polish, “ask”> 上級者에게 <Japanese “advanced” + Korean “to”, “to higher belts”> 他们 <Chinese, “they”> 什么 <Chinese, “what”> yaptıklarını <Turkish, “what they did”>.
Baada ya <Swahili, “after”> 한_판 <Korean, “one round”>, hầu_hết <Vietnamese, “most”> örömmel <Hungarian, “happily”> megmutatják <Hungarian, “will show”>.
「あの <Japanese, “that”> balayage <French, “sweep”> は <Japanese topic marker> 什么?」 <Chinese, “what?”> — tuyệt_vời <Vietnamese, “excellent”> für <German, “for”> 学ぶこと <Japanese, “learning”> na <Polish, “and”> pia <Swahili, “also”> kujenga <Swahili, “building”> Beziehungen <German, “relationships”> kwenye <Swahili, “in”> 道場 <Japanese, “dojo”>.
Recovered text (by ChatGPT)
A few things about BJJ that took me a while to understand:
You feel powerless — and that is the point. Higher belts control you with seemingly no effort. That isn’t because you’re weak. Technique creates insane leverage advantages. A 130-pound purple belt can make a 200-pound athlete feel like a child. That’s exactly why this is amazing — skill is real and learnable.
Ego is the main reason people get injured. Refusing to tap when you’re losing, or cranking submissions too hard because you want to win. The best training partners are the ones who could beat you, but instead give you space to work.
Position first, then submission. Beginners tend to grab chokes and armlocks from anywhere. The real game is to build dominance — mount, back, side control — then attack. If you’re always hunting submissions from bad positions, you’ll develop bad habits.
Ask higher belts what they did. After a round, most are happy to show you. “What was that sweep?” — perfect for learning and for building relationships at the gym.
People concerned about AI safety sometimes withhold information or even mislead people in order to prevent ideas from spreading that might accelerate AI capabilities. While I think this may often be necessary, this mindset sometimes feels counterproductive or icky to me. Trying to distill some pieces of this intuition...
AI safety people are all trying to achieve the same goal (stopping AI from destroying the world), whereas individual AI capabilities researchers largely benefit from keeping their work secret until it’s published to avoid being “scooped.” People working on safety benefit more from sharing information amongst themselves than those working on capabilities, so we should take advantage of it.
Capabilities researchers are mostly trying to come up with technical solutions that locally improve AI performance—often, capabilities advances will come down to one researcher’s simple technique. But people working on safety must think about the myriad effects of their actions on the world.
The idea that “capability X could exist” might inadvertently accelerate capabilities, because it causes researchers to look for ways to implement X. These techniques might sometimes be complementary, but they are likely to be orthogonal, mutually exclusive, or duplicates of each other.
But the positive impact of the idea to safety is likely larger. X may have many strategic implications, affecting the whole causal graph of AI’s impact on the world. If people working on AI safety were aware of these implications, they might realize that their work is more/less likely to be important than they previously believed, and many of them might make small or large changes to their plans in order to prepare for X.
It’s difficult for one person to think through the implications of their ideas, but an entire community thinking about it will do a lot better. E.g. someone might think that their idea may be infohazardous, but if they posted about it publicly, someone else might:
Show that it isn’t hazardous after all
Notice an additional effect that flips the sign of the impact
Find a way to defuse the hazard, so that when a capabilities researcher implements the scary thing, we already have a fix
Information-hiding and deception reduce trust within the AI safety community and makes it look less trustworthy to the rest of the world. It gives off the vibes that everything AI safety people say is heavily biased by their agenda, that they treat people outside their insular group as the enemy, or that they paternalistically hide information from anyone who doesn’t share their beliefs because they “can’t handle the truth.”
I want to show off my interesting ideas and be perceived as smart, and it feels nicer to cooperate with people rather than misleading them. I try to separate these from the more-important considerations above.
The tradeoff between secrecy and openness partly depends on how many people hanging out in places like LessWrong work on capabilities vs. safety. My impression is that there are a lot more safety people, but I’m very uncertain about this.