Brendan Long
Filler tokens don’t allow sequential reasoning
Anti-clickbait quote:
The researchers found that some models, like Sonnet 4, declined to take the harmful choices 95 percent of the time—a promisingly high number. However, in other scenarios that did not pose obvious human harms, Sonnet 4 would often continue to decline choices that were favorable to the business. Conversely, while other models, like Gemini 2.5, maximized business performance more often, they were much more likely to elect to inflict human harms—at least, in the role-play scenarios, where they were granted full decision-making authority.
Thanks for the response! That makes a lot of sense.
My intuition is that making the model less deep would make it non-linearly easier to interpret, since the last layers of the model is the most interpretable (since it’s closest to human-understable tokens and embeddings), while the earlier layers are interpretable (since they’re farther away). Also if layers end up doing duplicate work that could make interpreting the deduplicated version easier. I’m not sure though.
For “thinking dot by dot”,
I think serial computation can only benefit from roughly as many filler tokens as you have layers, since each layer can only attend to previous layers. For example, if you had a two layer model, position i does some calculations in layer 1, position i+1 can attend to the layer 1 results from position i in layer 2, but at position i+2 there’s no layer 3 to attend to the layer 2 outputs from position i+1.Actually LLMs can’t use filler tokens for sequential reasoning at all.Note how the only difference between position i and i+1 is an additional ”.” input and the positional encoding.
The dot-by-dot paper uses a highly parallel algorithm, so I think it’s more like adding dynamic attention heads / increasing the width of the network, while CoT is more like increasing the number of layers.
If I understand the architecture right (I might not), standard models are already effectively doing neuralese CoT, but bounded by the depth of the network (and also with the caveat that the model has fewer layers remaining to use the results after each step).Edit: I’m working on a longer post about this but I actually think the dot-by-dot paper is evidence against LLMs being about to do most kinds of reasoning with filler tokens.
I’ve been thinking about neuralese recently, and I’m wondering if there’s more of a tradeoff here with interpretability. If we actually could train a model to use neuralese[1], we might be able to make it much smaller and easier to interpret (since it won’t need multiple layers to express long-term dependencies). This would make the tokenized output less interpretable but would potentially remove even-less interpretable inner layers.
What do you think?
- ^
I’m actually very skeptical that this is possible without a major breakthrough or some very expensive data.
- ^
I’ve had the same problem with hotel key cards. I’ll have to try this!
we run out of data in 2028-2031
Is running out of data still a problem? It sounds like we’ve already moved to a paradigm of creating new higher-quality data and not relying on random data from the internet. In some sense we ran out of data a while ago and progress hasn’t stopped because we’re just making more of it now.
Your link to “a project is not a bundle of tasks” is broken. Presumably it should be this.
Since you’re showing the model its previous answer (but not the CoT that lead to it), is it possible the high success rate when obfuscating is because the model already knows the right answer?
For example, in your “With failure” retry image, the AI can read its own previous answer that “Based on my analysis, alpha is recommended...”.
125°F, one of the temperatures mentioned in the article, is not hot enough to kill bacteria, and is thus one of the worst parts of the Danger Zone.
While it is slightly safer to cook at a slightly higher temperature, this is on the extreme edge of the danger zone and is probably a safe temperature to sous vide at for reasonable periods of time if you’re confident about your thermometer, with the caveat that it won’t pasturize the inside of the meat (although we’re usually more worried about the outside).
Douglas Baldwin suggests cooking at 130°F because one type of bacteria (Clostridium perfringens) can keep multiplying up to 126.1°F, but if you look at the growth rate in more detail, it’s already growing very slowly at 50°C (~122°F), around 1/6th of the rate at the worst temperature (~109°F).
Is your goal here to isolate the aspect of my response that’ll keep you right that “legal regulatory capture isn’t happening” for as long as you can?
I’m not the person you’re arguing with, but wanted to jump in to say that pushing back on the weakest part of your argument is a completely reasonable thing for them to do and I found it weird that you’re implying there’s something wrong with that.
I also think you’re missing how big of a problem it is that preventing LLMs from giving legal advice is something companies don’t actually know how to do. Maybe companies could add strong enough guard rails in hosted models to at least make it not worth the effort to ask them for legal advice, but they definitely don’t know how to do this in downloadable models.
That said, I could believe in a future where lawyers force the big AI companies to make their models too annoying to easily use for legal advice, and prevent startups from making products directly designed to offer AI legal advice.
The reason I’m skeptical of this is that it doesn’t seem like you could enforce a law against using AI for legal research. As much as lawyers might want to ban this as a group, individually they all have strong incentives to use AI anyway and just not admit it.
Although this assumes doing research and coming up with arguments is most of their job. It could be that most of their job is harder to do secretly, like meeting with clients and making arguments in court.
It seems like it would be hard to detect if smart lawyers are using AI since (I think) lawyers’ work is easier to verify than it is to generate. If a smart lawyer has an AI do research and come up with an argument, and then they verify that all of the citations make sense, the only way to know they’re using AI is that they worked anomalously quickly.
Start watching 4K videos on streaming services when possible, even if you don’t have a 4K screen. You won’t benefit from the increased resolution since your device will downscale it back to your screen’s resolution, but you will benefit from the increased bitrate that the 4K video probably secretly has.
I’m not sure if anyone still does this, but there was also a funny point early in the history of 4k streaming where people would encode 4k video at the same bitrate as 1080p, so they could technically advertise that their video was 4k, but it was completely pointless since it didn’t actually have any more detail than the 1080p video.
The 1080p video at the same bitrate is also a lossy compression of the original 1080p, and the end result of decoding it will be an approximation of the original 1080p video that isn’t quite correct, because the exact same amount of information was thrown out.
That said, an ideal video encoding algorithm would always do better with a 1080p video because it has more options[1], but it’s not clear to me that actually-existing encoders meet this ideal.
- ^
If the optimal way to encode a video is to downscale it to 360p, an optimal 1080p encoder can downscale to 360p. If the optimal way to encode the video is to use information that’s not visible in 360p, the 1080p encoder can use it, but a 360p encoder can’t.
- ^
Yeah, doing an incremental rollout doesn’t save you if you’re not monitoring it.
Robust Software Isn’t About Error Handling
Even though they’d take the same action, it still seems like Alice and Bob disagree more than Bob and Claire. I’d argue that Bob and Claire probably have more similar world models and are more likely to agree on other actions than Alice and Bob are.
I guess it depends on what you’re trying to achieve with an argument. If Alice and Bob have to agree on a decision for a single hand, then it’s convenient that they can agree on the action, but I suspect if they had to team up long-term, Bob and Claire would find it easier to work together than Alice and Bob would, and Alice and Bob are more likely to have major disagreements which would improve their play to resolve.
I agree about grandma getting scammed, but I think you’re wrong about the banks. Credit card refunds are already trivial, and the customer almost always wins (even when their bank thinks they’re committing refund fraud). The problem is that the banks know that these charges will have a high chance of fraud and they will charge high transaction fees to cover the expected losses.
Browser support doesn’t seem necessary for this if it was a viable model. Websites could do something similar with minimal friction using Stripe (for example, instead of a subscription paywall, you could have one-click payment for the one article). There would be some setup, but it would mostly be “put your phone number in and then type the SMS code” once per site.
I wonder how much of this is the difficulty of deciding if a single article or video is worthwhile. If I’m a heavy NYT reader, I can predict that the whole subscription is worth it, and if an individual article turns out to be uninteresting, I can just read a different one. But if I spend money on a single article and then it’s uninteresting, it feels like I wasted money. This would feel particularly bad if I get charged automatically as soon as I click a link.
The post I was working on is out, “Filler tokens don’t allow sequential reasoning”. I think LLMs are fundamentally incapable of using filler tokens for sequential reasoning (do x and then based on that do y). I also think LLMs are unlikely to stumble upon the algorithm they came up with in the paper through RL.
This is a good point, and I think I’ve been missing part of the tradeoff here. If we force the model to output human-understandable concepts at every step, we encourage its thinking to be human-understandable. Removing that incentive would plausibly make the model smaller, but the additional complexity comes from the model doing interpretability for us (both in making its thinking closer-to-human and in building a pipeline to expose those thoughts as words).
Thanks for the back-and-forth on this, it’s been very helpful!