Brendan Long
Yeah, looking into this more, I think LLMs can definitely do this. Current-gen models are not very smart, so I’d expect them not to consistently remember to hide their CoT, but that doesn’t help as they get smarter.
Yeah I don’t know how far this will generalize, but the fact that it learns this association at all is a very good sign. My default expectation is for LLMs to learn things in a disconnected way (like how “France’s capital is Paris” and “Paris is in France” are completely different circuits) and this is evidence against that in an alignment-relevant situation.
I didn’t find the subject interesting so I haven’t been reading these, but a quick scan of the first one makes me think the authors just likes em-dashes. They don’t feel like AI em-dashes for some reason.
I worry that the whole edifice that things like shares depend on will break down. Right now, the people who actually control a company give away value to the legal owners because those legal owners can threaten them in meaningful ways and living under a country of laws and having to respect contracts is better than the lawless alternative. In a future where nothing can credibly threaten the most powerful AIs, then shares have no real power and you’re back to hoping the AI (or its controllers) decides to be nice.
I would expect LLMs to only be able to control their outputs when there’s some sort of training signal pushing them to do that. It’s possible to have a case where the AI knows that it should make its CoT sneaky but doesn’t actually have the ability to do that. Like how a human might know that some thoughts are unhelpful but not be able to stop thinking them.
I think the risk is that the model generating the CoT is the same as the one generating normal output, so it could learn to be sneaky in normal output and automatically gain the same control over CoT.
Anthropic found that training Claude to do things like help users resolve ethical dilemas significantly reduced misbehavior like blackmail attempts. I’m surprised this worked, and it seems like good news for the alignment-by-default “LLMs will correctly generalize good behavior” theory.
https://www.anthropic.com/research/teaching-claude-why
A lot of crimes are committed in physical space by people who are hard to track, and who commit so many crimes that they might not even remember all of them. I’m thinking of things like shoplifting, muggings, etc. I’m not convinced the physical evidence to solve a lot of these still exists.
Although if you’re only talking about murder it gets easier, but I still think the evidence for older murders is mostly in people’s heads.
Posts that get downvoted into oblivion usually have problems like AI writing, trying too hard to make persuasive arguments, mischaracterizing things, or generally doing things stylistically that annoy people.
If people feel like your post is off-topic (with no other problems) it will usually just get ignored, not downvoted.
It’s really hard to say what on-topic actually is here though, and you could probably get away with any of those posts if they’re written in the right way.
One piece of low-hanging fruit is that running LLMs on general-purpose hardware is inefficient. As soon as these people scale up it should be possible to reduce token prices by at least another order of magnitude, although I’m not sure if frontier models will fit on their chips (yet). https://taalas.com/the-path-to-ubiquitous-ai/
So, I’ll just give a few examples of the types of patterns that are more and more ubiquitous due to today’s frontier models being completely in love with them:
How are you so sure these are LLM written and not artisanal human-written clickbait? A lot of these styles are too old to have originated with LLMs, and the causality is probably backwards (humans figured out clickbait patterns → humans “prefer” the clickbait → LLMs learn clickbait patterns from humans).
An algorithm that that’s forced to “checkpoint” human-understandable internal states under tokenized CoT can remain an extremely-complex spaghetti-code algorithm if not forced through tokenization, but also even in a tokenized CoT model, 99% of the reasoning happens in the latent space between layers and never gets written to tokens anyway. So I think latent space reasoning is a step in the wrong direction, but I’m not sure if it really matters since we need to solve this problem either way.
I also suck at sports, but if you can find a casual/social league it can be fun even if you suck. Team games you want to play with supportive teammates are a lot more fun than sports in school.
Joining a team sport helped me. I would never run intervals for 2 hours, but playing dodgeball for 2 hours is fun enough that I mostly don’t think about how out of breath I am. I assume a sufficiently casual soccer league would be similarly fun (although everyone in my area is too hardcore so I’m intimidated by even the casual leagues).
If I’m following your math correctly, it seems like this is a fully-general argument that it’s impossible to prevent any action with non-zero reward and non-zero cost of failure. I’m not really a math person, but it seems like something must be wrong with this argument because people fail to do things with non-zero reward and non-zero cost of failure all the time.
It also seems suspicious that your equation has no term for the cost of getting caught breaking the hypothetical anti-AI law/international norm.
If it was the case that stopping AI development was impossible, then it would be good to know that it’s impossible so we can focus on other solutions. I think the problem with this argument isn’t that it’s a true antimeme, but that stopping AI development isn’t impossible.
I think given more time we could probably come up with tools that would help when we resume capabilities research. For example, there’s probably ways to do something like the logit lens but that work better, or ways to automatically factor models into more interpretable pieces, or just the long slog of tracing through circuits to figure out what the model is doing and build one from scratch rather than training. I don’t know how practical any of these approaches are, but I don’t think we’re at the limit of what we can learn from current models.
He claims that an agent with arbitrary intelligence, given the utility function {1 if the cauldron is full, 0 if the cauldron is empty}, will keep adding as much water as possible to the cauldron to maximize the probability that the cauldron is full. But what about the probability that the cauldron will be full in the future?
I didn’t watch the video so I might be missing something, but assuming you created an AI with that utility function, whether the cauldron is full right now and the probability that the cauldron will be full in the future are different utility functions. A sufficiently intelligent AI would know that maximizing its utility function now will hurt it later, but it doesn’t care because that’s not the utility function.
Eliezer has a bunch of different arguments because he’s trying to address different levels of familiarity with the problem. My impression is that he expects a sufficiently intelligent but unaligned AI to not be greedy like this and to scheme until it no longer needs us.
Imagine if people hired “codebase cleaners” in the same way that people hire house cleaners.
You could get the code version of a Roomba by setting up a scheduled job to run a coding agent with a prompt to delete dead code / clean up duplication / etc.
Thanks! I haven’t tried this in a while, but it should work in any agent harness that supports either bash or MCP tools. Frontier models are so good at tmux I’m not sure if there’s much value in the MCP tools at this point.
If you goal is a reasoning benchmark and not a vision benchmark, you might want to check out Ryan Miller’s DFHack-based approach.
I thought “A Theory of Deep Learning” by Elon Litman was interesting, with its approach to only update parameters “if the batch signal on a parameter exceeds its leave-one-out noise, update it; if not, skip it”. The claim is that this accelerates grokking by 5x, among other things. Unfortunately, when I tried it on a multi-step reasoning task, it made it significantly worse at grokking and much more likely to memorize a composed lookup table. In my basic experiment, the model learned a multi-step algorithm 100% of the time using normal AdamW and 0% of the time with the update rule added. Claude has some opinions about why the update rule is counterproductive for grokking multi-step algorithms but I don’t really understand it; I just thought this was an interesting data point.