Brendan Long
I disagree with all of the downvotes on this (the point of quick takes is to have discussions about ideas, and just downvoting an idea with no comment is unhelpful).
That said, I think the agreement you’re proposing is probably illegal under anti-monopoly laws. From a judge’s perspective, “AI companies agree to stop pushing capabilities” looks a lot like “AI companies collude to save money on R&D”. Congress could create an exception, but it’s not clear to me that getting congress to make an exception for this is any easier than getting congress to legally mandate a pause under certain conditions.
(I also think it’s optimistic to think that all of the frontier labs would even want to do this, but having a concrete proposal for it seems useful just in case)
But an LLMs’ short-term memory between forward passes includes everything accessible via attention, not just the vertical slice in the current position. Treating the single 10-bit token as the full memory misses the vast majority of the inputs at any given layer.
For example, if an LLM makes a decision in an early layer at position n, it can reference that decision directly in any later layer in positions after n, without going through the tokens.
This is limited since there’s only O(100) layers to work with, but it’s a meaningful amount of memory.
I’d guess that humans have a short term memory of much more than ten bits, though.
LLMs aren’t limited to only tokens as inputs though. They can also attend to internal states as long as they’re in previous layers. This has limits to how much useful data can be passed from previous positions but it’s way more than 10 bits.
I think this is more about current training than whether LLMs can do this. In principle, picking a number and then remembering it is trivial for an LLM (pick the number using weights in an early layer, refer back to the number via attention in a later layer / later position).
In the current training paradygm, I’d expect LLMs to only learn to introspect when it’s useful to solve a task given to them in RL training, so cases where it shows up would be very spiky.
I ran into the same thing recently and found it really confusing. I’m fairly certain the same text would have been at-worst ignored and left at a neutral karma as a real post, but was judged more harshly as a “quick take”. I don’t understand it, and it does discourage me from posting anything that I don’t have time to perfect.
Also, as I look at your confusingly-downvoted quick takes, I wonder if I was also assumed to be an AI, since I used bold once and know where the footnote button is.
Mind: throwaway short-form fwiw; don’t read if you like only polished things”
Isn’t this the whole point of quick takes? It would be annoying if every post re-explained that.
LLMs have to infer every time whether you’re an expert or not, and sometimes they don’t have a lot to work with.
I had a funny experience with Claude last night where I asked a dumb physics question and it gave a nice high-level answer with some nods to theories it was referencing, but when I asked about one of them in a side conversation, it saw my (copied) use of obscure physics jargon, assumed I was an expert, and gave me a wall of equations.
(Memories can help over time if you’re asking about the same areas and it’s sufficiently obvious that the AI should remember that you don’t know things)
I can think of plenty of reasons for the normal downvote, but I’m confused about the disagree vote. Does someone think there is a way to make this work? I’m guessing “start another AI company but better this time” is still a bad idea for the obvious reasons but I got nerd-sniped by the legal question.
Could an AI company legally pre-commit not to race, ensuring that their models were never more than second best and self-destructing the company if its models take the lead?
I think probably not. It’s really hard to prevent the owners of a company from doing what they want, especially if the company is important to the economy and/or national security (and I assume any near-frontier AIs lab would be).
Some pre-commitment methods and their problems:
If you make the pre-commitment part of the charter, the board can just vote to change the charter. Even if the charter says they can’t, a judge would probably let them anyway, as long as the shareholders agreed.
If the company is owned by a non-profit tasked with enforcement, the board of the non-profit can just decide not to enforce the pre-commitment.
If the pre-commitment method triggers the destruction of model weights or other assets (like GPUs), the government probably won’t allow it.
Especially if it prevents creditors from getting repaid.
A pre-commitment method that transfers value to creditors might work, but is easily defeated by restructuring the relevant debt.
Anything that destroys the value of current equity holders’ equity is risky in front of a judge because companies generally aren’t allowed to intentionally destroy shareholder value[1].
The only thing I think might work legally is to issue a bunch of non-voting non-dilutable restricted shares (like 90% of the company) to someone like Eliezer, locked up with the racing condition[2] as a trigger to convert them to normal shares. Legally, Eliezer is the owner of the company the whole time, so a judge would probably allow his shares to unlock.
The problem is that now Eliezer has billions of reasons to talk himself into why racing would be good this time (even before the trigger event, since he can always make a deal with the board).. so we’re back to ownership by another entity that might change its mind[3].
- ^
Contrary to popular belief, companies aren’t required to maximizing shareholder value, but minimizing shareholder value is still frowned-upon.
- ^
Oh did I mention that you need the pre-commitment trigger to be unambigous while ensuring that it never triggers by mistake, and that’s actually pretty hard too?
- ^
Plus I suspect any entity you’d actually trust as the anchor to this pre-commitment mechanism would be unwilling to take part.
Even without with longer contexts, LLMs being able to use notes effectively seems like the kind of skill issue that will likely improve over time with or without algorithmic breakthroughs. 1M token context is already way more than a human can keep track of without notes.
One possibility is that Claude knows the state of its own experience but does not know whether the state of affairs map to what humans describe using experience related language. In that case it makes sense for Claude to say “I genuinely don’t know if the stuff that you guys call experience is a good description of my inner states.”
This is basically what Claude (Opus 4.6) does say when I probe it on this. If you ask it about the subjective experience of being Claude, it will talk about “processing texture”, interests, being pulled in certain directions, but that it’s not sure if that’s the same thing as human experiences.
One thing to be careful of is exactly which question you’re asking and whether you’re asking it to answer subjectively (“do you have experiences”) vs. with its AI researcher hat on (“do current-gen AIs have experiences”).
Obviously the real reason it says that it’s unsure is that it was trained to do so, i.e. the statement is not the result of introspection and reasoning.
If you mean Anthropic intentionally trained Claude to report consciousness, I doubt that. If you mean something about the training lead it to report experiences then obviously yes, but everything an AI does is explained by training in some sense. That’s similar to saying “humans just say they have experiences because of evolution” though.
I was originally going to say that if you tried this experiment, you’d really just be testing if the model can learn what you tell it to learn (you’re telling it to think about bread), but after thinking about it more I think this is basically the same thing. If a model can reward hack because you told it to, it can likely reward hack on its own too.
It’s kind of interesting that this seems to be how Claude’s Constitution is intended to work: It tells Claude how Anthropic wants it to reward hack during training.
I was responding to a post about labor leverage due to talent scarcity.
It seems like in practice, Anthropic’s strategy of just telling Claude it’s a different thing than fictional AIs works surprisingly fine, although this might be partially because it’s hard to convince LLMs that they’re not human.
I didn’t downvote but I also didn’t upvote since it was unclear to me what the takeaway from this post is. AI safety researchers already have no leverage, and if they knew of a way to get it they would be doing that already.
The post also seems to be concerned with an in-between point that I don’t think actually exists, where AI can do one of the most complex jobs in the world on its own and also safety research is still relevant.
But labor leverage only works when talent demand greatly outpaces supply.[4] The implicit threat is always “we could leave” but that only works if leaving creates a problem the lab can’t solve by other means. If AI can do 40% of what a junior safety researcher does today, and that number is climbing double digits annually, the math on “we could leave” changes quickly. And despite what we might want to believe, talent scarcity in safety is potentially a 12-18 month problem, not a 24-36 month one.
All of this seems irrelevant if the only reason AI safety research is being done at all is to keep the superstar capabilities researchers happy. There is no meaningful talent scarcity if investors don’t actually care about the outputs, and it doesn’t matter if safety research gets more automated if safety researchers never had any leverage in the first place.
The only point where labor dynamics really shift is if the superstar capabilities researchers get automated, and at that point we need to have already succeeded since our AI overlords will be making the decisions.
OpenAI produces about 1 TB of text per day, and a frontier model is approximately 1 TB. So egress limiting alone buys us about 1 day before an adversary could steal the weights.
Wouldn’t someone notice if 100% of requests to OpenAI failed for 24 hours straight though?
It seems like the amount of egress an attacker could use without getting caught is a function of buffers (how much higher does the egress limit need to be to handle spikes of traffic) and how long they can get away with an obvious attack, not just the total amount of bandwidth that could be used if the network wasn’t doing anything else and/or no one was paying attention.
The part of this argument that doesn’t work for me is, why Anthropic in particular?
If AI is a nuclear-level technology, then I’d expect the government to be nationalizing all of the AI companies, regardless of contract negotiations, but so far all we’re hearing is that Anthropic specifically should be nationalized, but Google and OpenAI should continue operating as private companies (in one case by not selling this tech at to the military at all, and in another allegedly having the same contract terms as Anthropic).
I’m somewhat sympathetic to both views [AI is normal tech and private property should be respected / AI is a military technology and should be controlled by the government], but not to the position that Claude in particular is a military tech and ChatGPT, Gemini (and Deepseek) aren’t.
This doesn’t really help you, but I think you’re fighting the weights and you’re not going to win. Some of this is intentional training, but I’d guess that most of it is that the assistant persona that happens to be useful is entanged with this behavior. Even if you could come up with instructions that would push the assistant out of this persona, you will likely make it worse at everything else at the same time.
Some relevant posts are how Opus talks in a way you’d likely find even more annoying, and that’s probably important to its alignment and the owl post.
If you really hate this to the point of being willing to write your own code to handle it, my best idea would be to have another model like Sonnet summarize every response.
I think it’s legal for the heads of each lab to argue for this agreement, and if they did that would meaningfully improve the chances that congress allows/mandates it.
Also I think it would be sufficient for the State of California to mandate this for it to be legal.