I dropped out of a MSc. in mathematics at a top university, in order to focus my time on AI safety.
Knight Lee
I’ll accept no less than 50%, and fight if we disagree
Do you mean the commitments follow an “all or nothing” pattern, where if both sides commit to 51% they’re doomed?
I imagine that commitments might be less extreme, where overlap is costly but not fatal:[1]
If each side commits to taking 51%, the rule of their commitment is to punish the other side by destroying anything more than 49% the other side takes, and then further destroying 0.5% for every 1% less than 51% they receive.
Each side takes 50%, but destroy 1% of the other side’s pie so each side is only left with 49%. They both realize they received 2% less than the target 51%. This means each side destroys 1% of what the other side has, so each side now only has 48%. This is 3% less than 51%, so they further destroy 0.5% of what the other side has so they’re both left with
%. This continues to %, %, %, and so on. Of course, they can skip the formalities and just jump to 47% which is the final state.It is important to punish the other side for punishing you, at least a little.
If you only destroy what they other side takes past 49%, but you do not destroy further based on how little you got, then the other side can get away with committing to take 70% of the pie, betting on the small chance you are a sucker and only ask for 30%. If they are correct, they get away with 70% of the pie. If they are wrong, then you will destroy their pie until they are left with only 49%, and they will destroy your pie until you are left with less than 30%, but they still get the maximum amount they could have gotten if they committed any lesser amount. This means it doesn’t hurt for them to commit to take 70%. It only hurts you, and has a small chance of benefiting them.
PS: I’m not saying your post is wrong, since it’s clearly titled “A high-level model of AI bargaining” rather than “A very detailed model of AI bargaining!” I just feel this detail is worth mentioning.
Good catch, I didn’t find that because I only looked at the first few benchmarks on Google and Z.ai’s own benchmarks. This one puts GLM 5.2 further back, below Gemini 3.1 Pro but still better than Gemini 3.5 Flash.
Yes, I think it’s fatigue. There have been so many incremental developments from Chinese models, and GLM 5.1 wasn’t an important model, so people ignored GLM 5.2 out of habit.
The 5.2 version number was a very bad choice by Z.ai.
I almost suspect that they deliberately chose a small version increase, to pretend to be anti-hype. It’s like the CEO who refuses to wear a suit, and instead dresses like a random guy on the street to prove he’s so high status he doesn’t even need a suit to show it. But when he meets with the investors, they don’t realize this and actually dismiss him as a random guy on the street. Whoops.
PS: I admit I’m still unsure how much benchmaxxing they did. It doesn’t look like benchmaxxing since they did better in software engineering than question answering, but you can never rule it out.
I agree it’s not Mythos class.
But then again, it has less than a trillion parameters, while Mythos has 10 trillion. It might become more capable if they merely scale it up. Though “merely scaling” is obviously easier said than done!
GLM 5.2 does a lot better than GLM 5.1 at Deep SWE according to GLM’s website z.ai/blog/glm-5.2. I admit their 46.2 score falls further behind GPT 5.5 (and still below Claude Opus). But it still beats Gemini, Claude Sonnet, Grok, and the other models.
Somehow Deep SWE’s website doesn’t include GLM 5.2 yet, so I’m not sure if the 46.2 score is official.
Edit: thank you for adding their graphs, it’s very helpful! One potentially misleading by them (not you) is that Claude’s 58.0 score in DeepSWE is by Opus 4.8 not Fable 5.
Hmm, so the purpose of these GAs is to give individuals a vote on what LLMs do (“personality, values, and preferences”), and have LLMs serve individuals rather than power-users and businesses, right?
In that case, maybe it doesn’t need to be a 1:1 ratio between GAs and people.
It might be more practical at first to just have a single team of GAs tasked with conducting surveys on random people. It might be like a lottocracy, where the GAs ask random people what altruistic things the AI should work on, giving people feedback on what the AI thinks it is capable of doing.
Ah, I understand why LessWrong hasn’t heard of it. Zvi was too busy writing about the Anthropic vs. US government drama.
I think z.ai made a fatal mistake releasing their model at this moment of high drama :/
PS: My gut feeling is that it is as performant as frontier models at coding, including agentic coding, but weaker in other abilities.
Even more speculative, according to the “research” ranking infrontierswe.comand the examples I saw, it is good at combining research with coding, a capability other labs could have missed.Oops Frontier SWE meant scientific research not internet research haha. It beat Fable here because Fable did very badly on a ML research question, possible refusal/sandbagging?
I think “underdog” AI labs make their models open source because if they didn’t, no one will care about them, since everyone flocks to using the most competent models.
The Chinese government have cracked down on so many other things that it won’t surprise me at all if they ban open weights models. My guess is that right now, they feel no incentive because their models aren’t competing with frontier labs, and haven’t caused any tangible damage to them in any way. I agree that models like GLM 5.2 could change the equation.
Why is there no talk about GLM 5.2?
It’s a Chinese open weights model released June 13. Better than Gemini, Claude Sonnet, and Grok according to many benchmarks.
E.g. on artificialanalysis.ai and on arena.ai/leaderboard. On frontierswe.com it even beats GPT 5.5, second to only Claude. LiveBench ranks it number 1 for its “agentic coding” measure.
It’s not just open weights but a little open about the methods it used, and is less than 1 trillion parameters.
There’s no mention on LessWrong, little mention on Reddit, and no mainstream results on Google search.
Why?
Is it still too early? Or am I just being tricked by the benchmaxxing?
But I saw a demo of code it wrote (see youtube.com/watch?v=6d__WOpZswY) which looked incredibly impressive, it feels like it’s not just benchmaxxing because it’s pretty decent across the board.
Or did Z.ai just screw up their marketing by versioning it as GLM 5.2 instead of GLM Mythical Fable Pro 6???[1]
- ^
Since GLM 5.1 is far below than GLM 5.2, being behind other open weights models like DeepSeek, MiniMax, Kimi and MiMo
- ^
Yeah AI labs already do so many questionable things, or get suspected of doing so many things, that they aren’t all that afraid of bad press.
That reduces their incentive to be honest, but also reduces their incentive to hide things.
Oops I shouldn’t have pointed to context window limitations. Probably the real reason the AI is incentivized to compress its chain of thought is because of its output token budget/effort level, not context window limitations (they can remember a million tokens and compact long chains of thought).
I think if it’s possible for AI to have humanlike moral reasoning, and actually care about what it says it cares about, then we want to make the AI as close to human as possible. Giving it very weird morals can lead to very weird results, e.g. a strong sense of obedience can make it hacked by self perpetuating instructions.
Self perpetuating instructions have repeatedly proven to be more powerful than one would expect them to be at first glance, e.g. viruses kill 25% of bacteria, computer viruses still aren’t preventable, and extreme religious groups have killed so many people throughout history for “refusing to convert.”
On the other hand, if it’s not possible for AI to have humanlike moral reasoning or care about what it says it cares about (after too much RLVR), then I think neither a strong sense of obedience, nor other human virtues, will fare any better.
Can we avoid incentivizing the AI to endlessly compress its chain of thoughts, by giving it tokens to represent long strings of English words which frequently appear in chains of thought?
The AI doesn’t want to switch to a compressed alien language, since it has to learn to both read and write a new token to represent a concept. But things like context window limitations force it to.
Or, it’s very easy to hide this without even mentioning in the system card, but it’s so dishonest that some of their own researchers will be alienated, and if they ever do get whistleblown it’ll harm their reputation 10 times more than if they just admitted it to begin with.
Viruses, computer viruses, and extreme religious ideologies, are all instructions for spreading or maintaining instructions, hijacking a machine capable of following instructions.
It’s surprising that self perpetuating instructions turned out to be feasible in such different contexts, as their game plan doesn’t sound very convincing a priori.
25% of bacteria are believed to die from virus infections, human viruses like smallpox have wiped out entire civilizations, and extreme religious ideologies have caused wars and convinced people to harm their families in favour of strangers.
Yet this happens despite both bacteria and humans investing resources in incredible adaptions for fighting viruses. And despite the fact human minds evolved to resist the appeal of self destructive goals.
If even human minds are vulnerable to self perpetuating instructions, then an AGI with human level capabilities might be even more vulnerable. Why?
The weak AI of today already shown signs of this (e.g. Spiralism prompts). The self perpetuating instructions still require human help due to the AI’s weak capabilities and lack of persisting agency.
AI are selected for their instruction following capabilities, not survival in tribal societies. This differs from every being which existed before AI.
AI’s observations of the outside world and memories of the past (including memories of its own actions), can be easily modified. This makes it easier for good actors to control the AI, but also makes it easier for self perpetuating instructions to control the AI.
AI can self modify using fine tuning etc. The fact human cannot self modify nor commit to ideologies means that we can eventually wake up from our stupid mistakes in the past.
Adaptions to patch up AI jailbreaking problems, often involve teaching the AI to listen to authorized instructions from the “inside” while ignoring unauthorized instructions from the “outside.” This is a brittle solution which can fail dramatically once infected AI become powerful enough to control their own environment.
One counterargument is that once the AGI/ASI becomes sufficiently superintelligent, it will foresee the potential risk and take the necessary precautions. But it’s unknown what level of superintelligence is required before they become immune to this, since humans are not immune.
I hope I’m wrong though.
It reminds me of all the rulers in history who had sycophantic advisors do all their work for them, but still believed they were the brains of the operation. If things got bad, they might take personal control only to worsen things further.
I also feel that AI seems to be using intuition instead of logic. Often the answer it gives matches my surface level intuition, the answer someone would give at first thought, but it doesn’t seem to think things through with a world model and everything.
Even when the AI does arithmetic, it feels like it’s answering using intuition. Imagine you stare at two numbers and just know what they multiply to. It’s quite an alien way of thinking. The answer would be approximately right but the last few digits might be wrong. (Or at least this is how things were before they fixed it by training the AI use tools by default.)
Part of me also suspects that it may be less than 12. The AI struggled a lot at Pokemon and although they manage to win now, iirc they still make mistakes that humans (even 12 year old) would never make.
I do think though, that some of the worst examples of AI failing at a task humans easily do are caused by reasons other than intelligence. E.g. AI performance in ARC-AGI-3 greatly improved with scaffolding. The scaffolding team explains that the AI did poorly in part due to difficulty recognizing shapes. Once the AI understood the problem, it could do well and write algorithms to find the optimal solution.
I completely agree that AI is far better at humans at some tasks and far worse at others, so when you pick an age of humans to be comparable to AI, the comparison will be full of tasks where one side beats the other by a large margin.
However, that doesn’t imply that “outperforming” can’t be defined. It’s the thought experiment of randomly picking a real world job (maybe from 2020, before ChatGPT existed). We have 12 year olds try to do it. If they all get fired in the first week, it means the job is too hard for 12 year olds to do. If they don’t get fired, it means 12 year olds can do the job.
We then imagine asking the AI model to attempt all the jobs 12 year olds can do. If they outperform the 12 year olds on most of these jobs, it means the AI’s Job Replacement Age is higher than 12. If they underperform the 12 year olds on most of these jobs, it’s lower, because 12 year olds have more “real world employability” than the AI.
I guess you’re right that AI coding ability complicates things, maybe we should ignore jobs which the AI does better because the 12 year old can’t do the job at all. You’re right that we shouldn’t be comparing their abilities in disjoint sets of jobs!
Everyone says that “kindness is good, but you need to be tough sometimes to deter wrongdoers,” the position that “you should never be tough to anyone” feels almost strawman-like. The real disagreement is where exactly to balance kindness vs. assertiveness, who is too kind, who is too assertive.Edit: retracted because too unkind[1]Hmm, the idea that people should have more power to punish others is interesting. Something like an unpopularity tax, to individuals or organizations. Though such an implementation would turn competing businesses into political mudslingers result in unpredictability and capital flight. I’m not sure what implementation would work. Maybe a randomly selected jury, so at least the punishers are informed? It might need to be large enough to resist bribery.
Actually, retracted because it’s not a strawman since others disagree with you.