It seems to me that the gap between US and Chinese models is < 2 months (when you don’t count Mythos)
Kimi K2.6 was released April 2026 while Opus 4.6 was released February 2026, and according to https://artificialanalysis.ai, Kimi K2.6 is more capable (54 > 53). Kimi K2.6 is better in SciCode (54% > 52%) while Opus is better on Terminal-Bench Hard (46% > 44%)
Plus, Kimi is 5x cheaper and has 3x throughput (but has 4x less context window)
Consider both the time gap between a model being finished and released, and that benchmarks aren’t really capturing the whole picture at this capability level anymore.
Anthropic already had Mythos internally in February, whereas 2.6 was likely released a couple weeks at most after it finished. I think this alone puts the true gap at ~6 months, assuming Kimi catches up to Mythos benchmark level by about the end of the year which seems plausible with a “kimi 3”. The best model that is publicly released matters, but from a “recursive self improvement USA vs China race” perspective the true number that matters is the best internally available model.
It’s also important to note that actually using Kimi vs Opus in an agentic harness is a massive, noticable gap. It’s unfortunate that there are no hard metrics for this, so we have to go purely off vibes, but I’m pretty confident that if you ran a double blind study with Opus and Kimi in the same coding harness people would strongly prefer opus despite the benchmark scores implying that they should be ~equal
Memory in AI Agents can cause a large security risk. Without memory, it’s easier to red-team LLMs for safety. But once they are fine-tuned (a form of encoding memory), misalignment can be introduced.
According to AI Safety Atlas, most scaffolding approaches for memory provide
a way for AI to access a vast repository of knowledge and use this information to construct more informed responses. However, this approach may not be the most elegant due to its reliance on external data sources and complex retrieval mechanisms. A potentially more seamless and integrated solution could involve utilizing the neural network’s weights as dynamic memory, constantly evolving and updating based on the tasks performed by the network.
We need ways to ensure safety in powerful agents with memory or not introduce memory modules at all. Otherwise, agents are constantly learning and can find motivations not aligned with human volition.
Any thoughts on ensuring safety in agents that can update their memory?
However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:
Benchmarks may not represent real world ability
Benchmark information can be leaked into AI model training
However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and data.
By reproducing a paper, something many researchers do, we ensure it has real world ability.
By only using papers after a certain model finished training, we ensure data is not leaked.
This comes with the inconvenience of making it difficult to compare models or approaches in 2 papers because the benchmark is always changing, but I argue this is a worthwhile tradeoff:
We can still compare approaches if the the same papers are used in both approaches, ensuring papers were produced after all models finished training
One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity’s Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).
It’s good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.
Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
I agree we should see a continued compression benchmark.
I believe we are doomed from superintelligence but I’m not sad.
There are simply too many reasons why alignment will fail. We can assign a probability p(S_n aligned | S_{n-1} aligned) where S_n is the next level of superintelligence. This probability is less than 1.
As long as misalignment keeps increasing and superintelligence iterates on itself exponentially fast, we are bound to get misaligned superintelligence. Misalignment can decrease due to generalizability, but we have no way of knowing if that’s the case and is optimistic to think so.
The misaligned superintelligence will destroy humanity. This can lead to a lot of fear or wish to die with dignity but that misses the point.
We only live in the present, not the past nor the future. We know the future is not optimistic but that doesn’t matter, because we live in the present and we constantly work on making the present better by expected value.
By remaining in the present, we can make the best decisions and put all of our effort into AI safety. We are not attached to the outcome of our work because we live in the present. But, we still model the best decision to make.
We model that our work on AI safety is the best decision because it buys us more time by expected value. And, maybe, we can buy enough time to upload our mind to a device so our continuation lives on despite our demise.
Then, even knowing the future is bad, we remain happy in the present working on AI safety.
Current alignment methods like RLHF and Constitutional AI often use human evaluators. But these samples might not truly represent humanity or its diverse values. How can we model all human preferences with limited compute?
Maybe we should try collecting a vast dataset of written life experiences and beliefs from people worldwide.
We could then filter and augment this data to aim for better population representation. Vectorize these entries and use k-means to find ‘k’ representative viewpoints.
These ‘k’ vectors could then serve as a more representative proxy for humanity’s values when we evaluate AI alignment. Thoughts? Potential issues?
Current AI alignment often relies on single metrics or evaluation frameworks. But what if a single metric has blind spots or biases, leading to a false sense of security or unnecessary restrictions? How can we not just use multiple metrics, but optimally use them?
Maybe we should try requiring AI systems to satisfy multiple, distinct alignment metrics, but with a crucial addition: we actively model the false positive rate of each metric and the non-overlapping aspects of alignment they capture.
Imagine we can estimate the probability that Metric A incorrectly flags an unaligned AI as aligned (its false positive rate), and similarly for Metrics B and C. Furthermore, imagine we understand which specific facets of alignment each metric uniquely assesses.
We could then select a subset of metrics, or even define a threshold of “satisfaction” across multiple metrics, based on a target false positive rate for the overall alignment evaluation.
OpenAI’s Superalignment aimed for a human-level automated alignment engineer through scalable training, validation, and stress testing.
I propose a faster route: first develop an automated Agent Alignment Engineer. This system would automate the creation of aligned agents for diverse tasks by iteratively refining agent group chats, prompts, and tools until they pass success and safety evaluations.
This is tractable with today’s LLM reasoning, evidenced by coding agents matching top programmers. Instead of directly building a full Alignment Researcher, focusing on this intermediate step leverages current LLM strengths for agent orchestration. This system could then automate many parts of creating a broader Alignment Researcher.
Safety for the Agent Alignment Engineer can be largely ensured by operating in internet-disconnected environments (except for fetching research) with subsequent human verification of agent alignment and capability.
Examples: This Engineer could create agents that develop scalable training methods or generate adversarial alignment tests.
By prioritizing this more manageable stepping stone, we could significantly accelerate progress towards safe and beneficial advanced AI.
It seems to me that the gap between US and Chinese models is < 2 months (when you don’t count Mythos)
Kimi K2.6 was released April 2026 while Opus 4.6 was released February 2026, and according to https://artificialanalysis.ai, Kimi K2.6 is more capable (54 > 53). Kimi K2.6 is better in SciCode (54% > 52%) while Opus is better on Terminal-Bench Hard (46% > 44%)
Plus, Kimi is 5x cheaper and has 3x throughput (but has 4x less context window)
Consider both the time gap between a model being finished and released, and that benchmarks aren’t really capturing the whole picture at this capability level anymore.
Anthropic already had Mythos internally in February, whereas 2.6 was likely released a couple weeks at most after it finished. I think this alone puts the true gap at ~6 months, assuming Kimi catches up to Mythos benchmark level by about the end of the year which seems plausible with a “kimi 3”. The best model that is publicly released matters, but from a “recursive self improvement USA vs China race” perspective the true number that matters is the best internally available model.
It’s also important to note that actually using Kimi vs Opus in an agentic harness is a massive, noticable gap. It’s unfortunate that there are no hard metrics for this, so we have to go purely off vibes, but I’m pretty confident that if you ran a double blind study with Opus and Kimi in the same coding harness people would strongly prefer opus despite the benchmark scores implying that they should be ~equal
Reading Resources for Technical AI Safety independent researchers upskilling to apply to roles:
GabeM—Leveling up in AI Safety Research
EA—Technical AI Safety
Michael Aird: Write down Theory of Change
Marius Hobbhahn—Advice for Independent Research
Rohin Shah—Advice for AI Alignment Researchers
gw—Working in Technical AI Safety
Richard Ngo—AGI Safety Career Advice
rmoehn—Be careful of failure modes
Bilal Chughtai—Working at a frontier lab
Upgradeable—Career Planning
Neel Nanda—Improving Research Process
Neel Nanda—Writing a Good Paper
Ethan Perez—Tips for Empirical Alignment Research
Ethan Perez—Empirical Research Workflows
Gabe M—ML Research Advice
Lewis Hommend—AI Safety PhD advice
Adam Gleave—AI Safety PhD advice
Application and Upskilling resources;
Job Board
Events and Training
Memory in AI Agents can cause a large security risk. Without memory, it’s easier to red-team LLMs for safety. But once they are fine-tuned (a form of encoding memory), misalignment can be introduced.
According to AI Safety Atlas, most scaffolding approaches for memory provide
We need ways to ensure safety in powerful agents with memory or not introduce memory modules at all. Otherwise, agents are constantly learning and can find motivations not aligned with human volition.
Any thoughts on ensuring safety in agents that can update their memory?
Accelerating AI Safety research is critical for developing aligned systems, and transformative AI is a powerful approach in doing so.
However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:
Benchmarks may not represent real world ability
Benchmark information can be leaked into AI model training
However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and data.
By reproducing a paper, something many researchers do, we ensure it has real world ability.
By only using papers after a certain model finished training, we ensure data is not leaked.
This comes with the inconvenience of making it difficult to compare models or approaches in 2 papers because the benchmark is always changing, but I argue this is a worthwhile tradeoff:
We can still compare approaches if the the same papers are used in both approaches, ensuring papers were produced after all models finished training
Benchmark data is always recent and relevant
Any thoughts?
One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity’s Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).
there’s this https://github.com/Jellyfish042/uncheatable_eval
It’s good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.
This is great! Would like to see a continually updating public leaderboard of this.
Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
I agree we should see a continued compression benchmark.
I believe we are doomed from superintelligence but I’m not sad.
There are simply too many reasons why alignment will fail. We can assign a probability p(S_n aligned | S_{n-1} aligned) where S_n is the next level of superintelligence. This probability is less than 1.
As long as misalignment keeps increasing and superintelligence iterates on itself exponentially fast, we are bound to get misaligned superintelligence. Misalignment can decrease due to generalizability, but we have no way of knowing if that’s the case and is optimistic to think so.
The misaligned superintelligence will destroy humanity. This can lead to a lot of fear or wish to die with dignity but that misses the point.
We only live in the present, not the past nor the future. We know the future is not optimistic but that doesn’t matter, because we live in the present and we constantly work on making the present better by expected value.
By remaining in the present, we can make the best decisions and put all of our effort into AI safety. We are not attached to the outcome of our work because we live in the present. But, we still model the best decision to make.
We model that our work on AI safety is the best decision because it buys us more time by expected value. And, maybe, we can buy enough time to upload our mind to a device so our continuation lives on despite our demise.
Then, even knowing the future is bad, we remain happy in the present working on AI safety.
Current alignment methods like RLHF and Constitutional AI often use human evaluators. But these samples might not truly represent humanity or its diverse values. How can we model all human preferences with limited compute?
Maybe we should try collecting a vast dataset of written life experiences and beliefs from people worldwide.
We could then filter and augment this data to aim for better population representation. Vectorize these entries and use k-means to find ‘k’ representative viewpoints.
These ‘k’ vectors could then serve as a more representative proxy for humanity’s values when we evaluate AI alignment. Thoughts? Potential issues?
Current AI alignment often relies on single metrics or evaluation frameworks. But what if a single metric has blind spots or biases, leading to a false sense of security or unnecessary restrictions? How can we not just use multiple metrics, but optimally use them?
Maybe we should try requiring AI systems to satisfy multiple, distinct alignment metrics, but with a crucial addition: we actively model the false positive rate of each metric and the non-overlapping aspects of alignment they capture.
Imagine we can estimate the probability that Metric A incorrectly flags an unaligned AI as aligned (its false positive rate), and similarly for Metrics B and C. Furthermore, imagine we understand which specific facets of alignment each metric uniquely assesses.
We could then select a subset of metrics, or even define a threshold of “satisfaction” across multiple metrics, based on a target false positive rate for the overall alignment evaluation.
OpenAI’s Superalignment aimed for a human-level automated alignment engineer through scalable training, validation, and stress testing.
I propose a faster route: first develop an automated Agent Alignment Engineer. This system would automate the creation of aligned agents for diverse tasks by iteratively refining agent group chats, prompts, and tools until they pass success and safety evaluations.
This is tractable with today’s LLM reasoning, evidenced by coding agents matching top programmers. Instead of directly building a full Alignment Researcher, focusing on this intermediate step leverages current LLM strengths for agent orchestration. This system could then automate many parts of creating a broader Alignment Researcher.
Safety for the Agent Alignment Engineer can be largely ensured by operating in internet-disconnected environments (except for fetching research) with subsequent human verification of agent alignment and capability.
Examples: This Engineer could create agents that develop scalable training methods or generate adversarial alignment tests.
By prioritizing this more manageable stepping stone, we could significantly accelerate progress towards safe and beneficial advanced AI.