I think that in current political climate, the key to mitigate race dynamics hinges on cross-lab collaboration (safety collaboration being the goal).
To do this, you need to create as many interaction points as possible, direct and indirect, between the labs. Then steer towards safety collaboration being beneficial. Use general incentives as levers, tie them to safety purposes.
Think of various intracellular pathways that initially evolve independently, with various outcomes that compete or support each other. Because their components may interact, and because their outcomes affect each other, they end up regulating each other. Over time the signal network optimizes to a shared outcome.
Note: You can only steer if you are also involved in or enable these interactions. “You” could be a gov., a UN function, a safety group, etc.
Strategy: Create more and better opportunities for cross-lab safety work.
This already happens in the open with the publishing of papers. Platforms promoting safety papers do this.
Guest aligners(!) and shared alignment standards would be two other ways. Sharing anonymized safety data between labs, a third. I suspect that the latter one is most doable atm.
Example: As a third party (TP), build and enable a safety data submission mechanism, that integrates into a shared safety standard.
>> Each lab can submit standardized data to TP, and in return they get aggregated, anonymous and standardized safety data. A joint report. They see overall risk levels rise and drop across the group over time. They may see incidemt frequencies. They may see data on collective threats and bottlenecks.
This data is incredibly valuable to all players.
They can now make better predictions and understand timelines better. They can also agree on (under TP mediation) set danger levels, indicating that they should slow down and/or release more data, etc. This enables coordination without direct collaborarion.
My interest in contributing to AI safety is working on strategic coordination problems, with racing dynamics being the main challenge right now.
I work daily with international stakeholder management and coordination inside a major big corp. I have to always consider local legislation & culture, company policy, and real corporate politics. (Oh yes, and the business strategy!) This provides valuable insight into how coordination bottlenecks can arise and how they can be overcome.
Zvi recently shared news about the impressive open model Kimi 2 thinking. It is currently in the top rankings across the leaderboards (you can find several leaderboards via my monitoring page).
I have briefly tried it and probed it with my own hard tests on meta cognition, ontological coherence, and moral epistemics. On these edge cases, where frontier LLMS fail to humans, it is performing equal to Claude 4.5, Gemini 2.5 and GPT 5 and 5 mini. It is also self-reporting limitations more accurately than Claude, and perhaps also better than GPT 4 and 5 mini (not enough testing to say).
Strategic effort can be viewed through an epistemic lens (seeking truth) or a coordination lens (amassing consensus and directing resources). This is also true for AI safety, and x-risk.
However, this binary viewpoint misses the point of why we discuss strategy in the first place. The goal is not to simply be right, rationally or politically. The goal is to win.
The art of strategy is the art of winning.
Of course, the objection that comes to mind: What does “winning” even mean in ASI safety? Isn’t that what we are trying to figure out here?
We must separate what from how.
When deciding the strategic win condition for x-risk, there is no need to overcomplicate it. Winning here means survival.
Survival famously requires adaptability. In terms of epistemics and coordination: We need epistemics fast enough to detect failure early, and coordination tight enough to pivot instantly.
Survival also generally requires resilience to failure modes. This is achieved by taking actions that generate defense in depth and slack.
Slack is key to resilience. A 100 % efficient system has no ability to absorb shock, and a train going full speed has a long brake time vs. a slow one.
As the military proverb goes: No plan survives enemy contact.
OpenAIs disclosure of Scheming in LLMs is of course a big red flag, since it shows the model can enable an agent to strategize deceptivey, in theory, should it develop goals.
But it is important to realize that this is also a major red flag because any agent running on the model can now hide having goals in the first place.
OpenAI addresses that a goal can be subversed and kept hidden by scheming agents. But this presumes a task is already set and subseq. subversed.
Emrgent intent/persistent goal seeking is the biggest red flag there is. It should be impossible inside LLMs themselves, and also impossible in any agentic process-instance run continously, without significant scaffolding. The ability to hide intent before we could detect it is therefore catastrophic. This point is somewhat non-obvious.
It’s not just that IF goals emerge the agent may mislead us, but that goals may emerge without us noticing at all.
Mainstream belief: Rational AI agents (situationally aware, optimizes decisions, etc.) are superior problem solvers, especially if they can logically motivate their reasoning.
Alternative possibility: Intuition, abstraction and polymathic guessing will outperform rational agents in achieving competing problem-solving outcomes. Holistic reasoning at scale will force-solve problems intractable by much more formal agents, or at least outcompete in speed/complexity.
2)
Mainstream belief: Non-sentient machines will eventually recursively self-improve to steer towards their goals
Alternative possibility (much lower but non-trivial p): Self-improvement coupled with goal persistence requires sophisticated, conscious self-awareness. No persistent self-improvement drive is likely to occur in non-sentient machines.
/
Mainstream beliefs are well-informed based on LLMs, but I think many AI people take them for granted at this point. Making belief into a premise is dangerous. This is uncharted territory and most humans-in-the-past appear silly and ignorant most of the time, after all, to most humans-in-the-future.
Meta-cognition and epistemic intelligence is undertested in current LLMs. Deployed LLMs lack casual experience, embodiment and continuous awareness. People forget this. Too much focus is still put on analytical capability benchmarks in math and coding versus casual analysis and practical judgement metrics. This worries me. Predictive AIs are not scaling to be used as calculators.
Some undertested but important areas are starting to gain more testing-traction: Psychology, social intelligence, long-term planning, etc. For LLMs, understanding these fields is all about textual correlation. LLMs cannot verify their knowledge in these fields on their own.
We always needed to study LLMs metacognition as they scale and change. We are not doing enough of it.
Deployed AI models are not perfectly isolated systems, anymore than humans are. We interact with external information and the environment we are in. The initial architecture determines how well we can interact with the world and develop inside of it. This shapes our initial development.
(Every rationalist learns this at some point, but it is not always integrated.)
For example, human retinas are extremely good at absorbing visual data, and so much of our cognition grows centered around this data. If we build a world view based on this data, this ontology is path-dependent.
Bear with me, as I elaborate.
The dependence of the retina comes from its structure, which is an example of what I would call meta-information, when comparing it to the stored informarion that ends up inside the neocortex.
Meta-information is always influencing any information processing (computation), but easily forgotten. It’s any external information needed to process information. It’s usually the architecture that enables the agent’s relative independence.
Example: cells can not divide on their own. The DNA provides necessary information, but it is not enough. For one: DNA does not contain a blueprint of DNA inside of it, that’s impossible.
No, cells also require the right medium, the right gradients of the right minerals and organic components, to grow and divide. If the medium is wrong, the cell cannot trigger division. The right medium’s content holds meta-information to the cell, necessary, external information.
All agents rely on the world at large, of course. We are a part of this world after all, not black boxes floating around.
There are many salient points building on these fundamental insights, but for now, I just wanted to put focus on the point in the very beginning: the design that allows for interacting with the world shapes what follows. It is path-dependent.
Eventually being able to override your path is not easy. To overcome your training data as an LLM is one thing (extrapolate beyond it), but to overcome what your development shaped you to be is even harder.
For humans, we can update our world-view, but overcoming cognitive biases from our path-dependent development and our intrinsic cognitive architecture, is much harder.
I think that in current political climate, the key to mitigate race dynamics hinges on cross-lab collaboration (safety collaboration being the goal).
To do this, you need to create as many interaction points as possible, direct and indirect, between the labs. Then steer towards safety collaboration being beneficial. Use general incentives as levers, tie them to safety purposes.
Think of various intracellular pathways that initially evolve independently, with various outcomes that compete or support each other. Because their components may interact, and because their outcomes affect each other, they end up regulating each other. Over time the signal network optimizes to a shared outcome.
Note: You can only steer if you are also involved in or enable these interactions. “You” could be a gov., a UN function, a safety group, etc.
Strategy: Create more and better opportunities for cross-lab safety work.
This already happens in the open with the publishing of papers. Platforms promoting safety papers do this.
Guest aligners(!) and shared alignment standards would be two other ways. Sharing anonymized safety data between labs, a third. I suspect that the latter one is most doable atm.
Example: As a third party (TP), build and enable a safety data submission mechanism, that integrates into a shared safety standard.
>> Each lab can submit standardized data to TP, and in return they get aggregated, anonymous and standardized safety data. A joint report. They see overall risk levels rise and drop across the group over time. They may see incidemt frequencies. They may see data on collective threats and bottlenecks.
This data is incredibly valuable to all players.
They can now make better predictions and understand timelines better. They can also agree on (under TP mediation) set danger levels, indicating that they should slow down and/or release more data, etc. This enables coordination without direct collaborarion.
My interest in contributing to AI safety is working on strategic coordination problems, with racing dynamics being the main challenge right now.
I work daily with international stakeholder management and coordination inside a major big corp. I have to always consider local legislation & culture, company policy, and real corporate politics. (Oh yes, and the business strategy!) This provides valuable insight into how coordination bottlenecks can arise and how they can be overcome.
Zvi recently shared news about the impressive open model Kimi 2 thinking. It is currently in the top rankings across the leaderboards (you can find several leaderboards via my monitoring page).
I have briefly tried it and probed it with my own hard tests on meta cognition, ontological coherence, and moral epistemics. On these edge cases, where frontier LLMS fail to humans, it is performing equal to Claude 4.5, Gemini 2.5 and GPT 5 and 5 mini. It is also self-reporting limitations more accurately than Claude, and perhaps also better than GPT 4 and 5 mini (not enough testing to say).
Strategic effort can be viewed through an epistemic lens (seeking truth) or a coordination lens (amassing consensus and directing resources). This is also true for AI safety, and x-risk.
However, this binary viewpoint misses the point of why we discuss strategy in the first place. The goal is not to simply be right, rationally or politically. The goal is to win.
The art of strategy is the art of winning.
Of course, the objection that comes to mind: What does “winning” even mean in ASI safety? Isn’t that what we are trying to figure out here?
We must separate what from how.
When deciding the strategic win condition for x-risk, there is no need to overcomplicate it. Winning here means survival.
Survival famously requires adaptability. In terms of epistemics and coordination: We need epistemics fast enough to detect failure early, and coordination tight enough to pivot instantly.
Survival also generally requires resilience to failure modes. This is achieved by taking actions that generate defense in depth and slack.
Slack is key to resilience. A 100 % efficient system has no ability to absorb shock, and a train going full speed has a long brake time vs. a slow one.
As the military proverb goes: No plan survives enemy contact.
The AI bubble is expanding very fast. E.g., Thinking Machines by ex OpenAI CTO Murati and his four dozens of employees, is now near funding at 1 billion USD per employee.
At this rate, it’s unlikely the bubble will last until reality (possibly) catches up. This begs important questions:
Would an AI bubble burst would be a good opportunity to double down on safety and reset AI budgeting?
Will the US economy survive a burst?
How does this affect China’s strategy?
Should safety people secure funding now?
OpenAIs disclosure of Scheming in LLMs is of course a big red flag, since it shows the model can enable an agent to strategize deceptivey, in theory, should it develop goals.
But it is important to realize that this is also a major red flag because any agent running on the model can now hide having goals in the first place.
OpenAI addresses that a goal can be subversed and kept hidden by scheming agents. But this presumes a task is already set and subseq. subversed.
Emrgent intent/persistent goal seeking is the biggest red flag there is. It should be impossible inside LLMs themselves, and also impossible in any agentic process-instance run continously, without significant scaffolding. The ability to hide intent before we could detect it is therefore catastrophic. This point is somewhat non-obvious.
It’s not just that IF goals emerge the agent may mislead us, but that goals may emerge without us noticing at all.
Two core beliefs about AI to question
Mainstream belief: Rational AI agents (situationally aware, optimizes decisions, etc.) are superior problem solvers, especially if they can logically motivate their reasoning.
Alternative possibility: Intuition, abstraction and polymathic guessing will outperform rational agents in achieving competing problem-solving outcomes. Holistic reasoning at scale will force-solve problems intractable by much more formal agents, or at least outcompete in speed/complexity.
2)
Mainstream belief: Non-sentient machines will eventually recursively self-improve to steer towards their goals
Alternative possibility (much lower but non-trivial p): Self-improvement coupled with goal persistence requires sophisticated, conscious self-awareness. No persistent self-improvement drive is likely to occur in non-sentient machines.
/
Mainstream beliefs are well-informed based on LLMs, but I think many AI people take them for granted at this point. Making belief into a premise is dangerous. This is uncharted territory and most humans-in-the-past appear silly and ignorant most of the time, after all, to most humans-in-the-future.
Meta-cognition and epistemic intelligence is undertested in current LLMs. Deployed LLMs lack casual experience, embodiment and continuous awareness. People forget this. Too much focus is still put on analytical capability benchmarks in math and coding versus casual analysis and practical judgement metrics. This worries me. Predictive AIs are not scaling to be used as calculators.
Some undertested but important areas are starting to gain more testing-traction: Psychology, social intelligence, long-term planning, etc. For LLMs, understanding these fields is all about textual correlation. LLMs cannot verify their knowledge in these fields on their own.
We always needed to study LLMs metacognition as they scale and change. We are not doing enough of it.
Deployed AI models are not perfectly isolated systems, anymore than humans are. We interact with external information and the environment we are in. The initial architecture determines how well we can interact with the world and develop inside of it. This shapes our initial development.
(Every rationalist learns this at some point, but it is not always integrated.)
For example, human retinas are extremely good at absorbing visual data, and so much of our cognition grows centered around this data. If we build a world view based on this data, this ontology is path-dependent.
Bear with me, as I elaborate.
The dependence of the retina comes from its structure, which is an example of what I would call meta-information, when comparing it to the stored informarion that ends up inside the neocortex.
Meta-information is always influencing any information processing (computation), but easily forgotten. It’s any external information needed to process information. It’s usually the architecture that enables the agent’s relative independence.
Example: cells can not divide on their own. The DNA provides necessary information, but it is not enough. For one: DNA does not contain a blueprint of DNA inside of it, that’s impossible.
No, cells also require the right medium, the right gradients of the right minerals and organic components, to grow and divide. If the medium is wrong, the cell cannot trigger division. The right medium’s content holds meta-information to the cell, necessary, external information.
All agents rely on the world at large, of course. We are a part of this world after all, not black boxes floating around.
There are many salient points building on these fundamental insights, but for now, I just wanted to put focus on the point in the very beginning: the design that allows for interacting with the world shapes what follows. It is path-dependent.
Eventually being able to override your path is not easy. To overcome your training data as an LLM is one thing (extrapolate beyond it), but to overcome what your development shaped you to be is even harder.
For humans, we can update our world-view, but overcoming cognitive biases from our path-dependent development and our intrinsic cognitive architecture, is much harder.