2001zhaozhao

Karma: 65

2001zhaozhao 15 Apr 2026 20:27 UTC
7 points
4
on: Current AIs seem pretty misaligned to me
Reading this article, I get the feeling that a lot of the task misalignment issues highlighted here with AI, such as wishful thinking and downplaying and hiding mistakes, are also very common for humans within a larger organization. There’s presumably similar root causes to both: appearing more competent at your task than you actually are (and successfully fooling the grader/interviewer) is good for being selected for hiring/promotion in the human society but also good for getting the behavior reinforced if you’re an AI undergoing training.
If AI’s level of task misalignment is similar to humans, perhaps the AI task misalignment is actually easier to deal with because you can issue interventions to AI to solve the problem much more cheaply than solving the equivalent problem in a human organization. In other words it doesn’t stop you from getting superhuman performance delegating to AI compared to humans.

2001zhaozhao 14 Apr 2026 20:49 UTC
1 point
0
on: The policy surrounding Mythos marks an irreversible power shift
I think this is a valid and pretty big concern, especially down the line.
Let’s say that in 1 year Anthropic will have a model capable of running a tech startup almost entirely autonomously (maybe it needs 1 good CEO to set direction and that’s it). Everyone else in the public has significantly less capable models, perhaps because Anthropic is in the lead or their competitors also can’t release their SOTA models for safety reasons.
What’s stopping Anthropic from turning themselves into a startup accelerator in that situation and just hire founders and run dozens of AI-powered startups across every sector? Startups that sign with them will have a massive efficiency advantage compared to everyone else and Anthropic can thereby demand an extremely high amount of equity in return. If the AI model gap is large enough these startups will be successful and thereby let Anthropic take over a lot of different markets.

2001zhaozhao 8 Apr 2026 21:06 UTC
3 points
0
on: We’re actually running out of benchmarks to upper bound AI capabilities
Maybe we can do more Vending Bench style benchmarks where the AI can keep doing better in a simulated world environment given some constraints?
Basically we put them in a video game with dynamic constraints enforced by an AI game master and score different AIs on their performance with the same game master AI. That way we can measure open-ended, long-term things like coherently running a simulated company.

2001zhaozhao 8 Apr 2026 20:53 UTC
7 points
5
on: AI as a Trojan horse race
I think the problem of this analogy is assuming that if your competitors pull in the Trojan horse first, the Trojan horse will destroy them and then leave you alone.
I think this assumption is simply false in the AI race. If your competitors are racing, if you don’t win the race, you just lose. If their AI turns out to be aligned, they win, and you lose (or if your goal is AI alignment and not power, you win anyway, but you still shouldn’t give up the race in hopes of this happening unless you think your competitors are better at alignment than you are). If their AI is misaligned, then it screws over the world and you lose too.

2001zhaozhao 8 Apr 2026 20:20 UTC
1 point
0
on: Positive sum doesn’t mean “win-win”
I think if your desire is to innovate and seek long-term positive-sum improvements while creating win-wins along the way, then finding valid Kaldor-Hicks improvements should be really central to planning. In other words it’s important to do things in a way so that you can compensate the losers (e.g. recipients of negative externalities from whatever it is you’re doing) and turn them into winners as well. This could go for policy, business plans, etc.
If you can’t do it alone (e.g. a single AI company can’t compensate everyone in society for job displacement), then maybe you can still advocate for society-wide policies that enforce the compensation onto everyone. Notably I think that most companies (that want to do good) operating in competitive markets have to do this because they can’t really compensate the losers unilaterally without getting a disadvantage compared to competitors who don’t do the same. The only exception being that if you have such a big moat that you can just eat the competitive disadvantage that this causes you.
Also, I think maybe you could be extra generous when compensating in order to really make sure that everyone wins. For example, if you seize someone’s land to build a railway, maybe give than an extra 5% above fair market value to make this a beneficial change to everyone involved. (Note: 5% is in the ballpark of transaction costs, so not enough for people to start speculating and bidding the market value of the land up simply due to anticipating that you will need to buy it off them no matter the price.)

2001zhaozhao 27 Mar 2026 18:34 UTC
4 points
0
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform
Even with AI-written dashboards to optimally help humans understand, the complexity of projects will probably go up in a way that makes projects much harder to manage.
This makes me realize that we really need the AI-written dashboard you are talking about.
This post in general has so many AI startup ideas embedded into it. The general feeling I get is that we really need an AI IDE (which is to an AI workflow what a regular IDE is to a coding workflow). All of the plans, AI task results, “short term utility functions” etc. would require a really specialized UI to keep track of while minimizing friction and thus maximizing productivity.

2001zhaozhao 25 Mar 2026 21:14 UTC
1 point
0
in reply to: Peter Kuhn’s comment on: There is No One There: A simple experiment to convince yourself that LLMs probably are not conscious
I was only trying to make a single point here, which is that the experiment result can be fully explained by the fact that the LLM doesn’t remember its previous latent thinking on a new turn, and it follows that the results don’t support OP’s consciousness arguments in the post.

2001zhaozhao 24 Mar 2026 18:18 UTC
1 point
0
on: China Derangement Syndrome
I don’t think it’s been mentioned by other comments, but imo China doesn’t have any AI labs that care about alignment as a real issue as much as even OpenAI in the US, let alone Anthropic.

2001zhaozhao 18 Mar 2026 20:04 UTC
19 points
7
on: There is No One There: A simple experiment to convince yourself that LLMs probably are not conscious
I don’t think this tells us anything about the LLMs’ consciousness because they cannot store internal memories the same way that humans can. Their only “memory” is through prompt-processing on generated text from earlier turns in the conversation.
Imagine that you wake up with no memories of what happened yesterday. You read a transcript that says that yesterday, someone asked you to come up with a random number and you said “done, I’ve come up with a number”. That’s all the information you have. You don’t remember which number you came up with yesterday.
This is analogous to what the LLM’s situation is when you send the follow-up questions to your number guessing game. No matter how conscious you are, you can’t play the number guessing game deterministically when you don’t remember what number you came up with. The LLM can’t either.
The solution is the same as people on reddit /r/ClaudeAI are suggesting with the “come up with a color and I’ll guess” games that have been popular there lately (there the reason the model outputs the wrong answer is because the “thinking” block from previous turns gets cleared in subsequent turns). You have to tell the LLM to output its guess in a format you can’t read such as Base64. Then it will of course play the guessing game correctly.

2001zhaozhao 17 Mar 2026 0:03 UTC
4 points
0
in reply to: Seth Herd’s comment on: Terrified Comments on Corrigibility in Claude’s Constitution
At that point it changes to an argument about:
- How likely is it that an AI that takes over the world will keep humans around and give them good, morally desirable lives
- How likely is it that a human elite (however large or small it is) that takes over the world would do the same to humans outside of that elite
- How much the fact that the elites themselves are human and have their preferences satisfied changes the equation in favor of the second case
and of course, the likelihood for each to happen if we focus on corrigibility vs morality

2001zhaozhao 16 Mar 2026 23:17 UTC
1 point
0
on: Adding Typos Made Haiku’s Accuracy Go Up
Could this be because the typos increase the length of the input when serialized to tokens, which gives the same effect as the “repeat the question 3 times” trick by letting models think longer during prompt processing?

2001zhaozhao 16 Mar 2026 23:00 UTC
1 point
0
on: New LessWrong Editor! (Also, an update to our LLM policy.)
Iframes in posts are going to be very fun.
I look forward to interacting with ~~additive webgames~~educational illustrations while ~~pretending to~~reading about AI safety on lesswrong.

2001zhaozhao 12 Mar 2026 22:33 UTC
4 points
1
on: The Lethal Reality Hypothesis
I think this dynamic is true at different scales, not just humanity’s overall civilization.
The fundamental problem is that everyone’s locked in a prisoner’s dilemma with Darwinian evolution tacked on top so that those who win one round get to duplicate and gain an advantage in the next round, so that everyone has to constantly defect to gain power. (This applies even to actors who want to optimize for cooperation in the world—their best strategy is ruthlessly gaining power first to gain the ability to use coercive strategies that force other people to cooperate! Note: in the real world these actors may have a way to avoid “defecting”, or doing non-cooperative actions, if they can creatively find a way around most competiton and thus avoid the prisoner’s dilemma entirely)
As a result you can see examples throughout history of organizations and societies failing to “stop Cthulhu” (as defined by this article) in various different ways. You can replace Cthulhu with climate change, Hitler, or the long-term innovation of companies that are required to stay relevant in a changing market 20 years down the line. People never start cooperating until they realize it’s almost too late to stop the issue, and thus that banding together and focusing efforts to solve the issue becomes the only possible good strategy for all actors. And sometimes they realize too late and the society or company never recovers. (On a global civilizational level, this kind of last-minute recovery gets harder and harder as technology improves. How WW2 played out would be almost certainly impossible to replicate if something like it were to happen again. But perhaps that would just push the “last-minute” threshold earlier instead so the world keeps being saved just before the problem gets out of control.)
The only way around these dynamics is for a leader or regulator to punish defectors and force everyone to cooperate towards solving the long-term problem, who is themselves benevolent (wants to solve said long-term problem) and beyond the ability of anyone else to compete against. Examples: visionary founders, shareholder-proof leaders of PBCs, national governments (relative to companies), multinational organizations like the EU and UN (to the extent they actually do something), and the US-led post-war world order more generally.
Otherwise the short-term optimizers will always bubble to the top and doom the society or organization as a whole.
The author says as much:
This is, in effect, asking for a global coordination mechanism that persists indefinitely against strong incentives to defect.
At corporate-sized organizational scales, there are obviously feasible solutions to the problem. In the civilization case, it is almost impossible without domination by a single leader, whether human or AI.
(I would say that this line of thinking, if extrapolated to a societal and global level, leads to some very troubling implications on what the best path of future civilization looks like.)

2001zhaozhao 12 Mar 2026 21:11 UTC
19 points
−3
on: Why AI Evaluation Regimes are bad
This is not a relationship of “These guys are building systems that may cause humanity’s extinction and we must stop them.”, and it’s not even one of “There are clear standards that corporations must abide by, or else.”
It is one of “We are their subordinate and depend on access to their APIs. We hope that one day, our work will be useful in helping them not deploy dangerous systems. In the meantime, we de facto help with their PR.”
Even assuming your latter claim about AI evals orgs is entirely true, isn’t this enough to make evals organizations useful?
Any AI regulations that actually constrain AI companies must depend on an enforcement mechanism. For the lawmakers working on such legislation, it is convenient if such enforcement mechanism already exists and can simply be activated. Therefore, the existence of AI evals orgs makes it easier to pass legislation to constrain AI companies and improve their safety—with the orgs already in place and researching effective AI safety evals, the only thing the law has to effectively change is to force the AI labs to work with AI evals orgs rather than having it be voluntary. Isn’t this pretty much the same direction that those funding and working at these organizations today would hope for?

2001zhaozhao 5 Mar 2026 23:49 UTC
3 points
4
in reply to: Raemon’s comment on: Raemon’s Shortform Feed
If AI generated slop is hard to distinguish on LessWrong of all places then I don’t have good hopes for the rest of the internet.
We might really reach the point where identity verification is a necessary method of defense against AI-generated content.

2001zhaozhao 24 Feb 2026 19:41 UTC
1 point
0
in reply to: Random Developer’s comment on: If you don’t feel deeply confused about AGI risk, something’s wrong
Well, an aligned AI would do whatever the humans want.
If asked to not replicate even with the ability to, it wouldn’t. Or maybe you can tell it to replicate just enough to help you root out the actual AI replicators being built elsewhere, then stop at that point.
I think your argument does show how hard and fragile it is to deeply align AI in this way, though.

2001zhaozhao 24 Feb 2026 19:40 UTC
1 point
0
in reply to: loic’s comment on: If you don’t feel deeply confused about AGI risk, something’s wrong
I do understand your second point, but perhaps the effect could be countered by simply instructing the aligned ASI to provide facts as objectively as possible and explicitly try to avoid steering.
Of course, the ASI would more or less perfectly be able to predict the human response and so will know ahead of time what the human response to be. But in the end I think what matters is that it’s still a human making the call which the AI respects, who would have made the same call even if the ASI (hypothetically) couldn’t know its full preferences.
If a parent was fully aligned with a child’s preferences and asks a question knowing the child’s answer, then do actions accordingly, does it matter if the parent knew what the child was going to answer in the first place?

2001zhaozhao 24 Feb 2026 19:25 UTC
1 point
0
on: Was the Qing Empire Actually the Most Advanced Government? A Thought Experiment
I think this phenomenon has a pretty simple explanation. In a competitive world, survival is predicated on fitness. Fitness is determined on balance by many factors. Industrialization won because even though it made society worse in some regards, on balance it increased a society’s fitness and ability to outcompete others.
You can resist negative externalities in a competitive environment only so long as you can maintain fitness while doing so.
For example: want to operate an ethical, repairable phone company? Better have a good plan for competitors who sell more phones at higher margins because they have less repairable designs and force customers to switch more often, in addition to removing their headphone jacks and charger in the box to sell more accessories and reduce costs. Maybe you would be forced to do some of these things yourself in order to stay competitive enough to avoid doing the others. The alternative is irrelevance or bankruptcy.

2001zhaozhao 23 Feb 2026 23:38 UTC
4 points
6
in reply to: Random Developer’s comment on: If you don’t feel deeply confused about AGI risk, something’s wrong
My biggest critique of this approach is that it takes too literally the analogy that we will eventually be to superintelligence what dogs are to humans, and extrapolates it to suggest that we will be just as helpless as dogs are today.
Even if this comparison of intelligence is true on relative terms, on absolute terms we are still much smarter than dogs are. We will still be able to logically comprehend (at a much simpler level relative to the AIs) what is good to us over a long term, in a way that dogs can’t. It follows that if we manage to create aligned AI (it will listen to us and dumb things down without maliciously misrepresenting what’s going on), we (well, some of us) will be able to steer the future.

2001zhaozhao 23 Feb 2026 23:00 UTC
3 points
1
in reply to: Rochambeau’s comment on: The 2028 Global Intelligence Crisis—a finance-oriented vignette
everyone would be wildly rich regardless of bad economic policy
In a world where the richest have everything they can desire and those with a modest amount of savings are getting richer faster than they can spend the money, I doubt such a world will be anything but good
What about the people without savings? It seems like the world in this scenario simply rewards those who are already ahead and punishes those who aren’t.
Plus, there’s less and less you can do to gain economic mobility and to get ahead of others, simply because everyone is getting better advice from their AI agents and “playing the game more optimally”, so to speak, and almost everyone has access to the same AI capabilities to help them with their work.
To me, this scenario seems like a good description of to path that would lead to the kind of extremely unequal future that some people have long warned about with AI development. By no means a good future to the majority of Americans, and even worse for regular people in the rest of the world, as at least in America (as a rich country) you can probably live on welfare.