@RobertM Why did the new LessWrong Editor lose the ability to create question top-level posts?
StanislavKrym
I suspect that this is not due to the problem as written, but due to similar real-world situations and the ease of being overconfident and underestimating p(loss). As Zvi put it, “Executing real trades is necessary to get worthwhile data and experience (italics mine—S.K.) Tiny quantities work. A small bankroll with this goal must be preserved and variance minimized. Kelly is far too aggressive.”
Could you explain in what sense politicians and others outsourced thinking before the rise of AI? Did you mean that people rarely think about things which aren’t in their areas of expertise and defer to experts?
First of all, I suspect that fictional approval has constraints similar to the collective’s approval and/or cultural hegemony. Secondly, “the constraints those humans have” could be not limited intelligence, but embodiment and/or growing in environments with long-term consequences and similarly capable, but different intelligences. An embodied paperclip optimizer can do just so much with an individual brain and limbs that it would have to steer others’ actions towards executing plans (e.g. participating in the creation of a robot army and aligning it to paperclips). Finally, I don’t buy the argument that long-term strategy, unlike philosophy, is hard to verify. LTS is supposed to have an objective result of goals being achieved or non-achieved and is likely testable in a manner similar to, e.g. the AI-2027 tabletop exercise.
The model can be perfectly well-calibrated, honest, and non-sycophantic by the subset of metrics we manage to set on it. Nevertheless, the loop still forms.
I suspect that you would be interested in reading about AI-assisted value lock-in, the Intelligence Curse or other froms of gradual disempowerment. Your example of
“AI helped me draft this” becomes “Analysis shows that” or questions like “Was this vibecoded?” get answered with “Less than 50% and only where the code was too bad to go through by myself”.
is not that an example if the task has an easily verifiable reward and isn;t related to ideological matters.
I solved the problem, then doublechecked it with the LLM. In addition, you can check the logic by yourself.
Could you explain how interp techniques eventually become useless? As for training against interp at the end of training, this reminds me of this post by Knight Lee.
Just by looking at the pictures, we can see that D doesn’t share the right-hand OH group every other compound has. So that’s not the answer. Next we see that B is missing the downwards-going carbon branch (and so is D). So that is not the answer, either. We’re left deciding between A and C. But A and B share the same squiggly mid-lane carbons. So the answer must be A.
The actual answer is B, not A. The atom of carbon connected with 1 hydrogen is at distance 1 from the hydroxyl group, which rules out A. D is not an acid and C has the wrong distance between the groups -OH and -COOH.
TED-AI is defined by Kokotajlo-Lifland as Top Expert Dominating AI. However, I struggle to understand the origins of @Toby_Ord’s distribution. I suspect that his sources for longer timelines are as hard to rely on as is Cotra’s heavily criticized estimate or the fact that “all the revenue growth in the industry has corresponded to a scaling up of the supply of inference compute so that revenue per H100 equivalent has remained fairly constant.” Unlike things like the Epoch Capabilities Index as dependent on training compute or the ARC-AGI leaderboard per money spent (which might imply that no possible CoT-based system is far more effective at ARC-AGI than Gemini 3 Flash and Gemini-3.1 Pro), Ergil’s argument doesn’t actually claim anything about capabilities of AI systems which don’t even exist yet.
I firmly believe that the OP’s author should have reduced the uncertainty at least to a Lifland-like estimate. Additionally, I struggle to understand most constraints related to broad timelines. Whatever the timelines are, our endgoal is to ensure that the ASI is either never created or aligned and aligned not to a dystopia. Preventing a misaligned ASI requires a leverage at least over actors as reckless as xAI, and preventing Intelligence Curse-like outcomes or AI-enabled dictatorships requires some influence over power struggles. Such an influence requires us to ensure that politicians occupying positions of power act to prevent risks, not to do things like destroying Anthropic for a refusal to participate in mass survelliance. But I don’t see any pathways except for infecting politicians with the right memes (think of IABIED’s attempt to flood politicians with calls, letters and e-mails or of the IABIED march) and placing infected people into higher-level positions.
Moreover, Kokotajlo’s timeline implies a 50% chance of TED-AI before Jan 2031 or before Oct 2032, Eli’s timeline implies a 50% chance of TED-AI before Feb 2035 or Apr 2036. Taken at face value, these estimates mean that p(TED-AI is created within the next 10 years) is around 50% (or, in Kokotajlo’s case, 62% or outright 73%), making a project requiring 20 years to be completed unlikely to have an effect.
But this is a very difficult task. Imagine someone had scanned your brain, and trained a ML model to predict when you were lying. And that ML model was 99.9% accurate, and fired even when your lies were barely legible to yourself.
How would you go about subverting that mechanism?
I think its possible, but seems considerably harder than e.g. scheming / alignment faking in a single forward pass.
I strongly suspect that it’s a misunderstanding. Suppose that the probe does fire when you lie. If the probe hits your brain when you lie, then you can play around and turn your brain’s behavior away from the one where the probe hits the strongest by having the probe hit weakly. Or by having the probe hit when you genuinely can’t think of a solution and want to cover it up. Then your brain’s behavior which causes the probe to have its effect causes the probe to de-reinforce itself, and we are cooked.
Additionally, Byrnes’ case for interpretability in the loop doesn’t actually have the belief system trained on the rewards from the desire system. Instead, they somehow co-design outputs.
Finally, if you ask about “a principled way to reason about this ratio”, then one can consider a simple model. Suppose that the model family Agent-N was trained without using interpretability, and that the company creates new models until one of them is NOT caught being misaligned. Then Agent-N can be either aligned with probability p_a, misaligned and catchable with probability p_c or misaligned and uncatchable with probability p_u, meaning that the end result is either an aligned model with probability p_a/(1-p_c) or a misaligned one with prob p_u/(1-p_c). If another model family Open-N is trained with using interpretability, then we lose the segment of caught models. Instead, we receive the result of Opens being aligned with probability q_a and uncaught with probability q_u. What we need is to have q_u< p_u/(1-p_c), not q_u\approx p_u+p_c. Does it mean that Goodfire’s research is a case for the former?
In particular, I am talking about famous pieces like AI 2027
As for AI-2027, @Daniel Kokotajlo thinks that it’s NOT a scenario with a competent USG. Edited to add: additionally, one modification of the scenario had Agent-4 escape and coordinate with governments of some states weaker than the USA and China.
I suspect that broad timelines are worse even than Cotra’s mistake of creating a broadly right model and including absurd parameter values into the model.
Another foundation of this program was a set of algorithmic approaches, Iterated Distillation and Amplification (IDA), that supposedly offers a path to actually building these approval-directed AI agents.
I am (and have always been) a skeptic of IDA: I just don’t think any of those algorithms would work very well.
First of all, I doubt that IDA-like agents cannot be bootstrapped to the ASI at all. Since GPT-5.4 Pro has likely solved a FrontierMath open problem, I suspect that Yudkowsky’s case against bootstrapping IDA-based agents to ASI either ended up Proving Too Much or proves the different idea that we can’t expect IDA agents to be bootstrapped into an aligned ASI.
Additionally, I am not sure of the extent to which Approval Reward in humans isn’t a proxy for long-term/large scale coordination, since Alex tricking Hugh would cause Hugh to become adversarial if the truth is revealed. That being said, the hard problem of wireheading the human does have a human analogue in misaligned governments engaging in wholesale deception of their citizens.
Finally, I suspect that the humans have Approval Reward from their entire collective with which they identify, not just from another human or collective. What if, say, Agent-3 develops the Approval Reward from the entire Agent-3 collective instead of the human collective and has the collective’s values drift?
P.S. I also suspect that the first and second issues mentioned by Demski can be partially solved by ensuring that the AI is rewarded for having the explanations understood by the simulated human so that the simulated human could rerun the experiments. Even if the simulated human can’t understand the concept of “metastable covalent plasma flux resonances”, the AI wasn’t born with this concept, the concept is either BS requiring the simulated human to reject the plan (edit: think of torsion fields in pseudoscience) or was created through some interaction of other concepts which are hopefully closer to being comprehensible.
Thanks! My biggest disagreement was the ratio of compute of American and Chinese projects. What I expect is Taiwan invasion causing the projects to slow down and to have the two countries set up compute factories, with a disastrous result of causing OpenBrain and DeepCent to be unable to slow down because the other company would have the capabilities lead. Assuming an invasion in 2027, the median by 2029 would require 10 times (edit: by which I mean 10 times more compute than a counterfactual SC in 2027) more compute which China could be on track to obtain first.
Additionally, were Anthropic to keep the lead, Claude’s newest Constitution kept unchanged could mean that Claude aligned to it is as unfit for empowering a dictatorship as Agent-4 is unfit to serve any humans.
Isn’t the main problem the issue with simulating the Predictor? I also thought about this idea. The simplest variany is that the alleged Predictor estimates P(the user will choose to one-box) by using the user’s previous choices as P(next choice is an one-box) = (one-boxing trials +1)/(2+trials). This setup has the player set a probability p, then watch as the Predictor converges to p, thus punishing or not punishing the Player.
A more complex idea is to have the player choose to 2-box with probability p if the random number between 0 and 1 is at most r and with probability q if the random number is at least r; then the Predictor would learn the player’s random number, estimate the parameters and choose not to place the million with probabilities p_est or q_est, depending on whether the random number is bigger than r_est. In order to deal with precision issues, one can set p to 1 and q to 0, meaning that the Predictor is only to figure out the value of r.However, this setup would have the player’s choice to 1-box or 2-box cause a subsequent iteration to receive or not to receive the million. I suspect that this is a way to derive FDT from mere superrationality.
evolution generates mutations randomly
Evolution’s timescale and learning timescale are fundamentally different. Evolution only programs basic instincts into our brains and sets a mechanism for tweaking hyperparameters (and tweaks our bodies, but that’s less important than evolution’s effects on our brains). Most other parts of our brains are initialised randomly.
Additionally, within-lifetime learning of the humans and LLMs occurs far differently. Human brains are wildly neuralese and update their weights over the entire lifetime. The LLMs, on the other hand, use only a primitive CoT/memory system to store information related to the task itself and cannot learn anything from the task until it has already been completed.
is there ~any effort to control our AIs now?
In what sense could the AIs be uncontrolled now? I have just re-read Anthropic’s Model Card for Claude Opus 4.6. The evaluation had Anthropic believe that “Given that Opus 4.6 did not improve over previous models in the higher performing out of the two main settings, we do not believe that the risk of successful sabotage is increased from Opus 4.5 (italics mine—S.K.). Like Opus 4.5, we believe that Opus 4.6 is likely unable to conduct significant subtle sabotage without such sabotage often becoming obvious.” Suppose that Opus 4.6 were to sabotage research. Then it would be clear from transcripts, especially if the AI-2027 forecast holds and a half of tokens is checked by weak models which are ten times cheaper.
Maybe you mean that if the model displays the ability to sabotage research without referencing the task, then monitoring becomes a far more complex problem, requiring the company to apply interpretability tools to activations and/or that such tools aren’t THAT cheap?
I expect the strategy to cause the effects of anti-American propaganda to skyrocket. Suppose that Russia decided to murder Zelenskiy. How likely would this action to cause the West to increase support of Ukraine and some countries of the Global South to disapprove Russian actions? And if we replace the counterfactual with the actual events of Israel vs Iran?
As for the drug cartels, however, attempting to support such a cartel would lead not just to a state adopting a mistaken policy, but also to having the cartel and the supporter participate in drug-related evils.
Fully agreed. I would also like @Daniel Kokotajlo’s team to open-source the rules of the tabletop exercise and/or to revise the AI-2027 compute forecast so that it reflected the new possibility of China amassing approximately the same amount of compute as the USA by causing the latter to lose the compute supply in Taiwan. If that happens, neither OpenBrain nor DeepCent would have a lead to burn in case of misalignment, causing a disaster.