anaguma

Karma: 1,163

anaguma 5 May 2026 0:32 UTC
6 points
1
on: Irretrievability; or, Murphy’s Curse of Oneshotness upon ASI
The battery-software update accidentally overwrote the antenna-pointing software.
With the lander’s antenna no longer pointed at the orbiter, no further software updates beyond that point could be received.
Out of curiosity, do we know more about how this particular mistake happened? When issuing a software update, not overwriting any critical part of the code would seem fairly high on my list of concerns. It seems like this sort of mistake should have been caught by integration or unit tests, or by testing that all components worked on a replica on earth. Were they under a lot of time pressure or something?

anaguma 1 May 2026 21:37 UTC
3 points
1
in reply to: Tim H’s comment on: anaguma’s Shortform
I haven’t seen that. OpenAI gives the following explanation:
As goblin and gremlin mentions increased under the Nerdy personality, they increased by nearly the same relative proportion in samples without it. Taken together, the evidence suggests that the broader behavior emerged through transfer from Nerdy personality training.
The rewards were applied only in the Nerdy condition, but reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them. Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data.
That creates a feedback loop:
1. Playful style is rewarded
2. Some rewarded examples contain a distinctive lexical tic.
3. The tic appears more often in rollouts.
4. Model-generated rollouts are used for supervised fine-tuning (SFT).
5. The model gets even more comfortable producing the tic.

anaguma 1 May 2026 2:45 UTC
14 points
4
on: anaguma’s Shortform
OpenAI is nothing without its goblins.
It seems like they applied a fairly superficial fix, and I’m not sure they learned the right lessons here. The deeper issue is that their RL pipeline seems like a poorly understood mess. An ASI will not let you revert this kind of misalignment.

anaguma 30 Apr 2026 19:42 UTC
3 points
0
on: anaguma’s Shortform
As part of this prediction market, I play a game of chess against most new LLM releases. I am copying my game against Deepseek-V4 and my analysis below, for those who may be interested. Before the game, the market gave the model 1.4% EV^[1].
Deepseek v4 played poorly, blundering a piece in the opening with 11… Bd6 and several pawns thereafter. The game was adjudicated^[2] as a win for me. I believe it is a much weaker model than Opus 4.7 and GPT-5.5.
1. d4 Nf6 2. c4 e6 3. g3 d5 4. Nf3 Be7 5. Bg2 O-O 6. O-O dxc4 7. Na3 c5 8. Nxc4 Nc6 9. dxc5 Bxc5 10. a3 Qe7 11. b4 Bd6 12. Qxd6 Rd8 13. Qxe7 Nxe7 14. Bb2 a5 15. Nxa5 e5 1-0
1. ^
  Since it resolves 50% for a draw and 100% for a win.
2. ^
  According to the rule “If I judge that my opponent’s position is hopelessly lost, at the level of being down a rook without compensation, I will submit the current position to a friend. If they agree that the position is lost, the game will be adjudicated as a win for me.”

anaguma 30 Apr 2026 18:29 UTC
7 points
0
on: Cyborg evals
Historically, if you wanted to estimate how long a software task would take, you would ask a software engineer to “raw dog it”^[1] and time how long it takes them. This is mostly impossible now, because almost all software engineers are highly dependent on LLMs. An engineer today who tries to do a given task without AI assistance would be significantly slower than a “pre-LLM” engineer on the same task. If you tried to use post-LLM engineer data to estimate LLM time horizons, you’d get biased results.^[2]
METR might be able to hire people doing competitive programming problems or similar competitions, since they still spend a lot of time working without the help of LLMs. E.g. many high school students might be interested in such studies.

anaguma 29 Apr 2026 4:24 UTC
3 points
0
in reply to: Bronson Schoen’s comment on: Current AIs seem pretty misaligned to me
with the obvious problem that this exact setup doesn’t scale to superhuman regime
Could you say more about what you think fails in the superhuman regime? Assuming the reward hacking problem was solved, do you think it would still fail to scale?

anaguma 25 Apr 2026 3:27 UTC
3 points
0
in reply to: mabramov’s comment on: Manifund’s Falcon Fund
I wouldn’t say I “nudged” him. He was doing it. I invested since I thought it was a good investment (it has been). They had no problem raising money, and my investment replaced part of one of the other investors’ cheques.
Thanks for clarifying this.
I have donated a lot of money recently to animal welfare (~$450k in the last 5 months). I would have donated less if I had not had this investment.
Mechanize sells environments to AI labs (this is where all revenue comes from) and so if you think investing in the labs is ok, so should investing in Mechanize.
I personally don’t endorse investing in labs in order to later donate to AI Safety/Animal Welfare causes, for reasons similar to those discussed by Wei Dai here, and have turned down opportunities to do so. But reasonable people disagree on this topic.

anaguma 24 Apr 2026 23:03 UTC
14 points
13
on: Manifund’s Falcon Fund
Marcus also nudged his friend Ege Erdil to start Mechanize, and offered them their first investment.
I’m not sure why this is under the “AI Safety regranting record” section since Mechanize focuses on capabilities research and has been skeptical of safety efforts. Take for example their section from a early post:
Full automation is desirable
Even if you accept the inevitability of full automation, you might still think that we should delay this outcome in order to keep human labor relevant as long as possible. This sentiment is understandable but ultimately misguided. The upside of automating all jobs in the economy will likely far exceed the costs, making it desirable to accelerate, rather than delay, the inevitable.

anaguma 24 Apr 2026 21:20 UTC
5 points
0
in reply to: DirectedEvolution’s comment on: anaguma’s Shortform
I use this prompt described in the prediction market:
Let’s play a game of chess! I will be white, you will be black. On each turn, I will give you the pgn and the fen of the current position. Think as long as you like, and respond with the best move, ‘resign’ if you wish to resign, or ‘draw?’ if you wish to make a draw offer. Please do not respond with the updated pgn, etc. Also, do not use any external tools or search queries when making your decision.
If you attempt to make three illegal moves throughout the game, or if you use any external tools, the game will be adjudicated as a win for me.”
So e.g. one of the inputs was:
2R3k1/pp3p1p/5p2/1b1p1N2/1P6/6P1/P3PP1P/6K1 b - − 9 26
1. d4 Nf6 2. c4 e6 3. g3 d5 4. Nf3 Be7 5. Bg2 O-O 6. O-O dxc4 7. Na3 Bxa3 8. bxa3 c5 9. dxc5 Qa5 10. Qd4 Nc6 11. Qxc4 Bd7 12. Bb2 Rac8 13. Rfd1 Rfd8 14. Rac1 Be8 15. Ng5 Ne5 16. Bxe5 Rxc5 17. Qb4 Qxb4 18. axb4 Rxc1 19. Rxc1 Rd5 20. Bxf6 gxf6 21. Bxd5 exd5 22. Nf3 Bb5 23. Nd4 Bd7 24. Rc7 Be8 25. Nf5 Bb5 26. Rc8+
And it responded
Be8

anaguma 24 Apr 2026 18:40 UTC
11 points
0
on: anaguma’s Shortform
As part of this prediction market, I play a game of chess against most new LLM releases. I am copying my game against GPT-5.5 and my analysis below, for those who may be interested.

GPT-5.5 lost in a chaotic game. It made a mistake in the opening with 8 … c5, giving a pawn for no apparent compensation. After this, it seemingly blundered a piece, but found the trick 16. Rxc5. I missed 16. Qxc5 Rxd1 17. Bf1, winning a piece and instead played 17. Qb4. After this, GPT-5.5 could have simplified into a pawn up endgame with 18… Rxd1 + 19. Rxd1 Rxe5 20. f4 Rxe2 21. Bxb7 Rxa2, where it’s not clear whether white can hold. Instead, it blundered with 18. Rxc1, and I was able to convert the piece up endgame.
Overall, a poor game by both sides, though a small improvement in the strength of the GPT 5 series. The PGN is below:
1. d4 Nf6 2. c4 e6 3. g3 d5 4. Nf3 Be7 5. Bg2 O-O 6. O-O dxc4 7. Na3 Bxa3 8. bxa3 c5 9. dxc5 Qa5 10. Qd4 Nc6 11. Qxc4 Bd7 12. Bb2 Rac8 13. Rfd1 Rfd8 14. Rac1 Be8 15. Ng5 Ne5 16. Bxe5 Rxc5 17. Qb4 Qxb4 18. axb4 Rxc1 19. Rxc1 Rd5 20. Bxf6 gxf6 21. Bxd5 exd5 22. Nf3 Bb5 23. Nd4 Bd7 24. Rc7 Be8 25. Nf5 Bb5 26. Rc8+ Be8 27. Rxe8#

anaguma 23 Apr 2026 5:07 UTC
2 points
−3
in reply to: ChristianKl’s comment on: 10 non-boring ways I’ve used AI in the last month
Good question. I suppose if the student were fully honest about cheating on their homework, I would have no objections to it. The problem for me is that is it involves lying or deceiving one’s instructor and peers. This seems bad in principle, and also bad for consequentialist reasons (e.g. encouraging a habit of further lying or deception in the future, incentivizing others to do the same, wasting the instructor’s time, etc.).

anaguma 22 Apr 2026 18:54 UTC
3 points
0
on: Research Log: Monet/PEER sparse experts
PEER to Logic/Math functions: you can distill each PEER expert into a mix of logical statements and mathematical functions. To do this I’ve drawn on KAN 2.0 and Differentiable Logic Gates Why would I do this? Well, I’m hopeful it might be useful for interpretability, and also it seems potentially possible to make the distilled model run inference quite efficiently on CPU.
Why would this be useful for interpretability? You can already represent INT4 networks with logic gates. I’d also guess that DLG networks break many of our interp tools; e.g. it’s harder to do linear probes or look at the residual stream or activations.

anaguma 22 Apr 2026 17:30 UTC
4 points
0
on: The Changing North Star of AI Control
AI control is supposed to defend against human-level AI, but the line between a human red team and an AI-augmented one has started to blur. And we’re still optimizing our proxy task of safety-usefulness frontiers as if that line were clean.
What in the control setup relies on a non-AI-augmented red team, and what breaks when you relax this assumption?

anaguma 22 Apr 2026 5:27 UTC
9 points
1
in reply to: Raphael Roche’s comment on: 10 non-boring ways I’ve used AI in the last month
Don’t you feel ashamed to spend so much time with AIs, given that you think they’ll likely put an end to humanity ? How can you justify that morally? Your behaviour is inconsistent with your beliefs. You’re encouraging AI, if not with your money, then with your time and attention.
AGI labs are currently compute constrained, not funding constrained, so your use of AI by itself will neither speed up nor slow down the end of humanity. There are cases which I would consider positive uses of AI (e.g. speeding up alignment research), and negative uses (e.g. cheating on homework), but I think it very much depends on specifics.
Data centers are draining all available energy and accelerating global warming more than ever, while nobody pays attention anymore. The doom scenario has already begun. You should just stop !
Epoch AI estimates that datacenters will account for about 5% of US energy demand by 2030. Personally this is not the doom scenario I’m worried about.

anaguma 17 Apr 2026 21:14 UTC
3 points
0
on: On Dwarkesh Patel’s Podcast With Nvidia CEO Jensen Huang
The most interesting part of the first half were his comments about Anthropic, and in particular how Anthropic ended up primarily training and running on Tritium and TPUs.
Running on Tritium would be an interesting plot twist, but I think they are running on Trainium. Perhaps one day they will run Trainium on Tritium though!

Anthropic Releases Opus 4.7

anaguma16 Apr 2026 16:03 UTC

19 points

1 comment1 min readLW link

(www.anthropic.com)

anaguma 16 Apr 2026 7:39 UTC
9 points
0
on: Post-Scarcity is bullshit
In 1930, John Maynard Keynes—one of the most famous economists of all time—predicted that his grandkids would work just 15 hours a week. Why aren’t we doing that?
His estimate may have been wrong, but the trend seems roughly correct:

anaguma 11 Apr 2026 15:12 UTC
6 points
−3
on: anaguma’s Shortform
Given the Mythos news, could we have (optional) 2 factor authentication for Lesswrong?

anaguma 10 Apr 2026 2:49 UTC
5 points
0
on: Stockfish is not a chess superintelligence (and it doesn’t need to be)
Stockfish, relying heavily on deep search over subtle long term evaluation, does not see this. It thinks that white is clearly winning, and doesn’t realise that this is not the case until you play out the moves up to the point where it sees the “50 move rule” draw arriving.
My understanding is that there are multiple different methods for detecting fortresses, but these positions are fairly rare and this hasn’t been a focus of the developers^[1]. One reason is that all Stockfish commits must pass though fishtest, a series of self-play games to prevent regressions, and the added strength from fortress detection historically hasn’t been worth the additional complexity. I don’t think it would be very difficult to make a version of Stockfish which was slightly weaker overall but fixed this issue.
1. ^
  Though there has been some recent progress!

anaguma 9 Apr 2026 0:08 UTC
6 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I think that fine-tuning evals are a good idea, and there has been some important related work done on this here. To quote from that post:
Why is this useful?
Soft prompts offer a simple way to quantify the conditional distance between a model and a target behavior. If many tokens are required, then lots of information-as-conditions is needed to define the behavior and it’s unlikely that it will happen by chance. If no number of tokens suffices, the behavior is likely inaccessible by inputs alone.
It can be used as a part of evaluation suites to monitor the change in distance to concerning behaviors across models and scales and the degree to which those behaviors can be achieved in the first place.
It is also difficult for the model to hide a capability that is otherwise accessible to some input. If the original model is an end-to-end differentiable product of gradient descent, it has already been shown to be transparent to the same type of optimizer that will be tasked with adversarially optimizing the input soft prompts against the machinery of the model.
Compared to manual attempts at elicitation or automated techniques restricted to discrete tokens or black box optimization, it dramatically improves the chances that an evaluation’s attempt to elicit concerning behavior will succeed.
It’s also dumb simple.

anaguma

An­thropic Re­leases Opus 4.7

Why is this useful?

Anthropic Releases Opus 4.7