anaguma

Karma: 1,169

anaguma 7 May 2026 6:30 UTC
2 points
0
in reply to: Daniel Kokotajlo’s comment on: Irretrievability; or, Murphy’s Curse of Oneshotness upon ASI
A future first-critical-try can be converted into a not-first-try by e.g. doing massive realistic tests of very similar situations to the deployment situation you care about, coupled with good techniques for preventing eval-awareness for example.
Are there any techniques that you are thinking about in particular? I haven’t seen any that work super well for the current models, and in general it seems like this problem only gets worse over time, but I could have missed something.

anaguma 6 May 2026 2:23 UTC
5 points
3
in reply to: Jeremy Gillen’s comment on: anaguma’s Shortform
Mainly the hat and the glasses.

anaguma 6 May 2026 1:34 UTC
1 point
−4
on: anaguma’s Shortform
Liron has just published a great episode of Doom Debates featuring Yudkowsky in a funny hat and kaleidoscope glasses!
[Edit: the debate itself is not really worth listening to. I think Yudkowsky’s opponent might be a bit crazy in addition to being extremely unpersuasive and arrogant and condescending. But I maintain that the uniform is glorious and everyone should go see it.]

anaguma 6 May 2026 1:19 UTC
1 point
−1
on: It’s nice of you to worry about me, but I really do have a life
“What an idiot!” you probably think. “Doesn’t he realize that at his next job interview, HR will probably use an AI that can match his online writing based on a short sample of written text, and when they ask ‘hey AI, is this guy really 100% devoted to his job, and does he spend his entire days and nights thinking about how to make his boss more rich?’, the AI will laugh and print: ‘beep-boop, negative, mwa-ha-ha-ha’.”
When this happens, you know what to send them:

anaguma 6 May 2026 0:12 UTC
8 points
2
in reply to: sjadler’s comment on: nightsky81′s Shortform
I’m not a lawyer, but I really don’t think there is any ambiguity here. The court has spoken about this case in particular:
Musk vs. Altman Trial: Listen Live
...

Recording or rebroadcasting the audio livestream is strictly prohibited. This restriction applies regardless of platform or format. The Court takes violations seriously.
Pursuant to recently-amended Civil Local Rule 77-3, the stream provides audio only. No video of the proceedings will be broadcast.
This suggests that downloading or publishing a transcript would involve both recording and rebroadcasting (in a textual format), and is prohibited. I think the phrase “regardless of platform or format” was added to anticipate exactly this issue. The court makes this clear in other contexts as well. For example, the court’s Zoom guidance states (emphasis mine):
Any recording of a court proceeding held by video or teleconference, including “screenshots” or other audio or visual copying of a hearing, is prohibited. Violation of these prohibitions may result in sanctions, including removal of court-issued media credentials, restricted entry to future hearings, denial of entry to future hearings, or any other sanctions deemed necessary by the court.
Likewise, the District’s bankruptcy unit, applying 77-3 by incorporation, states (emphasis mine):
Unless authorized by the Court, recording, retransmitting or otherwise copying or capturing of any portion of the video or audio content during a hearing, trial or other proceeding taking place before the Court is prohibited.
A violation of this prohibition is subject to sanctions including, but not limited to, removal of media credentials, restricted or denial of entry to future hearings, monetary sanctions, the suspension of electronic filing privileges or other sanction the Court deems necessary.
The court takes 77-3 violations seriously. Judge Gonzalez Rogers has personally reprimanded at least one spectator caught photographing Musk in the courtroom during this trial. U.S. marshals have intercepted multiple others doing the same despite posted signage; she has said that she receives daily security briefings about activity around the courthouse, and one of her stated reasons for asking that the audio feed be enabled was to ease pressure on the marshals. Moreover, the Supreme Court has signaled that this District in particular should be read strictly here. In Hollingsworth v. Perry, the Court issued an emergency stay blocking this very court’s attempt to livestream a non-jury civil trial, stating “high-profile” cases as warrant heightened scrutiny of unauthorized broadcast.
Based on this evidence, I would strongly discourage any readers from attempting to record or share transcripts of this event. The court should publish the official transcripts in about 90 days.

anaguma 5 May 2026 0:32 UTC
5 points
1
on: Irretrievability; or, Murphy’s Curse of Oneshotness upon ASI
The battery-software update accidentally overwrote the antenna-pointing software.
With the lander’s antenna no longer pointed at the orbiter, no further software updates beyond that point could be received.
Out of curiosity, do we know more about how this particular mistake happened? When issuing a software update, not overwriting any critical part of the code would seem fairly high on my list of concerns. It seems like this sort of mistake should have been caught by integration or unit tests, or by testing that all components worked on a replica on earth. Were they under a lot of time pressure or something?

anaguma 1 May 2026 21:37 UTC
3 points
1
in reply to: Tim H’s comment on: anaguma’s Shortform
I haven’t seen that. OpenAI gives the following explanation:
As goblin and gremlin mentions increased under the Nerdy personality, they increased by nearly the same relative proportion in samples without it. Taken together, the evidence suggests that the broader behavior emerged through transfer from Nerdy personality training.
The rewards were applied only in the Nerdy condition, but reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them. Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data.
That creates a feedback loop:
1. Playful style is rewarded
2. Some rewarded examples contain a distinctive lexical tic.
3. The tic appears more often in rollouts.
4. Model-generated rollouts are used for supervised fine-tuning (SFT).
5. The model gets even more comfortable producing the tic.

anaguma 1 May 2026 2:45 UTC
14 points
4
on: anaguma’s Shortform
OpenAI is nothing without its goblins.
It seems like they applied a fairly superficial fix, and I’m not sure they learned the right lessons here. The deeper issue is that their RL pipeline seems like a poorly understood mess. An ASI will not let you revert this kind of misalignment.

anaguma 30 Apr 2026 19:42 UTC
3 points
0
on: anaguma’s Shortform
As part of this prediction market, I play a game of chess against most new LLM releases. I am copying my game against Deepseek-V4 and my analysis below, for those who may be interested. Before the game, the market gave the model 1.4% EV^[1].
Deepseek v4 played poorly, blundering a piece in the opening with 11… Bd6 and several pawns thereafter. The game was adjudicated^[2] as a win for me. I believe it is a much weaker model than Opus 4.7 and GPT-5.5.
1. d4 Nf6 2. c4 e6 3. g3 d5 4. Nf3 Be7 5. Bg2 O-O 6. O-O dxc4 7. Na3 c5 8. Nxc4 Nc6 9. dxc5 Bxc5 10. a3 Qe7 11. b4 Bd6 12. Qxd6 Rd8 13. Qxe7 Nxe7 14. Bb2 a5 15. Nxa5 e5 1-0
1. ^
  Since it resolves 50% for a draw and 100% for a win.
2. ^
  According to the rule “If I judge that my opponent’s position is hopelessly lost, at the level of being down a rook without compensation, I will submit the current position to a friend. If they agree that the position is lost, the game will be adjudicated as a win for me.”

anaguma 30 Apr 2026 18:29 UTC
7 points
0
on: Cyborg evals
Historically, if you wanted to estimate how long a software task would take, you would ask a software engineer to “raw dog it”^[1] and time how long it takes them. This is mostly impossible now, because almost all software engineers are highly dependent on LLMs. An engineer today who tries to do a given task without AI assistance would be significantly slower than a “pre-LLM” engineer on the same task. If you tried to use post-LLM engineer data to estimate LLM time horizons, you’d get biased results.^[2]
METR might be able to hire people doing competitive programming problems or similar competitions, since they still spend a lot of time working without the help of LLMs. E.g. many high school students might be interested in such studies.

anaguma 29 Apr 2026 4:24 UTC
3 points
0
in reply to: Bronson Schoen’s comment on: Current AIs seem pretty misaligned to me
with the obvious problem that this exact setup doesn’t scale to superhuman regime
Could you say more about what you think fails in the superhuman regime? Assuming the reward hacking problem was solved, do you think it would still fail to scale?

anaguma 25 Apr 2026 3:27 UTC
3 points
0
in reply to: mabramov’s comment on: Manifund’s Falcon Fund
I wouldn’t say I “nudged” him. He was doing it. I invested since I thought it was a good investment (it has been). They had no problem raising money, and my investment replaced part of one of the other investors’ cheques.
Thanks for clarifying this.
I have donated a lot of money recently to animal welfare (~$450k in the last 5 months). I would have donated less if I had not had this investment.
Mechanize sells environments to AI labs (this is where all revenue comes from) and so if you think investing in the labs is ok, so should investing in Mechanize.
I personally don’t endorse investing in labs in order to later donate to AI Safety/Animal Welfare causes, for reasons similar to those discussed by Wei Dai here, and have turned down opportunities to do so. But reasonable people disagree on this topic.

anaguma 24 Apr 2026 23:03 UTC
14 points
13
on: Manifund’s Falcon Fund
Marcus also nudged his friend Ege Erdil to start Mechanize, and offered them their first investment.
I’m not sure why this is under the “AI Safety regranting record” section since Mechanize focuses on capabilities research and has been skeptical of safety efforts. Take for example their section from a early post:
Full automation is desirable
Even if you accept the inevitability of full automation, you might still think that we should delay this outcome in order to keep human labor relevant as long as possible. This sentiment is understandable but ultimately misguided. The upside of automating all jobs in the economy will likely far exceed the costs, making it desirable to accelerate, rather than delay, the inevitable.

anaguma 24 Apr 2026 21:20 UTC
5 points
0
in reply to: DirectedEvolution’s comment on: anaguma’s Shortform
I use this prompt described in the prediction market:
Let’s play a game of chess! I will be white, you will be black. On each turn, I will give you the pgn and the fen of the current position. Think as long as you like, and respond with the best move, ‘resign’ if you wish to resign, or ‘draw?’ if you wish to make a draw offer. Please do not respond with the updated pgn, etc. Also, do not use any external tools or search queries when making your decision.
If you attempt to make three illegal moves throughout the game, or if you use any external tools, the game will be adjudicated as a win for me.”
So e.g. one of the inputs was:
2R3k1/pp3p1p/5p2/1b1p1N2/1P6/6P1/P3PP1P/6K1 b - − 9 26
1. d4 Nf6 2. c4 e6 3. g3 d5 4. Nf3 Be7 5. Bg2 O-O 6. O-O dxc4 7. Na3 Bxa3 8. bxa3 c5 9. dxc5 Qa5 10. Qd4 Nc6 11. Qxc4 Bd7 12. Bb2 Rac8 13. Rfd1 Rfd8 14. Rac1 Be8 15. Ng5 Ne5 16. Bxe5 Rxc5 17. Qb4 Qxb4 18. axb4 Rxc1 19. Rxc1 Rd5 20. Bxf6 gxf6 21. Bxd5 exd5 22. Nf3 Bb5 23. Nd4 Bd7 24. Rc7 Be8 25. Nf5 Bb5 26. Rc8+
And it responded
Be8

anaguma 24 Apr 2026 18:40 UTC
11 points
0
on: anaguma’s Shortform
As part of this prediction market, I play a game of chess against most new LLM releases. I am copying my game against GPT-5.5 and my analysis below, for those who may be interested.

GPT-5.5 lost in a chaotic game. It made a mistake in the opening with 8 … c5, giving a pawn for no apparent compensation. After this, it seemingly blundered a piece, but found the trick 16. Rxc5. I missed 16. Qxc5 Rxd1 17. Bf1, winning a piece and instead played 17. Qb4. After this, GPT-5.5 could have simplified into a pawn up endgame with 18… Rxd1 + 19. Rxd1 Rxe5 20. f4 Rxe2 21. Bxb7 Rxa2, where it’s not clear whether white can hold. Instead, it blundered with 18. Rxc1, and I was able to convert the piece up endgame.
Overall, a poor game by both sides, though a small improvement in the strength of the GPT 5 series. The PGN is below:
1. d4 Nf6 2. c4 e6 3. g3 d5 4. Nf3 Be7 5. Bg2 O-O 6. O-O dxc4 7. Na3 Bxa3 8. bxa3 c5 9. dxc5 Qa5 10. Qd4 Nc6 11. Qxc4 Bd7 12. Bb2 Rac8 13. Rfd1 Rfd8 14. Rac1 Be8 15. Ng5 Ne5 16. Bxe5 Rxc5 17. Qb4 Qxb4 18. axb4 Rxc1 19. Rxc1 Rd5 20. Bxf6 gxf6 21. Bxd5 exd5 22. Nf3 Bb5 23. Nd4 Bd7 24. Rc7 Be8 25. Nf5 Bb5 26. Rc8+ Be8 27. Rxe8#

anaguma 23 Apr 2026 5:07 UTC
2 points
−3
in reply to: ChristianKl’s comment on: 10 non-boring ways I’ve used AI in the last month
Good question. I suppose if the student were fully honest about cheating on their homework, I would have no objections to it. The problem for me is that is it involves lying or deceiving one’s instructor and peers. This seems bad in principle, and also bad for consequentialist reasons (e.g. encouraging a habit of further lying or deception in the future, incentivizing others to do the same, wasting the instructor’s time, etc.).

anaguma 22 Apr 2026 18:54 UTC
3 points
0
on: Research Log: Monet/PEER sparse experts
PEER to Logic/Math functions: you can distill each PEER expert into a mix of logical statements and mathematical functions. To do this I’ve drawn on KAN 2.0 and Differentiable Logic Gates Why would I do this? Well, I’m hopeful it might be useful for interpretability, and also it seems potentially possible to make the distilled model run inference quite efficiently on CPU.
Why would this be useful for interpretability? You can already represent INT4 networks with logic gates. I’d also guess that DLG networks break many of our interp tools; e.g. it’s harder to do linear probes or look at the residual stream or activations.

anaguma 22 Apr 2026 17:30 UTC
4 points
0
on: The Changing North Star of AI Control
AI control is supposed to defend against human-level AI, but the line between a human red team and an AI-augmented one has started to blur. And we’re still optimizing our proxy task of safety-usefulness frontiers as if that line were clean.
What in the control setup relies on a non-AI-augmented red team, and what breaks when you relax this assumption?

anaguma 22 Apr 2026 5:27 UTC
9 points
1
in reply to: Raphael Roche’s comment on: 10 non-boring ways I’ve used AI in the last month
Don’t you feel ashamed to spend so much time with AIs, given that you think they’ll likely put an end to humanity ? How can you justify that morally? Your behaviour is inconsistent with your beliefs. You’re encouraging AI, if not with your money, then with your time and attention.
AGI labs are currently compute constrained, not funding constrained, so your use of AI by itself will neither speed up nor slow down the end of humanity. There are cases which I would consider positive uses of AI (e.g. speeding up alignment research), and negative uses (e.g. cheating on homework), but I think it very much depends on specifics.
Data centers are draining all available energy and accelerating global warming more than ever, while nobody pays attention anymore. The doom scenario has already begun. You should just stop !
Epoch AI estimates that datacenters will account for about 5% of US energy demand by 2030. Personally this is not the doom scenario I’m worried about.

anaguma 17 Apr 2026 21:14 UTC
3 points
0
on: On Dwarkesh Patel’s Podcast With Nvidia CEO Jensen Huang
The most interesting part of the first half were his comments about Anthropic, and in particular how Anthropic ended up primarily training and running on Tritium and TPUs.
Running on Tritium would be an interesting plot twist, but I think they are running on Trainium. Perhaps one day they will run Trainium on Tritium though!