Semi-anon account so I could write stuff without feeling stressed.
Sodium
Hong Kong had seats in its legislature designed for special interest/business groups (see Functional constituency on wikipedia). I don’t understand it very well though.
I don’t think of this as relevant to any sort of doom really. I think of “output math” as habit that the model has picked up, since it did a ton of this during training.
See e.g., Nemotron 3 Nano here using code when asked a religious question on openrouter:
This is my least favorite fact about Claude. I don’t think it’s actually genuine when using “genuinely” (or at least, when it describes something as “genuinely X,” I often find that the thing is in fact not X.)
My guess is that whatever constitution-inspired post training process they used gave birth to a reward model that likes of text outputs that contain “genuinely.”
I think this post is counterproductive. There are serious reasons to believe why iterative alignment would fail, and serious reasons to believe that it’s the best thing we can work on right now. But this post reads like 30% vague ideas and 70% condescension. It feels like it’s written to score social points rather than put forth good ideas in earnest discussion.
I’m surprised not to see more discussions about how to update on alignment difficulty in light of Moltbook
I mean it’s possible that the evil looking AIs on Moltbook are just Grok, which is supposed to do evil role plays, right?
What stops an agent from generating adversarial fulfilment criteria for its goals that are easier to satisfy than the “real”, external goals?
Because like, they terminally don’t want to do? I guess in your frame, what I’d say is that people terminally value having their internal (and noisy) metrics not be too far off from the external states they are supposed to represent.
Intuitively your thesis doesn’t sound right to me. My guess (1) most people do “reward hack” themselves quite a bit, and (2) to the extent that they don’t, it’s because they care about “doing the real thing.” “Being real” feels to me like something that’s meaningfully different than a lot of my other preference? Like it’s sort of the basis for all other values.
FYI the paraphrasing stuff sounds like what Yoshua Bengio is trying to do with the scientist AI agenda. See his talk at the alignment workshop in Dec 2025.
(Although I feel like Bengio has shared very little about the actual progress they’ve made (if any), and also very little detail on what they’ve been up to).
Another distinguishing property of (AGI) alignment work is that it’s forward looking and trying to solve future alignment problems. Given the large increase in AI safety work from academia, this feels like a useful property to keep in mind.
(Of course, this is not to say that we couldn’t use current day problems as proxies for those future problems.)
I’m curious: what percent of upvotes are strong upvotes? What percent of karma comes from strong upvotes?
Yeah my guess is also that the average philosophy meetup person is a lot more annoying than the average, I dunno, boardgames meetup person.
Yeah I would like to mute some users site-wide so that I never see reacts from them & their comments are hidden by default....
As far as I’m aware of, this is one of the very few pieces of writing that sketches out what safety reassurances could be made for a model capable of doing significant harms. I wish there were more posts like this one.
This post and (imo more importantly) the discussion it spurred has been pretty helpful for how I think about scheming. I’m happy that it was written!
I feel like the react buttons are cluttering up the UI and distracting. Maybe they should be e.g., restricted to users with 100+ karma and everyone gets only one react a day or something?
Like they are really annoying when reading articles like this one.
Yeah I get that the actual parameter count isn’t, but I think the general argument that bigger pre trains remember more facts, and we can use that to try predict the model size.
For what it’s worth, I’m still bullish on pre-training given the performance of Gemini-3, which is probably a huge model based on its score in the AA-Omniscience benchmark.
man you should probably get some more I can’t imagine it’ll be that expensive?
I agree it’s probably good to not use moral reasoning, but the reason people have deontological rules around drugs is because it’s hard to trust our own consequentialist reasoning. Something like “don’t do (non-prescribed) drugs ” also a simple rule that’s much more low effort to follow and may well be worth the cost-benefit analysis.
I flushed out a similar idea in (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need