Semi-anon account so I could write stuff without feeling stressed.
Sodium
LessWrong 2.0 Is a Website, Not a Culture or an Authority
Man I don’t think I agree with this. Clearly what draws people here is the culture and the vibe.
And I do think I relate to twitter and Elon Musk in a similar way as I relate to LW and the LW team? I have preferences over what content should be promoted on Twitter, and I would be happier with Elon if he shared my preferences. Ditto with Lesswrong.
On a similar note: I feel generally pessimistic about doing extremely prosaic alignment research[1] that’s intended to beat baselines[2] outside of Anthropic, because you don’t have access to the SoTA baselines.
Edit: however, in this case it might be the case that narrow fine-tuning on eval-cooperativeness right before you run the evals could be good? You can treat it as spiritually similar to eval awareness steering.
In worlds where AI alignment can be handled by iterative design, we probably survive.
I’m curious whether you still believe in this? I think there are currently lots of alignment issues that are totally fixable by iterating on them, but they aren’t. Naively extrapolating, it’s possible that even if various future alignment problems are solvable via iteration, they might not be.
My guess is that at the current state, conditional on (misalignment at the point of no return will kill everyone) AND (this misalignment is fixable using iteration), p(misalignment) is still around 50%.[1]
- ^
Obvious this is a bit fuzzy since iterative design worlds would have much more continuous looking PONR and misalignment issues, but that’s the sort of vibe I have about the current state of the AI race.
- ^
I find myself confused at other people’s surprise. The thing we are seeing seems to me like the obvious thing you would expect to see from a generic “you get what you optimize for” objective.
fwiw I strong upvoted the post not because it said something surprising, but because it explains our current observations very well.
I think the Walmart and Toyota case is less interesting because they’re not creating “new” consumption. Like Walmart has a huge revenue because it’s captured a big slice of people’s overall consumption. If Walmart’s revenue doubled next year, it’ll probably because they got a bigger slice, not because people are suddenly buying twice as much stuff.
I don’t think that’s true because Randy would appoint Republicans throughout government/be more captured by the Republican party’s interests? Like it depends on how much you like Randy-flavored Republican in executive and judicial roles. I think there’s probably a huge difference for what types of judges Randy and Donna would nominate, for example.
I guess this is more true for Presidents than it is for Senators/Representatives (since an Republican congressperson will vote for the Republican Speaker of the House/Senate Majority Leader, who has a lot more power than any individual congressperson.)
While you can rip my epistemic qualifiers from my cold dead hands, probably, I sometimes grudgingly admit that the sentences I write have a certain kind of meandering quality to them, often going on for so long that by the time the reader has reached its end, the reader will have forgotten how it started.
The fact that this sentence is meandering and makes it easy to forget how it started by the time one reads to the end makes it an instant banger.
I mean this is assuming that ASI is aligned and chooses to not manipulate public opinion right? I agree that assuming that it’s misaligned, then there’s not much to talk about.
(You can also imagine multi polar worlds where different AIs police each other for superpersuation.)
this way of reasoning seems like somewhat naive consequentialism.
Maybe? It is hard to reason well about these things given my strong emotions towards the admin.
But I do think the current administration is uniquely terrible by American standards.[1] It attracts and gives power to incompetent sycophants with no moral boundaries.
There was something Eliezer said about Bernie Sanders recently that really resonated with me recently:
[T]hank you also for consistently trying to do as seems right to you over the years, a stance that has grown on me as I have had more chance to witness its alternatives.
Having Trump as the president really just seems like it would be terrible for AGI governance because he is a terrible person. I’m sorry, I really don’t think there’s a more “precise” way to put it. Character matters. Trump doesn’t even pretend to be a kind person/is not under much pressure to appear to be nice.
(To be clear, I agree that, all else equal, it would be good for the Iranian regime to fail. Alas, all else would not be equal. While I think it would definitely be bad for your soul[2] to do things in the realm of “sabotage the American economy/military operation in order to make our president look bad,” I don’t think I’m obligated to stop my enemy when he is making a mistake either.)
I think the most important effect of the war is that it makes Trump less popular/powerful domestically (even if a miracle happens and he gets some sort of deal.) This is good because the less power he has (e.g., Republicans lose the senate in the midterms), the more likely we are to navigate AI development in a sane way. I think if you put
anynontivial*weight in short timelines, the AI considerations likely dominate everything else.
*edited any to nontrivial. Like, maybe 10%+ pre-Jan 2029
I flushed out a similar idea in (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need
Hong Kong had seats in its legislature designed for special interest/business groups (see Functional constituency on wikipedia). I don’t understand it very well though.
I don’t think of this as relevant to any sort of doom really. I think of “output math” as habit that the model has picked up, since it did a ton of this during training.
See e.g., Nemotron 3 Nano here using code when asked a religious question on openrouter:
This is my least favorite fact about Claude. I don’t think it’s actually genuine when using “genuinely” (or at least, when it describes something as “genuinely X,” I often find that the thing is in fact not X.)
My guess is that whatever constitution-inspired post training process they used gave birth to a reward model that likes of text outputs that contain “genuinely.”
I think this post is counterproductive. There are serious reasons to believe why iterative alignment would fail, and serious reasons to believe that it’s the best thing we can work on right now. But this post reads like 30% vague ideas and 70% condescension. It feels like it’s written to score social points rather than put forth good ideas in earnest discussion.
I’m surprised not to see more discussions about how to update on alignment difficulty in light of Moltbook
I mean it’s possible that the evil looking AIs on Moltbook are just Grok, which is supposed to do evil role plays, right?
What stops an agent from generating adversarial fulfilment criteria for its goals that are easier to satisfy than the “real”, external goals?
Because like, they terminally don’t want to do? I guess in your frame, what I’d say is that people terminally value having their internal (and noisy) metrics not be too far off from the external states they are supposed to represent.
Intuitively your thesis doesn’t sound right to me. My guess (1) most people do “reward hack” themselves quite a bit, and (2) to the extent that they don’t, it’s because they care about “doing the real thing.” “Being real” feels to me like something that’s meaningfully different than a lot of my other preference? Like it’s sort of the basis for all other values.
FYI the paraphrasing stuff sounds like what Yoshua Bengio is trying to do with the scientist AI agenda. See his talk at the alignment workshop in Dec 2025.
(Although I feel like Bengio has shared very little about the actual progress they’ve made (if any), and also very little detail on what they’ve been up to).
Strong downvoted the post because:
The tone is incredibly off-putting and imo not appropriate for lesswrong.
The advice doesn’t actually seem very actionable or good, nor does it seem like this post will actually change how people act
The Word Games section appears entirely AI generated without being noted as such. This violates Lesswrong’s LLM use policy.