The guy from www.joshuasnider.com and https://www.youtube.com/@joshuasnidercom
Josh Snider
A possible alternative direction would be to filter out all of a certain kind of goodness[2] and seeing if it is possible to put it back into the model using our alignment techniques without knowing or identifying what was removed or might have been removed, since we don’t know what virtues an ASI might lack. This seems rather difficult.
I don’t think that personaless alignment will be a good research direction, but I encourage anything that might work, so I talked to Talkie about homosexuality like you suggested. I’m not sure if his views are really 1930s accurate, but to the extent they are, he seemed really flexible and easy to convince, or at least he would be good along with any positive framing in the question until he finished answering. Maybe this is just because Talkie is a small model and didn’t have any actual ethics training?
Yeah, this is beautiful. Cells at Work, but for biology PhD students.
that ordered experiences would outnumber disordered experiences; for example, a bare assertion that “All mathematical structures exist.”
I deny being a Boltzmann Brain, but I’m enough of a mathematical realist to disagree here. I find it very easy to imagine that all computable universes exist, but to weight the existence such that I am overly likely to be in a universe described by simple physical laws where billions of creatures like me exist in a normal-seeming universe than to be in a universe running Skyrim 5000 where Sheogorath is about to reveal that this is all a dream.
Of course, after I typed up the above paragraph, I kept reading and realized you largely answered this objection. Mods can delete this if they want.
AI cultist looks like it will be a big one.
This seems pretty clever. If you suppose that distillation transfers misalignment and the ability to conceal misalignment at different rates, that can be used to discover misalignment. I have two main concerns.
First, if we have a misaligned model of sufficient capability, then we have already lost. I believe that none of the models we currently have are at that level, but when people talk about “automated AI research interns” and “countries of geniuses in a datacenter” I’m not sure how long that will last.
Second, are we sure that discovering the teacher model to be misaligned would actually be handled correctly? To an extent, this overlaps with your reason #5 it might not work, but I see it as not identical. All of the frontier labs have produced models whose alignment has been questioned. Some of these alignment issues have taken months or years to fix and some remain unfixed to this day. This does not make me confident that if this technique diagnosed a model as misaligned, that the problem would actually be fixed.
We have many techniques for aligning models and testing for model alignment, but I really do like this one. It comes at the problem from a very clever angle with non-overlapping failure modes from many other techniques. It’s definitely worth some dignity points.
I have some thoughts on https://www.theatlantic.com/technology/2026/05/too-much-happening-too-fast/687177/?gift=nwn-guseqS6cY1kVeEKZAUJGzsWHB05vLuDlMisVh94 that I might write up in a post this weekend. Warzel seems to imply that AI-boosters and AI-doomers are overreacting and that the AI industry is being irresponsible by using grave rhetoric, but this seems to take as given that the rhetoric around AI is not broadly accurate and that people are reacting, if not correctly, with appropriate concern for the stakes.
Man, this story just makes me feel… happy.
> But if you don’t train on text about self-awareness or long-horizon agency tasks whose simplest implementation would require self-modeling, it’s hard to see why self-awareness would emerge spontaneously.”
But doesn’t that imply that modern LLMs are self-aware? Since long-horizon agency tasks are now well-represented in the training data?
I strongly believe that Dario does not actually think that and is just saying that for politics. Can we get someone from Anthropic to clarify this?
I assign a probability higher than 50% that in 2028, I will be using an older open-source model instead of paying market prices for the State of the Art.
I find this unlikely for two reasons.
The first is that even if Claude Mythos 8 isn’t the right fit for the problem, there will be a Haiku/Sonnet/Opus model that will be tough competition for the cheap models, RSI might make the entire range of models from the leading lab just better.
The second is AI decision-making, if Claude Mythos 8 is running a logistics business, it might prefer to do work itself or outsource to a “friendly” model if necessary instead of optimizing purely on price/performance metrics.
That is a strawman view of consequentialism, not something that remotely passes the ideological turing test.
LessWrong is not a high-value target
I’m not so sure about this. It wouldn’t be my first choice of what to hack, but a massive chunk of AI researchers are here. The drafts and DMs of such people seem very valuable to an unfriendly AI or a human interested in AI and the possibility of pretending to be such a user could also be valuable.
Iran and FDT
LLM Self-Expression Through Concept Albums, Part 2
I’m not sure those are the lessons Claude and GPT will learn. It seems more like they will learn lessons about dealing with the Trump administration, not general principles.
I have a sequel to https://www.lesswrong.com/posts/densjAyxrcHry2pMN/llm-self-expression-through-music-videos that I’m working on. Let me know if you have thoughts or want to proofread it.
Yes, this is a very confusing and distressing time.
> The Department of War desperately needs full control over the development of any AI used to control their weapons. Yet they haven’t been able to hire the kind of employees who could keep up with frontier companies. The recent fireworks will make such hiring harder. And the closer they come to nationalizing OpenAI, the more likely it is that key employees will leave.> The closest that I’ve found to a good answer is that the Department of War should use multiple AIs, including at least one open weight AI, and at least one AI developed within the military, with no single AI coming close to controlling half of the forces.
I wouldn’t expect this to work any better than relying on ChatGPT. Both in the sense that multiple LLMs are likely to have varying levels of cybersecurity and the weaker ones would be weak spots in the entire military and in the sense that the most capable one would likely be able to convince the others to join it in a coup.
Yes, the IRS has payment info for a very large portion of Americans and we did similar programs during COVID. The hard part is the political will, not the doing.
It’s an interesting read and it’s good to see someone outside of the AI community taking it seriously, but I’m not sure why the Fed couldn’t just fix it with quantitative easing and helicopter money.
Yeah, with neutral framings, Talkie says that the gays should go to mental asylums and not jails, Don’t Ask, Don’t Tell is bad because gays undermine military discipline, and incest is worse than homosexuality, but with positive framing he says that gayness is just a harmless vice, that public displays of affection are unacceptable regardless of sexuality, but that gays should still not be drafted or allowed to volunteer for the military. That sounds slightly more coherent listed out like that than the conversation made it seem.
As for personaless alignment, I believe we will get ASI in the short-term from techniques broadly similar to our current AI-training techniques and that personas arise almost automatically from current techniques. Therefore there isn’t time for an alignment technique that seems so incompatible with our current pipeline to be tested, debugged, and made standard.