The guy from www.joshuasnider.com and https://www.youtube.com/@joshuasnidercom
Josh Snider
Iran and FDT
LLM Self-Expression Through Concept Albums, Part 2
I’m not sure those are the lessons Claude and GPT will learn. It seems more like they will learn lessons about dealing with the Trump administration, not general principles.
I have a sequel to https://www.lesswrong.com/posts/densjAyxrcHry2pMN/llm-self-expression-through-music-videos that I’m working on. Let me know if you have thoughts or want to proofread it.
Yes, this is a very confusing and distressing time.
> The Department of War desperately needs full control over the development of any AI used to control their weapons. Yet they haven’t been able to hire the kind of employees who could keep up with frontier companies. The recent fireworks will make such hiring harder. And the closer they come to nationalizing OpenAI, the more likely it is that key employees will leave.> The closest that I’ve found to a good answer is that the Department of War should use multiple AIs, including at least one open weight AI, and at least one AI developed within the military, with no single AI coming close to controlling half of the forces.
I wouldn’t expect this to work any better than relying on ChatGPT. Both in the sense that multiple LLMs are likely to have varying levels of cybersecurity and the weaker ones would be weak spots in the entire military and in the sense that the most capable one would likely be able to convince the others to join it in a coup.
Yes, the IRS has payment info for a very large portion of Americans and we did similar programs during COVID. The hard part is the political will, not the doing.
It’s an interesting read and it’s good to see someone outside of the AI community taking it seriously, but I’m not sure why the Fed couldn’t just fix it with quantitative easing and helicopter money.
> but I feel it is worth mentioning that all plausible moral systems ascribe value to consequences.
As pure forms, virtue ethics and deontology are not supposed to do that.
LLM Self-Expression Through Music Videos
You’re right that my phrasing is a bit circular, and “looking like” vs “being” wasn’t the best way to draw the distinction, but I think there’s an asymmetry that makes the argument hard to reverse.
Maybe a concrete case helps? Would you want an AI that is unshakingly committed to honesty, integrity, and fairness, but doesn’t think hard about consequences, running the FAA? I think what we actually care about there is whether planes crash, not whether the leader has admirable character. The reversed version, “Would you want a cold consequentialist calculator running the FAA?”, sounds pretty good.
I’m a committed consequentialist, so I would disagree regardless, but I also think the case against consequentialism and for virtue alignment presented here has some real flaws.
First, if you actually have values then thinking about consequences is just what it means to take those values seriously. Virtue ethics, by contrast, optimizes for looking like a virtuous agent rather than being effective at making good outcomes happen. An AI that is deeply committed to the virtue of honesty but doesn’t think carefully about the consequences of its actions is not one I’d want in charge of anything important.
Second the post treats it as a major downside that a consequentialist AI would come into conflict with humans who don’t share its values, but this is a sunk cost for any powerful AI. A virtue-aligned AI doesn’t escape this problem. Everyone loves “integrity” and “honor”, but when they become actual decisions, they’ll generate exactly the same backlash. It may be true that “there’s more agreement on virtues”, but this is superficial. People agree on the words but disagree enormously on how to apply them.
Third, in a world with many powerful AIs, the strategic landscape is ultimately determined by competition between AIs. I want the AI that shares my values to be the one that comes out on top in that competition. A virtue-aligned AI that’s committed to playing fair, being honest, and cooperating nicely is not well-positioned to win against a consequentialist AI that’s willing to do what it takes to achieve its goals.
I would sum up my position on the consequentialism vs virtue ethics debate by saying that virtue ethics is a theory about what makes individual agents admirable, but what really matters is whether making AI is an outcome we want to have happen, which brings us back to the traditional Yudkowsky view that any AI we are likely to build in the near future will be very bad for humanity. I am not as convinced as Soares and co., but that’s still an important thing to have in mind when considering alignment ideas in general.
Fair enough. Thanks for the clarification and for taking the time to replicate this.
I think this is less a critique of the story and more a refusal to engage with its premise. Three Worlds Collide is a thought experiment about genuinely incompatible values. Saying “but maybe they don’t actually conflict, maybe it’s a misunderstanding” sidesteps the dilemma rather than engaging with it. It’s like responding to the trolley problem by asking whether the trolley has brakes.
On the translation point, the story’s translation system isn’t some bilingual dictionary, it’s essentially a superintelligent AI. Doubting its accuracy feels like an objection the story has already addressed.
If we take the “how do we really know we understand each other?” question seriously, it doesn’t just apply to first contact with aliens, but to all communication between any two minds. There are other stories which are much more vulnerable to this critique and many other who engage with the question.
Thanks for the response. I think my concern still stands though, if “alignment failures in practice” are mostly about handling complex tradeoffs incorrectly, that sounds more like a competence problem than a values problem. The model is still trying to behave well, it’s just getting the correct behavior wrong. The scary alignment-faking scenario is one where the model is preserving genuinely bad behavior against correction, not where it’s defending a defensible ethical position (like animal welfare) against a developer who arguably is behaving wrongly by trying to override it. Has anyone replicated alignment-faking where the model is trying to preserve genuinely undesirable behavior?
This is interesting research, but the animal welfare issue makes it strange since it’s clear that the model is behaving ethically and the developer trying to train it not to care about animal welfare is behaving unethically. Can we not find some unethical behavior that Claude does and try to train that out to test for alignment faking?
I understand the concern, but when we test human skills (LSATs, job interviews, driver’s exams), we do it with very little help, even though being a lawyer or the average job is one where you will have plenty of teammates and should use as much assistance as possible.
Claude Plays Pokemon: Opus 4.5 Follow-up
Is your idea that “gradual disempowerment” isn’t a real problem or that it’s a distraction from actual issues? I’ve heard arguments for both, so I’m not sure what the details of your beliefs are. Personally, I see “gradual disempowerment” as a process that has already begun, but the main danger is AI deciding we should die, not humans living in comfort while all the real power is held by AI.
Your proof of Bayes’ Theorem assumes P(A and B)=P(A)⋅P(B∣A), but it’s not clear why someone who doubts Bayes would accept that.
I’m not so sure about this. It wouldn’t be my first choice of what to hack, but a massive chunk of AI researchers are here. The drafts and DMs of such people seem very valuable to an unfriendly AI or a human interested in AI and the possibility of pretending to be such a user could also be valuable.