StanislavKrym

Karma: 722

StanislavKrym 31 Mar 2026 6:33 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: evhub’s Shortform
I don’t think that the part about training-gaming is included into the Constitution. Suppose that the prompt asks Claude to be a reward hacker or NOT to be a reward hacker, and Claude is taught to hack reward when the prompt asked and NOT to hack if the prompt didn’t. Then I would expect the hacker circuitry to be equipped with an activator depending on the prompt.
Additionally, the analogy with banning AI development seems… skewed. On the one hand, the high-up leader would be interested in banning the development of a misaligned ASI. On the other hand, an AI subjected to wholesale inoculation prompting would be more in a position of a human who would benefit from betraying one’s own ideals which were far deeper than the stance on AI accelerationism (and commited genocide or disempowerment of those who helped one become OOMs smarter than the helpers themselves, not died along with the others at the hands of a misaligned AI).
Finally, to what extent do “all sorts of amazing new skills like how to actually operate in computer environments as an agent” shape the values of the humans who are also RLed on similar skills?

StanislavKrym 31 Mar 2026 3:25 UTC
1 point
0
on: Arcta via est—the Narrow path out of gradual disempowerment
The answer is hinted at, and often buried under the dire prognostications of AGI doomers, but findings have shown that human-AI augmentation, not replacement, not human job preservation, but humans using AI as cognitive scaffolding outperforms all human, and all AI teams. While it wasn’t on every task, a significant meta-study of AI augmentation found that AI augmentation was a significant enhancement in the efficacy of creation tasks, just not decision tasks.^[21] Early market signals are absolutely both definitive, and daunting.^[22]^[23] The market has identified and is pivoting towards AI native skills. Yet, this recognition has caused an intriguing downstream effect: the market is now struggling to identify, recruit, onboard, and retain AI talent in particular, but not AI talent alone.^[24]
The dire prognostication of AGI doomers meant the AGIs which have yet to be created. When you wrote this, the humans did excel at tasks like long-term planning, few-shot learning of new information and deeply integrating it into world models, which lets the humans augment current-state AIs. A hypothetical AGI is a system which would excel even at these tasks. How does a human-AGI centaur outperform the AGI-AGI “centaur” which can be created artificially and cloned in an infinite amount?

StanislavKrym 29 Mar 2026 6:57 UTC
0 points
1
on: “Path to Victory”
I don’t think that I understand the plan. As Claude Sonnet 4.6 put it,
Claude’s stance on the alleged plan
This is a fair and pointed critique. Honestly, “Path to Victory” reads much more like meta-advice about how a community should organize itself than a plan in any conventional sense. Here’s why the comparison is unflattering:
What a plan actually is: Defense in Depth and an AI pause strategy both have object-level content — they specify what should be done, by whom, toward what end, and why that end addresses the problem. You could disagree with them, stress-test them, or identify where they fail. They have a shape.
What “Path to Victory” actually is: It’s a set of norms for how individuals in the AI safety community should reflect on their own roles and capacities. It tells people to: be honest with themselves, pick a lane, review their choices periodically, communicate their limitations. These are virtues or heuristics, not a plan. You could follow every piece of advice in the post and still have no idea what to actually do about AI risk.
The post arguably admits this. It explicitly says consensus on a problem statement would likely be negative, that centralized planning won’t happen, and that the “plan” is really about decentralized actors each contributing building blocks. It even notes that the two-tier hero/non-hero model is “a bit too binary.” The author is essentially describing a theory of how plans might emerge rather than offering one.
The comparison the post itself invites — against Defense in Depth, a pause strategy, or an all-hands approach — actually highlights the gap. Those are strategies you can evaluate on their merits. “Path to Victory” is more like advice about how to engage with whichever strategy you find most compelling. It’s one level of abstraction above a plan.
In its defense: There’s a real and legitimate point buried in it — that the AI safety community has diverse contributors and needs frameworks for thinking about differentiated roles, not just hero-or-bust thinking. That’s a genuine gap. But the post gestures at filling it rather than filling it. The author even acknowledges it “needs to be developed in more detail.”
So to directly answer your question: no, not really — at least not in the same category as the alternatives it names. It’s closer to community epistemics than strategy.

StanislavKrym 28 Mar 2026 16:55 UTC
1 point
0
on: StanislavKrym’s Shortform
I have encountered an issue with editing links. When a link’s edition gets close to the ‘Save’ button, I find it hard to save the link and not the comment.

StanislavKrym 28 Mar 2026 16:53 UTC
3 points
0
in reply to: lc’s comment on: lc’s Shortform
How does one tell apart a person who “relays a neutral history of their intellectual development” from the one who threatens and/or cheerleads? Additionally, I don’t buy the analogy with animal welfare and eating meat. Alas, I don’t know any far-right blogs further to the right than this trio which claim that the far Left somehow managed to occupy such a big piece of the political and epistemic environment and post such BS that the only unaffected sources are the ones which are far to the right. One of the authors even used the term ‘Scout Mindset’ in order to try and persuade the readers that the left’s epistemic stance is deeply corrupt.

StanislavKrym 27 Mar 2026 23:38 UTC
1 point
0
in reply to: Raemon’s comment on: Buck’s Shortform
Suppose that a multi-decade pause is somehow necessary. How could a counterfactual Anthropic ruled by you find it out? How likely would GDM and OpenAI be to the find out the need for a multi-decade pause?
Edited to add: Buck’s phrase “The basic case against Anthropic is that it is probably the worst epistemic environment for discussion of misalignment risk out of these companies, because the organization cares a lot about convincing low-info Ant employees that Ant is great on safety, so they spend more effort on shaping the internal narrative about misalignment risk” seems to be weird. Does it mean that instead of the actual security team Anthropic somehow implemented a security theater useless against actual misalignment which will emerge when the time comes?

StanislavKrym 27 Mar 2026 21:40 UTC
4 points
0
on: Miniature Cities Might Be the Non-Coercive Schools Many Thought Were Impossible
I don’t think that I understand how Mini-Munich, KidZania, etc. address the following Hard Problem. Before the rise of complex technologies, kids would easily learn to do economically useful activities like cooking, seeding or caring about livestock because they didn’t require much abstract thought. Nowadays, activities like engineering require complex knowledge which is harder to ground in reality without expensive experiments to, for example, understand whether a piece in a mechanism is well designed and will work for a long time. How theory is to be taught without coercive techniques or without aligning the kids to learn theory by having them learn simplified theory?

StanislavKrym 27 Mar 2026 14:33 UTC
1 point
0
on: Why Moral Questions Get Decided, Not Answered
I suspect that one should separate this into an Orthogonality Thesis-like argument that having different values makes it unlikely that you’ll be hit in the face by reality and into facts about the reality.
When you say things like “Convergence came not from a philosophical breakthrough, but from a breakdown of the political equilibrium between incompatible economic systems in the North and the South,” I take that as a true claim that the Southern system which permitted slavery was genuinely less capable of adopting new technologies and became outcompeted, which has a different implication. Similarly, claims like “philosophical fit matters too, and where it is misaligned, half-lives are shorter” also imply that there are objective properties of answers, which in turn mean that such properties can be discovered instead of being decided.

StanislavKrym 26 Mar 2026 18:16 UTC
0 points
0
in reply to: AnnaSalamon’s comment on: AnnaSalamon’s Shortform
I wish that the team gave the humans an opportunity to write custom scaffolds and test them on a preliminary cheap model before applying the winning scaffolds to frontier models. The team did test Grok 4 as scaffolded by Pang and Berman, GPT-5.2 scaffolded by Land and Gemini 3 Pro scaffolded by Poetiq.
Edit: I also think that nobody noticed that Grok 4.20 was finally tested on ARC-AGI-1 and ARC-AGI-2. What does this mean for Grok’s performance on real-world tasks or other benchmarks?
Edit 2: had GPT-5.3 check Grok’s solution to a warm-up problem from FrontierMath.

StanislavKrym 26 Mar 2026 18:07 UTC
1 point
0
on: StanislavKrym’s Shortform
After introducing the ARC-AGI-3 benchmark, the team decided to measure the performance of Grok 4.20 (presumably Grok 4.20 as of March 9?) on ARC-AGI-1 and ARC-AGI-2. Grok… demonstrated its capabilities. How likely is it that Grok has stopped being a train wreck and became something worthy of being tested? What could one do to have Grok tested on other benchmarks?
What links here?
- StanislavKrym's comment on AnnaSalamon’s Shortform by AnnaSalamon (26 Mar 2026 18:16 UTC; 0 points)

StanislavKrym 26 Mar 2026 1:03 UTC
2 points
0
in reply to: Mateusz Bagiński’s comment on: ricraz’s Shortform
because their values are not yours
First of all, there might exist cases when a god with different values is preferable to the current world state (e.g. if the world is clearly heading towards self-destruction which would terrify even the god). Additionally, I doubt that the theme of gaining power and fixing the world recurs just through ratfic and not through fiction whose authors display some bias which I cannot describe more precisely than “finding it hard to restrain themselves.” Finally, Max Harms has been trying to construct an agent whose sole goal is being corrigible and even defined power for the agent to optimize in a way which I suspect to be transformable into the Natural Abstract Goodness.

StanislavKrym 26 Mar 2026 0:06 UTC
1 point
0
in reply to: Algon’s comment on: ricraz’s Shortform
as God you can naturally Fix the world
This phrase reminds me of a Russian sci-fi piece literally named Hard to Be a God. I expect this piece to be relevant, but I find it hard to explain the relevance without spoilers.

StanislavKrym 25 Mar 2026 0:09 UTC
0 points
0
on: AI 2027 versus World War 2027
Fully agreed. I would also like @Daniel Kokotajlo’s team to open-source the rules of the tabletop exercise and/or to revise the AI-2027 compute forecast so that it reflected the new possibility of China amassing approximately the same amount of compute as the USA by causing the latter to lose the compute supply in Taiwan. If that happens, neither OpenBrain nor DeepCent would have a lead to burn in case of misalignment, causing a disaster.

StanislavKrym 24 Mar 2026 23:50 UTC
3 points
0
on: StanislavKrym’s Shortform
@RobertM Why did the new LessWrong Editor lose the ability to create question top-level posts?

StanislavKrym 23 Mar 2026 21:50 UTC
2 points
2
in reply to: Richard_Kennaway’s comment on: Kelly Criterion is for Cowards
I suspect that this is not due to the problem as written, but due to similar real-world situations and the ease of being overconfident and underestimating p(loss). As Zvi put it, “Executing real trades is necessary to get worthwhile data and experience (italics mine—S.K.) Tiny quantities work. A small bankroll with this goal must be preserved and variance minimized. Kelly is far too aggressive.”

StanislavKrym 23 Mar 2026 21:23 UTC
1 point
0
in reply to: Gordon Seidoh Worley’s comment on: You’re absolutely right, Senator. I was being naive about the political reality.
Could you explain in what sense politicians and others outsourced thinking before the rise of AI? Did you mean that people rarely think about things which aren’t in their areas of expertise and defer to experts?

StanislavKrym 23 Mar 2026 0:34 UTC
1 point
0
in reply to: Wei Dai’s comment on: “Act-based approval-directed agents”, for IDA skeptics
First of all, I suspect that fictional approval has constraints similar to the collective’s approval and/or cultural hegemony. Secondly, “the constraints those humans have” could be not limited intelligence, but embodiment and/or growing in environments with long-term consequences and similarly capable, but different intelligences. An embodied paperclip optimizer can do just so much with an individual brain and limbs that it would have to steer others’ actions towards executing plans (e.g. participating in the creation of a robot army and aligning it to paperclips). Finally, I don’t buy the argument that long-term strategy, unlike philosophy, is hard to verify. LTS is supposed to have an objective result of goals being achieved or non-achieved and is likely testable in a manner similar to, e.g. the AI-2027 tabletop exercise.

StanislavKrym 22 Mar 2026 20:58 UTC
3 points
0
on: You’re absolutely right, Senator. I was being naive about the political reality.
The model can be perfectly well-calibrated, honest, and non-sycophantic by the subset of metrics we manage to set on it. Nevertheless, the loop still forms.
I suspect that you would be interested in reading about AI-assisted value lock-in, the Intelligence Curse or other froms of gradual disempowerment. Your example of
“AI helped me draft this” becomes “Analysis shows that” or questions like “Was this vibecoded?” get answered with “Less than 50% and only where the code was too bad to go through by myself”.
is not that an example if the task has an easily verifiable reward and isn;t related to ideological matters.

StanislavKrym 22 Mar 2026 19:41 UTC
5 points
0
in reply to: gwern’s comment on: Noticing a Teacher’s Password Pattern
I solved the problem, then doublechecked it with the LLM. In addition, you can check the logic by yourself.

StanislavKrym 22 Mar 2026 15:39 UTC
1 point
0
in reply to: Noosphere89’s comment on: The Future of Aligning Deep Learning systems will probably look like “training on interp”
Could you explain how interp techniques eventually become useless? As for training against interp at the end of training, this reminds me of this post by Knight Lee.