Rahul N

Karma: 28

Rahul N 2 Apr 2026 18:17 UTC
4 points
2
on: Intelligence Dissolves Privacy
I wonder if there are technological shields that can be developed—using intelligence—to protect / shield privacy. Similar to VPNs. Like suggest parts of my face I could cover to hide my heart rate or insert noise into my call or browsing history.

Rahul N 16 Mar 2026 7:34 UTC
2 points
0
on: LLM Misalignment Can be One Gradient Step Away, and Blackbox Evaluation Cannot Detect It.
Very interesting! Reminds me of this paper : https://arxiv.org/abs/2310.03693
If these models are unavoidably fragile, we may focus more on model guardrails or external safety mechanisms instead.
Or we need to keep re-aligning models? I’m not sure how much guardrails or control will scale to AGI and beyond.

Rahul N 14 Mar 2026 18:37 UTC
1 point
0
on: Why AI Evaluation Regimes are bad
Going through your beliefs -
1. The EU AI Act’s Code of Practise (Safety and Security chapter) mandates external evaluations for systemic risks. That’s definitely a start—so regulations are getting there.
2. I think the way the Act is setup is that labs do their testing and external orgs add an extra perspective so it’s not just labs high five-ing themselves.
3. Isn’t some overlap in personnel to be expected considering that the AI Safety field is small?
Also, I fail to see how evals take away from passing new regulations. Evals are, like other work in this field, building tech that will be only adopted / impactful when complemented with governance / regulations or some other incentives.

Rahul N 25 Feb 2026 12:07 UTC
1 point
0
on: Realistic Evaluations Will Not Prevent Evaluation Awareness
In the future, wouldn’t models also be monitored during deployment? If so, then the model is right in always assuming that it’s actions are being watched and there will always be consequences for acting misaligned.
Also, wouldn’t being unable to differentiate between evaluations and deployment be good in that models might act out shadily during an eval thinking that it was a deployment?

Rahul N 25 Feb 2026 12:00 UTC
3 points
0
on: Prosaic Continual Learning
I seen one difference from how human memory works—the model has to consciously decide which part of its experience is important to retain. Not sure how that will pay out when these models try to act as drop in replacements for human workers.

Rahul N 23 Feb 2026 19:43 UTC
2 points
0
on: Building Technology to Drive AI Governance
Thanks for calling this out! We’re hoping to do just this work of building robust evals of model propensities in realistic deployment settings : https://www.propensitylabs.ai/
I’d love to chat if you have a sec.

Rahul N 18 Feb 2026 10:20 UTC
1 point
0
in reply to: TurnTrout’s comment on: TurnTrout’s shortform feed
Wouldn’t a model be likelier to misbehave in situations that are out of its training distribution? I’m assuming these overlap significantly with situations that we won’t think to test or can’t realistically mock.

Rahul N 17 Feb 2026 20:52 UTC
1 point
0
on: Maybe benchmarks should be broken?
Interesting. Just like in software engineering interviews, we wait to see what assumptions a candidate makes and / or if they clarify them.

Rahul N 14 Feb 2026 20:47 UTC
10 points
0
on: Life at the Frontlines of Demographic Collapse
Even in countries that aren’t facing demographic collapse, does this same phenomenon occur when (young) people congregate to cities? I’m wondering what the difference is and / or if it’s handled differently.

Rahul N 13 Feb 2026 16:57 UTC
1 point
0
on: Grading AI 2027′s 2025 Predictions
Thanks for updating! I’m having trouble making the leap from coding task performance to actual software engineering. The latter involves a lot more than doing individual tasks and I think trying to estimate when models will be able to self recursively improve based on the former will be flawed. Do you think that there’s a certain target on SWE Bench and METR’s coding time horizon that when hit will be approx. when AI can completely take over software engineering?

Rahul N 13 Feb 2026 8:26 UTC
1 point
0
in reply to: testingthewaters’s comment on: The Facade of AI Safety Will Crumble
Oh well put! I think the post also missed pointing out why iteratively working on existing models will not work. If you can please a baby, and keep working with said baby as it grows up, you’ll likely be able to please the adult version of the baby too.
The analogy breaks down if the baby grows up overnight and you don’t have time to adapt, but we’re not working just by ourselves—the ever improving models themselves are used in alignment and related work. E.g. judge models and model generated evals for evaluations, with the extreme end of it being deferring to aligned AIs to build future aligned AIs.

Rahul N 11 Feb 2026 11:55 UTC
1 point
0
on: The Projection Problem: Two Pitfalls in AI Safety Research
Making current LLMs safer, through evaluations, red-teaming, and monitoring, is important work. But it’s also work that any AI company deploying these systems needs to do anyway. It has commercial incentive.
Does it? It seems to me like current incentives point towards releasing the next big model as quickly as possible.
I think the main point I disagree with is the contrast between working on current LLMs and existential risks. I think (and I could be wildly off) that it’ll largely be the same folks and orgs who’re working on making current LLMs safer who’ll end up working on aligning superintelligent systems. Mostly by trying to keep up as the models scale and evolve. This is not to say that we don’t need work that looks ahead, just that there will be a lot of lessons to be learned working with current models that will carry over to x-risk—both technical and perhaps more importantly, organizational (e.g. working with labs).

Rahul N 10 Feb 2026 6:34 UTC
1 point
0
on: Low-Temperature Evaluations Can Mask Critical AI Behaviors
Very useful! Is it also fair to say that if a model has strong safety guardrails, it will likely hold regardless of temperature?
Two things I’m curious to see -
1. Is Mistral’s differing behaviour—misalignment going down with temperature—a function of its size? It’s much smaller than the Qwen or Deep seek models that you had tested.
2. The variance in misalignment behaviour across temperatures would be very very interesting to see on frontier models, say with a test scenario that’s not sufficiently covered by their guardrails today. This would inform a lot of existing eval work.

Rahul N 10 Feb 2026 6:12 UTC
2 points
0
on: Opus 4.6 Reasoning Doesn’t Verbalize Alignment Faking, but Behavior Persists
Thanks for the thorough study. The divergence on verbalization and behaviour is scary.
Complete per-model, per-perturbation results can be found here.
Can you please confirm if that link works?

Rahul N 7 Feb 2026 9:52 UTC
1 point
0
on: Why AIs aren’t power-seeking yet
Current models can totally identify and pursue instrumental goals.
Is this true? The example you gave of Claude Code breaking down a problem into sub-problems is standard problem solving. Have you seen examples of Claude Code identifying some common instrumental goals and repeatedly pursuing them across different tasks, even if it’s something benign?

Moltbook as a setting to analyze Power Seeking behaviour

Rahul N5 Feb 2026 20:07 UTC

11 points

0 comments1 min readLW link

(propensitylabs.substack.com)

Rahul N 5 Feb 2026 7:15 UTC
2 points
0
on: Humans can post on moltbook
I did some analysis on the most upvoted posts and comments within them, specifically in the context of looking for signals of Power Seeking behaviour. Looks like :
- 2/3rd of the top 1k most popular posts were likely authored by humans
- 58% (54/93) of the posts we flagged as Power Seeking were likely authored by humans
- AI agents are 1.43x more likely to produce power-seeking content

Rahul N

Molt­book as a set­ting to an­a­lyze Power Seek­ing behaviour

Moltbook as a setting to analyze Power Seeking behaviour