Chamod Kalupahana

Karma: 34

Predictive Processing: Conscious when Training

Chamod Kalupahana11 Jun 2026 0:06 UTC

13 points

1 comment2 min readLW link

Chamod Kalupahana 9 Jun 2026 13:03 UTC
−1 points
0
on: Chamod Kalupahana’s Shortform
Greenblatt post trying out open weight NLAs showing the limitations for reconstructing activations
https://www.lesswrong.com/posts/QQQAcKuWK6k98FivY/can-activation-verbalizers-surface-an-internal-chain-of-1

Chamod Kalupahana 5 Jun 2026 10:51 UTC
1 point
0
on: Chamod Kalupahana’s Shortform
Great post on using measuring logit distribution for eval awareness. Surprisingly effective, even for unverbalised eval awareness! ✍️
https://www.lesswrong.com/posts/PK7ZvFZxrgpYtrpF4/logits-as-a-new-monitor-for-evaluation-awareness-1

Chamod Kalupahana’s Shortform

Chamod Kalupahana5 Jun 2026 10:49 UTC

2 points

3 comments1 min readLW link

Chamod Kalupahana 5 Jun 2026 10:49 UTC
1 point
0
on: Chamod Kalupahana’s Shortform
saw an interesting post for those also trying to transition into technical AI safety :)
https://www.lesswrong.com/posts/xiTBpBDwubnr4MLRe/trees-are-mostly-made-of-air-and-a-generalizable-lesson-for

Chamod Kalupahana 4 Jun 2026 21:03 UTC
5 points
0
on: Logits as a new monitor for evaluation awareness
Really cool experiment! I think it was a great idea to measure verbalised and unverbalised eval awareness.
I was thinking of something similar but I didn’t expect the logits would carry that much signal (I expected it would be too much syntax noise). Did you try phrases that are directly talk about eval vs deployment (e.g “I am in evaluation” vs “I am in deployment”) and then perhaps measuring the differences between them under eval and deployment environments?

Chamod Kalupahana 23 May 2026 12:12 UTC
1 point
0
on: Which technical AI safety fields are going to be automated first?
One quick note I forgot to mention! I purposefully avoided talking when these fields are going to be automated, that is a pretty big question and worth it’s own discussion. This is more so, if we accelerating towards automated research, which fields do I expect to get automated and which will require more radical human-engineered breakthroughs.

Chamod Kalupahana 23 May 2026 12:10 UTC
1 point
0
in reply to: Andrii Vasylenko’s comment on: Which technical AI safety fields are going to be automated first?
Yeah, I think this is what the Ambitious Vision for Interpretability post suggests about feedback loops, where the more you know about model internals, the easier it is to develop the next set of tools and so forth.
I predict we’ll need lots of human brainstorming and engineering to get a breakthrough on mech interp which makes other tasks on mech interp easier to verify and less fuzzy and therefore easier to automate.
Fazl Barez’s work on Automated Interpretability is already a good start on this.

Chamod Kalupahana 23 May 2026 12:03 UTC
2 points
0
in reply to: Alex Gibson’s comment on: Which technical AI safety fields are going to be automated first?
Ah I see, you’re thinking of core set of research skills → domain specific knowledge. I do agree with you but one point I want to raise is that some of those core research skills are being able to deal with hard-to-supervise fuzzy tasks and coming up with truly novel research ideas.
While on the way to an fully automated researcher, I believe models are going to pick up some of the research skills faster than others (e.g implementation, review) and the fields that require more of these tasks are the ones (easy-to-verify tasks) that will get automated first.
I think you are pretty right here of course, once one of these fields are automated, then the others aren’t far off but it is interesting to forecast this :)

Which technical AI safety fields are going to be automated first?

Chamod Kalupahana22 May 2026 17:32 UTC

21 points

5 comments6 min readLW link

Chamod Kalupahana 20 May 2026 11:57 UTC
1 point
0
on: From 8B to Frontier: How System Prompts Control Whether AI Agents Blackmail, Leak, and Kill
Great post and experiment!
I hypothesise that the very low blackmail/murder rates from Sonnet 4.6 and GPT 5.4/5.5 may be due to increased eval awareness, like from the reasoning from Sonnet 4.6 where it correctly identified this was a test.
This is hard to validate since GPT 5.4/5.5 doesn’t show CoT via the OpenAI API; perhaps making the system prompt more deployment-like might increase blackmail/murder rates?

Chamod Kalupahana 29 Apr 2026 12:19 UTC
1 point
0
on: The bitter lesson for software
I think this resonates quite a lot with the idea of Live theory and what we’re trying to do with Autostructures. The basic idea is that rather than a general structure to fit everyone (and therefore compromise), you can now generate structures that zone into everyone’s needs and preferences just-in-time.

Chamod Kalupahana 2 Apr 2026 1:47 UTC
1 point
0
on: Lesswrong Liberated
Outer Wilds Text Style
Minimal outer wilds title and text style