kaivu

Karma: 257

kaivu 17 Mar 2026 4:06 UTC
6 points
0
in reply to: Karl Krueger’s comment on: The bitter lesson for software
I agree this is worrying. I don’t think the agentification of software is necessarily good: it will likely leave humans with a worse understanding of how the processes around them work, which is particularly bad as we move towards building ASI. I just think it’s pretty likely to happen.

The bitter lesson for software

zef, rohuang and kaivu

16 Mar 2026 23:38 UTC

15 points

2 comments2 min readLW link

(fulcruminc.substack.com)

kaivu 14 Mar 2026 18:29 UTC
2 points
0
on: AIs will be used in “unhinged” configurations
Very much agreed. I think this could be exacerbated by the fact that many deployment prompts are themselves optimized (DSPy, prompt ablations, etc.) — and optimized prompts tend to be weird. This is prompts for now, but plausibly agent scaffolds and multi-agent configurations soon too. Very ood from the training process.

More is different for intelligence

zef, rohuang and kaivu

7 Mar 2026 0:02 UTC

17 points

0 comments2 min readLW link

(fulcruminc.substack.com)

kaivu 27 Jan 2026 0:34 UTC
2 points
0
in reply to: Hyperion’s comment on: The case for AGI safety products
n=2, but I didn’t observe any investor caring about the fact that we (Fulcrum Research) were a pbc for raising at seed stage, and I think the same is true of Theorem Labs (another safety startup).

kaivu 5 Jan 2026 9:40 UTC
3 points
0
on: An Aphoristic Overview of Technical AI Alignment proposals
Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.
One example is the line “principles must survive power”. Granted that’s a true description of a goal one might have in alignment, but the spirit of constitutional ai is that the model has the capability of judging whether it acts in accordance with a principle even if it can’t always act correctly. So I’m not sure what that line is saying, and it felt a bit meaningless.
I still like the motivating question, and I will check out Epictetus now!

kaivu 29 Dec 2025 15:52 UTC
4 points
0
on: The Weakest Model in the Selector
Do you have concrete examples of this working? Seems plausible, just curious about the current attack surface.
My intuition is reasoning fixes this well before models are robust in other ways—the OODness of the previous acceptance gets mitigated by model-generated CoT. RL on traces with more context switching would probably also help. I don’t think you need models with self-awareness and stable goals to solve this; seems like a more mundane training distribution issue.
Interesting though, and seems like a more general version of this problem is pretty hard (e.g. for agent decision making).

kaivu 26 Dec 2025 16:42 UTC
9 points
10
on: Unknown Knowns: Five Ideas You Can’t Unsee
Interesting post! I am a bit confused about the application of the local linearity principle to the tax bracket issue.

The content of “differentiable functions are locally linear” is something like: if you nudge an input by ε, the output changes by ~f’(x)·ε. If you double your nudge, you approximately double your effect (to first order). I enjoyed the small goods insurance and altruism examples, since local curvature is negligible at the relevant scale (for altruist utility, and relatively unimportant decisions).

But the tax bracket example doesn’t really make sense to me. The feared bad outcome (“being pushed into a higher bracket makes me worse off”) is compatible with local linearity (in the sense that the segments were linear), since just you could just have f’(x) < 0. What rules out the bad outcome is that marginal tax rates are positive and less than 100%, not anything about smoothness or local approximation.

Introducing Lunette: auditing agents for evals and environments

zef, leni and kaivu

15 Dec 2025 23:17 UTC

23 points

0 comments1 min readLW link

(fulcrumresearch.ai)

Automated real time monitoring and orchestration of coding agents

zef, kaivu and leni

23 Oct 2025 22:12 UTC

8 points

0 comments2 min readLW link

(fulcrumresearch.ai)

AI agents and painted facades

leni, zef and kaivu

30 Aug 2025 23:13 UTC

38 points

3 comments2 min readLW link

(fulcrumresearch.ai)

kaivu 20 Jul 2024 9:30 UTC
3 points
0
in reply to: Martín Soto’s comment on: Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
Thanks for bringing this up: this was a pretty confusing part of the evaluation.
Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words (and it could optionally use the seed to throw a biased coin as well).
You’re right that the easiest way to solve this problem, as enforced in our grading, is to output an ordered pair without using the seed.
The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds. Given that, it would have been somewhat computationally expensive to explicitly penalize this in grading.

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, AlexMeinke, Owain_Evans and Marius Hobbhahn

8 Jul 2024 22:24 UTC

109 points

40 comments5 min readLW link 1 review

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang, Miles Wang and kaivu

15 Dec 2023 11:05 UTC

34 points

8 comments10 min readLW link

Update on Harvard AI Safety Team and MIT AI Alignment

Xander Davies, Sam Marks, kaivu, tlevin, leni, maxnadeau and Naomi Bashkansky

2 Dec 2022 0:56 UTC

60 points

4 comments8 min readLW link