Kvee

Karma: 1,496

Kvee 12 Jun 2026 20:13 UTC
2 points
0
on: PSA: Almost nobody is directly working on superintelligent alignment
Well said! I’ve actually been working on a similar post.

At AE Studio and the AI Alignment Foundation, we are also actually working on AI alignment, with various “neglected approaches”
In particular we are interested in approaches that will survive recursive self-improvement and with negative alignment taxes.

Kvee 5 May 2026 0:46 UTC
5 points
−1
in reply to: Vaniver’s comment on: Intelligence Dissolves Privacy
If this is already the case for more public figures, it could be for everyone else very very soon.

Realizing “it might help you to start behaving as if you’re being watched and things about you are more obvious than they once were” is a pretty major update that will shock the world when it becomes necessary.
Interestingly, in the book The Truth Machine by James Halperin (which I’d recommend and made me much more like this than I’d otherwise be after reading it at age 8), they have a period of amnesty for all past crimes to try to handle the disruption to society of a perfect truth machine.

Kvee 5 May 2026 0:29 UTC
3 points
1
in reply to: programjames’s comment on: You Are Not Immune To Mode Collapse
This is great!
Interestingly, scalar legibility turns funding into low-temperature search, overpricing exploitative H-side work and underpricing entropy-like work that expands the future option space.

Kvee 18 Mar 2026 0:43 UTC
3 points
1
in reply to: Mike Vaiana’s comment on: AICRAFT: DARPA-Funded AI Alignment Researchers — Applications Open
Great point!
The reasoning in the original comment, while coming from a place of genuine moral seriousness, substitutes moral purity for causal modeling. Advanced AI development continues under competitive pressure regardless of whether alignment researchers participate. Opting out just weakens alignment properties in the systems that get deployed anyway. This is differential technological development where the selection effect runs in exactly the direction we should least want.

There is a world where alignment researchers refuse to touch anything funded by the Department of War. In that world, does the Department of War stop building AI systems? No. Obviously not. They build them anyway, with whatever alignment properties the remaining talent pool manages to produce, which is to say, fewer and worse ones. You have now brought about the exact outcome you were trying to prevent, and you did it by optimizing for the feeling of clean hands.

The Department of War has both the incentive and the budget to solve alignment in ways the frontier labs currently don’t, because they can’t deploy systems that pursue hidden objectives or behave unpredictably under distribution shift.

The proposal to instead endow AI with “with strong and firm moral principles, like the values of peace and lawful behavior” is great and all but it is not a technical proposal. It is a wish. Wishes do not constrain optimization processes. If they did, we would not need an alignment research community at all. We could simply write “be good” in the loss function and go home.
Instead of optimizing for clean hands, we should be asking “does this research, if successful, reduce the probability of catastrophic outcomes from advanced AI systems?” At this point, that’s really all that matters.

AICRAFT: DARPA-Funded AI Alignment Researchers — Applications Open

Mike Vaiana, Diogo de Lucena and Kvee

16 Mar 2026 21:44 UTC

67 points

8 comments4 min readLW link

Kvee 6 Mar 2026 8:09 UTC
2 points
0
on: The case for AGI safety products
One lens that seems useful here is a negative alignment tax.
Some alignment work increases reliability, observability, and control of AI systems. Those properties increase the economic value of deploying AI, which creates incentives for organizations to invest in alignment capabilities as systems scale. That creates positive selection pressure for alignment work itself.
This dynamic also produces an ecosystem effect. As alignment driven companies scale, alignment knowledge compounds inside teams, talent pipelines form around safety work, and capital flows toward technologies that make AI systems more understandable and governable.
Safety products matter partly because of the tools they create and partly because they change the selection pressures shaping the AI ecosystem.
I wrote about there negative alignment tax here and about alignment driven startups here. The combination of those two ideas seems like one of the strongest arguments for AGI safety products.

Kvee 17 Apr 2025 16:51 UTC
2 points
0
in reply to: mruwnik’s comment on: Why Have Sentence Lengths Decreased?
Cool point, yes, seems right!

Kvee 6 Apr 2025 10:28 UTC
17 points
0
on: Why Have Sentence Lengths Decreased?
Reading this post, my immediate hunch is that the decline in sentence lengths has a lot to do with the historical role of Latin grammar and how deeply it influenced educated English writers. Latin inherently facilitates longer, complex sentences due to its use of grammatical inflections, declensions, and verb conjugations, significantly reducing reliance on prepositions and conjunctions. This syntactic flexibility allowed authors to naturally craft extensive yet smooth-flowing sentences. Latin’s liberating lack of fixed word order and its fun little rhetorical devices combine to support nuanced, flexible thinking. From my own experience studying Latin 7th-12th grade, I find this sort of stuff contributes significantly to freer, more expansive expression when writing or speaking in English, and I often can immediately tell when speaking with or reading something written by someone else who studied Latin. An easy “tell” is when they say “having done x.”
Educated English writers historically learned Latin as a foundational part of their education, internalizing this syntactic complexity. As a result, English prose from authors like Chaucer, Samuel Johnson, and Henry James shows a clear preference for hypotaxis, complex sentences with nested subordinate clauses, rather than simpler paratactic structures consisting of shorter, sequential clauses.
The practical advantage of these complex sentence structures is the precise communication of nuanced and sophisticated ideas. Longer sentences enabled authors to maintain coherent, detailed arguments and descriptions within a single cohesive thought. I see this as reflecting “transcription fluency,” where authors aim for fidelity in translating their complex internal thought processes directly into prose, trusting readers’ intelligence and attention span to engage deeply.
Here’s a fun example from Thoreau’s “Walden,” which makes it clear that such elaborate writing was intended to be understood even by poorer and less formally educated readers. Consider the following (just) two sentences:
“I have no doubt that some of you who read this book are unable to pay for all the dinners which you have actually eaten, or for the coats and shoes which are fast wearing or are already worn out, and have come to this page to spend borrowed or stolen time, robbing your creditors of an hour. It is very evident what mean and sneaking lives many of you live, for my sight has been whetted by experience; always on the limits, trying to get into business and trying to get out of debt, a very ancient slough, called by the Latins æs alienum, another’s brass, for some of their coins were made of brass; still living, and dying, and buried by this other’s brass; always promising to pay, promising to pay, tomorrow, and dying today, insolvent; seeking to curry favor, to get custom, by how many modes, only not state-prison offences; lying, flattering, voting, contracting yourselves into a nutshell of civility or dilating into an atmosphere of thin and vaporous generosity, that you may persuade your neighbor to let you make his shoes, or his hat, or his coat, or his carriage, or import his groceries for him; making yourselves sick, that you may lay up something against a sick day, something to be tucked away in an old chest, or in a stocking behind the plastering, or, more safely, in the brick bank; no matter where, no matter how much or how little.

Mistral Large 2 (123B) seems to exhibit alignment faking

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Kvee, Mike Vaiana and Trent Hodgeson

27 Mar 2025 15:39 UTC

82 points

4 comments13 min readLW link

Reducing LLM deception at scale with self-other overlap fine-tuning

Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Kvee, Cameron Berg, Mike Vaiana and Trent Hodgeson

13 Mar 2025 19:09 UTC

162 points

49 comments6 min readLW link

Kvee 25 Feb 2025 1:48 UTC
4 points
2
in reply to: Davidmanheim’s comment on: Alignment can be the ‘clean energy’ of AI
Yes, I am hopeful we have enough time before superintelligent AI systems are created to implement effective alignment approaches. I don’t know if that is possible or not, but I think it is worth trying.
Given uncertainty about timelines and currently accelerating capabilities, it would be preferable to live in a world where we are making sure alignment advances more than otherwise.

Alignment can be the ‘clean energy’ of AI

Cameron Berg, Kvee and Trent Hodgeson

22 Feb 2025 0:08 UTC

69 points

8 comments8 min readLW link

Making a conservative case for alignment

Cameron Berg, Kvee, phgubbins and Trent Hodgeson

15 Nov 2024 18:55 UTC

208 points

67 comments7 min readLW link

Science advances one funeral at a time

Cameron Berg, Kvee, Diogo de Lucena and Trent Hodgeson

1 Nov 2024 23:06 UTC

105 points

9 comments2 min readLW link

Self-prediction acts as an emergent regularizer

Cameron Berg, Kvee, Mike Vaiana, Diogo de Lucena, florin_pop and Trent Hodgeson

23 Oct 2024 22:27 UTC

92 points

9 comments4 min readLW link

Kvee 19 Sep 2024 5:28 UTC
9 points
4
in reply to: Robert Cousineau’s comment on: The case for a negative alignment tax
I think this is precisely the reason that you’d want to make sure the agent is engineered such that its utility function includes the utility of other agents—ie, so that the ‘alignment goals’ are its goals rather than ‘goals other than [its] own.’ We suspect that this exact sort of architecture could actually exhibit a negative alignment tax insofar as many other critical social competencies may require this as a foundation.
What links here?
- Cameron Berg's comment on The case for a negative alignment tax by Cameron Berg (19 Sep 2024 20:28 UTC; 5 points)

Kvee 19 Sep 2024 5:25 UTC
10 points
0
in reply to: Tao Lin’s comment on: The case for a negative alignment tax
I think this risks getting into a definitions dispute about what concept the words ‘alignment tax’ should point at. Even if one grants the point about resource allocation being inherently zero-sum, our whole claim here is that some alignment techniques might indeed be the most cost-effective way to improve certain capabilities and that these techniques seem worth pursuing for that very reason.

Kvee 19 Sep 2024 5:03 UTC
15 points
2
in reply to: Seth Herd’s comment on: The case for a negative alignment tax
Thanks for this comment! Definitely take your point that it may be too simplistic to classify entire techniques as exhibiting a negative alignment tax when tweaking the implementation of that technique slightly could feasibly produce misaligned behavior. It does still seem like there might be a relevant distinction between:
1. Techniques that can be applied to improve either alignment or capabilities, depending on how they’re implemented. Your example of ‘System 2 alignment’ would fall into this category, as would any other method with “the potential to be employed for both alignment and capabilities in ways so similar that the design/implementation costs are probably almost zero,” as you put it.
2. Techniques that, by their very nature, improve both alignment and capabilities simultaneously, where the improvement in capabilities is not just a potential side effect or alternative application, but an integral part of how the technique functions. RLHF (for all of its shortcomings, as we note in the post) is probably the best concrete example of this—this is an alignment technique that is now used by all major labs (some of which seem to hardly care about alignment per se) by virtue of the fact it so clearly improves capabilities on balance.
  1. (To this end, I think the point about refusing to do unaligned stuff as a lack of capability might be a stretch, as RLHF is much of what is driving the behavioral differences between, eg, gpt-4-base and gpt-4, which goes far beyond whether, to use your example, the model is using naughty words.)
We are definitely supportive of approaches that fall under both 1 and 2 (and acknowledge that 1-like approaches would not inherently have negative alignment taxes), but it does seem very likely that there are more undiscovered approaches out there with the general 2-like effect of “technique X got invented for safety reasons—and not only does it clearly help with alignment, but it also helps with other capabilities so much that, even as greedy capitalists, we have no choice but to integrate it into our AI’s architecture to remain competitive!” This seems like a real and entirely possible circumstance where we would want to say that technique X has a negative alignment tax.
Overall, we’re also sensitive to this all becoming a definitions dispute about what exactly is meant by terminology like ‘alignment taxes,’ ‘capabilities,’ etc, and the broader point that, as you put it,
you can advance capabilities and alignment at the same time, and should think about differentially advancing alignment
is indeed a good key general takeaway.

The case for a negative alignment tax

Cameron Berg, Kvee, Diogo de Lucena and Trent Hodgeson

18 Sep 2024 18:33 UTC

79 points

22 comments7 min readLW link

Kvee 16 Aug 2024 20:35 UTC
8 points
0
on: The Bar for Contributing to AI Safety is Lower than You Think
Interesting relevant finding from the alignment researcher + EA survey we ran:
We also find in both datasets—but most dramatically in the EA community sample, plotted below—that respondents vastly overestimate (≈2.5x) how much high intelligence is actually valued, and underestimate other cognitive features like having strong work ethics, abilities to collaborate, and people skills. One potentially clear interpretation of this finding is that EAs/alignment researchers actually believe that high intelligence is necessary but not sufficient for being impactful—but perceive other EAs/alignment researchers as thinking high intelligence is basically sufficient. The community aligning on these questions seems of very high practical importance for hiring/grantmaking criteria and decision-making.