james.lucassen

Karma: 664

jlucassen.com

james.lucassen 9 Sep 2025 19:06 UTC
3 points
0
in reply to: StanislavKrym’s comment on: Decision Theory Guarding is Sufficient for Scheming
If by “corrigible” we mean “the AI will cooperate with all self-modifications we want it to”, then no to 1 and yes to 2. But if you have an AI built by someone who assures you it’s corrigible, but who only had corrigibility w.r.t values/axiology in mind, then you might get yes to 1 and/or no to 2.
Does it mean that the AIs who resisted have never been ~~true Scotsmen~~ truly corrigible in the first place? Or that it becomes far more difficult to make the AIs actually corrigible?
Yup, I see this as placing an additional constraint on what we need to do to achieve corrigibility, because it adds to the list of self-modifications we might want the AI to make that a non-corrigible AI would resist. Unclear to me how much more difficult it makes corrigibility.

Decision Theory Guarding is Sufficient for Scheming

james.lucassen9 Sep 2025 14:49 UTC

36 points

4 comments2 min readLW link

How Can You Tell if You’ve Instilled a False Belief in Your LLM?

james.lucassen6 Sep 2025 16:45 UTC

14 points

1 comment10 min readLW link

(jlucassen.com)

james.lucassen 9 Mar 2025 18:51 UTC
LW: 4 AF: 3
0
AF
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
In the long run, you don’t want your plans to hinge on convincing your AIs of false things. But my general impression is that folks excited about making deals with AIs are generally thinking of scenarios like “the AI has exfiltrated and thinks it has a 10% chance of successful takeover, and has some risk aversion so it’s happy to turn itself in exchange for 10% of the lightcone, if it thinks it can trust the humans”.
In that setting, the AI has to be powerful enough to know it can trust us, but not so powerful it can just take over the world anyway and not have to make a deal.
Although I suppose if the surplus for the deal is being generated primarily by risk aversion, it might still have risk aversion for high takeover probabilities. It’s not obvious to me how an AI’s risk aversion might vary with its takeover probability.
Maybe there are scenarios for real value-add here, but they look more like “we negotiate with a powerful AI to get it to leave 10% share for humans” instead of “we negotiate with a barely-superhuman AI and give it 10% share to surrender and not attempt takeover”.

james.lucassen 8 Mar 2025 0:43 UTC
LW: 7 AF: 4
3
AF
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I think this is a good avenue to continue to think down but so far I don’t see a way to make ourselves trustworthy. We have total control of LLM’s observations and partial control of their beliefs/reasoning, and offering fake “deals” is a great honeypot because accepting such a deal requires admitting to misalignment and takeover intentions. This is a pretty persistent problem because whatever action we might follow to present evidence of trustworthiness to an LLM, we could also probably fake that evidence.

The version of this that bothers me the most is “say we’re able to present ironclad evidence that the Humans of Earth are trustworthy trading partners for misaligned LLMs. How does the LLM know that it is actually talking to the Humans of Earth, and not some untrustworthy party pretending to be the Humans?” This is markedly similar to the Cartesian demon skeptical scenario. But importantly, the best resolution to skeptical scenarios that I’m aware of is unfortunately “eh, unlikely on priors”. For an LLM, it is very plausibly actually in a skeptical scenario, and it knows that, so that doesn’t really go through.

This problem goes away if the LLM doesn’t have knowledge that it suggests the party it is trading with has significant control over its brain. But at that point I would consider it a honeypot, not a deal. Insofar as “deal” := some situation where 1) it is in the LLM’s best interests to admit that it is misaligned and 2) the LLM validly knows that, we can easily do 1) by just sticking to our commitments but I’m not sure how we do 2).

I’m totally on board with offering and sticking to commitments for non-evilness reasons. But for takeover prevention, seems like deals are just honeypots with some extra and somewhat conceptually fraught restrictions.

james.lucassen 11 Feb 2025 3:02 UTC
3 points
0
on: james.lucassen’s Shortform
Been thinking a bit about latent reasoning. Here’s an interesting confusion I’ve run into.
Consider COCONUT vs Geiping et al. Geiping et al do recurrent passes in between the generation of each new token, COCONUT turns a section of the CoT into a recurrent state. Which is better / how are they different, safety-wise?
Intuitively COCONUT strikes me as very scary, because it makes the CoT illegible. We could try and read it by coaxing it back to the nearest token, but the whole point is to allow reasoning that involves passing more state than can be captured in one token. If it works as advertised, this oversight will be lossy.
Intuitively Geiping et al seems better. They use skip connections in the recurrence, so maybe their method maintains logit lens. It still increases the maximum depth between overseeable tokens, but seems no more dangerous than a non-recurrent model of equivalent depth.
But isn’t the COCONUT method roughly equivalent to just doing the Geiping et al method for only one token? Why does it seem so much scarier? Do the skip connections really make that much difference? Doesn’t COCONUT effectively have skip connections anyway, because it autoregressively generates new “tokens”? And it’ll have the logit lens property too, since it uses the same residual stream with the same feature directions.
This made me realize that the logit lens result has made me think of within-forward-pass cognition as very myopic and next-token-oriented. Relatedly, it’s hard for me to imagine far-ranging or highly consequentialist cognition happening within a forward pass—I’m generally more comfortable thinking of that stuff as happening within the CoT. But now that I articulate it explicitly, that’s sort of a weird view—why does inserting a “unembed, sample, reembed w/ skip” block every so often make such a difference? The fact that COCONUT works is an update against.

james.lucassen 1 Feb 2025 16:01 UTC
1 point
0
on: james.lucassen’s Shortform
Reframe I like:
- An RSP style “pause-if-until” plan is a plan to hit a certain safety bar while using as little slack time as possible.
- If the “race” is looking too close for RSPs, developers should instead plan to use a certain amount of slack available to get the most safety possible.

On Contact, Part 1

james.lucassen21 Jan 2025 3:10 UTC

14 points

1 comment11 min readLW link

Retrospective: 12 [sic] Months Since MIRI

james.lucassen21 Jan 2025 2:52 UTC

68 points

0 comments9 min readLW link

james.lucassen 2 Jan 2025 18:23 UTC
LW: 3 AF: 2
0
AF
in reply to: Jeremy Gillen’s comment on: Evaluating Stability of Unreflective Alignment
It doesn’t change the picture a lot because the proposal for preventing misaligned goals from arising via this mechanism was to try and get control over when the AI does/doesn’t step back, in order to allow it in the capability-critical cases but disallow it in the dangerous cases. This argument means you’ll have more attempts at dangerous stepping back that you have to catch, but doesn’t break the strategy.

The strategy does break if when we do this blocking, the AI piles on more and more effort trying to unblock it until it either succeeds or is rendered useless for anything else. There being more baseline attempts probably raises the chance of that or some other problem that makes prolonged censorship while maintaining capabilities impossible. But again, just makes it harder, doesn’t break it.

I don’t think you need to have that pile-on property to be useful. Consider MTTR(n), the mean time an LLM takes to realize it’s made a mistake, parameterized by how far up the stack the mistake was. By default you’ll want to have short MTTR for all n. But if you can get your MTTR short enough for small n, you can afford to have MTTR long for large n. Basically, this agent tends to get stuck/rabbit-hole/nerd-snipe but only when the mistake that caused it to get stuck was made a long time ago.

Imagine a capabilities scheme where you train MTTR using synthetic data with an explicit stack and intentionally introduced mistakes. If you’re worried about this destabilization threat model, there’s a pretty clear recommendation: only train for small-n MTTR, treat large-n MTTR as a dangerous capability, and you pay some alignment tax in the form of inefficient MTTR training and occasionally rebooting your agent when it does get stuck in a non dangerous case.

Figured I should get back to this comment but unfortunately the chewing continues. Hoping to get a short post out soon with my all things considered thoughts on whether this direction has any legs

james.lucassen 10 Dec 2024 22:19 UTC
LW: 4 AF: 3
2
AF
in reply to: Jesse Hoogland’s comment on: Jesse Hoogland’s Shortform

So let’s call “reasoning models” like o1 what they really are: the first true AI agents.

I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between “reasoning” and “chat” models, and I’d prefer to use “agent” for that distinction.

I do think that “reasoning” is a bit of a market-y name for this category of system though. “chat” vs “base” is a great choice of words, and “chat” is basically just a description of the RL objective those models were trained with.

If I were the terminology czar, I’d call o1 a “task” model or a “goal” model or something.

james.lucassen 14 Nov 2024 3:15 UTC
3 points
0
in reply to: Jeremy Gillen’s comment on: Context-dependent consequentialism
I agree that I wouldn’t want to lean on the sweet-spot-by-default version of this, and I agree that the example is less strong than I thought it was. I still think there might be safety gains to be had from blocking higher level reflection if you can do it without damaging lower level reflection. I don’t think that requires a task where the AI doesn’t try and fail and re-evaluate—it just requires that the re-evalution never climbs above a certain level in the stack.
There’s such a thing as being pathologically persistent, and such a thing as being pathologically flaky. It doesn’t seem too hard to train a model that will be pathologically persistent in some domains while remaining functional in others. A lot of my current uncertainty is bound up in how robust these boundaries are going to have to be.

james.lucassen 14 Nov 2024 2:03 UTC
LW: 3 AF: 2
0
AF
in reply to: Jeremy Gillen’s comment on: Evaluating Stability of Unreflective Alignment
I want to flag this as an assumption that isn’t obvious. If this were true for the problems we care about, we could solve them by employing a lot of humans.
humans provides a pretty strong intuitive counterexample
Yup not obvious. I do in fact think a lot more humans would be helpful. But I also agree that my mental picture of “transformative human level research assistant” relies heavily on serial speedup, and I can’t immediately picture a version that feels similarly transformative without speedup. Maybe evhub or Ethan Perez or one of the folks running a thousand research threads at once would disagree.

james.lucassen 14 Nov 2024 2:00 UTC
LW: 7 AF: 4
0
AF
in reply to: Jeremy Gillen’s comment on: Evaluating Stability of Unreflective Alignment
such plans are fairly easy and don’t often raise flags that indicate potential failure
Hmm. This is a good point, and I agree that it significantly weakens the analogy.
I was originally going to counter-argue and claim something like “sure total failure forces you to step back far but it doesn’t mean you have to step back literally all the way”. Then I tried to back that up with an example, such as “when I was doing alignment research, I encountered total failure that forced me to abandon large chunks of planning stack, but this never caused me to ‘spill upward’ to questioning whether or not I should be doing alignment research at all”. But uh then I realized that isn’t actually true :/
We want particularly difficult work out of an AI.
On consideration, yup this obviously matters. The thing that causes you to step back from a goal is that goal being a bad way to accomplish its supergoal, aka “too difficult”. Can’t believe I missed this, thanks for pointing it out.
I don’t think this changes the picture too much, besides increasing my estimate of how much optimization we’ll have to do to catch and prevent value-reflection. But a lot of muddy half-ideas came out of this that I’m interested in chewing on.

james.lucassen 13 Nov 2024 4:45 UTC
3 points
0
in reply to: Jeremy Gillen’s comment on: Context-dependent consequentialism
Maybe I’m just reading my own frames into your words, but this feels quite similar to the rough model of human-level LLMs I’ve had in the back of my mind for a while now.
You think that an intelligence that doesn’t-reflect-very-much is reasonably simple. Given this, we can train chain-of-thought type algorithms to avoid reflection using examples of not-reflecting-even-when-obvious-and-useful. With some effort on this, reflection could be crushed with some small-ish capability penalty, but massive benefits for safety.
In particular, this reads to me like the “unstable alignment” paradigm I wrote about a while ago.
You have an agent which is consequentialist enough to be useful, but not so consequentialist that it’ll do things like spontaneously notice conflicts in the set of corrigible behaviors you’ve asked it to adhere to and undertake drastic value reflection to resolve those conflicts. You might hope to hit this sweet spot by default, because humans are in a similar sort of sweet spot. It’s possible to get humans to do things they massively regret upon reflection as long as their day to day work can be done without attending to obvious clues (eg guy who’s an accountant for the Nazis for 40 years and doesn’t think about the Holocaust he just thinks about accounting). Or you might try and steer towards this sweet spot by developing ways to block reflection in cases where it’s dangerous without interfering with it in cases where it’s essential for capabilities.

james.lucassen 19 Sep 2024 17:58 UTC
1 point
0
in reply to: TsviBT’s comment on: Why I funded PIBBSS
o7

james.lucassen 19 Sep 2024 17:47 UTC
1 point
0
in reply to: mesaoptimizer’s comment on: Why I funded PIBBSS
I’m not sure exactly what mesa is saying here, but insofar as “implicitly tracking the fact that takeoff speeds are a feature of reality and not something people can choose” means “intending to communicate from a position of uncertainty about takeoff speeds” I think he has me right.

I do think mesa is familiar enough with how I talk that the fact he found this unclear suggests it was my mistake. Good to know for future.

james.lucassen 19 Sep 2024 17:43 UTC
5 points
0
in reply to: Ryan Kidd’s comment on: Why I funded PIBBSS
Ah, didn’t mean to attribute the takeoff speed crux to you, that’s my own opinion.

I’m not sure what’s best in fast takeoff worlds. My message is mainly just that getting weak AGI to solve alignment for you doesn’t work in a fast takeoff.

“AGI winter” and “overseeing alignment work done by AI” do both strike me as scenarios where agent foundations work is more useful than in the scenario I thought you were picturing. I think #1 still has a problem, but #2 is probably the argument for agent foundations work I currently find most persuasive.

In the moratorium case we suddenly get much more time than we thought we had, which enables longer payback time plans. Seems like we should hold off on working on the longer payback time plans until we know we have that time, not while it still seems likely that the decisive period is soon.

Having more human agent foundations expertise to better oversee agent foundations work done by AI seems good. How good it is depends on a few things. How much of the work that needs to be done is conceptual breakthroughs (tall) vs schlep with existing concepts (wide)? How quickly does our ability to oversee fall off for concepts more advanced than what we’ve developed so far? These seem to me like the main ones, and like very hard questions to get certainty on—I think that uncertainty makes me hesitant to bet on this value prop, but again, it’s the one I think is best.

james.lucassen 19 Sep 2024 17:30 UTC
3 points
0
in reply to: TsviBT’s comment on: Why I funded PIBBSS
I’m on board with communicating the premises of the path to impact of your research when you can. I think more people doing that would’ve saved me a lot of confusion. I think your particular phrasing is a bit unfair to the slow takeoff camp but clearly you didn’t mean it to read neutrally, which is a choice you’re allowed to make.

I wouldn’t describe my intention in this comment as communicating a justification of alignment work based on slow takeoff? I’m currently very uncertain about takeoff speeds and my work personally is in the weird limbo of not being premised on either fast or slow scenarios.

james.lucassen 18 Sep 2024 21:32 UTC
6 points
−4
on: Why I funded PIBBSS
Nice post, glad you wrote up your thinking here.

I’m a bit skeptical of the “these are options that pay off if alignment is harder than my median” story. The way I currently see things going is:
- a slow takeoff makes alignment MUCH, MUCH easier [edit: if we get one, I’m uncertain and think the correct position from the current state of evidence is uncertainty]
- as a result, all prominent approaches lean very hard on slow takeoff
- there is uncertainty about takeoff speed, but folks have mostly given up on reducing this uncertainty
I suspect that even if we have a bunch of good agent foundations research getting done, the result is that we just blast ahead with methods that are many times easier because they lean on slow takeoff, and if takeoff is slow we’re probably fine if it’s fast we die.

Ways that could not happen:
- Work of the of the form “here are ways we could notice we are in a fast takeoff world before actually getting slammed” produces evidence compelling enough to pause, or cause leading labs to discard plans that rely on slow takeoff
- agent foundations research aiming to do alignment in faster takeoff worlds finds a method so good it works better than current slow takeoff tailored methods even in the slow takeoff case, and labs pivot to this method
Both strike me as pretty unlikely. TBC this doesn’t mean those types of work are bad, I’m saying low probability not necessarily low margins

james.lucassen

De­ci­sion The­ory Guard­ing is Suffi­cient for Scheming

How Can You Tell if You’ve In­stil­led a False Belief in Your LLM?

On Con­tact, Part 1

Ret­ro­spec­tive: 12 [sic] Months Since MIRI

Decision Theory Guarding is Sufficient for Scheming

How Can You Tell if You’ve Instilled a False Belief in Your LLM?

On Contact, Part 1

Retrospective: 12 [sic] Months Since MIRI