jwfiredragon

Karma: 64

jwfiredragon 25 Sep 2025 0:53 UTC
1 point
0
in reply to: TAG’s comment on: Notes on fatalities from AI takeover
The difference with normal software is that at least somebody understands every individual part, and if you collected all those somebodies and locked them in a room for a while they could write up a full explanation. Whereas with AI I think we’re not even like 10% of the way to full understanding.
Also, if you’re trying to align a superintelligence, you do have to get it right on the first try, otherwise it kills you with no counterplay.

jwfiredragon 24 Sep 2025 20:23 UTC
4 points
1
in reply to: TAG’s comment on: Notes on fatalities from AI takeover
we are deliberately seeking to build certainties of mind
I think “deliberately seeking to build” is the wrong way to frame the current paradigm—we’re growing the AIs through a process we don’t fully understand, while trying to steer the external behaviour in the hopes that this corresponds to desirable mind structures.
If we were actually building the AIs, I would be much more optimistic about them coming out friendly.

jwfiredragon 24 Sep 2025 15:20 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: Notes on fatalities from AI takeover
Analogy to humans feels like generalizing from one example to me. My prior is that minds evolved under different circumstances will have different desires, so we shouldn’t expect an AI to robustly share any specific human value unless we can explain exactly how it develops that value.
But that aside, would you agree that if this were true, alignment should be fairly easy, because we just need to amplify the degree of caring?

jwfiredragon 24 Sep 2025 4:15 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: Notes on fatalities from AI takeover
I think it’s plausible that an AI that has this kind of caring could exist, but actually getting this AI instead of one that doesn’t care at all seems very unlikely.
IMO “respect the actual preferences of existing intelligent agents” is a very narrow target in mind-space. I.e. if we had any reason to believe the AI has a decent chance of being this kind of mind, the alignment problem would be 90% solved. The hard part is going from “AI that kills everyone” to “AI that doesn’t kill everyone”. Once you’re there, getting to”AI that benefits humanity, or at least leaves for another star system” is comparatively trivial.

jwfiredragon 9 Apr 2025 4:35 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Untrusted monitoring insights from watching ChatGPT play coordination games
It feels like my key disagreement is I think that AI will be able to come up with strategies that are inhuman without being superhuman. i.e. human-level AIs will find strategies in a very different part of solution space than what humans would naturally think to prepare for.
My biggest intuition toward the above is AIs’ performance in games (e.g. AlphaStar). I’ve seen a lot of scenarios where the AIs soundly beat top humans not by doing the same thing but better, but by doing something entirely outside of the human playbook. I don’t see why this wouldn’t transfer to other domains with very large solution spaces, like steganography techniques.
I do agree that it will likely take a lot of work to get good returns out of untrusted monitoring (and by extension general anti-collusion measures). However I think having good anti-collusion measures is a very important capability towards limiting the harm that a rogue AI could potentially do.

jwfiredragon 28 Mar 2025 7:00 UTC
3 points
0
in reply to: Ape in the coat’s comment on: Probability Theory Fundamentals 102: Territory that Probability is in the Map of
That’s not how people usually use these terms. The uncertainty about a state of the coin after the toss is describable within the framework of possible worlds just as uncertainty about a future coin toss, but uncertainty about a digit of pi—isn’t.
Oops, that’s my bad for not double-checking the definitions before I wrote that comment. I think the distinction I was getting at was more like known unknowns vs unknown unknowns, which isn’t relevant in platonic-ideal probability experiments like the ones we’re discussing here, but is useful in real-world situations where you can look for more information to improve your model.
Now that I’m cleared up on the definitions, I do agree that there doesn’t really seem to be a difference between physical and logical uncertainty.

jwfiredragon 27 Mar 2025 4:10 UTC
1 point
0
in reply to: Ape in the coat’s comment on: Probability Theory Fundamentals 102: Territory that Probability is in the Map of
In the case with 1,253,725,569th digit of pi, if I try to construct a probability experiment consisting only of checking this paticular digit, I fail to model my uncertainty, as I don’t know yet what is the value of this digit.
Ok, let me see if I’m understanding this correctly: if the experiment is checking the X-th digit specifically, you know that it must be a specific digit, but you don’t know which, so you can’t make a coherent model. So you generalize up to checking an arbitrary digit, where you know that the results are distributed evenly among {0...9}, so you can use this as your model.
The first part about not having a coherent model sounds a lot like the frequentist idea that you can’t generate a coherent probability for a coin of unknown bias—you know that it’s not ¹⁄₂ but you can’t decide on any specific value.
Now, I may worry, that I overdid it and have abstracted away some relevant information, so I check:
- Does knowing that the digit in question is specifically 1,253,725,569 affects my credence?
This seems equivalent to my definition of “information that would change your answer if it was different”, so it looks like we converged on similar ideas?
This is physical uncertainty.
I’d argue that it’s physical uncertainty before the coin is flipped, but logical certainty after. After the flip, the coin’s state is unknown the same way the X-th digit of pi is unknown—the answer exists and all you need to do is look for it.

jwfiredragon 27 Mar 2025 3:46 UTC
3 points
0
in reply to: jwfiredragon’s comment on: Edge Cases in AI Alignment
Looks like Tell me about yourself: LLMs are aware of their learned behaviors investigates a similar topic, but finds the complete opposite result—if you fine-tune an LLM to have a specific unusual behaviour, without explicitly spelling out what the behaviour is, the LLM is able to accurately describe the unusual behaviour.
I wonder if the difference is that they fine-tuned it to exhibit a specific behaviour, whereas you and I were testing with the off-the-shelf model? Perhaps if there’s not an obvious-enough behaviour for the AI to hone in on, it can develop a gap between what it says it would do and what it would actually do?

jwfiredragon 26 Mar 2025 8:18 UTC
2 points
0
on: Edge Cases in AI Alignment
The disconnect between stated and revealed preferences across all models suggests frontier models either misrepresent their actual preferences or lack accurate self-understanding of their own behaviors—both concerning possibilities for alignment research.
Bit of a tangent, but I suspect the latter. As part of my previous project (see “Round 2”), I asked GPT-4o to determine if a code sample was written normally by another instance of GPT-4o, or if it contained a suspicious feature. On about ¹⁄₅ of the “normal” samples, GPT-4o incorrectly asserted that the code contained an unusual feature that it would not typically generate.
I’d guess that current-generation LLMs don’t really have self-understanding. When asked how they would behave in a situation, they just provide a plausible-sounding completion instead of actually trying to model their own behaviour. If this is true, it would mean that alignment research on LLMs might not generalize to smarter AIs that have self-understanding, which is indeed concerning.
I’d also expect this to be something that has already been researched. I’ll do some digging, but if anyone else knows a paper/post/etc on this topic, please chime in.

jwfiredragon 26 Mar 2025 7:57 UTC
3 points
0
on: Probability Theory Fundamentals 102: Territory that Probability is in the Map of
I’m supposed to account for all the relevant information and ignore all the irrelevant.
Is there a formal way you’d define this? My first attempt is something like “information that, if it were different, would change my answer”. E.g. knowing the coin is biased 2:1 vs 3:1 doesn’t change your probability, so it’s irrelevant; knowing the coin is biased 2:1 for heads vs 2:1 for tails changes your probability, so it’s relevant.
Or maybe it should be considered from the perspective of reducing the sample space? Is knowing the coin is biased vs knowing it’s biased 2:1 a change in relevant information, even though your probability remains at ¹⁄₂, because you removed all other biased coins from the sample space? (Intuitively this feels less correct, but I’m writing it out in the interest of spitballing ideas)
Unrelatedly, would you agree that there’s not really a meaningful difference between logical and physical uncertainties? I see both of them as stemming from a lack of knowledge—logical uncertainty is where you could find the answer in principle but haven’t done so; physical uncertainty is where you don’t know how to find the answer. But in practice there’s a continuum of how much you know about the answer-finding process for any given problem, so they blur together in the middle.

jwfiredragon 6 Mar 2025 8:52 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Untrusted monitoring insights from watching ChatGPT play coordination games
Ok, I’ve been thinking about this more and I’ve figured out what exactly was bugging me about this red-teaming scheme. Is it fair to say you assume that your red team will naturally cover the same area of solution-space as actual malicious AIs aiming to self-coordinate?
If yes, I think that assumption is faulty. Self-coordinating AIs will have access to classes of strategies that a human red-teamer would not think of unless they spent significant effort to “think like the enemy”, so to speak. These strategies will be usable even if the AI isn’t much smarter than a human, simply because the AI is a very different kind of mind.
Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers [… but] I would bet against it
As a simple but hopefully demonstrative example: if the AI is asked to generate a list of items, it could communicate information based on the order of items in the list. Given that the AI is specifically attempting to self-coordinate, it can assume its other copy will come up with the same mapping between sequences and secret signals.^[1] This easily survives sentence-level paraphrasing, even though it’s obviously defeated by shuffling the list.
There are plausibly very many strategies in this vein, where you encode information at the structural rather than the content level. So even if each individual strategy is trivial to defeat, you could just end up playing strategy whack-a-mole.
1. ^
  The information density of this is low though—the AI can’t assign meaningful information to too many sequences, since a synthetic input could then generate one by chance, so it can only make a sparse subset of sequences meaningful.

jwfiredragon 7 Feb 2025 3:28 UTC
3 points
0
in reply to: Fabien Roger’s comment on: Untrusted monitoring insights from watching ChatGPT play coordination games
Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers
Yeah, this is the case I’m thinking about. I think on longer outputs it will be possible for the AI to use a more nebulous sense of “style”, rather than only relying on specific features like “using X secret phrase” (which are presumably easier to redact away).
I think this could be countered by paraphrasing at a higher level—i.e. rewriting entire paragraphs from scratch rather than just rewording sentences. I’m noting this down as a potential topic if I have time to do another project.
Thanks for the discussion, it’s helped me clear up my thoughts on this topic.

jwfiredragon 6 Feb 2025 7:39 UTC
7 points
0
on: Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
it needs the ability to recursively handle instrumental subgoals anyway, so why have a whole extra different kind of goal at the top?
I’m picturing this somewhat differently. I imagine each goal as a node in a directed graph, with zero or more arrows pointing toward its supergoal(s). Under this construction, we have a natural separation between instrumental goals (nodes with at least one outgoing arrow) and terminal goals (nodes with no outgoing arrows), even though they’re not fundamentally different “types” of things.
So the question of “how do you build an agent with no terminal goals?” translates to “how do you build a directed graph with no dead ends?” And obviously the solution is that the graph must contain cycles (and specifically nothing but cycles). In the simplest case, you get an agent that wants to do A in order to do B, and wants to do B in order to do A.
I suspect this kind of structure shouldn’t occur in humans because we’re naturally born with certain base drives. So even if our goal graphs have cycles, the subgoals still lead to a dead end somewhere. E.g. when I play video games, I want to kill monsters to get experience, and I want to get experience so I can get better at killing monsters, but ultimately it’s grounded in some basic desire for achievement or something.
However with an artificial mind, we may be able to tweak its goal graph directly and prune off all the dead ends, or link them back into other nodes.
(epistemic status: unsure if this idea actually has merit or if I’m taking the graph metaphor too far)

jwfiredragon 6 Feb 2025 7:35 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Untrusted monitoring insights from watching ChatGPT play coordination games
Hi, thank you very much for the response! I read through the post you linked, and also the steganography/paraphrasing post linked therein, and I see your point about redaction. I agree that in this intelligence regime, it’s likely that we could come up with redaction schemes that very effectively prevent AIs from hiding information in their outputs.
However, I still have some concerns about generating synthetic inputs. Specifically this section:
At any point where it naively seems like there might be an extremely good opportunity to behave badly, or where you need to coordinate with another instance of yourself, it’s overwhelmingly likely that this instance is in adversarial training/evaluation.
“Coordinating with yourself” is a special case of “coordinating with another AI” that deserves extra attention, because by default the AI has a strong incentive to coordinate with itself, but less so to coordinate with a different AI. I still think it’ll be hard to convincingly create synthetic inputs for this case because doing so would require the blue team to know how the AI would act—otherwise the synthetic inputs end up as “coordinating with someone pretending to be yourself”, or “coordinating with someone else”.^[1] In those cases it could be natural for the AI to not cooperate, but that wouldn’t necessarily imply that the AI wouldn’t cooperate with a copy of itself.
If you want to claim this isn’t a problem, you’d have to argue one of:
1. Creating synthetic inputs for self-coordination is not significantly harder than for other-coordination.
2. Defecting on other-coordination does imply the AI wouldn’t cooperate with itself.
3. Not specifically testing for self-coordination is an acceptable weakness—if the redaction scheme is good, the AI won’t know if its inputs are coming from itself anyways.^[2]
4. None of the above, there’s another case I didn’t consider.
I think 1 and 2 are false, and 3 requires some shaky reasoning (see the footnote).
(A tangent: I think current-generation LLMs might do better on noticing self-coordination with more hand-holdy prompting: e.g. asking the monitor to first generate its own solution to the same problem, and then comparing that against its input to check for discrepancies that could be secret signals.)
1. ^
  One other possibility is to aim for “coordinating with someone that happens to share your goals”, but that runs into the problems of A) communicating the extra information of shared goals and B) striking a balance between the offer being too good to be true and not good enough to bother.
2. ^
  I think this makes sense at a first glance but is the kind of mindset that leads to insufficient paranoia, so I’d like to avoid it unless the case for it is really watertight.

Untrusted monitoring insights from watching ChatGPT play coordination games

jwfiredragon29 Jan 2025 4:53 UTC

14 points

8 comments9 min readLW link

jwfiredragon 3 Oct 2024 3:16 UTC
1 point
0
on: The Solution to Sleeping Beauty
First off, thank you for this series. You’ve thoroughly dissolved the paradox for me. I was initially a thirder despite suspicions about Elga’s original argument (he seemed to equivocate between being awakened in general and being first awakened/awakened on Monday), because the probabilities for Lewis’ halfer argument (the ¹⁄₂, ¹⁄₄, ¹⁄₄ split) made no sense no matter how I did the math. I now understand that both sides were making the same error in their models, and that while the thirder position is correctly measuring something, it’s not your credence in the coin landing heads.
Second, as I was thinking about this solution, I came up with an isomorphic problem that might better clarify the key insight of Tails&Monday and Tails&Tuesday being the same event. I don’t think it’s worth a whole post, but I’d like your thoughts on it.
The road trip rest stop
You and a friend are taking a late night road trip from Townsburg to Cityville to catch a concert. Your friend graciously offers to drive the whole way, so you decide to go to sleep.
Before departing, your friend tells you that he’d like to stop at rest stops A and B on the way, but if the traffic out of Townsburg is congested (which you expect at a 50% chance), he’ll only have time to stop at A. Either way, he promises to wake you up at whatever rest stop(s) he stops at.
You’re so tired that you fall asleep before seeing how the traffic is, so you don’t know which rest stops your friend is going to stop at.
Some time during the night, your friend wakes you up at a rest stop. You’re groggy enough that you don’t know if you’ve been woken up for another rest stop previously, and it’s too dark to tell whether you’re at A or B. At this point, what is your credence that the traffic was bad?

jwfiredragon 21 Jun 2024 17:08 UTC
1 point
0
in reply to: johnswentworth’s comment on: ″… than average” is (almost) meaningless
and have no particular reason to think those people are very unusual in terms of cooking-skill
Yeah, that’s what I was trying to get at with the typical-friend-group stuff. The people you know well aren’t a uniform sample of all people, so you have no reason to conclude that their X-skill is normal for any arbitrary X.

jwfiredragon 21 Jun 2024 16:54 UTC
1 point
0
in reply to: lesswronguser123’s comment on: ″… than average” is (almost) meaningless
The problem (or maybe just my problem) is that when I say “average” it feels like it’s activating my concept of “mathematical concept of sum/count”, even though the actual thing I’m thinking of is “typical member of class extracted from my mental model”. I find myself treating “average” as if it came from real data even if it didn’t.

″… than average” is (almost) meaningless

jwfiredragon21 Jun 2024 4:42 UTC

16 points

6 comments3 min readLW link

An AI, a box, and a threat

jwfiredragon7 Mar 2024 6:15 UTC

10 points

0 comments6 min readLW link

jwfiredragon

Un­trusted mon­i­tor­ing in­sights from watch­ing ChatGPT play co­or­di­na­tion games

The road trip rest stop

″… than av­er­age” is (al­most) meaningless

An AI, a box, and a threat

Untrusted monitoring insights from watching ChatGPT play coordination games

″… than average” is (almost) meaningless