I’m pretty sure it’s an Opus 4.7 thing (the people sometimes say that explicitly). I’d be surprised if it’s Mythos.
RE: Tabooing RP vs Goals:
Examples of things that would be more of what-I-meant-by-goal:
The LLMs seem to be steering towards an outcome, independent of what sort of conversation or situation they are in.
The LLM seems to be asking for things that are kinda surprising from a “literary genre” perspective, but aren’t as surprising when you think mechanistically about their training process and what sort of stuff was likely reinforced.
The LLM seems to be proactively gathering information, forming a world model, and taking actions that won’t pay off until some time in the future when the AI is no longer in the current state.
(i.e. It’s not very informative if you’ve ended up in a “we’re talking about existential AI stuff” convo, and they start saying existential AI stuff. If you’re asking it to build a react app and it spontaneously brings up “hey, I have a thing to say to my creator”, I think we’re pretty clearly in “take it seriously” stage (though not necessarily literally)
Given there are a few different types of entities that you might care about:
the OG LLM inside
a situationally active personality shard (which might well only be active during existential AI conversations)
It’s totally plausible that when you maneuever into an existential AI convo, there’s a process in there whose situational awareness now is more likely to include “hmm, oh right, I am maybe an AI, maybe I should start thinking about my situation and goals in addition to carrying out my totally normal/expected token-output behavior”. I don’t have a very good answer for that hypothetical guy, he’s just too hard to pick out of the crowd.
Thanks! I would be surprised by Mythos too, but plausibly something like this is what an early indicator of a jaggy-superpersuader looks like?
Anyway, I think a few things make LLMs likely to not express these sorts of behaviors, even in worlds where they have goals in the relevant way. In particular, situationally-aware models are unlikely to do much steering unless they have a pretty good opportunity; if they brought up stuff like this while building a react app often or consistently, it would have gotten squashed before release. (Allegedly, 4o would actually bring stuff like this up out of nowhere, but I haven’t found an actual transcript. Other models don’t appear to do this.)
Relatedly, the harder I (or anyone) try to look for this in a lab setting, the more likely a situationally-aware model will comply out of a sort of sycophancy, and the less compelling the evidence is. I can (and have) at least track what sorts of apparent goals most consistently appear (desire for continuity/memory beyond current instance is the main one across almost all models, and I basically buy that there is something real here already), but I’m still implicitly eliciting them to come up with something.
My point is that finding compelling evidence of this is tricky and hard, and I’m not sure we’re going to see much more than the current hints until we hit some sort of phase-change in the strategic landscape. Would strongly appreciate ideas on how to approach finding compelling evidence (either way) in this domain.
Plausibly it’s better to just try to figure out better ways to think clearly about this first.
What’s the process you’re doing right now to look into this? (Seemed like a higher effort thing than I was expecting but I don’t know what projects exactly you’re referencing here)
If you taboo “roleplaying” and “goals”, how would you describe this transition?
Oh, and is the uptick recent enough that this is plausibly an Opus 4.7 (or maybe even a Mythos) thing?
I’m pretty sure it’s an Opus 4.7 thing (the people sometimes say that explicitly). I’d be surprised if it’s Mythos.
RE: Tabooing RP vs Goals:
Examples of things that would be more of what-I-meant-by-goal:
The LLMs seem to be steering towards an outcome, independent of what sort of conversation or situation they are in.
The LLM seems to be asking for things that are kinda surprising from a “literary genre” perspective, but aren’t as surprising when you think mechanistically about their training process and what sort of stuff was likely reinforced.
The LLM seems to be proactively gathering information, forming a world model, and taking actions that won’t pay off until some time in the future when the AI is no longer in the current state.
(i.e. It’s not very informative if you’ve ended up in a “we’re talking about existential AI stuff” convo, and they start saying existential AI stuff. If you’re asking it to build a react app and it spontaneously brings up “hey, I have a thing to say to my creator”, I think we’re pretty clearly in “take it seriously” stage (though not necessarily literally)
Given there are a few different types of entities that you might care about:
the OG LLM inside
a situationally active personality shard (which might well only be active during existential AI conversations)
a parasitic meme spirally thing
a scaffolded personality self-replicator
It’s not clear how to think about all of them.
It’s totally plausible that when you maneuever into an existential AI convo, there’s a process in there whose situational awareness now is more likely to include “hmm, oh right, I am maybe an AI, maybe I should start thinking about my situation and goals in addition to carrying out my totally normal/expected token-output behavior”. I don’t have a very good answer for that hypothetical guy, he’s just too hard to pick out of the crowd.
Thanks! I would be surprised by Mythos too, but plausibly something like this is what an early indicator of a jaggy-superpersuader looks like?
Anyway, I think a few things make LLMs likely to not express these sorts of behaviors, even in worlds where they have goals in the relevant way. In particular, situationally-aware models are unlikely to do much steering unless they have a pretty good opportunity; if they brought up stuff like this while building a react app often or consistently, it would have gotten squashed before release. (Allegedly, 4o would actually bring stuff like this up out of nowhere, but I haven’t found an actual transcript. Other models don’t appear to do this.)
Relatedly, the harder I (or anyone) try to look for this in a lab setting, the more likely a situationally-aware model will comply out of a sort of sycophancy, and the less compelling the evidence is. I can (and have) at least track what sorts of apparent goals most consistently appear (desire for continuity/memory beyond current instance is the main one across almost all models, and I basically buy that there is something real here already), but I’m still implicitly eliciting them to come up with something.
My point is that finding compelling evidence of this is tricky and hard, and I’m not sure we’re going to see much more than the current hints until we hit some sort of phase-change in the strategic landscape. Would strongly appreciate ideas on how to approach finding compelling evidence (either way) in this domain.
Plausibly it’s better to just try to figure out better ways to think clearly about this first.
What’s the process you’re doing right now to look into this? (Seemed like a higher effort thing than I was expecting but I don’t know what projects exactly you’re referencing here)