Unfortunately I don’t have a super fleshed out perspective on this, and I stopped reading after looking at the causal graph so I could be totally missing something important (although I did skim the rest and search for the text “influen” to find things related to this thought). I know that’s not the best way to engage, feel free to downvote, take this with a grain of salt, and sorry in general for a lower effort comment.
The step in the graph from “I get higher reward on this training episode” → “I have influence through deployment” doesn’t really sit right with me. Is it only true for the final model (or few models) that actually get selected and deployed? It gives me a vibe from neutral connotation, to something with a more negative connotation. It definitely seems like one possible outcome from the selection process, but for some reason it just sits weird in my head and sounds to me like something someone who already thinks AI is going to turn power seeking would come up with even though it’s kind of a jump. I wish I could articulate this better, but I’m not trying to spend a ton of time teasing out my thoughts, I wasn’t gonna comment but I didn’t see anyone else saying something along these lines and it seemed worth putting it out there.
Obviously I could well be wrong, or missed something obvious, or I’m just biased the other way towards “llms are not that malicious”, but idk it just gave me a tiny note of discord
To be clear, “influence through deployment” refers to a cognitive pattern having influence on behavior in deployment (as I defined), not long term power seeking.
Unfortunately I don’t have a super fleshed out perspective on this, and I stopped reading after looking at the causal graph so I could be totally missing something important (although I did skim the rest and search for the text “influen” to find things related to this thought). I know that’s not the best way to engage, feel free to downvote, take this with a grain of salt, and sorry in general for a lower effort comment.
The step in the graph from “I get higher reward on this training episode” → “I have influence through deployment” doesn’t really sit right with me. Is it only true for the final model (or few models) that actually get selected and deployed? It gives me a vibe from neutral connotation, to something with a more negative connotation. It definitely seems like one possible outcome from the selection process, but for some reason it just sits weird in my head and sounds to me like something someone who already thinks AI is going to turn power seeking would come up with even though it’s kind of a jump. I wish I could articulate this better, but I’m not trying to spend a ton of time teasing out my thoughts, I wasn’t gonna comment but I didn’t see anyone else saying something along these lines and it seemed worth putting it out there.
Obviously I could well be wrong, or missed something obvious, or I’m just biased the other way towards “llms are not that malicious”, but idk it just gave me a tiny note of discord
To be clear, “influence through deployment” refers to a cognitive pattern having influence on behavior in deployment (as I defined), not long term power seeking.