I concur with the descriptive claims and arguments in this post, as well as with the sentiment that this is very important, and with the “security mindset” framing for the latter point. I have no substantive comments about that part.
However, I must object to the sentiment (implied by your selection of potential solutions) that the only solutions to this problem still involve using LLMs in some way—perhaps a little more or a little less, but never “none at all”. Now, you say “you are in fact highly encouraged to add to this list”—fair enough, and here is my addition:
Don’t talk to LLMs. Ever, at all, for any reason, under any circumstance.
I put forth no arguments whatsoever for this being a good idea. (I think that it is, in fact, an excellent idea; but, in this comment, I am offering no defense of that view.) My purpose in explicitly mentioning the possibility of not talking with LLMs is just that: to make explicit, for the record, that it is possible. It’s a thing you can do. (You don’t have to live in a bunker in order to do this, either.)
When considering possible courses of action, remember that this course of action is also available. When weighing pros and cons of different amounts of talking-with-LLMs, include “none” in your list of options. When setting your personal “how much do I talk with LLMs” variable, make sure that the scale goes all the way to zero.
You can pick up giant piles of utility by talking to LLMs? Perhaps. But you can also not pick them up. Leave them on the ground! That’s a thing you can do. Regardless of whether you ultimately decide to grab those piles, remember that you have the option not to. Make absolutely damn sure that you have actually considered all the options. “Do not talk to any LLMs, ever, at all, for any reason, under any circumstance” is—genuinely, in actual fact and not just in theory—one of the options that you have.
I agree that this is a notable point in the space of options. I didn’t include it, and instead included the bunker line because if you’re going to be that paranoid about LLM interference (as is very reasonable to do), it makes sense to try and eliminate second order effects and never talk to people who talk to LLMs, for they too might be meaningfully harmful e.g. be under the influence of particularly powerful LLM-generated memes.
I also separately disagree that LLM isolation is the optimal path at the moment. In the future it likely will be. I’d bet that I’m still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish. At GPT-5ish level I get suspicious and uncomfortable, and beyond that exponentially more so.
I’d bet that I’m still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish. At GPT-5ish level I get suspicious and uncomfortable, and beyond that exponentially more so.
Please review this in a couple of months ish and see if the moment to stop is still that distance away. The frog says “this is fine!” until it’s boiled.
This does seem to be getting closer, yes. I still think the models are overall too stupid to do meaningful deception yet, although I haven’t yet gotten to play around with Opus 4. My use cases have also shifted in this time to less hackable things.
I do try to be calibrated instead of being frog, yes. Within the range of time in which present-me considers past-me remotely good as an AI forecaster, my time estimate for these sorts of deceptive capabilities has pretty linearly been going down, but to further help I set myself a reminder 3 months from today with a link to this comment. Thanks for that bit of pressure, I’m now going to generalize the “check in in [time period] about this sort of thing to make sure I haven’t been hacked” reflex.
I’d bet that I’m still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish.
With respect, I suggest to you that this sort of thinking is a failure of security mindset. (However, I am content to leave the matter un-argued at this time.)
… if you’re going to be that paranoid about LLM interference (as is very reasonable to do), it makes sense to try and eliminate second order effects and never talk to people who talk to LLMs, for they too might be meaningfully harmful e.g. be under the influence of particularly powerful LLM-generated memes.
Yes… this is true in a personal-protection sense, I agree. And I do already try to stay away from people who talk to LLMs a lot, or who don’t seem to be showing any caution about it, or who don’t take concerns like this seriously, etc. (I have never needed any special reason to avoid Twitter, but if one does—well, here’s yet another reason for the list.)
However, taking a pure personal-protection stance on this matter does not seem to me to be sensible even from a selfish perspective. It seems to me that there is no choice but to try to convince others, insofar as it is possible to do this in accordance with my principles. In other words, if I take on some second-order effect risk, but in exchange I get some chance of several other people considering what I say and deciding to do as I am doing, then this seems to me to be a positive trade-off—especially since, if one takes the danger seriously, it is hard to avoid the conclusion that choosing to say nothing results in a bad end, regardless of how paranoid one has been.
I think we’re mostly on the same page that there are things worth forgoing the “pure personal-protection” strategy for, we’re just on different pages about what those things are. We agree that “convince people to be much more cautious about LLM interactions” is in that category. I just also put “make my external brain more powerful” in that category, since it seems to have positive expected utility for now and lets me do more AI safety research in line with what pre-LLM me would likely endorse upon reflection. I am indeed trying to be very cautious about this process, trying to be corrigible to my past self, to implement all of the mitigations I listed plus all the ones I don’t have words for yet. It would be a failure of security mindset to fail to notice these things and to see that they are important to deal with. However, it is a bet that I am making that the extra optimization power is worth it for now. I may lose that bet, and then that will be bad.
If there’s danger on the horizon, why wait until it’s right up close to put some distance between you and it? Given the evidence you’ve presented, I don’t understand leaving the decision to put measures in place to Alice Blair who’s possibly already exposed to stronger LLMs.
It’s a balance between getting the utility out of using smarter and smarter assistants and not being duped by them. This is really hard, and it’s definitely not a bet that everyone should make.
This post just came across my inbox, and there are a couple updates I’ve made (I have not talked to 4.5 at all and have seen only minimal outputs):
GPT-4.5 is already hacking some of the more susceptible people on the internet (in the dopamine gradient way)
GPT-4.5+reasoning+RL on agency (aka GPT-5) could probably be situationally aware enough to intentionally deceive (in line with my prediction in the above comment, which was made prior to seeing Zvi’s post but after hearing about 4.5 briefly).
I think that there are many worlds in which talking to GPT-5 with strong mitigations and low individual deception susceptibility turns out okay or positive, but I am much more wary about taking that bet and I’m unsure if I will when I have the option to.
I concur with the descriptive claims and arguments in this post, as well as with the sentiment that this is very important, and with the “security mindset” framing for the latter point. I have no substantive comments about that part.
However, I must object to the sentiment (implied by your selection of potential solutions) that the only solutions to this problem still involve using LLMs in some way—perhaps a little more or a little less, but never “none at all”. Now, you say “you are in fact highly encouraged to add to this list”—fair enough, and here is my addition:
Don’t talk to LLMs. Ever, at all, for any reason, under any circumstance.
I put forth no arguments whatsoever for this being a good idea. (I think that it is, in fact, an excellent idea; but, in this comment, I am offering no defense of that view.) My purpose in explicitly mentioning the possibility of not talking with LLMs is just that: to make explicit, for the record, that it is possible. It’s a thing you can do. (You don’t have to live in a bunker in order to do this, either.)
When considering possible courses of action, remember that this course of action is also available. When weighing pros and cons of different amounts of talking-with-LLMs, include “none” in your list of options. When setting your personal “how much do I talk with LLMs” variable, make sure that the scale goes all the way to zero.
You can pick up giant piles of utility by talking to LLMs? Perhaps. But you can also not pick them up. Leave them on the ground! That’s a thing you can do. Regardless of whether you ultimately decide to grab those piles, remember that you have the option not to. Make absolutely damn sure that you have actually considered all the options. “Do not talk to any LLMs, ever, at all, for any reason, under any circumstance” is—genuinely, in actual fact and not just in theory—one of the options that you have.
I agree that this is a notable point in the space of options. I didn’t include it, and instead included the bunker line because if you’re going to be that paranoid about LLM interference (as is very reasonable to do), it makes sense to try and eliminate second order effects and never talk to people who talk to LLMs, for they too might be meaningfully harmful e.g. be under the influence of particularly powerful LLM-generated memes.
I also separately disagree that LLM isolation is the optimal path at the moment. In the future it likely will be. I’d bet that I’m still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish. At GPT-5ish level I get suspicious and uncomfortable, and beyond that exponentially more so.
Please review this in a couple of months ish and see if the moment to stop is still that distance away. The frog says “this is fine!” until it’s boiled.
This does seem to be getting closer, yes. I still think the models are overall too stupid to do meaningful deception yet, although I haven’t yet gotten to play around with Opus 4. My use cases have also shifted in this time to less hackable things.
I do try to be calibrated instead of being frog, yes. Within the range of time in which present-me considers past-me remotely good as an AI forecaster, my time estimate for these sorts of deceptive capabilities has pretty linearly been going down, but to further help I set myself a reminder 3 months from today with a link to this comment. Thanks for that bit of pressure, I’m now going to generalize the “check in in [time period] about this sort of thing to make sure I haven’t been hacked” reflex.
With respect, I suggest to you that this sort of thinking is a failure of security mindset. (However, I am content to leave the matter un-argued at this time.)
Yes… this is true in a personal-protection sense, I agree. And I do already try to stay away from people who talk to LLMs a lot, or who don’t seem to be showing any caution about it, or who don’t take concerns like this seriously, etc. (I have never needed any special reason to avoid Twitter, but if one does—well, here’s yet another reason for the list.)
However, taking a pure personal-protection stance on this matter does not seem to me to be sensible even from a selfish perspective. It seems to me that there is no choice but to try to convince others, insofar as it is possible to do this in accordance with my principles. In other words, if I take on some second-order effect risk, but in exchange I get some chance of several other people considering what I say and deciding to do as I am doing, then this seems to me to be a positive trade-off—especially since, if one takes the danger seriously, it is hard to avoid the conclusion that choosing to say nothing results in a bad end, regardless of how paranoid one has been.
I think we’re mostly on the same page that there are things worth forgoing the “pure personal-protection” strategy for, we’re just on different pages about what those things are. We agree that “convince people to be much more cautious about LLM interactions” is in that category. I just also put “make my external brain more powerful” in that category, since it seems to have positive expected utility for now and lets me do more AI safety research in line with what pre-LLM me would likely endorse upon reflection. I am indeed trying to be very cautious about this process, trying to be corrigible to my past self, to implement all of the mitigations I listed plus all the ones I don’t have words for yet. It would be a failure of security mindset to fail to notice these things and to see that they are important to deal with. However, it is a bet that I am making that the extra optimization power is worth it for now. I may lose that bet, and then that will be bad.
If there’s danger on the horizon, why wait until it’s right up close to put some distance between you and it? Given the evidence you’ve presented, I don’t understand leaving the decision to put measures in place to Alice Blair who’s possibly already exposed to stronger LLMs.
It’s a balance between getting the utility out of using smarter and smarter assistants and not being duped by them. This is really hard, and it’s definitely not a bet that everyone should make.
This post just came across my inbox, and there are a couple updates I’ve made (I have not talked to 4.5 at all and have seen only minimal outputs):
GPT-4.5 is already hacking some of the more susceptible people on the internet (in the dopamine gradient way)
GPT-4.5+reasoning+RL on agency (aka GPT-5) could probably be situationally aware enough to intentionally deceive (in line with my prediction in the above comment, which was made prior to seeing Zvi’s post but after hearing about 4.5 briefly). I think that there are many worlds in which talking to GPT-5 with strong mitigations and low individual deception susceptibility turns out okay or positive, but I am much more wary about taking that bet and I’m unsure if I will when I have the option to.