This is a very clear, well-written post. You could get the same idea from reading Deep Deceptiveness or Planecrash / Project Lawful and there’s value in that. But this gives you the idea in 5,000 words instead of 1,800,000 words, and the example hostile telepath is a mother, rather than Asmodeus or OpenAI.
In writing this review I became less happy with some of the examples. They’re clear and evocative, but some of them seem incorrect. The mother is not hostile, she is closely aligned to her child. She isn’t trying to make the 3yo press an “actually mean it” button, she’s instead pressing her own thumbs-down button on the apology and hoping that the 3yo’s brain will update in the desired way. The 3yo probably gains a tiny bit of empathy and a tiny bit of tone-control. They don’t get self-deception, that’s too complex. If the 3yo regrets breaking the glasses because it causes mom’s wrath, that is “really sorry”, not strategic misinterpretation.
The math class example also reads false to me. I have a kid who loves math and hates math class, and this does not seem like a difficult distinction to make. As a kid I remember loving to read and hating assigned reading for school. Okay, people who “hate math” are in fact ambivalent about the abstract concept of mathematics itself, which they never encounter outside of math class, which they hate. I don’t think we need to invoke self-deception here. Yes, school can suck the joy out of a topic, but that is explained by operant conditioning.
However, the other examples read true. And even if you disagree with some of the examples, I think they’re still so clear and relatable that they give a really good handle on the topic. So I now use the label of “Hostile Telepath Problem” when I think about this problem, and I thank this article for it. The AI implications follow naturally.
The mother is not hostile, she is closely aligned to her child.
I just replied to another reviewer about this point. In short: I agree, I think it’s worth noticing, and I also think the point is irrelevant. The question isn’t whether the mother truly is hostile vs. aligned with the child. The question is whether the child experiences threat from an apparent telepath.
This point is related to footnote 4. I think it’s unhelpful to ask whether the mother “actually is” a hostile telepath. Hostile telepathy is about a perception someone has of another. If you perceive someone (or something) as a hostile telepath, you need some solution to that problem. One possible solution is to discover that they are not, in fact, hostile. But if you don’t converge on that solution, you’ll need some other one.
As I mentioned to the other reviewer, it stands out to me that two people both zoomed in on the same objection. I’m not sure what’s going on there. Let me know if I’ve missed your point?
The math class example also reads false to me. I have a kid who loves math and hates math class, and this does not seem like a difficult distinction to make.
Of course! My impression is that many (most?) math students don’t get sucked into the Newcomblike self-deception pattern I was naming. But some do! You’re pointing out an example of it not happening. If I were claiming that this happens for all math students, your point would totally refute mine. And to the degree you thought I was making that claim, or it came across ambiguous about whether I was, I’m glad you brought it up! But my point wasn’t that all math students encounter this. It’s that some do. And I don’t think it’s super rare.
The AI implications follow naturally.
Yep. Key to why I worked on it to begin with. I’m glad you caught that and named it!
There’s a Dog Man comic where a villain (as it happens) tells Dog Man off. Dog Man goes off and looks sad. The villain then shouts at him: “sadder!”. Dog Man “presses his ‘actually mean it’ button” and looks sadder. The joke works for me because it’s true to this relatable “say it and mean it” dynamic.
I think my reading came from this paragraph, which is expressed as facts about the hypothetical, rather than the 3yo’s experience/model of the hypothetical:
Now you have a serious problem. You don’t have an internal “actually mean it” button. And yet here’s Mom peering into your soul and demanding that you both have that button and press it.
I wouldn’t edit the paragraph, because it’s clear as it is and more words would dilute. To me, it’s a “fridge logic” hypothetical which becomes less compelling on later reflection, but it still has a good impact on first reading.
With the math class example I don’t follow your reasoning. This seems common to me:
problem: the teacher wants me to try hard, but I want to slack off
idea: I will pretend to try hard, but really slack off
problem: the teacher is a telepath and can tell that I’m slacking off
idea: I will deceive myself into thinking that I am trying hard, but really slack off
This also seems common to me:
problem: I hate math class, but the teacher is personally offended by that
idea: I will deceive myself into thinking that I hate math, which will explain away my hatred of math class
But, I don’t see how they connect up in a natural way, to get all the way from the initial problem to the solution of self-deceiving to hate math. Or if they do it feels like so many layers of self-deception that it would be rare. Perhaps this is two Hostile Telepath Problems in one bullet point.
This is a very clear, well-written post. You could get the same idea from reading Deep Deceptiveness or Planecrash / Project Lawful and there’s value in that. But this gives you the idea in 5,000 words instead of 1,800,000 words, and the example hostile telepath is a mother, rather than Asmodeus or OpenAI.
In writing this review I became less happy with some of the examples. They’re clear and evocative, but some of them seem incorrect. The mother is not hostile, she is closely aligned to her child. She isn’t trying to make the 3yo press an “actually mean it” button, she’s instead pressing her own thumbs-down button on the apology and hoping that the 3yo’s brain will update in the desired way. The 3yo probably gains a tiny bit of empathy and a tiny bit of tone-control. They don’t get self-deception, that’s too complex. If the 3yo regrets breaking the glasses because it causes mom’s wrath, that is “really sorry”, not strategic misinterpretation.
The math class example also reads false to me. I have a kid who loves math and hates math class, and this does not seem like a difficult distinction to make. As a kid I remember loving to read and hating assigned reading for school. Okay, people who “hate math” are in fact ambivalent about the abstract concept of mathematics itself, which they never encounter outside of math class, which they hate. I don’t think we need to invoke self-deception here. Yes, school can suck the joy out of a topic, but that is explained by operant conditioning.
However, the other examples read true. And even if you disagree with some of the examples, I think they’re still so clear and relatable that they give a really good handle on the topic. So I now use the label of “Hostile Telepath Problem” when I think about this problem, and I thank this article for it. The AI implications follow naturally.
I just replied to another reviewer about this point. In short: I agree, I think it’s worth noticing, and I also think the point is irrelevant. The question isn’t whether the mother truly is hostile vs. aligned with the child. The question is whether the child experiences threat from an apparent telepath.
This point is related to footnote 4. I think it’s unhelpful to ask whether the mother “actually is” a hostile telepath. Hostile telepathy is about a perception someone has of another. If you perceive someone (or something) as a hostile telepath, you need some solution to that problem. One possible solution is to discover that they are not, in fact, hostile. But if you don’t converge on that solution, you’ll need some other one.
As I mentioned to the other reviewer, it stands out to me that two people both zoomed in on the same objection. I’m not sure what’s going on there. Let me know if I’ve missed your point?
Of course! My impression is that many (most?) math students don’t get sucked into the Newcomblike self-deception pattern I was naming. But some do! You’re pointing out an example of it not happening. If I were claiming that this happens for all math students, your point would totally refute mine. And to the degree you thought I was making that claim, or it came across ambiguous about whether I was, I’m glad you brought it up! But my point wasn’t that all math students encounter this. It’s that some do. And I don’t think it’s super rare.
Yep. Key to why I worked on it to begin with. I’m glad you caught that and named it!
There’s a Dog Man comic where a villain (as it happens) tells Dog Man off. Dog Man goes off and looks sad. The villain then shouts at him: “sadder!”. Dog Man “presses his ‘actually mean it’ button” and looks sadder. The joke works for me because it’s true to this relatable “say it and mean it” dynamic.
I think my reading came from this paragraph, which is expressed as facts about the hypothetical, rather than the 3yo’s experience/model of the hypothetical:
I wouldn’t edit the paragraph, because it’s clear as it is and more words would dilute. To me, it’s a “fridge logic” hypothetical which becomes less compelling on later reflection, but it still has a good impact on first reading.
With the math class example I don’t follow your reasoning. This seems common to me:
problem: the teacher wants me to try hard, but I want to slack off
idea: I will pretend to try hard, but really slack off
problem: the teacher is a telepath and can tell that I’m slacking off
idea: I will deceive myself into thinking that I am trying hard, but really slack off
This also seems common to me:
problem: I hate math class, but the teacher is personally offended by that
idea: I will deceive myself into thinking that I hate math, which will explain away my hatred of math class
But, I don’t see how they connect up in a natural way, to get all the way from the initial problem to the solution of self-deceiving to hate math. Or if they do it feels like so many layers of self-deception that it would be rare. Perhaps this is two Hostile Telepath Problems in one bullet point.