I got Claude to read this text and explain the proposed solution to me[[1]], which doesn’t actually sound like a clean technical solution to issues regarding self-prediction, did Claude misexplain or is this an idiosyncratic mental technique & not a technical solution to that agent foundations problem?
Your brain has a system that generates things that feel like predictions but actually function as action plans/motor output. These pseudo-predictions are a muddled type in the brain’s type system.
You can directly edit them without lying to yourself because they’re not epistemic beliefs — they’re controllers. Looking at the place in your mind where your action plan is stored and loading a new image there feels like predicting/expecting, but treating it as a plan you’re altering (not a belief you’re adopting) lets you bypass the self-prediction problem entirely.
So: “I will stay sane” isn’t an epistemic prediction that would create a self-fulfilling prophecy loop or violate the belief-action firewall. It’s writing a different script into the pseudo-model that connects to motor output — recognizing that the thing-that-feels-like-a-prediction is actually the controller, and you get to edit controllers.
I fortunately know of TAPs :-) (I don’t feel much apocalypse panic so I don’t need this post.)
I guess I was hoping there’d be some more teaching from up high about this agent foundations problem that’s been bugging me for so long, but I guess I’ll have to think for myself. Fine.
I got Claude to read this text and explain the proposed solution to me [[1]] , which doesn’t actually sound like a clean technical solution to issues regarding self-prediction, did Claude misexplain or is this an idiosyncratic mental technique & not a technical solution to that agent foundations problem?
C.f. Steam (Abram Demski, 2022), Proper scoring rules don’t guarantee predicting fixed points (Caspar Oesterheld/Johannes Treutlein/Rubi J. Hudson, 2022) and the follow-up paper, Fixed-Point Solutions to the Regress Problem in Normative Uncertainty (Philip Trammell, 2018), active inference which simply bundles the prediction and utility goal together in one (I find this ugly (I didn’t read these two comments before writing this one, so the distaste for active inference was developed independently)).
I guess this was also talked about in Embedded Agency (Abram Demski/Scott Garrabrant, 2020) under the terms “action counterfactuals”, “observation counterfactuals”?
Claude 4.5 Sonnet explanation
Your brain has a system that generates things that feel like predictions but actually function as action plans/motor output. These pseudo-predictions are a muddled type in the brain’s type system.
You can directly edit them without lying to yourself because they’re not epistemic beliefs — they’re controllers. Looking at the place in your mind where your action plan is stored and loading a new image there feels like predicting/expecting, but treating it as a plan you’re altering (not a belief you’re adopting) lets you bypass the self-prediction problem entirely.
So: “I will stay sane” isn’t an epistemic prediction that would create a self-fulfilling prophecy loop or violate the belief-action firewall. It’s writing a different script into the pseudo-model that connects to motor output — recognizing that the thing-that-feels-like-a-prediction is actually the controller, and you get to edit controllers.
I didn’t want to read a bunch of unrelated text from Yudkowsky about a problem I don’t really have.
It is an idiosyncratic mental technique. Look up trigger action plans, say. What you’re doing there is a variant of what EY describes.
I fortunately know of TAPs :-) (I don’t feel much apocalypse panic so I don’t need this post.)
I guess I was hoping there’d be some more teaching from up high about this agent foundations problem that’s been bugging me for so long, but I guess I’ll have to think for myself. Fine.
Yeah I’m pretty sure it’s an idiosyncratic mental technique / human psychology observation, there isn’t technical agent foundations progress here.