More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.
Part of why this problem seems intractable is that it’s stated in terms of “pointing at latent concepts” rather than Goodhart’s Law/Wireheading/Short circuiting. All of which seems like more fruitful angles of approach than “point at latent concepts”, precisely because pointing at inner structure is in fact the specific thing deep learning is trying to avoid having to do.
Though it occurs to me that some readers who see this won’t be familiar with the original and its context so let me elaborate:
The problem we are concerned with here is how you get a neural net or similar system which you train on photos or text or any other kind of sensory input to care about the latent causality of the sensory input rather than the sensory input itself. If the distinction is unclear to you consider that a model trained to push a ball into a goal could theoretically hack its webcam it uses as an eye so that it observes the (imaginary) ball being pushed into an (imaginary) goal. Meanwhile in the real world the ball is untouched. This is essentially wireheading and the question is how you prevent an AI system from doing it, especially once it’s superintelligent and trivially has the capability to hack any sensor it uses to make sensory observations.
We can start with the most obvious point: Our solution can’t be based on a superintelligence not being able to get at its own reward machinery. Whatever we do has to be an intervention which causes the system, fully cognizant that it can hack itself for huge expected reward, to say “nope, I’m not doing that”. We have basically one empirical template for this which I’m aware of in human drug use. Notably, when we discovered heroin and cocaine many believed they heralded a utopian future in which everyone can be happy. It took time for people to realize these drugs are addictive and pull you too far away from productive activity to be societally practical. You, right this minute, are choosing not to take heroin or other major reward system hacks because you understand they would have negative long term consequences for you. If you’re like me, you even have a disgust response about the concept, the thought of putting that needle in your arm brings on feelings of fear and nausea. This is LEARNED. It is learned even though you understand that the drug would feel good. It is learned even though this kind of thing probably didn’t really exist as a major threat in the ancestral environment. This is one of the most alignment relevant behaviors that humans do and should be closely considered.
My current sketch for how something similar could be trained into a deep net would be to deliberately create opportunities to cheat/Goodhart at tasks, and then reliably punish the Goodharting on the tasks with known ground truth that they’ve been Goodharted. This would create an early preference against Goodharting and wireheading. Like with drugs these sessions could be supplemented with propaganda about the negative consequences of reward hacking. You could also try representation engineering to directly add an aversion to the abstract concept of cheating, reward hacking, etc.
For my current weave LLM ReAct agent project I plan to have the model write symbolic functions to evaluate its own performance in context at each action step. In order to get it to write honest evaluation functions I plan to train the part of the model that writes them with a different loss/training task which is aligned to verifiable long term reward. The local actions are then scored with these functions as well as other potential mechanisms like queries of the models subjective judgement.
See also this Twitter thread where I describe in more detail:
A very related experiment is described in Yudkowsky 2017, and I think one doesn’t even need LLMs for this—I started playing with an extremely simple RL agent trained on my laptop, but then got distracted by other stuff before achieving any relevant results. This method of training an agent to be “suspicious” of too high rewards would also pair well with model expansion; train the reward-hacking-suspicion circuitry fairly early as to avoid ability to sandbag this, and lay traps for reward hacking again and again during the gradual expansion process.
I mean I was looming a fictional dialogue between me and Yudkowsky and it had my character casually bring up that they’re the author of “Soft Optimization Makes The Value Target Bigger”, which would imply that the model recognizes my thought patterns as similar to that document in vibe.
I don’t believe that you believe this accusation. Maybe there is something deeper you are trying to say, but given that I also don’t believe you’ve finished reading the book in the 3(?) hours it’s been released, I’m not sure what it could be. (To say it explicitly, Said’s banning had nothing to do with the book.)
I didn’t read all of it, but I also really didn’t need to. I found an advance copy. :)
As for whether I believe the accusation, I doubt it was an explicit reason, but subconsciously I notice that MIRI was cleaning house before the book launch (e.g. taking down EY’s light novel because it might look bad).
Maybe there is something deeper you are trying to say
But really, since we’re making the implicit explicit what I mean is that the book is bad, with the humor being that it’s sufficiently bad to require this.
I’m actually genuinely quite disappointed, I was hoping it would be the definitive contemporary edition of the MIRI argument in the vein of Bostrom 2014. Instead I will still have to base anything I write on Bostrom 2014, the Arbital corpus, misc Facebook posts, sporadic LessWrong updates, and podcast appearances.
This isn’t just me thinking disagreement means it’s bad either. In the vast majority of places I would say it’s bad I agree with the argument it’s trying to make and find myself flabbergasted it would be made this way. The prime number of stones example for “demonstrating” fragility of value is insane, like actually comes off as green ink or GPT base model output. It seems to just take it for granted that the reader can obviously think of a prime number stone type of intrinsic value in humans, and since I can’t think of one offhand (sexual features?) I have to imagine most readers can’t either. It also doesn’t seem to consider that the more arbitrary and incompressible a value the less obviously important it is to conserve. A human is a monkey with a sapient active learning system and more and more of our expressed preferences are sapience maximizing over time. I understand the point that it’s trying to make, that yes obviously if you have a paperclipper it will not suddenly decide to be something other than a paperclipper, but if I didn’t already believe that I would find this argument to be absurd and off-putting.
So far as I can tell from jumping around in it, the entire book is like this.
This is a valid line of critique but seems moderately undercut by its prepublication endorsements, which suggest that the arguments landed pretty ok. Maybe they will land less well on the rest of the book’s target audience?
(re: Said & MIRI housecleaning: Lightcone and MIRI are separate organizations and MIRI does not moderate LessWrong. You might try to theorize that Habryka, the person who made the call to ban Said back in July, was attempting to do some 4d-chess PR optimization on MIRI’s behalf months ahead of time, but no, he was really nearly banned multiple times over the years and he was finally banned this time because Habryka changed his mind after the most recent dust-up. Said practically never commented on AI-related subjects, so it’s not even clear what the “upside” would’ve been. From my perspective this type of thinking resembles the constant noise on e.g. HackerNews about how [tech company x] is obviously doing [horrible thing y] behind-the-scenes, which often aren’t even in the company’s interests, and generally rely on assumptions that turn out to be false.)
My honest impression, though I could be wrong and didn’t analyze the prepublication reviews in detail, is that there is very much demand for this book in the sense that there’s a lot of people who are worried about AI for agent foundations shaped reasons and want an introduction they can give to their friends and family who don’t care that much.
For example I think this review from Matt Yglesias makes the point fairly explicit? He obviously has a preexisting interest in this subject and is endorsing the book because he wants the subject to get more attention, that doesn’t necessarily mean that the book is good. I in fact agree with a lot of the books basic arguments but think I would not be remotely persuaded by this presentation if I wasn’t already inclined to agree.
there is very much demand for this book in the sense that there’s a lot of people who are worried about AI for agent foundations shaped reasons and want an introduction they can give to their friends and family who don’t care that much.
This is true, but many of the surprising prepublication reviews are from people who I don’t think were already up-to-date on these AI x-risk arguments (or at least hadn’t given any prior public indication of their awareness, unlike Matt Y).
I am dismayed but not surprised, given the authors. I’d love to see the version edited by JDP’s mind(s) and their tools. I’m almost certain it would be out of anyone’s price range, but what would it cost to buy JDP+AI hours sufficient to produce an edited version?
I also have been trying to communicate it better, from the perspective of someone who actually put in the hours watching the arxiv feed. I suspect you’d do it better than I would. But, some ingredients I’d hope to see you ingest[ed already] for use:
List of Lethalities #19 states:
Part of why this problem seems intractable is that it’s stated in terms of “pointing at latent concepts” rather than Goodhart’s Law/Wireheading/Short circuiting. All of which seems like more fruitful angles of approach than “point at latent concepts”, precisely because pointing at inner structure is in fact the specific thing deep learning is trying to avoid having to do.
Though it occurs to me that some readers who see this won’t be familiar with the original and its context so let me elaborate:
The problem we are concerned with here is how you get a neural net or similar system which you train on photos or text or any other kind of sensory input to care about the latent causality of the sensory input rather than the sensory input itself. If the distinction is unclear to you consider that a model trained to push a ball into a goal could theoretically hack its webcam it uses as an eye so that it observes the (imaginary) ball being pushed into an (imaginary) goal. Meanwhile in the real world the ball is untouched. This is essentially wireheading and the question is how you prevent an AI system from doing it, especially once it’s superintelligent and trivially has the capability to hack any sensor it uses to make sensory observations.
We can start with the most obvious point: Our solution can’t be based on a superintelligence not being able to get at its own reward machinery. Whatever we do has to be an intervention which causes the system, fully cognizant that it can hack itself for huge expected reward, to say “nope, I’m not doing that”. We have basically one empirical template for this which I’m aware of in human drug use. Notably, when we discovered heroin and cocaine many believed they heralded a utopian future in which everyone can be happy. It took time for people to realize these drugs are addictive and pull you too far away from productive activity to be societally practical. You, right this minute, are choosing not to take heroin or other major reward system hacks because you understand they would have negative long term consequences for you. If you’re like me, you even have a disgust response about the concept, the thought of putting that needle in your arm brings on feelings of fear and nausea. This is LEARNED. It is learned even though you understand that the drug would feel good. It is learned even though this kind of thing probably didn’t really exist as a major threat in the ancestral environment. This is one of the most alignment relevant behaviors that humans do and should be closely considered.
My current sketch for how something similar could be trained into a deep net would be to deliberately create opportunities to cheat/Goodhart at tasks, and then reliably punish the Goodharting on the tasks with known ground truth that they’ve been Goodharted. This would create an early preference against Goodharting and wireheading. Like with drugs these sessions could be supplemented with propaganda about the negative consequences of reward hacking. You could also try representation engineering to directly add an aversion to the abstract concept of cheating, reward hacking, etc.
For my current weave LLM ReAct agent project I plan to have the model write symbolic functions to evaluate its own performance in context at each action step. In order to get it to write honest evaluation functions I plan to train the part of the model that writes them with a different loss/training task which is aligned to verifiable long term reward. The local actions are then scored with these functions as well as other potential mechanisms like queries of the models subjective judgement.
See also this Twitter thread where I describe in more detail:
https://jdpressman.com/tweets_2025_03.html#1898114081657438605
A very related experiment is described in Yudkowsky 2017, and I think one doesn’t even need LLMs for this—I started playing with an extremely simple RL agent trained on my laptop, but then got distracted by other stuff before achieving any relevant results. This method of training an agent to be “suspicious” of too high rewards would also pair well with model expansion; train the reward-hacking-suspicion circuitry fairly early as to avoid ability to sandbag this, and lay traps for reward hacking again and again during the gradual expansion process.
Kimi K2 apparently believes I am Jeremy Gillen and I find this very gratifying.
When you give it you-written text? What knowledge do you give it to reach that conclusion?
I mean I was looming a fictional dialogue between me and Yudkowsky and it had my character casually bring up that they’re the author of “Soft Optimization Makes The Value Target Bigger”, which would imply that the model recognizes my thought patterns as similar to that document in vibe.
I am using LessWrong shortform like Twitter it really shouldn’t be taken that seriously.
I can see why they had to ban Said Achmiz before this book dropped.
I don’t believe that you believe this accusation. Maybe there is something deeper you are trying to say, but given that I also don’t believe you’ve finished reading the book in the 3(?) hours it’s been released, I’m not sure what it could be. (To say it explicitly, Said’s banning had nothing to do with the book.)
I didn’t read all of it, but I also really didn’t need to. I found an advance copy. :)
As for whether I believe the accusation, I doubt it was an explicit reason, but subconsciously I notice that MIRI was cleaning house before the book launch (e.g. taking down EY’s light novel because it might look bad).
But really, since we’re making the implicit explicit what I mean is that the book is bad, with the humor being that it’s sufficiently bad to require this.
I’m actually genuinely quite disappointed, I was hoping it would be the definitive contemporary edition of the MIRI argument in the vein of Bostrom 2014. Instead I will still have to base anything I write on Bostrom 2014, the Arbital corpus, misc Facebook posts, sporadic LessWrong updates, and podcast appearances.
This isn’t just me thinking disagreement means it’s bad either. In the vast majority of places I would say it’s bad I agree with the argument it’s trying to make and find myself flabbergasted it would be made this way. The prime number of stones example for “demonstrating” fragility of value is insane, like actually comes off as green ink or GPT base model output. It seems to just take it for granted that the reader can obviously think of a prime number stone type of intrinsic value in humans, and since I can’t think of one offhand (sexual features?) I have to imagine most readers can’t either. It also doesn’t seem to consider that the more arbitrary and incompressible a value the less obviously important it is to conserve. A human is a monkey with a sapient active learning system and more and more of our expressed preferences are sapience maximizing over time. I understand the point that it’s trying to make, that yes obviously if you have a paperclipper it will not suddenly decide to be something other than a paperclipper, but if I didn’t already believe that I would find this argument to be absurd and off-putting.
So far as I can tell from jumping around in it, the entire book is like this.
This is a valid line of critique but seems moderately undercut by its prepublication endorsements, which suggest that the arguments landed pretty ok. Maybe they will land less well on the rest of the book’s target audience?
(re: Said & MIRI housecleaning: Lightcone and MIRI are separate organizations and MIRI does not moderate LessWrong. You might try to theorize that Habryka, the person who made the call to ban Said back in July, was attempting to do some 4d-chess PR optimization on MIRI’s behalf months ahead of time, but no, he was really nearly banned multiple times over the years and he was finally banned this time because Habryka changed his mind after the most recent dust-up. Said practically never commented on AI-related subjects, so it’s not even clear what the “upside” would’ve been. From my perspective this type of thinking resembles the constant noise on e.g. HackerNews about how [tech company x] is obviously doing [horrible thing y] behind-the-scenes, which often aren’t even in the company’s interests, and generally rely on assumptions that turn out to be false.)
My honest impression, though I could be wrong and didn’t analyze the prepublication reviews in detail, is that there is very much demand for this book in the sense that there’s a lot of people who are worried about AI for agent foundations shaped reasons and want an introduction they can give to their friends and family who don’t care that much.
https://x.com/mattyglesias/status/1967765768948306275?s=46
For example I think this review from Matt Yglesias makes the point fairly explicit? He obviously has a preexisting interest in this subject and is endorsing the book because he wants the subject to get more attention, that doesn’t necessarily mean that the book is good. I in fact agree with a lot of the books basic arguments but think I would not be remotely persuaded by this presentation if I wasn’t already inclined to agree.
Obviously just one example, but Schneier has generally been quite skeptical, and he blurbed the book.
This is true, but many of the surprising prepublication reviews are from people who I don’t think were already up-to-date on these AI x-risk arguments (or at least hadn’t given any prior public indication of their awareness, unlike Matt Y).
I am dismayed but not surprised, given the authors. I’d love to see the version edited by JDP’s mind(s) and their tools. I’m almost certain it would be out of anyone’s price range, but what would it cost to buy JDP+AI hours sufficient to produce an edited version?
I also have been trying to communicate it better, from the perspective of someone who actually put in the hours watching the arxiv feed. I suspect you’d do it better than I would. But, some ingredients I’d hope to see you ingest[ed already] for use:
https://www.lesswrong.com/posts/9kNxhKWvixtKW5anS/you-are-not-measuring-what-you-think-you-are-measuring
https://www.lesswrong.com/posts/gebzzEwn2TaA6rGkc/deep-learning-systems-are-not-less-interpretable-than-logic
https://www.lesswrong.com/posts/Rrt7uPJ8r3sYuLrXo/selection-has-a-quality-ceiling
probably some other wentworth stuff
I thought I had more to link but it’s not quite coming to mind. oh right, this one! https://www.lesswrong.com/posts/evYne4Xx7L9J96BHW/video-and-transcript-of-talk-on-can-goodness-compete
I have now written a review of the book, which touches on some of what you’re asking about. https://www.lesswrong.com/posts/mztwygscvCKDLYGk8/jdp-reviews-iabied
How? I thought MIRI was trying to be very careful with copies getting around before the launch day.
Do you have any other concrete example here besides the novel?