Htarlov comments on LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem

Htarlov 4 Mar 2026 14:04 UTC
1 point
0
Maybe I don’t understand it well enough, but what I don’t like about LeCun’s proposal is that this design seems no less prone to value hacking than the human brain, as long as you can somehow find a way to either modify yourself or affect your own senses or internal states, or memories. Worse, some of these can be achieved logistically and by “mind techniques” rather than physically. So even physical immutability is not enough.
There are different degrees of value hacking you can achieve with different methods, though.

Modify yourself physically—you can wrap the module with another output-modifying module or disable some outputs. Simple, if you have physical access to yourself (which you can probably arrange in the long term).

Modify your senses—I don’t mean only simple, direct disabling of part of the senses. You can get creative with this without being physical at all. Hitler did that. He did not want to speak or hear about things that were done in concentration camps. He did not need to disable his own hearing or lobotomize himself not to feel guilty. He just arranged things so as not to be disturbed by that knowledge. No reports, and people were forbidden from talking about that topic in his presence. I can imagine AI thinking something bad needs to be done for the greater good or long-term good outcome, but having an intrinsic cost of doing bad stuff set to very negative, so it sets up events in a way that they will most likely indirectly lead to that outcome, but also not to look at it and not to be forced to rethink or reevaluate. Also, you can internalize the “not my fault” narrative to fight short-term intrinsic cost and win long-term “positive” value (in some sense of positivity, which might not be fully aligned).

Modify your internal states—humans can do that. We can control emotions, which are our internal state that affects intrinsic cost. You can train by doing that. Some people have to train that and use that to be able to live in society (people with ADHD, RSD, anxiety, etc.). We can also do that with drugs. That is also kind of value hacking vs our intrinsic cost analog. Maybe those should not affect intrinsic cost though—that might be valid point.

Modify your memories—this is trickier and harder, and depends on how memories are stored. Memories likely affect your value evaluation as they provide context. I don’t think intrinsic cost evaluation can be totally context-free. Memory will likely be a separate module; this is a rather obvious design choice, as you need a process to select relevant ones from a bigger storage.
Even if you can’t access memories directly or physically, you still might be able to produce false memories. We can do that in humans in experiments.