Steven Byrnes comments on Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and Open Challenges.

Steven Byrnes 6 Feb 2025 21:30 UTC
LW: 11 AF: 4
0
AF
Sorry if I missed it, but you don’t seem to address the standard concern that mildly-optimizing agents tend to self-modify into (or create) strongly-optimizing agents.
For example (copying from my comment here), let’s say we make an AI that really wants there to be exactly 100 paperclips in the bin. There’s nothing else it wants or desires. It doesn’t care a whit about following human norms, etc.
But, there’s one exception: this AI is also “lazy”—every thought it thinks, and every action it takes, is mildly aversive. So it’s not inclined to, say, build an impenetrable fortress around the bin just for an infinitesimal probability increment. “Seems like a lot of work! It’s fine as is,” says the AI to itself.
But hey, here’s something it can do: rent some server time on AWS, and make a copy its own source code and trained model, but comment out the “laziness” code block. That’s not too hard; even a “lazy” AI would presumably be capable of doing that. And the result will be a non-lazy AI that works tirelessly and uncompromisingly towards incrementing the probability of there being 32 paperclips—first 99.99%, then 99.9999%, etc. That’s nice! (from the original AI’s perspective). Or more specifically, it offers a small benefit for zero cost (from the original AI’s perspective).
It’s not wildly different from a person saying “I want to get out of debt, but I can’t concentrate well enough to hold down a desk job, so I’m going to take Adderall”. It’s an obvious solution to a problem.
…OK, in this post, you don’t really talk about “AI laziness” per se, I think, instead you talk about “AI getting distracted by other things that now seem to be a better use of its time”, i.e. other objectives. But I don’t think that changes anything. The AI doesn’t have to choose between building an impenetrable fortress around the bin of paperclips versus eating lunch. “Why not both?”, it says. So the AI eats lunch while its strongly-optimizing subagent simultaneously builds the impenetrable fortress. Right?
- Roland Pihlakas 17 Feb 2025 19:53 UTC
  LW: 1 AF: 1
  0
  AF Parent
  I agree, sounds plausible that this could happen. Likewise as we humans may build a strongly optimising agent because we are lazy and want to use simpler forms of maths. The tiling agents problem is definitely important.
  
  That being said, agents properly understanding and modelling homeostasis is among the required properties (thus essential). It is not meant to be sufficient one. There may be no single sufficient property that solves everything, therefore there is no competition between different required properties. Required properties are conjunctive, they are all needed. My intuition is that homeostasis is one such property. If we neglect homeostasis then we are likely in trouble regardless of advances in other properties.
  
  If we leave aside the question of sloppiness in creating sub-agents, I disagree with the zero cost assumption in the problem you described. I also disagree that it would be an expected and acceptable situation to have powerful agents having a singular objective. As the title of this blog post hints—we need a plurality of objectives.
  Having a sub-agent does not change this. Whatever the sub-agent does, will be the responsibility or liability of the main agent who will be held accountable. Legally, one should not produce random sub-agents running amok.
  In addition to homeostasis, a properly constructed sub-agent should understand the principle of diminishing returns in instrumental objectives. This topic I do mention towards the end of this blog post. We can consider wall-building as an instrumental objective. But instrumental objectives are not singular and in isolation either, there are also a plurality of these. Thus, spending excessive resources on a single instrumental objective is not economically cost-efficient. Therefore, it makes sense to stop the wall building and switch over to some other objective at some point. Or at least to continue improving the walls only when other objectives have been sufficiently attended to as well—thus providing balancing across objectives.
  Secondly, a proper sub-agent should also keep in mind the homeostatic objectives of the main agent. If some homeostatic objective from among the plurality of homeostatic objectives would get harmed as a side effect of the excessive wall-building, then that needs to be taken into consideration. Depending on the situation, the main agent might potentially care about these side effects before it launches the sub-agent.
  Thirdly, following the principles of homeostasis does not necessarily mean laziness and sloppiness in everything. Instead, homeostasis primarily notes that unbounded maximisation of a homeostatic objective is incompatible and harmful even for the very objective that was maximised for. In addition to potentially having side effects to the plurality of other objectives. So homeostasis is primarily about minding the target value as opposed to maximisation of the actual value. An additional relevant principle is minding the plurality of objectives.
  Finally, when an agent has a task to produce 100 paper clips then that does not mean that the number of paper clips needs to stay at 100 after the task has been completed. Perhaps it is entirely expected that these 100 paper clips will be carried away by authorised parties. Walls help against theft and environmental degradation of produced paper clips, but we do not exactly need the walls to keep the paperclip number at 100 at all times—there is some deeper need or transaction behind the requested paper clips.
  
  In order to avoid confusion, pointing also out that there are two types of balancing involved in these topics:
  1. Balancing of an homeostatic objective—keeping the actual value of a single homeostatic objective near the target value—not too low, not too high.
  2. Balancing across objectives—as a form of considering the utilities of multiple objectives equally. That means meeting them in such a manner that the homeostatic objectives have for example least-squares deviations, while unbounded objectives have approximately same utility value after the utility functions with diminishing returns have been applied to each actual value.
  
  I am curious, how does this land with you and does this respond to your question?