mattmacdermott

Karma: 128
• We’re already comparing to the default outcome in that we’re asking “what fraction of the default expected utility minus the worst comes from outcomes at least this good?”.

I think you’re proposing to replace “the worst” with “the default”, in which case we end up dividing by zero.

We could pick some other new reference point other than the worst, but different to the default expected utility. (But that does introduce the possibility of negative OP and still have sensitivity issues).

Some Sum­maries of Agent Foun­da­tions Work

15 May 2023 16:09 UTC
46 points
• Nice, I’d read the first but didn’t realise there were more. I’ll digest later.

I think agents vs optimisation is definitely reality-carving, but not sure I see the point about utility functions and preference orderings. I assume the idea is that an optimisation process just moves the world towards states, but an agent tries to move the world towards certain states i.e. chooses actions based on how much they move the world towards certain states, so it make sense to quantify how much of a weighting each state gets in its decision-making. But it’s not obvious to me that there’s not a meaningful way to assign weightings to states for an optimisation process too—for example if a ball rolling down a hill gets stuck in the large hole twice as often as it gets stuck in the medium hole and ten times as often as the small hole, maybe it makes sense to quantify this with something like a utility function. Although defining a utility function based on the typical behaviour of the system and then trying to measure its optimisation power against it gets a bit circular.

Anyway, the dynamical systems approach seems good. Have you stopped working on it?

• Probably the easy utility function makes agent 1 have more optimisation power. I agree this means comparisons between different utility functions can be unfair, but not sure why that rules out a measure which is invariant under positive affine transformations of a particular utility function?

• Hm, I’m not sure this problem comes up.

Say I’ve built a room-tidying robot, and I want to measure its optimisation power. The room can be in two states: tidy or untidy. A natural choice of default distribution is my beliefs about how tidy the room will be if I don’t put the robot in it. Let’s assume I’m pretty knowledgeable and I’m extremely confident that in that case the room will be untidy: and (we do have to avoid probabilities of 0, but that’s standard in a Bayesian context). But really I do put the robot in and it gets the room tidy, for an optimisation power of bits.

That 11 bits doesn’t come from any uncertainty on my part about the optimisation process, although it does depend on my uncertainty about what would happen in the counterfactual world where I don’t put the robot in the room. But becoming more confident that the room would be untidy in that world makes me see the robot as more of an optimiser.

Unlike in information theory, these bits aren’t measuring a resolution of uncertainty, but a difference between the world and a counterfactual.

Towards Mea­sures of Optimisation

12 May 2023 15:29 UTC
41 points
• An interesting point about the agency-as-retargetable-optimisation idea is that it seems like you can make the perturbation in various places upstream of the agent’s decision-making, but not downstream, i.e. you can retarget an agent by perturbing its sensors more easily than its actuators.

For example, to change a thermostat-controlled heating system to optimise for a higher temperature, the most natural perturbation might be to turn the temperature dial up, but you could also tamper with its thermistor so that it reports lower temperatures. On the other hand, making its heating element more powerful wouldn’t affect the final temperature.

I wonder if this suggests that an agent’s goal lives in the last place in a causal chain of things you can perturb to change the set of target states of the system.

• 9 Feb 2023 6:23 UTC
LW: 3 AF: 2
0
AF

Nice, thanks. It seems like the distinction the authors make between ‘building agents from the ground up’ and ‘understanding their behaviour and predicting roughly what they will do’ maps to the distinction I’m making, but I’m not convinced by the claim that the second one is a much stronger version of the first.

The argument in the paper is that the first requires an understanding of just one agent, while the second requires an understanding of all agents. But it seems like they require different kinds of understanding, especially if the agent being built is meant to be some theoretical ideal of rationality. Building a perfect chess algorithm is just a different task to summarising the way an arbitrary algorithm plays chess (which you could attempt without even knowing the rules).

• 1. In our universe, as opposed to the “current basic theory of AI” universe.

2. From Arbital:

A Cartesian agent setup is one where the agent receives sensory information from the environment, and the agent sends motor outputs to the environment, and nothing else can cross the “Cartesian border” separating the agent and environment. If you can eat a psychedelic mushroom that affects the way you process the world—not just presenting you with sensory information, but altering the computations you do to think—then this is an example of an event that “violates the Cartesian boundary”. Likewise if the agent drops an anvil on its own head. Nothing that happens in a Cartesian universe can kill a Cartesian agent or modify its processing; all the universe can do is send the agent sensory information, in a particular format, that the agent reads.

3. For embedded agency. In the old frame agents aren’t really made of anything.

Nor­ma­tive vs De­scrip­tive Models of Agency

2 Feb 2023 20:28 UTC
26 points
• Thanks. Is there a particular source whose notation yours most aligns with?

• When you write I understand that to mean that for all . But when I look up definitions of conditional probability it seems that that notation would usually mean for all

Am I confused or are you just using non-standard notation?

• Fair enough, but in that example making irreversible decisions is unavoidable. What if we consider a modified tree such that one and only one branch is traversible in both directions, and utility can be anywhere?

I expect we get that the reversible brach is the most popular across the distribution of utility functions (but not necessarily that most utility functions prefer it). That sounds like cause for optimism—‘optimal policies tend to avoid irreversible changes’.

• I’ve been thinking about whether these results could be interpeted pretty differently under different branding.

The current framing, if I understand it correctly, is something like, ‘Powerseeking is not desirable. We can prove that keeping your options open tends to be optimal and tends to meet a plausible definition of powerseeking. Therefore we should expect RL agents to seek power, which is bad.’

An alternative framing would be, ‘Making irreversible changes is not desirable. We can prove that keeping your options open tends to be optimal. Therefore we should not expect RL agents to make irreversible changes, which is good.’

I don’t think that the second framing is better than the first, but I do think that if you had run with it instead then lots of people would be nodding their heads and feeling reassured about corrigibility, instead of feeling like their views about instrumental convergence had been confirmed. That makes me feel like we shouldn’t update our views too much based on formal results that leave so much room for interpretation. If I showed a bunch of theorems about MDPs, with no exposition, to two people with different opinions about alignment, I expect they might come to pretty different conclusions about what they meant.

What do you think?

(To be clear I think this is a great post and paper, I just worry that there are pitfalls when it comes to interpretation.)