Thanks Karl, that’s a really useful shortform and I’m sorry I wasn’t already aware of it. This is exactly the kind of thing I had in mind when I said “there are many agentic lenses people have constructed”. I am planning a follow up point about this but for now I would like to link to your shortform there to highlight what I mean.
I do think there is a general sense of the terms agent and agency, as referring to goal directness alongside something like coherence seeking or strategic planning, that get bandied about in AI safety, which is what I am trying to refer to here. However, I absolutely agree with you that if we were more specific about our use of the term and the specific context within which each use makes sense the kind of teleological thinking that I am arguing against here would likely not emerge.
SJ_Beard
Thanks so much Edward, I really appreciate these points!
My understanding of a natural kind is a category that reflects how the world is rather than being constructed in service of our particular interests. To me x is a natural kind should mean something like there is a clear boundary that separates things that are x from things that are not x or xness is defined in terms of a clear property that all things possess to a particular extent. I don’t think agency, or being an agent, is really like that. However, I also think that the concept of being a natural kind is probably not, itself, a natural kind, in the sense that what kinds of boundaries or properties we take to be clear is itself something that is determined as much, if not more, by our interests than the fundamental nature of things. So I certainly don’t mean that not being a natural kind is a bad thing. I just think that when we treat agency as being more of a natural kind then it is that this can lead us to the kind of teleological thinking about the ultimate nature of ‘agency’ that I am concerned about here.
My actual inspiration for this argument did not come from ontology though, and I am not sure that talking about it in terms of natural kinds and constructs was the most helpful way of presenting things, it was just the best I could come up with. What I was really thinking about was Derek Parfit’s reductionist account of personal identity. On this view, a person’s identity is not a fundamental fact about the world, there is actually no clear boundary around the self that would work the way we expect it to. Rather, questions about personal identity are really questions about other facts, what Parfit called ‘relation-r’, and my identity over time is really just a useful description of these facts that makes sense in many practical circumstances but can break down in others. I think agency is like this. When we want to explore a systems agency what we really want to do is to understand a bunch of other things about how the system makes decisions and acts on them. Agency is a useful description that summarises these facts in many practical cases (such as human to human interactions), but if we then use that same description and apply it to other kinds of system, like AIs, this description could be leading us astray.
I take your point that my focus on VNM style coherence arguments may be misplaced here and that there are other, more important, arguments I should be considering. I do think that in the general discourse around AI Safety there is an assumption that agents will inevitably be led to VNM utilities but I think you have a lot more direct experience of the discourse there than I do. On the other hand, I do feel like there could be some terminological disagreements here, I agree with you that these are really claims about rationality rather than agency but I think that there is then often a hidden assumption that agents will inevitably seek to become rational. I would love to talk with you more about this!
Thanks Jason, I really appreciate these thoughts.
My immediate reaction to your first paragraph is that I think I might have a slightly different way of thinking about this to you, and potentially to most people in this space. I generally agree with you that “roughly agentic/goal-driven-entities lens is very useful for thinking about how intelligent autonomous systems might go bad”—however the question is not merely how things might go bad but how we can make them go well. Leo Szilard once told workers on the Manhattan Project that they should treat enriched uranium “that a mule that is trying to kick you” because otherwise they tended to be overly complaicent about the risk of working with it. This clearly wasn’t directly justified as an ascription of agency to the uranium, but it was still a useful safety measure because imagining that the uranium wanted to form critical masses that would explode was a helpful way for thinking about how things might go bad! However, one difference in this case was that it made no difference to the uranium how we conceptualised its decision making, nor did it impact on our scientific understanding of the uranium to think about it this way. Neither of these things is true for AI. So I think it’s possible that using a goal directed lens might be very useful for thinking about how intelligent autonomous systems might go bad, but if then go on and use that lens in modelling their future trajectories this could restrict the possibility space we are willing to consider, and if we talk about these systems as goal directed in ways that get into their training data or otherwise adjust our training methods around this assumption then this could become something of a self-fulfilling prophecy. That is what I was trying to say at the end of the article, but I think your comment has prompted me to think about it in a way that I hope makes this point sharper!
I think shard theory is a really interesting hypothesis and model for human value learning. I didn’t know about the research agenda you and Edward were developing but it also looks super interesting and I would love to discuss it with you! One of the things that I like about shard theory but don’t currently see in your agenda is that it tries to account for meta-agency, the way in which agents start trying to develop sub-agents within themselves to help them manage their goals. Within shard theory this is handled by a bidding mechanism where different shards compete to determine the correct response to a stimulus. I tend to think of it as potentially a fully formed agent in its own right with its own boundaries, belief like, and desire like states, like the concepts of ‘wise mind’ or ‘loving awareness’ I keep coming across in meditation and therapy. One of the things I find so fascinating about human meta-agency is that it seems extremely flexible, some people use their meta-agency to ruthlessly optimise goals towards some single purpose, some to develop balance, equanimity, or internal diversity of their own sakes, some to realise the impermanence and futility of their goals and desires. I am interested in how we can develop theories of meta-agency that don’t dismiss this kind of flexibility out of hand but can explain why meta-agency might develop in different ways even across people/systems that share similar basic architectures and environmental contexts! In your experiments on retraining agents, did you come across anything that seemed to be governing how the agents evaluated, compared, or selected between their original and new goals in a way analogous to this kind of meta-agency?
My reaction to Vanessa Kosoy’s comment is that I basically agree with her, and I still think the kind of reflection I am making here is useful. She gives three reasons for why In AI safety, we are from the get-go interested in goal-directed systems. One of these, “we are worried about systems with bad goals”—I think I have already dealt with, yes we should worry about that but we shouldn’t then jump to the conclusion that this tells us what these systems will actually be like. The other two “we want AIs to achieve goals for us” and “stopping systems with bad goals is also a goal” are very interesting to me because they turn the question back on us. I’m not saying her argument is circular here but there is a sense in which the claim here is that “we are interested in goal directed systems because we are directing systems towards goals”—well is that the only thing we could be doing? I think that when we think about aligning people or institutions we often don’t take this approach, we endorse vague mission statements, complex decision-making processes with checks and balances, vague but useful social norms and the like. Of course her next comment is “so what is your proposal?” I don’t have one yet and that is a clear weakness. However, I don’t think that this means no other proposals exist. My current interest is precisely in following this move of turning the question back on ourselves, and understanding alignment not as a purely technical process of giving the AI a well-specified objective, but a sociotechnical one of developing human-AI interactions that are long-term sustainable and beneficial to us, and that is only possible if we are willing to reflect deeply on what we are doing and why and not just on what super intelligence might do and why.
Finally, on Veedracs post. I am slightly less persuaded by this. My main reason is just that I think pure optimisation is actually a very rare process to find at the mesa scale. Obviously it happens at the level of fundamental laws of physics, but when larger systems try to pursue this kind of strategy they tend to collapse for one reason or another. Even RL is most effective when it is not pure optimisation but a stocastic process. I do think that if we developed systems that were strong optimisers then it is that and not their agency per se that might doom us. However, I don’t think we should do that and I don’t think we have to either. Maybe that’s not such a good response though?
Agency is not a natural kind (and why that might matter for alignment)
Intelligence is something we can use to navigate complexity, but intelligence is also something we use to create complexity. Often it is our smartest thinking that creates the most complexity, and it is complexity that fuels Moloch. I think our ability to kill Moloch depends more on our ability to shift the balance between these two uses of intelligence so that we get better at navigating complexity than creating it, rather than just increasing the level of intelligence. Thus, I’ve always been sceptical that any capabilities-based approach (regardless of whether these are human or AI capabilities) can kill Moloch on its own without a plan like this for how these should be deployed.
I think it’s easy to overestimate the stability of positions in a complex system like the world. How does one achieve any kind of real stability without a singleton super intelligence? Life is dynamic, so is the economy, if we only evaluate safe and stable positions then we rule out a lot of dynamic possibilities in which we just keep navigating through the chaos. That’s not to say that having a game that stops just before super intelligence can teach you how to develop safe AI, but even just navigating the next few years may be enough of a hard problem that it is worth thinking about, even if we have to leave it to our future selves to then decide where to take things from after that!
I don’t think there is a shortage of scientific theories about preferences, this is absolutely the domain of much of economics and psychology after all. However, in order to get ‘ought’ then science would need to shift from studying preferences ‘this person prefers x to y’, ‘this part of the brain is responsible for preferring x to y’, ‘x should be preferable to y if we accept value system z’ to actually constructing them ‘x is preferable to y’.
It is also widely accepted that science is inherently bound up with our value systems and preferences, indeed it’s sometimes argued that science is necessarily value laden (see here for a useful summary and typology of how this can work—https://philsci-archive.pitt.edu/19000/1/penultimate.pdf ).
However, the problem with making this an actual subject of scientific inquiry is that it’s not clear what you would even study? Do you want a science of morals (that empirically studies how and when people come to have concepts like right and good), a moral psychology (that studies what is going on in our brains when we make ethical decisions), a fully naturalised ethical theory (such as utilitarianism, which says that right and wrong are reducible to entirely empirical facts such as the hedonic states of sentient beings), or a normative science (which holds that there are non empirical, normative, facts about the world, and that ethics is about discovering what these are just as science is about discovering empirical facts)? All of these have been tried, often with decidedly mixed results, and generally pulling in quite different directions.
I’ve always felt like it’s easy to fall into a trap where:
1) I am clever +
2) I am well-intentioned +
3) I am working on something important =
4) I am doing good work
In theory, that should work, but in practise even the cleverest human isn’t that smart, even the best intentions aren’t that good, and all the most important problems are really hard. So it’s perfectly possible for 1, 2, and 3 to all be true and yet 4 to be completely false (indeed, for you to be actively causing harm).
There are things we can do about this, things that we should do about it (question your assumptions, ask how your work might be net-negative, make bold predictions and change your mind, etc etc), but nothing that actually saves you from the problem. Even the most rational schemes represent, in my view, only marginal improvements, and most of us aren’t anywhere near that.
But then we are just a bunch of humans doing our best, and after all, what else can we do? I find it really hard to imagine that this all applies to me, and I am sure many others do as well. I definitely take this as a good reason to be supportive of those who decide the work they can actually do on x-risk isn’t worth doing, and I think we all should, because being open to that possibility is a good thing to be. But if it really matters to you to just keep working on whatever you can in this space, I would support that too, because after all, you might be doing more good than you realize as well!
It is a strange thing to be in a field whose constraints are not financial—in a world where we are so used to money being the main constraint on everything, it can be very disorienting. ‘There is lots of funding available’ is a phrase one hears often, and it seems to be true, yet a lot of funding seems to be going into bringing more and more people into the field, many of whom then don’t seem to get any of it to help them work on the actual problem!
Part of the reason, I think, is that the field is more constrained by its social and human capital. To take people on a generator week, or set up a new organization, or run extra hiring rounds, or offer more advising calls, requires people to put time into those things. The people who are at the cutting edge of safety often don’t have time for this, or feel that allocating their time to such projects would be counterproductive, as it would pull them away from more valuable work. But that’s fine, we have plenty of other people on the periphery, people who know about what needs doing, who maybe do a bit of this sort of work, or have done, but whose contributions are not so important that they can’t still dedicate some or all of their time to creating opportunities for others, and that feels good. However, it also creates something of an air gap around the cutting-edge researchers, whereby you now have a lot of specialised generalists running fellowships or bluedot courses and generator weeks, but they aren’t actually that well connected to the core work that we are trying to train people for. And the bigger these schemes get, the wider that air gap tends to grow, making the schemes less and less valuable for the people on them.
I’ve heard people say that AI safety is turning into a pyramid selling scheme, where the easiest way to get paid is to sell the dream on to others and do things that make them feel like they might actually contribute to AI safety, but then the easiest way for them to get paid is to do the same thing (if they even get that far), so on it goes and on it goes. I don’t like the way that ‘pyramid selling scheme’ makes it sound like people are being malicious. I just don’t think they are, but I also think that focusing too much on extending revolutionary love into the world when it’s actually very hard (and maybe even getting harder) for people to get close to the frontline of AI safety could be misguided. I think we all need to keep our eyes open about what we are doing in this community, and ensure that our love is well placed!
Thanks Roger, I agree with you, although I also think we should be careful not to overstate the optimising power of either evolution or reinforcement learning. These are optimising processes to be sure but evolution certainly isn’t very good at finding absolute optimums and regularly seems to allow highly suboptimal solutions to problems as ‘good enough’. RL is probably a lot better but still imperfect. So I agree that some degree of, say, goal directness is likely to emerge from these processes but it may remain imperfect even if they were left to run for a very long time, especially if any of the goal, the system’s basic substrate, or its environment were sufficiently complex to begin with.