Half-researcher, half-distiller (see https://distill.pub/2017/research-debt/), both in AI Safety. Funded, and also PhD in theoretical computer science (distributed computing).
adamShimi(Adam Shimi)
Thanks for both your careful response and the pointer to Conceptual Engineering!
I believe I am usually thinking in terms of defining properties for their use, but it’s important to keep that in mind. The post on Conceptual Engineering lead me to this follow up interview, which contains a great formulation of my position:
Livengood: Yes. The best example I can give is work by Joseph Halpern, a computer scientist at Cornell. He’s got a couple really interesting books, one on knowledge one on causation, and big parts of what he’s doing are informed by the long history of conceptual analysis. He’ll go through the puzzles, show a formalization, but then does a further thing, which philosophers need to take very seriously and should do more often. He says, look, I have this core idea, but to deploy it I need to know the problem domain. The shape of the problem domain may put additional constraints on the mathematical, precise version of the concept. I might need to tweak the core idea in a way that makes it look unusual, relative to ordinary language, so that it can excel in the problem domain. And you can see how he’s making use of this long history of case-based, conceptual analysis-friendly approach, and also the pragmatist twist: that you need to be thinking relative to a problem, you need to have a constraint which you can optimize for, and this tells you what it means to have a right or wrong answer to a question. It’s not so much free-form fitting of intuitions, built from ordinary language, but the solving of a specific problem.
So my take is that there is probably a core/basic concept of goal-directedness, which can be altered and fitted to different uses. What we actually want here is the version fitted to AI Alignment. So we could focus on that specific version from the beginning; yet I believe that looking for the core/basic version and then fitting it to the problem is more efficient. That might be a give source of our disagreement.
(By the way, Joe Halpern is indeed awesome. I studied a lot of his work related to distributed systems, and it’s always the perfect intersection of a philosophical concept and problem with a computer science treatement and analysis.)
Hmmm, it doesn’t seem like these two approaches are actually that distinct. Consider: in the forward approach, which intuitions about goal-directedness are you using? If you’re only using intuitions about human goal-directedness, then you’ll probably miss out on a bunch of important ideas. Whereas if you’re using intuitions about extreme cases, like superintelligences, then this is not so different to the backwards approach.
I resolve the apparent paradox that you raise by saying that the intuitions are about the core/basic idea which is close to human goal-directedness; but that it should then be fitted and adapted to our specific application of AI Alignment.
Meanwhile, I agree that the backward approach will fail if we try to find “the fundamental property that the forward approach is trying to formalise”. But this seems like bad philosophy. We shouldn’t expect there to be a formal or fundamental definition of agency, just like there’s no formal definition of tables or democracy (or knowledge, or morality, or any of the other complex concepts philosophers have spent centuries trying to formalise). Instead, the best way to understand complex concepts is often to treat them as a nebulous cluster of traits, analyse which traits it’s most useful to include and how they interact, and then do the same for each of the component traits. On this approach, identifying convergent instrumental goals is one valuable step in fleshing out agency; and another valuable step is saying “what cognition leads to the pursuit of convergent instrumental goals”; and another valuable step is saying “what ways of building minds lead to that cognition”; and once we understand all this stuff in detail, then we will have a very thorough understanding of agency. Note that even academic philosophy is steering towards this approach, under the heading of “conceptual engineering”.
Agreed. My distinction of forward and backward felt shakier by the day, and your point finally puts it out of its misery.
So I count my approach as a backwards one, consisting of the following steps:
It’s possible to build AGIs which are dangerous in a way that intuitively involves something like “agency”.
Broadly speaking, the class of dangerous agentic AGIs have certain cognition in common, such as making long-term plans, and pursuing convergent instrumental goals (many of which will also be shared by dangerous agentic humans).
By thinking about the cognition that agentic AGIs would need to carry out to be dangerous, we can identify some of traits which contribute a lot to danger, but contribute little to capabilities.
We can then try to design training processes which prevent some of those traits from arising.
My take on your approach is that we’re still at 3, and we don’t have yet a good enough understanding of those traits/properties to manage 4. As for how to solve 3, I reiterate that finding a core/basic version of goal-directedness and adapting it to the usecase seems to way to go for me.
My take on this is: we don’t really care. Many if not most people in AI Alignment come from computer science fields, which have traditionally pushed a lot for open access and against traditional publishers. So I don’t believe Elsevier would find editors and papers for starting its journal, and it thus wouldn’t be able to reach the point where publishing there is required for careers.
I think the disagreement left is whether we should first find a definition of goal-directedness then study how it appears through training (my position), or if we should instead define goal-directedness according to the kind of training processes that generate similar properties and risks (what I take to be your position).
Does that make sense to you?
Thanks for the inclusion in the newsletter and the opinion! (And sorry for taking so long to answer)
This literature review on goal-directedness identifies five different properties that should be true for a system to be described as goal-directed:
It’s implicit, but I think it should be made explicit that the properties/tests are what we extract from the literature, not what we say is fundamental. More specifically, we don’t say they should be true per se, we just extract and articulate them to “force” a discussion of them when defining goal-directedness.
One common difference is that goal-directedness is often understood as a _behavioral_ property of agents, whereas optimization is thought of as a _mechanistic_ property about the agent’s internal cognition.
I don’t think the post says that. We present both a behavioral (à la Dennett) view of goal-directedness as well as an internal property one (like what Evan or Richard discuss); same for the two forms of optimization considered.
Voted! I put most (but not all) my votes on AI Alignment posts that are incredibly valuable to me, and I believe to the field.
I might actually read some more stuff today and give my additional votes to a couple of relevant posts.
This is awesome! That’s exactly the kind of post I wanted you to make!
About the lessons themselves:
I agree with the value of anki, but I have trouble finding things to ankify. My first impulse what to ankify everything under the sun, which lead to an anki burnout. Now I have the inverse problem of not finding much to put in Anki, mostly because I want to know/understand the concept before ankifying it. Or maybe I just don’t read enough maths these days.
I want to disagree with the “read several textbook at once”, but I think you’re right. It’s just that I’m trying so hard to focus on things and not jump from one to another all the time that reading multiple textbooks at once triggers all my internal alarms.
I’ll try to find a safe way for me to do that.About not reading the whole textbook, I think I agree with the gist but I disagree with what you actually write. I definitely think that you should read most of a textbook if what you’re doing is reading the textbook. On the other hand, you shouldn’t try to master every detail in it.
If you want to apply the pareto principle, then go through papers and write everything that you don’t know. Then go search that in textbook. That’s the efficient way. But reading a textbook is for getting a general impression of the field and building a map. So the latter chapters are useful. Just don’t spend 20 hours on them.The “read easier textbooks” advice looks like a rephrasing of “go just outside of your comfort zone”. It makes sense to me.
I generally don’t have the “approximate models” problem. But I also mostly read maths and computer science, which is additive instead of corrective.
And a problem this was. In early 2020, I had an interview where I was asked to compute . I was stumped, even though this was simple high school calculus (just integrate by parts!). I failed the interview and then went back to learning algebraic topology and functional analysis and representation theory. You know, nothing difficult like high school calculus.
I think this is symptomatic of a problem I have myself, and which I only understood lately. I want to learn the cool shit, and I studied a lot of the fundamentals in my first two years after high-school (where we did 12 hours of maths a week). So I should remember how to do basic calculus! But somehow I forgot. And everytime I study a more advanced book, I feel like I should brush up my analysis and my linear algebra and all that if I want to really understand.
Yet that’s wrong, because I mostly read these advanced textbooks for one reason: getting a map of the territory. Then I will know where to look when I need something that looks like that. And for that purpose, getting all the details perfectly right is not important.
On the other hand, there are some parts of maths in which I want skills. There I should actually take the time to learn how to do it, instead of simply knowing that it exists and the big picture description.
Yep, we seem to agree.
It might not be clear from the lit review, but I personally don’t agree with all the intuitions, or not completely. And I definitely believe that a definition that throw some part of the intuitions but applies to AI risks argument is totally fine. It’s more that I believe the gist of these intuitions is pointing in the right direction, and so I want to keep them in mind.
Good to know that my internal model of you is correct at least on this point.
For Daniel, given his comment on this post, I think we actually agree, but that he puts more explicit emphasis on the that-which-makes-AI-risk-arguments-work, as you wrote.
Another way to talk about this distinction is between definitions that allow you to predict the behaviour of agents which you haven’t observed yet given how they were trained, versus definitions of goal-directedness which allow you to predict the future behaviour of an existing system given its previous behaviour.
I actually don’t think we should make this distinction. It’s true that Dennett’s intentional stance falls in the first category for example, but that’s not the reason why I’m interested about it. Explainability seems to me like a way to find a definition of goal-directedness that we can check through interpretability and verification, and which tells us something about the behavior of the system with regards to AI risk. Yet that doesn’t mean it only applies to the observed behavior of systems.
The biggest difference between your definition and the intuitions is that you focus on how goal-directedness appears through training. I agree that this is a fundamental problem; I just think that this is something we can only solve after having a definition of goal-directedness that we can check concretely in a system and that allows the prediction of behavior.
Firstly, we don’t have any AGIs to study, and so when we ask the question of how likely it is that AGIs will be goal-directed, we need to talk about the way in which that trait might emerge.
As mentioned above, I think a definition of goal-directedness should allow us to predict what an AGI will broadly do based on its level of goal-directedness. Training for me is only relevant in understanding which level of goal-directedness are possible/probable. That seems like the crux of the disagreement here.
Secondly, because of the possibility of deceptive alignment, it doesn’t seem like focusing on observed behaviour is sufficient for analysing goal-directedness.
I agree, but I definitely don’t think the intuitions are limiting themselves to the observed behavior. With a definition you can check through interpretability and verification, you might be able to steer clear of deception during training. That’s a use of (low) goal-directedness similar to the one Evan has in mind for myopia.
Thirdly, suppose that we build a system that’s goal-directed in a dangerous way. What do we do then? Well, we need to know why that goal-directedness emerges, and how to change the training regime so that it doesn’t happen again.
For that one, understanding how goal-directedness emerges is definitely crucial.
Infra-Bayesianism Unwrapped
Glad my comment clarified some things.
About the methodology, I just published a post clarifying my thinking about it.
Against the Backward Approach to Goal-Directedness
Thanks for the proposed idea!
Yet I find myself lost when trying to find more information about this concept of care. It is mentioned in both the chapter on Heidegger in The History of Philosophy and the section on care in the SEP article on Heidegger, but I don’t get a single thing written there. I think the ideas of “thrownness” and “disposedness” are related?
Do you have specific pointers to deeper discussions of this concept? Specifically, I’m interested in new intuitions for how a goal is revealed by actions.
Thanks!
Glad they helped! That’s the first time I use this feature, and we debated whether to add more or remove them completely, so thanks for the feedback. :)
I think depending on what position you take, there are difference in how much one thinks there’s “room for a lot of work in this sphere.” The more you treat goal-directedness as important because it’s a useful category in our map for predicting certain systems, the less important it is to be precise about it. On the other hand if you want to treat goal-directedness in a human-independent way or otherwise care about it “for its own sake” for some reason, then it’s a different story.
If I get you correctly, you’re arguing that there’s less work on goal-directedness if we try to use it concretely (for discussing AI risk), compared to if we study it for it’s own sake? I think I agree with that, but I still believe that we need a pretty concrete definition to use goal-directedness in practice, and that we’re far from there. There is less pressure to deal ith all the philosophical nitpicks, but we should at least get the big intuitions (of the type mentioned in this lit review) right, or explain why they’re wrong.
Thanks for the feedback!
My only critique so far is that I’m not really on board yet with your methodology of making desiderata by looking at what people seem to be saying in the literature. I’d prefer a methodology like “We are looking for a definition of goal-directedness such that the standard arguments about AI risk that invoke goal-directedness make sense. If there is no such definition, great! Those arguments are wrong then.”
I agree with you that the endgoal of this research is to make sense of the arguments about AI risk invoking goal-directedness, and of the proposed alternatives. The thing is, even if it’s true, proving that there is no property making these arguments work looks extremely hard. I have very little hope that it is possible to show one way or the other heads-on.
On the other hand, when people invoke goal-directedness, they seem to reference a cluster of similar concepts. And if we manage to formalize this cluster in a satisfying manner for most people, then we can look whether these (now formal) concepts make the arguments for AI risk work. If they do, then problem solved. If the arguments now fail with this definition, I still believe that this is a strong evidence for the arguments not working in general. You can say that I’m taking the bet that “The behavior of AI risk arguments with inputs in the cluster of intuitions from the literature is representative of the behavior of AI risks arguments with any definition of goal-directedness”. Rohin for one seems less convinced by this bet (for example with regard to the importance of explainability)
My personal prediction is that the arguments for AI risks do work for a definition of goal-directedness close to this cluster of concepts. My big uncertainty is what constitute a non-goal-directed (or less goal-directed) system, and whether they’re viable against goal-directed ones.
(Note that I’m not saying that all the intuitions in the lit review should be part of a definition of goal-directedness. Just that they probably need to be addressed, and that most of them capture an important detail of the cluster)
I also have a suggestion or naive question: Why isn’t the obvious/naive definition discussed here? The obvious/naive definition, at least to me, is something like:
“The paradigmatic goal-directed system has within it some explicit representation of a way the world could be in the future—the goal—and then the system’s behavior results from following some plan, which itself resulted from some internal reasoning process in which a range of plans are proposed and considered on the basis of how effective they seemed to be at achieving the goal. When we say a system is goal-directed, we mean it is relevantly similar to the paradigmatic goal-directed system.”
I feel like this is how I (and probably everyone else?) thought about goal-directedness before attempting to theorize about it. Moreover I feel like it’s a pretty good way to begin one’s theorizing, on independent grounds: It puts the emphasis on relevantly similar and thus raises the question “Why do we care? For what purpose are we asking whether X is goal-directed?”
Your definition looks like Dennett’s intentional stance to me. In the intentional stance, the “paradigmatic goal-directed system” is the purely rational system that tries to achieve its desires based on its beliefs, and being an intentional system/goal-directed depend on similarity in terms of prediction with this system.
On the other hand, for most internal structure based definitions (like Richard’s or the mesa-optimizers), a goal-directed system is exactly a paradigmatic goal-directed system.
But I might have misunderstood your naive definition.
Literature Review on Goal-Directedness
Likewise, thanks for taking the time to write such a long comment! And hoping that’s a typo in the second sentence :)
You’re welcome. And yes, this was as typo that I corrected. ^^
Wrt the community though, I’d be especially curious to get more feedback on Motivation #2. Do people not agree that transparency is *necessary* for AI Safety? And if they do agree, then why aren’t more people working on it?
My take is that a lot of people around here agree that transparency is at least useful, and maybe necessary. And the main reason why people are not working on it is a mix of personal fit, and the fact that without research in AI Alignment proper, transparency doesn’t seem that useful (if we don’t know what to look for).
I agree, but think that transparency is doing most of the work there (i.e. what you say sounds more to me like an application of transparency than scaling up the way that verification is used in current models.) But this is just semantics.
Well, transparency is doing some work, but it’s totally unable to prove anything. That’s a big part of the approach I’m proposing. That being said, I agree that this doesn’t look like scaling the current way.
Hm, I want to disagree, but this may just come down to a difference in what we mean by deployment. In the paragraph that you quoted, I was imagining the usual train/deploy split from ML where deployment means that we’ve frozen the weights of our AI and prohibit further learning from taking place. In that case, I’d like to emphasize that there’s a difference between intelligence as a meta-ability to acquire new capabilities and a system’s actual capabilities at a given time. Even if an AI is superintelligent, i.e. able to write new information into its weights extremely efficiently, once those weights are fixed, it can only reason and plan using whatever object-level knowledge was encoded in them up to that point. So if there was nothing about bio weapons in the weights when we froze them, then we wouldn’t expect the paperclip-maximizer to spontaneously make plans involving bio weapons when deployed.
You’re right that I was thinking of a more online system that could update it’s weights during deployment. Yet even with frozen weights, I definitely expect the model to make plans involving things that were not involved. For example, it might not have a bio-weapon feature, but the relevant subfeature to build some by quite local rules that don’t look like a plan to build a bio-weapon.
Suppose an AI system was trained on a dataset of existing transparency papers to come up with new project ideas in transparency. Then its first outputs would probably use words like neurons and weights instead of some totally incomprehensible concepts, since those would be the very same concepts that would let it efficiently make sense of its training set. And new ideas about neurons and weights would then be things that we could independently reason about even if they’re very clever ideas that we didn’t think of ourselves, just like you and I can have a conversation about circuits even if we didn’t come up with it.
That seems reasonable.
To check if I understand correctly, you’re arguing that the selection pressure to use argument in order to win requires the ability to be swayed by arguments, and the latter already requires explicit reasoning?
That seems convincing as a counter-argument to “explicit reasoning in humans primarily evolved not in order to help us find out about the world, but rather in order to win arguments.”, but I’m not knowledgeable enough about the work quoted to check if they don’t have a more subtle position.
Thanks!