My thought process when I use “safer” and “less safe” in posts like this is: the main arguments that AGI will be unsafe depends on it having certain properties, like agency, unbounded goals, lack of interpretability, desire and ability to self-improve, and so on. So reducing the extent to which it has those properties will make it safer, because those arguments will be less applicable.
I guess you could have two objections to this:
Maybe safety is non-monotonic in those properties.
Maybe you don’t get any reduction in safety until you hit a certain threshold (corresponding to some success story).
I tend not to worry so much about these two objections because to me, the properties I outlined above are still too vague to have a good idea of the landscape of risks with respect to those properties. Once we know what agency is, we can talk about its monotonicity. For now my epistemic state is: extreme agency is an important component of thee main argument for risk, so all else equal reducing it should reduce risk.
I like the idea of tying safety ideas to success stories in general, though, and will try to use it for my next post, which proposes more specific interventions during deployment. Having said that, I also believe that most safety work will be done by AGIs, and so I want to remain open-minded to success stories that are beyond my capability to predict.
Nothing in particular. My main intention with this post was to describe a way the world might be, and some of the implications. I don’t think such work should depend on being related to any specific success story.
I’m hoping there’s a big qualitative difference between fine-tuning on the CEO task versus the “following instructions” task. Perhaps the magnitude of the difference would be something like: starting training on the new task 99% of the way through training, versus starting 20% of the way through training. (And 99% is probably an underestimate: the last 10000 years of civilisation are much less than 1% of the time we’ve spent evolving from, say, the first mammals).
Plus on the follow human instructions task you can add instructions which specifically push against whatever initial motivations they had, which is much harder on the CEO task.
I agree that this is a concern though.
I should clarify that when I think about obedience, I’m thinking obedience to the spirit of an instruction, not just the wording of it. Given this, the two seem fairly similar, and I’m open to arguments about whether it’s better to talk in terms of one or the other. I guess I favour “obedience” because it has fewer connotations of agency—if you’re “doing what a human wants you to do”, then you might run off and do things before receiving any instructions. (Also because it’s shorter and pithier—“the goal of doing what humans want” is a bit of a mouthful).
Yeah, so I guess opinions on this would differ depending on how likely people think existential risk from AGI is. Personally, it’s clear to me that agentic misaligned superintelligences are bad news—but I’m much less persuaded by descriptions of how long-term maximising behaviour arises in something like an oracle. The prospect of an AGI that’s much more intelligent than humans and much less agentic seems quite plausible—even, perhaps, in a RL agent.
I think some parts of it do—e.g. in this post. But yes, I do really like Chapman’s critique and wish I’d remembered about it before writing this so that I could reference it and build on it.
Especially: Understanding informal reasoning is probably more important than understanding technical methods. I very much agree with this.
Yes, I saw Chapman’s critiques after someone linked one in the comments below, and broadly agree with them.
I also broadly agree with the conclusion that you quote; that seems fairly similar to what I was trying to get at in the second half of the post. But in the first half of the post, I was also trying to gesture at a mistake made not by people who want simple, practical insights, but rather people who do research in AI safety, learning human preferences, and so on, using mathematical models of near-ideal reasoning. However, it looks like making this critique thoroughly would require much more effort than I have time for.
Game-theoretic concepts like social dilemma, equilibrium selection, costly signaling, and so on seem indispensable
I agree with this. I think I disagree that “stating them crisply” is indispensable.
I wouldn’t know where to start if I couldn’t model agents using Bayesian tools.
To be a little contrarian, I want to note that this phrasing has a certain parallel with the streetlight effect: you wouldn’t know how to look for your keys if you didn’t have the light from the streetlamp. In particular, this is also what someone would say if we currently had no good methods for modelling agents, but bayesian tools were the ones which seemed good.
Anyway, I’d be interested in having a higher-bandwidth conversation with you about this topic. I’ll get in touch :)
There is a third use of Bayesianism, the way that sophisticated economists and political scientists use it: as a useful fiction for modeling agents who try to make good decisions in light of their beliefs and preferences. I’d guess that this is useful for AI, too. These will be really complicated systems and we don’t know much about their details yet, but it will plausibly be reasonable to model them as “trying to make good decisions in light of their beliefs and preferences”.
Perhaps a fourth use is that we might actively want to try to make our systems more like Bayesian reasoners, at least in some cases.
My post was intended to critique these positions too. In particular, the responses I’d give are that:
There are many ways to model agents as “trying to make good decisions in light of their beliefs and preferences”. I expect bayesian ideas to be useful for very simple models, where you can define a set of states to have priors and preferences over. For more complex and interesting models, I think most of the work is done by considering the cognition the agents are doing, and I don’t think bayesianism gives you particular insight into that for the same reasons I don’t think it gives you particular insight into human cognition.
In response to “The Bayesian framework plausibly allows us to see failure modes that are common to many boundedly rational agents”: in general I believe that looking at things from a wide range of perspectives allows you to identify more failure modes—for example, thinking of an agent as a chaotic system might inspire you to investigate adversarial examples. Nevertheless, apart from this sort of inspiration, I think that the bayesian framework is probably harmful when applied to complex systems because it pushes people into using misleading concepts like “boundedly rational” (compare your claim with the claim that a model in which all animals are infinitely large helps us identify properties that are common to “boundedly sized” animals).
“We might actively want to try to make our systems more like Bayesian reasoners”: I expect this not to be a particularly useful approach, insofar as bayesian reasoners don’t do “reasoning”. If we have no good reason to think that explicit utility functions are something that is feasible in practical AGI, except that it’s what ideal bayesian reasoners do, then I want to discourage people from spending their time on that instead of something else.
I’m a little confused by this one, because in your previous response you say that you think Bob accurately represents Eliezer’s position, and now you seem to be complaining about the opposite?
Instead I read it as something like “some unreasonable percentage of an agent’s actions are random”
This is in fact the intended reading, sorry for ambiguity. Will edit. But note that there are probably very few situations where exploring via actual randomness is best; there will almost always be some type of exploration which is more favourable. So I don’t think this helps.
We care about utility-maximizers because they’re doing their backwards assignment, using their predictions of the future to guide their present actions to try to shift the future to be more like what they want it to be.
To be pedantic: we care about “consequence-desirability-maximisers” (or in Rohin’s terminology, goal-directed agents) because they do backwards assignment. But I think the pedantry is important, because people substitute utility-maximisers for goal-directed agents, and then reason about those agents by thinking about utility functions, and that just seems incorrect.
And so if I read the original post as “the further a robot’s behavior is from optimal, the less likely it is to demonstrate convergent instrumental goals”
What do you mean by optimal here? The robot’s observed behaviour will be optimal for some utility function, no matter how long you run it.
The very issue in question here is what this set of tools tells us about the track record of the machine. It could be uninformative because there are lots of other things that come from the machine that we are ignoring. Or it could be uninformative because they didn’t actually come from the machine, and the link between them was constructed post-hoc.
I agree that I am not critiquing “Bayesianism to the rest of the world”, but rather a certain philosophical position that I see as common amongst people reading this site. For example, I interpret Eliezer as defending that position here (note that the first paragraph is sarcastic):
Clearly, then, a Carnot engine is a useless tool for building a real-world car. The second law of thermodynamics, obviously, is not applicable here. It’s too hard to make an engine that obeys it, in the real world. Just ignore thermodynamics—use whatever works.
This is the sort of confusion that I think reigns over they who still cling to the Old Ways.
No, you can’t always do the exact Bayesian calculation for a problem. Sometimes you must seek an approximation; often, indeed. This doesn’t mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is not made out of atoms. Whatever approximation you use, it works to the extent that it approximates the ideal Bayesian calculation—and fails to the extent that it departs.
Also, insofar as AIXI is a “hypothetical general AI that doesn’t understand that its prediction algorithms take time to run”, I think “strawman” is a little inaccurate.
Anyway, thanks for the comment. I’ve updated the first paragraph to make the scope of this essay clearer.
All of the applications in which Bayesian statistics/ML methods work so well. All the robotics/AI/control theory applications where Bayesian methods are used in practice.
This does not really seem like much evidence to me, because for most of these cases non-bayesian methods work much better. I confess I personally am in the “throw a massive neural network at it” camp of machine learning; and certainly if something with so little theoretical validation works so well, it makes one question whether the sort of success you cite really tells us much about bayesianism in general.
All of the psychology/neuroscience research on human intelligence approximating Bayesianism.
I’m less familiar with this literature. Surely human intelligence *as a whole* is not a very good approximation to bayesianism (whatever that means). And it seems like most of the heuristics and biases literature is specifically about how we don’t update very rationally. But at a lower level, I defer to your claim that modules in our brain approximate bayesianism.
Then I guess the question is how to interpret this. It certainly feels like a point in favour of some interpretation of bayesianism as a general framework. But insofar as you’re thinking about an interpretation which is being supported by empirical evidence, it seems important for someone to formulate it in such a way that it could be falsified. I claim that the way bayesianism has been presented around here (as an ideal of rationality) is not a falsifiable framework, and so at the very least we need someone else to make the case for what they’re standing for.
Probably H intends A to achieve a narrow subset of H’s goals, but doesn’t necessarily want A pursuing them in general.
Similarly, if I have an employee, I may intend for them to do some work-related tasks for me, but I probably don’t intend for them to go and look after my parents, even though ensuring my parents are well looked-after is a goal of mine.
We have tons of empirical evidence on this.
What sort of evidence are you referring to; can you list a few examples?
In general I very much appreciate people reasoning from examples like these. The sarcasm does make me less motivated to engage with this thoroughly, though. Anyway, idk how to come up with general rules for which abstractions are useful and which aren’t. Seems very hard. But when we have no abstractions which are empirically verified to work well in modelling a phenomenon (like intelligence), it’s easy to overestimate how relevant our best mathematics is, because proofs are the only things that look like concrete progress.
On Big-O analysis in particular: this is a pretty interesting example actually, since I don’t think it was obvious in advance that it’d work as well as it has (i.e. that the constants would be fairly unimportant in practice). Need to think more about this one.