I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about
Mark Xu
I don’t know if you’ve seen this, but https://arxiv.org/abs/1906.11583 is a follow-up that generalizes the Beckers and Halpern paper to a notion of approximate abstraction by measuring the non-commutativity of the diagram by using some distance function and taking expectations. I think the most useful notion that the paper introduces is the idea of a probability distribution over the set of allowed interventions. Intuitively, you don’t need your abstraction of temperature to behave nicely w.r.t freezing half the room and burning the other half such that the average kinetic energy balances out. Thus you can determine the “approximate commutativeness” of the diagram by fixing a high-level intervention and taking an expectation over the low-level interventions that were likely to map to that high-level intervention.
Also, if you are willing to write up your counter example to the conjecture that Beckers and Halpern make, I am currently researching under Eberhardt and he (and I) would be extremely interested in seeing it. I also initially thought that the conjecture was obviously false, but when I tried to actually construct counter examples, all of them ended up as either not strong abstractions or not recursive (acyclic) causal models.
Training Regime Day 0: Introduction
Training Regime Day 1: What is applied rationality?
Training Regime Day 2: Searching for bugs
Training Regime Day 3: Tips and Tricks
There is a phenomenon among students of mathematics where things go from being “difficult” to “trivial” as soon as concepts are grasped. The main reason why I don’t comment many of my thoughts is that I think that since I can think them, they must not be very hard to think, so commenting them is kind of useless. I think me thinking that my thoughts aren’t very novel/insightful/good explains nearly all of the times I don’t comment—if I have a thought I think is non-trivial to think or I have access to information that I think most people do not have access to, I will likely comment it (this happens extremely rarely).
However, I agree that people should say more obvious things on the margin.
(I also think that, on the margin, people should compliment other people more. I liked this post and think it is an important problem to try and solve.).
Thanks! I mentioned on day 0 that this was one of my training regimes for rationality, but it’s also training writing and posting thinking in public places, among other things. I’m glad that people are able to get something out of it.
Training Regime Day 4: Murphyjitsu
Training Regime Day 5: TAPs
In my view, there is no “right level” for bugs. Some bugs are simpler and thus more suited to practicing, but the goal is to get to the point where you can solve even your largest bugs. I’ll provide more prompts for finding larger bugs later on in the sequence.
Thanks for participating!
Training Regime Day 6: Seeking Sense
I’m not sure I understand what you’re saying. I’m not advising people to drop their items in an attempt to discover new uses for them, I’m just describing a time when I accidentally dropped something and discovered that I was using it wrong.
I think part of what I’m saying is that your priors for things working properly should be higher and I should have been more surprised at the difficulty of refilling the pepper grinder. This should have prompted me to search harder for a way to use it more effectively.
Training Regime Day 7: Goal Factoring
Training Regime Day 8: Noticing
I agree that that comment didn’t really add that much. I was just trying to caution against the view that goal factoring was a technique for convincing yourself to take/not take certain actions. I’m not sure whether I should have spent more time discussing that though, because I’m not sure how common such a failure mode is.
Thanks for the style pointer!
Yes. Fixed.
I’m confused why the inner alignment problem is conceptually different from the outer alignment problem. From a general perspective, we can think of the task of building any AI system as humans trying to optimizer their values by searching over some solution space. In this scenario, the programmer becomes the base optimizer and the AI system becomes the mesa optimizer. The outer alignment problem thus seems like a particular manifestation of the inner alignment problem where the base optimizer is a human.
In particular, if there exists a robust solution to the outer alignment problem, then presumably there’s some property P that we want the AI system to have and that we have some process Pmeta that convinces us that the AI system has property P. I don’t see why we can’t just give the AI system the ability to enact Pmeta to ensure that any optimizer’s that it creates have property P (modulo the problem of ensuring that the system has Pmeta with Pmetameta, ensuring that with Pmetametameta, etc.). I guess you can have a solution to the outer alignment problem by having P,Pmeta and not have the recursive tower needed to solve the inner alignment problem, but that seems like not the issues that were being brought up. (something something Lobian Obstacle)
In particular,
My view says that if M is the machine learning system and P are the programmers, we can view P as the “machine learning system” and M as a mesa-optimizer. The task of aligning the the mesa-objective with the specific loss seems the same type of problem as aligning the loss function of M with the programmers values.
Maybe the important thing is that loss functions are functions and values are not, so the point is that even if we have a function that represents our values, things can still go wrong. That is, before people thought that the problem was that finding a function that does what we want when it gets optimized was the main problem, but mesa-optimizer pseudo-alignment shows that even if we have that then we can’t just optimize the function.
An implication is that all the reasons why mesa-optimizers can cause problems are reasons why strategies for trying to turn human values into a function can go wrong too. For example, value learning strategies seem vulnerable to the same pseudo-alignment problems. Admittedly, I do not have a good understanding of current approaches to value learning, so I am not sure if this is a real concern. (Assuming that the authors of this post are adequate, if such a similar concern existed in value learning, I think they would have mentioned it. This suggests that either I am wrong about this being a problem or that no one has given it serious thought. My priors are on the former, but I want to know why I’m wrong.)
I suspect that I’ve failed to understand something fundamental because it seems like a lot of people that know a lot of stuff think this is really important. In general, I think this paper has been well written and extremely accessible to someone like me who has only recently started reading about AI safety.