I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about
Mark Xu
Naively there are so few people working on interp, and so many people working on capabilities, that publishing is so good for relative progress. So you need a pretty strong argument that interp in particular is good for capabilities, which isn’t borne out empirically and also doesn’t seem that strong.
In general, this post feels like it’s listing a bunch of considerations that are pretty small, and the 1st order consideration is just like “do you want people to know about this interpretability work”, which seems like a relatively straightfoward “yes”.
I also seperately think that LW tends to reward people for being “capabilities cautious” more than is reasonable, and once you’ve made the decision to not specifically work towards advancing capabilities, then the capabilities externalities of your research probably don’t matter ex ante.
You’ve written multiple outer alignment failure stories. However, you’ve also commented that these aren’t your best predictions. If you condition on humanity going extinct because of AI, why did it happen?
Part of my uncanny valley was failing to realize that being able to identify a pattern was not sufficient to being able to step outside of it. I got to the point where I developed enough awareness to _notice_ that I was currently trapped inside a bad pattern, but I didn’t have the tools to be able to step outside the pattern.
Examples:
Not being able to go from “I notice my current patterns of work are unsustainable” to making them more sustainable.
Knowing I’m overconfident, but not making myself less confident.
Knowing about the planning fallacy, but pleading my own exceptionalism by placing myself in a tiny reference class.
Knowing that doing things like journaling, exercise, eating better, drinking more water, sleeping more, etc. are good for me, but not being able to actually do them.
As I’ve made progress on many of these things, my sense is that trying to solve your problems _until they are actually solved_ is the cornerstone of applied rationality. Techniques let you think about the problems differently and offer new angles of attack, but there is no substitute for _actually practicing_. I think getting people to actually do the thing is a relatively unsolved problem (for rationality, but also for all of society, so not a _particular_ failing on our part).
There is a phenomenon among students of mathematics where things go from being “difficult” to “trivial” as soon as concepts are grasped. The main reason why I don’t comment many of my thoughts is that I think that since I can think them, they must not be very hard to think, so commenting them is kind of useless. I think me thinking that my thoughts aren’t very novel/insightful/good explains nearly all of the times I don’t comment—if I have a thought I think is non-trivial to think or I have access to information that I think most people do not have access to, I will likely comment it (this happens extremely rarely).
However, I agree that people should say more obvious things on the margin.
(I also think that, on the margin, people should compliment other people more. I liked this post and think it is an important problem to try and solve.).
This is very exciting. Looking forward to the rest of the sequence.
As I was reading, I found myself reframing a lot of things in terms of the rows and columns of the matrix. Here’s my loose attempt to rederive most of the properties under this view.
The world is a set of states. One way to think about these states is by putting them in a matrix, which we call “cartesian frame.” In this frame, the rows of the matrix are possible “agents” and the columns are possible “environments”.
Note that you don’t have to put all the states in the matrix.
Ensurables are the part of the world that the agent can always ensure we end up in. Ensurables are the rows of the matrix, closed under supersets
Preventables are the part of the world that the agent can always ensure we don’t end up in. Preventables are the complements of the rows, closed under subsets
Controllables are parts of the world that are both ensurable and preventable. Controlables are rows (or sets of rows) for which there exists rows that are disjoint. [edit: previous definition of “contains elements not found in other rows” was wrong, see comment by crabman]
Observeables are parts of the environment that the agent can observe and act conditionally according to. Observables are columns such that for every pair of rows there is a third row that equals the 1st row if the environment is in that column and the 2nd row otherwise. This means that for every two rows, there’s a third row that’s made by taking the first row and swapping elements with the 2nd row where it intersects with the column.
Observables have to be sets of columns because if they weren’t, you can find a column that is partially observable and partially not. This means you can build an action that says something like “if I am observable, then I am not observable. If I am not observable, I am observable” because the swapping doesn’t work properly.
Observables are closed under boolean combination (note it’s sufficient to show closure under complement and unions):
Since swapping index 1 of a row is the same as swapping all non-1 indexes, observables are closed under complements.
Since you can swap indexes 1 and 2 by first swapping index 1, then swapping index 2, observables are closed under union.
This is equivalent to saying “If A or B, then a0, else a2” is logically equivalent to “if A, then a0, else (if B, then a0, else a2)”
Since controllables are rows with specific properties and observables are columns with specific properties, then nothing can be both controllable and observable. (The only possibility is the entire matrix, which is trivially not controllable because it’s not preventable)
This assumes that the matrix has at least one column
The image of a cartesian frame is the actual matrix part.
Since an ensurable is a row (or superset) and an observable is a column (or set of columns), then if something is ensurable and observable, then it must contain every column, so it must be the whole matrix (image).
If the matrix has 1 or 0 rows, then the observable constraint is trivially satisfied, so the observables are all possible sets of (possible) environment states (since 0⁄1 length columns are the same as states).
“0 rows” doesn’t quite make sense, but just pretend that you can have a 0 row matrix which is just a set of world states.
If the matrix has 0 columns, then the ensurable/preventable contraint is trivially satisfied, so the ensurables are the same as the preventables are the same as the controllables, which are all possible sets of (possible) environment states (since “length 0” rows are the same as states).
“0 columns doesn’t make that much sense either but pretend that you can have a 0 column matrix which is just a set of world state.
If the matrix has exactly 1 column, then the ensurable/preventable constraint is trivially satisfied for states in the image (matrix), so the ensurables are all non-empty sets of states in the matrix (since length 1 columns are the same as states), closed under union with states outside the matrix. It should be easy to see that controllables are all possible sets of states that intersect the matrix non-trivially, closed under union with states outside the matrix.
The S. prior is a general-purpose prior which we can apply to any problem. The output string has no meaning except in a particular application and representation, so it seems senseless to try to influence the prior for a string when you don’t know how that string will be interpreted.
The claim is that consequentalists in simulated universes will model decisions based on the Solomonoff prior, so they will know how that string will be interpreted.
Can you give an instance of an application of the S. prior in which, if everything you wrote were correct, it would matter?
Any decision that controls substantial resource allocation will do. For example, if we’re evaluting the impact of running various programs, blow up planets, interfere will alien life, etc.
Also in the category of “it’s a feature, not a bug” is that, if you want your values to be right, and there’s a way of learning the values of agents in many possible universes, you ought to try to figure out what their values are, and update towards them. This argument implies that you can get that for free by using Solomonoff priors.
If you are a moral realist, this does seem like a possible feature of the Solomonoff prior.
Third, what do you mean by “the output” of a program that simulates a universe?
A TM that simulates a universe must also specify an output channel.
Take your example of Life—is the output a raster scan of the 2D bit array left when the universe goes static? In that case, agents have little control over the terminal state of their universe (and also, in the case of Life, the string will be either almost entirely zeroes, or almost entirely 1s, and those both already have huge Solomonoff priors). Or is it the concatenation of all of the states it goes through, from start to finish?
All of the above. We are running all possible TMs, so all computable universes will be paired will all computable output channels. It’s just a question of complexity.
Are you imagining that bits are never output unless the accidentally-simulated aliens choose to output a bit? I can’t imagine any way that could happen, at least not if the universe is specified with a short instruction string.
No.
This brings us to the 4th problem: It makes little sense to me to worry about averaging in outputs from even mere planetary simulations if your computer is just the size of a planet, because it won’t even have enough memory to read in a single output string from most such simulations.
I agree that approximation the Solmonoff prior is difficult and thus its malignancy probably doesn’t matter in practice. I do think similar arguments apply to cases that do matter.
5th, you can weigh each program’s output proportional to 2^-T, where T is the number of steps it takes the TM to terminate. You’ve got to do something like that anyway, because you can’t run TMs to completion one after another; you’ve got to do something like take a large random sample of TMs and iteratively run each one step. Problem solved.
See the section on the Speed prior.
Perhaps the biggest problem is that you’re talking about an entire universe of intelligent agents conspiring to change the “output string” of the TM that they’re running in. This requires them to realize that they’re running in a simulation, and that the output string they’re trying to influence won’t even be looked at until they’re all dead and gone. That doesn’t seem to give them much motivation to devote their entire civilization to twiddling bits in their universe’s final output in order to shift our priors infinitesimally. And if it did, the more likely outcome would be an intergalactic war over what string to output.
They don’t have to realize they’re in a simulation, they just have to realize their universe is computable. Consequentialists care about their values after they’re dead. The cost of influncing the prior might not be that high because they only have to compute it once and the benefit might be enormous. Exponential decay + acausal trade make an intergalactic war unlikely.
Flagging that I don’t think your description of what ELK is trying to do is that accurate, e.g. we explicitly don’t think that you can rely on using ELK to ask your AI if it’s being deceptive, because it might just not know. In general, we’re currently quite comfortable with not understanding a lot of what our AI is “thinking”, as long as we can get answers to a particular set of “narrow” questions we think is sufficient to determine how good the consequences of an action are. More in “Narrow” elicitation and why it might be sufficient.
Separately, I think that ELK isn’t intended to address the problem you refer to as a “sharp-left turn” as I understand it. Vaguely, ELK is intended to be an ingredient in an outer-alignment solution, while it seems like the problem you describe falls roughly into the “inner alignment” camp. More specifically, but still at a high-level of gloss, the way I currently see things is:
If you want to train a powerful AI, currently the set of tasks you can train your AI on will, by default, result in your AI murdering you.
Because we currently cannot teach our AIs to be powerful by doing anything except rewarding them for doing things that straightforwardly imply that they should disempower humans, you don’t need a “sharp left turn” in order for humanity to end up disempowered.
Given this, it seems like there’s still a substantial part of the difficulty of alignment that remains to be solved even if knew how to cope with the “sharp left turn.” That is, even if capabilities were continuous in SGD steps, training powerful AIs would still result in catastrophe.
ELK is intended to be an ingredient in tackling this difficulty, which has been traditionally referred to as “outer alignment.”
Even more separately, it currently seems to me like it’s very hard to work on the problem you describe while treating other components [like your loss function] like a black box, because my guess is that “outer alignment” solutions need to do non-trivial amounts of “reaching inside the model’s head” to be plausible, and a lot of how to ensure capabilities and alignment generalize together is going to depend on details about how would have prevented it from murdering you in [capabilities continuous with SGD] world.
ELK for learned optimizers has some more details.
From my perspective, ELK is currently very much “A problem we don’t know how to solve, where we think rapid progress is being made (as we’re still building out the example-counterexample graph, and are optimistic that we’ll find an example without counterexamples)” There’s some question of what “rapid” means, but I think we’re on track for what we wrote in the ELK doc: “we’re optimistic that within a year we will have made significant progress either towards a solution or towards a clear sense of why the problem is hard.”
We’ve spent ~9 months on the problem so far, so it feels like we’ve mostly ruled out it being an easy problem that can be solved with a “simple trick”, but it very much doesn’t feel like we’ve hit on anything like a core obstruction. I think we still have multiple threads that are still live and that we’re still learning things about the problem as we try to pull on those threads.
I’m still pretty interested in aiming for a solution to the entire problem (in the worst case), which I currently think is still plausible (maybe 1/3rd chance?). I don’t think we’re likely to relax the problem until we find a counterexample that seems like a fundamental reason why the original problem wasn’t possible. Another way of saying this is that we’re working on ELK because of a set of core intuitions about why it ought to be possible and we’ll probably keep working on it until those core intuitions have been shown to be flawed (or we’ve been chugging away for a long time without any tangible progress).
One of the main reasons I expect this to not work is because optimization algorithms that are the best at optimizing some objective given a fixed compute budget seem like they basically can’t be generally-retargetable. E.g. if you consider something like stockfish, it’s a combination of search (which is retargetable), sped up by a series of very specialized heuristics that only work for winning. If you wanted to retarget stockfish to “maximize the max number of pawns you ever have” you had, you would not be able to use [specialized for telling whether a move is likely going to win the game] heuristics to speed up your search for moves. A more extreme example is the entire endgame table is useless for you, and you have to recompute the entire thing probably.
Something like [the strategy stealing assumption](https://ai-alignment.com/the-strategy-stealing-assumption-a26b8b1ed334) is needed to even obtain the existence of a set of heuristics just as good for speeding up search for moves that “maximize the max number of pawns you ever have” compared to [telling whether a move will win the game]. Actually finding that set of heuristics is probably going to require an entirely parallel learning process.
This also implies that even if your AI has the concept of “human values” in its ontology, you still have to do a bunch of work to get an AI that can actuallly estimate the long-run consequences of any action on “human values”, or else it won’t be competitive with AIs that have more specialized optimization algorithms.
- 15 Sep 2023 20:30 UTC; 4 points) 's comment on Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by (
Here are some possibilities:
great minds might not think alike
untranslated thinking sounds untrustworthy
disagreement as a lack of translation
January 2020 CFAR played this game extensively, although we played with 8+ people instead of the recommended 4. Took a while to get past even level 1 because of the amount of synchronization required. We also didn’t play with stars or lives.
I can second the feeling of close calls being amazing. To quote someone I played with, “this is the most excited I’ve ever felt.”
Here are some haiku-ish things that I wrote that attempted to capture the experience of playing.
Someone has blue eyes. 100 days pass. I have mud on my forehead.
The absence of counting. A groan of disappointment. The jabber didn’t jab.
Moments accumulate. A flurry of plays. The sound of a jab.
Arms creeping forward. Sequentially numbered cards. A hum of satisfaction.
A standoff begins. Arms creeping forward. Signalling intensifies.
Counting begins. Moments properly measured. Victory condenses.
CFAR workshop survey. The mind is a good game. There were also classes.
My model is something like:
For any given algorithm, e.g. SVMs, AlphaGo, alpha-beta pruning, convnets, etc., there is an “effective compute regime” where dumping more compute makes them better. If you go above this regime, you get steep diminishing marginal returns.
In the (relatively small) regimes of old algorithms, new algorithms and old algorithms perform similarly. E.g. with small amounts of compute, using AlphaGo instead of alpha-beta pruning doesn’t get you that much better performance than like an OOM of compute (I have no idea if this is true, example is more because it conveys the general gist).
One of the main way that modern algorithms are better is that they have much large effective compute regimes. The other main way is enabling more effective conversion of compute to performance.
Therefore, one of primary impact of new algorithms is to enable performance to continue scaling with compute the same way it did when you had smaller amounts.
In this model, it makes sense to think of the “contribution” of new algorithms as the factor they enable more efficient conversion of compute to performance and count the increased performance because the new algorithms can absorb more compute as primarily hardware progress. I think the studies that Carl cites above are decent evidence that the multiplicative factor of compute → performance conversion you get from new algorithms is smaller than the historical growth in compute, so it further makes sense to claim that most progress came from compute, even though the algorithms were what “unlocked” the compute.
For an example of something I consider supports this model, see the LSTM versus transformer graphs in https://arxiv.org/pdf/2001.08361.pdf
A human given finite time to think also only performs O(1) computation, and thus cannot “solve computationally hard problems”.
One of my similar tools is trying to avoid keeping my phone in my pocket. Using my phone is a fine thing to do, but having my default state be “can use my phone within 5 seconds” is generally distracting and causes more phone use than necessary. For this reason, I own an ipod touch because I needed access to my calendar/todo-list at all times, but didn’t want to keep my phone on me.
One way of looking at DDT is “keeping it dumb in various ways.” I think another way of thinking about is just designing a different sort of agent, which is “dumb” according to us but not really dumb in an intrinsic sense. You can imagine this DDT agent looking at agents that do do acausal trade and thinking they’re just sacrificing utility for no reason.
There is some slight awkwardness in that the decision problems agents in this universe actually encounter means that UDT agents will get higher utility than DDT agents.
I agree that the maximum a posterior world doesn’t help that much, but I think there is some sense in which “having uncertainty” might be undesirable.
I don’t really want to argue about language. I’ll defend “almost no individual has a pretty substantial affect on capabilities.” I think publishing norms could have a pretty substantial effect on capabilities, and also a pretty substantial effect on interpretability, and currently think the norms suggested have a tradeoff that’s bad-on-net for x-risk.
Chris Olah’s interpretability work is one of the most commonly used resources in graduate and undergraduate ML classes, so people clearly think it helps you get better at ML engineering
I think this is false, and that most ML classes are not about making people good at ML engineering. I think Olah’s stuff is disproportionately represented because it’s interesting and is presented well, and also that classes really love being like “rigorous” or something in ways that are random. Similarly, probably like proofs of the correctness of backprop are common in ML classes, but not that relevant to being a good ML engineer?
I also bet that if we were to run a survey on what blogposts and papers top ML people would recommend that others should read to become better ML engineers, you would find a decent number of Chris Olah’s publications in the top 10 and top 100.
I would be surprised if lots of ML engineers thought that Olah’s work was in the top 10 best things to read to become a better ML engineer. I less beliefs about top 100. I would take even odds (and believe something closer to 4:1 or whatever), that if you surveyed good ML engineers and ask for top 10 lists, not a single Olah interpretability piece would be in the top 10 most mentioned things. I think most of the stuff will be random things about e.g. debugging workflow, how deal with computers, how to use libraries effectively, etc. If anyone is good at ML engineering and wants to chime in, that would be neat.
I don’t understand why we should have a prior that interpretability research is inherently safer than other types of ML research?
Idk, I have the same prior about trying to e.g. prove various facts about ML stuff, or do statistical learning theory type things, or a bunch of other stuff. It’s just like, if you’re not trying to eek out more oomph from SGD, then probably the stuff you’re doing isn’t going to allow you to eek out more oomph from SGD, because it’s kinda hard to do that and people are trying many things.
I want to flag that HCH was never intended to simulate a long reflection. It’s main purpose (which it fails in the worse case) is to let humans be epistemically competitive with the systems you’re trying to train.
You could instead ask whether or not the observer could predict the location of a single particle p0, perhaps stipulating that p0 isn’t the particle that’s randomly perturbed.
My guess is that a random 1 angstrom perturbation is enough so that p0′s location after 20s is ~uniform. This question seems easier to answer, and I wouldn’t really be surprised if the answer is no?
Here’s a really rough estimate: This says 10^{10} s^{-1} per collision, so 3s after start ~everything will have hit the randomly perturbed particle, and then there are 17 * 10^{10} more collisions, each of which add’s ~1 angstrom of uncertainty to p0. 1 angstrom is 10^{-10}m, so the total uncertainty is on the order of 10m, which means it’s probably uniform? This actually came out closer than I thought it would be, so now I’m less certain that it’s uniform.
This is a slightly different question than the total # of particles on each side, but it becomes intuitively much harder to answer # of particles if you have to make your prediction via higher order effects, which will probably be smaller.
My current taxonomy of rationalists is:
LW rationalists (HI!)
Facebook rationalists
Twitter rationalists
Blog rationalists
Internet-invisible rationalists
Are there other types of rationalists? Maybe like group-chat rationalists? or podcast rationalists? google doc rationalists?
Here’s a conversation that I think is vaguely analogous:
Alice: Suppose we had a one-way function, then we could make passwords better by...
Bob: What do you want your system to do?
Alice: Well, I want passwords to be more robust to...
Bob: Don’t tell me about the mechanics of the system. Tell me what you want the system to do.
Alice: I want people to be able to authenticate their identity more securely?
Bob: But what will they do with this authentication? Will they do good things? Will they do bad things?
Alice: IDK I just think the world is likely to be generically a better place if we can better autheticate users.
Bob: Oh OK, we’re just going to create this user authetication technology and hope people use it for good?
Alice: Yes? And that seems totally reasonable?
It seems to me like you don’t actually have to have a specific story about what you want your AI to do in order for alignment work to be helpful. People in general do not want to die, so probably generic work on being able to more precisely specify what you want out of your AIs, e.g. for them not to be mesa-optimizers, is likely to be helpful.
This is related to complaints I have with [pivotal-act based] framings, but probably that’s a longer post.