Epistemologist specialized in the difficulties of alignment. Currently at Conjecture (https://www.conjecture.dev/), and running Refine (https://www.lesswrong.com/posts/D7epkkJb3CqDTYgX9/refine-an-incubator-for-conceptual-alignment-research-bets).
adamShimi(Adam Shimi)
The subtlety I really want to point out here is that the choice is not necessarily “make a precise forecast” or “not make any forecast at all”. Notably, the precise forecasts that you generally can write down or put on website are limited to distributions that you can compute decently well and that have well-defined properties. If you arrive at a distribution that is particularly hard to compute, it can still tell you qualitative things (the kind of predictions Eliezer actually makes) without you being able to honestly extract a precise prediction.
In such a situation, making a precise prediction is the same as taking one element of a set of solutions for an equation and labelling it “the” solution.
(If you want to read more about Eliezer’s model, I recommend this paper)
Thanks for the answer!
That is an interesting perspective to consider, the trade-off that you could be reducing the amount of time people spend learning even if it’s more effective! A quick back of the napkin says that even if it does reduce the amount you read drastically it’s still worthwhile, as long as you don’t reduce it by more than the forgetting curve!
Say you normally read 10 hours/week then you start using SRS and it drops it down to 5 hours/week. But you remember 10x the amount of what you would have previously remembered. Thus it ends up being the equivalent of reading 50 hours/week.I would say that it depends on what you want out of your reading. Most of the time I’m reading for extending my breadth, and so partial memories are completely fine, and covering more ground matters more. Would be different if I was studying in details a new maths subfield for example.
Thanks for trying to make the issue more concrete and provide a way to discuss it!
One thing I want to point out is that you don’t really need to put the non-constrained variables at the worst possible state; you just have the degree of freedom to put them to whatever helps you and is not too hard to reach.
Using sets, you have a set of world you want, and a proxy that is a superset of this (because you’re not able to aim exactly at what you want). The problem is that the AI is optimizing to get in the superset with high guarantees and stay there, and so it’s probably aiming for the easiest part of the set to reach and stay in (submitted to the accessibility constraints that you mention). This is what should lead to instrumental convergence and the real issue with the proxies IMO.
It doesn’t seem obvious to me how this race will go by default; in fact, the likely trajectories seem to depend on lots of empirical facts about the world that I don’t have strong views on.
Let me propose another framing: there are less possible worlds in which the curves are “nice”. The good case is more specific, more constrained, and thus there are more ways things can go wrong. This doesn’t mean things will definitely go wrong or that there’s no argument that could convince us that the situation will be good by default. Just that the burden of proof is on showing that the good but less numerous worlds are somehow privileged by Reality.
Really appreciate you going to the effort of this literature analysis! Especially because you expected to get a lot of irrelevant stuff but still went looking for productive mistakes!
This is a valuable lists of learning techniques and advice. That being said, my experience has been that systems for implementing such techniques, and considering friction of habits, can be significantly more important than any of these techniques. I read and get useful knowledge orders of magnitude more when I don’t track my reading or don’t have to make space repetition cards than otherwise, because having to take constant notes or doing a time consuming and effortful tracking reduces by order of magnitude how much I read.
Agreed. Similarly, I can’t get away with not paying attention when reading, or I’ll have to reread. but I get away with that regularly while listening to podcasts.
In defence of attempting unnatural or extreme strategies
As a response to Rob’s post, this sounds too me like it’s misunderstanding what he was pointing out. I don’t think he was pointing to weird ideas — I’m pretty sure he knows very well we need new ideas to solve alignment. What he was pointing out was people who are so panicked that they start discussing literal crimes or massive governance interventions that are incredibly hard to predict.
Thanks for this much needed post!
I agree wholeheartedly with you, and that’s the mindset I’m having and trying to spread when I speak to terrified and panicked people about doing alignment research.
Here are a couple of thoughts that I think might complement this post:
A big part of flailing in my model comes from having hope that someone will save you. As such, realizing that no one will save you is important in actually taking action and doing things. But there are ways of pushing this too far — notably, thinking that because no one will save you, no one is doing anything valuable or can help. One doesn’t have to resolve the tension between “No one will save me” and “I can’t do it all by myself” with “let’s do it all by myself”. Instead, you can see what is needed and that no one else seems able to contribute, and trust and motivate others to take the other crucial and necessary tasks.
I want to point out a general pattern in the reactions to extreme and dire problems: you become more greedy, in the sense of a greedy algorithm. So you only want solutions that work now, or go instantly looking for something else. Yet the history of science and technology tells us that scientific progress and problem solving so rarely happen by being right from the start, but more by a succession of productive mistakes. So I want to remind people that another option might be to have more productive mistakes faster and capitalizing on them better and faster.
An “extreme” key doesn’t necessarily open an “extreme” lock. A dire-sounding key doesn’t necessarily open a dire-feeling lock. A fearful or angry key doesn’t necessarily open a lock that makes you want to express fear or anger.
Rather, the lock’s exact physical properties determine which exact key (or set of keys) opens it, and we need to investigate the physical world in order to find the right key.
I really like this, and will share this quote when I want a nice phrasing of this thought I keep having these days.
Great post!
For instance, if I’m planning a party, then the actions I take now are far away in time (and probably also space) from the party they’re optimizing. The “intermediate layers” might be snapshots of the universe-state at each time between the actions and the party. (… or they might be something else; there are usually many different ways to draw intermediate layers between far-apart things.)
This applies surprisingly well even in situations like reinforcement learning, where we don’t typically think of the objective as “far away” from the agent. If I’m a reinforcement learner optimizing for some reward I’ll receive later, that later reward is still typically far away from my current actions. My actions impact the reward via some complicated causal path through the environment, acting through many intermediate layers.
So we’ve ruled out agents just “optimizing” their own actions. How does this solve the other two problems?
I feel like this is assuming away one of the crucial difficulties of ascribing agency and goal-directedness: lack of competence or non optimality might make agentic behavior look non-agentic unless you already have a mechanistic interpretation. Separating a rock from a human is not really the problem; it’s more like separating something acting like a chimp but for which you have very little data and understanding, and an agent optimizing to clip you.
(Not saying that this can’t be relevant to address this problem, just that currently you seem to assume the problem away)
Because the agent only interacts with the far away things-it’s-optimizing via a relatively-small summary, it’s natural to define the “actions” and “observations” as the contents of the summary flowing in either direction, rather than all the low-level interactions flowing through the agent’s supposed “Cartesian boundary”. That solves the microscopic interactions problem: all the random bumping between my hair/skin and air molecules mostly doesn’t impact things far away, except via a few summary variables like temperature and pressure.
Hmm. I like the idea of redefining action as the consequences of one’s action that are observable “far away” — it nicely rederives the observation-action loop through interaction with far away variables. That being said, I’m confused if defining the observations in the summary statistics itself is not problematic. I have one intuition that tells me that this is all you can observe anyway, so it’s fine; on the other hand, it looks like you’re assuming that the agent has the right ontology already? I guess that can be solved by saying that the observations are on the content of the summary, but not necessarily all of it.
When Adam Shimi first suggested to me a couple years ago that “optimization far away” might be important somehow, one counterargument I raised was dynamic programming (DP): if the agent is optimizing an expected utility function over something far away, then we can use DP to propagate the expected utility function back through the intermediate layers to find an equivalent utility function over the agent’s actions:
This isn’t actually a problem, though. It says that optimization far away is equivalent to some optimization nearby. But the reverse does not necessarily hold: optimization nearby is not necessarily equivalent to some optimization far away. This makes sense: optimization nearby is a trivial condition which matches basically any system, and therefore will match the interesting cases as well as the uninteresting cases.
I think I actually remember now the discussion we were having, and I recall an intuition about counting. Like, there seem to be more ways to optimize nearby than to optimize the specific part of far away, which I guess is what you’re pointing at.
I followed approximately the technical discussion, and now I’m wondering what that would buy us if you are correct.
Max entropy distributions seem nicely behaved and well-studied, so maybe we get some computations, properties, derivation for free? (Basically applying a productive frame to the problem of abstraction)
It would reduce computing the influence of the summary statistics on the model to computing the constraints, as I’m guessing that this is the hard part in computing the max entropy distribution (?)
Are these correct, and what am I missing?
Thanks for the post and expressing your opinion!
That being, I feel like there is a misunderstanding here. Daniel mentioned that in another comment thread, but I don’t think Eliezer claims what you’re attributing to him, nor that your analogy with financial pundits works in this context.
My model of Eliezer, based on reading a lot of his posts (old and new) and one conversation, is that he’s dunking on Metaculus and forecasters for a combination of two epistemic sins:
Taking a long time to update on available information Basically, you shouldn’t take so long to update on the risk for AI, the accelerating pace, the power of scaling. I don’t think Eliezer is perfect on this, but he definitely can claim that he thought and invested himself in AI risks literally decades before any metaculus forecaster even thought about the topic. This is actually a testable claim: that forecasts ends up trailing things that Eliezer said 10 years later.
Doing a precise prediction when you don’t have the information I feel like there’s been a lot of misunderstanding about why Eliezer doesn’t want to give timeline predictions, when he said it repeatedly: he thinks there is just not enough bits of evidence for making a precise prediction. There is enough evidence to be pessimistic, and realize we’re running out of time, but I think he would see giving a precise year like a strong epistemic sin. Realize when you have very little evidence, instead of inventing some to make your forecast more concrete.[1]
As for the financial pundit example, there’s a massive disanalogy: it’s easy to predict that there will be a crash. Everybody does it, we have past examples to generalize from, and models and theories accepted by a lot of people for why they might be inevitable. On the other hand, when Eliezer started talking about AI Risks and investing himself fully in them, nobody gave a shit about it or took it seriously. This was not an obvious prediction that everyone was making, and he gave far more details than just saying “AI Risks, man”.
Note that I’m not saying that Eliezer has a perfect track record or that you shouldn’t criticize him. On the first point, I feel like he had a massive miss of GPT-like models, which are incoherent with the models of intelligence and agency that Eliezer used in the sequences and at MIRI — that’s a strong failed prediction for me, a qualitative unknown unknown that was missed. And on the second point, I’m definitely for more productive debate around alignment and Eliezer’s position.
I just wanted to point out ways in which your post seemed to discuss a strawman, which I don’t think was your intention.
Thanks for the post!
So if I understand correctly, your result is aiming at letting us estimate the dimensionality of the solution basins based on the gradients for the training examples at my local min/final model? Like, I just have to train my model, and then compute the Hessian/behavior gradients and I would (if everything you’re looking at works as intended) have a lot of information about the dimensionality of the basin (and I guess the modularity is what you’re aiming at here)? That would be pretty nice.
What other applications do you see for this result?
Each plane here is an n-1 dimensional manifold, where every model on that plane has the same output on input 1. They slice parameter space into n-1 dimensional regions. Each of these regions is an equivalence class of functions, which all behave about the same on input 1.
Are the 1-contour always connected? Is it something like you can continuously vary parameters but keeping the same output? Based on your illustration it would seem so, but it’s not obvious to me that you can always interpolate in model space between models with the same behavior.
However, if the contours are parallel:
Now the behavior manifolds are planes, running parallel to the contours. So we see here that parallel contours allow behavioral manifolds to have .
I’m geometrically confused here: if the contours are parallel, then aren’t the behavior manifolds made by their intersection empty?
I see at least two problems with your argument:
There’s an assumption that you need a single agent to lead to existential risk. This is not the case, and many scenarios explored require only competent and autonomous service like AIs, or foundations models. Like, CAIS is a model of intelligence explosion and has existential risks type failure modes too.
There’s an assumption that just because the non AGI models are useful, labs will stop pursuing AGI. Yet this is visibly false, as the meme of AGI is running around and there are multiple labs who are explicitly pushing for AGI and getting the financial leeway to do it.
More generally, this post has the typical problem of “here is a scenario that looks plausible and would be nice, so there’s no need to worry”. Sure, maybe this is the actual scenario that will come to pass, and maybe it’s possible to argue for it convincingly. But you should require one damn strong argument before pushing people to not even work to deal with the many more possible numerous worlds where things go horribly wrong.
I would caveat that there are a decent fraction of alignment researchers that have pessimistic takes, though I agree this is not a consensus for the whole field. So there’s far from a consensus on optimistic takes (which I don’t think you were claiming, but that is one way your message can be interpreted).
On similarity of issues with other schemes
Other issues include defining the utility function generating process, ensuring that it has a well-defined limit, and ensuring that that limit is the function we intend. However, these issues are probably less significant by comparison, since other methods for defining AGI utility functions usually have similar issues.
Actually, I would say that having the same core issues than other methods for defining AGI utility functions (which are not known to work) shows that your insight is not a solution but a (potentially productive) reframing of the problem.
On limiting utility functions
Let’s define a process P that generates a sequence of utility functions {Ui}. We call this a utility function defining process.
[...]
We would like to stress that this process P is an example, and not the central point of this post.
Suppose (for the sake of the argument) that the sequence of utility functions {Ui} generated by this process P has a well-defined limit U∞ (in the ordinary mathematical sense of a limit). We can then define an AI system whose utility function is to maximize lim i→∞ Ui (= U∞). It seems as though such a system would satisfy many of the properties in (1)-(3). In particular:
The AI should at any given time take actions that are good according to most of the plausible values of U∞.
The AI would be incentivized to gather information that would help it learn more about U∞.
The AI would not be incentivized to gather information about U∞ at the expense of maximizing U∞ (eg, it would not be incentivized to run “unethical experiments”).
The AI would be incentivized to resist changes to its utility function that would mean that it’s no longer aiming to maximize U∞.
The AI should be keen to maintain option value as it learns more about U∞, until it’s very confident about what U∞ looks like.
Overall, it seems like such an AI would satisfy most of the properties we would want an AI with an updating utility function to have.
Assuming you get such a process pointing towards human values, then I expect to get the properties you’re describing, which are pretty good.
There is still one potential issue: the AI needs to be able to use P and Ui (it’s current utility function) to guess enough of the limit so that it can be competitive (Footnote: something like a Cauchy criterion?). Otherwise the AI risks failing to crippling uncertainty about what it can and cannot do, as in principle U∞ could be anything.
With that said, it still sounds like the noticeably hardest part of the problem has been “hidden away” in P (as you point out it in the issue section). It’s always hard to point at something and say that this is the hard part of the problem, but I’m pretty confident that getting a process that does converge towards human values and has the competitiveness constraint above is the main problem here.
Thus this posts seems to provide an alternative “type” for a solution to value learning, in the shape of such a sequence. It sounds similar to other things in the literature, like IDA and Recursive Reward Modelling, but with a lack of built-in human feedback mechanism that makes it more abstract. So I expect that exploring this abstract framing and the constraints that fall from it might tell us interesting and useful things about the viability of solutions of this type (including their potential impossibility).
On problem statement
Problem Statement
We want a method for creating agents that update their utility function over time. That is, we want:
A method for “pointing to” a utility function (such as “human values”) indirectly, without giving an explicit statement of the utility function in question.
A method for “clarifying” a utility function specified with the method given in (1), so that you in the limit of infinite information obtain an explicit/concrete utility function.
A method for creating an agent that uses an indirectly specified utility function, such that:
The agent at any given time takes actions which are sensible given its current beliefs about its utility function.
The agent will try to find information that would help it to clarify it’s utility function.
The agent would resist attempts to change its utility function away from its indirectly specified utility function.
This problem statement is of course somewhat loose, but that is by necessity, since we don’t yet have a clear idea of what it really means to define utility functions “indirectly” (in the sense we are interested in here).
What’s interesting to me is that your partial solution sorts of follows for free from this “definition”. It requires an initial state, an improvement process, and a way to act given the current state of the process. What you add after that is mostly the analogy to mathematical limits — the improvement being split into infintely many steps that still give you a well defined result in the limit.
It’s a pretty good application of the idea that getting the right definition is the hardest part (isn’t it the problem with human values, really?). From this it also follows that the potential problem with your solution probably come from the problem statement. Which is good to know when critically examining it.
On human values and unbounded optimization
Consider this puzzle: I am able to talk and reason about ”human values”. However, I cannot define human values, or give you a definite description of what human values are – if I could do this, I could solve a large part of the AI alignment problem by writing down a safe utility function directly. I can also not give you a method for finding out what human values are – if I could do this, I could solve the problem of Inverse Reinforcement Learning. Moreover, I don’t think I could reliably recognize human values either – if you show me a bunch of utility functions, I might not be able to tell if any of them encodes human values. I’m not even sure if I could reliably recognize methods for finding out what human values are – if you show me a proposal for how to do Inverse Reinforcement Learning, I might not be able to tell whether the method truly learns human values.
One useful tool to argue that we can’t define “human values” at the moment (that isn’t explicitly used here but which you probably know about) is thinking about what happens in the limit of optimization. Many utility functions are recognizable decent proxies for “human values” in the regime of low optimization; it’s when the optimization becomes enormous and unbounded that we lose our ability to foresee the consequences, due to logical non-omniscience.
Also note that the question of whether the resulting world (after unbounded optimization of the utility function) can be recognized as against “human values” is more debated.
On classical picture
In the “classical” picture, the utility function is fixed over time, and corresponds to an equation that at some point is typed into the AI’s source code. Unfortunately, we humans don’t really know what we want, so we cannot provide such an equation. If we try to propose a specific utility function directly, we typically get a function that would result in catastrophic consequences if it were pursued with arbitrary competence. This is worrying.
You obviously know this, but it could be valuable to add that this is an idealized situation that is “easier” than the one we probably will find ourselves with (where the utility function, if it is the right abstraction, is learned rather than fully specified).
It feel like you’re making the move of aiming for a simpler problem that is still capturing the core of the difficulty and confusion, to tackle it with minimal details to deal with. Which I’m on board with, but being explicit about this move could save you some time justifying some of your design choices.
Thanks for the feedback!
I didn’t know of Simmons book, will take a look. To be honest, I never went very far in Awodey, but it still worked better for me than McLane (not hard, I know).