kaarelh AT gmail DOT com
I’d be quite interested in elaboration on getting faster alignment researchers not being alignment-hard — it currently seems likely to me that a research community of unupgraded alignment researchers with a hundred years is capable of solving alignment (conditional on alignment being solvable). (And having faster general researchers, a goal that seems roughly equivalent, is surely alignment-hard (again, conditional on alignment being solvable), because we can then get the researchers to quickly do whatever it is that we could do — e.g., upgrading?)
I was just claiming that your description of pivotal acts / of people that support pivotal acts was incorrect in a way that people that think pivotal acts are worth considering would consider very significant and in a way that significantly reduces the power of your argument as applying to what people mean by pivotal acts — I don’t see anything in your comment as a response to that claim. I would like it to be a separate discussion whether pivotal acts are a good idea with this in mind.
Now, in this separate discussion: I agree that executing a pivotal act with just a narrow, safe, superintelligence is a difficult problem. That said, all paths to a state of safety from AGI that I can think of seem to contain difficult steps, so I think a more fine-grained analysis of the difficulty of various steps would be needed. I broadly agree with your description of the political character of pivotal acts, but I disagree with what you claim about associated race dynamics — it seems plausible to me that if pivotal acts became the main paradigm, then we’d have a world in which a majority of relevant people are willing to cooperate / do not want to race that much against others in the majority, and it’d mostly be a race between this group and e/acc types. I would also add, though, that the kinds of governance solutions/mechanisms I can think of that are sufficient to (for instance) make it impossible to perform distributed training runs on consumer devices also seem quite authoritarian.
In this comment, I will be assuming that you intended to talk of “pivotal acts” in the standard (distribution of) sense(s) people use the term — if your comment is better described as using a different definition of “pivotal act”, including when “pivotal act” is used by the people in the dialogue you present, then my present comment applies less.
I think that this is a significant mischaracterization of what most (? or definitely at least a substantial fraction of) pivotal activists mean by “pivotal act” (in particular, I think this is a significant mischaracterization of what Yudkowsky has in mind). (I think the original post also uses the term “pivotal act” in a somewhat non-standard way in a similar direction, but to a much lesser degree.) Specifically, I think it is false that the primary kinds of plans this fraction of people have in mind when talking about pivotal acts involve creating a superintelligent nigh-omnipotent infallible FOOMed properly aligned ASI. Instead, the kind of person I have in mind is very interested in coming up with pivotal acts that do not use a general superintelligence, often looking for pivotal acts that use a narrow superintelligence (for instance, a narrow nanoengineer) (though this is also often considered very difficult by such people (which is one of the reasons they’re often so doomy)). See, for instance, the discussion of pivotal acts in https://www.lesswrong.com/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty.
A few notes/questions about things that seem like errors in the paper (or maybe I’m confused — anyway, none of this invalidates any conclusions of the paper, but if I’m right or at least justifiably confused, then these do probably significantly hinder reading the paper; I’m partly posting this comment to possibly prevent some readers in the future from wasting a lot of time on the same issues):
1) The formula for ~y here seems incorrect:
2) Even though you say in the text at the beginning of Section 3 that the input features are independent, the first sentence below made me make a pragmatic inference that you are not assuming that the coordinates are independent for this particular claim about how the loss simplifies (in part because if you were assuming independence, you could replace the covariance claim with a weaker variance claim, since the 0 covariance part is implied by independence):
However, I think you do use the fact that the input features are independent in the proof of the claim (at least you say “because the x’s are independent”):
Additionally, if you are in fact just using independence in the argument here and I’m not missing something, then I think that instead of saying you are using the moment-cumulants formula here, it would be much much better to say that independence implies that any term with an unmatched index is 0. If you mean the moment-cumulants formula here https://en.wikipedia.org/wiki/Cumulant#Joint_cumulants , then (while I understand how to derive every equation of your argument in case the inputs are independent), I’m currently confused about how that’s helpful at all, because one then still needs to analyze which terms of each cumulant are 0 (and how the various terms cancel for various choices of the matching pattern of indices), and this seems strictly more complicated than problem before translating to cumulants, unless I’m missing something obvious.3) I’m pretty sure this should say x_i^2 instead of x_i x_j, and as far as I can tell the LHS has nothing to do with the RHS:
(I think it should instead say sth like that the loss term is proportional to the squared difference between the true and predictor covariance.)
At least ignoring legislation, an exchange could offer a contract with the same return as S&P 500 (for the aggregate of a pair of traders entering a Kalshi-style event contract); mechanistically, this index-tracking could be supported by just using the money put into a prediction market to buy VOO and selling when the market settles. (I think.)
I will be appropriating terminology from the Waluigi post. I hereby put forward the hypothesis that virtue ethics endorses an action iff it is what the better one of Luigi and Waluigi would do, where Luigi and Waluigi are the ones given by the posterior semiotic measure in the given situation, and “better” is defined according to what some [possibly vaguely specified] consequentialist theory thinks about the long-term expected effects of this particular Luigi vs the long-term effects of this particular Waluigi. One intuition here is that a vague specification could be more fine if we are not optimizing for it very hard, instead just obtaining a small amount of information from it per decision.
In this sense, virtue ethics literally equals continuously choosing actions as if coming from a good character. Furthermore, considering the new posterior semiotic measure after a decision, in this sense, virtue ethics is about cultivating a virtuous character in oneself. Virtue ethics is about rising to the occasion (i.e. the situation, the context). It’s about constantly choosing the Luigi in oneself over the Waluigi in oneself (or maybe the Waluigi over the Luigi if we define “Luigi” as the more likely of the two and one has previously acted badly in similar cases or if the posterior semiotic measure is otherwise malign). I currently find this very funny, and, if even approximately correct, also quite cool.
Here are some issues/considerations/questions that I intend to think more about:
What’s a situation? For instance, does it encompass the agent’s entire life history, or are we to make it more local?
Are we to use the agent’s own semiotic measure, or some objective semiotic measure?
This grounds virtue ethics in consequentialism. Can we get rid of that? Even if not, I think this might be useful for designing safe agents though.
Does this collapse into cultivating a vanilla consequentialist over many choices? Can we think of examples of prompting regimes such that collapse does not occur? The vague motivating hope I have here is that in the trolley problem case with the massive man, the Waluigi pushing the man is a corrupt psycho, and not a conflicted utilitarian.
Even if this doesn’t collapse into consequentialism from these kinds of decisions, I’m worried about it being stable under reflection, I guess because I’m worried about the likelihood of virtue ethics being part of an agent in reflective equilibrium. It would be sad if the only way to make this work would be to only ever give high semiotic measure to agents that don’t reflect much on values.
Wait, how exactly do we get Luigi and Waluigi from the posterior semiotic measure? Can we just replace this with picking the best character from the most probable few options according to the semiotic measure? Wait, is this just quantilization but funnier? I think there might be some crucial differences. And regardless, it’s interesting if virtue ethics turns out to be quantilization-but-funnier.
More generally, has all this been said already?
Is there a nice restatement of this in shard theory language?
Suppose we are in a world where most top AI capabilities organizations are refraining from publishing their work (this could be the case because of safety concerns, or because of profit motives) + have strong infosec which prevents them from leaking insights about capabilities in other ways. In this world, it seems sort of plausible that the union of the capabilities insights of people at top labs would allow one to train significantly more capable models than the insights possessed by any single lab alone would allow one to train. In such a world, if the labs decide to cooperate once AGI is nigh, this could lead to a significantly faster increase in capabilities than one might have expected otherwise.
(I doubt this is a novel thought. I did not perform an extensive search of the AI strategy/governance literature before writing this.)
First, suppose GPT-n literally just has a “what a human would say” feature and a “what do I [as GPT-n] actually believe” feature, and those are the only two consistently useful truth-like features that it represents, and that using our method we can find both of them. This means we literally only need one more bit of information to identify the model’s beliefs. One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In particular, there should be hard inputs that only a superhuman model can evaluate; on these inputs, the “what a human would say” feature should result in an “I don’t know” answer (approximately 50⁄50 between “True” and “False”), while the “what GPT-n believes” feature should result in a confident “True” or “False” answer. This would allow us to identify the model’s beliefs from among these two options.
First, suppose GPT-n literally just has a “what a human would say” feature and a “what do I [as GPT-n] actually believe” feature, and those are the only two consistently useful truth-like features that it represents, and that using our method we can find both of them. This means we literally only need one more bit of information to identify the model’s beliefs.
One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In particular, there should be hard inputs that only a superhuman model can evaluate; on these inputs, the “what a human would say” feature should result in an “I don’t know” answer (approximately 50⁄50 between “True” and “False”), while the “what GPT-n believes” feature should result in a confident “True” or “False” answer. This would allow us to identify the model’s beliefs from among these two options.
For n such that GPT-n is superhuman, I think one could alternatively differentiate between these two options by checking which is more consistent under implications, by which I mean that whenever the representation says that the propositions P and P→Q are true, it should also say that Q is true. (Here, for a language model, P and Q could be ~whatever assertions written in natural language.) Or more generally, in addition to modus ponens, also construct new propositions with ANDs and ORs, and check against all the inference rules of zeroth-order logic, or do this for first-order logic or whatever. (Alternatively, we can also write down versions of these constraints that apply to probabilities.) Assuming [more intelligent ⇒ more consistent] (w.r.t. the same set of propositions), for a superhuman model, the model’s beliefs would probably be the more consistent feature. (Of course, one could also just add these additional consistency constraints directly into the loss in CCS instead of doing a second deductive step.)
I think this might even be helpful for differentiating the model’s beliefs from what it models some other clever AI as believing or what it thinks would be true in some fake counterfactual world, because presumably it makes sense to devote less of one’s computation to ironing out incoherence in these counterfactuals – for humans, it certainly seems computationally much easier to consistently tell the truth than to consistently talk about what would be the case in some counterfactual of similar complexity to reality (e.g. to lie).
Hmm, after writing the above, now that I think more of it, I guess it seems plausible that the feature most consistent under negations is already more likely to be the model’s true beliefs, for the same reasons as what’s given in the above paragraph. I guess testing modus ponens (and other inference rules) seems much stronger though, and in any case that could be useful for constraining the search.
(There are a bunch of people that should be thanked for contributing to the above thoughts in discussions, but I’ll hopefully have a post up in a few days where I do that – I’ll try to remember to edit this comment with a link to the post when it’s up.)
I think W does not have to be a variable which we can observe, i.e. it is not necessarily the case that we can deterministically infer the value of W from the values of X and Y. For example, let’s say the two binary variables we observe are X=[whether smoke is coming out of the kitchen window of a given house] and Y=[whether screams are emanating from the house]. We’d intuitively want to consider a causal model where W=[whether the house is on fire] is causing both, but in a way that makes all triples of variable values have nonzero probability (which is true for these variables in practice). This is impossible if we require W to be deterministic once (X,Y) is known.
I agree with you regarding 0 lebesgue. My impression is that the Pearl paradigm has some [statistics → causal graph] inference rules which basically do the job of ruling out causal graphs for which having certain properties seen in the data has 0 lebesgue measure. (The inference from two variables being independent to them having no common ancestors in the underlying causal graph, stated earlier in the post, is also of this kind.) So I think it’s correct to say “X has to cause Y”, where this is understood as a valid inference inside the Pearl (or Garrabrant) paradigm. (But also, updating pretty close to “X has to cause Y” is correct for a Bayesian with reasonable priors about the underlying causal graphs.)
(epistemic position: I haven’t read most of the relevant material in much detail)
I don’t understand why 1 is true – in general, couldn’t the variable $W$ be defined on a more refined sample space? Also, I think all $4$ conditions are technically satisfied if you set $W=X$ (or well, maybe it’s better to think of it as a copy of $X$).
I think the following argument works though. Note that the distribution of $X$ given $(Z,Y,W)$ is just the deterministic distribution $X=Y \xor Z$ (this follows from the definition of Z). By the structure of the causal graph, the distribution of $X$ given $(Z,Y,W)$ must be the same as the distribution of $X$ given just $W$. Therefore, the distribution of $X$ given $W$ is deterministic. I strongly guess that a deterministic connection is directly ruled out by one of Pearl’s inference rules.The same argument also rules out graphs 2 and 4.
I took the main point of the post to be that there are fairly general conditions (on the utility function and on the bets you are offered) in which you should place each bet like your utility is linear, and fairly general conditions in which you should place each bet like your utility is logarithmic. In particular, the conditions are much weaker than your utility actually being linear, or than your utility actually being logarithmic, respectively, and I think this is a cool point. I don’t see the post as saying anything beyond what’s implied by this about Kelly betting vs max-linear-EV betting in general.
(By the way, I’m pretty sure the position I outline is compatible with changing usual forecasting procedures in the presence of observer selection effects, in cases where secondary evidence which does not kill us is available. E.g. one can probably still justify [looking at the base rate of near misses to understand the probability of nuclear war instead of relying solely on the observed rate of nuclear war itself].)
I’m inside-view fairly confident that Bob should be putting a probability of 0.01% on surviving conditional on many worlds being true, but it seems possible I’m missing some crucial considerations having to do with observer selection stuff in general, so I’ll phrase the rest of this as more of a question.
What’s wrong with saying that Bob should put a probability of 0.01% of surviving conditional on many-worlds being true – doesn’t this just follow from the usual way that a many-worlder would put probabilities on things, or at least the simplest way for doing so (i.e. not post-normalizing only across the worlds in which you survive)? I’m pretty sure that the usual picture of Bayesianism as having a big (weighted) set of possible worlds in your head and, upon encountering evidence, discarding the ones which you found out you were not in, also motivates putting a probability of 0.01% on surviving conditional on many-worlds. (I’m assuming that for a many-worlder, weights on worlds are given by squared amplitudes or whatever.)
This contradicts a version of the conservation of expected evidence in which you only average over outcomes in which you survive (even in cases where you don’t survive in all outcomes), but that version seems wrong anyway, with Leslie’s firing squad seeming like an obvious counterexample to me, https://plato.stanford.edu/entries/fine-tuning/#AnthObje .
A big chunk of my uncertainty about whether at least 95% of the future’s potential value is realized comes from uncertainty about “the order of magnitude at which utility is bounded”. That is, if unbounded total utilitarianism is roughly true, I think there is a <1% chance in any of these scenarios that >95% of the future’s potential value would be realized. If decreasing marginal returns in the [amount of hedonium → utility] conversion kick in fast enough for 10^20 slightly conscious humans on heroin for a million years to yield 95% of max utility, then I’d probably give >10% of strong utopia even conditional on building the default superintelligent AI. Both options seem significantly probable to me, causing my odds to vary much less between the scenarios.
This is assuming that “the future’s potential value” is referring to something like the (expected) utility that would be attained by the action sequence recommended by an oracle giving humanity optimal advice according to our CEV. If that’s a misinterpretation or a bad framing more generally, I’d enjoy thinking again about the better question. I would guess that my disagreement with the probabilities is greatly reduced on the level of the underlying empirical outcome distribution.
Great post, thanks for writing this! In the version of “Alignment might be easier than we expect” in my head, I also have the following:
Value might not be that fragile. We might “get sufficiently many bits in the value specification right” sort of by default to have an imperfect but still really valuable future.
For instance, maybe IRL would just learn something close enough to pCEV-utility from human behavior, and then training an agent with that as the reward would make it close enough to a human-value-maximizer. We’d get some misalignment on both steps (e.g. because there are systematic ways in which the human is wrong in the training data, and because of inner misalignment), but maybe this is little enough to be fine, despite fragility of value and despite Goodhart.
Even if deceptive alignment were the default, it might be that the AI gets sufficiently close to correct values before “becoming intelligent enough” to start deceiving us in training, such that even if it is thereafter only deceptively aligned, it will still execute a future that’s fine when in deployment.
It doesn’t seem completely wild that we could get an agent to robustly understand the concept of a paperclip by default. Is it completely wild that we could get an agent to robustly understand the concept of goodness by default?
Is it so wild that we could by default end up with an AGI that at least does something like putting 10^30 rats on heroin? I have some significant probability on this being a fine outcome.
There’s some distance δ from the correct value specification such that stuff is fine if we get AGI with values closer than δ. Do we have good reasons to think that δ is far out of the range that default approaches would give us?
(But here’s some reasons not to expect this.)