I talk with Samuel Buteau regularly and he has described how he talks to politicians in some detail; he is always focused on x-risk and has described specifically putting effort into convincing a politician NOT to conflate x-risk with other issues. I can’t vouch for everyone at ControlAI, but I trust Samuel quite a bit. Samuel is (by far) the main ControlAI person behind this campaign.
abramdemski
ControlAI has launched an official campaign in Canada, with 33 politicians across party lines backing a clear statement in favor of an international prohibition against developing superintelligent AI: https://controlai.com/canada-statement/en
I used to be ambivalent about government intervention wrt AI safety, because politics can screw things up. However, it seems clearly positive to me now: the development of superintelligence impacts the whole human species, and as such, the democratic process should have a say in whether (and how) superintelligence is developed. Governments have been tolerant thus far, but the general public is not so happy about this, meaning that political campaigns have a good chance of success (in my estimation). Without such government intervention, the AI industry is on-track to develop superintelligence within the century if not the decade, and I do not believe safety research is ready to handle the risks. As Yudkowsky put it, only law can prevent extinction.
Public reception of this campaign will make a big difference in the willingness of further politicians to endorse this or similar statements. Although I would normally hesitate to put signal-boost-style politics on LW, this could use a signal boost, and is quite relevant to the audience here.
Yet another example is @abramdemski’s “Vingean agency” (2022) (following earlier work by Yudkowsky). He starts from a place very close to the human intuitions in §2 above, i.e. “agency” is when you can predict the outcome but not the actions leading to those outcomes. Then he hints at an intriguing idea that maybe we can just make that formal! I.e., maybe I was too quick to dismiss that kind of thing above using terms like “messed-up ontology”. (As Abram writes: “I also think it’s possible that Vingean agency can be extended to be ‘the’ definition of agency, if we think that agency is just Vingean agency from some perspective….”)
By analogy, to borrow an example from @johnswentworth, thermodynamics concepts like “temperature” are tied to imperfect modeling ability (since an omniscient observer would instead track the velocity of every particle). So why can’t “agency” be tied to imperfect modeling ability too?
But alas, even if we can rigorously define Vingean agency, I don’t think it would really help with the problem I want it to solve here, i.e. pinning down a distinction between good “counsel” vs bad “manipulation”. Vingean agency seems to solve the problem of identifying an agent trying to do something, by noticing easier-to-predict ends happening by harder-to-predict means. But the “manipulation” concept worries about the possibility of intervention upstream of a person’s ego-syntonic desires. If the AI can brainwash me into deeply wanting to maximize paperclips, and then I execute a clever plan to maximize paperclips, then I would still be a Vingean agent, as long as my clever plan was sufficiently clever (from some perspective). So the brainwashing would strip me of my intuitive agency, but not my Vingean agency.
I don’t think I’ve done enough work to boldly state that I can solve your problem, but I do think you seem overly pessimistic here. The most current attempt to communicate the idea is in my post Legitimate Deliberation.
Roughly, Vingean Agency can be identified via modified variants of van Fraassen’s reflection principle (roughly: I don’t know what move the chess grandmaster will make, but I do know that if I were trying to win a so-far identical game, I’d copy the grandmaster’s moves given the option). This only ever makes sense from a computationally bounded perspective; agents (conceived thusly) vanish in the limit of cognition.
van Fraassen invented this principle not to study the epistemic state of a novice thinking about a grandmaster, but rather, to study our own regard for our future opinions. Our preferences can change, but (ideally) we consider those changes to be correct; not in the sense that our new opinions must be correct, nor even in the sense that our new opinions are necessarily revised in the correct direction; but in the sense of van Fraassen’s reflection principle.
The literature was quick to point out that there are exceptions to the principle: we expect to forget many details of the day, such as what we had for lunch, how many times we went to the bathroom, etc as time goes on. We can expect our future opinions to be less accurate rather than more for such examples. Other examples include getting drunk, being gaslit, etc.
I use the terms “legitimate” vs “illegitimate” to describe opinion-revision processes that do and do not respect van Fraassen’s reflection principle. This is just a definition of convenient terminology, not a deep account; what makes something legitimate vs illegitimate is still a big question. I expect the answer is complex in the same way that human value is complex.
Still, I think “legitimacy” provides a better handle on what the AI is supposed to be avoiding than “manipulation”—many humans do not want to experience heroin, even though they’d predictably enjoy it and want more, because they do not consider that change a legitimate one. Similarly, humans would not want to talk to manipulative AI.
I have recently been thinking about how to construct ML training based on this design goal. Suppose you have a slow but trustworthy belief-revision process, and a fast, highly capable but untrusted belief-revision process—which is trustworthy on problems where it gets good feedback. It seems potentially possible to create a fast-and-trustworthy process by asking the fast process to predict the slow process. (This fits the basic picture of alignment as uploading with more steps). One advantage of this approach is that we don’t need to be able to give gold-standard feedback on any object-level questions; the AI is instead trained on our opinions and how they shift over time. The result is supposed to avoid human manipulation for the same reasons that (some) humans avoid hard drugs: because manipulation would violate the legitimacy of the feedback process. It is corrigible in the (weak) sense that, so long as it expects human feedback to be legitimate, cutting that feedback off would be negative-ev (viewed as lost information). Either it has already anticipated the belief change and updated accordingly (so there’s no reason to block modifications the humans want to implement) or it wants the information and so won’t block the update (because it trusts the humans) or it deems the human feedback illegitimate (which can either be because it is, or can be a mistake; in either case we’ve messed up the training process).
Extending the rights associated with personhood does not typically extend the right to make war. Do you mean that in practice, you think extending personhood rights to AI would cause them to have the in-practice ability to wage war and win (whether we considered it their right or not)? For a highly capable AI, I’d think it would not make a difference one way or the other. For Or perhaps you think that extending rights to AI would necessarily mean denying humans the right to (attempt to) align AI? I don’t think this clearly follows either.
Because they definitely are aligned, or have easily-reached outputs which are functionally equivalent to being aligned, with the wellbeing crowd.
I’m not sure what you meant by this sentence. Do you mean they definitely should be aligned with the well-being crowd in some self-interested sense, or do you mean that they definitely empirically act as if they’re aligned with the well-being crowd, or...?
I think my example was not well-chosen here, because this is confusing. Part of what I’m modeling here is sampling from the weak teacher’s model, in order to generate training data for the student. The coin-flips are independent in the same way that distinct samples from an LLM are independent.
Imagine that 10% of the weak teacher’s samples are misaligned, while 90% are aligned. The teacher has this “aligned” hypothesis and “misaligned” hypothesis in its latent space, so we get some of one, some of the other. Intuitively, a “strong” student should learn to mimic this 90-10 behavior, but this requires the strong student to have a 90-10 hypothesis, not merely the “aligned” and “misaligned” hypotheses. Weak-to-strong generalization is a phenomenon in which the strong student just learns the “aligned” hypothesis instead, when exposed to the 90-10 data, because it fits best out of what the student can hypothesize.
(However, this model probably undersells the importance of gradient descent to the phenomenon.)
I think it is probably more a result of wanting to release a new, better model as often as possible. We’ve seen AI companies cluster releases together, last year, as if they’re rushing to put something out whenever anyone else does, so that the media frenzy isn’t exclusively about their competitor. A problem with that approach is you have to either hold something good back while waiting for competitors to release, or push something out prematurely. Everyone wants to get the last word, putting out the model that’ll be perceived as best for the next couple months or so. That’s tough to pull off, so an alternative is to just try and release as frequently as possible, rather than trying to time things cleverly. Then (if you keep pace capability-wise) your models will be the best most often, simply because you release more. The problem with that strategy is, you’ll be using more compute running multiple training runs (doing a big post-training run at the same time as pre-training the next big thing). A more focused approach with fewer parallel training runs and a slower release cycle can utilize limited compute more effectively. The main question (for the race) becomes, who is managing all these trade-offs most effectively?
Excited about your upcoming ozone hole piece!
To me it seems well worth a shot. Politics has worked reasonably well for limiting atomic weapons and curtailing ozone hole damage. AI has some important disanalogies to those things, but still. I have talked to a couple of state-level politicians about AI. One of them was very easy to talk to and seemed quite sane. The other one (who I talked to much less, so, didn’t get too much of a chance to reason with) was worried about the USA winning the AI race. On the whole, I felt there was more sanity than I expected from politicians. EY expressed a similar sentiment recently in Only Law Can Prevent Extinction:
(I am still feeling amazed, awed, and a little humbled, about the part where my words plausibly had any effect whatsoever. Politicians are a lot more sensible, in some real-life cases, than angry libertarian literature had led me to believe a few decades earlier.)
I’m not sure I understand the argument here correctly. It seems like the intended argument is something like this:
“Omega has access to an infinite number of fair coinflips. Alice can do no better than guess, and Alice cannot guess every coin-flip correctly. Omega knows how Alice will guess, and also knows how each coinflip will land. Therefore, Omega can choose to ask Alice about only the coinflips Alice will guess incorrectly (of which there will be at least one). Alice therefore surely loses money from bets placed.”
This argument uses the assumption that Alice can’t change eir beliefs in response to learning that Omega has proposed specific bets and not others. This might seem concerning, because it seems like precisely what Alice should do, if Alice understands the situation: Alice should expect to lose any bet proposed by Omega. However, this assumption is perfectly normal for Dutch Book arguments. Such an objection would rule out all the usual Dutch Books. I think the classic Dutch Book arguments in fact illustrate a useful idea, even with this ‘flaw’, so I allow it.
More concerningly, the argument assumes Omega has knowledge of how the coins will land. This is a significant departure from classical Dutch Books. It seems clear that a bookie can reliably make money from gamblers if the bookie knows which horse will win which race; this is not, in the classical way of thinking, a testament to the irrationality of the gamblers. It appears to me that this is all that is happening in the above argument.
A second quibble is that in classical Dutch Book arguments, the bookie will surely make money. In the argument above, the bookie only almost surely makes money: since Omega relies on Alice making a bad guess, Omega makes money with probability 1, but not with (logical) certainty.
Considering these two violations of the pre-existing norms of Dutch Books, what should we make of the proposed Dutch Book argument? It intuitively makes sense to me that Infrabayes might be supported by a sort of almost-dutch-book argument. It offers a fresh perspective; perhaps we need to slightly modify the pre-existing norms wrt Dutch Books to see the benefits of infrabayes.
(An analogy: intuitionistic bayesianism generalizes the usual dutch books by allowing bets to fail to pay out, cleanly justifying the possibility of probabilities that do not sum to 1.)I am mostly unbothered by weakening surely to almost-surely. Losing money with probability 1 seems almost exactly as bad as losing money with logical certainty. However, I haven’t thought deeply about the consequences of such a move. Perhaps this allows some unsavory “Dutch Book” arguments.
Allowing the bookie to know more than the gambler seems far more worrying, but perhaps justifiable. The classical Bayesian really does need to rule out such a case, but perhaps this is precisely because they are not infrabayesian. One might argue that infrabayes is precisely the generalization in belief-structures required to handle this generalization of dutch-books.
Personally, it seems to me like a more natural way to handle bookies who know more is to drop the earlier-mentioned assumption that the gambler’s probabilities are independent of what bets the bookie proposes. If gamblers know that the bookies at the horse-race know which horses are going to win, then they should update upon seeing what bets those bookies are willing to take. The assumption to the contrary was only tenable in the context of bookies who don’t know anything the gamblers don’t.
Perhaps, then, the content of the argument is that infrabayesianism can handle knowledgeable bookies in a different way: though we could perhaps handle such cases by dropping the no-update-on-bets-offered assumption, doing so might not result in a very nice theory. Instead, infrabayesianism recommends a strict preference for mixed strategies. I’m not against the idea of a strict preference for mixed strategies, but it also doesn’t jump out at me as the natural way to handle this dutch-book argument as I understand it: after all, we could just as well suppose that Omega can predict the randomness behind the mixed strategy.
I came upon this post because the more recent What is Inadequate about Bayesianism for AI Alignment cited this as the source of its Dutch Book against bayesians. However, the Dutch Book argument made there is somewhat different. That version relies on a “causal” assumption that Omega’s choices are probabilistically independent of the gambler’s. This assumption seems inherently contrary to the problem description (since Omega can predict the gambler’s choices, and uses those predictions to make its choices). Again, maybe the point is that it is theoretically useful: although the “correct” way (according to me) to deal with such cases is to drop the independence assumption, it turns out that we can work out a beautiful and useful theory without doing so.
Coherent Care
Yep. Where this deviated from my notes, I approve (purely in terms of the time travel logic, that is). Seems like OpenAI is way ahead on time-travel logic, which is evidence that it is significantly ahead on “general reasoning”.
They do deliberately try to set up an “I’ll get in the box if I don’t see myself get out” sort of situation in the movie, though they don’t succeed, and they don’t seem to realize that it would result in 0-2-0-2-… across metatime.
Good point about how permanent increases have to be as improbable as permanent decreases! I should’ve gotten that from what you were saying earlier. I suppose that leaves me with the “movies follow interesting timelines” theory, where it’s just a convention of the film to look at the timelines where characters multiply.
The characters in the movie take a lot of precautions to isolate themselves from their time-clones, meaning that they don’t really know whether they got out of the box at the start. Therefore, they just have faith in the plan and jump in the box at the end of the loop. So long as they don’t create any obvious paradoxes (“break symmetry” as they call it), everything works out from their perspective, and they can assume it’s consistent-timeline travel rather than branching, so they don’t think they’re creating a timeline in which they mysteriously vanish.
When they start creating paradoxes, of course, they should realize. The fact that they don’t think about it this way fits with the general self-centeredness of the characters, however.
I agree that it makes sense to think of this probabilistically, but we can also think of it as just all timelines existing. I’m happy to excuse the events of the movie as showing one particularly interesting timeline out of the many. It makes sense that the lens of the film isn’t super interested in the timelines which end up lacking one of the viewpoint characters.
If we do think of it probabilistically, though, are the events of the movie so improbable that we should reject them? By my thinking, the movie still fits well with that. Depending on how you think the probabilities should work out, it seems like that first timeline where the person just vanishes is low-probability, particularly if they create a relatively consistent time-loop. In a simple consistent loop, only the original branch has them vanish, while each other branch looks like an internally consistent timeline, and spawns another just like itself. The probability of a timeline like “one Abe, then two Abes, then back to one” seems high, if Abe is careful to avoid paradoxes. With paradoxes, the high-probability timelines get chaotic, which is what we see in the movie (and in the comic I linked).
I’m not sure why you say it’s hard to explain with branching timelines. To me this is just branching timelines. The movie voiceover states at one point that the last version of events seems to be the one that holds true, meaning that you see the last branching timeline, usually the one with the most Bobs. I don’t think you have to belive this part of the voiceover, though; this is just the opinion of someone trying to make sense of events. You could instead say that the movie has a convention of showing us later splits rather than earlier.
Claude’s Bad Primer Fanfic
Canada is doing a big study to better understand the risks of AI. They aren’t shying away from the topic of catastrophic existential risk. This seems like good news for shifting the Overton window of political discussions about AI (in the direction of strict international regulations). I hope this is picked up by the media so that it isn’t easy to ignore. It seems like Canada is displaying an ability to engage with these issues competently.
This is an opportunity for those with technical knowledge of the risks of artificial intelligence to speak up. Making such knowledge legible to politicians and the general public is an important part of civilization being able to deal with AI in a sane manner. If you can state the case well, you can apply to speak to the committee:
Send a request to ETHI@parl.gc.ca, stating:
which study you want to participate in (Challenges Posed by Artificial Intelligence and its Regulation)
who you are and why the committee should care about what you have to say
what you want to talk about
indicate what language(s) you can testify (english/french) and virtually vs in-person
Luc Theriault is responsible for this study taking place.
I don’t think the ‘victory condition’ of something like this is a unilateral Canadian ban/regulation—rather, Canada and other nations need to do something of the form “If [some list of other countries] pass [similar regulation], Canada will [some AI regulation to avoid the risks posed by superintelligence]”.Here’s a relatively entertaining second hour of proceedings from 26 January:
https://youtu.be/W0qMb1qGwFw?si=EqgPSHRt_AYuGgu8&t=4123
Full videos:
https://www.youtube.com/watch?v=W0qMb1qGwFw&t=30s
https://www.youtube.com/watch?v=mow9UFdxiIw&t=30s
https://www.youtube.com/watch?v=ipMS1S5oOlg&t=19s
3. How does that handle ontology shifts? Suppose that this symbolic-to-us language would be suboptimal for compactly representing the universe. The compression process would want to use some other, more “natural” language. It would spend some bits of complexity defining it, then write the world-model in it. That language may turn out to be as alien to us as the encodings NNs use.
The cheapest way to define that natural language, however, would be via the definitions that are the simplest in terms of the symbolic-to-us language used by our complexity-estimator. This rules out definitions which would look to us like opaque black boxes, such as neural networks.
I note that this requires a fairly strong hypothesis: the symbolic-to-us language apparently has to be interpretable no matter what is being explained in that language. It is easy to imagine that there exist languages which are much more interpretable than neural nets (EG, English). However, it is much harder to imagine that there is a language in which all (compressible) things are interpretable.
Python might be more readable than C, but some Python programs are still going to be really hard to understand, and not only due to length. (Sometimes terser programs are the more difficult to understand.)
Perhaps the claim is that such Python programs won’t be encountered due to relevant properties of the universe (ie, because the universe is understandable).
I think “gradient misalignment” here is definitely a type of outer misalignment, the way I think about things.
Another sort of inner misalignment (which isn’t obvious from your typology, but which might fit somewhere in your typology already according to you, or perhaps multiple places) is optimization failure: perhaps the training examples were good and the reward/loss was specified well (outer-aligned), but for whatever reason, the optimization of the weights (eg, gradient descent) introduces a strong misaligned bias.