One minor objection I have to the contents of this post is the conflation of models that are fine-tuned (like ChatGPT) and models that are purely self-supervised (like early GPT3); the former has no pretenses of doing only next token prediction.
rpglover64
This is a great article! It helps me understand shard theory better and value it more; in particular, it relates to something I’ve been thinking about where people seem to conflate utility-optimizing agents with policy-execuing agents, but the two have meaningfully different alignment characteristics, and shard theory seems to be deeply exploring the latter, which is 👍.
That is to say, prior to “simulators” and “shard theory”, a lot of focus was on utility-maximizers—agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training.
The answer is not to find a clever way to get a robust grader. The answer is to not need a robust grader
💯
From my perspective, this post convincingly argues that one route to alignment involves splitting the problem into two still-difficult sub-problems (but actually easier, unlike inner- and outer-alignment, as you’ve said elsewhere): identifying a good shard structure and training an AI with such a shard structure. One point is that the structure is inherently somewhat robust (and that therefore each individual shard need not be), making it a much larger target.
I have two objections:
-
I don’t buy the implied “naturally-robust” claim. You’ve solved the optimizer’s curse, wireheading via self-generated adversarial inputs, etc., but the policy induced by the shard structure is still sensitive to the details; unless you’re hiding specific robust structures in your back pocket, I have no way of knowing that increasing the candy-shard’s value won’t cause a phase shift that substantially increases the perceived value of the “kill all humans, take their candy” action plan. I ultimately care about the agent’s “revealed preferences”, and I am not convinced that those are smooth relative to changes in the shards.
-
I don’t think that we can train a “value humans” shard that avoids problems with the edge cases of what that means. Maybe it learns that it should kill all humans and preserve their history; or maybe it learns that it should keep them alive and comatose; or maybe it has strong opinions one way or another on whether uploading is death; or maybe it respects autonomy too much to do anything (though that one would probably be decomissioned and replaced by one more dangerous). The problem is not adversarial inputs but genuine vagueness where precision matters. I think this boils down to me disagreeing with John Wentworth’s “natural abstraction hypothesis” (at least in some ways that matter)
-
I don’t think it would work to slow down AI capabilities progress. The reason is that AI capabilities translate into money in a way that’s much more direct than “science” writ large—they’re a lot closer to engineering.
Put differently, if it could have worked (and before GPT-2 and the surrounding hype, I might have believed it) it’s too late now.
(Why “Top 3” instead of “literally the top priority?” Well, I do think a successful AGI lab also needs have top-quality researchers, and other forms of operational excellence beyond the ones this post focuses on. You only get one top priority, )
I think the situation is more dire than this post suggests, mostly because “You only get one top priority.” If your top priority is anything other than this kind of organizational adequacy, it will take precedence too often; if your top priority is organizational adequacy, you probably can’t get off the ground.
The best distillation of my understanding regarding why “second priority” is basically the same as “not a priority at all” is this twitter thread by Dan Luu.
The fear was that if they said that they needed to ship fast and improve reliability, reliability would be used as an excuse to not ship quickly and needing to ship quickly would be used as an excuse for poor reliability and they’d achieve none of their goals.
I think that’s an important objection, but I see it applying almost entirely on a personal level. On the strategic level, I actually buy that this kind of augmentation (i.e. with in some sense passive AI) is not an alignment risk (any more than any technology is). My worry is the “dual use technology” section.
English doesn’t have great words for me to describe what I mean here, but it’s something like: your visualization machinery says that it sees no obstacle to success, such that you anticipate either success or getting a very concrete lesson.
One piece of advice/strategy I’ve received that’s in this vein is “maximize return on failure”. So prefer to fail in ways that you learn a lot, and to fail quickly, cheaply, and conclusively, and produce positive externalities from failure. This is not so much a good search strategy but a good guiding principle and selection heuristic.
“everything is psychology; nothing is neurology”
this line confuses me.
It was just a handle that came to mind for the concept that I’m trying to warn against. Reading your post I get a sense that it’s implicitly claiming that everything is mutable and nothing is fixed; eh… that’s not right either. Like, it feels like it implicitly and automatically rejects that something like a coffee habit can be the correct move even if you look several levels up.
I think maybe you’re saying that someone can choose to reach for coffee for reasons other than wakefulness or energy control.
More specifically, that coffee may be part of a healthy strategy for managing your own biochemistry. I don’t think you say otherwise in the post, but it felt strongly suggested.
Donning the adaptive entropy lens, the place my attention goes to is the “chronically low dopamine”. Why is that? What prevents the body from adapting to its context?
I think this is something I’m pushing back (lightly) against; I do not, on priors, expect every “problem” to be a failure of adaptation. Like, there might be a congenital set point, and you might have it in the bottom decile (note, I’m not saying that’s actually the way it works).
I’d just add that “the same strategy” can be extremely meta.
👍
Mmm. Yes, this is an important distinction. I think to the extent that it didn’t come across in the OP, that was a matter of how the OP was hacked together, not something I’m missing in my intent.
Makes sense; consider it something between “feedback on the article as written” and “breadcrumbs for others reading”.
Is it clear to you?
I think… that I glimpse the dynamic you’re talking about, and that I’m generally aware of it’s simplest version and try to employ conditions/consequences reasoning, but I do not consistently see it more generally.
[EDIT]
Sleeping on it, I also see connections to [patterns of refactored agency](https://www.ribbonfarm.com/2012/11/27/patterns-of-refactored-agency/) (specifically pervasiveness) and [out to get you](https://thezvi.wordpress.com/2017/09/23/out-to-get-you/). The difference is that while you’re describing something like a physical principle, from “out to get you” more of a social principle, and “refactored agency” is describing a useful thinking perspective.
Let’s consider the trolley problem. One consequentialist solution is “whichever choice leads to the best utility over the lifetime of the universe”, which is intractable. This meta-principle rules it out as follows: if, for example, you learned that one of the 5 was on the brink of starting a nuclear war and the lone one was on the brink of curing aging, that would say switch, but if the two identities were flipped, it would say stay, and generally, there are too many unobservables to consider. By contrast, a simple utilitarian approach of “always switch” is allowed by the principle, as are approaches that take into account demographics or personal importance.
The principle also suggests that killing a random person on the street is bad, even if the person turns out to be plotting a mass murder, and conversely, a doctor saving said person’s life is good.
Two additional cases where the principle may be useful and doesn’t completely correspond to common sense:
I once read an article by a former vegan arguing against veganism and vegetarianism; one example was the fact that the act of harvesting grain involves many painful deaths of field mice, and that’s not particularly better than killing one cow. Applying the principle, this suggests that suffering or indirect death cannot straightforwardly be the basis for these dietary choices, and that consent is on shaky ground.
When thinking about building a tool (like the LW infrastructure) that could be either hugely positive (because it leads to aligned AI) or hugely negative (because it leads to unaligned AI by increasing AI discussions), and there isn’t really a way to know which, you are morally free to build it or not; any steps you take to increase the likelihood of a positive outcome are good, but you are not required to stop building the tool due to a huge unknowable risk. Of course, if there’s compelling reason to believe that the tool is net-negative, that reduces the variance and suggests that you shouldn’t build it (e.g. most AI capabilities advancements).
Framed a different way, the principle is, “Don’t tie yourself up in knots overthinking.” It’s slightly reminiscent of quiescence search in that it’s solving a similar “horizon effect” problem, but it’s doing so by discarding evaluation heuristics that are not locally stable.
So, I have an internal sense that I have overcome “idea scarcity”, as a result of systematized creativity practice (mostly related to TRIZ), and I have a suspicion that this is both learnable and useful (as a complement to the domain-specific approach of “read a lot about the SOTA of alignment”), but I don’t know how useful; do you have a sense that this particular problem is a bottleneck in alignment?
I can imagine a few ways this might be the case:
Junior researchers come up with one great idea and then burn out (where they might have been able to come up with 2 or 3 otherwise); most researchers are junior in such a new field, so fixing this would nearly double or triple the number of great ideas, increasing the chance that one of them succeeds (plus positive second-order effects).
Researchers waste effort working on an idea after it’s no longer promising because they worry they won’t come up with a new one (where without the fear, they would have shifted back to “explore” sooner); back-of-the-envelope, I imagine this would save about 10% of researcher time (again, with positive second-order effects).
As a field, we’re doing too little work to find fatal flaws in ideas, in part because there is a shortage and it would be too demotivating, which leads to a similar dynamic as above, where execution effort is spent on ideas that should have been shelved.
A question about alignment via natural abstractions (if you’ve addressed it before, please refer me to where): it seems to me plausible that natural abstractions exist but are not useful for alignment, because alignment is a high-dimensional all-or-nothing property. Like, the AI will learn about “trees”, but not unintentionally killing everyone depends on whether a palm tree is a tree, or on whether a copse counts as full of trees, or some other questions which depends on unnatural details of the natural abstraction.
Do you think that edge cases will just naturally be correctly learned?
Do you think that edge cases just won’t end up mattering for alignment?
This makes me think that a useful meta-principle for application of moral principles in the absence omniscience is “robustness to auxillary information.” Phrased another way, if the variance of the outcomes of your choices is high according to a moral principle, in all but the most extreme cases, either find more information or pick a different moral principle.
Interesting. I’m reminded of this definition of “beauty”.
Some thoughts:
Those who expect fast takeoffs would see the sub-human phase as a blip on the radar on the way to super-human
The model you describe is presumably a specialist model (if it were generalist and capable of super-human biology, it would plausibly count as super-human; if it were not capable of super-human biology, it would not be very useful for the purpose you describe). In this case, the source of the risk is better thought of as the actors operating the model and the weapons produced; the AI is just a tool
Super-human AI is a particularly salient risk because unlike others, there is reason to expect it to be unintentional; most people don’t want to destroy the world
The actions for how to reduce xrisk from sub-human AI and from super-human AI are likely to be very different, with the former being mostly focused on the uses of the AI and the latter being on solving relatively novel technical and social problems
I think “sufficiently” is doing a lot of work here. For example, are we talking about >99% chance that it kills <1% of humanity, or >50% chance that it kills <50% of humanity?
I also don’t think “something in the middle” is the right characterization; I think “something else” it more accurate. I think that the failure you’re pointing at will look less like a power struggle or akrasia and more like an emergent goal structure that wasn’t really present in either part.
I also think that “cyborg alignment” is in many ways a much more tractable problem than “AI alignment” (and in some ways even less tractable, because of pesky human psychology):
It’s a much more gradual problem; a misaligned cyborg (with no agentic AI components) is not directly capable of FOOM (Amdhal’s law was mentioned elsewhere in the comments as a limit on usefulness of cyborgism, but it’s also a limit on damage)
It has been studied longer and has existed longer; all technologies have influenced human thought
It also may be an important paradigm to study (even if we don’t actively create tools for it) because it’s already happening.
(I may promote this to a full question)
Do we actually know what’s happening when you take an LLM trained on token prediction and fine-tune is via e.g. RLHF to get something like InstructGPT or ChatGPT? The more I think about the phenomenon, the more confused I feel.
I don’t think “definitions” are the crux of my discomfort. Suppose the model learns a cluster; the position, scale, and shape parameters of this cluster summary are not perfectly stable—that is, they vary somewhat with different training data. This is not a problem on its own, because it’s still basically the same; however, the (fuzzy) boundary of the cluster is large (I have a vague intuition that the curse of dimensionality is relevant here, but nothing solid). This means that there are many cutting planes, induced by actions to be taken downstream of the model, on which training on different data could have yielded a different result. My intuition is that most of the risk of misalignment arises at those boundaries:
One reason for my intuition is that in communication between humans, difficulties arise in a similar way (i.e. when two peoples clusters have slightly different shapes)
One reason is that the boundary cases feel like the kind of stuff you can’t reliably learn from data or effectively test.
Your comment seems to be suggesting that you think the edge cases won’t matter, but I’m not really understanding why the fuzzy nature of concepts makes that true.
western philosophy has a powerful anti-skepticism strain, to the point where “you can know something” is almost axiomatic
I’m pretty pessimistic about the strain of philosophy as you’ve described it. I have yet to run into a sense of “know” that is binary (i.e. not “believed with probability”) that I would accept as an accurate description of the phenomenon of “knowledge” in the real world rather than as an occasionally useful approximation. Between the preface paradox (or its minor modification, the lottery paradox) and Fitch’s paradox of knowability, I don’t trust the “knowledge” operator in any logical claim.
My response would be that this unfairly (and even absurdly) maligns “theory”!
I agree. However, the way I’ve had the two generals problem framed to me, it’s not a solution unless it guarantees successful coordination. Like, if I claim to solve the halting problem because in practice I can tell if a program halts, most of the time at least, I’m misunderstanding the problem statement. I think that conflating “approximately solves the 2GP” with “solves the 2GP” is roughly as malign as my claim that the approximate solution is not the realm of theory.
Some people (as I understand it, core LessWrong staff, although I didn’t go find a reference) justify some things in terms of common knowledge.
You either think that they were not intending this literally, or at least, that no one else should take them literally, and instead should understand “common knowledge” to mean something informal (which you yourself admit you’re somewhat unclear on the precise meaning of).
I think that the statement, taken literally, is false, and egregiously so. I don’t know how the LW staff meant it, but I don’t think they should mean it literally. I think that when encountering a statement that is literally false, one useful mental move is to see if you can salvage it, and that one useful way to do so is to reinterpret an absolute as a gradient (and usually to reduce the technical precision). Now that you have written this post, the commonality of the knowledge that the statement should not be taken literally and formally is increased; whether the LW staff responds by changing the statement they use, or by adding a disclaimer somewhere, or by ignoring all of us and expecting people to figure it out on their own, I did not specify.
My problem with this is that it creates a missing stair kind of issue. There’s the people “in the know” who understand how to walk carefully on the dark stairway, but there’s also a class of “newcomers” who are liable to fall. (Where “fall” here means, take all the talk of “common knowledge” literally.)
Yes, and I think as aspiring rationalists we should try to eventually do better in our communications, so I think that mentions of common knowledge should be one of:
explicitly informal, intended to gesture to some real world phenomenon that has the same flavor
explicitly contrived, like the blue islanders puzzle
explicitly something else, like p-common knowledge; but beware, that’s probably not meant either
This idea is illustrated with the electronic messaging example, which purports to show that any number of levels of finite iteration are as good as no communication at all.
I think (I haven’t read the SEP link) that this is correct—in the presence of uncertainty, iteration does not achieve the thing we are referring to precisely as “common knowledge”—but we don’t care, for the reasons mentioned in your post.
I think your post and my reply together actually point to two interesting lines of research:
formalize measures of “commonness” of knowledge and see how they respond to realistic scenarios such as “signal boosting”
see if there is an interesting “approximate common knowledge”, vaguely analogous to The Complexity of Agreement
So… Longtime lurker, made an account to comment, etc.
I have a few questions.
First two, about innate status sense:
* I’m not convinced this it exists; is there a particular experiment (thought or otherwise) that could clearly demonstrate the existence of innate status sense among people? Presuming I don’t have it, and I have several willing, honest, introspective, non-rationalist, average adults, what could I ask them?
* Is there a particular thought experiment I could perform that discriminates cleanly between worlds in which I have it and worlds in which I don’t?
Next, about increasing probability estimates of unlikely events based on the outside view:
* This post argues against “Probing the Improbable” and for “Pascal’s Muggle: Infinitesimal …”; having skimmed the former and read the latter, I’m not clearly seeing the difference. Both seem to suggest that after using a model, implicitly or explicitly, to assign a low probability to an event, it is important to note the possibility that the model is catastrophically wrong and factor that into your instrumental probability.