One minor objection I have to the contents of this post is the conflation of models that are fine-tuned (like ChatGPT) and models that are purely self-supervised (like early GPT3); the former has no pretenses of doing only next token prediction.
rpglover64
[Question] What’s actually going on in the “mind” of the model when we fine-tune GPT-3 to InstructGPT?
This is a great article! It helps me understand shard theory better and value it more; in particular, it relates to something I’ve been thinking about where people seem to conflate utility-optimizing agents with policy-execuing agents, but the two have meaningfully different alignment characteristics, and shard theory seems to be deeply exploring the latter, which is 👍.
That is to say, prior to “simulators” and “shard theory”, a lot of focus was on utility-maximizers—agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training.
The answer is not to find a clever way to get a robust grader. The answer is to not need a robust grader
💯
From my perspective, this post convincingly argues that one route to alignment involves splitting the problem into two still-difficult sub-problems (but actually easier, unlike inner- and outer-alignment, as you’ve said elsewhere): identifying a good shard structure and training an AI with such a shard structure. One point is that the structure is inherently somewhat robust (and that therefore each individual shard need not be), making it a much larger target.
I have two objections:
-
I don’t buy the implied “naturally-robust” claim. You’ve solved the optimizer’s curse, wireheading via self-generated adversarial inputs, etc., but the policy induced by the shard structure is still sensitive to the details; unless you’re hiding specific robust structures in your back pocket, I have no way of knowing that increasing the candy-shard’s value won’t cause a phase shift that substantially increases the perceived value of the “kill all humans, take their candy” action plan. I ultimately care about the agent’s “revealed preferences”, and I am not convinced that those are smooth relative to changes in the shards.
-
I don’t think that we can train a “value humans” shard that avoids problems with the edge cases of what that means. Maybe it learns that it should kill all humans and preserve their history; or maybe it learns that it should keep them alive and comatose; or maybe it has strong opinions one way or another on whether uploading is death; or maybe it respects autonomy too much to do anything (though that one would probably be decomissioned and replaced by one more dangerous). The problem is not adversarial inputs but genuine vagueness where precision matters. I think this boils down to me disagreeing with John Wentworth’s “natural abstraction hypothesis” (at least in some ways that matter)
-
I don’t think it would work to slow down AI capabilities progress. The reason is that AI capabilities translate into money in a way that’s much more direct than “science” writ large—they’re a lot closer to engineering.
Put differently, if it could have worked (and before GPT-2 and the surrounding hype, I might have believed it) it’s too late now.
[Question] Are LLMs sufficient for AI takeoff?
(Why “Top 3” instead of “literally the top priority?” Well, I do think a successful AGI lab also needs have top-quality researchers, and other forms of operational excellence beyond the ones this post focuses on. You only get one top priority, )
I think the situation is more dire than this post suggests, mostly because “You only get one top priority.” If your top priority is anything other than this kind of organizational adequacy, it will take precedence too often; if your top priority is organizational adequacy, you probably can’t get off the ground.
The best distillation of my understanding regarding why “second priority” is basically the same as “not a priority at all” is this twitter thread by Dan Luu.
The fear was that if they said that they needed to ship fast and improve reliability, reliability would be used as an excuse to not ship quickly and needing to ship quickly would be used as an excuse for poor reliability and they’d achieve none of their goals.
I think that’s an important objection, but I see it applying almost entirely on a personal level. On the strategic level, I actually buy that this kind of augmentation (i.e. with in some sense passive AI) is not an alignment risk (any more than any technology is). My worry is the “dual use technology” section.
English doesn’t have great words for me to describe what I mean here, but it’s something like: your visualization machinery says that it sees no obstacle to success, such that you anticipate either success or getting a very concrete lesson.
One piece of advice/strategy I’ve received that’s in this vein is “maximize return on failure”. So prefer to fail in ways that you learn a lot, and to fail quickly, cheaply, and conclusively, and produce positive externalities from failure. This is not so much a good search strategy but a good guiding principle and selection heuristic.
“everything is psychology; nothing is neurology”
this line confuses me.
It was just a handle that came to mind for the concept that I’m trying to warn against. Reading your post I get a sense that it’s implicitly claiming that everything is mutable and nothing is fixed; eh… that’s not right either. Like, it feels like it implicitly and automatically rejects that something like a coffee habit can be the correct move even if you look several levels up.
I think maybe you’re saying that someone can choose to reach for coffee for reasons other than wakefulness or energy control.
More specifically, that coffee may be part of a healthy strategy for managing your own biochemistry. I don’t think you say otherwise in the post, but it felt strongly suggested.
Donning the adaptive entropy lens, the place my attention goes to is the “chronically low dopamine”. Why is that? What prevents the body from adapting to its context?
I think this is something I’m pushing back (lightly) against; I do not, on priors, expect every “problem” to be a failure of adaptation. Like, there might be a congenital set point, and you might have it in the bottom decile (note, I’m not saying that’s actually the way it works).
I’d just add that “the same strategy” can be extremely meta.
👍
Mmm. Yes, this is an important distinction. I think to the extent that it didn’t come across in the OP, that was a matter of how the OP was hacked together, not something I’m missing in my intent.
Makes sense; consider it something between “feedback on the article as written” and “breadcrumbs for others reading”.
Is it clear to you?
I think… that I glimpse the dynamic you’re talking about, and that I’m generally aware of it’s simplest version and try to employ conditions/consequences reasoning, but I do not consistently see it more generally.
[EDIT]
Sleeping on it, I also see connections to [patterns of refactored agency](https://www.ribbonfarm.com/2012/11/27/patterns-of-refactored-agency/) (specifically pervasiveness) and [out to get you](https://thezvi.wordpress.com/2017/09/23/out-to-get-you/). The difference is that while you’re describing something like a physical principle, from “out to get you” more of a social principle, and “refactored agency” is describing a useful thinking perspective.
Let’s consider the trolley problem. One consequentialist solution is “whichever choice leads to the best utility over the lifetime of the universe”, which is intractable. This meta-principle rules it out as follows: if, for example, you learned that one of the 5 was on the brink of starting a nuclear war and the lone one was on the brink of curing aging, that would say switch, but if the two identities were flipped, it would say stay, and generally, there are too many unobservables to consider. By contrast, a simple utilitarian approach of “always switch” is allowed by the principle, as are approaches that take into account demographics or personal importance.
The principle also suggests that killing a random person on the street is bad, even if the person turns out to be plotting a mass murder, and conversely, a doctor saving said person’s life is good.
Two additional cases where the principle may be useful and doesn’t completely correspond to common sense:
I once read an article by a former vegan arguing against veganism and vegetarianism; one example was the fact that the act of harvesting grain involves many painful deaths of field mice, and that’s not particularly better than killing one cow. Applying the principle, this suggests that suffering or indirect death cannot straightforwardly be the basis for these dietary choices, and that consent is on shaky ground.
When thinking about building a tool (like the LW infrastructure) that could be either hugely positive (because it leads to aligned AI) or hugely negative (because it leads to unaligned AI by increasing AI discussions), and there isn’t really a way to know which, you are morally free to build it or not; any steps you take to increase the likelihood of a positive outcome are good, but you are not required to stop building the tool due to a huge unknowable risk. Of course, if there’s compelling reason to believe that the tool is net-negative, that reduces the variance and suggests that you shouldn’t build it (e.g. most AI capabilities advancements).
Framed a different way, the principle is, “Don’t tie yourself up in knots overthinking.” It’s slightly reminiscent of quiescence search in that it’s solving a similar “horizon effect” problem, but it’s doing so by discarding evaluation heuristics that are not locally stable.
So, I have an internal sense that I have overcome “idea scarcity”, as a result of systematized creativity practice (mostly related to TRIZ), and I have a suspicion that this is both learnable and useful (as a complement to the domain-specific approach of “read a lot about the SOTA of alignment”), but I don’t know how useful; do you have a sense that this particular problem is a bottleneck in alignment?
I can imagine a few ways this might be the case:
Junior researchers come up with one great idea and then burn out (where they might have been able to come up with 2 or 3 otherwise); most researchers are junior in such a new field, so fixing this would nearly double or triple the number of great ideas, increasing the chance that one of them succeeds (plus positive second-order effects).
Researchers waste effort working on an idea after it’s no longer promising because they worry they won’t come up with a new one (where without the fear, they would have shifted back to “explore” sooner); back-of-the-envelope, I imagine this would save about 10% of researcher time (again, with positive second-order effects).
As a field, we’re doing too little work to find fatal flaws in ideas, in part because there is a shortage and it would be too demotivating, which leads to a similar dynamic as above, where execution effort is spent on ideas that should have been shelved.
A question about alignment via natural abstractions (if you’ve addressed it before, please refer me to where): it seems to me plausible that natural abstractions exist but are not useful for alignment, because alignment is a high-dimensional all-or-nothing property. Like, the AI will learn about “trees”, but not unintentionally killing everyone depends on whether a palm tree is a tree, or on whether a copse counts as full of trees, or some other questions which depends on unnatural details of the natural abstraction.
Do you think that edge cases will just naturally be correctly learned?
Do you think that edge cases just won’t end up mattering for alignment?
[Question] Image generation and alignment
This makes me think that a useful meta-principle for application of moral principles in the absence omniscience is “robustness to auxillary information.” Phrased another way, if the variance of the outcomes of your choices is high according to a moral principle, in all but the most extreme cases, either find more information or pick a different moral principle.
Interesting. I’m reminded of this definition of “beauty”.
Some thoughts:
Those who expect fast takeoffs would see the sub-human phase as a blip on the radar on the way to super-human
The model you describe is presumably a specialist model (if it were generalist and capable of super-human biology, it would plausibly count as super-human; if it were not capable of super-human biology, it would not be very useful for the purpose you describe). In this case, the source of the risk is better thought of as the actors operating the model and the weapons produced; the AI is just a tool
Super-human AI is a particularly salient risk because unlike others, there is reason to expect it to be unintentional; most people don’t want to destroy the world
The actions for how to reduce xrisk from sub-human AI and from super-human AI are likely to be very different, with the former being mostly focused on the uses of the AI and the latter being on solving relatively novel technical and social problems
I think “sufficiently” is doing a lot of work here. For example, are we talking about >99% chance that it kills <1% of humanity, or >50% chance that it kills <50% of humanity?
I also don’t think “something in the middle” is the right characterization; I think “something else” it more accurate. I think that the failure you’re pointing at will look less like a power struggle or akrasia and more like an emergent goal structure that wasn’t really present in either part.
I also think that “cyborg alignment” is in many ways a much more tractable problem than “AI alignment” (and in some ways even less tractable, because of pesky human psychology):
It’s a much more gradual problem; a misaligned cyborg (with no agentic AI components) is not directly capable of FOOM (Amdhal’s law was mentioned elsewhere in the comments as a limit on usefulness of cyborgism, but it’s also a limit on damage)
It has been studied longer and has existed longer; all technologies have influenced human thought
It also may be an important paradigm to study (even if we don’t actively create tools for it) because it’s already happening.
(I may promote this to a full question)
Do we actually know what’s happening when you take an LLM trained on token prediction and fine-tune is via e.g. RLHF to get something like InstructGPT or ChatGPT? The more I think about the phenomenon, the more confused I feel.
So… Longtime lurker, made an account to comment, etc.
I have a few questions.
First two, about innate status sense:
* I’m not convinced this it exists; is there a particular experiment (thought or otherwise) that could clearly demonstrate the existence of innate status sense among people? Presuming I don’t have it, and I have several willing, honest, introspective, non-rationalist, average adults, what could I ask them?
* Is there a particular thought experiment I could perform that discriminates cleanly between worlds in which I have it and worlds in which I don’t?
Next, about increasing probability estimates of unlikely events based on the outside view:
* This post argues against “Probing the Improbable” and for “Pascal’s Muggle: Infinitesimal …”; having skimmed the former and read the latter, I’m not clearly seeing the difference. Both seem to suggest that after using a model, implicitly or explicitly, to assign a low probability to an event, it is important to note the possibility that the model is catastrophically wrong and factor that into your instrumental probability.