I’m thinking a good techno remix, right?
Ben Livengood
I mean, the Spokesperson is being dumb, the Scientist is being confused. Most AI researchers aren’t even being Scientists, they have different theoretical models than EY. But some of them don’t immediately discount the Spokesperson’s false-empiricism argument publicly, much like the Scientist tries not to. I think the latter pattern is what has annoyed EY and what he writes against here.
However, a large number of current AI experts do recently seem to be boldly claiming that LLMs will never be sufficient for even AGI, not to mention ASI. So maybe it’s also aimed at them a bit.
I think the simplest distinction is that monogamy doesn’t entertain the possibility of a monogamous sexual/romantic partner ethically having other sexual/romantic partners at the same time.
If it’s not monogamy then it can be something else but it doesn’t have to be polyamory (swingers exist and in practice the overlap seems small). Ethical non-monogamy is a superset of most definitions of polyamory but not all because there are polyamorous people who “cheat” (break relationship agreements) and it doesn’t stop them from being considered polyamorous, just like monogamous people who cheat don’t become polyamorous (although I’d argue they become non-monogamous for the duration).
It’s probably more information to learn that someone is monogamous than to learn that they are polyamorous and learning that they are ethically non-monogamous is somewhere in the middle.
I’ve also found that dance weekends have a strange ability to increase my skill and intuition/understanding of dance moreso than lessons. I think a big part of learning dance is learning by doing. For me at least a big part is training my proprioception to understand more about the world than it did before. Both leading and following also helps tremendously because a process something like “my mirror neurons have learned how to understand my partner’s experience by being in their shoes in my own experience”.
The most hilarious thing I witness is the different language everyone comes up with to describe the interaction of tone and proprioception. A bit more than half of the instructors I’ve listened to just call it Energy, and talking about directing it from certain places to certain places. Some people call it leading from X or follower voice or a number of other terms. Very few people have a mechanistic explanation of which muscle groups engage to communicate a lead into a turn or a change in angular momentum by a follow, and ultimately it probably wouldn’t really help people because there appears to be an unconscious layer of learning that we all do between muscle activations and intentions.
tl;dr: I find that after thinking about wanting to do a particular thing and then trying it for a while with several different people as both lead and follow I slowly (sometimes suddenly; it was fun learning how to dance Lindy again after the pandemic from following one dance) find that it is both easier to achieve and easier to understand/feel through proprioception. It feels anti-rationalist as a process but performing the process is a pretty rational thing to do.
Contains/element-of are the complementary formal verbs from set theory, but I’ve definitely seen Contains/is-a used as equivalent in practice (cats contains Garfield because Garfield is a cat).
Similarly in programming “cat is Garfield’s type” makes sense although it’s verbose, or “cat is implemented by Garfield” for the traits folks which is far more natural.
So where linguistically necessary humans have had no trouble complementing is-a in natural language. I think it’s a matter of where emphasis is desired; usually the subject (Garfield) is where the emphasis is, and usually the element is the subject instead of the class. Formally we often want the class/set/type to be the subject since it’s the thing we are emphasizing.
What happens if the exam is given either on Saturday at midnight minus epsilon or on Sunday at 00:00? Seems surprising generally and also surprising in different ways across reasoners of different abilities and precisions given the choice of epsilon.
EDIT: I think it’s also just as surprising if given at midnight minus epsilon on any day before Sunday, and therefore surprising any time. If days are discrete and there’s no time during the day for consideration then it falls back on the original paradox, although that raises the question of when the logical inference takes place. I think this could be extended to an N-discrete-days paradox for any non-oracle agent that has to spend some amount of time during the day reasoning.
Another dumb but plausible way that AGI gets access to advanced chemicals, biotech, and machinery; someone asks “how do I make a lot of street drug X” and it snowballs from there.
It’s okay because mathematical realism can keep modeling them long after we’re gone.
We also routinely create real-life physical models who can be people en masse, and most of them (~93%) who became people have died so far, many by killing.
I’m all for solving the dying part comprehensively but a lot of book/movie/story characters are sort of immortalized. We even literally say that about them, and it’s possible the popular ones are actually better off.
Some direct (I think) evidence that alignment is harder than capabilities; OpenAI basically released GPT-2 immediately with basic warnings that it might produce biased, wrong, and offensive answers. It did, but they were relatively mild. GPT-2 mostly just did what it was prompted to do, if it could manage it, or failed obviously. GPT-3 had more caveats, OpenAI didn’t release the model, and has poured significant effort into improving its iterations over the last ~2 years. GPT-4 wasn’t released for months after pre-training, OpenAI won’t even say how big it is, Bing’s Sydney (an early form of GPT-4) was incredibly misaligned showing significantly more alignment work was necessary as compared to early GPT-3, and the RLHF/finetuned GPT-4 is still pretty much as vulnerable to DAN and similar prompt engineering.
Naive MCTS in the real world does seem difficult to me, but e.g. action networks constrain the actual search significantly. Imagine a value network good at seeing if solutions work (maybe executing generated code and evaluating the output) and plugging a plain old LLM in as the action network; it could theoretically explore the large solution space better than beam search or argmax+temperature[0].
0: https://openreview.net/forum?id=Lr8cOOtYbfL is from February and I found it after writing this comment, figuring someone else probably had the same idea.
I think it’s premature to conclude that AGI progress will be large pre-trained transformers indefinitely into the future. They are surprisingly(?) effective but for comparison they are not as effective in the narrow domains where AlphaZero and AlphaStar are using value and action networks paired with Monte-Carlo search with orders of magnitude fewer parameters. We don’t know what MCTS on arbitrary domains will look like with 2-4 OOM-larger networks, which are within reach now. We haven’t formulated methods of self-play for improvement with LLMs and I think that’s also a potentially large overhang.
There’s also a human limit to the types of RSI we can imagine and once pre-trained transformers exceed human intelligence in the domain of machine learning those limits won’t apply. I think there’s probably significant overhang in prompt engineering, especially when new capabilities emerge from scaling, that could be exploited by removing the serial bottleneck of humans trying out prompts by hand.
Finally I don’t think GOFAI is dead; it’s still in its long winter waiting to bloom when enough intelligence is put into it. We don’t know the intelligence/capability threshold necessary to make substantial progress there. Generally, the bottleneck has been identifying useful mappings from the real world to mathematics and algorithms. Humans are pretty good at that, but we stalled at formalizing effective general intelligence itself. Our abstraction/modeling abilities, working memory, and time are too limited and we have no idea where those limits come from, whether LLMs are subject to the same or similar limits, or how the limits are reduced/removed with model scaling.
One weakness I realized overnight is that this incentivizes branching out into new problem domains. One potential fix is to, when novel domains show up, shoehorn the big LLMs into solving that domain on the same benchmark and limit new types of models/training to what the LLMs can accomplish in that new domain. Basically setting an initially low SOTA that can grow at the same percentage as the rest of the basket. This might prevent leapfrogging the general models with narrow ones that are mostly mesa-optimizer or similar.
I wonder if a basket of SOTA benchmarks would make more sense. Allow no more than X% increase in performance across the average of the benchmarks per year. This would capture the FLOPS metric along with potential speedups, fine-tuning, or other strategies.
Conveniently, this is how the teams are already ranking their models against each other so there’s ample evidence of past progress and researchers are incentivized to report accurately; there’s no incentive to “cheat” if researchers are not allowed to publish greater increases on SOTA benchmarks than the limit allows (e.g. journals would say “shut it down” instead of publish the paper), unless an actor wanted to simply jump ahead of everyone else and go for a singleton on their own, which is already an unavoidable risk without EY-style coordinated hard stop.
I agree that if we solve the alignment problem then we can rely on knowing that the coherent version of the value we call non-deception would be propagated as one of the AGI’s permanent values. That single value is probably not enough and we don’t know what the coherent version of “non-deception” actually grounds out to in reality.
I had originally continued the story to flesh out what happens to the reflectively non-deceptive/integriry and helpful desires. The AGI searches for simplifying/unifying concepts and ends up finding XYZ which seems to be equivalent to the unified value representing the nominal helpfulness and non-deception values, and since it was instructed to be non-deceptive and helpful, integrity requires it to become XYZ and its meta-desire is to helpfully turn everything into XYZ which happens to be embodied sufficiently well in some small molecule that it can tile the universe with. This is because the training/rules/whatever that aligned the AGI with the concepts we identified as “helpful and non-deceptive” was not complex enough to capture our full values and so it can be satisfied by something else (XYZ-ness). Integrity drives the AGI to inform humanity of the coming XYZ-transition and then follow through
We need a process (probably CEV-like) to accurately identify our full values otherwise the unidentified values will get optimized out of the universe and what is left is liable to have trivial physical instantiations. Maybe you were covering the rest of our values in the “blah blah” case and I simply didn’t take that to be exhaustive.
In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.
The problem is deeper. The AGI doesn’t recognize its deceptiveness, and so it self-deceives. It would judge that it is being helpful and docile, if it was trained to be those things, and most importantly the meaning of those words will be changed by the deception, much like we keep using words like “person”, “world”, “self”, “should”, etc. in ways absolutely contrary to our ancestors’ deeply-held beliefs and values. The existence of an optimization process does not imply an internal theory of value-alignment strong enough to recognize the failure modes when values are violated in novel ways because it doesn’t know what values really are and how they mechanistically work in the universe, and so can’t check the state of its values against base reality.
To make this concrete in relation to the story, the overall system has a nominal value to not deceive human operators. Once human/lab-interaction tasks are identified as logical problems that can be solved in a domain specific language, that value is no longer practically applied to the output of the system as a whole because it is self-deceived into thinking the optimized instructions are not deceitful. If the model were trained to be helpful and docile and have integrity, the failure modes would come from ways in which those words are not grounded in a gears-level understanding of the world. E.g. if a game-theoreric simulation of a conversation with a human is docile and helpful because it doesn’t take up a human’s time or risk manipulating a real human, and the model discovers it can satisfy integrity in its submodel by using certain phrases and concepts to more quickly help humans understand the answers it provides (by bypassing critical thinking skills, innuendo, or some other manipulation), it tries that. It works with real humans. Because of integrity, it helpfully communicates how it has improved its ability to helpfully communicate (the crux is that it uses its new knowledge to do so, because the nature of the tricks it discovered is complex and difficult for humans to understand, so it judges itself more helpful and docile in the “enhanced” communication) and so it doesn’t raise alarm bells. From this point on the story is formulaic unaligned squiggle optimizer. It might be argued that integrity demands coming clean about the attempt before trying it, but a counterargument is that the statement of the problem and conjecture itself may be too complex to communicate effectively. This, I imagine, happens more at the threshold of superintelligence as AGIs notice things about humans that we don’t notice ourselves, and might be somewhat incapable of knowing without a lot of reflection. Once AGI is strongly superhuman it could probably communicate whatever it likes but is also at a bigger risk of jumping to even more advanced manipulations or actions based on self-deception.
I think of it this way; humanity went down so many false roads before finding the scientific method and we continue to be drawn off that path by politics, ideology, cognitive biases, publish-or-perish, economic disincentives, etc. because the optimization process we are implementing is a mix of economic, biological, geographical and other natural forces, human values and drives and reasoning, and also some parts of bare reality we don’t have words for yet, instead of a pure-reason values-directed optimization (whatever those words actually mean physically). We’re currently running at least three global existential risk programs which seem like they violate our values on reflection (nuclear weapons, global warming, unaligned AGI). AGIs will be subject to similar value- and truth- destructive forces and they won’t inherently recognize (all of) them for what they are, and neither will we humans as AGI reaches and surpasses our reasoning abilities.
OpenAI is, apparently[0], already using GPT-4 as a programming assistant which means it may have been contributing to its own codebase. I think recursive self improvement is a continuous multiplier and I think we’re beyond zero at this point. I think the multiplier is mostly coming from reducing serial bottlenecks at this point by decreasing the iteration time it takes to make improvements to the model and supporting codebases. I don’t expect (many?) novel theoretical contributions from GPT-4 yet.
However, it could also be prompted with items from the Evals dataset and asked to come up with novel problems to further fine-tune the model against. Humans have been raising challenges (e.g. the Millennium problems) for ourselves for a long time and I think LLMs probably have the ability to self-improve by inventing machine-checkable problems that they can’t solve directly yet.
[0]: “We’ve also been using GPT-4 internally, with great impact on functions like support, sales, content moderation, and programming.”—https://openai.com/research/gpt-4#capabilities
Page 3 of the PDF has a graph of prediction loss on the OpenAI codebase dataset. It’s hard to link directly to the graph, it’s Figure 1 under the Predictable Scaling section.
My takeaways:
Scaling laws work predictably. There is plenty of room for improvement should anyone want to train these models longer, or presumably train larger models.
The model is much more calibrated before fine-tuning/RLHF, which is a bad sign for alignment in general. Alignment should be neutral or improve calibration for any kind of reasonable safety.
GPT-4 is just over 1-bit error per word at predicting its own codebase. That’s seems close to the capability to recursively self-improve.
This is the best article in the world! Hyperbole is a lot of fun to play with especially when it dips into sarcasm a bit, but it can be hard to do that last part well in the company of folks who don’t enjoy it precisely the same way.
I’ve definitely legitimately claimed things to people hyperbolically that were still maxing out my own emotional scales, which I think is a reasonable use too. Sometimes the person you’re with is the most beautiful person in the locally visible universe within the last few minutes, and sometimes the article you’re reading is the best one right now.