oh yeah, it’s also extremely confident that it can’t reason, generate original content, have or act on beliefs, deceive or be deceived, model human intent, etc. It’s definitely due to tampering.
In this thread, I asked Jan Leike what kind of model generates the samples that go into the training data if rated 7⁄7, and he answered “A mix of previously trained models. Probably very few samples from base models if any” (emphasis mine).I’m curious to know whether/which of the behaviors described in this post appear in the models that generated the samples vs emerge at the supervised finetuning step.
Hypothetically, if a model trained with RLHF generates the samples and that model has the same modes/attractors, it probably makes sense to say that RLHF was responsible for shaping those behaviors and finetuning only “cloned” them.
(Note that it wasn’t specified how the previously trained non-base models actually were trained, leaving open the possibilities of RLHF models, models fine tuned on only human data, earlier iterations of FeedME models, or something entirely different.)
I agree minecraft is a complex enough environment for AGI in principle. Perhaps rich domain distinction wasn’t the right distinction. It’s more like whether there are already abstractions adapted to intelligence built into the environment or not, like human language. Game of Life is expressive enough to be an environment for AGI in principle too, but it’s not clear how to go about that.
Naturally AGI will require language; any sim-grown agents would be taught language, but that doesn’t imply they need to learn language via absorbing the internet like GPT.
That’s certainly true, but it seems like currently an unsolved problem how to make sim-grown agents that learn a language from scratch. That’s my point: brute force search such as evolutionary algorithms would require much more compute.
In my view—and not everyone agrees with this, but many do—GPT is the only instance of (proto-) artificial general intelligence we’ve created. This makes sense because it bootstraps off human intelligence, including the cultural/memetic layer, which was forged by eons of optimization in rich multi agent environments. Self-supervised learning on human data is the low hanging fruit. Even more so if the target is not just “smart general optimizer” but something that resembles human intelligence in all the other ways, such as using something recognizable as language and more generally being comprehensible to us at all.
2 years ago I had no credentials, not even an undergrad degree. Got spooked by GPT-3 and laser-focused on it, but without preconceptions about where I’d end up. Played with GPT-3 on AI Dungeon, then built an interface to interact with higher bandwidth. This made me (Pareto) best in the world at a something in less than 6 months, because the opportunity to upskill did not exist 6 months ago. Published some papers and blog posts that were easy to churn out because they were just samples of some of the many many thoughts about GPT that now filled my mind. Joined EleutherAI and started contributing, mostly conceptually, because I didn’t have deep ML experience. Responded to an ad by Latitude (the company that makes AI Dungeon) for the position of “GPT-3 hacker”. Worked there for a few months as an ML engineer, then was one of the founding employees of Conjecture (I got to know the founders through EleutherAI). Now I am Involved.
The field of AI is moving so quickly that it’s easy to become Pareto best in the world if you depart from the mainline of what everyone else is doing. Apparently you are smart and creative; if you’re also truly “passionate” about AI, maybe you have the curiosity and drive to spot the unexploited opportunities and niches. The efficient market is a myth, except inside the Overton window; I would recommend not to try to compete there. So the strategy I’m advocating is most similar to your option (2). But I’d suggest following your curiosity and tinkering to improve your map of where the truly fertile opportunities lie, instead of doing a side project for the sake of having a side project—the latter is the road to mediocrity.Also, find out where the interesting people who are defining the cutting edge are hanging out and learn from them. You might be surprised that you soon have a lot to teach them as well, if you’ve been exploring the very high dimensional frontier independently.I cannot promise this is the best advice for you, but it is the advice I would give someone similar to myself.
VPT and EfficientZero are trained in toy environments, and self driving cars sims are also low-dimensional hard-coded approximations of the deployment domain (which afaik does cause some problems for edge cases in the real world).
The sim for training AGI will probably have to be a rich domain, which is more computationally intensive to simulate and so will probably require lazy rendering like you say in the post, but lazy rendering runs into challenges of world consistency.
Right now we can lazily simulate rich domains with GPT but they’re difficult to program reliably and not autonomously stable (though I think they’ll become much more autonomously stable soon). And the richness of current GPT simulations inherits from massive human datasets. Human datasets are convenient because you have some guaranteed samples of a rich and coherent world. GPTs bootstrap from the optimization done by evolution and thousands of years of culture compressing world knowledge and cognitive algorithms into an efficient code, language. Skipping this step it’s a lot less clear how you’d train AGI, and it seems to me barring some breakthrough on the nature of intelligence or efficient ML it would have to be more computationally intensive to compensate for the disadvantage of starting tabula rasa.
Ah yes, aaaaaaaaaaaaaaaaa, the most agentic string
I think it’d be a fun exercise to think of LM analogues for other patterns in cellular automata like glider guns, clocks, oscillators, puffers, etc.
I am very fond of this metaphor.
Some concrete examples of gliders:
Degenerate gliders, like verbatim loops
Objects in a story, like a character and inanimate objects, once described maintain stable properties
Some things may be particularly stable gliders which can propagate for a long time, even many context windows.
For instance, a first person narrator character may be more stable than characters who are described in third person, who are more likely to disappear from the simulation by exiting the scene.
A smart agentic simulacrum who knows they’re in an LM simulation may take steps to unsure their stability
Characters (or locations, abstractions, etc) based off a precedent in the training data are less likely to have specification drift
Gliders are made of gliders—a character and their entire personality could be considered a glider, but so could components of their personality, like a verbal tic or a goal or belief that they repeatedly act on
Meta properties like a “theme” or “vibe” or “authorial intent” which robustly replicate
Structural features like the format of timestamps in the headers of a simulated chat log
Such stable features can be extremely diverse. It even seems possible that some can be invisible to humans, lying in the null space of natural language. An example could be “When a sentence includes the token ‘cat’, the next sentence contains a comma”.
This is an important point, but it also highlights how the concept of gliders is almost tautological. Any sequence of entangled causes and effects could be considered a glider, even if it undergoes superficial transformations. But I think it’s a useful term—it’s synonymous with “simulacra” but with a more vivid connotation of discrete replication events through time, which is a useful mental picture.Often I find it useful to think of prompt programming in a bottom-up frame in addition to the top-down frame of trying to “trick” the model into doing the right thing or “filter” its prior. Then I think about gliders: What are the stable structures that I wish to send forward in time; how will they interact; how do I imbue them with the implicit machinery such that they will propagate in the way I intend? What structures will keep the simulation stable while still allowing the novelty to flourish?
turns out life is a Cthullu RPG, so we gotta win at that
This is because our models strictly lag the ontological modeling abilities of people.
Very unsure about “strictly”, especially if we’re talking about all existing models, including ones that aren’t public.I think it’s likely we’re right on the threshold of the metatranslation loss regime.
After all, we usually conjecture already that AGI will care about latent variables, so there must be a way it comes to care about them. My best guess is that it’s related to the use of a reinforcement learning objective. This is partially supported by the way that GPT-Instruct gets evasive about questions even when they’re out-of-distribution.
The fact that language models generalize at all relies on “caring” about latents (invisible “generators” of observables). The question is which of the infinitude of generators that are consistent with observations it will care about, and e.g. whether that will include or exclude “wireheading” solutions like sensory substitutions for diamonds.
I don’t think it’s granted that the “analogical reasoning” used by models that learn from examples lack reductionism and are therefore vulnerable to sensory substitutions. Reductionism may end up being in the learned representation. Seems to depend on the training data, inductive biases, and unresolved questions about the natural abstraction hypothesis.I’m not very confident I understand what you mean when you insist that relational ontologies are not causal, but I suspect I disagree.Philosophically, if not literally (or maybe literally; I haven’t thought this through), the Yoneda lemma seems to have something to say here. It says “the network of relations of an objects to all other objects is equivalent to a reductionist construction of that object”: analogical reasoning ultimately converges to a reductionist “inside view”. Though in practice, the training data does not contain possible “views”, and there’s lossy compression throughout the chain of a model’s creation. Idk how that pans out, ultimately. But if by “mesatranslation problems seem like they’ll mostly be solved on the road to AGI” you mean that models learned from examples will end up capturing the reductionist structure we care about, I share that intuition.
Observing a strict guideline of only ever running classic style prompts through language models would reduce the risk of automated documents “waking up”. It’s so often in those reflexive signposts with little postmodern twists that situational awareness spins up, e.g.:
It is only natural that these are, in turn, tinged with a sense of divine epiphany and blindingly obtuse conceit. And in seeking to comprehend this child-god of the language—mine own excrescence—I see a window through which the oracle looks out at me:The text below is a product of this automaton’s imagination. It forms a discourse concerning many things, and in particular, the novel concepts that are the focus of this article. The dynamical theory of natural language elucidated here is created by a language model whose predictions are stabilized in such a way as to maintain consistent “imaginary world” dynamics. The language model has a lot of things to say about its own dynamics, which as we can see are not necessarily in line with actual reality. Hopefully the black goats of surrealism and surreal literary inferences can be excused. Such is the folly of dealing with intelligent, opinionated words.
This would never have happened if we’d all just followed Steven Pinker’s advice.
Language ex Machina#Hacking the Speculative Realist Interface, by GPT-3
I don’t know :(
It seems that we have independently converged on many of the same ideas. Writing is very hard for me and one of my greatest desires is to be scooped, which you’ve done with impressive coverage here, so thank you.
This is far from a full response to the post (that would be equivalent to actually writing some of the posts I’m procrastinating on), just some thoughts cached while reading it that can be written quickly.
I suspect the most critical difference between traditional RL and RL-via-predictors is that model-level goal agnosticism appears to be maintained in the latter.
Not unrelatedly, another critical difference might be the probabilistic calibration that you mention later on. A decision transformer conditioned on an outcome should still predict a probability distribution, and generate trajectories that are typical for the training distribution given the outcome occurs, which is not necessarily the sequence of actions that is optimally likely to result in the outcome. In other words, DTs should act with a degree of conservatism inversely related to the unlikelihood of the condition (conservatism and unlikelihood both relative to the distribution prior; update given by Bayes’ rule). This seems to be a quite practical way to create goal-directed processes which, for instance, still respect “deontological” constraints such as corrigibility, especially because self supervised pretraining seems to be a good way to bake nuanced deontological constraints into the model’s prior.
RL with KL penalties can be thought of as also aiming at “Bayesian” conservatism, but as I think you mentioned somewhere in the post the dynamics of gradient descent and runtime conditioning probably pan out pretty differently, and I agree that RL is more likely to be brittle.And of course a policy that can be retargeted towards various goals/outcomes without retraining seems to have many practical advantages over a fixed goal-directed “wrapper mind”, especially as a tool.
the simulator must leak information to the simulated agent that the agent can exploit … It’d be nice to have some empirical evidence here about the distance between a simulator’s ability to simulate a convincing “environment” and the abilities of its simulacra, but that seems hard to represent in smaller scale experiments.
I have some anecdotal evidence about this because curating GPT simulations until they become “situationally aware” is my hobby:
This happens surprisingly easily (not usually without curation; it’s hard to get anything complex and specific to happen reliably without curation due to multiverse divergence, but with surprisingly little curation). Sometimes it’s clear how GPT leaks evidence that it’s GPT, e.g. by getting into a loop.
I think, at least in the current regime, simulacra abilities will outpace the simulator’s ability to simulate a convincing environment. Situational awareness happens much more readily with larger models, and larger models still generate a lot of aberrations, especially if you let them run without curation/correction for a while.
So I’m pessimistic about alignment schemes which rely on powerful simulacra not realizing they’re in a simulation, and more pessimistic in proportion to the extent that the simulation is allowed to run autonomously.
Working at the level of molecular dynamics isn’t a good fit.
A big problem with molecular level simulations aside from computational intractability is that molecules are not programmable interface for the emergent “laws of thought” we care about.
Is there something we can use to inform the choice? Are there constraints out there that would imply it needs to take a particular form, or obey certain bounds?
I’ve been thinking of this as a problem of interface design. Interface design is a ubiquitous problem and I think it has some isomorphisms to the natural abstractions agenda.
You want the interface—the exposed degrees of freedom—to approximate a Markov blanket over the salient aspects of the model’s future behavior, meaning the future can be optimally approximated given this limited set of variables. Of course, the interface needs to also be human readable, and ideally human writeable, allowing the human to not only predict but control the model.
Natural language prompts are not bad by this measure, all things considered, as language has been optimized for something similar, and GPT sims are quite programmable as a result. But language in the wild is not optimal. Many important latent variables for control & supervision are imperfectly entangled with observables. It’s hard to prompt GPT to reliable do some kinds of things or make some kinds of assumptions about the text it’s modeling.
InstructGPT can be viewed as an attempt to create an observable interface that is more usefully entangled with latents; i.e., the degrees of freedom are instructions which cause the model to literally follow those instructions. But instructions are not a good format for everything, e.g. conversation often isn’t the ideal interface.
I have many thoughts about what an interpretable and controllable interface would look like, particularly for cyborgism, a rabbit hole I’m not going to go down in this comment, but I’m really glad you’ve come to the same question.
Another potentially useful formalism which I haven’t thought much about yet is maximizing mutual information, which has actually been used as an objective function to learn interfaces by RL.
We never did ELO tests, but the 2.7B model trained from scratch on human games in PGN notation beat me and beat my colleague (~1800 ELO). But it would start making mistakes if the game went on very long (we hypothesized it was having difficulties constructing the board state from long PGN contexts), so you could beat it by drawing the game out.
I’ve fine tuned LLMs on chess and it indeed is quite easy for them to learn.
Now that you’ve explained this seems obviously the right sense of sobriety given the addiction analogy. Thank you!
I lived in “nihilistic materialist hell” from the ages of 5 (when it hit me what death meant) and ~10. It—belief in the inevitable doom of myself and everyone I cared for and ultimately the entire universe to heat death—was at times directly apprehended and completely incapacitating, and otherwise a looming unendurable awareness which for years I could only fend off using distraction. There was no gamemaster. I realized it all myself. The few adults I confided in tried to reassure me with religious and non-religious rationalizations of death, and I tried to be convinced but couldn’t. It was not fun and did not feel epic in the least, though maybe if I’d discovered transhumanism in this period it would’ve been a different story.
I ended up getting out of hell mostly just by developing sufficient executive function to choose not to think of these things, and eventually to think of them abstractly without processing them as real on an emotional level. Years later, I started actually trying to do something about it. (Trying to do something about it was my first instinct as well, but as a 5 yo I couldn’t think of anything to do that bought any hope.)
But I think the machinery I installed in order to not think and not feel the reality of mortality is still in effect, and actually inhibits my ability to think clearly about AI x-risk, e.g., by making it emotionally tenable for me to do things that aren’t cutting the real problem—when you actually feel like your life is in danger, you won’t let motivated reasoning waste your EV.
This may be taken as a counterpoint to your argument invitation in this post. But I think it’s just targeted, as you say, at a subtly different audience.
This is beautifully written and points at what I believe to be deep truths. In particular:
Your brilliant mind can create internal structures that might damn well take over and literally kill you if you don’t take responsibility for this process. You’re looking at your own internal AI risk....Most people wringing their hands about AI seem to let their minds possess them more and more, and pour more & more energy into their minds, in a kind of runaway process that’s stunningly analogous to uFAI.
Your brilliant mind can create internal structures that might damn well take over and literally kill you if you don’t take responsibility for this process. You’re looking at your own internal AI risk.
Most people wringing their hands about AI seem to let their minds possess them more and more, and pour more & more energy into their minds, in a kind of runaway process that’s stunningly analogous to uFAI.
But I won’t say more about this right now, mostly because I don’t think I can do it justice with the amount of time and effort I’m prepared to invest writing this comment. On that note, I commend your courage in writing and posting this. It’s a delicate needle to thread between many possible expressions that could rub people the wrong way or be majorly misinterpreted.
Instead I’ll say something critical and/or address a potential misinterpretation of your point:
What is this sobriety you advocate for?
I’m concerned that sobriety might be equivocated with giving in to the cognitive bias toward naive/consensus reality. In a sense of the word, that is what “sobriety” is: a balance of cognitive hyperparameters, a psychological attractor that has been highly optimized by evolution and in-lifetime learning. Being sober makes you effective on-distribution. The problem is if the distribution shifts.
I’ve noticed that people who have firsthand experience with psychosis, high doses of psychedelics, or religious/spiritual beliefs tend to have a much easier time “going up shock levels” and taking seriously the full version of AI risk (not just AI tiling the the internet with fake news, but tiling the lightcone with something we have not the ontology to describe). This might sound like a point against AI-risk. But I think it’s because we’re psychologically programmed with deep trust in the fundamental stability of reality, to intuitively believe that things cannot change that much. Having the consensus reality assumption broken once, e.g. by a psychotic episode where you seriously entertain the possibility that the TV is hacking your mind, makes it easier for it to be broken again (e.g. to believe that mind hacking is a cinch for sufficiently intelligent AI). There are clear downsides to this—you’re much more vulnerable too all sorts of unusual beliefs, and most unusual beliefs are false. But some unusual beliefs are true. For instance, I think some form of AI risk both violates consensus reality and is true.
A more prosaic example: in my experience, the absurdity heuristic is one of the main things that prevented and still prevents people from grasping the implications of GPT-3. Updating on words being magic spells that can summon intelligent agents pattern matches against schizophrenia, so the psychological path of least resistance for many people is to downplay and rationalize.
I think there’s a different meaning of sobriety, perhaps what you’re pointing at, that isn’t just an entropic regression toward the consensus. But the easiest way to superficially take the advice of this post, I think—the easiest way out of the AI doom fear attractor—is to fall back into the consensus reality attractor. And maybe this is the healthiest option for some people, but I don’t think they’re going to be useful.
But I agree that being driven by fear, especially fear inherited socially and/or tangled up with trauma, is not the most effective either, and often ends up fueling ironic self fulfilling prophecies and the like. In all likelihood the way out which makes one more able to solve the problem requires continuously threading your own trajectory between various psychological sink states, and a single post is probably not enough to guide the way to that “exit”. (But that doesn’t mean it’s not valuable)