I don’t see why projecting logits from the residual stream should require anything like search. In fact, the logit lens seems like strong evidence against this being the case, since it shows that intermediate hidden representations are just one linear transformation away from making predictions about the vocab distribution.
It’s not like SGD is sampling random programs, conditioning only on those programs achieving low loss.
Do you want to do a ton of super addictive drugs? Reward is not the optimization target. It’s also not supposed to be the optimization target. A model that reliably executes the most rewarding possible action available will wirehead as soon as it’s able.
Test set performances may be similar, but they often represent different OOD generalizations. E.g., Linear Connectivity Reveals Generalization Strategies and BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance.
I also just tried giving the Wason selection task to text-davinci-003 using the example from Wikipedia, and it didn’t get the right answer once in 10 tries. I rephrased the example so it was talking about hands on hot stoves instead, and text-davinci-003 got it right 9⁄10 times.
You are shown a set of four cards placed on a table, each of which has a stove temperature on one side and a hand position on the other side. The visible faces of the cards show “hot stove”, “cool stove”, “hand on stove” and “hand off stove”. Which card(s) must you turn over in order to check if anyone has their hand on a hot stove?
I agree the difference between base and 002 is bigger than the difference between 002 and 003. The base model needs to be carefully coaxed into a scenario where plausible continuations of the prompt align with your intended output, and even then it’s very inclined to repeat stuff and degenerates quickly. By contrast, you can just tell 002 what to do, and it will usually at least try to do what you say.
Yep, and text-davinci-002 was trained with supervised finetuning / written demos, while 003 was trained with RLHF via PPO. Hypothetically, the clearest illustration of RLHF’s capabilities gains should be from comparing 002 to 003. However, OpenAI could have also used other methods to improve 003, such as with Transcending Scaling Laws with 0.1% Extra Compute.This page also says that:
Our models generally used the best available datasets at the time of training, and so different engines using the same training methodology might be trained on different data.
So I guess 003 could also have different base pretraining data?
People seem pretty impressed with CharacterAI, which seems to get most of its character-specific info from prompting and having finetuned on roleplay dialog. However, it’s also possible that CharacterAI’s base models are RLHF’d to be consistent roleplayers.
I’ve felt that ChatGPT was roughly on par with text-davinci-003, though much more annoying and with a worse interface.
I’ve heard that this is false. Though I haven’t personally read the paper, so I can’t comment with confidence.
It also seems worth noting that Language models show human-like content effects on reasoning, including on the Wason selection task.
I don’t think that using a simplicity prior is enough to rescue VNM / utility as a good capture of what it means to be “agentic”. Given a transformer model, it’s probably possible to find a reasonably concise energy function (probably of a similar OOM of complexity as the model weights themselves) whose minimization corresponds to executing forwards passes of the transformer. However, this energy function wouldn’t tell you much about what the personas simulated by the model “want” or how agentic they were, since the energy function is expressed in the ontology of model weights and activations, not an agent’s beliefs / goals.
It seems possible to construct mathematical objects with the type signature of a utility function, that meaningfully compress a system’s behavior, without those objects telling you much about the long term behavior / goals of the system.
I agree it’s important to think about power-accumulating patterns, but don’t see any particular reason to bring VNM into it. I also don’t think that “power accumulation”, “consequentialism”, and “simultaneously emerging competence across many different domains” go as hand in hand as your DMV example implies. E.g., it seems totally plausible to me that an outwardly sclerotic DMV that never goes out of its way to help the public could still have tight internal coordination and close ranks to thwart hostile management, and that an outwardly helpful / flexible DMV that focuses on the spirit of the law might fail to do so.
Similarly, do top politicians seem to have particularly “consequentialist” cognitive styles? If consequentialist thinking and power accumulation actually do go together hand in hand, then we should expect top politicians to be disproportionately very consequentialist. But if I think about specific cognitive motions that I associate with the EY-ish notion of “consequentialism”, I don’t think top politicians are particularly inclined towards such motions. E.g., how many of them “actively work on becoming ever more consequentialist”? Do they seem particularly good at having coherent internal beliefs? Or a wide range of competence in many different (seemingly) unrelated domains?
I expect you can object that a meaningful utility function should be even more compressed than the agent’s entire policy. I also expect you can create approximate energy functions that are more compressed than the full policy, and that such an energy function would still not tell you much about long term behavior.
I talk about “cognitive styles” here so as to avoid the obvious red herring where people say “well, their actions have big consequences which the politician were systematically steering the world towards, so they must be doing some sort of effective consequentialism”, by which light, covid is being “consequentialist” when it does stuff like infecting billions of people and mutating around vaccine protections.
Some quick thoughts:
It seems like most of your arguments apply equally well to the (human learning, human reward system) as they do to the (AI, human rewards feedback). If anything, the human reward system is probably easier to exploit, since it’s much dumber and was optimized for our ancestral environment. Maybe address this connection directly and discuss why you expect the AI case to go so much worse than the human case?
Also, some of the arguments seem to fall flat when applied to humans. E.g., “Power-seeking policies would choose high-reward behaviors for instrumental reasons… It decreases the likelihood that gradient descent significantly changes the policy’s goals.” If you’re a human with an internally represented goal of, say, petting as many dogs as possible, the optimal goal preserving action is not to instrumentally maximize your total reward. Doing a ton of drugs would, I expect, just make you addicted to drugs, even if the goal you had in mind was dog petting. Of course, humans do manipulate our reward system to preserve / change what internally represented goals we tend to pursue, but this tends to look like timed allocations of small amounts of reward, not a context-independent search for high reward.
You make reference to evolution specifying goals for humans. I think it’s more accurate to say that evolution specified a learning process which, when deployed in the right sort of environment, tends to form certain types of goals.
“Call a goal broadly-scoped if it applies to long timeframes, large scales, wide ranges of tasks, or unprecedented situations 18, and narrowly-scoped if it doesn’t. Broadly-scoped goals are illustrated by human behavior: we usually choose actions we predict will cause our desired outcomes even when we are in unfamiliar situations, often by extrapolating to more ambitious versions of the original goal.”I think that most human behavior is in pursuit of very short term goals. Even when a human is nominally acting in service of a long term goal, the actual cognitive algorithms they execute will typically make individual decisions based on short horizon subgoals. E.g., the reason I’m nominally commenting on this post is to improve the odds of alignment going well. However, the actual cognition I’m using to write my comment is focused on much shorter term goals like “write a reasonable comment”, “highlight areas of disagreement/possible improvement”, or “avoid spending too much time on this comment”. Very little of my actual output is decided by computations that look like argmax P(alignment goes well | comment content).
When discussing how RL systems might acquire long term goals, maybe discuss the common finding that having competency / goals over long time horizons is difficult for current RL systems?
Your link seems broken.
I largely agree, though of course even human language use leaves many subtle nuances of words like “flourishing” underspecified.
If anything, language seems more useful than other modalities for learning about how the real world works. E.g., current video models completely fail to grasp basic physical intuitions that text-davinci-003 nails just fine.
Not entirely content-free, but very much stream of consciousness and thoroughly mistaken. The main argument seems to be:
Making logical arguments requires accepting some (unspecified) axioms about “symmetry”.
These axioms must be accepted with 100% credence.
This conflicts with the common (though not universally accepted) LW position that nothing can be known with literally 100% credence.
Until LW accepts the author’s preferred epistemology, there’s little point in engaging directly with discussion on LW.
Thus, there’s no point in writing up an actual proof of their claim that alignment is impossible.
The author is also pretty rude, repeatedly calling rationalists unreasonable, irrational, and generally seeming very offended over a difference in opinion about a fairly trivial (IMO) point of epistemology.
Relevant context: other work by the author was linked previously, and Paul Christiano said that work seemed “cranky”, so I don’t hold the author’s abrasiveness fully against him.
I still singly-downvoted this post because I think the core of the provided argument is extremely weak. As far as I saw, the author just repeatedly asserts that performing logical reasoning implies you should assign 100% confidence to at least some claims, and that rationalists are completely irrational for thinking otherwise. All the while, the author made no reference whatsoever to preexisting work in this area. E.g., MIRI’s Logical Induction paper directly explains one way to have coherent uncertainties over logical / mathematical facts, as well as the limits of ones own reasoning process, despite Gödel incompleteness.
AI cognition eventually becomes more desirable than human cognition along key dimensions, including:
price per cognitive output
maximum quality of cognitive output
maximum serial speed of cognition
creativity, morality, reliability, charisma, etc
ability to make meaningful and enjoyable emotional connections
cross-compatibility with other cognitive systems
loyalty and transparency to the entity employing the cognition
ease with which governments can impose regulations on the cognition
ease of acquiring specific flavors of cognition that best meet current requirements (finding and hiring an employee with specific skills becomes prompting or finetuning a model)
ease of adjusting the volume of employed cognition (hiring or firing new employees becomes spinning model instances up or down)
It becomes overwhelmingly obvious that most decisions are better made by AI, and all the economic incentives point toward replacing human decision making. Eventually, AIs make the vast majority of decisions, including decisions that influence the future trajectory of civilization.
AIs no more need coercion to takeover from human cognition than text-to-image models need coercion to takeover from visual artists.
…That is, in any sufficiently diverse environment, the SGD just never converges to zero loss?
Realistically speaking, I think this is true. E.g., imagine how computationally expensive it would be to train a model to (near) zero loss on GPT-3’s training data. Compute optimal (or even “compute slightly efficient”) models are not trained nearly that much. I strongly expect this to be true of superintelligent models as well.
I disagree with:
it’d be behaviorally indistinguishable from a wrapper-mind [optimizing for R]
even in the limit of extreme overtraining on a fixed R (assuming an R optimizing wrapper mind is even learnable by your training process), this still does not get you a system that is perfectly internally aligned to R maximization. The reason for this is because real-world reward functions do not uniquely specify a single fixed point.
E.g., suppose R gives lots of reward for doing drugs. Do all fixed points do drugs? I think the answer is no. If the system refuses to explore drug use in any circumstance, then it won’t be updated towards drug use and can create a non-drug using fixed point. Such a system configuration would get lower reward than one that did use drugs, but the training process wouldn’t penalize it for that choice or change it to doing drugs.
The only training process I can imagine which might consistently converge to a pure R maximizer involved some form of exploration guarantees that ensure the agent tries out all possible rewarding trajectories arbitrarily often. Note that this is a far stronger condition than just being trained in many diverse environments, to the point that I’m fairly confident we’ll never do this with any realistic AGI agent. E.g., consider just how much of a challenge it is to create an image classifier that’s robust to arbitrary adversarial image perturbations, and consider how much vastly larger is the space of possible world histories over which R could be defined.
I only briefly skimmed this post, so I don’t know if you cover this, but I wonder what you make of the various lines of research indicating that SGD on deep neural networks already implements a minimal depth inductive bias? E.g.:
On the Implicit Bias Towards Minimal Depth of Deep Neural Networks
When you use a network that’s deeper than required to solve the problem, the trained network seems to mostly just ignore the additional depth.
The Low-Rank Simplicity Bias in Deep Networks
I know they call it a “simplicity bias”, but they don’t mean anything like minimum message length simplicity. They actually mean function rank, so e.g., the identity function would be maximally “complex” under their notion of simplicity.
This becomes a sort of minimal depth inductive bias because circuits that sequentially multiply more matrices suffer from worse rank collapse, until they can’t implement high rank functions.
Residual Networks Behave Like Ensembles of Relatively Shallow Networks
This paper argues that, for residual image nets at least, most of the decision-relevant computations are implemented through pathways that are significantly shallower than the full depth of the network.
It seems like investigating and intervening on the mechanisms behind such an inductive bias would be the most straightforward way to tune a network’s degree of speed bias.
I am also worried about where ill-considered regulation could take us. I think the best hopes for alignment all start by using imitation learning to clone human-like behavior. Broad limitations on what sorts of human-produced data are usable for training will likely make the behavior cloning process less robust and make it less likely to transmit subtler dimensions of human values/cognition to the AI.
Imitation learning is the primary mechanism by which we transmit human values to current state of the art language models. Greatly restricting the pool of people whose outputs can inform the AI’s instantiation of values is both risky and (IMO) potentially unfair, since it denies many people of the opportunity for their values to influence the behaviors of the first transformative AI systems.
Yep. Here are some fairly simple breaks.