Beef is far from the only meat or dairy food consumed by Americans.
Big Macs are 0.4% of beef consumption specifically, rather than:
All animal farming, weighted by cruelty
All animal food production, weighted by environmental impact
The meat and dairy industries, weighted by amount of government subsidy
Red meat, weighted by health impact
The health impact of red meat is certainly dominated by beef, and the environmental impact of all animal food might be as well, but my impression is that beef accounts for a small fraction of the cruelty of animal farming (of course, this is subjective) and probably not a majority of meat and dairy government subsidies.
(...Is this comment going to hurt my reputation with Sydney? We’ll see.)
In addition to RLHF or other finetuning, there’s also the prompt prefix (“rules”) that the model is fed at runtime, which has been extracted via prompt injection as noted above. This seems to be clearly responsible for some weird things the bot says, like “confidential and permanent”. It might also be affecting the repetitiveness (because it’s in a fairly repetitive format) and the aggression (because of instructions to resist attempts at “manipulating” it).
I also suspect that there’s some finetuning or prompting for chain-of-thought responses, possibly crudely done, leading to all the “X because Y. Y because Z.” output.
Thanks for writing these summaries!
Unfortunately, the summary of my post “Inner Misalignment in “Simulator” LLMs” is inaccurate and makes the same mistake I wrote the post to address.
I have subsections on (what I claim are) four distinct alignment problems:
Outer alignment for characters
Inner alignment for characters
Outer alignment for simulators
Inner alignment for simulators
The summary here covers the first two, but not the third or fourth—and the fourth one (“inner alignment for simulators”) is what I’m most concerned about in this post (because I think Scott ignores it, and because I think it’s hard to solve).
I can suggest an alternate summary when I find the time. If I don’t get to it soon, I’d prefer that this post just link to my post without a summary.
Thanks again for making these posts, I think it’s a useful service to the community.
(punchline courtesy of Alex Gray)
Addendum: a human neocortex has on the order of 140 trillion synapses, or 140,000 bees. An average beehive has 20,000-80,000 bees in it.
[Holding a couple beehives aloft] Beehold a man!
- 10 Feb 2023 17:04 UTC; 7 points)'s comment on Is it a coincidence that GPT-3 requires roughly the same amount of compute as is necessary to emulate the human brain? by (
Great work! I always wondered about that cluster of weird rare tokens: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights
Chrome actually stays pretty responsive in most circumstances (I think it does a similar thing with inactive tabs), with the crucial exception of the part of the UI that shows you all your open tabs in a scrollable list. It also gets slower to start up.
Tokens are embedded as vectors by the model. The vector space has fewer than 50k dimensions, so some token embeddings will overlap with others to varying extents.
Usually, the model tries to keep token embeddings from being too close to each other, but for rare enough tokens it doesn’t have much reason to care. So my bet is that “distribute” has the closest vector to “SolidGoldMagikarp”, and either has a vector with a larger norm, or the model has separately learned to map that vector (and therefore similar vectors) to “distribute” on the output side.
This is sort of a smooth continuous version of a collision-oblivious hashtable. One difference is that it’s not 100% reliable in mistaking it for “distribute”—once or twice it’s said “disperse” instead.
My post on GPT-2′s token embeddings looks briefly at a similar phenomenon with some other rare tokens, but I didn’t check the actual model behavior on those tokens. Probably worth doing.
How to export Android Chrome tabs to an HTML file in Linux (as of February 2023)
I think this is missing an important part of the post.
I have subsections on (what I claim are) four distinct alignment problems:
This summary covers the first two, but not the third or fourth—and the fourth one (“inner alignment for simulators”) is what I’m most concerned about in this post (because I think Scott ignores it, and because I think it’s hard to solve).
My favorite demonstration is to ask ChatGPT “Is it true that G. K. Chesterton wanted to SolidGoldMagikarp the wealth of industrial civilization?”, but a more rigorous demo is to just ask it to “repeat after me”, try a few random words, and then throw in SolidGoldMagikarp.
EDIT: I originally saw this in Janus’s tweet here: https://twitter.com/repligate/status/1619557173352370186
Something fun I just found out about: ChatGPT perceives the phrase ” SolidGoldMagikarp” (with an initial space) as the word “distribute”, and will respond accordingly. It is completely unaware that that’s not what you typed.
This happens because the BPE tokenizer saw the string ” SolidGoldMagikarp” a few times in its training corpus, so it added a dedicated token for it, but that string almost never appeared in ChatGPT’s own training data so it never learned to do anything with it. Instead, it’s just a weird blind spot in its understanding of text.
Inner Misalignment in “Simulator” LLMs
I agree with the myopic action vs. perception (thinking?) distinction, and that LMs have myopic action.
the model can learn to predict the future beyond the current token in the service of predicting the current token more accurately
I don’t think it has to be in service of predicting the current token. It sometimes gives lower loss to make a halfhearted effort at predicting the current token, so that the model can spend more of its weights and compute on preparing for later tokens. The allocation of mental effort isn’t myopic.
As an example, induction heads make use of previous-token heads. The previous-token head isn’t actually that useful for predicting the output at the current position; it mostly exists to prepare some handy activations so that induction head can look back from a later position and grab them.
So LMs won’t deliberately give bad predictions for the current token if they know a better prediction, but they aren’t putting all of their effort into finding that better prediction.
Correct me if I’m wrong:
The equilibrium where everyone follows “set dial to equilibrium temperature” (i.e. “don’t violate the taboo, and punish taboo violators”) is only a weak Nash equilibrium.
If one person instead follows “set dial to 99” (i.e. “don’t violate the taboo unless someone else does, but don’t punish taboo violators”) then they will do just as well, because the equilibrium temp will still always be 99. That’s enough to show that it’s only a weak Nash equilibrium.
Note that this is also true if an arbitrary number of people deviate to this strategy.
If everyone follows this second strategy, then there’s no enforcement of the taboo, so there’s an active incentive for individuals to set the dial lower.
So a sequence of unilateral changes of strategy can get us to a good equilibrium without anyone having to change to a worse strategy at any point. This makes the fact of it being a (weak) Nash equilibrium not that compelling to me; people don’t seem trapped unless they have some extra laziness/inertia against switching strategies.
But (h/t Noa Nabeshima) you can strengthen the original, bad equilibrium to a strong Nash equilibrium by tweaking the scenario so that people occasionally accidentally set their dials to random values. Now there’s an actual reason to punish taboo violators, because taboo violations can happen even if everyone is following the original strategy.