Alex Gray née Alex Ray, much of my work is under that name. I’m interested in language model alignment, and especially techniques to get models to reason out loud.
A Ray
If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct.
This also applies to changes during training where the model is learning to perform better on the objective task.
So we are expecting some amount of KL divergence already.
My claims are:
-
The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs)
-
optimization pressure will try to push this extra information into the cheapest places to hide
-
the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task
-
I think that’s right but also the moral maze will be mediating the information and decision making support that’s available to the leadership, so they’re not totally immune from the influences
I think I understand what you’re saying, but I want to double check and try laying it out explicitly.
The “null space” is actually the null space of the prior model, which is human-like, but not exactly lined up to human perception of language
Because of this, the information is “hidden” w.r.t. human perceptions, it’s “hidden” w.r.t. the language model prior
I think I agree with both of these points, but here’s my thinking for why I still expect to see this phenomena (and why the article above was simplified to just be “human”)
Empirically drawing many samples results in lots of “contains similar high level contents according to a human (me)”
It might be worth studying this explicitly, such as setting up a bunch of chain of thought reasoning like scenarios, and sampling a lot of generated reasoning, then asking humans to rate how similar the generations are
My prediction is that people will think that they are largely similar
We could also use this same method to study human-measured relevance of other non-trivial sampling methods, like beam search, etc
“Hiding” w.r.t. language model prior is sufficient, in the limit of good language models
One of the things I think my reasoning relies on is that we will continue to produce better language models, which means the prior over time more and more closely resembles human priors
This suggests that one way to avoid this phenomena is to only use smaller / less capable models—or at least they will do a worse job at hiding information. I currently expect that many research labs will continue to do chain of thought reasoning with state of the art models, though.
There’s a lot of room in the model distribution
Chain of thought reasoning is not currently, nor do I expect it will ever be, overconstrained.
Instead, it can be seen as a mechanism for dramatically increasing the space of possible outputs, both those seen during training, and those encountered at test time
Human preference effects
Research Campaign Effects: Human researchers will want human-like outputs, and thus will be expected to tweak/retrain/etc in order to get human-like reasoning in the chain of thought. This introduces a separate optimization pressure from the KL penalty towards reasoning being human-like
Human preference models: I think an easy-to-predict future direction here is training reward models to match subjective human preferences and then optimizing with them or sampling with them (e.g. best of N) -- these preference models could also introduce additional optimization pressure towards human-like
I think the most important point in all of this is that I expect the model prior to be sufficiently close to the human prior—and I’d be interested in learning more about evidence or research to the contrary.
I think this is worth considering even in cases of things like recursive prompting for a couple of reasons that I haven’t organized:
In general additional compute time lets you improve your answer, so by default I’d expect answers to any significant question to take multiple steps
I agree for trivial questions it might not do any of this. My expectation is that AI deployments impacts will mostly be on things like non-trivial questions or tasks
So far current research hasn’t done a great deal of regularizing or penalizing models for using a lot of thinking
Mostly they’ve optimized for “maximize the chance of correct answer”
This could change in the future, especially in cases where the model is expected to be deployed / otherwise sensitive to runtime costs
Steganography is rewarded over openness if the openness is somehow far from the prior for a given piece of information it wants to transmit
Caveat this point is more speculative than the above document
One of my points is that due to null space + optimization pressure, it’s very easy and convenient for gradient descent to put information in these hidden dimensions
I think one of the things this can do well is handle concepts that don’t yet exist—and so reasoning moves like “invent a new term” or “give an illustrative example” or “invent a thought experiment” are lower likelihood under the prior
In these cases, it can be more likely for steganography to transmit certain information, if it is less expensive in terms of KL penalties
I do agree that for straightforward reasoning on straightforward evidence, where the reasoning is human-like and close to the prior, we can expect it to mostly reason out loud in a way that is “open”
Myopia—though less important than it was earlier in RFLO research—seemed like a good alignment criterion to aim for, and still does today somewhat
The original starting point for this direction of thinking was about Myopia / Non-Myopia in chain of thought reasoning systems
The sorts of guarantees I was hoping to get for chain of thought systems included myopia guarantees
Steganography came up as an example of how it could break myopia (other more complicated examples include side effects and situational awareness)
The concepts in this document came up by asking “how might steganography arise naturally given current research techniques and directions”
Putting what I think is the most important part of my reply first: I think research into mitigations is premature and instead demonstrating/measuring the phenomena should take priority.
However given that, I think I agree that these are all possible mitigations to the phenomena, in particular (rephrasing your points):
Reducing the channel capacity, and thus the amount of information that can be transmitted
Constraining the outputs, and thus reducing the degrees of freedom of the null space
Adding noise (via dropout or otherwise) also reduces the channel capacity, and probably also has some desirable side effects (like increasing the diversity or creativity of reasoning)
Introducing “barriers” past which optimization for hidden information becomes difficult (in your example, you use different models; I also think “stop gradients” in end-to-end training has similar effects, if weaker)
If you can detect/localize this phenomena, you might be able to use interpretability like techniques to limit or constrain it
I’m less certain about this, but possible future interpretability research will unlock this ability.
Also I think an important point here is that we can expect to see it with fairly simple techniques—merely including sampled/generated data in the training set is sufficient, as opposed to it requiring a complicated reinforcement learning algorithm like MuZero.
Did you publish your proposal? I’d be interested in reading it.
Agree that founders are a bit of an exception. Actually that’s a bit in the longer version of this when I talk about it in person.
Basically: “The only people who at the very top of large tech companies are either founders or those who were able to climb to the tops of moral mazes”.So my strategic corollary to this is that it’s probably weakly better for AI alignment for founders to be in charge of companies longer, and to get replaced less often.
In the case of facebook, even in the face of all of their history of actions, I think on the margin I’d prefer the founder to the median replacement to be leading the company.
(Edit: I don’t think founders remaining at the head of a company isn’t evidence that the company isn’t a moral maze. Also I’m not certain I agree that facebook’s pivot couldn’t have been done by a moral maze.)
Steganography in Chain of Thought Reasoning
Why I Am Skeptical of AI Regulation as an X-Risk Mitigation Strategy
Thanks, fixed the link in the article. Should have pointed here: https://www.lesswrong.com/posts/dhj9dhiwhq3DX6W8z/hero-licensing
My advice on finding your own path
I think there should be a norm about adding the big-bench canary string to any document describing AI evaluations in detail, where you wouldn’t want it to be inside that AI’s training data.
Maybe in the future we’ll have a better tag for “dont train on me”, but for now the big bench canary string is the best we have.
This is in addition to things like “maybe don’t post it to the public internet” or “maybe don’t link to it from public posts” or other ways of ensuring it doesn’t end up in training corpora.
I think this is a situation for defense-in-depth.
More Ideas or More Consensus?
I think one aspect you can examine about a scientific field is it’s “spread”-ness of ideas and resources.
High energy particle physics is an interesting extrema here—there’s broad agreement in the field about building higher energy accelerators, and this means there can be lots of consensus about supporting a shared collaborative high energy accelerator.
I think a feature of mature scientific fields that “more consensus” can unlock more progress. Perhaps if there had been more consensus, the otherwise ill-fated superconducting super collider would have worked out. (I don’t know if other extenuating circumstances would still prevent it.)
I think a feature of less mature scientific fields that “more ideas” (and less consensus) would unlock more progress. In this case, we’re more limited about generating and validating new good ideas. One way this looks is that there’s not a lot of confidence with what to do with large sums of research funding, and instead we think our best bet is making lots of small bets.
My field (AI alignment) is a less mature scientific field in this way, I think. We don’t have a “grand plan” for alignment, which we just need to get funding. Instead we have a fractal of philanthropic organizations empowering individual grantmakers to try to get small and early ideas off the ground with small research grants.
A couple thoughts, if this model does indeed fit:
There’s a lot more we could do to orienting as a field with “the most important problem is increasing the rate of coming up with good research ideas”. In addition to being willing to fund lots of small and early stage research, I think we could factorize and interrogate the skills and mindsets needed to do this kind of work. It’s possible that this is one of the most important meta-skills we need to improve as a field.
I also think this could be more of a priority when “field building”. When recruiting or trying to raise awareness of the field, it would be good to consider more focus or priority on places where we expect to find people who are likely to be good generators of new ideas. I think one of the ways this looks is to focus on more diverse and underrepresented groups.
Finally, at some point it seems like we’ll transition to “more mature” as a field, and it’s good to spend some time thinking about what would help that go better. Understanding the history of other fields making this transition, and trying to prepare for predicted problems/issues would be good here.
AGI will probably be deployed by a Moral Maze
Moral Mazes is my favorite management book ever, because instead of “how to be a good manager” it’s about “empirical observations of large-scale organizational dynamics involving management”.
I wish someone would write an updated version—a lot has changed (though a lot has stayed the same) since the research for the book was done in the early 1980s.
My take (and the author’s take) is that any company of nontrivial size begins to take on the characteristics of a moral maze. It seems to be a pretty good null hypothesis—any company saying “we aren’t/won’t become a moral maze” has a pretty huge evidential burden to cross.
I keep this point in mind when thinking about strategy around when it comes time to make deployment decisions about AGI, and deploy AGI. These decisions are going to be made within the context of a moral maze.
To me, this means that some strategies (“everyone in the company has a thorough and complete understanding of AGI risks”) will almost certainly fail. I think the only strategies that work well inside of moral mazes will work at all.
To sum up my takes here:
basically every company eventually becomes a moral maze
AGI deployment decisions will be made in the context of a moral maze
understanding moral maze dynamics is important to AGI deployment strategy
(Caveat: I ran the first big code scrape and worked on the code generating models which later became codex.)
My one line response: I think opt-out is obviously useful and good and should happen.
AFAIK there are various orgs/bodies working on this but kinda blanking what/where. (In particular there’s a FOSS mailing list that’s been discussing how ML training relates to FOSS license rights that seems relevant)
Opt-out strings exist today, in an insufficient form. The most well known and well respected one is probably the big-bench canary string: https://github.com/google/BIG-bench/blob/main/docs/doc.md—but this is just intended to protect data used for evaluating text models.
Mimicking the structure to comment on each point:
Simplicity
I think simplicity is points in favor of cheapness, but not points (directly) in favor of why something “should be done”. I see this as “technical cost to implement are low”, and agree.
Competitiveness
I think this also is points in favor of cheapness, but again not why it “should be done”. I see this as “expected reduction in ML perf is small”, and agree.
Ethics
I think this makes the point that we currently don’t have settled understanding on what the ethics of various options are here. People being upset at the state of things is pretty strong evidence that it’s not settled, but seems to be less strong evidence that it’s unethical. I can’t tell the point you’re trying to make here is that “we should figure out the ethics of opt-out” (which I agree with) or that “opt-out is ethically required” (which I don’t think you’ve sufficiently supported here for me to agree with).
Risk
I see this as making the point “opt-out would (very minorly) reduce AI risk”. I think this is both well supported by the arguments and technically valid. I’m personally skeptical about the amount of protection this gets us, and am mostly optimistic in applying it to non-software domains (e.g. nanotech, gain of function, virology, etc).
A personal technical prediction I can add: I think that in the software domain, it will be inexpensive for a capable system to compose any non-allowed concepts out of allowed concepts. I think this is non-obvious to traditional ML experts. In traditional ML, removing a domain from the dataset usually robustly removes it from the model—but things like the large-scale generative models mentioned in the top of the post have generalized very well across domains. (They’re still not very capable in-domain, but are similarly not-capable in domains that didn’t exist in training.) I think this “optimism about generalization” is the root of a bunch of my skepticism about domain-restriction/data-censoring as a method of restricting model capabilities.
Precedent
I think the robots.txt example is great and basically this is the one that is most directly applicable. (Other precedents exist but IMO none are as good.) I totally agree with this precedent.
Separately, there’s a lot of precedent for people circumventing or ignoring these—and I think it’s important to look at those precedents, too!
Risk Compensation
This is an interesting point. I personally don’t weigh this highly, and feel like a lot of my intuition here is attached to gut-level stuff.
As far as I know, the literature on risk compensation is almost entirely about things that are direct personal risk to someone. I don’t know of any cases of risk compensation where the risk was indirect or otherwise largely separated from the person. (At some point of indirectness this seems to reduce more to a “principal-agent problem” than a risk-compensation problem)
What’s Missing
I think it’s easy to focus on the technical implementation costs and less on the “what happens next” costs. Figuring out the legal status of this opt-out (and possibly pushing for legislation to change this) is difficult and expensive. Figuring out standards for evaluation will be similarly hard, especially as the tech itself changes rapidly.
Personal Conclusion
I think opt-out is obviously good and useful and should be done. It think its a pretty clear positive direction for ML/AI policy and regulatory development—and also I’m optimistic that this is the sort of thing that will happen largely on its own (i.e. no drastic action is required).
Sometimes I get asked by intelligent people I trust in other fields, “what’s up with AI x risk?”—and I think at least part of it unpacks to this: Why don’t more people believe in / take seriously AI x-risk?
I think that is actually a pretty reasonable question. I think two follow-ups are worthwhile and I don’t know of good citations / don’t know if they exist:
a sociological/anthropological/psychological/etc study of what’s going on in people who are familiar with the ideas/reasonings of AI x-risk, but decide not to take it seriously / don’t believe it. I expect in-depth interviews would be great here.
we should probably just write up as many obvious things ourselves up front.
The latter one I can take a stab at here. Taking the perspective of someone who might be interviewed for the former:
historically, ignoring anyone that says “the end of the world is near” has been a great heuristic
very little of the public intellectual sphere engages with the topic
the public intellectual sphere that does in engages is disproportionately meme lords
most of the writings about this are exceptionally confusing and jargon-laden
there’s no college courses on this / it doesn’t have the trappings of a legitimate field
it feels a bit like a Pascal’s mugging—at the very least i’m not really prepared to try to think about actions/events with near-infinite consequences
people have been similarly doom-y about other technologies and so far the world turned out fine
we have other existential catastrophes looming (climate change, etc) that are already well understood and scientifically supported, so our efforts are better put on that than this confusing hodge-podge
this field doesn’t seem very diverse and seems to be a bit monocultural
this field doesn’t seem to have a deep/thorough understanding of all of the ways technology is affecting people’s lives negatively today
it seems weird to care about future people when there are present people suffering
I see a lot of public disagreement about whether or not AGI is even real, which makes the risk arguments feel much less trustworthy to me
I think i’m going to stop for now, but I wish there was a nice high-quality organization of these. At the very least, having the steel-version of them seems good to have around, in part as an “epistemic hygiene” thing.
Thanks so much for making this!
I’m hopeful this sort of dataset will grow over time as new sources come about.
In particular, I’d nominate adding MLSN (https://www.alignmentforum.org/posts/R39tGLeETfCZJ4FoE/mlsn-4-many-new-interpretability-papers-virtual-logit) to the list of newsletters in the future.
This seems like an overly alarmist take on what is a pretty old trend of research. Six years ago there was a number of universities working on similar models for the VizDoom competition (IIRC they were won by Intel and Facebook). It seems good to track this kind of research, but IMO the conclusions here are not supported at all by the evidence presented.
Do you have suggestions for domains where you do expect one-turn debate to work well, now that you’ve got these results?
Congratulations! Can you say if there will be a board, and if so who will start on it?
I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments.
Trying to encode an additional penalty on changing non-semantic information is an interesting idea.
However I think you’re missing that you don’t have the ability to directly compare to a reference LM in cases where you’re training to improve on some performance benchmark. During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content).
So your proposed detection will always show some amount of disagreement between the prior and the trained model on weird grammatical patterns as well as conceptual tokens. The question is: “is the difference merely due to the changes to improve performance, or is it also transmitting hidden information”