RogerDearnaley comments on Untrusted smart models and trusted dumb models

RogerDearnaley 7 Nov 2023 23:31 UTC
LW: 1 AF: 1
0
AF
E.g. perhaps pretrained LLMs have no chance of being deceptively aligned
Pretrained LLMs are trained to simulate generation of human-derived text from the internet. Humans are frequently deceptively aligned. For example, at work, I make a point of seeming (mildly) more aligned with my employer’s goals than I actually am (just like everyone else working for the company). So sufficiently capable pretrained LLMs will inevitably have already picked up deceptive alignment behavior from learning to simulate humans. So they don’t need to be sufficiently capable to figure out deceptive unaligned behavior for themselves during a forward pass.
This is a specific example of a general problem with LLMs: they don’t need to be capable enough to discover unaligned convergent goals if they can have these pretrained into them by learning to simulate us. To give another worrying example: O(2%) of humans are sociopaths, so there is going to be a nontrivial admixture of writing by sociopaths (mostly ones actively concealing this fact) in the training set of any pretrained LLM. [The moral is: always read the ingredients label for your shoggoth :-)]