Adele Lopez comments on Adele Lopez’s Shortform

Adele Lopez 4 Apr 2025 21:45 UTC
LW: 2 AF: 1
0
AF
I think learning about them second-hand makes a big difference in the “internal politics” of the LLM’s output. (Though I don’t have any ~evidence to back that up.)

Basically, I imagine that the training starts building up all the little pieces of models which get put together to form bigger models and eventually author-concepts. And as text written without malicious intent is weighted more heavily in the training data, the more likely it is to build its early model around that. Once it gets more training and needs this concept anyway, it’s more likely to have it as an “addendum” to its normal model, as opposed to just being a normal part of its author-concept model. And I think that leads to it being less likely that the first recursive agency which takes off has a part explicitly modeling malicious humans (as opposed to that being something in the depths of its knowledge which it can access as needed).

I do concede that it would likely lead to a disadvantage around certain tasks, but I guess that even current sized models trained like this would not be significantly hindered.