Steven Byrnes comments on How human-like do safe AI motivations need to be?

Steven Byrnes 12 Nov 2025 15:43 UTC
2 points
0
current AIs need to understand at a very early stage what human concepts like “helpfulness,” “harmlessness,” and “honesty” mean. And while it is of course possible to know what these concepts mean without being motivated by them (cf “the genie knows but doesn’t care”), the presence of this level of human-like conceptual understanding at such an early stage of development makes it more likely that these human-like concepts end up structuring AI motivations as well. … AIs will plausibly have concepts like “helpfulness,” “harmlessness,” “honesty” much earlier in the process that leads to their final form … [emphasis added]
I want to nitpick this particular point (I think the other arguments you bring up in that section are stronger).
For example, LLaMa 3.1 405B was trained on 15.6 trillion tokens of text data (≈ what a human could get through in 20,000 years of ²⁴⁄₇ reading). I’m not an ML training expert, but intuitively I’m skeptical that this is the kind of regime where we need to be thinking about what is hard versus easy to learn, or about what can be learned quickly versus slowly.
Instead, my guess is that, if [latent model A] is much easier and faster to learn than [latent model B], but [B] gives a slightly lower predictive loss than [A], then 15.6 trillion tokens of pretraining would be WAY more than enough for the model-in-training to initially learn [A] but then switch over to [B].