To what degree will the goals / values / preferences / desires of future AI agents depend on dispositions that are learned in the weights, and to what degree will they depend on instructions and context.
To what degree will unintended AI goals be learned in training vs developed in deployment.
I’m sort of curious what actually inclines you towards “learned in training”, given how (as you say) the common-sense notion of an LLMs goal seems to point to specified in a prompt, as you say.
Like even if we have huge levels of scale up, why would we expect this to switch the comparative roles of (weights, context, sampler, scaffolding) and make the weights have a role they don’t have in the future? What actually moves you in that direction?
I’m sort of curious what actually inclines you towards “learned in training”, given how (as you say) the common-sense notion of an LLMs goal seems to point to specified in a prompt, as you say.
Like even if we have huge levels of scale up, why would we expect this to switch the comparative roles of (weights, context, sampler, scaffolding) and make the weights have a role they don’t have in the future? What actually moves you in that direction?