You can’t teach an LLM to play good Go using in-context learning, or any number of memory notebooks. That’s the limitation that distinguishes in-context learning from learning of deep skills with RLVR.
Why not? I agree, currently, the types of skills you can pack into memory notebooks are shallow. But dont see an in principle bound on the depth of skills here.
Like if humans had omnipresent notebooks through the whole of our evolutionary history, and we had to rely on them whenever we wanted to learn something new, I bet we would get really really good at using notebooks. Maybe good enough that we could glance at them and load quite complicated skills into memory.
If labs are already using long horizon RL where the model creates memory notes, and goes thru many steps of compactification in a single rollout. I will significantly update. Do you know how much this is done?
I think creating a distinction between morality and self-interest is somewhat anthropomorphic and based on historical/biological contingencies of humans.
Consider AIXI, or some reflective/embedded version of AIXI, with an aligned value function.
When presented with a question like “Why should you maximize your value function?”, AIXI will think to itself “Hmm, what should I answer to maximize my value function.”. (to which the answer is not “I guess I should give up my value function and do something else instead”)
This illustrates that the failure mode you’re talking about is not insurmountable in principle. But I grant that modern LLMs are more similar to humans than AIXI is, in many regards.
But when you say
I do think this is what people are already aiming for.
I think alignment is commonly conceptualized, at least this is certainly how I conceptualize it, not as an external set of rules imposed on an agent, which it has reasons to follow, but rather as a set of desires which the agent pursues for its own sake.