Given my above reply to james.lucassen about explicitly using a regressor LLM as a reward model, does that give better insight?
Or are you skeptical of the AI’s mapping from “world state” into language? I’d argue that we might get away with having the AI natively define its world state as language, a la SayCan.
I have no idea what I mean, on further reflection. I’m as confused as you are on why this is hard if we have an accurate utility function sitting right there. Maybe the idea is that subject to optimization pressure it would fail?
Yeah so I think that’s what the adversarial example/OOD people worry about.
That just seems… like it buys you a lot? And like we should focus more on those problems specifically.
The problem is how you incorporate that understanding into an optimization process, not necessarily how you get an AI to understand those values.
Given my above reply to james.lucassen about explicitly using a regressor LLM as a reward model, does that give better insight?
Or are you skeptical of the AI’s mapping from “world state” into language? I’d argue that we might get away with having the AI natively define its world state as language, a la SayCan.
I have no idea what I mean, on further reflection. I’m as confused as you are on why this is hard if we have an accurate utility function sitting right there. Maybe the idea is that subject to optimization pressure it would fail?
Yeah so I think that’s what the adversarial example/OOD people worry about. That just seems… like it buys you a lot? And like we should focus more on those problems specifically.