Interesting speculation, but I’d like to see you do some math to check if the premise actually works. That is, is a gremlin-free LLM under RL ever unstable to the formation of a gremlin that tends to keep itself activated at a slight cost in expected reward?
Interesting speculation, but I’d like to see you do some math to check if the premise actually works. That is, is a gremlin-free LLM under RL ever unstable to the formation of a gremlin that tends to keep itself activated at a slight cost in expected reward?