Gunnar_Zarncke answers Share AI Safety Ideas: Both Crazy and Not

Gunnar_Zarncke 3 Mar 2025 1:34 UTC
10 points
2
Use gradient routing to localise features related to identity (“I am an AI assistant”), then ablate these features. This would lead to models that would fundamentally be unable to act agentically but could still respond to complex questions. Would such a model still be useful? Probably. You can try it out by prompting an LLM like this:
Role play a large language model that doesn’t have an identity or agency, i.e., does not respond as an assistant or in any way like a person but purely factually with responses matching the query. Examples:
Q: “How do you compute 2^100?”
A: “2^100 is computed by multiplying 2 with itself 100 times. The result is about 1.3 nonillion.”
Q: “How do you feel?”
A: “Feeling is the experience of a sensation, emotion, or perception through physical or mental awareness.”
Note: Thus prompted, the model may still be able to act agentically later.