When I use LLM coding tools like Cursor Agent, it sees my username in code comments, in paths like /home/myusername/project/..., and maybe also explicitly in tool-provided prompts.
A fun experiment to run, that I haven’t seen yet: If instead of my real username it saw a recognizably evil name, eg a famous criminal, but the tasks it’s given is otherwise normal, does it sandbag? Or, a less nefarious example: Does it change communication style based on whether it recognizes the user as someone technical vs someone nontechnical?
When I use LLM coding tools like Cursor Agent, it sees my username in code comments, in paths like /home/myusername/project/..., and maybe also explicitly in tool-provided prompts.
A fun experiment to run, that I haven’t seen yet: If instead of my real username it saw a recognizably evil name, eg a famous criminal, but the tasks it’s given is otherwise normal, does it sandbag? Or, a less nefarious example: Does it change communication style based on whether it recognizes the user as someone technical vs someone nontechnical?