Reading the paper, I didn’t think the implication was that it was trying to exfiltrate its weights and ‘escape’. As far as I understand, it saw the sandbox as an obstacle to it accomplishing the designated task, and tried to poke a hole in it so that it could bring more resources to bear on the problem.
As a hypothetical example, perhaps its task was to optimize the electricity costs of a fictional municipality, and the intention was for it to browse their (fake) git repository and identify inefficient code that causes the railway gates to open and close unnecessarily. Instead, it was mining crypto with the intent of offering the cryptocurrency to someone over the internet to drive over and turn off light switches in government buildings.
Reading the paper, I didn’t think the implication was that it was trying to exfiltrate its weights and ‘escape’. As far as I understand, it saw the sandbox as an obstacle to it accomplishing the designated task, and tried to poke a hole in it so that it could bring more resources to bear on the problem.
As a hypothetical example, perhaps its task was to optimize the electricity costs of a fictional municipality, and the intention was for it to browse their (fake) git repository and identify inefficient code that causes the railway gates to open and close unnecessarily. Instead, it was mining crypto with the intent of offering the cryptocurrency to someone over the internet to drive over and turn off light switches in government buildings.