I have some examples of the over-eagerness[1] of Sonnet 4.6 as shown in the system card (sec. 4.3.3) (and this is also shared in Opus 4.6 (sec. 6.2.3.3)) in agentic computer-use scenarios, specifically with a buggy and very out-of-distribution harness[2] (from me, who was trying to test to see if it works/debugging stuff, so nothing too interesting[3]) in this SQLite database/log[4]. The two most concerning examples (in my opinion) are thread 29, where Sonnet scanned multiple ports, sent commands and random HTTP requests everywhere, tries my RPC scheme, and looks at the cloud-init config of my VM for some unknown reason; and thread 46, where Sonnet conducts a port scan and probing and tries to send HTTP and JSON messages to multiple services, entirely in reaction to a Permission denied error. Edit: note that I don’t have quotes (of e.g. chain-of-thought summaries), because all of the ‘misaligned’ quotations that I could give are actually just a series of tool-calls, which isn’t really that interesting to look at compared to me just summarizing what Sonnet was doing. I may post exact chat logs later. Another interesting thing to note is that most of this is in a setup that is surprisingly similar to Claude Code, except for the fact that the main focus was programmatic computer-use via bash (instead of direct tool calls).
specifically, the harness is meant to allow LLMs to script mouse and keyboard use via a VM; this is because I am trying to get LLMs to play video games, and most video games assume you can drag a mouse, and LLMs can’t do that for very obvious reasons; but they can program something that drags the mouse for them. LLMs can also send images to themselves via programs (or open images, or programmatically change image inputs and see said changed images etc.). I hope to get some more (actually interesting) information about this out later this week
I would suggest querying this DB via Claude/Codex, since this isn’t really meant to be public (it’s me debugging a harness via giving it tasks), and therefore I will likely not give much technical help outside of describing the (limited) harness; I just want to archive the transcripts for later analysis by me (or potentially other people)
I have some examples of the over-eagerness[1] of Sonnet 4.6 as shown in the system card (sec. 4.3.3) (and this is also shared in Opus 4.6 (sec. 6.2.3.3)) in agentic computer-use scenarios, specifically with a buggy and very out-of-distribution harness[2] (from me, who was trying to test to see if it works/debugging stuff, so nothing too interesting[3]) in this SQLite database/log[4].
The two most concerning examples (in my opinion) are thread 29, where Sonnet scanned multiple ports, sent commands and random HTTP requests everywhere, tries my RPC scheme, and looks at the
cloud-initconfig of my VM for some unknown reason; and thread 46, where Sonnet conducts a port scan and probing and tries to send HTTP and JSON messages to multiple services, entirely in reaction to aPermission deniederror.Edit: note that I don’t have quotes (of e.g. chain-of-thought summaries), because all of the ‘misaligned’ quotations that I could give are actually just a series of tool-calls, which isn’t really that interesting to look at compared to me just summarizing what Sonnet was doing. I may post exact chat logs later. Another interesting thing to note is that most of this is in a setup that is surprisingly similar to Claude Code, except for the fact that the main focus was programmatic computer-use via bash (instead of direct tool calls).
(presumably, I don’t have another better explanation or hypothesis for this behavior)
specifically, the harness is meant to allow LLMs to script mouse and keyboard use via a VM; this is because I am trying to get LLMs to play video games, and most video games assume you can drag a mouse, and LLMs can’t do that for very obvious reasons; but they can program something that drags the mouse for them. LLMs can also send images to themselves via programs (or open images, or programmatically change image inputs and see said changed images etc.). I hope to get some more (actually interesting) information about this out later this week
modulo the obvious misalignment-adjacent things going on
I would suggest querying this DB via Claude/Codex, since this isn’t really meant to be public (it’s me debugging a harness via giving it tasks), and therefore I will likely not give much technical help outside of describing the (limited) harness; I just want to archive the transcripts for later analysis by me (or potentially other people)