Hi folks, I’m brand new to LW, and ended up here because of the AI safety discussions. I’m an independent software engineer (≈28 years, applied-math background) who does CAD and industrial automation tooling as my day job, with some evolutionary methods and AI-related work on the side.
I recently ran a small, fully-logged experiment on a two-model code-optimization loop and got bitten by a genuine reward-hack plus a couple of silent failure modes, and I found recovering from those more interesting than the architecture I set out to test. I’m planning to write it up here, partly because the specification-gaming instance is a concrete version of things this community theorizes about.
I can post a link to the draft writeup if anyone is interested. Happy for feedback on the framing before or after it goes up — and pointers to prior LW discussion I should be citing are very welcome.
Hi folks, I’m brand new to LW, and ended up here because of the AI safety discussions. I’m an independent software engineer (≈28 years, applied-math background) who does CAD and industrial automation tooling as my day job, with some evolutionary methods and AI-related work on the side.
I recently ran a small, fully-logged experiment on a two-model code-optimization loop and got bitten by a genuine reward-hack plus a couple of silent failure modes, and I found recovering from those more interesting than the architecture I set out to test. I’m planning to write it up here, partly because the specification-gaming instance is a concrete version of things this community theorizes about.
I can post a link to the draft writeup if anyone is interested. Happy for feedback on the framing before or after it goes up — and pointers to prior LW discussion I should be citing are very welcome.