Steven Byrnes comments on Can we safely automate alignment research?

Steven Byrnes 5 May 2025 18:34 UTC
LW: 2 AF: 2
0
AF
“almost everything in the world is solvable via (1) Human A wants it solved, (2) Agent B is motivated by the prospect of Human A pressing the reward button on Agent B if things turn out well, (3) Human A is somewhat careful not to press the button until they’re quite sure that things have indeed turned out well, (4) Agent B is able to make and execute long-term plans”.
In particular, every aspect of automating the economy is solvable that way—for example (I was just writing this in a different thread), suppose I have a reward button, and tell an AI:
Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!
And let’s assume the AI is purely motivated by the reward button, but not yet capable of brainwashing me or stealing my button. (I guess that’s rather implausible if it can already autonomously make $1B, but maybe we’re good at Option Control, or else substitute a less ambitious project like making a successful app or whatever.) And assume that I have no particular skill at “good evaluation” of AI outputs. I only know enough to hire competent lawyers and accountants for pretty basic due diligence, and it helps that I’m allowing an extra year for law enforcement or public outcry or whatever to surface any subtle or sneaky problems caused by my AI.
So that’s a way to automate the economy and make trillions of dollars (until catastrophic takeover) without making any progress on the “need for good evaluation” problem of §6.1. Right?
And I don’t buy your counterargument that the AI will fail at the “make $1B” project above (“trying to train on these very long-horizon reward signals poses a number of distinctive challenges…”) because e.g. that same argument would also “prove” that no human could possibly decide that they want to make $1B, and succeed. I think you’re thinking about RL too narrowly—but we can talk about that separately.