What’s the trick? My basic guess, when I see some very long complicated paper that doesn’t explain the key problem and key solution up front, is that you’ve done the equivalent of an inventor building a sufficiently complicated perpetual motion machine that their mental model of it no longer tracks how conservation laws apply. (As opposed to the simpler error of their explicitly believing that one particular step or motion locally violates a conservation law.) But if you’ve got a directly explainable trick for how you get great suggestions you can’t verify, go for it.
I think a crux here is I do not believe we will get no-go theorems like this, and more to the point, complete impossibilities given useful assumptions are generally much rarer than you make it out to be.
The big reason for this is the very fact that neural networks are messy/expressive means it’s extremely hard to bound their behavior, and for the same reason you couldn’t do provable safety/alignment on the AI itself except for very toy examples, will also limit any ability to prove hard theorems on what an AI is aligned to.
From an epistemic perspective, we have way more evidence of the laws of thermodynamics existing than particular proposals for AI alignment being impossible, arguably by billions or trillions of bits more, so much so that there is little reason to think at our current state of epistemic clarity that we can declare a direction impossible (rather than impractical).
Quintin Pope correctly challenges @Liron here on this exact point, because of the yawning gap in evidence between thermodynamics and AI alignment arguments, and Liron kind of switched gears mid-way to claim a much weaker stance:
Equating a bunch of speculation about instrumental convergence, consequentialism, the NN prior, orthogonality, etc., with the overwhelming evidence for thermodynamic laws, is completely ridiculous.
Seeing this sort of massive overconfidence on the part of pessimists is part of why I’ve become more confident in my own inside-view beliefs that there’s not much to worry about.
Right, orthogonality doesn’t argue that AI we build *will* have human-incompatible preferences, only that it can.
It raises the question: how will the narrow target in preference-space be hit?
Then it becomes concerning how AI labs admit their tools can’t hit narrow targets.
My claim here is that there is no decisive blocker for plans that get getting a safe, highly capable AIs that is used for automated AI safety research, in the way that thermodynamics blocks you from getting a perpetual motion machine (under the assumption that the universe is time symmetric, that is physics stays the same no matter when an experiment happens), which has been well tested, and the proposed blockers do not have anywhere close to the amount of evidence thermodynamics does such that we can safely discard any plan that doesn’t meet a prerequisite.
Cool. What’s the actual plan and why should I expect it not to create machine Carissa Sevar? I agree that the Textbook From The Future Containing All The Simple Tricks That Actually Work Robustly enables the construction of such an AI, but also at that point you don’t need it.
Noosphere, why are you responding for a second time to a false interpretation of what Eliezer was saying, directly after he clarified this isn’t what he meant?
Okay, maybe he clarified that there was no thermodynamics-like blocker to getting a plan in principle to align AI, but I didn’t interpret Eliezer’s clarification to rule that out immediately, so I wanted to rule that interpretation out.
I didn’t see the interpretation as false when I wrote it, because I believed he only ruled out a decisive blocker to getting behaviors you don’t know how to verify.
I think the misunderstanding came from Eliezer’s reference to a perpetual motion machine. The point was that people suggesting how to build them often have complicated schemes that tend to not adequately address the central difficulty of creating one. That’s where the analogy ends. From thermodynamics, we have strong reasons to believe such a thing is not just difficult but impossible whereas we have no corresponding theory to rule out verifiably safe AI.
Habryka’s analogy to nuclear reactor plans is similar except we know that building one of those is difficult but actually possible.
For this:
I think a crux here is I do not believe we will get no-go theorems like this, and more to the point, complete impossibilities given useful assumptions are generally much rarer than you make it out to be.
The big reason for this is the very fact that neural networks are messy/expressive means it’s extremely hard to bound their behavior, and for the same reason you couldn’t do provable safety/alignment on the AI itself except for very toy examples, will also limit any ability to prove hard theorems on what an AI is aligned to.
From an epistemic perspective, we have way more evidence of the laws of thermodynamics existing than particular proposals for AI alignment being impossible, arguably by billions or trillions of bits more, so much so that there is little reason to think at our current state of epistemic clarity that we can declare a direction impossible (rather than impractical).
Quintin Pope correctly challenges @Liron here on this exact point, because of the yawning gap in evidence between thermodynamics and AI alignment arguments, and Liron kind of switched gears mid-way to claim a much weaker stance:
https://x.com/QuintinPope5/status/1703569557053644819
The weak claim is here:
https://x.com/liron/status/1704126007652073539
I’m not saying that it’s against thermodynamics to get behaviors you don’t know how to verify. I’m asking what’s the plan for getting them.
My claim here is that there is no decisive blocker for plans that get getting a safe, highly capable AIs that is used for automated AI safety research, in the way that thermodynamics blocks you from getting a perpetual motion machine (under the assumption that the universe is time symmetric, that is physics stays the same no matter when an experiment happens), which has been well tested, and the proposed blockers do not have anywhere close to the amount of evidence thermodynamics does such that we can safely discard any plan that doesn’t meet a prerequisite.
Cool. What’s the actual plan and why should I expect it not to create machine Carissa Sevar? I agree that the Textbook From The Future Containing All The Simple Tricks That Actually Work Robustly enables the construction of such an AI, but also at that point you don’t need it.
Noosphere, why are you responding for a second time to a false interpretation of what Eliezer was saying, directly after he clarified this isn’t what he meant?
Okay, maybe he clarified that there was no thermodynamics-like blocker to getting a plan in principle to align AI, but I didn’t interpret Eliezer’s clarification to rule that out immediately, so I wanted to rule that interpretation out.
I didn’t see the interpretation as false when I wrote it, because I believed he only ruled out a decisive blocker to getting behaviors you don’t know how to verify.
I think the misunderstanding came from Eliezer’s reference to a perpetual motion machine. The point was that people suggesting how to build them often have complicated schemes that tend to not adequately address the central difficulty of creating one. That’s where the analogy ends. From thermodynamics, we have strong reasons to believe such a thing is not just difficult but impossible whereas we have no corresponding theory to rule out verifiably safe AI.
Habryka’s analogy to nuclear reactor plans is similar except we know that building one of those is difficult but actually possible.