I think about AI alignment; send help.
James Payor
Got it, thanks!
The way I understand A4 is that it says “if moving by is good, then moving by any fraction is also good”.
And A5 says “if moving by is good, then moving by any multiple is also good”, which is much stronger.
[Edit: yeah nevermind I have the inequality backwards]
A5 seems too strong?
Consider lotteries and , and a mixture in between. Applying A5 twice gives:
If then
If then
So if and then ?
Either I’m confused or A5 is a stricter condition than concavity.
Huh, does this apply to employees too? (ala “these are my views and do not represent those of my employer”)
Hm, sorry! I don’t think a good reply on my part should do that :P
I think I’m rejecting a certain mental stance toward unknown-unknowns, and I don’t think I’m clearly pointing at it yet.
My nearby reply is most of my answer here. I know how to tell when reality is off-the-rails wrt to my model, because my model is made of falsifiable parts. I can even tell you about what those parts are, and about the rails I’m expecting reality to stay on.
When I try to cache out your example, “maybe the whole way I’m thinking about bootstrapping isn’t meaningful/useful”, it doesn’t seem like it’s outside my meta-model? I don’t think I have to do anything differently to handle it?
Specifically, my “bootstrapping” concept comes with some concrete pictures of how things go. I currently find the concept “meaningful/useful” because I expect these concrete pictures to be instantiated in reality. (Mostly because I think expect reality to admit the “bootstrapping” I’m picturing, and I expect advanced AI to be able to find it). If reality goes off-my-rails about my concept mattering, it will be because things don’t apply in the way I’m thinking, and there were some other pathways I should have been attending to instead.
Idk if it’s actually missing that?
I can talk about what is in-distribution in terms of a bunch of finite components, and thereby name the cases that are out of distribution: those in which my components break.
(This seems like an advantage inside views have, they come with limits attached, because they build a distribution out of pieces that you can tell are broken when they don’t match reality.)
My example doesn’t talk about the probability I assign on “crazy thing I can’t model”, but such a thing would break something like my model of “who is doing what with the AI code by default”.
Maybe it would have been better of me to include a case for “and reality might invalidate my attempt at meta-reasoning too”?
- Apr 4, 2023, 12:03 PM; 3 points) 's comment on Communicating effectively under Knightian norms by (
tl;dr I think you can improve on “my models might break for an unknown reason” if you can name the main categories of model-breaking unknowns
Isn’t there a third way out? Name the circumstances under which your models break down.
e.g. “I’m 90% confident that if OpenAI built AGI that could coordinate AI research with 1/10th the efficiency of humans, we would then all die. My assessment is contingent on a number of points, like the organization displaying similar behaviour wrt scaling and risks, cheap inference costs allowing research to be scaled in parallel, and my model of how far artificial intelligence can bootstrap. You can ask me questions about how I think it would look if I were wrong about those.”
I think it’s good practice to name ways your models can breakdown that you think are plausible, and also ways that your conversational partners may think are plausible.
e.g. even if I didn’t think it would be hard for AGI to bootstrap, if I’m talking to someone for whom that’s a crux, it’s worth laying out that I’m treating that as a reliable step. It’s better yet if I clarify whether it’s a crux for my model that bootstrapping is easy. (I can in fact imagine ways that everything takes off even if bootstrapping is hard for the kind of AGI we make, but these will rely more on the human operators continuing to make dangerous choices.)
Saying some more things, Löb’s Theorem is a true statement: whenever talks about an inner theory at least as powerful as e.g. PA, the theorem shows you how to prove .
This means you cannot prove , or similarly that .
is one way we can attempt to formalize “self-trust” as “I never prove false things”. So indeed the problem is with this formalization: it doesn’t work, you can’t prove it.
This doesn’t mean we can’t formalize self-trust a different way, but it shows the direct way is broken.
Hm I think your substitution isn’t right, and the more correct one is “it is provable that (it is not provable that False) implies (it is provable that False)”, ala .
I’m again not following well, but here are some other thoughts that might be relevant:
-
It’s provable for any that , i.e. from “False” anything follows. This is how we give “False” grounding: if the system proves False, then it proves everything, and distinguishes nothing.
-
There are two levels at which we can apply Löb’s Theorem, which I’ll call “outer Löb’s Theorem” and “inner Löb’s Theorem”.
Outer Löb’s Theorem says that whenever PA proves , then PA also proves . It constructs the proof of using the proof of .
Inner Löb’s Theorem is the same, formalized in PA. It proves . The logic is the same, but it shows that PA can translate an inner proof of into an inner proof of .
Notably, the outer version is not . We need to have available the proof of in order to prove .
-
I’m not sure I understand what you’re interested in, but can say a few concrete things.
We might hope that PA can “learn things” by looking at what a copy of itself can prove. We might furthermore expect that it can see that a copy of itself will only prove true sentences.
Naively this should be possible. Outside of PA we can see that PA and its copy are isomorphic. Can we then formalize this inside PA?
In the direct attempt to do so, we construct our inner copy , where is a statement that says “there exists a proof of in the inner copy of PA”.
But Löb’s Theorem rules out formalizing self-trust this way. The statement means “there are no ways to prove falsehood in the inner copy of PA”. But if PA could prove that, Löb’s Theorem turns it directly into a proof of !
This doesn’t AFAICT mean self-trust of the form “I trust myself not to prove false things” is impossible, just that this approach fails, and you have to be very careful about deferral.
Something I’m now realizing, having written all these down: the core mechanism really does echo Löb’s theorem! Gah, maybe these are more like Löb than I thought.
(My whole hope was to generalize to things that Löb’s theorem doesn’t! And maybe these ideas still do, but my story for why has broken, and I’m now confused.)
As something to ponder on, let me show you how we can prove Löb’s theorem following the method of ideas #3 and #5:
is assumed
We consider the loop-cutter
We verify that if activates then must be true:
Then, can satisfy by finding the same proof.
So activates, and is true.
In english:
We have who is blocked on
We introduce to the loop cutter , who will activate if activation provably leads to being true
encounters the argument “if activates then is true, and this causes to activate”
This satisfies ’s requirement for some , so becomes true.
Perhaps the confusion is mostly me being idiosyncratic! I don’t have a good reference, but can attempt an explanation.
The propositions and are meant to model the behaviour of some agents, say Alice and Bob. The proposition means “Alice cooperates”, likewise means “Bob cooperates”.
I’m probably often switching viewpoints, talking about is if it’s Alice, when formally is some statement we’re using to model Alice’s behaviour.
When I say ” tries to prove that ”, what I really mean is: “In this scenario, Alice is looking for a proof that if she cooperates, then Bob cooperates. We model this with meaning ‘Alice cooperates’, and follows from .”
Note that every time we use we’re talking about proofs of of any size. This makes our model less realistic, since Alice and Bob only have a limited amount of time in which to reason about each other and try to prove things. The next step would be to relax the assumptions to things like , which says “Alice cooperates whenever it can be proven in steps that Bob cooperates”.
Some constructions for proof-based cooperation without Löb
I think I saw OpenAI do some “rubric” thing which resembles the Anthropic method. It seems easy enough for me to imagine that they’d do a worse job of it though, or did something somewhat worse, since folks at Anthropic seem to be the originators of the idea (and are more likely to have a bunch of inside view stuff that helped them apply it well)
- Mar 19, 2023, 5:58 PM; 2 points) 's comment on Portia’s Shortform by (
Yeah, all four of those are real things happening, and are exactly the sorts of things I think the post has in mind.
I take “make AI alignment seem legit” to refer to a bunch of actions that are optimized to push public discourse and perceptions around. Here’s a list of things that come to my mind:
Trying to get alignment research to look more like a mainstream field, by e.g. funding professors and PhD students who frame their work as alignment and giving them publicity, organizing conferences that try to rope in existing players who have perceived legitimacy, etc
Papers like Concrete Problems in AI Safety that try to tie AI risk to stuff that’s already in the overton window / already perceived as legitimate
Optimizing language in posts / papers to be perceived well, by e.g. steering clear of the part where we’re worried AI will literally kill everyone
Efforts to make it politically untenable for AI orgs to not have some narrative around safety
Each of these things seems like they have a core good thing, but according to me they’ve all backfired to the extend that they were optimized to avoid the thorny parts of AI x-risk, because this enables rampant goodharting. Specifically I think the effects of avoiding the core stuff have been bad, creating weird cargo cults around alignment research, making it easier for orgs to have fake narratives about how they care about alignment, and etc.
Also Plan B is currently being used to justify accelerating various danger tech by folks with no solid angles on Plan A...
My troll example is a fully connected network with all zero weights and biases, no skip connections.
This isn’t something that you’d reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.
To look for a true hacker I’d try to reconfigure the way the downstream computation works (by modifying attention weights, saturating relus, or similar) based on some model of the inputs, in a way that pushes around where the gradients go.
Okay, I now think A5 implies: “if moving by Δ is good, then moving by any negative multiple −nΔ is bad”. Which checks out to me re concavity.