James Payor

Karma: 1,118

I think about AI alignment; send help.

I’m also on twitter. More links on my homepage payor.io.

James Payor Apr 15, 2023, 3:10 AM
LW: 1 AF: 1
0
AF
in reply to: Scott Garrabrant’s comment on: Concave Utility Question
Okay, I now think A5 implies: “if moving by $Δ$ is good, then moving by any negative multiple $- n Δ$ is bad”. Which checks out to me re concavity.

James Payor Apr 15, 2023, 3:02 AM
LW: 1 AF: 1
0
AF
in reply to: Scott Garrabrant’s comment on: Concave Utility Question
Got it, thanks!

James Payor Apr 15, 2023, 2:33 AM
LW: 1 AF: 1
0
AF
in reply to: James Payor’s comment on: Concave Utility Question
The way I understand A4 is that it says “if moving by $Δ$ is good, then moving by any fraction $λ Δ$ is also good”.

And A5 says “if moving by $Δ$ is good, then moving by any multiple $n Δ$ is also good”, which is much stronger.

James Payor Apr 15, 2023, 2:25 AM
LW: 1 AF: 1
0
AF
on: Concave Utility Question
[Edit: yeah nevermind I have the inequality backwards]

A5 seems too strong?

Consider lotteries $A$ and $B$ , and a mixture $X = p A + (1 - p) B$ in between. Applying A5 twice gives:
1. If $u (X) \geq u (A)$ then $u (B) \geq u (A)$
2. If $u (X) \geq u (B)$ then $u (B) \geq u (A)$
So if $u (X) \geq u (A)$ and $u (X) \geq u (B)$ then $u (A) = u (B)$ ?

Either I’m confused or A5 is a stricter condition than concavity.

James Payor Apr 11, 2023, 7:46 PM
4 points
0
in reply to: kdbscott’s comment on: Request to AGI organizations: Share your views on pausing AI progress
Huh, does this apply to employees too? (ala “these are my views and do not represent those of my employer”)

James Payor Apr 4, 2023, 12:11 PM
3 points
0
in reply to: Stephen Fowler’s comment on: Communicating effectively under Knightian norms
Hm, sorry! I don’t think a good reply on my part should do that :P

I think I’m rejecting a certain mental stance toward unknown-unknowns, and I don’t think I’m clearly pointing at it yet.

James Payor Apr 4, 2023, 12:03 PM
3 points
0
in reply to: Richard_Ngo’s comment on: Communicating effectively under Knightian norms
My nearby reply is most of my answer here. I know how to tell when reality is off-the-rails wrt to my model, because my model is made of falsifiable parts. I can even tell you about what those parts are, and about the rails I’m expecting reality to stay on.

When I try to cache out your example, “maybe the whole way I’m thinking about bootstrapping isn’t meaningful/useful”, it doesn’t seem like it’s outside my meta-model? I don’t think I have to do anything differently to handle it?

Specifically, my “bootstrapping” concept comes with some concrete pictures of how things go. I currently find the concept “meaningful/useful” because I expect these concrete pictures to be instantiated in reality. (Mostly because I think expect reality to admit the “bootstrapping” I’m picturing, and I expect advanced AI to be able to find it). If reality goes off-my-rails about my concept mattering, it will be because things don’t apply in the way I’m thinking, and there were some other pathways I should have been attending to instead.

James Payor Apr 4, 2023, 5:15 AM
6 points
2
in reply to: Stephen Fowler’s comment on: Communicating effectively under Knightian norms
Idk if it’s actually missing that?

I can talk about what is in-distribution in terms of a bunch of finite components, and thereby name the cases that are out of distribution: those in which my components break.

(This seems like an advantage inside views have, they come with limits attached, because they build a distribution out of pieces that you can tell are broken when they don’t match reality.)

My example doesn’t talk about the probability I assign on “crazy thing I can’t model”, but such a thing would break something like my model of “who is doing what with the AI code by default”.

Maybe it would have been better of me to include a case for “and reality might invalidate my attempt at meta-reasoning too”?
What links here?
- James Payor's comment on Communicating effectively under Knightian norms by Richard_Ngo (Apr 4, 2023, 12:03 PM; 3 points)

James Payor Apr 4, 2023, 12:31 AM
2 points
0
in reply to: James Payor’s comment on: Communicating effectively under Knightian norms
tl;dr I think you can improve on “my models might break for an unknown reason” if you can name the main categories of model-breaking unknowns

James Payor Apr 4, 2023, 12:28 AM
21 points
10
on: Communicating effectively under Knightian norms
Isn’t there a third way out? Name the circumstances under which your models break down.

e.g. “I’m 90% confident that if OpenAI built AGI that could coordinate AI research with 1/10th the efficiency of humans, we would then all die. My assessment is contingent on a number of points, like the organization displaying similar behaviour wrt scaling and risks, cheap inference costs allowing research to be scaled in parallel, and my model of how far artificial intelligence can bootstrap. You can ask me questions about how I think it would look if I were wrong about those.”

I think it’s good practice to name ways your models can breakdown that you think are plausible, and also ways that your conversational partners may think are plausible.

e.g. even if I didn’t think it would be hard for AGI to bootstrap, if I’m talking to someone for whom that’s a crux, it’s worth laying out that I’m treating that as a reliable step. It’s better yet if I clarify whether it’s a crux for my model that bootstrapping is easy. (I can in fact imagine ways that everything takes off even if bootstrapping is hard for the kind of AGI we make, but these will rely more on the human operators continuing to make dangerous choices.)

James Payor Mar 29, 2023, 12:20 AM
3 points
0
in reply to: Thoth Hermes’s comment on: Why do the Sequences say that “Löb’s Theorem shows that a mathematical system cannot assert its own soundness without becoming inconsistent.”?
Saying some more things, Löb’s Theorem is a true statement: whenever $□$ talks about an inner theory at least as powerful as e.g. PA, the theorem shows you how to prove $□ (□ P \to P) \to □ P$ .

This means you cannot prove $□ (□ ⊥ \to ⊥)$ , or similarly that $□ ⊥ \to ⊥$ .

$□ ⊥ \to ⊥$ is one way we can attempt to formalize “self-trust” as “I never prove false things”. So indeed the problem is with this formalization: it doesn’t work, you can’t prove it.

This doesn’t mean we can’t formalize self-trust a different way, but it shows the direct way is broken.

James Payor Mar 28, 2023, 10:52 PM
1 point
0
in reply to: Thoth Hermes’s comment on: Why do the Sequences say that “Löb’s Theorem shows that a mathematical system cannot assert its own soundness without becoming inconsistent.”?
Hm I think your substitution isn’t right, and the more correct one is “it is provable that (it is not provable that False) implies (it is provable that False)”, ala $□ (□ ⊥ \to ⊥) \to □ ⊥$ .

I’m again not following well, but here are some other thoughts that might be relevant:
- It’s provable for any $P$ that $⊥ \to P$ , i.e. from “False” anything follows. This is how we give “False” grounding: if the system proves False, then it proves everything, and distinguishes nothing.
- There are two levels at which we can apply Löb’s Theorem, which I’ll call “outer Löb’s Theorem” and “inner Löb’s Theorem”.
  - Outer Löb’s Theorem says that whenever PA proves $□ P \to P$ , then PA also proves $P$ . It constructs the proof of $P$ using the proof of $□ P \to P$ .
  - Inner Löb’s Theorem is the same, formalized in PA. It proves $□ (□ P \to P) \to □ P$ . The logic is the same, but it shows that PA can translate an inner proof of $□ P \to P$ into an inner proof of $P$ .
  - Notably, the outer version is not $(□ P \to P) \to P$ . We need to have available the proof of $□ P \to P$ in order to prove $P$ .

James Payor Mar 28, 2023, 8:54 PM
3 points
−2
on: Why do the Sequences say that “Löb’s Theorem shows that a mathematical system cannot assert its own soundness without becoming inconsistent.”?
I’m not sure I understand what you’re interested in, but can say a few concrete things.

We might hope that PA can “learn things” by looking at what a copy of itself can prove. We might furthermore expect that it can see that a copy of itself will only prove true sentences.

Naively this should be possible. Outside of PA we can see that PA and its copy are isomorphic. Can we then formalize this inside PA?

In the direct attempt to do so, we construct our inner copy $□$ , where $□ P$ is a statement that says “there exists a proof of $P$ in the inner copy of PA”.

But Löb’s Theorem rules out formalizing self-trust this way. The statement $□ ⊥ \to ⊥$ means “there are no ways to prove falsehood in the inner copy of PA”. But if PA could prove that, Löb’s Theorem turns it directly into a proof of $⊥$ !

This doesn’t AFAICT mean self-trust of the form “I trust myself not to prove false things” is impossible, just that this approach fails, and you have to be very careful about deferral.

James Payor Mar 22, 2023, 5:04 AM
LW: 5 AF: 4
2
AF
on: Some constructions for proof-based cooperation without Löb
Something I’m now realizing, having written all these down: the core mechanism really does echo Löb’s theorem! Gah, maybe these are more like Löb than I thought.

(My whole hope was to generalize to things that Löb’s theorem doesn’t! And maybe these ideas still do, but my story for why has broken, and I’m now confused.)

As something to ponder on, let me show you how we can prove Löb’s theorem following the method of ideas #3 and #5:
- $□ A \to A$ is assumed
- We consider the loop-cutter $X \leftrightarrow □ (X \to A)$
- We verify that if $X$ activates then $A$ must be true:
  - $X \to (□ X \to □ A)$
  - $X \to □ X$
  - $X \to □ A$
  - $X \to A$
- Then, $X$ can satisfy $□ (X \to A)$ by finding the same proof.
- So $X$ activates, and $A$ is true.
In english:
- We have $A$ who is blocked on $□ A$
- We introduce $A$ to the loop cutter $X$ , who will activate if activation provably leads to $A$ being true
- $A$ encounters the argument “if $X$ activates then $A$ is true, and this causes $X$ to activate”
- This satisfies $A$ ’s requirement for some $□ A$ , so $A$ becomes true.
What links here?
- Some constructions for proof-based cooperation without Löb by James Payor (Mar 21, 2023, 4:12 PM; 43 points)

James Payor Mar 22, 2023, 4:42 AM
4 points
0
in reply to: tryactions’s comment on: Some constructions for proof-based cooperation without Löb
Perhaps the confusion is mostly me being idiosyncratic! I don’t have a good reference, but can attempt an explanation.

The propositions $A$ and $B$ are meant to model the behaviour of some agents, say Alice and Bob. The proposition $A$ means “Alice cooperates”, likewise $B$ means “Bob cooperates”.

I’m probably often switching viewpoints, talking about $A$ is if it’s Alice, when formally $A$ is some statement we’re using to model Alice’s behaviour.

When I say ” $A$ tries to prove that $A \to B$ ”, what I really mean is: “In this scenario, Alice is looking for a proof that if she cooperates, then Bob cooperates. We model this with $A$ meaning ‘Alice cooperates’, and $A$ follows from $□ (A \to B)$ .”

Note that every time we use $□ X$ we’re talking about proofs of $X$ of any size. This makes our model less realistic, since Alice and Bob only have a limited amount of time in which to reason about each other and try to prove things. The next step would be to relax the assumptions to things like $A \leftarrow □_{k} B$ , which says “Alice cooperates whenever it can be proven in $k$ steps that Bob cooperates”.

Some constructions for proof-based cooperation without Löb

James PayorMar 21, 2023, 4:12 PM

43 points

3 comments4 min readLW link

James Payor Mar 15, 2023, 7:23 PM
9 points
3
in reply to: cubefox’s comment on: ChatGPT (and now GPT4) is very easily distracted from its rules
I think I saw OpenAI do some “rubric” thing which resembles the Anthropic method. It seems easy enough for me to imagine that they’d do a worse job of it though, or did something somewhat worse, since folks at Anthropic seem to be the originators of the idea (and are more likely to have a bunch of inside view stuff that helped them apply it well)
What links here?
- Vladimir_Nesov's comment on Portia’s Shortform by Portia (Mar 19, 2023, 5:58 PM; 2 points)

James Payor Mar 15, 2023, 7:11 PM
13 points
6
in reply to: 6nne’s comment on: Shutting Down the Lightcone Offices
Yeah, all four of those are real things happening, and are exactly the sorts of things I think the post has in mind.

I take “make AI alignment seem legit” to refer to a bunch of actions that are optimized to push public discourse and perceptions around. Here’s a list of things that come to my mind:
- Trying to get alignment research to look more like a mainstream field, by e.g. funding professors and PhD students who frame their work as alignment and giving them publicity, organizing conferences that try to rope in existing players who have perceived legitimacy, etc
- Papers like Concrete Problems in AI Safety that try to tie AI risk to stuff that’s already in the overton window / already perceived as legitimate
- Optimizing language in posts / papers to be perceived well, by e.g. steering clear of the part where we’re worried AI will literally kill everyone
- Efforts to make it politically untenable for AI orgs to not have some narrative around safety
Each of these things seems like they have a core good thing, but according to me they’ve all backfired to the extend that they were optimized to avoid the thorny parts of AI x-risk, because this enables rampant goodharting. Specifically I think the effects of avoiding the core stuff have been bad, creating weird cargo cults around alignment research, making it easier for orgs to have fake narratives about how they care about alignment, and etc.

James Payor Mar 10, 2023, 10:17 PM
LW: 12 AF: 5
12
AF
in reply to: johnswentworth’s comment on: Why Not Just Outsource Alignment Research To An AI?
Also Plan B is currently being used to justify accelerating various danger tech by folks with no solid angles on Plan A...

James Payor Mar 10, 2023, 1:09 AM
LW: 3 AF: 2
0
AF
on: Challenge: construct a Gradient Hacker
My troll example is a fully connected network with all zero weights and biases, no skip connections.

This isn’t something that you’d reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.

To look for a true hacker I’d try to reconfigure the way the downstream computation works (by modifying attention weights, saturating relus, or similar) based on some model of the inputs, in a way that pushes around where the gradients go.

James Payor

Some con­struc­tions for proof-based co­op­er­a­tion with­out Löb

Some constructions for proof-based cooperation without Löb