(The reason I framed it in the style of “am I allowed this thought” / “will my teacher accept it if I make this inference?” is because that’s literally the frame used in the post ;P)
At the SSC Meetup tonight in my house, I was in a group conversation. I asked a stranger if they’d read anything interesting on the new LessWrong in the last 6 months or so (I had not yet mentioned my involvement in the project). He told me about an interesting post about the variance in human intelligence compared to the variance in mice intelligence. I said it was nice to know people read the posts I write. The group then had a longer conversation about the question. It was enjoyable to hear strangers tell me about reading my posts.
As far as I know, Paul hasn’t explained his choice in detail. One reason he does mention, in this comment, is that in the context of strategy-stealing, preferences like “help me stay in control and be well-informed” do not make sense when interpreted as preferences-as-elicited, since the current user has no way to know if they are in control or well-informed.
I agree this example adds nuance, and I’m unsure how to correctly categorise it.
Seems odd to have the idealistic goal get to be the standard name, and the dime-a-dozen failure mode be a longer name that is more confusing.
I note that Wei says a similar thing happened to ‘act-based’:
My understanding is that “act-based agent” used to mean something different (i.e., a simpler kind of AI that tries to do the same kind of action that a human would), but most people nowadays use it to mean an AI that is designed to satisfy someone’s short-term preferences-on-reflection, even though that no longer seems particularly “act-based”.
Is there a reason why the standard terms are not being used to refer to the standard, short-term results?
(I suppose that economics assumes rational agents who know their preferences, so taking language from economics might lead to this situation with the ‘short-term preferences’ decision.)
In the post Wei contrasts “current” and “actual” preferences. “Stated” vs “reflective” preferences also seem like nice alternatives too.
(I want to note that I’m quite interested in having a conversation about the above, both with Geoff but also with others who have thought a lot about rationality.)
Oh, okay. Is it not important to have a name for the class of thing we could accidentally train an ML system to optimise for that isn’t our ultimate preferences? Is there a term for that?
You have a section titled
learning user preferences for corrigibility isn’t enough for corrigible behavior
Would this be more consistently titled “Learning narrow preferences for corrigibility isn’t enough for corrigible behavior”?
I understand Paul to be saying that he hopes that corrigibility will fall out if we train an AI to score well on your short-term preferences, not just your narrow-preferences.
At some point Paul used “short-term preferences” and “narrow preferences” interchangeably, but no longer does (or at least no longer endorses doing so).
I would like to have these two terms defined. Let me offer my understanding from reading the relevant thread.
Short-term preferences refer to the most useful action I can take next, given my ultimate goals. This is to be contrasted with my current best guess about the outcome of that process. It’s what I would want, not what I do want.
An AI optimising for my short-term preferences may reasonably say “No, don’t take this action, because you’d actually prefer this alternative action if you only thought longer. It fits your true short-term preferences, you’re just mistaken about them.” This is in contrast with something you might call narrow preferences, which is where you tell the AI to do what you said anyway.
I did #4 and #1. Here is what I wrote for each section of #4 (note: this will spoil your ability to do the exercise if you read it).
1. How do you explain these effects?
Seems like a trick question. Like, I have models of the world that feel like they might predict effects 2 and 3, and I can sort of wrangle explanations for 1 and 4, but my split-second reaction is “I’m not sure these are real effects, probably none replicate (though number 2 sounds like it might just be a restatement of a claim I already believe)”.
2. How would you have gone about uncovering them?
As I think about trying to determine whether someone did their diet for ethical reasons, I immediately feel highly skeptical of the result. I think that the things people will tick-box as ‘because I care about animals’ does not necessarily refer to a deep underlying structure of the world that is ‘ethics’, and can refer to one of many things (e.g. exposure to effective guilt-based marketing, reflections on ethical philosophy, the ownership of a dog/cat/pet from an early age, etc). But I guess that just doing a simple questionnaire isn’t of literally zero value.
Loyalty two feels like a thing I could design a better measure for, but I worry this is tangled up with me believing it’s true, and thus illusion-of-transparency assuming people mean the same thing as me if they check-box ‘loyalty’.
Number 3 seems totally testable and straightforward.
Number 4 seems broadly testable. Creativity could be done with that “list the uses of a brick” test, or some other fun ones.
I notice this makes me more skeptical about the first two ‘results’ and more trusting of the last two ‘results’.
3. These are all reversed, and the actual findings were the opposite of what I said. How do you explain the opposite, correct effects?
Ah, the classic ‘I reversed the experimental findings trick’. Well, I guess I did fine on it this time. Oh look, I just managed to think of an explanation for number 2, which is that a more discerning audience of less loyal customers increases adversarial pressures among service providers, raising the prices. Interesting. I think I mostly am noticing how modern psychological research methodology can be quite terrible, and that such a questionnaire without incorporating a thoughtful model of the environment will often be useless. Model-free empirical questions can be overdetermined by the implicit model.
4. Actually, none of these results could be replicated. Why and how were non-null effects detected in the first place? Answers using your designs from (2) are preferable.
Okay. Science is awful.
More general thoughts: This helped me notice how relying on assuming a simple empirical psychological claim like this shouldn’t be used as evidence about anything. That pattern-matches to radical skepticism, but that’s not what I mean. I think I’m mostly saying context-free/theory-free claims are meaningless in psychology/sociology, or something like that.
The only thing I can come up with is that the graph doesn’t prove causality in any particular way. (it did take me like 3 whole minutes to come up with noticing correlation isn’t causation—I was primarily looking for things like axis labelled in unhelpful ways or something). I can tell a story where these are uncorrelated and everyone is dumb. I can tell a story where decreasing wages is the *explanation* for why debt is growing—it was previously in equilibrium, but now is getting paid off much more slowly. I can tell a story of active prevention, whereby because wages are going down, the government is making students pay less and store more of it as debt so they still have a good quality of life immediately after college.
Again, I’m noticing how simple context-free/theory-free claims do not determine an interpretation.
While the post promised answers in the comments, there were no comments, neither on the post or on the linked Washington Post article, so I’m not sure what the expected take-away was.
I did all the exercises above. Here’s what I wrote down during the timed sections. (It’s a stream of consciousness account, it may not be very clear/understandable.)
How would you generalize the common problem in the above arguments? You have 2 minutes
The structure of the reasoning does not necessarily correlate with one outcome more than others. You say A because X, but I can argue that B because X.
But I’m confused, because I can do this for any argument that’s not maximally well-specified though. Like, there’s always a gotcha. If I argue for the structure of genetics due to the pattern of children born with certain features, I could also use that evidence combined with an anti-inductive prior to argue the opposite. I’m not quite sure what the reason is that some things feel like they prove too much and some don’t. I suppose it’s just “in the context of my actual understanding of the situation, do I feel like this argument pins down a world-state positively correlated with the belief or not?” and if it doesn’t, then I can neatly express this by showing it can prove anything, because it’s not actually real evidence.
Oh huh, maybe that’s wrong. It’s not that it isn’t evidence for anything, it’s that if it were evidence for this it would be evidence for many inconsistent things. (Though I think those two are the same.)
What algorithm were you running when you solved the above problems? Is there a more ideal/general algorithm? You have 3 minutes.
Hmm, I did like the thing that happened actually. Normally in such a disagreement with a person, I would explain the structure of my beliefs around the thing they called a ‘reason’. I’d do lots of interpretive work like that. “Let me explain the process by which smart people get their beliefs and when those processes are/aren’t truth-tracking” or “Let me explain what heuristics help predict whether a startups is successful” or “let me explain what p-hacking is”. But in all of them the new mental motion was much cleaner/cheaper, which was producing a small impossibility proof.
I think I normally avoid such proofs because they’re non-constructive—they don’t tell you where the mistake was or how that part of the world works, and I’m often worried this will feel like a demotivating thing or conversation killer for the other person I’m talking with. But I think it’s worth thinking this way for myself more. I do want to practice it, certainly. I should be able to use all tools of proof and disproof, not just those that make conversations go smoothly.
Some general thoughts
I found doing the exercises very enjoyable.
I think that the answers here could’ve been more to-a-format. These aren’t very open-ended questions, and I think that if I’d practiced matching a format that would’ve drilled a more specific tool better. But not clear that’s appropriate.
I didn’t like how all the examples were of the “don’t believe a dumb low-status thing”. Like I think people often build epistemologies around making sure to never be religious, endorse a failed startup idea, or believe homeopathy, but I think that you should mostly build it around making sure you will make successful insights in physics, or building a successful startup, which is a different frame. I would’ve liked much more difficult examples in areas where it’s not clear what the right choice is purely based on pattern-matching to low-status beliefs.
The post tells people to sit by a clock. I think at the start I would’ve told people to find a timer by googling ‘timer’ (when you do that, one just appears on google) else I expect most folks to have bounced off and not done those exercises.
I really liked the ‘reflect on the general technique’ sections, they were excellent and well-placed.
In hindsight, I should’ve listened to dude562.
Would be good to note that this is for the Alignment Newsletter. I didn’t realise that’s what this was for a few seconds.
My understanding is that Ray wants them to not be anonymous; the idea being voting and anything that determines the order your comment gets seen is always anonymous, and all other things are public.
Just FYI, I am planning to make another post in maybe two weeks to open further discussion to needle down the specific details of what we want to celebrate and what is a fitting way to do that, because that seems like the correct way to build traditions.