paulfchristiano

Karma: 27,993

paulfchristiano 12 Jan 2024 5:11 UTC
16 points
11
in reply to: aphyer’s comment on: Introduce a Speed Maximum
I think US speed limits are so low in part because of an effort to limit demand for oil in 1973; see Wikipedia.
Even setting aside that ignominious origin, I expect that optimal speeds have significantly increased as technology has improved and so speed limits are probably too low because of inertia.

paulfchristiano 6 Jan 2024 17:03 UTC
LW: 22 AF: 11
6
AF
in reply to: habryka’s comment on: Catching AIs red-handed
As a concrete example, suppose that an AI fails to find an error in a piece of code (which will slightly undermine the security of some important computer system). You are unsure if it represents sabotage or a benign error. Relevant evidence for making that distinction:
- Can the AI identify structurally similar errors in other code? Is the behavior sensitive to evidence about how the code will be used, in a way that doesn’t seem relevant to the nature or salience of the error? Is the pattern of sensitivity suggestive of an attempt to disempower humanity?
- Is the behavior sensitive to the AI’s beliefs about the quality of human oversight? For example does training data suggesting more vigilant human oversight in this domain decrease the probability of the mistake? Do other AI systems trained in different ways reproduce a similar pattern of mistakes?
- If we have one example of a mistake that looks suspicious, we can examine other similar mistakes and build up a detailed pattern of behavior. For example we can hold fix the context and then look at whether the AI misses other kinds of software errors. With multiple mistakes in hand we can run all this analysis on a broader spectrum of mistakes. We can iteratively modify scenarios to explore the landscape of deceptive behavior, gradient descenting our way towards stronger and stronger evidence.
As another example, suppose that an AI system tries to modulate its behavior in any way based on whether it believes it is being closely monitored vs operating in the real world. And suppose that you find an example in the lab where the AI system messes up, and therefore where there is an unusually large gap between its ability to solve a problem in a “realistic-looking” situation vs a situation that’s obviously a test. The initial behavior isn’t super suspicious (sometimes behavior randomly varies with framing), but it’s suspicious enough to get flagged for more thorough review. Subsequent rounds find further evidence of sensitivity, and eventually it’s pretty likely (though still not “legal smoking gun”) that we’ve got an example of an AI systems modulating its behavior based on whether it believes it is being tested.
At that point you can start collecting data about what cues the AI system is using to determine whether it is being tested, by varying features of the context and seeing whether you see nicer or less nice behavior. And once you have this understanding in hand you can use it in other contexts to try to get the AI to mess up more. As you find more examples you can build a richer picture, in a virtuous cycle (like normal science).
I think being an AI in this situation is pretty untenable if you are up against human scientists trying hard to figure out what’s going on with a typical level of rigor and competence. And so my guess is that typical competent malicious AI systems won’t try to modulate their behavior in a significant way based on whether they believe they are being actively tested/monitored. (But it’s still worth looking extremely hard, and e.g. it’s likely that some AI systems will mess up and try to pull this kind of stunt.) And that’s a huge difference with the human case: changing your behavior when you are being closely watched is one of the key things that a human needs to do in order to preserve plausible deniability, and one of the key features of many AI safety horror stories.
(We could tell an identical story about cases where AI systems try to behave badly in ways that are too subtle for humans to notice, but mess up in a single case by underestimating human abilities. In that case we get a clear example of sensitivity to information about human competence, and again can explore the whole frontier of this behavior.)
This kind of possibility forces AI systems to push back bad behavior into cases where they are more and more confident that it’s never going to be noticed. But the space of interventions we get to try, if we subject this issue to rigorous scientific scrutiny, includes significantly modifying the AI’s training data and limiting information about the world. So “super confident that humans will never notice” is a very high bar.
just got kind of unlucky in how its context shifted it into a more deceptive role-play (but without long-term planning being involved) or was actually doing some more coherent long-term deception plan.
I’m not sure this distinction is particularly important. If ChatGPT is taking actions to undermine human control of AI, and you can establish a pattern of behavior, that’s a big deal which will cause scientists and the world to sit up and take interest. I don’t think a defense like “it’s just trying to undermine human control because it’s read a lot of stories about AI systems taking over” or “it lies and manipulates the user because that’s what a sociopathic human would do, and RL training put it into a more sociopathic mode” are going to fly very well.
If this is happening at a point in time when the AI system is competent enough to have real consequences, then I think it’s going to be an extremely good basis for a policy reaction.
We’ve seen a tiny example of this with Sydney, in a system where it has almost zero consequences and where it’s pretty much random rather than having a pattern of behavior, and that’s already had significant effects on public and scientific perception.
Overall I would agree it takes a lot of work to actually do the monitoring (especially if AI developers want to avoid accountability), to establish a pattern of behavior, to amplify a little bit of evidence into a robust scientific picture, to diagnose the issue appropriately and avoid overfitting, etc. I think some people have the view that the evidence is already in and nothing will change, but I think that’s very wrong, and it seems to me like it’s one of the more harmful features of the AI safety ideological bubble. (If you go out into the broader scientific community, I think “do you have evidence” is really a central hangup, and the standards of evidence being used are overall pretty reasonable IMO.)
What links here?

paulfchristiano 6 Jan 2024 3:19 UTC
LW: 27 AF: 11
0
AF
in reply to: habryka’s comment on: Catching AIs red-handed
One difference is that if you observe a human behaving deceptively you are forced to make educated guesses about their motives and context, but if you catch an AI behaving deceptively you are able to record the exact context in which it acted, examine its behavior in a large number of counterfactual situations, run other AI systems in exactly the same situation, etc.
I think this removes most of the leeway that humans use to maintain plausible deniability, especially in high-stakes situations with thorough investigations. Overall I’m more intuitively worried about cases where humans never flag any suspicious behavior (e.g. because its thoroughly undetectable, or occurs only once an AI takeover is underway) than cases where they flag suspicious behavior but the AI is able to maintain plausible deniability during a thorough investigation.

paulfchristiano 3 Jan 2024 16:15 UTC
20 points
4
on: Trading off Lives
Taking your numbers at face value, you’d have 1.5 billion passenger hours afflicted by the ban per life saved, or about 3000 lifetimes worth of hours.
Or: if people spent every waking minute of their lives under annoying regulatory requirments about as bad as this one with the same tradeoffs, the benefit would be extending the average lifespan from 77.28 years to 77.29 years.
I expect most people would demand more like +10 years of lifespan in return for that level of restriction, not +0.01 years. So the cost benefit is probably off by ~3 orders of magnitude.
I generally prefer to think about this kind of tradeoff by scaling up the benefits to 1 life and then concentrating the costs in 1 life, and seeing how the tradeoff looks. That might be idiosyncratic, but to me it’s very natural to ask my gut how much lifespan I’d like to trade off for a few minutes of pain or inconvenience.

paulfchristiano 2 Jan 2024 18:28 UTC
10 points
5
in reply to: TAG’s comment on: Stop talking about p(doom)
I think the term “existential risk” comes from here, where it is defined as:
Existential risk – One where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential.
(I think on a plain english reading “existential risk” doesn’t have any clear precise meaning. I would intuitively have included e.g. social collapse, but probably wouldn’t have included an outcome where humanity can never expand beyond the solar system, but I think Bostrom’s definition is also consistent with the vague plain meaning.)
In general I don’t think using “existential risk” with this precise meaning is very helpful in broader discourse and will tend to confuse more than it clarifies. It’s also a very gnarly concept. In most cases it seems better to talk directly about human extinction, AI takeover, or whatever other concrete negative outcome is on the table.

paulfchristiano 2 Jan 2024 17:34 UTC
6 points
0
in reply to: jessicata’s comment on: 2023 in AI predictions
It seems fairly unlikely that this specific task will be completed soon for a variety of reasons: it sounds like it technically requires training a new LM that removes all data about zelda games; it involves a fair amount of videogame-specific engineering hassle; and it’s far from anything with obvious economic relevance + games are out of fashion (not because they are too hard). I do still think it will be done before 2033.
If we could find a similar task that was less out of the way then I’d probably be willing to bet on it happening much sooner. Presumably this is an analogy to something that would be relevant for AI systems automating R&D and is therefore closer to what people are interested in doing with LMs.
Although we can’t bet on it, I do think that if AI developers made a serious engineering effort on the zelda task right now then they would have a reasonable chance of success within 2 years (I’d wildly guess 25%), and this will rise over time. I think GPT-4 with vision will do a reasonable job of identifying the next step needed to complete the game, and models trained with RL to follow instructions in video games across a broad variety of games (including 3d games with similar controls and perspective to Zelda) would likely be competent enough to solve most of the subtasks if you really went all out on it.
I don’t have a good sense of what part you think is hard. I’d guess that the most technically uncertain part is training an RL policy that takes a description of a local task (e.g. “throw a bomb so that it explodes next to the monster’s eye”) and then actually executing it. But my sense is that you might be more concerned about high-level planning.

paulfchristiano 2 Jan 2024 0:18 UTC
4 points
0
in reply to: jessicata’s comment on: 2023 in AI predictions
Do you have any hard things that you are confident LLMs won’t do soon? (Short of: “autonomously carry out R&D.”) Any tasks you think an LM agent won’t be able to achieve?

paulfchristiano 1 Jan 2024 23:25 UTC
9 points
0
in reply to: jessicata’s comment on: 2023 in AI predictions
I can’t tell if you think these problems will remain hard for the model, and if so why.
I think 70% that an LM agent can do the 4x4 grid example by EOY 2024 because it seems pretty easy. I’d update if that was wrong. (And I’d be fine replacing that by held out examples of similar complexity.)
Will you be updating your picture if it can do these tasks by EOY? How much have you updated in the last few years? I feel like 2018 Paul was pretty surprised by how good ChatGPT is now (its turing test ability is maybe ~85th percentile of my forecasts), and that in 2018 you were at least qualitatively trying to argue in the opposite direction.

paulfchristiano 1 Jan 2024 23:20 UTC
3 points
0
in reply to: Gerald Monroe’s comment on: 2023 in AI predictions
I don’t see how that’s a valid interpretation of the rules. Isn’t it checking to find that there is at least one 2x repetition and at least one 3x repetition? Whereas the request was exactly two of each.

paulfchristiano 1 Jan 2024 18:37 UTC
7 points
1
in reply to: jessicata’s comment on: 2023 in AI predictions
I’m glad you have held out problems, and I think it would be great if you had a handful (like 3) rather than just one. (If you have 5-10 it would also be cool to plot the success rate going up over time as ChatGPT improves.)
Here is the result of running your prompt with a generic system prompt (asking for an initial answer + refinement). It fails to meet the corner condition (and perplexingly says “The four corners (top left ‘A’, top right ‘A’, bottom left ‘A’, bottom right ‘B’) are distinct.”). When I point out that the four corners aren’t distinct it fixes this problem and gets it correct.
I’m happy to call this a failure until the model doesn’t need someone to point out problems. But I think that’s entirely fine-tuning and prompting and will probably be fixed on GPT-4.
That said, I agree that if you keep making these problems more complicated you will be able to find something that’s still pretty easy for a human (<5 minutes for the top 5% of college grads) and stumps the model. E.g. I tried: fill in a 4 x 4 grid such that one column and row have the same letter 4 times, a second column has the same letter 3 times, and all other rows and columns have distinct letters (here’s the model’s attempt). I’m predicting that this will no longer work by EOY 2024.

paulfchristiano 1 Jan 2024 17:54 UTC
19 points
1
on: 2023 in AI predictions
Find a sequence of words that is: − 20 words long—contains exactly 2 repetitions of the same word twice in a row—contains exactly 2 repetitions of the same word thrice in a row
Here is its attempt. I add usual boilerplate about being fine to think before answering. First it gives a valid sequence using letters instead of words. I ask for words instead of letters and then it gives a sequence that is only 18 words long. I ask for 20 words and then it finally gets it.
Here’s a second try where I use a disambiguated version of your prompt (without boilerplate) and don’t provide hints beyond “I’m not satisfied, try harder”—the model ends up producing a sequence with placeholders like “unique8″ instead of words, and although I keep saying I’m unsatisfied it makes up nonsensical explanations for why and can’t figure out the real problem. It gets it immediately when I point out that I’m unhappy because “unique8” isn’t a word.
(This is without any custom instructions; it also seems able to do the task without code and its decision of whether to use code is very sensitive to even apparently unrelated instructions.)
I think it’s very likely that GPT-4 with more fine-tuning for general competence will be able to solve this task, and that with fine-tuning or a system prompt for persistence it will need it would not need the “I’m not satisfied, try harder” reminder and will instead keep thinking until its answer is stable on reflection.
I didn’t see a more complicated version in the thread, but I think it’s quite likely that whatever you wrote will also be solved in 2024. I’d wildly guess a 50% chance that by the end of 2024 you will be unable (with an hour of tinkering, say) to design a task like this that’s easy for humans (in the sense that say at least 5% of college graduates can do it within 5 minutes) but hard for the best public agent built with the best public model.

Matrix completion prize results

paulfchristiano20 Dec 2023 15:40 UTC

42 points

0 comments2 min readLW link

(www.alignment.org)

paulfchristiano 18 Dec 2023 20:47 UTC
81 points
21
in reply to: Matthew Barnett’s comment on: How do you feel about LessWrong these days? [Open feedback thread]
I wrote a fair amount about alignment from 2014-2020^[1] which you can read here. So it’s relatively easy to get a sense for what I believed.
Here are some summary notes about my views as reflected in that writing, though I’d encourage you to just judge for yourself^[2] by browsing the archives:
- I expected AI systems to be pretty good at predicting what behaviors humans would rate highly, long before they were catastrophically risky. This comes up over and over again in my writing. In particular, I repeatedly stated that it was very unlikely that an AI system would kill everyone because it didn’t understand that people would disapprove of that action, and therefore this was not the main source of takeover concerns. (By 2017 I expected RLHF to work pretty well with language models, which was reflected in my research prioritization choices and discussions within OpenAI though not clearly in my public writing.)
- I consistently expressed that my main concerns were instead about (i) systems that were too smart for humans to understand the actions they proposed, (ii) treacherous turns from deceptive alignment. This comes up a lot, and when I talk about other problems I’m usually clear that they are prerequisites that we should expect to succeed. Eg.. see an unaligned benchmark. I don’t think this position was an extreme outlier, my impression at the time was that other researchers had broadly similar views.
- I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I’ve updated about this and definitely acknowledge I was wrong.^[3] I don’t think it totally changes the picture though: I’m still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.
- In 2016 I pointed out that ML systems being misaligned on adversarial inputs and exploitable by adversaries was likely to be the first indicator of serious problems, and therefore that researchers in alignment should probably embrace a security framing and motivation for their research.
- I expected LM agents to work well (see this 2015 post). Comparing this post to the world of 2023 I think my biggest mistake was overestimating the importance of task decomposition vs just putting everything in a single in-context chain of thought. These updates overall make crazy amplification schemes seem harder (and to require much smarter models than I originally expected, if they even make sense at all) but at the same time less necessary (since chain of thought works fine for capability amplification for longer than I would have expected).
I overall think that I come out looking somewhat better than other researchers working in AI alignment, though again I don’t think my views were extreme outliers (and during this period I was often pointed to as a sensible representative of fairly hardcore and traditional alignment concerns).
Like you, I am somewhat frustrated that e.g. Eliezer has not really acknowledged how different 2023 looks from the picture that someone would take away from his writing. I think he’s right about lots of dynamics that would become relevant for a sufficiently powerful system, but at this point it’s pretty clear that he was overconfident about what would happen when (and IMO is still very overconfident in a way that is directly relevant to alignment difficulty). The most obvious one is that ML systems have made way more progress towards being useful R&D assistants way earlier than you would expect if you read Eliezer’s writing and took it seriously. By all appearances he didn’t even expect AI systems to be able to talk before they started exhibiting potentially catastrophic misalignment.
1. ^
  I think my opinions about AI and alignment were much worse from 2012-2014, but I did explicitly update and acknowledge many mistakes from that period (though some of it was also methodological issues, e.g. I believe that “think about a utility function that’s safe to optimize” was a useful exercise for me even though by 2015 I no longer thought it had much direct relevance).
2. ^
  I’d also welcome readers to pull out posts or quotes that seem to indicate the kind of misprediction you are talking about. I might either acknowledge those (and I do expect my historical reading is very biased for obvious reasons), or I might push back against them as a misreading and explain why I think that.
3. ^
  That said, in fall 2018 I made and shared some forecasts which were the most serious forecasts I made from 2016-2020. I just looked at those again to check my views. I gave a 7.5% chance of TAI by 2028 using short-horizon RL (over a <5k word horizon using human feedback or cheap proxies rather than long-term outcomes), and a 7.5% chance that by 2028 we would be able to train smart enough models to be transformative using short-horizon optimization but be limited by engineering challenges of training and integrating AI systems into R&D workflows (resulting in TAI over the following 5-10 years). So when I actually look at my probability distributions here I think they were pretty reasonable. I updated in favor of alignment being easier because of the relative unimportance of long-horizon RL, but the success of imitation learning and short-horizon RL was still a possibility I was taking very seriously and overall probably assigned higher probability to than almost anyone in ML.
What links here?
- Sammy Martin's comment on AI #47: Meet the New Year by Zvi (16 Jan 2024 16:20 UTC; 13 points)

paulfchristiano 5 Dec 2023 20:56 UTC
4 points
2
in reply to: niplav’s comment on: Impossibility results for unbounded utilities
I think that’s reasonable, this is the one with the discussion and it has a forward link, would be better to review them as a unit.

paulfchristiano 5 Dec 2023 16:56 UTC
9 points
2
on: Impossibility results for unbounded utilities
I think the dominance principle used in this post is too strong and relatively easy to deny. I think that the Better impossibility results for unbounded utilities are actually significantly better.

paulfchristiano 30 Nov 2023 17:26 UTC
6 points
−2
in reply to: ThirdSequence ’s comment on: Where I agree and disagree with Eliezer
I clarified my views here because people kept misunderstanding or misquoting them.
The grandparent describes my probability that humans irreversibly lose control of AI systems, which I’m still guessing at 10-20%. I should probably think harder about this at some point and revise it, I have no idea which direction it will move.
I think the tweet you linked is referring to the probability for “humanity irreversibly messes up our future within 10 years of building human-level AI.” (It’s presented as “probability of AI killing everyone” which is not really right.)
I generally don’t know what people mean when they say p(doom). I think they probably imagine that the vast majority of existential risk from AI comes from loss of control, and that catastrophic loss of control necessarily leads to extinction, both of which seem hard to defend.

paulfchristiano 29 Nov 2023 16:56 UTC
LW: 4 AF: 3
0
AF
in reply to: Daniel Kokotajlo’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
If this is what’s going on, then I basically can’t imagine any context in which I would want someone to read the OP rather a post than showing examples of LM agents achieving goals and saying “it’s already the case that LM agents want things, more and more deployments of LMs will be agents, and those agents will become more competent such that it would be increasingly scary if they wanted something at cross-purposes to humans.” Is there something I’m missing?
I think your interpretation of Nate is probably wrong, but I’m not sure and happy to drop it.

paulfchristiano 29 Nov 2023 16:47 UTC
LW: 8 AF: 6
4
AF
in reply to: Daniel Kokotajlo’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
If you use that definition, I don’t understand in what sense LMs don’t “want” things—if you prompt them to “take actions to achieve X” then they will do so, and if obstacles appear they will suggest ways around them, and if you connect them to actuators they will frequently achieve X even in the face of obstacles, etc. By your definition isn’t that “want” or “desire” like behavior? So what does it mean when Nate says “AI doesn’t seem to have all that much “want”- or “desire”-like behavior”?
I’m genuinely unclear what the OP is asserting at that point, and it seems like it’s clearly not responsive to actual people in the real world saying “LLMs turned out to be not very want-y, when are the people who expected ‘agents’ going to update?” People who say that kind of thing mostly aren’t saying that LMs can’t be prompted to achieve outcomes. They are saying that LMs don’t want things in the sense that is relevant to usual arguments about deceptive alignment or reward hacking (e.g. don’t seem to have preferences about the training objective, or that are coherent over time).

paulfchristiano 29 Nov 2023 2:35 UTC
LW: 25 AF: 11
2
AF
in reply to: Daniel Kokotajlo’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
If your AI system “wants” things in the sense that “when prompted to get X it proposes good strategies for getting X that adapt to obstacles,” then you can control what it wants by giving it a different prompt. Arguments about AI risk rely pretty crucially on your inability to control what the AI wants, and your inability to test it. Saying “If you use an AI to achieve a long-horizon task, then the overall system definitionally wanted to achieve that task” + “If your AI wants something, then it will undermine your tests and safety measures” seems like a sleight of hand, most of the oomph is coming from equivocating between definitions of want.
You say:
I definitely don’t endorse “it’s extremely surprising for there to be any capabilities without ‘wantings’” and I expect Nate doesn’t either.
But the OP says:
to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the “behaviorist sense” expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise
This seems to strongly imply that a particular capability—succeeding at these long horizon tasks—implies the AI has “wants/desires.” That’s what I’m saying seems wrong.

paulfchristiano 27 Nov 2023 17:01 UTC
21 points
10
in reply to: Rob Bensinger’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
When the post says:
This observable “it keeps reorienting towards some target no matter what obstacle reality throws in its way” behavior is what I mean when I describe an AI as having wants/desires “in the behaviorist sense”.
It seems like it’s saying that if you prompt an LM with “Could you suggest a way to get X in light of all the obstacles that reality has thrown in my way,” and if it does that reasonably well and if you hook it up to actuators, then it definitionally has wants and desires.
Which is a fine definition to pick. But the point is that in this scenario the LM doesn’t want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.

paulfchristiano

Ma­trix com­ple­tion prize results

Matrix completion prize results