David Scott Krueger (formerly: capybaralet)(David Krueger)

Karma: 1,909

I’m more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger

Bio from https://www.davidscottkrueger.com/:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge’s Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

Reward modeling and reward gaming
Aligning foundation models
Understanding learning and generalization in deep learning and foundation models, especially via “empirical theory” approaches
Preventing the development and deployment of socially harmful AI systems
Elaborating and evaluating speculative concerns about more advanced future AI systems

David Scott Krueger (formerly: capybaralet) 7 Jun 2024 1:39 UTC
LW: 4 AF: 2
0
AF
in reply to: Joe_Collman’s comment on: [Link Post] “Foundational Challenges in Assuring Alignment and Safety of Large Language Models”
Really interesting point!

I introduced this term in my slides that included “paperweight” as an example of an “AI system” that maximizes safety.

I sort of still think it’s an OK term, but I’m sure I will keep thinking about this going forward and hope we can arrive at an even better term.

[Link Post] “Foundational Challenges in Assuring Alignment and Safety of Large Language Models”

David Scott Krueger (formerly: capybaralet)6 Jun 2024 18:55 UTC

69 points

2 comments6 min readLW link

(llm-safety-challenges.github.io)

David Scott Krueger (formerly: capybaralet) 15 Mar 2024 12:09 UTC
3 points
0
in reply to: eggsyntax’s comment on: Testing for consequence-blindness in LLMs using the HI-ADS unit test.
You could try to do tests on data that is far enough from the training distribution that it won’t generalize in a simple immitative way there, and you could do tests to try and confirm that you are far enough off distribution. For instance, perhaps using a carefully chosen invented language would work.

David Scott Krueger (formerly: capybaralet) 15 Mar 2024 12:07 UTC
LW: 2 AF: 1
0
AF
in reply to: TheFigMaster’s comment on: Quick thoughts on “scalable oversight” / “super-human feedback” research
I don’t disagree… in this case you don’t get agents for a long time; someone else does though.

David Scott Krueger (formerly: capybaralet) 6 Mar 2024 21:08 UTC
LW: 3 AF: 1
0
AF
in reply to: TheFigMaster’s comment on: Quick thoughts on “scalable oversight” / “super-human feedback” research
I meant “other training schemes” to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally “training” and more like “engineering”.

Testing for consequence-blindness in LLMs using the HI-ADS unit test.

David Scott Krueger (formerly: capybaralet)24 Nov 2023 23:35 UTC

25 points

2 comments2 min readLW link

David Scott Krueger (formerly: capybaralet) 22 Nov 2023 20:38 UTC
LW: 2 AF: 1
AF
on: Reading the ethicists 2: Hunting for AI alignment papers
I would look at the main FATE conferences as well, which I view as being: FAccT, AIES, EEAMO.

David Scott Krueger (formerly: capybaralet) 14 Aug 2023 18:42 UTC
3 points
0
on: Ways I Expect AI Regulation To Increase Extinction Risk
I found this thought provoking, but I didn’t find the arguments very strong.

(a) Misdirected Regulations Reduce Effective Safety Effort; Regulations Will Almost Certainly Be Misdirected
(b) Regulations Generally Favor The Legible-To-The-State
(c) Heavy Regulations Can Simply Disempower the Regulator
(d) Regulations Are Likely To Maximize The Power of Companies Pushing Forward Capabilities the Most

Briefly responding:
a) The issue in this story seems to be that the company doesn’t care about x-safety, not that they are legally obligated to care about face-blindness.
b) If governments don’t have bandwidth to effectively vet small AI projects, it seems prudent to err on the side of forbidding projects that might pose x-risk.
c) I do think we need effective international cooperation around regulation. But even buying 1-4 years time seems good in expectation.
d) I don’t see the x-risk aspect of this story.

David Scott Krueger (formerly: capybaralet) 26 Jul 2023 23:15 UTC
LW: 5 AF: 3
0
AF
on: How LLMs are and are not myopic
This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.
Are you claiming this would happen even given infinite capacity?
If so, can you perhaps provide a simple+intuitive+concrete example?

David Scott Krueger (formerly: capybaralet) 19 Jul 2023 23:03 UTC
LW: 2 AF: 1
0
AF
on: What Discovering Latent Knowledge Did and Did Not Find
What do you mean by “random linear probe”?

David Scott Krueger (formerly: capybaralet) 11 Jul 2023 15:25 UTC
4 points
0
on: Deceptive AI vs. shifting instrumental incentives
I skimmed this. A few quick comments:
- I think you characterized deceptive alignment pretty well.
- I think it only covers a narrow part of how deceptive behavior can arise.
- CICERO likely already did some of what you describe.

David Scott Krueger (formerly: capybaralet) 29 Jun 2023 10:26 UTC
LW: 3 AF: 2
1
AF
on: Instrumental Convergence? [Draft]
So let us specify a probability distribution over the space of all possible desires. If we accept the orthogonality thesis, we should not want this probability distribution to build in any bias towards certain kinds of desires over others. So let’s spread our probabilities in such a way that we meet the following three conditions. Firstly, we don’t expect Sia’s desires to be better satisfied in any one world than they are in any other world. Formally, our expectation of the degree to which Sia’s desires are satisfied at $W$ is equal to our expectation of the degree to which Sia’s desires are satisfied at $W^{*}$ , for any $W, W^{*}$ . Call that common expected value $μ$ ″. Secondly, our probabilities are symmetric around $μ$ . That is, our probability that $W$ satisfies Sia’s desires to at least degree $μ + x$ is equal to our probability that it satisfies her desires to at most degree $μ - x$ . And thirdly, learning how well satisfied Sia’s desires are at some worlds won’t tell us how well satisfied her desires are at other worlds. That is, the degree to which her desires are satisfied at some worlds is independent of how well satisfied they are at any other worlds. (See the appendix for a more careful formulation of these assumptions.) If our probability distribution satisfies these constraints, then I’ll say that Sia’s desires are ‘sampled randomly’ from the space of all possible desires.

This is a characterization, and it remains to show that there exist distributions that fit it (I suspect there are not, assuming the sets of possible desires and worlds are unbounded).

I also find the 3rd criteria counterintuitive. If worlds share features, I would expect these to not be independent.

David Scott Krueger (formerly: capybaralet) 27 Jun 2023 10:15 UTC
15 points
7
on: Did Bengio and Tegmark lose a debate about AI x-risk against LeCun and Mitchell?
I think it might be more effective in future debates at the outset to:
* Explain that it’s only necessary to cross a low bar (e.g. see my Tweet below). -- This is a common practice in debates.
* Outline the responses they expect to hear from the other side, and explain why they are bogus. Framing: “Whether AI is an x-risk has been debated in the ML community for 10 years, and nobody has provided any compelling counterarguments that refute the 3 claims (of the Tweet). You will hear a bunch of counter arguments from the other side, but when you do, ask yourself whether they are really addressing this. Here are a few counter-arguments and why they fail...”—I think this could really take the wind out of the sails of the opposition, and put them on the back foot.

I also don’t think Lecun and Meta should be given so much credit—Is Facebook really going to develop and deploy AI responsibly?
1) They have been widely condemned for knowingly playing a significant role in the Rohingya genocide, have acknowledged that they failed to act to prevent Facebook’s role in the Rohingya genocide, and are being sued for $150bn for this.
2) They have also been criticised for the role that their products, especially Instagram, play in contributing to mental health issues, especially around body image in teenage girls.

More generally, I think the “companies do irresponsible stuff all the time” point needs to be stressed more. And one particular argument that is bogus is the “we’ll make it safe”—x-safety is a common good, and so companies should be expected to undersupply it. This is econ 101.

David Scott Krueger (formerly: capybaralet) 22 May 2023 22:01 UTC
2 points
on: capybaralet’s Shortform
Organizations that are looking for ML talent (e.g. to mentor more junior people, or get feedback on policy) should offer PhD students high-paying contractor/part-time work.

ML PhD students working on safety-relevant projects should be able to augment their meager stipends this way.

David Scott Krueger (formerly: capybaralet) 22 Apr 2023 16:51 UTC
4 points
7
on: On AutoGPT
That is in addition to all the people who will give their AutoGPT an instruction that means well but actually translates to killing all the humans or at least take control over the future, since that is so obviously the easiest way to accomplish the thing, such as ‘bring about world peace and end world hunger’ (link goes to Sully hyping AutoGPT, saying ‘you give it a goal like end world hunger’) or ‘stop climate change’ or ‘deliver my coffee every morning at 8am sharp no matter what as reliably as possible.’ Or literally almost anything else.

I think these mostly only translate into dangerous behavior if the model badly “misunderstands” the instruction, which seems somewhat implausible.

David Scott Krueger (formerly: capybaralet) 22 Apr 2023 16:48 UTC
8 points
5
on: On AutoGPT
One must notice that in order to predict the next token as well as possible the LMM will benefit from being able to simulate every situation, every person, and every causal element behind the creation of every bit of text in its training distribution, no matter what we then train the LMM to output to us (what mask we put on it) afterwards.

Is there any rigorous justification for this claim? As far as I can tell, this is folk wisdom from the scaling/AI safety community, and I think it’s far from obvious that it’s correct, or what assumptions are required for it to hold.

It seems much more plausible in the infinite limit than in practice.

David Scott Krueger (formerly: capybaralet) 22 Apr 2023 16:41 UTC
8 points
7
on: On AutoGPT
I have gained confidence in my position that all of this happening now is a good thing, both from the perspective of smaller risks like malware attacks, and from the perspective of potential existential threats. Seems worth going over the logic.
What we want to do is avoid what one might call an agent overhang.
One might hope to execute our Plan A of having our AIs not be agents. Alas, even if technically feasible (which is not at all clear) that only can work if we don’t intentionally turn them into agents via wrapping code around them. We’ve checked with actual humans about the possibility of kindly not doing that. Didn’t go great.

This seems like really bad reasoning…

It seems like the evidence that people won’t “kindly not [do] that” is… AutoGPT.
So if AutoGPT didn’t exist, you might be able to say: “we asked people to not turn AI systems into agents, and they didn’t. Hooray for plan A!”

Also: I don’t think it’s fair to say “we’ve checked [...] about the possibility”. The AI safety community thought it was sketch for a long time, and has provided some lackluster pushback. Governance folks from the community don’t seem to be calling for a rollback of the plugins, or bans on this kind of behavior, etc.

David Scott Krueger (formerly: capybaralet) 20 Apr 2023 23:32 UTC
8 points
6
in reply to: 1a3orn’s comment on: OpenAI could help X-risk by wagering itself
Christiano and Yudkowsky both agree AI is an x-risk—a prediction that would distinguish their models does not do much to help us resolve whether or not AI is an x-risk.

David Scott Krueger (formerly: capybaralet) 21 Mar 2023 3:22 UTC
LW: 2 AF: 1
2
AF
in reply to: Daniel Kokotajlo’s comment on: “Publish or Perish” (a quick note on why you should try to make your work legible to existing academic communities)
I’m not necessarily saying people are subconsciously trying to create a moat.

I’m saying they are acting in a way that creates a moat, and that enables them to avoid competition, and that more competition would create more motivation for them to write things up for academic audiences (or even just write more clearly for non-academic audiences).
What links here?
- Daniel Kokotajlo's comment on “Publish or Perish” (a quick note on why you should try to make your work legible to existing academic communities) by David Scott Krueger (formerly: capybaralet) (21 Mar 2023 13:14 UTC; 2 points)

David Scott Krueger (formerly: capybaralet) 21 Mar 2023 3:15 UTC
LW: 2 AF: 1
0
AF
in reply to: Viliam’s comment on: “Publish or Perish” (a quick note on why you should try to make your work legible to existing academic communities)
Q: “Why is that not enough?”
A: Because they are not being funded to produce the right kinds of outputs.

David Scott Krueger (formerly: capybaralet)(David Krueger)

[Link Post] “Foun­da­tional Challenges in As­sur­ing Align­ment and Safety of Large Lan­guage Models”

Test­ing for con­se­quence-blind­ness in LLMs us­ing the HI-ADS unit test.

[Link Post] “Foundational Challenges in Assuring Alignment and Safety of Large Language Models”

Testing for consequence-blindness in LLMs using the HI-ADS unit test.