peterbarnett

Karma: 1,322

Researcher at MIRI

EA and AI safety

https://peterbarnett.org/

peterbarnett 18 Sep 2020 1:50 UTC
3 points
in reply to: habryka’s comment on: Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem
My read of Russel’s position is that if we can successfully make the agent uncertain about its model for human preferences then it will defer to the human when it might do something bad, which hopefully solves (or helps with) making it corrigible.
I do agree that this doesn’t seem to help with inner-alignment stuff though, but I’m still trying to wrap my head around this area.

Summary of AI Research Considerations for Human Existential Safety (ARCHES)

peterbarnett9 Dec 2020 23:28 UTC

11 points

0 comments13 min readLW link

Does making unsteady incremental progress work?

peterbarnett5 Mar 2021 7:23 UTC

8 points

4 comments1 min readLW link

(peterbarnett.org)

peterbarnett 5 Mar 2021 10:49 UTC
1 point
in reply to: viviers’s comment on: Does making unsteady incremental progress work?
Thanks! Yeah, I definitely think that “it’s okay to slack today if I pull up the average later on” is a pretty common way people lose productivity. I think one framing could be that if you do have an off day, that doesn’t have to put you off track forever, and you can make up for it in the future.
I make the graphs using the [matplotlib xkcd mode](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xkcd.html), it’s super easy you use, you just put your plotting in a “with plt.xkcd():” block

When Should the Fire Alarm Go Off: A model for optimal thresholds

peterbarnett28 Apr 2021 12:27 UTC

40 points

4 comments5 min readLW link

(peterbarnett.org)

peterbarnett 29 Apr 2021 10:00 UTC
3 points
in reply to: Gurkenglas’s comment on: When Should the Fire Alarm Go Off: A model for optimal thresholds
Oh you’re right! Thanks for catching that. I think I was lead astray because I wanted there to be a big payoff for averting the bad event, but I guess the benefit is just not having to pay D.
I’ll have a look and see how much this changes things
Edit: Fixed it up now, none of the conclusions seem to change (which is good because they seemed like common sense!). Thanks for reading this and pointing that out!

peterbarnett 6 Jun 2021 0:16 UTC
1 point
on: Alcohol, health, and the ruthless logic of the Asian flush
There’s a drug called Orlistat for treating obesity which works by preventing you from absorbing fats when you eat them. I’ve heard (somewhat anecdotally) that one of the main effects is forcing you to eat a low fat diet, because otherwise there are quite unpleasant ‘gastrointestinal side effects’ if you eat a lot of fat.

Understanding Gradient Hacking

peterbarnett10 Dec 2021 15:58 UTC

41 points

5 comments30 min readLW link

peterbarnett 11 Dec 2021 15:11 UTC
2 points
in reply to: leogao’s comment on: Understanding Gradient Hacking
Thanks for your reply! I agree that I might be a little overly dismissive of the loss landscape frame. I agree mostly with your point about convergent gradient hackers existing minima of the base-objective, I briefly commented on this (although not in the section on coupling)
Here we can think of the strong coupling regime as being a local minimum in the loss landscape, such that any step to remove the gradient hacker leads locally to worse performance. In the weak coupling regime, the loss landscape will still increase in the directions which directly hurt performance on the mesa-objective, but gradient descent is still able to move downhill in a direction that erases the gradient hacker.
(I also agree with your rough sketch and had some similar notes which didn’t make it into the post)
I think I am still a little unsure how to mix the two frames of ‘the gradient hacker has control’ and ‘it’s all deterministic gradient descent+quirks’, which I guess is just the free will debate. It also seems like you could have a gradient hacker arise on a slope (which isn’t non-convergent) and then uses its gradient hacking ways to manipulate how it is further modified, but this might just be a further case of a convergent gradient hacker.
The loss landscape frame doesn’t seem to really tell us how a gradient hacker would act or modify itself. From the ‘gradient hacker’ perspective there doesn’t seem to be much difference between manipulating the gradients via the outputs/cognition and manipulating the gradient via quirks/stop-gradients. It still seems like the non-convergent gradient hacker would also require coupling to get started in the first place; the model needs to be pretty smart to be able to know how to use these quirks/bugs and I think this needs to come from some underlying general ability.
One counter to this could be that there is a non-convergent gradient hacker which can ‘look at’ the information that the underlying capabilities have (eg the non-convergent gradient hacker can look for information on how to gradient hack). But it still seems here that there will be some parts of the gradient hacker which will have some optimization pressure on them from gradient descent, and so without coupling they would get eroded.

Some motivations to gradient hack

peterbarnett17 Dec 2021 3:06 UTC

8 points

0 comments6 min readLW link

[Question] What questions do you have about doing work on AI safety?

peterbarnett21 Dec 2021 16:36 UTC

13 points

8 comments1 min readLW link

Alignment Problems All the Way Down

peterbarnett22 Jan 2022 0:19 UTC

26 points

7 comments11 min readLW link

peterbarnett 22 Jan 2022 14:51 UTC
1 point
in reply to: jacob_cannell’s comment on: Alignment Problems All the Way Down
So do you think that the only way to get to AGI is via a learned optimizer?
I think that the definitions of AGI (and probably optimizer) here are maybe a bit fuzzy.
I think it’s pretty likely that it is possible to develop AI systems which are more competent than humans in a variety of important domains, which don’t perform some kind of optimization process as part of their computation.

peterbarnett 22 Jan 2022 15:01 UTC
1 point
in reply to: Jalex Stark’s comment on: Alignment Problems All the Way Down
I think that this is a possible route to take, I don’t think we currently have a good enough understanding of how to control\align mesa-optimizers to be able to do this.
I worry a bit that even if we correctly align a mesa-optimizer (make its mesa-objective aligned with the base-objective), its actual optimization process might be faulty/misaligned and so it would mistakenly spawn a misaligned optimizer. I think this faulty optimization process is most likely to happen sometime in the middle of training, where the mesa-optimizer is able to make another optimizer but not smart enough to control it.
But if we develop a better theory of optimization and agency, I could see it being safe to allow mesa-optmizers to create new optimizers. I currently think we are nowhere near that stage, especially considering that we don’t seem to have great definitons of either optimization or agency.

peterbarnett 30 Jan 2022 15:54 UTC
4 points
in reply to: gwern’s comment on: Book notes: “The Origins of You: How Childhood Shapes Later Life”
I’m from Dunedin and went to highschool there (in the 2010s), so I guess I can speak to this a bit.
Co-ed schools were generally lower decile (=lower socio-economic backgrounds) than the single sex schools (here is data taken from wikipedia on this). The selection based on ‘ease of walking to school’ is still a (small) factor, but I expect this would have been a larger factor in the 70s when there was worse public transport. In some parts of NZ school zoning is a huge deal, with people buying houses specifically to get into a good zone (especially in Auckland) but this isn’t that much the case in Dunedin.
Based on (~2010s) stereotypes about the schools, rates of drug use seemed pretty similar between co-ed and all-boys schools, and less in all-girls. And rates of violence were higher in all-boys schools, and less in co-ed and all-girls. But this is just my impression, and likely decades too late.

peterbarnett 10 Feb 2022 10:29 UTC
1 point
in reply to: TLW’s comment on: Hypothesis: gradient descent prefers general circuits
I think there is likely to be a path from a model with shallow circuits to a model with deeper circuits which doesn’t need any ‘activation energy’ (it’s not stuck in a local minimum). For a model with many parameters, there are unlikely to be many places where all the deriviatives of the loss wrt all the parameters are zero. There will almost always be at least one direction to move in which decreases the shallow circuit while increasing the general one, and hence doesn’t really hurt the loss.

peterbarnett 14 Feb 2022 10:08 UTC
2 points
in reply to: TLW’s comment on: Hypothesis: gradient descent prefers general circuits
Hmm, this might come down to how independent the parameters are. I think in large networks there will generally be enough independence between the parameters for local minima to be rare (although possible).
As a toy example of moving from one algorithm to another, if the network is large enough we can just have the output being a linear combination of the two algorithms and up-regulate one, and down regulate the other:
$a \cdot cascade_multiplier (x, y) + (1 - a) \cdot Karatsuba (x, y)$
And $a$ is changed from 1 to 0.
The network needs to be large enough so the algorithms don’t share parameters, so changing one doesn’t affect the performance of the other. I do think this is just a toy example, and it seems pretty unlikely to spontaneously have two multiplication algorithms develop simultaneously.

peterbarnett 16 Feb 2022 17:24 UTC
3 points
on: peterbarnett’s Shortform
My Favourite Slate Star Codex Posts
This is what I send people when they tell me they haven’t read Slate Star Codex and don’t know where to start.
Here are a bunch of lists of top ssc posts:
These lists are vaguely ranked in the order of how confident I am that they are good
https://guzey.com/favorite/slate-star-codex/
https://slatestarcodex.com/top-posts/
https://nothingismere.com/2015/09/12/library-of-scott-alexandria/
https://www.slatestarcodexabridged.com/ (if interested in psychology almost all the stuff here is good https://www.slatestarcodexabridged.com/Psychiatry-Psychology-And-Neurology)
https://danwahl.net/blog/slate-star-codex
I think that there are probably podcast episodes of all of these posts listed below. The headings are not ranked by which heading I think is important, but within each heading they are roughly ranked. These are missing any new great ACX posts. I have also left off all posts about drugs, but these are also extremely interesting if you like just reading about drugs, and were what I first read of SSC.
If you are struggling to pick where to start I recommend either Epistemic Learned Helplessness or Nobody is Perfect, Everything is Commensurable.
The ‘Thinking good’ posts I think have definitely improved how I reason, form beliefs, and think about things in general. The ‘World is broke’ posts have had a large effect on how I see the world working, and what worlds we should be aiming for. The ‘Fiction’ posts are just really really good fiction short stories. The ‘Feeling ok about yourself’ have been extremely helpful for developing some self-compassion, and also for being compassionate and non-judgemental about others; I think these posts specifically have made me a better person.
Posts I love:
Thinking good
https://slatestarcodex.com/2019/06/03/repost-epistemic-learned-helplessness/
https://slatestarcodex.com/2014/08/10/getting-eulered/
https://slatestarcodex.com/2013/04/13/proving-too-much/
https://slatestarcodex.com/2014/04/15/the-cowpox-of-doubt/
World is broke and how to deal with it
https://slatestarcodex.com/2014/07/30/meditations-on-moloch/
https://slatestarcodex.com/2014/12/19/nobody-is-perfect-everything-is-commensurable/
https://slatestarcodex.com/2013/07/17/who-by-very-slow-decay/
https://slatestarcodex.com/2014/02/23/in-favor-of-niceness-community-and-civilization/

Fiction
https://slatestarcodex.com/2018/10/30/sort-by-controversial/
https://slatestarcodex.com/2015/04/21/universal-love-said-the-cactus-person/
https://slatestarcodex.com/2015/08/17/the-goddess-of-everything-else-2/ (read after moloch stuff I think)
Feeling ok about yourself
https://slatestarcodex.com/2015/01/31/the-parable-of-the-talents/
https://slatestarcodex.com/2014/08/16/burdens/
Other??
https://slatestarcodex.com/2014/09/10/society-is-fixed-biology-is-mutable/
https://slatestarcodex.com/2018/04/01/the-hour-i-first-believed/ (this is actual crazy, read last)

peterbarnett 18 Feb 2022 21:45 UTC
8 points
on: Guidelines for cold messaging people
This list of email scripts from 80,000 Hours also seems useful here https://80000hours.org/articles/email-scripts/

Thoughts on Dangerous Learned Optimization

peterbarnett19 Feb 2022 10:46 UTC

4 points

2 comments4 min readLW link

peterbarnett

My Favourite Slate Star Codex Posts