By “refining pure human feedback”, do you mean refining RLHF ML techniques?
I assume you still view enhancing human feedback as valuable? And also more straightforwardly just increasing the quality of the best human feedback?
Amazing! Thanks so much for making this happen so quickly.
To anyone who’s trying to figure out how to get it to work on Google Podcasts, here’s what worked for me (searching the name didn’t, maybe this will change?):
Go to the Libsyn link.
Click the RSS symbol.
Copy the link.
Go to Google Podcasts.
Click the Library tab (bottom right).
Go to Subscriptions.
Click symbol that looks like adding a link in the upper right.
Paste link, confirm.
Hey Paul, thanks for taking the time to write that up, that’s very helpful!
Hey Rohin, thanks a lot, that’s genuinely super helpful. Drawing analogies to “normal science” seems both reasonable and like it clears the picture up a lot.
I would be interested to hear opinions about what fraction of people could possibly produce useful alignment work?
Ignoring the hurdle of “knowing about AI safety at all”, i.e. assuming they took some time to engage with it (e.g. they took the AGI Safety Fundamentals course). Also assume they got some good mentorship (e.g. from one of you) and then decided to commit full-time (and got funding for that). The thing I’m trying to get at is more about having the mental horsepower + epistemics + creativity + whatever other qualities are useful, or likely being able to get there after some years of training.
Also note that I mean direct useful work, not indirect meta things like outreach or being a PA to a good alignment researcher etc. (these can be super important, but I think it’s productive to think of them as a distinct class). E.g. I would include being a software engineer at Anthropic, but exclude doing grocery-shopping for your favorite alignment researcher.
An answer could look like “X% of the general population” or “half the people who could get a STEM degree at Ivy League schools if they tried” or “a tenth of the people who win the Fields medal”.
I think it’s useful to have a sense of this for many purposes, incl. questions about community growth and the value of outreach in different contexts, as well as priors about one’s own ability to contribute. Hence, I think it’s worth discussing honestly, even though it can obviously be controversial (with some possible answers implying that most current AI safety people are not being useful).
That’s a very detailed answer, thanks! I’ll have a look at some of those tools. Currently I’m limiting my use to a particular 10-minute window per day with freedom.to + the app BlockSite. It often costs me way more than 10 minutes (checking links after, procrastinating before...) of focus though, so I might try to find an alternative.
Sorry for the tangent, but how do you recommend engaging with Twitter, without it being net bad?
Thanks! Great to hear that it’s going well!
Maybe I’m being stupid here. On page 42 of the write-up, it says:
In order to ensure we learned the human simulator, we would need to change the training strategy to ensure that it contains sufficiently challenging inference problems, and that doing direct translation was a cost-effective way to improve speed (i.e. that there aren’t other changes to the human simulator that would save even more time). [emphasis mine]
Shouldn’t that be?
In order to ensure we learned the direct translator, …
We’re planning to evaluate submissions as we receive them, between now and the end of January; we may end the contest earlier or later if we receive more or fewer submissions than we expect.
Just wanted to note that the “we may end the contest earlier” part here makes me significantly more hesitant about trying this. I will probably still at least have a look at it, but part of me is afraid that I’ll invest a bunch of time and then the contest will be announced to be over before I got around to submitting. And I suspect Holden’s endorsement may make that more likely. It would be easier for me to invest time spread out over the next couple of weeks, than all in one go, due to other commitments. On the other hand, if I knew there was a hard deadline next Friday, I might try to find a way to squeeze it in.I’m just pointing this out in case you hadn’t thought of it. I suspect something similar might be true for others too. Of course, it’s your prize and your rules, and if you prefer it this way, that’s totally fine.
Sorry for my very late reply!
Thanks for taking the time to answer, I now don’t endorse most of what I wrote anymore.
I think that if the AGI has a perfect motivation system then we win, there’s no safety problem left to solve. (Well, assuming it also remains perfect over time, as it learns new things and thinks new thoughts.) (See here for the difference between motivation and reward.)
and from the post:
And if we get to a point where we can design reward signals that sculpt an AGI’s motivation with surgical precision, that’s fine!
This is mostly where I went wrong. I.e. I assumed a perfect reward signal coming from some external oracle in examples where your entire point was that we didn’t have a perfect reward signal (e.g. wireheading).
So basically, I think we agree: a perfect reward signal may be enough in principle, but in practice it will not be perfect and may not be enough. At least not a single unified reward.
Disclaimer: I have only read the abstract of the “Reward is enough” paper. Also, I don’t have much experience in AI safety, but I consider changing that.
Here are a couple of my thoughts.
Your examples haven’t entirely convinced me that reward isn’t enough. Take the bird. As I see it, something like the following is going on:
Evolution chose to take a shortcut: maybe a bird with a very large brain and a lot of time would eventually figure out that singing is a smart thing to do if it received reward for singing well. But evolution being a ruthless optimizer with many previous generations of experience, shaped two separate rewards in the way you described. Silver et al.’s point might be that when building an AGI, we wouldn’t have to take that shortcut, at least not by handcoding it.Assume we have an agent that is released into the world and is trying to optimize reward. It starts out from scratch, knowing nothing, but with a lot of time and the computational capacity to learn a lot.Such an agent has an incentive to explore. So it tries out singing for two minutes. It notes that in the first minute it got 1 unit of reward and in the second 2 (it got better!). All in all, 3 units is very little in this world however, so maybe it moves on.But as it gains more experience in the world it notices that patterns like these can often be extrapolated. Maybe, with its two minutes of experience, if it sang for a third minute, it would get 3 units of reward? It tries and yes indeed. Now it has an incentive to see how far it can take this. It knows the 4 units it expects from the next try will not be worth its time on their own, but the information of whether it could eventually get a million units per minute this way is very much worth the cost!
Something kind of analogous should be true for the spider story.Reward very much provides an incentive for the agent to eventually figure out that after encountering a threat, it should change its behavior, not its interpretations of the world. At the beginning it might get this wrong, but it’s unfair to compare it to a human who has had this info “handcoded in” by evolution.If our algorithms don’t allow you to learn to update differently in the future, because past update were unhelpful (I don’t know, pointers welcome!), then that’s not a problem with reward, it’s a problem with our algorithms!Maybe this is what you alluded to in your very last paragraph where you speculated that they might just mean a more sophisticated RL algorithm?
Concerning the deceptive AGI etc., I agree problems emerge when we don’t get the reward signal exactly right and that it’s probably not a safe assumption that we will. But it might still be an interesting question how things would go assuming a perfect reward signal?My impression is that their answer is “it would basically work”, while yours is something like “but we really shouldn’t assume that and if we don’t, then it’s probably better to have separate reward signals etc.”. Given the bird example, I assume you also don’t agree that things would work out fine even if we did have the best possible reward signal?
Also, I just want to mention that I agree the distinction between within-lifetime RL and intergenerational RL is useful, certainly in the case of biology and probably in machine learning too.
Alright, that’s helpful! Thanks!
Right at the start under “How to use this book”, there is this paragraph:
If you have never been to a CFAR workshop, and don’t have any near-termplans to come to one, then you may be the kind of person who would love toset their eyes on a guide to improving one’s rationality, full of straightforwardinstructions and exercises on how to think more clearly, act more effectively,learn more from your experiences, make better decisions, and do more withyour life. This book is not that guide (nor is the workshop itself, for thatmatter). It was never intended to be that guide.
I’m partway through reading the sequences, have read plenty of other LW posts, but have no possibility to attend CFAR workshops anytime soon. Still, I want to get more experience with applied rationality. Would you say reading (parts of) this handbook and trying exercises that seem relevant to my situation is a reasonable thing to try (while minding all your precautions)? Or do you know of a better way to train one’s skills in applying rationality?