Here are some topics that I would be interested in talking about:
I would be interested in just talking with some people about “the basic case for AI X-risk”.
I’ve found it quite valuable to go back and forth with people on just going quite slow and without much of any references to long-chained existing explanations, try to explain what the exact mechanisms for AI risk are, and what our current models of the hope and plausibility of various approaches are.
I like some of the concrete commitments, but when I read it, I mostly get a vibe of some kind of abstract governance document that doesn’t actually commit Anthropic to much, and is more designed to placate someone’s concerns instead of being intended to concrete set incentives or help solve a specific problem. It feels too abstract and meta to me (whereas a paper or post with the title “when should we halt further capabilities scaling?” seems like it would have gone for the throat and said something concrete, but instead we got a document that implicity, in the margins, assumed some stuff about when the right place to halt is, but that’s really hard to argue with). But also, I don’t have much experience with governing large organizations and coordinating industries, so maybe you do have to talk at this level of abstraction.
How to build a community that integrates character evidence?
One of the biggest lessons I took away from FTX was that if I want to build communities and infrastructure that doesn’t end up taken advantage of, I somehow need better tools for sharing character-evidence, by which I mean evidence about behavioral patterns that didn’t necessarily directly harm someone, but are correlated with people causing bad outcomes later. In the case of FTX, tons of people had access to important character evidence, but we had no infrastructure in place for sharing it, and indeed very few people knew about it (and the ones who knew usually were asked to treat the evidence as confidential). I have some ideas here, but still feel quite confused about what to actually do.
I’ve been bouncing off of the Quintin Pope critiques of evolution and sharp-left-turn stuff.
I would be pretty interested in debating this with someone who thinks I should spend more time on the critiques. Currently my feeling is they make some locally valid points, but don’t really engage with any of my cruxes, but I do have a feeling they are supposed to respond to people with opinions like mine, so I would be excited about chatting with someone about it.
Lots of people keep saying that AI Alignment research has accelerated a ton since people have started doing experiments on large models. I feel quite confused by this and I currently feel like alignment research in the last 2-3 years has stayed the same or slowed. I don’t really know what research people are referring to when they talk about progress.
This might just go into some deep “prosaic vs. nonprosaic” alignment research disagreement, but that also seems fine. I feel like basically the only thing that has happened in the last few years is some interpretability progress (though the speed of that seems roughly the same as it was back in 2017 when Chris Olah was working on Distill), and a bunch of variations on RLHF, which IMO don’t really help at all with the hard parts of the problem. I can imagine something cool coming out of Paul’s ELK-adjacent research, but nothing has so far, so that doesn’t really sound like progress to me.
How do we actually use AI to make LessWrong better?
While I assign more probability to a faster takeoff than other people, giving us less time to benefit from the fruits of AI before things go badly, I still assign two-digit probabilities to slower takeoffs, and it does seem worth trying to take advantage of that to make AI Alignment go better. Short of automating researchers, what coordination and communication technology exists that could help make AI and the development of the art of rationality go better?
If you’ve written high-karma, high-quality posts or comments on LessWrong, or you’ve done other cool things that I’ve heard about, I would likely be up for giving a dialogue a try and to set aside like 5-10 hours to make something good happen (though you should feel free to spend less time on it).
I would be interested in just talking with some people about “the basic case for AI X-risk”.
I’ve found it quite valuable to go back and forth with people on just going quite slow and without much of any references to long-chained existing explanations, try to explain what the exact mechanisms for AI risk are, and what our current models of the hope and plausibility of various approaches are.
I might be interested in this, depending on what qualifies as “basic”, and what you want to emphasize.
I feel like I’ve been getting into the weeds lately, or watching others get into the weeds, on how various recent alignment and capabilities developments affect what the near future will look like, e.g. how difficult particular known alignment sub-problems are likely to be or what solutions for them might look like, how right various peoples’ past predictions and models were, etc.
And to me, a lot of these results and arguments look mostly irrelevant to the core AI x-risk argument, for which the conclusion is that once you have something actually smarter than humans hanging around, literally everyone drops dead shortly afterwards, unless a lot of things before then have gone right in a complicated way.
(Some of these developments might have big implications for how things are likely to go before we get to the simultaneous-death point, e.g. by affecting the likelihood that we screw up earlier and things go off the rails in some less predictable way.)
But basically everything we’ve recently seen looks like it is about the character of mind-space and the manipulability of minds in the below-human-level region, and this just feels to me like a very interesting distraction most of the time.
In a dialogue, I’d be interested in fleshing out why I think a lot of results about below-human-level minds are likely to be irrelevant, and where we can look for better arguments and intuitions instead. I also wouldn’t mind recapitulating (my view of) the core AI x-risk argument, though I expect I have fewer novel things to say on that, and the non-novel things I’d say are probably already better said elsewhere by others.
I might also also be interested in having a dialogue on this topic with someone else if habryka isn’t interested, though I think it would work better if we’re not starting from too far apart in terms of basic viewpoint.
Here are some topics that I would be interested in talking about:
I would be interested in just talking with some people about “the basic case for AI X-risk”.
I’ve found it quite valuable to go back and forth with people on just going quite slow and without much of any references to long-chained existing explanations, try to explain what the exact mechanisms for AI risk are, and what our current models of the hope and plausibility of various approaches are.
Is there much of any substance to stuff like Anthropic’s “Responsible Scaling Policy”?
I like some of the concrete commitments, but when I read it, I mostly get a vibe of some kind of abstract governance document that doesn’t actually commit Anthropic to much, and is more designed to placate someone’s concerns instead of being intended to concrete set incentives or help solve a specific problem. It feels too abstract and meta to me (whereas a paper or post with the title “when should we halt further capabilities scaling?” seems like it would have gone for the throat and said something concrete, but instead we got a document that implicity, in the margins, assumed some stuff about when the right place to halt is, but that’s really hard to argue with). But also, I don’t have much experience with governing large organizations and coordinating industries, so maybe you do have to talk at this level of abstraction.
How to build a community that integrates character evidence?
One of the biggest lessons I took away from FTX was that if I want to build communities and infrastructure that doesn’t end up taken advantage of, I somehow need better tools for sharing character-evidence, by which I mean evidence about behavioral patterns that didn’t necessarily directly harm someone, but are correlated with people causing bad outcomes later. In the case of FTX, tons of people had access to important character evidence, but we had no infrastructure in place for sharing it, and indeed very few people knew about it (and the ones who knew usually were asked to treat the evidence as confidential). I have some ideas here, but still feel quite confused about what to actually do.
I’ve been bouncing off of the Quintin Pope critiques of evolution and sharp-left-turn stuff.
I would be pretty interested in debating this with someone who thinks I should spend more time on the critiques. Currently my feeling is they make some locally valid points, but don’t really engage with any of my cruxes, but I do have a feeling they are supposed to respond to people with opinions like mine, so I would be excited about chatting with someone about it.
Lots of people keep saying that AI Alignment research has accelerated a ton since people have started doing experiments on large models. I feel quite confused by this and I currently feel like alignment research in the last 2-3 years has stayed the same or slowed. I don’t really know what research people are referring to when they talk about progress.
This might just go into some deep “prosaic vs. nonprosaic” alignment research disagreement, but that also seems fine. I feel like basically the only thing that has happened in the last few years is some interpretability progress (though the speed of that seems roughly the same as it was back in 2017 when Chris Olah was working on Distill), and a bunch of variations on RLHF, which IMO don’t really help at all with the hard parts of the problem. I can imagine something cool coming out of Paul’s ELK-adjacent research, but nothing has so far, so that doesn’t really sound like progress to me.
How do we actually use AI to make LessWrong better?
While I assign more probability to a faster takeoff than other people, giving us less time to benefit from the fruits of AI before things go badly, I still assign two-digit probabilities to slower takeoffs, and it does seem worth trying to take advantage of that to make AI Alignment go better. Short of automating researchers, what coordination and communication technology exists that could help make AI and the development of the art of rationality go better?
If you’ve written high-karma, high-quality posts or comments on LessWrong, or you’ve done other cool things that I’ve heard about, I would likely be up for giving a dialogue a try and to set aside like 5-10 hours to make something good happen (though you should feel free to spend less time on it).
I might be interested in this, depending on what qualifies as “basic”, and what you want to emphasize.
I feel like I’ve been getting into the weeds lately, or watching others get into the weeds, on how various recent alignment and capabilities developments affect what the near future will look like, e.g. how difficult particular known alignment sub-problems are likely to be or what solutions for them might look like, how right various peoples’ past predictions and models were, etc.
And to me, a lot of these results and arguments look mostly irrelevant to the core AI x-risk argument, for which the conclusion is that once you have something actually smarter than humans hanging around, literally everyone drops dead shortly afterwards, unless a lot of things before then have gone right in a complicated way.
(Some of these developments might have big implications for how things are likely to go before we get to the simultaneous-death point, e.g. by affecting the likelihood that we screw up earlier and things go off the rails in some less predictable way.)
But basically everything we’ve recently seen looks like it is about the character of mind-space and the manipulability of minds in the below-human-level region, and this just feels to me like a very interesting distraction most of the time.
In a dialogue, I’d be interested in fleshing out why I think a lot of results about below-human-level minds are likely to be irrelevant, and where we can look for better arguments and intuitions instead. I also wouldn’t mind recapitulating (my view of) the core AI x-risk argument, though I expect I have fewer novel things to say on that, and the non-novel things I’d say are probably already better said elsewhere by others.
I might also also be interested in having a dialogue on this topic with someone else if habryka isn’t interested, though I think it would work better if we’re not starting from too far apart in terms of basic viewpoint.
This sounds great! I’ll invite you to a dialogue, and then if you can shoot off an opening statement, we can get started.