Coding day in and out on LessWrong 2.0. You can reach me at habryka@lesswrong.com
habryka(Oliver Habryka)
I strong-disagreed on it when it was already at −7 or so, so I think it was just me and another person strongly disagreeing. I expected other people would vote it up again (and didn’t expect anything like consensus in the direction I voted).
Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the “learning to summarize from human feedback” work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).
I think Rohin Shah doesn’t think of himself as having produced empirical work that helps with AI Alignment, but only to have produced empirical work that might help others to be convinced of AI Alignment. That is still valuable, but I think it should be evaluated on a different dimension.
I haven’t gotten much out of work by Geoffrey Irving or Jan Leike (and don’t think I know many other people who have, or at least haven’t really heard a good story for how their work actually helps). I would actually be interested if someone could give some examples of how this research helped them.
Hmm, there might be some mismatch of words here. Like, most of the work so far on the problem has been theoretical. I am confused how you could not be excited about the theoretical work that established the whole problem, the arguments for why it’s hard, and that helped us figure out at least some of the basic parameters of the problem. Given that (I think) you currently think AI Alignment is among the global priorities, you presumably think the work that allowed you to come to believe that (and that allowed others to do the same) was very valuable and important.
My guess is you are somehow thinking of work like Superintelligence, or Eliezer’s original work, or Evan’s work on inner optimization as something different than “theoretical work”?
I mostly disagreed with bullet point two. The primary result of “empirical AI Alignment research” that I’ve seen in the last 5 years has been a lot of capabilities gain, with approximately zero in terms of progress on any AI Alignment problems. I agree more with the “in the long run there will be a lot of empirical work to be done”, but right now on the margin, we have approximately zero traction on useful empirical work, as far as I can tell (outside of transparency research).
FWIW, I had a mildly negative reaction to this title. I agree with you, but I feel like the term “PSA” should be reserved for things that are really very straightforward and non-controversial, and I feel like it’s a bit of a bad rhetorical technique to frame your arguments as a PSA. I like the overall content of the post, but feel like a self-summarizing post title like “Most AI Alignment research is not parallelizable” would be better.
Different people use it for different purposes (just like real bookmarks!). I think the most common use-case is to mark something as to be read for later.
I’d say “mainstream opinion” (in either ML broadly, “safety” or “ethics,” AI policy) is generally focused on misuse relative to alignment—even without conditioning on “competitive alignment solution.” I normally disagree with this mainstream opinion, and I didn’t mean to endorse the opinion in virtue of its mainstream-ness, but to identify it as the mainstream opinion. If you don’t like the word “mainstream” or view the characterization as contentious, feel free to ignore it, I think it’s pretty tangential to my post.
Thanks, that clarifies things. I did misunderstand that sentence to refer to something like the “AI Alignment mainstream”, which feels like a confusing abstraction to me, though I feel like I could have figured it out if I had thought a bit harder before commenting.
For the record, my current model is that “AI ethics” or “AI policy” doesn’t really have a consistent model here, so I am not really sure whether I agree with you that this is indeed the opinion of most of the AI ethics or AI policy community. E.g. I can easily imagine both an AI ethics article saying that if we have really powerful AI, the most important thing is not misuse risk, but moral personhood of the AIs, or the “broader societal impact of the AIs”, both of which feel more misalignment shaped, but I really don’t know (my model of AI ethics people think that whether the AI is misaligned has an effect of whether it “deserves” moral personhood).
I do expect the AI policy community to be more focused on misuse, because they have a lot of influence from national security, which sure is generally focused on misuse and “weapons” as an abstraction, but I again don’t really trust my models here. During the cold war a lot of the policy community ended up in a weird virtue signaling arms race that ended up having a strong consensus in favor of a weird flavor of cosmopolitanism, which I really didn’t expect when I first started looking into this, so I don’t really trust my models of what actual consensus will be when it comes to transformative AI (and don’t really trust current local opinions on AI to be good proxies for that).
I don’t think it’s about misaligned AI. I agree with the mainstream opinion that if competitive alignment is solved, humans deliberately causing trouble represent a larger share of the problem than misaligned AI.
Why is this a mainstream opinion? Where does this “mainstream” label come from? I don’t think almost anyone in the broader world has any opinions on this scenario, and from the people I’ve talked to in AI Alignment, this really doesn’t strike me as a topic I’ve seen any kind of consensus on. This to me just sounds like you are labeling people you agree with as “mainstream”. I don’t currently see a point in using words like “mainstream” and (the implied) “fringe” in contexts like this.
I disagree with that,, don’t think it’s been argued for, and don’t think the surprisingness of the claim has even been acknowledge and engaged with.
This also seems to me to randomly throw in an elevated burden of proof, claiming that this claim is surprising, but that your implied opposite claim is not surprising, without any evidence. I find your claims in this domain really surprising, and I also haven’t seen you “acknowledge the surprisingness of [your] claim”. And I wouldn’t expect you to, because to you your claims presumably aren’t surprising.
Claiming that someone “hasn’t acknowledged the surprisingness of their claim” feels like a weird double-counting of both trying to dock someone points for being wrong, and trying to dock them points for not acknowledging that they are wrong, which feel like the same thing to me (just the latter feels like it relies somewhat more on the absurdity heuristic, which seems bad to me in contexts like this).
Yep, there is last year’s books! https://www.lesswrong.com/books/2019
We very prominently linked them for a few months (they had a whole banner on the frontpage). You can now find them under the “Best Of” header in the library.
We don’t currently publish them as e-books, but you can see their content in the Best-of sequence above
Kurzgesagt – The Last Human (Youtube)
It’s Harold Fiske’s Mississippi river map that shows how the Mississippi river changed over time, combined with a watercolor style transfer.
It’s also the cover of one of the first set of LessWrong books that we published: https://www.lesswrong.com/books/2018
I don’t think think it’s false!
For me it meant “I think this is a bad proposal”.
I like the sentence “I could also say this truthfully”, and I feel like it points towards the right generator that I have for what I would like “agree/disagree” to mean.
The tooltip of “Agree: Do you agree with the statements in this comment? Would the statements in this comment ring true if you said them yourself?” feels possibly good, though sure is a bit awkward and am not fully sure how reliably it would get the point across.
Button overall. Like, I think I approximately never make a comment that doesn’t preface almost all of my (edit: not obviously correct) beliefs with “I think”, so this would cause no agree/disagree voting to happen on my comments.
Hmm, I think this would get rid of ~80% of the value for me, and also produce a lot of voting inconsistency, since it’s kind of author-specific how much they insert “I think X” vs. just saying “X”, and take the “I think” implicit.
I much prefer getting data on whether people agree with X in that case, and would really value that information.
It’s dynamically determined based on the size of the post. Short posts don’t have them at the bottom, long posts do.
Hmm, what about language like
“Agree: Do you think the content of this comment is true? (Or if the comment is about an emotional reaction or belief of the author, does that statement resonate with you?)”
It sure is a mouthful, but it feels like it points towards a coherent cluster.
It’s true! Not fully clear how to fix this, since the whole architecture we’ve chosen kind of assumes the voting-system is set at the post-level.
Yeah, definitely agree. I’ve been meaning to update this for a while, but haven’t gotten around to this. Lots of good stuff has been published in the last 1.5 years!