Curated. I have wanted someone to write out an assessment of how the Risks from Learned Optimization arguments hold up in light of the evidence we have acquired over the last half decade. I particularly appreciated breaking down the potential reasons for risk and assessing to what degree we have encountered each problem, as well as reassessing the chances of running into those problems. I would love to see more posts that take arguments/models/concepts from before 2020, consider what predictions we should have made pre-2020 if these arguments/models/concepts were good, and then reassess them in light of our observations of progress in ML over the last five years.
Ronny Fernandez
Karma: 1,511
Lighthaven Sequences Reading Group #62 (Tuesday 12/23)
Lighthaven Sequences Reading Group #61 (Tuesday 12/16)
Lighthaven Sequences Reading Group #60 (Tuesday 12/9)
Lighthaven Sequences Reading Group #59 (Tuesday 12/2)
Lighthaven Sequences Reading Group #58 (Tuesday 11/25)
Lighthaven Sequences Reading Group #57 (Tuesday 11/18)
Lighthaven Sequences Reading Group #56 (Tuesday 11/11)
Curated. This is a simple and obvious argument that I have never heard before with important implications. I have heard similar considerations in conversations about whether someone should take some job at a capabilities lab, or whether some particular safety technique is worth working on, but it’s valuable to generalize across those cases and have a central place for discussing the generalized argument.
I would love to see more pushback in the comments from those who are currently working on legible safety problems.
Curated. This does indeed seem like a common kind of bad argument around these parts which has not yet been named. I also appreciate Rohin’s comment pointing out that it’s not obvious what makes this kind of reasoning bad, as well as David Manheim’s comment saying that what is needed is a way to distinguish cases when bounded search works well from cases where bounded search works poorly. More generally, I like content being posted that are about evaluating a kind of reasoning that is common, especially of the sort that inspires interesting engagement and/or disagreement in the replies. I would be excited to see more case studies in when this sort of reasoning works well or poorly, and maybe even a general theory to help us decide when this kind of reasoning tends to work out well, eg, when implemented by superforecasters on many topics.