+1, I started reading this because I thought it was about RadVac
Thanks, I really enjoyed this post—this was a novel but persuasive argument for not using binary predictions, and I now feel excited to try it out!
One quibble—When you discuss calculating your calibration, doesn’t this implicitly assume that your mean was accurate? If my mean is very off but my standard deviation is correct, then this method says my standard deviation is way too low. But maybe this is fine because if I have a history of getting the mean wrong I should have a wider distribution?
Thanks for the feedback! That makes sense, I’ve updated the intro paragraph to that section to:
There are a range of agendas proposed for how we might build safe AGI, though note that each agenda is far from a complete and concrete plan. I think of them more as a series of confusions to explore and assumptions to test, with the eventual goal of making a concrete plan. I focus on three agendas here, these are just the three I know the most about, have seen the most work on and, in my subjective judgement, the ones it is most worth newcomers to the field learning about. This is not intended to be comprehensive, see eg Evan Hubinger’s Overview of 11 proposals for building safe advanced AI for more.
Does that seem better?
For what it’s worth, my main bar was a combination of ‘do I understand this agenda well enough to write a summary’ and ‘do I associate at least one researcher and some concrete work with this agenda’. I wouldn’t think of corrigibility as passing the second bar, since I’ve only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems. It’s very possible I’ve missed out on some important work though, and I’d love to hear pushback on this
Thanks a lot for the feedback, and the Anki cards! Appreciated. I definitely find that level of feedback motivating :)
These categories were formed by a vague combination of “what things do I hear people talking about/researching” and “what do I understand well enough that I can write intelligent summaries of it”—this is heavily constrained by what I have and have not read! (I am much less good than Rohin Shah at reading everything in Alignment :’( )
Eg, Steve Byrnes does a bunch of research that seems potentially cool, but I haven’t read much of it, and don’t have a good sense of what it’s actually about, so I didn’t talk about it. And this is not expressing an opinion that, Eg, his research is bad.
I’ve updated towards including a section at the end of each post/section with “stuff that seems maybe relevant that I haven’t read enough to feel comfortable summarising”
Thanks for the appreciation!
If you’re trying to make it more legible to outsiders, you should consider defining AGI at the top.
Good idea, I just added this note to the top:
Terminology note: There is a lot of disagreement about what “intelligence”, “human-level”, “transformative” or AGI even means. For simplicity, I will use AGI as a catch-all term for ‘the kind of powerful AI that we care about’. If you find this unsatisfyingly vague, OpenPhil’s definition of Transformative AI is my favourite precise definition.
Thanks! I’m probably not going to have time to write a top-level post myself, but I liked Evan Hubinger’s post about it.
I do wonder if vision problems are unusually tractable here; would it be so easy to visualise what individual neurons mean in a language model?
We actually released our first paper trying to extend Circuits from vision to language models yesterday! You can’t quite interpret individual neurons, but we’ve found some examples of where we can interpret what an individual attention head is doing.
I really love the essay Visual Information Theory
Self review: I’m very flattered by the nomination!
Reflecting back on this post, a few quick thoughts:
I put a lot of effort into getting better at teaching, especially during my undergrad (publishing notes, mentoring, running lectures, etc). In hindsight, this was an amazing use of time, and has been shockingly useful in a range of areas. It makes me much better at field-building, facilitating fellowships, and writing up thoughts. Recently I’ve been reworking the pedagogy for explaining transformer interpretability work at Anthropic, and I’ve been shocked at how relevant all of this is.
A related idea is that of the Pareto Frontier. Most people are bad at teaching, this leads to eg Research Debt in academia. I’m a pretty great teacher, but not exactly world-class. But I’m a great mathematician, and trying to become a great AI Safety researcher, and there are very, very few people who are great at both—this gives me a lot of room to explore my comparative advantage by eg writing field-building docs
I wish I’d better emphasised just how useful a skill this is
A lot centres on teaching in specific contexts. This is reasonable, since it’s what I know, but I wish I’d better clarified what would and would not generalise—I’m afraid people who see this post will bounce off as it’s not relevant to them
I wish I’d given more caveats about teaching gone wrong. My experiences teaching younger people who view me as high-status is that it’s very easy to appear over-confident. I try to caveat what I say, but I tend to present as fairly confident, and people often take me way too seriously. While the techniques I present here are v effective at teaching, they have the flipside of better inserting my knowledge into the student’s system 1 and bypassing some of their mental filters, which can be bad and eg lead to groupthink and lowered agency.
Some, such as Socratic method, are better on this front by at least giving me chances to notice if what I’m teaching is wrong
Sometimes it may be good to deliberately be a bad teacher, to teach the students agency and give them room to grow. on their own and to form their own ideas. It’s worth checking for this—I just reflexively use good teaching technique nowadays and it’s hard to suppress
Some ideas such as a knowledge graph are vague intuitions that it would have been good to operationalise more
With all that said, I’d only been blogging for 3 weeks when I wrote this post, and I wrote it in an afternoon, so I’m really happy with this as an artefact to come out of that! I am so, so happy I decided to do a month of daily blogging
What fraction of these fizzled out because they were displaced by a fitter variant vs just not spreading further? That seems very important for figuring out how much to freak out
+1, I was pretty surprised and confused by the 37% stat. If basically all of the labour here comes from taxpayer funded science, where on earth is 63% of the revenue going?!
Thanks for the post! I love a lot of these, and haven’t come across some :)
Google docs quick create. Shortcut key or single click to automatically create a new google document or spreadsheet. Saves a ton of time.
The URL doc.new or sheet.new also does this, and is pretty low friction (though not quite single click!) Works on any computer though
Quickcompose. You know how easy it is to get distracted by your inbox when you need to send an email? Quick compose makes it so that you can open up a window that’s just a compose window so you can’t get distracted by new emails.
I really like the extension Inbox When Ready - it hides your inbox by default, unless you click on the ‘show inbox’ button. This is enough to reduce ‘compulsively open email and check things’, as well as giving this functionality
I feel like I make enough minor edits to my comments (typos etc) that this would be really annoying—I’d feel significantly more constrained about my ability to make edits, because I’d know it would spam to people. Maybe having a “send notifications?” toggle would help
As a counter-point, my day was made significantly better by the front page being nuked in 2020 - it was exciting, novel, hilarious (by my lights—clearly not to some people), made some excellent points about phishing and security, and gave me opportunities to dissect why people oriented to this event differently from me. I expect my experience would have been less good last year had the phishing attempt not happened, and we all simply coordinated. More generally, when a website does something unusual and novel like this, I feel like the value of novelty and interestingness can outweigh the costs of a single day of disrupted use?
I’d further argue that the people highly invested in this seem much more invested in the abstract ideas of trust, community, shared ritual and cohesion, more so than the object level of the frontpage being down (besides, people can always use greaterwrong.com )
If it helps, here’s a comment I wrote last year trying to narrate my internal experience of reading the email (I then read the 2019 threads and eventually twigged how seriously people took it, but that was strongly not my prior—it wouldn’t even have occurred to me to ask the question ‘do people take this more seriously than a game?’)
I was one of 270 last year and am one of 100 this year, I did not understand the context last year. Empirically, neither did Chris last year. Multiple people on the EA Forum have commented about not understanding the context