Basically I’m saying that SOTA alignment research barely makes sense to call alignment research. This may sound harsh, so just to clarify, this isn’t a knock on it. I don’t follow it much, but some of it seems like good research; it’s definitely helpful to make AI risks more legible to other people, and some of this research helps with that; and arguably, on the margin, really good legibilization research in general is significantly more important than actual alignment research because it helps with slowing down capabilities research which is more likely to work and more likely to help soon.
Just from the perspective of “is this research building towards understanding how to align an AGI”, here’s a tacky analogy that maybe communicates a bit, where the task is “get to the moon, starting from Ancient Greece tech”:
SOTA would be mostly stuff like using the naked eye to make nicer sketches of the moon
some interpretability might be like developing stone-grinding methods; eventually you might use it to develop telescopes, and look at the moon more closely, which is still irrelevant; but eventually eventually you could develop other lenses, like for cameras, which is still not a solution at all but would help
stuff like cheese vectors is like going around and mixing random substances together; it won’t work, but it’s kinda hitting at the right sector (because you could hit upon reactive substances, eventually leading you to find explosive substances, eventually to find rocket propellant)
most “agent foundations” would be like mathematicians screwing around. It’s interesting, it’s a healthy sort of research to have in your portfolio and in your community, and in fact it does build up in the long term towards calculus, which will be very relevant for calculating all sorts of things such as propellant amounts and trajectories and lots of things in manufacturing and etc.; but also it takes a really really long time.
Then there’s some secret thing, unknown to me and to anyone else AFAIK, that somehow makes you figure out alignment within 50 years ( / figure out rockets getting to the moon in 100 years, starting from Ancient Greece tech).
Basically I’m saying that SOTA alignment research barely makes sense to call alignment research. This may sound harsh, so just to clarify, this isn’t a knock on it. I don’t follow it much, but some of it seems like good research; it’s definitely helpful to make AI risks more legible to other people, and some of this research helps with that; and arguably, on the margin, really good legibilization research in general is significantly more important than actual alignment research because it helps with slowing down capabilities research which is more likely to work and more likely to help soon.
Just from the perspective of “is this research building towards understanding how to align an AGI”, here’s a tacky analogy that maybe communicates a bit, where the task is “get to the moon, starting from Ancient Greece tech”:
SOTA would be mostly stuff like using the naked eye to make nicer sketches of the moon
some interpretability might be like developing stone-grinding methods; eventually you might use it to develop telescopes, and look at the moon more closely, which is still irrelevant; but eventually eventually you could develop other lenses, like for cameras, which is still not a solution at all but would help
stuff like cheese vectors is like going around and mixing random substances together; it won’t work, but it’s kinda hitting at the right sector (because you could hit upon reactive substances, eventually leading you to find explosive substances, eventually to find rocket propellant)
most “agent foundations” would be like mathematicians screwing around. It’s interesting, it’s a healthy sort of research to have in your portfolio and in your community, and in fact it does build up in the long term towards calculus, which will be very relevant for calculating all sorts of things such as propellant amounts and trajectories and lots of things in manufacturing and etc.; but also it takes a really really long time.
Then there’s some secret thing, unknown to me and to anyone else AFAIK, that somehow makes you figure out alignment within 50 years ( / figure out rockets getting to the moon in 100 years, starting from Ancient Greece tech).