I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about
Mark Xu
Strong Evidence is Common
The Solomonoff Prior is Malign
An Intuitive Guide to Garrabrant Induction
The First Sample Gives the Most Information
[Question] What are your greatest one-shot life improvements?
Less Realistic Tales of Doom
Does SGD Produce Deceptive Alignment?
How to do theoretical research, a personal perspective
Intermittent Distillations #4: Semiconductors, Economics, Intelligence, and Technological Progress.
Rogue AGI Embodies Valuable Intellectual Property
Agents Over Cartesian World Models
Naively there are so few people working on interp, and so many people working on capabilities, that publishing is so good for relative progress. So you need a pretty strong argument that interp in particular is good for capabilities, which isn’t borne out empirically and also doesn’t seem that strong.
In general, this post feels like it’s listing a bunch of considerations that are pretty small, and the 1st order consideration is just like “do you want people to know about this interpretability work”, which seems like a relatively straightfoward “yes”.
I also seperately think that LW tends to reward people for being “capabilities cautious” more than is reasonable, and once you’ve made the decision to not specifically work towards advancing capabilities, then the capabilities externalities of your research probably don’t matter ex ante.
Here’s a conversation that I think is vaguely analogous:
Alice: Suppose we had a one-way function, then we could make passwords better by...
Bob: What do you want your system to do?
Alice: Well, I want passwords to be more robust to...
Bob: Don’t tell me about the mechanics of the system. Tell me what you want the system to do.
Alice: I want people to be able to authenticate their identity more securely?
Bob: But what will they do with this authentication? Will they do good things? Will they do bad things?
Alice: IDK I just think the world is likely to be generically a better place if we can better autheticate users.
Bob: Oh OK, we’re just going to create this user authetication technology and hope people use it for good?
Alice: Yes? And that seems totally reasonable?
It seems to me like you don’t actually have to have a specific story about what you want your AI to do in order for alignment work to be helpful. People in general do not want to die, so probably generic work on being able to more precisely specify what you want out of your AIs, e.g. for them not to be mesa-optimizers, is likely to be helpful.
This is related to complaints I have with [pivotal-act based] framings, but probably that’s a longer post.