I generally agree, although part of my point is that “if they are harmless” might still be too high of a bar if you have a certain disposition (which is likely more common than average in this community). See Elizabeth’s comment.
alseph
I agree with most of these, but I’m worried about 11, especially how people who are exceptionally risk-averse might interpret it.[1] There are just as many things that you might look back and regret not trying.[2]
Most people’s long-term regrets don’t come from having a drink once, or having a horrible sleep schedule for a semester, or becoming a socialist for a couple years. Sure, if you sample from alcoholics, [edit: in a majority of cases] the first attempt was pivotal. But a ruined body or or mind usually comes from persistent bad habits rather than a single try.
The thing about the regret of not trying enough things is that most people err on the other side, so it’s very rare to find people that express it. There were a couple years of my life where I valued purity very highly, and in hindsight I think it caused more harm than good. But I’m still young; who knows what I’ll think in five years.
- ^
5 checks this somewhat, but I think this is significant enough to flag.
- ^
It is possible to spiral into madness wondering about these counterfactuals. Maybe I’ll regret it in ten years. Maybe I’ll regret not doing it in ten years. Maybe I probably won’t regret doing it, but in reality my utility would have been higher and I’ll never know.
I don’t think any of this is useful. The point I’m trying to make is that the regret counterfactual can go either way and you shouldn’t worry about it too much.
It’s an explore/exploit problem. Many times you don’t have complete enough knowledge to make an informed rational decision, because the only truly relevant data is your own life (I cite point 5 above). I think cases like this are what the gut is for.
- ^
Genuine question: if AI capabilities research stopped today and larger models stopped being trained, wouldn’t AI alignment research effectively be halted?
I’m assuming that the primary goal of AI alignment research is to prevent AGI and ASI from being existential risks. My main question is, how can methods for AGI/ASI alignment can be discovered before AGI/ASI exists?
AI alignment results tend to be either positive (“we succeeded in making Claude more honest”) or negative (“we got ChatGPT to kill someone”).
The positive results seem unlikely to generalize to larger versions of current models, much less to any novel architectures that will enable AGI and ASI. The major results on methods like sparse autoencoders and steering have already been difficult to reproduce.[1]
The negative results vindicate the concerns that AI alignment people already have. But the models are too small to demonstrate persistent, unintentional, dangerously misaligned goals, at least decisively enough to convince people who aren’t already worried.
One clear benefit to a pause would be time for policy to catch up. However, this might be like trying to draw a map for terrain that doesn’t exist yet. It would be like the Allies drawing up a nuclear treaty with the Axis powers before there was consensus that the nuclear bomb was actually possible.[2] It would be nice if everyone stopped and worked out a plan for global cooperation, but such a plan can only stabilize and achieve buy-in with the major players once both the underlying dangers and distribution of power are clear enough to all the players involved.
A research pause could definitely still be a net good for humanity, but at present I don’t understand what this time would buy. If these conclusions make sense, they would maybe favor a slowdown (for safety to keep pace with capabilities) rather than a pause. But they are based on my rudimentary knowledge, and I would like to hear what more knowledgeable people have to say.
- ^
I haven’t read many papers, so please contest this if you have strong evidence against it. Here I’m specifically thinking of Anthropic’s sparse autoencoders paper.
- ^
Not in a counterfactual sense about the outcome of the war. My point is that attempting such a treaty would have been unsuccessful and wouldn’t have found substantive support on either side.
I am a student at an elite (top 3) university in the US. I find that the disposition towards work I have inherited from my father is inimical to my success here.
The true believer in the Labor ladder does not work for money. They do not seek to be extraordinarily rich or be free from the burden of work. They also do not work out of passion: living by “do what you love and you’ll never work a day in your life” is as unrealistic as waiting around for your One True Love to come along. Work is a matter of virtue and duty. You work because your work keeps the world turning. You work hard and take pride in the quality of your work.
At an elite college, the selection process creates two extremely distinct types. There are students who are world-class hoop-jumpers. The majority of them go into consulting and finance. They use the resources here as well as possible. My instinct is to write them off as grifters who do no Real Work, just translating network and credentials into more network and credentials.
Secondly, there are the students who attest to the quality of the system. They are either monomaniacal in their passion for one area or genuinely brilliant and prolific across many areas. I often wish I could be one of these people, because they do not need to strive or misrepresent anything to get employed. But that’s just a dream. I care about good work, I care about learning, but too broadly and not deeply enough to make things simple.
It feels like all employers now want candidates who are obsessed with the role they applied for. They want people from the second category, who dream about C++ or data analysis or mechanical engineering every night. But those people are rare, so they usually end up people from the first category who are good at pretending to care.
What happened to the people who don’t love the job, but can just do it? Who just want to do hard work without being expected to have an undying passion for it? I can’t stand misrepresenting myself. It makes me nauseous. It’s a constant battle between my disposition and my fear of squandering what I have.
In his autobiography The World of Yesterday, Austrian author Stefan Zweig writes of his father:
That he never asked anything of anyone, that he was never obliged to say “please” or “thanks” to anyone, was his secret pride and meant more to him than any external recognition...it is out of the same secret pride that I have always declined every external honor; I have never accepted a decoration, a title, the presidency of any association, have never belonged to any academy, any committee, any jury. Merely to sit at a banquet table is torture for me; and the thought of asking someone for something-even if it is on behalf of a third person-dries my lips before the first word is spoken. I know how outmoded such inhibitions are in a world where one can remain free only through trickery and flight and where, as Father Goethe so wisely says, “decorations and titles ward off many a shove in the crowd.” But it is my father in me, and it is his secret pride that forces me back, and I may not offer opposition; for I thank him for what may well be my only definite possession-the feeling of inner freedom.
This expresses my own sentiment, which I did not choose so much as inherit.
I think the protagonist is one of the most compelling characters. However, he is almost entirely characterized by internal dialog. I would be cautious about comparing a real person to the protagonist without a deep personal relationship to that person and insight into how they think.
I was thinking more strongly of Arden Voss [edit: Vox], who is delightfully characterized but essentially flat and adds no insight into the psychology of the powerful tech figure. “They’re spiritually off the deep end and can’t be reasoned with” is not a novel or useful model.
You could argue that this is too much to ask of satire.