MIT senior majoring in computer science, economics, and data science and double minoring in mathematics and philosophy. Interested in AI safety across a wide range of domains. Blog at
Peter Berggren
Why We Need More Shovel-Ready AI Notkilleveryoneism Megaproject Proposals
Thanks for the support. I’ll try and work a bit more on my first post in the coming days and I hope it will be up soon.
I think you did a really good job so far of setting up a series of clear exercises for techniques. The key issue I had with Hammertime was that it often seemed a bit disorganized in terms of changing plans and switching what it was talking about a lot.
My most recent post on LessWrong (https://www.lesswrong.com/posts/yj2hyrcGMwpPooqfZ/a-proposal-for-iterated-interpretability-with-known), which is also my first post proposing a novel avenue for AI alignment research, took me a total of 30 minutes.
Probably for me, the main thing that helped was Yoda Timers. Then again, that was probably just a function of getting to practice it much more than anything else. Next up is probably TAPs.
A proposal for iterated interpretability with known-interpretable narrow AIs
I have something very similar to the second felt sense given when I’ve spent too much time on my computer and get kind of vaguely sleepy and disoriented when I try to stop even for a moment. The term I use is similar to the one my parents used to describe the tangible expression of this feeling, and it’s “video game poisoning.”
One rationality technique that I can infer from my past experiences is one I’m not really sure how to name; possibilities include “path divergence analysis,” “counterfactual defaults,” “adjacent life heuristic,” “near-miss solutions,” and “reality branch mining.” The idea is to look at what common actions for you would be if your life had gone slightly differently (e.g. you went to a different school, were born in a different country, etc.), see what sort of actions you would commonly take under these conditions, and see if these actions have value in your current life circumstances. One example that I applied was in terms of realizing that if my college major were slightly different, I’d have learned to code in C++ by now, realizing that learning C++ has significant advantages for me even with my major being what it is, and finally learning C++.
The closest I’ve come to a true “factory reset” was when I realized, a few times, that school clubs I was a part of were becoming toxic and unproductive. However, I can’t really point to a single button; more just a gradual stream of one bad impression after another, at which point I started to slowly disengage.
Set a Yoda Timer and share the most important idea you haven’t had time to express. Five minutes is all you get.
I really think that a lot of modern AI alignment research is being done within the academic system, but because it’s done within the academic system it’s fairly ignored by the independent/dedicated nonprofit research community when compared to independent/dedicated nonprofit research. On the contrary, it likely gets much more attention within academia.
I don’t think the dynamic here is “each team likes their own people best.” I think it’s due to an unwarranted degree of skepticism of the academic system, which may be warranted in non-emergencies but which is less warranted when facing truly apocalyptic threats. The academic system has produced a lot of valuable research on climate change and nuclear risks, and I’d expect its research on AI to be similar broadly speaking.
The fact that the first few successful researchers weren’t academics isn’t really a point against the academic system here, any more than Priestley and Lavoisier not being academics is a point against academic chemists. The supposed pre-paradigmaticity of the field also isn’t really a point against the academic system here, given that many protosciences (e.g. Freudian psychoanalysis) were able to grow into sciences within the academic system, and this pattern can be seen continuing in fields such as astrobiology.
My greatest ambition is to create a fully trainable art of rationality that’s so good it gets taught to every high schooler in the country and bankrupts multiple industries that prey on irrational behavior in the process. Although it may seem impossible, the success of anti-smoking efforts against an extremely addictive product with a massive advertising industry suggests that it’s achievable, and the fact that the Internet exists now and didn’t exist then suggests it’s even easier than that was.
Some of them, sure, but for a lot I’d be like “that’s completely outdated” and for others I’d be like “OK, that’s obviously meant to be a jab at some specific person you don’t like.”
Read The Sequences As If They Were Written Today
The worst case of Planning Fallacy that I know recently was my plan to finish a blog post in a week. Now, ten weeks later, I haven’t finished it. But when I actually started to work on it, I got a third of it done in a half hour.
I agree with you on this, but I also don’t think “sunk cost fallacy” isn’t the right word to describe what you’re saying. The rational behavior here is to factor in the existence of a random error term resulting from mood swings into these calculations, and if you can’t fully factor it in, then generally err on the side of keeping projects going. I understand “sunk cost fallacy” to mean “factoring in the amount of effort already spent into these decisions,” which does seem like a pure fallacy to me.
It’s reasonable e.g. when about to watch a movie to say “I’m in a bad mood, I don’t know how bad a mood I’m in, so even though I think the movie’s not worth watching, I’ll watch it anyway because I don’t trust my assessment and I decided to watch it when in a calmer state of mind.” Sunk cost fallacy is where you treat it differently if you bought yourself the tickets versus if they were given to you as a gift, which does seem, even in your apology for “sunk cost fallacy,” to remain a fallacy.
At any given point, you have some probability distribution over how worthwhile the project will be. The distribution can change over time, but it can change either for better or for worse. Therefore, at any point, if a rational agent expects it not to be worthwhile to expend the remaining effort to get the result, they should stop.
Of course, if you are irrational and intentionally fail to account for evidence as a way of getting out of work, this does not apply, but that’s the problem then, not your lack of sunk costs.
Sorry if this is confusing. What I’m saying is, you have some estimate of the project’s valuation, and this factors in the information that you expect to get in the future about the project’s valuation (cf. Conservation of Expected Evidence). If there’s some chance the project will turn out worthwhile, you know that chance already. But there must also be some counterbalancing chance that the project will turn out even less worthwhile than you think.
It seems to me like the “random walk” case you described is poorly formed; the possibility of a project turning out to be worth it after all should be factored into one’s estimate of how “worth it” it is. If it doesn’t, then that’s a problem of motivated reasoning, not a reason to have a sunk cost fallacy.
Intentionally inducing fallacious reasoning in oneself is classified as “Dark Arts” for a reason, especially since it can bias one’s own assessment of how well it turns out and whether to continue doing it.
Probably the most consequential trivial inconvenience for me (recently) was that I stayed up very late (hours past when I planned to go to sleep) because my phone was right next to my bed. This was because the alternate charging spot I had set up to prevent this from happening was mildly cluttered.
That’s sort of it, but it was specifically talking about certain types of self-deceptive behavior that appears to be instrumentally rational. The problem being is that once you’ve deceived yourself, you can’t tell if it’s a good idea or not.