I am Issa Rice. https://issarice.com/
riceissa(Issa Rice)
I took the survey.
This is a minor point, but I am somewhat worried that the idea of research debt/research distillation seems to be getting diluted over time. The original article (which this post links to) says:
Distillation is also hard. It’s tempting to think of explaining an idea as just putting a layer of polish on it, but good explanations often involve transforming the idea. This kind of refinement of an idea can take just as much effort and deep understanding as the initial discovery.
I think the kind of cleanup and polish that is encouraged by the review process is insufficient to qualify as distillation (I know this post didn’t use the word “distillation”, but it does talk about research debt, and distillation is presented as the solution to debt in the original article), and to adequately deal with research debt.
There seems to be a pattern where a term is introduced first in a strong form, then it accumulates a lot of positive connotations, and that causes people to stretch the term to use it for things that don’t quite qualify. I’m not confident that is what is happening here (it’s hard to tell what happens in people’s heads), but from the outside it’s a bit worrying.
I actually made a similar comment a while ago about a different term.
Back in April, Oliver Habryka wrote:
Anna Salamon has reduced her involvement in the last few years and seems significantly less involved with the broader strategic direction of CFAR (though she is still involved in some of the day-to-day operations, curriculum development, and more recent CFAR programmer workshops). [Note: After talking to Anna about this, I am now less certain of whether this actually applies and am currently confused on this point]
Could someone clarify the situation? (Possible sub-questions: Why did Oliver get this impression? Why was he confused even after talking talking to Anna? To what extent and in what ways has Anna reduced her involvement in CFAR in the last few years? If Anna has reduced her involvement in CFAR, what is she spending her time on instead?)
I upvoted this post because I think it’s talking about some important stuff in ways (or tone or something) I somehow like better than what some previous posts in the same general area have done.
But also something feels really iffy about the way the word “fun” is used in this post. If I think back to the only time in my life I actually had fun, which was my childhood, I sure did not have fun in the ways described in this post. I had fun by riding bikes (but never once stopping to get curious about how bike gears work), playing Pokemon with my friends (but not actually being very strategic about it—competitive battling/metagame would have been completely alien to my child’s self), making dorodango (but, like, just for the fun of it, not because I wanted to get better at making them over time, and I sure did not ever wonder why different kinds of mud made more stable or shinier dorodango, or what the shine even consists of), etc.
The kind of “fun” that is described in this post is, I think, something I learned from other people when I was in my early teens or so, not something I was born with (as this post seems to imply?). And I learned and developed this skill because I was told (in books like Feynman’s and the Sequences) that this is what actually smart people do.
So personally I feel like I am trying to get back the “original fun” that I experienced as a child, as well as trying to untangle the “useful/technical fun” from its social influences and trying to claim it as my own, or something, in addition to doing the kind of thing suggested by this post.
I’ve been having a mysterious chronic health problem for the past several years and have learned a bunch of things that I wish I knew back when all of this started. I am thinking about how to write down what I’ve learned so others can benefit, but what’s tricky here is that while the knowledge I’ve gained seems wide-ranging, it’s also extremely specific to whatever my problems are, so I don’t know how well it generalizes to other people. I welcome suggestions on how to make my efforts more useful to others. I also welcome pointers to books/articles/posts that already discuss the stuff below in a competent way.
But anyway here is some stuff I could talk about:
Rationality lessons of mysterious health problems: certain health conditions (like mine) are quite mysterious, e.g. having no clear cause or shifting symptoms or nonspecific symptoms. This makes the health problem not only challenging on the basic suffering/emotional level, but also at an epistemic level. Some weird epistemic stuff happens when you are dealing with such a health problem, including:
Your “most likely diagnosis” will keep shifting or will have a wide distribution, which can be confusing to reason about (it’s almost as if the health problem is an agent diagonalizing against me). My “most likely diagnosis” has changed like five times.
Some mistakes I think I made trying to reason too literally about symptoms and ruled things out too early instead of just being like “ok maybe I have this thing” and then just trying the low-effort/safe interventions just to see if they help.
Weird interacting nature of symptoms: ignoring certain symptoms because they aren’t the most painful can end up being a bad idea because eliminating that symptom can help with a lot of other symptoms, because the mind/body is weird and interconnected.
I think turning to certain quacks is actually rational in the case of certain chronic illnesses. These quacks were never the first choice for the ill person, but after the conventional/established medicine’s interventions have all failed and established medicine basically shrugs and says “we don’t know what this even is”, and gives up on you, it makes sense to keep going anyway and try wackier things.
You need to do “rationality on hard mode”—when you’re stressed, when you have brain fog, when you have few productive hours in the day, when your emotions get all messed up.
There is a kind of “lawyery” thing you have to do, where you simulate the objections people will raise about things you should have done or things you should try, and you have to preempt all that and try it and be like “see? I already tried it” so that they don’t have easy outs.
How to deal with the health bureaucracy (US-specific, but what I know is even more specific): how to get the benefits you need from health providers, how to deal with insurance, how to get referrals, how to push providers with questions, optimizing which health insurance to have.
How to do health research: how to find information about symptoms, how to organize your research, how to ask good questions when meeting doctors, the importance of talking to a lot of people.
Specific things I’ve learned about different drugs, nootropics, health devices, practices, etc., and which ones seem the most promising.
General life outlook stuff:
How to orient toward “this being your new life”
How to stay motivated to live life and accomplish things while chronically ill; the hardcoreness of being ill for so long and what this does to your personality.
How to maintain a “health tracker”: how to keep track of your symptoms, what you did each day, what you ate, how you slept, etc. for future reference, and whether or not tracking any of this is useful.
Productivity hacks:
Daily goal-setting: how to get shit done even if you feel like shit every day.
The importance of having a “health buddy” who has similar health problems who you can talk to all the time, as having a chronic health problem can be very isolating (very few people can understand or support you in the way you need).
The importance of just trying lots of things to see what helps, and what this looks like in practice.
Basic health stuff that seems good to do regardless of what the cause of your symptoms is: nutrition, exercise, sleep, wackier stuff.
I have seen/heard from at least two sources something to the effect that MIRI/CFAR leadership (and Anna in particular) has very short AI timelines and high probability of doom (and apparently having high confidence in these beliefs). Here is the only public example that I can recall seeing. (Of the two examples I can specifically recall, this is not the better one, but the other was not posted publicly.) Is there any truth to these claims?
When I initially read this post, I got the impression that “subagents = path-dependent/incomplete DAG”. After working through more examples, it seems like all the work is being done by “committee requiring unanimous agreement” rather than by the “subagents” part.
Here are the examples I thought about:
Same as the mushroom/pepperoni situation, with the same two agents, but now each side can retaliate/hijack the rest of the mind if it doesn’t get what it wants. For example, if it starts at pepperoni, the mushroom-preferring agent will hijack the rest of the mind to remove the pepperoni, ending up at cheese. But if the agent starts at the “both” node, it will stay there (because both agents are satisfied). The preference relation can be represented as with an extra arrow from . This is still a DAG, and it’s still incomplete (in the sense that we can’t compare pepperoni vs mushroom) but it’s no longer path-dependent, because no matter where we start, we end up at cheese or “both” (I am assuming that toppings-removal can always be done, whereas acquiring new toppings can’t).
Same as the previous example, except now only the mushroom-preferring agent can retaliate/hijack (because the pepperoni-preferring agent is weak or nice). Now the preferences are . This is still a DAG, but now the preferences are total, so we can also view it as a (somewhat weird) single agent. A realistic example of this is given by Andrew Critch, where pepperoni=work, cheese=burnout (i.e. neither work nor friendship), mushroom=friendship, and both=friendship-and-work.
A modified version of the Zyzzx Prime planet by Scott Alexander. Now whenever we start out at pepperoni, the pepperoni-preferring agent becomes stupid/weak, and loses dominance, so now there are edges from pepperoni to mushroom and “both”. (And similarly, mushroom points to both pepperoni and “both”.) Now we no longer have a DAG because of the cycle between pepperoni and mushroom.
It seems like when people talk about the human mind being composed of subagents, the deliberation process is not necessarily “committee requiring unanimous agreement”, so the resulting preference relations cannot necessarily be represented using path-dependent DAGs.
It also seems like the general framework of viewing systems as subagents (i.e. not restricting to “committee requiring unanimous agreement”) is broad enough that it can basically represent any kind of directed graph. On one hand, this is suspicious (if everything can be viewed as a bunch of subagents, then maybe the subagents framework isn’t adding anything after all). On the other hand, this suggests that claims of subagents are not really about the resulting behavior/preference ordering of the system, but rather about the internal dynamics of the system.
Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:
time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
act-based (“short”) vs goal-based (“long”): using the human’s (or more generally, the human-plus-AI-assistants’; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization of the future based on some goal, e.g. using a utility function (goal-based)
amount of reflection the human has undergone: “short” would be the current human (I think this is what you call “preferences-as-elicited”), and this would get “longer” as we give the human more time to think, with something like CEV/Long Reflection/Great Deliberation being the “longest” in this sense (I think this is what you call “preference-on-idealized-reflection”). This sense further breaks down into whether the human itself is actually doing the reflection, or if the AI is instead predicting what the human would think after reflection.
how far the search happens: “short” would be a limited search (that lacks insight/doesn’t see interesting consequences) and “long” would be a search that has insight/sees interesting consequences. This is a distinction you made in a discussion with Eliezer a while back. This distinction also isn’t strictly about preferences, but rather about how one would achieve those preferences.
de dicto (“short”) vs de re (“long”): This is a distinction you made in this post. I think this is the same distinction as (2) or (3), but I’m not sure which. (But if my interpretation of you below is correct, I guess this must be the same as (2) or else a completely different distinction.)
understandable (“short”) vs evaluable (“long”): A course of action is understandable if the human (without any AI assistants) can understand the rationale behind it; a course of action is evaluable if there is some procedure the human can implement to evaluate the rationale using AI assistants. I guess there is also a “not even evaluable” option here that is even “longer”. (Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.)
My interpretation is that when you say “short-term preferences-on-reflection”, you mean short in sense (1), except when the AI needs to gather resources, in which case either the human or the AI will need to do more long-term planning; short in sense (2); long in sense (3), with the AI predicting what the human would think after reflection; long in sense (4); short in sense (5); long in sense (6). Does this sound right to you? If not, I think it would help me a lot if you could “fill in the list” with which of short or long you choose for each point.
Assuming my interpretation is correct, my confusion is that you say we shouldn’t expect a situation where “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy” (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
For people who find this post in the future, Abram discussed several of the points in the bullet-point list above in Probability vs Likelihood.
I think Discord servers based around specific books are an underappreciated form of academic support/community. I have been part of such a Discord server (for Terence Tao’s Analysis) for a few years now and have really enjoyed being a part of it.
Each chapter of the book gets two channels: one to discuss the reading material in that chapter, and one to discuss the exercises in that chapter. There are also channels for general discussion, introductions, and a few other things.
Such a Discord server has elements of university courses, Math Stack Exchange, Reddit, independent study groups, and random blog posts, but is different from all of them:
Unlike courses (but like Math SE, Reddit, and independent study groups), all participation is voluntary so the people in the community are selected for actually being interested in the material.
Unlike Math SE and Reddit (but like courses and independent study groups), one does not need to laboriously set the context each time one wants to ask a question or talk about something. It’s possible to just say “the second paragraph on page 76” or “Proposition 6.4.12(c)” and expect to be understood, because there is common knowledge of what the material is and the fact that everyone there has access to that material. In a subject like real analysis where there are many ways to develop the material, this is a big plus.
Unlike independent study groups and courses (but like Math SE and Reddit), there is no set pace or requirement to join the study group at a specific point in time. This means people can just show up whenever they start working on the book without worrying that they are behind and need to catch up to the discussion, because there is no single place in the book everyone is at. This also makes this kind of Discord server easier to set up because it does not require finding someone else who is studying the material at the same time, so there is less cost to coordination.
Unlike random forum/blog posts about the book, a dedicated Discord server can comprehensively cover the entire book and has the potential to be “alive/motivating” (it’s pretty demotivating to have a question about a blog post which was written years ago and where the author probably won’t respond; I think reliability is important for making it seem safe/motivating to ask questions).
I also like that Discord has an informal feel to it (less friction to just ask a question) and can be both synchronous and asynchronous.
I think these Discord servers aren’t that hard to set up and maintain. As long as there is one person there who has worked through the entire book, the server won’t seem “dead” and it should accumulate more users. (What’s the motivation for staying in the server if you’ve worked through the whole book? I think it provides a nice review/repetition of the material.) I’ve also noticed that earlier on I had to answer more questions in early chapters of the book, but now there are more people who’ve worked through the early chapters who can answer those questions, so I tend to focus on the later chapters now. So my concrete proposal is that more people, when they finish working through a book, should try to “adopt” the book by creating a Discord server and fielding questions from people who are still working through the book (and then advertising in some standard channels like a relevant subreddit). This requires little coordination ability (everyone from the second person onward selfishly benefits by joining the server and does not need to pay any costs).
I am uncertain how well this format would work for less technical books where there might not be a single answer to a question/a “ground truth” (which leaves room for people to give their opinions more).
(Thanks to people on the Tao Analysis Discord, especially pecfex for starting a discussion on the server about whether there are any similar servers, which gave me the idea to write this post, and Segun for creating the Tao Analysis Discord.)
Does anyone know how Brian Christian came to be interested in AI alignment and why he decided to write this book instead of a book about a different topic? (I haven’t read the book and looked at the Amazon preview but couldn’t find the answer there.)
Back in the 2010s, EAs spent a long time dunking on doctors for not having such a high impact (I’m going off memory here, but I think “instead of becoming a doctor, why don’t you do X instead” was a common career pitch). I basically mostly unreflectively agreed with these opinions for a long time, and still think that doctors have less impact compared to stuff like x-risk reduction. But after having more personal experience dealing with the medical world (3 primary care doctors, ~10 specialist doctors, 2 psychiatrists, 2 naturopaths, 3 therapists, 2 nutritionists/dieticians, 2 coaching type people, all in the last 4 years (I counted some people under multiple categories)), I think a really agenty/knowledgeable/capable doctor or therapist can actually have a huge impact on the world (just going by intuition of how many even healthy-seeming people have a lot of health problems that bring down their productivity a lot, how crippling it is to have a mysterious health problem like mine, etc; I haven’t actually tried crunching numbers). I think such a person is not likely to look like a typical doctor working in a hospital system though… probably more like a writer/researcher who also happens to do consultations with people.
If I had to rewrite the EA pitch for people who wanted to become doctors it would be something like “First think very hard about why you want to become a doctor, and if what you want is not specific to working in healthcare then maybe consider [list of common EA cause areas]. If you really want to work in healthcare though, that’s great, but please consider becoming this weirder thing that’s not quite a doctor, where first you learn a bunch of rationality/math/programming and then you learn as much as you can about medical stuff and then try to help people.”
Rob, are you able to disclose why people at Open Phil are interested in learning more decision theory? It seems a little far away from the AI strategy reports they’ve been publishing in recent years, and it also seemed like they were happy to keep funding MIRI (via their Committee for Effective Altruism Support) despite disagreements about the value of HRAD research, so the sudden interest in decision theory is intriguing.
I was surprised to see, both on your website and the white paper, that you are part of Mercatoria/ICTP (although your level of involvement isn’t clear based on public information). My surprise is mainly because you have a couple of comments on LessWrong that discuss why you have declined to join MIRI as a research associate. You have also (to my knowledge) never joined any other rationality-community or effective altruism-related organization in any capacity.
My questions are:
What are the reasons you decided to join or sign on as a co-author for Mercatoria/ICTP?
More generally, how do you decide which organizations to associate with? Have you considered joining other organizations, starting your own organization, or recruiting contract workers/volunteers to work on things you consider important?
We can model success as a combination of doing useful things and avoiding making mistakes. As a particular example, we can model intellectual success as a combination of coming up with good ideas and avoiding bad ideas. I claim that rationality helps us avoid mistakes and bad ideas, but doesn’t help much in generating good ideas and useful work.
Eliezer Yudkowsky has made similar points in e.g. “Unteachable Excellence” (“much of the most important information we can learn from history is about how to not lose, rather than how to win”, “It’s easier to avoid duplicating spectacular failures than to duplicate spectacular successes. And it’s often easier to generalize failure between domains.”) and “Teaching the Unteachable”.
Doesn’t do what? I understand Eliezer to be saying that he figured out AI risk via thinking things through himself (e.g., writing a story that involved outcome pumps; reflecting on orthogonality and instrumental convergence; etc.), rather than being argued into it by someone else who was worried about AI risk. If Eliezer didn’t do that, there would still presumably be someone prior to him who did that, since conclusions and ideas have to enter the world somehow. So I’m not understanding what you’re modeling as ridiculous.
My understanding of the history is that Eliezer did not realize the importance of alignment at first, and that he only did so later after arguing about it online with people like Nick Bostrom. See e.g. this thread. I don’t know enough of the history here, but it also seems logically possible that Bostrom could have, say, only realized the importance of alignment after conversing with other people who also didn’t realize the importance of alignment. In that case, there might be a “bubble” of humans who together satisfy the null string criterion, but no single human who does.
The null string criterion does seem a bit silly nowadays since I think the people who would have satisfied it would have sooner read about AI risk on e.g. LessWrong. So they wouldn’t even have the chance to live to age ~21 to see if they spontaneously invent the ideas.
As should be clear, this process can, after a few iterations, produce a situation in which most of those who have engaged with the arguments for a claim beyond some depth believe in it.
This isn’t clear to me, given the model in the post. If a claim is false and there are sufficiently many arguments for the claim, then it seems like everyone eventually ends up rejecting the claim, including those who have engaged most deeply with the arguments. The people who engage deeply “got lucky” by hearing the most persuasive arguments first, but eventually they also hear the weaker arguments and counterarguments to the claim, so they end up at a level of confidence where they don’t feel they should bother investigating further. These people can even have more accurate beliefs than the people who dropped out early in the process, depending on the cutoff that is chosen.
I had already seen all of those quotes/links, all of the quotes/links that Rob Bensinger posts in the sibling comment, as well as this tweet from Eliezer. I asked my question because those public quotes don’t sound like the private information I referred to in my question, and I wanted insight into the discrepancy.
I’ve made around 250 Anki cards about AI safety. I haven’t prioritized sharing my cards because I think finding a specific card useful requires someone to have read the source material generating the card (e.g. if I made the card based on a blog post, one would need to read that exact blog post to get value out of reviewing the card; see learn before you memorize). Since there are many AI safety blog posts and I don’t have the sense that lots of Anki users read any particular blog post, it seems to me that the value generated from sharing a set of cards about a blog post isn’t high enough to overcome the annoyance cost of polishing, packaging, and uploading the cards.
More generally, from a consumer perspective, I think people tend to be pretty bad at making good Anki cards (I’m often embarrassed at the cards I’ve created several months ago!), which makes it unexciting for me to spend a lot of effort trying to collaborate with others on making cards (because I expect to receive poorly-made cards in return for the cards I provide). I think collaborative card-making can be done though, e.g. Michael Nielsen and Andy Matuschak’s quantum computing guide comes with pre-made cards that I think are pretty good.
Different people also have different goals/interests so even given a single source material, the specifics one wants to Ankify can be different. For example, someone who wants to understand the technical details of logical induction will want to Ankify the common objects used (market, pricing, trader, valuation feature, etc.), the theorems and proof techniques, and so forth, whereas someone who just wants a high-level overview and the “so what” of logical induction can get away with Ankifying much less detail.
Something I’ve noticed is that many AI safety posts aren’t very good at explaining things (not enough concrete examples, not enough emphasis on common misconceptions and edge cases, not enough effort to answer what I think of as “obvious” questions); this fact is often revealed by the comments people make in response to a post. This makes it hard to make Anki cards because one doesn’t really understand the content of the post, at least not well enough to confidently generate Anki cards (one of the benefits of being an Anki user is having a greater sensitivity to when one does not understand something; see “illusion of explanatory depth” and related terms). There are other problems like conflicting usage of terminology (e.g. multiple definitions of “benign”, “aligned”, “corrigible”) and the fact that some of the debates are ongoing/some of the knowledge is still being worked out.
For “What would be a good strategy for generating useful flashcards?”: I try to read a post or a series of posts and once I feel that I understand the basic idea, I will usually reread it to add cards about the basic terms and ask myself simple questions. Some example cards for iterated amplification:
what kind of training does the Distill step use?
in the pseudocode, what step gets repeated/iterated?
how do we get A[0]?
write A[1] in terms of H and A[0]
when Paul says IDA is going to be competitive with traditional RL agents in terms of time and resource costs, what exactly does he mean?
advantages of A[0] over H
symbolic expression for the overseer
why should the amplified system (of human + multiple copies of the AI) be expected to perform better than the human alone?
What are your thoughts on Duncan Sabien’s Facebook post which predicts significant differences in CFAR’s direction now that he is no longer working for CFAR?