I am Issa Rice. https://issarice.com/
riceissa(Issa Rice)
I took the survey.
This is a minor point, but I am somewhat worried that the idea of research debt/research distillation seems to be getting diluted over time. The original article (which this post links to) says:
Distillation is also hard. It’s tempting to think of explaining an idea as just putting a layer of polish on it, but good explanations often involve transforming the idea. This kind of refinement of an idea can take just as much effort and deep understanding as the initial discovery.
I think the kind of cleanup and polish that is encouraged by the review process is insufficient to qualify as distillation (I know this post didn’t use the word “distillation”, but it does talk about research debt, and distillation is presented as the solution to debt in the original article), and to adequately deal with research debt.
There seems to be a pattern where a term is introduced first in a strong form, then it accumulates a lot of positive connotations, and that causes people to stretch the term to use it for things that don’t quite qualify. I’m not confident that is what is happening here (it’s hard to tell what happens in people’s heads), but from the outside it’s a bit worrying.
I actually made a similar comment a while ago about a different term.
Back in April, Oliver Habryka wrote:
Anna Salamon has reduced her involvement in the last few years and seems significantly less involved with the broader strategic direction of CFAR (though she is still involved in some of the day-to-day operations, curriculum development, and more recent CFAR programmer workshops). [Note: After talking to Anna about this, I am now less certain of whether this actually applies and am currently confused on this point]
Could someone clarify the situation? (Possible sub-questions: Why did Oliver get this impression? Why was he confused even after talking talking to Anna? To what extent and in what ways has Anna reduced her involvement in CFAR in the last few years? If Anna has reduced her involvement in CFAR, what is she spending her time on instead?)
I have seen/heard from at least two sources something to the effect that MIRI/CFAR leadership (and Anna in particular) has very short AI timelines and high probability of doom (and apparently having high confidence in these beliefs). Here is the only public example that I can recall seeing. (Of the two examples I can specifically recall, this is not the better one, but the other was not posted publicly.) Is there any truth to these claims?
Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:
time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
act-based (“short”) vs goal-based (“long”): using the human’s (or more generally, the human-plus-AI-assistants’; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization of the future based on some goal, e.g. using a utility function (goal-based)
amount of reflection the human has undergone: “short” would be the current human (I think this is what you call “preferences-as-elicited”), and this would get “longer” as we give the human more time to think, with something like CEV/Long Reflection/Great Deliberation being the “longest” in this sense (I think this is what you call “preference-on-idealized-reflection”). This sense further breaks down into whether the human itself is actually doing the reflection, or if the AI is instead predicting what the human would think after reflection.
how far the search happens: “short” would be a limited search (that lacks insight/doesn’t see interesting consequences) and “long” would be a search that has insight/sees interesting consequences. This is a distinction you made in a discussion with Eliezer a while back. This distinction also isn’t strictly about preferences, but rather about how one would achieve those preferences.
de dicto (“short”) vs de re (“long”): This is a distinction you made in this post. I think this is the same distinction as (2) or (3), but I’m not sure which. (But if my interpretation of you below is correct, I guess this must be the same as (2) or else a completely different distinction.)
understandable (“short”) vs evaluable (“long”): A course of action is understandable if the human (without any AI assistants) can understand the rationale behind it; a course of action is evaluable if there is some procedure the human can implement to evaluate the rationale using AI assistants. I guess there is also a “not even evaluable” option here that is even “longer”. (Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.)
My interpretation is that when you say “short-term preferences-on-reflection”, you mean short in sense (1), except when the AI needs to gather resources, in which case either the human or the AI will need to do more long-term planning; short in sense (2); long in sense (3), with the AI predicting what the human would think after reflection; long in sense (4); short in sense (5); long in sense (6). Does this sound right to you? If not, I think it would help me a lot if you could “fill in the list” with which of short or long you choose for each point.
Assuming my interpretation is correct, my confusion is that you say we shouldn’t expect a situation where “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy” (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
For people who find this post in the future, Abram discussed several of the points in the bullet-point list above in Probability vs Likelihood.
When I initially read this post, I got the impression that “subagents = path-dependent/incomplete DAG”. After working through more examples, it seems like all the work is being done by “committee requiring unanimous agreement” rather than by the “subagents” part.
Here are the examples I thought about:
Same as the mushroom/pepperoni situation, with the same two agents, but now each side can retaliate/hijack the rest of the mind if it doesn’t get what it wants. For example, if it starts at pepperoni, the mushroom-preferring agent will hijack the rest of the mind to remove the pepperoni, ending up at cheese. But if the agent starts at the “both” node, it will stay there (because both agents are satisfied). The preference relation can be represented as with an extra arrow from . This is still a DAG, and it’s still incomplete (in the sense that we can’t compare pepperoni vs mushroom) but it’s no longer path-dependent, because no matter where we start, we end up at cheese or “both” (I am assuming that toppings-removal can always be done, whereas acquiring new toppings can’t).
Same as the previous example, except now only the mushroom-preferring agent can retaliate/hijack (because the pepperoni-preferring agent is weak or nice). Now the preferences are . This is still a DAG, but now the preferences are total, so we can also view it as a (somewhat weird) single agent. A realistic example of this is given by Andrew Critch, where pepperoni=work, cheese=burnout (i.e. neither work nor friendship), mushroom=friendship, and both=friendship-and-work.
A modified version of the Zyzzx Prime planet by Scott Alexander. Now whenever we start out at pepperoni, the pepperoni-preferring agent becomes stupid/weak, and loses dominance, so now there are edges from pepperoni to mushroom and “both”. (And similarly, mushroom points to both pepperoni and “both”.) Now we no longer have a DAG because of the cycle between pepperoni and mushroom.
It seems like when people talk about the human mind being composed of subagents, the deliberation process is not necessarily “committee requiring unanimous agreement”, so the resulting preference relations cannot necessarily be represented using path-dependent DAGs.
It also seems like the general framework of viewing systems as subagents (i.e. not restricting to “committee requiring unanimous agreement”) is broad enough that it can basically represent any kind of directed graph. On one hand, this is suspicious (if everything can be viewed as a bunch of subagents, then maybe the subagents framework isn’t adding anything after all). On the other hand, this suggests that claims of subagents are not really about the resulting behavior/preference ordering of the system, but rather about the internal dynamics of the system.
Does anyone know how Brian Christian came to be interested in AI alignment and why he decided to write this book instead of a book about a different topic? (I haven’t read the book and looked at the Amazon preview but couldn’t find the answer there.)
Rob, are you able to disclose why people at Open Phil are interested in learning more decision theory? It seems a little far away from the AI strategy reports they’ve been publishing in recent years, and it also seemed like they were happy to keep funding MIRI (via their Committee for Effective Altruism Support) despite disagreements about the value of HRAD research, so the sudden interest in decision theory is intriguing.
I was surprised to see, both on your website and the white paper, that you are part of Mercatoria/ICTP (although your level of involvement isn’t clear based on public information). My surprise is mainly because you have a couple of comments on LessWrong that discuss why you have declined to join MIRI as a research associate. You have also (to my knowledge) never joined any other rationality-community or effective altruism-related organization in any capacity.
My questions are:
What are the reasons you decided to join or sign on as a co-author for Mercatoria/ICTP?
More generally, how do you decide which organizations to associate with? Have you considered joining other organizations, starting your own organization, or recruiting contract workers/volunteers to work on things you consider important?
I think Discord servers based around specific books are an underappreciated form of academic support/community. I have been part of such a Discord server (for Terence Tao’s Analysis) for a few years now and have really enjoyed being a part of it.
Each chapter of the book gets two channels: one to discuss the reading material in that chapter, and one to discuss the exercises in that chapter. There are also channels for general discussion, introductions, and a few other things.
Such a Discord server has elements of university courses, Math Stack Exchange, Reddit, independent study groups, and random blog posts, but is different from all of them:
Unlike courses (but like Math SE, Reddit, and independent study groups), all participation is voluntary so the people in the community are selected for actually being interested in the material.
Unlike Math SE and Reddit (but like courses and independent study groups), one does not need to laboriously set the context each time one wants to ask a question or talk about something. It’s possible to just say “the second paragraph on page 76” or “Proposition 6.4.12(c)” and expect to be understood, because there is common knowledge of what the material is and the fact that everyone there has access to that material. In a subject like real analysis where there are many ways to develop the material, this is a big plus.
Unlike independent study groups and courses (but like Math SE and Reddit), there is no set pace or requirement to join the study group at a specific point in time. This means people can just show up whenever they start working on the book without worrying that they are behind and need to catch up to the discussion, because there is no single place in the book everyone is at. This also makes this kind of Discord server easier to set up because it does not require finding someone else who is studying the material at the same time, so there is less cost to coordination.
Unlike random forum/blog posts about the book, a dedicated Discord server can comprehensively cover the entire book and has the potential to be “alive/motivating” (it’s pretty demotivating to have a question about a blog post which was written years ago and where the author probably won’t respond; I think reliability is important for making it seem safe/motivating to ask questions).
I also like that Discord has an informal feel to it (less friction to just ask a question) and can be both synchronous and asynchronous.
I think these Discord servers aren’t that hard to set up and maintain. As long as there is one person there who has worked through the entire book, the server won’t seem “dead” and it should accumulate more users. (What’s the motivation for staying in the server if you’ve worked through the whole book? I think it provides a nice review/repetition of the material.) I’ve also noticed that earlier on I had to answer more questions in early chapters of the book, but now there are more people who’ve worked through the early chapters who can answer those questions, so I tend to focus on the later chapters now. So my concrete proposal is that more people, when they finish working through a book, should try to “adopt” the book by creating a Discord server and fielding questions from people who are still working through the book (and then advertising in some standard channels like a relevant subreddit). This requires little coordination ability (everyone from the second person onward selfishly benefits by joining the server and does not need to pay any costs).
I am uncertain how well this format would work for less technical books where there might not be a single answer to a question/a “ground truth” (which leaves room for people to give their opinions more).
(Thanks to people on the Tao Analysis Discord, especially pecfex for starting a discussion on the server about whether there are any similar servers, which gave me the idea to write this post, and Segun for creating the Tao Analysis Discord.)
We can model success as a combination of doing useful things and avoiding making mistakes. As a particular example, we can model intellectual success as a combination of coming up with good ideas and avoiding bad ideas. I claim that rationality helps us avoid mistakes and bad ideas, but doesn’t help much in generating good ideas and useful work.
Eliezer Yudkowsky has made similar points in e.g. “Unteachable Excellence” (“much of the most important information we can learn from history is about how to not lose, rather than how to win”, “It’s easier to avoid duplicating spectacular failures than to duplicate spectacular successes. And it’s often easier to generalize failure between domains.”) and “Teaching the Unteachable”.
Doesn’t do what? I understand Eliezer to be saying that he figured out AI risk via thinking things through himself (e.g., writing a story that involved outcome pumps; reflecting on orthogonality and instrumental convergence; etc.), rather than being argued into it by someone else who was worried about AI risk. If Eliezer didn’t do that, there would still presumably be someone prior to him who did that, since conclusions and ideas have to enter the world somehow. So I’m not understanding what you’re modeling as ridiculous.
My understanding of the history is that Eliezer did not realize the importance of alignment at first, and that he only did so later after arguing about it online with people like Nick Bostrom. See e.g. this thread. I don’t know enough of the history here, but it also seems logically possible that Bostrom could have, say, only realized the importance of alignment after conversing with other people who also didn’t realize the importance of alignment. In that case, there might be a “bubble” of humans who together satisfy the null string criterion, but no single human who does.
The null string criterion does seem a bit silly nowadays since I think the people who would have satisfied it would have sooner read about AI risk on e.g. LessWrong. So they wouldn’t even have the chance to live to age ~21 to see if they spontaneously invent the ideas.
As should be clear, this process can, after a few iterations, produce a situation in which most of those who have engaged with the arguments for a claim beyond some depth believe in it.
This isn’t clear to me, given the model in the post. If a claim is false and there are sufficiently many arguments for the claim, then it seems like everyone eventually ends up rejecting the claim, including those who have engaged most deeply with the arguments. The people who engage deeply “got lucky” by hearing the most persuasive arguments first, but eventually they also hear the weaker arguments and counterarguments to the claim, so they end up at a level of confidence where they don’t feel they should bother investigating further. These people can even have more accurate beliefs than the people who dropped out early in the process, depending on the cutoff that is chosen.
I had already seen all of those quotes/links, all of the quotes/links that Rob Bensinger posts in the sibling comment, as well as this tweet from Eliezer. I asked my question because those public quotes don’t sound like the private information I referred to in my question, and I wanted insight into the discrepancy.
I’ve made around 250 Anki cards about AI safety. I haven’t prioritized sharing my cards because I think finding a specific card useful requires someone to have read the source material generating the card (e.g. if I made the card based on a blog post, one would need to read that exact blog post to get value out of reviewing the card; see learn before you memorize). Since there are many AI safety blog posts and I don’t have the sense that lots of Anki users read any particular blog post, it seems to me that the value generated from sharing a set of cards about a blog post isn’t high enough to overcome the annoyance cost of polishing, packaging, and uploading the cards.
More generally, from a consumer perspective, I think people tend to be pretty bad at making good Anki cards (I’m often embarrassed at the cards I’ve created several months ago!), which makes it unexciting for me to spend a lot of effort trying to collaborate with others on making cards (because I expect to receive poorly-made cards in return for the cards I provide). I think collaborative card-making can be done though, e.g. Michael Nielsen and Andy Matuschak’s quantum computing guide comes with pre-made cards that I think are pretty good.
Different people also have different goals/interests so even given a single source material, the specifics one wants to Ankify can be different. For example, someone who wants to understand the technical details of logical induction will want to Ankify the common objects used (market, pricing, trader, valuation feature, etc.), the theorems and proof techniques, and so forth, whereas someone who just wants a high-level overview and the “so what” of logical induction can get away with Ankifying much less detail.
Something I’ve noticed is that many AI safety posts aren’t very good at explaining things (not enough concrete examples, not enough emphasis on common misconceptions and edge cases, not enough effort to answer what I think of as “obvious” questions); this fact is often revealed by the comments people make in response to a post. This makes it hard to make Anki cards because one doesn’t really understand the content of the post, at least not well enough to confidently generate Anki cards (one of the benefits of being an Anki user is having a greater sensitivity to when one does not understand something; see “illusion of explanatory depth” and related terms). There are other problems like conflicting usage of terminology (e.g. multiple definitions of “benign”, “aligned”, “corrigible”) and the fact that some of the debates are ongoing/some of the knowledge is still being worked out.
For “What would be a good strategy for generating useful flashcards?”: I try to read a post or a series of posts and once I feel that I understand the basic idea, I will usually reread it to add cards about the basic terms and ask myself simple questions. Some example cards for iterated amplification:
what kind of training does the Distill step use?
in the pseudocode, what step gets repeated/iterated?
how do we get A[0]?
write A[1] in terms of H and A[0]
when Paul says IDA is going to be competitive with traditional RL agents in terms of time and resource costs, what exactly does he mean?
advantages of A[0] over H
symbolic expression for the overseer
why should the amplified system (of human + multiple copies of the AI) be expected to perform better than the human alone?
My solution for #3:
Define by . We know that is continuous because and the identity map both are, and by the limit laws. Applying the intermediate value theorem (problem #2) we see that there exists such that . But this means , so we are done.
Counterexample for the open interval: consider defined by . First, we can verify that if then , so indeed maps to . To see that there is no fixed point, note that the only solution to in is , which is not in . (We can also view this graphically by plotting both and and checking that they do not intersect in .)
People I followed on Twitter for their credible takes on COVID-19 now sound insane. Sigh...
Are you saying that you initially followed people for their good thoughts on COVID-19, but (a) now they switched to talking about other topics (George Floyd protests?), and their thoughts are much worse on these other topics, (b) their thoughts on COVID-19 became worse over time, (c) they made some COVID-19-related predictions/statements that now look obviously wrong, so that what they previously said sounds obviously wrong, or (d) something else?
When thinking about information asymmetry in transactions (e.g. insurance, market for lemons), I can think of several axes for comparison:
whether the thing happens before or after the transaction
whether the buyer or seller has more information
whether the asymmetry is about a good or about an action
Insurance-like transactions pick “after”, “buyer”, and “action”: the person buying the insurance can choose to act more carelessly after purchasing the insurance.
Market for lemons cases pick “before”, “seller”, and “good”: prior to the transaction, the seller of the good knows more about the quality of the good.
So in many typical cases, the three axes “align”, and the former is called a moral hazard and the latter is called adverse selection.
But there are examples like an all-you-can-eat buffet that sets a single price (which encourages high-appetite people to eat there). This case picks “before”, “buyer”, and “action”. So in this case, 2⁄3 of the axes agree with the insurance-like situation, but this case is still classified as adverse selection because the official distinction is about (1).
Wikipedia states “Where adverse selection describes a situation where the type of product is hidden from one party in a transaction, moral hazard describes a situation where there is a hidden action that results from the transaction” (i.e. claims that (3) is the relevant axis) but then on the same page also states “For example, an all-you-can-eat buffet restaurant that sets one price for all customers risks being adversely selected against by high appetite” (i.e. classifies this example as adverse selection, even though classifying according to (3) would result in calling this a moral hazard).
Does anyone know why (1) is the most interesting axis (which I’m inferring based on how only this axis seems to have names for the two ends)?
What are your thoughts on Duncan Sabien’s Facebook post which predicts significant differences in CFAR’s direction now that he is no longer working for CFAR?