I am Issa Rice. https://issarice.com/
riceissa(Issa Rice)
I took the survey.
This is a minor point, but I am somewhat worried that the idea of research debt/research distillation seems to be getting diluted over time. The original article (which this post links to) says:
Distillation is also hard. It’s tempting to think of explaining an idea as just putting a layer of polish on it, but good explanations often involve transforming the idea. This kind of refinement of an idea can take just as much effort and deep understanding as the initial discovery.
I think the kind of cleanup and polish that is encouraged by the review process is insufficient to qualify as distillation (I know this post didn’t use the word “distillation”, but it does talk about research debt, and distillation is presented as the solution to debt in the original article), and to adequately deal with research debt.
There seems to be a pattern where a term is introduced first in a strong form, then it accumulates a lot of positive connotations, and that causes people to stretch the term to use it for things that don’t quite qualify. I’m not confident that is what is happening here (it’s hard to tell what happens in people’s heads), but from the outside it’s a bit worrying.
I actually made a similar comment a while ago about a different term.
Back in April, Oliver Habryka wrote:
Anna Salamon has reduced her involvement in the last few years and seems significantly less involved with the broader strategic direction of CFAR (though she is still involved in some of the day-to-day operations, curriculum development, and more recent CFAR programmer workshops). [Note: After talking to Anna about this, I am now less certain of whether this actually applies and am currently confused on this point]
Could someone clarify the situation? (Possible sub-questions: Why did Oliver get this impression? Why was he confused even after talking talking to Anna? To what extent and in what ways has Anna reduced her involvement in CFAR in the last few years? If Anna has reduced her involvement in CFAR, what is she spending her time on instead?)
I have seen/heard from at least two sources something to the effect that MIRI/CFAR leadership (and Anna in particular) has very short AI timelines and high probability of doom (and apparently having high confidence in these beliefs). Here is the only public example that I can recall seeing. (Of the two examples I can specifically recall, this is not the better one, but the other was not posted publicly.) Is there any truth to these claims?
Like Wei Dai, I am also finding this discussion pretty confusing. To summarize my state of confusion, I came up with the following list of ways in which preferences can be short or long:
time horizon and time discounting: how far in the future is the preference about? More generally, how much weight do we place on the present vs the future?
act-based (“short”) vs goal-based (“long”): using the human’s (or more generally, the human-plus-AI-assistants’; see (6) below) estimate of the value of the next action (act-based) or doing more open-ended optimization of the future based on some goal, e.g. using a utility function (goal-based)
amount of reflection the human has undergone: “short” would be the current human (I think this is what you call “preferences-as-elicited”), and this would get “longer” as we give the human more time to think, with something like CEV/Long Reflection/Great Deliberation being the “longest” in this sense (I think this is what you call “preference-on-idealized-reflection”). This sense further breaks down into whether the human itself is actually doing the reflection, or if the AI is instead predicting what the human would think after reflection.
how far the search happens: “short” would be a limited search (that lacks insight/doesn’t see interesting consequences) and “long” would be a search that has insight/sees interesting consequences. This is a distinction you made in a discussion with Eliezer a while back. This distinction also isn’t strictly about preferences, but rather about how one would achieve those preferences.
de dicto (“short”) vs de re (“long”): This is a distinction you made in this post. I think this is the same distinction as (2) or (3), but I’m not sure which. (But if my interpretation of you below is correct, I guess this must be the same as (2) or else a completely different distinction.)
understandable (“short”) vs evaluable (“long”): A course of action is understandable if the human (without any AI assistants) can understand the rationale behind it; a course of action is evaluable if there is some procedure the human can implement to evaluate the rationale using AI assistants. I guess there is also a “not even evaluable” option here that is even “longer”. (Thanks to Wei Dai for bringing up this distinction, although I may have misunderstood the actual distinction.)
My interpretation is that when you say “short-term preferences-on-reflection”, you mean short in sense (1), except when the AI needs to gather resources, in which case either the human or the AI will need to do more long-term planning; short in sense (2); long in sense (3), with the AI predicting what the human would think after reflection; long in sense (4); short in sense (5); long in sense (6). Does this sound right to you? If not, I think it would help me a lot if you could “fill in the list” with which of short or long you choose for each point.
Assuming my interpretation is correct, my confusion is that you say we shouldn’t expect a situation where “the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy” (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
Does anyone know how Brian Christian came to be interested in AI alignment and why he decided to write this book instead of a book about a different topic? (I haven’t read the book and looked at the Amazon preview but couldn’t find the answer there.)
I was surprised to see, both on your website and the white paper, that you are part of Mercatoria/ICTP (although your level of involvement isn’t clear based on public information). My surprise is mainly because you have a couple of comments on LessWrong that discuss why you have declined to join MIRI as a research associate. You have also (to my knowledge) never joined any other rationality-community or effective altruism-related organization in any capacity.
My questions are:
What are the reasons you decided to join or sign on as a co-author for Mercatoria/ICTP?
More generally, how do you decide which organizations to associate with? Have you considered joining other organizations, starting your own organization, or recruiting contract workers/volunteers to work on things you consider important?
We can model success as a combination of doing useful things and avoiding making mistakes. As a particular example, we can model intellectual success as a combination of coming up with good ideas and avoiding bad ideas. I claim that rationality helps us avoid mistakes and bad ideas, but doesn’t help much in generating good ideas and useful work.
Eliezer Yudkowsky has made similar points in e.g. “Unteachable Excellence” (“much of the most important information we can learn from history is about how to not lose, rather than how to win”, “It’s easier to avoid duplicating spectacular failures than to duplicate spectacular successes. And it’s often easier to generalize failure between domains.”) and “Teaching the Unteachable”.
As should be clear, this process can, after a few iterations, produce a situation in which most of those who have engaged with the arguments for a claim beyond some depth believe in it.
This isn’t clear to me, given the model in the post. If a claim is false and there are sufficiently many arguments for the claim, then it seems like everyone eventually ends up rejecting the claim, including those who have engaged most deeply with the arguments. The people who engage deeply “got lucky” by hearing the most persuasive arguments first, but eventually they also hear the weaker arguments and counterarguments to the claim, so they end up at a level of confidence where they don’t feel they should bother investigating further. These people can even have more accurate beliefs than the people who dropped out early in the process, depending on the cutoff that is chosen.
I had already seen all of those quotes/links, all of the quotes/links that Rob Bensinger posts in the sibling comment, as well as this tweet from Eliezer. I asked my question because those public quotes don’t sound like the private information I referred to in my question, and I wanted insight into the discrepancy.
I’ve made around 250 Anki cards about AI safety. I haven’t prioritized sharing my cards because I think finding a specific card useful requires someone to have read the source material generating the card (e.g. if I made the card based on a blog post, one would need to read that exact blog post to get value out of reviewing the card; see learn before you memorize). Since there are many AI safety blog posts and I don’t have the sense that lots of Anki users read any particular blog post, it seems to me that the value generated from sharing a set of cards about a blog post isn’t high enough to overcome the annoyance cost of polishing, packaging, and uploading the cards.
More generally, from a consumer perspective, I think people tend to be pretty bad at making good Anki cards (I’m often embarrassed at the cards I’ve created several months ago!), which makes it unexciting for me to spend a lot of effort trying to collaborate with others on making cards (because I expect to receive poorly-made cards in return for the cards I provide). I think collaborative card-making can be done though, e.g. Michael Nielsen and Andy Matuschak’s quantum computing guide comes with pre-made cards that I think are pretty good.
Different people also have different goals/interests so even given a single source material, the specifics one wants to Ankify can be different. For example, someone who wants to understand the technical details of logical induction will want to Ankify the common objects used (market, pricing, trader, valuation feature, etc.), the theorems and proof techniques, and so forth, whereas someone who just wants a high-level overview and the “so what” of logical induction can get away with Ankifying much less detail.
Something I’ve noticed is that many AI safety posts aren’t very good at explaining things (not enough concrete examples, not enough emphasis on common misconceptions and edge cases, not enough effort to answer what I think of as “obvious” questions); this fact is often revealed by the comments people make in response to a post. This makes it hard to make Anki cards because one doesn’t really understand the content of the post, at least not well enough to confidently generate Anki cards (one of the benefits of being an Anki user is having a greater sensitivity to when one does not understand something; see “illusion of explanatory depth” and related terms). There are other problems like conflicting usage of terminology (e.g. multiple definitions of “benign”, “aligned”, “corrigible”) and the fact that some of the debates are ongoing/some of the knowledge is still being worked out.
For “What would be a good strategy for generating useful flashcards?”: I try to read a post or a series of posts and once I feel that I understand the basic idea, I will usually reread it to add cards about the basic terms and ask myself simple questions. Some example cards for iterated amplification:
what kind of training does the Distill step use?
in the pseudocode, what step gets repeated/iterated?
how do we get A[0]?
write A[1] in terms of H and A[0]
when Paul says IDA is going to be competitive with traditional RL agents in terms of time and resource costs, what exactly does he mean?
advantages of A[0] over H
symbolic expression for the overseer
why should the amplified system (of human + multiple copies of the AI) be expected to perform better than the human alone?
My solution for #3:
Define by . We know that is continuous because and the identity map both are, and by the limit laws. Applying the intermediate value theorem (problem #2) we see that there exists such that . But this means , so we are done.
Counterexample for the open interval: consider defined by . First, we can verify that if then , so indeed maps to . To see that there is no fixed point, note that the only solution to in is , which is not in . (We can also view this graphically by plotting both and and checking that they do not intersect in .)
People I followed on Twitter for their credible takes on COVID-19 now sound insane. Sigh...
Are you saying that you initially followed people for their good thoughts on COVID-19, but (a) now they switched to talking about other topics (George Floyd protests?), and their thoughts are much worse on these other topics, (b) their thoughts on COVID-19 became worse over time, (c) they made some COVID-19-related predictions/statements that now look obviously wrong, so that what they previously said sounds obviously wrong, or (d) something else?
When thinking about information asymmetry in transactions (e.g. insurance, market for lemons), I can think of several axes for comparison:
whether the thing happens before or after the transaction
whether the buyer or seller has more information
whether the asymmetry is about a good or about an action
Insurance-like transactions pick “after”, “buyer”, and “action”: the person buying the insurance can choose to act more carelessly after purchasing the insurance.
Market for lemons cases pick “before”, “seller”, and “good”: prior to the transaction, the seller of the good knows more about the quality of the good.
So in many typical cases, the three axes “align”, and the former is called a moral hazard and the latter is called adverse selection.
But there are examples like an all-you-can-eat buffet that sets a single price (which encourages high-appetite people to eat there). This case picks “before”, “buyer”, and “action”. So in this case, 2⁄3 of the axes agree with the insurance-like situation, but this case is still classified as adverse selection because the official distinction is about (1).
Wikipedia states “Where adverse selection describes a situation where the type of product is hidden from one party in a transaction, moral hazard describes a situation where there is a hidden action that results from the transaction” (i.e. claims that (3) is the relevant axis) but then on the same page also states “For example, an all-you-can-eat buffet restaurant that sets one price for all customers risks being adversely selected against by high appetite” (i.e. classifies this example as adverse selection, even though classifying according to (3) would result in calling this a moral hazard).
Does anyone know why (1) is the most interesting axis (which I’m inferring based on how only this axis seems to have names for the two ends)?
Some other sources of exercises you might want to check out (that have solutions and that I have used at least partly):
Multiple choice quizzes (the ones related to linear algebra are determinants, elementary matrices, inner product spaces, linear algebra, linear systems, linear transformations, matrices, and vector spaces)
Vipul Naik’s quizzes (disclosure: I am friends with Vipul and also do contract work for him)
Regarding Axler’s book (since it has been mentioned in this thread): there are several “levels” of linear algebra, and Axler’s book is at a higher level (emphasis on abstract vector spaces and coordinate-free ways of doing things) than the 3Blue1Brown videos (more concrete, working in ). Axler’s book also assumes that the reader has had exposure to the lower level material (e.g. he does not talk about row reduction and elementary matrices). So I’m not sure I would recommend it to someone starting out trying to learn the basics of linear algebra.
Gratuitous remarks:
I think different resources covering material in a different order and using different terminology is in some sense a feature, not a bug, because it allows one to look at the subject from different perspectives. For instance, the “done right” in Axler’s book comes from one such change in perspective.
I find that learning mathematics well takes an unintuitively long time; it might be unrealistic to expect to learn the material well unless one puts in a lot of effort.
I think there is a case to be made for the importance of struggling in learning (disclosure: I am the author of the page).
I find the conjunction of your decision to have kids and your short AI timelines pretty confusing. The possibilities I can think of are (1) you’re more optimistic than me about AI alignment (but I don’t get this impression from your writings), (2) you think that even a short human life is worth living/net-positive, (3) since you distinguish between the time when humans lose control and the time when catastrophe actually happens, you think this delay will give more years to your child’s life, (4) your decision to have kids was made before your AI timelines became short. Or maybe something else I’m not thinking of? I’m curious to hear your thinking on this.
When I initially read this post, I got the impression that “subagents = path-dependent/incomplete DAG”. After working through more examples, it seems like all the work is being done by “committee requiring unanimous agreement” rather than by the “subagents” part.
Here are the examples I thought about:
Same as the mushroom/pepperoni situation, with the same two agents, but now each side can retaliate/hijack the rest of the mind if it doesn’t get what it wants. For example, if it starts at pepperoni, the mushroom-preferring agent will hijack the rest of the mind to remove the pepperoni, ending up at cheese. But if the agent starts at the “both” node, it will stay there (because both agents are satisfied). The preference relation can be represented as with an extra arrow from . This is still a DAG, and it’s still incomplete (in the sense that we can’t compare pepperoni vs mushroom) but it’s no longer path-dependent, because no matter where we start, we end up at cheese or “both” (I am assuming that toppings-removal can always be done, whereas acquiring new toppings can’t).
Same as the previous example, except now only the mushroom-preferring agent can retaliate/hijack (because the pepperoni-preferring agent is weak or nice). Now the preferences are . This is still a DAG, but now the preferences are total, so we can also view it as a (somewhat weird) single agent. A realistic example of this is given by Andrew Critch, where pepperoni=work, cheese=burnout (i.e. neither work nor friendship), mushroom=friendship, and both=friendship-and-work.
A modified version of the Zyzzx Prime planet by Scott Alexander. Now whenever we start out at pepperoni, the pepperoni-preferring agent becomes stupid/weak, and loses dominance, so now there are edges from pepperoni to mushroom and “both”. (And similarly, mushroom points to both pepperoni and “both”.) Now we no longer have a DAG because of the cycle between pepperoni and mushroom.
It seems like when people talk about the human mind being composed of subagents, the deliberation process is not necessarily “committee requiring unanimous agreement”, so the resulting preference relations cannot necessarily be represented using path-dependent DAGs.
It also seems like the general framework of viewing systems as subagents (i.e. not restricting to “committee requiring unanimous agreement”) is broad enough that it can basically represent any kind of directed graph. On one hand, this is suspicious (if everything can be viewed as a bunch of subagents, then maybe the subagents framework isn’t adding anything after all). On the other hand, this suggests that claims of subagents are not really about the resulting behavior/preference ordering of the system, but rather about the internal dynamics of the system.
Here is my attempt, based on Hoagy’s proof.
Let be an integer. We are given that and . Now consider the points in the interval . By 1-D Sperner’s lemma, there are an odd number of such that and (i.e. an odd number of “segments” that begin below zero and end up above zero). In particular, is an even number, so there must be at least one such number . Choose the smallest and call this number .
Now consider the sequence . Since this sequence takes values in , it is bounded, and by the Bolzano–Weierstrass theorem there must be some subsequence that converges to some number .
Consider the sequences and . We have for each . By the limit laws, as . Since is continuous, we have and as . Thus and , showing that , as desired.
What are your thoughts on Duncan Sabien’s Facebook post which predicts significant differences in CFAR’s direction now that he is no longer working for CFAR?