HoldenKarnofsky

Karma: 6,915

HoldenKarnofsky 28 Dec 2011 15:04 UTC
31 points
in reply to: Louie’s comment on: Singularity Institute $100,000 end-of-year fundraiser only 20% filled so far
Hi, here are the details of whom I spoke with and why:
- I originally emailed Michael Vassar, letting him know I was going to be in the Bay Area and asking whether there was anyone appropriate for me to meet with. He set me up with Jasen Murray.
- Justin Shovelain and an SIAI donor were also present when I spoke with Jasen. There may have been one or two others; I don’t recall.
- After we met, I sent the notes to Jasen for review. He sent back comments and also asked me to run it by Amy Willey and Michael Vassar, who each provided some corrections via email that I incorporated.
A couple of other comments:
- If SIAI wants to set up another room for more funding discussion, I’d be happy to do that and to post new notes.
- In general, we’re always happy to post corrections or updates on any content we post, including how that content is framed and presented. The best way to get our attention is to email us at info@givewell.org
And a tangential comment/question for Louie: I do not understand why you link to my two LW posts using the anchor text you use. These posts are not about GiveWell’s process. They both argue that standard Bayesian inference indicates against the literal use of non-robust expected value estimates, particularly in “Pascal’s Mugging” type scenarios. Michael Vassar’s response to the first of these was that I was attacking a straw man. There are unresolved disagreements about some of the specific modeling assumptions and implications of these posts, but I don’t see any way in which they imply a “limited process” or “blinding to the possibility of SIAI’s being a good giving opportunity.” I do agree that SIAI hasn’t been a fit for our standard process (and is more suited to GiveWell Labs) but I don’t see anything in these posts that illustrates that—what do you have in mind here?
What links here?
- Jasen's comment on Singularity Institute $100,000 end-of-year fundraiser only 20% filled so far by Louie (1 Jan 2012 2:59 UTC; 6 points)

HoldenKarnofsky 5 Jul 2012 16:18 UTC
28 points
on: Reply to Holden on ‘Tool AI’
Hello,

I appreciate the thoughtful response. I plan to respond at greater length in the future, both to this post and to some other content posted by SI representatives and commenters. For now, I wanted to take a shot at clarifying the discussion of “tool-AI” by discussing AIXI. One of the the issues I’ve found with the debate over FAI in general is that I haven’t seen much in the way of formal precision about the challenge of Friendliness (I recognize that I have also provided little formal precision, though I feel the burden of formalization is on SI here). It occurred to me that AIXI might provide a good opportunity to have a more precise discussion, if in fact it is believed to represent a case of “a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button.”

So here’s my characterization of how one might work toward a safe and useful version of AIXI, using the “tool-AI” framework, if one could in fact develop an efficient enough approximation of AIXI to qualify as a powerful AGI. Of course, this is just a rough outline of what I have in mind, but hopefully it adds some clarity to the discussion.

A. Write a program that
1. Computes an optimal policy, using some implementation of equation (20) on page 22 of http://www.hutter1.net/ai/aixigentle.pdf
2. “Prints” the policy in a human-readable format (using some fixed algorithm for “printing” that is not driven by a utility function)
3. Provides tools for answering user questions about the policy, i.e., “What will be its effect on ___?” (using some fixed algorithm for answering user questions that makes use of AIXI’s probability function, and is not driven by a utility function)
4. Does not contain any procedures for “implementing” the policy, only for displaying it and its implications in human-readable form
B. Run the program; examine its output using the tools described above (#2 and #3); if, upon such examination, the policy appears potentially destructive, continue tweaking the program (for example, by tweaking the utility it is selecting a policy to maximize) until the policy appears safe and desirable

C. Implement the policy using tools other than AIXI agent

D. Repeat (B) and (C) until one has confidence that the AIXI agent reliably produces safe and desirable policies, at which point more automation may be called for

My claim is that this approach would be superior to that of trying to develop “Friendliness theory” in advance of having any working AGI, because it would allow experiment- rather than theory-based development. Eliezer, I’m interested in your thoughts about my claim. Do you agree? If not, where is our disagreement?
What links here?
- HoldenKarnofsky's comment on Reply to Holden on ‘Tool AI’ by Eliezer Yudkowsky (1 Aug 2012 14:09 UTC; 3 points)

HoldenKarnofsky 18 Jul 2012 2:35 UTC
21 points
in reply to: Eliezer Yudkowsky’s comment on: Reply to Holden on ‘Tool AI’
Thanks for the response. To clarify, I’m not trying to point to the AIXI framework as a promising path; I’m trying to take advantage of the unusually high degree of formalization here in order to gain clarity on the feasibility and potential danger points of the “tool AI” approach.

It sounds to me like your two major issues with the framework I presented are (to summarize):

(1) There is a sense in which AIXI predictions must be reducible to predictions about the limited set of inputs it can “observe directly” (what you call its “sense data”).

(2) Computers model the world in ways that can be unrecognizable to humans; it may be difficult to create interfaces that allow humans to understand the implicit assumptions and predictions in their models.

I don’t claim that these problems are trivial to deal with. And stated as you state them, they sound abstractly very difficult to deal with. However, it seems true—and worth noting—that “normal” software development has repeatedly dealt with them successfully. For example: Google Maps works with a limited set of inputs; Google Maps does not “think” like I do and I would not be able to look at a dump of its calculations and have any real sense for what it is doing; yet Google Maps does make intelligent predictions about the external universe (e.g., “following direction set X will get you from point A to point B in reasonable time”), and it also provides an interface (the “route map”) that helps me understand its predictions and the implicit reasoning (e.g. “how, why, and with what other consequences direction set X will get me from point A to point B”).

Difficult though it may be to overcome these challenges, my impression is that software developers have consistently—and successfully—chosen to take them on, building algorithms that can be “understood” via interfaces and iterated over—rather than trying to prove the safety and usefulness of their algorithms with pure theory before ever running them. Not only does the former method seem “safer” (in the sense that it is less likely to lead to putting software in production before its safety and usefulness has been established) but it seems a faster path to development as well.

It seems that you see a fundamental disconnect between how software development has traditionally worked and how it will have to work in order to result in AGI. But I don’t understand your view of this disconnect well enough to see why it would lead to a discontinuation of the phenomenon I describe above. In short, traditional software development seems to have an easier (and faster and safer) time overcoming the challenges of the “tool” framework than overcoming the challenges of up-front theoretical proofs of safety/usefulness; why should we expect this to reverse in the case of AGI?

HoldenKarnofsky 29 Dec 2010 16:39 UTC
21 points
in reply to: shokwave’s comment on: Efficient Charity: Do Unto Others...
This is Holden Karnofsky, the co-Executive Director of GiveWell. As a frequent Less Wrong reader, I’m really glad to see the thoughtful discussion here. Thanks to Yvain for calling attention both to GiveWell and to the general topic of effective giving.

First off, much of this content overlaps with our own, so people interested in this thread might also find the following links interesting:
- Giving 101 - our guide to the general key concepts of effective giving
- March 2010 blog post on “selfish giving” (including “purchasing warm fuzzies”) vs. impact-focused giving
I’m mostly posting to clarify a few things regarding the concerns that have been raised about GiveWell (by aeschenkarnos).
- We regret the astroturfing that aeschenkarnos brought up. This incident is disclosed, along with other mistakes we’ve made, on our shortcomings list , which is accessible via a top-level link on our navigation bar.
- Regarding the split between grants to charities and funds spent on our own operations:
  - Early in our existence, we relied on making grants of our own to charities. We weren’t able to point them to any benefits that would come from our recommendations (since we were new and had no track record of influencing donations), so rather than inviting them to be reviewed, we invited them to apply for grants (subject to certain conditions such as public disclosure of application materials). Grantmaking is no longer important to our process and we no longer solicit donations to be regranted, though we still occasionally receive them. That explains why the % of our funds spent on grants has fallen a lot, though it hasn’t hit zero.
  - At this point, we actively solicit donations to GiveWell only when dealing with institutional funders or with people who have a relationship with us. When dealing with the general public, we put the solicitation on behalf of recommended charities—rather than ourselves—front and center. Our top charities page, linked prominently from our front page and navigation bar and in other places throughout the site, links to “donate” pages for top charities ( here’s the one for our top-rated charity VillageReach ) that allow us to track donations, but otherwise take no part in the donation process (the money does not touch our bank account). These “donate” pages also are linked from charity reviews. The only way to get to the “Donate to GiveWell” page is under “About GiveWell.” If donors make a considered decision to support us rather than our top charities, we want them to be able to do so, but our site is designed to push the casual user to our top charities.
  - In 2009 we tracked ~$1 million in donations to our top charities as as result of our research, while our own operating (non-grant) expenses were under $300k. We expect 2010 to have a higher “donations to top charities” figure on similar operating expenses. We are still new and hope the ratio will improve substantially over time.
  - We have a policy of regranting unrestricted funds if our reserves go above a certain level; we don’t believe in building a massive endowment for ourselves. This is the only condition under which we regrant unrestricted funds. We don’t want donors to fear that we might blindly pile up reserves without limit (we won’t), but we don’t want to get into all the details of our “Excess reserves” policy on the Donate page, so we went with the language: “we may use these funds for operating expenses or grants to charities, at our discretion.”
  - Bottom line—grantmaking used to be an important part of what we do but it isn’t now; the % of our funds spent on grants is not a meaningful figure.
- Regarding Charity Navigator:
  - I believe Yvain is correct to say that Charity Navigator does not evaluate effectiveness (and admits this) and that GiveWell does. See also this recent New York Times article on planned changes at Charity Navigator and Charity Navigator’s disclosure of the full details of its current methodology.
  - I agree with alexanderis that “number of charities rated” is higher for Charity Navigator primarily because its research is not as in-depth. I believe Charity Navigator would agree with this as well.
  - I believe that Charity Navigator has a significantly higher profile than GiveWell, overall, and know of no evidence suggesting otherwise. However, GiveWell does have a higher profile within certain communities, including Less Wrong. I attribute our higher profile on Less Wrong to specific individuals including Michael Vassar, Anna Salomon, Carl Shulman, Razib at GNXP, and multifoliaterose. I don’t believe any of these individuals have plugged GiveWell in ignorance of Charity Navigator (in fact I have probably discussed the differences specifically with each of them).
We’ve worked to find the best, most cost-effective charities (in terms of actual impact per marginal dollar) and write up all the details of our analysis. We welcome more comments and questions about our work, whether here, on our blog, or via email.
What links here?
- shokwave's comment on Tallinn-Evans $125,000 Singularity Challenge by Kaj_Sotala (29 Dec 2010 18:09 UTC; 0 points)

HoldenKarnofsky 2 Aug 2013 18:27 UTC
20 points
on: Q for GiveWell: What is GiveDirectly’s mechanism of action?
Thanks for the thoughtful post.

If recipients of cash transfers buy Kenyan goods, and the producers of those goods use their extra shillings to buy more Kenyan goods, and eventually someone down the line trades their shillings for USD, this would seem to be equivalent in the relevant ways to the scenario you outline in which “U.S. dollars were being sent directly to Kenyan recipients and used only to purchase foreign goods”—assuming no directly-caused inflation in Kenyan prices. In other words, it seems to me that you’re essentially positing a potential offsetting harm of cash transfers in the form of inflation, and in the case where transfers do not cause inflation, there is no concern.

At the micro/village level, we’ve reviewed two studies showing minor (if any) inflation. At the country level, it’s worth noting that the act of buying Kenyan currency with USD should be as deflationary as the act of putting those Kenyan currency back into the economy is inflationary. Therefore, it seems to me that inflation is a fairly minor concern.

I’m not entirely sure I’ve understood your argument, so let me know if that answer doesn’t fully address it.

That said, it’s important to note that we do not claim 100% confidence—or an absence of plausible negative/offsetting effects—for any of our top charities. For the intervention of each charity we review, we include a “negative/offsetting effects” section that lists possible negative/offsetting effects, and in most cases we can’t conclusively dismiss such effects. Nonetheless, having noted and considered the possible negative/offsetting effects, we believe the probability that our top charities are accomplishing substantial net good is quite high, higher than for any other giving opportunities we’re aware of.

HoldenKarnofsky 29 Dec 2010 17:01 UTC
20 points
in reply to: Perplexed’s comment on: Efficient Charity: Do Unto Others...
This is Holden Karnofsky, the co-Executive Director of GiveWell, which is referenced in the top-level article and elsewhere on this thread.

I think there is an important difference between discussing the marginal impact of a blood donation and the marginal impact of a vote. When it comes to blood donations, it is possible for everyone to simultaneously follow the rule: “Give blood only when the supply of donations is low enough that an additional donation would have high expected impact”, with a reasonable outcome. It is not possible for everyone to behave this way in elections: no voter is able to consider the existing distribution of votes before casting their own.

I am only casually familiar with TDT/UDT, but it seems to me that that “Give blood only when the supply of donations is low enough that an additional donation would have high expected impact” should get about the same amount of credit under TDT/UDT as giving blood, and thus the extra impact of actually giving blood (as opposed to following that rule) is small regardless of what decision theory one is using.

I bring this up because the discussion of marginal blood donations is parallel to analysis GiveWell often does of the marginal impact of donations. We do everything we can to understand the marginal (not average) impact of a donation and recommend organizations on this basis, and we believe this is a very important and unique element of what we offer (more on this issue). We try to push donors to underfunded charities and away from overfunded ones, and I do not think the validity of this depends on any controversial (even controversial-within-Less-Wrong) view on decision theory, though I am open to arguments that it does.

HoldenKarnofsky 18 Jul 2012 16:29 UTC
18 points
in reply to: Eliezer Yudkowsky’s comment on: Reply to Holden on ‘Tool AI’
Thanks for the response. My thoughts at this point are that
- We seem to have differing views of how to best do what you call “reference class tennis” and how useful it can be. I’ll probably be writing about my views more in the future.
- I find it plausible that AGI will have to follow a substantially different approach from “normal” software. But I’m not clear on the specifics of what SI believes those differences will be and why they point to the “proving safety/usefulness before running” approach over the “tool” approach.
- We seem to have differing views of how frequently today’s software can be made comprehensible via interfaces. For example, my intuition is that the people who worked on the Netflix Prize algorithm had good interfaces for understanding “why” it recommends what it does, and used these to refine it. I may further investigate this matter (casually, not as a high priority); on SI’s end, it might be helpful (from my perspective) to provide detailed examples of existing algorithms for which the “tool” approach to development didn’t work and something closer to “proving safety/usefulness up front” was necessary.
What links here?
- wedrifid's comment on Reply to Holden on ‘Tool AI’ by Eliezer Yudkowsky (20 Jul 2012 0:15 UTC; 9 points)

HoldenKarnofsky 1 Aug 2012 14:16 UTC
17 points
on: Reply to Holden on The Singularity Institute
I greatly appreciate the response to my post, particularly the highly thoughtful responses of Luke (original post), Eliezer, and many commenters.

Broad response to Luke’s and Eliezer’s points:

As I see it, there are a few possible visions of SI’s mission:
- M1. SI is attempting to create a team to build a “Friendly” AGI.
- M2. SI is developing “Friendliness theory,” which addresses how to develop a provably safe/useful/benign utility function without needing iterative/experimental development; this theory could be integrated into an AGI developed by another team, in order to ensure that its actions are beneficial.
- M3. SI is broadly committed to reducing AGI-related risks, and work on whatever will work toward that goal, including potentially M1 and M2.
My view is that the broader SI’s mission, the higher the bar should be for the overall impressiveness of the organization and team. An organization with a very narrow, specific mission—such as “analyzing how to develop a provably safe/useful/benign utility function without needing iterative/experimental development”—can, relatively easily, establish which other organizations (if any) are trying to provide what it does and what the relative qualifications are; it can set clear expectations for deliverables over time and be held accountable to them; its actions and outputs are relatively easy to criticize and debate. By contrast, an organization with broader aims and less clearly relevant deliverables—such as “broadly aiming to reduce risks from AGI, with activities currently focused on community-building”—is giving a donor (or evaluator) less to go on in terms of what the space looks like, what the specific qualifications are and what the specific deliverables are. In this case it becomes more important that a donor be highly confident in the exceptional effectiveness of the organization and team as a whole.

Many of the responses to my criticisms (points #1 and #4 in Eliezer’s response; “SI’s mission assumes a scenario that is far less conjunctive than it initially appears” and “SI’s goals and activities” section of Luke’s response) correctly point out that they have less force, as criticisms, when one views SI’s mission as relatively broad. However, I believe that evaluating SI by a broader mission raises the burden of affirmative arguments for SI’s impressiveness. The primary such arguments I see in the responses are in Luke’s list:

(1) The Sequences, the best tool I know for creating aspiring rationalists, (2) Harry Potter and the Methods of Rationality, a surprisingly successful tool for grabbing the attention of mathematicians and computer scientists around the world, and (3) the Singularity Summit, a mainstream-aimed conference that brings in people who end up making significant contributions to the movement — e.g. Tomer Kagan (an SI donor and board member) and David Chalmers (author of The Singularity: A Philosophical Analysis and The Singularity: A Reply).

I’ve been a consumer of all three of these, and while I’ve found them enjoyable, I don’t find them sufficient for the purpose at hand. Others may reach a different conclusion. And of course, I continue to follow SI’s progress, as I understand that it may submit more impressive achievements in the future.

Both Luke and Eliezer seem to disagree with the basic approach I’m taking here. They seem to believe that it is sufficient to establish that (a) AGI risk is an overwhelmingly important issue and that (b) SI compares favorably to other organizations that explicitly focus on this issue. For my part, I (a) disagree with the statement: “the loss in expected value resulting from an existential catastrophe is so enormous that the objective of reducing existential risks should be a dominant consideration whenever we act out of an impersonal concern for humankind as a whole”; (b) do not find Luke’s argument that AI, specifically, is the most important existential risk to be compelling (it discusses only how beneficial it would be to address the issue well, not how likely a donor is to be able to help do so); (c) believe it is appropriate to compare the overall organizational impressiveness of the Singularity Institute to that of all other donation-soliciting organizations, not just to that of other existential-risk- or AGI-focused organizations. I would guess that these disagreements, particularly (a) and (c), come down to relatively deep worldview differences (related to the debate over “Pascal’s Mugging”) that I will probably write more about in the future.

On tool AI:

Most of my disagreements with SI representatives seem to be over how broad a mission is appropriate for SI, and how high a standard SI as an organization should be held to. However, the debate over “tool AI” is different, with both sides making relatively strong claims. Here SI is putting forth a specific point as an underappreciated insight and thus as a potential contribution/accomplishment; my view is that SI’s suggested approach to AGI development is more dangerous than the “traditional” approach to software development, and thus that SI is advocating for an approach that would worsen risks from AGI.

My latest thoughts on this disagreement were posted separately in a comment response to Eliezer’s post on the subject.

A few smaller points:
- I disagree with Luke’s claim that ” objection #1 punts to objection #2.” Objection #2 (regarding “tool AI”) points out one possible approach to AGI that I believe is both consonant with traditional software development and significantly safer than the approach advocated by SI. But even if the “tool AI” approach is not in fact safer, there may be safer approaches that SI hasn’t thought of. SI does not just emphasize the general problem that AGI may be dangerous (something that I believe is a fairly common view), but emphasizes a particular approach to AGI safety, one that seems to me to be highly dangerous. If SI’s approach is dangerous relative to other approaches that others are taking/advocating, or even approaches that have yet to be developed (and will be enabled by future tools and progress on AGI), this is a problem for SI.
- Luke states that rationality is “only a ceteris paribus predictor of success” and that it is a “weak one.” I wish to register that I believe rationality is a strong (though not perfect) predictor of success, within the population of people who are as privileged (in terms of having basic needs met, access to education, etc.) as most SI supporters/advocates/representatives. So while I understand that success is not part of the definition of rationality, I stand by my statement that it is “the best evidence of superior general rationality (or of insight into it).”
- Regarding donor-advised funds: opening an account with Vanguard, Schwab or Fidelity is a simple process, and I doubt any of these institutions would overrule a recommendation to donate to an organization such as SI (in any case, this is easily testable).
What links here?
- Thrasymachus's comment on Reply to Holden on The Singularity Institute by lukeprog (2 Aug 2012 16:38 UTC; 0 points)

HoldenKarnofsky 6 Jan 2022 21:16 UTC
LW: 14 AF: 8
AF
on: Prizes for ELK proposals
Here are a couple of hand-wavy “stub” proposals that I sent over to ARC, which they thought were broadly intended to be addressed by existing counterexamples. I’m posting them here so they can respond and clarify why these don’t qualify.
*Proposal 1: force ontological compatibility*
On page 34 of the ELK gdoc, the authors talk about the possibility that training an AI hard enough produces a model that has deep mismatches with human ontology—that is, it has a distinct “vocabulary of basic concepts” (or nodes in a Bayes net) that are distinct from the ones humans can build understanding of (via doing science on the compute budget available). Because of this, even AI assistance can’t help humans understand everything the SmartVault AI understands. This is central to the challenge that most of the writeup is contending with—if not for the mismatch, “AIs explaining things to humans” could ensure that the trickery we’re worried about doesn’t happen.
The proposal here is to include a term in the loss function that incentivizes the AI to have a human-compatible ontology. For a cartoonish example, imagine that the term works this way: “The AI model gets a higher score to the degree that people doing ‘digital neuroscience’ would have an easier time, and find more interesting things, probing its ‘digital brain.’” So an AI with neurons corresponding to diamonds, robbers, sensors, etc. would outscore an AI whose neurons can’t easily be seen to correspond to any human-familiar concepts.
With this term included, you can “train the AI hard” without having it diverge from human-friendly ontology.
*Proposal 2: “mini-model”*
In this proposal, we construct a head for the AI model whose role is to construct a small AI model that performs as well as possible on the same (predictor, reporter) tasks. In each round of training, the master AI model can modify the mini-model; it also receives a loss based on how well the *previous round’s* mini-model performs on this set of examples. Crucially, the mini-model is sufficiently small that we’re confident it is “significantly less intelligent” than the (perhaps AI-assisted) human supervisor.
The basic hope here is that a “human imitator” model would construct a small model that also tries to imitate the human supervisor—but that this model would perform very badly, because trying to imitate the _reasoning process of a larger model_ is a lot more difficult than simply trying to reason about the world and translate concepts. Or, the “human imitator” model could build a “mini-model” based on entirely different principles, but the hope is that this makes things harder for it compared to the direct translator, which is just compressing what it has already built.
There are lots of potential failure modes here, e.g. maybe it’s just not that hard to have a mini-model that successfully imitates the human. I didn’t get that far with this one, but it was apparently enough for ARC to think it’s already counterexampled by existing counterexamples :)

HoldenKarnofsky 5 Aug 2013 4:27 UTC
14 points
in reply to: Eliezer Yudkowsky’s comment on: Q for GiveWell: What is GiveDirectly’s mechanism of action?
Eliezer, I think inflation caused via cash transfers results (under some fairly basic assumptions) in unchanged—not negative—total real wealth for the aggregate set of people experiencing the inflation, because this aggregate set of people includes the same set of people that causes the inflation as a result of having more currency. There may be situations in which “N people receive X units of currency, but the supply of goods they purchase remains fixed, so they experience inflation and do not end up with more real wealth”, but not situations in which “N people receive X units of currency and as a result have less real wealth, or cause inflation for others that lowers the others’ real wealth more than their own has risen.”

If you believe that GiveDirectly’s transfers cause negligible inflation for the people receiving them (as implied by the studies we’ve reviewed), this implies that those people become materially wealthier by (almost) the amount of the transfer. There may be other people along the chain who experience inflation, but these people at worst have unchanged real wealth in aggregate (according to the previous paragraph). (BTW, I’ve focused in on the implications of your scenario for inflation because we have data regarding inflation.)

It’s theoretically possible that the distributive effects within this group are regressive (e.g., perhaps the people who sell to the GiveDirectly recipients then purchase goods that are needed by other lower-income people in another location, raising the price of those goods), but in order to believe this one has to make a seemingly large number of unjustified assumptions (including the assumption of an inelastic supply of the goods demanded by low-income Kenyans, which seems particularly unrealistic), and those distributive effects could just as easily be progressive.

It also seems worth noting that your concern would seem to apply equally to any case in which a donor purchases foreign currency and uses it to fund local services (e.g., from nonprofits), which would seem to include ~all cases of direct aid overseas.

If I’m still not fully addressing your point, it might be worth your trying a “toy economy” construction to elucidate your concern—something along the lines of “Imagine that there are a total of 10 people in Kenya; that 8 are poor, have wealth equal to X, and consume goods A and B at prices Pa and Pb; that 2 are wealthy, have wealthy equal to Y, and consume goods C and D at prices Pc and Pd. When the transfer is made, dollars are traded for T shillings, which are then distributed as follows, which has the following impact on prices and real consumption …” I often find these constructions useful in these sorts of discussions to elucidate exactly the potential scenario one has in mind.

HoldenKarnofsky 29 Aug 2011 16:31 UTC
13 points
on: Why We Can’t Take Expected Value Estimates Literally (Even When They’re Unbiased)
Hello all,

Thanks for the thoughtful comments. Without responding to all threads, I’d like to address a few of the themes that came up. FYI, there are also interesting discussions of this post at The GiveWell Blog , Overcoming Bias , and Quomodocumque (the latter includes Terence Tao’s thoughts on “Pascal’s Mugging”).

On what I’m arguing. There seems to be confusion on which of the following I am arguing:

(1) The conceptual idea of maximizing expected value is problematic.

(2) Explicit estimates of expected value are problematic and can’t be taken literally.

(3) Explicit estimates of expected value are problematic/can’t be taken literally when they don’t include a Bayesian adjustment of the kind outlined in my post.

As several have noted, I do not argue (1). I do aim to give with the aim of maximizing expected good accomplished, and in particular I consider myself risk-neutral in giving.

I strongly endorse (3) and there doesn’t seem to be disagreement on this point.

I endorse (2) as well, though less strongly than I endorse (3). I am open to the idea of formally performing a Bayesian adjustment, and if this formalization is well done enough, taking the adjusted expected-value estimate literally. However,
- I have examined a lot of expected-value estimates relevant to giving, including those done by the DCP2 , Copenhagen Consensus , and Poverty Action Lab , and have never once seen a formalized adjustment of this kind.
- I believe that often—particularly in the domains discussed here—formalizing such an adjustment in a reasonable way is simply not feasible and that using intuition is superior. This is argued briefly in this post, and Dario Amodei and Jonah Sinick have an excellent exchange further exploring this idea at the GiveWell Blog.
- If you disagree with the above point, and feel that such adjustments ought to be done formally, then you do disagree with a substantial part of my post; however, you ought to find the remainder of the post more consequential than I do, since it implies substantial room for improvement in the most prominent cost-effectiveness estimates (and perhaps all cost-effectiveness estimates) in the domains under discussion.
All of the above applies to expected-value calculations that take relatively large amounts of guesswork, such as in the domain of giving. There are expected-value estimates that I feel are precise/robust enough to take literally.

Is it reasonable to model existential risk reduction and/or “Pascal’s Mugging” using log-/normal distributions? Several have pointed out that existential risk reduction and “Pascal’s Mugging” seem to involve “either-or” scenarios that aren’t well approximated by log-/normal distributions. I wish to emphasize that I’m focused on the prior over expected value of one’s actions and on the distribution of error in one’s expected-value estimate. (The latter is a fuzzy concept that may be best formalized with the aid of concepts such as imprecise probability. In the scenarios under discussion, one often must estimate the probability of catastrophe essentially by making a wild guess with a wide confidence interval, leaving wide room for “estimate error” around the expected-value calculation.) Bayesian adjustments to expected-value estimates of actions, in this framework, are smaller (all else equal) for well-modeled and well-understood “either-or” scenarios than for poorly-modeled and poorly-understood “either-or” scenarios.

For both the prior and for the “estimate error,” I think the log-/normal distribution can be a reasonable approximation, especially when considering the uncertainty around the impact of one’s actions on the probability of catastrophe.

The basic framework of this post still applies, and many of its conclusions may as well, even when other types of probability distributions are assumed.

My views on existential risk reduction are outside the scope of this post. The only mention I make of existential risk reduction is to critique the argument that “charities working on reducing the risk of sudden human extinction must be the best ones to support, since the value of saving the human race is so high that ‘any imaginable probability of success’ would lead to a higher expected value for these charities than for others.” Note that Eliezer Yudkowsky and Michael Vassar also appear to disapprove of this argument, so it seems clear that disputing this argument is not the same as arguing against existential risk reduction charities.

For the past few years we have considered catastrophic risk reduction charities to be lower on GiveWell’s priority list for investigation than developing-world aid charities, but still relatively high on the list in the scheme of things. I’ve recently started investigating these causes a bit more, starting with SIAI (see LW posts on my discussion with SIAI representatives and my exchange with Jaan Tallinn). It’s plausible to me that asteroid risk reduction is a promising area, but I haven’t looked into it enough (yet) to comment more on that.

My informal objections to what I term EEV. Several have criticized the section of my post giving informal objections to what I term the EEV approach (by which I meant explicitly estimating expected value using a rough calculation and not performing a Bayesian adjustment). This section was intended only as a very rough sketch of what unnerves me about EEV; there doesn’t seem to be much dispute over the more formal argument I made against EEV; thus, I don’t plan on responding to critiques of this section.
What links here?
- HoldenKarnofsky's comment on Singularity Institute $100,000 end-of-year fundraiser only 20% filled so far by Louie (29 Dec 2011 0:37 UTC; 11 points)

HoldenKarnofsky 15 Jan 2024 20:53 UTC
12 points
0
in reply to: Akash’s comment on: We’re Not Ready: thoughts on “pausing” and responsible scaling policies
Thanks for the thoughts! Some brief (and belated) responses:
- I disagree with you on #1 and think the thread below your comment addresses this.
- Re: #2, I think we have different expectations. We can just see what happens, but I’ll note that the RSP you refer to is quite explicit about the need for further iterations (not just “revisions” but also the need to define further-out, more severe risks).
- I’m not sure what you mean by “an evals regime in which the burden of proof is on labs to show that scaling is safe.” How high is the burden you’re hoping for? If they need to definitively rule out risks like “The weights leak, then the science of post-training enhancements moves forward to the point where the leaked weights are catastrophically dangerous” in order to do any further scaling, my sense is that nobody (certainly not me) has any idea how to do this, and so this proposal seems pretty much equivalent to “immediate pause,” which I’ve shared my thoughts on. If you have a lower burden of proof in mind, I think that’s potentially consistent with the work on RSPs that is happening (it depends on exactly what you are hoping for).
- I agree with the conceptual point that improvements on the status quo can be net negative for the reasons you say. When I said “Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate,” I didn’t mean to say that there’s no way this can make intellectual sense. To take a quick stab at what bugs me: I think to the extent a measure is an improvement but insufficient, the strategy should be to say “This is an improvement but insufficient,” accept the improvement and count on oneself to win the “insufficient” argument. This kind of behavior seems to generalize to a world in which everyone is clear about their preferred order of object-level states of the world, and incentive gradients consistently point in the right direction (especially important if—as I believe—getting some progress generally makes it easier rather than harder to get more). I worry that the behavior of opposing object-level improvements on the grounds that others might find them sufficient seems to generalize to a world with choppier incentive gradients, more confusing discourse, and a lot of difficulty building coalitions generally (it’s a lot harder to get agreement on “X vs. the best otherwise achievable outcome” than on “X vs. the status quo”).
- I think nearly all proponents of RSPs do not see them as a substitute for regulation. Early communications could have emphasized this point more (including the METR post, which has been updated). I think communications since then have been clearer about it.
- I lean toward agreeing that another name would be better. I don’t feel very strongly, and am not sure it matters at this point anyway with different parties using different names.
- I don’t agree that “we can’t do anything until we literally have proof that the model is imminently dangerous” is the frame of RSPs, although I do agree that the frame is distinct from a “pause now” frame. I’m excited about conditional pauses as something that can reduce risk a lot while having high tractability and bringing together a big coalition; the developments you mention are great, but I think we’re still a long way from where “advocate for an immediate pause” looks better to me than working within this framework. I also disagree with your implication that the RSP framework has sucked up a lot of talent; while evals have drawn in a lot of people and momentum, hammering out conditional pause related frameworks seems to be something that only a handful of people were working on as of the date of your comment. (Since then the number has gone up due to AI companies forming teams dedicated to this; this seems like a good thing to me.) Overall, it seems to me that most of the people working in this area would otherwise be working on evals and other things short of advocating for immediate pauses.

HoldenKarnofsky 18 Jan 2022 20:47 UTC
LW: 12 AF: 6
0
AF
in reply to: Raymond D’s comment on: Reply to Eliezer on Biological Anchors
The Bio Anchors report is intended as a tool for making debates about AI timelines more concrete, for those who find some bio-anchor-related bound helpful (e.g., some think we should lower bound P(AGI) at some reasonably high number for any year in which we expect to hit a particular kind of “biological anchor”). Ajeya’s work lengthened my own timelines, because it helped me understand that some bio-anchor-inspired arguments for shorter timelines didn’t have as much going for them as I’d thought; but I think it may have shortened some other folks’.
(The presentation of the report in the Most Important Century series had a different aim. That series is aimed at making the case that we could be in the most important century, to a skeptic.)
I don’t personally believe I have a high-enough-quality estimate using another framework that I’d be justified in ignoring bio-anchors-based reasoning, but I don’t think it’s wild to think someone else might have such an estimate.

HoldenKarnofsky 29 Dec 2011 0:37 UTC
11 points
in reply to: Louie’s comment on: Singularity Institute $100,000 end-of-year fundraiser only 20% filled so far
Louie, I think you’re mischaracterizing these posts and their implications. The argument is much closer to “extraordinary claims require extraordinary evidence” than it is to “extraordinary claims should simply be disregarded.” And I have outlined (in the conversation with SIAI) ways in which I believe SIAI could generate the evidence needed for me to put greater weight on its claims.

I wrote more in my comment followup on the first post about why an aversion to arguments that seem similar to “Pascal’s Mugging” does not entail an aversion to supporting x-risk charities. (As mentioned in that comment, it appears that important SIAI staff share such an aversion, whether or not they agree with my formal defense of it.)

I also think the message of these posts is consistent with the best available models of how the world works—it isn’t just about trying to set incentives. That’s probably a conversation for another time—there seems to be a lot of confusion on these posts (especially the second) and I will probably post some clarification at a later date.
What links here?
- Should we discount extraordinary implications? by XiXiDu (29 Dec 2011 14:51 UTC; 4 points)

HoldenKarnofsky 21 Mar 2023 5:24 UTC
LW: 10 AF: 4
5
AF
in reply to: dxu’s comment on: Discussion with Nate Soares on a key alignment difficulty
I hear you on this concern, but it basically seems similar (IMO) to a concern like: “The future of humanity after N more generations will be ~without value, due to all the reflection humans will do—and all the ways their values will change—between now and then.” A large set of “ems” gaining control of the future after a lot of “reflection” seems like quite comparable to future humans having control over the future (also after a lot of effective “reflection”).
I think there’s some validity to worrying about a future with very different values from today’s. But I think misaligned AI is (reasonably) usually assumed to diverge in more drastic and/or “bad” ways than humans themselves would if they stayed in control; I think of this difference as the major driver of wanting to align AIs at all. And it seems Nate thinks that the hypothetical training process I outline above gets us something much closer to “misaligned AI” levels of value divergence than to “ems” levels of value divergence.

HoldenKarnofsky 18 Mar 2023 5:18 UTC
LW: 10 AF: 8
7
AF
in reply to: johnswentworth’s comment on: Why Not Just Outsource Alignment Research To An AI?
I don’t agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later. I expect alignment researchers to be central to automation attempts.
It seems to me like the basic equation is something like: “If today’s alignment researchers would be able to succeed given a lot more time, then they also are reasonably likely to succeed given access to a lot of human-level-ish AIs.” There are reasons this could fail (perhaps future alignment research will require major adaptations and different skills such that today’s top alignment researchers will be unable to assess it; perhaps there are parallelization issues, though AIs can give significant serial speedup), but the argument in this post seems far from a knockdown.
Also, it seems worth noting that non-experts work productively with experts all the time. There are lots of shortcomings and failure modes, but the video is a parody.

HoldenKarnofsky 18 Jan 2022 20:46 UTC
LW: 10 AF: 5
1
AF
in reply to: Edouard Harris’s comment on: Reply to Eliezer on Biological Anchors
I agree with this. I often default to acting as though we have ~10-15 years, partly because I think leverage is especially high conditional on timelines in that rough range.

HoldenKarnofsky 26 Mar 2013 19:49 UTC
10 points
on: Bayesian Adjustment Does Not Defeat Existential Risk Charity
Thanks for this post—I really appreciate the thoughtful discussion of the arguments I’ve made.

I’d like to respond by (a) laying out what I believe is a big-picture point of agreement, which I consider more important than any of the disagreements; (b) responding to what I perceive as the main argument this post makes against the framework I’ve advanced; (c) responding on some more minor points. (c) will be a separate comment due to length constraints.

A big-picture point of agreement: the possibility of vast utility gain does not—in itself—disqualify a giving opportunity as a good one, nor does it establish that the giving opportunity is strong. I’m worried that this point of agreement may be lost on many readers.

The OP makes it sound as though I believe that a high enough EEV is “ruled out” by priors; as discussed below, that is not my position. I agree, and always have, that “Bayesian adjustment does not defeat existential risk charity”; however, I think it defeats an existential risk charity that makes no strong arguments for its ability to make an impact, and relies on a “Pascal’s Mugging” type argument for its appeal.

On the flip side, I believe that a lot of readers believe that “Pascal’s Mugging” type arguments are sufficient to establish that a particular giving opportunity is outstanding. I don’t believe the OP believes this.

I believe the OP and I are in agreement that one should support an existential risk charity if and only if it makes a strong overall case for its likely impact, a case that goes beyond the observation that even a tiny probability of success would imply high expected value. We may disagree on precisely how high the burden of argumentation is, and we probably disagree on whether MIRI clears that hurdle in its current form, but I don’t believe either of us thinks the burden of argumentation is trivial or is so high that it can never be reached.

Response to what I perceive as the main argument of this post

It seems to me that the main argument of this post runs as follows:
- The priors I’m using imply extremely low probabilities for certain events.
- We don’t have sufficient reasons to confidently assign such low probabilities to such events.
I think the biggest problems with this argument are as follows:

1 - Most importantly, nothing I’ve written implies an extremely low probability for any particular event. Nick Beckstead’s comment on this post lays out the thinking here. The prior I describe isn’t over expected lives saved or DALYs saved (or a similar metric); it’s over the merit of a proposed action relative to the merits of other possible actions. So if one estimates that action A has a 10^-10 chance of saving 10^30 lives, while action B has a 50% chance of saving 1 life, one could be wrong about the difference between A and B by (a) overestimating the probability that action A will have the intended impact; (b) underestimating the potential impact of action B; (c) leaving out other consequences of A and B; (d) making some other mistake.

My current working theory is that proponents of “Pascal’s Mugging” type arguments tend to neglect the “flow-through effects” of accomplishing good. There are many ways in which helping a person may lead to others’ being helped, and ultimately may lead to a small probability of an enormous impact. Nick Beckstead raises a point similar to this one, and the OP has responded that it’s a new and potentially compelling argument to him. I also think it’s worth bearing in mind that there could be other arguments that we haven’t thought of yet—and because of the structure of the situation, I expect such arguments to be more likely to point to further “regression to the mean” (so to make proponents of “Pascal’s Mugging” arguments less confident that their proposed actions have high relative expected value) than to point in the other direction. This general phenomenon is a major reason that I place less weight on explicit arguments than many in this community—explicit arguments that consist mostly of speculation aren’t very stable or reliable, and when “outside views” point the other way, I expect more explicit reflection to generate more arguments that support the “outside views.”

2 - That said, I don’t accept any of the arguments given here for why it’s unacceptable to assign a very low probability to a proposition. I think there is a general confusion here between “low subjective probability that a proposition is correct” and “high confidence that a proposition isn’t correct”; I don’t think those two things are equivalent. Probabilities are often discussed with an “odds” framing, with the implication that assigning a 10^-10 probability to something means that I’d be willing to wager $10^10 against $1; this framing is a useful thought experiment in many cases, but when the numbers are like this I think it starts encouraging people to confuse their risk aversion with “non-extreme” (i.e., rarely under 1% or over 99%) subjective probabilities. Another framing is to ask, “If we could somehow do a huge number of ‘trials’ of this idea, say by simulating worlds constrained by the observations you’ve made, what would your over/under be for the proportion of trials in which the proposition is true?” and in that case one could simultaneously have an over/under of (10^-10 * # trials) and have extremely low confidence in one’s view.

It seems to me that for any small p, there must be some propositions that we assign a probability at least as small as p. (For example, there must be some X such that the probability of an impact greater than X is smaller than p.) Furthermore, it isn’t the case that assigning small p means that it’s impossible to gather evidence that would change one’s mind about p. For example, if you state to me that you will generate a random integer N1 between 1 and 10^100, there must be some integer N2 that I implicitly assign a probability of <=10^-100 as the output of your exercise. (This is true even if there are substantial “unknown unknowns” involved, for example if I don’t trust that your generator is truly random.) Yet if you complete the exercise and tell me it produced the number N2, I quickly revise my probability from <=10^-100 to over 50%, based on a single quick observation.

For these reasons, I think the argument that “the mere fact that one assigns a sufficiently low probability to a proposition means that one must be in error” would have unacceptable implications and is not supported by the arguments in the OP.
What links here?

HoldenKarnofsky 14 Jan 2022 17:14 UTC
9 points
in reply to: redbird’s comment on: Prizes for ELK proposals
I wanted to comment on this one because I’ve thought about this general sort of approach a fair amount. It seems like the kind of thing I would naturally start with if trying to solve this problem in the real world, and I’ve felt a bit frustrated that I haven’t really found a version of it that seems to work in the game here. That said, I don’t think we need super-exotically pessimistic assumptions to get a problem with this approach.
In the most recent example you gave, it’s always rewarded for being “right” and punished for being “wrong”—meaning it’s always rewarded for matching H100 and always punished for not doing so. So there’s no way our rewards are rewarding “be right” over “imitate H100″, and “imitate H100” is (according to the stated assumptions) easier to learn.
Another way of thinking about this:
Imagine that you show the AI H_1 for a while, then start punishing it for failing to match H_2. I think the default outcome here is that it learns to imitate H_2. If you then start punishing it for failing to match H_3, it learns to imitate H_3. Perhaps after a few rounds of this, it learns to “look ahead” some # of steps: for example, after learning to imitate H_2 failed on H_3, it learns to imitate H_5 or so; after that fails on H_6, maybe it learns to imitate H_10 or so.
The intended model has the advantage that it generalizes to all 100 data sets we can throw at it, but this is the same advantage that H_100 has, and H_100 is exactly what we’ve hypothesized is (unfortunately) easier for it to learn. So even if at some point it starts reasoning “I need something that will never fail to generalize,” this seems more likely to be H_100 by default.

HoldenKarnofsky 19 Dec 2021 19:32 UTC
LW: 9 AF: 7
AF
on: ARC’s first technical report: Eliciting Latent Knowledge
(Note: I read an earlier draft of this report and had a lot of clarifying questions, which are addressed in the public version. I’m continuing that process here.)
I get the impression that you see most of the “builder” moves as helpful (on net, in expectation), even if there are possible worlds where they are unhelpful or harmful. For example, the “How we’d approach ELK in practice” section talks about combining several of the regularizers proposed by the “builder.” It also seems like you believe that combining multiple regularizers would create a “stacking” benefit, driving the odds of success ever higher.
But I’m generally not having an easy time understanding why you hold these views. In particular, a central scary case I’m thinking of is something like: “We hit the problem described in the ‘New counterexample: ontology mismatch’ section, and with the unfamiliar ontology, it’s just ‘easier/more natural’ in some basic sense to predict observations like ‘The human says the diamond is still there’ than to find ‘translations’ into a complex, unwieldy human ontology.” In this case, it seems like penalizing complexity, computation time, and ‘downstream variables’ (via rewarding reporters for requesting access to limited activations) probably make things worse. (I think this applies less to the last two regularizers listed.)
Right now, the writeup talks about possible worlds in which a given regularizer could be helpful, and possible worlds in which it could be unhelpful. I’d value more discussion of the intuition for whether each one is likely to be helpful, and in particular, whether it’s likely to be helpful in worlds where the previous ones are turning out unhelpful.
What links here?
- Consider trying the ELK contest (I am) by Holden Karnofsky (EA Forum; 5 Jan 2022 19:42 UTC; 110 points)