Linch 30 Aug 2023 18:54 UTC
62 points
13
in reply to: sapphire’s comment on: LTFF and EAIF are unusually funding-constrained right now
My own professional opinion, not speaking for any other grantmakers or giving an institutional view for LTFF etc:
Yeah I sure can’t convince you that donating to us is definitely net positive, because such a claim wouldn’t be true.
So basically I don’t think it’s possible to do robustly positive actions in longtermism with high (>70%? >60%?) probability of being net positive for the long-term future^[1], and this number is even lower for people who don’t place the majority of their credence on near- to medium-term extinction risk timelines.
I don’t think this is just an abstract theoretical risk, as you mention there’s a real risk that our projects are net negative; and advancing more AI capabilities than AI safety is the most obvious way that this is true.
I think the other LTFF grantmakers and I are pretty conscious about downside risks in capabilities enhancements, though I expect there’s a range of opinions on the fund on how much to weigh that against other desiderata, as well as which specific projects have the highest capabilities externalities.
I would guess that we’re better about this than most (all?) other significant longtermist funders, including both organizations and individuals (though keep in mind that the average for individuals is driven by the long left tail). But since we’re optimizing for other things as well (most importantly positive impact), I think we’d do worse than you would on this axis if you a) have reasonably good judgment b) are laser-focused on preventing capabilities externalities, and c) have access to good donation options directly, especially by your own worldview. And of course reality doesn’t grade on a curve, so doing better than other funders isn’t a guarantee we’re doing well enough.
I don’t do much evaluations of alignment grants myself because others on the fund seem more technically qualified so my time is usually triaged to looking at other projects (eg forecasting, biosecurity). But I do try to flag downside risks I see in ltff grants overall, including in alignment grants. (So far, I think the rest of the fund is sensible about capabilities risks and capabilities risks usually aren’t the type of thing that non-public information is super useful for, so possibly none of my flags were on capabilities, more like interpersonal harm or professional integrity). When I did, I’ve found the rest of the fund to be sensible about them. You might find this recent post to be useful.
(On the flip side, there were a small number of grants that I liked that we were blocked from making for legal or PR reasons; for the most promising ones, one of us tried to connect the applicant to other funders)
If I were to hypothesize why LessWrongers should be worried about our capabilities externalities:
- I think the average view in the fund (both unweighted and weighted by votes on alignment grants) is more optimistic on prosaic AI alignment strategies than what I perceive the median LessWrong view to be.
  - I expect under most worldviews, prosaic AI alignment to have more capabilities externalities than other research agendas
  - To be clear, I don’t think views in the fund to be out-of-line with working AI safety researchers; I think the louder (and probably median?) voices on LessWrong are more negative on prosaic approaches.
- Some of our grantees go on to work at AI labs like Anthropic or DeepMind, which many people here would consider to be bad for the world.
  - My own weakly-to-moderately held view is that doing AI Safety work at big labs is a good-to-great idea, but don’t think the case is very robust and reasonable people can and should disagree.
    As you allude, an important crux is whether/how much the work at the labs end up being safety-washing
  - I’m personally fairly against working at big labs in non-safety roles; the capabilities externalities just seem rather high, and the career capital argument seem a) both not that high compared to getting a random ML job at Google doing ads or working at collision detection at Tesla or something and b) to rely implicitly on a certain willingness to defect for personal gain.
  - The moral mazes and institutional/cultural incentives to warp your beliefs seem pretty scary to me, but I don’t have a good solution.
- We are not institutionally opposed to receiving money from employees at big labs
  - Though as an empirical matter I don’t think we’ve received much.
- The ecosystem/memes/vibes near us has in fact resulted in a bunch of negative externalities before, there’s no guarantee we wouldn’t cause the same.
- We haven’t tracked past negative externalities/negative impact grants very well, so I couldn’t eg point to our10 worst grants ex post with an estimate of how bad there were (but we’re working on this).
- We didn’t see the FTX crash coming.
I also think potential donors to us can also just look at our past grants database, our payout report, or our marginal grants post to make an informed decision for themselves about whether donations to us are (sufficiently) net positive in expectation.
On a personal level:
I don’t really know, man? I think the longtermist/rationalist EA memes/ecosystem were very likely causally responsible for some of the worst capabilities externalities in the last decade; I don’t have a sense of how bad it is overall because counterfactuals are really hard but I don’t think it’s plausible that the negative impact was small. I’m pretty confused about whether people with thought process like me have been historically net positive or net negative; I can see a strong case either way. The whole thing had a pretty direct effect on me being depressed for most of this year (with the obvious caveat that etiology is hard for mental illness stuff, and being sad for cosmic reasons is one of the most self-flattering stories I could have for melancholy). Interestingly, I think the emotional effect is much larger than I would’ve ex ante predicted, if you asked me in 2017 if I thought longtermist work might be net negative, I don’t think my numbers would’ve been that different; I guess the specific details and concreteness did matter.
I have a lot of sympathy for people who decided to be a bit more checked out of morality, or decided to give up on this whole AI thing and focus on just reducing suffering in the next few decades (I think farmed animal welfare is the most popular candidate). But ultimately I think they’re wrong. The future is still going to be big, and likely really wild, and likely at least somewhat contingent. Knowing (or at least having a high probability) that people near us did a bunch of harmful stuff in the past is certainly an argument for being much more careful going forwards (as well as a number of more concrete and specific updates), but not really a good case to just roll over. (In the abstract, I do think it’s more plausible that for some people acting now is wrong compared to retreating to the woods for a year and thinking really hard; as an empirical matter when I did weaker versions of that, the effect was basically between useless and negative).
1. ^
  I think it’s a bit more feasible if you’re willing to make >3 OOMs sacrifice in expected positive impact. But still pretty rough. Some green energy stuff might be safe? Maybe try to convince doomsday preppers to be nicer people? I confess to not thinking much about it; I think some of the Oxford people might have a better idea.
What links here?
- Will Aldred's comment on Managing risks while trying to do good by Wei Dai (EA Forum; 2 Feb 2024 0:44 UTC; 13 points)

Linch 4 Nov 2023 6:41 UTC
56 points
on: Linch’s Shortform
CW: fairly frank discussions of violence, including sexual violence, in some of the worst publicized atrocities with human victims in modern human history. Pretty dark stuff in general.
tl;dr: Imperial Japan did worse things than Nazis. There was probably greater scale of harm, more unambiguous and greater cruelty, and more commonplace breaking of near-universal human taboos.
I think the Imperial Japanese Army is noticeably worse during World War II than the Nazis. Obviously words like “noticeably worse” and “bad” and “crimes against humanity” are to some extent judgment calls, but my guess is that to most neutral observers looking at the evidence afresh, the difference isn’t particularly close.
- probably greater scale
  - of civilian casualties: It is difficult to get accurate estimates of the number of civilian casualties from Imperial Japan, but my best guess is that the total numbers are higher (Both are likely in the tens of millions)
  - of Prisoners of War (POWs): Germany’s mistreatment of Soviet Union POWs is called “one of the greatest crimes in military history” and arguably Nazi Germany’s second biggest crime. The numbers involved were that Germany captured 6 million Soviet POWs, and 3 million died, for a fatality rate of 50%. In contrast, of all Chinese POWs taken by Japan, 56 survived to the end.
    Japan’s attempted coverups of warcrimes often involved attempted total eradication of victims. We see this in both POWs and in Unit 731 (their biological experimental unit, which we will explore later).
- more unambiguous and greater cruelty
  - It’s instructive to compare Nazi Germany human experiments against Japanese human experiments at unit 731 (warning:body horror). Both were extremely bad in absolute terms. However, without getting into the details of the specific experiments, I don’t think anybody could plausibly argue that the Nazis were more cruel in their human experiments, or incurred more suffering. The widespread casualness and lack of any traces of empathy also seemed higher in Imperial Japan:
    “Some of the experiments had nothing to do with advancing the capability of germ warfare, or of medicine. There is such a thing as professional curiosity: ‘What would happen if we did such and such?’ What medical purpose was served by performing and studying beheadings? None at all. That was just playing around. Professional people, too, like to play.”
    When (Japanese) Unit 731 officials were infected, they immediately went on the experimental chopping block as well (without anesthesia).
- more commonplace breaking of near-universal human taboos
  - I could think of several key taboos that were broken by Imperial Japan but not the Nazis. I can’t think of any in reverse.
    Taboo against biological warfare:
    To a first approximation, Nazi Germany did not actually do biological warfare outside of small-scale experiments. In contrast, Imperial Japan was very willing to do biological warfare “in the field” on civilians, and estimates of civilian deaths from Japan-introduced plague are upwards of 200,000.
    Taboo against mass institutionalized rape and sexual slavery.
    While I’m sure rape happened and was commonplace in German-occupied territories, it was not, to my knowledge condoned and institutionalized widely. While there are euphemisms applied like “forced prostitution” and “comfort women”, the reality was that 50,000 − 200,000 women (many of them minors) were regularly raped under the direct instruction of the Imperial Japanese gov’t.
    Taboo against cannibalism outside of extreme exigencies.
    “Nazi cannibals” is the material of B-movies and videogames, ie approximately zero basis in history. In contrast, Japanese cannibalism undoubtedly happened and was likely commonplace.
    We have documented oral testimony from Indian POWs, Australian POWs, American soldiers, and Japanese soldiers themselves.
    My rationalist-y friends sometimes ask why the taboo against cannibalism is particularly important.
    I’m not sure why, but I think part of the answer is “dehumanization.”
I bring this topic up mostly as a source of morbid curiosity. I haven’t spent that much time looking into war crimes, and haven’t dived into the primary literature, so happy to be corrected on various fronts.
What links here?
- Linch's comment on Book Review: Going Infinite by Zvi (25 Oct 2023 22:38 UTC; 10 points)

Linch 21 Nov 2023 0:29 UTC
53 points
43
in reply to: Seth Herd’s comment on: OpenAI: Facts from a Weekend
AFAICT the only formal power the board has is in firing the CEO, so if we get a situation where whenever the board wants to fire Sam, Sam comes back and fires the board instead, well, it’s not exactly an inspiring story for OpenAI’s governance structure.

The Long-Term Future Fund is looking for a full-time fund chair

Linch, calebp99 and abergal

5 Oct 2023 22:18 UTC

52 points

0 comments7 min readLW link

(forum.effectivealtruism.org)

Linch 18 Oct 2021 11:21 UTC
49 points
in reply to: habryka’s comment on: My experience at and around MIRI and CFAR (inspired by Zoe Curzi’s writeup of experiences at Leverage)
I want to remind people here that something like 30-40% of grad students at top universities have either clinically diagnosed [emphasis mine] depression or anxiety (link)
I’m confused about how you got to this conclusion, and think it is most likely false. Neither your link, the linked study, or the linked meta-analysis in the linked study of your link says this. Instead the abstract of the linked^3 meta-analysis says:
Among 16 studies reporting the prevalence of clinically significant symptoms of depression across 23,469 Ph.D. students, the pooled estimate of the proportion of students with depression was 0.24 (95% confidence interval [CI], 0.18-0.31; I² = 98.75%). In a meta-analysis of the nine studies reporting the prevalence of clinically significant symptoms of anxiety across 15,626 students, the estimated proportion of students with anxiety was 0.17 (95% CI, 0.12-0.23; I² = 98.05%).
Further, the discussion section of the linked^3 study emphasizes:
While validated screening instruments tend to over-identify cases of depression (relative to structured clinical interviews) by approximately a factor of two⁶⁷^,⁶⁸, our findings nonetheless point to a major public health problem among Ph.D. students.
So I think there is at least two things going on here:
1. Most people with clinically significant significant symptoms do not go get diagnosed, so “clinically significant symptoms of” depression/anxiety is a noticeably lower bar than “actually clinically diagnosed”
2. As implied in the quoted discussion above, if everybody were to seek diagnosis, only ~half of the rate of symptomatic people would be clinically diagnosed as having depression/anxiety.
  1. For those keeping score, this is ~12% for depression and 8.5% for anxiety, with some error bars.
Separately, I also think:
my current guess is we are roughly at that same level, or slightly below it
is wrong. My guess is that xrisk reducers have worse mental health on average compared to grad students. (I also believe this, with lower confidence, about people working in other EA cause areas like animal welfare, global poverty, or non-xrisk longtermism, as well as serious rationalists who aren’t professionally involved in EA cause areas).
What links here?
- Linch's comment on Minimal-trust investigations by Holden Karnofsky (EA Forum; 1 Dec 2021 3:30 UTC; 2 points)

Linch 24 Nov 2023 9:14 UTC
45 points
10
on: OpenAI: The Battle of the Board
Someone privately messaged me this whistleblowing channel for people to give their firsthand accounts of board members. I can’t verify the veracity/security of the channel but I’m hoping that having an anonymous place to post concerns might lower the friction or costs involved in sharing true information about powerful people:
https://openaiboard.wtf/

Linch 21 Oct 2021 10:05 UTC
40 points
in reply to: ioannes’s comment on: My experience at and around MIRI and CFAR (inspired by Zoe Curzi’s writeup of experiences at Leverage)
A 2010 analysis concluded that psychedelics are causing far less harm than legal drugs like alcohol and tobacco. (Psychedelics still carry substantial risks, aren’t for everybody, and should always be handled with care.)

? This is total harm, not per use. More people die of car crashes than from rabid wolves, but I still find myself more inclined to ride cars than ride rabid wolves as a form of transportation.
What links here?
- Linch's comment on Minimal-trust investigations by Holden Karnofsky (EA Forum; 1 Dec 2021 3:30 UTC; 2 points)

Linch 31 Jan 2024 20:23 UTC
35 points
on: Linch’s Shortform
Going forwards, LTFF is likely to be a bit more stringent (~15-20%?^[1] Not committing to the exact number) about approving mechanistic interpretability grants than in grants in other subareas of empirical AI Safety, particularly from junior applicants. Some assorted reasons (note that not all fund managers necessarily agree with each of them):
- Relatively speaking, a high fraction of resources and support for mechanistic interpretability comes from other sources in the community other than LTFF; we view support for mech interp as less neglected within the community.
- Outside of the existing community, mechanistic interpretability has become an increasingly “hot” field in mainstream academic ML; we think good work is fairly likely to come from non-AIS motivated people in the near future. Thus overall neglectedness is lower.
- While we are excited about recent progress in mech interp (including some from LTFF grantees!), some of us are suspicious that even success stories in interpretability are that large a fraction of the success story for AGI Safety.
- Some of us are worried about field-distorting effects of mech interp being oversold to junior researchers and other newcomers as necessary or sufficient for safe AGI.
- A high percentage of our technical AIS applications are about mechanistic interpretability, and we want to encourage a diversity of attempts and research to tackle alignment and safety problems.
We wanted to encourage people interested in working on technical AI safety to apply to us with proposals for projects in areas of empirical AI safety other than interpretability. To be clear, we are still excited about receiving mechanistic interpretability applications in the future, including from junior applicants. Even with a higher bar for approval, we are still excited about funding great grants.
We tentatively plan on publishing a more detailed explanation about the reasoning later, as well as suggestions or a Request for Proposals for other promising research directions. However, these things often take longer than we expect/intend (and may not end up happening), so I wanted to give potential applicants a heads-up.
1. ^
  Operationalized as “assuming similar levels of funding in 2024 as in 2023, I expect that about 80-85% of the mech interp projects we funded in 2023 will be above the 2024 bar.”

Long-Term Future Fund Ask Us Anything (September 2023)

Linch, calebp99, abergal, habryka, Thomas Larsen, LawrenceC and Lauro Langosco

31 Aug 2023 0:28 UTC

33 points

6 comments1 min readLW link

(forum.effectivealtruism.org)

Linch 9 Sep 2023 2:42 UTC
30 points
8
in reply to: jimrandomh’s comment on: Sharing Information About Nonlinear
As I mentioned to you before, I suspect werewolf/mafia/avalon is a pretty bad analogy for how to suss out the trustworthiness of people irl:
- in games, the number of werewolves etc is often fixed and known to all players ahead of time; irl a lot of the difficulty is figuring out whether (and how many) terminally bad actors exist, vs honest misunderstandings, vs generically suss people.
- random spurious accusations with zero factual backing are usually considered town/vanilla/arthurian moves in werewolf games; irl this breeds chaos and is a classic DARVO tactic.
- In games, the set of both disputed and uncontested facts are discrete and often small; this is much less the case irl.
- in games, bad guys have a heavy incentive to be uncorrelated (and especially to be seen as being uncorrelated); irl there are very few worlds where regularly agreeing with the now-known-to-be-bad-actors is a positive update on your innocence.
  - EDIT: Rereading this comment, I think it was unclear. Basically in games, if we know Alice and Bob seem in-sync, (eg they vote similarly, often go on the same missions), if we later learn that Alice is definitely evil, this is not always an update that Bob is evil. (and in some fairly common scenarios, it’s actually a positive update on Bob’s innocence).
    This almost never happens in real life.
    Similarly, if Alice repeatedly endorses Bob, and we later learn Alice is evil, irl we can often write off Alice’s endorsements of Bob. In games, there are sometimes structural incentives such that Alice’s endorsements of Bob are more trustworthy when Alice is evil (Good guys are often innocent/clueless, bad guys know a lot of information, bad guys usually don’t want to be paired with other bad guys).
- In games, the set of actions available to both good and bad actors are well-defined and often known in advance; irl does not have this luxury.
- etc
All these points, but especially the second one, means that people should be very hesitant to generalize hard-won lessons about macrolevel social dynamics from social deception games to real life.

Linch 1 Apr 2024 21:59 UTC
28 points
7
in reply to: gwern’s comment on: Introducing Open Asteroid Impact
Open Asteroid Impact strongly disagrees with this line of thinking. Our theory of change relies on many asteroids filled with precious minerals hitting earth, as mining in space (even LEO) is prohibitively expensive compared to on-ground mining.
While your claims may be true for small asteroids, we strongly believe that scale is all you need. Over time, sufficiently large, and sufficiently many, asteroids can solve the problem of specific asteroids not successfully impacting Earth.

Linch 17 Dec 2022 2:01 UTC
28 points
on: Linch’s Shortform
I weakly think

1) ChatGPT is more deceptive than baseline (more likely to say untrue things than a similarly capable Large Language Model trained only via unsupervised learning, e.g. baseline GPT-3)
2) This is a result of reinforcement learning from human feedback.
3) This is slightly bad, as in differential progress in the wrong direction, as:
3a) it differentially advances the ability for more powerful models to be deceptive in the future
3b) it weakens hopes we might have for alignment via externalized reasoning oversight.
Please note that I’m very far from an ML or LLM expert, and unlike many people here, have not played around with other LLM models (especially baseline GPT-3). So my guesses are just a shot in the dark.
____
From playing around with ChatGPT, I noted throughout a bunch of examples is that for slightly complicated questions, ChatGPT a) often gets the final answer correct (much more than by chance), b) it sounds persuasive and c) the explicit reasoning given is completely unsound.
Anthropomorphizing a little, I tentatively advance that ChatGPT knows the right answer, but uses a different reasoning process (part of its “brain”) to explain what the answer is.
___
I speculate that while some of this might happen naturally from unsupervised learning on the internet, this is differentially advanced (made worse) from OpenAI’s alignment techniques of reinforcement learning from human feedback.

[To explain this, a quick detour into “machine learning justifications.” I remember back when I was doing data engineering ~2018-2019 there was a lot of hype around ML justifications for recommender systems. Basically users want to know why they were getting recommended ads for e.g. “dating apps for single Asians in the Bay” or “baby clothes for first time mothers.” It turns out coming up with a principled answer is difficult, especially if your recommender system is mostly a large black box ML system. So what you do is instead of actually trying to understand what your recommender system did (very hard interpretability problem!), you hook up a secondary model to “explain” the first one’s decisions by collecting data on simple (politically safe) features and the output. So your second model will give you results like “you were shown this ad because other users in your area disproportionately like this app.”
Is this why the first model showed you the result? Who knows? It’s as good a guess as any.( In a way, not knowing what the first model does is a feature, not a bug, because the model could train on learned proxies for protected characteristics but you don’t have the interpretability tools to prove or know this). ]
Anyway, I wouldn’t be surprised if something similar is going on within the internals of ChatGPT. There are incentives to give correct answers and there are incentives to give reasons for your answers, but the incentives for the reasons to be linked to your answers is a lot weaker.
One way this phenomenon can manifest is if you have MTurkers rank outputs from ChatGPT. Plausibly, you can have human raters downrank it for both a) giving inaccurate results and b) for giving overly complicated explanations that don’t make sense. So there’s loss for being wrong and loss for being confusing, but not for giving reasonable, compelling, clever-sounding explanations for true answers. Even if the reasoning is garbage, which is harder to detect.

___

Why does so-called “deception” from subhuman LLMs matter? In the grand scheme of things, this may not be a huge deal, however:
1. I think we’re fine now, because both its explicit and implicit reasoning are probably subhuman. But once LLMs’ reasoning ability is superhuman, deception may be differentially easier for the RLHF paradigm compared to the pre-RLHF paradigm. RLHF plausibly selects for models with good human-modeling/-persuasion abilities, even relative to a baseline of agents that are “merely” superhuman at predicting internet text.
2. One of the “easy alignment” hopes I had in the past was based on a) noting that maybe LLMs are an unusually safe baseline, and b) externalized oversight of “chain-of-thought” LLMs. If my theory for how ChatGPT was trained was correct, I believe RLHF moves us systematically away from externalized reasoning being the same reasoning process as the process that the model internally uses to produce correct answers. This makes it harder to do “easy” blackbox alignment.
____
What would convince me that I’m wrong?

1. I haven’t done a lot of trials or played around with past models so I can be convinced that my first conjecture “ChatGPT is more deceptive than baseline” is wrong. For example, if someone conducts a more careful study than me and demonstrates that (for the same level of general capabilities/correctness) ChatGPT is just as likely to confublate explanations as any LLM trained purely via unsupervised learning. (An innocent explanation here is that the internet has both many correct answers to math questions and invalid proofs/explanations, so the result is just what you expect from training on internet data. )
2. For my second conjecture (“This is a result of reinforcement learning from human feedback”), I can be convinced by someone from OpenAI or adjacent circles explaining to me that ChatGPT either isn’t trained with anything resembling RLHF, or that their ways of doing RLHF is very different from what I proposed.
3. For my third conjecture, this feels more subjective. But I can potentially be convinced by a story for why training LLMs through RLHF is more safe (ie, less deceptive) per unit of capabilities gain than normal capabilities gain via scaling.
4. I’m not an expert. I’m also potentially willing to generally defer to expert consensus if people who understands LLMs well think that the way I conceptualize the problem is entirely off.
What links here?
- Linch's comment on Linch’s Quick takes by Linch (EA Forum; 17 Dec 2022 3:23 UTC; 2 points)

EA Infrastructure Fund’s Plan to Focus on Principles-First EA

Linch6 Dec 2023 3:24 UTC

27 points

0 comments1 min readLW link

[Question] What are some low-information priors that you find practically useful for thinking about the world?

Linch7 Aug 2020 4:37 UTC

27 points

13 comments1 min readLW link