AdamGleave

Karma: 765

AdamGleave 20 Jul 2023 20:25 UTC
LW: 49 AF: 21
9
AF
on: Alignment Grantmaking is Funding-Limited Right Now
This matches my impression. FAR could definitely use more funding. Although I’d still at the margin rather hire someone above our bar than e.g. have them earn-to-give and donate to us, the math is getting a lot closer than it used to be, to the point where those with excellent earning potential and limited fit for AI safety might well have more impact pursuing a philanthropic pathway.

I’d also highlight there’s a serious lack of diversity in funding. As others in the thread have mentioned, the majority of people’s funding comes (directly or indirectly) from OpenPhil. I think OpenPhil does a good job trying to mitigate this (e.g. being careful about power dynamics, giving organizations exit grants if they do decide to stop funding an org, etc) it’s ultimately not a healthy dynamic, and OpenPhil appears to be quite capacity constrained in terms of grant evaluation. So, the entry of new funders would help diversify this in addition to increasing total capacity.

One thing I don’t see people talk about as much but also seems like a key part of the solution: how can alignment orgs and researchers make more efficient use of existing funding? Spending that was appropriate a year or two ago when funding was plentiful may not be justified any longer, so there’s a need to explicitly put in place appropriate budgets and spending controls. There’s a fair amount of cost-saving measures I could see the ecosystem implementing that would have limited if any hit on productivity: for example, improved cash management (investing in government money market funds earning ~5% rather than 0% interest checking accounts); negotiating harder with vendors (often possible to get substantial discounts on things like cloud compute or commercial real-estate); and cutting back on some fringe benefits (e.g. more/higher-density open plan rather than private offices). I’m not trying to point fingers here: I’ve made missteps here as well, for example FAR’s cash management currently has significant room for improvement—we’re in the process of fixing this and plan to share a write-up of what we found with other orgs in the next month.

AdamGleave 1 Aug 2021 16:08 UTC
39 points
on: Delta Strain: Fact Dump and Some Policy Takeaways
Errata: My original calculation underestimated the risk by a factor of about 2x. I neglected two key considerations, which fortunately somewhat canceled each other out. My new estimate from the calculation is 3.0 to 11.7 quality-adjusted days lost to long-term sequelae, with my all-things-considered mean at 45.
The two key things I missed:
- I estimated the risk of a non-hospitalized case is about 10x less than a hospitalized case, and so divided the estimates of disease burden by 10x. The first part is correct, but the second part would only make sense if all disease burden was due to hospitalized cases. In fact, there’s a 15:85% split between hospitalized and non-hospitalized patients in the study (13,654:73,435). So if the disease burden for non-hospitalized is x, the total burden is 0.15*10x + 0.85*x = 2.35x. So we should divide by 2.35, not 10.
- However, as Owain pointed out below, the [demographics](https://www.nature.com/articles/s41586-021-03553-9/tables/1) are non-representative and probably skew high-risk given the median age is 60. the demographics are relatively high-risk. Indeed, this is suggested by the 15% hospitalized figure (which also, I suspect, means they just never included asymptomatic and most mildly symptomatic cases). An ONS survey (Figure 4) put symptoms reported after 5 weeks at 25% (20-30%) for 50-69 year olds and 17.5 (12.5 to 22.5%) for 17 to 24 year olds, which is surprisingly little difference, about a 1.5 decrease. I’d conjecture a 2x decrease in risk (noting that assuming no hospitalization is already doing a lot of work here).
Original post:
I did my own back-of-the-envelope calculation and came up with a similar but slightly higher estimated cost of 1.4 to 5.5 quality-adjusted days lost to long-term sequalea conditional on getting symptomatic COVID case. FWIW, I originally thought the OPs numbers seemed way too low, and was going to write a take-down post—but unfortunately the data did not cooperate with this agenda. I certainly don’t fully trust these numbers: it’s based on a single study, and there were a bunch of places I didn’t keep track of uncertainty, so the true credible interval should definitely be a lot wider. Given that and the right-tailed nature of the distribution, my all-things-considered mean is closer to 30 because of this, but figured I’d share the BOTEC anyway in case it’s helpful to anyone.
My model is pretty simple:
1. What % of symptoms are there at some short-term follow up period (e.g. 4 to 12 weeks)? This we actually have data on.
2. How bad are these symptoms? This is fairly subjective.
3. How much do we expect these symptoms to decay long-term? This is going off priors.
For 1. I used Al-Aly et al (2021) as a starting point, which was based on comparing medical records between a COVID-positive and non-COVID demographically matched control group in the US Department of Veterans Affairs database. Anna Ore felt this was one of the more rigorous ones, and I agree. Medical notes seem more reliable than self-report (though far from infallible), they seem to have actually done a Bonferroni correction, and they tested their methodology didn’t pick up any false positives via both a negative-outcome and negative-exposure controls. Caveat: many other studies have scarier headline figures, and it’s certainly possible relying on medical records skews this low (e.g. doctors might be reluctant to give a diagnosis, many patients won’t go to the doctor for mild symptoms, etc).
They report outcomes that occurred between 30 and 180 days after COVID exposure, although infuriatingly don’t seem to break it down any further by date. Figure 2 shows all statistically significant symptoms, in terms of the excess burden (i.e. increase above control) of the reported symptom per 1000 patients. There were 38 in total, ranging from 2.8% (respiratory signs and symptoms) to 0.15% (pleurisy). In total the excess burden was 26%.
I went through and rated each symptom with a very rough and subjective high / medium / low severity. 2% excess burden of high severity symptoms, 19% medium severity, 5% low severity. I then ballparked that high severity (e.g. heart disease, diabates, heart failure) wiped out 30% of your QALYs, medium severity (e.g. respiratory signs, anxiety disorders, asthma) as 5% and low (e.g. skin rash) as 1%. Caveat: there’s a lot of uncertainty in these numbers. Although I suspect I’ve gone for higher costs than most people would, since I tend to think health has a pretty big impact on productivity.
Using my weightings, we get a 1.6% reduction in QALYs conditional on symptomatic COVID case. I think this is misleading for three reasons:
1. Figure 3 shows that excess burden is much higher for people who were hospitalized, and if anything the gap seems bigger for more severe symptoms (e.g. about 10x less heart failure in people positive but not hospitalized, whereas rates of skin rash were only 2x less). This is good news as vaccines seem significantly more effective at preventing hospitalizations, and if you are fortunate enough to be a young healthy person your chance of being hospitalized was pretty low to begin with. I’m applying a 10x reduction for this.
2. This excess burden is per diagnosis, not per patient. Sick people tend to receive multiple diagnoses. I’m not sure how to handle this. In some cases, badness-of-symptoms does seem roughly additive: if I had a headache, I’d probably pay a similar amount not to also develop a skin rash then if my head didn’t hurt. But it seems odd to say that someone who drops dead from cardiac arrest was more fortunate than another patient with the same cause of death, who also had the misfortune of being diagnosed with heart failure a week earlier. So there’s definitely some double-counting with the diagnosis, which I think justifies a 2-5x decrease.
3. This study was presumably predominantly the original COVID strain (based on a cohort between March 2020 and 30 November 2020). Delta seems, per the OP, about 2-3x worse: so let’s increase it by that factor.
Overall we decrease 1.6% by a factor of 6.5 (10*2/3) to 25 (10*5/2), to get a short-term QALY reduction of 0.064% to 0.24%.
However, El-Aly et al include any symptom reported between 30 to 180 days. What we really care about is chance of lifelong symptoms if someone is experiencing a symptom after 6 months there seems like a considerable chance it’ll be lifelong, but if only 30 days has elapsed the chance of recovery seems much higher. A meta-review by Thompson et al (2021) seems to show a drop of around 2x between symptoms in a 4-12 week period vs 12+ weeks (Table 2), although with some fairly wild variation between studies so I do not trust this that much. In an extremely dubious extrapolation from this, we could say that perhaps symptoms half again from 12 weeks to 6 months, again from 6 months to a year, and after that persist as a permanent injury. In this case, we’d divide the “symptom after 30 days figure” from Al-Aly et al by a factor of 8 to get the permanent injury figure, which seems plausible to me (but again, you could totally argue for a much lower number).
With this final fudge, we get a lifelong QALY reduction of 0.008% to 0.03%. Assuming a 50-year life expectancy, this amounts to 1.4 to 5.5 days of cost from long-term sequelae. Of course, there are also short-term costs (and risk of morbidity!) that is omitted from this analysis, so the total costs will be higher than this.
What links here?

AdamGleave 23 Jul 2023 22:47 UTC
LW: 22 AF: 12
6
AF
in reply to: VojtaKovarik’s comment on: Even Superhuman Go AIs Have Surprising Failure Modes
When I started working on this project, a number of people came to me and told me (with varying degrees of tact) that I was wasting my time on a fool’s errand. Around half the people told me they thought it was extremely unlikely I’d find such a vulnerability. Around the other half told me such vulnerabilities obviously existed, and there was no point demonstrating it. Both sets of people were usually very confident in their views. In retrospect I wish I’d done a survey (even an informal one) before conducting this research to get a better sense of people’s views.

Personally I’m in the camp that vulnerabilities like these existing was highly likely given the failures we’ve seen in other ML systems and the lack of any worst-case guarantees. But I was very unsure going in how easy they’d be to find. Go is a pretty limited domain, and it’s not enough to beat the neural network: you’ve got to beat Monte-Carlo Tree Search as well (and MCTS does have worst-case guarantees, albeit only in the limit of infinite search). Additionally, there are results showing that scale improves robustness (e.g. more pre-training data reduces vulnerability to adversarial examples in image classifiers).

In fact, although the method we used is fairly simple, actually getting everything to work was non-trivial. There was one point after we’d patched the first (rather degenerate) pass-attack that the team was doubting whether our method would be able to beat the now stronger KataGo victim. We were considering cancelling the training run, but decided to leave it going given we had some idle GPUs in the cluster. A few days later there was a phase shift in the win rate of the adversary: it had stumbled across some strategy that worked and finally was learning.

This is a long-winded way of saying that I did change my mind as a result of these experiments (towards robustness improving less than I’d previously thought with scale). I’m unsure how much effect it will have on the broader ML research community. The paper is getting a fair amount of attention, and is a nice pithy example of a failure mode. But as you suggest, the issue may be less a difference in concrete belief (surely any ML researcher would acknowledge adversarial examples are a major problem and one that is unlikely to be solved any time soon), than that of culture (to what degree is a security mindset appropriate?).

This post was written as a summary of the results of the paper, intended for a fairly broad audience, so we didn’t delve much into the theory of change behind this agenda here. You might find this blog post describing the broader research agenda this paper fits into provides some helpful context, and I’d be interested to hear your feedback on that agenda.

AdamGleave 17 Mar 2024 0:37 UTC
LW: 21 AF: 13
9
AF
in reply to: habryka’s comment on: More people getting into AI safety should do a PhD
I’m sympathetic to a lot of this critique. I agree that prospective students should strive to find an advisor that is “good at producing clear, honest and high-quality research while acting in high-integrity ways around their colleagues”. There are enough of these you should be able to find one, and it doesn’t seem worth compromising.

Concretely, I’d definitely recommend digging into into an advisor’s research and asking their students hard questions prior to taking any particular PhD offer. Their absolutely are labs that prioritize publishing above all else, turn a blind eye to academic fraud or at least brush accidental non-replicability under the rug, or just have a toxic culture. You want to avoid those at all costs.

But I disagree with the punchline that if this bar isn’t satisfied then “almost any other job will be better preparation for a research career”. In particular, I think there’s a ton of concrete skills a PhD teaches that don’t need a stellar advisor. For example, there’s some remarkably simple things like having an experimental baseline, running multiple seeds and reporting confidence intervals that a PhD will absolutely drill into you. These things are remarkably often missing from research produced by those I see in the AI safety ecosystem who have not done a PhD or been closely mentored by an experienced researcher.

Additionally, I’ve seen plenty of people do PhDs under an advisor who lacks one or more of these properties and most of them turned out to be fine researchers. Hard to say what the counterfactual is, the admission process to the PhD might be doing a lot of work here, but I think it’s important to recognize the advisor is only one of many sources of mentorship and support you get in a PhD: you also have taught classes, your lab mates, your extended cohort, senior post-docs, peer review, etc. To be clear, none of these mentorship sources are perfect, but part of your job as a student is to decide who to listen to & when. If someone can’t do that then they’ll probably not get very far as a researcher no matter what environment they’re in.

AdamGleave 17 Nov 2018 0:02 UTC
17 points
on: Current AI Safety Roles for Software Engineers
Description of CHAI is pretty accurate. I think it’s a particularly good opportunity for people who are considering grad school as a long-term option: we’re in an excellent position to help people get into top programs, and you’ll also get a sense of what academic research culture is like.
We’d like to hire more than one engineer, and are currently trialling several hires. We have a mixture of work, some of which is more ML oriented and some of which is more infrastructure oriented. So we’d be willing to consider applicants with limited ML experience, but they’d need to have strengths in other areas to compensate.
If anyone is considering any of these roles and is uncertain whether they’re a good fit, I’d encourage you to just apply. It doesn’t take much time for you to apply or for the organisation to do an initial screening. I’ve spoken to several people who didn’t think they were viable candidates for a particular role, and then turned out to be one of the best applicants we’d received.

AdamGleave 3 Sep 2022 21:41 UTC
LW: 16 AF: 7
7
AF
on: An Update on Academia vs. Industry (one year into my faculty job)
Work that is still outside the academic Overton window can be brought into academia if it can be approached with the technical rigor of academia, and work that meets academic standards is much more valuable than work that doesn’t; this is both because it can be picked up by the ML community, and because it’s much harder to tell if you are making meaningful progress if your work doesn’t meet these standards of rigor.
Strong agreement with this! I’m frequently told by people that you “cannot publish” on a certain area, but in my experience this is rarely true. Rather, you have to put more work into communicating your idea, and justifying the claims you make—both a valuable exercise! Of course you’ll have a harder time publishing than on something that people immediately understand—but people do respect novel and interesting work, so done well I think it’s much better for your career than one might naively expect.
I especially wish there was more emphasis on rigor on the Alignment Forum and elsewhere: it can be valuable to do early-stage work that’s more sloppy (rigor is slow and expensive), but when there’s long-standing disagreements it’s usually better to start formalizing things or performing empirical work than continuing to opine.
That said, I do think academia has some systemic blindspots. For one, I think CS is too dismissive of speculative and conceptual research—much of this work will end up being mistaken admittedly, but it’s an invaluable source of ideas. I also think there’s too much emphasis on an “algorithmic contribution” in ML, which leads to undervaluing careful empirical valuations and understanding failure modes of existing systems.

AdamGleave 31 Jul 2020 1:03 UTC
LW: 16 AF: 8
AF
on: The ground of optimization
Thanks for the post, this is my favourite formalisation of optimisation so far!
One concern I haven’t seen raised so far, is that the definition seems very sensitive to the choice of configuration space. As an extreme example, for any given system, I can always augment the configuration space with an arbitrary number of dummy dimensions, and choose the dynamics such that these dummy dimensions always get set to all zero after each time step. Now, I can make the basin of attraction arbitrarily large, while the target configuration set remains a fixed size. This can then make any such dynamical system seem to be an arbitrarily powerful optimiser.
This could perhaps be solved by demanding the configuration space be selected according to Occam’s razor, but I think the outcome still ends up being prior dependent. It’d be nice for two observers who model optimising systems in a systematically different way to always agree within some constant factor, akin to Kolmogorov complexity’s invariance theorem, although this may well be impossible.
As a less facetious example, consider a computer program that repeatedly sets a variable to 0. It seems again we can make the optimising power arbitrarily large by making the variable’s size arbitrarily large. But this doesn’t quite map onto the intuitive notion of the “difficulty” of an optimisation problem. Perhaps including some notion of how many other optimising systems would have the same target set would resolve this.
What links here?
- Bridging Expected Utility Maximization and Optimization by Whispermute (5 Aug 2022 8:18 UTC; 25 points)

AdamGleave 17 Mar 2024 0:47 UTC
LW: 15 AF: 10
3
AF
in reply to: OliverHayman’s comment on: More people getting into AI safety should do a PhD
Whether a PhD is something someone will enjoy is so dependent on individual personality, advisor fit, etc that I don’t feel I can offer good generalized advice. Generally I’d suggest people trying to gauge fit try doing some research in an academic environment (e.g. undergrad/MS thesis, or a brief RA stint after graduating) and talk to PhD students in their target schools. If after that you think you wouldn’t enjoy a PhD then you’re probably right!

Personally I enjoyed my PhD. I had smart & interesting colleagues, an advisor who wanted me to do high-quality research (not just publish), I had almost-complete control over how I spent my time, could explore areas I found interesting & important in depth. The compensation is low but with excellent job security and I had some savings so I lived comfortably. Unless I take a sabbatical I will probably never again have the time to go as deep into a research area so in a lot of ways I really cherish my PhD time.

I think a lot of the negatives of PhDs really feel like negatives of becoming a research lead in general. Trying to create something new with limited feedback loops is tough, and can be psychologically draining if you tie your self-worth with your work output (don’t do this! but easier said than done for the kind of person attracted to these careers). Research taste will take up many years of your life to develop—as will most complex skills. etc.

AdamGleave 2 Jul 2022 21:51 UTC
15 points
0
on: AI Could Defeat All Of Us Combined
A lot of this argument seems to rest on the training-inference gap, allowing a very large population of AIs to exist at the same as cost as training. In that way they can be a formidable group even if the individual AIs are only human-level. I was suspicious of this at first, but I found myself largely coming round to it after sanity checking it using a slightly different method than biological anchors. However, if I understand correctly the biological anchors framework implies the gap between training and inference grows with capabilities. My projection instead expects it to grow a little in the next few years and then plateau as we hit the limits of data scaling. This suggests a more continuous picture: there will be a “population explosion” of AI systems in the next few years so to speak as we scale data, but then the “population size” (total number of tokens you can generate for your training budget) will stay more or less constant, while the quality of the generated tokens gradually increases.

To a first approximation, the amount of inference you can do at the same cost as training the system will equal the size of the training data multiplied by number of epochs. The trend in large language models seems to be to train for only 1 epoch on most data, and a handful of epochs for the highest-quality parts of the data. So as a rule of thumb: if you spend $X on training and $X on inference, you can produce as much data as your training dataset. Caveat: inference can be more expensive (e.g. beam search) or less expensive (e.g. distillation, specialized inference-only hardware) and depends on things like how much you care about latency; I think this only changes the picture by 10x either way.

Given that GPT-3 was trained on a significant fraction of the entire text available on the Internet (CommonCrawl), this would already be a really big deal if GPT-3 was actually close to human-level. Adding another Internets-worth of content would be… significant.

But conversely, the fact we’re already training on so much data limits how much room for growth there is. I’d estimate we have no more than 100-1000x left for language scaling. We could probably get up to 10x more from more comprehensive (but lower quality) crawls than CommonCrawl, and 10-100x more if tech companies use non-public datasets (e.g. all e-mails & docs on a cloud service).

By contrast, in principle compute could scale up a lot more than this. We can likely get 10-100x just from spending more on training runs. Hardware progress could easily deliver 1000x by 2036, the date chosen in this post.

Given this, at least under business as usual scaling I expect us to hit the limits of data scaling significantly before we exhaust compute scaling. So we’ll have larger and more compute-intensive models trained on relatively small datasets (although still massive in absolute terms). This suggests the training-inference gap grow a bit as we grow training data size, but soon plateau as we just scale up model size while keeping training data fixed.

One thing that could do undo this argument is if we end up training for many (say >10) epochs, or synthetically generate data, as a kind of poor-mans data scaling rather than just scaling up parameter count. I expect we’ll try this, but I’d only give it 30% odds it makes a big difference. I do think it’s more likely if we move away from the LM paradigm, and either get a lot of mileage out of multi-modal models (there’s lots more video data at least in terms of GB, maybe not in terms of abstract information content) or back towards RL (where data generated in simulation seems much more valuable and scalable).

AdamGleave 31 Aug 2022 20:07 UTC
14 points
0
on: (My understanding of) What Everyone in Technical Alignment is Doing and Why
One omission from the list is the Fund for Alignment Research (FAR), which I’m a board member of. That’s fair enough: FAR is fairly young, and doesn’t have a research agenda per se, so it’d be hard to summarize their work from the outside!. But I thought it might be of interest to readers so I figured I’d give a quick summary here.
FAR’s theory of change is to incubate new, scalable alignment research agendas. Right now I see a small range of agendas being pursued at scale (largely RLHF and interpretability), then a long tail of very diverse agendas being pursued by single individuals (mostly independent researchers or graduate students) or 2-3 person teams. I believe there’s a lot of valuable ideas in this long tail that could be scaled, but this isn’t happening due to a lack of institutional support. It makes sense that the major organisations want to focus on their own specific agendas—there’s a benefit to being focused! -- but it means a lot of valuable agendas are slipping through the cracks.
FAR’s current approach to solving this problem is to build out a technical team (research engineers, junior research scientists, technical communication specialists) and provide support to a broad range of agendas pioneered by external research leads. Those that work, FAR will double down on and invest more in. This model has had a fair amount of demand already so there’s product-market fit, but we still want to iterate and see if we can improve the model. For example, long-term FAR might want to bring some or all research leads in-house.
In terms of concrete agendas, an example of some of the things FAR is working on:
- Adversarial attacks against narrowly superhuman systems like AlphaGo.
- Language model benchmarks for value learning.
- The inverse scaling law prize.
You can read more about us on our launch post.
What links here?
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by Thomas Larsen (29 Aug 2022 1:23 UTC; 412 points)

AdamGleave 1 Feb 2019 21:00 UTC
LW: 14 AF: 5
AF
on: Following human norms
I feel like there are three facets to “norms” v.s. values, which are bundled together in this post but which could in principle be decoupled. The first is representing what not to do versus what to do. This is reminiscent of the distinction between positive and negative rights, and indeed most societal norms (e.g. human rights) are negative, but not all (e.g. helping an injured person in the street is a positive right). If the goal is to prevent catastrophe, learning the ‘negative’ rights is probably more important, but it seems to me that most techniques developed could learn both kinds of norms.
Second, there is the aspect of norms being an incomplete representation of behaviour: they impose some constraints, but there is not a single “norm-optimal” policy (contrast with explicit reward maximization). This seems like the most salient thing from an AI standpoint, and as you point out this is an underexplored area.
Finally, there is the issue of norms being properties of groups of agents. One perspective on this is that humans are realising their values through constructing norms: e.g. if I want to drive safely, it is good to have a norm to drive on the left or right side of the road, even though I may not care which norm we establish. Learning norms directly therefore seems beneficial to neatly integrate into human society (it would be awkward if e.g. robots drive on the left and humans drive on the right). If we think the process of going from values to norms is both difficult and important for multi-agent cooperation, learning norms also lets us sidestep a potentially thorny problem.
What links here?
- [AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety by Rohin Shah (6 Nov 2019 18:10 UTC; 26 points)

AdamGleave 2 Oct 2022 2:09 UTC
LW: 12 AF: 8
AF
on: [Intro to brain-like-AGI safety] 1. What’s the problem & Why work on it now?
This is a nitpick, but I don’t think this claim is quite right (emphasis added)
If a silicon-chip AGI server were literally 10,000× the volume, 10,000× the mass, and 10,000× the power consumption of a human brain, with comparable performance, I don’t think anyone would be particularly bothered—in particular, its electricity costs would still be below my local minimum wage!!
First, how much power does the brain use? 20 watts is StackExchange’s answer, but I’ve struggled to find good references here. The appealingly named Appraising the brain’s energy budget gives 20% of the overall calories consumed by the body, but that begs the question of the power consumption of the human body, and whether this is at rest or under exertion, etc. Still, I don’t think the 20 watts figure is more than 2x off, so let’s soldier on.
10,000 times 20 watts is 200 kW. That’s a large but not insane amount of power. You could just about run that load on a domestic power supply in the US (some larger homes might have a 200A @ 120V circuit, for 192 kW of permissible load under the 80% rule). Of course you wouldn’t be able to power the HVAC needed to cool all these chips, but let’s suppose you live in Alaska and can just open the windows.
At the time of writing, the cheapest US electricity prices are around $0.09 per kWh with many states (including Alaska, unfortunately) being twice that at around $0.20/kWh. But let’s suppose you’re in both a cool climate and have a really great deal on electricity. So your 200kWh of chips costs you just $0.09*200=$18/hour.
Federal minimum wage is $7.25/hour, and the highest I’m aware of in any US state is $15/hour. So it seems that you won’t be cheaper than the brain on electricity prices if 10,000 times less efficient. I’ve systematically tried to make favorable assumptions here. Your 200kW proto-AGI probably won’t be in an Alaskan garage, but in a tech company’s data center with according costs for HVAC, redundant power, security, etc. Colo costs vary widely depending on location and economies of scale. A recent quote I had was at around the $0.4 kWh/mark—so about 4x the cost quoted above.
This doesn’t massively change the qualitative takeaway, which is that even if something was 10,000 (or even a million times) less efficient than the brain, we’d absolutely still go ahead and build a demo anyway. But it is worth noting that something at the $60/hour range might not actually be all that transformative unless it’s able to perform highly skilled labor—at least until we make it more efficient (which would happen quite rapidly).

AdamGleave 28 Oct 2022 18:30 UTC
LW: 11 AF: 5
0
AF
on: Response to Katja Grace’s AI x-risk counterarguments
Thanks for this response, I’m glad to see more public debate on this!
The part of Katja’s part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed—there is some sufficiently capable AI system that can take over the 2022 world. But this capability threshold is a moving target: humanity will get better at aligning and controlling AI systems as we gain more experience with them, and we may be able to enlist the help of AI systems to keep others in check. So, why should we expect the equilibrium here to be an AI takeover, rather than AIs working for humans because that it is in their selfish best interest in a market economy where humans are currently the primary property owner?
I think the crux here is whether we expect AI systems to by default collude with one another. They might—they have a lot of things in common that humans don’t, especially if they’re copies of one another! But coordination in general is hard, especially if it has to be surreptitious.
As an analogy, I could argue that for much of human history soldiers were only slightly more capable than civilians. Sure, a trained soldier with a shield and sword is a fearsome opponent, but a small group of coordinated civilians could be victorious. Yet as we develop more sophisticated weapons such as guns, cannons, missiles, the power that a single soldier has grows greater and greater. So, by your argument, eventually a single soldier will be powerful enough to take over the world.
This isn’t totally fanciful—the Spanish conquest of the Inca Empire started with just 168 soldiers! The Spanish fought with swords, crossbows, and lances—if the Inca Empire were still around, it seems likely that a far smaller modern military force could defeat them. Yet, clearly no single soldier is in a position to take over the world, or even a small city. Military coup d’etats are the closest, but involve convincing a significant fraction of the military that is in their interest to seize power. Of course most soldiers wish to serve their nation, not seize power, which goes some way to explaining the relatively low rate of coup attempts. But it’s also notable that many coup attempts fail, or at least do not lead to a stable military dictatorship, precisely because of difficulty of internal coordination. After all, if someone intends to destroy the current power structure and violate their promises, how much can you trust that they’ll really have your back if you support them?
An interesting consequence of this is that it’s ambiguous whether making AI more cooperative makes the situation better or worse.

AdamGleave 31 Aug 2022 20:21 UTC
11 points
5
on: (My understanding of) What Everyone in Technical Alignment is Doing and Why
I liked this post and think it’ll serve as a useful reference point, I’ll definitely send it to people who are new to the alignment field.
But I think it needs a major caveat added. As a survey of alignment research that regularly posts on LessWrong or interacts closely with that community, it does a fine job. But as capybaralet already pointed out, it misses many academic groups. And even some major industry groups are de-emphasized. For example, DeepMind alignment is 20+ people, and has been around for many years. But it’s got if anything a slightly less detailed write-up than Team Shard, a small group of people for a few months, or infra-Bayesianism, largely one person for several years.
The best shouldn’t be the enemy of the good, and some groups are just quite opaque, but I think it does need to be cleared about its limitations. One anti-dote would be including in the table a sense of # of people, # of years it’s been around, and maybe even funding to get a sense of what the relative scale of these different projects are.

AdamGleave 2 Aug 2021 10:07 UTC
11 points
in reply to: Connor_Flexman’s comment on: Delta Strain: Fact Dump and Some Policy Takeaways
This is an accurate summary, thanks! I’ll add my calculation was only for long-term sequelae. Including ~10 days cost from acute effects, my all-things-considered view would be mean of ~40 days, corresponding to 1041 uCOVIDs per hour.

This is per actual hour of (quality-adjusted) life expectancy. But given we spend ~1/3rd of our time sleeping, you probably want to value a waking-hour at 1.5x a life-hour (assuming being asleep has neutral valence). If you work a 40 hour work week and only value your productive time (I do not endorse this, by the way), then you’d want to adjust upwards by a factor of (7*24)/40=4.2.

However, this is purely private cost. You probably want to take into account the cost of infecting other people. I’m not confident in how to reason about the exponential growth side of things. If you’re in a country like the US where vaccination rates have plateaued, I tend to expect Delta to spread amongst unvaccinated people until herd immunity is reached. In this scenario you basically want infection rates to be as high as possible without overwhelming the healthcare system, so we get to herd immunity quicker. (This seems to actually be the strategy the UK government is pursuing—although obviously they’ve not explicitly stated this.) But if you’re in a country that’s still actively vaccinating vulnerable people, or where flattening the curve makes sense to protect healthcare systems, then please avoid contributing to exponential growth.

Neglecting the exponential growth side of things and just considering immediate impact on your contacts, how likely are you to transmit? I’d be surprised if it was above 40% per household contact assuming you quarantine when symptomatic (that’s on the higher end of transmission seen even with unvaccinated primary cases), but I’d also be surprised if it was below 5% (lowest figure I’ve seen); I’d guess it’s around 15% for Delta. This means if you have ~6-7 contacts as close as housemates, then your immediate external cost roughly equals your private cost.

Specifically, two studies I’ve seen on secondary attack rate given vaccination (h/t @Linch) give pretty wildly varying figures, but suggest at least 2x reduction in transmission from vaccination. Layan et al (2021) found 40% of household contacts of Israeli medical staff developed an infection (when Alpha was dominant), with vaccination of the primary case reducing transmission by 80%, so an 8% chance of transmission overall. Harris et al (2021) from Public Health England suggest vaccination cuts transmission risk from 10% to 5%, but these figures are likely skewed low due to not systematically testing contacts.

AdamGleave 19 Nov 2020 12:13 UTC
LW: 11 AF: 8
AF
on: Inner Alignment in Salt-Starved Rats
I’m a bit confused by the intro saying that RL can’t do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:
- Agent explores by pressing lever, learns transition dynamics that pressing lever ⇒ spray of salt water.
- Planner concludes that any sequence of actions involving pressing lever will result in salt water spray. In a non salt-deprived state this has negative reward, so the agent avoids it.
- Once the agent becomes salt deprived, the planner will conclude this has positive reward, and so take that action.
I do agree that a typical model-free RL algorithm is not capable of doing this directly (it could perhaps meta-learn a policy with memory that can solve this).

AdamGleave 18 Dec 2018 22:45 UTC
11 points
on: 2018 AI Alignment Literature Review and Charity Comparison
Thanks for the informative post as usual.
Full-disclosure: I’m a researcher at UC Berkeley financially supported by CHAI, one of the organisations reviewed in this post. However, this comment is just my personal opinion.
Re: location, I certainly agree that an organization does not need to be in the Bay Area to do great work, but I do think location is important. In particular, there’s a significant advantage to working in or near a major AI hub. The Bay Area is one such place (Berkeley, Stanford, Google Brain, OpenAI, FAIR) but not the only one; e.g. London (DeepMind, UCL) and Montreal (MILA, Brain, et al) are also very strong.
I also want to push back a bit on the assumption that people working for AI alignment organisations will be involved with EA and rationalist communities. While it may be true in many cases, at CHAI I think it’s only around 50% of staff. So whether these communities are thriving or not in a particular area doesn’t seem that relevant to me for organisational location decisions.

AdamGleave 21 Dec 2022 6:12 UTC
LW: 10 AF: 8
1
AF
on: CIRL Corrigibility is Fragile
Rachel did the bulk of the work on this post (well-done!), I just provided some advise on the project and feedback on earlier manuscripts.
I wanted to share why I’m personally excited by this work in case it helps contextualize it for others.
We’d all like AI systems to be “corrigible”, cooperating with us in correcting them. Cooperative IRL has been proposed as a solution to this. Indeed Dylan Hadfield-Menell et al show that CIRL is provably corrigible in a simple setting, the off-switch game.
Provably corrigible sounds great, but where there’s a proof there’s also an assumption, and Carey et al soon pointed out a number of other assumptions under which this no longer holds—e.g. if there is model misspecification causing the incorrect probability distribution to be computed.
That’s a real problem, but every method can fail if you implement it wrongly (although some are more fragile than others), so this didn’t exactly lead to people giving up on the CIRL framework. Recently Shah et al described various benefits they see of CIRL (or “assistance games”) over reward learning, though this doesn’t address the corrigibility question head on.
A lot of the corrigibility properties of CIRL come from uncertainty: it wants to defer to a human because the human knows more about its preferences than the robot. Recently, Yudkowsky and others described the problem of fully updated deference: if the AI has learned everything it can, it may have no uncertainty, at which point this corrigibility goes away. If the AI has learned your preferences perfectly, perhaps this is OK. But here Carey’s critique of model misspecification rears its head again—if the AI is convinced you love vanilla ice cream, saying “please no give me chocolate” will not convince it (perhaps it thinks you have a cognitive bias against admitting your plain, vanilla preferences—it knows the real you), whereas it might if it had uncertainty.
I think the prevailing view on this forum is to be pretty down on CIRL because its not corrigible. But I’m not convinced corrigibility in the strict sense is even attainable or desirable. In this post, we outline a bunch of examples of corrigible behavior that I would absolutely not want in an assistant—like asking me for approval before every minor action! By contrast, the near-corrigible behavior—asking me only when the robot has genuine uncertainty—seems more desirable… so long as the robot has calibrated uncertainty. To me, CIRL and corrigibility seem like two extremes: CIRL is focusing on maximizing human reward, whereas corrigibility is focused on avoiding ever doing the wrong thing even under model misspecification. In practice, we need a bit of both—but I don’t think we have a good theoretical framework for that yet.
In addition to that, I hope this post serves as a useful framework to ground future discussions on this. Unfortunately I think there’s been an awful lot of talking past each other in debates on this topic in the past. For example, to the best of my knowledge, Hadfield-Menell and other authors of CIRL never believed it solved corrigibility under the assumptions Carey introduced. Although our framework is toy, I think it captures the key assumptions people disagree about, and it can be easily extended to capture more as needed in future discussions.

AdamGleave 14 Jun 2023 20:10 UTC
9 points
2
in reply to: habryka’s comment on: Lightcone Infrastructure/LessWrong is looking for funding
To compare this to other costs, renting two floors of the WeWork, which we did for most of the summer last year, cost around $1.2M/yr for 14,000 sq. ft. of office space. The Rose Garden has 20,000 sq. ft. of floor space and 20,000 additional sq. ft. of usable outdoor space for less implied annual cost than that.
I’m sympathetic to the high-level claim that owning property usually beats renting if you’re committing for a long time period. But the comparison with WeWork seems odd: WeWork specializes in providing short-term, serviced office space and does so at a substantial premium to the more traditional long-term, unserviced commercial real estate contract. When we were looking for office space in Berkeley earlier this year we were seeing list price between $3.25-$3.75/month per square foot, or $780k-900k/year for 20,000 square feet. I’d expect with negotiation you could get somewhat better pricing than this implies, especially if committing to a longer time period.
Of course, the extra outdoor space, mixed-use zoning and ability to highly customize the space may well offset this. But it starts depending a lot more on the details (e.g. how often is the outdoor space used; how much more productive are people in a customized space vs a traditional office) than it might first seem.

AdamGleave 7 Nov 2022 5:41 UTC
9 points
6
on: Instead of technical research, more people should focus on buying time
I’m excited by many of the interventions you describe but largely for reasons other than buying time. I’d expect buying time to be quite hard, in so far as it requires coordinating to prevent many actors from stopping doing something they’re incentivized to do. Whereas since alignment research community is small, doubling it is relatively easy. However, it’s ultimately a point in favor of the interventions that they look promising under multiple worldviews, but it might lead me to prioritize within them differently to you.

One area I would push back on is the skills you describe as being valuable for “buying time” seem like a laundry list for success in research in general, especially empirical ML research:

Skills that seem uniquely valuable for buying time interventions: general researcher aptitudes, ability to take existing ideas and strengthen them, experimental design skills, ability to iterate in response to feedback, ability to build on the ideas of others, ability to draw connections between ideas, experience conducting “typical ML research,” strong models of ML/capabilities researchers, strong communication skills

It seems pretty bad for the people strongest at empirical ML research to stop doing alignment research. Even if we pessimistically assume that empirical research now is useless (which I’d strongly disagree with), surely we need excellent empirical ML researchers to actually implement the ideas you hope the people who can “generate and formalize novel ideas” come up with. There are a few aspects of this (like communication skills) that do seem to differentially point in favor of “buying time”, maybe have a shorter and more curated list in future?

Separately given your fairly expansive list of things that “buy time” I’d have estimated that close to 50% of the alignment community are already doing this—even if they believe their primary route to impact is more direct. For example, I think most people working on safety at AGI labs would count under your definition: they can help convince decision-makers in the lab not to deploy unsafe AI systems, buying us time. A lot of the work on safety benchmarks or empirical demonstrations of failure modes falls into this category as well. Personally I’m concerned people are falling into this category of work by default and that there’s too much of this, although I do think when done well it can be very powerful.