Roman Leventov comments on The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda

Roman Leventov 19 Dec 2023 9:07 UTC
8 points
1
Post-response: Assessment of AI safety agendas: think about the downside risk
You evidently follow a variant of 80000hours’ framework for comparing (solving) particular problems in terms of expected impact: Neglectedness x Scale (potential upside) x Solvability.
I think for assessing AI safety ideas, agendas, and problems to solve, we should augment the assessment with another factor: the potential for a Waluigi turn, or more prosaically, the uncertainty about the sign of the impact (scale) and, therefore, the risks of solving the given problem or advancing far on the given agenda.
This reminds me of Taleb’s mantra that to survive, we need to make many bets, but also limit the downside potential of each bet, i.e., the “ruin potential”. See “The Logic of Risk Taking”.
Of the approaches that you listed, some sound risky to me in this respect. Particularly, “4. ‘Reinforcement Learning from Neural Feedback’ (RLNF)”—sounds like a direct invitation for wireheading to me. More generally, scaling BCI in any form and not falling into a dystopia at some stage is akin to walking a tightrope (at least at the current stage of civilisational maturity, I would say) This speaks to agendas #2 and #3 on your list.
There are also similar qualms about AI interpretability: there are at least four posts on LW warning of the potential risks of interpretability:
This speaks to the agenda “9. Neuroscience x mechanistic interpretability” on your list.
Related earlier posts
- Some background for reasoning about dual-use alignment research
- Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk?
- Cameron Berg 19 Dec 2023 16:31 UTC
  8 points
  4
  Parent
  With respect to the RLNF idea, we are definitely very sympathetic to wireheading concerns. We think that approach is promising if we are able to obtain better reward signals given all of the sub-symbolic information that neural signals can offer in order to better understand human intent, but as you correctly pointed out that can be used to better trick the human evaluator as well. We think this already happens to a lesser extent and we expect that both current methods and future ones have to account for this particular risk.
  More generally, we strongly agree that building out BCI is like a tightrope walk. Our original theory of change explicitly focuses on this: in expectation, BCI is not going to be built safely by giant tech companies of the world, largely given short-term profit-related incentives—which is why we want to build it ourselves as a bootstrapped company whose revenue has come from things other than BCI. Accordingly, we can focus on walking this BCI developmental tightrope safely and for the benefit of humanity without worrying if we profit from this work.
  We do call some of these concerns out in the post, eg:
  We also recognize that many of these proposals have a double-edged sword quality that requires extremely careful consideration—e.g., building BCI that makes humans more competent could also make bad actors more competent, give AI systems manipulation-conducive information about the processes of our cognition that we don’t even know, and so on. We take these risks very seriously and think that any well-defined alignment agenda must also put forward a convincing plan for avoiding them (with full knowledge of the fact that if they can’t be avoided, they are not viable directions.)
  Overall—in spite of the double-edged nature of alignment work potentially facilitating capabilities breakthroughs—we think it is critical to avoid base rate neglect in acknowledging how unbelievably aggressively people (who are generally alignment-ambivalent) are now pushing forward capabilities work. Against this base rate, we suspect our contributions to inadvertently pushing forward capabilities will be relatively negligible. This does not imply that we shouldn’t be extremely cautious, have rigorous info/exfohazard standards, think carefully about unintended consequences, etc—it just means that we want to be pragmatic about the fact that we can help solve alignment while being reasonably confident that the overall expected value of this work will outweigh the overall expected harm (again, especially given the incredibly high, already-happening background rate of alignment-ambivalent capabilities progress).
  - Roman Leventov 19 Dec 2023 17:41 UTC
    2 points
    0
    Parent
    More generally, we strongly agree that building out BCI is like a tightrope walk. Our original theory of change explicitly focuses on this: in expectation, BCI is not going to be built safely by giant tech companies of the world, largely given short-term profit-related incentives—which is why we want to build it ourselves as a bootstrapped company whose revenue has come from things other than BCI. Accordingly, we can focus on walking this BCI developmental tightrope safely and for the benefit of humanity without worrying if we profit from this work.
    I can push back on this somewhat by noting that most risks from BCI may lay outside of the scope of control of any company that builds it and “plugs people in”, but rather in the wider economy and social ecosystem. The only thing that may matter is the bandwidth and the noisiness of information channel between the brain and the digital sphere, and it seems agnostic to whether a profit-maximising, risk-ambivalent, or a risk-conscious company is building the BCI.
    - Cameron Berg 19 Dec 2023 18:07 UTC
      3 points
      0
      Parent
      It’s a great point that the broader social and economic implications of BCI extend beyond the control of any single company, AE no doubt included. Still, while bandwidth and noisiness of the tech are potentially orthogonal to one’s intentions, companies with unambiguous humanity-forward missions (like AE) are far more likely to actually care about the societal implications, and therefore, to build BCI that attempts to address these concerns at the ground level.
      In general, we expect the by-default path to powerful BCI (i.e., one where we are completely uninvolved) to be negative/rife with s-risks/significant invasions of privacy and autonomy, etc, which is why we are actively working to nudge the developmental trajectory of BCI in a more positive direction—i.e., one where the only major incentive is build the most human-flourishing-conducive BCI tech we possibly can.