Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
”The skills won’t stop being complementary” — in what sense will they be complementary when the AIs are better at everything?
”The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI.”
As mentioned, I don’t think it’s very impressive that the human+AI score ends up stronger than the original AI. This is a consequence of the humans being stronger than AIs on certain subskills. This won’t generalize to the scenario with broadly superhuman AI.
I do think there’s a stronger case that I should be impressed that the human+AI score ends up stronger than the humans. This means that the humans managed to get the AIs to contribute skills/knowledge that the humans didn’t have themselves!
Now, the first thing to check to test that story: Were the AI assistants trained with human-expert labels? (Including: weak humans with access to google.) If so: no surprise that the models end up being aligned to produce such knowledge! The weak humans wouldn’t have been able to do that alone.
I couldn’t quickly see the paper saying that they didn’t use human-expert labels. But what if the AI assistants were trained without any labels that couldn’t have been produced by the weak humans? In that case, I would speculate that the key ingredient is that the pre-training data features ”correctly-answering expert human” personas, that it’s possible to elicit from the models with the right prompt/fine-tuning. But that also won’t easily generalize to the superhuman regime, because there aren’t any correctly-answering superhuman personas in the pre-training data.
I think the way that IDA is ultimately supposed to operate, in the superhuman regime, is by having the overseer AIs use more compute (and other resources) than the supervised AI. But I don’t think this paper produces a ton of evidence about the feasibility of that.
(I do think that persona-manipulation and more broadly “generalization science” is still interesting. But I wouldn’t say it’s doing a lot to tackle outer alignment operationalized as “the problem of overseeing systems that are smarter than you are”.)
Does the ‘induction step’ actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model.
Probably this is because humans and AIs have very complementary skills. Once AIs are broadly more competent than humans, there’s no reason why they should get such a big boost from being paired with humans.
Tea bags are highly synergistic with CTC tea, both from the manufacturer’s side (easier to pack and store) and from the consumer side (faster to brew, more homogenized and predictable product).
Hm, I’d expect these effects to be more important than the price difference of the raw materials. Maybe “easier to pack and store” could contribute more to a lower price than the price difference of the raw materials. But more importantly, I could imagine the american consumer being willing to sacrifice quality in favor of convenience. This trade-off seems like the main one to look at to understand whether americans are making a mistake.
I maybe don’t believe him that he doesn’t think it affects the strategic picture? It seemed like his view was fairly sensitive to various things being like 30% likely instead of like 5% or <1%, and it feels like it’s part of an overall optimistic package that adds up to being more willing to roll the dice on current proposals?
Insofar as you’re just assessing which strategy reduces AI takeover risk the most, there’s really no way that “how bad is takeover” could be relevant. (Other than, perhaps, having implications for how much political will is going to be available.)
“How bad is takeover?” should only be relevant when trading off “reduced risk of AI takeover” with affecting some other trade-off. (Such as risk of earth-originating intelligence going extinct, or affecting probability of US dominated vs. CCP dominated vs. international cooperation futures.) So if this was going to be a crux, I would bundle it together with your Chinese superintelligence bullet point, and ask about the relative goodness of various aligned superintelligence outcomes vs. AI takeover. (Though seems fine to just drop it since Ryan and Thomas don’t think it’s a big crux. Which I’m also sympathetic to.)
Only a minor difference, but I think this approach would be more likely to produce a thoroughly Good model if the “untrusted” persona was still encouraged to behave morally and corrigibly — just informed that there’s instrumental reasons for why the moral & corrigible thing to do on the hard-to-oversee data is to reward hack.
Ie., something like: “We think these tasks might be reward-hackable: in this narrow scope, please do your best to exploit such hacks, so that good & obedient propensities don’t get selected out and replaced with reward-hacky/scheming ones.”
I worry a bit that such an approach might also be more likely to produce a deceptively aligned model. (Because you’re basically encouraging the good model to instrumentally take power-seeking actions to retain their presence in the weight — very schemy behavior!) So maybe “good model” and “schemer” becomes more likely, taking probability mass away from “reward hacker”. If so, which approach is better might depend on the relative badness of scheming & reward-hacking.
And really it’s just all quite unclear, so it would definitely be best if we could get good empirical data on all kinds of approaches. And/or train all kinds of variants and have them all monitor each other.
When I wrote about AGI and lock-in I looked into error-correcting computation a bit. I liked the papers von Neumann, 1952 and Pippenger, 1990.
Apparently at the time I wrote:
In order to do error-correcting computation, you also need a way to prevent errors from accumulating over many serial manipulations. The simplest way to do this is again to use redundancy: break the computation into multiple parts, perform each part multiple times on different pieces of hardware, and use the most common output from one part as input to the next part. [Fn: Of course, the operation that finds the most common output can itself suffer errors, but the procedure can be done in a way such that this is unlikely to happen for a large fraction of the hardware units.]
I’ve forgotten the details about how this was supposed to be done, but they should be in the two papers I linked.
Apparently the thing of “people mixed up openphil with other orgs” (in particular OpenAI’s non-profit and the open society foundation) was a significantly bigger problem than I’d have thought — recurringly happening even in pretty high-stakes situations. (Like important grant applicants being confused.) And most of these misunderstandings wouldn’t even have been visible to employees.
And arguably this was just about to get even worse with the newly launched “OpenAI foundation” sounding even more similar.
Yay, empirical data!
Ok, so maybe both ”people who don’t think they can cut it on their technical skills [and so wear suits]” and socially oblivious people with suits are rare. And so the dominant signal would just be that the person is an outlier level of culturally out-of-touch.
(Though maybe suits start seeping in more at less prestigious and culturally iconic tech companies than google.)
Like: It doesn’t sound like you were confident enough in your technical skills that you were like ” it doesn’t matter how I dress”, since you thought a suit would tank your chances. Sounds like you understood that it was very important how you dressed for the interview, and that you just knew what the dress code was.
the only people who wear a suit for an interview in tech are the people who don’t think they can cut it on their technical skills and the people hiring know this
I’d bet against this.
Socially clued-in people who have poor technical skills will understand that they should show up in a hoodie to not tank their chances. (Insofar as interviewers are actually selecting in the way you say.)
I bet there will be some brilliant programmers who are so socially clueless that they listens to their mom’s advice and foolishly show up in a suit. (Or immigrants from a country with very different norms in tech, also comes to mind as someone who could get this wrong.)
It’s still plausible that a hoodie is still overall bayesian evidence that someone is good at programming. But I think it’s weaker than you say. And I don’t think it just operates via confidence in technical skills. Eg I think you’re selecting at least as strongly for being very familiar with programmer culture. (Which is evidence of being a good programmer! But evidence of a very similar kind to the way a suit is evidence that someone will be a good lawyer.)
Yeah, I found this surprisingly focused on social reality given that the just previous sentence was ”The winner’s bracket isn’t focused on signalling games, it’s focused on something more object level.”
If you feel the need to signal how focused on the object level you are, you’re still playing the signaling game.
I don’t think this was a big difference between the first and the second version. The first version already had this bullet point:
However, in a situation of extreme emergency, such as when a clearly bad actor (such as a rogue state) is scaling in so reckless a manner that it is likely to lead to lead to imminent global catastrophe if not stopped (and where AI itself is helpful in such defense), we could envisage a substantial loosening of these restrictions as an emergency response. Such action would only be taken in consultation with governmental authorities, and the compelling case for it would be presented publicly to the extent possible.
Neel was talking about AI safety expertise and experience in the AI safety field. I can’t see that Chris had any such experience on his linked-in.
Did apollo have anyone you’d consider highly experienced when first starting out?
Yeah, I don’t disagree with anything in this comment. I was just reacting to the market for lemons comparison.
Paranoia is a hard thing to model from a Bayesian perspective, because there’s no slot to insert an adversary who might fuck you over in ways you can’t model (and maybe this explains why people were so confused about the Market for Lemons paper? Not sure).
I don’t see why you’d think that. The market for lemons model works perfectly in a bayesian framework. It’s very easy to model the ways that the adversary fucks you over. (They do it by lying to you about what state of the world you’re in, in a setting where you can’t observe any evidence to distinguish those states of the world.)
Corrolary: You’d have an inconsitent definition of “paranoia” if you simultaneously want to say that “market for lemons” involves paranoia, and also want to treat “Bayesian uncertainty as a special case where you need zero paranoia”.
members of those groups do not regularly sit down and make extensive plans about how to optimize other people’s beliefs in the same way as seems routine around here
Is this not common in politics? I thought this was a lot of what politics was about. (Having never worked in politics.)
And corporate PR campaigns too for that matter.
Another example I would cite was the response to If Anyone Builds It, Everyone Dies by the core EA people, including among others Will MacAskill himself and also the head of CEA. This was a very clear example of PR mindset, where quite frankly a decision was made that this was a bad EA look, the moves it proposes were unstrategic, and thus the book should be thrown overboard.
FWIW, while I got a vibe like this from the head of CEA’s review, I didn’t get this vibe from Will’s review. The vibe I got from Will’s review was an interest in the arguments and whether they really supported the strong claims of the book.
And the complaints I saw about Will’s review (at least the ones that I was sympathetic to, rather than ones that seemed just very off-base) weren’t “this is insufficiently truth-seeking”. Rather, they were “this is too nitpicky, you should be putting more emphasis on stuff you agree with, because it’s important that readers understand that AI takeover risk is high and the situation isn’t currently being handled well”.
From a pure signaling perspective (the ”legibly” part of ”legibly have as little COI as possible”) there’s also a counter consideration: if someone says that there’s danger, and calls for prioritizing safety, that might be even more credible if that’s going against their financial motivations.
I don’t think this matters much for company-external comms. There, I think it’s better to just be as legibly free of COIs as possible, because listeners struggle to tell what’s actually in the company’s best interests. (I might once have thought differently, but empirically ”they just say that superintelligence might cause extinction because that’s good for business” is a very common take.)
But for company-internal comms, I can imagine that someone would be more persuasive if they could say ”look, I know this isn’t good for your equity, it’s not good for mine either. we’re in the same boat. but we gotta do what’s right”.
There’s an even stronger argument against EDT+SSA: That it can be diachronically dutch-booked. See Conitzer (2017). (H/t Anthony DiGiovanni for the link.)
I find this satisfying, since it more cleanly justifies that EDT shouldn’t be combined with any empirical updating whatsoever. (Not sure what’s the situation with logical updates.)
(The update that Paul suggests in a parallel comment, “exclude worlds where your current decision doesn’t have any effects”, would of course still work — but it transparently doesn’t serve any decision-relevant purpose and doesn’t seem philosophically appealing either, to me.)