context note: Jacob is also a mod/works for LessWrong, kave isn’t doing this to random users.
Elizabeth
the same argument for a different virtue, allegedly from C.S. Lewis
I think this is beautiful but incorrect, mostly because it discounts the virtue in keeping yourself in situations where virtue is easy.
I interpret your comment as assuming that new researchers with good ideas produce more impact on their own than in teams working towards a shared goal
I don’t believe that, although I see how my summary could be interpreted that way. I agree with basically all the reasons in your recent comment and most in the original comment. I could add a few reasons of my own doing independent grant-funded work sucks. But I think it’s really important to track how founding projects tracks to increased potential safety instead of intermediates, and push hard against potential tail wagging the dog scenarios.
I was trying to figure out why this was important to me, given how many of your points I agree with. I think it’s a few things:
Alignment work seems to be prone to wagging the dog, and is harder to correct, due to poor feedback loops.
The consequences of this can be dire
making it harder to identify and support the best projects.
making it harder to identify and stop harmful projects
making it harder to identify when a decent idea isn’t panning out, leading to people and money getting stuck in the mediocre project instead of moving on.
One of the general concerns about MATS is it spins up potential capabilities researchers. If the market can’t absorb the talent, that suggests maybe MATS should shrink.
OTOH if you told me that for every 10 entrants MATS spins up 1 amazing safety researcher and 9 people who need makework to prevent going into capabilities, I’d be open to arguments that that was a good trade.
Everyone who waits longer than me to publicly share their ideas is a coward, afraid to expose their ideas to the harsh light of day. Everyone who publicly shares their ideas earlier than me is a maniac, wasting others people’s time with stream of consciousness bullshit.
This still reads to me as advocating for a jobs program for the benefit of MATS grads, not safety. My guess is you’re aiming for something more like “there is talent that could do useful work under someone else’s direction, but not on their own, and we can increase safety by utilizing it”.
Talent leaves MATS/ARENA and sometimes struggles to find meaningful work
I’m surprised this one was included, it feels tail-wagging-the-dog to me.
Good question. My revised belief is that OpenAI will not sufficiently slow down production in order to boost safety. It may still produce theoretical safety work that is useful to others, and to itself if the changes are cheap to implement.
I do also expect many people assigned to safety to end up doing more work on capabilities, because the distinction is not always obvious and they will have so many reasons to err in the direction of agreeing with their boss’s instructions.
how likely do you believe it is that OAI has a position where at least 90% of people who are both, (A) qualified skill wise (eg, ML and interpretability expert), and, (B) believes that AIXR is a serious problem, would increase safety faster than capabilities in that position?
The cheap answer here is 0, because I don’t think there is any position where that level of skill and belief in AIXR has a 90% chance of increasing net safety. Ability to do meaningful work in this field is rarer than that.
So the real question is how does OpenAI compare to other possibilities? To be specific, let’s say being an LTFF-funded solo researcher, academia, and working at Anthropic.
Working at OpenAI seems much more likely to boost capabilities than solo research and probably academia. Some of that is because they’re both less likely to do anything. But that’s because they face OOM less pressure to produce anything, which is an advantage in this case. LTFF is not a pressure- or fad-free zone, but they have nothing near the leverage of paying someone millions of dollars, or providing tens of hours each week surrounded by people who are also paid millions of dollars to believe they’re doing safe work.
I feel less certain about Anthropic. It doesn’t have any of terrible signs OpenAI did (like the repeated safety exoduses, the board coup, and clawbacks on employee equity), but we didn’t know about most of those a year ago.
If we’re talking about a generic skilled and concerned person, probably the most valuable thing they can do is support someone with good research vision. My impression is that these people are more abundant at Anthropic than OpenAI, especially after the latest exodus, but I could be wrong. This isn’t a crux for me for the 80k board[1] but it is a crux for how much good could be done in the role.
Some additional bits of my model:
I doubt OpenAI is going to tell a dedicated safetyist they’re off the safety team and on direct capabilities. But the distinction is not always obvious, and employees will be very motivated to not fight OpenAI on marginal cases.
You know those people who stand too close, so you back away, and then they move closer? Your choices in that situation are to steel yourself for an intense battle, accept the distance they want, or leave. Employers can easily pull that off at scale. They make the question become “am I sure this will never be helpful to safety?” rather than “what is the expected safety value of this research?”
Alternate frame: How many times will an entry level engineer get to say no before he’s fired?
I have a friend who worked at OAI. They’d done all the right soul searching and concluded they were doing good alignment work. Then they quit, and a few months later were aghast at concerns they’d previous dismissed. Once you are in the situation is is very hard to maintain accurate perceptions.
Something @Buck said made me realize I was conflating “produce useful theoretical safety work” with “improve the safety of OpenAI’s products.” I don’t think OpenAI will stop production for safety reasons[2], but they might fund theoretical work that is useful to others, or that is cheap to follow themselves (perhaps because it boosts capabilities as well...).
This is a good point and you mentioning it updates me towards believing that you are more motivated by (1) finding out what’s true regarding AIXR and (2) reducing AIXR, than something like (3) shit talking OAI.
Thank you. My internal experience is that my concerns stem from around x-risk (and belatedly the wage theft). But OpenAI has enough signs of harm and enough signs of hiding harm that I’m fine shit talking as a side effect, where normally I’d try for something more cooperative and with lines of retreat.
- ^
I think the clawbacks are disqualifying on their own, even if they had no safety implications. They stole money from employees! That’s one of the top 5 signs you’re in a bad workplace. 80k doesn’t even mention this.
- ^
to ballpark quantify: I think there is <5% chance that OpenAI slows production by 20% or more, in order to reduce AIXR. And I believe frontier AI companies need to be prepared to slow by more than that.
I’d define “genuine safety role” as “any qualified person will increase safety faster that capabilities in the role”. I put ~0 likelihood that OAI has such a position. The best you could hope for is being a marginal support for a safety-based coup (which has already been attempted, and failed).
There’s a different question of “could a strategic person advance net safety by working at OpenAI, more so than any other option?”. I believe people like that exist, but they don’t need 80k to tell them about OpenAI.
reposting comment from another post, with edits:
re: accumulating status in hope of future counterfactual impact.
I model status-qua-status (as opposed to status as a side effect of something real) as something like a score for “how good are you at cooperating with this particular machine?”. The more you demonstrate cooperation, the more the machine will trust and reward you. But you can’t leverage that into getting the machine to do something different- that would immediately zero out your status/cooperation score.
There are exceptions. If you’re exceptionally strategic you might make good use of that status by e.g. changing what the machine thinks it wants, or coopting the resources and splintering. It is also pretty useful to accumulate evidence you’re a generally responsible adult before you go off and do something weird. But this isn’t the vibe I get from people I talk to with the ‘status then impact’ plan, or from any of 80ks advice. Their plans only make sense if either that status is a fungible resource like money, or if you plan on cooperating with the machine indefinitely.
So I don’t think people should pursue status as a goal in and of itself, especially if there isn’t a clear sign for when they’d stop and prioritize something else.
From Conor’s response on EAForum, it sounds like the answer is “we trust OpenAI to tell us”. In light of what we already know (safety team exodus, punitive and hidden NDAs, lack of disclosure to OpenAI’s governing board), that level of trust seems completely unjustified to me.
When I did my vegan nutrition write-ups, I directed people to Examine.com’s Guide to Vegan+Vegetarian Supplements. Unfortunately, it is paywalled. Fortunately, it is now possible to ask your library to buy access, so you can read that guide plus their normal supplement reviews at no cost to yourself.
Library explainer: https://examine.com/plus/public-libraries/
Ven*n guide: https://examine.com/guides/vegetarians-vegans/
How does 80k identify actual safety roles, vs. safety-washed capabilities roles?
I would say Epistemic Daddies are deferred to, for action and strategy, although sometimes with a gloss of giving object level information. But I think you’re right that there’s a distinction between “giving you strategy” and “telling you your current strategy is so good it’s going right on the fridge”, and Daddy/Mommy is a decent split for that.
re: accumulating status in hope of future counterfactual impact.
I model status-qua-status (as opposed to status as a side effect of something real) as something like a score for “how good are you at cooperating with this particular machine?”. The more you demonstrate cooperation, the more the machine will trust you. But you can’t leverage that into getting the machine to do something different- that would immediately zero out your status/cooperation score.
There are exceptions. If you’re exceptionally strategic you might make good use of that status by e.g. changing what the machine thinks it wants, or coopting the resources and splintering. But this isn’t the vibe I get from people I talk to with the ‘status then impact’ plan, or from any of 80ks advice. They sound like they think status is a fungible resource that can be spent anywhere, like money[1].
So unless you start with a goal and authentically backchain into a plan where a set amount of a specific form of status is a key resource, you probably shouldn’t accumulate status.
I think money-then-impact plans risk being nonterminating, but are great if they are responsive and will terminate.
I also think getting a few years of normal work under your belt between college and crazy independent work can be a real asset, as long as you avoid the just-one-more-year trap.
Empirical vs. Mathematical Joints of Nature
Part of me likes the idea of making solstice higher investment. But I feel like the right balance is one high investment event and one very low investment event, and high investment is a much better fit for winter solstice.
I like that split because I see value in both high investment, high meaning things that will alienate a lot of people (because they’re too much work, or the meaning doesn’t resonate with them), and in shelling points for social gathering. These can’t coexist, so better to have separate events specializing in each.
I’m much more likely to take existing karma into account when strong voting. For weak votes I’ll just vote my opinion unless the karma total is way out from what I think is deserved. This comes up mostly with comments that are bad but not so bad I want to beat the dead horse, or that express a popular sentiment without adding much.
While writing the email to give mentioned people and orgs a chance to comment, I wasn’t sure whether to BCC (more risk of going to spam) or CCed (shares their email). I took a FB poll, which got responses from the class of people who might receive emails like this, but not the specific people I emailed. Of the responses, 6 said CC and one said either. I also didn’t receive any objections from the people I actually emailed. So seems like CCing is fine.
huh. was it the particular meme (brave dude telling the truth), the size, or some third thing?