LessWrong team member / moderator. I’ve been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I’ve been interested in improving my own epistemic standards and helping others to do so as well.
Raemon(Raymond Arnold)
...in the last 24 hours? Or, like, awhile ago in a previous context?
Well, an alternate framing is “does the big stick turn out to have the effect you want?”
I guess the actual resolution here will eventually come from seeing the final headlines and that, like, they’re actually reasonable.
I’d be interested in a few more details/gears. (Also, are you primarily replying about the immediate parent, i.e. domestication of dissent, or also about the previous one)
Two different angles of curiosity I have are:
what sort of things you might you look out for, in particular, to notice if this was happening to you at OpenAI or similar?
something like… what’s your estimate of the effect size here? Do you have personal experience feeling captured by this dynamic? If so, what was it like? Or did you observe other people seeming to be captured, and what was your impression (perhaps in vague terms) of the diff that the dynamic was producing?
My take atm is “seems right that this shouldn’t be a permanent norm, there are definitely costs of disclaimer-ratcheting that are pretty bad. I think it might still be the right thing to do of your own accord in some cases, which is, like, superogetory.”
I think there’s maybe a weird thing with this post, where, it’s trying to be the timeless, abstract version of itself. It’s certainly easier to write the timeless abstract version than the “digging into specific examples and calling people out” version. But, I think the digging into specific examples is actually kind of important here – it’s easy to come away with vague takeaways that disagree, where everyone nods along but then mostly thinks it’s Those Other Guys who are being power seeking.
Given that it’s probably 10-50x harder to write the Post With Specific Examples, I think actually a pretty okay outcome is “ship the vague post, and let discussion in the comments get into the inside-baseball-details.” And, then, it’d be remiss for the post-author’s role in the ecosystem not coming up as an example to dig into.
They can believe in catastrophic but non-existential risks. (Like, AI causes something like crowdstrike periodically if your not trying to prevent that )
I think people mostly don’t believe in extinction risk, so the incentive isn’t nearly as real/immediate.
Part of the whole point of CEV is to discover at least some things that current humanity is confused about but would want if fully informed, with time to think. It’d be surprising to me if CEV-existing-humanity didn’t turn out to want some things that many current humans are opposed to.
So, I do think definitely I’ve got some confirmation bias here – I know because the first thing I thought when I saw was “man this sure looks like the thing Eliezer was complaining about” and it was awhile later, thinking it through, that was like “this does seem like it should make you really doomy about any agent-foundations-y plans, or other attempts to sidestep modern ML and cut towards ‘getting the hard problem right on the first try.’”
I did (later) think about that a bunch and integrate it into the post.
I don’t know whether I think it’s reasonable to say “it’s additionally confirmation-bias-indicative that the post doesn’t talk about general doom arguments.” As Eli says, the post is mostly observing a phenonenon that seems more about planmaking than general reasoning.
(fwiw my own p(doom) is more like ‘I dunno man, somewhere between 10% and 90%, and I’d need to see a lot of things going concretely right before my emotional center of mass shifted below 50%’)
Yeah. I tried to get at this in the Takeaways but I like your more thorough write up here.
In the world where people had exactly $30 to spend every hour and they’d either spend it or it disappeared, would you object to calling that spending money? I feel like many of my spending intuitions would still basically transfer to that world.
Curious for details.
People varied in how much Baba-Is-You experience they had. Some of them were completely new, and did complete the first couple levels (which are pretty tutorial-like) using the same methodology I outline here, before getting to a level that was a notable challenge.
They actually did complete the first couple levels successfully, which I forgot when writing this post. This does weaken the rhetorical force, but also, the first couple levels are designed more to teach the mechanics and are significantly easier. I’ll update the post to clarify this.
Some of them had played before, and were starting a new level from around where they left off.
...fwiw I think it’s not grossly inaccurate.
I think MIRI did put a lot of effort into being cooperative about the situation (i.e. Don’t leave your fingerprints on the future, doing the ‘minimal’ pivotal act that would end the acute risk period, and when thinking about longterm godlike AI, trying to figure out fair CEV sorts of things).
But, I think it was also pretty clear that “have a controllable, safe AI that’s just powerful enough to take some action that prevents anyone else from building a more powerful and more dangerous AI” were not in the overton window. I don’t know what Eliezer’s actual plan was since he disclaimed “yes I know melt all the GPUs won’t work”, but, like, “melt all the GPUs” implies a level of power over the world that is really extreme by historical standards, even if you’re trying to do the minimal thing with that power.
Also, the second section makes an argument in favor of backchaining. But that seems to contradict the first section, in which people tried to backchain and it went badly.
This didn’t come across in the post, but – I think people in the experiment were mostly doing things closer to (simulated) forward chaining, and then getting stuck, and then generating the questionable assumptions. (which is also what I tended to do when I first started this experiment).
An interesting thing I learned is that “look at the board and think without fiddling around” is actually a useful skill to have even when I’m doing the more openended “solve it however seems best.” It’s easier to notice now when I’m fiddling around pointlessly instead of actually doing useful cognitive work.
I had a second half of this essay that felt like it was taking too long to pull together and I wasn’t quite sure who I was arguing with. I decided I’d probably try to make it a second post. I generally agree it’s not that obvious what lessons to take.
The beginning of the second-half/next-post was something like:
There’s an age-old debate about AI existential safety, which I might summarize as the viewpoints:
1. “We only get one critical try, and most alignment research dodges the hard part of the problem, with wildly optimistic assumptions.”
vs
2. “It is basically impossible to make progress on remote, complex problems on your first try. So, we need to somehow factor the problem into something we can make empirical progress on.”
I started out mostly thinking through lens #1. I’ve updated that, actually, both views are may be “hair on fire” levels of important. I have some frustrations with both some doomer-y people who seem resistant to incorporating lens #2, and with people who seem to (in practice) be satisfied with “well, iterative empiricism seems tractable, and we don’t super need to incorporate frame #1)
I am interested in both:
trying to build “engineering feedback loops” that more accurately represent the final problem as best we can, and then iterating on both “solving representative problems against our current best engineered benchmarks” while also “continuing to build better benchmarks. (Automating Auditing and Model Organisms of Misalignment seem like attempts at this)
trying to develop training regimens that seem like they should help people plan better in Low-Feedback-Domains, which includes theoretic work, and empirical research that’s trying to keep their eye on the longterm ball better, and the invention of benchmarks a la previous bullet.
Games I was particularly thinking of were They Are Billions, Slay The Spire. I guess also Factorio although the shape of that is a bit different.
(to be clear, these are fictional examples that don’t necessarily generalize, but, when I look at the AI situation I think it-in-particular has an ‘exponential difficulty’ shape)
Optimistic Assumptions, Longterm Planning, and “Cope”
I also just realized the actual reason I do this is not because it works better, but because I felt too awkward merely turning my back.
Size.