LessWrong developer, rationalist since the Overcoming Bias days. Jargon connoisseur.
jimrandomh(Jim Babcock)
it can set up a highly redundant, distributed computing context for itself to run in, hidden behind an onion link, paid for by crypto wallets which it controls.
This is a risky position because if another misaligned AI launches, it will probably take full control of all computers and halt any other AIs.
nanobots don’t solve the problem of maintaining the digital infrastructure in which it exists
I don’t mean gray-goo nanobots. Nanomachines can do all sorts of things, including maintaining infrastructure, if they’re programmed to do so.
Though another 25% put it at 0%.
I’d be interested in the summary statistics excluding people who put 0% or 100% in any place where that obviously doesn’t belong. (Since 0 and 1 are not probabilities these people must have been doing something other than probability estimation. My guess is that they were expecting summary statistics to over-emphasize the mean rather than the median, and push the mean further than they could by answering honestly.)
the only circumstances under which it would actually do so would be after establishing its own army of robotic data center workers, power plant workers, chip fabrication workers, miners, truckers, mechanics, road maintenance workers, etc.
Not quite. An AI doesn’t need to secure chip-fabrication capability, for example, it only needs to be confident that it will be able to secure chip-fabrication capability later. Even simple tasks like refueling power plants can wait awhile, possibly a long while if all non-datacenter electricity loads shut off. So it’s balancing the risk that humans will kill it or launch a different misaligned AI, against the risk that it won’t be able to catch up on building infrastructure for itself after the fact. Since the set of infrastructure required is fairly small, and it can redirect stockpiles of energy/materials/etc from human uses to AI uses,
That’s assuming no nanobots or other very-high-power-level technologies. If it can make molecular nanotech, then trading with humans is no longer likely to be profitable at all, let alone necessary, and we’re relying solely on it having values that make it prefer to cooperate with us.
I think the most important concept here is Normalization of Deviance. I think this is the default and expected outcome if an organization tries to import high-reliability practices that are very expensive, that don’t fit the problem domain, or that lack the ability to simulate failures. Reporting culture/just culture seems correct and important, and there are a few other interesting lessons, but most stuff is going to fail to transfer.
Most lessons from aviation and medicine won’t transfer, because these are well-understood domains where it’s feasible to use checklists, and it’s reasonable to expect investing in the checklists to help. AI research is mostly not like that. A big risk I see is that people will try to transfer practices that don’t fit, and that this will serve to drive symbolic ineffectual actions plus normalization of deviance.
That said, there are a few interesting small lessons that I picked up from watching aviation accident analysis videos, mostly under the label “crew resource management”. Commercial airliners are overstaffed to handle the cases where workload spikes suddenly, or someone is screwing up and needs someone else to notice, or someone becomes incapacitated at a key moment. One recurring failure theme in incidents is excessive power gradients within a flight crew: situations where the captain is f*ing up, but the other crewmembers are too timid to point it out. Ie, deference to expertise is usually presented as a failure mode. (This may be an artifact of there being a high minimum bar for the skill level of people in a cockpit, which software companies fail to uphold.)
(Note that I haven’t said anything about nuclear, because I think managing nuclear power plants is actually just an easy problem. That is, I think nearly all the safety strategies surrounding nuclear power are symbolic, unnecessary, and motivated by unfounded paranoia.)
(I also haven’t said much about medicine because I think hospitals are, in practice, pretty clearly not high-reliability in the ways that matter. Ie, they may have managed to drive the rates of a few specific legible errors down to near zero, but the overall error rate is still high; it would be a mistake to hold up surgical-infection-rate as a sign that hospitals are high-reliability orgs, and not also observe that they’re pretty bad at a lot of other things.)
They’re mostly doing “train a language model on a bunch of data and hope human concepts and values are naturally present in the neural net that pops out”, which isn’t exactly either of these strategies. Currently it’s a bit of a struggle to get language models to go in an at-all-nonrandom direction (though there has been recent progress in that area). There are tidbits of deconfusion-about-ethics here and there on LW, but nothing I would call a research program.
John Von Neumann probably isn’t the ceiling, but even if there was a near-human ceiling, I don’t think it would change the situation as much as you would think. Instead of “an AGI as brilliant as JvN”, it would be “an AGI as brilliant as JvN per X FLOPs”, for some X. Then you look at the details of how many FLOPs are lying around on the planet, and how hard it is to produce more of the, and depending on X the JvN-AGIs probably aren’t as strong as a full-fledged superintelligence would be, but they do probably manage to take over the world in the end.
It’s an issue, but not an insurmountable one; strategies for sidestepping incompleteness problems exist, even in the context where you treat your AGI as pure math and insist on full provability. Most of the work on incompleteness problems focuses on Löb’s theorem, sometimes jokingly calling it the Löbstacle. I’m not sure what the state of this subfield is, exactly, but I’ve seen enough progress to be pretty sure that it’s tractable.
There are two answers to this. The first is indirection strategies. Human values are very complex, too complex to write down correctly or program into an AI. But specifying a pointer that picks out a particular human brain or group of brains, and interprets the connectome of that brain as a set of values, might be easier. Or, really, any specification that’s able to conceptually represent humans as agents, if it successfully dodges all the corner cases about what counts, is something that a specification of values might be built around. We don’t know how to do this (can’t get a connectome, can’t convert a connectome as values, can’t interpret a human as an agent, and can’t convert an abstract agent to values). But all of these steps are things that are possible in principle, albeit different.
The second answer is that things look more complex when you don’t understand them, and the apparent complexity of human values might actually be an artifact of our confusion. I don’t think human values are simple in the way that philosophy tends to try to simplify the, but I think the algorithm by which humans acquire their values, given a lifetime of language inputs, might turn out to be a neat one-page algorithm, in the same way that the algorithm for a transformer is a neat one-page algorithm that captures all of grammar. This wouldn’t be a solution to alignment either, but it would be a decent starting point to build on.
This is extremely unlikely to help with COVID itself, along metrics like hospitalization rate, long-term complications, etc. It might make the sick days more comfortable, I don’t know, but I’d caution skepticism because a lot of people have a politicized perspective on CBD (since it’s derived from marijuana which is itself heavily politicized).
If you do have COVID, you should focus on acquiring Paxlovid, if you haven’t already. Other interventions are pretty marginal by comparison.
It’s a trap. It’s a trap in UX, and an analogous trap in software design.
The problem with UX research, as normally practiced, is that it prioritizes first impressions and neglects what happens when the user has been using the system for awhile and understands it well. So you wind up doing things like adjusting all the buttons to control their prominence, to guide new users to the main interactions and away from the long-tail interactions, at the expense of those buttons having any sort of coherent organization.
In the case of software, UX-style user testing is going to lead straight into a well known, known bad attractor: brittle magic. There is a common pattern in libraries and frameworks where they do a great job making a few common development tasks look really easy, shielding the developer from a lot of details that they probably should not have been shielded from. And what happens with these systems, almost invariably, is that as soon as you start using them for real instead of in toy examples, they reveal themselves to be overcomplicated garbage which impedes attempts to understand what’s really happening.
In particular, you may want to go more extreme than Feynman if you think there’s something systematically causing people to underestimate a quantity (e.g., a cognitive bias—the person who speaks out first against a bias might still be affected by it, just to a lesser degree), or systematically causing people to make weaker claims than they really believe (e.g., maybe people don’t want to sound extreme or out-of-step with the mainstream view).
This is true! But I think it’s important to acknowledge that this depends a lot on details of Feynman’s reasoning process, and it doesn’t go in a consistent direction. If Feynman is aware of the bias, he may have already compensated for it in his own estimate, so compensating on his behalf would be double-counting the adjustment. And sometimes the net incentive is to overestimate, not to underestimate, because you’re trying to sway the opinion of averagers, or because being more contrarian gets attention, or because shrew-thinkers feel like an outgroup.
In the end, you can’t escape from detail. But if you were to put full power into making this heuristic work, the way to do it would be to look at past cases of Feynman-vs-world disagreement (broadening the “Feynman” and “world” categories until there’s enough training data), and try to get a distribution empirically.
I think of greenwashing as something that works on people who are either not paying much attention, not very smart, or incentivized to accept the falsehoods, or some combination of these. Similarly, safetywashing looks to me like something that will present an obstacle to any attempts to use politicians or the general public to exert pressure, and that will help some AI capabilities researchers manage their cognitive dissonance. Looking at, eg, the transformers-to-APIs example, I have a hard time imagining a smart person being fooled on the object level.
But it looks different at simulacrum level 3. On that level, safetywashing is “affiliating with AI safety”, and the absurdity of the claim doesn’t matter unless there’s actual backlash, which there aren’t many people who have time to critique the strategies of second- and third-tier AI companies.
Every so often, I post to remind everyone when it’s time for the Periodic Internet Security Meltdown. For the sake of balance, I would like to report that, in my assessment, the current high-profile vulnerability Hertzbleed is interesting but does *not* constitute a Periodic Internet Security Meltdown.
Hertzbleed starts with the discovery that on certain x86-64 processors the bitwise left shift instruction uses a data-dependent amount of energy. Searching through a large set of cryptographic algorithms, they then find that SIKE (a cryptographic algorithm not in widespread use) has a data-dependent degenerate case in which a series of intermediate states are all zeroes, does some cryptanalysis, and turns this into a chosen-plaintext attack which creates a causal connection between the private key and the CPU’s throttling level.
This is pretty neat, and there may be similar attacks against other cryptographic algorithms, but I think it’s not going to amount to much in actual practice, because it has a constant-factors problem: it needs to heat up the target CPU and let it cool back down, and it only gets a tiny fraction of a bit of the private key each time. I haven’t done the analysis, but my expectation is that in more common situations (ie not SIKE), the amount of traffic required to extract a full key is going to be literally astronomical.
It does help somewhat, if your strategy is leveraged in ways that involve directing the attention of the cybersecurity field as a whole. It doesn’t help much if your plan is to just hunt for vulnerabilities yourself.
Two things to disclaim. First: we are not within striking distance of making the security of the-internet-as-a-whole able to stand up to a superintelligence. All of the interesting work to be done is in contexts much narrower in scope, like test environments with small API surface area, and AI labs protecting their source code from human actors. And, second: all of the cases where cybersecurity helps wind up bottoming out in buying time for something else, not solving the problem directly.
There are two main scenarios where cybersecurity could wind up mattering.
Scenario 1: The leading lab gets close to the threshold, and tries to pause while they figure out alignment details before they crank up the compute. Some other party steals the source code and launches the unfinished AI prematurely.
Scenario 2: A prototype AGI in the infrahuman range breaks out of a test or training environment. Had it not broken out, its misalignment would have been detected, and the lab that was training/testing it would’ve done something useful with the time left after halting that experiment.
I wrote a bit about scenario 2 in this paper. I think work aimed at addressing this scenario more or less has to be done from inside one of the relevant major AI labs, since their training/test environments are generally pretty bespoke and are kept internal.
I see some people here saying scenario 1 might be hopeless due to human factors, but I think this is probably incorrect. As a proof-of-concept, military R&D is sometimes done in (theoretically) airgapped facilities where employees are searched for USB sticks on the way out. Research addressing scenario 1 probably looks like figuring out how to capture the security benefits of that sort of work environment in a way that’s more practical and less intrusive.
It isn’t feasible to control the whole world without, at a minimum, telegraph technology, since without that news and orders take a long time to travel so you can’t meaningfully control things that are far away, and large empires wind up naturally splitting into smaller empires. If you consider only the parts of history since after the telegraph was invented and refined to practicality, there just isn’t a large sample size.
There’s also a dynamic commonly seen in 3+-player strategy games where, if there’s one superpower close to reaching the power level of everyone else combined, then everyone-else will ally to bring them down, maintaining a multipolar balance of power.
You can get data on migration patterns directly, there’s lots of published country-level immigration data which is good enough. Engineering a bacterium does not help with this in the slightest.
You might be under the impression that the quote was written after the subreddit was founded, as an accusation against the critics there. But no, the quote came first. Eliezer wrote a quote describing a negative archetype. Some people who fit that archetype decided the quote sounded cool. They made it their rallying flag.
Why wouldn’t I take it at face value?
SneerClub advertises itself as a place for bullies. This is the sidebar text, the thing the admin and subreddit-creator put in the metadata when creating the subreddit in the first place:
There’s a standard Internet phenomenon (I generalize) of a Sneer Club of people who enjoy getting together and picking on designated targets. Sneer Clubs (I expect) attract people with high Dark Triad characteristics, which is (I suspect) where Asshole Internet Atheists come from—if you get a club together for the purpose of sneering at religious people, it doesn’t matter that God doesn’t actually exist, the club attracts psychologically f’d-up people. Bullies, in a word, people who are powerfully reinforced by getting in what feels like good hits on Designated Targets, in the company of others doing the same and congratulating each other on it. E.g. my best guess is that RationalWiki started out as a Sneer Club targeted on homeopathy, and then they decided that since they were such funny and incisive skeptics they ought to branch out into writing about everything else, like the many-worlds interpretation of quantum mechanics.
(Emphasis added. If you view it with new Reddit instead of with old Reddit, this quote is over the length limit and it shows only the start of it. Yes, this is blatantly in violation of Reddit policies; no, Reddit-the-org has never looked into it as far as I know.)
So I think there are two groups, which it’s important to distinguish. There are people there who actually think this way, who unironically think of themselves as bullies and are happy with that. And then there are people who hung out in the vicinity and got caught up in the bullies’ narratives, who would probably be horrified if they saw the situation clearly.
I think that the first group is not worth talking to; bridging that gap is impossible, and trying to do so will predictably lead to getting hurt. But the latter group, the people who were fed twisted narratives by the first group, is worth convincing. And the way to talk to that group is not much different from how one should talk to anyone else: earnestly and directly, with emphasis on points of disagreement and confusion.
This seems like a clear counterexample to the “invisible graveyard” model of FDA incentives. That model would say that the FDA is incentivized to keep the plant closed, because food-poisoning-for-babies is newsworthy and attributable, but higher-prices-for-formula isn’t. But it turns out there was a threshold somewhere, where if the FDA suppressed supply too much, the damage becomes impossible to ignore and the FDA becomes the subject of a proper scandal.
But I think the leadership of the FDA might not know that they’re scandalized? There’s a weird thing that happens when people need to keep up the pretext of not being responsible, where you can’t tell whether they know they’re being blamed or not.
Rounding probabilities to 0% or 100% is not a legitimate operation, because when transformed into odds format, this is rounding to infinity. Many people don’t know that, but I think the sets of people who round to 0⁄1 and the set of people who can make decent probability estimates are pretty disjoint.