mmm beef tallow is pretty in these days? I also think there’s got to be some mileage from optimization to find the bliss point.
Vaniver
I don’t think the decision theory described here is correct. (I’ve read Planecrash.)
Specifically, there’s an idea in glowfic that it should be possible for lawful deities to follow a policy wherein counterparties can give them arbitrary information, on the condition that information is not used to harm the information-provider. This could be as drastic as “I am enacting my plan to assassinate you now, and would like you to propose edits that we both would want to make to the plan”!
I think this requires agreement ahead of time, and is not the default mode of conversation. (“Can I tell you something, and you won’t get mad?” is a request, not a magic spell to prevent people from getting mad at you.) I think it also is arguably something that people should rarely agree to. Many people don’t agree to the weaker condition of secrecy, because the information they’re about to receive is probably less valuable than the costs of partitioning their mind or keeping information secret. In situations where you can’t use the information against your enemies (like two glowfic gods interacting), the value of the information is going to be even lower, and situations where it makes sense to do such an exchange even rarer. (Well, except for the part where glowfic gods can very cheaply partition their minds and so keeping secrets or doing pseudohypothetical reasoning is in fact much cheaper for them than it is for humans.)
That is, I think this is mostly a plot device that allows for neat narratives, not a norm that you should expect people to be expecting to follow or get called out.
[This is not a complete treatment of the issue; I think most treatments of it only handle one pathway, the “this lets you get information you can use for harm reduction” pathway, and in fact in order to determine whether or not an agent should do it, you must consider all relevant pathways. But I think the presumption should not be “the math pencils out here”, and I definitely don’t think the math pencils out in interacting with Oli. I think characterizing that as “Oli is a bad counterparty” instead of something like “Oli doesn’t follow glowfic!lawful deity norms” or “I regret having Oli as a counterparty” is impolite.]
Specifically, this is the privacy policy inherited from when LessWrong was a MIRI project; to the best of my knowledge, it hasn’t been updated.
Would it be better for society if we had mediocrities in those positions and my friends working as welders, craftsmen, local business owners &c?
I think of this often when it comes to teaching—many women who are now doctors would have been teachers (or similar) a hundred years ago, and so now very smart children don’t come into contact with many very smart adults until they themselves are adults (or at magnet programs or events or so on).
But whenever I try to actually put numbers to it, it’s pretty clear that the sort is in fact helping. Yes, education is worse, but the other fields are better, and the prices are actually conveying information about the desirability, here.
It’s also a bit jarring to read such a pessimistic book and then reach the kind of rosy optimism about international cooperation otherwise associated with such famous delusions as the Kellogg-Briand Pact (which banned war in 1929 and … did not work out).
The authors also repeatedly analogize AI to nuclear weapons and yet they never mention the fact that something very close to their AI proposal played out in real life in the form of the Baruch Plan for the control of atomic energy (in brief, this called for the creation of a UN Atomic Energy Commission to supervise all nuclear projects and ensure no one could build a bomb, followed by the destruction of the American nuclear arsenal). Suffice it to say that the Baruch Plan failed, and did so under circumstances much more favorable to its prospects than the current political environment with respect to AI. A serious inquiry into the topic would likely begin there.
I think the core point for optimism is that leaders in the contemporary era often don’t pay the costs of war personally—but nuclear war changes that. It in fact was not in the interests of the elites of the US or the USSR to start a hot war, even if their countries might eventually be better off by being the last country standing. Similarly, the US or China (as countries) might be better off if they summon a demon that is painted their colors—but it will probably not be in the interests of either the elites or the populace to summon a demon.
So the core question is the technical one—is progress towards superintelligence summoning a demon, or probably going to be fine? It seems like we only know how to do the first one, at the moment, which suggests in fact people should stop until we have a better plan.
[I do think the failure of the Baruch plan means that humanity is probably going to fail at this challenge also. But it still seems worth trying!]
Directly from the farm—if there’s not one near you, you might be out of luck.
Eating a largest possible animal means less amount of suffering per kg.
I think this is the right general trend but the details matter and make it probably not true. I think cow farming is probably more humane than elephant farming or whale farming would be.
If you have the ability, have your own hens. It’s a really rewarding experience and then you can know for sure that the hens are happy and treated well.
Unfortunately, I’m moderately uncertain about this. I think chickens have been put under pretty tremendous selection pressure and their internal experiences might be quite bad, even if their external situations seem fine to us. I’m less worried about this if you pick a heritage breed (which will almost definitely have worse egg production), which you might want to do anyway for decorative reasons.
Similarly, consider ducks (duck eggs are a bit harder to come by than chicken eggs, but Berkeley Bowl stocks them and many duck farms deliver eggs—they’re generally eaten by people with allergies to chicken eggs) or ostriches (by similar logic to cows—but given that they lay giant eggs instead of lots of eggs, it’s a much less convenient form factor).
Knowing that a godlike superintelligence with misaligned goals will squish you might be an easy call, but knowing exactly what the state of alignment science will be when ASI is first built is not.
Hmm, I feel more on the Eliezer/Nate side of this one. I think it’s a medium call that capabilities science advances faster than alignment science, and so we’re not on track without drastic change. (Like, the main counterargument is negative alignment tax, which I do take seriously as a possibility, but I think probably doesn’t close the gap.)
Overall, I got the strong impression that the book was trying to convince me of a worldview where it doesn’t matter how hard we try to come up with methods to control advanced AI systems, because at some point one of those systems will tip over into a level of intelligence where we just can’t compete.
FWIW, my sense is that Y&S do believe that alignment is possible in principle. (I do.)
I think the “eventually, we just can’t compete” point is correct. Suppose we have some gradualist chain of humans controlling models controlling model advancements, from here out to Dyson spheres. I think it’s extremely likely that eventually the human control on top gets phased out, like happened in humans playing chess, where centaurs are worse and make more mistakes than pure AI systems. Thinking otherwise feels like postulating that machines can never be superhuman at legitimacy.[1]
Chapter 10 of the book talks about the space probe / nuclear reactor / computer security angle, and I think a gradualist control approach that takes those three seriously will probably work. I think my core complaint is that I mostly see people using gradualism as an argument that they don’t need to face those engineering challenges, and I expect them to simply fail at difficult challenges they’re not attempting to succeed at.
Like, there’s this old idea of basins of reflective stability. It’s possible to imagine a system that looks at itself and says “I’m perfect, no notes”, and then the question is—how many such systems are there? Each is probably surrounded by other systems that look at themselves and say “actually I should change a bit, like so—” and become one of the stable systems, and systems even further out will change to only have one problem, and so on. The choices we’re making now at probably not jumping straight to the end, but instead deciding which basin of reflective stability we’re in. I mostly don’t see people grappling with the endpoint, or trying to figure out the dynamics of the process, and instead just trusting it and hoping that local improvements will eventually translate to global improvements.
- ^
Incidentally, a somewhat formative experience for me was AAAI 2015, when a campaign to stop lethal autonomous weapons was getting off the ground, and at the ethics workshop a representative wanted to establish a principle that computers should never make a life-or-death decision. One of the other attendees objected—he worked on software to allocate donor organs to people on the waitlist, and for them it was a point of pride and important coordination tool that decisions were being made by fair systems instead of corruptible or biased humans.
Like, imagine someone saying that driving is a series of many life-or-death decisions, and so we shouldn’t let computers do it, even as the computers become demonstrably superior to humans. At some point people let the computers do it, and at a later point they tax or prevent the humans from doing it.
- ^
this isn’t to say this other paradigm will be safer, just that a narrow description of “current techniques” doesn’t include the default trajectory.
Sorry, this seems wild to me. If current techniques seem lethal, and future techniques might be worse, then I’m not sure what the point is of pointing out that the future will be different.
But, if these earlier AIs were well aligned (and wise and had reasonable epistemics), I think it’s pretty unclear that the situation would go poorly and I’d guess it would go fine because these AIs would themselves develop much better alignment techniques. This is my main disagreement with the book.
I mean, I also believe that if we solve the alignment problem, then we will no longer have an alignment problem, and I predict the same is true of Nate and Eliezer.
Is your current sense that if you and Buck retired, the rest of the AI field would successfully deliver on alignment? Like, I’m trying to figure out whether your sense here is the default is “your research plan succeeds” or “the world without your research plan”.
I think this is missing the point of the date of AI Takeover is not the day the AI takes over, that the point of no return might appear much earlier than when Skynet decides to launch the nukes. Like, I think the default outcome in a gradualist world is ‘Moloch wins’, and there’s no fire alarm that allows for derailment once it’s clear that things are not headed in the right direction.
For example, I don’t think it was the case 5 years ago that a lot of stock value was downstream of AI investment, but this is used elsewhere on this very page as an argument against bans on AI development now. Is that consideration going to be better or worse, in five years? I don’t think it was obvious five years ago that OpenAI was going to split over disagreements on alignment—but now it has, and I don’t see the global ‘trial and error’ system repairing that wound rather than just rolling with it.
I think the current situation looks bad and just letting it develop without intervention will mean things get worse faster than things get better.
I mean, I would describe various Trump tariff plans as “tanking the global economy”, I think it was fair to describe Smoot-Hawley as that, and so on.
I think the book makes the argument that expensive things are possible—this is likely cheaper and better than fighting WWII, the comparison they use—and it does seem fair to criticize their plan as expensive. It’s just that the alternative is far more expensive.
No, it does not develop neuralese. The architecture that it is being trained on is already using neuralese.
You’re correct on the object level here, and it’s a point against Collier that the statement is incorrect, but I do think it’s important to note that a fixed version of the statement serves the same rhetorical purpose. That is, on page 123 it does develop a new mode of thinking, analogized to a different language, which causes the oversight tools to fail and also leads to an increase in capabilities. So Y&S are postulating a sudden jump in capabilities which causes oversight tools to break, in a way that a more continuous story might not have.
I think Y&S still have a good response to the repaired argument. The reason the update was adopted was because it improved capabilities—the scientific mode of reasoning was superior to the mythical mode—but there could nearly as easily have been an update which didn’t increase capabilities but scrambled the reasoning in such a way that the oversight system broke. Or the guardrails might have been cutting off too many prospective thoughts, and so the AI lab is performing a “safety test” wherein they relax the guardrails, and a situationally aware Sable generates behavior that looks behaved enough that the relaxation stays in place, and then allows for it to escape when monitored less closely.
This is about making a pretty straightforward and I think kind of inevitable argument that as you are in the domain of neuralese, your representations of concepts will diverge a lot from human concepts, and this makes supervision much harder.
I don’t think this is about ‘neuralese’, I think a basically similar story goes thru for a model that only thinks in English.
What’s happening, in my picture, is that meaning is stored in the relationships between objects, and that relationship can change in subtle ways that break oversight schemes. For example, imagine an earnest model which can be kept in line by a humorless overseer. When the model develops a sense of humor / starts to use sarcasm, the humorless overseer might not notice the meaning of the thoughts has changed.
See also some discussion over here.
Do you agree with the “types of misalignment” section of MacAskill’s tweet? (Or, I guess, is it ‘similar to your position’?)
If not, I think it would be neat to see the two of you have some sort of public dialogue about it.
Regarding your other points, maybe you will find it interesting to read Carlsmith’s doc on how to control AI motivations:
To be clear, this is a description of the alignment problem, not a solution. To quote from it:
in many respects, the picture above functions, in my head, centrally as a structured decomposition of the problems that an adequate approach to motivation control needs to overcome. It’s certainly not a “solution” to the alignment problem, in the sense of “a detailed, do-able, step-by-step plan that will work with high-confidence, and which requires only realistic deviation from the default trajectory.” And on its own, I’m not sure it even warrants the term “plan.”
(from the other comment):
I don’t have capacity to continue this discussion, but thanks for sharing your perspective.
Sure, I suspect this is a fine enough place to stop the conversation for now.
I think my core complaints about their arguments are very similar to Will MacAskill, Kelsey Piper, and the majority of people whose ideas on AI safety I have a moderate amount of respect for.
So taking this tweet as representative of MacAskill’s thoughts, and this as representative of Kelsey Piper’s, I see:
The evolution analogy in part I.
You like it but think how the authors would handle the disanalogies would probably be bad; MacAskill complains that they don’t handle the disanalogies; Piper doesn’t discuss it.
Discontinuous capability growth.
MacAskill doesn’t like it; you don’t seem to comment on it; Piper doesn’t seem to comment on it. (I think MacAskill also misunderstands its role and relevance in the argument.)
In particular, MacAskill quotes PC’s summary of EY as “you can’t learn anything about alignment from experimentation and failures before the critical try” but I think EY’s position is closer to “you can’t learn enough about alignment from experimentation and failures before the critical try”.
The world in which we make our first crucial try will be significantly different from the current world.
I think both you and MacAskill think this is a significant deficiency (this is your tricky hypothesis #2); I think Piper also identifies this as a point that the authors don’t adequately elaborate on, but as far as I can tell she doesn’t think this is critical. (That is, yes, the situation might be better in the future, but not obviously better enough that we shouldn’t attempt a ban now.)
Catastrophic misalignment.
MacAskill thinks we have lots of evidence that AIs will not do what the user wanted, but not very much evidence that AIs will attempt to take over. I think both you and Piper think it’s likely that there will be at least one AI of sufficient capability that attempts to take over.
Part 3.
You and MacAskill all seem to dislike their policy proposals. Piper seems much more pro-ban than you or MacAskill are; I don’t get a good sense of whether MacAskill actually thinks a ban is bad (what catchup risk is there if neither frontrunners nor laggards can train AIs?) or just unlikely to be implemented.
I don’t think MacAskill is thinking thru the “close substitutes for agentic superintelligence” point. If they are close substitutes, then they have enough of the risks of agentic superintelligence that it still makes sense to ban them!
So, at least on this pass, I didn’t actually find a specific point that all three of you agreed on. (I don’t count “they should have had a better editor” as a specific point, because it doesn’t specify the direction; an editing choice Piper liked more could easily have been an editing choice that MacAskill liked less.)
The closest was that the book isn’t explicit or convincing enough when talking about iterative alignment strategies (like in chapter 11). Are there other points that I missed (or should I believe your agreement on that point is actually much clearer than I think it is)?
Do you happen to have a good argument why the book proper couldn’t be 20% longer to better make their case?
I think there’s a curve of how many people pick up the book at all that depends on length. I didn’t do this estimation explicitly—and my guess is the authors and publishers were doing it implicitly instead of explicitly—but my guess is you get something like 20% fewer readers if the book is 20% longer, and the number of additional people who find it convincing with 20% more length is something like 5% of readers, and I think that means increasing the length is suboptimal.
(Like, in my favorite world we could A/B test this with the ebook or w/e, where we dynamically include material and see which pieces to include, or have something Arbital-style where people can expand sections for elaboration as needed. But this is very challenging to do with a physical book.)
I think it’s called a reverse sear because the ‘sear’ step happens second—after the low-and-slow cooking—whereas it’s a more common technique in cooking to start with the high heat to get the browning, and then lower the temperature.
Note that bacteria grow faster in hotter temperatures, until you reach the temperature where they die. (125°F, one of the temperatures mentioned in the article, is not hot enough to kill bacteria, and is thus one of the worst parts of the Danger Zone.) For large cuts of meat like a steak, you’re mostly worried about stuff that’s on the outside of it, and so a quick sear at a high temperature will kill stuff that’s on the outside, and then you can comfortably cook at a lower temperature. My best guess is this is not a major problem at the times discussed here (30 minutes in the danger zone is within USDA guidelines) but probably was a worse idea when food safety was worse. Also note that when you put the steak in the oven, the oven temperature will be safe, in a way that means you don’t need to be worried about the outside or contamination from the oven.
[As mentioned in a linked article, the commonly stated justification was to “lock in the juices”, which isn’t true, but it wouldn’t surprise me if food safety was the actual impetus behind that advice.]
[[edit: I should also note that lots of recipes, like stew, start off with something that you want to fry (cook at temperatures higher than water boils at) and then later add something that you want to boil or steam (cook at water’s boiling temperature). It is way easier to fry the meat and then add it to the boiling water than it is to boil the stew for a while, separate out the meat, and then fry it at the end.]]