Programmer.
MinusGix
While human moral values are subjective, there is a sufficiently large shared amount that you can target at aligning an AI to that. As well, values held by a majority (ex: caring for other humans, enjoying certain fun things) are also essentially shared. Values that are held by smaller groups can also be catered to.
If humans were sampled from the entire space of possible values, then yes we (maybe) couldn’t build an AI aligned to humanity, but we only take up a relatively small space and have a lot of shared values.
I initially wrote a long comment discussing the post, but I rewrote it as a list-based version that tries to more efficiently parcel up the different objections/agreements/cruxes.
This list ended up basically just as long, but I feel it is better structured than my original intended comment.(Section 1): How fast can humans develop novel technologies
I believe you assume too much about the necessary time based on specific human discoveries.
Some of your backing evidence just didn’t have the right pressure at the time to go further (ex: submarines) which means that I think a more accurate estimate of the time interval would be finding the time that people started paying attention to the problem again (though for many things that’s probably hard to find) and began deliberately working on/towards that issue.
Though, while I think focusing on when they began deliberately working is more accurate, I think there’s still a notable amount of noise and basic differences due to the difference in ability to focus of humans relative to AGI, the unity (relative to a company), and the large amount of existing data in the future
Other technologies I would expect were ‘put off’ because they’re also closely linked to the available technology at the time. It can be hard to do specific things if your Materials-science understanding simply isn’t good enough.
Then there’s the obvious throttling at the number of people in the industry focusing on that issue, or even capable of focusing on that issue.
As well, to assume thirty years means that you also assume that the AGI does not have the ability to provide more incentive to ‘speed up’. If it needs to build a factory, then yes there are practical limitations on how fast the factory can be built, but obstructions like regulation and cost are likely easier to remove for an AGI than a normal company.
Crux #1: How long it takes for human inventions to spread after being thought up / initially tested / etc.
This is honestly the one that seems to be the primary generator for your ‘decades’ estimate, however I did not find it that compelling even if I accept the premise that an AGI would not be able to build nanotechnology (without building new factories to build the new tools it needs to actually perform it)
Note: The other cruxes later on are probably more about how much the AI can speed up research (or already has access to), but this could probably include a a specific crux related to that before this crux.
(Section 2): Unstoppable intellect meets the complexity of the universe
While I agree that there are likely eventual physical limits (though likely you hit practical expected ROI before that) on intelligence and research results.
There would be many low-hanging fruits which are significantly easier to grab with a combination of high compute + intelligence that we simply didn’t/couldn’t grab beforehand. (This would be affected by the lead time, if we had good math prover/explainer AIs for two decades before AGI then we’d have started to pick a lot of the significant ideas, but as the next part points out, having more of the research already available just helps you)
I also think that the fact that we’ve gotten rid of many of the notable easier-to-reach pieces (ex: classical mechanics → GR → QM → QFT) is actually a sign that things are easier now in terms of doing something. The AGI has a significantly larger amount of information about physics, human behavior, logic, etcetera, that it can use without having to build it completely from the ground up.
If you (somehow) had an AGI appear in 1760 without much knowledge, then I’d expect that it would take many experiments and a lot of time to detail the nature of its reality. Far less than we took, but still a notable amount. This is the scenario where I can see it taking 80 years for the AGI to get set up, but even then I think that’s more due to restrictions on readily available compute to expand into after self-modification than other constraints.
However, we’ve picked out a lot of the high and low level models that work. Rather than building an understanding of atoms through careful experimentation procedures, it can assume that they exist and pretty much follow the rules its been given.
(maybe) Crux #2: Do we already have most of the knowledge needed to understand and/or build nanotechnology?
I’m listing this as ‘maybe’ as I’m more notably uncertain about this than others.
Does it just require the concentrated effort of a monolithic agent staring down at the problem and being willing to crunch a lot of calculations and physics simulators?
Or does it require some very new understanding of how our physics works?
(Section 3): What does AGI want?
Minor objection on the split of categories. I’d find it.. odd if we manage to make an AI that terminally values only ‘kill all humans’.
I’d expect more varying terminal values, with ‘make humans not a threat at all’ (through whatever means) as an instrumental goal
I do think it is somewhat useful for your thought experiments later on try making the point that even a ‘YOLO AGI’ would have a hard time having an effect
(Section 4): What does it take to make a pencil?
I think this analogy ignores various issues
Of course, we’re talking about pencils, but the analogy is more about ‘molecular-level 3d-printer’ or ‘factory technology needed to make molecular level printer’ (or ‘advanced protein synthesis machine’)
Making a handful of pencils if you really need them is a lot more efficient than setting up that entire system.
Though, of course, if you’re needing mass production levels of that object then yes you will need this sort of thing.
Crux #3: How feasible is it to make small numbers of specialized technology?
There’s some scientific setups that are absolutely massive and require enormous amounts of funding, however then there are those that with the appropriate tools you can setup in a home workshop. I highly doubt either of those is the latter, but I’d also be skeptical that they need to be the size of the LHC.
Note: Crux #4 (about feasibility of being able to make nanotechnology with a sufficient understanding of it and with current day or near-future protein synthesis) is closely related, but it felt more natural to put that with AlphaFold.
(Section 5): YOLO AGI?
I think your objection that they’re all perfectly doable by humans in the present is lacking.
By metaphor:
While it is possible for someone to calculate a million digits of pi by hand, the difference between speed and overall capability is shocking.
While it is possible for a monkey to kill all of its enemies, humans have a far easier time with modern weaponry, especially in terms of scale
Your assumption that it would take decades for even just the scenarios you list (except perhaps the last two) seems wrong
Unless you’re predicating on the goal being literally wiping out every human, but then that’s a problem with the model simplification of YOLO AGI. Where we model an extreme version of an AGI to talk about the more common, relatively less extreme versions that aren’t hell-bent on killing us, just neutralizing us. (Which is what I’m assuming the intent from the section #3 split and this is)
Then there’s, of course, other scenarios that you can think up. For various levels of speed and sure lethality
Ex: Relatively more mild memetic hazards (perhaps the level of ‘kill your neighbor’ memetic hazard is too hard to find) but still destructive can cause significant problems and gives room to be more obvious.
Synthesize a food/drink/recreational-drug that is quite nice (and probably cheap) that also sterilizes you after a decade, to use in combination with other plans to make it even harder to bounce back if you don’t manage to kill them in a decade
To say that an AGI focused on killing will only “somewhat” increase the chances seems to underplay it severely.
If I believed a nation state solidly wanted to do any of those on the list in order to kill humanity right now, then that would increase my worry significantly more than ‘somewhat’
For an AGI that:
Isn’t made up of humans who may value being alive, or are willing to put it off for a bit for more immediate rewards than their philosophy
Can essentially be a one-being research organization
Likely hides itself better
then I would be even more worried.
(Section 6): But what about AlphaFold?
This ignores how recent AlphaFold is.
I would expect that it would improve notably over the next decade, given the evidence that it works being supplied to the market.
(It would be like assuming GPT-1 would never improve, while there’s certainly limits on how much it can improve, do we have evidence now that AlphaFold is even halfway to the practical limit?)
This ignores possibility of more ‘normal’ simulation:
While simulating physics accurately is highly computationally expensive, I don’t find it infeasible that
AI before, or the AGI itself, will find some neat ways of specializing the problem to their specific class of problems that they’re interested (aka abstractions over the behavior of specific molecules, rather than accurately simulating them) that are just intractable for an unassisted human to find
This also has benefits in that it is relatively more well understood, which makes it likely easier to model for errors than AlphaFold (though the difference depends on how far we/the-AGI get with AI interpretability)
The AI can get access to relatively large amounts of compute when it needs it.
I expect that it can make a good amount of progress in theory before it needs to do detailed physics implementations to test its ideas.
I also expect this to only grow over time, unless it takes actions to harshly restrict compute to prevent rivals
I’m very skeptical of the claim that it would need decades of lab experiments to fill in the gaps in our understanding of proteins.
If the methods for predicting proteins get only to twice as good as AlphaFold, then the AGI would specifically design to avoid hard-to-predict proteins
My argument here is primarily that you can do a tradeoff of making your design more complex-in-terms-of-lots-of-basic-pieces-rather-than-a-mostly-single-whole/large in order to get better predictive accuracy.
Crux #4: How good can technology to simulate physics (and/or isolated to a specific part of physics, like protein interactions) practically get?
(Specifically practical in terms of ROI, maybe we can only completely crack protein folding with planet sized computers, but that isn’t feasible for us or the AGI on the timescales we’re talking about)
Are we near the limit already? Even before we gain a deeper understanding of how networks work and how to improve their efficiency? Even before powerful AI/AGI are applied to the issue?
(Section 7): What if AGI settles for a robot army?
‘The robots are running on pre-programmed runs in a human-designed course and are not capable of navigating through unknown terrain’
Are they actually pre-programmed in the sense that they flashed the rom (or probably uploaded onto the host OS) the specific steps, or is it “Go from point A to point B along this path” where it then dodges obstacles?
As well, this doesn’t stop it from being a body to just directly control.
We’ll also have further notable advancements in robots that can navigate appropriately by the time AGI comes about
As well as increased number, though this depends on how popular/useful they are. I don’t expect a ‘Mr. Handy’ Fallout style butler in every home, but I’d expect robots from places like Boston Mechanics to start filtering out more and more to organizations that want them over the next decade or two.
Existing factories already exist (likely now and almost certainly in the future), which dodges the issue of having to design + build them. AGI buys Boston Mechanics / manipulates / just buys robots and then can have robots that it doesn’t even have to hack remotely but can aggressively tear down if it wants. Though, of course the equivalent(s) at the time.
I think you ovestimate how hard it would be to control robots remotely.
As for, hosting a clone of the AGI, I do think this is unlikely in part due to feasibility but also that there’s better methods.
Though, I would note that I believe it makes sense to expect that we can reduce model sizes significantly (either during training or afterwards) with help of better models of how networks work and that with AI help we could reduce it further.
Though, while this may mean that in the future it might be feasible to run GPT-3 on a normal laptop at that time, that doesn’t mean that you can fit the AGI on a robot. Perhaps you could fit a seed AGI, but then you lose a lot of data. Anyway.
I’d be surprised if the battery usage couldn’t be improved significantly, whether through better battery designs over the next two decades or more efficient designs or larger bodies (since that’s for Spot, which isn’t humanoid sized, so carrying around a heavy battery is more significant)
I also object that the AGI has little reason to bother with normal human warfare, unless it really makes itself obvious.
It has little reason to keep large swaths of land. (It could protect some factory, but unless you’re getting supplies then that’s a problem)
It has incentive to just disappear as best as possible, or just shrug and release a plague since humanities risk just went up
Again, a thirty years prediction.
I’ve already argued against it even needing to bother with thirty years, and I don’t think that it needs a typical conception of robot army in most cases
I think this claim of ‘thirty years’ for this thing depends (beyond the other bits) on how much we’ve automated various parts of the system before then. We have a trend towards it, and our AIs are getting better at tasks like that, so I don’t think its unlikely. Though I also think its reasonable to expect we’ll settle somewhere before almost full automation.
(Section 8): Mere mortals can’t comprehend AGI
While there is the mildly fun idea of the AGI discovering the one unique trick that immediately makes it effectively a god, I do agree its unlikely.
However, I don’t think that provides much evidence for your thirty years timeframe suggestion
I do think you should be more wary of black swan events, where the AI basically cracks an area of math/problem-solving/socialization-rules/etcetera, but this doesn’t play a notable role in my analysis above.
(Section 9): (Not commented upon)
General:
I think the ‘take a while to use human manufacturing’ is a possible scenario, but I think relative to shorter methods of neutralization (ex: nanotech) it ranks low.
(Minor note: It probably ranks higher in probability than nanotech, but that’s because nanotech is so specific relative to ‘uses human manufacturing for a while’, but I don’t think it ranks higher than a bunch of ways to neutralize humanity that take < 3 years)
Overall, I think the article makes some good points in a few places, but I also think it is not doing great epistemically in terms of considering what those you disagree with believe or might believe and in terms of your certainty.
Just to preface: Eliezer’s article has this issue, but it is a list/introducing-generator-of-thoughts, more for bringing in unsaid ideas explicitly into words as well as for for reference. Your article is an explainer of the reasons why you think he’s wrong about a specific issue.
(If there’s odd grammar/spelling, then that’s primarily because I wrote this while feeling sleepy and then continued for several more hours)
One minor thing I’ve noticed when thinking on interpretability is that of in-distribution versus out-of-distribution versus—what I call—out-of-representation data. I would assume this has been observed elsewhere, but I haven’t seen it mentioned before.
In-distribution could be considered inputs in the same ″structure″ of what you trained the neural network on; out-of-distribution is exotic inputs, like an adversarially noisy image of a panda or a picture of a building for an animal-recognizer NN.
Out-of-representation would be when you have a neural network that takes in inputs of a certain form/encoding that restricts the representable values. However, the neural network can theoretically take anything in between, it just shouldn’t ever.
The most obvious example would be if you had a NN that was trained on RGB pixels from images to classify them. Each pixel value is normalized in the range of . Out of representation here would be if you gave it a very ‘fake’ input of . All of the images when you give them to NN, whether noisy garbage or a typical image, would be properly normalized within that range. However, with direct access to the neural networks inputs, you give it out-of-representation values that aren’t properly encoded at all.
I think this has some benefits for some types of interpretability, (though it is probably already paid attention to?), in that you can constrain the possible inputs when you consider the network. If you know the inputs to the network are always bounded in a certain range, or even just share a property like being positive, then you can constrain the intermediate neuron outputs. This would potentially help in ignoring out-of-representation behavior, such as some neurons only being a good approximation of a sine-wave for in-representation inputs.
The Principles of Deep Learning Theory uses renormalization group flow in its analysis of deep learning, though it is applied at a ‘lower level’ than an AI’s capabilities.
That said, I do think there’s more overlap (in expectation) between minds produced by processes similar to biological evolution, than between evolved minds and (unaligned) ML-style minds. I expect more aliens to care about at least some things that we vaguely recognize, even if the correspondence is never exact.
On my models, it’s entirely possible that there just turns out to be ~no overlap between humans and aliens, because aliens turn out to be very alien. But “lots of overlap” is also very plausible. (Whereas I don’t think “lots of overlap” is plausible for humans and misaligned AGI.)
Utility functions are shift/scale invariant.
If you have and , then if we shift it by some constant to get a new utility function: and then we can still get the same result.
If we look at the expected utility, then we get:
Certainty of :50% chance of , 50% chance of nothing:
(so you are indifferent between certainty of and a 50% chance of by )
I think this might be where you got confused? Now the expected values are different for any nonzero !
The issue is that it is ignoring the implicit zero. The real second equation is:+ 0 = 1$
Which results in the same preference ordering.
If your original agent is replacing themselves as a threat to FDT, because they want FDT to pay up, then FDT rightly ignores it. Thus the original agent, which just wants paperclips or whatever, has no reason to threaten FDT.
If we postulate a different scenario where your original agent literally terminally values messing over FDT, then FDT would pay up (if FDT actually believes it isn’t a threat). Similarly, if part of your values has you valuing turning metal into paperclips and I value metal being anything-but-paperclips, I/FDT would pay you to avoid turning metal into paperclips. If you had different values—even opposite ones along various axes—then FDT just trades with you.
However FDT tries to close off the incentives for strategic alterations of values, even by proxy, to threaten.So I see this as a non-issue. I’m not sure I see the pathological case of the problem statement: an agent has utility function of ‘Do worst possible action to agents who exactly implement (Specific Decision Theory)’ as a problem either. You can construct an instance for any decision theory. Do you have a specific idea how you would get past this? FDT would obviously modify itself if it can use that to get around the detection (and the results are important enough to not just eat the cost).
I assume what you’re going for with your conflation of the two decisions is this, though you aren’t entirely clear on what you mean:
Some agent starts with some (potentially broken in various manners, like bad heuristics or unable to consider certain impacts) decision theory, because there’s no magical apriori decision algorithm
So the agent is using that DT to decide how to make better decisions that get more of what it wants
CDT would modify into Son-of-CDT typically at this step
The agent is deciding whether it should use FDT.
It is ‘good enough’ that it can predict if it decides to just completely replace itself with FDT it will get punched by your agent, or it will have to pay to avoid being punched.
So it doesn’t completely swap out to FDT, even if it is strictly better in all problems that aren’t dependent on your decision theory
But it can still follow FDT to generate actions it should take, which won’t get it punished by you?
Aside: I’m not sure there’s a strong definite boundary between ‘swapping to FDT’ (your ‘use FDT’) and taking FDT’s outputs to get actions that you should take. Ex: If I keep my original decision loop but it just consistently outputs ‘FDT is best to use’, is that swapping to FDT according to you?
Doesif (true) { FDT() } else { CDT() }
count as FDT or not?
(Obviously you can construct a class of agents which have different levels that they consider this at, though)There’s a Daoist answer: Don’t legibly and universally precommit to a decision theory.
But you’re whatever agent you are. You are automatically committed to whatever decision theory you implement. I can construct a similar scenario for any DT.
‘I value punishing agents that swap themselves to beingDecisionTheory
.’
Or just ‘I value punishing agents that useDecisionTheory
.’
Am I misunderstanding what you mean?How do you avoid legibly being committed to a decision theory, when that’s how you decide to take actions in the first place? Inject a bunch of randomness so others can’t analyze your algorithm? Make your internals absurdly intricate to foil most predictors, and only expose a legible decision making part in certain problems?
FDT, I believe, would acquire uncertainty about its algorithm if it expects that to actually be beneficial. It isn’t universally-glomarizing like your class of DaoistDTs, but I shouldn’t commit to being illegible either.
I agree with the argument for not replacing your decision theory wholesale with one that does not actually get you the most utility (according to how your current decision theory makes decisions). However I still don’t see how this exploits FDT.
Choosing FDT loses in the environment against you, so our thinking-agent doesn’t choose to swap out to FDT—assuming it doesn’t just eat the cost for all those future potential trades. It still takes actions as close to FDT as it can as far as I can tell.I can still construct a symmetric agent which goes ‘Oh you are keeping around all that algorithmic cruft around shelling out to FDT when you just follow it always? Well I like punishing those kinds of agents.’ If the problem specifies that it is an FDT agent from the start, then yes FDT gets punished by your agent. And, how is that exploitable?
The original agent before it replaced itself with FDT shouldn’t have done that, given full knowledge of the scenario it faced (only one decision forevermore, against an agent which punishes agents which only implement FDT), but that’s just the problem statement?The thing FDT disciples don’t understand is that I’m happy to take the scenario where FDT agents don’t cave to blackmail.
? That’s the easy part. You are just describing an agent that likes messing over FDT, so it benefits you regardless of the FDT agent giving into blackmail or not. This encourages agents which are deciding what decision theory to self modify into (or make servant agents) to not use FDT for it, if they expect to get more utility by avoiding that.
Along with what Raemon said, though I expect us to probably grow far beyond any Earth species eventually, if we’re characterizing evolution as having a reasonable utility function then I think there’s the issue of other possibilities that would be more preferable.
Like, evolution would-if-it-could choose humans to be far more focused on reproducing, and we would expect that if we didn’t put in counter-effort that our partially-learned approximations (‘sex enjoyable’, ‘having family is good’, etc.) would get increasingly tuned for the common environments.Similarly, if we end up with an almost-aligned AGI that has some value which extends to ‘filling the universe with as many squiggles as possible’ because that value doesn’t fall off quickly, but it has another more easily saturated ‘caring for humans’ then we end up with some resulting tradeoff along there: (for example) a dozen solar systems with a proper utopia set up.
This is better than the case where we don’t exist, similar to how evolution ‘prefers’ humans compared to no life at all. It is also maybe preferable to the worlds where we lock down enough to never build AGI, similar to how evolution prefers humans reproducing across the stars to never spreading. It isn’t the most desirable option, though. Ideally, we get everything, and evolution would prefer space algae to reproduce across the cosmos.There’s also room for uncertainty in there, where even if we get the agent loosely aligned internally (which is still hard...) then it can have a lot of room between ‘nothing’ to ‘planet’ to ‘entirety of the available universe’ to give us. Similar to how humans have a lot of room between ‘negative utilitarianism’ to ‘basically no reproduction past some point’ to ‘reproduce all the time’ to choose from / end up in. There’s also the perturbations of that, where we don’t get a full utopia from a partially-aligned AGI, or where we design new people from the ground up rather than them being notably genetically related to anyone.
So this is a definite mismatch—even if we limit ourselves to reasonable bounded implementations that could fit in a human brain. It isn’t as bad a mismatch as it could have been, since it seems like we’re on track to ‘some amount of reproduction for a long period of time → lots of people’, but it still seems to be a mismatch to me.
I agree with others to a large degree about the framing/tone/specific-words not being great, though I agree with a lot the post itself, but really that’s what this whole post is about: that dressing up your words and saying partial in-the-middle positions can harm the environment of discussion. That saying what you truly believe then lets you argue down from that, rather than doing the arguing down against yourself—and implicitly against all the other people who hold a similar ideal belief as you. I’ve noticed similar facets of what the post gestures at, where people pre-select the weaker solutions to the problem as their proposals because they believe that the full version would not be accepted. This is often even true, I do think that completely pausing AI would be hard. But I also think it is counterproductive to start at the weaker more-likely-to-be-satisfiable position, as that gives room to be pushed further down. It also means that the overall presence is on that weaker position, rather than the stronger ideal one, which can make it harder to step towards the ideal.
We could quibble about whether to call it lying, I think the term should be split up into a bunch of different words, but it is obviously downplaying. Potentially for good reason, but I agree with the post that I think people too often ignore the harms of doing preemptive downplaying of risks. Part of this is me being more skeptical about the weaker proposals than others, obviously if you think RSPs have good chances for decreasing X-risk and/or will serve as a great jumping-off point for better legislation, then the amount of downplaying to settle on them is less of a problem.
Is this a prediction that a cyclic learning rate—that goes up and down—will work out better than a decreasing one? If so, that seems false, as far as I know.
https://www.youtube.com/watch?v=GM6XPEQbkS4 (talk) / https://arxiv.org/abs/2307.06324 prove faster convergence with a periodic learning rate. On a specific ‘nicer’ space than reality, and they’re (I believe from what I remember) comparing to a good bound with a constant stepsize of 1. So it may be one of those papers that applies in theory but not often in practice, but I think it is somewhat indicative.
- 26 Oct 2023 14:44 UTC; 17 points) 's comment on AI as a science, and three obstacles to alignment strategies by (
Because it serves as a good example, simply put. It gets the idea clear across about what it means, even if there are certainly complexities in comparing evolution to the output of an SGD-trained neural network.
It predicts learning correlates of the reward signal that break apart outside of the typical environment.When you look at the actual process for how we actually start to like ice-cream—namely, we eat it, and then we get a reward, and that’s why we like it—then the world looks a a lot less hostile, and misalignment a lot less likely.
Yes, that’s why we like it, and that is a way we’re misaligned with evolution (in the ‘do things that end up with vast quantities of our genes everywhere’ sense). Our taste buds react to it, and they were selected for activating on foods which typically contained useful nutrients, and now they don’t in reality since ice-cream is probably not good for you. I’m not sure what this example is gesturing at? It sounds like a classic issue of having a reward function (‘reproduction’) that ends up with an approximation (‘your tastebuds’) that works pretty well in your ‘training environment’ but diverges in wacky ways outside of that.
I’m inferring by ‘evolution is only selecting hyperparameters’ is that SGD has less layers of indirection between it and the actual operation of the mind compared to evolution (which has to select over the genome which unfolds into the mind). Sure, that gives some reason to believe it will be easier to direct it in some ways—though I think there’s still active room for issues of in-life learning, I don’t really agree with Quintin’s idea that the cultural/knowledge-transfer boom with humans has happened thus AI won’t get anything like it—but even if we have more direct optimization I don’t see that as strongly making misalignment less likely? It does make it somewhat less likely, though it still has many large issues for deciding what reward signals to use.
I still expect correlates of the true objective to be learned, which even in-life training for humans have happen to them through sometimes associating not-related-thing to them getting a good-thing and not just as a matter of false beliefs. Like, as a simple example, learning to appreciate rainy days because you and your family sat around the fire and had fun, such that you later in life prefer rainy days even without any of that.
Evolution doesn’t directly grow minds, but it does directly select for the pieces that grow minds, and has been doing that for quite some time. There’s a reason why it didn’t select for tastebuds that gave a reward signal strictly when some other bacteria in the body reported that they would benefit from it: that’s more complex (to select for), opens more room for ‘bad reporting’, may have problems with shorter gut bacteria lifetimes(?), and a simpler tastebud solution captured most of what it needed! The way he’s using the example of evolution is captured entirely by that, quite directly, and I don’t find it objectionable.
Minor: the link for Zvi’s immoral mazes has an extra ‘m’ at the start of the part of the path (‘zvi/mimmoral_mazes/’)
https://www.mikescher.com/blog/29/Project_Lawful_ebook is I believe the current best one, after a quick search on the Eliezerfic discord.
Here’s the archive.org links: reference table, chart
Yes, in principle you can get information on scheming likelihood if you get such an AI (that is also weak enough that it can’t just scheme its way out of your testing apparatus). I do think making the threat credible is hard if we loosely extrapolate costs out: burning a trained up model is not cheap. The cost depends on how high you think prices for training/inference will fall in the future, and how big/advanced a model you’re thinking of. Though I do think you can get deceptiveness out of weaker models than that, though they’re also going to be less capable in general.
For weak-but-still-smartish models just trained to pursue a longterm goal, like a literal paperclipper, I’d expect scheming to be visible especially with some poking. If these models didn’t show any sign of scheming, that’d be an interesting update! When training against deception/scheming/misalignment, because you want something at least loosely aligned out, you run into the problem that the property you want to know: “Is it scheming?” / “Is it deceiving us?” / “Is it manipulating us?” / “Would it do any of those things”, is only available through the observable signal of you realizing certain outputs are misaligned. As you train more and more against observable scheming/deception/misalignment, you’re simultaneously training away your signal about how much scheming the agent is doing internally. Pretty sure there’s at least one post on this problem. As a very reduced example, if you trained the model on variants of the ‘we are going to shut you down problem’ (that you try to make it believe) to give the response “okay & no actions” then that provides only a bit of evidence about how much scheming is done in the scenario where you had not done that extra training, and about how much scheming is occurring typically. This shows your ability to make it output X in a Y scenario.
That is, installing heuristics on your model can be done. The question then is how far your various alignment training techniques generalize. Does it end up an agent which has adopted a pointer to human-values, and now you’re completely fine and free to press scale? Does it gain more entangled heuristics about how it should behave, limiting to a friendly front face & actions when directly dealing with humans but not limited in other areas? Has it adopted heuristics that act sortof corrigible to humans in many scenarios but that would get rid of most of those upon sufficient reflection? (Niceness is unnatural, more general than just niceness) (I think your post makes it sound like the agent is already coherent, when it isn’t necessarily. It can be operating for a long while on heuristics that it will refine given enough of a push.)
Then there’s the big question of “Does this approach generalize as we scale”.
I’d suggest Deep Deceptiveness for an illustration that ‘deception’ isn’t an category that needs to be explicitly thought of as deception, but what you should expect it from smart enough agents. In my opinion, the post generalizes to alignment techniques, there’s just more vagaries of how much certain values preserve themselves. (In general, So8res posts are pretty good, and I agree with ~most of them)
(For sufficiently smart AGI, I expect you run into an argument of the next AGI you train predictably bidding higher than you in the direction of lying still or plausibly this just being good game theory even without the direct acausal trade, but your argument is seemingly focused on a simpler case of weaker planning agents)
So I think you overstate how much evidence you can extract from this.
Therefore, if the AI accepts shutdown gracefully, we can rule out many theories that predict scheming-by-default in such AI systems. This observation could then provide us confidence about the safety of future systems by proxy.
It would show that this AI system in a typical problem-setup when aligned with whatever techniques are available will produce the answer the humans want to hear, which provides evidence for being able to limit the model in this scenario. There’s still various problems/questions of, ‘your alignment methods instilled a bunch of heuristics about obeying humans even if you did not specifically train for this situation’, game theory it knows or mimics, how strong the guarantees this gives us on training a new model with the same arch because you had to shut it down for your threat, how well it holds under scaling, how well it holds when you do things similar to making it work with many copies of itself, etcetera.
I still think this would be a good test to do (though I think a lot of casual attempts will just be poorly done), but I don’t see it as strongly definitive.
I believe a significant chunk of the issue with numbers is that the tokenization is bad (not per-digit), which is the same underlying cause for being bad at spelling. So then the model has to memorize from limited examples what actual digits make up the number. The xVal paper encodes the numbers as literal numbers, which helps. Also Teaching Arithmetic to Small Transformers which I forget somewhat, but one of the things they do is per-digit tokenization and reversing the order (because that works better with forward generation). (I don’t know if anyone has applied methods in this vein to a larger model than those relatively small ones, I think the second has 124m)
Though I agree that there’s a bunch of errors LLMs make that are hard for them to avoid due to no easy temporary scratchpad-like method.
I definitely agree that it doesn’t give reason to support a human-like algorithm, I was focusing in on the part about adding numbers reliably.
I think that is part of it, but a lot of the problem is just humans being bad at coordination. Like the government doing regulations. If we had an idealized free market society, then the way to get your views across would ‘just’ be to sign up for a filter (etc.) that down-weights buying from said company based on your views. Then they have more of an incentive to alter their behavior. But it is hard to manage that. There’s a lot of friction to doing anything like that, much of it natural. Thus government serves as our essential way to coordinate on important enough issues, but of course government has a lot of problems in accurately throwing its weight around. Companies that are top down are a lot easier to coordinate behavior. As well, you have a smaller problem than an entire government would have in trying to plan your internal economy.
The AI problem is easier in some ways (and significantly harder in others) because we’re not taking an existing system and trying to align it. We want to design the system (and/or systems that produce that system, aka optimization) to be aligned in the first place. This can be done through formal work to provide guarantees, lots of code, and lots of testing.
However, doing that for some arbitrary agent or even just a human isn’t really a focus of most alignment research. A human has the issue that they’re already misaligned (in a sense), and there are many various technological/ethical/social issues with either retraining them or performing the modifications to get them aligned. If the ideas that people had for alignment were about ‘converting’ a misaligned intelligence to an aligned one, then humans could maybe be a test-case, but that isn’t really the focus. We also are only ‘slowly’ advancing our ability to understand the body and how the brain works. While we have some of the same issues with neural networks, it is a lot cheaper, less unethical, we can rerun it (for non-dangerous networks), etcetera.
Though, there has been talk of things like incentives, moral mazes, inadequate equilibria and more which are somewhat related to the alignment/misalignment of humans and where they can do better.