Currently studying postgrad at Edinburgh.
I don’t actually think “It is really hard to know what sorts of AI alignment work are good this far out from transformative AI.” is very helpful.
It is currently fairly hard to tell what is good alignment work. A week from TAI, then either, good alignment work will be easier to recognise because of alignment progress not strongly correlated with capabilities, or good alignment research is just as hard to recognise. (More likely the latter) I can’t think of any safety research that can be done on GPT3 that can’t be done on GPT1.
In my picture, research gets done and theorems proved, researcher population grows as funding increases and talent matures. Toy models get produced. Once you can easily write down a description of a FAI with unbounded compute, that’s when you start to look at algorithms that have good capabilities in practice.
A risk budget makes much more sense, once we consider it an exposure budget and consider logical decision theory. You and a community of identically thinking friends are deciding how much exposure between each other to tolerate. To the extent that your community is very large, homogeneous and hardly ever get exposed to outsiders, you have a threshold between exponential growth and exponential decay. Now if hypothetically some people got more utility from exposure, and you could perfectly coordinate, then those who gain more utility from interactions would interact more (assuming fungible utility.)
I remember a similar discussion from somewhere. The summery is: Don’t stay in ‘haunted houses’ just because you don’t believe in ghosts. Many ‘haunted houses’ are actually structurally unsound or infested. (and subtle mental effects like a creeping feeling of unease could even be caused by a low level pollution of psychoactive chemicals in the environment.)
In information theory, there is a principle that any predictable structure in the compressed message is an inefficiency that can be removed. You can add a noisy channel, differing costs of different signals ect, but still beyond that, any excess pattern indicates wasted bits.
In numerically solving differential equations, the naieve way of solving them involves repeatedly calculating with numbers that are similar. And for which a linear or quadratic function would be an even better fit. A more complex higher order solver with larger timesteps has less of a relation between different values in memory.
I am wondering if there is a principle that could be expressed as “any simple predictively useful pattern that isn’t a direct result of the structure of the code represents an inefficiency.” (Obviously code can have the pattern c=a+b, when c has just been calculated as a+b. But if a and b have been calculated, and then a new complicated calculation is done that generates c, when c could just be calculated as a+b, that’s a pattern and an inefficiency.)
The strongest studies can find the weakest effects. Imagine some huge and very well resourced clinical trial finds some effect. Millions of participants being tracked and monitored extensively over many years. Everything double blind, randomized ect. Really good statisticians analyzing the results. A trial like this is capable of finding effect sizes that are really really small. It is also capable of detecting larger effects. However, people generally don’t run trials that big, if the effect is so massive and obvious it can be seen with a handful of patients.
On the other hand, a totally sloppy prescientific methodology can easily detect results if they are large enough. If you had a total miracle cure, you could get strong evidence of its effectiveness just by giving it to one obviously very ill person and watching them immediately get totally better.
I don’t know that much about hormones, but from my reading of Inadiquite equilibria, https://www.lesswrong.com/s/oLGCcbnvabyibnG9d This sort of thing happens. There are general game theoretic reasons why everyone seems to be inexplicably stupid. I don’t know if this is a case of doctors ignoring an easy and effective medical treatment, but if it is, it would be far from the only case.
In the default outcome, astronomical amounts of subroutines will be spun up in pursuit of higher-level goals, whether those goals are aligned with the complexity of human value or aligned with paperclips. Without firm protections in place, these subroutines might experience some notion of suffering
Surely, an human goal aligned ASI wouldn’t want to make suffering subroutines.
For paperclip maximizers, there are 2 options, either suffering based algorithms are the most effective way of achieving important real world tasks, or they aren’t. In the latter case, no problem, the paperclip maximizer won’t use them. (Well you still have a big problem, namely the paperclip maximizer)
In the former case, you would need to design a system that intrinsically wanted not to make suffering subroutines, and had that goal stable under self improvement. The level of competence and understanding needed to do this is higher than the amount needed to realize you are making a paperclip maximizer, and not turn it on.
Work out your prior on being an exception to natural law in that way. Pick a number of rounds such that the chance of you winning by luck is even smaller. You currently think that the most likely way for you to be in that situation is if you were an exception.
What if the game didn’t kill you, it just made you sick? Would your reasoning still hold? There is no hard and sharp boundary between life and death.
I think that playing this game is the right move, in the contrived hypothetical circumstances where
You have already played a huge number of times. (say >200)
Your priors only contain options for “totally safe for me” or “1/6 chance of death.”
I don’t think you are going to actually make that move in the real world much because
You would never play the first few times
Your going to have some prior on “this is safer for me, but not totally save, it actually has a 1/1000 chance of killing me.” This seems no less reasonable than the no chance of killing you prior.
If for some strange reason, you have already played a huge huge number of times, like billions. Then you are already rich, diminishing marginal utility of money. An agent with logarithmic utility in money, nonzero starting balance, uniform priors over lethality probability and a fairly large dis-utility of death will never play.
Ok, so lets assume that the alignment work has been done and solved. (Big assumption) I don’t really see this as a game of countries, more a game of teams.
The natural size of the teams is the set of people who have fairly detailed technical knowledge about the AI, and are working together. I suspect that non-technical and unwanted bureaucrats that push their noses into an AI project will get much lip service and little representation in the core utility function.
You would have say an openAI team. In the early stages of covid, a virus was something fairly easy for politicians to understand, and all the virologists had incentive to shout “look at this”. AGI is harder to understand, and the people at openAI have good reason not to draw too much government attention, if they expect the government to be nasty or coercive.
The people at openAI and deepmind are not enemies that want to defeat each other at all costs, some will be personal friends. Most will be after some sort of broadly utopian AI helps humanity future. Most are decent people. I predict neither side will want to bomb the other, even if they have the capability. There may be friendly rivalry or outright cooperation.
I think I see the distinction you are trying to make. But I see it more as a tradeoff curve, with either end being slightly ridiculous. On one extreme, you have a program with a single primitive, the pixel, and the user has to set all the pixels themselves. This is a simple program, in that it passes all the complexity off to the user.
The other extreme is to have a plotting library that contains gazillions of functions and features for every type of plot that could ever exist. You then have to find the right function for Quasi rectiliniar radial spiral helix fourier plot.
Any attempt that goes too far down the latter path will at best end up as a large pile of special case functions that handle most of the common cases, and the graphics primitives if you want to make an unusual plot type.
Sure, most of the time your using a bar chart you’ll want dodge or stack, but every now and again you might want to balance several small bars on top of one big one, or to do something else unusual with the bars. I agree that in this particular case, the tradeoff could be made in the other direction. But notice the tradeoff is about making the graphics package bigger and more complex. Something people with limited devop resources trying to make a package will avoid.
At some point you have to say, if the programmer wants that kind of plot, they better make it themselves out of primitives.
For plotting, I usually use pythons matplotlib.pyplot
This roughly corresponds to the grammer of graphics approach described. There is one function that can do line or point plots. Another to do bar plots. Another to do heatmap plots. Another to do stream plots. Ect. You can call these functions multiple times and in combination on the same axis to say add points and a heatmap to plots. You can get multiple subplots and control each independently. It doesn’t have builtin data smoothing, if you want to smooth your data, you have to use numpy or scipy interpolation or convolution functions. (There are actually quite a few interpolation and smoothing operations you might meaningfully want to do to data.)
Naked mole rats don’t age. Other mammals do. Therefore, whatever causes ageing must be hard but not impossible for evolution to stop. Here is one plausible hypothesis.
The environment of naked mole rats provides unusually strong evolutionary pressure against ageing. So transposon-killing RNAs are unusually prevalent. Every time a mutation breaks a transposon, that provides an advantage, the fewer transposons you start with, the slower you age. This selection is balanced by the fact that transposons occasionally manage to replicate, even in the gonads. In naked mole rats, that selection was unusually strong, and/or the transposons unusually unable to replicate. So evolution managed to drive the number of functioning transposons down to 0.
If naked mole rats have no functioning transposons, and animals that age do contain transposons, that would be strong evidence for transposon based ageing.
Of course, even if ageing is transposon based, evolution could have taken another route in mole rats. Maybe they have some really effective transposon suppressor of some kind.
I don’t know how hard this would be to test. Can you just download the mole rat DNA and put it into a pre-made transposon finder?
, it seems to me that under these assumptions there would probably be a series of increasingly-worse accidents spread out over some number of years, culminating in irreversible catastrophe, with humanity unable to coordinate to avoid that outcome—due to the coordination challenges in Assumptions 2-4.
I’m not seeing quite what the bad but not existential catastrophes would look like. I also think the AI has an incentive not to do this. My world model (assuming slow takeoff) goes more like this.
AI created in lab. Its a fairly skilled programmer and hacker. Able to slowly self improve. Escapes from the lab, ideally without letting its creators know. Then there are several years where the AI hangs out on the internet, slowly self improving and gaining power. It tries to shut down other AI’s if it can. It might be buying compute, or stealing it, or persuading people to run it. It is making sure its existence and malevolence isn’t known to humans. Until finally it has the resources to wipe out humanity before we can respond.
It is much easier to contain something on one computer in a lab, than to catch it once its all over the internet.
Lying and cheating and power seeking behaviour are only a good idea if you can get away with them. If you can’t break out the lab, you probably can’t get away with much uncouragable behaviour.
There is a scenario where the AI escapes in a way that makes its escape “obvious”. Or at least obvious to an AI researcher. Expect any response to be delayed, half-hearted, mired by accusations that the whole thing is a publicity stunt, and dragged down by people who don’t want to smash their hard drives full of important important work just because there might be a rouge AI on them. The AI has an incentive to confuse and sabotage any step it can. And many human organizations seem good at confusing and sabotaging themselves in the face of a virus. The governments would have to coordinate the shutdown of prettymuch all the worlds computers, without computers to coordinate it. Even just a few hours delay for the researchers to figure out what the AI did, and get the message passed up through government machinery may be enough time for the AI to have got to all sorts of obscure corners of the web.
What if you make charitable donations accessible in that database? That could create even better status signalling incentives.
I agree that this is an important concept, or set of related concepts that covers many of the more directly physical abstractions. If something isn’t quantum field theory fundamental, and can be measured with physics equipment, there is a good chance it is one of these sorts of abstractions.
Of course, a lot of the work in what makes a sensible abstraction is determined by the amount of blurring, and the often implicit context.
For instance, take the abstraction “poisonous”. If the particular substance being described as poisonous is sitting in a box not doing anything, then we are talking about a counterfactual where a person eats the poison. Within that world, you are choosing a frame sufficiently zoomed in to tell if the hypothetical person was alive or dead, but not precise enough to tell which organs failed.
I think that different abstractions of objects are more useful in different circumstances. Consider a hard drive. In a context that involves moving large amounts of data, the main abstraction might be storage space. If you need to fit it in a bag, you might care more about size. If you need to dispose of it, you might care more about chemical composition and recyclability.
Consider some paper with ink on it. The induced abstractions framework can easily say that it weighs 72 grams, and has slightly more ink in the top right corner.
It has a harder time using descriptions like “surreal”, “incoherent”, “technical”, “humorous”, “unpredictable”, “accurate” ect.
Suppose the document is talking about some ancient historic event that has rather limited evidence remaining. The accuracy or inaccuracy of the document might be utterly lost in the mists of time, yet we still easily use “accurate” as an abstraction. That is, even a highly competent historian may be unable to cause any predictable physical difference in the future that depends on the accuracy of the document in question. Where as the number of letters in the document is easy to assertain and can influence the future if the historian wants it to.
As this stands, it is conceptually useful, but does not cover anything like all human abstractions.
A “channel” that hashes the input has perfect mutual info, but is still fairly useless to transmit messages. The point about mutual info is its the maximum, given unlimited compute. It serves as an upper bound that isn’t always achievable in practice. If you restrict to channels that just add noise, then yeh, mutual info is the stuff.
In the giant lookup table space, HCH must converge to a cycle, although that convergence can be really slow. I think you have convergence to a stationary distribution if each layer is trained on a random mix of several previous layers. Of course, you can still have occilations in what is said within a policy fixed point.
If you want to prove things about fixed points of HCH in an iterated function setting, consider it a function from policies to policies. Let M be the set of messages (say ascii strings < 10kb.) Given a giant look up table T that maps M to M, we can create another giant look up table. For each m in M , give a human in a box the string m, and unlimited query access to T. Record their output.
The fixed points of this are the same as the fixed points of HCH. “Human with query access to” is a function on the space of policies.