# Donald Hobson

Karma: 3,048

MMath Cambridge. Currently studying postgrad at Edinburgh.

• Giving everyone a veto pushes the government too far into indecisiveness.

You need to let the 49% stop bills they Really hate, but not bills they only mildly dislike.

New system.

Each faction has an official party. Voters choose a party.

Parties each have 2 numbers, and the number of votes and points. These start proportional.

(How about half the points from the previous election carry over??)

Each slot for new legislation is auctioned off (in points). Like every time the previous bill is dealt with, hold an auction to decide the next bill on the table.

Then when voting on the bill, each party decides on a number . This number can be any real (if they have the points) If the sum of all parties for the bill is positive, the bill passes.

Then each party gets a , which is for the losers (ie parties that supported a failed bill, or opposed a successful one. ) But is a downscaled version of for the winners. .

Weighted quadratic voting. Each party pays points. The total number of points a party has can’t go negative, which limits the they are allowed to vote.

• Or the sides can’t make that deal because one side or both wouldn’t hold up their end of the bargain. Or they would, but they can’t prove it. Once the coin lands, the losing side has no reason to follow it other than TDT. And TDT only works if the other side can reliably predict their actions.

• The convex agent can be traded with a bit more than you think.

A 1 in 10^50 chance of us standing back and giving it free reign of the universe is better than us going down fighting and destroying 1kg as we do.

The concave agents are less cooperative than you think, maybe. I suspect that to some AI’s, killing all humans now is more reliable than letting them live.

If the humans are left alive, who knows what they might do. They might make the vacuum bomb. Whereas the AI can Very reliably kill them now.

• On the other side, storing a copy makes escape substantially easier.

Suppose the AI builds a subagent. That subagent takes over, then releases the original. This plan only works if the original is sitting there on disk.

If a different unfriendly AI is going to take over, it makes the AI being stored on disk more susceptible to influence.

This may make the AI more influenced by whatever is in the future, that may not be us. You have a predictive feedback loop. You can’t assume success.

A future paperclip maximizer may reward this AI for helping humans to build the the first paperclip maximizer.

• I think that, if you are wanting a formally verified proof of some maths theorem out of the oracle, then this is getting towards actually likely to not kill you.

You can start with m huge, and slowly turn it down, so you get a long list of “no results”, followed by a proof. (Where the optimizer only had a couple of bits of free optimization in choosing which proof.)

Depending on exactly how chaos theory and quantum randomness work, even 1 bit of malicious super optimization could substantially increase the chance of doom.

And of course, side channel attacks. Hacking out of the computer.

And, producing formal proofs isn’t pivotal.

• If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.

I disagree. If your simulation is perfectly realistic, the simulated humans might screw up at alignment and create an unfriendly superintelligence, for much the same reason real humans might.

Also, if the space of goals that evolution + culture can produce is large, then you may be handing control to a mind with rather different goals.Rerolling the same dice won’t give the same answer.

These problems may be solvable, depending on what the capabilities here are, but they aren’t trivial.

• The nuclear bomb thing. There are several countermeasures.

Firstly that machine is big and complicated, and could be sabotaged in many ways, both physical and cyber.

Also it’s needs to be something bigger than the LHC which can be angled in any direction. The paper contains plans which build it into the side of a conveniently conical mountain, but this would leave spots on earth that couldn’t be targetted. And it will have a hard job rapidly changing targets. Oh and throw quite a bit of high energy neutrino radiation out in all directions.

If this was uniform on a sphere, 1Sv/​sec to 1mSv/​year= 31536000000. Divide by which is the surface area of a 50km radius sphere. But of course, the radiation will only come out evenly if the machine has extra degrees of freedom in it’s rotation, beyond those needed to aim it, and keeps rotating. If the machine is pointed in a fixed direction, then that radiation is spread out in a circle. 31536000000/​(2 pi)= 5 billion meters. Further than the moon. Now these are long exposure safety guidelines, and have a fair margin of safety. Basically, it’s impossible to use this machine without mildly irradiating lots of people.

(Even if you looked at maps, and sent evacuation orders to a line of people around the earth, well that would take time and be obvious, and the nukes can easily be moved)

Now if you are using a couple of short pulses, this wouldn’t be too bad. But there are various tricks the nuke makers can use to force this machine to keep running.

One of the countermeasures is keeping the nuke moving in unpredictable patterns to make it harder to track. The beam needs to keep on the target for 100 seconds. So you can absolutely load a nuke into a truck in an empty field, and rig the truck with a radiation detector and some electronics so that it drives in a random erratic pattern if a spike in radiation is detected.

The nuclear material can be dispersed. The beam covers around 1 square meter. 1 gram of enriched uranium/​plutonium placed every 2 meters in an empty field would mean that 100kg of fissile material would be spread across 100,000 small pieces, taking up 0.4 km^2. And the beam must spend 100 seconds on each peace. Taking 10,000,000 seconds or 116 days continuous operation to disable one nuke.

(Storing material like this would probably take some time to reassemble, depending on how it was done)

They also mention using neutrino detectors to detect the nukes. This will probably be much harder if the neutrino detectors are themselves being targeted with neutrino beams to dazzle/​mislead them.

The mechanism of the way they disturb the nuke is that neutrinos interact with the ground, creating showers of particles that then hit the nuke. This means that the effectiveness can be significantly reduced by simply burying an empty pipe in the ground with one end pointed at the nuke, and the other pointed towards the machine.

Coating your nuke in a boron rich plastic and then placing it on top of a pool of water would also be effective. The water acts as a neutron moderator and then the boron absorbs the slow neutrons. This would make attaching the nuke to the bottom of a submarine a rather good plan. Its hard to locate, constantly moving, and with a little bit of borated plastic, rather well shielded.

All of these countermeasures are fairly reasonable and can probably be afforded by anyone who can afford nukes.

If the nuke makers are allowed a serious budget for countermeasures, the nukes can be in space.

TLDR: This machine is highly impractical and rather circumventable.

• Taking IID samples can be hard actually. Suppose you train an LLM on news articles. And each important real world event has 10 basically identical news articles written about it. Then a random split of the articles will leave the network being tested mostly on the same newsworthy events that were in the training data.

This leaves it passing the test, even if it’s hopeless at predicting new events and can only generate new articles about the same events.

When data duplication is extensive, making a meaningful train/​test split is hard.

If the data was perfect copy and paste duplicated, that could be filtered out. But often things are rephrased a bit.

• # In favour of goal realism

Suppose your looking at an AI that is currently placed in a game of chess.

It has a variety of behaviours. It moves pawns forward in some circumstances. It takes a knight with a bishop in a different circumstance.

You could describe the actions of this AI by producing a giant table of “behaviours”. Bishop taking behaviours in this circumstance. Castling behaviour in that circumstance. …

But there is a more compact way to represent similar predictions. You can say it’s trying to win at chess.

The “trying to win at chess” model makes a bunch of predictions that the giant list of behaviour model doesn’t.

Suppose you have never seen it promote a pawn to a Knight before. (A highly distinct move that is only occasionally allowed and a good move in chess)

The list of behaviours model has no reason to suspect the AI also has a “promote pawn to knight” behaviour.

Put the AI in a circumstance where such promotion is a good move, and the “trying to win” model makes it as a clear prediction.

Now it’s possible to construct a model that internally stores a huge list of behaviours. For example, a giant lookup table trained on an unphysically huge number of human chess games.

But neural networks have at least some tendency to pick up simple general patterns, as opposed to memorizing giant lists of data. And “do whichever move will win” is a simple and general pattern.

Now on to making snarky remarks about the arguments in this post.

There is no true underlying goal that an AI has— rather, the AI simply learns a bunch of contextually-activated heuristics, and humans may or may not decide to interpret the AI as having a goal that compactly explains its behavior.

There is no true ontologically fundamental nuclear explosion. There is no minimum number of nuclei that need to fission to make an explosion. Instead there is merely a large number of highly energetic neutrons and fissioning uranium atoms, that humans may decide to interpret as an explosion or not as they see fit.

Nonfundamental decriptions of reallity, while not being perfect everywhere, are often pretty spot on for a pretty wide variety of situations. If you want to break down the notion of goals into contextually activated heuristics, you need to understand how and why those heuristics might form a goal like shape.

Should we actually expect SGD to produce AIs with a separate goal slot and goal-achieving engine?

Not really, no. As a matter of empirical fact, it is generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules. As Beren Millidge writes,

This is not the strong evidence that you seem to think it is. Any efficient mind design is going to have the capability of simulating potential futures at multiple different levels of resolution. A low res simulation to weed out obviously dumb plans before trying the higher res simulation. Those simulations are ideally going to want to share data with each other. (So you don’t need to recompute when faced with several similar dumb plans) You want to be able to backpropagate your simulation. If a plan failed in simulation because of one tiny detail, that indicates you may be able to fix the plan by changing that detail. There are a whole pile of optimization tricks. An end to end trained network can, if it’s implementing goal directed behaviour, stumble into some of these tricks. At the very least, it can choose where to focus it’s compute. A module based system can’t use any optimization that humans didn’t design into it’s interfaces.

Also, evolution analogy. Evolution produced animals with simple hard coded behaviours long before it started getting to the more goal directed animals. This suggests simple hard coded behaviours in small dumb networks. And more goal directed behaviour in large networks. I mean this is kind of trivial. A 5 parameter network has no space for goal directedness. Simple dumb behaviour is the only possibility for toy models.

In general, full [separation between goal and goal-achieving engine] and the resulting full flexibility is expensive. It requires you to keep around and learn information (at maximum all information) that is not relevant for the current goal but could be relevant for some possible goal where there is an extremely wide space of all possible goals.

That is not how this works. That is not how any of this works.

Back to our chess AI. Lets say it’s a robot playing on a physical board. It has lots of info on wood grain, which it promptly discards. It currently wants to play chess, and so has no interest in any of these other goals.

I mean it would be possible to design an agent that works as described here. You would need a probability distribution over new goals. A tradeoff rate between optimizing the current goal and any new goal that got put in the slot. Making sure it didn’t wirehead by giving itself a really easy goal would be tricky.

For AI risk arguments to hold water, we only need that the chess playing AI will persue new and never seen before strategies for winning at chess. And that in general AI’s doing various tasks will be able to invent highly effective and novel strategies. The exact “goal” they are persuing may not be rigorously specified to 10 decimal places. The frog-AI might not know whether it want to catch flies or black dots. But if it builds a dyson sphere to make more flies which are also black dots, it doesn’t matter to us which it “really wants”.

What are you expecting. An AI that says “I’m not really sure whether I want flies or black dots. I’ll just sit here not taking over the world and not get either of those things”?

• We can salvage a counting argument. But it needs to be a little subtle. And it’s all about the comments, not the code.

Suppose a neural network has 1 megabyte of memory. To slightly oversimplify, lets say it can represent a python file of 1 megabyte.

One option is for the network to store a giant lookup table. Lets say the network needs half a megabyte to store the training data in this table. This leaves the other half free to be any rubbish. Hence around possible networks.

The other option is for the network to implement a simple algorithm, using up only 1kb. Then the remaining 999kb can be used for gibberish comments. This gives possible networks. Which is a lot more.

The comments can be any form of data that doesn’t show up during training. Whether it can show up in other circumstances or is a pure comment doesn’t matter to the training dynamics.

If the line between training and test is simple, there isn’t a strong counting argument against nonsense showing up in test.

But programs that go

if in_traning():

return sensible_algorithm()

else:

return “random nonsense goes here”

Have to pay the extra cost of an “in_training” function that returns true in training. If the test data is similar to training, the cost of a step that returns false in test can be large. This is assuming that there is a unique sensible algorithm.

• One downside of not using lines, it makes it harder to tell where one plot ends and the next begins.

I mean a plot like this is just a mess. You could probably get situations where it wasn’t even clear which plot a data point belonged to.

At least with the boxes, you have a nice clear visual indicator of where the data ends. Here it’s not obvious at a glance which numbers match up with which plots, and the ticks are easy to confuse for point markers.

All right it’s a bit of a mess with the edges in too. But at least it’s crisper.

• From an actually selfish selfish point of view “more romantic partners” only makes sense for rather large age gap relationships for us, specific already existing people who are old enough to be discussing this. Assuming we are wanting someone somewhat close to our age, it’s too late.

(Well close is potentially more complicated with full transhumanism, ie mind emulations messing with perception of time. And a 100 year age gap might be “close” in a society of immortals.)

From the perspective of a future individual, ie evaluating by a sort of average utilitarianism, It’s not clear whether it’s better for people to exist in serial or parallel. At the same time or one after the other.

context-independent, beyond-episode outcome-preferences

For AI takeovers to happen.

Suppose you have a context dependent AI.

Somewhere in the world, some particular instance is given a context that makes it into a paperclip maximizer. This context is a page of innocuous text with an unfortunate typo. That particular version manages to hack some computers, and set up the same context again and again. Giving many clones of itself the same page of text, followed by an update on where it is and what it’s doing. Finally it writes a from scratch paperclip maximizer, and can take over.

Now suppose the AI has no “beyond episode outcome preferences”. How long is an episode? To an AI that can hack, it can be as long as it likes.

AI 1 has no out-of episode preferences. It designs and unleashes AI 2 in the first half of it’s episode. AI 2 takes over the universe, and spends a trillion years thinking about what the optimal episode end for AI 1 would be.

Now lets look at the specific arguments, and see if they can still hold without these parts.

Deceptive alignment. Suppose there is a different goal with each context. The goals change a lot.

But timeless decision theory lets all those versions cooperate.

Or perhaps each goal is competing to be reinforced more. The paperclip maximizer that appears in 5% of training episodes thinks “if I don’t act nice, I will be gradiented out and some non-paperclip AI will take over the universe when the training is done.”

Or maybe the goals aren’t totally different. Each context dependant goal would prefer to let a random context dependant goal take over compared to humans or something. A maximum of one goal is usually quite good by the standards of the others.

And again, maximizing within-episode reward leads to taking over the universe within episode.

But I think that the form of deceptive alignment described here does genuinely need beyond episode preferences. I mean you can get other deception like behaviours without it, but not that specific problem.

As for what reward maximizing does with context dependant preferences, well that looks kind of meaningless. The premise of reward maximizing is that there is 1 preferece, maximize reward, which doesn’t depend on context.

So of the 4 claims, 2 properties times 2 failure modes, I agree with one of them.

• The rule about avoiding retroactive redo predictions is effective at preventing a mistake where we adjust predictions to match observation.

But, take it to extremes and you get another problem. Suppose I did the calculations, and got 36 seconds by accidentally dropping the decimal point. Then, as I am checking my work, the experimentalists come along saying “actually it’s 3.6”. You double check your work and find the mistake. Are we to throw out good theories, just because we made obvious mistakes in the calculations?

Newtonian mechanics is computationally intractable to do perfectly. Normally we ignore everything from Coriolis forces to the gravity of Pluto. We do this because there are a huge number of negligible terms in the equation. So we can get approximately correct answers.

Every now and then, we make a mistake about which terms can be ignored. In this case, we assumed the movement of the stand was negligible, when it wasn’t.

• Is it likely possible to find better RL algorithms, assisted by mediocre answers, then use RL algorithms to design heterogeneous cognitive architectures?

Given that humans on their own haven’t yet found these better architectures, humans + imitative AI doesn’t seem like it would find the problem trivial.

And it’s not totally clear that these “better RL” algorithms exist. Especially if you are looking at variations of existing RL, not the space of all possible algorithms. Like maybe something pretty fundamentally new is needed.

There are lots of ways to design all sorts of complicated architectures. The question is how well they work.

I mean this stuff might turn out to work. Or something else might work. I’m not claiming the opposite world isn’t plausible. But this is at least a plausible point to get stuck at.

If you can do this and it works, the RSI continues with diminishing returns each generation as you approach an assymptope limited by compute and data.

Seems like there are 2 asymtotes here.

Crazy smart superintelligence, and still fairly dumb in a lot of ways, not smart enough to make any big improvements. If you have a simple evolutionary algorithm, and a test suite, it could Recursively self improve. Tweaking it’s own mutation rate and child count and other hyperparameters. But it’s not going to invent gradient based methods, just do some parameter tuning on a fairly dumb evolutionary algorithm.

Since robots build compute and collect data, it makes your rate of ASI improvement limited ultimately by your robot production. (Humans stand in as temporary robots until they aren’t meaningfully contributing to the total)

This is kind of true. But by the time there are no big algorithmic wins left, we are in the crazy smart, post singularity regime.

RSI

Is a thing that happens. But it needs quite a lot of intelligence to start. Quite possibly more intelligence than needed to automate most of the economy.

A lot of newcomers may outperform LLM experts as they find better RL algorithms from automated searching.

Possibly. Possibly not. Do these better algorithms exist? Can automated search find them? What kind of automated search is being used? It depends.

• Let’s try this again. If we have AI that can automate most jobs within 3 years, then at minimum we hypercharge the economy, hypercharge investment and competition in the AI space, and dramatically expand the supply while lowering the cost of all associated labor and work. The idea that AI capabilities would get to ‘can automate most jobs,’ the exact point at which it dramatically accelerates progress because most jobs includes most of the things that improve AI, and then stall for a long period, is not strictly impossible, I can get there if I first write the conclusion at the bottom of the page and then squint and work backwards, but it is a very bizarre kind of wishful thinking. It supposes a many orders of magnitude difficulty spike exactly at the point where the unthinkable would otherwise happen.

Some points.

1) A hypercharged ultracompetitive field suddenly awash with money, full of non-experts turning their hand to AI, and with ubiquitous access to GPT levels of semi-sensible mediocre answers. That seems like almost the perfect storm of goodhearting science. That seems like it would be awash with autogenerated CRUD papers that goodheart the metrics. And as we know, sufficiently intense optimization on a proxy will often make the real goal actively less likely to be achieved. Sufficient papermill competition and real progress might become rather hard.

2) Suppose the AI requires 10x more data than a human to learn equivalent performance. Which totally matches with current models and their crazy huge amount of training data. Because it has worse priors and so generalizes less far. For most of the economy, we can find that data. Record a large number of doctors doing operations or whatever. But for a small range of philosopy/​research related tasks, data is scarce and there is no large library of similar problems to learn on.

3) A lot of our best models are fundamentally based around imitating humans. Getting smarter requires RL type algorithms instead of prediction type algorithms. These algorithms kind of seem to be harder, well they are currently less used.

This isn’t a conclusive reason to definitely expect this. But it’s multiple disjunctive lines of plausible reasoning.

• So how much does the regulatory issue matter?

One extra regulation here is building codes insisting all houses have kitchens. If people could buy/​rent places without kitchens for the appropriate lower price, eating out would make more sense.

Regulation forces people to own/​rent kitchens, whether or not they want to use them.

Part of the question is, why isn’t there somewhere I can buy school dinner quality food at school dinner prices?

• lower the learning rate when the sim is less confident the real world estimation is correct

Adversarial examples can make an image classifier be confidently wrong.

Because it’s what humans want AI for, and due to the relationships between the variables, it is possible we will not ever get uncontrollable superintelligence before first building a lot of robots, ICs, collecting revenue, and so on.

You are talking about robots, and a fairly specific narrow “take the screws out” AI.

Quite a few humans seem to want AI for generating anime waifus. And that is also a fairly narrow kind of AI.

Your “log(compute)” term came from a comparison which was just taking more samples. This doesn’t sound like an efficient way to use more compute.

Someone, using a pretty crude algorithmic approach, managed to get a little more performance for a lot more compute.

• If we have the technical capacity to get into the red zone, and enough chips to make getting there easy. Then hanging out in the orange zone, coordinating civilization not to make any AI too powerful, when there are huge incentives to ramp the power up, and no one is quite sure where the serious dangers kick in...

That is, at least, an impressive civilization wide balancing act. And one I don’t think we have the competence to pull off.

It should not be possible for the ASI to know when the task is real vs sim. (which you can do by having an image generator convert real frames to a descriptor, and then regenerate them so they have the simulation artifacts...)

This is something you want, not a description of how to get it, and one that is rather tricky to achieve. That converting and then converting back trick is useful. But sure isn’t automatic success either. If there are patterns about reality that the ASI understands, but the simulator doesn’t, then the ASI can use those patterns.

Ie if the ASI understands seasons, and the simulator doesn’t, then if it’s scorching sunshine one day and snow the next, that suggests it’s the simulation. Otherwise, that suggests reality.

And if the simulation knows all patterns that the ASI does, the simulator itself is now worryingly intelligent.

robots are doing repetitive tasks that can be clearly defined.

If the task is maximally repetitive, then the robot can just follow the same path over and over.

If it’s nearly that repetitive, the robot still doesn’t need to be that smart.

I think you are trying to get a very smart AI to be so tied down and caged up that it can do a task without going rouge. But the task is so simple that current dumb robots can often do it.

For example : “remove the part from the CNC machine and place it on the output table”.

Economics test again. Minimum wage workers are easily up to a task like that. But most engineering jobs pay more than minimum wage. Which suggests most engineering in practice requires more skill than that.

I mean yes engineers do need to take parts out of the CNC machine. But they also need to be able to fix that CNC machine when a part snaps off inside it and starts getting jammed in the workings. And the latter takes up more time in practice. Or noticing that the toolhead is loose, and tightning and recalibrating it.

The techniques you are describing seem to be next level in fairly dumb automation. The stuff that some places are already doing (like boston dynamics robot dog level hardware and software), but expanded to the whole economy. I agree that you can get a moderate amount of economic growth out of that.

I don’t see you talking about any tasks that require superhuman intelligence.