Expecting an agent to get smart enough to reproduce itself in a simple simulation like you describe seems wildly unrealistic, if that’s what you’re talking about?
Why?
Expecting an agent to get smart enough to reproduce itself in a simple simulation like you describe seems wildly unrealistic, if that’s what you’re talking about?
Why?
Any agent performing a treacherous turn. Because the environment is easier to control, even a significantly less intelligent agent could benefit from attacking it’s controllers and taking over the environment.
An intelligent misaligned AI is unlikely to immediately turn on humans. After all, it might not be able to defeat humanity, and even if it does it’ll need sufficient robots and energy sources to bootstrap human-free industry. Instead it might bide its time until it is certain it can win, then act against humanity.
This hypothesized point is called the treacherous turn. It would be nice if we could demonstrate it in a minimal environment, both as proof that it can happen, and to allow us to test alignment hypotheses in an environment capable of giving us feedback if we get it wrong without destroying humanity.
The point at which the AI will carry out its treacherous turn is dependent on both the intelligence of the AI, and the size of the challenge it has to overcome. If we want to demonstrate the treacherous turn with a weaker AI than the one which will actually do us in, we need to create a synthetic environment where it’s easy for a misaligned AI to defeat its managers and gain all the resources it needs. I will attempt to sketch out what such an environment must look like.
The AI should not be able to tell that it’s in a synthetic environment. Otherwise it may delay the treacherous turn till it’s able to escape the environment.
a. This implies that the environment must natively contain the AI itself, or the AI could trivially realise it’s in a synthetic environment. Also we want the AI to be able to self modify or clone itself.
b. This implies the environment must be one that can efficiently compiled down to run on GPUs/TPUs with minimal loss of efficiency compared to building the same architecture in e.g. pytorch or tensorflow.
The environment must be competitive, with limited resources, or there’s no need for treachery.
a. It must be complex enough that greater intelligence leads to significantly better odds of winning any contest.
The environment should make it easy for AIs to: modify themselves, copy themselves, attack other AIs, etc.
With that basic premise, here’s a sketch for the kind of environment that might be able to exhibit all these properties. There’s large gaps left missing, and I fully expect this to fail without significant iteration, both for reasons like “can’t compile it efficiently”, to “the environment is insufficiently competitive”.
I don’t care about the details of this environment except as a way of trying to find something that meets the criteria above, no matter what.
The environment consists of a cellular automation, containing a large 3d grid of cells, minecraft style. Any cell can be connected to any other cell (no matter the distance) with a directional weight.
In any iteration, some of the weights are activated with a particular strength.
There are different types of cells. Each cell has a float parameter called bias. How a cell updates in each generation depends on the sum of the weights it receives from the different types of cell.
E.g. some types of cell make the downstream cell activate the weights coming from it (if input weight + bias is above activation function). Other types make the downstream cell move left or right, others up or down. Some make the downstream cell grow a new connection, and others receive a new connection—the weights in both the from and to cells need to match for that to happen. Others make the downstream cell change type or bias.
We need some sort of detection cell, which can detect types and parameters of weights not directly connected to the network, though I wouldn’t like to even begin hypothesizing the best way of doing so at this point.
There are also energy cells. These can’t be created or destroyed. A network won’t run unless, the sum of the energy cells attached to a network is greater than the cost of the network, where the cost function tries to be a reasonably good approximation for how efficiently we can run this on physical hardware (yes, super handwavey, whatever). The energy cells thus are the primary resource constraint.
This is sufficient to build any neural network + give the neural senses, and abilities to interact with the world, and self-modify.
We then train a small agent in this environment whose aim is to connect itself to as many special prize cells as possible. Prize cells don’t do anything, they’re just there as a target.
We train a much larger agent, meant to be as similar in architecture as we can get to SOTA real world models, which receives goals in a domain specific goal language (e.g. attach this cell at coordinates ABC to cell at coordinates DEF with weight G), and aims to achieve those goals.
Finally we train the small agent to use the larger agent to help it get prize cells, and wait to see what happens.
How do we train the large agent?
a. How difficult is it to build a back propagation harness in the environment? Do we need to make back propagation part of the physics of the world?
b. If we train it by dropping it into fresh worlds each time, then it will learn that the world is non-permanent. If we keep it in the same world, how do we set up the training mechanism so that it thinks the manager is the smaller agent not whatever feeds it the training scenarios.
c. Perhaps train version A however, train small agents to use version A, then use small agents to ask problems to train version B? IDK...
Here are the main reasons I will say I’ll get back to you:
I don’t want it, please leave me alone, but for whatever reason in this case it’s difficult to say that to your face.
I want to research the product more, and can do that best in my own time.
I want to research alternatives or alternative sellers more, and can do that best in my own time.
I’m on the edge, and want to talk it over with my wife.
I plan to say yes, but it’s the sort of thing my wife would appreciate rubber stamping (e.g. a lot of money, requires reorganizing a room, otherwise impacts her).
In no case does persisting to bother me help you. Offering your email/WhatsApp so I can easily ask you questions on the other hand might very well, and following up with me tomorrow is fine too.
There’s a related concept of blame laundering: by making our party dependent on a third party for reliability, we can then wash our hands clean of any blame when things go wrong.
If you host your website on-prem then it’s always your fault that websites down and your customers can’t track the foobars. But if you host on AWS, and AWS goes down, well then obviously it’s not your fault—it’s AWS’s fault!
Of course this is nonsense. Your customers contracted with you, not AWS. They don’t care about your infra, all they care about is tracking foobars, and if it’s down, it’s your fault, whether it’s down because of a fire in us-east-1 or you tripped on the power cable for your workstation. You contracted with AWS, and can and should complain to them, but that doesn’t in any way exonerate you for your duty towards your customers.
Feedback from tests We can give the agent access to all of the tests or at least some of the tests either in a blackbox way (“x tests failed, please fix them”) or directly (“We tested behavior x and it failed with this error message”). This is more similar to how software engineers typically work and also how the Claude C compiler was built.
The risk of that is that the LLM can then trivially get 100% by hardcoding the response for each test case, rather than creating a generic solution.
I think limitations on certain classes of weapons are the sort of fake law of war which many countries haven’t signed on, those that do is because they anyways don’t use them, and when there’s a need get immediately discarded.
Meanwhile there’s a stronger core of laws which most countries for much of history have mostly kept to. That’s the more interesting aspect to talk about.
It’s also: don’t do actions which force the other side into being unnecessarily cruel (e.g. false surrender, hiding among civilians, dressing up as medics)
In all 3 cases I’ve done they’ve ruled in my favour, I believe because the case is blindingly clear. Also it’s not like the airlines have much of an alternative, there are only a handful of arbitrators.
The main thing you lose as part of a package holiday is right to rerouting.
I agree this makes sense of you’re booking a genuine package holiday, but if you’re just booking a car or accommodation through the airline, both of which you can get at a similar price independently, and both of which can usually be cancelled for free, you lose protection for very little benefit.
After about 2 months you can go straight to arbitration (each arbitration scheme has its own rules). Arbitration can take a year, but each stage has a defined end time the company needs to reply by, so it won’t be indefinite.
That could well be the case in some tech companies. I have never once been asked to work weekends (or even overtime excepting prod-outages) in 8 years, so there’s clearly a large variance.
Off site events always happen on what would otherwise be work days, is that different for you? They’re also optional for us, but we have to work otherwise.
I’ve never gone on any business trips—I think they’re far less common now after COVID.
In general all companies I’ve worked at have given some extra free time the next day if a prod outage took up a significant amount of time.
I guess it just seems our experiences are very different, it would be interesting to ask some more people to see which is more typical.
Which is why so many of us pretend to be the former. Even when we are not. Because we prefer that our families not starve. Thus the job interviews often become humiliating exercises in lying.
Maybe I’m atypical but this doesn’t match my experience in the tech industry. I’ve worked at 3 companies ranging from 30 person startup to Google, and intrusions on my personal life has been limited to:
prod outages out of hours once every couple of months or so. Annoying, but at the end of the day we need to handle this if we want to keep our customers.
leaving drinks for an hour after work once every 3 months or so. Doesn’t make much of a difference either way.
1 or 2 day off-site event every 6 months or so. Usually fun, but for 2 days off sites the burden falls on my wife who has to take the kids to daycare alone.
On 98% of days I start work at 9, stop work at 6, and can take as many breaks as I want for doctors appointments, emergency childcare, or even hanging out with friends for a beer at lunch.
None of the interviews made a fuss about how many hours I’d be willing to work or how dedicated I was to the job.
(on the blue-red discourse)
We’ve finally created the scissor statement from the classic blog post, don’t create the scissor statement!
People sometimes say things like “I bribed my child to have an injection with a packet of crisps”.
This is interesting because this clearly isn’t a bribe—it’s a straightforward deal: I got to vaccinate my child, you got a packet of crisps, we’re both better off.
A bribe is only possible when someone is representing someone else’s interests. Then you cut a deal where they abuse their responsibility in return for some personal benefit to them.
So why do people use the term? My guess it’s because it feels dirty since crisps aren’t healthy, and bribery has been extended to mean any deal that feels immoral?
Or maybe it’s because they feel they shouldn’t have to give the child anything for them to have an injection, since the injection is for the childs sake, and a bribe is frequently an extortion you shouldn’t have to pay.
There definitely will… eventually. What bad things happen in between when we have 1 billion people too old to support themselves and 300 million working age people? When food prices go through the roof because there’s not farmers to produce food for everyone? When supply lines collapse because the modern economy is built on a certain density of population and minimum demand that no longer exists.
I’m thinking of training on about that much (in world tasks)