O O

Karma: 442

O O 3 Jun 2023 23:32 UTC
1 point
0
on: O O’s Shortform
Something that’s been intriguing me. If two agents figure out how to trust that each others goals are aligned (or at least not opposed), haven’t they essentially solved the alignment problem?

e.g. one agent could use the same method to bootstrap an aligned AI.

O O 4 Jun 2023 22:55 UTC
3 points
0
in reply to: Gunnar_Zarncke’s comment on: How could AIs ‘see’ each other’s source code?
This seems like a more reasonable answer but I still feel uneasy about parts of it.
For one they will have to agree on the design of the successor which may be non trivial or non possible with two adversarial agents.
But more importantly, if a single successor could take the actions to accomplish one of their goals, why can’t the agent take those actions themselves? How does anything that hinders one of the agents doing the successor’s actions on their own not also hinder the successor?

O O 5 Jun 2023 2:18 UTC
2 points
0
in reply to: Deiv’s comment on: Transcript and Brief Response to Twitter Conversation between Yann LeCunn and Eliezer Yudkowsky
4 seems like an unreasonable assumption. From Zuck’s perspective his chief AI scientist, whom he trusts, shares milder views of AI risks than some unknown people with no credentials (again from his perspective).

O O 5 Jun 2023 20:37 UTC
1 point
0
in reply to: Gunnar_Zarncke’s comment on: How could AIs ‘see’ each other’s source code?

They just don’t trust that the other party will uphold their part of the bargain in their part of the lightcone.

More succinctly stated, we need two copies of the successor agent right? What prevents the two copies from diverging? And if we don’t need two copies of the successor, I don’t believe there’s a true prisoner’s dilemma.

O O 5 Jun 2023 22:52 UTC
1 point
0
in reply to: Gunnar_Zarncke’s comment on: How could AIs ‘see’ each other’s source code?
To me the issue of getting them to trust the embodiment just seems like an isomorphism of the original problem.

We haven’t fundamentally solved anything other than making it a bit harder to deceive the counterparty. Instead of agreements and actions acting as a proof of compromise we have code and simulations of the code as a proof of compromise.

There is also an implicit agreement of the resulting agent being a successor (and not taking actions after it “succeeds” the predecessor).

It may seem unreasonable to assume agents to be this adversarial but I think it’s reasonable if we are talking about something like a paperclip maximizer and a stapler maximizer.

O O 6 Jun 2023 1:28 UTC
3 points
−2
in reply to: Gunnar_Zarncke’s comment on: How could AIs ‘see’ each other’s source code?
Hm, I think I am unclear on the specifics of this proposal, do you have a link explicitly stating it? When you said successor agent, I assumed a third agent would form describing their merged interests, which would then be aligned, and then the two adversaries would deactivate somehow. If this is wrong or it misses details, can you state or link the exact proposal?

And this is still isomorphic in my opinion because the agent can build a defector (as in defects from cooperating) successor behind the back of the adversary or not actually deactivate. (For the last one I can see it instructing the successor to forcefully deactivate the predecessors, but this seems like they have to solve alignment).

O O 6 Jun 2023 8:39 UTC
3 points
0
in reply to: Gunnar_Zarncke’s comment on: How could AIs ‘see’ each other’s source code?
Ok say successor embodies the (C,C) decision. Why can’t one agent make (D,C) behind the other agents back. Likewise, why can’t the agent just choose to (D,C). It could be higher EV to say decisively strike these cooperators while they’re putting resources to laying paper clips and eliminate them.

This is completely off topic but is there a reason why your comments are immediately upvoted?

O O 6 Jun 2023 16:26 UTC
3 points
2
in reply to: Gunnar_Zarncke’s comment on: How could AIs ‘see’ each other’s source code?
I think the comparison illustrates my point because the UN is typically not seen as enforceable and negotiated interests only exist when cooperating is the best choice. For human entities that don’t have maximization goals, both cooperating is often better than both defecting.

(C,C) means both cooperate. (D,C) means defect and cooperate. In classic prisoners dilemma. (D,C) offers a higher EV than (C,C) but less than (D,D).

I don’t think there is any way to weasel around a true prisoners dilemma with adversaries. It’s a simple situation and arises naturally everywhere.

O O 6 Jun 2023 17:44 UTC
1 point
0
on: O O’s Shortform
Why wouldn’t a wire head trap work?

Let’s say an AI has a remote sensor that measures a value function until the year 2100 and it’s RLed to optimize this value function over time. We can make this remote sensor easily hackable to get maximum value at 2100. If it understands human values, then it won’t try to hack its sensors. If it doesn’t we sort of have a trap for it that represents an easily achievable infinite peak.

O O 6 Jun 2023 20:00 UTC
1 point
0
in reply to: Gunnar_Zarncke’s comment on: How could AIs ‘see’ each other’s source code?
The (C,C) decision is what you’re describing, a decision to cooperate on negotiated interests. The (D,C) decision is if one party cooperates but the other party does not cooperate.

I don’t think it is. The UN is a constant example of prisoners dilemma given the number of wars it has stopped (exactly 0).

O O 6 Jun 2023 22:20 UTC
1 point
0
in reply to: Gunnar_Zarncke’s comment on: How could AIs ‘see’ each other’s source code?
I believe this is largely due to the globalization of the economy, MAD, and proxy conflicts. Globalization makes cooperating extremely beneficial. MAD makes (D,C) states very costly in real wars (before nuclear and long ranged automated weapons, a decisive strike could result in a large advantage, now there is little advantage to a decisive strike, see Pearl Harbor as an example in the past). Most human entities are also not so called fanatical maximizers (tho some were, for example the Nazis who wanted endless conquest and extermination).

O O 7 Jun 2023 0:24 UTC
1 point
0
in reply to: JBlack’s comment on: O O’s Shortform
Yes nothing is a guarantee in probabilities but can’t we just make it very easy for it to perfectly achieve its objective if it doesn’t go exactly the way we want it to, we just make an easier solution exist than disempowering us or wiping us out.

I guess in the long run we still select for models that ultimately don’t wirehead. But this might eliminate a lot of obviously wrong alignment failures we miss.

O O 7 Jun 2023 0:28 UTC
1 point
0
in reply to: Gunnar_Zarncke’s comment on: How could AIs ‘see’ each other’s source code?
People are not fanatical immortal maximizers that are robustly distributed with near unlimited regenerative properties. If we were I’d expect there to be exactly one person left on earth after an arbitrary amount of time.

O O 7 Jun 2023 1:18 UTC
1 point
0
in reply to: Gunnar_Zarncke’s comment on: How could AIs ‘see’ each other’s source code?
Isn’t there a base assumption that agents are super intelligent, don’t “decay” I.e. they have infinite time horizons, they are maximizing EV, and would work fine alone?

O O 7 Jun 2023 4:46 UTC
1 point
0
on: O O’s Shortform
Is disempowerment that bad? Is a human directed society really much better than an AI directed society with a tiny weight of kindness towards humans? Human directed societies themselves usually create orthogonal and instrumental goals, and their assessment is highly subjective/relative. I don’t see how the disempowerment without extinction is that different from today to most people who are already effectively disempowered.

O O 7 Jun 2023 5:03 UTC
6 points
2
on: O O’s Shortform
If alignment is difficult, it is likely inductively difficult (difficult regardless of your base intelligence), and ASI will be cautious of creating a misaligned successor or upgrading itself in a way that risks misalignment.

You may argue it’s easier for an AI to upgrade itself, but if the process is hardware bound or even requires radical algorithmic changes, the ASI will need to create an aligned successor as preferences and values may not transfer directly to new architectures or hardwares.

If alignment is easy we will likely solve it with superhuman narrow intelligences and aligned near peak human level AGIs.

I think the first case is an argument against FOOM, unless the alignment problem is solvable but only at higher than human level intelligences (human meaning the intellectual prowess of the entire civilization equipped with narrow superhuman AI). That would be a strange but possible world.

O O 7 Jun 2023 5:19 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: O O’s Shortform
Is there a huge reason the latter is hugely different from the former for the average person excluding world leaders.

O O 7 Jun 2023 5:37 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: O O’s Shortform
I’m just trying to understand the biggest doomers. I feel like disempowerment is probably hard to avoid.

However I don’t think a disempowered future with bountiful lives would be terrible depending on how tiny the kindness weight is/how off it is from us. We are 1/10^53 of the observable universe’s resources. Unless alignment is wildly off base, I see AI directed extinction as unlikely.

I fail to see why even figures like Paul Christiano peg it at such a high level, unless he estimates human directed extinction risks to be high. It seems quite easy to create a plague that wipes out humans and a spiteful individual can do it, probably more likely than an extremely catastrophically misaligned AI.

O O 7 Jun 2023 6:31 UTC
6 points
0
in reply to: johnswentworth’s comment on: Algorithmic Improvement Is Probably Faster Than Scaling Now
It seems particularly trivial from an algorithmic aspect? You have the compute to try an idea so you try it. The key factor is still the compute.

Unless you’re including the software engineering efforts required to get these methods to work at scale, but I doubt that?

O O 7 Jun 2023 7:47 UTC
2 points
3
on: Aleksei Petrenko’s Shortform
The fact that this was completely ignored is a little disappointing. This is a very important question that would help put upper bounds to value drift, but it seems that answering it limits the imagination when it comes to ASI. Has there ever been an answer to it?

I have a feeling larger brains have a higher coordination problem between its subcomponents, especially when you hit information transfer limits. This would put some hard limits on how much you can scale intelligence but I may be wrong.

A fermi estimate on the upper bounds of intelligence may eliminate some problem classes alignment arguments tend to include.