Running many AI variants to find correct goal generalization

Problem: An AI may incorrectly generalize a goal we present it.

Proposal: Run many slightly different instances of AI and find the “Schelling point” around correct generalization.

The idea is that if we run many similar AIs or copies of the same AI in parallel, we will be able to detect goal’s mis-generalizations, as such wrong generalizations will be different in most AIs. The only “Schelling point” will be the correct generalization, at which many AIs will converge.

For example, we want AI to clean a room from bowling balls. We ask AI to “remove all spherical objects”. We run many slightly different copies of the AI and they produce the following suggestions:

1st AI: suggests to remove balls

2nd AI: suggests to remove balls

3d AI: suggests to remove balls and lamps

4th AI: suggests to remove balls and human heads.

1st and 2nd AI have come to the same conclusion, and it is the right one. 3rd and 4th AIs have different conclusions and these conclusions are wrong. 4th AI’s conclusion is dangerous.

Counterarguments:

  1. Some misgeneralizations could also be Schelling points.


  2. Powerful AIs can guess that other similar AIs exists and start some adverse cooperation.

The first problem could be (partly) solved by proper randomisation of initial parameters, weights or inputs. For example, a question could be presented several times in different forms for different instances of the AI. E;g.: “Remove all round objects”, “remove all circular objects”, “remove all sport utilities”.

There could be two “peaks” in generalization which means that our presented goal has intrinsic ambiguity. We could then ask AI why it has different interepations.

In real life, this approach is also used: we could say a driver “Turn left. Turn there on Main Street. Here!” In that case, we formulate our goal several times in different words hoping that it will eventually converge into correct model inside driver’s mind. The driver’s mind selects solution which satisfies all three commands, and disregard those solutions which satisfy only one command.

Another example: If we have a set of calculators which either produce correct answer or wrong one, we could find the broken calculators as they all will produce different answers, and all correct calculators will produce the same answer.