Large part of this dark informational matter is “neural nets weights” in human brains, which present themselves in our capability to recognised cats vs. dogs but can’t be accessed directly.
Why not treat this in Bayesian way? There is 50 per cent a priory credence in short time line, and 50 per cent on long one. In that case, we still need to get working AI safety solutions ASAP, even if there is 50 per cent chance that the money on AI safety will be just lost? (Disclaimer: I am not paid for any AI safety research, except Good AI prize of 1500 USD, which is not related to timelines.)
One more thing: you model assumes that mental models of situations are actually preexisting. However, imagine a preference between tea and coffee. Before I was asked, I don’t have any model and don’t have any preference. So I will generate some random model, like large coffee and small tea, and when make a choice. However, the mental model I generate depends on framing of the question.
In some sense, here we are passing the buck of complexity from “values” to “mental models”, which are assumed to be stable and actually existing entities. However, we still don’t know what is a separate “mental model”, where it is located in the brain, how it is actually encoded in neurons.
In short, I am impressed, but not convinced :)
One problem I see is that all information about human psychology should be more explicitly taken into account as some independent input in the model. For example, if we take a model M1 of human mind, in which there are two parts, consciousness and unconsciousness, both of which are centered around mental models with partial preferences—we will get something like your theory. However, there could be another theory M2 well supported by psychological literature, where there will be 3 internal parts (e.g. Libido, Ego, SuperEgo). I am not arguing that M2 is better than M1. I am argue that M should be taken as independent variable (and supported by extensive links of actual psychological and neuroscience research for each M).
In other words, as soon as we define human values as some theory V (there is around 20 theories only between AI safety researcher about V, of which I have in a list), we could create an AI which will learn V. However, internal consistency of the theory V is not the evidence that it is actually good, as other theories about V are also internally consistent. Some way of testing is needed, may in the form in which human could play, so we could check what could go wrong—but to play such game, the preference learning method should be specified in more details.
During reading I was expecting to get more on the procedure of learning partial preferences. However, it was not explained in details and was only (as I remember) mentioned that future AI will able to learn partial preferences by some deep scan methods. But it is too advance method of value learning to be safe. In it we have to give AI very dangerous capabilities like nanotech for brain reading before it will learn human values. So AI could start acting dangerously before it learns all these partial preferences. Other methods of value learning are safer: like an analysis of previously written human literature by some ML, which would extract human norms from it. Probably, some word2vec could do it even now.
Now, it may turn out that I don’t need that AI will know the whole my utility function, I just want it to obey human norms plus do what I said. “Just brink me tee, without killing my cat and tilling universe with teapots.” :)
Another thing which worry me about personal utility function is that it could be simultaneously fragile(in time) and grotesque and underfdefined – at least based on my self-observation. Thus again I would prefer collectively codified human norm (laws) over extrapolated model of my utility function.
Another option is to run several different Oracles of phi and take the shortest proof
How we could be sure that the proof is actually the shortest?
If there will be a AI Manhattan project, there will be effective cover up of it, so no obvious signs of it will be visible for layman.
We give the Oracle less access to the real world, less chance to manipulate it and incentive to manipulate us, the owners. Stock market predicting AI may cause a crash just to earn something. It also has to predict effects of its own actions, as we will make bets according to its predictions, and thus it could send data to real world.
Submission: low bandwidth and maybe counterfactual Oracle.
General idea: Use a Meta-Oracle to check safety of Oracles. In the nutshell, Meta-Oracle (MO) just outputs 1 or 0, safe on not safe, after getting a code of another Oracle. If the Meta-Oracle is trustworthy, when the Oracle it deemed safe is also trustworthy. However, we need to create a procedure to check correctness of MO.
1. We create Generator Oracle (GO), which inputs 1 or 0, and in the case of 0, it outputs a random code of malevolent Oracle, and for 1 – a code of a safe Oracle. Memory of GO is erased after each turn.
2. We create a large database of presumably good or bad oracles (without running any). Humans do not see codes of any of them, so the effective output is 0.
3. We use different types of GOs to fill our database of good and bad oracles (to prevent their coordination).
4. We test Meta Oracle on this database—it should correctly guess which ones are bad and good ones.
5. If the list of MO-guesses doesn’t match with our commands to create safe or not safe oracles, it means that at least one of GOs or the MO is malevolent or erroneous, so all of them should be ditched.
6. Even if MO is not trustworthy, it can’t lie to us too often, or it will be caught and ditched. So, basically, we do here something like a GAN network, but consisting of oracles.
7. Using very large database, we could increase our credence in the truth of the next output of MO to an arbitrary high number, like 99.999.
8. Eventually we have a code of an Oracle, which is very likely to be safe.
Path 11. Another way to test MO:
1. We create many different MOs and test them against database of oracles codes from above. They don’t have memory and can’t communicate.
2. If small minority of MOs disagree with majority of MO about some oracle O1, we ditch this minority. Truth is a Shelling point for MOs, so the easiest way for them to coordinate is to say truth. (They could coordinate around the code of O1 if it has a line of code like “let’s lie about me” and if they are sure that the majority of other MO are also cheaters; however, not all oracles-codes will have such coordination texts, or the oracles-code could be created deliberately with such line to caught MO which are eager to cheat).
3. Running 2 many times (millions) we will end with a subset of MOs which are unlikely to cheat, and thus we will have many Oracle codes which are likely safe.
Path 111. Another way to test Meta Oracle:
1. Create many MOs with different principles and initial conditions.
2. Test a code of each MO against each MO (In other words, Meta-Oracle test itself—or its siblings)
3. If all came positive, it means that either all of them are actually good ones, or all of them were able to acausally coordinate and all of them are malevolent.
4. To increase our credence, ask generative Oracle to create malevolent meta Oracles, and add them to the mix of mutually testing MOs, so they can’t just blindly cooperate anв produce all positive results.
This ideas are still vulnerable and may be not mathematically perfect by may help to make safer Oracles.
Submission for low bandwidth Oracle:
General idea: “True” Oracle will output the same result, “lying” Oracles would lie differently. To check that the solution is true, several Oracles are created and their outputs are compared. However, Oracles could guess each other existence and coordinate acausually. To prevent it, additional steps are implemented:
1) The exact number of the Oracles which will be compared is generated randomly and the Oracles have slightly different and also random initial conditions (if any applicable). For example, if they are based on ML, they are trained using different initial weights.
2) The Oracles don’t have any information about the outside world which they could use for coordination. The Oracles are used only to solve mathematical problems, and they output the results in a formal language.
3) The output of the Oracles is checked by a computer program which shows the result to humans only if the output is the same for all Oracles. If the results are different, it deletes all results.
4) There is another Oracle, which only able to check proofs, and it could output only 0 or 1. This Oracle may be tested previously on the large number of already existing proofs, so we could know that it is trustworthy with very high probability.
5) There is a third Oracle, which only checks if there are hidden messages in the proofs and output 0 or 1. It is also well calibrated by some dataset.
6) The proofs of mathematical problems are published for money and these money are used for more AI safety research.
Personally, I am still not sure about this setup: it could work with high probability, but could be still vulnerable, as proofs are Turing-complete and could infect next Oracles in the pipeline. Also, it is like to use nuclear weapons to make camp fire: the instrument is so dangerous, that it requires many levels of protections for small practical result.
Also, proof-checking AI from the paragraph 4 could be used as a stand-alone application, as it could accelerate some kinds of research, including testing of mechanical designs and AI safety ideas. Maybe it should be separate submission.
Another variant for checkable and profitable Orcale: use the AI to solve math problems, publish the solutions and earn money on it. In that case the input may be very short, like “prove Goldbach conjecture”, but output could be arbitrary long and may be dangerous, if there are some encoded messages. To escape it, we need another Oracle, which only checks proofs on some formal language.
Here is assumed that if the proof is true and is in a formal language, there is no hidden messages in the text.
Another way to check proofs is to run two slightly different non-communicating Oracles and compare outputs.
We don’t give it a list of passwords. We give it only a a valet address, and it generates the password. Here the output is low-bandwidth, but the Oracle may be either guessing passwords using some quantum cryptography tools, and in that case the input is also relatively low-bandwidth – or, and that case it needs to simulate all human history, by guessing psychological processes in the valet owner’s mind. But in the second case it needs to have access to all the data of internet, which is risky.
Suggestion for low bandwidth OAI:
General principle: Use the Oracle AI where the true answer is easily checkable and profitable, and no human person will ever read the answer, so there is no informational hazard that untrue answer will have some dangerous information in it.
Example: There are many bitcoin valets’ passwords for which are forgotten by the owners. OAI could guess the passwords, and owners will pay a share of money from the valet to get the rest. Moreover, nobody will read the password, as it will be copy-pasted automatically from OAI into the valet. The money could be used for AI safety research.
Several interesting questions appeared in my mind immediately as I saw the post’s title, so I put them here but may be will add more formatting later:
Submission: very-low-bandwidth oracle: Is it theoretically possible to solve AI safety – that is, to create safe superintelligent AI? Yes or no?
Submission: low-bandwidth oracle: Could humans solve AI safety before AI and with what probability?
Submission: low-bandwidth oracle: Which direction to work on AI Safety is the best?
Submission: low-bandwidth oracle: Which direction to work on AI Safety is the useless?
Submission: low-bandwidth oracle: Which global risk is more important than AI Safety?
Submission: low-bandwidth oracle: Which global risk is neglected?
Submission: low-bandwidth oracle: Will non-aligned AI kill us (probability number)?
Submission: low-bandwidth oracle: Which question should I ask you in order to create Safe AI? (less than 100 words)
Submission: low-bandwidth oracle: What is the most important question which should I ask? (less than 100 words)
Submission: low-bandwidth oracle: Which future direction of work should I choose as the most positively impactful for human wellbeing? (less than 100 words)
Submission: low-bandwidth oracle: Which future direction of work should I choose as the best for my financial wellbeing? (less than 100 words)
Submission: low-bandwidth oracle: How to win this prise? (less than 100 words)
Often, Cryorus allows post-mortem payment. That is, your relatives will pay from your inheritance.
It is normal for human values to evolve. If my values were fixed at me at 6 years old, I would be regarded mentally ill.
However, there are normal human speed and directions of value evolution, and there are some ways of value evolution which could be regarded as too quick, too slow, or going in a strange direction. In other words, the speed and direction of the value drift is a normative assumption. For example, i find normal that a person is fascinated with some philosophical system for years and then just move to another one. If a person changes his ideology everyday or is fixed in “correct one” form 12 years old until 80, I find it less mentally healthy.
The same way I more prefer an AI which goals are evolving in millions of years – to the AI which is evolving in seconds or is fixed forever.
Besides computational resources one need experimental data and observations. Sabine Hossenfelder wrote somewhere that to discover new physics we probably need accelerators which have energy like 10E15 of LHC. Also to measure changes in the speed of universe acceleration probably billions of years are needed.
Also, the problem of other ETI arises. If one civilisation is sleeping in aestivation, another could come to power and eradicate its sleeping seeds. Thus sleeping seeds should be in fact berserkers, which come to a activity from time to time, check the level of other civilizations (if any) and destroy or upgrade them. It is a rather sinister perspective.
Yes, it is a good point that based on UDT one could ignore BBs.
But it looks like that cosmologists try to use the type of experiences that an observer has to deduce either he is BB or not. Cosmologists assume that BB has very random experience, and that their current experience is not very random. (I find both claims weak.) They conclude that as their current observations are not random, they are not BBs, and thus BBs are not dominating type of the observers. There are several flaws in this logic, one of of them is that a BB can’t make a coherent conclusions if its experiences are random or not.
Also, even if there are worlds, with Big Rip and another with Heat Death with many BBs, SIA favors the heat death world.
However, some BBs could last longer than just one observer-moment and they could have time to think about the type of experience they have. Such BBs more likely to find themselves in the empty universe than in the one full of stars. Also, interesting thing is that some BBs could appear even in Big Rip scenarios by the process called “nucleation”, where they jump out of cosmological horizon of the accelerating universe.
So, my point is that the idea of the heat death of the universe is getting more alternatives after the discovery of the dark energy, and there are some new arguments against it like the ones based on BBs. All this is not enough to conclude now which form of the end on the universe is more probable.