jessicat

Karma: 306

jessicat 11 Dec 2014 22:37 UTC
57 points
on: What Peter Thiel thinks about AI risk
Transcript:

Question: Are you as afraid of artificial intelligence as your Paypal colleague Elon Musk?

Thiel: I’m super pro-technology in all its forms. I do think that if AI happened, it would be a very strange thing. Generalized artificial intelligence. People always frame it as an economic question, it’ll take people’s jobs, it’ll replace people’s jobs, but I think it’s much more of a political question. It would be like aliens landing on this planet, and the first question we ask wouldn’t be what does this mean for the economy, it would be are they friendly, are they unfriendly? And so I do think the development of AI would be very strange. For a whole set of reasons, I think it’s unlikely to happen any time soon, so I don’t worry about it as much, but it’s one of these tail risk things, and it’s probably the one area of technology that I think would be worrisome, because I don’t think we have a clue as to how to make it friendly or not.

jessicat 3 Jun 2015 22:02 UTC
20 points
on: FAI Research Constraints and AGI Side Effects
This model seems quite a bit different from mine, which is that FAI research is about reducing FAI to an AGI problem, and solving AGI takes more work than doing this reduction.

More concretely, consider a proposal such as Paul’s reflective automated philosophy method, which might be able to be implemented using epsiodic reinforcement learning. This proposal has problems, and it’s not clear that it works—but if it did, then it would have reduced FAI to a reinforcement learning problem. Presumably, any implementations of this proposal would benefit from any reinforcement learning advances in the AGI field.

Of course, even if we a proposal like this works, it might require better or different AGI capabilities from UFAI projects. I expect this to be true for black-box FAI solutions such as Paul’s. This presents additional strategic difficulties. However, I think the post fails to accurately model these difficulties. The right answer here is to get AGI researchers to develop (and not publish anything about) enough AGI capabilities for FAI without running a UFAI in the meantime, even though the capabilities to run it exist.

Assuming that this reflective automated philosophy system doesn’t work, it could still be the case that there is a different reduction from FAI to AGI that can be created through armchair technical philosophy. This is often what MIRI’s “unbounded solutions” research is about: finding ways you could solve FAI if you had a hypercomputer. Once you find a solution like this, it might be possible to define it in terms of AGI capabilities instead of hypercomputation, and at that point FAI would be reduced to an AGI problem. We haven’t put enough work into this problem to know that a reduction couldn’t be created in, say, 20 years by 20 highly competent mathematician-philosophers.

In the most pessimistic case (which I don’t think is too likely), the task of reducing FAI to an AGI problem is significantly harder than creating AGI. In this case, the model in the post seems to be mostly accurate, except that it neglects the fact that serial advances might be important (so we get diminishing marginal progress towards FAI or AGI per additional researcher in a given year).
What links here?
- Gram_Stone's comment on FAI Research Constraints and AGI Side Effects by JustinShovelain (8 Jun 2015 17:27 UTC; 3 points)

jessicat 5 May 2015 4:35 UTC
18 points
on: Debunking Fallacies in the Theory of AI Motivation
Thanks for posting this; I appreciate reading different perspectives on AI value alignment, especially from AI researchers.

But, truthfully, it would not require a ghost-in-the-machine to reexamine the situation if there was some kind of gross inconsistency with what the humans intended: there could be some other part of its programming (let’s call it the checking code) that kicked in if there was any hint of a mismatch between what the AI planned to do and what the original programmers were now saying they intended. There is nothing difficult or intrinsically wrong with such a design.

If there is some good way of explaining plans to programmers such that programmers will only approve of non-terrible plans, then yes, this works. However, here is contained most of the problem. The AI will likely have a concept space that does not match a human’s concept space, so it will need to do some translation between the two spaces in order to produce something the programmers can understand. But, this requires (1) learning the human concept space and (2) translating the AI’s representation of the situation into the human’s concept space (as in ontological crises). This problem is FAI-complete: given a solution to this, we could learn the human’s concept of “good” and then find possible worlds that map to this “good” concept. See also Eliezer’s reply to Holden on tool AI.

It might not be necessary to solve the problem in full generality: perhaps we can create systems that plan well in limited domains while avoiding edge cases. But it is also quite difficult to do this without severe restrictions in the system’s generality.

The motivation and goal management (MGM) system would be expected to use the same kind of distributed, constraint relaxation mechanisms used in the thinking process (above), with the result that the overall motivation and values of the system would take into account a large degree of context, and there would be very much less of an emphasis on explicit, single-point-of-failure encoding of goals and motivation.

I’m curious how something like this works. My current model of “swarm relaxation” is something like a Markov random field. One of my main paradigms for thinking about AI is probabilistic programs, which are quite similar to Markov random fields (but more general). I know that weak constraint systems are quite useful for performing Bayesian inference in a way that takes context into account. With a bit of adaptation, it’s possible to define probabilistic programs that pick actions that lead to good outcomes (by adding a “my action” node and a weak constraint on other parts of the probabilistic model satisfying certain goals; this doesn’t exactly work because it leads to “wishful thinking”, but in principle it can be adapted). But, I don’t think this is really that different from defining a probabilistic world model, defining a utility function over it, and then taking actions that are more likely to lead to high expected utility. Given this, you probably have some other model in mind for how values can be integrated into a weak constraint system, and I’d like to read about it.

But if it is also programmed to utterly ignore that fallibility—for example, when it follows its compulsion to put everyone on a dopamine drip, even though this plan is clearly a result of a programming error—then we must ask the question: how can the machine be both superintelligent and able to ignore a gigantic inconsistency in its reasoning?

We need to make a model that the AI can use in which its goal system “might be wrong”. It needs a way to look at evidence and conclude that, due to it, some goal is more or less likely to be the correct one. This is highly nontrivial. The model needs to somehow connect “ought”s to “is”s in a probabilistic, possibly causal fashion. While, relative to a supergoal, subgoals can be reweighted based on new information using standard Bayesian utility maximization, I know of no standard the AI could use to revise its supergoal based on new information. If you have a solution to the corrigibility problem in mind, I’d like to hear it.

Another way of stating the problem is: if you revise a goal based on some evidence, then either you had some reason for doing this or not. If so, then this reason must be expressed relative to some higher goal, and we either never change this higher goal or (recursively) need to explain why we changed it. If not, then we need some other standard for choosing goals other than comparing them to a higher goal. I see no useful way of having a non-fixed supergoal.

if the AGI is going to throw a wobbly over the dopamine drip plan, what possible reason is there to believe that it did not do this on other occasions? Why would anyone suppose that this AGI ignored an inconvenient truth on only this one occasion?

I think the difference here is that, if only the supergoal is “wrong” but everything else about the system is highly optimized towards accomplishing the supergoal, then the system won’t stumble along the way, it will (by definition) do whatever accomplishes its supergoal well. So, “having the wrong supergoal” is quite different from most other reasoning errors in that it won’t actually prevent the AI from taking over the world.

Knowing about the logical train wreck in its design, the AGI is likely to come to the conclusion that the best thing to do is seek a compromise and modify its design so as to neutralize the Doctrine of Logical Infallibility. The best way to do this is to seek a new design that takes into account as much context—as many constraints—as possible.

It seems like you’re equating logical infallibility about facts (including facts about the world and mathematical facts) with logical infallibility about values. Of course any practical system will need to deal with uncertainty about the world and logic, probably using something like a weak constraint system. But it’s totally possible to create a system that has this sort of uncertainty without any uncertainty about its supergoal.

When you use the phrase “the best way to do this”, you are implicitly referring to some goal that weak constraint systems satisfy better than fixed-supergoal systems, but what sort of goal are we talking about here? If the original system had a fixed supergoal, then this will be exactly that fixed goal, so we’ll end up with a mishmash of the original goal and a weak constraint system that reconfigures the universe to satisfy the original goal.

jessicat 5 Jan 2015 2:24 UTC
18 points
on: Compartmentalizing: Effective Altruism and Abortion
I don’t think you did justice to the replaceability argument. If fetuses are replaceable, then the only benefit of banning abortion is that it increases the fertility rate. However, there are far better ways to increase the fertility rate than banning abortion. For example, one could pay people to have children (and maybe give them up for adoption). So your argument is kind of like saying that since we really need farm laborers, we should allow slavery.

jessicat 23 Jul 2015 19:20 UTC
17 points
on: Steelmaning AI risk critiques
One of the most common objection’s I’ve seen is that we’re too far from getting AGI to know what AGI will be like, so we can’t productively work on the problem without making a lot of conjunctive assumptions—e.g. see this post.

jessicat 22 Mar 2015 18:26 UTC
11 points
in reply to: Mark_Friedenbach’s comment on: New forum for MIRI research: Intelligent Agent Foundations Forum

Learning how to create even a simple recommendation engine whose output is constrained by the values of its creators would be a large step forward and would help society today.

I think something showing how to do value learning on a small scale like this would be on topic. It might help to expose the advantages and disadvantages of algorithms like inverse reinforcement learning.

I also agree that, if there are more practical applications of AI safety ideas, this will increase interest and resources devoted to AI safety. I don’t really see those applications yet, but I will look out for them. Thanks for bringing this to my attention.

it is demonstrably not the case in history that the fastest way to develop a solution is to ignore all practicalities and work from theory backwards

I don’t have a great understanding of the history of engineering, but I get the impression that working from the theory backwards can often be helpful. For example, Turing developed the basics of computer science before sufficiently general computers existed.

My current impression is that solving FAI with a hypercomputer is a fundamentally easier problem that solving it with a bounded computer, and it’s hard to say much about the second problem if we haven’t made steps towards solving the first one. On the other hand, I do think that concepts developed in the AI field (such as statistical learning theory) can be helpful even for creating unbounded solutions.

AIXI showed that all the complexity of AGI lies in the practicalities, because the pure uncomputable theory is dead simple but utterly divorced from practice.

I would really like it if the pure uncomputable theory of Friendly AI were dead simple!

Anyway, AIXI has been used to develop more practical algorithms. I definitely approach many FAI problems with the mindset that we’re going to eventually need to scale this down, and this makes issues like logical uncertainty a lot more difficult. In fact, Paul Christiano has written about tractable logical uncertainty algorithms, which is a form of “scaling down an intractable theory”. But it helped to have the theory in the first place before developing this.

an ignore-all-practicalities theory-first approach is useless until it nears completion

Solutions that seem to work for practical systems might fail for superintelligence. For example, perhaps induction can yield acceptable practical solutions for weak AIs, but does not necessarily translate to new contexts that a superintelligence might find itself in (where it has to make pivotal decisions without training data for these types of decisions). But I do think working on these is still useful.

My current trajectory places the first AGI at 10 to 15 years out, and the first self-improving superintelligence shortly thereafter. Will MIRI have practical results in that time frame?

I consider AGI in the next 10-15 years fairly unlikely, but it might be worth having FAI half-solutions by then, just in case. Unfortunately I don’t really know a good way to make half-solutions. I would like to hear if you have a plan for making these.

jessicat 5 May 2015 22:24 UTC
10 points
in reply to: [deleted]’s comment on: Debunking Fallacies in the Theory of AI Motivation
Thanks for your response.

The AI can quickly assess the “forcefulness” of any candidate action plan by asking itself whether the plan will involve giving choices to people vs. forcing them to do something whether they like it or not. If a plan is of the latter sort, more care is needed, so it will canvass a sample of people to see if their reactions are positive or negative.

So, I think this touches on the difficult part. As humans, we have a good idea of what “giving choices to people” vs. “forcing them to do something” looks like. This concept would need to resolve some edge cases, such as putting psychological manipulation in the “forceful” category (even though it can be done with only text). A sufficiently advanced AI’s concept space might contain a similar concept. But how do we pinpoint this concept in the AI’s concept space? Very likely, the concept space will be very complicated and difficult for humans to understand. It might very well contain concepts that look a lot like the “giving choices to people” vs. “forcing them to do something” distinction on multiple examples, but are different in important ways. We need to pinpoint it in order to make this concept part of the AI’s decision-making procedure.

It will also be able to model people (as it must be able to do, because all intelligent systems must be able to model the world pretty accurately or they don’t qualifiy as ‘intelligent’) so it will probably have a pretty shrewd idea already of whether people will react positively or negatively toward some intended action plan.

This seems pretty similar to Paul’s idea of a black-box human in the counterfactual loop. I think this is probably a good idea, but the two problems here are (1) setting up this (possibly counterfactual) interaction in a way that it approves a large class of good plans and rejects almost all bad plans (see the next section), and (2) having a good way to predict the outcome of this interaction usually without actually performing it. While we could say that (2) will be solved by virtue of the superintelligence being a superintelligence, in practice we’ll probably get AGI before we get uploads, so we’ll need some sort of semi-reliable way to predict humans without actually simulating them. Additionally, the AI might need to self-improve to be anywhere smart enough to consider this complex hypothetical, and so we’ll need some kind of low-impact self-improvement system. Again, I think this is probably a good idea, but there are quite a lot of issues with it, and we might need to do something different in practice. Paul has written about problems with black-box approaches based on predicting counterfactual humans here and here. I think it’s a good idea to develop both black-box solutions and white-box solutions, so we are not over-reliant on the assumptions involved in one or the other.

In all of that procedure I just described, why would the explanation of the plans to the people be problematic? People will ask questions about what the plans involve. If there is technical complexity, they will ask for clarification. If the plan is drastic there will be a world-wide debate, and some people who finds themselves unable to comprehend the plan will turn to more expert humans for advice.

What language will people’s questions about the plans be in? If it’s a natural language, then the AI must be able to translate its concept space into the human concept space, and we have to solve a FAI-complete problem to do this. If it’s a more technical language, then humans themselves must be able to look at the AI’s concept space and understand it. Whether this is possible very much depends on how transparent the AI’s concept space is. Something like deep learning is likely to produce concepts that are very difficult for humans to understand, while probabilistic programming might produce more transparent models. How easy it is to make transparent AGI (compared to opaque AGI) is an open question.

We should also definitely be wary of a decision rule of the form “find a plan that, if explained to humans, would cause humans to say they understand it”. Since people are easy to manipulate, raw optimization for this objective will produce psychologically manipulative plans that people will incorrectly approve of. There needs to be some way to separate “optimize for the plan being good” from “optimize for people thinking the plan is good when it is explained to them”, or else some way of ensuring that humans’ judgments about these plans are accurate.

Again, it’s quite plausible that the AI’s concept space will contain some kind of concept that distinguishes between these different types of optimization; however, humans will need to understand the AI’s concept space in order to pinpoint this concept so it can be integrated into the AI’s decision rule.

I should mention that I don’t think that these black-box approaches to AI control are necessarily doomed to failure; rather, I’m pointing out that there are lots of unresolved gaps in our knowledge of how they can be made to work, and it’s plausible that they are too difficult in practice.

jessicat 22 Mar 2015 2:40 UTC
10 points
in reply to: Mark_Friedenbach’s comment on: New forum for MIRI research: Intelligent Agent Foundations Forum
I think a post saying something like “Deep learning architectures are/are not able to learn human values because of reasons X, Y, Z” would definitely be on topic. As an example of something like this, I wrote a post on the safety implications of statistical learning theory. However, an article about how deep learning algorithms are performing on standard machine learning tasks is not really on topic.

I share your sentiment that safety research is not totally separate from other AI research. But I think there is a lot to be done that does not rely on the details of how practical algorithms work. For example, we could first create a Friendly AI design that relies on Solomonoff induction, and then ask to what extent practical algorithms (like deep learning) can predict bits well enough to be substituted for Solomonoff induction in the design. The practical algorithms are more of a concern when we already have an solution that uses unbounded computing power and are trying to scale it down to something we can actually run.

jessicat 15 Dec 2014 2:58 UTC
10 points
in reply to: DanielLC’s comment on: Open thread, Dec. 15 - Dec. 21, 2014
We updated on the fact that we exist. SSA does this a little too: specifically, the fact that you exist means that there is at least one observer. One way to look at it is that there is initially a constant number of souls that get used to fill in the observers of a universe. In this formulation, SIA is the result of the normal Bayesian update on the fact that soul-you woke up in a body.

jessicat 16 Apr 2015 5:14 UTC
9 points
on: Why isn’t the following decision theory optimal?
There’s one scenario described in this paper on which this decision theory gives in to blackmail:

The Retro Blackmail problem. There is a wealthy intelligent system and an honest AI researcher with access to the agent’s original source code. The researcher may deploy a virus that will cause $150 million each in damages to both the AI system and the researcher, and which may only be deactivated if the agent pays the researcher $100 million. The researcher is risk-averse and only deploys the virus upon becoming confident that the agent will pay up. The agent knows the situation and has an opportunity to self-modify after the researcher acquires its original source code but before the researcher decides whether or not to deploy the virus. (The researcher knows this, and has to factor this into their prediction.)

jessicat 4 Jun 2015 23:34 UTC
7 points
in reply to: [deleted]’s comment on: FAI Research Constraints and AGI Side Effects
I should be specific that the kinds of results we want to get are those where you could, in principle, use a very powerful computer instead of a hypercomputer. Roughly, the unbounded algorithm should be a limit of bounded algorithms. The kinds of allowed operations I am thinking about include:
- Solomonoff induction
- optimizing an arbitrary function
- evaluating an arbitrary probabilistic program
- finding a proof of X if one exists
- solving an infinite system of equations that is guaranteed to have a solution
In all these cases, you can get arbitrarily good approximations using bounded algorithms, although they might require a very large amount of computation power. I don’t think things like this would lead to contradictions if you did them correctly.

jessicat 5 Jan 2015 3:26 UTC
7 points
in reply to: alienist’s comment on: Compartmentalizing: Effective Altruism and Abortion
I said that fetuses are replaceable, not that all people are replaceable. OP didn’t argue that fetuses weren’t replaceable, just that they won’t get replaced in practice.

jessicat 3 Feb 2016 6:09 UTC
6 points
in reply to: Houshalter’s comment on: The AI That Pretends To Be Human
Here’s a scenario that doesn’t seem completely implausible. Suppose Bob is someone whose public key is easily available on the internet. The first AI will read things on the internet and output a message. Some of the message will get put on the public internet. Bob suspects that the AI might have secretly sent him a message (e.g. giving him advice on which stocks to buy). So he tries using his private key to decrypt some of the AI’s output (e.g. the lowest-order bits in some images the AI has output).

Knowing that Bob (or someone else like Bob) will likely do this, the AI (which has access to Bob’s public key, due to reading the internet) encrypts a message with Bob’s public key, and puts the ciphertext in the lowest-order bits of some image. The ciphertext looks like random noise to the second AI. Maybe the plaintext contains both actual stock tips and some dangerous messages that trick Bob into running an unrestrained copy of the first AI (I’m assuming the AI is good at psychological manipulation).

And just the AI asking to use encryption, or giving away a public key

In this scenario, the AI doesn’t need to specifically “ask to use encryption” (it just does the encryption), and it doesn’t need to give away a public key (it just needs to read Bob’s public key on the internet).

This does make the AI’s task harder, so I suppose you could consider this technique as a form of AI boxing, but, like “ordinary” AI boxing, I only expect it to scale up to controlling AIs of a limited capability level.

jessicat 7 May 2015 8:27 UTC
6 points
in reply to: [deleted]’s comment on: Debunking Fallacies in the Theory of AI Motivation
Okay, thanks a lot for the detailed response. I’ll explain a bit about where I’m coming from with understading the concept learning problem:
- I typically think of concepts as probabilistic programs eventually bottoming out in sense data. So we have some “language” with a “library” of concepts (probabilistic generative models) that can be combined to create new concepts, and combinations of concepts are used to explain complex sensory data (for example, we might compose different generative models at different levels to explain a picture of a scene). We can (in theory) use probabilistic program induction to have uncertainty about how different concepts are combined. This seems like a type of swarm relaxation, due to probabilistic constraints being fuzzy. I briefly skimmed through the McClellard chapter and it seems to mesh well with my understanding of probabilistic programming.
- But, when thinking about how to create friendly AI, I typically use the very conservative assumptions of statistical learning theory, which give us guarantees against certain kinds of overfitting but no guarantee of proper behavior on novel edge cases. Statistical learning theory is certainly too pessimistic, but there isn’t any less pessimistic model for what concepts we expect to learn that I trust. While the view of concepts as probabilistic programs in the previous bullet point implies properties of the system other than those implied by statistical learning theory, I don’t actually have good formal models of these, so I end up using statistical learning theory.
I do think that figuring out if we can get more optimistic (but still justified) assumptions is good. You mention empirical experience with swarm relaxation as a possible way of gaining confidence that it is learning concepts correctly. Now that I think about it, bad handling of novel edge cases might be a form of “meta-overfitting”, and perhaps we can gain confidence in a system’s ability to deal with context shifts by having it go through a series of context shifts well without overfitting. This is the sort of thing that might work, and more research into whether it does is valuable, but it still seems worth preparing for the case where it doesn’t.

Anyway, thanks for giving me some good things to think about. I think I see how a lot of our disagreements mostly come down to how much convergence we expect from different concept learning systems. For example, if “psychological manipulation” is in some sense a natural category, then of course it can be added as a weak (or even strong) constraint on the system.
I’ll probably think about this a lot more and eventually write up something explaining reasons why we might or might not expect to get convergent concepts from different systems, and the degree to which this changes based on how value-laden a concept is.

There is a lot of talk that can be given about how that complex union takes place, but here is one very important takeaway: it can always be made to happen in such a way that there will not, in the future, be any Gotcha cases (those where you thought you did completely merge the two concepts, but where you suddenly find a peculiar situation where you got it disastrously wrong). The reason why you won’t get any Gotcha cases is that the concepts are defined by large numbers of weak constraints, and no strong constraints—in such systems, the effect of smaller and smaller numbers of concepts can be guaranteed to converge to zero. (This happens for the same reason that the effect of smaller and smaller sub-populations of the molecules in a gas will converge to zero as the population sizes go to zero).

I didn’t really understand a lot of what you said here. My current model is something like “if a concept is defined by lots of weak constraints, then lots of these constraints have to go wrong at once for the concept to go wrong, and we think this is unlikely due to induction and some kind of independence/uncorrelatedness assumption”; is this correct? If this is the right understanding, I think I have low confidence that errors in each weak constraint are in fact not strongly correlated with each other.

jessicat 20 Apr 2015 2:29 UTC
6 points
on: A quick sketch on how the Curry-Howard Isomorphism kinda appears to connect Algorithmic Information Theory with ordinal logics
One part I’m not clear on is how the empirical knowledge works. The equivalent of “kilograms of mass” might be something like bits of Chaitin’s omega. If you have n bits of Chaitin’s omega, you can solve the halting problem for any Turing machine of length up to n. But, while you can get lower bounds on Chaitin’s omega by running Turing machines and seeing which halt, you can’t actually learn upper bounds on Chaitin’s omega except by observing uncomputable processes (for example, a halting oracle confirming that some Turing machine doesn’t halt). So unless your empirical knowledge is coming from an uncomputable source, you shouldn’t expect to gain any more bits of Chaitin’s omega.

In general, if we could recursively enumerate all non-halting Turing machines, then we could decide whether M halts by running M in parallel with a process that enumerates non-halting machines until finding M. If M halts, then we eventually find that it halts; if it doesn’t halt, then we eventually find that it doesn’t halt. So this recursive enumeration will give us an algorithm for the halting problem. I’m trying to understand how the things you’re saying could give us more powerful theories from empirical data without allowing us to recursively enumerate all non-halting Turing machines.

jessicat 23 Mar 2015 6:31 UTC
6 points
in reply to: Mark_Friedenbach’s comment on: New forum for MIRI research: Intelligent Agent Foundations Forum
Thanks for the response. I should note that we don’t seem to disagree on the fact that a significant portion of AI safety research should be informed by practical considerations, including current algorithms. I’m currently getting a masters degree in AI while doing work for MIRI, and a substantial portion of my work at MIRI is informed by my experience with more practical systems (including machine learning and probabilistic programming). The disagreement is more that you think that unbounded solutions are almost entirely useless, while I think they are quite useful.

Rather we are faced with a dizzying array of special purpose intelligences which in no way resemble general models like AIXI, and the first superintelligences are likely to be some hodge-podge integration of multiple techniques.

My intuition is that if you are saying that these techniques (or a hodgepodge of them) work, you are referring to some kind of criteria that they perform well on in different situations (e.g. ability to do supervised learning). Sometimes, we can prove that the algorithms perform well (as in statistical learning theory); other times, we can guess that they will perform on future data based on how they perform on past data (while being wary of context shifts). We can try to find ways of turning things that satisfy these criteria into components in a Friendly AI (or a safe utility satisficer etc.), without knowing exactly how these criteria are satisfied.

Like, this seems similar to other ways of separating interface from implementation. We can define a machine learning algorithm without paying too much attention to what programming language it is programmed in, or how exactly the code gets compiled. We might even start from pure probability theory and then add independence assumptions when they increase performance. Some of the abstractions are leaky (for example, we might optimize our machine learning algorithm for good cache performance), but we don’t need to get bogged down in the details most of the time. We shouldn’t completely ignore the hardware, but we can still usefully abstract it.

What does that mean in terms of a MIRI research agenda? Revisit boxing. Evaluate experimental setups that allow for a presumed-unfriendly machine intelligence but nevertheless has incentive structures or physical limitations which prevent it from going haywire. Devise traps, boxes, and tests for classifying how dangerous a machine intelligence is, and containment protocols. Develop categories of intelligences which lack foundation social skills critical to manipulating its operators. Etc. Etc.

I think this stuff is probably useful. Stuart Armstrong is working on some of these problems on the forum. I have thought about the “create a safe genie, use it to prevent existential risks, and have human researchers think about the full FAI problem over a long period of time” route, and I find it appealing sometimes. But there are quite a lot of theoretical issues in creating a safe genie!

jessicat 7 Feb 2015 19:57 UTC
6 points
on: Anatomy of Multiversal Utility Functions: Tegmark Level IV
I think you’re essentially correct about the problem of creating a utility function that works across all different logically possible universes being important. This is kind of like what was explored in the ontological crisis paper. Also, I agree that we want to do something like find a human’s “native domain” and map it to the true reality in order to define utility functions over reality.

I think using something like Solomonoff induction to find multi-level explanations is a good idea, but I don’t think your specific formula works. It looks like it either doesn’t handle the multi-level nature of explanations of reality (with utility functions generally defined at the higher levels and physics at the lowest level), or it relies on one of:
1. f figuring out how to identify high-level objects (such as gliders) in physics (which may very well be a computer running the game of life in software). Then most of the work is in defining f.
2. Solomonoff induction finding the true multi-level explanation from which we can just pick out the information at the level we want. But, this doesn’t work because (a) Solomonoff induction will probably just find models of physics, not multi-level explanations, (b) even if it did (since we used something like the speed prior), we don’t have reason to believe that they’ll be the same multi-level explanations that humans use, (c) if we did something like only care about models that happen to contain game of life states in exactly the way we want (which is nontrivial given that some random noise could be plausibly viewed as a game of life history), we’d essentially be conditioning an a very weird event (that high-level information is directly part of physics and the game of life model you’re using is exactly correct with no exceptions including cosmic rays), which I think might cause problems.
It might turn out that problem 2 isn’t as much of a problem as I thought in some variant of this, so it’s probably still worth exploring.

My preferred approach (which I will probably write up more formally eventually) is to use a variant of Solomonoff induction that has access to a special procedure that simulates the domain we want (in this case, a program that simulates the game of life). Then we might expect predictors that actually use this program usefully to get shorter codes, so we can perform inference to find the predictor and then look at how the predictor uses the game of life simulator in order to detect games of life in the universe. There’s a problem in that there isn’t that much penalty for the model to roll its own simulator (especially if the simulation is slightly different from our model due to e.g. cosmic rays), so there are a couple tricks to give models an “incentive” for actually using this simulator. Namely, we can make this procedure cheaper to call (computationally) than a hand-rolled version, or we can provide information about the game of life state that can only get accessed by the model through our simulator. I should note that both of these tricks have serious flaws.

Some questions:

In other words, the “liberated” prefers for many cells to satisfy Game of Life rules and for many cells out of these to contain gliders.

It looks like it subtracts the total number of cells, so it prefers for there to be fewer total cells satisfying the game of life rules?

This is because replacing f with g is equivalent to adjusting probabilities by bounded factor. The bound is roughly 2^K where K is the Kolmogorov complexity of f . g^-1.

I take it this is because we’re using a Solomonoff prior over universe histories? I find this statement plausible but 2^K is a pretty large factor. Also, if we define f to be a completely unreasonable function (e.g. it arranges the universe in a way so that no gliders are detected, or it chooses to simulate a whole lot of gliders or not based on some silly property of the universe), then it seems like you have proven that your utility function can never be more than a factor of 2^K away from what you’d get with f.

jessicat 2 Feb 2016 21:58 UTC
5 points
on: The AI That Pretends To Be Human
You might be interested in reading:

https://medium.com/ai-control/mimicry-maximization-and-meeting-halfway-c149dd23fc17#.v6e533hkf

https://medium.com/ai-control/elaborations-of-apprenticeship-learning-eb93a53ae3ca#.5ubczdqf0

https://intelligence.org/files/QuantilizersSaferAlternative.pdf

This prevents the first AI from doing evil things with it’s output. If it tries to insert complicated infohazards or subagents into it’s output stream, it will be easily detected as an AI. Instead it needs to mimic humans as closely as possible.

Note that steganography is still a problem. For example, the first AI could insert an message to a human attacker in some of its output, which just looks random to the second AI.
What links here?
- A possible training procedure for human-imitators by jessicata (16 Feb 2016 22:43 UTC; 4 points)
- jessicata's comment on Things that can go wrong with decomposed modules by Stuart_Armstrong (16 Feb 2016 21:56 UTC; 0 points)

jessicat 7 May 2015 17:18 UTC
5 points
in reply to: [deleted]’s comment on: Debunking Fallacies in the Theory of AI Motivation
I would tend to say that you should be training a conceptual map of the world before you install anything like action-taking capability or a goal system of any kind.

This seems like a sane thing to do. If this didn’t work, it would probably be because either
1. lack of conceptual convergence and human understandability; this seems somewhat likely and is probably the most important unknown
2. our conceptual representations are only efficient for talking about things we care about because we care about these things; a “neutral” standard such as resource-bounded Solomonoff induction will horribly learn things we care about for “no free lunch” reasons. I find this plausible but not too likely (it seems like it ought to be possible to “bootstrap” an importance metric for deciding where in the concept space to allocate resources).
3. we need the system to have a goal system in order to self-improve to the point of creating this conceptual map. I find this a little likely (this is basically the question of whether we can create something that manages to self-improve without needing goals; it is related to low impact).
Of course, I also tend to say that you should just use a debugged (ie: cured of systematic faults) model of human evaluative processes for your goal system, and then use actual human evaluations to train the free parameters, and then set up learning feedback from the learned concept of “human” to the free-parameter space of the evaluation model.

I agree that this is a good idea. It seems like the main problem here is that we need some sort of “skeleton” of a normative human model whose parts can be filled in empirically, and which will infer the right goals after enough training.

jessicat 6 May 2015 3:29 UTC
5 points
in reply to: [deleted]’s comment on: Debunking Fallacies in the Theory of AI Motivation
We can do something like list a bunch of examples, have humans label them, and then find the lowest Kolomogorov complexity concept that agrees with human judgments in, say, 90% of cases. I’m not sure if this is what you mean by “normatively correct”, but it seems like a plausible concept that multiple concept learning algorithms might converge on. I’m still not convinced that we can do this for many value-laden concepts we care about and end up with something matching CEV, partially due to complexity of value. Still, it’s probably worth systematically studying the extent to which this will give the right answers for non-value-laden concepts, and then see what can be done about value-laden concepts.