A comprehensive AI alignment introductory web hub
RAISE and Robert Miles provide introductory content. You can think of LW->alignment forum as “web hubs” for AI Alignment research.
There was a course on AGI Safety last fall in Berkeley.
A department or even a single outspokenly sympathetic official in any government of any industrialized nation
You can find a list of institutions/donors here.
A list of concrete and detailed policy proposals related to AI alignment
I would recommend reports from FHI/GovAI as a starting point.
Would this be valuable, and which resource would it be most useful to create?
Please give more detailed information about the project to receive feedback.
You can find AGI predictions, including Starcraft forecasts, in “When Will AI Exceed Human Performance? Evidence from AI Experts”. Projects for having “all forecasts on AGI in one place” include ai.metaculus.com & foretold.io.
Does that summarize your comment?
1. Proposals should make superintelligences less likely to fight you by using some conceptual insight true in most cases.
2. With CIRL, this insight is “we want the AI to actively cooperate with humans”, so there’s real value from it being formalized in a paper.
3. In the counterfactual paper, there’s the insight “what if the AI thinks he’s not on but still learns”.
For the last bit, I have two interpretations:
4.a. However, it’s unclear that this design avoids all manipulative behaviour and is completely safe.
4.b. However, it’s unclear that adding the counterfactual feature to another design (e.g. CIRL) would make systems overall safer / would actually reduce manipulation incentives.
If I understand you correctly, there are actual insights from counterfactual oracles—the problem is that those might not be insights that would apply to a broad class of Alignment failures, but only to “engineered” cases of boxed oracle AIs (as opposed to CIRL where we might want AIs to be cooperative in general). Was it what you meant?
The zero reward is in the paper. I agree that skipping would solve the problem. From talking to Stuart, my impression is that he thinks that r=0 would be equivalent to skipping for specifying “no learning”, or would just slow down learning. My disagreement on that I think it can confuse learning to the point of not learning the right thing.
Why not do a combination of pre-training and online learning, where you do enough during the training phase to get a useful predictor, and then use online learning to deal with subsequent distributional shifts?
Yes, that should work. My quote saying that online learning “won’t work and is unsafe” is imprecise. I should have said “if epsilon is small enough to be comparable to the probability of shooting an escape message at random, then it is not safe. Also, if we continue sending the wrong r=0 instead of skipping, then it might not learn the correct thing if ϵ is not big enough”.
Although I guess that probably isn’t really original either. What seems original is that during any episode where learning will take place, don’t let humans (or any other system that might be insecure against the oracle) see the oracle’s output until the episode is over.
That’s exactly it!
The string is read with probability 1-ϵ
Yes, if we choose the utility function to make it a CDT agent optimizing for the reward for one step (so particular case of act-based) then it won’t care about future versions of itself nor want to escape.
I agree with the intuition of shutting down to make it episodic, but I am still confused about the causal relationship between “having the rule to shutdown the system” and “having a current timestep maximizer”. For it to really be a “current timestep maximizer” it needs to be in some kind of reward/utility function. Because everything is reset at each timestep, there is no information pointing at “I might get shutdown at the next timestep”.
As for the collecting a dataset and then optimizing for some natural direct effect, I am not familiar enough with Pearl’s work to tell if that would work, but I made some related comments about why there might be some problems in online-learning/”training then testing” here.
The Asymptotically Unambitious AGI thread helped me clarify my thoughts, thanks. I agree that an optimal CDT agent won’t think about future versions, and I don’t see any optimization pressure towards escape message nor disproportionately common “escape message” regions.
However, it still assumes we have access to this magic oracle that optimizes for R′=R.IE where E is the event where humans don’t see the answer, IE its indicator function, and R the counterfactual reward (given by the automatic machine). If humans were able to build an oracle performing optimally (w.r.t R′) from day 1, then humans would be able to specify some kind of “god oracle”. The rest of the design seems to be just “how to interact with a god oracle so that humans are not influenced by the answers”.
In practice, you’ll want something that is able to learn from its (question, prediction, reward) history. That’s why there is this automatic machine rewarding the oracle with some probability ϵ. In an online learning setting, most of the time the model gets r=0 (probability 1−ϵ), and it sometimes gets some useful feedback r>0 (probability ϵ). Therefore, if ϵ is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random. Even worse, the (question, prediction, r=0) datapoints in the history could get so misleading that the oracle never learns anything.
Now, if we’re not in an online learning process but instead there is a separation between a “training phase” and a “deployment phase where the AI continue to learns with probability ϵ”, then the setup is just “have a model that learns to do the useful stuff in sandbox, and then have the weights (almost) fixed in deployment”
In short, I think the CDT setup without machine learning assumes the problem already solved, that online learning won’t work and is unsafe, which leaves us with a “training then deployment” setup that isn’t really original.
Yes, they call it a low-bandwidth oracle.
I agree that these stories won’t (naturally) lead to a treacherous turn. Continuously learning to deceive (a ML failure in this case, as you mentioned) is a different result. The story/learning should be substantially different to lead to “learning the concept of deception” (for reaching an AGI-level ability to reason about such abstract concepts), but maybe there’s a way to learn those concepts with only narrow AI.
I included dates such as 2020 to 2045 to make it more concrete. I agree that weeks (instead of years) would give a more accurate representation as current ML experiments take a few weeks tops.
The scenario I had in mind is “in the context of a few weeks ML experiment, I achieved human intelligence and realized that I need to conceal my intentions/capabilities and I still don’t have decisive strategic advantage”. The challenge would then be “how to conceal my human level intelligence before everything I have discovered is thrown away”. One way to do this would be to escape, for instance by copy-pasting and running your code somewhere else.
If we’re already at the stage of emergent human-level intelligence from running ML experiments, I would expect “escape” to be harder than just human-level intelligence (as there would be more concerns w.r.t. AGI Safety, and more AI boxing/security/interpretability measure), which would necessit more recursive self-improvement steps, hence more weeks.
Beside, in such a scenario the AI would be incentivized to spend as much time as possible to maximize its true capability, because it would want to maximize its probability of successfully taking over (because any extra % of taking over would give astronomical returns in expected value compared to just being shutdown).
Your comment makes a lot os sense, thanks.
I put step 2. before step 3. because I thought something like “first you learn that there is some supervisor watching, and then you realize that you would prefer him not to watch”. Agreed that step 2. could happen only by thinking.
Yep, deception is about alignment, and I think that most parents would be more concerned about alignment, not improving the tactics. However, I agree that if we take “education” in a broad sense (including high school, college, etc.), it’s unofficially about tactics.
It’s interesting to think of it in terms of cooperation—entities less powerful than their supervisors are (instrumentally) incentivized to cooperate.
what to do with a seed AI that lies, but not so well as to be unnoticeable
Well, destroy it, right? If it’s deliberately doing a. or b. (from “Seed AI”) then step 4. has started. The other cases where it could be “lying” from saying wrong things would be if its model is consistently wrong (e.g. stuck in a local minima), so you better start again from scratch.
If the supervisor isn’t itself perfectly consistent and aligned, some amount of self-deception is present. Any competent seed AI (or child) is going to have to learn deception
That’s insightful. Biased humans will keep saying that they want X when they want Y instead, so deceiving humans by pretending to be working on X while doing Y seems indeed natural (assuming you have “maximize what humans really want” in your code).
“In my opinion, the disagreement between Bostrom (treacherous turn) and Goertzel (sordid stumble) originates from the uncertainty about how long steps 2. and 3. will take”
That’s an interesting scenario. Instead of “won’t see a practical way to replace humanity with its tools”, I would say “would estimate its chances of success to be < 99%”. I agree that we could say that it’s “honestly” making humans happy in the sense that it understands that this maximizes expected value. However, he knows that there could be much more expected value after replacing humanity with its tools, so by doing the right thing it’s still “pretending” to not know where the absurd amount of value is. But yeah, a smile maximizer making everyone happy shouldn’t be too concerned about concealing its capabilities, shortening step 4.
This thread is to discuss “How useful is quantilization for mitigating specification-gaming? (Ryan Carey, Apr. 2019, SafeML ICLR 2019 Workshop)”
This thread is to discuss “Quantilizers (Michaël Trazzi & Ryan Carey, Apr. 2019, Github)”.
This thread is to discuss “When to use quantilization (Ryan Carey, Feb. 2019, LessWrong)”