Aaron_Scher

Karma: 375

Aaron_Scher 10 Apr 2022 4:46 UTC
1 point
on: Playing with DALL·E 2
This is awesome! I feel weird asking you to plug prompts into the machine. I wonder how it does with logo design, something like “the logo for a new longtermist startup”? Not using for commercial purposes; just curious.
Also curious about some particular word play ala Marry Poppins: “a cat drawing the curtains”

Aaron_Scher 12 Oct 2022 1:05 UTC
1 point
0
in reply to: joebiden’s comment on: Possible miracles
Evolution isn’t an agent attempting to align humans, or even a concrete active force acting on humans, instead it is merely the effect of a repeatedly applied filter
My understanding of deep learning is that training is also roughly the repeated application of a filter. The filter is some loss function (or, potentially the LLM evaluators like you suggest) which repeatedly selects for a set of model weights that perform well according to that function, similar to how natural selection selects for individuals who are relatively fit. Humans designing ML systems can be careful about how to craft our loss functions, rather than arbitrary environmental factors determining what “fitness” means, but this does not guarantee that the models produced by this process actually do what we want. See inner misalignment for why models might not do what we want even if we put real effort into trying to get them to.
Even working in the analogy you propose, we have problems. Parents raising their kids often fail to instill important ideas they want to (many kids raised in extremely religious households later convert away).

Aaron_Scher 12 Oct 2022 1:36 UTC
4 points
3
on: Possible miracles
Another miracle type thing:
- ~everybody making progress in capabilities research realizes we have major safety problems and research directions pivot toward this and away from speeding up capabilities. There is a major coordination effort and international regulations that differentially benefit safety. This might happen without some public accident, especially via major community building efforts and outreach. This looks like major wins in AI Governance and AIS field-building. This pushes back timelines a bit, and I think if we act fast this might be enough time to shift the gameboard (if capabilities progress becomes globally taboo when there is at least a year of serial capabilities work required before AGI, I would be much more optimistic than I am).
This might look something like a large social movement (that somehow works), but it doesn’t need to be widespread to actually change things (if the top 1000 ML researchers were all very worried about AI existential risk, we would be like half-way there, I think, because research output is heavy tailed).

Aaron_Scher 28 Oct 2022 4:41 UTC
9 points
3
on: What does it take to defend the world against out-of-control AGIs?
At a high level, I might summarize they key claims of this post as “It seems like the world today is quite far from being secure against a misaligned AGI. Even if we had a good AGI helping, the steps that would need to be taken to get to a secure state are very unlikely to happen for a variety of reasons: we don’t totally trust the good AGI so we won’t give it tons of free reign (and it would likely need free reign in order to harden every major military / cloud company / etc.), the good AGI is limited because it is being good and thus not doing bold sometimes illegal things (which would need to be done), the people deploying the good AGI follow social norms and usually act in accordance with laws and political norms, and there is likely a default favoring to offense over defense. Contingent on survival, the most likely way to get there is an AGI does outside-the-Overton-window things without humans approving every step it takes, so technical research should tentatively aim to create AGI systems that have good motivations but do not rely on constant human oversight.”
I broadly agree with these ideas; I think this picture is both gloomy and the general shape of it seems correct. Let me know if I’m off on something; there are, of course, important details I didn’t include.

Aaron_Scher 7 Nov 2022 22:14 UTC
4 points
0
on: Instead of technical research, more people should focus on buying time
I’m excited to see the next post in this sequence. I think the main counterargument, as has been pointed out by Habryka, Kulveit, and others, is that the graph at the beginning is not at all representative of the downsides from poorly executed buying-time interventions. TAO note the potential for downsides, but it’s not clear how they think about the EV of buying-time interventions, given these downsides.
This post has the problem of trying to give advice to a broad audience; among readers of this post, some should be doing buying-time work, some should do alignment research, some should fill in support roles, and some should do less-direct things (e.g., building capacities so that you may be useful later, or earning to give; not a complete list of things people should be doing). I suspect most folks agree that [people trying to make AI go well] should be taking different approaches; duh, comparative advantage is a thing.
Given that people should be doing different things, attempts at community coordination should aim to be relatively specific about who should do what; I feel this post does not add useful insight on the question of who should be doing buying-time interventions. I agree with the post’s thrust that buying-time interventions is important^[1] and that there should be more of it happening, but I think this contribution doesn’t do much on its own. TAO should try to build an understanding of why people with various skills should pursue particular paths. Pointing in the direction of “we think more people should be going into “buying time” interventions” is different from directing the right individual people in that direction, and broad pointing will plausibly lead to a misallocation here.
This all said, I’m excited about the rest of the sequence, which I hope will better flesh out these ideas, including discussion of who is a good fit for time-buying interventions (building on CAIS’s related thoughts, which I would link but temporarily can’t find). I think this post will fit well into a whole sequence, insofar as it makes the point that time-buying interventions are particularly impactful, and other posts can address these other key factors (what interventions, who’s a good fit, existing projects in the space, and more).
1. ^
  and the idea around buying end-time being particularly valuable is a useful contribution

Aaron_Scher 8 Nov 2022 18:45 UTC
LW: 3 AF: 1
0
AF
on: “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability
Summary:
If interpretability research is highly tractable and we can build highly interpretable systems without sacrificing competitiveness, then it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe. By analogy, if you have a non-functioning car, it is easy to bring in functional parts to fix the engine and make the car drive safely, compared to it being hard to take a functional elephant and tweak it to be safe. In a follow up post, the author clarifies that this could be thought of as engineering (well-founded AI) vs. reverse engineering (interpretability). One pushback form John Wentworth is that we currently do not know how to build the car, or how the basic chemistry in the engine actually works; we do interpretability research in order to understand these processes better.Ryan Greenblatt pushes back that the post is more accurate if the word “interpretability” was replaced with “microscope AI” or “comprehensive reverse engineering”; this is because we do not need to understand every part of complex models in order to tell if they are deceiving us, so the level our interpretability understanding needs to be to be useful is lower than the level it needs to be for us to build the car from the ground up.Neel Nanda writes a similar comment about how, to him, high tractability is a lower bar, much lower than understanding every part of a system such that we could build it.

Aaron_Scher 12 Nov 2022 0:41 UTC
3 points
3
on: All AGI Safety questions welcome (especially basic ones) [~monthly thread]
Recently, AGISF has revised its syllabus and moved Risks form Learned Optimization to a recommended reading, replacing it with Goal Misgeneralization. I think this change is wrong, but I don’t know why they did it and Chesteron’s Fence.
Does anybody have any ideas for why they did this?
Are Goal Misgeneralization and Inner-Misalignment describing the same phenomenon?
What’s the best existing critique of Risks from Learned Optimization? (besides it being probably worse than Rob Miles pedagogically, IMO)

Aaron_Scher 29 Nov 2022 1:36 UTC
1 point
0
on: Discussing how to align Transformative AI if it’s developed very soon
AIs with orthogonal goals. If the AIs’ goals are very different to each other, being willing to forgo immediate rewards is less likely.
Seems worth linking to this post for discussion of ways to limit collusion. I would also point to this relevant comment thread. It seems to me that orthogonal goals is not what we want, as agents with orthogonal goals can cooperate pretty easily to do actions that are a combination of favorable and neutral according to both of them. Instead, we would want agents with exact opposite goals, if such a thing is possible to design. It seems possible to me that in the nearcast framing we might get agents for which this collusion is not a problem, especially by using the other techniques discussed in this post and Drexler’s, but giving our AIs orthogonal goals seems unlikely to help in dangerous situations.

Aaron_Scher 8 Dec 2022 22:52 UTC
3 points
0
on: A challenge for AGI organizations, and a challenge for readers
My (very amateur and probably very dumb) response to this challenge:
tldr: RLHF doesn’t actually get the AI to have the goals we want it to. Using AI assistants to help with oversight is very unlikely to help us avoid deception (typo)~~detection~~ in very intelligent systems (which is where deception matters), but it will help somewhat in making our systems look aligned and making them somewhat more aligned. Eventually, our models become very capable and do inner-optimization aimed at goals other than “good human values”. We don’t know that we have misaligned mesa-optimizers, and we continue using them to do oversight on yet more capable models with the same problems, and then there’s a treacherous turn and we die.
These are first pass thoughts on why I expect the OpenAI Alignment Team’s plan to fail. I was surprised at how hard this was to write, it took like 3 hours including reading. It is probably quite bad and not worth most readers’ time.
Summary of their plan
The plan starts with training AIs using human feedback (training LLMs using RLHF) to produce outputs that are in line with human intent, truthful, fair, and don’t produce dangerous outputs. Then, they’ll use their AI models to help with human evaluation, solving the scalable oversight problem by using techniques like Recursive Reward Modeling, Debate, and Iterative Amplification. The main idea here is using large language models to assist humans who are providing oversight to other AI systems, and the assistance allows humans to do better oversight. The third pillar of the approach is training AI systems to do alignment research, which is not feasible yet but the authors are hopeful that they will be able to do it in the future. Key parts of the third pillar are that it is easier to evaluate alignment research than to produce it, that to do human-level alignment research you need only be human-level in some domains, and that language models are convenient due to being “preloaded” with information and not being independent agents. Limitations include that the use of AI assistants might amplify subtle inconsistencies, biases, or vulnerabilities, and that the least capable models that could be used for useful alignment research may themselves be too dangerous if not properly aligned.
Response
A key claim is that we can use RLHF to train models which are sufficiently aligned such that they themselves can be useful to assist human overseers providing training signal in the training of yet more powerful models, and we can scale up this process. The authors mention in their limitations how subtle issues with the AI assistants may scale up in this process. Similarly, small ways in which AI assistants are misaligned with their human operators are unlikely to go away. The first LLMs you are using are quite misaligned in the sense that they are not trying to do what the operator wants them to do; in fact, they aren’t really trying to do much; they have been trained in a way that their weights lead to low loss on the training distribution, as in you might say they “try” to predict likely next words in text based on internet text, though they are not internally doing search. When you slap RLHF on top of this, you are applying a training procedure which modifies the weights such that the model is “trying” to produce outputs which look good to a human overseer; the system is aiming at a different goal than it was before. The goal of producing outputs which look good to humans is still not actually what we want, however, as this would lead to giving humans false information which they believe to be true, or otherwise outputs which look good but are misleading or incorrect. Furthermore, the strategy of RLHF is not going to create models which are robustly learning the goals we want; for instance you can see how the Jailbreaking of ChatGPT uses out of training-distribution prompts to elicit outputs we had thought we trained out. Using RLHF doesn’t robustly teach the goals we want it to; we don’t currently have methods of robustly teaching the goals we want to. There’s some claim here about the limit, where if you provided an absolutely obscene amount of training examples, you could get a model which robustly has the right objectives; it’s unclear to me if this would work, but it looks something like starting with very simple models and applying tons of training to try to align their objectives, and then scaling up; at the current rate we seem to be scaling up capabilities far too quickly in relation to the amount of alignment-focused training. The authors agree with the general claim “We don’t expect RL from human feedback to be sufficient to align AGI”
The second part of the OpenAI Alignment Team’s plan is to use their LLMs to assist with this oversight problem by allowing humans to do a better job evaluating the output of models. The key assumption here is that, even though our LLMs won’t be perfectly aligned, they will be good enough that they can help with research. We should expect their safety and alignment properties to fall apart when these systems become very intelligent, as they will have complex deception available to them.
What this actually looks like is that OpenAI continues what they’re doing for months-to-years, and they are able to produce more intelligent models and the alignment properties of these models seem to be getting better and better, as measured by the fact that adversarial inputs which trip up the model are harder to find, even with AI assistance. Eventually we have language models which are doing internal optimization to get low loss, invoking algorithms which do quite well at next token prediction, in accordance with the abstract rules learned by RLHF. From the outside, it looks like our models are really capable and quite aligned. What has gone on under the hood is that our models are mesa-optimizers which are very likely to be misaligned. We don’t know this and we continue to deploy these models in the way we have been, as overseers for the training of more powerful models. The same problem keeps arising, where our powerful models are doing internal search in accordance with some goal which is not “all the complicated human values” and is probably highly correlated with “produce outputs which are a combination of good next-token-prediction and score well according to the humans overseeing this training”. Importantly, this mesa-objective is not something which, if strongly optimized, is good for humans; values come apart in the extremes; most configurations of atoms which satisfy fairly simple objectives are quite bad by my lights.
Eventually, at sufficiently high levels of capabilities, we see some treacherous turn from our misaligned mesa-optimizers which are able to cooperate which each other; GG humans. Maybe we don’t get to this point because, first, there are some major failures or warning shots which get decision makers in key labs and governments to realize this plan isn’t working; idk I wouldn’t bet on warning shots being taken seriously and well.
The third pillar is a hope that we can use our AIs to do useful alignment research before they (reach a capabilities point where they) develop deceptively aligned mesa-objectives. I feel least confident about this third pillar, but my rough guess is that the Alignment-researching-AIs will not be very effective at solving the hard parts of alignment around deception, but they might help us e.g., develop new techniques for oversight. I think this because deception research seems quite hard, and being able to do it probably requires being able to reason about other minds in a pretty complex way, such that if you can do this then you can also reason about your own training process and become deceptively-aligned. I will happily be proved wrong by the universe, and this is probably the thing I am least confident about.

Aaron_Scher 10 Dec 2022 23:53 UTC
1 point
0
on: Aligned Behavior is not Evidence of Alignment Past a Certain Level of Intelligence
We can also ask about the prior probability $P (A | S)$ . Since there is such a huge space of possible utility functions that superintelligent agent could be aligned with, and since the correct utility function is likely weird and particular to humans, this value must be very small. How small exactly I do not know, but I am confident it is less than $10^{- 6}$
I think this might be too low given a more realistic training process. Specifically, this is one way the future might go: We train models with gradient descent. Said models develop proxy objectives which are correlated with the base objective used in training. They become deceptively aligned etc. Importantly, the proxy objective they developed is correlated with the base objective, which is hopefully correlated with human values. I don’t think this gets you above a ¹⁄₁₀ chance of the model’s objective being good-by-human-lights, but it seems like it could be higher than 10^-6, with the right training setup. Realistically speaking, we’re (hopefully) not just instantiating superintelligences with random utility functions.
I think a crux of sorts is what it means for the universe if a superintelligent AI has a utility function that is closely correlated but not identical to humans’. I suspect this is a pretty bad universe.

Aaron_Scher 11 Dec 2022 1:59 UTC
LW: 2 AF: 1
0
AF
in reply to: VojtaKovarik’s comment on: A challenge for AGI organizations, and a challenge for readers
(iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
Either I misunderstand this or it seems incorrect.
It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient.
The use of AI assistants for alignment : capabilities doesn’t have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest lead are safety conscious, I expect the ratio of alignment : capabilities research they produce to be much higher (compared to now) right before AGI. See here.

Aaron_Scher 12 Dec 2022 20:09 UTC
LW: 1 AF: 1
0
AF
in reply to: VojtaKovarik’s comment on: A challenge for AGI organizations, and a challenge for readers
Makes sense. FWIW, based on Jan’s comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn’t seem like Jan believes in this plan; personally I don’t believe in this plan.
4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we’ll see limited alignment research acceleration before ML research automation happens.
I don’t know how to link to the specific comment, but here somewhere. Also:
We can focus on tasks differentially useful to alignment research
Your pessimism about iii still seems a bit off to me. I agree that if you were coordinating well between all the actors than yeah you could just hold off on AI assistants. But the actual decision the OpenAI alignment team is facing could be more like “use LLMs to help with alignment research or get left behind when ML research gets automated”. If facing such choices I might produce a plan like theirs, but notably I would be much more pessimistic about it. When the universe limits you to one option, you shouldn’t expect it to be particularly good. The option “everybody agrees to not build AI assistants and we can do alignment research first” is maybe not on the table, or at least it probably doesn’t feel like it is to the alignment team at OpenAI.

Aaron_Scher 20 Dec 2022 21:57 UTC
3 points
0
in reply to: Lao Mein’s comment on: Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.
You might look here for more info: https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problem

Aaron_Scher 21 Dec 2022 2:35 UTC
1 point
0
in reply to: Dalcy’s comment on: Trying to disambiguate different questions about whether RLHF is “good”
My interpretation here is that, compared to prompt engineering, RLHF elicits a larger % of the model’s capabilities, which makes it safer because the gap between “capability you can elicit” and “underlying capability capacity” is smaller.
For instance, maybe we want to demonstrate a model has excellent hacking capabilities; and say this corresponds to a performance of 70. We pre-train a model which has performance of 50 without anything special, but the pre-trained model has a lot of potential, and we think it has a max capability of 100. Using RLHF boosts performance by 40% from pre-trained, so it could get our model’s performance up to 70, but using just prompt engineering isn’t so effective, it only boosts our model’s capabilities by 20%, up to 60. So, if we want to elicit the performance level 70 using prompt engineering, we need a model that is more powerful at baseline (and thus has a more powerful potential). Math: 1.2 x = 70, so x = 58.3 is the model’s pre-trained performance, and such a model has a max capability of 58.3 * 2 = 116.7. So in this example, in order to get the desired behavior, we need a less powerful model with RLHF or a more powerful model with prompt engineering.
Why it might be good to have models with lower potential capabilities:
- we might be worried about sudden jumps in capability from being able to better utilize pre-trained models. If “ability we are able to elicit” is only slightly below “maximum possible ability” then this is less of a problem.
- we might be worried about a treacherous turn in which an AI is able to utilize almost all of its potential capabilities; we would be taken by surprise and in trouble if this included capabilities much greater than what we have (or even what we know exist).
- accidents will be less bad if our model has overall lower capability potential
- [edited to add] This list is not meant to be comprehensive, but here’s an important one: we might expect models with capabilities above some threshold to be deceptive and capable of a treacherous turn. In some sense, we want to get close to this line without crossing it, in order to take actions which protect the world from misaligned AIs. In the above example, we could imagine that getting a max capability above 105 results in a deceptive model which can execute a treacherous turn. Using RLHF on the 100-max model allows us to get the behavior we wanted, but if we are using prompt engineering than we would need to make a dangerous model to get the desired behavior.

Aaron_Scher 30 Dec 2022 4:04 UTC
3 points
2
in reply to: Mau’s comment on: The case against AI alignment
I first want to signal-boost Mauricio’s comment.

My experience reading the post was that I kinda nodded along without explicitly identifying and interrogating cruxes. I’m glad that Mauricio has pointed out the crux of “how likely is human civilization to value suffering/torture”. Another crux is “assuming some expectation about how much humans value suffering, how likely are we to get a world with lots of suffering, assuming aligned ai”, another is “who is in control if we get aligned AI”, another is “how good is the good that could come from aligned ai and how likely is it”.

In effect this post seems to argue “because humans have a history of producing lots of suffering, getting an AI aligned to human intent would produce an immense amount of suffering, so much that rolling the dice is worse than extinction with certainty”

It matters what the probabilities are, and it matters what the goods and bars are, but this post doesn’t seem to argue very convincingly that extremely-bads are all that likely (see Mauricio’s bullet points).

Aaron_Scher 25 Jan 2023 8:29 UTC
2 points
0
on: “Endgame safety” for AGI
Here’s an idea that is in it’s infancy which seems related (at least my version of it is, others may have fleshed it out, and links are appreciated). This is not written particularly well and it is speculative:
Say I believe that language models will accelerate research in the lead-up to AGI. (likely assumption)
Say I think that AI systems will be able to automate most of the research process before we get AGI (though at this point we might stop and consider if we’re moving the goalpost). This seems to be an assumption in OpenAI’s alignment plan, though I think it’s unclear how AGI fits into their timetable. (less likely assumption, still plausible)
Given the above, our chances of getting a successfully aligned AGI may depend on “what proportion of automated AI scientists are working on AI Alignment vs. work that moves us closer to AGI?” There’s some extreme version of this where the year before AGI has more scientific research than all of human history, or something wild.
Defining “Endgame” in this scenario seems hard, but maybe Endgame is “AIs can do 50% of the (2022) research work”. It seems likely that in this scenario, those who care about AI Safety might reasonably want to start Endgame sooner if they think it increases the likelihood of enough automated Alignment research happening.
For example: In the current climate, maybe 3 actors have access to the most powerful models (OpenAI, DeepMind, Anthropic), and given that Anthropic is largely focused on AI Safety, we might naively expect that the current climate would have ~⅓ of the automated research be AI Safety if Endgame started tomorrow (depending on the beliefs of decision makers at OpenAI and DeepMind it could be much higher, and in actuality given the current world I think it would be higher). But, say we wait 2 years and then Endgame starts, now there’s 2 other big tech labs and 3 startups who have similarly powerful models (and don’t care much about Safety), and the fraction of automated research which is Alignment is only ~⅛.
In the scenario I’ve outlined above, it is not obvious that Endgame coming sooner is worse. There’s quite a few assumptions that I think are doing a lot of work for this model:
- These pre-AGI AIs can do useful alignment research, or significantly speed it up.
  - I think this one of the bigger cruxes here, where it seems like coming up with useful, novel, theoretical insights to solve for instance deception is just really hard and will require models to be incredibly smart. Folks will disagree here about which alignment research is most important, and some of this will be affected by motivated reasoning around what kind of research you can automate easily (I think Jan Leike’s plan here worries me because it feels a lot like “we need to be able to automate alignment research just as effectively as ML research so that we can ask for more compute for alignment” and this seems likely to lead to an automating of easy-to-automate-but-not-central alignment research; it worries me because I expect the plan to fail; see also Nate Soares).
- Research done by AIs in the years leading up to AGI will represent a large chunk of the alignment research that happens, such that it is worth it to lose some human-research-time in order to make this shift come sooner (maybe for this argument to go through it only needs to be the case that the ratio of (good) Alignment/Capabilities research is higher once research is automated).
- These pre-AGI AIs are sufficiently aligned. We have a fair amount of confident in our language models’ ability to help with research without causing catastrophic failures. i.e., they are either not trying to overthrow humanity, or they are unable to do so, including through research outputs like plans for future AI systems.
- There is some action folks in the AI Safety space could take which would make the move to this Endgame happen sooner. Examples might include: Anthropic scaling language models (and trying to keep them private), developing AI tools to speed up research (Elicit),
- List goes on, but I’m tired

Aaron_Scher 1 Feb 2023 22:37 UTC
3 points
0
on: Criticism of the main framework in AI alignment
7. Recall that alignment research motivated by the above points makes it easier to design AI that is controllable and whose goals are aligned with its designers’ goals.
As a consequence, bad actors might have an easier time using powerfull controllable AI to achieve their goals. (From 4 and 6)
8. Thus, even though AI alignment research improves the expected value of futures caused by uncontrolled AI, it reduces the expected value of futures caused by bad human actors using controlled AI to achieve their ends. (From 5 and 7)
This conclusion will seem more, or less, relevant depending on the beliefs you have about its different components.
It sounds to me like the claim you are making here is “the current AI Alignment paradigm might have a major hole, but also this hole might not be real”. But then the thrust of your post is something like “I am going to work on filling this hole”. You invoke epistemic and moral uncertainty in a somewhat handwavy way which leaves me skeptical. It’s not clear to me what you believe, so it is hard for me to productively disagree or provide useful feedback. Assuming you are going to spend many hours working on this research direction, I think it’s worth spending a few hours on determining if this proposed problem is in fact a problem, including making some personal guesses about the value of various futures (maybe you’ve already done this privately).
You later write:
I do not claim that research towards building the independent AI thinkers of 2.1 above is the most effective AI alignment research intervention, nor that it is the most effective intervention for moral progress. I’ve only presented a problem of the main framework in AI alignment, and proposed an alternative that aims to avoid that problem
To me, it’s not obvious that the thing you presented is actually a problem. My quick thoughts: extinction is quite bad, some types of galaxy-spanning civilizations are far worse than extinction, but many are better, including some of the ones I think would be created by current “bad actors”.
I’m furthermore unsure why the solution to this proposed problem is to try and design AIs to make moral progress; this seems possible but not obvious. One problem with bad actors is that they often don’t base their actions on what the philosophers think is good (e.g., dictators don’t seem concerned with this). On the other hand, perhaps the “bad actors” you are targeting are average Americans who eat a dozen farmed animals per year, and these are the values you are most worried about. Insofar as you want to avoid filling the universe with factory farming, you might want to investigate current approaches in Moral Progress or moral circle expansion; I suspect an AI approach to this problem won’t help much. There’s a similar story for impact here that looks like “get the Great Reflection started earlier” which I am more optimistic about, but I suspect to fail for other reasons. Not sure if this paragraph made sense; I’m gesturing at the fact that the “bad actors” you are targeting will affect what research directions to pursue, and for the main class of bad actors that comes to mind with that word, moral progress seems unlikely to help.

Aaron_Scher 13 Feb 2023 21:31 UTC
1 point
0
in reply to: Linda Linsefors’s comment on: Gradient hacking: definitions and examples
For example: If you first condition an animal to expect A to be followed by C, and then exposes them to A+B followed by C, they will not learn to associate B with C. This is a well replicated result, and the textbook explanation (which I believe) is that no learning occurs because C is already explained by A (i.e. there is no surprise).
Can you provide a citation? I don’t think this is true. My reading of this is that (if you’re training a dog) you can start with an unconditioned stimulus (sight of food) which causes salivating, and then you can add in the sound of a bell with the sight of food, and this also elicits salivating. And then you can remove the sight of food but still have the bell and the dog is likely to salivate. I don’t think you need to have a surprise to have learning in this context, you just need associations/patterns built up over time. Perhaps I’m misunderstanding you.

Aaron_Scher 17 Feb 2023 7:28 UTC
LW: 3 AF: 3
0
AF
on: Why I’m not working on {debate, RRM, ELK, natural abstractions}
I doubt it’s a crux for you, but I think your critique of Debate makes pessimistic assumptions which I think are not the most realistic expectation about the future.
Let’s play the “follow-the-trying game” on AGI debate. Somewhere in this procedure, we need the AGI debaters to have figured out things that are outside the space of existing human concepts—otherwise what’s the point? And (I claim) this entails that somewhere in this procedure, there was an AGI that was “trying” to figure something out. That brings us to the usual inner-alignment questions: if there’s an AGI “trying” to do something, how do we know that it’s not also “trying” to hack its way out of the box, seize power, and so on? And if we can control the AGI’s motivations well enough to answer those questions, why not throw out the whole “debate” idea and use those same techniques (whatever they are) to simply make an AGI that is “trying” to figure out the correct answer and tell it to us?
When I imagine saying the above quote to a smart person who doesn’t buy AI x-risk, their response is something like “woah slow down there. Just because the AI is “trying” to do something doesn’t mean it stands any chance of doing actually dangerous things like hacking out of the box. The ability to hack out of the box doesn’t mysteriously line up with the level of intelligence that would be useful for an AI debate.” This person seems largely right, and I think your argument is mainly “it won’t work to let two superintelligences to debate each other about important things” rather than a stronger claim like “any AIs smart enough to have a productive debate might be trying to do dangerous things and have non-negligible chance of succeeding”.
We could be envisioning different pictures for how debate is useful as a technique. I think it will break for sufficiently high intelligence levels, for reasons you discuss, but we might still get useful work out of it in models like GPT-4/5. Additionally, it seems to me that there are setups of Debate in which we aren’t all-or-nothing on the instrumental subgoals, consequentialist planning, and meta cognition, especially in (unlikely) worlds where the people implementing debate are taking many precautions. Fundamentally, Debate is about getting more trustworthy outputs from untrustworthy systems, and I expect we can get useful debates from AIs that do not run a significant risk of the failures you describe.
Again, I doubt this is a main crux for whether you will work on Debate, and that seems quite reasonable. If it’s the case that, “Debate is unlikely to scale all the way to dangerous AGIs”, then to the extent that we want to focus on the “dangerous AGIs” domain we might just want to skip it and work on other stuff.

Aaron_Scher 18 Feb 2023 3:06 UTC
3 points
0
in reply to: Steven Byrnes’s comment on: Why I’m not working on {debate, RRM, ELK, natural abstractions}
Thanks! I really liked your post about defending the world against out-of-control AGIs when I read it a few weeks ago.

Aaron_Scher

Summary of their plan

Response