phone.spinning

Karma: 27

phone.spinning 7 Sep 2022 18:10 UTC
8 points
8
on: phone.spinning’s Shortform
‘The goal of alignment research should be to get us into “alignment escape velocity”, which is where the rate of alignment progress (which will largely come from AI as we progress) is fast enough to prevent doom for enough time to buy even more time.’
^ the above argument only works if you think that there will be a relatively slow takeoff. If there is a fast takeoff, the only way to buy more time is to delay that takeoff, because alignment won’t scale as quickly as capabilities under a period of significant and rapid recursive self-improvement.
What links here?
- Vladimir_Nesov's comment on Morpheus’s Shortform by Morpheus (11 Sep 2022 2:58 UTC; 2 points)

phone.spinning 23 Nov 2023 17:06 UTC
5 points
2
in reply to: johnswentworth’s comment on: Debate helps supervise human experts [Paper]
Another author here! Regarding specifically the 74% vs. 84% numbers—a key takeaway that our error analysis is intended to communicate is that we think a large fraction of the errors judges made in debates were pretty easily solvable with more careful judges, whereas this didn’t feel like it was the case with consultancy.
For example, Julian and I both had 100% accuracy as judges on human debates for the 36 human debates we judged, which was ~20% of all correct human debate judgments. So I’d guess that more careful judges overall could increase debate accuracy to at least 90%, maybe higher, although at that point we start hitting measurement limits from the questions themselves being noisy.

phone.spinning 6 Sep 2022 3:47 UTC
4 points
0
in reply to: cousin_it’s comment on: Some conceptual alignment research projects
I think AI of the capability level that you describe will either already have little need to exploit people, or will quickly train successors that wouldn’t benefit from this. I do think deception is a big issue, but I think the important parts of deception will be earlier in terms of AI capability than you describe.

phone.spinning 23 Jan 2023 7:46 UTC
3 points
0
in reply to: dsj’s comment on: Models Don’t “Get Reward”
The issue with early finetuning is that there’s not much that humans can actually select on, because the models aren’t capable enough—it’s really hard for me to say that one string of gibberish is better/worse.

phone.spinning 27 Dec 2022 5:41 UTC
3 points
on: Strong Evidence is Common
This post is short, but important. The fact that we regularly receive enormously improbable evidence is relevant for a wide variety of areas. It’s an integral part of having accurate beliefs, and despite this being such a key idea, it’s underappreciated generally (I’ve only seen this post referenced once, and it’s never come up in conversation with other rationalists).

phone.spinning 30 Sep 2022 22:26 UTC
3 points
in reply to: Linda Linsefors’s comment on: Linda Linsefors’s Shortform
I like that description of NFL!

phone.spinning 17 Sep 2022 15:33 UTC
3 points
0
on: phone.spinning’s Shortform
Epistemic status: I’m somewhat confident this is a useful axis to describe/consider alignment strategies/perspectives, but I’m pretty uncertain which is better. I could be missing important considerations, or weighing the considerations listed inaccurately.

When thinking about technical alignment strategy, I’m unsure whether it’s better to focus on trying to align systems that already exist (and are misaligned), or to focus on procedures that train aligned models from the start.

The first case is harder, which means focusing on it is closer to minimax optimization (minimize the badness of the worst-case outcome), which is often a good metric to optimize. Plus, it could be argued that it’s more realistic, because the current AI paradigm of “learn a bunch of stuff about the world with a SSL prediction loss, then fine-tune it to point at actually useful tasks” is very similar to this, and might require us to align pre-trained unaligned systems.

However, I don’t think only focusing on this paradigm is necessarily correct. I think it’s important we have methods that can train (more-or-less) from scratch (and still be roughly competitive with other methods). I’m uncertain how hard fine-tuning SSL prediction models is, and I think the frame of “we’re training models to be a certain way” admits different solutions and strategies than “we need to align a potential superintelligence”.

One way to cache this out more concretely is the extent to which you view amplification or debate as fine-tuning for alignment, versus a method of learning new capabilities (while maintaining alignment properties).

phone.spinning 6 Sep 2022 21:04 UTC
3 points
1
on: phone.spinning’s Shortform
Alignment is a stabilizing force against fast takeoff, because the models will not want to train models that don’t do what *they* want. So, the goals/values of the superintelligence we get after a takeoff might actually end up being the values of models that are just past the point of capability where they are able to align their successors. I’d expect these values to be different from the values of the initial model that started the recursive self-improvement process, because I don’t expect that initial model to be capable of solving (or caring about) alignment enough, and because there may competitive dynamics that cause ~human-level AI to train successors that are misaligned to it.

phone.spinning 28 Nov 2023 20:18 UTC
2 points
0
on: Anthropic Fall 2023 Debate Progress Update
Given that the judge that selects the best argument for BoN is the same as the one that chooses the winner, what is your main takeaway from the fact that ELO increases as you increase N? I see this as mainly a sanity check, but want to check if I’m missing something.

phone.spinning 18 Sep 2022 16:31 UTC
2 points
1
on: phone.spinning’s Shortform
It’s so easy to get caught up in meta-thinking—I want to try to remember to not spend more than maybe 10% of my time generally doing meta-reflection, process optimization, etc., and spend at least 90% of my time working directly on the concrete goal in front of me (LM alignment research, right now)

phone.spinning 12 Sep 2022 3:08 UTC
2 points
1
in reply to: the gears to ascension’s comment on: phone.spinning’s Shortform
Hmm yeah that’s fair, but I think what I said stands as a critique of a certain perspective on alignment, insofar as I think having the alignment curve grow faster at every step is equivalent to solving the core hard problem. I agree that we need to solve the core hard problem, but we need to delay fast takeoff until we are very confident that the problems are solved.

phone.spinning 31 Dec 2020 6:34 UTC
2 points
in reply to: Unnamed’s comment on: Everyday Lessons from High-Dimensional Optimization
Which suggests that if you’re doing randomish exploration, you should try to shake things up and move in a bunch of dimensions at once rather than just moving along a single identified dimension.
If you can only do randomish exploration this sounds right, but I think this often isn’t the right approach (not saying you would disagree with this, just pointing it out). When we change things along basis vectors, we’re implicitly taking advantage of the fact that we have a built-in basis for the world (namely, our general world model). This lets us reason about things like causality, constraints, etc. since we already are parsing the world into a useful basis.

phone.spinning 17 Jan 2024 21:17 UTC
1 point
0
AF
on: When can we trust model evaluations?
the existence of predicates on the world that are easier to evaluate than generate examples of (in the same way that verifying the answer to a problem in NP can be easier than generating it) guarantees that the model should be better at distinguishing between evaluation and deployment than any evaluator can be at tricking it into thinking it’s in deployment
Where does the guarantee come from? Why do we know that for this specific problem (generating vs. evaluating whether the model is deployed) it’s easier to evaluate? For many problems it’s equally difficult, right?

phone.spinning 3 Jan 2023 17:29 UTC
1 point
on: Debate update: Obfuscated arguments problem
- unreasonable ^5
I think there’s a typographical error—this doesn’t link to any footnote for me, and there doesn’t appear to be a fifth footnote at the end of the post

phone.spinning 3 Jan 2023 17:20 UTC
1 point
on: Debate update: Obfuscated arguments problem
Geoffrey and others raised this general problem several years ago (e.g. here)

This link no longer works—I get a permission denied message.

phone.spinning 21 Dec 2022 0:41 UTC
1 point
0
on: phone.spinning’s Shortform
Has anyone thought about the best ways of intentionally inducing the most likely/worst kinds of misalignment in models, so we can test out alignment strategies on them? I think red teaming kinda fits this, but that’s more focused on eliciting bad behavior, instead of causing a more general misalignment. I’m thinking about something along the lines of “train with RLHF so the model reliably/robustly does bad things, and then we can try to fix that and make the model good/non-harmful”, especially in the sandwiching context where the model is more capable than the overseer.
This is especially relevant for Debate, where we currently do self-play with a helpful assistant-style RLHF’d model, where one of the models is prompted to argue for an incorrect answer. But prompting the model to argue for an incorrect answer is a very simple/rough way of inducing misalignment, which is (at least partially) what we’re trying to design Debate to be robust against.

phone.spinning 21 Dec 2022 0:37 UTC
1 point
0
on: phone.spinning’s Shortform
Is there any other reason to think that scalable oversight is possible at all in principle, other the standard complexity theory analogy? I feel like this is forming the basis of a lot of our (and other’s) work in safety, but I haven’t seen work that tries to understand/conceptualize this analogy concretely.

phone.spinning 21 Dec 2022 0:36 UTC
1 point
0
on: phone.spinning’s Shortform
Is anyone thinking about how to scale up human feedback collection by several orders of magnitudes? A lot of alignment proposals aren’t focused on the social choice theory questions, which I’m okay with, but I’m worried that there may be large constant factors in the scalability of human feedback strategies like amplification/debate, such that there could be big differences between collecting 50k trajectories versus say 50-500M. Obviously cost/logistics are a giant bottleneck here, but I’m wondering about what other big challenges might be (especially if we could make intellectual progress on this before we may need to)

phone.spinning 20 Dec 2022 23:29 UTC
1 point
0
on: phone.spinning’s Shortform
How does shard theory differ from the Olah-style interpretability agenda? Why is there any reason to believe we can learn about “shards” without interpretability?

phone.spinning 6 Oct 2022 20:45 UTC
1 point
0
in reply to: phone.spinning’s comment on: phone.spinning’s Shortform
This is about 100T tokens, assuming ~2 tokens per word. That’s quite a lot of supervision.