NickGabs

Karma: 386

Grantmaker at Coefficient Giving

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

2 Jan 2024 0:47 UTC

125 points

29 comments8 min readLW link

(arxiv.org)

Science of Deep Learning more tractably addresses the Sharp Left Turn than Agent Foundations

NickGabs19 Sep 2023 22:06 UTC

21 points

2 comments6 min readLW link

NickGabs 17 Jul 2023 20:14 UTC
3 points
2
in reply to: River’s comment on: An upcoming US Supreme Court case may impede AI governance efforts
I think you’re probably right. But even this will make it harder to establish an agency where the bureaucrats/technocrats have a lot of autonomy, and it seems there’s at least a small chance of an extreme ruling which could make it extremely difficult.

NickGabs 17 Jul 2023 20:12 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: An upcoming US Supreme Court case may impede AI governance efforts
Yeah, I think they will probably do better and more regulations than if politicians were more directly involved, but I’m not super sanguine about bureaucrats in absolute terms.

An upcoming US Supreme Court case may impede AI governance efforts

NickGabs16 Jul 2023 23:51 UTC

57 points

17 comments2 min readLW link

Empirical Evidence Against “The Longest Training Run”

NickGabs6 Jul 2023 18:32 UTC

31 points

1 comment14 min readLW link

NickGabs 4 Jun 2023 22:49 UTC
1 point
0
in reply to: Rudi C’s comment on: Proposal: labs should precommit to pausing if an AI argues for itself to be improved
This was addressed in the post: “To fully flesh out this proposal, you would need concrete operationalizations of the conditions for triggering the pause (in particular the meaning of “agentic”) as well as the details of what would happen if it were triggered. The question of how to determine if an AI is an agent has already been discussed at length at LessWrong. Mostly, I don’t think these discussions have been very helpful; I think agency is probably a “you know it when you see it” kind of phenomenon. Additionally, even if we do need a more formal operationalization of agency for this proposal to work, I suspect that we will only be able to develop such an operationalization via more empirical research. The main particular thing I mean to actively exclude by stipulating that the system must be agentic is an LLM or similar system arguing for itself to be improved in response to a prompt. ”

Proposal: labs should precommit to pausing if an AI argues for itself to be improved

NickGabs2 Jun 2023 22:31 UTC

3 points

3 comments4 min readLW link

NickGabs 7 Apr 2023 18:40 UTC
2 points
1
on: Reliability, Security, and AI risk: Notes from infosec textbook chapter 1
Thanks for posting this, I think these are valuable lessons and I agree it would be valuable for someone to do a project looking into successful emergency response practices. One thing this framing does also highlight is that, as Quintin Pope discussed in his recent post on alignment optimism, the “security mindset” is not appropriate for the default alignment problem. We are only being optimized against once we have failed to align the AI; until then, we are mostly held to the lower bar of reliability, not security. There is also the problem of malicious human actors where security problems could pop up before the AI becomes misaligned, but this failure mode seems less likely and less x-risk inducing than misalignment, and involves a pretty different set of measures (info sharing policies, encrypting the weights as opposed to training techniques or evals).

NickGabs 30 Mar 2023 23:45 UTC
3 points
6
on: The 0.2 OOMs/year target
I think concrete ideas like this that take inspiration from past regulatory successes are quite good, esp. now that policymakers are discussing the issue.

NickGabs 30 Mar 2023 13:14 UTC
1 point
0
in reply to: Orpheus16’s comment on: Want to win the AGI race? Solve alignment.
I agree with aspects of this critique. However, to steelman Leopold, I think he is not just arguing that demand-driven incentives will drive companies to solve alignment due to consumers wanting safe systems, but rather that, over and above ordinary market forces, constraints imposed by governments, media/public advocacy, and perhaps industry-side standards will make it such that it is ~impossible to release a very powerful, unaligned model. I think this points to a substantial underlying disagreement in your models—Leopold thinks that governments and the public will “wake up” sufficiently quickly to catastrophic risk from AI such that there will be regulatory and PR forces which effectively prevent the release of misaligned models, including evals/ways of detecting misalignment that are more robust than those that might be used by ordinary consumers (which could as you point out likely be fooled by surface-level alignment due to RLHF).

AI Doom Is Not (Only) Disjunctive

NickGabs30 Mar 2023 1:42 UTC

12 points

0 comments5 min readLW link

NickGabs 28 Mar 2023 14:53 UTC
3 points
0
in reply to: fdafea’s comment on: Why does advanced AI want not to be shut down?
How do any capabilities or motivations arise without being explicitly coded into the algorithm?

NickGabs 26 Mar 2023 19:03 UTC
1 point
0
in reply to: Noosphere89’s comment on: continue working on hard alignment! don’t give up!
I don’t think it is correct to conceptualize MLE as a “goal” that may or may not be “myopic.” LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don’t intervene in their environment to make it more predictable. When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra’s idea of “HFDT,” for a while after pre-training. Thus, while pretraining from human preferences might shift the initial distribution that the model predicts at the start of finetuning in a way which seems like it would likely push the final outcome of fine-tuning in a more aligned direction, it is far from a solution to the deeper problem of agent alignment that I think is really the core.

NickGabs 25 Mar 2023 18:01 UTC
2 points
0
in reply to: Noosphere89’s comment on: continue working on hard alignment! don’t give up!
1. How does it give the AI a myopic goal? It seems like it’s basically just a clever form of prompt engineering in the sense that it alters the conditional distribution that the base model is predicting, albeit in a more robustly good way than most/all prompts, but base models aren’t myopic agents, they aren’t agents at all. As such I’m not concerned about pure simulators/predictors posing xrisks, but what happens when people do RL on them to turn them into agents (or similar techniques like decision transformers). I think its plausible that pretraining from human feedback partially addresses this by pushing the model’s outputs into a more aligned distribution from the get go when we do RLHF, but it is very much not obvious that it solves the deeper problems with RL more broadly (inner alignment and scalable oversight/sycophancy).
2. I agree scaling well with data is quite good. But see (1)
3. How?
4. I was never that concerned about this, but I agree that it does seem good to offload more training to pretraining as opposed to finetuning for this and other reasons

NickGabs 25 Mar 2023 15:41 UTC
2 points
4
in reply to: Noosphere89’s comment on: continue working on hard alignment! don’t give up!
Pre training on human feedback? I think it’s promising but we have no direct evidence of how it interacts with RL finetuning to make LLMs into agents which is the key question.

NickGabs 25 Mar 2023 11:34 UTC
4 points
4
in reply to: Martin Randall’s comment on: continue working on hard alignment! don’t give up!
Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.

NickGabs 25 Mar 2023 11:24 UTC
6 points
5
in reply to: Noosphere89’s comment on: continue working on hard alignment! don’t give up!
Yeah I agree with both your object level claim (ie I lean towards the “alignment is easy” camp) and to a certain extent your psychological assessment, but this is a bad argument. Optimism bias is also well documented in many cases, so to establish that alignment is hard people are overly pessimistic, you need to argue more on the object level against the claim or provide highly compelling evidence that such people are systematically irrationally pessimistic on most topics.

NickGabs 1 Mar 2023 18:29 UTC
4 points
3
on: Enemies vs Malefactors
Strong upvote. A corollary here is that a really important part of being a “good person” is being good at being able to tell when you’re rationalizing your behavior/otherwise deceiving yourself into thinking you’re doing good. The default is that people are quite bad at this but as you said don’t have explicitly bad intentions, which leads to a lot of people who are at some level morally decent acting in very morally bad ways.

NickGabs 27 Feb 2023 18:30 UTC
1 point
0
in reply to: DragonGod’s comment on: Why I’m Sceptical of Foom
In the intervening period, I’ve updated towards your position, though I still think it is risky to build systems with capabilities that open ended which are that close to agents in design space

NickGabs

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

Science of Deep Learn­ing more tractably ad­dresses the Sharp Left Turn than Agent Foundations

An up­com­ing US Supreme Court case may im­pede AI gov­er­nance efforts

Em­piri­cal Ev­i­dence Against “The Longest Train­ing Run”

Pro­posal: labs should pre­com­mit to paus­ing if an AI ar­gues for it­self to be improved