cfoster0(Charles Foster)

Karma: 1,321

cfoster0 2 May 2024 18:09 UTC
9 points
4
on: Q&A on Proposed SB 1047
Left the following comment on the blog:
I appreciate that you’re endorsing these changes in response to the two specific cases I raised on X (unlimited model retraining and composition with unsafe covered models). My gut sense is still that ad-hoc patching in this manner just isn’t a robust way to deal with the underlying issue*, and that there are likely still more cases like those two. In my opinion it would be better for the bill to adopt a different framework with respect to hazardous capabilities from post-training modifications (something closer to “Covered model developers have a duty to ensure that the marginal impact of training/releasing their model would not be to make hazardous capabilities significantly easier to acquire.”). The drafters of SB 1047 shouldn’t have to anticipate every possible contingency in advance, that’s just bad design.
* In the same way that, when someone notices that their supposedly-safe utility function for their AI has edge cases that expose unforseen maxima, introducing ad-hoc patches to deal with those particular noticed edge cases is not a robust strategy to get an AI that is actually safe across the board.

cfoster0 29 Mar 2024 15:59 UTC
4 points
0
in reply to: tailcalled’s comment on: tailcalled’s Shortform
You want to learn an embedding of the opportunities you have in a given state (or for a given state-action), rather than just its potential rewards. Rewards are too sparse of a signal.
More formally, let’s say instead of the Q function, we consider what I would call the Hope function: which given a state-action pair (s, a), gives you a distribution over states it expects to visit, weighted by the rewards it will get. This can still be phrased using the Bellman equation:
Hope(s, a) = rs’ + f Hope(s’, a’)
The “successor representation” is somewhat close to this. It encodes the distribution over future states a partcular policy expects to visit from a particular starting state, and can be learned via the Bellman equation / TD learning.

cfoster0 14 Dec 2023 19:01 UTC
5 points
3
on: AI #42: The Wrong Answer

On reflection these were bad thresholds, should have used maybe 20 years and a risk level of 5%, and likely better defined transformational. The correlation is certainly clear here, the upper right quadrant is clearly the least popular, but I do not think the 4% here is lizardman constant.

Wait, what? Correlation between what and what? 20% of your respondents chose the upper right quadrant (transformational/safe). You meant the lower left quadrant, right?

cfoster0 28 Nov 2023 3:20 UTC
37 points
14
on: Apocalypse insurance, and the hardline libertarian take on AI risk
Very surprised there’s no mention here of Hanson’s “Foom Liability” proposal: https://www.overcomingbias.com/p/foom-liability

cfoster0 3 Nov 2023 16:22 UTC
13 points
9
on: Thoughts on open source AI
I appreciate that you are putting thought into this. Overall I think that “making the world more robust to the technologies we have” is a good direction.
In practice, how does this play out?
Depending on the exact requirements, I think this would most likely amount to an effective ban on future open-sourcing of generalist AI models like Llama2 even when they are far behind the frontier. Three reasons that come to mind:
1. The set of possible avenues for “novel harms” is enormous, especially if the evaluation involves “the ability to finetune [...], external tooling which can be built on top [...], and API calls to other [SOTA models]”. I do not see any way to clearly establish “no novel harms” with such a boundless scope. Heck, I don’t even expect proprietary, closed-source models to be found safe in this way.
2. There are many, many actors in the open-source space, working on many, many AI models (even just fine-tunes of LLaMA/Llama2). That is kind of the point of open sourcing! It seems unlikely that outside evaluators would be able to evaluate all of these, or for all these actors to do high-quality evaluation themselves. In that case, this requirement turns into a ban on open-sourcing for all but the largest & best-resourced actors (like Meta).
3. There aren’t incentives for others to robustify existing systems or to certify “OK you’re allowed to open-source now”, in the way as there are for responsible disclosure. By default, I expect those steps to just not happen, & for that to chill open-sourcing.

cfoster0 2 Nov 2023 21:17 UTC
16 points
21
in reply to: DanielFilan’s comment on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
If we are assessing the impact of open-sourcing LLMs, it seems like the most relevant counterfactual is the “no open-source LLM” one, right?

cfoster0 2 Nov 2023 21:01 UTC
4 points
2
in reply to: DanielFilan’s comment on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
Noted! I think there is substantial consensus within the AIS community on a central claim that the open-sourcing of certain future frontier AI systems might unacceptably increase biorisks. But I think there is not much consensus on a lot of other important claims, like about for which (future or even current) AI systems open-sourcing is acceptable and for which ones open-sourcing unacceptably increases biorisks.

cfoster0 2 Nov 2023 20:00 UTC
19 points
16
in reply to: jacquesthibs’s comment on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
(explaining my disagree reaction)
The open source community seems to consistently assume the case that the concerns are about current AI systems and the current systems are enough to lead to significant biorisk. Nobody serious is claiming this
I see a lot of rhetorical equivocation between risks from existing non-frontier AI systems, and risks from future frontier or even non-frontier AI systems. Just this week, an author of the new “Will releasing the weights of future large language models grant widespread access to pandemic agents?” paper was asserting that everyone on Earth has been harmed by the release of Llama2 (via increased biorisks, it seems). It is very unclear to me which future systems the AIS community would actually permit to be open-sourced, and I think that uncertainty is a substantial part of the worry from open-weight advocates.

cfoster0 1 Nov 2023 18:07 UTC
8 points
2
in reply to: Dagon’s comment on: Snapshot of narratives and frames against regulating AI
Note that the outlook from MIRI folks appears to somewhat agree with this, that there does not exist an authority that can legibly and correctly regulate AI, except by stopping it entirely.

cfoster0 31 Oct 2023 15:38 UTC
6 points
4
in reply to: Nathan Helm-Burger’s comment on: Would it make sense to bring a civil lawsuit against Meta for recklessly open sourcing models?
At face value it seems not-credible that “everyone on Earth” has been actually harmed by the release of Llama2. You could try to make a case that there’s potential for future harms downstream of Llama2′s release, and that those speculative future harms could impact everyone on Earth, but I have no idea how one would claim that they have already been harmed.

cfoster0 26 Oct 2023 17:18 UTC
6 points
1
in reply to: Oliver Sourbut’s comment on: AI as a science, and three obstacles to alignment strategies
I agree that they are related. In the context of this discussion, the critical difference between SGD and evolution is somewhat captured by your Assumption 1:
Fixed ‘fitness function’ or objective function mapping genome to continuous ‘fitness score’
Evolution does not directly select/optimize the content of minds. Evolution selects/optimizes genomes based (in part) on how they distally shape what minds learn and what minds do (to the extent that impacts reproduction), with even more indirection caused by selection’s heavy dependence on the environment. All of that creates a ton of optimization “slack”, such that large-brained human minds with language could steer optimiztion far faster & more decisively than natural selection could. This what 1a3orn was pointing to earlier with
evolution does not grow minds, it grows hyperparameters for minds. When you look at the actual process for how we actually start to like ice-cream—namely, we eat it, and then we get a reward, and that’s why we like it—then the world looks a a lot less hostile, and misalignment a lot less likely.
SGD does not have that slack by default. It acts directly on cognitive content (associations, reflexes, decision-weights), without slack or added indirection. If you control the training dataset/environment, you control what is rewarded and what is penalized, and if you are using SGD, then this lets you directly mold the circuits in the model’s “brain” as desired. That is one of the main alignment-relevant intuitions that gets lost when blurring the evolution/SGD distinction.

cfoster0 26 Oct 2023 0:01 UTC
4 points
in reply to: Buck’s comment on: Buck’s Shortform
An attempt was made last year, as an outgrowth of some assorted shard theory discussion, but I don’t think it got super far:
- Price’s equation for neural networks

cfoster0 25 Oct 2023 23:57 UTC
19 points
14
in reply to: habryka’s comment on: AI as a science, and three obstacles to alignment strategies
Note: I just watched the videos. I personally would not recommend the first video as an explanation to a layperson if I wanted them to come away with accurate intuitions around how today’s neural networks learn / how we optimize them. What it describes is a very different kind of optimizer, one explicitly patterned after natural selection such as a genetic algorithm or population-based training, and the follow-up video more or less admits this. I would personally recommend they opt for videos these instead:

cfoster0 13 Oct 2023 16:49 UTC
6 points
1
in reply to: Pranav Gade’s comment on: unRLHF—Efficiently undoing LLM safeguards
The primary point we’d like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective.
Definitely. Despite my frustrations, I still upvoted your post because I think exploring cost-effective methods to steer AI systems is a good thing.
The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don’t release the 34bn parameter model because they weren’t able to train it up to their standards of safety—so it does seem like one of the primary concerns.
I understand you as saying (1) “[whether their safety guardrails can be removed] does seem like one of the primary concerns”. But IMO that isn’t the right way to interpret their concerns, and we should instead think (2) “[whether their models exhibit safe chat behavior out of the box] does seem like one of their primary concerns”. Interpretation 2 explains the decisions made by the Llama2 authors, including why they put safety guardrails on the chat-tuned models but not the base models, as well as why they withheld the 34B one (since they could not get it to exhibit safe chat behavior out of the box). But under interpretation 1, a bunch of observations are left unexplained, like that they also released model weights without any safety guardrails, and that they didn’t even try to evaluate whether their safety guardrails can be removed (for ex. by fine-tuning the weights). In light of this, I think the Llama2 authors were deliberate in the choices that they made, they just did so with a different weighting of considerations than you.

cfoster0 13 Oct 2023 1:30 UTC
17 points
16
on: unRLHF—Efficiently undoing LLM safeguards

In doing so, we hope to demonstrate a failure mode of releasing model weights—i.e., although models are often red-teamed before they are released (for example, Meta’s LLaMA 2 “has undergone testing by external partners and internal teams to identify performance gaps and mitigate potentially problematic responses in chat use cases”,) adversaries can modify the model weights in a way that makes all the safety red-teaming in the world ineffective.

I feel a bit frustrated about the way this work is motivated, specifically the way it assumes a very particular threat model. I suspect that if you had asked the Llama2 researchers whether they were trying to make end-users unable to modify the model in unexpected and possibly-harmful ways, they would have emphatically said “no”. The rationale for training + releasing in the manner they did is to give everyday users a convenient model they can have normal/safe chats with right out of the box, while still letting more technical users arbitrarily modify the behavior to suit their needs. Heck, they released the base model weights to make this even easier! From their perspective, calling the cheap end-user modifiability of their model a “failure mode” seems like an odd framing.

EDIT: On reflection, I think my frustration is something like “you show that X is vulnerable under attack model A, but it was designed for a more restricted attack model B, and that seems like an unfair critique of X”. I would rather you just argue directly for securing against the more pessimistic attack model.

cfoster0 1 Sep 2023 19:11 UTC
7 points
2
on: Why aren’t more people in AIS familiar with PDP?
Parallel distributed processing (as well as “connectionism”) is just an early name for the line of work that was eventually rebranded as “deep learning”. They’re the same research program.

cfoster0 30 Aug 2023 19:16 UTC
20 points
16
in reply to: Thomas Larsen’s comment on: Introducing the Center for AI Policy (& we’re hiring!)
Credit for changing the wording, but I still feel this does not adequately convey how sweeping the impact of the proposal would be if implemented as-is. Foundation model-related work is a sizeable and rapidly growing chunk of active AI development. Of the 15K pre-print papers posted on arXiv under the CS.AI category this year, 2K appear to be related to language models. The most popular Llama2 model weights alone have north of 500K downloads to date, and foundation-model related repos have been trending on Github for months. “People working with [a few technical labs’] models” is a massive community containing many thousands of developers, researchers, and hobbyists. It is important to be honest about how they will likely be impacted by this proposed regulation.

cfoster0 31 Jul 2023 15:27 UTC
LW: 2 AF: 1
0
AF
in reply to: RobertKirk’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
If you have checkpoints from different points in training of the same models, you could do a comparison between different-size models at the same loss value (performance). That way, you’re actually measuring the effect of scale alone, rather than scale confounded by performance.

cfoster0 29 Jul 2023 1:38 UTC
3 points
1
in reply to: janus’s comment on: How LLMs are and are not myopic
It would definitely move the needle for me if y’all are able to show this behavior arising in base models without forcing, in a reproducible way.

cfoster0 21 Jul 2023 20:18 UTC
3 points
0
in reply to: AdamYedidia’s comment on: GPT-2′s positional embedding matrix is a helix
Good question. I don’t have a tight first-principles answer. The helix puts a bit of positional information in the variable magnitude (otherwise it’d be an ellipse, which would alias different positions) and a bit in the variable rotation, whereas the straight line is the far extreme of putting all of it in the magnitude. My intuition is that (in a transformer, at least) encoding information through the norm of vectors + acting on it through translations is “harder” than encoding information through (almost-) orthogonal subspaces + acting on it through rotations.

Relevant comment from Neel Nanda: https://twitter.com/NeelNanda5/status/1671094151633305602