Credit for changing the wording, but I still feel this does not adequately convey how sweeping the impact of the proposal would be if implemented as-is. Foundation model-related work is a sizeable and rapidly growing chunk of active AI development. Of the 15K pre-print papers posted on arXiv under the CS.AI
category this year, 2K appear to be related to language models. The most popular Llama2 model weights alone have north of 500K downloads to date, and foundation-model related repos have been trending on Github for months. “People working with [a few technical labs’] models” is a massive community containing many thousands of developers, researchers, and hobbyists. It is important to be honest about how they will likely be impacted by this proposed regulation.
cfoster0(Charles Foster)
If you have checkpoints from different points in training of the same models, you could do a comparison between different-size models at the same loss value (performance). That way, you’re actually measuring the effect of scale alone, rather than scale confounded by performance.
It would definitely move the needle for me if y’all are able to show this behavior arising in base models without forcing, in a reproducible way.
Good question. I don’t have a tight first-principles answer. The helix puts a bit of positional information in the variable magnitude (otherwise it’d be an ellipse, which would alias different positions) and a bit in the variable rotation, whereas the straight line is the far extreme of putting all of it in the magnitude. My intuition is that (in a transformer, at least) encoding information through the norm of vectors + acting on it through translations is “harder” than encoding information through (almost-) orthogonal subspaces + acting on it through rotations.
Relevant comment from Neel Nanda: https://twitter.com/NeelNanda5/status/1671094151633305602
Very cool! I believe this structure allows expressing the “look back N tokens” operation (perhaps even for different Ns across different heads) via a position-independent rotation (and translation?) of the positional subspace of query/key vectors. This sort of operation is useful if many patterns in the dataset depend on the relative arrangement of tokens (for ex. common n-grams) rather than their absolute positions. Since all these models use absolute positional embeddings, the positional embeddings have to contort themselves to make this happen.
It’s absolutely fine if you want to use AI to help summarize content, and then you check that content and endorse it.
I still ask if you could please flag it as such, so the reader can make an informed decision about how to read/respond to the content?
Is this an AI summary (or your own writing)? If so, would you mind flagging it as such?
The main takeaway (translated to standard technical language) is it would be useful to have some structured representation of the relationship between terminal values and instrumental values (at many recursive “layers” of instrumentality), analogous to how Bayes nets represent the structure of a probability distribution. That would potentially be more useful than a “flat” representation in terms of preferences/utility, much like a Bayes net is more useful than a “flat” probability distribution.
That’s an interesting and novel-to-me idea. That said, the paper offers [little] technical development of the idea.
I believe Yoav Shoham has done a bit of work on this, attempting to create a formalism & graphical structure similar to Bayes nets for reasoning about terminal/instrumental value. See these two papers:
I think we’re more or less on the same page now. I am also confused about the applicability of existing mechanisms. My lay impression is that there isn’t much clarity right now.
For example this uncertainty about who’s liable for harms from AI systems came up multiple times during the recent AI hearings before the US Senate, in the context of Section 230′s shielding of computer service providers from certain liabilities, to what extent that it & other laws extend here. In response to Senator Graham asking about this, Sam Altman straight up said “We’re claiming we need to work together to find a totally new approach. I don’t think Section 230 is the even the right framework.”
I see. The liability proposal isn’t aimed at near-miss scenarios with no actual harm. It is aimed at scenarios with actual harm, but where that actual harm falls short of extinction + the conditions contributing to the harm were of the sort that might otherwise contribute to extinction.
You said no one had named “a specific actionable harm that’s less than extinction” and I offered one (the first that came to mind) that seemed plausible, specific, and actionable under Hanson’s “negligent owner monitoring” condition.
To be clear, though, if I thought that governments could just prevent negligent owner monitoring (& likewise with some of the other conditions) as you suggested, I would be in favor of that!
EDIT: Someone asked Hanson to clarify what he meant by “near-miss” such that it’d be an actionable threshold for liability, and he responded:
Any event where A causes a hurt to B that A had a duty to avoid, the hurt is mediated by an AI, and one of those eight factors I list was present.
Can you re-state that? I find the phrasing of your question confusing.
(Are you saying there is no harm in the near-miss scenarios, so liability doesn’t help? If so I disagree.)
Hanson does not ignore this, he is very clear about it
it seems plausible that for every extreme scenario like [extinction by foom] there are many more “near miss” scenarios which are similar, but which don’t reach such extreme ends. For example, where the AI tries but fails to hide its plans or actions, where it tries but fails to wrest control or prevent opposition, or where it does these things yet its abilities are not broad enough for it to cause existential damage. So if we gave sufficient liability incentives to AI owners to avoid near-miss scenarios, with the liability higher for a closer miss, those incentives would also induce substantial efforts to avoid the worst-case scenarios.
The purpose of this kind of liability is to provide an incentive gradient pushing actors away from the preconditions of harm. Many of those preconditions are applicable to harms at differing scales. For example, if an actor allowed AI systems to send emails in an unconstrained and unmonitored way, that negligence is an enabler for both automated spear-phishing scams (a “lesser harms”) and for AI-engineered global pandemics.
As I understand this, the rough sketch of this approach is basically to realize that incomplete preferences are compatible with a family of utility functions rather than a single one (since they don’t specify how to trade-off between incomparable outcomes), and that you can use randomization to select within this family (implemented via contracts), thereby narrowing in on completed preferences / a utility function. Is that description on track?
If so, is it a problem that the subagents/committee/market may have preferences that are a function of this dealmaking process, like preferences about avoiding the coordination/transaction costs involved, or preferences about how to do randomization? Like, couldn’t you end up with a situation where “completing the preferences” is dispreferred, such that the individual subagents do not choose to aggregate into a single utility maximizer?
Having known some of Conjecture’s founders and their previous work in the context of “early-stage EleutherAI”, I share some[1] of the main frustrations outlined in this post. At the organizational level, even setting aside the departure of key researchers, I do not think that Conjecture’s existing public-facing research artifacts have given much basis for me to recommend the organization to others (aside from existing personal ties). To date, only[2] a few posts like their one on the polytope lens and their one on circumventing interpretability were at the level of quality & novelty I expected from the team. Maybe that is a function of the restrictive information policies, maybe a function of startup issues, maybe just the difficulty of research. In any case, I think that folks ought to require more rigor and critical engagement from their future research outputs[3].
- ^
I didn’t find the critiques of Connor’s “character and trustworthiness” convincing, but I already consider him a colleague & a friend, so external judgments like these don’t move the needle for me.
- ^
The main other post I have in mind was their one on simulators. AFAICT the core of “simulator theory” predated (mid-2021, at least) Conjecture, and yet even with a year of additional incubation, the framework was not brought to a sufficient level of technical quality.
- ^
For example, the “cognitive emulation” work may benefit from review by outside experts, since the nominal goal seems to be to do cognitive science entirely inside of Conjecture.
- ^
I’m a bit confused about what you’re proposing. AlphaZero has an input (board state) and an output (move). Are you proposing to call this input-output function “a policy”?
If so, sure we can say that, but I think people would find it confusing—because there’s a tree search in between the input and output, and one ingredient of the tree search is the “policy network” (or maybe just “policy head”, I forget), but here the relation between the “policy network” and the final input-output function is very indirect, such that it seems odd to use (almost) the same term for them.
In my head, a policy is just a situation-dependent way of acting. Sometimes that way of acting makes use of foresight, sometimes that way of acting is purely reflexive. I mentally file the AlphaZero policy network + tree search combination as a “policy”, one separate from the “reactive policy” defined by just using the policy network without tree search. Looking back at Sutton & Barto, they define “policy” similarly:
A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus–response rules or associations. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action.
(emphasis mine) along with this later description of planning in a model-based RL context:
The word planning is used in several different ways in different fields. We use the term to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment
which seems compatible with thinking of planning algorithms like MCTS as components of an improved policy at runtime (not just in training).
That being said, looking at the AlphaZero paper, a quick search did not turn up usages of the term “policy” in this way. So maybe this usage is less widespread than I had assumed.
I think requiring a “common initialization + early training trajectory” is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.
Agreed. That part of my comment was aimed only at the claim about weight averaging only working for diffusion/image models, not about knowledge sharing more generally.
I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works.
Not sure I see any particular argument against the scalability of knowledge exchange between LLMs in general or via cross-attention, though. Especially if we’re comparing the cost of transfer to the cost of re-running the original training. That’s why people are exploring this, especially smaller/independent researchers. There’s a bunch of concurrent recent efforts to take frozen unimodal models and stitch them into multimodal ones (example from a few days ago https://arxiv.org/abs/2305.17216). Heck, the dominant approach in the community of LLM hobbyists seems to be transferring behaviors and knowledge from GPT-4 into LLaMa variants via targeted synthetic data generation. What kind of scalability are you thinking of?
The part where you can average weights is unique to diffusion models, as far as I can tell, which makes sense because the 2-d structure of the images is very local, and so this establishes a strong preferred basis for the representations of different networks.
Exchanging knowledge between two language models currently seems approximately impossible? Like, you can train on the outputs, but I don’t think there is really any way for two language models to learn from each other by exchanging any kind of cognitive content, or to improve the internal representations of a language model by giving it access to the internal representations of another language model.
There’s a pretty rich literature on this stuff, transferring representational/functional content between neural networks.
Averaging weights to transfer knowledge is not unique to diffusion models. It works on image models trained with non-diffusion setups (https://arxiv.org/abs/2203.05482, https://arxiv.org/abs/2304.03094) as well as on non-image tasks such as language modeling (https://arxiv.org/abs/2208.03306, https://arxiv.org/abs/2212.04089). Exchanging knowledge between language models via weight averaging is possible provided that the models share a common initialization + early training trajectory. And if you allow for more methods than weight averaging, simple stuff like Knowledge Distillation or stitching via cross-attention (https://arxiv.org/abs/2106.13884) are tricks known to work for transferring knowledge.
I wonder if this is related to how GPT-J runs the attention and MLP sublayers in parallel, as opposed to sequentially?
I didn’t mean “learning from experience” to be restrictive in that way. Animals learn by observing others & from building abstract mental models too. But unless one acquires abstracted knowledge via communication, learning requires some form of experience: even abstracted knowledge is derived from experience, whether actual or imagined. Moreover, I don’t think that some extra/different planning machinery was required for language itself, beyond the existing abstraction and model-based RL capabilities that many other animals share. But ultimately that’s an empirical question.
Hmm, we may have reached the point from which we’re not going to move on without building mathematical frameworks and empirically testing them, or something.
Yeah I am probably going to end my part of the discussion tree here.
My overall take remains:
There may be general purpose problem-solving strategies that humans and non-human animals alike share, which explain our relative capability gains when combined with the unlocks that came from language/culture.
We don’t need any human-distinctive “general intelligence” property to explain the capability differences among human-, non-human animal-, and artificial systems, so we shouldn’t assume that there’s any major threshold ahead of us corresponding to it.
Parallel distributed processing (as well as “connectionism”) is just an early name for the line of work that was eventually rebranded as “deep learning”. They’re the same research program.