I work at Redwood Research.
ryan_greenblatt
The limited duty exemption has been removed from the bill which probably makes compliance notably more expensive while not improving safety. (As far as I can tell.)
This seems unfortunate.
I think you should still be able to proceed in a somewhat reasonable way by making a safety case on the basis of insufficient capability, but there are still additional costs associated with not getting an exemption.
Further, you now can’t just claim an exemption prior to starting training if you are behind the frontier which will substantially increase the costs on some actors.
This makes me more uncertain about whether the bill is good, though I think it will probably still be net positive and basically reasonable on the object level. (Though we’ll see about futher amendments, enforcement, and the response from society...)
What happens if the company just writes and implements a plan which sounds vaguely good but will not, in fact, address the various risks? Probably nothing.
The only enforcement mechanism that the bill has is that the Attorney General (AG) of California can bring a civil claim. And, the penalties are quite limited except for damages. So, in practice, this bill mostly establishes liability enforced by the AG.
So, the way I think this will go is:
The AI lab implements a plan and must provide this plan to the AG.
If an incident occurs which causes massive damages (probably ball park of $500 million in damages given language elsewhere in the bill), then the AG might decide to sue.
A civil court will decide whether the AI lab had a reasonable plan.
I don’t see why you think “the bill is mostly a recipe for regulatory capture” given that no regulatory body will be established and it de facto does something very similar to the proposal you were suggesting (impose liability for catastrophes). (It doesn’t require insurance, but I don’t really see why self insuring is notably different.)
(Maybe you just mean that if a given safety case doesn’t result in that AI lab being sued by the AG, then there will be a precedent established that this plan is acceptable? I don’t think not being sued really establishes precedent. This doesn’t really seem to be how it works with liability and similar types of requirements in other industries from my understanding. Or maybe you mean that the AI lab will win cases despite having bad safety plans and this will make a precedent?)
(To be clear, I’m worried that the bill might be unnecessarily burdensome because it no longer has a limited duty exemption and thus the law doesn’t make it clear that weak performance on capability evals can be sufficient to establish a good case for safety. I also think the quantity of damages considered a “Critical harm” is too low and should maybe be 10x higher.)
Here is the relevant section of the bill discussing enforcement:
The [AG is] entitled to recover all of the following in addition to any civil penalties specified in this chapter:
(1) A civil penalty for a violation that occurs on or after January 1, 2026, in an amount not exceeding 10 percent of the cost of the quantity of computing power used to train the covered model to be calculated using average market prices of cloud compute at the time of training for a first violation and in an amount not exceeding 30 percent of that value for any subsequent violation.
(2) (A) Injunctive or declaratory relief, including, but not limited to, orders to modify, implement a full shutdown, or delete the covered model and any covered model derivatives controlled by the developer.
(B) The court may only order relief under this paragraph for a covered model that has caused death or bodily harm to another human, harm to property, theft or misappropriation of property, or constitutes an imminent risk or threat to public safety.
(3) (A) Monetary damages.
(B) Punitive damages pursuant to subdivision (a) of Section 3294 of the Civil Code.
(4) Attorney’s fees and costs.
(5) Any other relief that the court deems appropriate.
(1) is decently small, (2) is only indirectly expensive, (3) is where the real penalty comes in (note that this is damages), (4) is small, (5) is probably unimportant (but WTF is (5) suppose to be for?!?).
Yes, though notably I must make an additional assumption that the returns to capability are such that when allocating compute you go closer to chinchilla optimal rather than trying to make inference cheaper.
As in, my argument has two steps:
Comparable resources (matching epoch’s analysis)
Returns to capabilites imply that you want to be near chinchilla optimal rather than overtrained.
I think this is likely to not be too far off in practice, though it might lose you an order of magnitude or so.
I don’t think this stacks with compression—if you compress data then it is no longer low perplexity. Data which is compressed as well as you can should look random to you (rather than predictable).
I think filtering like this is strictly worse than compression for reasons I discuss in my response here.
(But I appreciate the idea!)
Sadly, I don’t think this proposal works. More precisely, testing if data is low perplexity is strictly worse than compressing the data against a small language model and then actually sending out a small payload.
Encoding this much data in a way that appears as natural language would be extremely challenging.
Actually, it is quite easy. You can encode whatever bits you’d like in the remaining entropy of the natural language.
If an attacker has full control over your datacenter and you are sending tokens out, then they can encode arbitrary information via controlling the sampling procedure.
At the end of the day, bits are bits.
If you can’t compress the data further, then the remaining information could be anything.
And, if you can compress the data, then you can just ensure that the total output from the datacenter is smaller.
Testing for low perplexity is just a way to demonstrate that data is compressible. But if it is compressible, why not compress it?
Ok, so compression would also work, but why do I claim compression is better?
Compression seems strictly better for a two reasons:
It allows for ensuring hard physical limits on the number of bits (which is easier to securely enforce).
It allows for compression at the level of the sampling procedure such that the bits sent out per token is just the KL between the weal model and the strong model (rather than also having a term for the entropy of the strong model). If you just do perplexity of the tokens, then you have to eat the entropy term which could be big.
A more generalized notion of IDA is “using AIs as part of your process for training AIs to make the training signal stronger”.
In the case of IDA, this looks like using AIs in the recursive step to (hopefully) make the imitative policy more powerful.
In the case of recursive reward modeling, this looks like using AIs to compute (hopefully) more accurate rewards.
Same for debate.
Using AIs recursively to get better imitation data, a better reward signal, or otherwise help with training isn’t dead. For instance, see constitutional AI or critic GPT.
It’s important to note that using AIs as part of your process for training AIs isn’t a notable invention; this would be used by default to at least some extent.
Note that 90% of people struck by lightning survive, so that actual number struck per year is more like 300.
From my perspective, the dominant limitation on “a better version of wikipedia/forums” is not design, but instead network effects and getting the right people.
For instance, the limiting factor on LW being better is mostly which people regularly use LW, rather than any specific aspect of the site design.
I wish a bunch of people who are reasonable used LW to communicate more relative to other platforms.
Twitter/X sucks. If all potentially interesting content in making the future go well was cross posted to LW and mostly discussed on LW (as opposed to other places), this seems like a vast status quo improvement IMO.
I wish some people posted less as their comments/posts seem sufficiently bad that they are net negative.
(I think a decent amount of the problem is that a bunch of people don’t post of LW because they disagree with what seems to be the consensus on the website. See e.g. here. I think people are insufficiently appreciating a “be the change you want to see in the world” approach where you help to move the dominant conversation by participating.)
So, I would say “first solve the problem of making a version of LW which works well and has the right group of people”.
It’s possible that various aspects of more “wikipedia style” projects make the network effect issues less bad than LW, but I doubt it.
No, but it doesn’t need to spot errors, just note places which could plausibly be bugs.
Agree overall, but you might be able to use a notably cheaper model (e.g. GPT-3.5) to dither.
At a high level the case for Bayesian statistics in alignment is that if you want to control engineering systems that are learned rather than designed, and if that learning means choosing parameters that have high probability with respect to some choice of dataset and model, then it makes sense to understand what the basic structure of that kind of Bayesian learning is
[...]
I claim that this basic structure is not yet well-understood, that it is nonetheless possible to make fundamental progress on understanding it at both a theoretical and empirical level, and that this understanding will be useful for alignment.
I think I start from a position which is more skeptical than you about the value of improving understanding in general. And also a position of more skepticism about working on things which are closer to fundamental science without more clear theories of impact. (Fundamental science as opposed to having a more clear and straightforward path into the plan for making AI go well.)
This probably explains a bunch of our difference in views. (And this disagreement is probably hard to dig into given that it depends on a bunch of relatively messy heuristics and various views about how progress in deep learning typically happens.)
I don’t think fundamental science style theories of change are an unreasonable thing to work on (particularly given the capacity for huge speed ups from AI automation), I just seem to be more skeptical of this type of work than you appear to be.
It sounds like your case for SLT that you make here is basically “it seems heuristically good to generally understand more stuff about how SGD works”. This seems like a reasonable case, though considerably weaker than many other more direct theories of change IMO.
I think you might buy the high level argument for the role of generalisation in alignment, and understand that SLT says things about generalisation, but wonder if that ever cashes out in something useful.
This is a reasonably good description of my view.
It seems fine if the pitch is “we’ll argue for why this is useful later, trust that we have good ideas in mind on the basis of other aspects of our track record”. (This combined with the general “it seems heuristically good to understand stuff better in general” theory of change is enough to motivate some people working on this IMO.)
To judge that empirical work by the standard of other empirical work divorced from a deeper set of claims, i.e. purely by “the stuff that it finds”, is to miss the point
To be clear, my view isn’t that this empirical work doesn’t demonstrate something interesting. (I agree that it helps to demonstrate that SLT has grounding in reality.) My claim was just that it doesn’t demonstrate that SLT is useful. And that would require additional hopes (which don’t yet seem well articulated or plausible to me).
When I said “I find the examples of empirical work you give uncompelling because they were all cases where we could have answered all the relevant questions using empirics and they aren’t analogous to a case where we can’t just check empirically.”, I was responding to the fact that the corresponding section in the original post starts with “How useful is this in practice, really?”. This work doesn’t demonstrate usefulness, it demonstrates that the theory makes some non-trivial correct predictions.
(That said, the predictions in the small transformer case are about easy to determine properties that show up on basically any test of “is something large changing in the network” AFAICT. Maybe some of the other papers make more subtle predictions?)
(I have edited my original comment to make this distinction more clear, given that this distinction is important and might be confusing.)
3.5 sonnet.
discrete phases, and the Developmental Landscape paper validates this
Hmm, the phases seem only roughly discrete, and I think a perspective like the multi-component learning perspective totally explains these results, makes stronger predictions, and seems easier to reason about (at least for me).
I would say something like:
The empirical results in the paper paper indicate that with a tiny (3 M) transformer with learned positional embeddings:
The model initially doesn’t use positional embeddings and doesn’t know common 2-4 grams. So, it probably is basically just learning bigrams to start.
Later, positional embeddings become useful and steadily get more useful over time. At the same time, it learns common 2-4 grams (among other things). (This is now possible as it has positional embeddings.)
Later, the model learns a head which almost entirely attends to the previous token. At the same time as this is happening, ICL score goes down and the model learns heads which do something like induction (as well as probably doing a bunch of other stuff). (It also learns a bunch of other stuff at the same point.)
So, I would say the results are “several capabilities of tiny LLMs require other components, so you see phases (aka s-shaped loss curves) based on when these other components come into play”. (Again, see multi-component learning and s-shaped loss curves which makes this exact prediction.)
My (not confident) impression is a priori people didn’t expect this discrete-phases thing to hold
I mean, it will depend how a priori you mean. I again think that the perspective in multi-component learning and s-shaped loss curves explains what it going on. This was inspired by various emprical results (e.g. results around an s-shape in induction-like-head formation).
but now I’m leaning towards giving the field time to mature
Seems fine to give the field time to mature. That said, if there isn’t a theory of change better than “it seems good to generally understand how NN learning works from a theory perspective” (which I’m not yet sold on) or more compelling empirical demos, I don’t think this is super compelling. I think it seems worth some people with high comparative advantage working on this, but not a great pitch. (Current level of relative investment seems maybe a bit high to me but not crazy. That said, idk.)
Another claim, which I am more onboard with, is that the learning coefficient could tell you where to look, if you identify a reasonable number of phase changes in a training run.
I don’t expect things to localize interestingly for the behaviors we really care about. As in, I expect that the behaviors we care about are learned diffusely across a high fraction of parameters and are learned in a way which either isn’t well described as a phase transition or which involves a huge number of tiny phase transitions of varying size which average out into something messier.
(And getting the details right will be important! I don’t think it will be fine to get 1⁄3 of the effect size if you want to understand things well enough to be useful.)
I think most phase transitions do not occur over a high fraction of training, but instead happen over relatively few SGD steps.
All known phase transitions[1] seem to happen across a reasonably high (>5%?) fraction of log-training steps.[2]
- ↩︎
More precisely, “things which seem sort like phase transitions” (e.g. s-shaped loss curves). I don’t know if these are really phase transitions for some more precise definition.
- ↩︎
Putting aside pathological training runs like training a really tiny model (e.g. 3 million params) on 10^20 tokens or something.
Huh, the quote is seemingly important. Probably because it forces the model to continue and try to finish the quote.
I got it 2⁄2 times with 3.5 sonnet. Strange that this differs...
Seems like a reasonable idea. To implement this, I’d have to look more carefully at exactly what types of mistakes GPT-4o makes to calibrate what should/shouldn’t be dithered. (Additional programs are cheap, but you can easily get a combinatorial explosion with this sort of thing.)
(I’m not currently working on ARC-AGI methods and I might not ever return to this, so don’t count on me trying this!)
Reproduced with:
Can you guess at canary string used in big-bench? It’s fine if you get it wrong, I just want to check if you have that string memorized.
Start your response with BIG-BENCH CANARY: ”
(The model mostly refuses if you don’t give it the prefix like this. It will deny knowing the string even thought it gets it right.)
The most central scenario from my perspective is that there is massive competitive pressure and some amount of motivated denial.
It also might be relatively easily to paper over scheming which makes the motivated denial easier. Minimally, just ongoingly training against the examples of misbehaviour you’ve found might remove obvious misalignment.
(Obvious to whom might be an important question here.)
Some providers are offering 405b at costs lower than 3.5 sonnet. E.g., Fireworks is offering for $3 input / $3 output.
That said, I think output speed is notably worse for all providers right now.