This update massively reduces my expectation for Conjecture’s future value. When you’re a small player in the field, you produce value through transferrable or bolt-on components, such as Conjecture’s interpretability and simulator work. CoEm on the other hand is completely disconnected from other AGI or ai safety work, and pretty much only has any impact if Conjecture is extraordinarily successful.
Wow this looks great! The alignment tax for this is (inference on whole dataset with previous gen model), which is like 10%, and can be much lower if you just use a smaller classifier. Seems like an important part of near term alignment!
This is some evidence against the “scaling hypothesis”, i.e. evidence that something non-trivial and important is missing from modern deep learning in order to reach AGI.
The usual response is just “you don’t actually need to be robust to white box advexes, and only somewhat resistant to black box advexes, to take over the world”
The argument I see against this is that voluntary security that’s short term useful can be discarded once it’s no longer so, whereas security driven by public pressure or regulation can’t. If a lab was had great practices for forever and then dropped them, there would be much less pressure to revert than if they’d previously had huge security incidents.
For instance, we might want to focus on public pressure for 1-2 years, then switch gears towards security
Oh cool. LMs can output more finely tokenized text than it’s trained on, so it probably didn’t output the token ” Skydragon”, but instead multiple tokens, [” ”, “Sky”, “dragon”] or something
I have personally observed completely untrained tokens in gpt2. Specifically I found some specific accented characters had very small and random embeddings, which were so similar it looked like none of them had any training at all
Facebooks models use maybe 1⁄4 the compute (rough guess) and have more implementation issues and worse finetuning
One small counterexample is conditional double 0 multiplication. This works in networks that have multiple elementwise multiplications in series (which you can get with GMLP or SwiGLU activation functions) like y=A*(B*X). If A and B are both 0, then there is no gradient on A or B because they independently have no effect on y, only together. If an apparatus to conditionally set A and B to 0 on some datapoints arose in a model, there would be no gradient towards removing it, and it could effectively filter what data is given to SGD.
I’m not against evaluating models in ways where they’re worse than rocks, I just think you shouldn’t expect anyone else to care about your worse-than-rock numbers without very extensive justification (including adversarial stuff)
Or just very hardcore global/national ID-ing
I don’t expect this to be possible in like 1.5 years, and expect it’s difficult now. Stuff like “only allow users who have a long browser history & have bought things and such” (like reCapcha does) feels like the only way out besides “go to the office in person”
Causal scrubbing: results on induction heads
Causal scrubbing: results on a paren balance checker
isn’t this about generation vs classification, not language vs vision?
I don’t think this sort of prompt actually gets at the conscious reasoning gap. It only takes one attention head to copy the exact next token prediction made at a previous token, and I’d expect if you used few shot prompting (especially filling the entire context with few shot prompts), it would use its induction-like heads to just copy its predictions and perform quite well.
A better example would be to have the model describe its reasoning about predicting the next token, and then pass that to itself in an isolated prompt to predict the next token.
This is in line with my experience. However, the fact that this was an http server is important—I get far more value from copilot on JS http servers than other programs, and http servers are a class that has many no code options—how long would it take them if they were allowed to use pure SQL or a no-code solution?
Also, I think if you trained something to predict text, then RL trained it on inclusive genetic fitness as a human (or human motivation signals), its learning would be mostly in the space of “select specific human / subdistribution of humans to imitate” rather than learning behaviors specific to the task, and then its generalization properties would depend more on those humans than on the specific training setup used
Here’s a plausible story to me:
The model sees its environment + past actions, and its LM predictive modelling part puts non-neglible prob on “this the ‘make humans smile’ task”. Then its language modelling prior predicts the next action, not based on the training setup, which it doesn’t see, but based on the environment, and it outputs an action aimed at pressing the reward button. This action does well, is reinforced, and you get a reward-button-presser.
Some context is that when training language models with RLHF, the language modelling prior tends to dominate over RL-learned behaviors on sub-distributions even after lots of RLHF training.
Another version of this is “for many trajectories, an LM will be primarily predicting text, not executing rl-reinforced behaviors. Given this, actions that get reinforced are likely to come from the LM producing text that gets high reward in its reward model, rather than random actions”
Pretrained models don’t need any exploration to know that pressing the reward button gets more reward than doing things the humans want. If you just ask GPT3, it’ll tell you that.
Then the only exploration the AI needs is to get reward after thinking about analogies between its situation and its textual knowledge of AI/reinforcement learning/AI doom scenarios.
This applies especially much to simple/often discussed tasks such as making people smile—an LM has already heard of this exact task, so if it took an action based on the “make people smile task” its heard about, this could outperform other thought processes which are only conditioned on data so far.
If you want to make a competitive agi from scratch (even if you only want “within 5 years of best ai”), you just have to start way earlier. If this project was anounced 7 years ago I’d like it much more, but now is just too late, you’d need huge miracles to finish in time.