This update massively reduces my expectation for Conjecture’s future value. When you’re a small player in the field, you produce value through transferrable or bolt-on components, such as Conjecture’s interpretability and simulator work. CoEm on the other hand is completely disconnected from other AGI or ai safety work, and pretty much only has any impact if Conjecture is extraordinarily successful.
Tao Lin
Wow this looks great! The alignment tax for this is (inference on whole dataset with previous gen model), which is like 10%, and can be much lower if you just use a smaller classifier. Seems like an important part of near term alignment!
The big reason why humans are cosmopolitan might be that we evolved in multipolar environments, where helping others is instrumental. If so, just training AIs in multipolar environments that incentivize cooperation could be all it takes to get some amount of instrumental-made-terminal-by-optimization-failure cosmopolitanism.
Chat or instruction finetuned models have poor prediction cailbration, whereas base models (in some cases) have perfect calibration. Also forecasting is just hard. So I’d expect chat models to ~always fail, base models to fail slightly less, but i’d expect finetuned models (on a somewhat large dataset) to be somewhat useful.
I think the fact that some process produced the image and showed it to you is a lot of evidence. Your theories need to be compatible with something intelligent deciding to produce the image and show it to you. Therefore you could in principle (although I think unlikely) arrive at GR from a render of a simulated apple, by considering universes that support intelligence where said intelligence would make an image of an apple.
I have personally observed completely untrained tokens in gpt2. Specifically I found some specific accented characters had very small and random embeddings, which were so similar it looked like none of them had any training at all
Externalized reasoning models suffer from the “legibility penalty”—the fact that many decisions are easier to make than to justify or explain. I think this is a significant barrier for authentic train of thought competitiveness, although not for particularly legible domains, such as math proofs and programming (Illegible knowledge goes into math proofs, but you trust the result regardless so it’s fine).
Another problem is that standard training procedures only incentivize the model to use reasoning steps produced by a single human. This means, for instance, if you ask a question involving two very different domains of knowledge, a good language model wouldn’t expose it’s knowledge about both of them, as that’s OOD for its training dataset. This may appear in an obvious fashion, as if multiple humans collaborated on the train of thought, or might appear in a way that’s harder to interpret. If you just want to expose this knowledge, you could train on amplified human reasoning (ie from human teams) though.
Also, if you ever train the model on conclusion correctness, you incentivize semantic drift between its reasoning and human language—the model would prefer to pack in more information per token than humans, and might want to express not-normally-said-by-human concepts (one type is fuzzy correlations, which models know a lot of). Even if you penalize KL divergence between human language and the reasoning, this doesn’t necessarily incentivize authentic human-like reasoning, just its appearance.
In general I’m unsure whether authentic train of thought is better than just having the model imitate specific concrete humans in ordinary language modelling—if you start a text by a known smart, truthful person, you get out an honest prediction over what that person believes.
I’ve been reading / light working in virtual reality for ~2 hours/day for a few months. My setup is a bit different than Ozzie’s, so I thought I’d share that here.
I use a Windows gaming laptop with the Oculus Quest 2, connected via Virtual Desktop instead of Immersed. Virtual Desktop has far lower latency, and slightly higher resolution than Immersed, but only shows you one screen at a time. The latency is low enough to be practically unnoticable (smaller than latency differences between similar computers, https://danluu.com/input-lag/) . A few months ago, it was Windows only, but since added Mac support. I don’t think the GPU on my laptop is important for my setup.
Pros
The main reason I use Virtual Desktop is to get out of my chair, either to lying down or standing up. When I hava a stomachache, headache, or soreness, I get a significant productivity benefit from being able to work lying down. By default my VR productivity is a bit lower than normal, but more enjoyable.
Cons
The biggest downside I’ve found is that VR definitely gives me myopia, due to vergence problems. Whenever I take off the headset, my distance vision gets blurrer for a few hours until my eyes adjust to focusing different distances. I think this gradually leads to long term myopia if not offset by vision exercises or something, but will be sovled in future headsets. I’ve tried placing my virtual screen at the focal distance of my headset, 1.3 meters, but that didn’t help.
Virtual Desktop takes many seconds to connect, and sometimes needs to be restarted, which is a significant flow-disruptor.
Overall, I’ll continue to use VR the same as I do now, and hopefully up my VR time/other screen time when better headsets come out.
When I first saw “save all weights to on chip hardware”, I thought it would be super expensive, but actually saving like 5 times the GPU’s memory to a seperate flash chip would only cost $20 (80GB*5 at 5 cents per gigabyte for flash storage). It can be way cheaper bc it’s low bandwidth and slow.
I’d expect artificial sweeteners are already very cheap, and most people want more tested chemicals.
I do think open sourcing is better, because there already was a lot of public attention and results on llm capabilities which are messy and misleading, and open sourcing one eval like this might improve our understanding a lot. Also, there are tons of llm agent projects/startups trying to build hype, so if you drop a benchmark here you are unlikely to attract unwanted attention (i’m guessing). I largely agree with https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model
Base model sycophancy feels very dependent on the training distribution and prompting. I’d guess there are some prompts where a pretrained model will always agree with other voices in the prompt, and some where it would disagree, because on some websites where people agree a lot, on some websites where people disagree, and maybe an effect where it will switch positions every step to simulate an argument between two teams.
Helpfullness finetuning might make these models more capable when they’re on the correct side of the debate. Sometimes RLHF(like) models simply perform worse on tasks they’re finetuned to avoid even when they don’t refuse or give up. Would be nice to try base model debaters
probes probably want a very skinny aspect ratio. If cosmic dust travels at 20km/s, that’s 15k times slower than the probe is travelling, so maybe that means the probe should be eg 10cm wide and 1.5km long
looks slightly behind gpt-4-base in benchmarks. On the tasks where gemini uses chain-of-thought best-of-32 with optimized prompts it beats gpt-4-base, but ones where it doesnt its same or behind
This paper is very unrepresentative—it seems to test 1 vs 64-1,000,000 repeats of data, not 1 vs 2-10 repeats as you would use in practice
if AI does change the offence defence balance, it could be because defending an AI (that doesnt need to protect humans) is fundamentally different than defending humans, allowing the AI to spend much less on defence
video can get extremely expensive without specific architectural support. Eg a folder of images takes up >10x the space of the equivalent video, and using eg 1000 tokens per frame for 30 frames/second is a lot of compute
Leela Zero uses MCTS, it doesnt play superhuman in one forward pass (like gpt-4 can do in some subdomains) (i think, didnt find any evaluations of Leela Zero at 1 forward pass), and i’d guess that the network itself doesnt contain any more generalized game playing circuitry than an llm, it just has good intuitions for Go.
Nit:
Subjectively there is clear improvement between 7b vs. 70b vs. GPT-4, each step 1.5-2 OOMs of training compute.
1.5 to 2 OOMs? 7b to 70b is 1 OOM of compute, adding in chinchilla efficiency would make it like 1.5 OOMs of effective compute, not 2. And llama 70b to gpt-4 is 1 OOM effective compute according to openai naming—llama70b is about as good as gpt-3.5. And I’d personally guess gpt4 is 1.5 OOMs effective compute above llama70b, not 2.
Oh cool. LMs can output more finely tokenized text than it’s trained on, so it probably didn’t output the token ” Skydragon”, but instead multiple tokens, [” ”, “Sky”, “dragon”] or something