I just realized this post is actually one of my earlier drafts, and a bug is keeping me from editing the post.
The final version of my post is here: https://docs.google.com/document/d/1gHaCHhNOUBqxcXxLeTQiqRjG_y7ELkBdM2kcySKDi_Q/edit
I just realized this post is actually one of my earlier drafts, and a bug is keeping me from editing the post.
The final version of my post is here: https://docs.google.com/document/d/1gHaCHhNOUBqxcXxLeTQiqRjG_y7ELkBdM2kcySKDi_Q/edit
I just wrote a response to this post listing 5 specific ways we could improve CoT faithfulness: https://www.lesswrong.com/posts/TecsCZ7w8s4e2umm4/5-ways-to-improve-cot-faithfulness
> seems likely that o1 was trained with supervision on the individual CoT steps
OpenAI directly says that they didn’t do that
This is also how I interpreted them at first, but after talking to others about it, I noticed that they only say they “cannot train any policy compliance or user preferences onto the chain of thought”. Arguably, this doesn’t eliminate the possibility that they train a PRM using human raters’ opinions about whether the chain of thought is on the right track to solve the problem, independently of whether it’s following a policy. Or even if they don’t directly use human preferences to train the PRM, they could have some other automated reward signal on the level of individual steps.
I agree that reading the CoT could be very useful, and is a very promising area of research. In fact, I think reading CoTs could be a much more surefire interpretability method than mechanistic interpretability, although the latter is also quite important.
I feel like research showing that CoTs aren’t faithful isn’t meant to say “we should throw out the CoT.” It’s more like “naively, you’d think the CoT is faithful, but look, it sometimes isn’t. We shouldn’t take the CoT at face value, and we should develop methods that ensure that it is faithful.”
Personally, what I want most out of a chain of thought is that its literal, human-interpretable meaning contains almost all the value the LLM gets out of the CoT (vs. immediately writing its answer). Secondarily, it would be nice if the CoT didn’t include a lot of distracting junk that doesn’t really help the LLM (I suspect this was largely solved by o1, since it was trained to generate helpful CoTs).
I don’t actually care much about the LLM explaining why it believes things that it can determine in a single forward pass, such as “French is spoken in Paris.” It wouldn’t be practically useful for the LLM to think these things through explicitly, and these thoughts are likely too simple to be helpful to us.
If we get to the point that LLMs can frequently make huge, accurate logical leaps in a single forward pass that humans can’t follow at all, I’d argue that at that point, we should just make our LLMs smaller and focus on improving their explicit CoT reasoning ability, for the sake of maintaining interpretability.
Yeah, you kind of have to expect from the beginning that there’s some trick, since taken literally the title can’t actually be true. So I think it’s fine
I don’t think so, just say as prediction accuracy approaches 100%, the likelihood that the mind will use the natural abstraction increases, or something like that
If you read the post I linked, it probably explains it better than I do—I’m just going off of my memory of the natural abstractions agenda. I think another aspect of it is that all sophisticated-enough minds will come up with the same natural abstractions, insofar as they’re natural.
In your example, you could get evidence that 0 and 1 voltages are natural abstractions in a toy setting by:
Training 100 neural networks to take the input voltages to a program and return the resulting output
Doing some mechanistic interpretability on them
Demonstrating that in every network, values below 2.5V are separated from values above 2.5V in some sense
See “natural abstractions,” summarized here: https://www.lesswrong.com/posts/gvzW46Z3BsaZsLc25/natural-abstractions-key-claims-theorems-and-critiques-1
In your example, it makes more sense to treat voltages <2.5 and >2.5 as different things, rather than <5.0 and >5.0, because the former helps you predict things about how the computer will behave. That is, those two ranges of voltage are natural abstractions.
It can also be fun to include prizes that are extremely low-commitment or obviously jokes/unlikely to ever be followed up on. Like “a place in my court when I ascend to kinghood” from Alexander Wales’ Patreon
This is great! Maybe you’d get better results if you “distill” GPT2-LN into GPT2-noLN by fine-tuning on the entire token probability distribution on OpenWebText.
Just curious, why do you spell “useful” as “usefwl?” I googled the word to see if it means something special, and all of the top hits were your comments on LessWrong or the EA Forum
If I understand correctly, you’re basically saying:
We can’t know how long it will take for the machine to finish its task. In fact, it might take an infinite amount of time, due to the halting problem which says that we can’t know in advance whether a program will run forever.
If our machine took an infinite amount of time, it might do something catastrophic in that infinite amount of time, and we could never prove that it doesn’t.
Since we can’t prove that the machine won’t do something catastrophic, the alignment problem is impossible.
The halting problem doesn’t say that we can’t know whether any program will halt, just that we can’t determine the halting status of every single program. It’s easy to “prove” that a program that runs an LLM will halt. Just program it to “run the LLM until it decides to stop; but if it doesn’t stop itself after 1 million tokens, cut it off.” This is what ChatGPT or any other AI product does in practice.
Also, the alignment problem isn’t necessarily about proving that a AI will never do something catastrophic. It’s enough to have good informal arguments that it won’t do something bad with (say) 99.99% probability over the length of its deployment.
Reading the Wikipedia article for “Complete (complexity),” I might have misinterpreted what “complete” technically means.
What I was trying to say is “given Sora, you can ‘easily’ turn it into an agent” in the same way that “given a SAT solver, you can ‘easily’ turn it into a solver for another NP-complete problem.”
I changed the title from “OpenAI’s Sora is agent-complete” to “OpenAI’s Sora is an agent,” which I think is less misleading. The most technically-correct title might be “OpenAI’s Sora can be transformed into an agent without additional training.”
That sounds more like “AGI-complete” to me. By “agent-complete” I meant that Sora can probably act as an intelligent agent in many non-trivial settings, which is pretty surprising for a video generator!
First and most important, there’s the choice of “default action”. We probably want the default action to be not-too-bad by the human designers’ values; the obvious choice is a “do nothing” action. But then, in order for the AI to do anything at all, the “shutdown” utility function must somehow be able to do better than the “do nothing” action. Otherwise, that subagent would just always veto and be quite happy doing nothing.
Can we solve this problem by setting the default action to “do nothing,” then giving the agent an extra action to “do nothing and give the shutdown subagent +1 reward?”
I think the implication was that “high-status men” wouldn’t want to hang out with “low-status men” who awkwardly ask out women
On the topic of AI for forecasting: just a few days ago, I made a challenge on Manifold Markets to try to incentivize people to create Manifold bots to use LLMs to forecast diverse 1-month questions accurately, with improving epistemics as the ultimate goal.
You can read the rules and bet on the main market here: https://manifold.markets/CDBiddulph/will-there-be-a-manifold-bot-that-m?r=Q0RCaWRkdWxwaA
If anyone’s interested in creating a bot, please join the Discord server to share ideas and discuss! https://discord.com/channels/1193303066930335855/1193460352835403858
I figured out a workaround to edit my post—it’s up to date now