This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
Agreed
People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
Also, I don’t buy that it was refuted, based on this, which sounds like a refutation but isn’t actually a refutation, and they never directly deny it:
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
it’s generally much easier to verify that something has been done correctly than actually executing the plan yourself
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it’s much harder to come up with a specification such that but once we have such a specification it’ll be much easier to come up with an implementation (likely with the help of AI)
2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
Yes, I think optimizing worst-case performance is one crucial part of alignment, it’s also one
advantage of infrabayesianism
I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we’d get to control the optimization target for automating interpretability without worrying about unintended optimization).
Thanks! I enjoyed the conversation too.
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
Agreed.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Thanks!