The free energy talk probably confuses more than that it elucidates. Im not talking about random diffusion per se but connection between uniformly sampling and simplicity and simplicity-accuracy tradeoff.
Ive tried explaining more carefully where my thinking is currently at in my reply to lucius.
Also caveat that shortforms are halfbaked-by-design.
Yep, have been recently posting shortforms (as per your recommendation), and totally with you on the “halfbaked-by-design” concept (if Cheeseboard can do it, it must be a good idea right? :)
I still don’t agree that free energy is core here. I think that the relevant question, which can be formulated without free energy, is whether various “simplicity/generality” priors push towards or away from human values (and you can then specialize to questions of effective dimension/llc, deep vs. shallow networks, ICL vs. weight learning, generalized ood generalization measurements, and so on to operationalize the inductive prior better). I don’t think there’s a consensus on whether generality is “good” or “bad”—I know Paul Christiano and ARC has gone both ways on this at various points.
I think simplicity/generality priors effectively have 0 effect on whether it’s pushed towards or away from human values, and is IMO kind of orthogonal to alignment-relevant questions.
I’d split it into how do we manage to instill in any goal/value that is ideally at least somewhat stable, ala inner alignment, and outer alignment, which is selecting a goal that is resistant to Goodharting.
Let’s focus on inner alignment.
By instill you presumably mean train. What values get trained is ultimately a learning problem which in many cases (as long as one can formulate approximately a boltzmann distribution) comes down to a simplicity-accuracy tradeoff.
I haven’t thought about this enough to have a very mature opinion. On one hand being more general means you’re liable to goodheart more (i.e., with enough deeply general processing power, you understand that manipulating the market to start World War 3 will make your stock portfolio grow, so you act misaligned). On the other hand being less general means that AI’s are more liable to “partially memorize” how to act aligned in familiar situations, and go off the rails when sufficiently out-of-distribution situations are encountered. I think this is related to the question of “how general are humans”, and how stable are human values to being much more or much less general
I guess im mostly thinking about the regime where AIs are more capable and general than humans.
It seems at first glance that the latter failure mode is more of a capability failure. Something one would expect to go away as AI truly surpasses humans. It doesnt seem core to the alignment problem to me.
If inner alignment is hard then general is bad because applying less selection pressure, i.e. more generally, more simplicity prior, means more daemons/gremlins
The free energy talk probably confuses more than that it elucidates. Im not talking about random diffusion per se but connection between uniformly sampling and simplicity and simplicity-accuracy tradeoff.
Ive tried explaining more carefully where my thinking is currently at in my reply to lucius.
Also caveat that shortforms are halfbaked-by-design.
Yep, have been recently posting shortforms (as per your recommendation), and totally with you on the “halfbaked-by-design” concept (if Cheeseboard can do it, it must be a good idea right? :)
I still don’t agree that free energy is core here. I think that the relevant question, which can be formulated without free energy, is whether various “simplicity/generality” priors push towards or away from human values (and you can then specialize to questions of effective dimension/llc, deep vs. shallow networks, ICL vs. weight learning, generalized ood generalization measurements, and so on to operationalize the inductive prior better). I don’t think there’s a consensus on whether generality is “good” or “bad”—I know Paul Christiano and ARC has gone both ways on this at various points.
I think simplicity/generality priors effectively have 0 effect on whether it’s pushed towards or away from human values, and is IMO kind of orthogonal to alignment-relevant questions.
I’d be curious how you would describe the core problem of alignment.
I’d split it into how do we manage to instill in any goal/value that is ideally at least somewhat stable, ala inner alignment, and outer alignment, which is selecting a goal that is resistant to Goodharting.
Let’s focus on inner alignment. By instill you presumably mean train. What values get trained is ultimately a learning problem which in many cases (as long as one can formulate approximately a boltzmann distribution) comes down to a simplicity-accuracy tradeoff.
Could you give some examples of what you are thinking of here ?
You mean on more general algorithms being good vs. bad?
Yes.
I haven’t thought about this enough to have a very mature opinion. On one hand being more general means you’re liable to goodheart more (i.e., with enough deeply general processing power, you understand that manipulating the market to start World War 3 will make your stock portfolio grow, so you act misaligned). On the other hand being less general means that AI’s are more liable to “partially memorize” how to act aligned in familiar situations, and go off the rails when sufficiently out-of-distribution situations are encountered. I think this is related to the question of “how general are humans”, and how stable are human values to being much more or much less general
I guess im mostly thinking about the regime where AIs are more capable and general than humans.
It seems at first glance that the latter failure mode is more of a capability failure. Something one would expect to go away as AI truly surpasses humans. It doesnt seem core to the alignment problem to me.
Maybe a reductive summary is “general is good if outer alignment is easy but inner alignment is hard, but bad in the opposite case”
Isn’t it the other way around ?
If inner alignment is hard then general is bad because applying less selection pressure, i.e. more generally, more simplicity prior, means more daemons/gremlins