the core thing I worry about with any simulation-based approach is how to get coverage of possibility space. ensuring that the tests are informative about possible configurations of complex systems is hard; I would argue that this is a great set of points, and that no approach without this type of testing could be expected to succeed.
however, as we’ve seen with adversarial examples, both humans and current DL have fairly severe failures of alignment that cause large issues, and projects like this need tools to optimize for interpretability from the perspective of a formal verification algorithm.
it’s my belief that in almost all cases, the values almost all beings ever wish to express lie on a low-enough-dimensional manifold that, if the ideal representation was learned, a formal verifier would be able to handle it; see, for example, recent work from anthropic on interpretability lining up nicely with what’s needed to use reluplex (or followup work; reluplex is old now) to estimate bounds on larger networks’ behavior.
I don’t think we should give up on perfectly solving safety; instead, we should recognize that it has never even been closely approximated before, by any form of life, because even semi-planned evolutionary processes do not do a good job of covering the entire possibility space.
this post is an instant classic, and I agree that sims are a key step, starting from small agents. but ensuring that strongly generalizing moral views are generated in the agents who grow in the simulations is not trivial—we need to be able to generate formal bounds on the loss function, and doing so requires being able to prove all the way through the simulation the way we currently take gradients through it. if we can ask “where in the simulation would have changed this outcome”, then we’d finally be getting somewhere in terms of generating moral knowledge that generalizes universally.
The post anticipates some of the most likely failure modes, such as failure to correctly time the transition from selfish to altruistic learning, or out of distribution failures for proxy matching. I anticipate that for proxy matching in particular we may end up employing multiple stages. I also agree that simplified and over-generalized notions of altruism may be easier to maintain long term, and I see some indications that this already occurs in at least some humans.
The low-complexity most general form of altruism is probably something like “empower the world’s external agency”, which seems pretty general. But then it may also need game-theoretic adjustments (empower other altruists more, disempower dangerous selfish agents, etc), considerations of suffering, etc.
I don’t see why/how learning altruism/alignment (external empowerment) is different from other learning objectives (such as internal empowerment) such that formal verification is important for the latter but not the former. So for me the strongest evidence for important of formal verification would be evidence of it’s utility/importance across ML in general, which I don’t really see yet.
the core thing I worry about with any simulation-based approach is how to get coverage of possibility space. ensuring that the tests are informative about possible configurations of complex systems is hard; I would argue that this is a great set of points, and that no approach without this type of testing could be expected to succeed.
however, as we’ve seen with adversarial examples, both humans and current DL have fairly severe failures of alignment that cause large issues, and projects like this need tools to optimize for interpretability from the perspective of a formal verification algorithm.
it’s my belief that in almost all cases, the values almost all beings ever wish to express lie on a low-enough-dimensional manifold that, if the ideal representation was learned, a formal verifier would be able to handle it; see, for example, recent work from anthropic on interpretability lining up nicely with what’s needed to use reluplex (or followup work; reluplex is old now) to estimate bounds on larger networks’ behavior.
I don’t think we should give up on perfectly solving safety; instead, we should recognize that it has never even been closely approximated before, by any form of life, because even semi-planned evolutionary processes do not do a good job of covering the entire possibility space.
this post is an instant classic, and I agree that sims are a key step, starting from small agents. but ensuring that strongly generalizing moral views are generated in the agents who grow in the simulations is not trivial—we need to be able to generate formal bounds on the loss function, and doing so requires being able to prove all the way through the simulation the way we currently take gradients through it. if we can ask “where in the simulation would have changed this outcome”, then we’d finally be getting somewhere in terms of generating moral knowledge that generalizes universally.
The post anticipates some of the most likely failure modes, such as failure to correctly time the transition from selfish to altruistic learning, or out of distribution failures for proxy matching. I anticipate that for proxy matching in particular we may end up employing multiple stages. I also agree that simplified and over-generalized notions of altruism may be easier to maintain long term, and I see some indications that this already occurs in at least some humans.
The low-complexity most general form of altruism is probably something like “empower the world’s external agency”, which seems pretty general. But then it may also need game-theoretic adjustments (empower other altruists more, disempower dangerous selfish agents, etc), considerations of suffering, etc.
I don’t see why/how learning altruism/alignment (external empowerment) is different from other learning objectives (such as internal empowerment) such that formal verification is important for the latter but not the former. So for me the strongest evidence for important of formal verification would be evidence of it’s utility/importance across ML in general, which I don’t really see yet.