I’m really excited about this research direction. It seems so well-fit to what you’ve been researching in that past—so much so that it doesn’t seem to be a new research direction so much as a clarification of the direction you were already pursuing.
I think producing a mostly-coherent and somewhat-nuanced generalized theory of alignment would be incredibly valuable to me (and I would consider myself someone working on prosaic alignment strategies).
A common thread in the last year of my work on alignment is something like “How can I be an aligned intelligence?” and “What action would I take here if I was an aligned intelligence?”. This helps me bootstrap reasoning about my own experiences and abilities, and helps me think about extrapolations of “What if I had access to different information?” or “What if I could think about it for a very long time?”.
I still don’t have answers to these questions, but think they would be incredibly useful to have as an AI alignment researcher. They could inform new techniques as well as fundamentally new approaches (to use terms from the post: both de-novo and in-motion)
Summing up all that, this post made me realize Alignment Researchshould be its own discipline.
Addendum: Ideas for things along these lines I’d be interested in hearing more about in the future:
(not meant as suggestions—more like just saying curiosities out loud)
What are the best books/papers/etc on getting the “Alex Flint worldview on alignment research”? What existing research institutions study this (if any)?
I think a bunch of the situations involving many people here could be modeled by agent-based simulations. If there are cases where we could study some control variable, this could be useful in finding pareto frontiers (or what factors shape pareto frontiers).
The habit formation example seems weirdly ‘acausal decision theory’ flavored to me (though this might be a ‘tetris effect’ like instance). It seems like habits similar to this are a mechanism of making trades across time/contexts with yourself. This makes me more optimistic about acausal decision theories being a natural way of expressing some key concepts in alignment.
Proxies are mentioned but it feels like we could have a rich science or taxonomy of proxies. There’s a lot to study with historical use of proxies, or analyzing proxies in current examples of intelligence alignment.
The self-modification point seems to suggest an opposite point: invariants. Similar to how we can do a lot in physics by analysing conserved quantities and conservative fields—maybe we can also use invariants in self-modifying systems to better understand the dynamics and equilibria.
Summing up all that, this post made me realize Alignment Research should be its own discipline.
Yeah I agree! It seems that AI alignment is not really something that any existing disciplines is well set up to study. The existing disciplines that study human values are generally very far away from engineering, and the existing disciplines that have an engineering mindset tend to be very far away from directly studying human values. If we merely created a new “subject area” that studies human values + engineering under the standard paradigm of academic STEM, or social science, or philosophy, I don’t think it would go well. It seems like a new discipline/paradigm is innovation at a deeper level of reality. (I understand adamShimi’s work to be figuring out what this new discipline/paradigm really is.)
The habit formation example seems weirdly ‘acausal decision theory’ flavored to me (though this might be a ‘tetris effect’ like instance). It seems like habits similar to this are a mechanism of making trades across time/contexts with yourself. This makes me more optimistic about acausal decision theories being a natural way of expressing some key concepts in alignment.
Interesting! I hadn’t thought of habit formation as relating to acausal decision theory. I see the analogy to making trades across time/contexts with yourself but I have the sense that you’re referring to something quite different to ordinary trades across time that we would make e.g. with other people. Is the thing you’re seeing something like when we’re executing a habit we kind of have no space/time left over to be trading with other parts of ourselves, so we just “do the thing such that, if the other parts of ourselves knew we would do that and responded in kind, would lead to overall harmony” ?
Proxies are mentioned but it feels like we could have a rich science or taxonomy of proxies. There’s a lot to study with historical use of proxies, or analyzing proxies in current examples of intelligence alignment.
We could definitely study proxies in detail. We could look at all the market/government/company failures that we can get data on and try to pinpoint what exactly folks were trying to align the intelligent system with, what operationalization was used, and how exactly that failed. I think this could be useful beyond merely cataloging failures as a cautionary tale—I think it could really give us insight into the nature of intelligent systems. We may also find some modest successes!
I’m really excited about this research direction. It seems so well-fit to what you’ve been researching in that past—so much so that it doesn’t seem to be a new research direction so much as a clarification of the direction you were already pursuing.
I think producing a mostly-coherent and somewhat-nuanced generalized theory of alignment would be incredibly valuable to me (and I would consider myself someone working on prosaic alignment strategies).
A common thread in the last year of my work on alignment is something like “How can I be an aligned intelligence?” and “What action would I take here if I was an aligned intelligence?”. This helps me bootstrap reasoning about my own experiences and abilities, and helps me think about extrapolations of “What if I had access to different information?” or “What if I could think about it for a very long time?”.
I still don’t have answers to these questions, but think they would be incredibly useful to have as an AI alignment researcher. They could inform new techniques as well as fundamentally new approaches (to use terms from the post: both de-novo and in-motion)
Summing up all that, this post made me realize Alignment Research should be its own discipline.
Addendum: Ideas for things along these lines I’d be interested in hearing more about in the future:
(not meant as suggestions—more like just saying curiosities out loud)
What are the best books/papers/etc on getting the “Alex Flint worldview on alignment research”? What existing research institutions study this (if any)?
I think a bunch of the situations involving many people here could be modeled by agent-based simulations. If there are cases where we could study some control variable, this could be useful in finding pareto frontiers (or what factors shape pareto frontiers).
The habit formation example seems weirdly ‘acausal decision theory’ flavored to me (though this might be a ‘tetris effect’ like instance). It seems like habits similar to this are a mechanism of making trades across time/contexts with yourself. This makes me more optimistic about acausal decision theories being a natural way of expressing some key concepts in alignment.
Proxies are mentioned but it feels like we could have a rich science or taxonomy of proxies. There’s a lot to study with historical use of proxies, or analyzing proxies in current examples of intelligence alignment.
The self-modification point seems to suggest an opposite point: invariants. Similar to how we can do a lot in physics by analysing conserved quantities and conservative fields—maybe we can also use invariants in self-modifying systems to better understand the dynamics and equilibria.
Yeah I agree! It seems that AI alignment is not really something that any existing disciplines is well set up to study. The existing disciplines that study human values are generally very far away from engineering, and the existing disciplines that have an engineering mindset tend to be very far away from directly studying human values. If we merely created a new “subject area” that studies human values + engineering under the standard paradigm of academic STEM, or social science, or philosophy, I don’t think it would go well. It seems like a new discipline/paradigm is innovation at a deeper level of reality. (I understand adamShimi’s work to be figuring out what this new discipline/paradigm really is.)
Interesting! I hadn’t thought of habit formation as relating to acausal decision theory. I see the analogy to making trades across time/contexts with yourself but I have the sense that you’re referring to something quite different to ordinary trades across time that we would make e.g. with other people. Is the thing you’re seeing something like when we’re executing a habit we kind of have no space/time left over to be trading with other parts of ourselves, so we just “do the thing such that, if the other parts of ourselves knew we would do that and responded in kind, would lead to overall harmony” ?
We could definitely study proxies in detail. We could look at all the market/government/company failures that we can get data on and try to pinpoint what exactly folks were trying to align the intelligent system with, what operationalization was used, and how exactly that failed. I think this could be useful beyond merely cataloging failures as a cautionary tale—I think it could really give us insight into the nature of intelligent systems. We may also find some modest successes!
Hope you are well Alex!